{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2020 Dask User Survey Results\n", "\n", "This notebook presents the results of the 2020 Dask User Survey,\n", "which ran earlier this summer. Thanks to everyone who took the time to fill out the survey!\n", "These results help us better understand the Dask community and will guide future development efforts.\n", "\n", "The raw data, as well as the start of an analysis, can be found in this binder:\n", "\n", "[![2020 Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/dask/dask-examples/main?urlpath=%2Ftree%2Fsurveys%2F2020.ipynb)\n", "\n", "Let us know if you find anything in the data.\n", "\n", "## Highlights\n", "\n", "We had 240 responses to the survey (slightly fewer than last year, which had about 260). Overall things were similar to last year's. The biggest shift in the community is a stronger demand for better performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import textwrap\n", "import re\n", "\n", "\n", "df2019 = (\n", " pd.read_csv(\"data/2019-user-survey-results.csv.gz\", parse_dates=[\"Timestamp\"])\n", " .replace({\"How often do you use Dask?\": \"I use Dask all the time, even when I sleep\"}, \"Every day\")\n", ")\n", "\n", "df2020 = (\n", " pd.read_csv(\"data/2020-user-survey-results.csv.gz\")\n", " .assign(Timestamp=lambda df: pd.to_datetime(df['Timestamp'], format=\"%Y/%m/%d %H:%M:%S %p %Z\").astype('datetime64[ns]'))\n", " .replace({\"How often do you use Dask?\": \"I use Dask all the time, even when I sleep\"}, \"Every day\")\n", ")\n", "df2020.head()\n", "\n", "common = df2019.columns & df2020.columns\n", "added = df2020.columns.difference(df2019.columns)\n", "dropped = df2019.columns.difference(df2020.columns)\n", "\n", "df = pd.concat([df2019, df2020])\n", "df['Year'] = df.Timestamp.dt.year\n", "df = df.set_index(['Year', 'Timestamp']).sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the questions are the same as in 2019. We added a couple questions about deployment and dashboard usage. Let's look at those first.\n", "\n", "Among respondents who use a Dask package to deploy a cluster (about 53% of respondents), there's a wide spread of methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = 'Do you use Dask projects to deploy?'\n", "d = df2020['Do you use Dask projects to deploy?'].dropna().str.split(\";\").explode()\n", "top = d.value_counts()\n", "top = top[top > 10].index\n", "sns.countplot(y=k, data=d[d.isin(top)].to_frame(), order=top);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most people access the dashboard through a web browser. Those not using the dashboard are likely (hopefully) just using Dask on a single machine with the threaded scheduler (though the dashboard works fine on a single machine as well)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = \"How do you view Dask's dashboard?\"\n", "sns.countplot(y=k, data=df2020[k].dropna().str.split(\";\").explode().to_frame());" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dask's learning materials are farily similar to last year. The most notable differences are from\n", "our survey form providing more options (our [YouTube channel](https://www.youtube.com/channel/UCj9eavqmvwaCyKhIlu2GaoA) and \"Gitter chat\"). Other than that, https://examples.dask.org might be relatively more popular." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = 'What Dask resources have you used for support in the last six months?'\n", "\n", "resource_map = {\n", " \"Tutorial\": \"Tutorial at tutorial.dask.org\",\n", " \"YouTube\": \"YouTube channel\",\n", " \"gitter\": \"Gitter chat\"\n", "}\n", "\n", "d = df[k].str.split(';').explode().replace(resource_map)\n", "top = d.value_counts()[:8].index\n", "d = d[d.isin(top)]\n", "\n", "fig, ax = plt.subplots(figsize=(8, 8))\n", "ax = sns.countplot(y=k, hue=\"Year\", data=d.reset_index(), ax=ax);\n", "ax.set(ylabel=\"\", title=k);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How do you use Dask?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like last year, we'll look at resource usage grouped by how often they use Dask." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resource_palette = (\n", " df['What Dask resources have you used for support in the last six months?'].str.split(\";\").explode().replace(resource_map).replace(re.compile(\"GitHub.*\"), \"GitHub\").value_counts()[:6].index\n", ")\n", "resource_palette = dict(zip(resource_palette, sns.color_palette(\"Paired\")))\n", "\n", "usage_order = ['Every day', 'Occasionally', 'Just looking for now']\n", "\n", "def resource_plot(df, year, ax):\n", " resources = (\n", " df.loc[year, 'What Dask resources have you used for support in the last six months?']\n", " .str.split(\";\")\n", " .explode()\n", " .replace(resource_map)\n", " )\n", " top = resources.value_counts().head(6).index\n", " resources = resources[resources.isin(top)]\n", "\n", " m = (\n", " pd.merge(df.loc[year, ['How often do you use Dask?']], resources, left_index=True, right_index=True)\n", " .replace(re.compile(\"GitHub.*\"), \"GitHub\")\n", " )\n", "\n", " ax = sns.countplot(hue=\"What Dask resources have you used for support in the last six months?\",\n", " y='How often do you use Dask?',\n", " order=usage_order,\n", " data=m, ax=ax,\n", " hue_order=list(resource_palette),\n", " palette=resource_palette)\n", " sns.despine()\n", " return ax\n", "\n", "fig, axes = plt.subplots(figsize=(20, 10), ncols=2)\n", "ax1 = resource_plot(df, 2019, axes[0])\n", "ax2 = resource_plot(df, 2020, axes[1])\n", "ax1.set_title(\"2019\")\n", "ax2.set_title(\"2020\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few observations\n", "\n", "\n", "* GitHub issues are becoming relatively less popular among moderate and heavy Dask users, which perhaps reflects better documentation or stability (assuming people go to the issue tracker when they can't find the answer in the docs or they hit a bug).\n", "* https://examples.dask.org is notably now more popular among occasinal users.\n", "* In response to last year's survey, we invested time in making https://tutorial.dask.org better, which we previously felt was lacking. Its usage is still about the same as last year's (pretty popular), so it's unclear whether we should dedicate additional focus there.\n", "\n", "API usage remains about the same as last year (recall that about 20 fewer people took the survey and people can select multiple, so relative differences are most interesting). We added new choices for RAPIDS, Prefect, and XGBoost, each of which are somewhat popular (in the neighborhood of `dask.Bag`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "apis = df['Dask APIs'].str.split(\";\").explode()\n", "top = apis.value_counts().loc[lambda x: x > 10]\n", "apis = apis[apis.isin(top.index)].reset_index()\n", "\n", "sns.countplot(y=\"Dask APIs\", hue=\"Year\", data=apis);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like last year, about 65% of our users are using Dask on a cluster at least some of the time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Local machine or Cluster?'].dropna().str.contains(\"Cluster\").astype(int).groupby(\"Year\").mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the majority of respondents *also* use Dask on their laptop.\n", "This highlights the importance of Dask scaling down, either for\n", "prototyping with a `LocalCluster`, or for out-of-core analysis\n", "using `LocalCluster` or one of the single-machine schedulers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "order = [\n", " 'Personal laptop',\n", " 'Large workstation',\n", " 'Cluster of 2-10 machines',\n", " 'Cluster with 10-100 machines',\n", " 'Cluster with 100+ machines'\n", "]\n", "d = df['Local machine or Cluster?'].str.split(\";\").explode().reset_index()\n", "sns.countplot(y=\"Local machine or Cluster?\", data=d, hue=\"Year\", order=order);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like last year, most repondents thought that more documentation and examples would be the most valuable improvements to the project.\n", "\n", "One interesting change comes from looking at \"Which would help you most right now?\" split by API group (`dask.dataframe`, `dask.array`, etc.). Last year showed that \"More examples\" in my field was the most important for all API groups (first table below). But in 2020 there are some differences (second plot below)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help_by_api = (\n", " pd.merge(\n", " df.loc[2019, 'Dask APIs'].str.split(';').explode(),\n", " df.loc[2019, 'Which would help you most right now?'],\n", " left_index=True, right_index=True)\n", " .groupby('Which would help you most right now?')['Dask APIs'].value_counts()\n", " .unstack(fill_value=0).T\n", " .loc[['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']]\n", " \n", ")\n", "(\n", " help_by_api\n", " .style\n", " .background_gradient(axis=\"columns\")\n", " .set_caption(\"2019 normalized by row. Darker means that a higher proporiton of \"\n", " \"users of that API prefer that priority.\")\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help_by_api = (\n", " pd.merge(\n", " df.loc[2020, 'Dask APIs'].str.split(';').explode(),\n", " df.loc[2020, 'Which would help you most right now?'],\n", " left_index=True, right_index=True)\n", " .groupby('Which would help you most right now?')['Dask APIs'].value_counts()\n", " .unstack(fill_value=0).T\n", " .loc[['Array', 'Bag', 'DataFrame', 'Delayed', 'Futures', 'ML', 'Xarray']]\n", " \n", ")\n", "(\n", " help_by_api\n", " .style\n", " .background_gradient(axis=\"columns\")\n", " .set_caption(\"2020 normalized by row. Darker means that a higher proporiton of \"\n", " \"users of that API prefer that priority.\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examples are again the most important (for all API groups except `Futures`). But \"Performance improvements\" is now the second-most important field (except for `Futures` where it's most important). How should we interpret this? A charitable interpretation is that Dask's users are scaling to larger problems and are running into new scaling challenges. A less charitable interpretation is that our user's workflows are the same but Dask is getting slower!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Common Feature Requests\n", "\n", "For specific features, we made a list of things that we (as developers) thought might be important." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "common = (df[df.columns[df.columns.str.startswith(\"What common feature\")]]\n", " .rename(columns=lambda x: x.lstrip(\"What common feature requests do you care about most?[\").rstrip(r\"]\")))\n", "a = common.loc[2019].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': \"Importance\", 0: \"count\"}).assign(Year=2019)\n", "b = common.loc[2020].apply(pd.value_counts).T.stack().reset_index().rename(columns={'level_0': 'Question', 'level_1': \"Importance\", 0: \"count\"}).assign(Year=2020)\n", "\n", "counts = pd.concat([a, b], ignore_index=True)\n", "\n", "d = common.stack().reset_index().rename(columns={\"level_2\": \"Feature\", 0: \"Importance\"})\n", "order = [\"Not relevant for me\", \"Somewhat useful\", 'Critical to me']\n", "sns.catplot('Importance', row=\"Feature\", kind=\"count\", col=\"Year\", data=d, sharex=False, order=order);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There really aren't any changes compared to last year in the relative importance of each feature. Perhaps the biggest movement was in \"Ease of deployment\" where \"Critical to me\" is now relatively more popular (though it was the most popular even last year).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What other systems do you use?\n", "\n", "SSH continues to be the most popular \"cluster resource mananger\". This was the big surprise last year, so we put in some work to make it nicer. Aside from that, not much has changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c = df['If you use a cluster, how do you launch Dask? '].dropna().str.split(\";\").explode()\n", "top = c.value_counts().index[:6]\n", "sns.countplot(y=\"If you use a cluster, how do you launch Dask? \", data=c[c.isin(top)].reset_index(), hue=\"Year\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dask users are about as happy with its stability as last year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# fig, ax = plt.subplots(figsize=(9, 6))\n", "sns.countplot(y=\"Is Dask stable enough for you?\", hue=\"Year\", data=df.reset_index())\n", "sns.despine()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Takeaways\n", "\n", "1. Overall, most things are similar to last year.\n", "2. Documentation, especially domain-specific examples, continues to be important.\n", "3. More users are pushing Dask further. Investing in performance is likely to be valuable.\n", "\n", "Thanks again to all the respondents. We look forward to repeating this process to identify trends over time." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }