{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Anaconda Package Download Data\n", "\n", "This notebook demonstrates how to load and use Anaconda package data. For more details, see the [Github repository](https://github.com/ContinuumIO/anaconda-package-data/blob/master/README.md). Due to limitations on Binder, you might find some of the analysis examples below run slowly or require more memory than is available on the Binder instance. Feel free to download this notebook locally and run it.\n", "\n", "\n", "## Setting up\n", "\n", "To start we need to install the needed packages by running `conda install dask intake numpy pandas` and `conda install -c conda-forge hvplot`. Then we can import the packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "from datetime import datetime\n", "import hvplot.pandas\n", "import intake\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This enables the Dask progress bar on all operations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dask.diagnostics import ProgressBar\n", "pbar = ProgressBar()\n", "pbar.register()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data\n", "\n", "There are multiple ways to load Anaconda package data. Below we show examples of loading one month of data for December 2018.\n", "\n", "#### Method 1: load data from S3 url\n", "\n", "First, we can read parquet files directly from S3 url. We recommend using `dask.dataframe` to read data files into a Dask DataFrame. Please visit the [Dask website](http://docs.dask.org/en/latest/dataframe.html) for more information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',\n", " storage_options={'anon': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Method 2: load data from intake catalog\n", "\n", "Second, we can load data from an [Intake](https://intake.readthedocs.io) catalog file. One advantage of using intake catalog is that we can define the `cache` specifications in the catelog so that intake caches remote data source files locally. This saves bandwidth and improves the performance of future analyses. If you would like to remove the intake cache, simply run `intake cache clear`. For more information on Intake catalogs, click [here](https://intake.readthedocs.io/en/latest/catalog.html).\n", "\n", "Before loading the data file, we need to load the Intake catalog file. We can use a URL to the catalog file directly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can load the data with user specified year and month. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, if you would like to load one year of data, you can simply define the dataframe as\n", "\n", "``` python\n", "df = cat.anaconda_package_data_by_year(year=2018).to_dask()\n", "```\n", "\n", "Similarly, if you would like to load one day of data, you can define the dataframe as\n", "```python\n", "df = cat.anaconda_package_data_by_day(year=2018, month=12, day=1).to_dask()\n", "```\n", "\n", "Note that `.to_dask()` reads data into a dask dataframe. If you would like to read data directly into a Pandas dataframe, please use:\n", "``` python \n", "cat.anaconda_package_data_by_month(year=2018, month=12).read()\n", "```\n", "\n", "#### Method 3: load data from conda package" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Third, we can install the data from a conda package by running (which we've already done in the Binder environment):\n", "``` bash\n", "conda install -c intake anaconda-package-data\n", "```\n", "This data package installs the Intake catalog (but not the data) into user's conda environment directly. The global Intake catalog `intake.cat` will then have entries from this data package. If we run `list(intake.cat)`, we can see that `'anaconda_package_data_by_month'`, `'anaconda_package_data_by_year'`, and `'anaconda_package_data_by_day'` show up in the list. Then, similiar to Method 2, we just need to specifiy year and month and load the data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = intake.cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, if you would like to read data directly into a Pandas Dataframe, please use `intake.cat.anaconda_package_data_by_month(year=2018, month=12).read()`.\n", "\n", "## Examples\n", "\n", "After loading the data, we can do a lot of data wrangling and visualization to answer interesting questions. Below we show a few examples of how people can use the data. \n", "\n", "#### Example 1: Pandas download statistics\n", "\n", "In this first example, we are looking at the download statistics of Pandas. First, let's see how many times Pandas are installed this month from Anaconda distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]['counts'].sum().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `.compute()` is needed when df is a dask dataframe. Delete `.compute()` if you load data into a pandas dataframe. Please visit [dask website](http://docs.dask.org/en/latest/dataframe.html) for more information.\n", "\n", "Next, let's take a look at the daily trends of pandas usage. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['day'] = df.time.dt.day\n", "pkg_day_agg = df\\\n", " .loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]\\\n", " .groupby(['day'])\\\n", " .sum()\\\n", " .reset_index()\\\n", " .compute()\n", "pkg_day_agg.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pkg_day_agg.hvplot('day','counts')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 2: Python 2 versus Python 3 usage status\n", "\n", "In 2020, Python 2 will not be maintained and many key projects such as pandas will stop Python 2 support. Many developers and stakeholders are interested to see how Python 2 and Python 3 usage change over time. We can plot this with our data. \n", "\n", "First, we need to recode the required package python version variable. Here we created a variable `python2vs3` based on the variable `pkg_python`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(['pkg_python'])['counts'].sum().compute()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['python2vs3'] = df['pkg_python'].\\\n", " map(lambda x: 'Python 2' if x.startswith('2') else 'Python 3' if x.startswith('3') else np.nan)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.groupby(['python2vs3'])['counts'].sum().compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, let's get the daily counts for Python 2 and Python 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "python_day_agg = df\\\n", " .groupby(['day','python2vs3'])\\\n", " .sum()\\\n", " .compute()\\\n", " .reset_index()\n", " \n", "python_day_agg.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can plot the Python 2 and Python 3 usage trend." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "python_day_agg.hvplot('day','counts',by='python2vs3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example 3: Package platform comparison\n", "\n", "We can also compare package platforms. Here we calculated the total number of downloads from each platform and visualize the results in a bar chart. (Note that \"noarch\" packages have no platform value because they work on all platforms.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "platform_month = df.groupby(['pkg_platform'])['counts'].sum().reset_index().compute()\n", "platform_month" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "platform_month.hvplot.bar('pkg_platform', 'counts', rot=90)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }