# Anaconda Package Download Data

This notebook demonstrates how to load and use Anaconda package data. For more details, see the [Github repository](https://github.com/ContinuumIO/anaconda-package-data/blob/master/README.md). Due to limitations on Binder, you might find some of the analysis examples below run slowly or require more memory than is available on the Binder instance. Feel free to download this notebook locally and run it.


## Setting up

To start we need to install the needed packages by running `conda install dask intake numpy pandas` and `conda install -c conda-forge hvplot`. Then we can import the packages:

In [None]:
import dask.dataframe as dd
from datetime import datetime
import hvplot.pandas
import intake
import numpy as np
import pandas as pd

This enables the Dask progress bar on all operations:

In [None]:
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()

## Loading Data

There are multiple ways to load Anaconda package data. Below we show examples of loading one month of data for December 2018.

#### Method 1: load data from S3 url

First, we can read parquet files directly from S3 url. We recommend using `dask.dataframe` to read data files into a Dask DataFrame. Please visit the [Dask website](http://docs.dask.org/en/latest/dataframe.html) for more information.

In [None]:
df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
 storage_options={'anon': True})

#### Method 2: load data from intake catalog

Second, we can load data from an [Intake](https://intake.readthedocs.io) catalog file. One advantage of using intake catalog is that we can define the `cache` specifications in the catelog so that intake caches remote data source files locally. This saves bandwidth and improves the performance of future analyses. If you would like to remove the intake cache, simply run `intake cache clear`. For more information on Intake catalogs, click [here](https://intake.readthedocs.io/en/latest/catalog.html).

Before loading the data file, we need to load the Intake catalog file. We can use a URL to the catalog file directly:

In [None]:
cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')

Then we can load the data with user specified year and month. 

In [None]:
df = cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()

In addition, if you would like to load one year of data, you can simply define the dataframe as

``` python
df = cat.anaconda_package_data_by_year(year=2018).to_dask()
```

Similarly, if you would like to load one day of data, you can define the dataframe as
```python
df = cat.anaconda_package_data_by_day(year=2018, month=12, day=1).to_dask()
```

Note that `.to_dask()` reads data into a dask dataframe. If you would like to read data directly into a Pandas dataframe, please use:
``` python 
cat.anaconda_package_data_by_month(year=2018, month=12).read()
```

#### Method 3: load data from conda package

Third, we can install the data from a conda package by running (which we've already done in the Binder environment):
``` bash
conda install -c intake anaconda-package-data
```
This data package installs the Intake catalog (but not the data) into user's conda environment directly. The global Intake catalog `intake.cat` will then have entries from this data package. If we run `list(intake.cat)`, we can see that `'anaconda_package_data_by_month'`, `'anaconda_package_data_by_year'`, and `'anaconda_package_data_by_day'` show up in the list. Then, similiar to Method 2, we just need to specifiy year and month and load the data.


In [None]:
df = intake.cat.anaconda_package_data_by_month(year=2018, month=12).to_dask()

Again, if you would like to read data directly into a Pandas Dataframe, please use `intake.cat.anaconda_package_data_by_month(year=2018, month=12).read()`.

## Examples

After loading the data, we can do a lot of data wrangling and visualization to answer interesting questions. Below we show a few examples of how people can use the data. 

#### Example 1: Pandas download statistics

In this first example, we are looking at the download statistics of Pandas. First, let's see how many times Pandas are installed this month from Anaconda distribution:

In [None]:
df.loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]['counts'].sum().compute()

Note that `.compute()` is needed when df is a dask dataframe. Delete `.compute()` if you load data into a pandas dataframe. Please visit [dask website](http://docs.dask.org/en/latest/dataframe.html) for more information.

Next, let's take a look at the daily trends of pandas usage. 

In [None]:
df['day'] = df.time.dt.day
pkg_day_agg = df\
 .loc[(df.data_source=='anaconda') & (df.pkg_name=='pandas')]\
 .groupby(['day'])\
 .sum()\
 .reset_index()\
 .compute()
pkg_day_agg.head()

In [None]:
pkg_day_agg.hvplot('day','counts')

#### Example 2: Python 2 versus Python 3 usage status

In 2020, Python 2 will not be maintained and many key projects such as pandas will stop Python 2 support. Many developers and stakeholders are interested to see how Python 2 and Python 3 usage change over time. We can plot this with our data. 

First, we need to recode the required package python version variable. Here we created a variable `python2vs3` based on the variable `pkg_python`:

In [None]:
df.groupby(['pkg_python'])['counts'].sum().compute()

In [None]:
df['python2vs3'] = df['pkg_python'].\
 map(lambda x: 'Python 2' if x.startswith('2') else 'Python 3' if x.startswith('3') else np.nan)

In [None]:
df.groupby(['python2vs3'])['counts'].sum().compute()

Second, let's get the daily counts for Python 2 and Python 3.

In [None]:
python_day_agg = df\
 .groupby(['day','python2vs3'])\
 .sum()\
 .compute()\
 .reset_index()
 
python_day_agg.head()

Finally, we can plot the Python 2 and Python 3 usage trend.

In [None]:
python_day_agg.hvplot('day','counts',by='python2vs3')

#### Example 3: Package platform comparison

We can also compare package platforms. Here we calculated the total number of downloads from each platform and visualize the results in a bar chart. (Note that "noarch" packages have no platform value because they work on all platforms.)

In [None]:
platform_month = df.groupby(['pkg_platform'])['counts'].sum().reset_index().compute()
platform_month

In [None]:
platform_month.hvplot.bar('pkg_platform', 'counts', rot=90)