{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Histogrammar exercises\n", "\n", "Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. \n", "\n", "(There is also a scala backend for Histogrammar, that is used by spark.) \n", "\n", "You can do the exercises below after the basic tutorial.\n", "\n", "Enjoy!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# install histogrammar (if not installed yet)\n", "import sys\n", "\n", "!\"{sys.executable}\" -m pip install histogrammar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import histogrammar as hg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "Let's first load some data!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# open a pandas dataframe for use below\n", "from histogrammar import resources\n", "df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing histogram types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histogrammar treats histograms as objects. You will see this has various advantages.\n", "\n", "Let's fill a simple histogram with a numpy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]\n", "hist1 = hg.Bin(num=10, low=0, high=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1.fill.numpy(df['age'].values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2 = hg.SparselyBin(binWidth=10, origin=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.fill.numpy(df['age'].values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.plot.matplotlib();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Have a look at the .values and .bins attributes of hist1 and hist2.\n", "What types are these? (hist1.values is a ...?) \n", "Does that make sense?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: In each bin, what type of object is keeping track of the bin count?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. \n", "Find out if and how hist1 keeps track of these?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables\n", "\n", "For categorical variables use the Categorize histogram\n", "- Categorize histograms: accepting categorical variables such as strings and booleans.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histx = hg.Categorize('eyeColor')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histx.fill.numpy(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: A categorize histogram, what is it fundementally, a dictionary or a list?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill a histograms with a boolean array (isActive), directly from the dataframe\n", "\n", "Q: what type of histogram do you get?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists = df.hg_make_histograms(features=['isActive'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-dimensional histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) \n", "Then fill it with the dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# hist1 = hg.Categorize(quantity='isActive')\n", "# hist2 = hg.Categorize(quantity='gender', value=hist1)\n", "# hist3 = hg.Categorize(quantity='favoriteFruit')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: How many data points end up in the bin: banana, male, True ?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Store this histogram as a json file. What is the size of the json file?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Read back the histogram and then plot it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1 = hg.Average(quantity='latitude')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Q: what is the mean value of latitude for the bin 'strawberry'?" ] } ], "metadata": { "kernel_info": { "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "nteract": { "version": "0.15.0" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }