{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Histogrammar exercises\n",
    "\n",
    "Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. \n",
    "\n",
    "(There is also a scala backend for Histogrammar, that is used by spark.) \n",
    "\n",
    "You can do the exercises below after the basic tutorial.\n",
    "\n",
    "Enjoy!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "# install histogrammar (if not installed yet)\n",
    "import sys\n",
    "\n",
    "!\"{sys.executable}\" -m pip install histogrammar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import histogrammar as hg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "Let's first load some data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# open a pandas dataframe for use below\n",
    "from histogrammar import resources\n",
    "df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing histogram types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Histogrammar treats histograms as objects. You will see this has various advantages.\n",
    "\n",
    "Let's fill a simple histogram with a numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]\n",
    "hist1 = hg.Bin(num=10, low=0, high=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist1.fill.numpy(df['age'].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist1.plot.matplotlib();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist2 = hg.SparselyBin(binWidth=10, origin=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist2.fill.numpy(df['age'].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist2.plot.matplotlib();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: Have a look at the .values and .bins attributes of hist1 and hist2.\n",
    "What types are these? (hist1.values is a ...?) \n",
    "Does that make sense?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: In each bin, what type of object is keeping track of the bin count?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. \n",
    "Find out if and how hist1 keeps track of these?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical variables\n",
    "\n",
    "For categorical variables use the Categorize histogram\n",
    "- Categorize histograms: accepting categorical variables such as strings and booleans.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "histx = hg.Categorize('eyeColor')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "histx.fill.numpy(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: A categorize histogram, what is it fundementally, a dictionary or a list?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fill a histograms with a boolean array (isActive), directly from the dataframe\n",
    "\n",
    "Q: what type of histogram do you get?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hists = df.hg_make_histograms(features=['isActive'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-dimensional histograms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) \n",
    "Then fill it with the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hist1 = hg.Categorize(quantity='isActive')\n",
    "# hist2 = hg.Categorize(quantity='gender', value=hist1)\n",
    "# hist3 = hg.Categorize(quantity='favoriteFruit')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: How many data points end up in the bin: banana, male, True ?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: Store this histogram as a json file. What is the size of the json file?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: Read back the histogram and then plot it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist1 = hg.Average(quantity='latitude')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: what is the mean value of latitude for the bin 'strawberry'?"
   ]
  }
 ],
 "metadata": {
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "nteract": {
   "version": "0.15.0"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}