# Histogrammar exercises

Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. 

(There is also a scala backend for Histogrammar, that is used by spark.) 

You can do the exercises below after the basic tutorial.

Enjoy!

In [None]:
%%capture
# install histogrammar (if not installed yet)
import sys

!"{sys.executable}" -m pip install histogrammar

In [None]:
import histogrammar as hg

In [None]:
import pandas as pd
import numpy as np
import matplotlib

## Dataset
Let's first load some data!

In [None]:
# open a pandas dataframe for use below
from histogrammar import resources
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])

In [None]:
df.head(2)

## Comparing histogram types

Histogrammar treats histograms as objects. You will see this has various advantages.

Let's fill a simple histogram with a numpy array.

In [None]:
# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]
hist1 = hg.Bin(num=10, low=0, high=100)

In [None]:
hist1.fill.numpy(df['age'].values)

In [None]:
hist1.plot.matplotlib();

In [None]:
hist2 = hg.SparselyBin(binWidth=10, origin=0)

In [None]:
hist2.fill.numpy(df['age'].values)

In [None]:
hist2.plot.matplotlib();

Q: Have a look at the .values and .bins attributes of hist1 and hist2.
What types are these? (hist1.values is a ...?) 
Does that make sense?

In [None]:
hist1

In [None]:
hist2

Q: In each bin, what type of object is keeping track of the bin count?

Try filling hist1 with small values (negative) or very large (> 100) or with NaNs. 
Find out if and how hist1 keeps track of these?

Now fill hist2 with small values (negative) or very large (> 100) or with NaNs. How does hist2 keeps track of these?

## Categorical variables

For categorical variables use the Categorize histogram
- Categorize histograms: accepting categorical variables such as strings and booleans.



In [None]:
histx = hg.Categorize('eyeColor')

In [None]:
histx.fill.numpy(df)

Q: A categorize histogram, what is it fundementally, a dictionary or a list?

Q: What else can it keep track of, e.g. numbers, booleans, nans? Give it a try, fill it with more entries!

Fill a histograms with a boolean array (isActive), directly from the dataframe

Q: what type of histogram do you get?

In [None]:
hists = df.hg_make_histograms(features=['isActive'])

## Multi-dimensional histograms

Let's make a 3-dimensional histogram, with axes: x=favoriteFruit, y=gender, z=isActive. (In Histogrammar, a multi-dimensional histogram is composed as recursive histograms, starting with the last one.) 
Then fill it with the dataframe.

In [None]:
# hist1 = hg.Categorize(quantity='isActive')
# hist2 = hg.Categorize(quantity='gender', value=hist1)
# hist3 = hg.Categorize(quantity='favoriteFruit')

Q: How many data points end up in the bin: banana, male, True ?


Q: Store this histogram as a json file. What is the size of the json file?

Q: Read back the histogram and then plot it.

Q: Make a histogram of the feature 'fruit', which measures the average value of 'latitude' per bin of fruit.

In [None]:
hist1 = hg.Average(quantity='latitude')

Q: what is the mean value of latitude for the bin 'strawberry'?