{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Histogrammar basic tutorial\n", "\n", "Histogrammar is a Python package that allows you to make histograms from numpy arrays, and pandas and spark dataframes. (There is also a scala backend for Histogrammar.) \n", "\n", "This basic tutorial shows how to:\n", "- make histograms with numpy arrays and pandas dataframes, \n", "- plot them, \n", "- make multi-dimensional histograms,\n", "- the various histogram types,\n", "- to make many histograms at ones,\n", "- and store and retrieve them. \n", "\n", "Enjoy!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# install histogrammar (if not installed yet)\n", "import sys\n", "\n", "!\"{sys.executable}\" -m pip install histogrammar" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import histogrammar as hg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation\n", "Let's first load some data!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# open a pandas dataframe for use below\n", "from histogrammar import resources\n", "df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's fill a histogram!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histogrammar treats histograms as objects. You will see this has various advantages.\n", "\n", "Let's fill a simple histogram with a numpy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this creates a histogram with 100 even-sized bins in the (closed) range [-5, 5]\n", "hist1 = hg.Bin(num=100, low=-5, high=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# filling it with one data point:\n", "hist1.fill(0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1.entries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# filling the histogram with an array:\n", "hist1.fill.numpy(np.random.normal(size=10000))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1.entries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's plot it\n", "hist1.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternatively, you can call this to make the same histogram:\n", "# hist1 = hg.Histogram(num=100, low=-5, high=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Histogrammar also supports open-ended histograms, which are sparsely represented. Open-ended histograms are used when you have a distribution of known scale (bin width) but unknown domain (lowest and highest bin index). Bins in a sparse histogram only get created and filled if the corresponding data points are encountered. \n", "\n", "A sparse histogram has a `binWidth`, and optionally an `origin` parameter. The `origin` is the left edge of the bin whose index is 0 and is set to 0.0 by default. Sparse histograms are nice if you don't want to restrict the range, for example for tracking data distributions over time, which may have large, sudden outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2 = hg.SparselyBin(binWidth=10, origin=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.fill.numpy(df['age'].values)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternatively, you can call this to make the same histogram:\n", "# hist2 = hg.SparselyHistogram(binWidth=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filling from a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make the same 1d (sparse) histogram directly from a (pandas) dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist3 = hg.SparselyBin(binWidth=10, origin=0, quantity='age')\n", "hist3.fill.numpy(df)\n", "hist3.plot.matplotlib();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When importing histogrammar, pandas (and spark) dataframes get extra functions to create histograms that all start with \"hg_\". For example: hg_Bin or hg_SparselyBin.\n", "Note that the column \"age\" is picked by setting quantity=\"age\", and also that the filling step is done automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Alternatively, do:\n", "hist3 = df.hg_SparselyBin(binWidth=10, origin=0, quantity='age')\n", "\n", "# ... where hist3 automatically picks up column age from the dataframe, \n", "# ... and does not need to be filled by calling fill.numpy() explicitly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handy histogram methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For any 1-dimensional histogram extract the bin entries, edges and centers as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# full range of bin entries, and those in a specified range:\n", "(hist3.bin_entries(), hist3.bin_entries(low=30, high=80))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# full range of bin edges, and those in a specified range:\n", "(hist3.bin_edges(), hist3.bin_edges(low=31, high=71))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# full range of bin centers, and those in a specified range:\n", "(hist3.bin_centers(), hist3.bin_centers(low=31, high=80))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hsum = hist2 + hist3\n", "hsum.entries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hsum *= 4\n", "hsum.entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Irregular bin histogram variants\n", "\n", "There are two other open-ended histogram variants in addition to the SparselyBin we have seen before. Whereas SparselyBin is used when bins have equal width, the others offer similar alternatives to a single fixed bin width.\n", "\n", "There are two ways:\n", "- CentrallyBin histograms, defined by specifying bin centers;\n", "- IrregularlyBin histograms, with irregular bin edges.\n", "\n", "They both partition a space into irregular subdomains with no gaps and no overlaps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist4 = hg.CentrallyBin(centers=[15, 25, 35, 45, 55, 65, 75, 85, 95], quantity='age')\n", "hist4.fill.numpy(df)\n", "hist4.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist4.bin_edges()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the slightly different plotting style for CentrallyBin histograms (e.g. x-axis labels are central values instead of edges)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-dimensional histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a multi-dimensional histogram. In Histogrammar, a multi-dimensional histogram is composed as two recursive histograms. \n", "\n", "We will use histograms with irregular binning in this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "edges1 = [-100, -75, -50, -25, 0, 25, 50, 75, 100]\n", "edges2 = [-200, -150, -100, -50, 0, 50, 100, 150, 200]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1 = hg.IrregularlyBin(edges=edges1, quantity='latitude')\n", "hist2 = hg.IrregularlyBin(edges=edges2, quantity='longitude', value=hist1)\n", "\n", "# for 3 dimensions or higher simply add the 2-dim histogram to the value argument\n", "hist3 = hg.SparselyBin(binWidth=10, quantity='age', value=hist2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist1.bin_centers()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.bin_centers()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist2.fill.numpy(df)\n", "hist2.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of dimensions per histogram\n", "(hist1.n_dim, hist2.n_dim, hist3.n_dim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing bin entries\n", "\n", "For most 2+ dimensional histograms, one can get the bin entries and centers as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from histogrammar.plot.hist_numpy import get_2dgrid\n", "x_labels, y_labels, grid = get_2dgrid(hist2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_labels, grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing a sub-histogram\n", "\n", "Depending on the histogram type of the first axis, hg.Bin or other, one can access the sub-histograms directly from:\n", "hist.values or \n", "hist.bins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Acces sub-histograms from IrregularlyBin from hist.bins\n", "# The first item of the tuple is the lower bin-edge of the bin.\n", "hist2.bins[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h = hist2.bins[1][1]\n", "h.plot.matplotlib()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h.bin_entries()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histogram types recap\n", "\n", "So far we have covered the histogram types: \n", "- Bin histograms: with a fixed range and even-sized bins,\n", "- SparselyBin histograms: open-ended and with a fixed bin-width,\n", "- CentrallyBin histograms: open-ended and using bin centers.\n", "- IrregularlyBin histograms: open-ended and using (irregular) bin edges,\n", "\n", "All of these process numeric variables only." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical variables\n", "\n", "For categorical variables use the Categorize histogram\n", "- Categorize histograms: accepting categorical variables such as strings and booleans.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histy = hg.Categorize('eyeColor')\n", "histx = hg.Categorize('favoriteFruit', value=histy)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histx.fill.numpy(df)\n", "histx.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# show the datatype(s) of the histogram\n", "histx.datatype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorize histograms also accept booleans:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histy = hg.Categorize('isActive')\n", "histy.fill.numpy(df)\n", "histy.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histy.bin_entries()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "histy.bin_labels()\n", "# histy.bin_centers() will work as well for Categorize histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other histogram functionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several more histogram types:\n", "- Minimize, Maximize: keep track of the min or max value of a numeric distribution,\n", "- Average, Deviate: keep track of the mean or mean and standard deviation of a numeric distribution,\n", "- Sum: keep track of the sum of a numeric distribution,\n", "- Stack: keep track how many data points pass certain thresholds.\n", "- Bag: works like a dict, it keeps tracks of all unique values encountered in a column, and can also do this for vector s of numbers. For strings, Bag works just like the Categorize histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmin = df.hg_Minimize('latitude')\n", "hmax = df.hg_Maximize('longitude')\n", "(hmin.min, hmax.max)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "havg = df.hg_Average('latitude')\n", "hdev = df.hg_Deviate('longitude')\n", "(havg.mean, hdev.mean, hdev.variance)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hsum = df.hg_Sum('age')\n", "hsum.sum" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's illustrate the Stack histogram with longitude distribution\n", "# first we plot the regular distribution\n", "hl = df.hg_SparselyBin(25, 'longitude')\n", "hl.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Stack counts how often data points are greater or equal to the provided thresholds \n", "thresholds = [-200, -150, -100, -50, 0, 50, 100, 150, 200]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hs = df.hg_Stack(thresholds=thresholds, quantity='longitude')\n", "hs.thresholds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hs.bin_entries()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stack histograms are useful to make efficiency curves.\n", "\n", "With all these histograms you can make multi-dimensional histograms. For example, you can evaluate the mean and standard deviation of one feature as a function of bins of another feature. (A \"profile\" plot, similar to a box plot.) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hav = hg.Deviate('age')\n", "hlo = hg.SparselyBin(25, 'longitude', value=hav)\n", "hlo.fill.numpy(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hlo.bins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hlo.plot.matplotlib();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convenience functions\n", "\n", "There are several convenience functions to make such composed histograms. These are:\n", "- [Profile](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#profile): Convenience function for creating binwise averages.\n", "- [SparselyProfile](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#sparselyprofile): Convenience function for creating sparsely binned binwise averages.\n", "- [ProfileErr](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#profileerr): Convenience function for creating binwise averages and variances.\n", "- [SparselyProfile](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#sparselyprofileerr): Convenience function for creating sparsely binned binwise averages and variances.\n", "- [TwoDimensionallyHistogram](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#twodimensionallyhistogram): Convenience function for creating a conventional, two-dimensional histogram.\n", "- [TwoDimensionallySparselyHistogram](https://histogrammar.github.io/histogrammar-docs/specification/1.0/#twodimensionallysparselyhistogram): Convenience function for creating a sparsely binned, two-dimensional histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For example, call this convenience function to make the same histogram as above:\n", "hlo = df.hg_SparselyProfileErr(25, 'longitude', 'age')\n", "hlo.plot.matplotlib();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overview of histograms\n", "\n", "Here you can find the list of all available histograms and aggregators and how to use each one: \n", "\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/\n", "\n", "The most useful aggregators are the following. Tinker with them to get familiar; building up an analysis is easier when you know \"there's an app for that.\"\n", "\n", "**Simple counters:**\n", "\n", " * [`Count`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#count-sum-of-weights): just counts. Every aggregator has an `entries` field, but `Count` _only_ has this field.\n", " * [`Average`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#average-mean-of-a-quantity) and [`Deviate`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#deviate-mean-and-variance): add mean and variance, cumulatively.\n", " * [`Minimize`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#minimize-minimum-value) and [`Maximize`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#maximize-maximum-value): lowest and highest value seen.\n", "\n", "**Histogram-like objects:**\n", "\n", " * [`Bin`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#bin-regular-binning-for-histograms) and [`SparselyBin`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#sparselybin-ignore-zeros): split a numerical domain into uniform bins and redirect aggregation into those bins.\n", " * [`Categorize`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#categorize-string-valued-bins-bar-charts): split a string-valued domain by unique values; good for making bar charts (which are histograms with a string-valued axis).\n", " * [`CentrallyBin`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#centrallybin-fully-partitioning-with-centers) and [`IrregularlyBin`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#irregularlybin-fully-partitioning-with-edges): split a numerical domain into arbitrary subintervals, usually for separate plots like particle pseudorapidity or collision centrality.\n", "\n", "**Collections:**\n", "\n", " * [`Label`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#label-directory-with-string-based-keys), [`UntypedLabel`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#untypedlabel-directory-of-different-types), and [`Index`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#index-list-with-integer-keys): bundle objects with string-based keys (`Label` and `UntypedLabel`) or simply an ordered array (effectively, integer-based keys) consisting of a single type (`Label` and `Index`) or any types (`UntypedLabel`).\n", " * [`Branch`](\n", "https://histogrammar.github.io/histogrammar-docs/specification/1.0/#branch-tuple-of-different-types): for the fourth case, an ordered array of any types. A `Branch` is useful as a \"cable splitter\". For instance, to make a histogram that tracks minimum and maximum value, do this:\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making many histograms at once\n", "\n", "There a nice method to make many histograms in one go. See here.\n", "\n", "By default automagical binning is applied to make the histograms.\n", "\n", "More details one how to use this function are found in in the advanced tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists = df.hg_make_histograms()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h = hists['transaction']\n", "h.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h = hists['date']\n", "h.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# you can also select which and make multi-dimensional histograms\n", "hists = df.hg_make_histograms(features = ['longitude:age'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hist = hists['longitude:age']\n", "hist.plot.matplotlib();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storage\n", "\n", "Histograms can be easily stored and retrieved in/from the json format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# storage\n", "hist.toJsonFile('long_age.json')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# retrieval\n", "factory = hg.Factory()\n", "hist2 = factory.fromJsonFile('long_age.json')\n", "hist2.plot.matplotlib();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we can store the histograms if we want to\n", "import json\n", "from histogrammar.util import dumper\n", "\n", "# store\n", "with open('histograms.json', 'w') as outfile:\n", " json.dump(hists, outfile, default=dumper)\n", "\n", "# and load again\n", "with open('histograms.json') as handle:\n", " hists2 = json.load(handle)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced tutorial\n", "\n", "The [advanced tutorial](https://nbviewer.jupyter.org/github/histogrammar/histogrammar-python/blob/master/histogrammar/notebooks/histogrammar_tutorial_advanced.ipynb) shows:\n", "- How to work with spark dataframes.\n", "- More details on this nice method to make many histograms in one go. For example how to set bin specifications.\n" ] } ], "metadata": { "kernel_info": { "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "nteract": { "version": "0.15.0" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }