{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with incremental data\n", "\n", "This notebook shows how to generate reports on incremental datasets\n", "\n", "The incremental data will either have a proper time-axis, or will be batches of data without \n", "a specific time-axis. \n", "\n", "The histograms of these datasets will be stitched together, and we generate a (consistent) report on the stitched dataset.\n", "\n", "Note that we always generate the report on the full stitched histograms, because algorithms like trend detection and comparison with reference histograms rely on having the historical histograms in place." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reporting given a histograms object (dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install popmon (if not installed yet)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "!\"{sys.executable}\" -m pip install -q popmon" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import popmon\n", "from popmon import stability_report, stitch_histograms, get_bin_specs\n", "from popmon import resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(resources.data(\"test.csv.gz\"), parse_dates=[\"date\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add month column, so we can make data batches per month." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def to_month(x):\n", " date = pd.to_datetime(x)\n", " return str(12 * date.year + date.month)\n", "\n", "\n", "df[\"month\"] = df[\"date\"].apply(to_month)\n", "months = df.month.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histograms of full dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = [\"date:isActive\", \"date:eyeColor\", \"date:latitude\", \"date:age\"]\n", "# weeks start on a Monday\n", "hists = df.pm_make_histograms(\n", " features=features, time_axis=\"date\", time_width=\"1w\", time_offset=\"2015-1-5\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Incremental datasets with existing time-axis information" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the same bin_specifications for every month below\n", "bin_specs = popmon.get_bin_specs(hists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# generate histograms per month, each month uses the same (weekly) binning specifications\n", "hists_list = []\n", "for month in months:\n", " df_month = df[df.month == month]\n", " h = df_month.pm_make_histograms(features=features, bin_specs=bin_specs)\n", " hists_list.append(h)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add up all the histograms sets\n", "hists2 = popmon.stitch_histograms(hists_list=hists_list, time_axis=\"date\", mode=\"add\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the two sets of histograms have consistent binning" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hists2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "rep = popmon.stability_report(hists)\n", "rep" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "popmon.stability_report(hists2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scan through the two reports above and you see that the outputs are identical!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding to existing histograms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now let's assume we already have a set of stitched histograms (hists3),\n", "# and we want to stitch to add another new batch to this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# hists_basis is the set of existing histogram\n", "hists_basis = hists\n", "# hists_delta is the new set of histograms\n", "hists_delta = hists_list[-1]\n", "\n", "# by default, the stitcher will recognize the existing time-axis \"date\" in both histogram sets.\n", "# remember that binning along the \"date\" time-axis is in weeks.\n", "\n", "# when adding hists_delta, one can either \"add\" histograms to existing weeks, or \"replace\" existing weeks.\n", "# the default is to add them.\n", "hists4 = popmon.stitch_histograms(\n", " hists_basis=hists_basis, hists_delta=hists_delta, mode=\"add\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# or \"replace\" histograms found in existing weeks with those in hists_delta:\n", "hists4 = popmon.stitch_histograms(\n", " hists_basis=hists_basis, hists_delta=hists_delta, mode=\"replace\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Incremental datasets without time information\n", "\n", "Now we are ignoring the date information in the histogram creation, but every batch dataset corresponds to one week of data. \n", "Although a batch can be anything, of course. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = [\"isActive\", \"eyeColor\", \"latitude\", \"age\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the same bin_specifications for every week below as earlier, and skip the date information\n", "bin_specs = popmon.get_bin_specs(hists, skip_first_axis=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def to_week(x):\n", " date = pd.to_datetime(x)\n", " return 52 * date.year + date.week\n", "\n", "\n", "df[\"week\"] = df[\"date\"].apply(to_week)\n", "weeks = df.week.unique().tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# generate histograms per month, each month uses the same (weekly) binning specifications\n", "hists_list = []\n", "for week in weeks:\n", " df_week = df[df.week == week]\n", " h = df_week.pm_make_histograms(features=features, bin_specs=bin_specs)\n", " hists_list.append(h)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# since none of these histograms has a time-axis, in the stitching we create one (called 'batch'), and specify\n", "# that each batch of histograms is inserted at a particular value time_bin_idx value\n", "\n", "hists3 = popmon.stitch_histograms(\n", " hists_list=hists_list, time_axis=\"batch\", time_bin_idx=weeks\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "popmon.stability_report(hists3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "rep" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# again the two reports are identical, except that the first one uses the batch-id as artificial time-axis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding to an existing stitched histograms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now let's assume we already have a set of stitched histograms (hists3),\n", "# and we want to stitch to add another new batch to this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# hists_basis are the existing histogram\n", "hists_basis = hists3\n", "# hists_delta is the new set of histograms\n", "hists_delta = hists_list[-1]\n", "\n", "# by default, the stitcher will insert the batch right after the last batch found.\n", "hists4 = popmon.stitch_histograms(\n", " hists_basis=hists_basis, hists_delta=hists_delta, time_axis=\"batch\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# one can also insert the new batch at a chosen new or existing time-bin index:\n", "hists4 = popmon.stitch_histograms(\n", " hists_basis=hists_basis,\n", " hists_delta=hists_delta,\n", " time_axis=\"batch\",\n", " time_bin_idx=200000,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# when inserting at an existing time-bin index, on can either \"add\" to that index\n", "# or \"replace\" the existing histograms. The default setting is to \"add\" the histograms:\n", "mode = \"add\" # \"replace\"\n", "hists4 = popmon.stitch_histograms(\n", " hists_basis=hists_basis,\n", " hists_delta=hists_delta,\n", " time_axis=\"batch\",\n", " time_bin_idx=104833,\n", " mode=mode,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storage of the histograms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%script false --no-raise-error\n", "\n", "# we can store the histograms if we want to\n", "import json\n", "from popmon.hist.histogram import dumper\n", "\n", "# store\n", "with open('histograms.json', 'w') as outfile:\n", "\tjson.dump(hists, outfile, default=dumper)\n", "\n", "# and load again\n", "with open('histograms.json') as handle:\n", "\thists = json.load(handle)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }