{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 1: Basics\n",
    "\n",
    "In this tutorial, we will generate some sample distributions, cluster them and select benchmark points.\n",
    "\n",
    "**NOTE FOR CONTRIBUTORS: Always clear all output before committing (``Cell`` > ``All Output`` > ``Clear``)**!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "The distribution to cluster in this notebook is generated with the flavio package.\n",
    "Furthermore, in order to show the plots in this notebook, this tutorial requires matplotlib (if you installed <code>ClusterKinG</code> with the <code>plotting</code> option, you already have it)\n",
    "Install both with:\n",
    "    \n",
    "    pip3 install --user flavio, matplotlib\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some Jupyter notebook magic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show plots in Jupyter notebooks\n",
    "%matplotlib inline\n",
    "\n",
    "# Reload modules whenever they change\n",
    "# (for development purposes)\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "# Make clusterking package available even without installation\n",
    "import sys\n",
    "sys.path = [\"../../\"] + sys.path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import clusterking with a short name. This is all you usually have to do once clusterking is installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import clusterking as ck"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate distributions (Scan)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first step, we generate distributions for different parameter values. \n",
    "For this, there are two classes, ``Scanner`` and ``WilsonScanner``, with the latter focusing on sampling in the space of Wilson coefficients. The Wilson coefficients are implemented using the Wilson package (https://wilson-eft.github.io/), which allows to use a variety of bases and EFTs and matches them to user specified scales. In this example we use the flavio basis (https://wcxf.github.io/assets/pdf/WET.flavio.pdf) at a scale of 5 GeV."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = ck.scan.WilsonScanner(scale=5, eft=\"WET\", basis=\"flavio\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we set up the function/distribution that we want to consider. Here we look into the branching ratio with respect to $q^2$ of $B\\to D \\,\\tau\\, \\bar\\nu_\\tau$. The function of the differential branching ration is taken from the flavio package (https://flav-io.github.io/). The $q^2$ binning is chosen to have 9 bins between $3.2 \\,\\text{GeV}^2$ and $11.6\\,\\text{GeV}^2$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import flavio\n",
    "import numpy as np\n",
    "\n",
    "def dBrdq2(w, q):\n",
    "    return flavio.np_prediction(\"dBR/dq2(B+->Dtaunu)\", w, q)\n",
    "\n",
    "s.set_dfunction(\n",
    "    dBrdq2,\n",
    "    binning=np.linspace(3.2, 11.6, 10),\n",
    "    normalize=True,\n",
    "    xvar=\"q2\"  # only sets name of variable on x axis\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we specify the grid of Wilson coefficients that are subsequenetly sampled. \n",
    "Using the example of $B\\to D \\tau \\bar\\nu_\\tau$, we sample the coefficients ``CVL_bctaunutau``, ``CSL_bctaunutau`` and ``CT_bctaunutau`` with 10 points between $-1$ and $1$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s.set_spoints_equidist(\n",
    "    {\n",
    "        \"CVL_bctaunutau\": (-1, 1, 10),\n",
    "        \"CSL_bctaunutau\": (-1, 1, 10),\n",
    "        \"CT_bctaunutau\": (-1, 1, 10)\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    Imaginary parts can be added by prefixing the name of the coefficient with <code>im_</code>, e.g. <code>im_CVL_bctaunutau</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now to compute the kinematical distributions from the Wilson coefficients sampled above we need a data instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = ck.Data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computing the kinematical distributions is done using ``run()`` method. This might take some time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = s.run(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r.write()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "By default, <code>Scanner.run</code> uses all cores on your machine.\n",
    "You can specify a different number using the <code>no_workers</code> option or disable multiprocessing completely by setting <code>no_workers=1</code>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are saved in our data object ``d``. \n",
    "At the heart of it is a dataframe, ``d.df``. Let's have a look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also already take a quick look at the resulting distributions by randomly selecting a few of them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.plot_dist(nlines=10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More plots will be introduced in subsequent tutorials, but you can also try running ``d.plot_dist_box()`` (box plots), ``d.plot_dist_minmax()`` (plot spread of bin contents), ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Different clustering algorithms are available in the ``cluster`` subpackage of ClusterKinG.\n",
    "They are implemented as subclasses of a class ``Cluster`` and by subclassing ``Cluster`` yourself (or any of the derived classes) it is easy to implement your own clustering algorithm.\n",
    "\n",
    "In this example, we will use a hierarchical clustering algorithm to group similar distributions together.\n",
    "The ``Cluster`` class, or here its subclass ``HierarchyCluster`` is initialized with the data object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = ck.cluster.HierarchyCluster()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we have to specify the metric we want to use to measure the distance between different distributions. If no argument is specified a Euclidean metric is being used (which is equivalent to a $\\chi^2$ metric, if we have flat uncorrelated relative errors on each bin):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.set_metric(\"euclidean\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The maximal distance between the individual clusters ``max_d``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.set_max_d(0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = c.run(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we add the information about the clusters to the dataframe created above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r.write()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So when we now look at the dataframe again, we see a new column ``cluster`` with the cluster number:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course we have also plenty of methods to plot the distributions that belong to the clusters, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.plot_dist_box();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also plot clusters vs parameters, e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.plot_clusters_scatter([\"CSL_bctaunutau\", \"CVL_bctaunutau\", \"CT_bctaunutau\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, more on plots in the following tutorials."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting benchmark points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a similar way we can determine the benchmark points representing the individual clusters. Initializing a benchmark point object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = ck.Benchmark()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and again choosing a metric (Euclidean metric is default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b.set_metric()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "the benchmark points can be computed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = b.run(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and written in the dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r.write()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look and notice the new column ``bpoint`` at the end of the data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now most plots will also show the distributions that correspond to the benchmark points:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.plot_dist_box();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preserving results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it's time to write out the results for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.write(\"output/tutorial_basics.sql\", overwrite=\"overwrite\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will not only write out the data itself, but also a lot of associated metadata that makes it easy to later reconstruct what the data actually represents. This was accumulated in the attribute ``d.md`` over all steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.md"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}