{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 1: Basics\n", "\n", "In this tutorial, we will generate some sample distributions, cluster them and select benchmark points.\n", "\n", "**NOTE FOR CONTRIBUTORS: Always clear all output before committing (``Cell`` > ``All Output`` > ``Clear``)**!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "The distribution to cluster in this notebook is generated with the flavio package.\n", "Furthermore, in order to show the plots in this notebook, this tutorial requires matplotlib (if you installed ClusterKinG with the plotting option, you already have it)\n", "Install both with:\n", " \n", " pip3 install --user flavio, matplotlib\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some Jupyter notebook magic:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show plots in Jupyter notebooks\n", "%matplotlib inline\n", "\n", "# Reload modules whenever they change\n", "# (for development purposes)\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "# Make clusterking package available even without installation\n", "import sys\n", "sys.path = [\"../../\"] + sys.path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import clusterking with a short name. This is all you usually have to do once clusterking is installed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import clusterking as ck" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate distributions (Scan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first step, we generate distributions for different parameter values. \n", "For this, there are two classes, ``Scanner`` and ``WilsonScanner``, with the latter focusing on sampling in the space of Wilson coefficients. The Wilson coefficients are implemented using the Wilson package (https://wilson-eft.github.io/), which allows to use a variety of bases and EFTs and matches them to user specified scales. In this example we use the flavio basis (https://wcxf.github.io/assets/pdf/WET.flavio.pdf) at a scale of 5 GeV." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = ck.scan.WilsonScanner(scale=5, eft=\"WET\", basis=\"flavio\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we set up the function/distribution that we want to consider. Here we look into the branching ratio with respect to $q^2$ of $B\\to D \\,\\tau\\, \\bar\\nu_\\tau$. The function of the differential branching ration is taken from the flavio package (https://flav-io.github.io/). The $q^2$ binning is chosen to have 9 bins between $3.2 \\,\\text{GeV}^2$ and $11.6\\,\\text{GeV}^2$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import flavio\n", "import numpy as np\n", "\n", "def dBrdq2(w, q):\n", " return flavio.np_prediction(\"dBR/dq2(B+->Dtaunu)\", w, q)\n", "\n", "s.set_dfunction(\n", " dBrdq2,\n", " binning=np.linspace(3.2, 11.6, 10),\n", " normalize=True,\n", " xvar=\"q2\" # only sets name of variable on x axis\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we specify the grid of Wilson coefficients that are subsequenetly sampled. \n", "Using the example of $B\\to D \\tau \\bar\\nu_\\tau$, we sample the coefficients ``CVL_bctaunutau``, ``CSL_bctaunutau`` and ``CT_bctaunutau`` with 10 points between $-1$ and $1$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.set_spoints_equidist(\n", " {\n", " \"CVL_bctaunutau\": (-1, 1, 10),\n", " \"CSL_bctaunutau\": (-1, 1, 10),\n", " \"CT_bctaunutau\": (-1, 1, 10)\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Imaginary parts can be added by prefixing the name of the coefficient with im_, e.g. im_CVL_bctaunutau.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now to compute the kinematical distributions from the Wilson coefficients sampled above we need a data instance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = ck.Data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing the kinematical distributions is done using ``run()`` method. This might take some time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = s.run(d)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.write()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "By default, Scanner.run uses all cores on your machine.\n", "You can specify a different number using the no_workers option or disable multiprocessing completely by setting no_workers=1.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are saved in our data object ``d``. \n", "At the heart of it is a dataframe, ``d.df``. Let's have a look:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also already take a quick look at the resulting distributions by randomly selecting a few of them. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.plot_dist(nlines=10);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More plots will be introduced in subsequent tutorials, but you can also try running ``d.plot_dist_box()`` (box plots), ``d.plot_dist_minmax()`` (plot spread of bin contents), ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different clustering algorithms are available in the ``cluster`` subpackage of ClusterKinG.\n", "They are implemented as subclasses of a class ``Cluster`` and by subclassing ``Cluster`` yourself (or any of the derived classes) it is easy to implement your own clustering algorithm.\n", "\n", "In this example, we will use a hierarchical clustering algorithm to group similar distributions together.\n", "The ``Cluster`` class, or here its subclass ``HierarchyCluster`` is initialized with the data object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c = ck.cluster.HierarchyCluster()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we have to specify the metric we want to use to measure the distance between different distributions. If no argument is specified a Euclidean metric is being used (which is equivalent to a $\\chi^2$ metric, if we have flat uncorrelated relative errors on each bin):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.set_metric(\"euclidean\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The maximal distance between the individual clusters ``max_d``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c.set_max_d(0.2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = c.run(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we add the information about the clusters to the dataframe created above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.write()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So when we now look at the dataframe again, we see a new column ``cluster`` with the cluster number:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course we have also plenty of methods to plot the distributions that belong to the clusters, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.plot_dist_box();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also plot clusters vs parameters, e.g." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.plot_clusters_scatter([\"CSL_bctaunutau\", \"CVL_bctaunutau\", \"CT_bctaunutau\"]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, more on plots in the following tutorials." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting benchmark points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a similar way we can determine the benchmark points representing the individual clusters. Initializing a benchmark point object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = ck.Benchmark()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and again choosing a metric (Euclidean metric is default)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b.set_metric()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the benchmark points can be computed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = b.run(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and written in the dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.write()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look and notice the new column ``bpoint`` at the end of the data frame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now most plots will also show the distributions that correspond to the benchmark points:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.plot_dist_box();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preserving results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's time to write out the results for later use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.write(\"output/tutorial_basics.sql\", overwrite=\"overwrite\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will not only write out the data itself, but also a lot of associated metadata that makes it easy to later reconstruct what the data actually represents. This was accumulated in the attribute ``d.md`` over all steps:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.md" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }