{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# JetNet Demo\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/rkansal47/JetNet/main/docs/_static/images/jetnetlogo.png\" alt=\"logo\" width=\"500\"/>\n",
    "\n",
    "<h3>Raghav Kansal<br>UC San Diego</h3>\n",
    "\n",
    "<h3><span style=\"color:gray\"><a href=\"https://indico.cern.ch/e/PyHEP2022\">PyHEP 2022 Workshop</a><br>Online, 12-16 September 2022</span></h3>\n",
    "\n",
    "\n",
    "<h4>JetNet: For developing and reproducing ML + HEP projects.</h4>\n",
    "\n",
    "\n",
    "Repo: [github.com/jet-net/JetNet](https://github.com/jet-net/JetNet)\n",
    "\n",
    "Docs: [jetnet.readthedocs.io](https://jetnet.readthedocs.io/en/latest/)\n",
    "\n",
    "Paper: [2106.11535](https://arxiv.org/abs/2106.11535)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "### Problems:\n",
    " - How do I **get started** with machine learning in high energy physics?\n",
    " - How do I **evaluate** my results?\n",
    " - How do we **reproduce** and **compare** results?\n",
    " \n",
    "### Solution:\n",
    "\n",
    "JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.\n",
    "\n",
    "#### Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Today\n",
    "\n",
    " - Loading and looking at the JetNet dataset\n",
    " - Preparing the dataset for training a model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data loading\n",
    "\n",
    "We'll use the `jetnet.datasets.JetNet.getData` function to download and directly access the dataset.\n",
    "\n",
    "First, we can check which particle and jet features are available in this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jetnet.datasets import JetNet\n",
    "print(f\"Particle features: {JetNet.all_particle_features}\")\n",
    "print(f\"Jet features: {JetNet.all_jet_features}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's load the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_args = {\n",
    "    \"jet_type\": [\"g\", \"t\", \"w\"],  # gluon, top quark, and W boson jets\n",
    "    \"data_dir\": \"datasets/jetnet\",\n",
    "    # only selecting the kinematic features\n",
    "    \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\"],\n",
    "    \"num_particles\": 30,\n",
    "    \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n",
    "}\n",
    "\n",
    "particle_data, jet_data = JetNet.getData(**data_args)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at some of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Particle features of the 10 highest pT particles in the first jet\\n{data_args['particle_features']}\\n{particle_data[0, :10]}\")\n",
    "print(f\"\\nJet features of first jet\\n{data_args['jet_features']}\\n{jet_data[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also visualise these jets as images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jetnet.utils import to_image\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "num_images = 5\n",
    "num_types = len(data_args[\"jet_type\"])\n",
    "im_size = 25  # number of pixels in height and width\n",
    "maxR = 0.4  # max radius in (eta, phi) away from the jet axis\n",
    "\n",
    "cm = plt.cm.jet.copy()\n",
    "cm.set_under(color=\"white\")\n",
    "plt.rcParams.update({\"font.size\": 16})\n",
    "\n",
    "fig, axes = plt.subplots(\n",
    "    nrows=num_types,\n",
    "    ncols=num_images,\n",
    "    figsize=(40, 8 * num_types),\n",
    "    gridspec_kw={\"wspace\": 0.25},\n",
    ")\n",
    "\n",
    "# get the index of each jet type using the JetNet.jet_types array\n",
    "type_indices = {jet_type: JetNet.jet_types.index(jet_type) for jet_type in data_args[\"jet_type\"]}\n",
    "\n",
    "for j in range(num_types):\n",
    "    jet_type = data_args[\"jet_type\"][j]\n",
    "    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat\n",
    "\n",
    "    axes[j][0].annotate(\n",
    "            jet_type,\n",
    "            xy=(0, -1),\n",
    "            xytext=(-axes[j][0].yaxis.labelpad - 15, 0),\n",
    "            xycoords=axes[j][0].yaxis.label,\n",
    "            textcoords=\"offset points\",\n",
    "            ha=\"right\",\n",
    "            va=\"center\",\n",
    "            fontsize=24\n",
    "        )\n",
    "\n",
    "    for i in range(num_images):\n",
    "        im = axes[j][i].imshow(\n",
    "            to_image(particle_data[type_selector][i], im_size, maxR=maxR),\n",
    "            cmap=cm,\n",
    "            interpolation=\"nearest\",\n",
    "            vmin=1e-8,\n",
    "            extent=[-maxR, maxR, -maxR, maxR],\n",
    "            vmax=0.05,\n",
    "        )\n",
    "        axes[j][i].tick_params(which=\"both\", bottom=False, top=False, left=False, right=False)\n",
    "        axes[j][i].set_xlabel(\"$\\phi^{rel}$\")\n",
    "        axes[j][i].set_ylabel(\"$\\eta^{rel}$\")\n",
    "        axes[j][i].set_title(f\"Jet {i + 1}\")\n",
    "\n",
    "cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)\n",
    "cbar.set_label(\"$p_T^{rel}$\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And calculate and plot their overall features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jetnet.utils import jet_features\n",
    "import numpy as np\n",
    "\n",
    "fig = plt.figure(figsize=(12, 12))\n",
    "plt.ticklabel_format(axis=\"y\", scilimits=(0, 0), useMathText=True)\n",
    "\n",
    "for j in range(num_types):\n",
    "    jet_type = data_args[\"jet_type\"][j]\n",
    "    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat\n",
    "\n",
    "    jet_masses = jet_features(particle_data[type_selector][:50000])[\"mass\"]\n",
    "    _ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype=\"step\", label=jet_type)\n",
    "\n",
    "plt.xlabel(\"Jet $m/p_{T}$\")\n",
    "plt.ylabel(\"# Jets\")\n",
    "plt.legend(loc=1, prop={\"size\": 18})\n",
    "plt.title(\"Relative Jet Masses\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset preparation\n",
    "\n",
    "To prepare the dataset for machine learning applications, we can use the `jetnet.datasets.JetNet` class itself, which inherits the `pytorch.data.utils.Dataset` class.\n",
    "\n",
    "We'll also use the class to **normalise** the features to have zero means and unit standard deviations, and **transform** the jet type feature to be one-hot-encoded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jetnet.datasets import JetNet\n",
    "from jetnet.datasets.normalisations import FeaturewiseLinear\n",
    "\n",
    "import numpy as np\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "\n",
    "# function to one hot encode the jet type and leave the rest of the features as is\n",
    "def OneHotEncodeType(x: np.ndarray):\n",
    "    enc = OneHotEncoder(categories=[[0, 1]])\n",
    "    type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()\n",
    "    other_features = x[..., 1:].reshape(-1, 3)\n",
    "    return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)\n",
    "\n",
    "\n",
    "data_args = {\n",
    "    \"jet_type\": [\"g\", \"t\"],  # gluon and top quark jets\n",
    "    \"data_dir\": \"datasets/jetnet\",\n",
    "    # these are the default particle features, written here to be explicit\n",
    "    \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\", \"mask\"],\n",
    "    \"num_particles\": 10,  # we retain only the 10 highest pT particles for this demo\n",
    "    \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n",
    "    # we don't want to normalise the 'mask' feature so we set that to False\n",
    "    \"particle_normalisation\": FeaturewiseLinear(normal=True, normalise_features=[True, True, True, False]),  \n",
    "    # pass our function as a transform to be applied to the jet features\n",
    "    \"jet_transform\": OneHotEncodeType,\n",
    "}\n",
    "\n",
    "jets_train = JetNet(**data_args, split=\"train\")\n",
    "jets_valid = JetNet(**data_args, split=\"valid\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can look at one of our datasets to confirm everything is as we expect:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jets_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And also directly at the data itself - note that the features have been **normalised** and the jet type has been **one-hot-encoded**):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "particle_features, jet_features = jets_train[0]\n",
    "print(f\"Particle features ({data_args['particle_features']}):\\n\\t{particle_features}\")\n",
    "print(f\"\\nJet features ({data_args['jet_features']}):\\n\\t{jet_features}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>We can now feed this into a PyTorch DataLoader and start training!</h4>\n",
    "\n",
    "Next things you can try are:\n",
    " - Repeat this with the Top Quark Tagging (`jetnet.datasets.TopTagging`) and Quark Gluon datasets (`jetnet.datasets.QuarkGluon`)\n",
    " - Training an ML model (tutorial coming soon...)\n",
    " - Evaluating generative models (`jetnet.evaluation`)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python310",
   "language": "python",
   "name": "python310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}