{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# JetNet Demo\n", "\n", "\"logo\"\n", "\n", "

Raghav Kansal
UC San Diego

\n", "\n", "

PyHEP 2022 Workshop
Online, 12-16 September 2022

\n", "\n", "\n", "

JetNet: For developing and reproducing ML + HEP projects.

\n", "\n", "\n", "Repo: [github.com/jet-net/JetNet](https://github.com/jet-net/JetNet)\n", "\n", "Docs: [jetnet.readthedocs.io](https://jetnet.readthedocs.io/en/latest/)\n", "\n", "Paper: [2106.11535](https://arxiv.org/abs/2106.11535)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "### Problems:\n", " - How do I **get started** with machine learning in high energy physics?\n", " - How do I **evaluate** my results?\n", " - How do we **reproduce** and **compare** results?\n", " \n", "### Solution:\n", "\n", "JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.\n", "\n", "#### Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Today\n", "\n", " - Loading and looking at the JetNet dataset\n", " - Preparing the dataset for training a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data loading\n", "\n", "We'll use the `jetnet.datasets.JetNet.getData` function to download and directly access the dataset.\n", "\n", "First, we can check which particle and jet features are available in this dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from jetnet.datasets import JetNet\n", "print(f\"Particle features: {JetNet.all_particle_features}\")\n", "print(f\"Jet features: {JetNet.all_jet_features}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's load the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_args = {\n", " \"jet_type\": [\"g\", \"t\", \"w\"], # gluon, top quark, and W boson jets\n", " \"data_dir\": \"datasets/jetnet\",\n", " # only selecting the kinematic features\n", " \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\"],\n", " \"num_particles\": 30,\n", " \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n", "}\n", "\n", "particle_data, jet_data = JetNet.getData(**data_args)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at some of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Particle features of the 10 highest pT particles in the first jet\\n{data_args['particle_features']}\\n{particle_data[0, :10]}\")\n", "print(f\"\\nJet features of first jet\\n{data_args['jet_features']}\\n{jet_data[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also visualise these jets as images:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from jetnet.utils import to_image\n", "import matplotlib.pyplot as plt\n", "\n", "num_images = 5\n", "num_types = len(data_args[\"jet_type\"])\n", "im_size = 25 # number of pixels in height and width\n", "maxR = 0.4 # max radius in (eta, phi) away from the jet axis\n", "\n", "cm = plt.cm.jet.copy()\n", "cm.set_under(color=\"white\")\n", "plt.rcParams.update({\"font.size\": 16})\n", "\n", "fig, axes = plt.subplots(\n", " nrows=num_types,\n", " ncols=num_images,\n", " figsize=(40, 8 * num_types),\n", " gridspec_kw={\"wspace\": 0.25},\n", ")\n", "\n", "# get the index of each jet type using the JetNet.jet_types array\n", "type_indices = {jet_type: JetNet.jet_types.index(jet_type) for jet_type in data_args[\"jet_type\"]}\n", "\n", "for j in range(num_types):\n", " jet_type = data_args[\"jet_type\"][j]\n", " type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat\n", "\n", " axes[j][0].annotate(\n", " jet_type,\n", " xy=(0, -1),\n", " xytext=(-axes[j][0].yaxis.labelpad - 15, 0),\n", " xycoords=axes[j][0].yaxis.label,\n", " textcoords=\"offset points\",\n", " ha=\"right\",\n", " va=\"center\",\n", " fontsize=24\n", " )\n", "\n", " for i in range(num_images):\n", " im = axes[j][i].imshow(\n", " to_image(particle_data[type_selector][i], im_size, maxR=maxR),\n", " cmap=cm,\n", " interpolation=\"nearest\",\n", " vmin=1e-8,\n", " extent=[-maxR, maxR, -maxR, maxR],\n", " vmax=0.05,\n", " )\n", " axes[j][i].tick_params(which=\"both\", bottom=False, top=False, left=False, right=False)\n", " axes[j][i].set_xlabel(\"$\\phi^{rel}$\")\n", " axes[j][i].set_ylabel(\"$\\eta^{rel}$\")\n", " axes[j][i].set_title(f\"Jet {i + 1}\")\n", "\n", "cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)\n", "cbar.set_label(\"$p_T^{rel}$\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And calculate and plot their overall features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from jetnet.utils import jet_features\n", "import numpy as np\n", "\n", "fig = plt.figure(figsize=(12, 12))\n", "plt.ticklabel_format(axis=\"y\", scilimits=(0, 0), useMathText=True)\n", "\n", "for j in range(num_types):\n", " jet_type = data_args[\"jet_type\"][j]\n", " type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat\n", "\n", " jet_masses = jet_features(particle_data[type_selector][:50000])[\"mass\"]\n", " _ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype=\"step\", label=jet_type)\n", "\n", "plt.xlabel(\"Jet $m/p_{T}$\")\n", "plt.ylabel(\"# Jets\")\n", "plt.legend(loc=1, prop={\"size\": 18})\n", "plt.title(\"Relative Jet Masses\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset preparation\n", "\n", "To prepare the dataset for machine learning applications, we can use the `jetnet.datasets.JetNet` class itself, which inherits the `pytorch.data.utils.Dataset` class.\n", "\n", "We'll also use the class to **normalise** the features to have zero means and unit standard deviations, and **transform** the jet type feature to be one-hot-encoded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from jetnet.datasets import JetNet\n", "from jetnet.datasets.normalisations import FeaturewiseLinear\n", "\n", "import numpy as np\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "\n", "# function to one hot encode the jet type and leave the rest of the features as is\n", "def OneHotEncodeType(x: np.ndarray):\n", " enc = OneHotEncoder(categories=[[0, 1]])\n", " type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()\n", " other_features = x[..., 1:].reshape(-1, 3)\n", " return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)\n", "\n", "\n", "data_args = {\n", " \"jet_type\": [\"g\", \"t\"], # gluon and top quark jets\n", " \"data_dir\": \"datasets/jetnet\",\n", " # these are the default particle features, written here to be explicit\n", " \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\", \"mask\"],\n", " \"num_particles\": 10, # we retain only the 10 highest pT particles for this demo\n", " \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n", " # we don't want to normalise the 'mask' feature so we set that to False\n", " \"particle_normalisation\": FeaturewiseLinear(normal=True, normalise_features=[True, True, True, False]), \n", " # pass our function as a transform to be applied to the jet features\n", " \"jet_transform\": OneHotEncodeType,\n", "}\n", "\n", "jets_train = JetNet(**data_args, split=\"train\")\n", "jets_valid = JetNet(**data_args, split=\"valid\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at one of our datasets to confirm everything is as we expect:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jets_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And also directly at the data itself - note that the features have been **normalised** and the jet type has been **one-hot-encoded**):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "particle_features, jet_features = jets_train[0]\n", "print(f\"Particle features ({data_args['particle_features']}):\\n\\t{particle_features}\")\n", "print(f\"\\nJet features ({data_args['jet_features']}):\\n\\t{jet_features}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

We can now feed this into a PyTorch DataLoader and start training!

\n", "\n", "Next things you can try are:\n", " - Repeat this with the Top Quark Tagging (`jetnet.datasets.TopTagging`) and Quark Gluon datasets (`jetnet.datasets.QuarkGluon`)\n", " - Training an ML model (tutorial coming soon...)\n", " - Evaluating generative models (`jetnet.evaluation`)" ] } ], "metadata": { "kernelspec": { "display_name": "python310", "language": "python", "name": "python310" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 4 }