JetNet: For developing and reproducing ML + HEP projects.
\n",
"\n",
"\n",
"Repo: [github.com/jet-net/JetNet](https://github.com/jet-net/JetNet)\n",
"\n",
"Docs: [jetnet.readthedocs.io](https://jetnet.readthedocs.io/en/latest/)\n",
"\n",
"Paper: [2106.11535](https://arxiv.org/abs/2106.11535)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"### Problems:\n",
" - How do I **get started** with machine learning in high energy physics?\n",
" - How do I **evaluate** my results?\n",
" - How do we **reproduce** and **compare** results?\n",
" \n",
"### Solution:\n",
"\n",
"JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.\n",
"\n",
"#### Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Today\n",
"\n",
" - Loading and looking at the JetNet dataset\n",
" - Preparing the dataset for training a model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data loading\n",
"\n",
"We'll use the `jetnet.datasets.JetNet.getData` function to download and directly access the dataset.\n",
"\n",
"First, we can check which particle and jet features are available in this dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from jetnet.datasets import JetNet\n",
"print(f\"Particle features: {JetNet.all_particle_features}\")\n",
"print(f\"Jet features: {JetNet.all_jet_features}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's load the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_args = {\n",
" \"jet_type\": [\"g\", \"t\", \"w\"], # gluon, top quark, and W boson jets\n",
" \"data_dir\": \"datasets/jetnet\",\n",
" # only selecting the kinematic features\n",
" \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\"],\n",
" \"num_particles\": 30,\n",
" \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n",
"}\n",
"\n",
"particle_data, jet_data = JetNet.getData(**data_args)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at some of the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Particle features of the 10 highest pT particles in the first jet\\n{data_args['particle_features']}\\n{particle_data[0, :10]}\")\n",
"print(f\"\\nJet features of first jet\\n{data_args['jet_features']}\\n{jet_data[0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also visualise these jets as images:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from jetnet.utils import to_image\n",
"import matplotlib.pyplot as plt\n",
"\n",
"num_images = 5\n",
"num_types = len(data_args[\"jet_type\"])\n",
"im_size = 25 # number of pixels in height and width\n",
"maxR = 0.4 # max radius in (eta, phi) away from the jet axis\n",
"\n",
"cm = plt.cm.jet.copy()\n",
"cm.set_under(color=\"white\")\n",
"plt.rcParams.update({\"font.size\": 16})\n",
"\n",
"fig, axes = plt.subplots(\n",
" nrows=num_types,\n",
" ncols=num_images,\n",
" figsize=(40, 8 * num_types),\n",
" gridspec_kw={\"wspace\": 0.25},\n",
")\n",
"\n",
"# get the index of each jet type using the JetNet.jet_types array\n",
"type_indices = {jet_type: JetNet.jet_types.index(jet_type) for jet_type in data_args[\"jet_type\"]}\n",
"\n",
"for j in range(num_types):\n",
" jet_type = data_args[\"jet_type\"][j]\n",
" type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat\n",
"\n",
" axes[j][0].annotate(\n",
" jet_type,\n",
" xy=(0, -1),\n",
" xytext=(-axes[j][0].yaxis.labelpad - 15, 0),\n",
" xycoords=axes[j][0].yaxis.label,\n",
" textcoords=\"offset points\",\n",
" ha=\"right\",\n",
" va=\"center\",\n",
" fontsize=24\n",
" )\n",
"\n",
" for i in range(num_images):\n",
" im = axes[j][i].imshow(\n",
" to_image(particle_data[type_selector][i], im_size, maxR=maxR),\n",
" cmap=cm,\n",
" interpolation=\"nearest\",\n",
" vmin=1e-8,\n",
" extent=[-maxR, maxR, -maxR, maxR],\n",
" vmax=0.05,\n",
" )\n",
" axes[j][i].tick_params(which=\"both\", bottom=False, top=False, left=False, right=False)\n",
" axes[j][i].set_xlabel(\"$\\phi^{rel}$\")\n",
" axes[j][i].set_ylabel(\"$\\eta^{rel}$\")\n",
" axes[j][i].set_title(f\"Jet {i + 1}\")\n",
"\n",
"cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)\n",
"cbar.set_label(\"$p_T^{rel}$\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And calculate and plot their overall features:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from jetnet.utils import jet_features\n",
"import numpy as np\n",
"\n",
"fig = plt.figure(figsize=(12, 12))\n",
"plt.ticklabel_format(axis=\"y\", scilimits=(0, 0), useMathText=True)\n",
"\n",
"for j in range(num_types):\n",
" jet_type = data_args[\"jet_type\"][j]\n",
" type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat\n",
"\n",
" jet_masses = jet_features(particle_data[type_selector][:50000])[\"mass\"]\n",
" _ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype=\"step\", label=jet_type)\n",
"\n",
"plt.xlabel(\"Jet $m/p_{T}$\")\n",
"plt.ylabel(\"# Jets\")\n",
"plt.legend(loc=1, prop={\"size\": 18})\n",
"plt.title(\"Relative Jet Masses\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset preparation\n",
"\n",
"To prepare the dataset for machine learning applications, we can use the `jetnet.datasets.JetNet` class itself, which inherits the `pytorch.data.utils.Dataset` class.\n",
"\n",
"We'll also use the class to **normalise** the features to have zero means and unit standard deviations, and **transform** the jet type feature to be one-hot-encoded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from jetnet.datasets import JetNet\n",
"from jetnet.datasets.normalisations import FeaturewiseLinear\n",
"\n",
"import numpy as np\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"\n",
"# function to one hot encode the jet type and leave the rest of the features as is\n",
"def OneHotEncodeType(x: np.ndarray):\n",
" enc = OneHotEncoder(categories=[[0, 1]])\n",
" type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()\n",
" other_features = x[..., 1:].reshape(-1, 3)\n",
" return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)\n",
"\n",
"\n",
"data_args = {\n",
" \"jet_type\": [\"g\", \"t\"], # gluon and top quark jets\n",
" \"data_dir\": \"datasets/jetnet\",\n",
" # these are the default particle features, written here to be explicit\n",
" \"particle_features\": [\"etarel\", \"phirel\", \"ptrel\", \"mask\"],\n",
" \"num_particles\": 10, # we retain only the 10 highest pT particles for this demo\n",
" \"jet_features\": [\"type\", \"pt\", \"eta\", \"mass\"],\n",
" # we don't want to normalise the 'mask' feature so we set that to False\n",
" \"particle_normalisation\": FeaturewiseLinear(normal=True, normalise_features=[True, True, True, False]), \n",
" # pass our function as a transform to be applied to the jet features\n",
" \"jet_transform\": OneHotEncodeType,\n",
"}\n",
"\n",
"jets_train = JetNet(**data_args, split=\"train\")\n",
"jets_valid = JetNet(**data_args, split=\"valid\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can look at one of our datasets to confirm everything is as we expect:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"jets_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And also directly at the data itself - note that the features have been **normalised** and the jet type has been **one-hot-encoded**):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"particle_features, jet_features = jets_train[0]\n",
"print(f\"Particle features ({data_args['particle_features']}):\\n\\t{particle_features}\")\n",
"print(f\"\\nJet features ({data_args['jet_features']}):\\n\\t{jet_features}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
We can now feed this into a PyTorch DataLoader and start training!
\n",
"\n",
"Next things you can try are:\n",
" - Repeat this with the Top Quark Tagging (`jetnet.datasets.TopTagging`) and Quark Gluon datasets (`jetnet.datasets.QuarkGluon`)\n",
" - Training an ML model (tutorial coming soon...)\n",
" - Evaluating generative models (`jetnet.evaluation`)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "python310",
"language": "python",
"name": "python310"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
},
"vscode": {
"interpreter": {
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}