{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning for $t\\bar{t}Z$ Opposite-sign dilepton analysis \n", "This notebook uses ATLAS Open Data http://opendata.atlas.cern to show you the steps to implement Machine Learning in the $t\\bar{t}Z$ Opposite-sign dilepton analysis, following the ATLAS published paper [Measurement of the $t\\bar{t}Z$ and $t\\bar{t}W$ cross sections in proton-proton collisions at $\\sqrt{s}$ = 13 TeV with the ATLAS detector](https://journals.aps.org/prd/pdf/10.1103/PhysRevD.99.072009).\n", "\n", "The whole notebook takes less than an hour to follow through.\n", "\n", "Notebooks are web applications that allow you to create and share documents that can contain for example:\n", "1. live code\n", "2. visualisations\n", "3. narrative text\n", "\n", "Notebooks are a perfect platform to develop Machine Learning for your work, since you'll need exactly those 3 things: code, visualisations and narrative text!\n", "\n", "We're interested in Machine Learning because we can design an algorithm to figure out for itself how to do various analyses, potentially saving us countless human-hours of design and analysis work.\n", "\n", "Machine Learning use within ATLAS includes: \n", "* particle tracking\n", "* particle identification\n", "* signal/background classification\n", "* and more!\n", "\n", "This notebook will focus on ROC curves.\n", "\n", "By the end of this notebook you will be able to:\n", "1. run machine learning algorithms to classify signal and background\n", "2. know some things you can change to improve your machine learning algorithms\n", "\n", "Feynman diagram pictures are borrowed from our friends at https://www.particlezoo.net" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction (from Section 1)\n", "\n", "Properties of the top quark have been explored by the\n", "Large Hadron Collider (LHC) and previous collider experiments in great detail. \n", "\n", "Other properties of the top quark are\n", "now becoming accessible, owing to the large center-of-mass energy and luminosity at the LHC.\n", "\n", "Measurements of top-quark pairs in association with a Z boson ($t\\bar{t}Z$) provide a direct probe of the\n", "weak couplings of the top quark. These couplings\n", "may be modified in the presence of physics beyond the\n", "Standard Model (BSM). Measurements of the $t\\bar{t}Z$ production cross sections, $\\sigma_{t\\bar{t}Z}$, can be used to\n", "set constraints on the weak couplings of the top quark. \n", "\n", "The production of $t\\bar{t}Z$ is often an important\n", "background in searches involving final states with multiple\n", "leptons and b-quarks. These processes also constitute an\n", "important background in measurements of the associated\n", "production of the Higgs boson with top quarks.\n", "\n", "This paper presents measurements of the $t\\bar{t}Z$ cross section using proton–proton (pp) collision data\n", "at a center-of-mass energy $\\sqrt{s} = 13 TeV.\n", "\n", "The final states of top-quark pairs produced in association with a\n", "Z boson contain up to four isolated, prompt leptons. In this analysis, events with two opposite-sign\n", "(OS) leptons are considered. The dominant backgrounds\n", "in this channel are Z+jets and $t\\bar{t}$, \n", "\n", "(In this paper, lepton is used to denote electron or muon, and prompt lepton is used to denote a lepton produced in a Z or W\n", "boson decay, or in the decay of a τ-lepton which arises from a Z or W boson decay.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data and simulated samples (from Section 3)\n", "\n", "The data were collected with the ATLAS detector at a proton–proton (pp) collision\n", "energy of 13 TeV. \n", "\n", "Monte Carlo (MC) simulation samples are used to model the expected signal and background distributions\n", "in the different control, validation and signal regions described below. All samples were processed through the\n", "same reconstruction software as used for the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Opposite-sign dilepton analysis (from Section 5A)\n", "\n", "The OS dilepton analysis targets the $t\\bar{t}Z$ process, where both top quarks decay hadronically and the Z boson\n", "decays to a pair of leptons (electrons or muons). Events are required to have exactly two opposite-sign leptons.\n", "Events with additional isolated leptons are rejected. The leading (subleading) lepton is required to have a\n", "transverse momentum of at least 30 (15) GeV.\n", "\n", "The OS dilepton analysis is affected by large backgrounds from Z+jets or $t\\bar{t}$ production, both characterized\n", "by the presence of two leptons. \n", "\n", "The signal region\n", "requirements are summarized in Table 1 below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Variable | Selection |\n", "|------|------|\n", "| Leptons | = 2, opposite sign |\n", "| $p_T$ (leading lepton) | > 30 GeV |\n", "| $p_T$ (subleading lepton) | > 15 GeV |\n", "\n", "Table 1: Summary of the event selection requirements in the OS dilepton signal regions.\n", "\n", "This is a subset of Table 2 of the ATLAS published paper [Measurement of the $t\\bar{t}Z$ and $t\\bar{t}W$ cross sections in proton-proton collisions at $\\sqrt{s}$ = 13 TeV with the ATLAS detector](https://journals.aps.org/prd/pdf/10.1103/PhysRevD.99.072009)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Contents: \n", "\n", "[Running a Jupyter notebook](#running)
\n", "[To setup first time](#setupfirsttime)
\n", "[To setup everytime](#setupeverytime)
\n", "  [File path](#fraction)
\n", "  [Get data from files](#get_data_from_files)
\n", "\n", "[Machine learning](#MVA)
\n", "  [Training and Testing split](#train_test_split)
\n", "  [Training](#MVA_training)
\n", " \n", "[Going further](#going_further)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running a Jupyter notebook\n", "\n", "To run the whole Jupyter notebook, in the top menu click Cell -> Run All.\n", "\n", "To propagate a change you've made to a piece of code, click Cell -> Run All Below.\n", "\n", "You can also run a single code cell, by using the keyboard shortcut Shift+Enter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First time setup on your computer (no need on mybinder)\n", "This first cell only needs to be run the first time you open this notebook on your computer. \n", "\n", "If you close Jupyter and re-open on the same computer, you won't need to run this first cell again.\n", "\n", "If you open on mybinder, you don't need to run this cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install --upgrade --user pip # update the pip package installer\n", "!{sys.executable} -m pip install -U pandas sklearn --user # install required packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To setup everytime\n", "Cell -> Run All Below\n", "\n", "to be done every time you re-open this notebook.\n", "\n", "We're going to be using a number of tools to help us:\n", "* pandas: lets us store data as dataframes, a format widely used in Machine Learning\n", "* numpy: provides numerical calculations such as histogramming\n", "* matplotlib: common tool for making plots, figures, images, visualisations" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import pandas as pd # to store data as dataframes\n", "import numpy as np # for numerical calculations such as histogramming\n", "import matplotlib.pyplot as plt # for plotting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File path" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#csv_path = \"~/13tevatlasopendataguide2020/docs/visualization/CrossFilter/13TeV_ttZ.csv\" # local \n", "csv_path = \"http://opendata.atlas.cern/release/2020/documentation/visualization/CrossFilter/13TeV_ttZ.csv\" # web address" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get data from files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "define function to get data from files\n", "\n", "The datasets used in this notebook have already been filtered to include exactly 2 leptons per event, so that processing is quicker." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ChannelNJetsMETMllLepDeltaPhiLepDeltaRSumLepPtNBJets
145370055.7690.781.321.40139.591
1453710497.1691.320.991.37175.511
1453720525.6688.161.671.7091.101
1453730564.0691.002.692.7122.762
1453740552.3375.130.101.12152.971
...........................
1542401414.7865.870.660.67196.762
1542411833.9791.720.600.98177.023
1542421556.0762.151.391.4988.680
154243110143.1783.462.782.7960.341
1542441742.1777.311.381.5593.172
\n", "

8875 rows × 8 columns

\n", "
" ], "text/plain": [ " Channel NJets MET Mll LepDeltaPhi LepDeltaR SumLepPt \\\n", "145370 0 5 5.76 90.78 1.32 1.40 139.59 \n", "145371 0 4 97.16 91.32 0.99 1.37 175.51 \n", "145372 0 5 25.66 88.16 1.67 1.70 91.10 \n", "145373 0 5 64.06 91.00 2.69 2.71 22.76 \n", "145374 0 5 52.33 75.13 0.10 1.12 152.97 \n", "... ... ... ... ... ... ... ... \n", "154240 1 4 14.78 65.87 0.66 0.67 196.76 \n", "154241 1 8 33.97 91.72 0.60 0.98 177.02 \n", "154242 1 5 56.07 62.15 1.39 1.49 88.68 \n", "154243 1 10 143.17 83.46 2.78 2.79 60.34 \n", "154244 1 7 42.17 77.31 1.38 1.55 93.17 \n", "\n", " NBJets \n", "145370 1 \n", "145371 1 \n", "145372 1 \n", "145373 2 \n", "145374 1 \n", "... ... \n", "154240 2 \n", "154241 3 \n", "154242 0 \n", "154243 1 \n", "154244 2 \n", "\n", "[8875 rows x 8 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_data = pd.read_csv(csv_path) # read all data\n", "signal_df = all_data[all_data['type']==1].drop(['type','weight'], axis=1) # get signal dataframe\n", "background_df = all_data[(all_data['type']!=0) & (all_data['type']!=1)].drop(['type','weight'], axis=1) # background dataframe\n", "signal_df # print the dataframe to take a look" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine learning\n", "\n", "Organise data ready for BDT" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# for sklearn data is usually organised \n", "# into one 2D array of shape (n_samples x n_features) \n", "# containing all the data and one array of categories \n", "# of length n_samples \n", "\n", "X = np.concatenate([signal_df.values, background_df.values]) # concatenate the list of MC dataframes into a single 2D array of features, called X\n", "y = np.concatenate([np.ones(signal_df.shape[0]), np.zeros(background_df.shape[0])]) # concatenate the list of lables into a single 1D array of labels, called y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Training and Testing split\n", "One of the first things to do is split your data into a training and testing set. This will split your data into train-test sets: 75%-25%. It will also shuffle entries so you will not get the first 75% of X for training and the last 25% for testing. This is particularly important in cases where you load all signal events first and then the background events.\n", "\n", "Here we split our data into two independent samples. The split is to create a training and testing set. The first will be used for training the classifier and the second to evaluate its performance.\n", "\n", "We don't want to test on events that we used to train on, this prevents overfitting to some subset of data so the network would be good for the test data but much worse at any *new* data it sees." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# make train and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, \n", " random_state=492 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training your machine learning algorithm\n", "\n", "We'll use SciKit Learn (sklearn) in this tutorial. Other possible tools include keras and pytorch. \n", "\n", "After instantiating our GradientBoostingClassifier, call the fit() method with the training sample as an argument. This will train the tree, now we are ready to evaluate the performance on the held out testing set.\n", "\n", "A useful plot to judge the performance of a classifier is to look at the Receiver Operarting Characteristic (ROC) curve directly." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.ensemble import GradientBoostingClassifier # BoostType\n", "from sklearn.metrics import roc_curve, auc\n", "\n", "for learning_rate_i in [0.01,0.1,1]:\n", " bdt = GradientBoostingClassifier(learning_rate=learning_rate_i)\n", " bdt.fit(X_train, y_train) # fit BDT to training set\n", "\n", " decisions = bdt.decision_function(X_test).ravel() # get probabilities on test set\n", "\n", " # Compute ROC curve and area under the curve\n", " fpr, tpr, _ = roc_curve(y_test, # actual\n", " decisions ) # predicted\n", "\n", " plt.plot(fpr, tpr, label='learning rate '+str(learning_rate_i)) # plot test ROC curve\n", "\n", " plt.xlabel('False Positive Rate') # x-axis label\n", " plt.ylabel('True Positive Rate') # y-axis label\n", " plt.legend() # add legend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the ROC curve changes with each different learning rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Putting everything into a machine learning algorithm means we only have 1 variable to optimise. The signal and background distributions are separated much better when looking at machine learning output, compared to individual variables. Using machine learning algorithms also achieves much higher S/B values than on individual variables.\n", "\n", "machine learning algorithm can achieve better S/B ratios because they find correlations in many dimensions that will give better signal/background classification.\n", "\n", "Hopefully you've enjoyed this discussion on using machine learning algorithms to select for signal to background." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going further\n", "\n", "If you want to go further, there are a number of things you could try: \n", "\n", "* **Modify some BDT hyper-parameters** in '[Training your machine learning algorithm](#MVA_training)'. Cell -> Run All Below. You may find the [sklearn documentation on GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) helpful.\n", "* **Try other machine learning algorithms** in '[Training your machine learning algorithm](#MVA_training)'. Cell -> Run All Below. You may find [sklearn documentation on supervised learning](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) helpful.\n", "\n", "With each change, keep an eye on the ROC curve." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Back to contents](#contents)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }