{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# H --> tau tau analysis with CMS open data and ADL/CutLang\n",
    "\n",
    "This is an exercise showing a simple analysis exploring the SM Higgs decays to tau tau in the tau mu final state: [https://opendata.web.cern.ch/record/12350](https://opendata.web.cern.ch/record/12350).  The analysis aims to explore the kinematics of H --> tau tau and compare them with that of a prominent SM background process with the same final state.\n",
    "  * The signal events consist of H -> tau tau events: [https://opendata.web.cern.ch/record/12351](https://opendata.web.cern.ch/record/12351). \n",
    "  * The background events consist of Z(-->tau tau)+jets events: [https://opendata.web.cern.ch/record/12353](https://opendata.web.cern.ch/record/12353)\n",
    "\n",
    "The analysis consists of two parts:\n",
    "1. Applying some event selection to the input events and making distributions.  This part is performed using a special language called ADL, and via a software called CutLang that can read and process ADL.\n",
    "2. Drawing plots produced by the previous step.  This part is performed using ROOT (with Python syntax).  ROOT is the main analysis software used at CERN.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --progress=dot:giga http://opendata.cern.ch/record/12351/files/GluGluToHToTauTau.root\n",
    "//Get the ROOT file containing the H -> tautau signal events.\n",
    "//CMS OD record: https://opendata.web.cern.ch/record/12351"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --progress=dot:giga https://www.dropbox.com/s/mcvb4mis1qhcjvy/DYJetsToLLsubset.root\n",
    "// Get the ROOT file containing the Z -> tau tau background events:\n",
    "//CMS OD record: https://opendata.web.cern.ch/record/12353"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What are ADL and CutLang?\n",
    "\n",
    "(More information on [cern.ch/adl](cern.ch/adl))\n",
    "\n",
    "LHC data analyses are usually performed using complex analysis frameworks written in general purpose languages like C++ and python. \n",
    "But this method has a steep learning curve, as even the simplest tasks could be coded in a complicated way, and it is not straightforward to understand the code, make changes or additions.\n",
    "However there is another emerging alternative which allows to decouple physics content from the technical code and write analyses with a simple, self-describing syntax.  **Analysis Description Language (ADL)** is a HEP-specific analysis language developed with this purpose. \n",
    "\n",
    "A HEP analysis includes 3 main parts:\n",
    "  * Object definitions: Which objects are used? e.g. electrons? muons? jets?  What are the selections applied on these?\n",
    "  * Event variable definitions: Are there event-wide variables used such as a invariant mass or a transverse mass?  How are they calculated?\n",
    "  * Event selections: What selections do we apply on events, for example, to enhance the signal and reduce the backgrounds?  Are there more than one event selections?  How are they defined?\n",
    "\n",
    "ADL consists of blocks separating object, variable and event selection definitions for a clear separation of analysis components. Blocks have a keyword-expression structure. Keywords specify analysis concepts and operations.  Syntax includes mathematical and logical operations, comparison and optimization operators, reducers, 4-vector algebra and HEP-specific functions (dφ, dR, …).\n",
    "\n",
    "ADL is designed with the goal to be self-describing, so especially for simple cases like in this example, one does not need to read syntax rules to understand an ADL description.  However if you are interested, the set of syntax rules can be found [here]( https://twiki.cern.ch/twiki/bin/view/LHCPhysics/ADL).\n",
    "\n",
    "Once an analysis is written it needs to be run on events.  This is achieved by **CutLang** , the runtime interpreter who reads and understands the ADL syntax and runs it on events.  CutLang is also a framework which aturomatically handles many tedious tasks as reading input events, writing output histograms, etc.  CutLang can be run on various environments such as linux, mac, conda, docker, jupyter, etc.  \n",
    "\n",
    "In case you are interested to learn more on CutLang, please see the [CutLang github](https://github.com/unelg/CutLang)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing the analysis with ADL and running with CutLang\n",
    "\n",
    "**Writing the analysis with ADL:** In the following cell, part of the analysis is written using the ADL syntax.  However there are some parts missing. Please follow the instructions in the comments to complete the missing parts.  If you feel adventurous, you could modify the object or event selections, add new variables or new histograms.\n",
    "\n",
    "**Running the analysis with CutLang:** Executing the cell will run the analysis on both the signal (SMHiggsToZZTo4L.root) and background (ZZTo2e2mu.root) events.  The run parameters are given in the first line of the cell:\n",
    "  * file : input root file\n",
    "  * filetype : input event format (do not change!)\n",
    "  * adlfile : the name we use for labeling the analysis \n",
    "  * events : number of events used from each file\n",
    "  * verbose : frequency of processed event numbers written in output text\n",
    "\n",
    "NOTE: When running jupyter/binder via direct link, if your run does not complete due to memory issues, please reduce the number of events via the \"events\" parameter.\n",
    "\n",
    "**Analysis output:** Running the analysis will produce two outputs:\n",
    "  * Text output shown cell output: This includes \"cutflows\" for each region, i.e. the selections applied and how many events survive the various selections.  Histograms are also listed.  You should see a separate output for each ROOT file that is run.\n",
    "  * ROOT output: One ROOT file called histoOut-\\<adlfile name\\>-\\<file name\\>.root that includes all the histograms produced by the analysis.  These ROOT files will be used in the next step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ADL description for the H -> tau tau analysis\n",
    "\n",
    "Try to understand the object selections, Higgs reconstruction algorithm and event selections in the analysis.  Check the resulting cutflows in both baseline and twojets selections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%cutlang file=GluGluToHToTauTau.root;DYJetsToLLsubset.root filetype=CMSNANO adlfile=Htautau events=200000;250000 verbose=20000\n",
    "\n",
    "# CMS Open Data outreach analysis of H -> tau tau using CMS 2012 data and MC\n",
    "# Link to open data record: http://opendata.web.cern.ch/record/12350\n",
    "# Can be used with the reduced NanoAOD ntuples listed in the open data record, e.g. http://opendata.cern.ch/record/12351\n",
    "# Runs with CutLang: https://github.com/unelg/CutLang\n",
    "info analysis\n",
    "  title \"CMS Open Data Outreach analysis: Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012\"\n",
    "  experiment CMS\n",
    "  id \"Open Data record: 12350\"\n",
    "  publication \"J. High Energy Phys. 05 (2014) 104\"\n",
    "  sqrtS 7 # TeV, 8.TeV\"\n",
    "  lumi 4.9 # fb-1\n",
    "  arXiv \"1401.5041\"\n",
    "  hepdata \"https://www.hepdata.net/record/ins1749379\"\n",
    "  doi \"10.1007/JHEP05(2014)104\"\n",
    "\n",
    "# OBJECT SELECTIONS\n",
    "object goodMuons\n",
    "  take Muon\n",
    "  select abs(Eta(Muon)) < 2.1\n",
    "  select pt(Muon) > 17\n",
    "  select tightId(Muon) == 1.0\n",
    "\n",
    "object goodTaus\n",
    "  take Tau\n",
    "  select q(Tau) != 0\n",
    "  select abs(eta(Tau)) < 2.3\n",
    "  select pt(Tau) > 20\n",
    "  select idDecayMode(Tau) == 1\n",
    "  select idIsoTight(Tau) == 1\n",
    "  select idAntiEleTight(Tau) == 1\n",
    "  select idAntiMuTight(Tau) == 1\n",
    "\n",
    "object goodJets\n",
    "  take Jet\n",
    "  select puId(Jet) > 0\n",
    "  select abs(eta(Jet)) < 4.7\n",
    "  select pt(Jet) > 10\n",
    "\n",
    "# Find the best mu-tau pair that reconstructs the best Higgs candidate.\n",
    "# We use negative indices -1 and -2 for muon and tau, because we don't initially know which mu and tau to select from the collections/\n",
    "# Loop over all muon-tau pairs:\n",
    "object HMuTaus : comb( goodMuons[-1] goodTaus[-2] ) alias aHMuTaus\n",
    "  # Select all mu-tau pairs with dR(mu,tau) > 0.5:\n",
    "  select dR( goodMuons[-1], goodTaus[-2] ) > 0.5\n",
    "  # Select the mu-tau pair which has the mu with maximum pT: \n",
    "  # pT(goodMuons) indicates a loop over goodMuons pTs.\n",
    "  select max(pT(goodMuons)) - pT(goodMuons[-1]) ~= 0\n",
    "  # Among the surviving pairs, select the mu-tau pair which has the tau with minimum relative isolation:\n",
    "  select min(relIsoall(goodTaus)) - relIsoall(goodTaus[-2]) ~= 0\n",
    "\n",
    "# EVENT VARIABLES\n",
    "# We can write the definitions of event variables here and use the variable names directly in the selection regions below.\n",
    "define bestMu = daughters(HMuTaus[0], goodMuons[-1])\n",
    "define bestTau = daughters(HMuTaus[0],  goodTaus[-2])\n",
    "define mutau = HMuTaus[0]\n",
    "define dRmutau = dR(bestMu, bestTau)\n",
    "define MTtau = Sqrt( 2*pT(bestTau) * MET*(1-cos(phi(METLV[0]) - phi(bestTau) )))\n",
    "define MTmu = Sqrt( 2*pT(bestMu) * MET*(1-cos(phi(METLV[0]) - phi(bestMu) )))\n",
    "define jj = goodJets[0] + goodJets[1]\n",
    "define mjj = m(jj)\n",
    "define pTjj = pT(jj)\n",
    "define detajj = dEta(goodJets[0], goodJets[1])\n",
    "\n",
    "# EVENT SELECTIONS\n",
    "region baseline\n",
    "  select ALL\n",
    "  select HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == 1\n",
    "  select size(goodMuons) > 0\n",
    "  select size(goodTaus) > 0\n",
    "  select size(HMuTaus) > 0\n",
    "  # Histograms in the baseline region:\n",
    "  histo hmutaum , \"mu+tau mass (GeV)\", 30, 20, 140, m(mutau)\n",
    "  histo hmutaupt , \"mu+tau pT (GeV)\", 30, 0, 60, pT(mutau)\n",
    "  histo hmupt , \"muon pT (GeV)\", 30, 17, 70, pT(bestMu)\n",
    "  histo htaupt , \"tau pT (GeV)\", 30, 17, 70, pT(bestTau)\n",
    "  histo hmueta , \"muon eta\", 30, -2.1, 2.1, eta(bestMu)\n",
    "  histo htaueta , \"tau eta\", 30, -2.3, 2.3, eta(bestTau)\n",
    "  histo hmuphi , \"muon phi\", 30, -3.14, 3.14, eta(bestMu)\n",
    "  histo htauphi , \"tau phi\", 30, -3.14, 3.14, eta(bestTau)\n",
    "  histo hmuiso , \"muon iso\", 30, -3.14, 3.14, pfRelIso03all(bestMu)\n",
    "  histo htauiso , \"tau iso\", 30, -3.14, 3.14, relIsoall(bestTau)\n",
    "  histo hmuq , \"muon charge\", 2, -2, 2, q(bestMu)\n",
    "  histo htauq , \"tau charge\", 2, -2, 2, q(bestTau)\n",
    "  histo hmet, \"MET (GeV)\", 30, 0, 60, MET\n",
    "  histo hmetphi, \"MET phi\", 30, -3.14, 3.14, phi(METLV[0])\n",
    "  histo hmumass , \"muon mass (GeV)\", 30, 0.0, 0.2, m(bestMu)\n",
    "  histo htaumass , \"tau mass (GeV)\", 30, 0.0, 2.0, m(bestTau)\n",
    "  histo hmuMT , \"muon+MET transverse mass (GeV)\", 30, 0.0, 100, MTmu\n",
    "  histo htauMT , \"tau+MET transverse mass (GeV)\", 30, 0.0, 100, MTtau\n",
    "  histo htaudecaymode , \"tau decay mode\", 11, 0, 11, decayMode(bestTau)\n",
    "    \n",
    "region twojets\n",
    "  baseline\n",
    "  select size(goodJets) >= 2\n",
    "  # histograms in the twojets region\n",
    "  histo hj1pt , \"jet1 pT (GeV)\", 30, 30, 70, pT(goodJets[0])\n",
    "  histo hj2pt , \"jet1 pT (GeV)\", 30, 30, 70, pT(goodJets[1])\n",
    "  histo hj1eta , \"jet1 eta\", 30, -4.7, 4.7, eta(goodJets[0])\n",
    "  histo hj2eta , \"jet2 eta\", 30, -4.7, 4.7, eta(goodJets[1])\n",
    "  histo hj1phi , \"jet1 phi\", 30, -3.14, 3.14, phi(goodJets[0])\n",
    "  histo hj2phi , \"jet2 phi\", 30, -3.14, 3.14, phi(goodJets[1])\n",
    "  histo hj1m , \"jet1 mass (GeV)\", 30, 30, 70, m(goodJets[0])\n",
    "  histo hj2m , \"jet2 mass (GeV)\", 30, 30, 70, m(goodJets[1])\n",
    "  histo hj1btag , \"jet1 btag\", 30, 0, 1, bTag(goodJets[0])\n",
    "  histo hj2btag , \"jet2 btag\", 30, 0, 1, bTag(goodJets[1])\n",
    "  histo hnjets , \"number of jets\", 5, 0, 5, size(goodJets)\n",
    "  histo hmjj , \"jj inv. mass (GeV)\", 30, 0, 400, mjj\n",
    "  histo hptjj , \"jj pT (GeV)\", 30, 0, 200, pTjj\n",
    "  histo hjdeta , \"deta j1, j2\", 30, -9.4, 9.4, detajj\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking the analysis output with ROOT\n",
    "\n",
    "Now let's make some plots using the ROOT package in python (which is widely used at CERN).\n",
    "Instructions are shown within comments in the following cells.\n",
    "\n",
    "What to do:\n",
    "  * Compare some of the histograms you made for signal and background in the baseline and twojets regions.  \n",
    "    * Which variables show the most discrimination between the signal and the background?\n",
    "    * How is the discrimination in different regions?\n",
    "  * Note that in this exercise we only compare the shapes of the distributions (i.e.: integral of each histogram equals to 1).  In real life, events scale with cross section and luminosity, and there are usually a large number of background events.  Therefore event selections are designed to eliminate as many background events as possible.  Based on what you see in the histograms, if you wanted to eliminate more background events, how would you improve the event selection?  You could try to add these cuts to the ADL description, rerun and check the event numbers in the cutflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python\n",
    "# Let's start with importing the needed modules\n",
    "from ROOT import gStyle, TFile, TH1, TH1D, TH2D, TCanvas, TLegend, TColor\n",
    "\n",
    "# Now let's set some ROOT styling parameters:\n",
    "# You do not need to know what they mean, but can directly use these settings\n",
    "\n",
    "gStyle.SetOptStat(0)\n",
    "gStyle.SetPalette(1)\n",
    "\n",
    "gStyle.SetTextFont(42)\n",
    "\n",
    "gStyle.SetTitleStyle(0000)\n",
    "gStyle.SetTitleBorderSize(0)\n",
    "gStyle.SetTitleFont(42)\n",
    "gStyle.SetTitleFontSize(0.055)\n",
    "\n",
    "gStyle.SetTitleFont(42, \"xyz\")\n",
    "gStyle.SetTitleSize(0.5, \"xyz\")\n",
    "gStyle.SetLabelFont(42, \"xyz\")\n",
    "gStyle.SetLabelSize(0.45, \"xyz\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python\n",
    "\n",
    "# Let's open the signal (Htautau) and background (Ztautau) files produced by CutLang: \n",
    "fs = TFile(\"histoOut-Htautau-GluGluToHToTauTau.root\")\n",
    "fb = TFile(\"histoOut-Htautau-DYJetsToLLsubset.root\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python \n",
    "# We can see what is inside the signal file:\n",
    "fs.ls()\n",
    "# There should be a directory (TDirectoryFile) per selection region, e.g. for \"baseline\" and \"twojets\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python\n",
    "# Let's check out what is inside \"baseline\":\n",
    "fs.cd(\"baseline\")\n",
    "fs.ls()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python\n",
    "# Now let's draw some histograms. \n",
    "# We will compare signal and background distributions for different variables.\n",
    "# You can try this with different histograms and different regions.\n",
    "# Which histogram would you like to draw? You can change the histogram name.\n",
    "hname = \"hmutaum\"\n",
    "# In which region would you like to draw? You can change the region name. \n",
    "region = \"baseline\"\n",
    "# Get the histograms from the file:\n",
    "hsg = fs.Get(region+\"/\"+hname)\n",
    "hbg = fb.Get(region+\"/\"+hname)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python\n",
    "\n",
    "# This cell formats the histograms: scaling, lines, colors, axes titles, etc..  \n",
    "# You do not need to learn the commands here unless you are really curious.\n",
    "# Otherwise just execute the cell.\n",
    "\n",
    "# Our purpose in this exercise is to compare the shapes of signal and background distributions.\n",
    "# To do this comparison best, the area integral under histograms being compared should be the same.\n",
    "# Therefore we scale the hisgograms so that the area integral under the histograms equals 1. \n",
    "hsg.Scale(1./hsg.Integral())\n",
    "hbg.Scale(1./hbg.Integral())\n",
    "if hsg.GetMaximum() > hbg.GetMaximum(): \n",
    "    hbg.SetMaximum(hsg.GetMaximum()*1.1)\n",
    "    \n",
    "# Histogram style settings:\n",
    "hsg.SetLineWidth(2)\n",
    "hbg.SetLineWidth(2)\n",
    "\n",
    "# Set the colors:\n",
    "# Color numbers can be retrived from https://root.cern.ch/doc/master/classTColor.html\n",
    "# (check for color wheel)\n",
    "hbg.SetFillColor(400-7) # kYellow - 7\n",
    "hsg.SetLineColor(600) # kBlue\n",
    "hbg.SetLineColor(400+2) # kYellow + 2\n",
    "        \n",
    "# Titles, labels.  \n",
    "# It is enough to set such variables ONLY FOR THE FIRST HISTOGRAM YOU WILL DRAW\n",
    "# i.e., the one you will call by .Draw().  The rest you will draw by .Draw(\"same\") will only \n",
    "# contribute with the historam curve.\n",
    "#hbg.SetTitle(\"\")\n",
    "\n",
    "hbg.SetTitle(\"\")\n",
    "hbg.GetXaxis().SetTitle(hsg.GetTitle())\n",
    "hbg.GetXaxis().SetTitleOffset(1.25)\n",
    "hbg.GetXaxis().SetTitleSize(0.05)\n",
    "hbg.GetXaxis().SetLabelSize(0.045)\n",
    "hbg.GetXaxis().SetNdivisions(8, 5, 0)\n",
    "hbg.GetYaxis().SetTitle(\"number of events\")\n",
    "hbg.GetYaxis().SetTitleOffset(1.4)\n",
    "hbg.GetYaxis().SetTitleSize(0.05)\n",
    "hbg.GetYaxis().SetLabelSize(0.045)\n",
    "    \n",
    "# Make a generically usable plot legend\n",
    "l = TLegend(0.65, 0.75, 0.88, 0.87)\n",
    "l.SetBorderSize(0)\n",
    "l.SetFillStyle(0000)\n",
    "l.AddEntry(hsg,\"H->tautau\", \"l\")\n",
    "l.AddEntry(hbg,\"Z->tautau\", \"f\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%python %jsroot on\n",
    "# Now we make a canvas and draw our histograms\n",
    "c = TCanvas(\"c\", \"c\", 620, 500)\n",
    "c.SetBottomMargin(0.15)\n",
    "c.SetLeftMargin(0.15)\n",
    "c.SetRightMargin(0.15)\n",
    "hbg.Draw(\"H\")\n",
    "hbg.Draw(\"Esame\")\n",
    "hsg.Draw(\"Hsame\")\n",
    "hsg.Draw(\"same\")\n",
    "l.Draw(\"same\")\n",
    "c.Draw()\n",
    "# Don't worry about the error that appears below!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ROOT C++ with CutLang",
   "language": "c++",
   "name": "cutlang"
  },
  "language_info": {
   "codemirror_mode": "text/x-c++src",
   "file_extension": ".C",
   "mimetype": " text/x-c++src",
   "name": "c++"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}