NanoAOD files processed with Distributed RDataFrame in Python and Spark (PySpark)
\n",
"\n",
"\n",
"Python version of the [`df102_NanoAODDimuonAnalysis`](https://root.cern.ch/doc/master/df102__NanoAODDimuonAnalysis_8C.html) ROOT tutorial running with PyRDF.\n",
"\n",
"The NanoAOD-like input files are filled with 66 mio. events from CMS OpenData containing muon candidates part of 2012 dataset ([DOI: 10.7483/OPENDATA.CMS.YLIC.86ZZ](http://opendata.cern.ch/record/6004) and [DOI: 10.7483/OPENDATA.CMS.M5AD.Y3V3](http://opendata.cern.ch/record/6030)).\n",
"\n",
"The macro matches muon pairs and produces an histogram of the dimuon mass spectrum showing resonances up to the Z mass. Note that the bump at 30 GeV is not a resonance but a trigger effect.\n",
"\n",
"Some more details about the dataset:\n",
"- It contains about 66 millions events (muon and electron collections, plus some other information, e.g. about primary vertices)\n",
"- It spans two compressed ROOT files located on EOS for about a total size of 7.5 GB.\n",
"\n",
"***Date: April 2019*** \n",
"***Author: Stefan Wunsch (KIT, CERN)*** \n",
"***Adapted to PyRDF and Spark: Javier Cervantes Villanueva (CERN), Prasanth Kothuri (CERN)***\n",
"\n",
"To run this notebook we used the following configuration:\n",
"\n",
"* *Software stack*: Bleeding edge (it has spark 2.4.3)\n",
"* *Platform*: Centos 7 (gcc7)\n",
"* *Spark cluster*: Cloud Containers (K8s)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Welcome to JupyROOT 6.17/01\n"
]
}
],
"source": [
"import PyRDF\n",
"import ROOT\n",
"\n",
"# Configure PyRDF to run on Spark splitting the dataset into 32 partitions (defines parallelism)\n",
"PyRDF.use(\"spark\", {'npartitions': '32'})\n",
"\n",
"# Create dataframe from NanoAOD files\n",
"files = [\n",
" \"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root\",\n",
" \"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012C_DoubleMuParked.root\"\n",
"]\n",
"\n",
"df = PyRDF.RDataFrame(\"Events\", files);"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# For simplicity, select only events with exactly two muons and require opposite charge\n",
"df_2mu = df.Filter(\"nMuon == 2\", \"Events with exactly two muons\");\n",
"df_os = df_2mu.Filter(\"Muon_charge[0] != Muon_charge[1]\", \"Muons with opposite charge\");"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Compute invariant mass of the dimuon system\n",
"# Uses InvariantMass, provided by RVec (see the reference for other RVec helper functions)\n",
"df_mass = df_os.Define(\"Dimuon_mass\", \"InvariantMass(Muon_pt, Muon_eta, Muon_phi, Muon_mass)\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Make histogram of dimuon mass spectrum\n",
"h = df_mass.Histo1D((\"Dimuon_mass\", \"Dimuon_mass\", 30000, 0.25, 300), \"Dimuon_mass\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Report is not supported yet in the Spark backend\n",
"# report = df_mass3.Report()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Info in : pdf file dimuon_spectrum.pdf has been created\n"
]
}
],
"source": [
"# Produce pdf of the plot\n",
"ROOT.gStyle.SetOptStat(0)\n",
"ROOT.gStyle.SetTextFont(42)\n",
"\n",
"c = ROOT.TCanvas(\"c\", \"\", 800, 700);\n",
"c.SetLogx(); c.SetLogy();\n",
"h.SetTitle(\"\");\n",
"h.GetXaxis().SetTitle(\"m_{#mu#mu} (GeV)\"); h.GetXaxis().SetTitleSize(0.04);\n",
"h.GetYaxis().SetTitle(\"N_{Events}\"); h.GetYaxis().SetTitleSize(0.04);\n",
"h.Draw();\n",
"\n",
"label = ROOT.TLatex()\n",
"label.SetNDC(True);\n",
"\n",
"label.DrawLatex(0.175, 0.740, \"#eta\");\n",
"label.DrawLatex(0.205, 0.775, \"#rho,#omega\");\n",
"label.DrawLatex(0.270, 0.740, \"#phi\");\n",
"label.DrawLatex(0.400, 0.800, \"J/#psi\");\n",
"label.DrawLatex(0.415, 0.670, \"#psi'\");\n",
"label.DrawLatex(0.485, 0.700, \"Y(1,2,3S)\");\n",
"label.DrawLatex(0.755, 0.680, \"Z\");\n",
"label.SetTextSize(0.040); label.DrawLatex(0.100, 0.920, \"#bf{CMS Open Data}\");\n",
"label.SetTextSize(0.030); label.DrawLatex(0.630, 0.920, \"#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}\");\n",
"\n",
"c.SaveAs(\"dimuon_spectrum.pdf\");\n",
"# Report is not supported yet in the Spark backend\n",
"#report.Print();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous cell computed the result and saved it as a pdf image. Now we can visualize the plot online without triggering the event loop again, since the results have been cached:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning in : Deleting canvas with same name: c2\n"
]
},
{
"data": {
"text/html": [
"\n",
"