{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"EP-SFT\"\n", "\"Spark\"\n", "\n", "

\n", "

NanoAOD files processed with Distributed RDataFrame in Python and Spark (PySpark)

\n", "
\n", "\n", "Python version of the [`df102_NanoAODDimuonAnalysis`](https://root.cern.ch/doc/master/df102__NanoAODDimuonAnalysis_8C.html) ROOT tutorial running with PyRDF.\n", "\n", "The NanoAOD-like input files are filled with 66 mio. events from CMS OpenData containing muon candidates part of 2012 dataset ([DOI: 10.7483/OPENDATA.CMS.YLIC.86ZZ](http://opendata.cern.ch/record/6004) and [DOI: 10.7483/OPENDATA.CMS.M5AD.Y3V3](http://opendata.cern.ch/record/6030)).\n", "\n", "The macro matches muon pairs and produces an histogram of the dimuon mass spectrum showing resonances up to the Z mass. Note that the bump at 30 GeV is not a resonance but a trigger effect.\n", "\n", "Some more details about the dataset:\n", "- It contains about 66 millions events (muon and electron collections, plus some other information, e.g. about primary vertices)\n", "- It spans two compressed ROOT files located on EOS for about a total size of 7.5 GB.\n", "\n", "***Date: April 2019***
\n", "***Author: Stefan Wunsch (KIT, CERN)***
\n", "***Adapted to PyRDF and Spark: Javier Cervantes Villanueva (CERN), Prasanth Kothuri (CERN)***\n", "\n", "To run this notebook we used the following configuration:\n", "\n", "* *Software stack*: Bleeding edge (it has spark 2.4.3)\n", "* *Platform*: Centos 7 (gcc7)\n", "* *Spark cluster*: Cloud Containers (K8s)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Welcome to JupyROOT 6.17/01\n" ] } ], "source": [ "import PyRDF\n", "import ROOT\n", "\n", "# Configure PyRDF to run on Spark splitting the dataset into 32 partitions (defines parallelism)\n", "PyRDF.use(\"spark\", {'npartitions': '32'})\n", "\n", "# Create dataframe from NanoAOD files\n", "files = [\n", " \"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root\",\n", " \"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012C_DoubleMuParked.root\"\n", "]\n", "\n", "df = PyRDF.RDataFrame(\"Events\", files);" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# For simplicity, select only events with exactly two muons and require opposite charge\n", "df_2mu = df.Filter(\"nMuon == 2\", \"Events with exactly two muons\");\n", "df_os = df_2mu.Filter(\"Muon_charge[0] != Muon_charge[1]\", \"Muons with opposite charge\");" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Compute invariant mass of the dimuon system\n", "# Uses InvariantMass, provided by RVec (see the reference for other RVec helper functions)\n", "df_mass = df_os.Define(\"Dimuon_mass\", \"InvariantMass(Muon_pt, Muon_eta, Muon_phi, Muon_mass)\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Make histogram of dimuon mass spectrum\n", "h = df_mass.Histo1D((\"Dimuon_mass\", \"Dimuon_mass\", 30000, 0.25, 300), \"Dimuon_mass\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Report is not supported yet in the Spark backend\n", "# report = df_mass3.Report()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Info in : pdf file dimuon_spectrum.pdf has been created\n" ] } ], "source": [ "# Produce pdf of the plot\n", "ROOT.gStyle.SetOptStat(0)\n", "ROOT.gStyle.SetTextFont(42)\n", "\n", "c = ROOT.TCanvas(\"c\", \"\", 800, 700);\n", "c.SetLogx(); c.SetLogy();\n", "h.SetTitle(\"\");\n", "h.GetXaxis().SetTitle(\"m_{#mu#mu} (GeV)\"); h.GetXaxis().SetTitleSize(0.04);\n", "h.GetYaxis().SetTitle(\"N_{Events}\"); h.GetYaxis().SetTitleSize(0.04);\n", "h.Draw();\n", "\n", "label = ROOT.TLatex()\n", "label.SetNDC(True);\n", "\n", "label.DrawLatex(0.175, 0.740, \"#eta\");\n", "label.DrawLatex(0.205, 0.775, \"#rho,#omega\");\n", "label.DrawLatex(0.270, 0.740, \"#phi\");\n", "label.DrawLatex(0.400, 0.800, \"J/#psi\");\n", "label.DrawLatex(0.415, 0.670, \"#psi'\");\n", "label.DrawLatex(0.485, 0.700, \"Y(1,2,3S)\");\n", "label.DrawLatex(0.755, 0.680, \"Z\");\n", "label.SetTextSize(0.040); label.DrawLatex(0.100, 0.920, \"#bf{CMS Open Data}\");\n", "label.SetTextSize(0.030); label.DrawLatex(0.630, 0.920, \"#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}\");\n", "\n", "c.SaveAs(\"dimuon_spectrum.pdf\");\n", "# Report is not supported yet in the Spark backend\n", "#report.Print();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous cell computed the result and saved it as a pdf image. Now we can visualize the plot online without triggering the event loop again, since the results have been cached:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning in : Deleting canvas with same name: c2\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%jsroot on\n", "\n", "ROOT.gStyle.SetTextFont(42)\n", "\n", "c2 = ROOT.TCanvas(\"c2\", \"\", 800, 700);\n", "c2.SetLogx(); c2.SetLogy();\n", "\n", "h.SetTitle(\"\");\n", "h.GetXaxis().SetTitle(\"m_{#mu#mu} (GeV)\"); h.GetXaxis().SetTitleSize(0.04);\n", "h.GetYaxis().SetTitle(\"N_{Events}\"); h.GetYaxis().SetTitleSize(0.04);\n", "h.SetStats(False)\n", "h.Draw();\n", "\n", "label = ROOT.TLatex()\n", "label.SetNDC(True);\n", "\n", "label.DrawLatex(0.175, 0.740, \"#eta\");\n", "label.DrawLatex(0.205, 0.775, \"#rho,#omega\");\n", "label.DrawLatex(0.270, 0.740, \"#phi\");\n", "label.DrawLatex(0.400, 0.800, \"J/#psi\");\n", "label.DrawLatex(0.415, 0.670, \"#psi'\");\n", "label.DrawLatex(0.485, 0.700, \"Y(1,2,3S)\");\n", "label.DrawLatex(0.755, 0.680, \"Z\");\n", "label.SetTextSize(0.040); label.DrawLatex(0.100, 0.920, \"#bf{CMS Open Data}\");\n", "label.SetTextSize(0.030); label.DrawLatex(0.630, 0.920, \"#sqrt{s} = 8 TeV, L_{int} = 11.6 fb^{-1}\");\n", "\n", "label.DrawClone(\"Same\")\n", "c2.Draw()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" }, "sparkconnect": { "bundled_options": [ "RDataFrame" ], "list_of_options": [] } }, "nbformat": 4, "nbformat_minor": 2 }