{ "cells": [ { "cell_type": "markdown", "id": "cd70edc7", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Save dataset to ROOT file after processing\n", "\n", "With RDataFrame, you can read your dataset, add new columns with processed values and finally use `Snapshot` to save the resulting data to a ROOT file in TTree format." ] }, { "cell_type": "code", "execution_count": null, "id": "308c56f0", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import ROOT\n", "\n", "df = ROOT.RDataFrame(\"dataset\",\"data/example_file.root\")\n", "df1 = df.Define(\"c\",\"a+b\")\n", "\n", "out_treename = \"outtree\"\n", "out_filename = \"outtree.root\"\n", "out_columns = [\"a\",\"b\",\"c\"]\n", "snapdf = df1.Snapshot(out_treename, out_filename, out_columns)" ] }, { "cell_type": "markdown", "id": "ecaaed15", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can now check that the dataset was correctly stored in a file:" ] }, { "cell_type": "code", "execution_count": null, "id": "7ca9de7b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%%bash\n", "rootls -lt outtree.root" ] }, { "cell_type": "markdown", "id": "55b7bc7f", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Result of a Snapshot is still an RDataFrame that can be further used:" ] }, { "cell_type": "code", "execution_count": null, "id": "23f46a0b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "snapdf.Display().Print()" ] }, { "cell_type": "markdown", "id": "d98928de", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Cutflow reports\n", "Filters applied to the dataset can be given a name. The `Report` method will gather information about filter efficiency and show the data flow between subsequent cuts on the original dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d7610f52", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "df = ROOT.RDataFrame(\"sig_tree\", \"https://root.cern/files/Higgs_data.root\")\n", "\n", "filter1 = df.Filter(\"lepton_eta > 0\", \"Lepton eta cut\")\n", "filter2 = filter1.Filter(\"lepton_phi < 1\", \"Lepton phi cut\")\n", "\n", "rep = df.Report()\n", "rep.Print()" ] }, { "cell_type": "markdown", "id": "f3be5b9d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Using C++ functions in Python\n", "- We still want to perform complex operations in Python but plain Python code is prone to be slow and not thread-safe. \n", "\n", "- Instead, you can inject C++ functions that will do the work in your event loop during runtime. \n", "\n", "- This mechanism uses the C++ interpreter cling shipped with ROOT, making this possible in a single line of code. \n", "\n", "- Let's start by defining a function that will allow us to change the type of a the RDataFrame dataset entry numbers (stored in the special column \"rdfentry\") from `unsigned long long` to `float`." ] }, { "cell_type": "code", "execution_count": null, "id": "9a1bcee4", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%%cpp\n", "\n", "float asfloat(unsigned long long entrynumber){\n", " return entrynumber;\n", "}" ] }, { "cell_type": "markdown", "id": "1b8f4bd1", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then let's define another function that takes a `float` values and computes its square." ] }, { "cell_type": "code", "execution_count": null, "id": "6d3a8b4f", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%%cpp\n", "\n", "float square(float val){\n", " return val * val;\n", "}" ] }, { "cell_type": "markdown", "id": "90522e44", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And now let's use these functions with RDataFrame! \n", "\n", "We start by creating an empty RDataFrame with 100 consecutive entries and defining new columns on it:" ] }, { "cell_type": "code", "execution_count": null, "id": "0edd70d3", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Create a new RDataFrame from scratch with 100 consecutive entries\n", "df = ROOT.RDataFrame(100)\n", "\n", "# Create a new column using the previously declared C++ functions\n", "df1 = df.Define(\"a\", \"asfloat(rdfentry_)\")\n", "df2 = df1.Define(\"b\", \"square(a)\")" ] }, { "cell_type": "markdown", "id": "5b1005d7", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can now plot the values of the columns in a graph:" ] }, { "cell_type": "code", "execution_count": null, "id": "18a35cd0", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Show the two columns created in a graph\n", "c = ROOT.TCanvas()\n", "graph = df2.Graph(\"a\",\"b\")\n", "graph.SetMarkerStyle(20)\n", "graph.SetMarkerSize(0.5)\n", "graph.SetMarkerColor(ROOT.kBlue)\n", "graph.SetTitle(\"My graph\")\n", "graph.Draw(\"AP\")\n", "c.Draw()" ] }, { "cell_type": "markdown", "id": "072ae85d", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Using all cores of your machine with multi-threaded RDataFrame\n", "- RDataFrame can transparently perform multi-threaded event loops to speed up the execution of its actions. \n", "\n", "- Users have to call `ROOT::EnableImplicitMT()` before constructing the RDataFrame object to indicate that it should take advantage of a pool of worker threads. \n", "\n", "- Each worker thread processes a distinct subset of entries, and their partial results are merged before returning the final values to the user.\n", "\n", "- RDataFrame operations such as Histo1D or Snapshot are guaranteed to work correctly in multi-thread event loops. \n", "\n", "- User-defined expressions, such as strings or lambdas passed to `Filter`, `Define`, `Foreach`, `Reduce` or `Aggregate` will have to be thread-safe, i.e. it should be possible to call them concurrently from different threads." ] }, { "cell_type": "code", "execution_count": null, "id": "f2d4528b", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%%time\n", "# Get a first baseline measurement\n", "\n", "treename = \"Events\"\n", "filename = \"root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root\"\n", "df = ROOT.RDataFrame(treename, filename)\n", "\n", "df.Sum(\"nMuon\").GetValue()" ] }, { "cell_type": "code", "execution_count": null, "id": "ec2afbc4", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%%time\n", "# Activate multithreading capabilities\n", "# By default takes all available cores on the machine\n", "ROOT.EnableImplicitMT()\n", "\n", "treename = \"Events\"\n", "filename = \"root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root\"\n", "df = ROOT.RDataFrame(treename, filename)\n", "\n", "df.Sum(\"nMuon\").GetValue()\n", "\n", "# Disable implicit multithreading when done\n", "ROOT.DisableImplicitMT()" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }