{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apache Spark and High Energy Physics Data Analysis\n", "## An example using LHCb open data\n", "\n", "This notebook is an example of how to use Spark to perform a simple analysis using high energy physics data from a LHC experiment. \n", "The exercises, figures and data are from original work developed and published by the **LHCb collaboration** as part of the **opendata** and outreach efforts (see credits below). \n", "**Prerequisites** - This work is intended to be accessible to an audience with some familiarity with data analysis in Python and an interest in particle Physics at undergraduate level. \n", "**Technology** - The focus of this notebook is as much on tools and techniques as it is on physics: **Apache Spark** is used for reading and analyzing high energy physics (HEP) data using Python with Pandas and Jupyter notebooks.\n", "\n", "**Credits:**\n", " * The original text of this notebook, including all exercises, analysis, explanations and data have been developed by the LHCb collaboration and are authored and shared by the LHCb collaboration in their opendata project at: \n", " * https://github.com/lhcb/opendata-project\n", " * http://www.hep.manchester.ac.uk/u/parkes/LHCbAntimatterProjectWeb/LHCb_Matter_Antimatter_Asymmetries/Homepage.html \n", " * https://cds.cern.ch/record/1994172?ln=en \n", " \n", " * The library for reading physics data stored using the popular [ROOT format](https://en.wikipedia.org/wiki/ROOT) has been developed by [DIANA-HEP](http://diana-hep.org/) and [CMS Big Data project](https://cms-big-data.github.io/). See also the code repository at: \n", " * https://github.com/diana-hep/spark-root\n", "\n", " * The Spark code in this notebook has been developed in the context of the CERN Hadoop and Spark service. \n", "Contact email: Luca.Canali@cern.ch" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Introduction and setup of the lab environemnt\n", "\n", "There are a few different ways that you can to run Python and Spark in a notebook environment.\n", "The following are instructions to set up a lab environemnt on a low-end system (small VM or laptop).\n", "If you have already set up Spark on a local machine or a cluster, you can just start the Jupyter notebook. \n", "**Note for CERN users**: if you are using [CERN SWAN service](https://swan.web.cern.ch/) (hosted Jupyter notebooks) to run this, you can move on to the next cell.\n", "\n", "### Instructions to get started with Jupyter notebooks and PySpark on a standalone system:\n", "\n", "* Setup the Python environment, for example download and install Anaconda https://www.continuum.io/downloads\n", " * version used/tested for this notebook: Anaconda 4.4.0 for Python 2.7\n", "\n", "* Set up Spark\n", " * simply run `pip install pyspark` \n", " * as an alternative download Spark from http://spark.apache.org/downloads.html\n", " * note: Spark version used for testing this notebook: Spark 2.2.0 and 2.1.1\n", "\n", "* Start the Jupyter notebook\n", "```python\n", "jupyter-notebook --ip=`hostname` --no-browser\n", "```\n", "\n", "* Point your browser to the URL as prompted \n", "* Open this notebook " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This starts the Spark Session\n", "# Note: Shift-Enter can be used to run Jupyter cells\n", "# These instructions rely on internet access to download the spark-root package from Maven Central\n", "\n", "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder \\\n", " .appName(\"LHCb Open Data with Spark\") \\\n", " .config(\"spark.jars.packages\", \"org.diana-hep:spark-root_2.11:0.1.11\") \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+\n", "|Hello World!|\n", "+------------+\n", "|Hello World!|\n", "+------------+\n", "\n" ] } ], "source": [ "# Test that Spark SQL works\n", "\n", "sql = spark.sql\n", "sql(\"select 'Hello World!'\").show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get the data: download from CERN open data portal\n", "\n", "Download the data for the exercises from the CERN opendata portal:\n", "More info on the CERN opedata initiative at http://opendata.cern.ch/about \n", "**Note for CERN users** using CERN SWAN (hosted notbooks): you don't need to download data (see next cell)\n", "\n", "Simulation data (~2 MB) - you need this file only for the first part of the notebook: working on simulation data \n", "http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root \n", "\n", "Measurement data (~1 GB) - you will need these files for the second part of the notebook: working on real data \n", "http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetDown.root \n", "http://opendata.cern.ch/eos/opendata/lhcb/AntimatterMatters2017/data/B2HHH_MagnetUp.root \n", "\n", "**Notes:** \n", "On Linux you can use [wget](https://www.gnu.org/software/wget/) to download the files \n", "If you run Spark on a standalone system or VM, simply put the data in the local filesystem. \n", "If you are using Spark on a cluster, you should put the data in a cluster filesystem, for example HDFS. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Edit this with the path to the data, with a trainling \"/\" \n", "# see above at \"get the data\" for details on how to download\n", "\n", "# CERN SWAN users can find data already in EOS\n", "data_directory = \"/eos/opendata/lhcb/AntimatterMatters2017/data/\"\n", "\n", "# Uncomment and edit the path for locally downloaded data\n", "# data_directory = \"/home/luca/misc/opendata-project/data/\"\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "heading_collapsed": true }, "source": [ "# Measuring Matter Antimatter Asymmetries at the Large Hadron Collider" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "![](http://lhcb-public.web.cern.ch/lhcb-public/en/LHCb-outreach/multimedia/LHCbDetectorpnglight1.png)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "# Getting Started\n", "\n", "Note: the original text of this exercise in the form relased by LHCb can be found at https://github.com/lhcb/opendata-project\n", "____" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ " Welcome to the first guided LHCb Open Data Portal project! \n", "\n", "
\n", " | B_FlightDistance | \n", "B_VertexChi2 | \n", "H1_PX | \n", "H1_PY | \n", "H1_PZ | \n", "H1_ProbK | \n", "H1_ProbPi | \n", "H1_Charge | \n", "H1_IPChi2 | \n", "H1_isMuon | \n", "... | \n", "H2_IPChi2 | \n", "H2_isMuon | \n", "H3_PX | \n", "H3_PY | \n", "H3_PZ | \n", "H3_ProbK | \n", "H3_ProbPi | \n", "H3_Charge | \n", "H3_IPChi2 | \n", "H3_isMuon | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "1.0 | \n", "3551.84 | \n", "1636.96 | \n", "23904.14 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "36100.40 | \n", "16546.83 | \n", "295600.61 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
1 | \n", "0.0 | \n", "1.0 | \n", "-2525.98 | \n", "-5284.05 | \n", "35822.00 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-8648.32 | \n", "-16617.56 | \n", "98535.13 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
2 | \n", "0.0 | \n", "1.0 | \n", "-700.67 | \n", "1299.73 | \n", "8127.76 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-13483.34 | \n", "10860.77 | \n", "79787.59 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "
3 | \n", "0.0 | \n", "1.0 | \n", "3364.63 | \n", "1397.30 | \n", "222815.29 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "1925.16 | \n", "-551.12 | \n", "40420.96 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "
4 | \n", "0.0 | \n", "1.0 | \n", "-581.66 | \n", "-1305.24 | \n", "22249.59 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-2820.04 | \n", "-8305.43 | \n", "250130.00 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
5 | \n", "0.0 | \n", "1.0 | \n", "112.84 | \n", "-13297.98 | \n", "51882.87 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-440.95 | \n", "-13699.42 | \n", "71163.14 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
6 | \n", "0.0 | \n", "1.0 | \n", "5558.97 | \n", "3913.52 | \n", "56981.08 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "3457.70 | \n", "780.13 | \n", "28716.94 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "
7 | \n", "0.0 | \n", "1.0 | \n", "-15208.03 | \n", "-1783.93 | \n", "265210.55 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-4478.67 | \n", "-164.39 | \n", "71498.09 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "
8 | \n", "0.0 | \n", "1.0 | \n", "-109.04 | \n", "8239.25 | \n", "191486.94 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "-2083.59 | \n", "11359.35 | \n", "192297.67 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
9 | \n", "0.0 | \n", "1.0 | \n", "15175.26 | \n", "93142.09 | \n", "379269.30 | \n", "1.0 | \n", "0.0 | \n", "1 | \n", "1.0 | \n", "0 | \n", "... | \n", "1.0 | \n", "0 | \n", "3295.84 | \n", "24950.02 | \n", "105990.48 | \n", "1.0 | \n", "0.0 | \n", "-1 | \n", "1.0 | \n", "0 | \n", "
10 rows × 26 columns
\n", "\n", " | B_FlightDistance | \n", "B_VertexChi2 | \n", "H1_PX | \n", "H1_PY | \n", "H1_PZ | \n", "H1_ProbK | \n", "H1_ProbPi | \n", "H1_Charge | \n", "H1_IPChi2 | \n", "H1_isMuon | \n", "... | \n", "H2_IPChi2 | \n", "H2_isMuon | \n", "H3_PX | \n", "H3_PY | \n", "H3_PZ | \n", "H3_ProbK | \n", "H3_ProbPi | \n", "H3_Charge | \n", "H3_IPChi2 | \n", "H3_isMuon | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "6.888037 | \n", "8.426947 | \n", "1207.753798 | \n", "-84.290958 | \n", "10399.473702 | \n", "0.902219 | \n", "0.041574 | \n", "-1 | \n", "998.424410 | \n", "0 | \n", "... | \n", "110.519068 | \n", "0 | \n", "1973.085892 | \n", "-289.150032 | \n", "26771.341608 | \n", "0.915843 | \n", "0.057261 | \n", "1 | \n", "386.493713 | \n", "0 | \n", "
1 | \n", "8.957103 | \n", "3.474719 | \n", "-811.617861 | \n", "-518.300956 | \n", "22338.883014 | \n", "0.942885 | \n", "0.093401 | \n", "1 | \n", "162.677006 | \n", "0 | \n", "... | \n", "1994.119734 | \n", "0 | \n", "-4801.397918 | \n", "1993.340031 | \n", "76466.229808 | \n", "0.806471 | \n", "0.385119 | \n", "-1 | \n", "158.823018 | \n", "0 | \n", "
2 | \n", "9.100733 | \n", "4.113123 | \n", "-2007.925735 | \n", "-2555.080382 | \n", "26601.465958 | \n", "0.917661 | \n", "0.077237 | \n", "-1 | \n", "1355.611615 | \n", "0 | \n", "... | \n", "8500.518262 | \n", "0 | \n", "-1260.859080 | \n", "-2824.663002 | \n", "22365.178510 | \n", "0.947676 | \n", "0.097263 | \n", "1 | \n", "352.235461 | \n", "0 | \n", "
3 | \n", "11.077374 | \n", "2.360357 | \n", "1408.170513 | \n", "-1372.864558 | \n", "66357.093308 | \n", "0.785618 | \n", "0.119467 | \n", "-1 | \n", "2.799029 | \n", "0 | \n", "... | \n", "564.019813 | \n", "0 | \n", "2171.855775 | \n", "-1964.419835 | \n", "92096.742555 | \n", "0.560237 | \n", "0.070540 | \n", "1 | \n", "44.498271 | \n", "0 | \n", "
4 | \n", "17.743006 | \n", "5.116309 | \n", "1457.671574 | \n", "1311.684099 | \n", "8551.692070 | \n", "0.783945 | \n", "0.029395 | \n", "1 | \n", "18266.642863 | \n", "0 | \n", "... | \n", "1098.894225 | \n", "0 | \n", "10985.230400 | \n", "1271.856077 | \n", "62682.682662 | \n", "0.576559 | \n", "0.455894 | \n", "-1 | \n", "360.444011 | \n", "0 | \n", "
5 | \n", "11.554098 | \n", "1.120899 | \n", "-77.919930 | \n", "598.047768 | \n", "18486.604881 | \n", "0.937157 | \n", "0.227115 | \n", "1 | \n", "89.128636 | \n", "0 | \n", "... | \n", "2387.913090 | \n", "0 | \n", "-3786.001956 | \n", "-3050.190096 | \n", "49924.677288 | \n", "0.940656 | \n", "0.096659 | \n", "-1 | \n", "827.264326 | \n", "0 | \n", "
6 | \n", "8.296893 | \n", "7.380471 | \n", "970.351808 | \n", "-490.045926 | \n", "27929.242265 | \n", "0.966135 | \n", "0.095613 | \n", "1 | \n", "135.543971 | \n", "0 | \n", "... | \n", "944.174083 | \n", "0 | \n", "1618.033440 | \n", "-1593.587768 | \n", "45253.841121 | \n", "0.959964 | \n", "0.098093 | \n", "-1 | \n", "260.910241 | \n", "0 | \n", "
7 | \n", "15.875883 | \n", "1.656760 | \n", "2336.165388 | \n", "4166.002188 | \n", "35728.525679 | \n", "0.946878 | \n", "0.060513 | \n", "-1 | \n", "3321.946818 | \n", "0 | \n", "... | \n", "1942.129052 | \n", "0 | \n", "-1708.189185 | \n", "2517.048779 | \n", "27592.747481 | \n", "0.961387 | \n", "0.125535 | \n", "1 | \n", "7027.069112 | \n", "0 | \n", "
8 | \n", "12.265774 | \n", "5.394378 | \n", "2606.155962 | \n", "-4657.669797 | \n", "99824.722050 | \n", "0.755108 | \n", "0.403123 | \n", "1 | \n", "153.705895 | \n", "0 | \n", "... | \n", "4.976051 | \n", "0 | \n", "961.067631 | \n", "892.008741 | \n", "12739.165721 | \n", "0.630626 | \n", "0.030043 | \n", "-1 | \n", "4351.779561 | \n", "0 | \n", "
9 | \n", "7.960260 | \n", "9.528332 | \n", "-1740.175238 | \n", "1060.634895 | \n", "76542.224969 | \n", "0.823950 | \n", "0.096627 | \n", "-1 | \n", "28.678577 | \n", "0 | \n", "... | \n", "309.280125 | \n", "0 | \n", "-2026.920178 | \n", "887.978374 | \n", "47609.309664 | \n", "0.972340 | \n", "0.161073 | \n", "1 | \n", "193.376446 | \n", "0 | \n", "
10 rows × 26 columns
\n", "\n", " | Energy_K1_K2 | \n", "P_K1_K2 | \n", "Energy_K1_K3 | \n", "P_K1_K3 | \n", "Energy_K2_K3 | \n", "P_K2_K3 | \n", "H1_Charge | \n", "H2_Charge | \n", "H3_Charge | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "59312.848237 | \n", "59211.181509 | \n", "37331.391551 | \n", "37308.534012 | \n", "75681.554420 | \n", "75585.380172 | \n", "-1 | \n", "1 | \n", "1 | \n", "
1 | \n", "40065.665041 | \n", "40022.232171 | \n", "99009.419207 | \n", "98975.411119 | \n", "94344.925710 | \n", "94244.771077 | \n", "1 | \n", "-1 | \n", "-1 | \n", "
2 | \n", "41944.817064 | \n", "41751.672744 | \n", "49387.243264 | \n", "49369.614780 | \n", "37724.526918 | \n", "37581.719924 | \n", "-1 | \n", "1 | \n", "1 | \n", "
3 | \n", "115729.095355 | \n", "115680.388609 | \n", "158532.678053 | \n", "158529.404702 | \n", "141485.642261 | \n", "141426.952947 | \n", "-1 | \n", "1 | \n", "1 | \n", "
4 | \n", "53399.569782 | \n", "53304.215413 | \n", "72440.132643 | \n", "72359.081204 | \n", "108264.666058 | \n", "108244.552904 | \n", "1 | \n", "-1 | \n", "-1 | \n", "