{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "TMVA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This exercise will get you started on using TMVA in pyroot.\n", "TMVA is the Toolkit for Multivariate Data Analysis with ROOT (more information http://tmva.sourceforge.net/) and it is a powerful tool to perform multivariate analysis." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import rootnotes\n", "import rootprint\n", "import utils\n", "import ROOT" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a ntuple and fill it with random numbers" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create a TNtuple\n", "ntuple = ROOT.TNtuple(\"ntuple\",\"ntuple\",\"x:y:signal\")\n", "\n", "# generate 'signal' and 'background' distributions\n", "for i in range(10000):\n", " # throw a signal event centered at (1,1)\n", " ntuple.Fill(ROOT.gRandom.Gaus(1,1), # x\n", " ROOT.gRandom.Gaus(1,1), # y\n", " 1) # signal\n", " \n", " # throw a background event centered at (-1,-1)\n", " ntuple.Fill(ROOT.gRandom.Gaus(-1,1), # x\n", " ROOT.gRandom.Gaus(-1,1), # y\n", " 0) # background" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the distribution of the generated events" ] }, { "cell_type": "code", "collapsed": false, "input": [ "samplesCanvas = rootnotes.canvas(\"samplesCanvas\", (400, 400))\n", "\n", "# draw an empty 2D histogram for the axes\n", "histo = ROOT.TH2F(\"histo\",\"\",1,-5,5,1,-5,5)\n", "histo.Draw()\n", "\n", "# cuts defining the signal and background sample\n", "sigCut = ROOT.TCut(\"signal > 0.5\")\n", "bgCut = ROOT.TCut(\"signal <= 0.5\")\n", "\n", "# draw the signal events in red\n", "ntuple.SetMarkerColor(ROOT.kRed)\n", "ntuple.Draw(\"y:x\",sigCut,\"same\")\n", "\n", "# draw the background events in blue\n", "ntuple.SetMarkerColor(ROOT.kBlue)\n", "ntuple.Draw(\"y:x\",bgCut,\"same\")\n", "\n", "samplesCanvas" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "10000L" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the TMVA factory, the basic component of TMVA for the training of multivariate methods." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ROOT.TMVA.Tools.Instance()\n", "\n", "# note that it seems to be mandatory to have an\n", "# output file, just passing None to TMVA::Factory(..)\n", "# does not work. Make sure you don't overwrite an\n", "# existing file.\n", "fout = ROOT.TFile(\"TMVAResults.root\",\"RECREATE\")\n", "\n", "factory = ROOT.TMVA.Factory(\"TMVAClassification\", fout,\n", " \":\".join([\n", " \"!V\",\n", " \"!Silent\",\n", " \"!Color\",\n", " \"!DrawProgressBar\",\n", " \"Transformations=I;D;P;G,D\",\n", " \"AnalysisType=Classification\"]\n", " ))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the variables and trees to the factory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "factory.AddVariable(\"x\",\"F\")\n", "factory.AddVariable(\"y\",\"F\")\n", "\n", "factory.AddSignalTree(ntuple)\n", "factory.AddBackgroundTree(ntuple)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "factory.PrepareTrainingAndTestTree(sigCut, # signal events\n", " bgCut, # background events\n", " \":\".join([\n", " \"nTrain_Signal=0\",\n", " \"nTrain_Background=0\",\n", " \"SplitMode=Random\",\n", " \"NormMode=NumEvents\",\n", " \"!V\"\n", " ]))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Book the multivariate method to train. In this case we are training a BDT and we specify the settings:\n", "\n", "- NTrees: number of trees\n", "- nEventsMin: minimum events per tree\n", "- MaxDepth: maximum depth of the decision tree allowed\n", "- BoostType: boosting algorithm used. In this case it is the adaptive boost\n", "- AdaBoostBeta: specific parameter of the adaptive boost algorithm\n", "- SeparationType: separation criterion for node splitting\n", "- nCuts: number of grid points in variable range used in finding optimal cut in node splitting\n", "- PruneMethod: method used to prune the tree. In this case no pruning is applied." ] }, { "cell_type": "code", "collapsed": false, "input": [ "methodBDT = factory.BookMethod(ROOT.TMVA.Types.kBDT, \"BDT\",\n", " \":\".join([\n", " \"!H\",\n", " \"!V\",\n", " \"NTrees=850\",\n", " \"nEventsMin=150\",\n", " \"MaxDepth=3\",\n", " \"BoostType=AdaBoost\",\n", " \"AdaBoostBeta=0.5\",\n", " \"SeparationType=GiniIndex\",\n", " \"nCuts=20\",\n", " \"PruneMethod=NoPruning\",\n", " ]))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Everything is setup, now train, test and evaluate all methods (the BDT in this case)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "factory.TrainAllMethods()\n", "factory.TestAllMethods()\n", "factory.EvaluateAllMethods()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the results of the training." ] }, { "cell_type": "code", "collapsed": false, "input": [ "reader = ROOT.TMVA.Reader()\n", "import array\n", "varx = array.array('f',[0]) ; reader.AddVariable(\"x\",varx)\n", "vary = array.array('f',[0]) ; reader.AddVariable(\"y\",vary)\n", "reader.BookMVA(\"BDT\",\"weights/TMVAClassification_BDT.weights.xml\")\n", "\n", "# create a new 2D histogram with fine binning\n", "histo2 = ROOT.TH2F(\"histo2\",\"\",200,-5,5,200,-5,5)\n", " \n", "# loop over the bins of a 2D histogram\n", "for i in range(1,histo2.GetNbinsX() + 1):\n", " for j in range(1,histo2.GetNbinsY() + 1):\n", " \n", " # find the bin center coordinates\n", " varx[0] = histo2.GetXaxis().GetBinCenter(i)\n", " vary[0] = histo2.GetYaxis().GetBinCenter(j)\n", " \n", " # calculate the value of the classifier\n", " # function at the given coordinate\n", " bdtOutput = reader.EvaluateMVA(\"BDT\")\n", " \n", " # set the bin content equal to the classifier output\n", " histo2.SetBinContent(i,j,bdtOutput)\n", "\n", "c2 = rootnotes.canvas(\"test2\", (800, 400))\n", "c2.Divide(2,1)\n", "c2.cd(1)\n", "histo2.Draw(\"colz\")\n", "\n", "# keeps objects otherwise removed by garbage collected in a list\n", "gcSaver = []\n", "\n", "# draw sigma contours around means\n", "for mean, color in (\n", " ((1,1), ROOT.kRed), # signal\n", " ((-1,-1), ROOT.kBlue), # background\n", " ):\n", "\n", " # draw contours at 1 and 2 sigmas\n", " for numSigmas in (1,2):\n", " circle = ROOT.TEllipse(mean[0], mean[1], numSigmas)\n", " circle.SetFillStyle(0)\n", " circle.SetLineColor(color)\n", " circle.SetLineWidth(2)\n", " circle.Draw()\n", " gcSaver.append(circle)\n", "\n", "# fill histograms for signal and background from the test sample tree\n", "hSig = ROOT.TH1F(\"hSig\", \"hSig\", 22, -1.1, 1.1)\n", "hBg = ROOT.TH1F(\"hBg\", \"hBg\", 22, -1.1, 1.1)\n", "ROOT.TestTree.Draw(\"BDT>>hSig\",\"classID == 0\",\"goff\") # signal\n", "ROOT.TestTree.Draw(\"BDT>>hBg\",\"classID == 1\", \"goff\") # background\n", "\n", "hSig.SetLineColor(ROOT.kRed); hSig.SetLineWidth(2) # signal histogram\n", "hBg.SetLineColor(ROOT.kBlue); hBg.SetLineWidth(2) # background histogram\n", " \n", "# use a THStack to show both histograms\n", "hs = ROOT.THStack(\"hs\",\"\")\n", "hs.Add(hSig)\n", "hs.Add(hBg)\n", "\n", "c2.cd(2)\n", "hs.Draw()\n", "c2" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1**: Inspect the results saved in the TMVAResults.root output file.\n", "Plot the correlation matrix for signal and background. Plot also the BDT output.\n", "Before loading the file we need to close it. The ROOT.TestTree will not be accessible anymore as it will be deleted from memory.\n", "To rerun the cell above you will need to run from the start." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%rootprint\n", "fout.Close()\n", "tmvaFile = ROOT.TFile(\"TMVAResults.root\")\n", "tmvaFile.ls()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncomment and run the following cell to see a suggestion." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print open(\"TMVASuggestionQuestion1.txt\").read()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2**: Train also a neural network (NN) and show the performance on this dataset.\n", "\n", "The NN is defined as kMLP (multilayer perceptron) and can be specified as method = factory.BookMethod(ROOT.TMVA.Types.kMLP, \"MLP\", ...). Some reasonable default options are \"H:!V:NeuronType=tanh:VarTransform=N:NCycles=600:HiddenLayers=N+5:TestRate=5:!UseRegulator\" (to be converted to a proper format)\n", "\n", "Plot the output of the NN and the ROC curve. Which of the two methods gives a better performance?\n", "\n", "It is best to restart the kernel (Kernel -> restart) and run over from the first cell after modifying the code for the MLP training.\n", "\n", "Uncomment and run the following cell to see a suggestion." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print open(\"TMVASuggestionQuestion2.txt\").read()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 } ], "metadata": {} } ] }