{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "- Event data\n", " - ServiceX delivers from grid or remote XRootD storage to the user. Or more precisely ServiceX writes into an object store (ServiceX internal storage) and users download files or URLs from the object store as soon as available.\n", " - Thickness of arrows reflect the amount of data over a wire. ServiceX is NOT designed to download full data from grids. Transformers effectively reduce data that will be delivered to user based on a query for selection and filtering.\n", " - ServiceX is often co-located with a grid site to maximize network bandwith. XCache is preferable to allow much faster read for frequently accessed datasets.\n", "- Transformer\n", " - ServiceX consists of multiple microservices that are deployed as static K8s pod (always \"running\" state) but transformers are dynamically created via HPA (Horizontal Pod Scaling)\n", " - A transformer pod runs on a file at a time and number of transformer pods are scaled up and down depending on the number of input files in the dataset and other criteria.\n", "- ServiceX Request\n", " - ServiceX request(s) is(are) made from the SerivceX client libary to ServiceX Web API via HTTP request\n", " - A ServiceX request takes one input dataset (or list of files) and ServiceX is happily scale transformer pods automatically. A dataset with a single file should work but it's much more desirable to utilize HPA.\n", " - Users can make ServiceX request anywhere only with Python ServiceX client library and
servicex.yaml
includes an access token. Thus it's perfectly fine to deliver data to a university cluster or a laptop for small tests.\n",
"\n",
"servicex.yaml
) from the ServiceX website and copy to your home or working directory \n",
"- NOTE: the ServiceX endpoint servicex.af.uchicago.edu
is limited to the ATLAS users as it provides an access to the ATLAS event data\n",
"\n",
"\n",
"\n",
"pip install servicex==3.0.0.alpha.18
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"servicex 3.0.0a18\n"
]
}
],
"source": [
"# !pip install servicex==3.0.0.alpha.18\n",
"!pip list | grep servicex"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"I have downloaded my ServiceX configuration file (servicex.yaml
) from the ServiceX webpage and installed servicex
package\n",
" spec
object.\n",
"- A Rucio dataset is specified\n",
"- Defined a `Query`, sent to transformers and run on all files in the given Rucio dataset\n",
"- `UprootRaw` query takes `\"treename\"` to set `TTree` in flat ROOT ntuples and `\"filter_name\"` to select branches in a given tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Let's deliver my ServiceX request"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1903f515405840e4b6c000ca3e681969",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "o = servicex.deliver(spec)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(o['UprootRaw_PyHEP'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Returns a dictionary" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample.Name: dict_keys(['UprootRaw_PyHEP'])\n", "\n", "Fileset:
[[3.86e+04, 3.6e+04],\n", " [3.44e+04, 1.91e+04],\n", " [],\n", " [5.98e+04, 5.76e+04],\n", " [6.84e+04, 2.4e+04],\n", " [3.5e+04],\n", " [],\n", " [],\n", " [1.34e+05, 4.36e+04],\n", " [],\n", " ...,\n", " [6.46e+04, 3.66e+04, 2.78e+04],\n", " [],\n", " [1.74e+04],\n", " [],\n", " [5.42e+04],\n", " [3.81e+04, 1.26e+04],\n", " [],\n", " [3.53e+04],\n", " [6.17e+04]]\n", "--------------------------------\n", "type: 11543 * var * float32" ], "text/plain": [ "
UprootRaw({\"treename\": \"nominal\", \"filter_name\": \"el_pt\"})
UprootRaw
Query\n",
"- This is a new query language, essentially calling `uproot.tree.arrays()` function\n",
"- A UprootRaw query can be a dictionary or a list of dictionaries\n",
"- There are two types of operations a user can put in a dictionary\n",
" - query: contains a `treename` key\n",
" - copy: contains a `copy_histograms` key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
" \n",
" \n",
"query = [\n",
" {\n",
" 'treename': 'reco', \n",
" 'filter_name': ['/mu.*/', 'runNumber', 'lbn', 'jet_pt_*'], \n",
" 'cut':'(count_nonzero(jet_pt_NOSYS>40e3, axis=1)>=4)'\n",
" },\n",
" {\n",
" 'copy_histograms': ['CutBookkeeper*', '/cflow.*/', 'metadata', 'listOfSystematics']\n",
" }\n",
" ]\n",
"
\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"- More details on the grammar can be found [here](https://servicex-frontend.readthedocs.io/en/latest/transformer_matrix.html)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"query_UprootRaw = servicex.query.UprootRaw({\"treename\": \"nominal\", \"filter_name\": \"el_pt\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"FuncADL_Uproot
Query\n",
"- Functional Analysis Description Language is a powerful query language that has been supported by ServiceX\n",
"- In addition to the basic operations like `Select()` for column selection or `Where()` for filtering, more sophisticated query can be built\n",
"- One new addition `FromTree()` method to set a tree name in a query\n",
"- More details can be found at the [talk](https://indico.cern.ch/event/1019958/timetable/#31-funcadl-functional-analysis) by M. Proffitt at PyHEP 2021"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"query_FuncADL = servicex.query.FuncADL_Uproot().FromTree('nominal').Select(lambda e: {'el_pt': e['el_pt']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"PythonFunction
Query\n",
"- Python function can be passed as a query\n",
"- `uproot`, `awkward`, `vector` can be imported (limited by the transformer image)\n",
"- Primarily experimental purpose and likely to be discontinued"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def run_query(input_filenames=None):\n",
" import uproot\n",
" with uproot.open({input_filenames: \"nominal\"}) as o:\n",
" br = o.arrays(\"el_pt\")\n",
" return br\n",
"\n",
"query_PythonFunction = servicex.query.PythonFunction().with_uproot_function(run_query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"All three queries return the same output, ROOT files with selected branch el_pt_NOSYS
!\n",
"\n",
"\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "o_multiple = servicex.deliver(spec_multiple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "o_yaml = deliver(\"config_UprootRaw.yaml\")\n", "# o_py = deliver(spec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "YAML syntax\n", "- The exclamation mark(!) to declare dataset type and query type (see detail on the [PyYAML constructor](https://matthewpburruss.com/post/yaml/))\n", " - Dataset tags: `!Rucio`, `!Rucio`, `!FileList`, `!CERNOpenData`\n", " - Query tags: `!UprootRaw`, `!FuncADL_Uproot`, `!PythonFunction`\n", "- The pipe (`|`) after query tag represents the literal operator and allows to properly interpret multi-line string\n", "\n", "
[07/01/24 00:13:17] WARNING Transform \"UprootRaw_PyHEP\" completed with failures: 3/3 files query_core.py:215\n", " failed \n", "\n" ], "text/plain": [ "\u001b[2;36m[07/01/24 00:13:17]\u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m Transform \u001b[32m\"UprootRaw_PyHEP\"\u001b[0m completed with failures: \u001b[1;36m3\u001b[0m/\u001b[1;36m3\u001b[0m files \u001b]8;id=567884;file:///opt/miniconda3/envs/pyhep/lib/python3.10/site-packages/servicex/query_core.py\u001b\\\u001b[2mquery_core.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=82617;file:///opt/miniconda3/envs/pyhep/lib/python3.10/site-packages/servicex/query_core.py#215\u001b\\\u001b[2m215\u001b[0m\u001b]8;;\u001b\\\n", "\u001b[2;36m \u001b[0m failed \u001b[2m \u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
WARNING More information of 'UprootRaw_PyHEP' HERE query_core.py:226\n", "\n" ], "text/plain": [ "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[31mWARNING \u001b[0m More information of \u001b[32m'UprootRaw_PyHEP'\u001b[0m \u001b]8;id=479336;https://atlas-kibana.mwt2.org:5601/s/servicex/app/dashboards?auth_provider_hint=anonymous1#/view/6d069520-f34e-11ed-a6d8-9f6a16cd6d78?embed=true&_g=(time:(from:now-30d%2Fd,to:now))&_a=(filters:!((query:(match_phrase:(requestId:'ab736b6d-d3d9-439a-8734-ed3ba9276540'))),(query:(match_phrase:(level:'error')))))&show-time-filter=true&hide-filter-bar=true\u001b\\\u001b[1;31;47mHERE\u001b[0m\u001b]8;;\u001b\\ \u001b]8;id=178926;file:///opt/miniconda3/envs/pyhep/lib/python3.10/site-packages/servicex/query_core.py\u001b\\\u001b[2mquery_core.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=210571;file:///opt/miniconda3/envs/pyhep/lib/python3.10/site-packages/servicex/query_core.py#226\u001b\\\u001b[2m226\u001b[0m\u001b]8;;\u001b\\\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n" ], "text/plain": [] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "o = deliver(spec_typo)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "