<h1 style="text-align: center;">
    <img src="img/logo_servicex.png" width="70" height="70"  style="float:left" alt="ServiceX">
    <img src="img/logo_ut.png" width="150" height="100"  style="float:right" alt="UT Austin">
    The new Python client library of ServiceX, the novel data delivery system
</h1>

<h4 style="text-align: center;">KyungEon Choi (UT Austin) for ServiceX team (IRIS-HEP)</h4>

<h4 style="text-align: center;">PyHEP 2024 (July 1, 2024)</h4>

<br>

</br>

<h2> ServiceX</h2>

<font size="3">
ServiceX is a scalable data extraction, transformation and delivery system deployed in a Kubernetes cluster. 

<p style="text-align:center;"> <img src="img/ServiceXDiagram2.png" width="1100" alt="ServiceX"></p>

<font size="3">

- <span style="color:#FF6E33;">Event data</span></b>
    - ServiceX delivers from grid or remote XRootD storage to the user. Or more precisely ServiceX writes into an object store (ServiceX internal storage) and users download files or URLs from the object store as soon as available.
    - Thickness of arrows reflect the amount of data over a wire. ServiceX is NOT designed to download full data from grids. Transformers effectively reduce data that will be delivered to user based on a query for selection and filtering.
    - ServiceX is often co-located with a grid site to maximize network bandwith. XCache is preferable to allow much faster read for frequently accessed datasets.
- <span style="color:red;">Transformer</span></b>
    - ServiceX consists of multiple microservices that are deployed as static K8s pod (always "running" state) but transformers are dynamically created via HPA (Horizontal Pod Scaling)
    - A transformer pod runs on a file at a time and number of transformer pods are scaled up and down depending on the number of input files in the dataset and other criteria.
- ServiceX Request
    - ServiceX request(s) is(are) made from the <span style="color:blue;">SerivceX client libary</span> to ServiceX Web API via HTTP request
    - A ServiceX request takes one input dataset (or list of files) and ServiceX is happily scale transformer pods automatically. A dataset with a single file should work but it's much more desirable to utilize HPA.
    - Users can make ServiceX request anywhere only with Python ServiceX client library and <font size="2"><code>servicex.yaml</code></font> includes an access token. Thus it's perfectly fine to deliver data to a university cluster or a laptop for small tests.

<br>

<font size="3">

<h4>ServiceX Webpage</h4>

- Download a ServiceX configuration file (<font size="2"><code>servicex.yaml</code></font>) from the ServiceX website and copy to your home or working  directory 
- NOTE: the ServiceX endpoint <font size="2"><code>servicex.af.uchicago.edu</code></font> is limited to the ATLAS users as it provides an access to the ATLAS event data

<p style="text-align:center;"><img src="img/servicex_web.png" width="900" alt="ServiceX Web"></p>

<br>


</br>

<h2>ServiceX Client library</h2>

<font size="3">

ServiceX Client library is a python library for users to communicate with ServiceX backend (or server) to make delivery requests and handling of outputs

<font size="3">

<b>The most fundamental compenents of a ServiceX request</b>
1. Dataset
1. Query - describe what a user wants to run in transformers

<br>

<font size="3">
    
<br>

<b>Design goal of the new ServiceX Client library</b>
- Minimize boilerplates
- Support YAML interface (integration of ServiceX DataBinder)
- Strongly typed (pydantic)

<br>

<font size="3">

<b>Installation</b><br />
- <font size="2"><code>pip install servicex==3.0.0.alpha.18</code></font>

In [1]:
# !pip install servicex==3.0.0.alpha.18
!pip list | grep servicex

servicex                         3.0.0a18


<font size="3">
I have downloaded my ServiceX configuration file (<font size="2"><code>servicex.yaml</code></font>) from the ServiceX webpage and installed <font size="2"><code>servicex</code></font> package
    <br />--> Ready to make a ServiceX request!

<br></br>

<h3>First ServiceX request</h3>

<font size="3">
Let's begin with the basic: <br>
<span style="margin-left:30px">Deliver a branch (or column) from a dataset in the grid</span>

In [2]:
import servicex

In [3]:
spec = {
    "Sample":[{
        "Name": "UprootRaw_PyHEP",
        "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
        "Query": servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
    }]
}

<font size="3">
    
- One sample named "UprootRaw_PyHEP" is defined in the <font size="2"><code>spec</code></font> object.
- A Rucio dataset is specified
- Defined a <font size="2">`Query`</font>, sent to transformers and run on all files in the given Rucio dataset
- <font size="2">`UprootRaw`</font> query takes <font size="2">`"treename"`</font> to set <font size="2">`TTree`</font> in flat ROOT ntuples and <font size="2">`"filter_name"`</font> to select branches in a given tree

<font size="3">
Let's deliver my ServiceX request

In [4]:
o = servicex.deliver(spec)

Output()

In [5]:
len(o['UprootRaw_PyHEP'])

3

<font size="3">
Returns a dictionary

In [6]:
print(f"Sample.Name: {o.keys()}\n")
print(f"Fileset: {type(o['UprootRaw_PyHEP'])}\n")
print(f"First file: {(o['UprootRaw_PyHEP'][0])}\n")

Sample.Name: dict_keys(['UprootRaw_PyHEP'])

Fileset: <class 'list'>

First file: /Users/kc43627/Work/data/servicex_cache/c9a57bae-b2c3-4432-93cd-253763e42ead/root___192.170.240.145__root___fax.mwt2.org_1094__pnfs_uchicago.edu_atlaslocalgroupdisk_rucio_user_mgeyik_a0_3c_user.mgeyik.30183079._000006.out.root



In [7]:
import uproot

with uproot.open(o['UprootRaw_PyHEP'][0]) as f:
    column = f['nominal']['el_pt']
column.array()

<font size="3">
Only few lines of a python script brings the data you want from the grid!

<br></br>

Let me go through what kinds of `Dataset` and `Query` are supported by ServiceX

<h3>Dataset</h3>

<font size="3">
ServiceX supports Rucio, XRootD, and CERN OpenDataset

In [8]:
servicex.dataset.Rucio.__init__

<function servicex.dataset_identifier.RucioDatasetIdentifier.__init__(self, dataset: str, num_files: Optional[int] = None)>

In [9]:
servicex.dataset.FileList.__init__

<function servicex.dataset_identifier.FileListDataset.__init__(self, files: Union[List[str], str])>

In [10]:
servicex.dataset.CERNOpenData.__init__

<function servicex.dataset_identifier.CERNOpenDataDatasetIdentifier.__init__(self, dataset: int, num_files: Optional[int] = None)>

<br></br>

<h3>Query</h3>

<font size="3">
<ul>
    <li>Query is a representation of what user wants from input dataset. e.g.</li>
    <ul>
        <li><font size="2"><code>UprootRaw({"treename": "nominal", "filter_name": "el_pt"})</code></font></li>
    </ul>
    <li>User provided query is translated into a code that runs on transformers</li>
    <li>Query is input data format dependent as a code for flat ROOT ntuple differs from the one for Apache parquet</li>
    <!-- <li>ServiceX supports ROOT ntuples, ATLAS xAOD, CMS Run-1 AOD as an input format</li> -->
    <!-- <li>Current version of client library supports query languages   (though other query classes are registered)</li> -->
    <!-- <li>Current version of client library supports query classes for ROOT ntuples at the moment</li> -->
</ul>
</font>

In [11]:
servicex.query.plugins

[EntryPoint(name='FuncADL_ATLASr21', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASr21', group='servicex.query'),
 EntryPoint(name='FuncADL_ATLASr22', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASr22', group='servicex.query'),
 EntryPoint(name='FuncADL_ATLASxAOD', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASxAOD', group='servicex.query'),
 EntryPoint(name='FuncADL_CMS', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_CMS', group='servicex.query'),
 EntryPoint(name='FuncADL_Uproot', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_Uproot', group='servicex.query'),
 EntryPoint(name='PythonFunction', value='servicex.python_dataset:PythonQuery', group='servicex.query'),
 EntryPoint(name='UprootRaw', value='servicex.uproot_raw.uproot_raw:UprootRawQuery', group='servicex.query')]

<font size="3">

<br>
<b>Query classes for ROOT ntuples (via Uproot)</b>

<font size="3">

<code>UprootRaw</code> Query
- This is a new query language, essentially calling <font size="2">`uproot.tree.arrays()`</font> function
- A UprootRaw query can be a dictionary or a list of dictionaries
- There are two types of operations a user can put in a dictionary
    - query: contains a  <font size="2">`treename`</font> key
    - copy: contains a  <font size="2">`copy_histograms`</font> key

<font size="2">    
    <pre>
        <code class="python">
query = [
         {
          'treename': 'reco', 
          'filter_name': ['/mu.*/', 'runNumber', 'lbn', 'jet_pt_*'], 
          'cut':'(count_nonzero(jet_pt_NOSYS>40e3, axis=1)>=4)'
         },
         {
          'copy_histograms': ['CutBookkeeper*', '/cflow.*/', 'metadata', 'listOfSystematics']
         }
        ]
        </code>
    </pre>
</font>


<font size="3">

- More details on the grammar can be found [here](https://servicex-frontend.readthedocs.io/en/latest/transformer_matrix.html)

In [12]:
query_UprootRaw = servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})

<font size="3">

<br>

<code>FuncADL_Uproot</code> Query
- Functional Analysis Description Language is a powerful query language that has been supported by ServiceX
- In addition to the basic operations like <font size="2">`Select()`</font> for column selection or <font size="2">`Where()`</font> for filtering, more sophisticated query can be built
- One new addition <font size="2">`FromTree()`</font> method to set a tree name in a query
- More details can be found at the [talk](https://indico.cern.ch/event/1019958/timetable/#31-funcadl-functional-analysis) by M. Proffitt at PyHEP 2021

In [13]:
query_FuncADL = servicex.query.FuncADL_Uproot().FromTree('nominal').Select(lambda e: {'el_pt': e['el_pt']})

<font size="3">

<br>

<code>PythonFunction</code> Query
- Python function can be passed as a query
- <font size="2">`uproot`</font>, <font size="2">`awkward`</font>, <font size="2">`vector`</font> can be imported (limited by the transformer image)
- Primarily experimental purpose and likely to be discontinued

In [14]:
def run_query(input_filenames=None):
    import uproot
    with uproot.open({input_filenames: "nominal"}) as o:
        br = o.arrays("el_pt")
    return br

query_PythonFunction = servicex.query.PythonFunction().with_uproot_function(run_query)

<font size="3">
All three queries return the same output, ROOT files with selected branch <font size="2"><code>el_pt_NOSYS</code></font>!

<br></br>

<h3>Multiple samples</h3>

<font size="3">

- HEP analysis often needs more than one sample

In [15]:
spec_multiple = {
    "Sample":[
        {
            "Name": "UprootRaw_PyHEP",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_UprootRaw
        },
        {
            "Name": "FuncADL_Uproot_PyHEP",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_FuncADL
        },
        {
            "Name": "PythonFunction_PyHEP",
            "Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
            "Query": query_PythonFunction
        }
    ]
}

<font size="3">

- <font size="2">`Sample`</font> block is a list of dictionaries, each with a <font size="2">`Dataset`</font> - <font size="2">`Query`</font> pair
- Client library makes one ServiceX request per <font size="2">`Dataset`</font> - <font size="2">`Query`</font> pair
- Again, it's preferred to have more files in a request to utilize K8s HPA than having multiple requests for the same query

In [16]:
o_multiple = servicex.deliver(spec_multiple)

Output()

<br></br>

<h3>YAML interface</h3>

<font size="3">

- It's cool to deliver only interested columns from grid storages in a Jupyter notebook, but real analysis often becomes quite messy
- The new client library brings <font size="2">`servicex-databinder`</font> and significantly improve user interface to allow a seamless experience with YAML

In [21]:
%%writefile -a config_UprootRaw.yaml

Sample:
  - Name: Uproot_UprootRaw_YAML
    Dataset: !Rucio user.kchoi.pyhep2024.test_dataset
    Query: !UprootRaw |
        {"treename":"nominal", "filter_name": "el_pt"}

Writing config_UprootRaw.yaml


<font size="3">
Compare with the one in this notebook

In [22]:
from servicex.dataset import Rucio
from servicex.query import UprootRaw
from servicex import deliver

spec = {
    "Sample":[{
        "Name": "UprootRaw_PyHEP",
        "Dataset": Rucio("user.kchoi.pyhep2024.test_dataset"),
        "Query": UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
    }]
}

In [23]:
from servicex import deliver

In [24]:
o_yaml = deliver("config_UprootRaw.yaml")
# o_py = deliver(spec)

Output()

<font size="3">

YAML syntax
- The exclamation mark(!) to declare dataset type and query type (see detail on the [PyYAML constructor](https://matthewpburruss.com/post/yaml/))
    - Dataset tags: <font size="2">`!Rucio`</font>, <font size="2">`!Rucio`</font>, <font size="2">`!FileList`</font>, <font size="2">`!CERNOpenData`</font>
    - Query tags: <font size="2">`!UprootRaw`</font>, <font size="2">`!FuncADL_Uproot`</font>, <font size="2">`!PythonFunction`</font>
- The pipe (`|`) after query tag represents the literal operator and allows to properly interpret multi-line string

<br></br>

<h3>Optional configurations</h3>

```
Definition:
  - &DEF_ggH_input "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
                  /2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"

  - &DEF_query1 !PythonFunction |
    def run_query(input_filenames=None):
        import uproot

        with uproot.open({input_filenames:"nominal"}) as o:
            br = o.arrays("mu_pt")
        return br

  - &DEF_query2 !FuncADL_Uproot  |
    FromTree('mini').Select(lambda e: {'lep_pt': e['lep_pt']}).Where(lambda e: e['lep_pt'] > 1000)

General:
  OutputFormat: parquet
  Delivery: SignedURLs

Sample:
  - Name: ttH
    Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttH.v11
    Query: *DEF_query1
    NFiles: 5
    # IgnoreLocalCache: False

  - Name: ttZ
    Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttZ.v11    
    Query: *DEF_query1
    NFiles: 3

  - Name: ggH
    Dataset: !FileList *DEF_ggH_input
    Query: *DEF_query2
```
<br></br>

<h3>Failed transformation</h3>

In [25]:
spec_typo = {
    "Sample":[{
        "Name": "UprootRaw_PyHEP",
        "Dataset": Rucio("user.kchoi.pyhep2024.test_dataset"),
        "Query": UprootRaw({"treename": "nominal", "filter_name": "el_pta"})
    }]
}

In [26]:
o = deliver(spec_typo)

Output()

<br></br>

## Future plans

<font size="3">

<br>

<b>Client library</b>
- Improve robustness: progress bar (transform status/object store access) and local caching
- Readthedoc of the new ServiceX cilent library is under construction! https://servicex-frontend.readthedocs.io/en/latest/index.html

<b>ServiceX</b>
- Improve stability and robustness of ServiceX especially what we learned during 200Gbps challenge (hundreds of ServiceX requests on hundreds of TB datasets)
- Server-side caching
- Other ServiceX transformers: ATLAS TopCPToolkit transformer, column-join transformer