{ "cells": [ { "cell_type": "markdown", "id": "blocked-chemical", "metadata": {}, "source": [ "In this demo, we first show how to use the Arrow `Dataset` API `SkyhookFileFormat` API to scan parquet files by pushing down scan opertations into Ceph and then we show how to use the `Dataset` API to process parquet files containing NanoEvents stored in Ceph in parallel through Coffea using Dask." ] }, { "cell_type": "markdown", "id": "medieval-profile", "metadata": {}, "source": [ "## Exploring SkyhookFileFormat with PyArrow\n", "\n", "We import the Dataset API and the Parquet API from PyArrow." ] }, { "cell_type": "code", "execution_count": 19, "id": "fifty-syndrome", "metadata": {}, "outputs": [], "source": [ "import pyarrow\n", "import pyarrow.dataset as ds\n", "import pyarrow.parquet as pq" ] }, { "cell_type": "markdown", "id": "inappropriate-watts", "metadata": {}, "source": [ "Now, we will instantiate the `SkyhookFileFormat`. Upon instantiation, the connection to the Ceph cluster is made under the hood. The connection is closed automatically upon object destruction. The `SkyhookFileFormat` API currently takes the Ceph configuration file as input. It inherits from the `FileFormat` API and uses the `DirectObjectAccess` API under the hood to interact with the underlying objects that make up a file in CephFS. Since, we mount CephFS, we use the `FileSystemDataset` that comes out of the box with Apache Arrow for instantiating our dataset, as by mounting CephFS we have just another directory of Parquet files. Having the suitability of using the `FileSystemDataset`, we just can start pushing down scan operations to our Parquet files by just plugging in `SkyhookFileFormat` in the format paramter. " ] }, { "cell_type": "code", "execution_count": 20, "id": "aerial-helping", "metadata": {}, "outputs": [], "source": [ "dataset = ds.dataset(\"file:///mnt/cephfs/nyc\", format=ds.SkyhookFileFormat(\"parquet\", \"/etc/ceph/ceph.conf\"))" ] }, { "cell_type": "markdown", "id": "received-costa", "metadata": {}, "source": [ "Now we apply some projections and filters on the dataset." ] }, { "cell_type": "code", "execution_count": 21, "id": "romance-prague", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | total_amount | \n", "fare_amount | \n", "
---|---|---|
0 | \n", "75.84 | \n", "52.00 | \n", "
1 | \n", "69.99 | \n", "52.00 | \n", "
2 | \n", "59.84 | \n", "53.00 | \n", "
3 | \n", "68.50 | \n", "53.50 | \n", "
4 | \n", "70.01 | \n", "52.00 | \n", "
... | \n", "... | \n", "... | \n", "
376 | \n", "78.88 | \n", "67.00 | \n", "
377 | \n", "64.84 | \n", "58.50 | \n", "
378 | \n", "0.31 | \n", "0.01 | \n", "
379 | \n", "58.80 | \n", "57.50 | \n", "
380 | \n", "229.80 | \n", "228.50 | \n", "
381 rows × 2 columns
\n", "