{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7 reasons why I love Vaex for data science"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:04.004565Z",
"start_time": "2020-02-14T10:51:03.030891Z"
}
},
"outputs": [],
"source": [
"import vaex\n",
"\n",
"import numpy as np\n",
"import pylab as plt\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction\n",
"\n",
"[Vaex](https://github.com/vaexio/vaex) is an open-source DataFrame library for Python with an API that closely resembles that of [Pandas](https://pandas.pydata.org/docs/index.html). I have been using Vaex for several years in both academic and industry environments, and it is my go-to library for several of the data science projects I am working on. In this article I would like to share some of my favourite Vaex features. Some may be obvious by now, but some may surprise you.\n",
"\n",
"The following code examples are run on a MacBook Pro (15\", 2018, 2.6GHz Intel Core i7, 32GB RAM)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Easy to work with very large datasets\n",
"\n",
"Nowadays, it becomes increasingly more common to encounter datasets that are larger than the available RAM on a typical laptop or a desktop workstation. [Vaex](https://github.com/vaexio/vaex) solves this problem rather elegantly by the use of memory mapping and lazy evaluations. As long as your data is stored in a memory mappable file format such as Apache Arrow or HDF5, Vaex will open it instantly, no matter how large it is, or how much RAM your machine has. In fact, the size of the files Vaex can read are only limited by the amount of free hard-disk space you have. If your data is not in a memory-mappable file format (e.g. CSV, JSON), you can easily convert it by using the rich Pandas I/O in combination with Vaex. [See this guide](https://docs.vaex.io/en/latest/faq.html#I-have-a-massive-CSV-file-which-I-can-not-fit-all-into-memory-at-one-time.-How-do-I-convert-it-to-HDF5?) on how to do so."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:04.842737Z",
"start_time": "2020-02-14T10:51:04.718847Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 jovan staff 107G Jul 3 2019 ../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5\r\n"
]
}
],
"source": [
"# Check the file size on disk\n",
"!ls -l -h ../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:05.825240Z",
"start_time": "2020-02-14T10:51:05.765977Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
payment_type
trip_distance
pickup_longitude
pickup_latitude
rate_code
store_and_fwd_flag
dropoff_longitude
dropoff_latitude
fare_amount
surcharge
mta_tax
tip_amount
tolls_amount
total_amount
\n",
"\n",
"\n",
"
0
VTS
2009-01-04 02:52:00.000000000
2009-01-04 03:02:00.000000000
1
CASH
2.630000114440918
-73.99195861816406
40.72156524658203
nan
nan
-73.99380493164062
40.6959228515625
8.899999618530273
0.5
nan
0.0
0.0
9.399999618530273
\n",
"
1
VTS
2009-01-04 03:31:00.000000000
2009-01-04 03:38:00.000000000
3
Credit
4.550000190734863
-73.98210144042969
40.736289978027344
nan
nan
-73.95584869384766
40.768028259277344
12.100000381469727
0.5
nan
2.0
0.0
14.600000381469727
\n",
"
2
VTS
2009-01-03 15:43:00.000000000
2009-01-03 15:57:00.000000000
5
Credit
10.350000381469727
-74.0025863647461
40.73974609375
nan
nan
-73.86997985839844
40.770225524902344
23.700000762939453
0.0
nan
4.739999771118164
0.0
28.440000534057617
\n",
"
3
DDS
2009-01-01 20:52:58.000000000
2009-01-01 21:14:00.000000000
1
CREDIT
5.0
-73.9742660522461
40.79095458984375
nan
nan
-73.9965591430664
40.731849670410156
14.899999618530273
0.5
nan
3.049999952316284
0.0
18.450000762939453
\n",
"
4
DDS
2009-01-24 16:18:23.000000000
2009-01-24 16:24:56.000000000
1
CASH
0.4000000059604645
-74.00157928466797
40.719383239746094
nan
nan
-74.00837707519531
40.7203483581543
3.700000047683716
0.0
nan
0.0
0.0
3.700000047683716
\n",
"
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
\n",
"
1,173,057,922
VTS
2015-12-31 23:59:56.000000000
2016-01-01 00:08:18.000000000
5
1
1.2000000476837158
-73.99381256103516
40.72087097167969
1.0
0.0
-73.98621368408203
40.722469329833984
7.5
0.5
0.5
1.7599999904632568
0.0
10.5600004196167
\n",
"
1,173,057,923
CMT
2015-12-31 23:59:58.000000000
2016-01-01 00:05:19.000000000
2
2
2.0
-73.96527099609375
40.76028060913086
1.0
0.0
-73.93951416015625
40.75238800048828
7.5
0.5
0.5
0.0
0.0
8.800000190734863
\n",
"
1,173,057,924
CMT
2015-12-31 23:59:59.000000000
2016-01-01 00:12:55.000000000
2
2
3.799999952316284
-73.98729705810547
40.739078521728516
1.0
0.0
-73.9886703491211
40.69329833984375
13.5
0.5
0.5
0.0
0.0
14.800000190734863
\n",
"
1,173,057,925
VTS
2015-12-31 23:59:59.000000000
2016-01-01 00:10:26.000000000
1
2
1.9600000381469727
-73.99755859375
40.72569274902344
1.0
0.0
-74.01712036132812
40.705322265625
8.5
0.5
0.5
0.0
0.0
9.800000190734863
\n",
"
1,173,057,926
VTS
2015-12-31 23:59:59.000000000
2016-01-01 00:21:30.000000000
1
1
1.059999942779541
-73.9843978881836
40.76725769042969
1.0
0.0
-73.99098205566406
40.76057052612305
13.5
0.5
0.5
2.9600000381469727
0.0
17.760000228881836
\n",
"\n",
"
"
],
"text/plain": [
"# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount\n",
"0 VTS 2009-01-04 02:52:00.000000000 2009-01-04 03:02:00.000000000 1 CASH 2.630000114440918 -73.99195861816406 40.72156524658203 nan nan -73.99380493164062 40.6959228515625 8.899999618530273 0.5 nan 0.0 0.0 9.399999618530273\n",
"1 VTS 2009-01-04 03:31:00.000000000 2009-01-04 03:38:00.000000000 3 Credit 4.550000190734863 -73.98210144042969 40.736289978027344 nan nan -73.95584869384766 40.768028259277344 12.100000381469727 0.5 nan 2.0 0.0 14.600000381469727\n",
"2 VTS 2009-01-03 15:43:00.000000000 2009-01-03 15:57:00.000000000 5 Credit 10.350000381469727 -74.0025863647461 40.73974609375 nan nan -73.86997985839844 40.770225524902344 23.700000762939453 0.0 nan 4.739999771118164 0.0 28.440000534057617\n",
"3 DDS 2009-01-01 20:52:58.000000000 2009-01-01 21:14:00.000000000 1 CREDIT 5.0 -73.9742660522461 40.79095458984375 nan nan -73.9965591430664 40.731849670410156 14.899999618530273 0.5 nan 3.049999952316284 0.0 18.450000762939453\n",
"4 DDS 2009-01-24 16:18:23.000000000 2009-01-24 16:24:56.000000000 1 CASH 0.4000000059604645 -74.00157928466797 40.719383239746094 nan nan -74.00837707519531 40.7203483581543 3.700000047683716 0.0 nan 0.0 0.0 3.700000047683716\n",
"... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n",
"1,173,057,922 VTS 2015-12-31 23:59:56.000000000 2016-01-01 00:08:18.000000000 5 1 1.2000000476837158 -73.99381256103516 40.72087097167969 1.0 0.0 -73.98621368408203 40.722469329833984 7.5 0.5 0.5 1.7599999904632568 0.0 10.5600004196167\n",
"1,173,057,923 CMT 2015-12-31 23:59:58.000000000 2016-01-01 00:05:19.000000000 2 2 2.0 -73.96527099609375 40.76028060913086 1.0 0.0 -73.93951416015625 40.75238800048828 7.5 0.5 0.5 0.0 0.0 8.800000190734863\n",
"1,173,057,924 CMT 2015-12-31 23:59:59.000000000 2016-01-01 00:12:55.000000000 2 2 3.799999952316284 -73.98729705810547 40.739078521728516 1.0 0.0 -73.9886703491211 40.69329833984375 13.5 0.5 0.5 0.0 0.0 14.800000190734863\n",
"1,173,057,925 VTS 2015-12-31 23:59:59.000000000 2016-01-01 00:10:26.000000000 1 2 1.9600000381469727 -73.99755859375 40.72569274902344 1.0 0.0 -74.01712036132812 40.705322265625 8.5 0.5 0.5 0.0 0.0 9.800000190734863\n",
"1,173,057,926 VTS 2015-12-31 23:59:59.000000000 2016-01-01 00:21:30.000000000 1 1 1.059999942779541 -73.9843978881836 40.76725769042969 1.0 0.0 -73.99098205566406 40.76057052612305 13.5 0.5 0.5 2.9600000381469727 0.0 17.760000228881836"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# # Read data from S3\n",
"# df = vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')\n",
"\n",
"# Read data from local disk\n",
"df = vaex.open('../vaex-taxi/data/yellow_taxi_2009_2015_f32.hdf5')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: Opening and previewing a 100GB file with Vaex is instant."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. No memory copies\n",
"\n",
"Vaex has a zero memory copy policy. This means that filtering a DataFrames costs very little memory and does not copy the data. Consider the following example."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:36.106131Z",
"start_time": "2020-02-14T10:51:27.208332Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
vendor_id
pickup_datetime
dropoff_datetime
passenger_count
payment_type
trip_distance
pickup_longitude
pickup_latitude
rate_code
store_and_fwd_flag
dropoff_longitude
dropoff_latitude
fare_amount
surcharge
mta_tax
tip_amount
tolls_amount
total_amount
\n",
"\n",
"\n",
"
0
VTS
2009-01-04 02:52:00.000000000
2009-01-04 03:02:00.000000000
1
CASH
2.630000114440918
-73.99195861816406
40.72156524658203
nan
nan
-73.99380493164062
40.6959228515625
8.899999618530273
0.5
nan
0.0
0.0
9.399999618530273
\n",
"
1
VTS
2009-01-04 03:31:00.000000000
2009-01-04 03:38:00.000000000
3
Credit
4.550000190734863
-73.98210144042969
40.736289978027344
nan
nan
-73.95584869384766
40.768028259277344
12.100000381469727
0.5
nan
2.0
0.0
14.600000381469727
\n",
"
2
DDS
2009-01-01 20:52:58.000000000
2009-01-01 21:14:00.000000000
1
CREDIT
5.0
-73.9742660522461
40.79095458984375
nan
nan
-73.9965591430664
40.731849670410156
14.899999618530273
0.5
nan
3.049999952316284
0.0
18.450000762939453
\n",
"
3
DDS
2009-01-24 16:18:23.000000000
2009-01-24 16:24:56.000000000
1
CASH
0.4000000059604645
-74.00157928466797
40.719383239746094
nan
nan
-74.00837707519531
40.7203483581543
3.700000047683716
0.0
nan
0.0
0.0
3.700000047683716
\n",
"
4
DDS
2009-01-16 22:35:59.000000000
2009-01-16 22:43:35.000000000
2
CASH
1.2000000476837158
-73.98980712890625
40.73500442504883
nan
nan
-73.98502349853516
40.72449493408203
6.099999904632568
0.5
nan
0.0
0.0
6.599999904632568
\n",
"
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
\n",
"
1,061,605,165
CMT
2015-12-31 23:59:56.000000000
2016-01-01 00:09:25.000000000
1
1
1.0
-73.9738998413086
40.74289321899414
1.0
0.0
-73.98957061767578
40.75054931640625
8.0
0.5
0.5
1.850000023841858
0.0
11.149999618530273
\n",
"
1,061,605,166
CMT
2015-12-31 23:59:58.000000000
2016-01-01 00:05:19.000000000
2
2
2.0
-73.96527099609375
40.76028060913086
1.0
0.0
-73.93951416015625
40.75238800048828
7.5
0.5
0.5
0.0
0.0
8.800000190734863
\n",
"
1,061,605,167
CMT
2015-12-31 23:59:59.000000000
2016-01-01 00:12:55.000000000
2
2
3.799999952316284
-73.98729705810547
40.739078521728516
1.0
0.0
-73.9886703491211
40.69329833984375
13.5
0.5
0.5
0.0
0.0
14.800000190734863
\n",
"
1,061,605,168
VTS
2015-12-31 23:59:59.000000000
2016-01-01 00:10:26.000000000
1
2
1.9600000381469727
-73.99755859375
40.72569274902344
1.0
0.0
-74.01712036132812
40.705322265625
8.5
0.5
0.5
0.0
0.0
9.800000190734863
\n",
"
1,061,605,169
VTS
2015-12-31 23:59:59.000000000
2016-01-01 00:21:30.000000000
1
1
1.059999942779541
-73.9843978881836
40.76725769042969
1.0
0.0
-73.99098205566406
40.76057052612305
13.5
0.5
0.5
2.9600000381469727
0.0
17.760000228881836
\n",
"\n",
"
"
],
"text/plain": [
"# vendor_id pickup_datetime dropoff_datetime passenger_count payment_type trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount\n",
"0 VTS 2009-01-04 02:52:00.000000000 2009-01-04 03:02:00.000000000 1 CASH 2.630000114440918 -73.99195861816406 40.72156524658203 nan nan -73.99380493164062 40.6959228515625 8.899999618530273 0.5 nan 0.0 0.0 9.399999618530273\n",
"1 VTS 2009-01-04 03:31:00.000000000 2009-01-04 03:38:00.000000000 3 Credit 4.550000190734863 -73.98210144042969 40.736289978027344 nan nan -73.95584869384766 40.768028259277344 12.100000381469727 0.5 nan 2.0 0.0 14.600000381469727\n",
"2 DDS 2009-01-01 20:52:58.000000000 2009-01-01 21:14:00.000000000 1 CREDIT 5.0 -73.9742660522461 40.79095458984375 nan nan -73.9965591430664 40.731849670410156 14.899999618530273 0.5 nan 3.049999952316284 0.0 18.450000762939453\n",
"3 DDS 2009-01-24 16:18:23.000000000 2009-01-24 16:24:56.000000000 1 CASH 0.4000000059604645 -74.00157928466797 40.719383239746094 nan nan -74.00837707519531 40.7203483581543 3.700000047683716 0.0 nan 0.0 0.0 3.700000047683716\n",
"4 DDS 2009-01-16 22:35:59.000000000 2009-01-16 22:43:35.000000000 2 CASH 1.2000000476837158 -73.98980712890625 40.73500442504883 nan nan -73.98502349853516 40.72449493408203 6.099999904632568 0.5 nan 0.0 0.0 6.599999904632568\n",
"... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n",
"1,061,605,165 CMT 2015-12-31 23:59:56.000000000 2016-01-01 00:09:25.000000000 1 1 1.0 -73.9738998413086 40.74289321899414 1.0 0.0 -73.98957061767578 40.75054931640625 8.0 0.5 0.5 1.850000023841858 0.0 11.149999618530273\n",
"1,061,605,166 CMT 2015-12-31 23:59:58.000000000 2016-01-01 00:05:19.000000000 2 2 2.0 -73.96527099609375 40.76028060913086 1.0 0.0 -73.93951416015625 40.75238800048828 7.5 0.5 0.5 0.0 0.0 8.800000190734863\n",
"1,061,605,167 CMT 2015-12-31 23:59:59.000000000 2016-01-01 00:12:55.000000000 2 2 3.799999952316284 -73.98729705810547 40.739078521728516 1.0 0.0 -73.9886703491211 40.69329833984375 13.5 0.5 0.5 0.0 0.0 14.800000190734863\n",
"1,061,605,168 VTS 2015-12-31 23:59:59.000000000 2016-01-01 00:10:26.000000000 1 2 1.9600000381469727 -73.99755859375 40.72569274902344 1.0 0.0 -74.01712036132812 40.705322265625 8.5 0.5 0.5 0.0 0.0 9.800000190734863\n",
"1,061,605,169 VTS 2015-12-31 23:59:59.000000000 2016-01-01 00:21:30.000000000 1 1 1.059999942779541 -73.9843978881836 40.76725769042969 1.0 0.0 -73.99098205566406 40.76057052612305 13.5 0.5 0.5 2.9600000381469727 0.0 17.760000228881836"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_filtered = df[(df.passenger_count>0) & (df.passenger_count<5)]\n",
"df_filtered"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: Filtering a Vaex DataFrame does not copy the data and takes negligible amount of memory."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The creation of the `df_filtered` DataFrame takes no extra memory! This is because `df_filtered` is a _shallow_ copy of `df`. When creating filtered DataFrames, Vaex creates a binary mask which is then applied to the original data, without the need to make copies. The memory costs for these kind of filters are low: one needs ~1.2 GB of RAM to filter a 1 billion row DataFrame. This is negligible compared to other \"classical\" tools where one would need 100GB to simply read in the data, and another ~100GB for the filtered DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Virtual columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Transforming existing columns of a Vaex DataFrame into new ones results in the creation of _virtual columns_. Virtual columns behave just like normal ones, but they take up no memory what so ever. This is because Vaex only remembers the _expression_ the defines them, and does not calculate the values up front. These columns are lazily evaluated only when necessary, keeping memory usage low."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:44.105206Z",
"start_time": "2020-02-14T10:51:44.094212Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
#
fare_amount
total_amount
tip_amount
tip_percentage
\n",
"\n",
"\n",
"
0
8.9
9.4
0
0
\n",
"
1
12.1
14.6
2
13.6986
\n",
"
2
23.7
28.44
4.74
16.6667
\n",
"
3
14.9
18.45
3.05
16.5312
\n",
"
4
3.7
3.7
0
0
\n",
"\n",
"
"
],
"text/plain": [
" # fare_amount total_amount tip_amount tip_percentage\n",
" 0 8.9 9.4 0 0\n",
" 1 12.1 14.6 2 13.6986\n",
" 2 23.7 28.44 4.74 16.6667\n",
" 3 14.9 18.45 3.05 16.5312\n",
" 4 3.7 3.7 0 0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['tip_percentage'] = df.tip_amount / df.total_amount * 100\n",
"df[['fare_amount', 'total_amount', 'tip_amount', 'tip_percentage']].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: The \"tip_percentage\" column is a virtual column: it take no extra memory and is lazily evaluated on the fly when needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Performance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vaex is _fast_. I mean _seriously fast_. The evaluation of virtual columns is fully parallelized and done with one pass over the data. Column methods such as \"value_counts\", \"groupby\" , \"unique\" and the various string operations are using fast and efficient algoithms, implemented in C++ under the hood. All of them work in an out-of-core fashion, meaning you can process much more data than you fit into RAM, and use all available cores of your processor. For example, doing a \"value_counts\" operation takes only a second for over 1 billion rows!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:51:52.994399Z",
"start_time": "2020-02-14T10:51:52.273518Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b1e0f515814a47ffb0d9027c34a70ffd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"1 812321234\n",
"2 172864560\n",
"5 81923912\n",
"3 51435890\n",
"6 25614703\n",
" ... \n",
"69 1\n",
"66 1\n",
"61 1\n",
"53 1\n",
"70 1\n",
"Length: 62, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.passenger_count.value_counts(progress='widget')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: Using Vaex, the \"value_counts\" operation takes ~1s for over _1.1 billion rows!_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Just-In-Time Compilation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. No memory copies\n",
"\n",
"Vaex has a zero memory copy policy. This means that filtering DataFrame costs very little memory and does not copy the data. Consider the following example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As long as a virtual column is defined only using [Numpy](https://numpy.org/) or pure Python operations, Vaex can accelerate its evaluation by jitting, or Just-In-Time compilation via [Numba](http://numba.pydata.org/) or [Pythran](https://pythran.readthedocs.io/en/latest/). Vaex also supports acceleration via [CUDA](https://developer.nvidia.com/cuda-zone) if your machine has a CUDA enabled NVIDIA graphics card. This can be quite useful for speeding up the evaluation of rather computationally expensive virtual columns. \n",
"\n",
"Consider the example below. I have defined the arc distance between two geographical locations, a calculation that involves quite some algebra and trigonometry. Calculating the mean value will force the execution of this rather computationally expensive virtual column. When the execution is done purely with Numpy, it takes only 30 seconds, which I find impressive given that it is done for over **1.1 billion rows**. Now, when we do the same with the numba pre-compiled expression, we get ~2.5 times faster execution time, at least on my laptop. Unfortunately, I do not have an NVIDIA graphics card so I can not do the same using CUDA at this time. If you do, I'll be very happy if you could try this out and share the results.\n",
"\n",
"A small but important bonus: Notice how you do not need to call `.compute` or any such method - Vaex automatically knows when to be lazy and when execute a computation."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:52:44.836057Z",
"start_time": "2020-02-14T10:52:02.107778Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b2c9bc4c513c4b42a76a5dd90bd4df87",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b9531b2b9fe2425489acf4bd0b296c63",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean arc distance comuted with numpy: 12.70720\n",
"Mean arc distance comuted with numba: 12.70720\n"
]
}
],
"source": [
"def arc_distance(theta_1, phi_1, theta_2, phi_2):\n",
" temp = (np.sin((theta_2-theta_1)/2*np.pi/180)**2\n",
" + np.cos(theta_1*np.pi/180)*np.cos(theta_2*np.pi/180) * np.sin((phi_2-phi_1)/2*np.pi/180)**2)\n",
" distance = 2 * np.arctan2(np.sqrt(temp), np.sqrt(1-temp))\n",
" return distance * 3958.8\n",
"\n",
"# Expression to be evaluated with numpy as usual\n",
"df['arc_distance_numpy'] = arc_distance(df.pickup_longitude, \n",
" df.pickup_latitude, \n",
" df.dropoff_longitude, \n",
" df.dropoff_latitude)\n",
"\n",
"# Expression to be pre-compiled with numba, and then executed\n",
"df['arc_distance_numba'] = arc_distance(df.pickup_longitude, \n",
" df.pickup_latitude, \n",
" df.dropoff_longitude, \n",
" df.dropoff_latitude).jit_numba()\n",
"\n",
"# Expression to be pre-compiled with CUDA, and then executed on you GPU \n",
"# provided you have a CUDA compatible NVIDIA GPU.\n",
"# df['arc_distance_cuda'] = arc_distance(df.pickup_longitude, \n",
"# df.pickup_latitude, \n",
"# df.dropoff_longitude, \n",
"# df.dropoff_latitude).jit_cuda()\n",
"\n",
"# Calculate the mean \n",
"mean_numpy = df.arc_distance_numpy.mean(progress='widget')\n",
"mean_numba = df.arc_distance_numba.mean(progress='widget')\n",
"print(f'Mean arc distance comuted with numpy: {mean_numpy:.5f}')\n",
"print(f'Mean arc distance comuted with numba: {mean_numba:.5f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: Jitting can lead to ~2.5 time faster execution times for a computationally expensive virtual column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Selections\n",
"\n",
"Vaex implements a concept called _selections_ which is used to, ah, select the data. This is useful when you want to explore the data by, for example, calculating statistics on different portions of it without making a new reference DataFrame each time. The true power of using selections is that we can calculate a statistic for multiple selections with just one pass over the data."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:53:35.752782Z",
"start_time": "2020-02-14T10:53:28.969348Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8ceafb99faa04685ab3607dbce998e4c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"array([11.21730816, 11.2078196 , 11.26832503])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"select_n_passengers_lt3 = df.passenger_count < 3\n",
"select_n_passengers_ge3 = df.passenger_count >= 3\n",
"\n",
"df.fare_amount.mean(selection=[None, select_n_passengers_lt3, select_n_passengers_ge3], progress='widget')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: You can calculate statistics for multiple selections with one pass over the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can be be also useful for making various visualisations. For example, we can use the `.count` method to create a couple of histograms on different selections with just one pass over the data. Quite efficient!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:53:46.734642Z",
"start_time": "2020-02-14T10:53:40.557183Z"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f421ff08b7c64c7da1b7d7cba2c48ee4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAewAAAEVCAYAAAAit9axAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXzU1bnH8c9DIGwJCGWRHVnCIhKWqIDKVjRYWritxovClShiKaggLWUTFEpECyKiXqpWEVmEaqW1gIioiK0GDRQtssseROAqKois5/4xkzGBmckkzEwy5Pt+vfLKzG87z/yUPHOW3znmnENERESKt1JFHYCIiIjkTwlbREQkBihhi4iIxAAlbBERkRighC0iIhIDlLBFRERiQLFP2Gb2gpkdNLMNIRz7uJmt9/5sNbMj0YhRREQk0qy4P4dtZp2Bo8BLzrlWBTjvXqCtc+7OiAUnIiISJcW+hu2cWw18lXubmTU2s+VmttbM3jez5n5OvRV4OSpBioiIRFjpog6gkJ4FBjvntpnZ1cD/At1zdppZA+Ay4J0iik9ERCSsYi5hm1kC0Al4xcxyNpc957C+wKvOuTPRjE1ERCRSYi5h42nGP+KcaxPkmL7A0CjFIyIiEnHFvg/7XM65b4GdZpYGYB7JOfvNrBlQBfiwiEIUEREJu2KfsM3sZTzJt5mZ7TOzgUA/YKCZfQJ8BvTJdcqtwEJX3Ie/i4iIFECxf6xLREREYqCGLSIiIsV80Fm1atVcw4YNizoMERGRqFi7du1h51x1f/uKdcJu2LAhWVlZRR2GiIhIVJjZ7kD71CQuIiISA5SwRUREYoAStoiISAxQwhYREYkBStgiIiIxoFiPEheR2PHtt99y8OBBTp06VdShiBRLZcqUoUaNGlSqVKlQ5ythi8gF+/bbb/nyyy+pU6cO5cuXJ9dKeiICOOc4fvw42dnZAIVK2iU+YT+3egczVm7l2En/K3FWjI9jeI8kBnVuFOXIRGLHwYMHqVOnDhUqVCjqUESKJTOjQoUK1KlTh/379xcqYZf4PuxgyRrg2MkzzFi5NYoRicSeU6dOUb58+aIOQ6TYK1++fKG7jUp8wg6WrAtyjEhJp2ZwkfxdyL+TEt8kntuuR3rled9w9NIiikRERCSvEl/DFhERiQVK2CIiIjFACVtERCSAV155hZSUFC655BIqVqxImzZtmDNnTpHEoj5sEREp9k6fPk1cXFzYBzceOnSIxMREypUr53f/T37yEx544AGaN29OmTJlWLJkCQMHDqR69er87Gc/C2ss+VENW0RKrK5duzJkyBDGjh1LtWrVqFGjBr/73e84e/as75iGDRsybdq0886755578hwzadIk0tPTSUxMpF69eixatIgjR47Qt29fEhISaNq0KStWrMg3nsGDBzNs2DCqVKlClSpVGDlyZJ545s2bx5VXXkliYiI1atQgLS3NNxkHeB6xu++++6hduzZly5alXr16jB492rf/tddeo3Xr1pQvX56qVavSpUsXvvzyS9/+f/zjH7Rv355y5cpx2WWXMW7cOE6ePJnns06ePJlf//rXVKpUibp16zJ16tQ8n2Pr1q106dKFcuXK0axZM5YtW0ZCQgIvvvii75js7Gz69u3r+5y9evVi27Ztvv0PPfQQrVq14sUXX6Rx48aULVuWY8eOsXr1ajp06EBCQgKVK1fm6quvZsOGDUHv67lOnjzJX//6V/r06UPt2rXzfP5zde/enf/6r/+iefPmNG7cmGHDhtG6dWvef//9ApUZDlGrYZtZM2BRrk2NgAnOuRnRikFEoqcon7I494mPYObPn8+wYcP44IMPWL9+Pbfddhvt27fn1ltvLVCZM2bMYPLkyYwbN44//elPDBgwgO7du9O3b18mT57MlClT6N+/P3v27AlYm8uJJz09nQ8//JBPP/2UQYMGUatWLUaMGAF4ks3EiRNp3rw5hw8fZtSoUdx6662sXr0agJkzZ7J48WIWLlxIw4YN2bdvH1u2bAHgwIED9O3blylTpnDTTTdx9OhRMjMzfWW/+eab9OvXjyeeeILOnTuzZ88eBg8ezIkTJ/J8aXn88ceZOHEiI0eO5I033uC+++7j2muvpWPHjpw9e5Zf/vKXXHrppWRmZnL8+HGGDx/OiRMnfOd///33dOvWjU6dOvHee+8RHx/PtGnT6NGjB5s2bfJNwLNz504WLFjAK6+8Qnx8POXKlaNPnz4MHDiQ+fPnc+rUKdatW0dcXFxI/43WrFnDnDlzWLRoEWXKlOHWW29lzZo1NGjQIKTznXO88847bNmyhYyMjJDOCaeoJWzn3BagDYCZxQHZwOJolS8i4k/Lli2ZNGkSAElJSTz33HO8/fbbBU7YqampDBkyBICJEycyffp0mjRpwu233w7A+PHjeeGFF9iwYQMpKSkBr1OrVi1mzpyJmdG8eXO2bt3K9OnTfQn7zjvv9B3bqFEjZs2aRYsWLdi3bx9169Zl9+7dJCUlcd1112Fm1K9fn06dOgGwf/9+Tp06xc033+xLUq1atfJdLyMjg5EjR3LHHXcA0LhxYx599FH69+/P1KlTfc3RN9xwg6+F4d5772XmzJm8/fbbdOzYkbfeeostW7awYsUK6tSpA3gS/DXXXOMrZ+HChTjnmD17tu+azzzzDDVq1GDJkiXccsstgOfLydy5c6lZsyYAX331FUeOHOEXv/gFjRs3BqB58+ZB/7vs27ePl156iZdeeok9e/bQp08f5s6dS2pqasiJ/ptvvqFOnTqcOHGCuLg4nn76aW688caQzg2nomoS/ynwuXNudxGVLyICQOvWrfO8r127NgcPHryg6yQkJFChQgWuuOIK37acpJPftTt06JCnn7Zjx45kZ2fz7bffArBu3Tr69OlDgwYNSExM9CX/PXv2AJCens769etJSkpi6NChLF261NeknpycTI8ePWjVqhU33XQTs2bN4tChQ76y1q5dS0ZGBgkJCb6f2267jWPHjnHgwAG/nxXy3rPNmzdTu3ZtX7IGuPLKKylV6sd0s3btWnbu3EliYqKvnMqVK/P111/z+eef+46rW7eu774BVK1alfT0dFJTU+nVqxfTp09n7969Qe/nAw88wLhx42jWrBl79uzh5Zdf5mc/+1nIyRogMTGR9evX8/HHH5ORkcGIESN4++23Qz4/XIpq0Flf4OUiKltEoqAgzdJFqUyZMnnem1mePuNSpUrhnMtzjL+pJf1dJ/e2nCSc+9oFdezYMVJTU+nRowdz586lRo0aHD58mOuuu87Xz9yuXTt27drF8uXLeeeddxgwYADJycm89dZbxMXFsWLFCjIzM1mxYgXPP/88Y8aM4b333iM5OZmzZ8/y4IMPkpaWdl7Z1atXD/pZcz6Xcy7fgWFnz56lTZs2LFy48Lx9VatW9b2uWLHieftnz57N8OHDWb58Oa+//jrjxo3jb3/7G6mpqX7LeuCBB6hVqxbz5s2jWbNm3HLLLfTv3z9PjT8/pUqVokmTJgC0adOGTZs28fDDD/PTn/405GuEQ9Rr2GYWD/QGXgmw/24zyzKzrNzf/EREikL16tX54osvfO9/+OEHNm/eHLHy1qxZk+cLQmZmJrVr16ZSpUps3ryZw4cP8/DDD9O5c2eaN2/ut8aemJhIWloas2bNYunSpbzzzjts374d8CTXjh078uCDD/Lxxx9Tu3ZtFi3yDC9q164dmzdvpkmTJuf9lC4dWv2uRYsWZGdns3//ft+2rKysPF9U2rVrx/bt26lWrdp55eRO2IEkJyczatQoVq1aRdeuXYM+ZtWkSROmTJnC7t27WbRoEUePHiU1NZXGjRszYcIEtm4t+FoRZ8+ezdMnHy1F0SR+I7DOOed3WJ5z7lnnXIpzLiX3NzoRkaLQvXt35s+fz6pVq/jss8+48847I7rm9/79+xk+fDhbtmzh1VdfZerUqdx///0A1K9fn7Jly/LUU0+xY8cOli5dyvjx4/OcP336dF5++WU2bdrE9u3bWbBggW80d2ZmJpMnT+bjjz9mz549vP766+zdu5eWLVsCMGHCBBYsWMCECRPYsGEDmzdv5tVXX+X3v/99yPFff/31NGvWjAEDBvDJJ5+QmZnJiBEjKF26tK/m3a9fP2rWrEmfPn1477332LlzJ6tXr+a3v/1tnpHi59q5cyejR4/mgw8+YPfu3bz77rt8+umnvviDKVWqlK9l4sCBA4wfP55//vOftGjRwted4E9GRgYrV65kx44dbNq0iccee4y5c+fSv3//kO9JuBRFk/itqDlcRGLEmDFj2LVrF3369CEhIYFx48blqT2GW79+/Thz5gxXX301ZsbAgQN9Cbt69erMmTOHsWPH8vTTT9O6dWumT59Oz549fecnJiYydepUtm3bhpnRtm1b3njjDSpUqEDlypX517/+xZNPPsmRI0eoV68e48eP9yWf1NRUli5dyh/+8AemTZtG6dKlSUpKIj09PeT4S5UqxeLFi7nrrru46qqraNiwIY899hi/+tWvfKPjK1SowOrVqxk9ejRpaWl888031K5dm27dulGlSpWA165QoQJbt24lLS2Nw4cPU7NmTfr168eoUaMKdI8TEhJIT08nPT2d3bt3U61atYDHHj16lN/85jfs27eP8uXL07x5c1566aUCD0oMBzu3byaihZlVAPYCjZxz3+R3fEpKisvKyopoTLkfPQm2+Ees9MeJFIVNmzbRokWLog4j5nXt2pVWrVrx1FNPFXUoYfXJJ5/Qpk0bsrKyaN++fVGHU+SC/Xsxs7XOOb+PEUS1hu2c+x74STTLjITnVu8Iuo52xfg4hvdIYlDnRlGOTESk6C1evJiKFSvStGlTdu3axYgRI0hOTqZdu3ZFHVpM00xnhRAsWYNn/ewZKws+kEFE5GLw3Xffcc8999CyZUv69etHixYtePPNN7Vm+gXSXOKFECxZF+QYEZHcVq1aVdQhhMXtt9/umzBGwkcJ+wIF6/cWEREJFzWJi4iIxAAlbBERkRighC0iIhID1IcdIvVNi4hIUVINO4iK8cFXc8lvv4iISLgoYQcxvEdSwKScMzmKiIhINKhJPIhBnRtptjIREfFr48aNDB06lI0bN/rmQ+/bty8PPfQQ8fHxYS9PCVtERIq906dPExcXV+DZ0vbt20edOnUiMstafHw8AwYMoG3btlxyySV88sknDBo0iNOnT/PHP/4x7OWpSVxESqyuXbsyZMgQxo4dS7Vq1ahRowa/+93v8qzd3LBhQ6ZNm3beeffcc0+eYyZNmkR6ejqJiYnUq1ePRYsWceTIEfr27UtCQgJNmzZlxYoV+cYzePBghg0bRpUqVahSpQojR47ME8+8efO48sorSUxMpEaNGqSlpZGdne3bf+rUKe677z5q165N2bJlqVevHqNHj/btf+2112jdujXly5enatWqdOnShS+//HG143/84x+0b9+ecuXKcdlllzFu3DhOnjyZ57NOnjyZX//6175lO6dOnZrnc2zdupUuXbpQrlw5mjVrxrJly0hISODFF1/0HZOdnU3fvn19n7NXr155ltZ86KGHaNWqFS+++CKNGzembNmyHDt2jNWrV9OhQwcSEhKoXLkyV199NRs2bAh4T8ePH0+jRo2YMGGCb03wcGnSpAnp6ekkJyfToEEDevfuTb9+/Xj//ffDWk4O1bBFJDIeqlyEZee7GKDP/PnzGTZsGB988AHr16/ntttuo3379gVePnHGjBlMnjyZcePG8ac//YkBAwbQvXt3+vbty+TJk5kyZQr9+/dnz549vmUmA8WTnp7Ohx9+yKeffsqgQYOoVasWI0aMAODkyZNMnDiR5s2bc/jwYUaNGsWtt97K6tWrAZg5cyaLFy9m4cKFNGzYkH379rFlyxYADhw4QN++fZkyZQo33XQTR48eJTMz01f2m2++Sb9+/XjiiSfo3Lkze/bsYfDgwZw4cSLPl5bHH3+ciRMnMnLkSN544w3uu+8+rr32Wjp27MjZs2f55S9/yaWXXkpmZibHjx9n+PDhnDhxwnf+999/T7du3ejUqRPvvfce8fHxTJs2jR49erBp0yYqVKgAeNa/XrBgAa+88grx8fGUK1eOPn36MHDgQObPn8+pU6dYt24dcXGBBwDPnDmTV199lblz55KRkUGHDh0YMGAAt9xyC5dccsl5x19++eXs3r074PUaNGjAZ5995nff9u3bWb58Ob179w54/oVQwhaREq1ly5ZMmjQJgKSkJJ577jnefvvtAifs1NRUhgwZAsDEiROZPn06TZo08c2pPX78eF544QU2bNhASorf1RMBqFWrFjNnzsTMaN68OVu3bmX69Om+hH3nnXf6jm3UqBGzZs2iRYsW7Nu3j7p167J7926SkpK47rrrMDPq169Pp06dANi/fz+nTp3i5ptvpkGDBgC0atXKd72MjAxGjhzJHXfcAUDjxo159NFH6d+/P1OnTvU1K99www2+FoZ7772XmTNn8vbbb9OxY0feeusttmzZwooVK6hTpw7gSfDXXHONr5yFCxfinGP27Nm+az7zzDPUqFGDJUuWcMsttwCeLydz586lZs2aAHz11VccOXKEX/ziFzRu3BiA5s2bB/3vkpiYyB133MEdd9zB3r17mTdvHjNmzGDYsGH07t2bAQMG0LNnT0qV8jQ4L1u2jFOnTgW8XpkyZc7b1qlTJ9atW8eJEycYNGgQDz/8cNCYCktN4iJSorVu3TrP+9q1a3Pw4MELuk5CQgIVKlTgiiuu8G3LSTr5XbtDhw55+ls7duxIdnY23377LQDr1q2jT58+NGjQgMTERF/y37NnDwDp6emsX7+epKQkhg4dytKlS31N6snJyfTo0YNWrVpx0003MWvWLA4dOuQra+3atWRkZJCQkOD7ue222zh27BgHDhzw+1kh7z3bvHkztWvX9iVrgCuvvNKXEHPK2blzJ4mJib5yKleuzNdff83nn3/uO65u3bq++wZQtWpV0tPTSU1NpVevXkyfPp29e/cGvZ+51atXjzFjxrBx40Zmz57Nm2++Sa9evXz3Djw16CZNmgT8yfmik9uiRYtYt24dCxYsYNmyZTz66KMhx1QQqmGLSGQUoFm6KJ1bYzKzPH3GpUqVwjmX5xh/NTB/18m9LScJ5752QR07dozU1FR69OjB3LlzqVGjBocPH+a6667z9TO3a9eOXbt2sXz5ct555x0GDBhAcnIyb731FnFxcaxYsYLMzExWrFjB888/z5gxY3jvvfdITk7m7NmzPPjgg6SlpZ1XdvXq1YN+1pzP5ZzLd4DX2bNnadOmDQsXLjxvX9WqVX2vK1aseN7+2bNnM3z4cJYvX87rr7/OuHHj+Nvf/kZqamrQMgH+7//+j0WLFjFv3jw+/vhjrr/+egYMGEDdunV9xxSmSbxevXqAp7XmzJkz3HXXXYwcOZLSpcObYpWwRUSCqF69Ol988YXv/Q8//MDmzZtp27ZtRMpbs2ZNnqSXmZlJ7dq1qVSpEmvXruXw4cM8/PDDXHbZZYBnENm5EhMTSUtLIy0tjfT0dDp06MD27dtJSkrCzOjYsSMdO3ZkwoQJXH755SxatIjk5GTatWvH5s2badKkSaHjb9GiBdnZ2ezfv5/atWsDkJWVleeLSrt27Xj55ZepVq2a337k/CQnJ5OcnMyoUaO48cYbmTNnTsCEfeLECZYsWcLcuXNZtmwZLVq04Pbbb+e1117j0ksvPe/4wjSJ53b27FlOnz7NmTNnlLBFRKKpe/fuvPDCC/Tu3Zvq1auTkZER9A/6hdq/fz/Dhw9nyJAh/Oc//2Hq1Kk88MADANSvX5+yZcvy1FNPMXToUDZt2sT48ePznD99+nRq1apFmzZtKFOmDAsWLPCN5s7MzGTlypWkpqZSs2ZN/v3vf7N3715atmwJwIQJE/j5z39OgwYNuOWWWyhdujQbNmzgo48+Cvkxpeuvv55mzZoxYMAApk2bxvHjxxkxYgSlS5f2fQnp168f06ZNo0+fPkyaNIn69euzd+9e/v73vzN48GCaNm3q99o7d+7kmWeeoXfv3tSpU4cdO3bw6aef8pvf/CZgPEOGDGHJkiXcdtttfPTRR7Rp0yZo/P6avAOZO3cu5cqV44orriA+Pp6srCzGjBnDzTffTNmyZUO+TqiimrDN7BLgz0ArwAF3Ouc+jEbZz63ewYyVWzl28kw0ihORi8SYMWPYtWsXffr0ISEhgXHjxrF///6IldevXz/OnDnD1VdfjZkxcOBA7r//fsBT258zZw5jx47l6aefpnXr1kyfPp2ePXv6zk9MTGTq1Kls27YNM6Nt27a88cYbVKhQgcqVK/Ovf/2LJ598kiNHjlCvXj3Gjx9P//79Ac/AuaVLl/KHP/yBadOmUbp0aZKSkkhPTw85/lKlSrF48WLuuusurrrqKho2bMhjjz3Gr371K9/o+AoVKrB69WpGjx5NWlqab9KRbt26UaVKlYDXrlChAlu3biUtLY3Dhw9Ts2ZN+vXrx6hRowKeM2bMGJ555pmw13YBSpcuzZQpU9i2bRvOORo0aMDQoUN9/73Czc7tm4kkM5sDvO+c+7OZxQMVnHNHAh2fkpLisrKywlL25ROWB03WFePj+GxSz4D7Q5V7kZBdj/S64OuJxIJNmzbRokWLog4j5nXt2pVWrVrx1FNPFXUoYfXJJ5/Qpk0bsrKyaN++fVGHU+SC/Xsxs7XOOb+PEUSthm1mlYDOQDqAc+4kcDLYOeGUX7LWvOAiIuGxePFiKlasSNOmTdm1axcjRozw9ZFL4UWzSbwRcAiYbWbJwFpgmHPuWO6DzOxu4G7w9NdEgmq+IiKR89133zFq1Cj27t1LlSpV6Nq1K48//nhEpgctSaKZsEsD7YB7nXNrzOwJYDSQZ8SEc+5Z4FnwNIlHMT4RkSK1atWqog4hLG6//XbfhDESPtGcOGUfsM85t8b7/lU8CVxERETyEbWE7Zw7AOw1s2beTT8FNkarfBGJrGgOYBWJVRfy7yTaz2HfC8z3jhDfAdwR5fJFJALKlCnD8ePHfYs2iIh/x48fz3fylUCimrCdc+uBwLPei0hMqlGjBtnZ2dSpU4fy5ctrcJHIOZxzHD9+nOzs7DzzoxeEZjoTkQtWqVIl4MfVoETkfGXKlKFmzZq+fy8FpYQdQbknUYEfn/ce1LlREUUkEjmVKlUq9B8iEclfSIPOzKxlrsFimNn1ZjbPzMaYWeCVw0ugivGBb8exk2eYsXJrFKMREZGLRaijxJ8H2gKYWV3g70BVYCgwOTKhxabhPZLyTdoiIiIFFWqTeAtgnfd1GrDGOfczM+sGzAbGRCK4WDSocyO/Td7nNo+LiIgURKgJO44f5/3+KbDM+/pzoHDD3aJM05GKiEgsCzVhbwB+Y2ZL8CTsnBp1HeBwJAIrSYIt/amBaiIiAqH3YY8CBgGrgJedc//xbu8NfBSBuEqUYOt0a6CaiIhAiDVs59xqM6sOVHLOfZ1r1zPA9xGJrATJbyCaBqqJiEjIz2E7586Y2Q9m1sq76XPn3K7IhFVy5e5r10A1ERHJEVLCNrOywKPAr4F4wIATZvYsMMo590PkQrz4KBGLiEhBhVrDngXcANwFfOjd1hGYAiQCd4Y/tItLxfi4fJu2gz2/LSIiJVuog87SgDucc/Odczu8P/OBgcDNkQvv4pHfhCo5o8FFRET8CbWGfQzI9rM9GzgevnAuXoEmVBEREQlFqDXsJ4EHzax8zgbv6/HefSIiIhJBodawOwBdgGwz+9S77Qrv+RXN7PWcA51zvcMbooiIiISasA8Dfz1n284wxyIiIiIBhDpxyh2RDkREREQCC3nilHAws13Ad8AZ4LRzLiWa5YuIiMSqgAnb21fdxTn3tZn9B3CBjnXOtS5Amd2cc1owREREpACC1bD/Cpzwvn41CrGIiIhIAAETtnNuIoCZlQJeAfY4545eYHkOWGFmDnjGOffsuQeY2d3A3QD169e/wOJEREQuDqE8h+2A9cClYSjvGudcO+BGYKiZdT6vMOeedc6lOOdSqlevHoYiRUREYl++Cds554AtwAVnT+fcfu/vg8Bi4KoLvaaIiEhJEOpMZ78HpppZGzOzwhRkZhXNLDHnNZ7FRDYU5loiIiIlTaiPdf0FKAesBU6b2YncO51zlUK4Rk1gsTfflwYWOOeWFyBWERGREivUhH0vQR7rCoVzbgeQfCHXEBERKalCnensxQjHISIiIkGE1IdtZmfMrIaf7T8xszPhD0tERERyC3XQWaCBZmWBk2GKRURERAII2iRuZiO8Lx0w2MxyT5wSB1wHbI5QbCIiIuKVXx/2vd7fBtyFZ9GOHCeBXcDg8IclIiIiuQVN2M65ywDM7F3gV865r6MSlYiIiOQR6ijxbpEORERERAILddCZiIiIFKFQJ06RYuS51TuYsXIrx076f6KuYnwcw3skMahzoyhHJiIikaIadgwKlqwBjp08w4yVW6MYkYiIRFrAhG1mL+RarKOzmak2XkwES9YFOUZERGJHsCTcHxgLfAe8C9QCDkYjKDlfw9FL/W7f9UivkI4TEZHYFixh7wLuNbMVeJ7D7mhmfh/rcs6tjkBsJV7F+LigNeWK8XFRjEZERIpSsIQ9EngOGINnprPFAY5zeGY9kzAb3iMpYH91zsAyEREpGQImbOfc34G/m9klwFfA5ahJPKoGdW6kkd4iIgKE8FiXc+6ImXUDtjnnTkchJhERETlHqDOdvWdmZc3sdqAlnmbwjcAC59yJSAYoIiIioa+H3RLYCkwHrgY6AI8DW82sRUEKNLM4M/u3mS0paLAiIiIlVagTpzwBrAfqO+euc85dB9QHPgFmFLDMYcCmAp4jIiJSooWasK8Bxjrnvs3Z4H09Drg21MLMrC7QC/hzQYIUEREp6UJN2D8Al/jZXtm7L1QzgN8DZwtwjoiISIkXasL+B/CcmV3j7YOOM7NrgWeA10O5gJn9HDjonFubz3F3m1mWmWUdOnQoxPBEREQubqEm7GHANuB9PDXqH4D38AxEGx7iNa4BepvZLmAh0N3M5p17kHPuWedcinMupXr16iFeWkRE5OIW6mNdR4A+ZnKVFDIAABQySURBVNYEaIFnqtKNzrntoRbknBuDZ9Y0zKwr8DvnXP8CRywiIlICFWgFLm+CDjlJi4iISHgUyZKZzrlVwKqiKFtERCQWhdqHLSIiIkVICVtERCQG5Juwzay0mQ0xs9rRCEhERETOl2/C9q7QNRUoE/lwRERExJ9Qm8QzgXaRDEREREQCC3WU+HPAY2bWAFgLHMu90zm3LtyBiYiIyI9CTdgLvL+n+9nngLjwhCMiIiL+hJqwL4toFCIiIhJUqFOT7o50ICIiIhJYyM9hm9mNZrbEzDaaWT3vtrvM7KeRC09EREQgxIRtZv2Av+BZsesyfnzEKw7P+tYiIiISQaHWsH8PDHLO3Q+czrU9E2gT9qhEREQkj1ATdlPgQz/bjwKVwheOiIiI+BNqwt4PJPnZ3hn4PHzhiIiIiD+hJuxngZlmdo33fT0zGwD8EZgVkchERETEJ9THuv5oZpWBt4BywLvACWCac+7pCMYnIiIihD5xCs65cWaWAbTEUzPf6Jw7GrHIRERExCfkhO3lgB+8r8+EORYJs4ajl+Z5XzE+juE9khjUuVERRSQiIoUV6nPYZc1sBvAV8AnwKfCVmT1hZuVCvEY5M/vIzD4xs8/MbGLhw5ZAKsYHntb92MkzzFi5NYrRiIhIuIQ66GwWcDNwF55HvJp4X/8S+N8Qr3EC6O6cS8bz7HZPM+tQsHAlP8N7JOWbtEVEJPaE2iSeBvzKOfdWrm07zOwg8Ffgzvwu4JxzeJ7bBs9MaWXwNLFLGA3q3Mhvk/e5zeMiIhJbQk3Yx4BsP9uzgeOhFmZmcXjW024CPO2cW+PnmLuBuwHq168f6qWlAPwlb/Vvi4gUb6E2iT8JPGhm5XM2eF+P9+4LiXPujHOuDVAXuMrMWvk55lnnXIpzLqV69eqhXlryEayZHNS/LSJS3AWsYZvZ6+ds6gpkm9mn3vdXeM+vWNBCnXNHzGwV0BPYUNDzpeCG90hixsqtQfuw1b8tIlJ8madr2c8Os9mhXsQ5d0e+BZlVB055k3V5YAXwqHNuSaBzUlJSXFZWVqhhSCEF699WU7mISPSY2VrnXIq/fQFr2KEk4QKqBczx9mOXAv4SLFlL9FSMjwtYu85pKlfCFhEpWgWdOKXQnHOfAm2jVZ6ELr/mcjWVi4gUvZAStplVAR4CugE1OGewmnOuRtgjk6jRo2AiIsVfqDXsl4DLgTnAl+j5aRERkagKNWF3Bbo459ZFMBYREREJINTnsD8vwLEiIiISZqEm4WHAFDNL9o7yFhERkSgKtUl8O1AeWAdgZnl2OueUxEVERCIo1IT9MlAZuA8NOhMREYm6UBN2CnCVc07TiIqIiBSBUBP2RqBSJAOR4k0rfImIFK1QE/YDwHQzewD4D3Aq907n3FfhDkyKXrApS8EzA1rGsk1kLNt03nlK5CIi4RXqKPFlwFV4FuzYDxzy/hz2/paL0PAeSfkuy+mPluoUEQm/UGvY3SIahRRLgaYsBXhu9Q7NPy4iEkUhJWzn3HuRDkRii+YfFxGJrlAX/2gXbL+mLBUREYmsUJvEs/A8e517xpTcz2Jr4hQREZEICjVhX3bO+zJ41rYeB4wJa0QiIiJynlD7sHf72bzdzL4BHgTeCGtUIiIikseFrsC1E2gTjkBEREQksFAHnVU9dxNQC3gI2BLiNeoBLwGXAmeBZ51zT4QcqYiISAkWah/2Yc5f8MOAvcB/h3iN08BvnXPrzCwRWGtmbznnNoZ4voiISIlV2IlTzuKZ4Wy7c+50KBdwzn0BfOF9/Z2ZbQLq4JmnXERERIIokolTzKwhnlHma/zsuxu4G6B+/frhLFZERCRmBU3Yfvqu/SrI4h9mlgD8FRjunPvWz7WeBZ4FSElJ0brbIiIi5F/D9td3fS4XwnUAMLMyeJL1fOfca6GcIyIiIvkn2mCLfvQEhuEZTJYvMzPgeWCTc256aOGJiIgI5JOw/fVde+cVfxToDDwD/CHEsq4B/gf4j5mt924b65xbFnq4IiIiJVOoo8Qxs8uADCANeA1o6Zz7PNTznXP/JO9c5HKRO3flrorxcQzvkRRwyU4REQks35nOzOwnZvYEsBnPpCcdnXP/XZBkLSVHxfjA68AcO3mGGSu3RjEaEZGLR9CEbWZjgc+BLkAf51x351xWVCKTmDS8R1K+SVtERArOnAs8CNzMzgLHgXfxTJbil3Oud/hD8zzWlZWl7wcXg9zN47se6VWEkYiIFF9mttY5l+JvX3592C+R/2NdIiIiEmH5jRJPj1IcUoI9t3oHM1Zu9dtcroFqIiIeIY8SFwmXc0ePB5MzUE0JW0RKugtdD1skJMEGouVHA9VERFTDligZ3iMpYLM3+G/6LkhNXETkYqeELVExqHMjNWuLiFwAJewPnoRVj8DJo/73xydA19HQ6d7oxiUiIpKL+rCDJWvw7Fv1SPTiERER8UMJO1iyLsgxIiIiEaQm8dwe+uac95X9H6dm9KjzNwBNz2iLSEmihF0YoTajK2FfkIrxcUEf6Tp28gwZyzaRsWzTeecpkYvIxUZN4oWhZvSoyG8hkUC0KpiIXIxUww5VoObxUJvRpcCCPQoWbDpT0GQrInLxKTkJO79+Z3/iE4IfH59w4XFJoQRK5ppsRUQuVlFL2Gb2AvBz4KBzrlW0yvXJL1n7S75dRwc+L2dgWTDn1rY1GE1ERAopmjXsF4Gn8CzZGX35JWt/ybfTvQVPrsFq5RqMJiIihRS1hO2cW21mDaNVXlDn9juHU7BaOXi2F6SfW7VyERGhJPVhR0ugWvnDdQo3cly18rDRutsiEsuK3WNdZna3mWWZWdahQ4eKOpzw6Tq68IPU9IhYWOQ3qlyPgolIcVbsatjOuWeBZwFSUlJc2C4cyWbwUBSmP1yPiF2Qgo4Y16NgIlKcFbuELXIh8psdLeeYzyb19L3Pndg1BaqIFFfRfKzrZaArUM3M9gEPOueej1b5MU8D1UIyvEdS0KbvnOR77rb8pkCdsXKrEraIFKlojhK/NVplXTTym7glkBI8UC3Y7GiB5JfkQc3lIlL01CRenOX3iFgwGqgWsmBJXjOniUhxoYRdnGmgmoiIeClhi4Qo1Nq2BqmJSCQUu+ewRYoTLe8pIsWFatgXMy0+csFCGZDmjwapiUi4mXPhm5sk3FJSUlxWVlZRhxFbCjsFqpJ5WORuNt/1SK8ijEREYpGZrXXOpfjbpxr2xaawI8tPHoUVD3h+clMiL7RwjTBXn7iIgGrYJccHTxb+EbH4BBibHfr1SnCSv3zC8og0h587O5uIXJyC1bCVsOXCknkg/pJ8CRBsRbBwU81b5OKjhC2FU9j+8BxFveDKRaCwNXYlc5HYpD5sKZz8+sP9NX1r4pawupBR6hnLNpGxbFOe7YESudYKFyn+VMOW8MqdsFXDjpgLaXr31x+eX01efegi0aEmcYmecNewS/AAtsKKVD967sfU8itDtXKRwlHClui50H5vf0roALZwC6U/PNha4ZHmL8nri4GUNErYEj2RGHFeUKqV+1WY5Bepx9QiTYlcYpUStsSmSNTWA1GS96swg9Gi+WhbcaUvDFJYStgSm4pDbR2UzKNAXwzC/zlj7UuDnlTwUMKWkqM4JPlACV6zw0XFxZbIoyUaXxjC/XTDxajYJGwz6wk8AcQBf3bOPRLseCVsiYrikOSjRV8mIk5fGIq3SLXYhGuxn2KRsM0sDtgKXA/sAz4GbnXObQx0jhK2FLmSlMylSB1z5Xj89E38+UzeP/x3xS3l/tJ/paL9EJbrFVfBPudRV44ZxfyzXGwJuyPwkHMu1ft+DIBzbkqgc5SwJebkl+D91Vb1pUAk9oVpoqjiMjVpHWBvrvf7gKvPPcjM7gbuBqhfv350IhMJl073FrzpuDDnFIa+TIjEtGgmbPOz7bzqvXPuWeBZ8NSwIx2USIlRnL9MlHSF+TJ1IdcrrvSlMSg1iYuIiBQTwZrES0Uxjo+BpmZ2mZnFA32B16NYvoiISMyKWpO4c+60md0DvInnsa4XnHOfRat8ERGRWBbV9bCdc8uAZdEsU0RE5GIQzSZxERERKSQlbBERkRighC0iIhIDlLBFRERiQLFercvMDgG7izqOIlYNOFzUQZQAus/RofscPbrX0RHu+9zAOVfd345inbAFzCwr0EP0Ej66z9Gh+xw9utfREc37rCZxERGRGKCELSIiEgOUsIu/Z4s6gBJC9zk6dJ+jR/c6OqJ2n9WHLSIiEgNUwxYREYkBStgiIiIxQAm7GDGzF8zsoJltyLWtqpm9ZWbbvL+rFGWMFwMzq2dm75rZJjP7zMyGebfrXoeRmZUzs4/M7BPvfZ7o3a77HAFmFmdm/zazJd73us9hZma7zOw/ZrbezLK826J2n5Wwi5cXgZ7nbBsNvO2cawq87X0vF+Y08FvnXAugAzDUzFqiex1uJ4DuzrlkoA3Q08w6oPscKcOATbne6z5HRjfnXJtcz15H7T4rYRcjzrnVwFfnbO4DzPG+ngP8V1SDugg5575wzq3zvv4Ozx+5Ouheh5XzOOp9W8b749B9Djszqwv0Av6ca7Puc3RE7T4rYRd/NZ1zX4An0QA1ijiei4qZNQTaAmvQvQ47bzPteuAg8JZzTvc5MmYAvwfO5tqm+xx+DlhhZmvN7G7vtqjd59KRurBIcWdmCcBfgeHOuW/NrKhDuug4584AbczsEmCxmbUq6pguNmb2c+Cgc26tmXUt6nguctc45/abWQ3gLTPbHM3CVcMu/r40s1oA3t8Hiziei4KZlcGTrOc7517zbta9jhDn3BFgFZ4xGrrP4XUN0NvMdgELge5mNg/d57Bzzu33/j4ILAauIor3WQm7+HsdGOB9PQD4exHGclEwT1X6eWCTc256rl2612FkZtW9NWvMrDzQA9iM7nNYOefGOOfqOucaAn2Bd5xz/dF9Diszq2hmiTmvgRuADUTxPmums2LEzF4GuuJZru1L4EHgb8BfgPrAHiDNOXfuwDQpADO7Fngf+A8/9vmNxdOPrXsdJmbWGs8gnDg8lYO/OOcmmdlP0H2OCG+T+O+ccz/XfQ4vM2uEp1YNnu7kBc65jGjeZyVsERGRGKAmcRERkRighC0iIhIDlLBFRERigBK2iIhIDFDCFhERiQFK2CIxwswGm9nhoo6jOPDeC+f9meFnfzkzWx7g3Mxc52rmNYkZStgiIcr1Rz7Qz4shXmehmb0a4XBjUgG/lHwF1ALGF7CYnwHXFfAckSKnucRFQlcr1+ufA8+ds+14dMMp8Zxz7kDuDWZWE5gOdAYuNbMdwL+Bfs65H7wnfaWWColFqmGLhMg5dyDnBzhy7jbn3DcAZtbWzFaZ2XEz+z8z+3OuKQ0fAf4buClXzbyDd990M9vmPW+nmWWYWXxBYjSzUWa2wcyOmdleM5tlZpVy7R9sZofNrLeZbTWz783sNTNLMLNbzexzMztiZi+YWdlc55U3s6fM7JCZ/WBm/8qJ27u/p/ezJOTa1jx3s3OuY7qaWZa37DVmdkXOfmAW8JNc96agaws/BbQD7gAygXQ806HGFfA6IsWOErZIGHmT45t4FgC4EkgDugN/8h4yGc9cw0vw1M5rAWu9+74BbgdaAPfhSTojCxjCaeAe4HLvtboAj51zTCIwFM8XhxuAa4HXgFvwrO2b5v25K9c5M/Cs8/s/eBLiNmC5mVUrYHwADwMjgPbA98A87/Z3gFH82NRdC3iygNduC7wI/BM45pxb7Zwb55w7Vog4RYoVNYmLhNcAPF+EBzjnjgOY2RBgmZmNds7tNbMfgNLnNuc65ybmervLzBrjSZoZoRbunMudnHeZ2VhgLjAo1/Z44G7n3G5vfH/x7q/pXVVrg5ktBboBT5tZFWAgcJtzbrn3nEHAT4HBeL6EFMQY59xq73UmAyvNrJpz7rCZfYufpu4C+BdwJ7ClkOeLFFuqYYuEVwvg3znJ2uufgHn3BeRtkv7AzA6Y2VHgETwLCoTMzG4ws3fMLNvMvgMWAAlmVjXXYd/mJGuvL4F93mSde1sN7+umeJqU/5Wz0zl3Cs9iKS0LEp/Xp7le7/f+ruHvwEK4B88CDY8Bqd7ugfvNTH/rJObpf2KR8DIg0Io6AVfaMbMueGrCr+MZ0NYWmISnNhxawWZNgX/gGWR1E54m58He3bmvc8pPXP625fx9sCDx52w7e86xAGUChJq7rJzzw/K3yDn3nXNuNJ4ugX/iSdx/AO4Nx/VFipIStkh4bQTaedd/znEtnsS02fv+JOcPgroW+Nw594hzLss5tw1oWMCyr8LTnPxb51ymc24rULfAn+B8W4Ez3hgBMLMywNV4Pi/AIe/v3KPm2xSiLH/3prCOOedm4xkvcG1+B4sUd0rYIuE1B09t80Uza2Vm3YCngZedc3u9x+wCks2sqZlVM7PSeJLiZWZ2i5k1NrP78NSSC2IbUNbM7jGzy8zsf4AhF/qBnHNfA38GHjOzVDNrieeRtkTgGe9hG4EDwCTv57oRKOgIb/Dcm8pm1sV7b8rnd0JuZvakmV3nja2UedY+74an1UEkpilhi4SRc+5bIBWoCXwMvAq8y49N0+B5dGknniRyCEjxHvck8L/Aejw1wtyD0EIp+yM8o8rHAp8B/fGMug6H+/E0188D1gFJQE/n3GFv2SeAvniaoj8FxnnjKKh3gdl4Rq0fAoYV8Px9wExgB9ADeAVYBEwrRCwixYo5F7BbTUSkWDKzwcBk55zfx8rMrBzwN+dczwD7mwObgCuccxsiF6lI+KiGLSKx6idmdtTMHi3ISWb2Dj8++y4SM1TDFpGY452gJudRsCM5TfMhnlsXKOd9u8c5dzLc8YlEghK2iIhIDFCTuIiISAxQwhYREYkBStgiIiIxQAlbREQkBihhi4iIxID/BzymuGypI+YkAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"binned_dist_npass_lt3, binned_dist_npass_ge3 = df.count(binby=['total_amount'],\n",
" limits=[5, 50],\n",
" shape=64,\n",
" selection=[select_n_passengers_lt3, \n",
" select_n_passengers_ge3], \n",
" progress='widget')\n",
"\n",
"xvalues = np.linspace(5, 50, 64)\n",
"plt.figure(figsize=(8, 4))\n",
"plt.plot(xvalues, binned_dist_npass_lt3, drawstyle=\"steps-pre\", label='num passengers < 3', lw=3)\n",
"plt.plot(xvalues, binned_dist_npass_ge3, drawstyle=\"steps-pre\", label='num passengers >=3', lw=3)\n",
"plt.legend(fontsize=14)\n",
"plt.xlabel('Total amount [$]', fontsize=14)\n",
"plt.ylabel('Number of trips', fontsize=14)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Caption: One can create multiple histograms on different selections with just one pass over the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. Groupby aggregations with selections\n",
"\n",
"One of my favourite features of [Vaex](https://github.com/vaexio/vaex) is the ability to use selections inside aggregation functions. I often find myself wanting to do a group-by operation, in which the aggregations follow some additional rule or filter. The SQL-esque way of doing this would be to run several separate queries in which one would first filter the data and then do the group-by aggregation, and later join the outputs of those aggregations into one table. With Vaex, one can do this with a single operation, and with just one pass over the data! The following group-by example, ran on over 1.1 billion rows takes only 30 seconds to execute on my laptop!"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2020-02-14T10:55:26.617858Z",
"start_time": "2020-02-14T10:54:56.586569Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"