{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl\n",
    "from polars.lazy import *\n",
    "import numpy as np\n",
    "from string import ascii_letters\n",
    "import pandas as pd\n",
    "import os\n",
    "from typing import Union"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Laziness\n",
    "\n",
    "Py-polars has a lazy API that supports a subset of the eager API. Laziness means that operations aren't executed until you ask for them. Let's start with a short example..\n",
    "\n",
    "Below we'll create a DataFrame in an eager fashion (meaning that the creation of the DataFrame is executed at once)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "+-----+-------+-----+\n",
       "| a   | b     | c   |\n",
       "| --- | ---   | --- |\n",
       "| i64 | f64   | str |\n",
       "+=====+=======+=====+\n",
       "| 0   | 0.111 | \"a\" |\n",
       "+-----+-------+-----+\n",
       "| 1   | 0.034 | \"b\" |\n",
       "+-----+-------+-----+\n",
       "| 2   | 0.56  | \"c\" |\n",
       "+-----+-------+-----+\n",
       "| 3   | 0.142 | \"d\" |\n",
       "+-----+-------+-----+\n",
       "| 4   | 0.584 | \"e\" |\n",
       "+-----+-------+-----+\n",
       "| 5   | 0.537 | \"f\" |\n",
       "+-----+-------+-----+\n",
       "| 6   | 0.643 | \"g\" |\n",
       "+-----+-------+-----+\n",
       "| 7   | 0.349 | \"h\" |\n",
       "+-----+-------+-----+\n",
       "| 8   | 0.716 | \"i\" |\n",
       "+-----+-------+-----+\n",
       "| 9   | 0.451 | \"j\" |\n",
       "+-----+-------+-----+"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pl.DataFrame({\"a\": np.arange(0, 10),\n",
    "              \"b\": np.random.rand(10),\n",
    "               \"c\": list(ascii_letters[:10])})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lazy DataFrame\n",
    "To make this a lazy dataframe we call the `.lazy` method. As we can see, not much happens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<polars.lazy.LazyFrame at 0x7f295093ba58>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ldf = df.lazy()\n",
    "ldf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can filter this `DataFrame` on all the rows, but we'll see that again nothing happens. \n",
    "\n",
    "*Note the `col` and `lit` (meaning **column** and **literal value**) are part of the lazy **dsl** (domain specific language) and are needed to build a query plan.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<polars.lazy.LazyFrame at 0x7f29197674a8>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ldf = ldf.filter(col(\"a\") == (lit(2)))\n",
    "ldf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query is only executed when we ask for it. This can be done with `.collect` method. \n",
    "Below we execute the query and obtain our results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "+-----+------+-----+\n",
       "| a   | b    | c   |\n",
       "| --- | ---  | --- |\n",
       "| i64 | f64  | str |\n",
       "+=====+======+=====+\n",
       "| 2   | 0.56 | \"c\" |\n",
       "+-----+------+-----+"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ldf.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Why lazy?\n",
    "This laziness opens up quite some cool possibitlies from an optimization perspective. \n",
    "It allows polars to modify the query right before executing it and make suboptimal queries more performant. Let's show this using various operations, comparing lazy execution with eager execution in both Polars and Pandas.\n",
    "\n",
    "Let's create 2 DataFrames with quite some columns and rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rand_string(n: int, set_size: int, lower=True) -> str:\n",
    "    s = \"\".join(np.random.choice(list(ascii_letters[:set_size]), n))\n",
    "    if lower:\n",
    "        return s.lower()\n",
    "    return s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Showing a subset of df_a:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+\n",
       "| key     | column_0 | column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 |\n",
       "| ---     | ---      | ---      | ---      | ---      | ---      | ---      | ---      | ---      | ---      |\n",
       "| str     | f64      | f32      | f64      | f32      | i64      | f64      | f64      | f64      | i64      |\n",
       "+=========+==========+==========+==========+==========+==========+==========+==========+==========+==========+\n",
       "| \"aacbb\" | 4.17     | 6.003    | 4.74     | 0.447    | 3        | 0.43     | 3.206    | 6.397    | 7        |\n",
       "+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+\n",
       "| \"abbca\" | 7.203    | 1.257    | 3.716    | 5.109    | 0        | 8.854    | 5.967    | 5.141    | 5        |\n",
       "+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+\n",
       "| \"accac\" | 0.001    | 3.436    | 4.738    | 4.253    | 8        | 6.743    | 5.28     | 4.888    | 0        |\n",
       "+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows = 100_000\n",
    "columns = 30\n",
    "key_size = 5\n",
    "key_set_size = 4\n",
    "\n",
    "np.random.seed(1)\n",
    "\n",
    "dtypes = [np.float32, np.float64, np.int]\n",
    "\n",
    "df_a = pl.DataFrame({f\"column_{i}\": np.array(np.random.rand(rows) * 10, dtype=np.random.choice(dtypes)) for i in range(columns)})\n",
    "s = pl.Series(\"key\",  np.array([rand_string(key_size, key_set_size) for _ in range(rows)]))\n",
    "df_a.insert_at_idx(0, s)\n",
    "\n",
    "rows = 80_000\n",
    "columns = 8\n",
    "df_b = pl.DataFrame({f\"column_{i}\": np.array(np.random.rand(rows) * 10, dtype=np.random.choice(dtypes)) for i in range(columns)})\n",
    "s = pl.Series(\"key\",  np.array([rand_string(key_size, key_set_size) for _ in range(rows)]))\n",
    "df_b.insert_at_idx(0, s)\n",
    "\n",
    "\n",
    "print(\"Showing a subset of df_a:\")\n",
    "# only show a sub_slice\n",
    "df_a[:3, :10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_a_pd = df_a.to_pandas()\n",
    "df_b_pd = df_b.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Update 21-10-2020\n",
    "Arrow 2.0 is out and Polars is also faster in filtering. :)\n",
    "\n",
    "# Where Polars slightly loses (or wins?)\n",
    "Let's start with an operation where polars is slower than pandas; filtering. A filter predicate creates a boolean array. Polars/Arrow stores these boolean values not as boolean values of 1 byte,\n",
    "but as bits, meaning 1 bytes stores 8 booleans. This reduces memory 8-fold, but has some overhead on array creation. As we can see pandas is more than 5x faster, though there is a huge spread.\n",
    "\n",
    "Pandas has something called a blockmanager which hugely increases filtering performance (I believe due to cache optimallity). However this blockmanager gives performance hits when modifying blocks and block consolidation is triggered. This block consolidation triggers:\n",
    "\n",
    "* when the blockmanager has > 100 blocks\n",
    "* groupby operation is executed\n",
    "* Operations: diff, take, xs, reindex, _is_mixed_type, _is_numeric_mixed_type, values, fillna, replace, resample, concat\n",
    "\n",
    "Read more about the [blockmanager](https://uwekorn.com/2020/05/24/the-one-pandas-internal.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10000 loops, best of 5: 99.8 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a[\"column_2\"] < 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 35.76 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000 loops, best of 5: 119 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a_pd[\"column_2\"] < 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we use this mask to select rows from the DataFrame we see that polars gets slower linearly with the number of columns. If we apply this filter on a DataFrame with a single column pandas is **1.2** faster, however the runtime is just 1 ms. So the operations are fast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 5: 1.32 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a[:, :1][df_a[\"column_2\"] < 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 2.32 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a_pd.iloc[:, :1][df_a_pd[\"column_2\"] < 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "~The performance stays approximatly the same with more columns Here we observe that with 30 columns, polars is still **~1.2x** slower.~\n",
    "* Pandas has great row-wise filtering due to the block-manager (with unexpected performance hits)\n",
    "* Polars has great row-wise filtering due to embarissingly parallelization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 2.33 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a[df_a[\"column_2\"] < 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 2.41 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a_pd[df_a_pd[\"column_2\"] < 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Where polars definitly wins\n",
    "However, polars wins in all the expensive operations. Joins an groupby operations take most of the running time of query. Below we see that a join in polars is more than **3x** faster than the join of pandas and that join operation can take **1000-3000x** the runtime of a DataFrame filter. \n",
    "\n",
    "It's better to be fast in the expensive operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loop, best of 5: 1.06 s per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a.join(df_b, left_on=\"key\", right_on=\"key\", how=\"inner\").shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loop, best of 5: 3.95 s per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a_pd.merge(df_b_pd, left_on=\"key\", right_on=\"key\", how=\"inner\").shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the groupby operation with an aggregation on all the columns we see that polars is more than **5x** faster. Again polars is embarissingly parallel. Which means that it can be slower than pandas due to parallelization overhead. However, when this is the case, it doesn't matter because you are counting only a few ms extra for parallelization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 3.28 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a.groupby([\"key\"]).first().shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 16.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "df_a_pd.groupby(\"key\").first().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Query optimization\n",
    "Filtering a DataFrame leads to a new allocation. An often sub-optimal query is doing multiple queries at once.\n",
    "Let's see if laziness can help optimize that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(58019, 31)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def eager(df: Union[pl.DataFrame, pd.DataFrame]):\n",
    "    df = df[df['column_2'] < 9]\n",
    "    df = df[df['column_3'] > 1]\n",
    "    df = df[df['column_6'] > 1]\n",
    "    df = df[df['column_4'] > 1]\n",
    "    return df\n",
    "    \n",
    "assert eager(df_a_pd).shape == eager(df_a).shape\n",
    "eager(df_a_pd).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager polars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 5: 18.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 5: 20.2 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a_pd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lazy_query(df_a: pl.DataFrame):\n",
    "    return (df_a.lazy().filter(col(\"column_2\") < lit(9))\n",
    "            .filter(col(\"column_3\") > lit(1))\n",
    "            .filter(col(\"column_6\") > lit(1))\n",
    "            .filter(col(\"column_4\") > lit(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lazy polars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 5.88 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "lazy_query(df_a).collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization: Combine predicates\n",
    "Above the query optimizer aggregated all the filters and executed them at once. This reduces a lot of extra allocations at every filter operations. \n",
    "With this optimization we don't incur a performance hit by blatantly filtering on different location in a query.\n",
    "\n",
    "We did increase the eager performance by **~2x** by rewriting the query."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization (Projection pushdown) Selecting important columns.\n",
    "Let's look at another optimization. Let's say we are only interested in the columns `\"key\"` and `\"column_1\"`. \n",
    "\n",
    "A suboptimal eager query could be written like below. This query could be more performant if the projection (selecting columns) was done before the selection (filtering rows). \n",
    "Below we see that the lazy query is optimized and selects the needed columns before doing the filter operation. This speeds up the query to **~1.5x** by not filtering columns that are part of the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eager(df: Union[pl.DataFrame, pd.DataFrame]):\n",
    "    df = df[df['column_2'] < 9]\n",
    "    df = df[df['column_3'] > 1]\n",
    "    df = df[df['column_6'] > 1]\n",
    "    df = df[df['column_4'] > 1]\n",
    "    return df[[\"key\", \"column_1\"]]\n",
    "\n",
    "def lazy_query(df_a: pl.DataFrame):\n",
    "    return (df_a.lazy().filter(col(\"column_2\") < lit(9))\n",
    "            .filter(col(\"column_3\") > lit(1))\n",
    "            .filter(col(\"column_6\") > lit(1))\n",
    "            .filter(col(\"column_4\") > lit(1))\n",
    "            .select([col(\"key\"), col(\"column_1\")]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager polars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 5: 19.1 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 5: 23.9 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a_pd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lazy polars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 3.97 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "lazy_query(df_a).collect()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optimization: Predicate pushdown\n",
    "The same trick can be done with predicates. A sub-optimal query would do the filter after an expensive join operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eager(df_a: Union[pl.DataFrame, pd.DataFrame], df_b: Union[pl.DataFrame, pd.DataFrame]):\n",
    "    df_a = df_a[df_a[\"column_1\"] < 1]\n",
    "    df_b = df_b[df_b[\"column_1\"] < 1]\n",
    "    # pandas\n",
    "    if hasattr(df_a, \"values\"):\n",
    "        return df_a.merge(df_b, left_on=\"key\", right_on=\"key\")\n",
    "    return df_a.join(df_b, left_on=\"key\", right_on=\"key\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager polars; filter before join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 11.9 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a, df_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager pandas; filter before join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 5: 43.9 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a_pd, df_b_pd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eager(df_a: Union[pl.DataFrame, pd.DataFrame], df_b: Union[pl.DataFrame, pd.DataFrame]):\n",
    "    # pandas\n",
    "    if hasattr(df_a, \"values\"):\n",
    "        df = df_a.merge(df_b, left_on=\"key\", right_on=\"key\")\n",
    "        df = df[df[\"column_1_x\"] < 1]\n",
    "        return df\n",
    "    df = df_a.join(df_b, left_on=\"key\", right_on=\"key\")\n",
    "    df = df[df[\"column_1\"] < 1]\n",
    "    return df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager polars; filter after join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loop, best of 5: 1.18 s per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a, df_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Eager pandas; filter after join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 loop, best of 5: 4.83 s per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "eager(df_a_pd, df_b_pd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lazy_query(df_a: pl.DataFrame, df_b: pl.DataFrame):\n",
    "    return (df_a.lazy()\n",
    "         .join(df_b.lazy(), left_on=col(\"key\"), right_on=col(\"key\"))\n",
    "         .filter(col(\"column_1\") < lit(1))\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lazy polars; filter after join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 11.6 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "lazy_query(df_a, df_b).collect().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, choosing the wrong order of filters has a large effect, slowing down the query more than **66x**. \n",
    "In the lazy variant, the optimizer pushed down the predicates such that they are executed before the join."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Some other queries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lazy_query(df_a: pl.DataFrame, df_b: pl.DataFrame):\n",
    "    return (df_a.lazy()\n",
    "         .join(df_b.lazy(), left_on=col(\"key\"), right_on=col(\"key\"), how=\"inner\")\n",
    "         .filter(col(\"column_1\") < lit(1))\n",
    "         .groupby(\"key\")\n",
    "         .agg([col(\"column_0\").agg_sum()])\n",
    "         .select([col(\"key\"), col(\"column_0_sum\")])\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 loops, best of 5: 13.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "lazy_query(df_a, df_b).collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Udf (User defined functions <3 Laziness)\n",
    "The lazy api also has access to all the `eager` operations on `Series` because there are udf's with almost no overhead (no serializing or pickling). Below we'll add a column `\"udf\"` with a `lambda` and help of the eager api. It still needs some polishing, as we need to make sure that we don't modify the dtypes. I hope you can imagine that this can be very powerful! :)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 76.1 ms, sys: 0 ns, total: 76.1 ms\n",
      "Wall time: 16.3 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "def lazy_query(df_a: pl.DataFrame, df_b: pl.DataFrame):\n",
    "    return (df_a.lazy()\n",
    "         .join(df_b.lazy(), left_on=col(\"key\"), right_on=col(\"key\"), how=\"inner\")\n",
    "         .filter(col(\"column_1\") < lit(1))\n",
    "         .groupby(\"key\")\n",
    "         .agg([col(\"column_0\").agg_sum(), col(\"column_2\").agg_max().alias(\"foo\")])\n",
    "         .with_column(col(\"foo\").apply(\n",
    "             lambda series: pl.Series(\"\", np.ones(series.len(), dtype=np.float32) * series.sum() )\n",
    "                                               ).alias('udf'))\n",
    "         .select([col(\"key\"), col(\"column_0_sum\"), col(\"udf\"), col(\"foo\")])\n",
    "    )\n",
    "\n",
    "lazy_query(df_a, df_b).collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# More coming up later."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}