{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Polars for Pandas users\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "# Colab setup ------------------\n", "import os, sys, subprocess\n", "if \"google.colab\" in sys.modules:\n", " cmd = \"pip install --upgrade polars watermark\"\n", " process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " stdout, stderr = process.communicate()\n", " data_path = \"https://s3.amazonaws.com/bebi103.caltech.edu/data/\"\n", "else:\n", " data_path = \"../data/\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "import polars as pl\n", "import polars.selectors as cs\n", "\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "As of September 2024, [Pandas](https://pandas.pydata.org/) is by far and away the most widely used data frame package for Python. We are using Polars primarily because, in my opinion, the API is more intuitive and therefore easier for beginners and experts alike to use. It is also faster, sometimes much faster. It is, however, important to know about Pandas and how to use it because many of your colleagues use it and many packages you may use do, too.\n", "\n", "Therefore, in this part of the lesson, I discuss how to convert a Polars data frame to Pandas, and vice versa. I also provide syntax for doing common tasks in Polars and Pandas. It is also worth reading the [section of the Polars user guide comparing Pandas and Polars](https://docs.pola.rs/user-guide/migration/pandas/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A sample data frame\n", "\n", "For ease of discussion and comparison, we will use a simple data frame that has two categorical columns, `'c1'`, and `'c2'`, two quantitative columns as floats, `'q1'`, and `'q2'`, and a column, `'i1'` of integer values. It also has an identity column, a unique identifier for each row that is useful when converting the data frame to tall format. Note that `'q1'` has a null value and `'q2'` has a NaN value." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (8, 6)
idc1c2q1q2i1
i64strstrf64f64i64
1"a""c"1.14.55
2"a""d"2.22.33
3"a""c"3.1NaN0
4"a""d"null1.12
5"b""c"2.97.84
6"b""d"1.72.33
7"b""c"3.01.14
8"b""d"7.30.81
" ], "text/plain": [ "shape: (8, 6)\n", "┌─────┬─────┬─────┬──────┬─────┬─────┐\n", "│ id ┆ c1 ┆ c2 ┆ q1 ┆ q2 ┆ i1 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ i64 │\n", "╞═════╪═════╪═════╪══════╪═════╪═════╡\n", "│ 1 ┆ a ┆ c ┆ 1.1 ┆ 4.5 ┆ 5 │\n", "│ 2 ┆ a ┆ d ┆ 2.2 ┆ 2.3 ┆ 3 │\n", "│ 3 ┆ a ┆ c ┆ 3.1 ┆ NaN ┆ 0 │\n", "│ 4 ┆ a ┆ d ┆ null ┆ 1.1 ┆ 2 │\n", "│ 5 ┆ b ┆ c ┆ 2.9 ┆ 7.8 ┆ 4 │\n", "│ 6 ┆ b ┆ d ┆ 1.7 ┆ 2.3 ┆ 3 │\n", "│ 7 ┆ b ┆ c ┆ 3.0 ┆ 1.1 ┆ 4 │\n", "│ 8 ┆ b ┆ d ┆ 7.3 ┆ 0.8 ┆ 1 │\n", "└─────┴─────┴─────┴──────┴─────┴─────┘" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = dict(\n", " id=list(range(1, 9)),\n", " c1=['a']*4 + ['b']*4,\n", " c2=['c', 'd'] * 4,\n", " q1=[1.1, 2.2, 3.1, None, 2.9, 1.7, 3.0, 7.3],\n", " q2=[4.5, 2.3, np.nan, 1.1, 7.8, 2.3, 1.1, 0.8],\n", " i1=[5, 3, 0, 2, 4, 3, 4, 1],\n", ")\n", "\n", "df = pl.DataFrame(data)\n", "\n", "# Take a look\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## From Polars to Pandas and from Pandas to Polars\n", "\n", "If you have a Polars data frame, you can directly convert it to a Pandas data frame using the `to_pandas()`, method. Let's do that for our data frame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idc1c2q1q2i1
01ac1.14.55
12ad2.22.33
23ac3.1NaN0
34adNaN1.12
45bc2.97.84
56bd1.72.33
67bc3.01.14
78bd7.30.81
\n", "
" ], "text/plain": [ " id c1 c2 q1 q2 i1\n", "0 1 a c 1.1 4.5 5\n", "1 2 a d 2.2 2.3 3\n", "2 3 a c 3.1 NaN 0\n", "3 4 a d NaN 1.1 2\n", "4 5 b c 2.9 7.8 4\n", "5 6 b d 1.7 2.3 3\n", "6 7 b c 3.0 1.1 4\n", "7 8 b d 7.3 0.8 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the null value because a NaN. All missing data in Pandas are NaN. (Well, not really. You can have an `object` data data type for a column that permits `None` values. However, when Pandas reads in data, when there are missing data, it assigns it to be NaN by default.)\n", "\n", "Note also that Pandas has an index displayed on the left side of the data frame. In general, we will not use Pandas indexes.\n", "\n", "Similarly, if you have a data frame in Pandas, you can convert it to a Polars data frame using the `pl.from_pandas()` function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (8, 6)
idc1c2q1q2i1
i64strstrf64f64i64
1"a""c"1.14.55
2"a""d"2.22.33
3"a""c"3.1null0
4"a""d"null1.12
5"b""c"2.97.84
6"b""d"1.72.33
7"b""c"3.01.14
8"b""d"7.30.81
" ], "text/plain": [ "shape: (8, 6)\n", "┌─────┬─────┬─────┬──────┬──────┬─────┐\n", "│ id ┆ c1 ┆ c2 ┆ q1 ┆ q2 ┆ i1 │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ str ┆ f64 ┆ f64 ┆ i64 │\n", "╞═════╪═════╪═════╪══════╪══════╪═════╡\n", "│ 1 ┆ a ┆ c ┆ 1.1 ┆ 4.5 ┆ 5 │\n", "│ 2 ┆ a ┆ d ┆ 2.2 ┆ 2.3 ┆ 3 │\n", "│ 3 ┆ a ┆ c ┆ 3.1 ┆ null ┆ 0 │\n", "│ 4 ┆ a ┆ d ┆ null ┆ 1.1 ┆ 2 │\n", "│ 5 ┆ b ┆ c ┆ 2.9 ┆ 7.8 ┆ 4 │\n", "│ 6 ┆ b ┆ d ┆ 1.7 ┆ 2.3 ┆ 3 │\n", "│ 7 ┆ b ┆ c ┆ 3.0 ┆ 1.1 ┆ 4 │\n", "│ 8 ┆ b ┆ d ┆ 7.3 ┆ 0.8 ┆ 1 │\n", "└─────┴─────┴─────┴──────┴──────┴─────┘" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pl.from_pandas(pd.DataFrame(data))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas and Polars for common tasks\n", "\n", "Below is a table listing common tasks with data frame done using Polars and Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Description | Pandas | Polars |\n", "| -------- | ------- | ------ |\n", "| Convert dictionary to df | `pd.DataFrame(data)` | `pl.DataFrame(data)` |\n", "| Make 2D Numpy array into df | `pd.DataFrame(my_ar, columns=['col1', 'col2', 'col3'])` | `pl.DataFrame(my_ar, schema=['col1', 'col2', 'col3']', orient='row')` |\n", "| Read from CSV | `pd.read_csv(file_name)` | `pl.read_csv(file_name)` |\n", "| Lazily read CSV | — | `pl.scan_csv(file_name)` |\n", "| Read from Excel | `pd.read_excel(file_name)` | `pl.read_excel(file_name)` |\n", "| Read from JSON | `pd.read_json(file_name)` | `pl.read_json(file_name)` |\n", "| Read from HDF5 | `pd.read_hdf(file_name)` | — |\n", "| Write CSV | `df.to_csv(fname, index=False)` | `df.write_csv(fname)` |\n", "| Rename columns | `df.rename(columns={'c1': 'cat1', 'c2': 'cat2'})` | `df.rename({'c1': 'cat1', 'c2': 'cat2'})` |\n", "| Get column `'q1'` as series | `df['q1']` or `df.get_column('q1')` or `df.loc[:,'q1']` | `df['q1']` or `df.get_column('q1')` |\n", "| Get column `'q1'` as data frame | `df[['q1']]` or `df.loc[:,['q1']]` | `df.select('q1')` |\n", "| Get columns `'c1'` and `'q2'` | `df[['c1', 'q2']]` or `df.loc[:, ['c1', 'q2']]` | `df.select('c1', 'q2')`|\n", "| Get columns containing floats | `df.select_dtypes(float)` | `df.select(cs.float())` |\n", "| Get row `4` | `df.loc[4, :]` | `df.row(4)` or `df.row(4, named=True)` |\n", "| Get row `4` as data frame | `df.loc[[4], :]` | `df[4]` |\n", "| Get row where `i1` is `2` | `df.loc[df['i1']==2, :]` | `df.row(by_predicate=pl.col('i1')==2)` or `df.filter(pl.col('i1')==2)` |\n", "| Sub df with rows where `c1` is `'a'` | `df.loc[df['c1']=='a', :]` | `df.filter(pl.col('c1')=='a')` |\n", "| Sub df where `c1` is `'a'` and `c2` is `'d'` | `df.loc[(df['c1']=='a') & (df['c2']=='d'), :]` | `df.filter((pl.col('c1')=='a') & (pl.col('c2')=='d'))` |\n", "| Iterate over columns of df | `for col, s in df.items()` | `for s in df` |\n", "| Iterate over rows of df | `for row_ind, row in df.iterrows()` | `for r in df.iter_rows(named=True)` |\n", "| Group by single column | `df.groupby('c1')` | `df.group_by('c1')` |\n", "| Group by maintaining order | `df.groupby('c1', sort=False)` | `df.group_by('c1', maintain_order=True)` |\n", "| Group by multiple columns | `df.groupby(['c1', 'c2'])` | `df.group_by(['c1', 'c2'])` |\n", "| Iterate over groups | `for group, subdf in df.groupby('c1')` | `for (group,), subdf in df.group_by('c1')`|\n", "| Iterate over nested groups | `for (g1, g2), subdf in df.groupby(['c1', 'c2'])` | `for (g1, g2), subdf in df.groupby(['c1', 'c2'])` |\n", "| Group by and apply mean¹ | `df.groupby('c1').mean(numeric_only=True)` | `df.group_by('c1').mean()` |\n", "| Group by and apply median to one column | `df.groupby('c1')['q1'].median()` | `df.group_by('c1').agg(pl.col('q1').median())` |\n", "| Group by and apply mean to two columns | `df.groupby('c1')[['q1', 'q2']].mean()` | `df.group_by('c1').agg(pl.col('q1', 'q2').mean())` |\n", "| Group by and apply custom func to col² | `df.groupby('c1')['q1'].apply(my_fun)` | `df.group_by('c1').agg(pl.col('q1').map_elements(my_fun, return_dtype=float))` |\n", "| Group by and apply custom func to 2 cols³ | `df.groupby('c1')[['q1', 'q2']].apply(my_fun)` | `df.group_by('c1').agg(pl.struct(['q1', 'q2']).map_elements(my_fun, return_dtype=float))` |\n", "| Group by and rank within each group | `df.groupby('c1')['q1'].rank()` | `df.select(pl.col('q1').rank().over('c1'))` |\n", "| Convert to tall format | `df.melt(value_name='value', var_name='var', id_vars='id')` | `df.unpivot(value_name='value', variable_name='var', index='id')` |\n", "| Pivot tall result above | `df_tall.pivot(columns='var', index='id').reset_index()` | `df_tall.pivot(on='var', index='id')` |\n", "| Select columns with string in name | `df.filter(regex='q')` or `df[df.columns[df.columns.str.contains('q')]]` | `df.select(cs.contains('q'))` |\n", "| Add column of zeros to data frame | `df['new_col'] = 0` or `df.assign(new_col=0)` | `df.with_columns(pl.lit(0).alias('new_col'))` |\n", "| Add a Numpy array as column | `df['new_col'] = my_array` or `df.assign(new_col=my_array)` | `df.with_columns(pl.Series(my_array).alias('new_col'))` |\n", "| Multiply two columns; make new column | `df['q1q2'] = df['q1'] * df['q2']` or `df.assign(q1q2=df['q1'] * df['q2']` | `df.with_columns((pl.col('q1') * pl.col('q2')).alias('q1q2'))` |\n", "| Apply a function to each row making new col⁴ | `df.assign(new_col=my_fun)` | `df.with_columns(pl.struct(pl.all()).map_elements(my_fun, return_dtype=float).alias('new_col'))` |\n", "| Drop rows with missing data | `df.dropna()` | `df.drop_nulls()` |\n", "| Sort according to a column | `df.sort_values(by='i1')` | `df.sort(by='i1')` |\n", "| Inner join two data frames⁵ | `pd.merge(df, df2, on=shared_columns)` | `df.join(df2, on=shared_columns)` |\n", "| Concatenate data frames vertically | `pd.concat((df, df2))` | `pl.concat((df, df2), how='diagonal')` |\n", "| Concatenate data frames horizontally | `pd.concat((df, df2), axis=1)` | `pl.concat((df, df2), how='horizontal')` |\n", "\n", "\n", "\n", "**Footnotes**\n", "\n", "1. Note that in Pandas, NaNs are omitted from calculations like means. In Polars, NaNs *are included*, and the result will be NaN. However, `null`s are not included.\n", "2. For Pandas, the function `my_fun` must take an `array_like` data type (list, Numpy array, Pandas Series, etc.) as input. For Polars, the function `my_fun` must take a Polars Series as input. It is wise to specify the data type of the output of the function (shown as `float` in the above example, but can be whatever type `my_fun` returns). A Pandas example: `my_fun = lambda x: np.sum(np.sin(x))`. A Polars example: `my_fun = lambda s: s.exp().sum()`.\n", "3. For Pandas, the function must take a Pandas DataFrame as an argument. For Polars, it must take a Polars Series with a struct data type. A Pandas example: `my_fun = lambda df: (np.sin(s['q1']) * s['q2']).sum()`. A Polars example: `my_fun = lambda s: (s.struct.field('q1').sin() * s.struct.field('q2')).sum()`\n", "4. For Pandas, `my_fun` must take as its argument a Pandas Series with an index containing the names of the columns of the original data frame. For Polars, `my_fun` must take as its argument a dictionary with keys given by the names of the columns of the original data frame. The functions may then have the same syntax (though possibly with different type hints). An example: `my_fun = lambda r: r['i1'] * np.sin(r['q2'])`. However, note that in Polars, a `null` value is treated as `None`, which means you cannot apply a function to it, multiply by it, etc.\n", "5. For Polars, the `on` kwarg for `df.join()` is required. With Pandas, which columns to join on are inferred based on like-names of columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hierarchical indexes\n", "\n", "Pandas supports hierarchical indexes, called `MultiIndex`es. This is not supports by Polars. Polars will not read a CSV file with hierarchical indexes. If you have a data set in a CSV file with hierarchical indexes, you can convert it to a CSV file in tall format where the `MultiIndex` has been converted to columns using the `bebi103.utils.unpivot_csv()` function. This operation is akin to a `df.melt()` operation on a data frame with a hierarchical index. You can then read the converted CSV file into Polars and begin working with it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.12.4\n", "IPython version : 8.25.0\n", "\n", "numpy : 1.26.4\n", "pandas : 2.2.2\n", "polars : 1.6.0\n", "jupyterlab: 4.0.13\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p numpy,pandas,polars,jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 4 }