{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 10 minutes to polars\n", "This as short introduction to Polars to get you started with the basic concepts of data wrangling. It is very much influenced by [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).\n", "\n", "We start by importing Polars. If you run this for the first time, get a coffee. This will take a while" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "use polars::prelude::*;\n", "\n", "#[macro_use]\n", "extern crate polars;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Object creation\n", "Creating a `Series` by passing a list of nullable values. Note that we use `Option` to describe missing values." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Series: i32\n", "[\n", "\t1\n", "\t3\n", "\t5\n", "\tnull\n", "\t6\n", "\t8\n", "]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Series::new(\n", " \"some_values with ones\", \n", " &[Some(1), Some(3), Some(5), None, Some(6), Some(8)]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we dont have any missing values, we can just pass a slice of `T`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Series: i32\n", "[\n", "\t1\n", "\t3\n", "\t5\n", "\t7\n", "\t6\n", "\t8\n", "]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Series::new(\n", " \"some_non_null_values\", \n", " &[1, 3, 5, 7, 6, 8]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Series` are actually an `Enum` around different typed values of a `ChunkedArray`. \n", "You can think of a `ChunedkArray` as an array with a known type. Every `ChunkedArray` has a type alias that makes them more convenient to use. \n", "\n", "Some examples are:\n", "\n", "| Type | Alias |\n", "|-----------------------------|------------------|\n", "| `ChunkedArray` | `Float64Chunked` |\n", "| `ChunkedArray` | `UInt32Chunked` |\n", "| `ChunkedArray` | `BooleanChunked` |\n", "| `ChunkedArray` | `Utf8Chunked` |\n", "\n", "See all available data types [here](https://ritchie46.github.io/polars/polars/datatypes/index.html).\n", "\n", "Create a `ChunkedArray` with null values:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PrimitiveArray\n", "[\n", " null,\n", " 1,\n", " 2,\n", "]]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Int64Chunked::new_from_opt_slice(\"nullable\", &[None, Some(1), Some(2)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or create a `ChunkedArray` without null values." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PrimitiveArray\n", "[\n", " 1,\n", " 2,\n", " 3,\n", "]]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Int64Chunked::new_from_slice(\"non-nullable\", &[1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting from `Series` to a `ChunkedArray` can be done by defining there type." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ok([PrimitiveArray\n", "[\n", " 1,\n", " 2,\n", " 3,\n", "]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let s = Series::new(\"values\", &[1, 2, 3]);\n", "s.i32()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will return an `Err` if you specify the wrong type." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Err(DataTypeMisMatch)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.i64()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we can cast a `Series` to the proper type and then unpack." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ok([PrimitiveArray\n", "[\n", " 1,\n", " 2,\n", " 3,\n", "]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.cast::().unwrap().i64()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we use pattern matching to check if the cast was successful. Note that the clones on a `ChunkedArray` and a `Series` are very cheap, as the underlying data is wrapped by an `Arc`. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[PrimitiveArray\n", "[\n", " 1,\n", " 2,\n", " 3,\n", "]]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let ca = match s.i64() {\n", " Err(_) => {\n", " s.cast::()\n", " .unwrap()\n", " .i64()\n", " .map(|ca| ca.clone())\n", " .unwrap()\n", " },\n", " Ok(ca) => ca.clone()\n", "};\n", "ca" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting from a `ChunkedArray` to a `Series`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Series: i64\n", "[\n", "\t1\n", "\t2\n", "\t3\n", "]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ca.into_series()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` is created from a `Vec` of `Series`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-21 | 1 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-21 | 2 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 5 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let dates = &[\n", " \"2020-08-21\",\n", " \"2020-08-21\",\n", " \"2020-08-22\",\n", " \"2020-08-23\",\n", " \"2020-08-22\",\n", "];\n", "let fmt = \"%Y-%m-%d\";\n", "let s0 = Date32Chunked::parse_from_str_slice(\"dates\", dates, fmt).into();\n", "let s1 = Series::new(\"n\", &[1, 2, 3, 4, 5]);\n", "let s2 = Utf8Chunked::full(\"foos\", \"foo\", 5).into();\n", "\n", "let df = DataFrame::new(vec![s0, s1, s2]).expect(\"something went wrong\");\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns of the resulting `DataFrame` have different data types." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Column: 'dates',\t dtype: Date32(Day)\n", "Column: 'n',\t dtype: Int32\n", "Column: 'foos',\t dtype: Utf8\n" ] }, { "data": { "text/plain": [ "()" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes()\n", " .iter()\n", " .zip(df.columns().iter())\n", " .for_each(|(dtype, name)| \n", " println!(\"Column: '{}',\\t dtype: {:?}\", name, dtype))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"dates\", \"n\", \"foos\"]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Viewing data\n", "\n", "Here is how to view the top and bottom rows of a DataFrame." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-21 | 1 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-21 | 2 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(Some(3))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 5 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(Some(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting by a column:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 5 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-21 | 1 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-21 | 2 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let reverse = true;\n", "df.sort(\"dates\", reverse).expect(\"column not sortable\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Selection\n", "Selecting a single column, which yields a `Result`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Series: date32(day)\n", "[\n", "\t2020-08-21\n", "\t2020-08-21\n", "\t2020-08-22\n", "\t2020-08-23\n", "\t2020-08-22\n", "]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.column(\"dates\")\n", " .expect(\"columns don't exist\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting 1 or multiple columns, which yield another `Result`:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+\n", "| dates |\n", "| --- |\n", "| date32 |\n", "+============+\n", "| 2020-08-21 |\n", "+------------+\n", "| 2020-08-21 |\n", "+------------+\n", "| 2020-08-22 |\n", "+------------+\n", "| 2020-08-23 |\n", "+------------+\n", "| 2020-08-22 |\n", "+------------+\n" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select(\"dates\")\n", " .expect(\"column does not exist\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+\n", "| dates | n |\n", "| --- | --- |\n", "| date32 | i32 |\n", "+============+=====+\n", "| 2020-08-21 | 1 |\n", "+------------+-----+\n", "| 2020-08-21 | 2 |\n", "+------------+-----+\n", "| 2020-08-22 | 3 |\n", "+------------+-----+\n", "| 2020-08-23 | 4 |\n", "+------------+-----+\n", "| 2020-08-22 | 5 |\n", "+------------+-----+\n" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select(&[\"dates\", \"n\"])\n", " .expect(\"column does not exist\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` can also be sliced in to a subset of the DataFrame." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let offset = 2;\n", "let length = 2;\n", "df.slice(offset, length)\n", " .expect(\"slice was not within bounds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a column by index:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Series: i32\n", "[\n", "\t1\n", "\t2\n", "\t3\n", "\t4\n", "\t5\n", "]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select_at_idx(1)\n", " .expect(\"column was not within bounds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Boolean indexing\n", "Boolean indexes can be used to filter data. Note that this also works on `Series` and `ChunkedArray`. We also use the `as_result!` macro. This utility expects a block that returns a `Result`. This makes it to convenient to use the `?` operator." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-22 | 5 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " // select the n column\n", " let n_s = df.column(\"n\")?;\n", " let mask = n_s.gt(2);\n", "\n", " // filter values > 2\n", " df.filter(&mask)\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter all values in the \"n\" column greater than 2 and smaller than 5:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+\n", "| dates | n | foos |\n", "| --- | --- | --- |\n", "| date32 | i32 | str |\n", "+============+=====+=======+\n", "| 2020-08-22 | 3 | \"foo\" |\n", "+------------+-----+-------+\n", "| 2020-08-23 | 4 | \"foo\" |\n", "+------------+-----+-------+\n" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " // select the n column\n", " let n_s = df.column(\"n\")?;\n", " \n", " // create the boolean mask\n", " let mask = (n_s.gt(2) & n_s.lt(5))?;\n", "\n", " // filter values > 2\n", " df.filter(&mask)\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For all the comparison methods available on `Series` and `ChunkArrays` check the [ChunkCompare trait](https://ritchie46.github.io/polars/polars/chunked_array/ops/trait.ChunkCompare.html).\n", "\n", "# Setting\n", "Setting a new column can be done with the `hstack` operation. This is operation adds new columns to the existing `DataFrame`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+-------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=======+=======+\n", "| 2020-08-21 | 1 | \"foo\" | \"mo\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-21 | 2 | \"foo\" | \"tue\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 3 | \"foo\" | \"wed\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-23 | 4 | \"foo\" | \"thu\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 5 | \"foo\" | \"fri\" |\n", "+------------+-----+-------+-------+\n" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let mut df = df;\n", "let s = Series::new(\"days\", &[\"mo\", \"tue\", \"wed\", \"thu\", \"fri\"]);\n", "df.hstack(&[s]).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It isn't possible to get mutable access to the columns of a `DataFrame`, because this would give you the possibility to invalidate the `DataFrame` (for instance by replacing the column with a `Series` with a different length).\n", "\n", "Luckely there are other ways to mutate a DataFrame. We could for instance replace a column in the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+-------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=======+=======+\n", "| 2020-08-21 | 1 | \"bar\" | \"mo\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-21 | 2 | \"bar\" | \"tue\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 3 | \"bar\" | \"wed\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-23 | 4 | \"bar\" | \"thu\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 5 | \"bar\" | \"fri\" |\n", "+------------+-----+-------+-------+\n" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let s = Utf8Chunked::full(\"bars\", \"bar\", 5);\n", "df.replace(\"foos\", s).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or if we want to use the column we're replacing to determine the new column's values we can use the `apply` method and use a closure to create the new column.\n", "\n", "Below we use this determine `n + 1`:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+-------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=======+=======+\n", "| 2020-08-21 | 2 | \"bar\" | \"mo\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-21 | 3 | \"bar\" | \"tue\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 4 | \"bar\" | \"wed\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-23 | 5 | \"bar\" | \"thu\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 6 | \"bar\" | \"fri\" |\n", "+------------+-----+-------+-------+\n" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(\"n\", |s| s + 1).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the `replace` and the `apply` methods exist for selection by index; \n", "* `replace_at_idx`\n", "* `apply_at_idx`" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "+------------+-----+-------+-------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=======+=======+\n", "| 2020-08-21 | 4 | \"bar\" | \"mo\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-23 | 10 | \"bar\" | \"thu\" |\n", "+------------+-----+-------+-------+\n", "| 2020-08-22 | 12 | \"bar\" | \"fri\" |\n", "+------------+-----+-------+-------+\n" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply_at_idx(1, |s| s * 2)\n", " .unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can apply a closure to the values that are valid under a condition constraint:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+---------------------+-------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=====================+=======+\n", "| 2020-08-21 | 4 | \"not_within_bounds\" | \"mo\" |\n", "+------------+-----+---------------------+-------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue\" |\n", "+------------+-----+---------------------+-------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed\" |\n", "+------------+-----+---------------------+-------+\n", "| 2020-08-23 | 10 | \"not_within_bounds\" | \"thu\" |\n", "+------------+-----+---------------------+-------+\n", "| 2020-08-22 | 12 | \"not_within_bounds\" | \"fri\" |\n", "+------------+-----+---------------------+-------+\n" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " let mask = (df.column(\"n\")?.gt(4) & df.column(\"n\")?.lt(10))?;\n", " \n", " df.may_apply(\"foos\", |s| {\n", " s.utf8()?\n", " .set(&!mask, Some(\"not_within_bounds\"))\n", " }\n", " )\n", " }\n", ").unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Iterators\n", "Every `ChunkedArray` implements the [IntoIterator trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html) which gives us all the powerful trait methods available for iterators." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Some(15)\n" ] }, { "data": { "text/plain": [ "Ok(())" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " let s = Series::new(\"a\", [1, 2, 3, 4, 5]);\n", " \n", " let v = s.i32()?\n", " .into_iter()\n", " .sum::>();\n", " \n", " println!(\"{:?}\", v);\n", " \n", " Ok(())\n", "})" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+------------+-----+---------------------+----------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=====================+==========+\n", "| 2020-08-21 | 4 | \"not_within_bounds\" | \"mo_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-23 | 10 | \"not_within_bounds\" | \"thu_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 12 | \"not_within_bounds\" | \"fri_ay\" |\n", "+------------+-----+---------------------+----------+\n" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " // adds \"ay\" to every word.\n", " fn to_pig_latin(opt_val: Option<&str>) -> Option {\n", " opt_val.map(|val| format!(\"{}_ay\", val))\n", " }\n", " \n", " // may apply takes a closure that may fail.\n", " df.may_apply(\"days\", |s| {\n", " let ca: Utf8Chunked = s.utf8()?\n", " .into_iter()\n", " .map(to_pig_latin)\n", " .collect();\n", " Ok(ca)\n", " });\n", " \n", " Ok(df.clone())\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Concat\n", "\n", "Polars provides various facilities for easily combining `DataFrames` and `Series`.\n", "\n", "We can concatenate a `DataFrame` with `hstack`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "| dates | n | foos | days | dates | n | foos | days |\n", "| --- | --- | --- | --- | --- | --- | --- | --- |\n", "| date32 | i32 | str | str | date32 | i32 | str | str |\n", "+============+=====+=====================+==========+============+=====+=====================+==========+\n", "| 2020-08-21 | 4 | \"not_within_bounds\" | \"mo_ay\" | 2020-08-21 | 4 | \"not_within_bounds\" | \"mo_ay\" |\n", "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue_ay\" | 2020-08-21 | 6 | \"bar\" | \"tue_ay\" |\n", "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed_ay\" | 2020-08-22 | 8 | \"bar\" | \"wed_ay\" |\n", "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "| 2020-08-23 | 10 | \"not_within_bounds\" | \"thu_ay\" | 2020-08-23 | 10 | \"not_within_bounds\" | \"thu_ay\" |\n", "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 12 | \"not_within_bounds\" | \"fri_ay\" | 2020-08-22 | 12 | \"not_within_bounds\" | \"fri_ay\" |\n", "+------------+-----+---------------------+----------+------------+-----+---------------------+----------+\n", "\n" ] } ], "source": [ "{\n", " let mut df1 = df.clone(); \n", " \n", " df1.hstack(df.get_columns());\n", " \n", " println!(\"{:?}\", df1);\n", "};" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or append the rows of a second DataFrame:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+-----+---------------------+----------+\n", "| dates | n | foos | days |\n", "| --- | --- | --- | --- |\n", "| date32 | i32 | str | str |\n", "+============+=====+=====================+==========+\n", "| 2020-08-21 | 4 | \"not_within_bounds\" | \"mo_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-23 | 10 | \"not_within_bounds\" | \"thu_ay\" |\n", "+------------+-----+---------------------+----------+\n" ] } ], "source": [ "{\n", " let mut df1 = df.clone(); \n", " \n", " df1.vstack(&df);\n", " \n", " println!(\"{:?}\", df1);\n", "};" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Join\n", "SQL-style joins. " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 2020-08-22 | 12 | \"not_within_bounds\" | \"fri_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-21 | 4 | \"not_within_bounds\" | \"mo_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-21 | 6 | \"bar\" | \"tue_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 8 | \"bar\" | \"wed_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-23 | 10 | \"not_within_bounds\" | \"thu_ay\" |\n", "+------------+-----+---------------------+----------+\n", "| 2020-08-22 | 12 | \"not_within_bounds\" | \"fri_ay\" |\n", "+------------+-----+---------------------+----------+\n", "\n", "+-------+------+\n", "| key | lval |\n", "| --- | --- |\n", "| str | i32 |\n", "+=======+======+\n", "| \"foo\" | 1 |\n", "+-------+------+\n", "| \"foo\" | 2 |\n", "+-------+------+\n", "\n", "+-------+------+\n", "| key | rval |\n", "| --- | --- |\n", "| str | i32 |\n", "+=======+======+\n", "| \"foo\" | 4 |\n", "+-------+------+\n", "| \"foo\" | 5 |\n", "+-------+------+\n", "\n" ] }, { "data": { "text/plain": [ "+-------+------+------+\n", "| key | lval | rval |\n", "| --- | --- | --- |\n", "| str | i32 | i32 |\n", "+=======+======+======+\n", "| \"foo\" | 1 | 4 |\n", "+-------+------+------+\n", "| \"foo\" | 2 | 4 |\n", "+-------+------+------+\n", "| \"foo\" | 1 | 5 |\n", "+-------+------+------+\n", "| \"foo\" | 2 | 5 |\n", "+-------+------+------+\n" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " let left = DataFrame::new(vec![\n", " Series::new(\"key\", &[\"foo\", \"foo\"]),\n", " Series::new(\"lval\", &[1, 2]),\n", " ])?;\n", " \n", " let right = DataFrame::new(vec![\n", " Series::new(\"key\", &[\"foo\", \"foo\"]),\n", " Series::new(\"rval\", &[4, 5]),\n", " ])?;\n", " \n", " println!(\"{:?}\", left);\n", " println!(\"{:?}\", right);\n", " \n", " left.inner_join(&right, \"key\", \"key\")\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another example that can be given is:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+------+\n", "| key | lval |\n", "| --- | --- |\n", "| str | i32 |\n", "+=======+======+\n", "| \"foo\" | 1 |\n", "+-------+------+\n", "| \"bar\" | 2 |\n", "+-------+------+\n", "\n", "+-------+------+\n", "| key | rval |\n", "| --- | --- |\n", "| str | i32 |\n", "+=======+======+\n", "| \"foo\" | 4 |\n", "+-------+------+\n", "| \"bar\" | 5 |\n", "+-------+------+\n", "\n" ] }, { "data": { "text/plain": [ "+-------+------+------+\n", "| key | lval | rval |\n", "| --- | --- | --- |\n", "| str | i32 | i32 |\n", "+=======+======+======+\n", "| \"foo\" | 1 | 4 |\n", "+-------+------+------+\n", "| \"bar\" | 2 | 5 |\n", "+-------+------+------+\n" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " \n", " let left = DataFrame::new(vec![\n", " Series::new(\"key\", &[\"foo\", \"bar\"]),\n", " Series::new(\"lval\", &[1, 2]),\n", " ])?;\n", " \n", " let right = DataFrame::new(vec![\n", " Series::new(\"key\", &[\"foo\", \"bar\"]),\n", " Series::new(\"rval\", &[4, 5]),\n", " ])?;\n", " \n", " println!(\"{:?}\", left);\n", " println!(\"{:?}\", right);\n", " \n", " left.inner_join(&right, \"key\", \"key\")\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Grouping\n", "\n", "By \"group by\" we are referring to a process involving one or more of the following steps:\n", "* **Splitting** the data into groups based on some criteria\n", "* **Applying** a function to each group independently\n", "* **Combining** the results into a data structure" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+-------+---------+-----+-----+\n", "| A | B | C | D |\n", "| --- | --- | --- | --- |\n", "| str | str | i32 | i32 |\n", "+=======+=========+=====+=====+\n", "| \"foo\" | \"one\" | 1 | 1 |\n", "+-------+---------+-----+-----+\n", "| \"bar\" | \"one\" | 1 | 2 |\n", "+-------+---------+-----+-----+\n", "| \"foo\" | \"two\" | 1 | 3 |\n", "+-------+---------+-----+-----+\n", "| \"bar\" | \"three\" | 1 | 4 |\n", "+-------+---------+-----+-----+\n", "| \"foo\" | \"two\" | 1 | 5 |\n", "+-------+---------+-----+-----+\n", "| \"bar\" | \"two\" | 1 | 6 |\n", "+-------+---------+-----+-----+\n", "| \"foo\" | \"one\" | 1 | 7 |\n", "+-------+---------+-----+-----+\n", "| \"foo\" | \"three\" | 1 | 8 |\n", "+-------+---------+-----+-----+\n" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let df = DataFrame::new(vec![\n", " Series::new(\"A\", &[\"foo\", \"bar\", \"foo\", \"bar\",\n", " \"foo\", \"bar\", \"foo\", \"foo\"]),\n", " Series::new(\"B\", &[\"one\", \"one\", \"two\", \"three\",\n", " \"two\", \"two\", \"one\", \"three\"]),\n", " Int32Chunked::full(\"C\", 1, 8).into(),\n", " Series::new(\"D\", &[1, 2, 3, 4,\n", " 5, 6, 7, 8])\n", "]).unwrap();\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping and then applying the `sum()` method to the resulting groups:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+-------+-------+\n", "| A | C_sum |\n", "| --- | --- |\n", "| str | i32 |\n", "+=======+=======+\n", "| \"foo\" | 5 |\n", "+-------+-------+\n", "| \"bar\" | 3 |\n", "+-------+-------+\n" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " (&df).groupby(\"A\")?.select(\"C\").sum()\n", "}).unwrap()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+-------+---------+-------+\n", "| A | B | C_sum |\n", "| --- | --- | --- |\n", "| str | str | i32 |\n", "+=======+=========+=======+\n", "| \"bar\" | \"three\" | 1 |\n", "+-------+---------+-------+\n", "| \"foo\" | \"one\" | 2 |\n", "+-------+---------+-------+\n", "| \"bar\" | \"two\" | 1 |\n", "+-------+---------+-------+\n", "| \"foo\" | \"three\" | 1 |\n", "+-------+---------+-------+\n", "| \"bar\" | \"one\" | 1 |\n", "+-------+---------+-------+\n", "| \"foo\" | \"two\" | 2 |\n", "+-------+---------+-------+\n" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " (&df).groupby(&[\"A\", \"B\"])?.select(\"C\").sum()\n", "}).unwrap()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivot tables\n", "Pivots create a summary table by a applying a groupby and defining a pivot column and values to aggregate." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+---------+-----+-------+-----+\n", "| A | B | C | E |\n", "| --- | --- | --- | --- |\n", "| str | str | str | i32 |\n", "+=========+=====+=======+=====+\n", "| \"one\" | \"A\" | \"foo\" | 0 |\n", "+---------+-----+-------+-----+\n", "| \"one\" | \"B\" | \"foo\" | 1 |\n", "+---------+-----+-------+-----+\n", "| \"two\" | \"C\" | \"foo\" | 2 |\n", "+---------+-----+-------+-----+\n", "| \"three\" | \"A\" | \"bar\" | 3 |\n", "+---------+-----+-------+-----+\n", "| \"one\" | \"B\" | \"bar\" | 4 |\n", "+---------+-----+-------+-----+\n", "| \"one\" | \"C\" | \"bar\" | 5 |\n", "+---------+-----+-------+-----+\n", "| \"two\" | \"A\" | \"foo\" | 6 |\n", "+---------+-----+-------+-----+\n", "| \"three\" | \"B\" | \"foo\" | 7 |\n", "+---------+-----+-------+-----+\n", "| \"one\" | \"C\" | \"foo\" | 8 |\n", "+---------+-----+-------+-----+\n", "| \"one\" | \"A\" | \"bar\" | 9 |\n", "+---------+-----+-------+-----+\n" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let s0 = Series::new(\"A\", &[\"one\", \"one\", \"two\", \"three\",\n", " \"one\", \"one\", \"two\", \"three\",\n", " \"one\", \"one\", \"two\", \"three\"\n", "]);\n", "let s1 = Series::new(\"B\", &[\"A\", \"B\", \"C\",\n", " \"A\", \"B\", \"C\",\n", " \"A\", \"B\", \"C\",\n", " \"A\", \"B\", \"C\",\n", "]);\n", "let s2 = Series::new(\"C\", &[\"foo\", \"foo\", \"foo\", \"bar\", \"bar\", \"bar\",\n", " \"foo\", \"foo\", \"foo\", \"bar\", \"bar\", \"bar\"\n", "]);\n", "let s3 = Series::new(\"E\", &((0..12).collect::>()));\n", "\n", "let df = DataFrame::new(vec![s0, s1, s2, s3]).unwrap();\n", "df" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+---------+-----+-----+\n", "| A | foo | bar |\n", "| --- | --- | --- |\n", "| str | i32 | i32 |\n", "+=========+=====+=====+\n", "| \"one\" | 9 | 18 |\n", "+---------+-----+-----+\n", "| \"three\" | 7 | 14 |\n", "+---------+-----+-----+\n", "| \"two\" | 8 | 10 |\n", "+---------+-----+-----+\n" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " (&df).groupby(&[\"A\"])?.pivot(\"C\", \"E\").sum()\n", "}).unwrap()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "+---------+-----+------+------+\n", "| A | B | foo | bar |\n", "| --- | --- | --- | --- |\n", "| str | str | i32 | i32 |\n", "+=========+=====+======+======+\n", "| \"three\" | \"A\" | null | 3 |\n", "+---------+-----+------+------+\n", "| \"one\" | \"C\" | 8 | 5 |\n", "+---------+-----+------+------+\n", "| \"two\" | \"A\" | 6 | null |\n", "+---------+-----+------+------+\n", "| \"one\" | \"A\" | 0 | 9 |\n", "+---------+-----+------+------+\n", "| \"two\" | \"C\" | 2 | null |\n", "+---------+-----+------+------+\n", "| \"three\" | \"B\" | 7 | null |\n", "+---------+-----+------+------+\n", "| \"one\" | \"B\" | 1 | 4 |\n", "+---------+-----+------+------+\n", "| \"two\" | \"B\" | null | 10 |\n", "+---------+-----+------+------+\n", "| \"three\" | \"C\" | null | 11 |\n", "+---------+-----+------+------+\n" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "as_result!({\n", " (&df).groupby(&[\"A\", \"B\"])?.pivot(\"C\", \"E\").sum()\n", "}).unwrap()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }