{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Importing and Exporting Data\n", "\n", "In the previous notebook we showed connecting Ibis to an existing database\n", "(specifically a DuckDB) database. While this is a common way of accessing data\n", "in Ibis, it's not the only way. In this notebook we'll walk through a few\n", "different ways of importing data into Ibis, as well as ways to export it to\n", "other systems. Being able to support a wide variety of input and output formats\n", "is essential to ensuring Ibis works well within the larger Python data\n", "ecosystem.\n", "\n", "## Importing Data from Existing Databases\n", "\n", "One common way of loading data into Ibis is by connecting to an existing\n", "backend that already has a collection of existing of tables. Backends use\n", "different nomenclatures for these, but in ibis a collection of tables is called\n", "a \"database\". Some backends also have a concept for a collection of databases\n", "(also using backend-specific terminology). In Ibis we call these \"catalogs\".\n", "\n", "To connect to an existing backend system you'd use the corresponding\n", "`ibis..connect` function. The parameters required here to connect\n", "differ per backend. For example:\n", "\n", "```python\n", "import ibis\n", "\n", "# Connect to duckdb\n", "con = ibis.duckdb.connect(\"path/to/database.db\")\n", "\n", "# Connect to postgres\n", "con = ibis.postgres.connect(host=\"postgres-hostname\", password=\"postgres-password\")\n", "\n", "# Connect to bigquery\n", "con = ibis.bigquery.connect(project_id=\"bigquery-project\", dataset_id=\"example\")\n", "\n", "# ...\n", "```\n", "\n", "Here we'll connect to the same `duckdb` database that we used in notebook 1." ] }, { "cell_type": "code", "metadata": {}, "source": [ "import ibis\n", "\n", "ibis.options.interactive = True" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Existing tables can then be seen and accessed using the `list_tables` and\n", "`table` methods on the connection." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# List all existing tables in a backend\n", "con.list_tables()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Access an existing table in the backend\n", "penguins = con.table(\"penguins\")\n", "\n", "penguins" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Data from Files\n", "\n", "Sometimes you want to accesss data that isn't already loaded into an existing\n", "backend, but rather exists in a common file format (`csv`, `parquet`, `json`,\n", "...). Ibis also supports loading data from these sources using corresponding\n", "`read_csv`/`read_parquet`/`read_json` functions or methods.\n", "\n", "There are two ways of doing this:\n", "\n", "- Using the method versions (e.g. `con.read_csv`), which will execute on the\n", " `con` backend.\n", "- Using the function versions (e.g. `ibis.read_csv`), which will execute using\n", " the configured *default backend* (if none is configured, will use `duckdb`).\n", "\n", "The latter mainly exist as a convenience, in most cases using the method\n", "versions will be easier.\n", "\n", "### Exercise 1: Add a Column for Scientific Name\n", "\n", "Like all species, the penguins here have scientific names. These are available\n", "in the `data/penguins/species.csv` file in the tutorial repo." ] }, { "cell_type": "code", "metadata": {}, "source": [ "!cat data/penguins/species.csv" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your job is to:\n", "\n", "- Read in the `species.csv` file using the `con.read_csv` method.\n", "- Join the original `penguins` table with the new `species` table to label\n", " every row with its proper scientific name." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Your code here..." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution" ] }, { "cell_type": "code", "metadata": {}, "source": [ "%load solutions/nb02_ex01.py" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing In-Memory Data\n", "\n", "Ibis can also import data from in-memory formats.\n", "\n", "To load in-memory data into `ibis` you can use `ibis.memtable`. This takes in\n", "either an existing dataframe-like object (currently `pandas`, `polars`, and\n", "`pyarrow` are supported), or a dict of column-name -> data to use." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create a table from a pandas.DataFrame\n", "import pandas as pd\n", "\n", "items = ibis.memtable(\n", " pd.DataFrame(\n", " {\n", " \"item\": [\"apple\", \"banana\", \"orange\"],\n", " \"price\": [0.75, 0.23, 0.92],\n", " \"avg_weight\": [0.120, 0.118, 0.150],\n", " }\n", " )\n", ")\n", "\n", "items" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create a table from a pyarrow.Table\n", "import pyarrow as pa\n", "\n", "inventory = ibis.memtable(\n", " pa.table(\n", " {\n", " \"item\": [\"apple\", \"banana\", \"orange\"],\n", " \"count\": [212, 125, 90],\n", " }\n", " )\n", ")\n", "\n", "inventory" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create a table from a polars.DataFrame\n", "import polars as pl\n", "\n", "orders = ibis.memtable(\n", " pl.DataFrame(\n", " {\n", " \"id\": [1234, 1234, 1235, 1235, 1236],\n", " \"item\": [\"apple\", \"orange\", \"banana\", \"orange\", \"banana\"],\n", " \"count\": [5, 3, 4, 10, 6],\n", " }\n", " )\n", ")\n", "\n", "orders" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These tables can then be queried the same as any other ibis table." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Find the most expensive item\n", "items.order_by(ibis.desc(\"price\")).limit(1)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Find the product with the most items in stock\n", "inventory.order_by(ibis.desc(\"count\")).limit(1)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Find the most expensive item per kg\n", "items.mutate(price_per_kg=items.price / items.avg_weight).order_by(ibis.desc(\"price_per_kg\")).limit(1)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even compose queries that make use of tables of different input types.\n", "Here we compose a query that joins a `pandas.DataFrame` with a\n", "`polars.DataFrame`." ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Compute the total cost of each order\n", "order_totals = (\n", " orders.join(items, \"item\") # 1. Join orders with items by item name\n", " .group_by(\"id\") # 2. Group By the order id\n", " .agg(total=lambda t: (t[\"price\"] * t[\"count\"]).sum()) # 3. Compute the total cost as sum(price * count) per order\n", ")\n", "\n", "order_totals" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When querying in-memory data wrapped in `ibis.memtable`, `ibis` will use the\n", "current default backend (`duckdb` by default), _unless_ the query also makes\n", "use of a table attached to some other database. This means that `ibis.memtable`\n", "may be used to create efficient lookup tables when composing queries against\n", "external systems.\n", "\n", "### Exercise 2: Add Columns for Island Coordinates\n", "\n", "Here we provide coordinate data for the 3 islands present in the `penguins`\n", "dataset as a pandas dataframe. " ] }, { "cell_type": "code", "metadata": {}, "source": [ "import pandas as pd\n", "\n", "coordinates_df = pd.DataFrame(\n", " {\n", " \"island\": [\"Biscoe\", \"Dream\", \"Torgersen\"],\n", " \"latitude\": [-65.433333, -64.733333, -64.766667],\n", " \"longitude\": [-65.5, -64.233333, -64.083333],\n", " }\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your job is to:\n", "\n", "- Coerce it to an `ibis.Table` using `ibis.memtable`\n", "- Join the original `penguins` table with the new `coordinates` table to label\n", " every row with its island coordinates" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Your code here..." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution" ] }, { "cell_type": "code", "metadata": {}, "source": [ "%load solutions/nb02_ex02.py" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting Data to Files\n", "\n", "Just as Ibis can load data from file formats like `parquet` or `csv`, it can\n", "also write results to these formats.\n", "\n", "- `query.to_parquet()`: writes results as a parquet file\n", "- `query.to_csv()`: writes results as a csv file\n", "- `query.to_delta()`: writes results as a Delta Lake table\n", "\n", "### Exercise 3: Write query results to csv\n", "\n", "The current `penguins` dataset contains a single table containing data for all\n", "three islands (Biscoe, Dream, and Torgersen).\n", "\n", "Below, please write some python code to produce 3 new `csv` files, each named\n", "after an island and containing only the data for that island (i.e. the\n", "`biscoe.csv` file should only contain data from \"Biscoe\")." ] }, { "cell_type": "code", "metadata": {}, "source": [ "islands = [\"Biscoe\", \"Dream\", \"Torgersen\"]\n", "\n", "# Your code here" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Solution" ] }, { "cell_type": "code", "metadata": {}, "source": [ "%load solutions/nb02_ex03.py" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting In-memory Data\n", "\n", "Similar to how `ibis` can consume multiple in-memory data formats, we also can\n", "return queries in multiple input formats using one of the various `to_*`\n", "methods:\n", "\n", "- `query.to_pandas()`: returns results as a `pandas.DataFrame`\n", "- `query.to_polars()`: returns results as a `polars.DataFrame`\n", "- `query.to_pyarrow()`: returns results as a `pyarrow.Table`\n", "- `query.to_pyarrow_batches()`: returns results as a `pyarrow.RecordBatchReader`\n", "- `query.to_torch()`: returns results as a `dict` of `torch.Tensor`s\n", "\n", "Supporting multiple output formats lets you efficiently compose ibis with\n", "existing systems. Have some downstream code that expects a `pandas.DataFrame`?\n", "Use ibis to do some initial processing then hand off the result to your\n", "existing code. Our wide support for popular Python data tools means `ibis` can\n", "often fit nicely in to existing workflows without forcing all code be ported to\n", "use `ibis`." ] }, { "cell_type": "code", "metadata": {}, "source": [ "mean_mass = penguins.group_by(\"species\").body_mass_g.mean()\n", "mean_mass" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Get results as a pandas DataFrame\n", "df = mean_mass.to_pandas()\n", "\n", "print(type(df))\n", "\n", "df" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Get results as a polars DataFrame\n", "mean_mass.to_polars()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Get results as a pyarrow Table\n", "mean_mass.to_pyarrow()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integrating Ibis with Downstream Tools\n", "\n", "Some downstream tools may be able to accept `ibis.Table` objects directly as\n", "inputs without requiring you to manually call one of the `.to_*` methods to\n", "convert a query to an in-memory format.\n", "\n", "Depending on the library, this may work through a few mechanisms:\n", "\n", "- Using the `__array__` protocol. Libraries expecting `numpy` arrays often also\n", " accept objects implementing this method.\n", "- Using the `__dataframe__` protocol. This is a newer protocol, but is accepted\n", " by some libraries (a popular one is the plotting library `altair`)\n", "- Using the `__arrow_c_stream__` protocol. This is also a new protocol for\n", " tooling passing around arrow tables.\n", "\n", "---\n", "\n", "For example, here we demonstrate passing an `ibis.Table` to the `altair` plotting\n", "library. Notice that you don't have to manually call any of the `to_*` methods\n", "yourself! This is because `altair` supports anything that implements the `__dataframe__`\n", "protocol as input, which means it can accept an `ibis.Table` directly." ] }, { "cell_type": "code", "metadata": {}, "source": [ "import altair as alt" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Make a scatterplot of bill length x bill depth for each penguin species\n", "chart = (\n", " alt.Chart(penguins) # <- here we pass in an `ibis.Table` directly\n", " .mark_circle(size=60)\n", " .encode(\n", " x=alt.X('bill_length_mm', scale=alt.Scale(zero=False)),\n", " y=alt.Y('bill_depth_mm', scale=alt.Scale(zero=False)),\n", " color=alt.Color('species'),\n", " tooltip=['species', 'sex', 'island']\n", " )\n", ")\n", "\n", "chart" ], "execution_count": null, "outputs": [] } ], "metadata": { "kernelspec": { "name": "python3", "language": "python", "display_name": "Python 3 (ipykernel)" } }, "nbformat": 4, "nbformat_minor": 4 }