{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Importing and Exporting Data\n",
        "\n",
        "In the previous notebook we showed connecting Ibis to an existing database\n",
        "(specifically a DuckDB) database. While this is a common way of accessing data\n",
        "in Ibis, it's not the only way. In this notebook we'll walk through a few\n",
        "different ways of importing data into Ibis, as well as ways to export it to\n",
        "other systems. Being able to support a wide variety of input and output formats\n",
        "is essential to ensuring Ibis works well within the larger Python data\n",
        "ecosystem.\n",
        "\n",
        "## Importing Data from Existing Databases\n",
        "\n",
        "One common way of loading data into Ibis is by connecting to an existing\n",
        "backend that already has a collection of existing of tables. Backends use\n",
        "different nomenclatures for these, but in ibis a collection of tables is called\n",
        "a \"database\". Some backends also have a concept for a collection of databases\n",
        "(also using backend-specific terminology). In Ibis we call these \"catalogs\".\n",
        "\n",
        "To connect to an existing backend system you'd use the corresponding\n",
        "`ibis.<backend>.connect` function. The parameters required here to connect\n",
        "differ per backend. For example:\n",
        "\n",
        "```python\n",
        "import ibis\n",
        "\n",
        "# Connect to duckdb\n",
        "con = ibis.duckdb.connect(\"path/to/database.db\")\n",
        "\n",
        "# Connect to postgres\n",
        "con = ibis.postgres.connect(host=\"postgres-hostname\", password=\"postgres-password\")\n",
        "\n",
        "# Connect to bigquery\n",
        "con = ibis.bigquery.connect(project_id=\"bigquery-project\", dataset_id=\"example\")\n",
        "\n",
        "# ...\n",
        "```\n",
        "\n",
        "Here we'll connect to the same `duckdb` database that we used in notebook 1."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import ibis\n",
        "\n",
        "ibis.options.interactive = True"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Existing tables can then be seen and accessed using the `list_tables` and\n",
        "`table` methods on the connection."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# List all existing tables in a backend\n",
        "con.list_tables()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Access an existing table in the backend\n",
        "penguins = con.table(\"penguins\")\n",
        "\n",
        "penguins"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Importing Data from Files\n",
        "\n",
        "Sometimes you want to accesss data that isn't already loaded into an existing\n",
        "backend, but rather exists in a common file format (`csv`, `parquet`, `json`,\n",
        "...). Ibis also supports loading data from these sources using corresponding\n",
        "`read_csv`/`read_parquet`/`read_json` functions or methods.\n",
        "\n",
        "There are two ways of doing this:\n",
        "\n",
        "- Using the method versions (e.g. `con.read_csv`), which will execute on the\n",
        "  `con` backend.\n",
        "- Using the function versions (e.g. `ibis.read_csv`), which will execute using\n",
        "  the configured *default backend* (if none is configured, will use `duckdb`).\n",
        "\n",
        "The latter mainly exist as a convenience, in most cases using the method\n",
        "versions will be easier.\n",
        "\n",
        "### Exercise 1: Add a Column for Scientific Name\n",
        "\n",
        "Like all species, the penguins here have scientific names. These are available\n",
        "in the `data/penguins/species.csv` file in the tutorial repo."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "!cat data/penguins/species.csv"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Your job is to:\n",
        "\n",
        "- Read in the `species.csv` file using the `con.read_csv` method.\n",
        "- Join the original `penguins` table with the new `species` table to label\n",
        "  every row with its proper scientific name."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Your code here..."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Solution"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%load solutions/nb02_ex01.py"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Importing In-Memory Data\n",
        "\n",
        "Ibis can also import data from in-memory formats.\n",
        "\n",
        "To load in-memory data into `ibis` you can use `ibis.memtable`. This takes in\n",
        "either an existing dataframe-like object (currently `pandas`, `polars`, and\n",
        "`pyarrow` are supported), or a dict of column-name -> data to use."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Create a table from a pandas.DataFrame\n",
        "import pandas as pd\n",
        "\n",
        "items = ibis.memtable(\n",
        "    pd.DataFrame(\n",
        "        {\n",
        "            \"item\": [\"apple\", \"banana\", \"orange\"],\n",
        "            \"price\": [0.75, 0.23, 0.92],\n",
        "            \"avg_weight\": [0.120, 0.118, 0.150],\n",
        "        }\n",
        "    )\n",
        ")\n",
        "\n",
        "items"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Create a table from a pyarrow.Table\n",
        "import pyarrow as pa\n",
        "\n",
        "inventory = ibis.memtable(\n",
        "    pa.table(\n",
        "        {\n",
        "            \"item\": [\"apple\", \"banana\", \"orange\"],\n",
        "            \"count\": [212, 125, 90],\n",
        "        }\n",
        "    )\n",
        ")\n",
        "\n",
        "inventory"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Create a table from a polars.DataFrame\n",
        "import polars as pl\n",
        "\n",
        "orders = ibis.memtable(\n",
        "    pl.DataFrame(\n",
        "        {\n",
        "            \"id\": [1234, 1234, 1235, 1235, 1236],\n",
        "            \"item\": [\"apple\", \"orange\", \"banana\", \"orange\", \"banana\"],\n",
        "            \"count\": [5, 3, 4, 10, 6],\n",
        "        }\n",
        "    )\n",
        ")\n",
        "\n",
        "orders"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "These tables can then be queried the same as any other ibis table."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Find the most expensive item\n",
        "items.order_by(ibis.desc(\"price\")).limit(1)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Find the product with the most items in stock\n",
        "inventory.order_by(ibis.desc(\"count\")).limit(1)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Find the most expensive item per kg\n",
        "items.mutate(price_per_kg=items.price / items.avg_weight).order_by(ibis.desc(\"price_per_kg\")).limit(1)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can even compose queries that make use of tables of different input types.\n",
        "Here we compose a query that joins a `pandas.DataFrame` with a\n",
        "`polars.DataFrame`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Compute the total cost of each order\n",
        "order_totals = (\n",
        "    orders.join(items, \"item\")                             # 1. Join orders with items by item name\n",
        "    .group_by(\"id\")                                        # 2. Group By the order id\n",
        "    .agg(total=lambda t: (t[\"price\"] * t[\"count\"]).sum())  # 3. Compute the total cost as sum(price * count) per order\n",
        ")\n",
        "\n",
        "order_totals"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When querying in-memory data wrapped in `ibis.memtable`, `ibis` will use the\n",
        "current default backend (`duckdb` by default), _unless_ the query also makes\n",
        "use of a table attached to some other database. This means that `ibis.memtable`\n",
        "may be used to create efficient lookup tables when composing queries against\n",
        "external systems.\n",
        "\n",
        "### Exercise 2: Add Columns for Island Coordinates\n",
        "\n",
        "Here we provide coordinate data for the 3 islands present in the `penguins`\n",
        "dataset as a pandas dataframe. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import pandas as pd\n",
        "\n",
        "coordinates_df = pd.DataFrame(\n",
        "    {\n",
        "        \"island\": [\"Biscoe\", \"Dream\", \"Torgersen\"],\n",
        "        \"latitude\": [-65.433333, -64.733333, -64.766667],\n",
        "        \"longitude\": [-65.5, -64.233333, -64.083333],\n",
        "    }\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Your job is to:\n",
        "\n",
        "- Coerce it to an `ibis.Table` using `ibis.memtable`\n",
        "- Join the original `penguins` table with the new `coordinates` table to label\n",
        "  every row with its island coordinates"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Your code here..."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Solution"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%load solutions/nb02_ex02.py"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exporting Data to Files\n",
        "\n",
        "Just as Ibis can load data from file formats like `parquet` or `csv`, it can\n",
        "also write results to these formats.\n",
        "\n",
        "- `query.to_parquet()`: writes results as a parquet file\n",
        "- `query.to_csv()`: writes results as a csv file\n",
        "- `query.to_delta()`: writes results as a Delta Lake table\n",
        "\n",
        "### Exercise 3: Write query results to csv\n",
        "\n",
        "The current `penguins` dataset contains a single table containing data for all\n",
        "three islands (Biscoe, Dream, and Torgersen).\n",
        "\n",
        "Below, please write some python code to produce 3 new `csv` files, each named\n",
        "after an island and containing only the data for that island (i.e. the\n",
        "`biscoe.csv` file should only contain data from \"Biscoe\")."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "islands = [\"Biscoe\", \"Dream\", \"Torgersen\"]\n",
        "\n",
        "# Your code here"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Solution"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%load solutions/nb02_ex03.py"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exporting In-memory Data\n",
        "\n",
        "Similar to how `ibis` can consume multiple in-memory data formats, we also can\n",
        "return queries in multiple input formats using one of the various `to_*`\n",
        "methods:\n",
        "\n",
        "- `query.to_pandas()`: returns results as a `pandas.DataFrame`\n",
        "- `query.to_polars()`: returns results as a `polars.DataFrame`\n",
        "- `query.to_pyarrow()`: returns results as a `pyarrow.Table`\n",
        "- `query.to_pyarrow_batches()`: returns results as a `pyarrow.RecordBatchReader`\n",
        "- `query.to_torch()`: returns results as a `dict` of `torch.Tensor`s\n",
        "\n",
        "Supporting multiple output formats lets you efficiently compose ibis with\n",
        "existing systems. Have some downstream code that expects a `pandas.DataFrame`?\n",
        "Use ibis to do some initial processing then hand off the result to your\n",
        "existing code. Our wide support for popular Python data tools means `ibis` can\n",
        "often fit nicely in to existing workflows without forcing all code be ported to\n",
        "use `ibis`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "mean_mass = penguins.group_by(\"species\").body_mass_g.mean()\n",
        "mean_mass"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Get results as a pandas DataFrame\n",
        "df = mean_mass.to_pandas()\n",
        "\n",
        "print(type(df))\n",
        "\n",
        "df"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Get results as a polars DataFrame\n",
        "mean_mass.to_polars()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Get results as a pyarrow Table\n",
        "mean_mass.to_pyarrow()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Integrating Ibis with Downstream Tools\n",
        "\n",
        "Some downstream tools may be able to accept `ibis.Table` objects directly as\n",
        "inputs without requiring you to manually call one of the `.to_*` methods to\n",
        "convert a query to an in-memory format.\n",
        "\n",
        "Depending on the library, this may work through a few mechanisms:\n",
        "\n",
        "- Using the `__array__` protocol. Libraries expecting `numpy` arrays often also\n",
        "  accept objects implementing this method.\n",
        "- Using the `__dataframe__` protocol. This is a newer protocol, but is accepted\n",
        "  by some libraries (a popular one is the plotting library `altair`)\n",
        "- Using the `__arrow_c_stream__` protocol. This is also a new protocol for\n",
        "  tooling passing around arrow tables.\n",
        "\n",
        "---\n",
        "\n",
        "For example, here we demonstrate passing an `ibis.Table` to the `altair` plotting\n",
        "library. Notice that you don't have to manually call any of the `to_*` methods\n",
        "yourself! This is because `altair` supports anything that implements the `__dataframe__`\n",
        "protocol as input, which means it can accept an `ibis.Table` directly."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import altair as alt"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Make a scatterplot of bill length x bill depth for each penguin species\n",
        "chart = (\n",
        "    alt.Chart(penguins)  # <- here we pass in an `ibis.Table` directly\n",
        "        .mark_circle(size=60)\n",
        "        .encode(\n",
        "            x=alt.X('bill_length_mm', scale=alt.Scale(zero=False)),\n",
        "            y=alt.Y('bill_depth_mm', scale=alt.Scale(zero=False)),\n",
        "            color=alt.Color('species'),\n",
        "            tooltip=['species', 'sex', 'island']\n",
        "        )\n",
        ")\n",
        "\n",
        "chart"
      ],
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3 (ipykernel)"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}