{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Nessie Iceberg/Hive SQL Demo with NBA Dataset\n", "============================\n", "This demo showcases how to use Nessie Python API along with Hive from Iceberg\n", "\n", "Initialize PyHive\n", "----------------------------------------------\n", "To get started, we will first have to do a few setup steps that give us everything we need\n", "to get started with Nessie. In case you're interested in the detailed setup steps for Hive, you can check out the [docs](https://projectnessie.org/tools/iceberg/hive/)\n", "\n", "The Binder server has downloaded Hive, Hadoop and some data for us as well as started a Nessie server in the background. All we have to do is to connect to Hive session.\n", "\n", "The below cell starts a local Hive session with parameters needed to configure Nessie. Each config option is followed by a comment explaining its purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "import os\n", "import requests\n", "from pyhive import hive\n", "from pynessie import init\n", "\n", "# where we will store our data\n", "warehouse = \"file://\" + os.path.join(os.getcwd(), \"nessie_warehouse\")\n", "\n", "# where our datasets are located\n", "datasets_path = \"file://\" + os.path.join(os.path.dirname(os.getcwd()), \"datasets\")\n", "\n", "nessie_client = init()\n", "\n", "\n", "def create_namespace(ref: str, namespace: list[str]):\n", " hash = nessie_client.get_reference(ref).hash_\n", " # pynessie client has currently no code to create namespace, issue a plain REST request.\n", " response = requests.post(\n", " url=f\"http://127.0.0.1:19120/api/v2/trees/{ref}@{hash}/history/commit\",\n", " headers={\"Accept\": \"application/json\", \"Content-Type\": \"application/json\"},\n", " json={\n", " \"commitMeta\": {\"message\": \"Create namespace nba\"},\n", " \"operations\": [{\"type\": \"PUT\", \"key\": {\"elements\": namespace}, \"content\": {\"type\": \"NAMESPACE\"}}],\n", " },\n", " )\n", " if response.status_code != 200:\n", " raise Exception(f\"Could not create namespace: HTTP {response.status_code} {response.reason}: {response.json()}\")\n", "\n", "\n", "def create_ref_catalog(ref: str):\n", " \"\"\"\n", " Create a branch and switch the current ref to the created branch\n", " \"\"\"\n", " default_branch = nessie_client.get_default_branch()\n", " if ref != default_branch:\n", " default_branch_hash = nessie_client.get_reference(default_branch).hash_\n", " nessie_client.create_branch(ref, ref=default_branch, hash_on_ref=default_branch_hash)\n", " return switch_ref_catalog(ref)\n", "\n", "\n", "def switch_ref_catalog(ref: str):\n", " \"\"\"\n", " Switch a branch. When we switch the branch via Hive, we will need to reconnect to Hive\n", " \"\"\"\n", " # The important args below are:\n", " # catalog-impl: which Iceberg catalog to use, in this case we want NessieCatalog\n", " # uri: the location of the nessie server.\n", " # ref: the Nessie ref/branch we want to use (defaults to main)\n", " # warehouse: the location this catalog should store its data\n", " return hive.connect(\n", " \"localhost\",\n", " configuration={\n", " \"iceberg.catalog.dev_catalog.catalog-impl\": \"org.apache.iceberg.nessie.NessieCatalog\",\n", " \"iceberg.catalog.dev_catalog.uri\": \"http://localhost:19120/api/v1\",\n", " \"iceberg.catalog.dev_catalog.ref\": ref,\n", " \"iceberg.catalog.dev_catalog.warehouse\": warehouse,\n", " },\n", " ).cursor()\n", "\n", "\n", "create_namespace(\"main\", [\"nba\"])\n", "\n", "\n", "print(\"\\n\\nHive running\\n\\n\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solving Data Engineering problems with Nessie\n", "============================\n", "\n", "In this Demo we are a data engineer working at a fictional sports analytics blog. In order for the authors to write articles they have to have access to the relevant data. They need to be able to retrieve data quickly and be able to create charts with it.\n", "\n", "We have been asked to collect and expose some information about basketball players. We have located some data sources and are now ready to start ingesting data into our data lakehouse. We will perform the ingestion steps on a Nessie branch to test and validate the data before exposing to the analysts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set up Nessie branches (via Nessie CLI)\n", "----------------------------\n", "Once all dependencies are configured, we can get started with ingesting our basketball data into `Nessie` with the following steps:\n", "\n", "- Create a new branch named `dev`\n", "- List all branches\n", "\n", "It is worth mentioning that we don't have to explicitly create a `main` branch, since it's the default branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_ref = create_ref_catalog(\"dev\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We have created the branch `dev` and we can see the branch with the Nessie `hash` its currently pointing to.\n", "\n", "Below we list all branches. Note that the auto created `main` branch already exists and both branches point at the same empty `hash` initially" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie --verbose branch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create tables under dev branch\n", "-------------------------------------\n", "Once we created the `dev` branch and verified that it exists, we can create some tables and add some data.\n", "\n", "We create two tables under the `dev` branch:\n", "- `salaries`\n", "- `totals_stats`\n", "\n", "These tables list the salaries per player per year and their stats per year.\n", "\n", "To create the data we:\n", "\n", "1. switch our branch context to dev\n", "2. create the table\n", "3. insert the data from an existing csv file. This csv file is already stored locally on the demo machine. A production use case would likely take feeds from official data sources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating our demo schema\n", "current_ref.execute(\"CREATE SCHEMA IF NOT EXISTS nba\")\n", "\n", "print(\"\\nCreated schema nba\\n\")\n", "\n", "\n", "print(\"\\nCreating tables nba.salaries and nba.totals_stats....\\n\")\n", "\n", "# Creating `salaries` table\n", "\n", "current_ref.execute(\n", " f\"\"\"CREATE TABLE IF NOT EXISTS nba.salaries (Season STRING,\n", " Team STRING, Salary STRING, Player STRING)\n", " STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'\n", " LOCATION '{warehouse}/nba/salaries'\n", " TBLPROPERTIES ('iceberg.catalog'='dev_catalog', 'write.format.default'='parquet',\n", " 'iceberg.mr.in.memory.data.model'='GENERIC')\"\"\"\n", ")\n", "\n", "## We create a temporary table to load data into our target table since\n", "## is not possible to load data directly from CSV into non-native table.\n", "current_ref.execute(\n", " \"\"\"CREATE TABLE IF NOT EXISTS nba.salaries_temp (Season STRING,\n", " Team STRING, Salary STRING, Player STRING)\n", " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','\"\"\"\n", ")\n", "\n", "current_ref.execute(f'LOAD DATA LOCAL INPATH \"{datasets_path}/nba/salaries.csv\" OVERWRITE INTO TABLE nba.salaries_temp')\n", "current_ref.execute(\"INSERT OVERWRITE TABLE nba.salaries SELECT * FROM nba.salaries_temp\")\n", "\n", "print(\"\\nCreated and inserted data into table nba.salaries from dataset salaries\\n\")\n", "\n", "\n", "# Creating `totals_stats` table\n", "\n", "current_ref.execute(\n", " f\"\"\"CREATE TABLE IF NOT EXISTS nba.totals_stats (\n", " Season STRING, Age STRING, Team STRING, ORB STRING,\n", " DRB STRING, TRB STRING, AST STRING, STL STRING,\n", " BLK STRING, TOV STRING, PTS STRING, Player STRING, RSorPO STRING)\n", " STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'\n", " LOCATION '{warehouse}/nba/totals_stats'\n", " TBLPROPERTIES ('iceberg.catalog'='dev_catalog', 'write.format.default'='parquet',\n", " 'iceberg.mr.in.memory.data.model'='GENERIC')\"\"\"\n", ")\n", "\n", "## We create a temporary table to load data into our target table since\n", "## is not possible to load data directly from CSV into non-native table.\n", "current_ref.execute(\n", " \"\"\"CREATE TABLE IF NOT EXISTS nba.totals_stats_temp (\n", " Season STRING, Age STRING, Team STRING, ORB STRING,\n", " DRB STRING, TRB STRING, AST STRING, STL STRING,\n", " BLK STRING, TOV STRING, PTS STRING, Player STRING, RSorPO STRING)\n", " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','\"\"\"\n", ")\n", "\n", "current_ref.execute(\n", " f'LOAD DATA LOCAL INPATH \"{datasets_path}/nba/totals_stats.csv\" OVERWRITE INTO TABLE nba.totals_stats_temp'\n", ")\n", "current_ref.execute(\"INSERT OVERWRITE TABLE nba.totals_stats SELECT * FROM nba.totals_stats_temp\")\n", "\n", "print(\"\\nCreated and inserted data into table nba.totals_stats from dataset totals_stats\\n\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Now we count the rows in our tables to ensure they are the same number as the csv files. Unlike Spark and Flink demos, we can't use the notation of `table@branch` (see the github issue [here](https://github.com/projectnessie/nessie/issues/1985). Therefore, we just set Nessie ref settings through Hive setting `SET iceberg.catalog.{catalog}.ref = {branch}` whenever we want to work on a specific branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "# We make sure we are still in dev branch\n", "current_ref = switch_ref_catalog(\"dev\")\n", "\n", "print(\"\\nCounting rows in nba.salaries\\n\")\n", "\n", "# We count now\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries\")\n", "table_count = current_ref.fetchone()[0]\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries_temp\")\n", "csv_count = current_ref.fetchone()[0]\n", "assert table_count == csv_count\n", "print(table_count)\n", "\n", "print(\"\\nCounting rows in nba.totals_stats\\n\")\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.totals_stats\")\n", "table_count = current_ref.fetchone()[0]\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.totals_stats_temp\")\n", "csv_count = current_ref.fetchone()[0]\n", "assert table_count == csv_count\n", "print(table_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check generated tables\n", "----------------------------\n", "Since we have been working solely on the `dev` branch, where we created 2 tables and added some data,\n", "let's verify that the `main` branch was not altered by our changes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie content list" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "And on the `dev` branch we expect to see two tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nessie content list --ref dev" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We can also verify that the `dev` and `main` branches point to different commits" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nessie --verbose branch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dev promotion into main\n", "-----------------------\n", "Once we are done with our changes on the `dev` branch, we would like to merge those changes into `main`.\n", "We merge `dev` into `main` via the command line `merge` command.\n", "Both branches should be at the same revision after merging/promotion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nessie merge dev -b main --force" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We can verify that the `main` branch now contains the expected tables and row counts.\n", "\n", "The tables are now on `main` and ready for consumption by our blog authors and analysts!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nessie --verbose branch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie content list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# We switch to main branch\n", "current_ref = switch_ref_catalog(\"main\")\n", "\n", "print(\"\\nCounting rows in nba.salaries\\n\")\n", "\n", "# We count now\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries\")\n", "table_count = current_ref.fetchone()[0]\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries_temp\")\n", "csv_count = current_ref.fetchone()[0]\n", "assert table_count == csv_count\n", "print(table_count)\n", "\n", "print(\"\\nCounting rows in nba.totals_stats\\n\")\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.totals_stats\")\n", "table_count = current_ref.fetchone()[0]\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.totals_stats_temp\")\n", "csv_count = current_ref.fetchone()[0]\n", "assert table_count == csv_count\n", "print(table_count)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Perform regular ETL on the new tables\n", "-------------------\n", "Our analysts are happy with the data and we want to now regularly ingest data to keep things up to date. Our first ETL job consists of the following:\n", "\n", "1. Update the salaries table to add new data\n", "2. We have decided the `Age` column isn't required in the `totals_stats` table so we will drop the column\n", "3. We create a new table to hold information about the players appearances in all star games\n", "\n", "As always we will do this work on a branch and verify the results. This ETL job can then be set up to run nightly with new stats and salary information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_ref = create_ref_catalog(\"etl\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# add some salaries for Kevin Durant\n", "current_ref.execute(\n", " \"\"\"INSERT INTO nba.salaries\n", " VALUES ('2017-18', 'Golden State Warriors', '$25000000', 'Kevin Durant'),\n", " ('2018-19', 'Golden State Warriors', '$30000000', 'Kevin Durant'),\n", " ('2019-20', 'Brooklyn Nets', '$37199000', 'Kevin Durant'),\n", " ('2020-21', 'Brooklyn Nets', '$39058950', 'Kevin Durant')\"\"\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"\\nCreating table nba.allstar_games_stats\\n\")\n", "\n", "# Creating `allstar_games_stats` table\n", "current_ref.execute(\n", " f\"\"\"CREATE TABLE IF NOT EXISTS nba.allstar_games_stats (\n", " Season STRING, Age STRING, Team STRING, ORB STRING,\n", " TRB STRING, AST STRING, STL STRING, BLK STRING,\n", " TOV STRING, PF STRING, PTS STRING, Player STRING)\n", " STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'\n", " LOCATION '{warehouse}/nba/allstar_games_stats'\n", " TBLPROPERTIES ('iceberg.catalog'='dev_catalog', 'write.format.default'='parquet',\n", " 'iceberg.mr.in.memory.data.model'='GENERIC')\"\"\"\n", ")\n", "\n", "## We create a temporary table to load data into our target table since\n", "## is not possible to load data directly from CSV into non-native table.\n", "current_ref.execute(\n", " \"\"\"CREATE TABLE IF NOT EXISTS nba.allstar_table_temp (\n", " Season STRING, Age STRING, Team STRING, ORB STRING, TRB STRING,\n", " AST STRING, STL STRING, BLK STRING,\n", " TOV STRING, PF STRING, PTS STRING, Player STRING)\n", " ROW FORMAT DELIMITED FIELDS TERMINATED BY ','\"\"\"\n", ")\n", "\n", "current_ref.execute(\n", " f'LOAD DATA LOCAL INPATH \"{datasets_path}/nba/allstar_games_stats.csv\" OVERWRITE INTO TABLE nba.allstar_table_temp'\n", ")\n", "current_ref.execute(\"INSERT OVERWRITE TABLE nba.allstar_games_stats SELECT * FROM nba.allstar_table_temp\")\n", "\n", "print(\"\\nCreated and inserted data into table nba.allstar_table_temp from dataset allstar_games_stats\\n\")\n", "\n", "\n", "print(\"\\nCounting rows in nba.allstar_games_stats\\n\")\n", "\n", "# Since we can't do 'table@branch'\n", "current_ref = switch_ref_catalog(\"etl\")\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.allstar_games_stats\")\n", "print(current_ref.fetchone()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can verify that the new table isn't on the `main` branch but is present on the etl branch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Since we have been working on the `etl` branch, the `allstar_games_stats` table is not on the `main` branch\n", "!nessie content list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We should see the new `allstar_games_stats` table on the `etl` branch\n", "!nessie content list --ref etl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we are happy with the data we can again merge it into `main`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie merge etl -b main --force" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Now lets verify that the changes exist on the `main` branch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie content list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!nessie --verbose branch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# We switch to the main branch\n", "current_ref = switch_ref_catalog(\"main\")\n", "\n", "print(\"\\nCounting rows in nba.allstar_games_stats\\n\")\n", "\n", "# We count now\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.allstar_games_stats\")\n", "table_count = current_ref.fetchone()[0]\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.allstar_table_temp\")\n", "csv_count = current_ref.fetchone()[0]\n", "assert table_count == csv_count\n", "print(table_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create `experiment` branch\n", "--------------------------------\n", "As a data analyst we might want to carry out some experiments with some data, without affecting `main` in any way.\n", "As in the previous examples, we can just get started by creating an `experiment` branch off of `main`\n", "and carry out our experiment, which could consist of the following steps:\n", "- drop `totals_stats` table\n", "- add data to `salaries` table\n", "- compare `experiment` and `main` tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_ref = create_ref_catalog(\"experiment\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop the `totals_stats` table on the `experiment` branch\n", "current_ref.execute(\"DROP TABLE nba.totals_stats\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add some salaries for Dirk Nowitzki\n", "current_ref.execute(\n", " \"\"\"INSERT INTO nba.salaries VALUES\n", " ('2015-16', 'Dallas Mavericks', '$8333333', 'Dirk Nowitzki'),\n", " ('2016-17', 'Dallas Mavericks', '$25000000', 'Dirk Nowitzki'),\n", " ('2017-18', 'Dallas Mavericks', '$5000000', 'Dirk Nowitzki'),\n", " ('2018-19', 'Dallas Mavericks', '$5000000', 'Dirk Nowitzki')\"\"\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We should see the `salaries` and `allstar_games_stats` tables only (since we just dropped `totals_stats`)\n", "!nessie content list --ref experiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# `main` hasn't been changed and still has the `totals_stats` table\n", "!nessie content list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the contents of the `salaries` table on the `experiment` branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_ref = switch_ref_catalog(\"experiment\")\n", "\n", "print(\"\\nCounting rows in nba.salaries\\n\")\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries\")\n", "print(current_ref.fetchone()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and compare to the contents of the `salaries` table on the `main` branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "current_ref = switch_ref_catalog(\"main\")\n", "\n", "# the following INSERT is a workaround for https://github.com/apache/iceberg/pull/4509 until iceberg 0.13.2 is released\n", "# add a single salary for Dirk Nowitzki (so we expect 3 less total rows)\n", "current_ref.execute(\n", " \"\"\"INSERT INTO nba.salaries VALUES\n", " ('2018-19', 'Dallas Mavericks', '$5000000', 'Dirk Nowitzki')\"\"\"\n", ")\n", "\n", "print(\"\\nCounting rows in nba.salaries\\n\")\n", "\n", "current_ref.execute(\"SELECT COUNT(*) FROM nba.salaries\")\n", "print(current_ref.fetchone()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally lets clean up after ourselves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nessie branch --delete dev\n", "!nessie branch --delete etl\n", "!nessie branch --delete experiment" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }