{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Ingest PandaSet autonomous driving dataset\n",
    "\n",
    "This notebook shows how to load 3D point clouds, 3D oriented bounding boxes and semantic segmentations from the PandaSet dataset into a 3LC Table. \n",
    "\n",
    "Tables with large 3D geometries use the [bulk data pattern](https://docs.3lc.ai/3lc/latest/tutorials/geometry/bulk_data.html#bulk-data-tutorial) for storing data. For details on the ingestion process, see the [loading script](./load_pandaset.py)\n",
    "\n",
    "![](../../../images/pandaset-light.png)\n",
    "\n",
    "<!-- Tags: [\"3D\", \"lidar\"] -->\n",
    "\n",
    "Running this notebook requires the [PandaSet DevKit](https://github.com/scaleapi/pandaset-devkit/blob/master/README.md). \n",
    "\n",
    "The dataset can be downloaded from [HuggingFace](https://huggingface.co/datasets/georghess/pandaset). If you have already downloaded `pandaset.zip`, ensure the dataset root below points to the unzipped pandaset directory.\n",
    "\n",
    "If not, the notebook will download `pandaset.zip` and unzip it into the dataset root directory. This requires authentication with HuggingFace, for example by setting the `HF_TOKEN` environment variable.\n",
    "\n",
    "> ⚠️ Storage requirements\n",
    ">\n",
    "> The unzipped dataset is ~42GB, and ingesting all sequences into 3LC will\n",
    "> require another 50GB of disk space. Ensure you have enough free space before\n",
    "> running the notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Project Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - Pandaset\"\n",
    "DATASET_NAME = \"pandaset\"\n",
    "TABLE_NAME = \"pandaset\"\n",
    "DATA_PATH = \"../../../../data\"\n",
    "DOWNLOAD_PATH = \"../../../../transient_data\"\n",
    "MAX_FRAMES = None\n",
    "MAX_SEQUENCES = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -q \"pandaset @ git+https://github.com/scaleapi/pandaset-devkit.git@master#subdirectory=python\"\n",
    "%pip install -q 3lc\n",
    "%pip install -q huggingface-hub"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "from load_pandaset import load_pandaset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## Prepare Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET_ROOT = Path(DOWNLOAD_PATH) / \"pandaset\"\n",
    "\n",
    "if not DATASET_ROOT.exists():\n",
    "    import zipfile\n",
    "\n",
    "    from huggingface_hub import hf_hub_download\n",
    "\n",
    "    print(\"Downloading dataset from HuggingFace\")\n",
    "    hf_hub_download(\n",
    "        repo_id=\"georghess/pandaset\",\n",
    "        repo_type=\"dataset\",\n",
    "        filename=\"pandaset.zip\",\n",
    "        local_dir=DATASET_ROOT.parent.absolute().as_posix(),\n",
    "    )\n",
    "\n",
    "    with zipfile.ZipFile(f\"{DATASET_ROOT.parent}/pandaset.zip\", \"r\") as zip_ref:\n",
    "        zip_ref.extractall(DATASET_ROOT.parent)\n",
    "\n",
    "    # Remove the pandaset.zip file after extraction\n",
    "    (DATASET_ROOT.parent / \"pandaset.zip\").unlink(missing_ok=True)\n",
    "else:\n",
    "    print(f\"Dataset root {DATASET_ROOT} already exists\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "## Create Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "table = load_pandaset(\n",
    "    dataset_root=DATASET_ROOT,\n",
    "    table_name=TABLE_NAME,\n",
    "    dataset_name=DATASET_NAME,\n",
    "    project_name=PROJECT_NAME,\n",
    "    data_path=DATA_PATH,\n",
    "    max_frames=MAX_FRAMES,\n",
    "    max_sequences=MAX_SEQUENCES,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  },
  "test_marks": [
   "slow",
   "dependent"
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}