{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Add new data to existing Table lineage\n",
    "\n",
    "Adding new data to an existing dataset is a common task, as more data is collected and we want to leverage it to improve the model. This notebook demonstrates how to add new data to an existing 3LC dataset by creating a new table that merges two or more existing tables.\n",
    "\n",
    "<!-- Tags: [\"table-lineage\"] -->\n",
    "\n",
    "We will cover two examples:\n",
    "1. Adding new data with the same classes.\n",
    "2. Adding new data with different classes, requiring a new, merged schema."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "DATA_PATH = \"../../data\"\n",
    "PROJECT_NAME = \"3LC Tutorials - Cats & Dogs\"\n",
    "DATASET_NAME = \"cats-and-dogs\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 3lc\n",
    "%pip install git+https://github.com/3lc-ai/3lc-examples.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import tlc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add new data with the same classes\n",
    "\n",
    "We will reuse the cats and dogs dataset from the previous section and add a new batch of data.\n",
    "\n",
    "Before we add it, we need to create a `Table` with the new data. Notice also that we set the `weight_column_value=0.0`, this is to keep track of which samples were added in the resulting table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = Path(DATA_PATH) / \"more-cats-and-dogs\"\n",
    "\n",
    "assert data_path.exists()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_data_table = tlc.Table.from_image_folder(\n",
    "    data_path,\n",
    "    table_name=\"new-data\",\n",
    "    dataset_name=DATASET_NAME,\n",
    "    project_name=PROJECT_NAME,\n",
    "    add_weight_column=True,\n",
    "    weight_column_value=0.0,\n",
    "    if_exists=\"overwrite\",\n",
    ")\n",
    "\n",
    "new_data_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also get the cats and dogs dataset from the notebook [create-table-from-image-folder.ipynb](../1-create-tables/create-table-from-image-folder.ipynb) to use as a base for the new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "initial_table = tlc.Table.from_names(table_name=\"initial-cls\", dataset_name=DATASET_NAME, project_name=PROJECT_NAME)\n",
    "initial_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the two tables, we are ready to combine them using `Table.join_tables()`. We specify a list of tables to join, and the name of the new table resulting from joining them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined_table = tlc.Table.join_tables([initial_table, new_data_table], table_name=\"added-more-data\")\n",
    "joined_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for row in joined_table.table_rows:\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add new data with different classes\n",
    "\n",
    "We will now create a new image folder table containing animals in the categories \"bats\" and \"frogs\". In order for this table to be joined with our existing table, we need to remap the labels \"bat\" and \"frog\", and their corresponding values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = Path(DATA_PATH) / \"bats-and-frogs\"\n",
    "\n",
    "more_new_data_table = tlc.Table.from_image_folder(\n",
    "    data_path,\n",
    "    table_name=\"more-new-data\",\n",
    "    dataset_name=DATASET_NAME,\n",
    "    project_name=PROJECT_NAME,\n",
    "    add_weight_column=True,\n",
    "    weight_column_value=0.0,\n",
    "    if_exists=\"overwrite\",\n",
    ")\n",
    "\n",
    "more_new_data_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "more_new_data_table.get_simple_value_map(\"label\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update the value map\n",
    "remap_value_map_table = more_new_data_table.set_value_map(\"label\", {0: \"cats\", 1: \"dogs\", 2: \"bats\", 3: \"frogs\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from tlc_tools.split import set_value_in_column_to_fixed_value\n",
    "\n",
    "# Update the row values: 0->2 and 1->3\n",
    "label_column = remap_value_map_table.get_column(\"label\").to_numpy()\n",
    "zero_indices = np.where(label_column == 0)[0].tolist()\n",
    "one_indices = np.where(label_column == 1)[0].tolist()\n",
    "\n",
    "remapped_bats_table = set_value_in_column_to_fixed_value(remap_value_map_table, \"label\", zero_indices, 2)\n",
    "remapped_frogs_table = set_value_in_column_to_fixed_value(remapped_bats_table, \"label\", one_indices, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now create yet another table by joining the previous joined table with the remapped bats and frogs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined_again_table = tlc.Table.join_tables([joined_table, remapped_frogs_table], table_name=\"added-bats-and-frogs-data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Originally, the two tables had different value maps. Let's inspect them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined_table.get_simple_value_map(\"label\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "final_value_map = joined_again_table.get_simple_value_map(\"label\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now inspect the row data of the final joined table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, row in enumerate(joined_again_table.table_rows):\n",
    "    image_path = row[\"image\"]\n",
    "    label = row[\"label\"]\n",
    "    weight = row[\"weight\"]\n",
    "    print(f\"Row {i}: {image_path}, {final_value_map[label]}, weight: {weight}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}