{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Add new data to dataset with splits\n", "\n", "Add new data and re-split a dataset.\n", "\n", "![img](../images/add-new-data-and-split.png)\n", "\n", "\n", "\n", "When adding new incoming data to a dataset that already has several split tables, let's take a look at two ways to go about it:\n", "1. Merge all the data into one table, and then create new splits from this table.\n", "2. Create new split Tables for the incoming data and then join those with the corresponding existing tables.\n", "\n", "Here we will show each of these in turn." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install 3lc\n", "%pip install git+https://github.com/3lc-ai/3lc-examples.git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tlc\n", "\n", "from tlc_tools.split import split_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Project setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - Splits\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Merge-first strategy\n", "\n", "![Merge First Strategy](../images/merge_first.png)\n", "\n", "The merge-first strategy first merges in the new data, and then creates new splits for all the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATASET_NAME = \"add_new_data_merge_first\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "original_train = tlc.Table.from_dict(\n", " data={\"my_column\": [1, 2, 3, 4, 5]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_train\",\n", ")\n", "original_val = tlc.Table.from_dict(\n", " data={\"my_column\": [6, 7, 8, 9, 10]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_val\",\n", ")\n", "original_test = tlc.Table.from_dict(\n", " data={\"my_column\": [11, 12, 13, 14, 15]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_test\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "original_joined = tlc.Table.join_tables(\n", " tables=[original_train, original_val, original_test],\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_joined\",\n", ")\n", "\n", "new = tlc.Table.from_dict(\n", " data={\"my_column\": [16, 17, 18, 19, 20]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"new\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_joined = tlc.Table.join_tables(\n", " tables=[original_joined, new],\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"all_joined\",\n", ")\n", "all_joined" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here a random split is applied, but any strategy could be used. See [split-tables.ipynb](split-tables.ipynb) for a more complete example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_tables = split_table(\n", " all_joined,\n", " splits={\"train\": 0.4, \"val\": 0.3, \"test\": 0.3},\n", ")\n", "\n", "new_train = new_tables[\"train\"]\n", "new_val = new_tables[\"val\"]\n", "new_test = new_tables[\"test\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for split, table in new_tables.items():\n", " print(f\"New {split} table: [\" + \", \".join(str(row[\"my_column\"]) for row in table) + \"]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Split-first strategy\n", "\n", "![Split First Strategy](../images/split_first.png)\n", "\n", "The split-first strategy first splits the new data, and then merges each resulting split with the corresponding original splits." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DATASET_NAME = \"add_new_data_split_first\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "original_train = tlc.Table.from_dict(\n", " data={\"my_column\": [1, 2, 3, 4, 5]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_train\",\n", ")\n", "original_val = tlc.Table.from_dict(\n", " data={\"my_column\": [6, 7, 8, 9, 10]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_val\",\n", ")\n", "original_test = tlc.Table.from_dict(\n", " data={\"my_column\": [11, 12, 13, 14, 15]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"original_test\",\n", ")\n", "\n", "new = tlc.Table.from_dict(\n", " data={\"my_column\": [16, 17, 18, 19, 20]},\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"new\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here a random split is applied, but any strategy could be used. See [split-tables.ipynb](split-tables.ipynb) for a more complete example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_tables_tmp = split_table(new, splits={\"train\": 0.4, \"val\": 0.3, \"test\": 0.3})\n", "\n", "new_train_tmp = new_tables_tmp[\"train\"]\n", "new_val_tmp = new_tables_tmp[\"val\"]\n", "new_test_tmp = new_tables_tmp[\"test\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_train = tlc.Table.join_tables(\n", " tables=[original_train, new_train_tmp],\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"new_train\",\n", ")\n", "new_val = tlc.Table.join_tables(\n", " tables=[original_val, new_val_tmp],\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"new_val\",\n", ")\n", "new_test = tlc.Table.join_tables(\n", " tables=[original_test, new_test_tmp],\n", " project_name=PROJECT_NAME,\n", " dataset_name=DATASET_NAME,\n", " table_name=\"new_test\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for split, table in [(\"train\", new_train), (\"val\", new_val), (\"test\", new_test)]:\n", " print(f\"New {split} table: [\" + \", \".join(str(row[\"my_column\"]) for row in table) + \"]\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 2 }