{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Weighted Table Subset Selection\n",
    "\n",
    "This notebook demonstrates how to apply zero weights to a subset of table rows for selective data processing.\n",
    "\n",
    "![](../images/weight-coreset.png)\n",
    "\n",
    "<!-- Tags: [\"data-curation\"] -->\n",
    "\n",
    "This technique is particularly useful in active learning and data labeling\n",
    "workflows, where only a subset of rows should be utilized for training or\n",
    "considered for labeling in each iteration.\n",
    "\n",
    "Specifically, this example demonstrates balanced coreset selection on a dataset,\n",
    "setting all non-coreset rows' weights to zero. The coreset selection strategy\n",
    "can be adapted to employ different approaches, such as random sampling,\n",
    "uncertainty-based sampling, or other model-driven selection criteria."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install 3lc\n",
    "%pip install git+https://github.com/3lc-ai/3lc-examples.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tlc\n",
    "\n",
    "from tlc_tools.split import get_balanced_coreset_indices, set_value_in_column_to_fixed_value"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## Project setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - CIFAR-10\"\n",
    "DATASET_NAME = \"CIFAR-10-train\"\n",
    "TABLE_NAME = \"initial\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## Load input table\n",
    "\n",
    "This assumes CIFAR-10-train has been created by running the notebook [create-table-from-torch.ipynb](../1-create-tables/create-table-from-torch-dataset.ipynb).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "table = tlc.Table.from_names(TABLE_NAME, DATASET_NAME, PROJECT_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## Compute coreset\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This function ensures the coreset is exactly balanced in terms of the split_by column.\n",
    "# The size parameter is the fraction of the minority class that should be included in the coreset.\n",
    "coreset_indices, non_coreset_indices = get_balanced_coreset_indices(\n",
    "    table,\n",
    "    size=0.01,  # CIFAR-10-train has 5000 samples per class, so 0.01 will result in 500 samples per class\n",
    "    split_by=\"Label\",\n",
    "    random_seed=42,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## Weight non-coreset rows to 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "coreset_table = set_value_in_column_to_fixed_value(\n",
    "    table,\n",
    "    \"weight\",\n",
    "    non_coreset_indices,\n",
    "    0.0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "coreset_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "# During training, we can now use a sampler that only samples non-zero weight rows\n",
    "sampler = coreset_table.create_sampler(\n",
    "    exclude_zero_weights=True,\n",
    ")\n",
    "print(len(sampler))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "## Remove non-coreset samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tlc_tools.split import keep_indices\n",
    "\n",
    "subset = keep_indices(\n",
    "    table, coreset_indices, table_name=\"balanced-subset\", table_description=\"Keep only a size 500 coreset\"\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}