{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Weighted Table Subset Selection\n", "\n", "This notebook demonstrates how to apply zero weights to a subset of table rows for selective data processing.\n", "\n", "![](../images/weight-coreset.png)\n", "\n", "\n", "\n", "This technique is particularly useful in active learning and data labeling\n", "workflows, where only a subset of rows should be utilized for training or\n", "considered for labeling in each iteration.\n", "\n", "Specifically, this example demonstrates balanced coreset selection on a dataset,\n", "setting all non-coreset rows' weights to zero. The coreset selection strategy\n", "can be adapted to employ different approaches, such as random sampling,\n", "uncertainty-based sampling, or other model-driven selection criteria." ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "%pip install 3lc\n", "%pip install git+https://github.com/3lc-ai/3lc-examples.git" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import tlc\n", "\n", "from tlc_tools.split import get_balanced_coreset_indices, set_value_in_column_to_fixed_value" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Project setup" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - CIFAR-10\"\n", "DATASET_NAME = \"CIFAR-10-train\"\n", "TABLE_NAME = \"initial\"" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Load input table\n", "\n", "This assumes CIFAR-10-train has been created by running the notebook [create-table-from-torch.ipynb](../1-create-tables/create-table-from-torch-dataset.ipynb).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "table = tlc.Table.from_names(TABLE_NAME, DATASET_NAME, PROJECT_NAME)" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## Compute coreset\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "# This function ensures the coreset is exactly balanced in terms of the split_by column.\n", "# The size parameter is the fraction of the minority class that should be included in the coreset.\n", "coreset_indices, non_coreset_indices = get_balanced_coreset_indices(\n", " table,\n", " size=0.01, # CIFAR-10-train has 5000 samples per class, so 0.01 will result in 500 samples per class\n", " split_by=\"Label\",\n", " random_seed=42,\n", ")" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## Weight non-coreset rows to 0" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "coreset_table = set_value_in_column_to_fixed_value(\n", " table,\n", " \"weight\",\n", " non_coreset_indices,\n", " 0.0,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "coreset_table" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "# During training, we can now use a sampler that only samples non-zero weight rows\n", "sampler = coreset_table.create_sampler(\n", " exclude_zero_weights=True,\n", ")\n", "print(len(sampler))" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## Remove non-coreset samples" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "from tlc_tools.split import keep_indices\n", "\n", "subset = keep_indices(\n", " table, coreset_indices, table_name=\"balanced-subset\", table_description=\"Keep only a size 500 coreset\"\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }