{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Custom Instance Segmentation Table\n",
    "\n",
    "Create a 3LC Table from custom RLE annotations using the Sartorius cell instance segmentation dataset with hundreds of cell instances per image.\n",
    "\n",
    "![img](../../images/cell-segmentations.jpg)\n",
    "\n",
    "<!-- Tags: [\"instance-segmentation\", \"medical\"] -->\n",
    "\n",
    "Custom instance segmentation is needed when working with specialized domains like medical imaging, where standard datasets don't capture the specific characteristics of your data. RLE format is memory-efficient for dense segmentation masks.\n",
    "\n",
    "This notebook processes the Sartorius Cell Instance Segmentation dataset from Kaggle, converting RLE-encoded masks to 3LC format. We handle multiple instances per image and demonstrate custom schema definition for specialized annotation formats."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - Cell Segmentation\"\n",
    "DATASET_NAME = \"Sartorius Cell Segmentation\"\n",
    "TABLE_NAME = \"initial\"\n",
    "DOWNLOAD_PATH = \"../../../transient_data\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -q 3lc\n",
    "%pip install -q matplotlib\n",
    "%pip install -q kaggle\n",
    "%pip install -q git+https://github.com/3lc-ai/3lc-examples.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import matplotlib.patches as patches\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pycocotools.mask as mask_utils\n",
    "import tlc\n",
    "from matplotlib.path import Path as MatplotlibPath\n",
    "from tqdm import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Dataset\n",
    "\n",
    "The data needs to be downloaded from Kaggle before it can be used.\n",
    "\n",
    "Either ensure you are logged in to Kaggle and the file `~/.kaggle/kaggle.json`\n",
    "exists, or set the `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables before running the next cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET_ROOT = (Path(DOWNLOAD_PATH) / \"sartorius-cell-instance-segmentation\").resolve().absolute()\n",
    "\n",
    "\n",
    "if not DATASET_ROOT.exists():\n",
    "    import zipfile\n",
    "\n",
    "    from kaggle import KaggleApi\n",
    "\n",
    "    api = KaggleApi()\n",
    "    api.authenticate()\n",
    "\n",
    "    print(\"Downloading dataset from Kaggle\")\n",
    "    api.competition_download_files(\n",
    "        \"sartorius-cell-instance-segmentation\", path=Path(DOWNLOAD_PATH).absolute().as_posix()\n",
    "    )\n",
    "\n",
    "    with zipfile.ZipFile(f\"{DOWNLOAD_PATH}/sartorius-cell-instance-segmentation.zip\", \"r\") as zip_ref:\n",
    "        zip_ref.extractall(DATASET_ROOT)\n",
    "\n",
    "\n",
    "else:\n",
    "    print(f\"Dataset root {DATASET_ROOT} already exists\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the Table data\n",
    "\n",
    "The annotations are stored in a csv file, with one row per instance.\n",
    "\n",
    "We'll read the csv file and group the annotations by image_id, then convert the instance annotations to COCO RLE format before writing to a `Table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_csv_file = DATASET_ROOT / \"train.csv\"\n",
    "assert train_csv_file.exists(), f\"Train CSV file {train_csv_file} does not exist\"\n",
    "\n",
    "train_csv = pd.read_csv(train_csv_file)\n",
    "train_csv.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Map cell names to indices\n",
    "cell_types_to_index = {\"astro\": 0, \"cort\": 1, \"shsy5y\": 2}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group annotations by image_id\n",
    "image_annotations = {}\n",
    "\n",
    "for _, row in tqdm(train_csv.iterrows(), total=len(train_csv), desc=\"Grouping annotations by image_id\"):\n",
    "    image_id = row[\"id\"]\n",
    "\n",
    "    if image_id not in image_annotations:\n",
    "        image_annotations[image_id] = {\n",
    "            \"width\": row[\"width\"],\n",
    "            \"height\": row[\"height\"],\n",
    "            \"sample_id\": row[\"sample_id\"],\n",
    "            \"annotations\": [],\n",
    "        }\n",
    "\n",
    "    # Add this annotation\n",
    "    annotation = {\n",
    "        \"cell_type_index\": cell_types_to_index[row[\"cell_type\"]],\n",
    "        \"segmentation\": list(map(int, row[\"annotation\"].split())),\n",
    "    }\n",
    "    image_annotations[image_id][\"annotations\"].append(annotation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def starts_lengths_to_coco_rle(starts_lengths, image_height, image_width):\n",
    "    \"\"\"Convert a list of starts and lengths to a COCO RLE by creating a binary mask and encoding it.\"\"\"\n",
    "\n",
    "    # Convert to numpy array and get starts/lengths\n",
    "    s = np.array(starts_lengths, dtype=int)\n",
    "    starts = s[0::2] - 1  # Convert from 1-based to 0-based indexing\n",
    "    lengths = s[1::2]\n",
    "\n",
    "    # Create binary mask\n",
    "    mask = np.zeros(image_height * image_width, dtype=np.uint8)\n",
    "    for start, length in zip(starts, lengths):\n",
    "        mask[start : start + length] = 1\n",
    "    mask = mask.reshape(image_height, image_width)\n",
    "\n",
    "    # Convert to COCO RLE format\n",
    "    rle = mask_utils.encode(np.asfortranarray(mask))\n",
    "    return rle[\"counts\"].decode(\"utf-8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def annotations_to_3lc_format(image_annotations):\n",
    "    \"\"\"Convert a list of annotations to the format required by 3LC instance segmentation Tables.\n",
    "\n",
    "    Input format:\n",
    "    {\n",
    "        \"cell_type_index\": int,\n",
    "        \"segmentation\": list[int],\n",
    "        \"width\": int,\n",
    "        \"height\": int,\n",
    "    }\n",
    "\n",
    "    Output format:\n",
    "    {\n",
    "        \"image_height\": int,\n",
    "        \"image_width\": int,\n",
    "        \"rles\": list[bytes],\n",
    "        \"instance_properties\": {\n",
    "            \"cell_type\": list[int],\n",
    "        }\n",
    "    }\n",
    "    \"\"\"\n",
    "    image_height = image_annotations[\"height\"]\n",
    "    image_width = image_annotations[\"width\"]\n",
    "\n",
    "    rles = []\n",
    "    cell_types = []\n",
    "\n",
    "    for annotation in image_annotations[\"annotations\"]:\n",
    "        rle = starts_lengths_to_coco_rle(annotation[\"segmentation\"], image_height, image_width)\n",
    "        rles.append(rle)\n",
    "        cell_types.append(annotation[\"cell_type_index\"])\n",
    "\n",
    "    return {\n",
    "        \"image_height\": image_height,\n",
    "        \"image_width\": image_width,\n",
    "        \"rles\": rles,\n",
    "        \"instance_properties\": {\n",
    "            \"label\": cell_types,\n",
    "        },\n",
    "    }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can collect all the transformed column-data for the `Table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_ids = []\n",
    "image_paths = []\n",
    "segmentations = []\n",
    "\n",
    "for image_id, image_data in tqdm(image_annotations.items(), total=len(image_annotations), desc=\"Processing images\"):\n",
    "    sample_ids.append(image_data[\"sample_id\"])\n",
    "    image_paths.append(\n",
    "        tlc.Url(DATASET_ROOT / \"train\" / f\"{image_id}.png\").to_relative().to_str()\n",
    "    )  # Call to_relative() to ensure aliases are applied\n",
    "    segmentations.append(annotations_to_3lc_format(image_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Table\n",
    "\n",
    "Create a `Table` using a `TableWriter` and a provided schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table_data = {\n",
    "    \"sample_id\": sample_ids,\n",
    "    \"image\": image_paths,\n",
    "    \"segmentations\": segmentations,\n",
    "}\n",
    "\n",
    "table_schemas = {\n",
    "    \"image\": tlc.ImageUrlSchema(sample_type=\"PILImage\"),\n",
    "    \"segmentations\": tlc.SegmentationSchema(\n",
    "        label_value_map={v: tlc.MapElement(k) for k, v in cell_types_to_index.items()},\n",
    "        sample_type=tlc.InstanceSegmentationMasks.sample_type,\n",
    "    ),\n",
    "}\n",
    "\n",
    "table = tlc.Table.from_dict(\n",
    "    table_data,\n",
    "    structure=table_schemas,\n",
    "    project_name=PROJECT_NAME,\n",
    "    dataset_name=DATASET_NAME,\n",
    "    table_name=TABLE_NAME,\n",
    "    if_exists=\"rename\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plot a sample from the Table\n",
    "\n",
    "Fetch the first sample from the Table, plot the image and the instance masks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_sample = table[0]\n",
    "first_sample[\"image\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the image path to ensure the alias is working\n",
    "print(f\"First sample image path: {table.table_rows[0]['image']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "masks = first_sample[\"segmentations\"][\"masks\"]\n",
    "combined_mask = masks.sum(axis=2) > 0  # Combine all instance masks to a single mask for plotting\n",
    "plt.imshow(combined_mask, cmap=\"gray\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert to polygons and make dataset splits\n",
    "\n",
    "To end this example, we'll convert the Table to a format compatible with\n",
    "[YOLO](https://github.com/3lc-ai/ultralytics) and make train/val splits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tlc_tools.derived_tables import masks_to_polygons\n",
    "\n",
    "# Creates an EditedTable where the sample type of the segmentation is changed from masks to polygons\n",
    "polygon_table = masks_to_polygons(table)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_sample = polygon_table[0]\n",
    "polygons = first_sample[\"segmentations\"][\"polygons\"]\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "for polygon in polygons:\n",
    "    vertices = np.array(polygon).reshape(-1, 2)\n",
    "    path = MatplotlibPath(vertices)\n",
    "    patch = patches.PathPatch(path, facecolor=\"#00FFFF\", edgecolor=\"black\")\n",
    "    ax.add_patch(patch)\n",
    "\n",
    "# Set axis limits based on image dimensions\n",
    "ax.set_xlim(0, first_sample[\"segmentations\"][\"image_width\"])\n",
    "ax.set_ylim(0, first_sample[\"segmentations\"][\"image_height\"])\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tlc_tools.split import split_table\n",
    "\n",
    "splits = split_table(polygon_table)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Train: {splits['train']}\")\n",
    "print(f\"Val: {splits['val']}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "test_marks": [
   "dependent"
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 2
}