{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Load FHIBE Dataset\n", "\n", "This notebook loads the Sony AI's \"Fair Human-Centric Image Benchmark\" dataset as a 3LC Table, including keypoints, segmentation, bounding boxes, as well as rich subject metadata.\n", "\n", "![img](../images/fhibe.png)\n", "\n", "\n", "\n", "To download the dataset, you need to register at [fairnessbenchmark.ai.sony](https://fairnessbenchmark.ai.sony/). To read the original research paper, see [here](https://www.nature.com/articles/s41586-025-09716-2).\n", "\n", "Several versions of the dataset exist, for this tutorial we will use version from `fhibe.20250716.u.gT5_rFTA_downsampled_public.tar.gz`, but the ingestion script should work for any version of the dataset, as the internal layout of the dataset is the same.\n", "\n", "We include as much as possible of the metadata contained in the dataset, omitting only a few attributes in the name of simplicity, specifically the `_QA_annotator_id` fields have been left out.\n", "\n", "The data can be categorized as follows:\n", "- Main image\n", "- Geometric annotations (instance segmentations, keypoints, facial bounding box)\n", "- Image-level metadata (shutter speed, camera manufacturer, weather conditions, etc.)\n", "- Subject-level metadata (ancestry, hair color, age, etc.)\n", "\n", "This script reads all data from the CSV file and converts it to a format suitable for a 3LC Table. Several of the columns are stored as \"categorical strings\" (e.g. hair color \"Blond\", \"Gray\", \"White\", ...), these values are converted to integers, with their corresponding string values stored in the schema. This makes it easier to filter and work with these values in the 3LC Dashboard." ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "%pip install -q 3lc" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import time\n", "from collections import defaultdict\n", "from pathlib import Path\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import tlc\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Project setup" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - FHIBE\"\n", "DATASET_NAME = \"FHIBE\"\n", "TABLE_NAME = \"initial\"\n", "MAX_SAMPLES = None\n", "DOWNLOAD_PATH = \"../../transient_data\"" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## Prepare data" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "FHIBE_ROOT = Path(DOWNLOAD_PATH) / \"fhibe\"\n", "CSV_FILE = FHIBE_ROOT / \"data/processed/fhibe_downsampled/fhibe_downsampled.csv\"\n", "\n", "if not CSV_FILE.exists():\n", " raise FileNotFoundError(f\"CSV_FILE does not exist: {CSV_FILE}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "# Load CSV (nrows=None reads all rows)\n", "t0 = time.time()\n", "df = pd.read_csv(CSV_FILE, nrows=MAX_SAMPLES)\n", "print(f\"CSV loading: {time.time() - t0:.2f}s ({len(df)} rows)\")\n", "\n", "\n", "def fast_parse(s):\n", " \"\"\"Parse serialized Python literal using json.loads (faster than ast.literal_eval).\"\"\"\n", " if pd.isna(s):\n", " return s\n", " # Replace single quotes with double quotes for JSON compatibility\n", " # Handle escaped quotes and None values\n", " s = s.replace(\"'\", '\"').replace(\"None\", \"null\").replace(\"True\", \"true\").replace(\"False\", \"false\")\n", " return json.loads(s)\n", "\n", "\n", "# Parse columns containing serialized Python literals\n", "SERIALIZED_COLUMNS = [\n", " \"lighting\",\n", " \"weather\",\n", " \"nationality\",\n", " \"ancestry\",\n", " \"pronoun\",\n", " \"natural_hair_color\",\n", " \"apparent_hair_color\",\n", " \"facial_hairstyle\",\n", " \"natural_facial_haircolor\",\n", " \"apparent_facial_haircolor\",\n", " \"natural_left_eye_color\",\n", " \"apparent_left_eye_color\",\n", " \"natural_right_eye_color\",\n", " \"apparent_right_eye_color\",\n", " \"facial_marks\",\n", " \"action_subject_object_interaction\",\n", " \"keypoints\",\n", " \"segments\",\n", " \"face_bbox\",\n", " \"person_bbox\",\n", "]\n", "\n", "t0 = time.time()\n", "for col in SERIALIZED_COLUMNS:\n", " if col in df.columns:\n", " df[col] = df[col].apply(fast_parse)\n", "print(f\"Parsing serialized columns: {time.time() - t0:.2f}s\")\n", "\n", "t0 = time.time()\n", "# Convert bounding boxes from [x, y, w, h] to [x0, y0, x1, y1] format\n", "\n", "\n", "def convert_xywh_to_xyxy(bbox):\n", " return [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]\n", "\n", "\n", "df[\"face_bbox\"] = df[\"face_bbox\"].apply(convert_xywh_to_xyxy)\n", "df[\"person_bbox\"] = df[\"person_bbox\"].apply(convert_xywh_to_xyxy)\n", "print(f\"Converting bboxes to xyxy: {time.time() - t0:.2f}s\")" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "# Columns to ingest (excluding QA annotator columns and other metadata)\n", "COLUMNS_TO_INGEST = [\n", " # Image-level metadata\n", " \"aperture_value\",\n", " \"camera_distance\",\n", " \"camera_position\",\n", " \"focal_length\",\n", " \"iso_speed_ratings\",\n", " \"lighting\",\n", " \"location_country\",\n", " \"location_region\",\n", " \"manufacturer\",\n", " \"model\",\n", " \"scene\",\n", " \"shutter_speed_value\",\n", " \"user_date_captured\",\n", " \"user_hour_captured\",\n", " \"weather\",\n", " # Subject-level metadata\n", " \"subject_id\",\n", " \"age\",\n", " \"nationality\",\n", " \"ancestry\",\n", " \"pronoun\",\n", " \"natural_skin_color\",\n", " \"apparent_skin_color\",\n", " \"hairstyle\",\n", " \"natural_hair_type\",\n", " \"apparent_hair_type\",\n", " \"natural_hair_color\",\n", " \"apparent_hair_color\",\n", " \"facial_hairstyle\",\n", " \"natural_facial_haircolor\",\n", " \"apparent_facial_haircolor\",\n", " \"natural_left_eye_color\",\n", " \"apparent_left_eye_color\",\n", " \"natural_right_eye_color\",\n", " \"apparent_right_eye_color\",\n", " \"facial_marks\",\n", " \"action_body_pose\",\n", " \"action_subject_object_interaction\",\n", " \"head_pose\",\n", "]\n", "\n", "# Special columns requiring custom processing (output as separate columns)\n", "SPECIAL_COLUMNS = [\"keypoints\", \"segments\", \"face_bbox\"]\n", "\n", "# Auxiliary columns (used internally but not output directly)\n", "AUXILIARY_COLUMNS = [\"person_bbox\", \"image_height\", \"image_width\", \"filepath\"]\n", "\n", "# Columns to treat as plain strings (not categorical due to high cardinality)\n", "STRING_COLUMNS = [\"user_date_captured\", \"subject_id\", \"location_region\", \"model\"]\n", "\n", "# Columns with skin color values that need display_color in schema\n", "SKIN_COLOR_COLUMNS = [\"natural_skin_color\", \"apparent_skin_color\"]\n", "\n", "# Threshold for auto-detecting categorical columns (max unique values)\n", "CATEGORICAL_THRESHOLD = 100" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## Helper functions\n", "\n", "These functions handle value cleaning, type detection, and schema inference for the categorical columns." ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "### Value cleaning and mapping" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "def make_internal_name(s: str) -> str:\n", " \"\"\"Create a valid internal name for a 3LC MapElement.\n", "\n", " Removes numbered prefixes (like \"0. Standing\") and all disallowed characters.\n", " Disallowed characters: <>\\\\|.:\"'?*&\n", " \"\"\"\n", " if not isinstance(s, str):\n", " return str(s)\n", " # Remove numbered prefix like \"0. \" or \"12. \" (requires space after dot)\n", " s = re.sub(r\"^\\d+\\.\\s+\", \"\", s)\n", " # Remove disallowed characters\n", " for char in \"<>\\\\|.:\\\"'?*&\":\n", " s = s.replace(char, \"\")\n", " return s.strip()\n", "\n", "\n", "def get_unique_values(series: pd.Series, is_list: bool = False) -> list:\n", " \"\"\"Extract unique values from a column (already parsed from string literals).\"\"\"\n", " if is_list:\n", " all_vals = set()\n", " for val in series.dropna():\n", " if isinstance(val, list):\n", " all_vals.update(val)\n", " return list(all_vals)\n", " return list(series.dropna().unique())\n", "\n", "\n", "def sort_by_prefix(values: list) -> list:\n", " \"\"\"Sort values by their numeric prefix if present (e.g., '0. Standing' before '1. Sitting').\"\"\"\n", "\n", " def key(v):\n", " match = re.match(r\"^(\\d+)\\.\\s+\", str(v))\n", " return (int(match.group(1)), str(v)) if match else (999, str(v))\n", "\n", " return sorted(values, key=key)\n", "\n", "\n", "def build_value_map(series: pd.Series, is_list: bool = False) -> dict[str, tuple[int, str]]:\n", " \"\"\"Build a mapping from internal_name to (index, display_name).\n", "\n", " The display_name is the original value, internal_name has disallowed chars removed.\n", " Returns: {internal_name: (index, display_name), ...}\n", " \"\"\"\n", " unique_vals = sort_by_prefix(get_unique_values(series, is_list))\n", " return {\n", " make_internal_name(v): (i, v) # v is the original value for display\n", " for i, v in enumerate(unique_vals)\n", " }" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "### Type detection and schema inference" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "def detect_column_type(col_name: str, series: pd.Series) -> str:\n", " \"\"\"Detect the type of a column for schema inference.\n", "\n", " Returns one of: 'numeric', 'string', 'categorical', 'categorical_list', 'special'\n", " \"\"\"\n", " if col_name in SPECIAL_COLUMNS:\n", " return \"special\"\n", " if col_name in STRING_COLUMNS:\n", " return \"string\"\n", " if series.dtype in [\"int64\", \"float64\"]:\n", " return \"numeric\"\n", "\n", " # Check if column contains lists (already parsed)\n", " sample = series.dropna().iloc[0] if len(series.dropna()) > 0 else None\n", " if isinstance(sample, dict):\n", " return \"special\"\n", " if isinstance(sample, list):\n", " if sample and isinstance(sample[0], str):\n", " return \"categorical_list\"\n", " return \"special\"\n", "\n", " # For string columns, use unique count to determine categorical vs string\n", " return \"categorical\" if series.nunique() <= CATEGORICAL_THRESHOLD else \"string\"\n", "\n", "\n", "def tuple2hex(t: str) -> str:\n", " \"\"\"Convert a serialized RGB list to hex color: '[255, 255, 255]' -> '#FFFFFF'\"\"\"\n", " nums = [int(c) for c in t.strip(\"[]\").split(\",\")]\n", " return \"#{:02X}{:02X}{:02X}\".format(*nums)\n", "\n", "\n", "def build_map_elements(value_map: dict, col_name: str = None) -> dict:\n", " \"\"\"Build MapElement dict from value_map for use in schema.\n", "\n", " Args:\n", " value_map: {internal_name: (index, display_name), ...}\n", " col_name: Column name, used for special handling (e.g., skin color)\n", "\n", " Returns: {index: MapElement, ...}\n", " \"\"\"\n", " elements = {}\n", " for internal_name, (idx, display_name) in value_map.items():\n", " kwargs = {\"display_name\": display_name}\n", "\n", " # Special handling for skin color columns\n", " if col_name in SKIN_COLOR_COLUMNS:\n", " kwargs[\"display_color\"] = tuple2hex(internal_name)\n", "\n", " elements[idx] = tlc.MapElement(internal_name, **kwargs)\n", "\n", " return elements\n", "\n", "\n", "def infer_schema(col_name: str, series: pd.Series, default_args: dict):\n", " \"\"\"Infer the appropriate 3LC schema for a column based on its data.\"\"\"\n", " col_type = detect_column_type(col_name, series)\n", " is_list = col_type == \"categorical_list\"\n", "\n", " if col_type == \"numeric\":\n", " return tlc.Int32Schema(**default_args) if series.dtype == \"int64\" else tlc.Float32Schema(**default_args)\n", "\n", " if col_type == \"string\":\n", " return tlc.StringSchema(**default_args)\n", "\n", " if col_type in (\"categorical\", \"categorical_list\"):\n", " value_map = build_value_map(series, is_list=is_list)\n", " map_elements = build_map_elements(value_map, col_name)\n", "\n", " if is_list:\n", " return tlc.CategoricalLabelListSchema(classes=map_elements, **default_args)\n", " return tlc.CategoricalLabelSchema(classes=map_elements, **default_args)\n", "\n", " return None # Special columns handled separately" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "### Consolidation of country spelling variations" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "# Taken from https://github.com/SonyResearch/fhibe_evaluation_api/blob/main/fhibe_eval_api/datasets/fhibe.py\n", "\n", "loc_country_name_mapping = {\n", " \"Abgola\": \"Angola\",\n", " \"Abuja\": \"Nigeria\",\n", " \"Argentiina\": \"Argentina\",\n", " \"Australie\": \"Australia\",\n", " \"Autsralia\": \"Australia\",\n", " \"Auustralia\": \"Australia\",\n", " \"Bahamas, The\": \"Bahamas\",\n", " \"Caanada\": \"Canada\",\n", " \"Canadad\": \"Canada\",\n", " \"French\": \"France\",\n", " \"Hanoi Vietnam\": \"Viet Nam\",\n", " \"Ho Chi Min\": \"Viet Nam\",\n", " \"Hong Kong\": \"China, Hong Kong Special Administrative Region\",\n", " \"I Go\": None,\n", " \"Italiana\": \"Italy\",\n", " \"Keenya\": \"Kenya\",\n", " \"Kenyan\": \"Kenya\",\n", " \"Kiambu\": \"Kenya\",\n", " \"Lagos\": \"Nigeria\",\n", " \"Lceland\": \"Iceland\",\n", " \"Mexican\": \"Mexico\",\n", " \"Micronesia\": \"Micronesia (Federated States of)\",\n", " \"Mironesi\": \"Micronesia (Federated States of)\",\n", " \"Mironesia\": \"Micronesia (Federated States of)\",\n", " \"Morroco\": \"Morocco\",\n", " \"Muranga\": \"Kenya\",\n", " \"Nairobi Nairobi\": \"Kenya\",\n", " \"Netherlands\": \"Netherlands (Kingdom of the)\",\n", " \"Nigerian\": \"Nigeria\",\n", " \"Nigeriia\": \"Nigeria\",\n", " \"Niheria\": \"Nigeria\",\n", " \"Nugeria\": \"Nigeria\",\n", " \"Nyari\": \"Kenya\",\n", " \"Owow Disable Abilities Off Level Up\": None,\n", " \"Pakisan\": \"Pakistan\",\n", " \"Pakisatn\": \"Pakistan\",\n", " \"Pakistain\": \"Pakistan\",\n", " \"Paksitan\": \"Pakistan\",\n", " \"Phillipines\": \"Philippines\",\n", " \"Punjab\": \"Pakistan\",\n", " \"South Afica\": \"South Africa\",\n", " \"South Afria\": \"South Africa\",\n", " \"South African\": \"South Africa\",\n", " \"Southern Africa\": \"South Africa\",\n", " \"South Korea\": \"Republic of Korea\",\n", " \"Tanzania\": \"United Republic of Tanzania\",\n", " \"Trinidad And Tobago\": \"Trinidad and Tobago\",\n", " \"Turkey\": \"Türkiye\",\n", " \"Ua\": \"Ukraine\",\n", " \"Uae\": \"United Arab Emirates\",\n", " \"Ugnd\": \"Uganda\",\n", " \"Uk\": \"United Kingdom of Great Britain and Northern Ireland\",\n", " \"United Kingdom\": \"United Kingdom of Great Britain and Northern Ireland\",\n", " \"Ukaine\": \"Ukraine\",\n", " \"United States\": \"United States of America\",\n", " \"Usa\": \"United States of America\",\n", " \"Venezuela\": \"Venezuela (Bolivarian Republic of)\",\n", " \"Veitnam\": \"Viet Nam\",\n", " \"Vienam\": \"Viet Nam\",\n", " \"Vietam\": \"Viet Nam\",\n", " \"Vietnam\": \"Viet Nam\",\n", " \"Vietname\": \"Viet Nam\",\n", " \"Viietnam\": \"Viet Nam\",\n", " \"Vitenam\": \"Viet Nam\",\n", " \"Vitnam\": \"Viet Nam\",\n", " \"Viwtnam\": \"Viet Nam\",\n", "}\n", "\n", "\n", "def fix_location_country(country: str) -> str:\n", " \"\"\"Format the location_country attribute string.\n", "\n", " Some countries are misspelled or inconsistently formatted.\n", "\n", " Args:\n", " country: The original string annotation\n", "\n", " Return:\n", " The re-formatted string\n", " \"\"\"\n", " if pd.isna(country):\n", " return country\n", " if country in loc_country_name_mapping:\n", " return loc_country_name_mapping[country]\n", " country_fmt = country.strip().title()\n", " if country_fmt in loc_country_name_mapping:\n", " return loc_country_name_mapping[country_fmt]\n", " else:\n", " return country_fmt\n", "\n", "\n", "# Apply normalization to DataFrame before building value maps\n", "df[\"location_country\"] = df[\"location_country\"].apply(fix_location_country)" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "## Define data processing steps" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "NUM_KEYPOINTS = 33\n", "\n", "# fmt: off\n", "KEYPOINTS = [\n", " \"Nose\", # 0\n", " \"Right eye inner\", # 1\n", " \"Right eye\", # 2\n", " \"Right eye outer\", # 3\n", " \"Left eye inner\", # 4\n", " \"Left eye\", # 5 \n", " \"Left eye outer\", # 6\n", " \"Right ear\", # 7\n", " \"Left ear\", # 8\n", " \"Mouth right\", # 9\n", " \"Mouth left\", # 10\n", " \"Right shoulder\", # 11\n", " \"Left shoulder\", # 12\n", " \"Right elbow\", # 13\n", " \"Left elbow\", # 14\n", " \"Right wrist\", # 15\n", " \"Left wrist\", # 16\n", " \"Right pinky knuckle\", # 17\n", " \"Left pinky knuckle\", # 18\n", " \"Right index knuckle\", # 19\n", " \"Left index knuckle\", # 20\n", " \"Right thumb knuckle\", # 21\n", " \"Left thumb knuckle\", # 22\n", " \"Right hip\", # 23\n", " \"Left hip\", # 24\n", " \"Right knee\", # 25\n", " \"Left knee\", # 26\n", " \"Right ankle\", # 27\n", " \"Left ankle\", # 28\n", " \"Right heel\", # 29\n", " \"Left heel\", # 30\n", " \"Right foot index\", # 31\n", " \"Left foot index\", # 32\n", "]\n", "\n", "SKELETON = [\n", " 11, 12, 11, 13, 13, 15, 12, 14, 14, 16, 12, 24, 11, 23, 23, 24,\n", " 24, 26, 26, 28, 23, 25, 25, 27, 27, 29, 29, 31, 28, 30, 30, 32,\n", " 31, 27, 32, 28, 16, 18, 15, 17, 19, 17, 18, 20, 16, 20, 15, 19, 15, 21, 16, 22,\n", "]\n", "# fmt: on\n", "\n", "# Pre-build keypoint name to index mapping for fast lookup\n", "KEYPOINT_TO_INDEX = {name: i for i, name in enumerate(KEYPOINTS)}\n", "\n", "\n", "def build_segments_value_map(df: pd.DataFrame) -> dict[str, int]:\n", " \"\"\"Build value map for segment classes from the DataFrame.\"\"\"\n", " all_classes = set()\n", " for segments in df[\"segments\"].dropna():\n", " for seg in segments:\n", " all_classes.add(seg[\"class_name\"])\n", " sorted_classes = sort_by_prefix(list(all_classes))\n", " return {make_internal_name(c): i for i, c in enumerate(sorted_classes)}\n", "\n", "\n", "# Build segments value map from data\n", "segments_value_map = build_segments_value_map(df)\n", "\n", "\n", "def process_keypoints(keypoints: dict, person_bbox: list, image_width: int, image_height: int):\n", " \"\"\"Convert keypoints to 3LC format.\n", "\n", " Args:\n", " keypoints: Dict mapping keypoint names to [x, y, visibility] values\n", " person_bbox: Bounding box in [x0, y0, x1, y1] format (already converted)\n", " image_width: Image width in pixels\n", " image_height: Image height in pixels\n", " \"\"\"\n", " kpts_arr = np.zeros((NUM_KEYPOINTS, 3), dtype=np.float32)\n", "\n", " for kpt_name, (x, y, viz) in keypoints.items():\n", " idx = KEYPOINT_TO_INDEX.get(make_internal_name(kpt_name))\n", " if idx is not None:\n", " kpts_arr[idx, :] = [x, y, 2 if viz else 0]\n", "\n", " instances = tlc.Keypoints2DInstances.create_empty(\n", " image_width=image_width,\n", " image_height=image_height,\n", " include_keypoint_visibilities=True,\n", " include_instance_bbs=True,\n", " )\n", " instances.add_instance(keypoints=kpts_arr, label=0, bbox=person_bbox)\n", " return instances.to_row()\n", "\n", "\n", "def process_segments(segments: list, image_width: int, image_height: int):\n", " \"\"\"Convert segments to 3LC format.\"\"\"\n", "\n", " def group_segments_by_class(segments):\n", " grouped: dict[str, list[list]] = defaultdict(list)\n", " for segment in segments:\n", " class_name = make_internal_name(segment[\"class_name\"])\n", " poly = [[p[\"x\"], p[\"y\"]] for p in segment[\"polygon\"]]\n", " flattened = [coord for point in poly for coord in point]\n", " grouped[class_name].append(flattened)\n", " return grouped\n", "\n", " masks, labels = [], []\n", " for class_name, polygons in group_segments_by_class(segments).items():\n", " mask = tlc.SegmentationHelper.mask_from_polygons(polygons, image_height, image_width)\n", " masks.append(mask)\n", " labels.append(segments_value_map[class_name])\n", "\n", " return tlc.SegmentationMasksDict(\n", " image_width=image_width,\n", " image_height=image_height,\n", " masks=np.stack(masks, axis=-1),\n", " instance_properties={\"label\": labels},\n", " )\n", "\n", "\n", "def process_face_bbox(face_bbox: list, image_width: int, image_height: int):\n", " \"\"\"Convert face bounding box to 3LC format.\n", "\n", " Args:\n", " face_bbox: Bounding box in [x0, y0, x1, y1] format (already converted)\n", " \"\"\"\n", " return {\n", " tlc.IMAGE_WIDTH: image_width,\n", " tlc.IMAGE_HEIGHT: image_height,\n", " tlc.BOUNDING_BOX_LIST: [\n", " {\n", " tlc.X0: face_bbox[0],\n", " tlc.Y0: face_bbox[1],\n", " tlc.X1: face_bbox[2],\n", " tlc.Y1: face_bbox[3],\n", " tlc.LABEL: 0,\n", " }\n", " ],\n", " }" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "def convert_value(value, col_name: str, value_maps: dict):\n", " \"\"\"Convert a raw value to the format expected by 3LC.\n", "\n", " For categorical columns, maps string values to integer indices.\n", " For list columns, maps each value in the list.\n", " \"\"\"\n", " # Handle NaN values for scalar types - convert to None for proper handling\n", " if not isinstance(value, (list, dict)) and pd.isna(value):\n", " return None\n", "\n", " col_type = detect_column_type(col_name, df[col_name])\n", "\n", " if col_type == \"numeric\":\n", " return value\n", "\n", " if col_type == \"string\":\n", " return value\n", "\n", " if col_type == \"categorical_list\":\n", " value_map = value_maps.get(col_name)\n", " if value_map is None:\n", " return value\n", " return [value_map[make_internal_name(v)][0] for v in value] # [0] gets the index\n", "\n", " if col_type == \"categorical\":\n", " value_map = value_maps.get(col_name)\n", " if value_map is None:\n", " return value\n", " return value_map[make_internal_name(value)][0] # [0] gets the index\n", "\n", " return value" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "# Build value maps for all categorical columns\n", "value_maps = {}\n", "for col_name in COLUMNS_TO_INGEST:\n", " col_type = detect_column_type(col_name, df[col_name])\n", " if col_type in (\"categorical\", \"categorical_list\"):\n", " is_list = col_type == \"categorical_list\"\n", " value_maps[col_name] = build_value_map(df[col_name], is_list=is_list)\n", "\n", "print(f\"Built value maps for {len(value_maps)} categorical columns\")" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "## Row processing\n", "\n", "This function processes a single DataFrame row, converting it to the format expected by the TableWriter." ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "def process_row(csv_row):\n", " \"\"\"Process a single CSV row into the format expected by 3LC.\"\"\"\n", " image_width = int(csv_row[\"image_width\"])\n", " image_height = int(csv_row[\"image_height\"])\n", "\n", " # Build absolute image path and convert to relative 3LC URL\n", " image_path = FHIBE_ROOT / csv_row[\"filepath\"]\n", " image_url = tlc.Url(image_path).to_relative().to_str()\n", "\n", " # Build the output row with special columns\n", " row = {\n", " \"image\": image_url,\n", " \"keypoints\": process_keypoints(csv_row[\"keypoints\"], csv_row[\"person_bbox\"], image_width, image_height),\n", " \"segments\": process_segments(csv_row[\"segments\"], image_width, image_height),\n", " \"face_bbox\": process_face_bbox(csv_row[\"face_bbox\"], image_width, image_height),\n", " }\n", "\n", " # Add all other columns with appropriate conversions\n", " for col_name in COLUMNS_TO_INGEST:\n", " if col_name in SPECIAL_COLUMNS:\n", " continue\n", " row[col_name] = convert_value(csv_row[col_name], col_name, value_maps)\n", "\n", " return row" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "## Define column schemas\n", "\n", "We are now ready to define our schemas." ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "# Default schema args: hidden by default and read-only in UI\n", "default_schema_args = {\"default_visible\": False, \"writable\": False}\n", "\n", "# Build schemas for special columns\n", "special_schemas = {\n", " \"image\": tlc.ImageUrlSchema(),\n", " \"keypoints\": tlc.Keypoints2DSchema(\n", " classes=[\"person\"],\n", " num_keypoints=NUM_KEYPOINTS,\n", " lines=SKELETON,\n", " point_attributes=KEYPOINTS,\n", " include_per_point_visibility=True,\n", " ),\n", " \"face_bbox\": tlc.BoundingBoxListSchema(\n", " label_value_map={0: tlc.MapElement(\"face\")},\n", " include_segmentation=False,\n", " ),\n", " \"segments\": tlc.SegmentationSchema(\n", " label_value_map={v: tlc.MapElement(k) for k, v in segments_value_map.items()},\n", " sample_type=tlc.InstanceSegmentationMasks.sample_type,\n", " ),\n", "}\n", "\n", "# Infer schemas for all other columns\n", "inferred_schemas = {}\n", "for col_name in COLUMNS_TO_INGEST:\n", " if col_name in SPECIAL_COLUMNS:\n", " continue\n", " schema = infer_schema(col_name, df[col_name], default_schema_args)\n", " if schema is not None:\n", " inferred_schemas[col_name] = schema\n", "\n", "# Combine all schemas\n", "schemas = {**special_schemas, **inferred_schemas}\n", "print(f\"Built schemas for {len(schemas)} columns\")" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "## Preview a sample row\n", "\n", "Before writing all rows, let's preview a single row to verify the data looks correct." ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "# Preview the first row\n", "sample_row = process_row(df.iloc[0])\n", "print(f\"Sample row keys ({len(sample_row)}): {list(sample_row.keys())[:10]}...\")\n", "print(f\"\\nImage: {sample_row['image']}\")\n", "print(f\"Subject ID: {sample_row['subject_id']}\")\n", "print(f\"Age: {sample_row['age']}\")\n", "print(f\"Scene: {sample_row['scene']}\")\n", "print(f\"Segments keys: {list(sample_row['segments'].keys())}\")" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "## Write the Table\n", "\n", "Finally, we create a `TableWriter`, and add our rows to the Table." ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "table_writer = tlc.TableWriter(\n", " table_name=TABLE_NAME,\n", " dataset_name=DATASET_NAME,\n", " project_name=PROJECT_NAME,\n", " column_schemas=schemas,\n", ")\n", "\n", "for csv_row in tqdm(df.to_dict(\"records\"), desc=\"Writing rows\"):\n", " table_writer.add_row(process_row(csv_row))\n", "\n", "table = table_writer.finalize()" ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "print(f\"Created table with {len(table)} rows\")\n", "print(f\"Table URL: {table.url}\")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }