--- name: channel-name-parsing description: "Multi-format channel name parsing for KINTSUGI CHANNELNAMES.txt files" author: KINTSUGI Team date: 2024-12-15 --- # Channel Name Parsing - Research Notes ## Experiment Overview | Item | Details | |------|---------| | **Date** | 2024-12-15 | | **Goal** | Parse channel names from various CHANNELNAMES.txt formats | | **Environment** | KINTSUGI pipeline, Python 3.10+ | | **Status** | Success | ## Context Different microscopy systems and users produce CHANNELNAMES.txt files in various formats. KINTSUGI needs to parse channel/marker names to label output files correctly. The parsing must auto-detect the format and handle multiple conventions. ## Supported Formats ### Format 1: Simple List (One Channel Per Line) Most common format from CODEX systems. Each line is a channel name, 4 channels per cycle. Cycle number extracted from DAPI marker name (DAPI-01, DAPI-02, etc.). ``` DAPI-01 Blank Blank Blank DAPI-02 CD31 CD8 CD45 DAPI-03 CD20 Ki67 CD3e ``` ### Format 2: Cycle-Prefixed with Colon ``` 1: DAPI, Blank, Blank, Blank 2: DAPI, CD31, CD8, CD45 3: DAPI, CD20, Ki67, CD3e ``` ### Format 3: Tab-Separated ``` 1 DAPI Blank Blank Blank 2 DAPI CD31 CD8 CD45 3 DAPI CD20 Ki67 CD3e ``` ### Format 4: CSV (Comma-Separated) ``` 1,DAPI,Blank,Blank,Blank 2,DAPI,CD31,CD8,CD45 3,DAPI,CD20,Ki67,CD3e ``` ## Verified Workflow ### Complete Parsing Function ```python import re from pathlib import Path def load_channel_names(meta_dir, filename="CHANNELNAMES.txt", channels_per_cycle=4): """ Load channel names from various formats. Returns: dict {cycle_number: [channel_names]} or None """ channel_file = Path(meta_dir) / filename # Try alternative filenames if not channel_file.exists(): alt_names = ["CHANNELNAMES.txt", "channelnames.txt", "channel_names.txt", "channel_names.csv", "channels.txt", "markers.txt"] for alt_name in alt_names: alt_file = Path(meta_dir) / alt_name if alt_file.exists(): channel_file = alt_file break else: return None # Read non-empty, non-comment lines lines = [] with open(channel_file, 'r') as f: for line in f: line = line.strip() if line and not line.startswith('#'): lines.append(line) if not lines: return None channel_dict = {} first_line = lines[0] # Detect format from first line if ':' in first_line or '\t' in first_line or \ (first_line.split(',')[0].strip().isdigit() and len(first_line.split(',')) > 2): # Cycle-prefixed format for line in lines: try: if ':' in line: cycle_str, names_str = line.split(':', 1) cycle = int(cycle_str.strip()) names = [n.strip() for n in names_str.split(',')] elif '\t' in line: parts = line.split('\t') cycle = int(parts[0].strip()) names = [n.strip() for n in parts[1:]] else: parts = line.split(',') cycle = int(parts[0].strip()) names = [n.strip() for n in parts[1:]] channel_dict[cycle] = names except (ValueError, IndexError): continue else: # Simple list format - detect cycles from DAPI-XX pattern current_cycle = 0 cycle_channels = [] for line in lines: dapi_match = re.match(r'DAPI[-_]?(\d+)', line, re.IGNORECASE) if dapi_match: # Save previous cycle if cycle_channels and current_cycle > 0: channel_dict[current_cycle] = cycle_channels # Start new cycle current_cycle = int(dapi_match.group(1)) cycle_channels = [line] elif current_cycle > 0: cycle_channels.append(line) if len(cycle_channels) == channels_per_cycle: channel_dict[current_cycle] = cycle_channels cycle_channels = [] # Save final cycle if cycle_channels and current_cycle > 0: channel_dict[current_cycle] = cycle_channels return channel_dict ``` ### Usage ```python meta_dir = project.paths.meta # or Path("/path/to/meta") channel_name_dict = load_channel_names(meta_dir) if channel_name_dict is None: # Fallback to manual definition channel_name_dict = { 1: ["DAPI", "Blank1a", "Blank1b", "Blank1c"], 2: ["DAPI", "CD31", "CD8", "CD45"], 3: ["DAPI", "CD20", "Ki67", "CD3e"], } # Access channel name for cycle 2, channel 3 marker = channel_name_dict.get(2, [''] * 4)[2] # "CD8" ``` ## Failed Attempts (Critical) | Attempt | Why it Failed | Lesson Learned | |---------|---------------|----------------| | Only supporting cycle-prefixed format | Simple list format common in CODEX systems | Must auto-detect format from first line | | Hardcoding 4 channels per cycle | Some systems have different channel counts | Make channels_per_cycle a parameter | | Requiring exact "DAPI" match | Some files use "DAPI-01", "DAPI-02" with cycle number | Use regex to extract cycle from DAPI marker | | Case-sensitive matching | "dapi-01" and "DAPI-01" both valid | Use re.IGNORECASE flag | ## Final Parameters ### Format Detection Heuristic ```python # Check first line for format indicators first_line = lines[0] is_cycle_prefixed = ( ':' in first_line or # "1: DAPI, Blank..." '\t' in first_line or # "1\tDAPI\tBlank..." (first_line.split(',')[0].strip().isdigit() and len(first_line.split(',')) > 2) # "1,DAPI,Blank..." ) ``` ### DAPI Cycle Extraction Regex ```python dapi_match = re.match(r'DAPI[-_]?(\d+)', line, re.IGNORECASE) # Matches: DAPI-01, DAPI_01, DAPI01, dapi-1, etc. ``` ## Key Insights - Auto-detect format rather than requiring user specification - Simple list format uses DAPI marker to determine cycle boundaries - Always provide fallback when file not found or parsing fails - Support multiple filename conventions (CHANNELNAMES.txt, channelnames.txt, etc.) - Comments (lines starting with #) should be ignored - Empty lines should be skipped ## References - CODEX channel naming conventions - KINTSUGI Notebook 2 cell-7 (Processing Parameters)