{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenMed PII Detection & De-identification - Complete Guide\n", "\n", "This notebook demonstrates **everything** about PII (Personally Identifiable Information) detection and de-identification in OpenMed, including:\n", "\n", "1. **Basic PII Extraction** - Detect PII entities in clinical text\n", "2. **Smart Entity Merging** - Fix fragmentation issues (NEW in v0.5.0)\n", "3. **De-identification Methods** - Mask, remove, replace, hash, shift dates\n", "4. **Re-identification** - Reverse de-identification with mappings\n", "5. **Batch Processing** - Process multiple texts efficiently\n", "6. **Confidence Thresholding** - Control precision vs recall\n", "7. **Custom Patterns** - Add domain-specific PII patterns\n", "8. **Clinical Use Cases** - Real-world examples\n", "9. **Visualization** - Display results with highlighting\n", "10. **CLI Usage** - Command-line interface examples\n", "\n", "---\n", "\n", "**Requirements:**\n", "```bash\n", "pip install openmed\n", "```\n", "\n", "**Model Used:**\n", "- `openmed/OpenMed-PII-SuperClinical-Large-434M-v1` (default)\n", "- Trained on clinical notes, EHR data, and HIPAA-relevant PII\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup and Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/maziyar/Desktop/Work/openmed/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "✅ All imports successful!\n" ] } ], "source": [ "# Import required libraries\n", "import os\n", "from pprint import pprint\n", "import json\n", "\n", "# Set HuggingFace token (if needed)\n", "# os.environ['HF_TOKEN'] = 'your_token_here'\n", "\n", "# Import OpenMed PII functions\n", "from openmed import (\n", " extract_pii,\n", " deidentify,\n", " reidentify,\n", " PIIEntity,\n", " DeidentificationResult,\n", ")\n", "\n", "# Import smart merging utilities\n", "from openmed import (\n", " merge_entities_with_semantic_units,\n", " find_semantic_units,\n", " calculate_dominant_label,\n", " PII_PATTERNS,\n", " PIIPattern,\n", ")\n", "\n", "# Import batch processing\n", "from openmed import BatchProcessor, BatchItem, process_batch\n", "\n", "print(\"✅ All imports successful!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 1. Basic PII Extraction\n", "\n", "Extract PII entities from clinical text." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "BASIC PII EXTRACTION\n", "================================================================================\n", "Input text:\n", "\n", "Patient Name: Dr. Sarah Johnson\n", "Date of Birth: 03/15/1975\n", "Social Security: 123-45-6789\n", "Phone: (555) 123-4567\n", "Email: sarah.johnson@email.com\n", "Address: 456 Oak Avenue, Boston, MA 02115\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 11 PII entities:\n", "\n", " 1. [occupation ] 'Dr. ' (confidence: 0.597)\n", " 2. [first_name ] 'Sarah ' (confidence: 1.000)\n", " 3. [last_name ] 'Johnson ' (confidence: 0.998)\n", " 4. [date_of_birth ] '03/15/1975 ' (confidence: 0.693)\n", " 5. [ssn ] '123-45-6789 ' (confidence: 0.981)\n", " 6. [phone_number ] '555) 123-4567 ' (confidence: 0.868)\n", " 7. [email ] 'sarah.johnson@email.com ' (confidence: 1.000)\n", " 8. [street_address ] '456 Oak Avenue ' (confidence: 1.000)\n", " 9. [city ] 'Boston ' (confidence: 0.900)\n", "10. [state ] 'MA ' (confidence: 0.927)\n", "11. [postcode ] '02115 ' (confidence: 0.967)\n", "\n", "================================================================================\n" ] } ], "source": [ "# Simple clinical text with various PII types\n", "clinical_text = \"\"\"\n", "Patient Name: Dr. Sarah Johnson\n", "Date of Birth: 03/15/1975\n", "Social Security: 123-45-6789\n", "Phone: (555) 123-4567\n", "Email: sarah.johnson@email.com\n", "Address: 456 Oak Avenue, Boston, MA 02115\n", "\"\"\"\n", "\n", "print(\"=\" * 80)\n", "print(\"BASIC PII EXTRACTION\")\n", "print(\"=\" * 80)\n", "print(f\"Input text:\\n{clinical_text}\\n\")\n", "\n", "# Extract PII with default settings\n", "result = extract_pii(\n", " clinical_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.5,\n", " use_smart_merging=True # DEFAULT in v0.5.0\n", ")\n", "\n", "print(f\"Found {len(result.entities)} PII entities:\\n\")\n", "for i, entity in enumerate(result.entities, 1):\n", " print(f\"{i:2d}. [{entity.label:25s}] '{entity.text:30s}' (confidence: {entity.confidence:.3f})\")\n", "\n", "print(\"\\n\" + \"=\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting Entity Details" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First Entity Details:\n", " Text: Dr.\n", " Label: occupation\n", " Confidence: 0.5971\n", " Start position: 14\n", " End position: 17\n", " Extracted from result.text: 'Dr.'\n" ] } ], "source": [ "# Access individual entity properties\n", "if result.entities:\n", " entity = result.entities[0]\n", " print(\"First Entity Details:\")\n", " print(f\" Text: {entity.text}\")\n", " print(f\" Label: {entity.label}\")\n", " print(f\" Confidence: {entity.confidence:.4f}\")\n", " print(f\" Start position: {entity.start}\")\n", " print(f\" End position: {entity.end}\")\n", " print(f\" Extracted from result.text: '{result.text[entity.start:entity.end]}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2. Smart Entity Merging (NEW in v0.5.0)\n", "\n", "Smart merging fixes the fragmentation problem where dates, SSN, phone numbers, and other PII entities are split into unusable fragments by the tokenizer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Problem: Fragmentation" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "COMPARING: WITHOUT vs WITH Smart Merging\n", "================================================================================\n", "Input: Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321\n", "\n", "❌ WITHOUT Smart Merging (use_smart_merging=False)\n", "--------------------------------------------------------------------------------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 5 entities:\n", " [date ] '01' (confidence: 0.886)\n", " [date_of_birth ] '/15' (confidence: 0.704)\n", " [date ] '/1970' (confidence: 0.565)\n", " [date ] '2024-03-20' (confidence: 0.999)\n", " [ssn ] '987-65-4321' (confidence: 0.997)\n", "\n", "⚠️ PROBLEM: 3 date fragments detected!\n", " These fragments are unusable for production de-identification.\n", "\n", "================================================================================\n", "✅ WITH Smart Merging (use_smart_merging=True) - DEFAULT\n", "--------------------------------------------------------------------------------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 3 entities:\n", " [date ] '01/15/1970' (confidence: 0.718)\n", " [date ] '2024-03-20' (confidence: 0.999)\n", " [ssn ] '987-65-4321' (confidence: 0.997)\n", "\n", "✅ SUCCESS: 2 complete date entities!\n", " Production-ready for de-identification.\n", "\n", "================================================================================\n" ] } ], "source": [ "test_text = \"Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321\"\n", "\n", "print(\"=\" * 80)\n", "print(\"COMPARING: WITHOUT vs WITH Smart Merging\")\n", "print(\"=\" * 80)\n", "print(f\"Input: {test_text}\\n\")\n", "\n", "# WITHOUT smart merging (raw model output)\n", "print(\"❌ WITHOUT Smart Merging (use_smart_merging=False)\")\n", "print(\"-\" * 80)\n", "result_raw = extract_pii(\n", " test_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.5,\n", " use_smart_merging=False # Disable smart merging\n", ")\n", "\n", "print(f\"Found {len(result_raw.entities)} entities:\")\n", "for entity in result_raw.entities:\n", " print(f\" [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})\")\n", "\n", "# Check for fragmentation\n", "date_fragments = [e for e in result_raw.entities if 'date' in e.label.lower() and len(e.text) < 8]\n", "if date_fragments:\n", " print(f\"\\n⚠️ PROBLEM: {len(date_fragments)} date fragments detected!\")\n", " print(\" These fragments are unusable for production de-identification.\")\n", "\n", "print(\"\\n\" + \"=\" * 80)\n", "\n", "# WITH smart merging (default)\n", "print(\"✅ WITH Smart Merging (use_smart_merging=True) - DEFAULT\")\n", "print(\"-\" * 80)\n", "result_merged = extract_pii(\n", " test_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.5,\n", " use_smart_merging=True # Enable smart merging (DEFAULT)\n", ")\n", "\n", "print(f\"Found {len(result_merged.entities)} entities:\")\n", "for entity in result_merged.entities:\n", " print(f\" [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})\")\n", "\n", "# Check for complete dates\n", "complete_dates = [e for e in result_merged.entities if 'date' in e.label.lower() and len(e.text) >= 8]\n", "if complete_dates:\n", " print(f\"\\n✅ SUCCESS: {len(complete_dates)} complete date entities!\")\n", " print(\" Production-ready for de-identification.\")\n", "\n", "print(\"\\n\" + \"=\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How Smart Merging Works" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "SMART MERGING: Semantic Unit Detection\n", "================================================================================\n", "Input: Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567\n", "\n", "Detected 2 semantic units using regex patterns:\n", "\n", " [date ] '01/15/1970' at position 24-34\n", " [ssn ] '123-45-6789' at position 41-52\n", "\n", "================================================================================\n", "Total PII patterns defined: 20\n", "\n", "Pattern categories:\n", " - credit_debit_card\n", " - date\n", " - email\n", " - ipv4\n", " - ipv6\n", " - mac_address\n", " - medical_record_number\n", " - phone_number\n", " - postcode\n", " - ssn\n", " - street_address\n", " - url\n" ] } ], "source": [ "# Demonstrate semantic unit detection\n", "demo_text = \"Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567\"\n", "\n", "print(\"=\" * 80)\n", "print(\"SMART MERGING: Semantic Unit Detection\")\n", "print(\"=\" * 80)\n", "print(f\"Input: {demo_text}\\n\")\n", "\n", "# Find semantic units using regex patterns\n", "semantic_units = find_semantic_units(demo_text)\n", "\n", "print(f\"Detected {len(semantic_units)} semantic units using regex patterns:\\n\")\n", "for start, end, entity_type in semantic_units:\n", " text_span = demo_text[start:end]\n", " print(f\" [{entity_type:20s}] '{text_span}' at position {start}-{end}\")\n", "\n", "print(\"\\n\" + \"=\" * 80)\n", "print(f\"Total PII patterns defined: {len(PII_PATTERNS)}\")\n", "print(\"\\nPattern categories:\")\n", "categories = set(p.entity_type for p in PII_PATTERNS)\n", "for cat in sorted(categories):\n", " print(f\" - {cat}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supported PII Patterns" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "SUPPORTED PII PATTERNS\n", "================================================================================\n", "\n", "CREDIT_DEBIT_CARD:\n", " Priority 8: \\b\\d{4}[-\\s]?\\d{4}[-\\s]?\\d{4}[-\\s]?\\d{4}\\b\n", "\n", "DATE:\n", " Priority 10: \\b\\d{4}-\\d{2}-\\d{2}\\b\n", " Priority 9: \\b\\d{1,2}/\\d{1,2}/\\d{2,4}\\b\n", " Priority 9: \\b\\d{1,2}-\\d{1,2}-\\d{2,4}\\b\n", " Priority 8: \\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2},? \\d{4}\\b\n", " Priority 8: \\b\\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{4}\\b\n", "\n", "EMAIL:\n", " Priority 10: \\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b\n", "\n", "IPV4:\n", " Priority 7: \\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b\n", "\n", "IPV6:\n", " Priority 8: \\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\\b\n", "\n", "MAC_ADDRESS:\n", " Priority 8: \\b(?:[0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\\b\n", "\n", "MEDICAL_RECORD_NUMBER:\n", " Priority 9: \\b(?:MRN|mrn)[:\\s#]*\\d{6,10}\\b\n", " Priority 5: \\b[A-Z]{2,3}\\d{6,9}\\b\n", "\n", "PHONE_NUMBER:\n", " Priority 9: \\b\\(\\d{3}\\)\\s*\\d{3}[-.\\s]?\\d{4}\\b\n", " Priority 8: \\b\\d{3}[-.\\s]\\d{3}[-.\\s]\\d{4}\\b\n", " Priority 5: \\b\\d{10}\\b\n", "\n", "POSTCODE:\n", " Priority 7: \\b\\d{5}(?:-\\d{4})?\\b\n", "\n", "SSN:\n", " Priority 10: \\b\\d{3}-\\d{2}-\\d{4}\\b\n", " Priority 9: \\b\\d{3}\\s\\d{2}\\s\\d{4}\\b\n", "\n", "STREET_ADDRESS:\n", " Priority 7: \\b\\d{1,5}\\s+[A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*\\s+(?:Street|St|Avenue|Ave|Road|Rd|Bou...\n", "\n", "URL:\n", " Priority 8: \\b(?:https?://)?(?:www\\.)?[a-zA-Z0-9-]+\\.[a-zA-Z]{2,}(?:/[^\\s]*)?\\b\n" ] } ], "source": [ "# Display all supported patterns\n", "print(\"=\" * 80)\n", "print(\"SUPPORTED PII PATTERNS\")\n", "print(\"=\" * 80)\n", "\n", "# Group patterns by type\n", "from collections import defaultdict\n", "patterns_by_type = defaultdict(list)\n", "for pattern in PII_PATTERNS:\n", " patterns_by_type[pattern.entity_type].append(pattern)\n", "\n", "for entity_type in sorted(patterns_by_type.keys()):\n", " patterns = patterns_by_type[entity_type]\n", " print(f\"\\n{entity_type.upper()}:\")\n", " for p in patterns:\n", " print(f\" Priority {p.priority}: {p.pattern[:80]}{'...' if len(p.pattern) > 80 else ''}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 3. De-identification Methods\n", "\n", "OpenMed supports multiple de-identification methods to protect patient privacy." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original Clinical Note:\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr. Sarah Johnson\n", "Date of Birth: 03/15/1975\n", "MRN: 87654321\n", "Social Security: 123-45-6789\n", "Contact: (555) 987-6543\n", "Email: sarah.j@hospital.org\n", "Address: 456 Oak Avenue, Boston, MA 02115\n", "\n", "Admission Date: 12/20/2024\n", "Discharge Date: 12/25/2024\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "================================================================================\n" ] } ], "source": [ "# Clinical note for de-identification\n", "patient_note = \"\"\"\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr. Sarah Johnson\n", "Date of Birth: 03/15/1975\n", "MRN: 87654321\n", "Social Security: 123-45-6789\n", "Contact: (555) 987-6543\n", "Email: sarah.j@hospital.org\n", "Address: 456 Oak Avenue, Boston, MA 02115\n", "\n", "Admission Date: 12/20/2024\n", "Discharge Date: 12/25/2024\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\"\"\"\n", "\n", "print(\"Original Clinical Note:\")\n", "print(patient_note)\n", "print(\"\\n\" + \"=\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 1: Mask (Placeholder replacement)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "METHOD 1: MASK (Placeholder replacement)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified text:\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.[first_name]h[last_name]n\n", "Date of Birth:[date_of_birth]5\n", "MRN: 87654321\n", "Social Security:[ssn]9\n", "Contact: [phone_number]3\n", "Email:[email]g\n", "Address:[street_address]e,[city]n,[state]A[postcode]5\n", "\n", "Admission Date:[date]4\n", "Discharge Date:[date]4\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "Entities masked: 12\n", " [first_name] 'Sarah' -> '[first_name]' conf=1.000 span=(46, 51)\n", " [last_name] 'Johnson' -> '[last_name]' conf=0.999 span=(52, 59)\n", " [date_of_birth] '03/15/1975' -> '[date_of_birth]' conf=0.815 span=(75, 85)\n", " [ssn] '123-45-6789' -> '[ssn]' conf=0.977 span=(117, 128)\n", " [phone_number] '555) 987-6543' -> '[phone_number]' conf=0.659 span=(139, 152)\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"METHOD 1: MASK (Placeholder replacement)\")\n", "print(\"=\" * 80)\n", "\n", "result_mask = deidentify(\n", " patient_note,\n", " method=\"mask\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", ")\n", "\n", "print(\"De-identified text:\")\n", "print(result_mask.deidentified_text)\n", "\n", "# ✅ The library returns `pii_entities` (not `entities`)\n", "entities = getattr(result_mask, \"pii_entities\", None) or getattr(result_mask, \"entities\", [])\n", "\n", "print(f\"\\nEntities masked: {len(entities)}\")\n", "\n", "# Optional: sort by position for nicer display\n", "entities = sorted(entities, key=lambda e: getattr(e, \"start\", 0))\n", "\n", "for entity in entities[:5]: # Show first 5\n", " label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n", " text = getattr(entity, \"text\", \"\")\n", " redacted = getattr(entity, \"redacted_text\", \"\")\n", " conf = getattr(entity, \"confidence\", None)\n", " span = (getattr(entity, \"start\", None), getattr(entity, \"end\", None))\n", "\n", " conf_str = f\"{conf:.3f}\" if isinstance(conf, (int, float)) else \"n/a\"\n", " print(f\" [{label}] '{text}' -> '{redacted}' conf={conf_str} span={span}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 2: Remove (Complete removal)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "METHOD 2: REMOVE (Complete removal)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified text:\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.hn\n", "Date of Birth:5\n", "MRN: 87654321\n", "Social Security:9\n", "Contact: 3\n", "Email:g\n", "Address:e,n,A5\n", "\n", "Admission Date:4\n", "Discharge Date:4\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "Entities removed: 12\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"METHOD 2: REMOVE (Complete removal)\")\n", "print(\"=\" * 80)\n", "\n", "result_remove = deidentify(\n", " patient_note,\n", " method=\"remove\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", ")\n", "\n", "print(\"De-identified text:\")\n", "print(result_remove.deidentified_text)\n", "\n", "# ✅ Use `pii_entities` (fallback included for robustness)\n", "entities = getattr(result_remove, \"pii_entities\", None) or getattr(result_remove, \"entities\", [])\n", "\n", "print(f\"\\nEntities removed: {len(entities)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 3: Replace (Synthetic data)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "METHOD 3: REPLACE (Synthetic data)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified text:\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.[first_name]h[last_name]n\n", "Date of Birth:[date_of_birth]5\n", "MRN: 87654321\n", "Social Security:[ssn]9\n", "Contact: [phone_number]3\n", "Email:[email]g\n", "Address:[street_address]e,[city]n,[state]A[postcode]5\n", "\n", "Admission Date:[date]4\n", "Discharge Date:[date]4\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "Entities replaced: 12\n", " [first_name] 'Sarah' -> '[first_name]'\n", " [last_name] 'Johnson' -> '[last_name]'\n", " [date_of_birth] '03/15/1975' -> '[date_of_birth]'\n", " [ssn] '123-45-6789' -> '[ssn]'\n", " [phone_number] '555) 987-6543' -> '[phone_number]'\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"METHOD 3: REPLACE (Synthetic data)\")\n", "print(\"=\" * 80)\n", "\n", "result_replace = deidentify(\n", " patient_note,\n", " method=\"replace\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", ")\n", "\n", "print(\"De-identified text:\")\n", "print(result_replace.deidentified_text)\n", "\n", "# ✅ `DeidentificationResult` uses `pii_entities`\n", "entities = getattr(result_replace, \"pii_entities\", None) or getattr(result_replace, \"entities\", [])\n", "\n", "print(f\"\\nEntities replaced: {len(entities)}\")\n", "\n", "for entity in sorted(entities, key=lambda e: getattr(e, \"start\", 0))[:5]:\n", " label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n", " text = getattr(entity, \"text\", \"\")\n", " # for replace/mask, this often holds the replacement value or placeholder\n", " repl = getattr(entity, \"redacted_text\", \"\")\n", " print(f\" [{label}] '{text}' -> '{repl}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 4: Hash (Cryptographic hashing)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "METHOD 4: HASH (Cryptographic hashing)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified text (first 500 chars):\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.first_name_7e8c729ehlast_name_3013b18fn\n", "Date of Birth:date_of_birth_ad87a4065\n", "MRN: 87654321\n", "Social Security:ssn_01a546299\n", "Contact: phone_number_d8f6c45f3\n", "Email:email_c67e1ae7g\n", "Address:street_address_c25c1d69e,city_a06522bcn,state_f0055891Apostcode_20ec61f35\n", "\n", "Admission Date:date_9b3129044\n", "Discharge Date:date_a98356c94\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "Entities hashed: 12\n", "\n", "Example hashed values:\n", " [first_name] Original: 'Sarah' Hashed: '7e8c729e'\n", " [last_name] Original: 'Johnson' Hashed: '3013b18f'\n", " [date_of_birth] Original: '03/15/1975' Hashed: 'ad87a406'\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"METHOD 4: HASH (Cryptographic hashing)\")\n", "print(\"=\" * 80)\n", "\n", "result_hash = deidentify(\n", " patient_note,\n", " method=\"hash\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", ")\n", "\n", "print(\"De-identified text (first 500 chars):\")\n", "text = result_hash.deidentified_text or \"\"\n", "print((text[:500] + \"...\") if len(text) > 500 else text)\n", "\n", "# ✅ Use `pii_entities` (fallback included)\n", "entities = getattr(result_hash, \"pii_entities\", None) or getattr(result_hash, \"entities\", [])\n", "\n", "print(f\"\\nEntities hashed: {len(entities)}\")\n", "print(\"\\nExample hashed values:\")\n", "\n", "for entity in sorted(entities, key=lambda e: getattr(e, \"start\", 0))[:3]:\n", " label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n", " original = getattr(entity, \"text\", \"\")\n", " hashed = getattr(entity, \"hash_value\", None) or getattr(entity, \"redacted_text\", \"\")\n", " print(f\" [{label}] Original: '{original}' Hashed: '{hashed}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Method 5: Shift Dates (Date shifting)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "METHOD 5: SHIFT_DATES (Preserves temporal relationships)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified text:\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.[first_name]h[last_name]n\n", "Date of Birth:[date_of_birth]5\n", "MRN: 87654321\n", "Social Security:[ssn]9\n", "Contact: [phone_number]3\n", "Email:[email]g\n", "Address:[street_address]e,[city]n,[state]A[postcode]5\n", "\n", "Admission Date:[date]4\n", "Discharge Date:[date]4\n", "\n", "DIAGNOSIS: Type 2 Diabetes Mellitus\n", "\n", "\n", "Date entities shifted:\n", " [date_of_birth] '03/15/1975' -> '[date_of_birth]'\n", " [date] '12/20/2024' -> '[date]'\n", " [date] '12/25/2024' -> '[date]'\n", "Note: Temporal relationships between dates are preserved!\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"METHOD 5: SHIFT_DATES (Preserves temporal relationships)\")\n", "print(\"=\" * 80)\n", "\n", "result_shift = deidentify(\n", " patient_note,\n", " method=\"shift_dates\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", " date_shift_days=365, # Shift by 1 year\n", ")\n", "\n", "print(\"De-identified text:\")\n", "print(result_shift.deidentified_text)\n", "\n", "# ✅ Use `pii_entities` (fallback included)\n", "entities = getattr(result_shift, \"pii_entities\", None) or getattr(result_shift, \"entities\", [])\n", "\n", "date_entities = [\n", " e for e in entities\n", " if \"date\" in getattr(e, \"label\", getattr(e, \"entity_type\", \"\")).lower()\n", "]\n", "\n", "print(\"\\nDate entities shifted:\")\n", "for e in sorted(date_entities, key=lambda x: getattr(x, \"start\", 0)):\n", " label = getattr(e, \"label\", getattr(e, \"entity_type\", \"UNKNOWN\"))\n", " original = getattr(e, \"text\", \"\")\n", " shifted = getattr(e, \"redacted_text\", \"\") # often holds the shifted date or replacement\n", " print(f\" [{label}] '{original}' -> '{shifted}'\")\n", "\n", "print(\"Note: Temporal relationships between dates are preserved!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 4. Re-identification\n", "\n", "Reverse de-identification using stored mappings." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "RE-IDENTIFICATION\n", "================================================================================\n", "Step 1: De-identify with keep_mapping=True\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "De-identified text (first 200 chars):\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.[first_name]h[last_name]n\n", "Date of Birth:[date_of_birth]5\n", "MRN: 87654321\n", "Social Security:[ssn]9\n", "Contact: [phone_number]3\n", "Email:[email]g\n", "Address:[street_addr...\n", "\n", "Mapping created: 11 entries\n", "\n", "First 5 mapping entries:\n", " 1. '[date]' → '12/20/2024'\n", " 2. '[postcode]' → '02115'\n", " 3. '[state]' → 'MA'\n", " 4. '[city]' → 'Boston'\n", " 5. '[street_address]' → '456 Oak Avenue'\n", "\n", "--------------------------------------------------------------------------------\n", "Step 2: Re-identify using the mapping\n", "\n", "Re-identified text (first 300 chars):\n", "\n", "CLINICAL NOTE\n", "=============\n", "Patient Name: Dr.SarahhJohnsonn\n", "Date of Birth:03/15/19755\n", "MRN: 87654321\n", "Social Security:123-45-67899\n", "Contact: 555) 987-65433\n", "Email:sarah.j@hospital.orgg\n", "Address:456 Oak Avenuee,Bostonn,MAA021155\n", "\n", "Admission Date:12/20/20244\n", "Discharge Date:12/20/20244\n", "\n", "DIAGNOSIS: Type 2 Di...\n", "\n", "--------------------------------------------------------------------------------\n", "Verification:\n", "⚠️ Difference detected (usually whitespace/formatting)\n", " Original length: 314\n", " Re-identified length: 314\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"RE-IDENTIFICATION\")\n", "print(\"=\" * 80)\n", "\n", "# De-identify with mapping\n", "print(\"Step 1: De-identify with keep_mapping=True\")\n", "result_with_mapping = deidentify(\n", " patient_note,\n", " method=\"mask\",\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.6,\n", " keep_mapping=True, # Keep mapping for re-identification\n", " use_smart_merging=True\n", ")\n", "\n", "print(f\"\\nDe-identified text (first 200 chars):\")\n", "print(result_with_mapping.deidentified_text[:200] + \"...\")\n", "\n", "print(f\"\\nMapping created: {len(result_with_mapping.mapping)} entries\")\n", "print(\"\\nFirst 5 mapping entries:\")\n", "for i, (redacted, original) in enumerate(list(result_with_mapping.mapping.items())[:5], 1):\n", " print(f\" {i}. '{redacted}' → '{original}'\")\n", "\n", "# Re-identify\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"Step 2: Re-identify using the mapping\")\n", "original_text = reidentify(\n", " result_with_mapping.deidentified_text,\n", " result_with_mapping.mapping\n", ")\n", "\n", "print(f\"\\nRe-identified text (first 300 chars):\")\n", "print(original_text[:300] + \"...\")\n", "\n", "# Verify\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"Verification:\")\n", "original_clean = patient_note.strip()\n", "reidentified_clean = original_text.strip()\n", "if original_clean == reidentified_clean:\n", " print(\"✅ SUCCESS: Re-identification is perfect!\")\n", "else:\n", " print(f\"⚠️ Difference detected (usually whitespace/formatting)\")\n", " print(f\" Original length: {len(original_clean)}\")\n", " print(f\" Re-identified length: {len(reidentified_clean)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 5. Batch Processing\n", "\n", "Efficiently process multiple clinical notes." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "BATCH PROCESSING\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Batch processing completed!\n", " Total items: 5\n", " Successful: 5\n", " Failed: 0\n", " Total processing time: 8.92s\n", "\n", "--------------------------------------------------------------------------------\n", "Results per note:\n", "\n", "📄 item_0:\n", " Entities found: 5\n", " - [first_name] 'John'\n", " - [last_name] 'Doe'\n", " - [date] '01'\n", " ... and 2 more\n", "\n", "📄 item_1:\n", " Entities found: 4\n", " - [occupation] 'Dr.'\n", " - [first_name] 'Sarah'\n", " - [last_name] 'Johnson'\n", " ... and 1 more\n", "\n", "📄 item_2:\n", " Entities found: 2\n", " - [date] '2024-03-20'\n", " - [date] '2024-03-25'\n", "\n", "📄 item_3:\n", " Entities found: 5\n", " - [street_address] '123 Main Street'\n", " - [city] 'Boston'\n", " - [state] 'MA'\n", " ... and 2 more\n", "\n", "📄 item_4:\n", " Entities found: 1\n", " - [email] 'patient.name@hospital.org'\n", "\n" ] } ], "source": [ "from openmed import BatchProcessor\n", "\n", "print(\"=\" * 80)\n", "print(\"BATCH PROCESSING\")\n", "print(\"=\" * 80)\n", "\n", "batch_texts = [\n", " \"Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789\",\n", " \"Dr. Sarah Johnson, Phone: (555) 123-4567, Email: sarah@email.com\",\n", " \"MRN: 87654321, Admission: 2024-03-20, Discharge: 2024-03-25\",\n", " \"Address: 123 Main Street, Boston, MA 02101, ZIP: 02101\",\n", " \"Contact: patient.name@hospital.org, Emergency: (555) 987-6543\",\n", "]\n", "\n", "processor = BatchProcessor(\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.5,\n", " group_entities=True,\n", " continue_on_error=True,\n", " # IMPORTANT: do NOT pass use_smart_merging here if your installed version triggers the HF pipeline error\n", ")\n", "\n", "batch_result = processor.process_texts(batch_texts)\n", "\n", "print(\"Batch processing completed!\")\n", "print(f\" Total items: {batch_result.total_items}\")\n", "print(f\" Successful: {batch_result.successful_items}\")\n", "print(f\" Failed: {batch_result.failed_items}\")\n", "print(f\" Total processing time: {batch_result.total_processing_time:.2f}s\")\n", "\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"Results per note:\\n\")\n", "\n", "for item_result in batch_result.items:\n", " if not item_result.success:\n", " print(f\"❌ {item_result.id}: {item_result.error}\")\n", " continue\n", "\n", " # In BatchProcessor results, entities usually live under item_result.result.entities\n", " ents = item_result.result.entities\n", " print(f\"📄 {item_result.id}:\")\n", " print(f\" Entities found: {len(ents)}\")\n", " for entity in ents[:3]:\n", " print(f\" - [{entity.label}] '{entity.text}'\")\n", " if len(ents) > 3:\n", " print(f\" ... and {len(ents) - 3} more\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch De-identification" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "BATCH DE-IDENTIFICATION (extract in batch, then deidentify)\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n", "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Batch extraction completed!\n", " Successful: 5/5\n", "\n", "De-identified texts:\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "📄 note_1:\n", "Patient: [first_name] [last_name], DOB: [date_of_birth], SSN: [ssn]\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "📄 note_2:\n", "[occupation] [first_name] [last_name], Phone: (555) 123-4567, Email: [email]\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "📄 note_3:\n", "MRN: 87654321, Admission: [date], Discharge: [date]\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "📄 note_4:\n", "Address: [street_address], [city], [state] [postcode], ZIP: [postcode]\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "📄 note_5:\n", "Contact: [email], Emergency: (555) 987-6543\n", "\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"BATCH DE-IDENTIFICATION (extract in batch, then deidentify)\")\n", "print(\"=\" * 80)\n", "\n", "ids = [f\"note_{i+1}\" for i in range(len(batch_texts))]\n", "\n", "# 1) Batch extraction (no use_smart_merging here in YOUR install; it breaks HF pipeline creation)\n", "batch_result = process_batch(\n", " batch_texts,\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " ids=ids,\n", " confidence_threshold=0.6,\n", " batch_size=2,\n", ")\n", "\n", "print(f\"Batch extraction completed!\")\n", "print(f\" Successful: {batch_result.successful_items}/{batch_result.total_items}\\n\")\n", "\n", "# 2) Deidentify each text (this API supports use_smart_merging in your environment)\n", "print(\"De-identified texts:\\n\")\n", "\n", "for item in batch_result.items:\n", " item_id = getattr(item, \"id\", None) or getattr(item, \"item_id\", \"unknown\")\n", "\n", " if not getattr(item, \"success\", False):\n", " err = getattr(item, \"error\", None) or getattr(item, \"exception\", None)\n", " print(f\"❌ {item_id} failed during extraction: {err}\")\n", " continue\n", "\n", " # run deidentify for the same text\n", " original_text = item.text if hasattr(item, \"text\") else batch_texts[ids.index(item_id)]\n", " deid = deidentify(\n", " original_text,\n", " method=\"mask\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", " )\n", "\n", " print(f\"📄 {item_id}:\")\n", " print(deid.deidentified_text)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 6. Confidence Thresholding\n", "\n", "Control precision vs recall trade-off." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "CONFIDENCE THRESHOLDING\n", "================================================================================\n", "Input: Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Threshold: 0.3 → 4 entities\n", " [first_name ] 'Jane ' (conf: 1.000)\n", " [last_name ] 'Doe ' (conf: 0.998)\n", " [date ] '05/20/1985 ' (conf: 0.672)\n", " [email ] 'jane@email.com ' (conf: 0.999)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Threshold: 0.5 → 4 entities\n", " [first_name ] 'Jane ' (conf: 1.000)\n", " [last_name ] 'Doe ' (conf: 0.998)\n", " [date ] '05/20/1985 ' (conf: 0.672)\n", " [email ] 'jane@email.com ' (conf: 0.999)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Threshold: 0.7 → 4 entities\n", " [first_name ] 'Jane ' (conf: 1.000)\n", " [last_name ] 'Doe ' (conf: 0.998)\n", " [date ] '05/20/1985 ' (conf: 0.751)\n", " [email ] 'jane@email.com ' (conf: 0.999)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Threshold: 0.9 → 3 entities\n", " [first_name ] 'Jane ' (conf: 1.000)\n", " [last_name ] 'Doe ' (conf: 0.998)\n", " [email ] 'jane@email.com ' (conf: 0.999)\n", "\n", "--------------------------------------------------------------------------------\n", "Guidelines:\n", " • threshold=0.3-0.5: High recall (catch more PII, more false positives)\n", " • threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)\n", " • threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"CONFIDENCE THRESHOLDING\")\n", "print(\"=\" * 80)\n", "\n", "test_text = \"Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com\"\n", "print(f\"Input: {test_text}\\n\")\n", "\n", "thresholds = [0.3, 0.5, 0.7, 0.9]\n", "\n", "for threshold in thresholds:\n", " result = extract_pii(\n", " test_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=threshold,\n", " use_smart_merging=True\n", " )\n", "\n", " print(f\"Threshold: {threshold:.1f} → {len(result.entities)} entities\")\n", " for entity in result.entities:\n", " print(f\" [{entity.label:20s}] '{entity.text:25s}' (conf: {entity.confidence:.3f})\")\n", " print()\n", "\n", "print(\"-\" * 80)\n", "print(\"Guidelines:\")\n", "print(\" • threshold=0.3-0.5: High recall (catch more PII, more false positives)\")\n", "print(\" • threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)\")\n", "print(\" • threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 7. Custom PII Patterns\n", "\n", "Add domain-specific patterns for your organization." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "CUSTOM PII PATTERNS\n", "================================================================================\n", "Defined 3 custom patterns:\n", "\n", " [employee_id ] Priority: 10, Pattern: \\bEMP-\\d{6}\\b\n", " [patient_id ] Priority: 9, Pattern: \\bPID-\\d{8}\\b\n", " [internal_code ] Priority: 8, Pattern: \\b[A-Z]{2}-\\d{4}-[A-Z]\\b\n", "\n", "--------------------------------------------------------------------------------\n", "Test text:\n", "\n", "Employee: EMP-123456\n", "Patient ID: PID-87654321\n", "Department Code: HR-2024-A\n", "Regular SSN: 123-45-6789\n", "\n", "--------------------------------------------------------------------------------\n", "Detected units with custom patterns:\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 2 entities (including custom types):\n", "\n", "🆕 [employee_id ] 'EMP-123456' (confidence: 0.988)\n", " [ssn ] '123-45-6789' (confidence: 0.924)\n", "\n", "🆕 = Custom pattern detected\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"CUSTOM PII PATTERNS\")\n", "print(\"=\" * 80)\n", "\n", "# Define custom patterns\n", "custom_patterns = [\n", " PIIPattern(\n", " pattern=r'\\bEMP-\\d{6}\\b', # Employee ID format: EMP-123456\n", " entity_type='employee_id',\n", " priority=10\n", " ),\n", " PIIPattern(\n", " pattern=r'\\bPID-\\d{8}\\b', # Patient ID format: PID-12345678\n", " entity_type='patient_id',\n", " priority=9\n", " ),\n", " PIIPattern(\n", " pattern=r'\\b[A-Z]{2}-\\d{4}-[A-Z]\\b', # Custom format: AB-1234-X\n", " entity_type='internal_code',\n", " priority=8\n", " ),\n", "]\n", "\n", "print(f\"Defined {len(custom_patterns)} custom patterns:\\n\")\n", "for p in custom_patterns:\n", " print(f\" [{p.entity_type:20s}] Priority: {p.priority}, Pattern: {p.pattern}\")\n", "\n", "# Test text with custom identifiers\n", "custom_text = \"\"\"\n", "Employee: EMP-123456\n", "Patient ID: PID-87654321\n", "Department Code: HR-2024-A\n", "Regular SSN: 123-45-6789\n", "\"\"\"\n", "\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"Test text:\")\n", "print(custom_text)\n", "\n", "# Find custom semantic units\n", "print(\"-\" * 80)\n", "print(\"Detected units with custom patterns:\\n\")\n", "\n", "# First get model predictions\n", "result = extract_pii(\n", " custom_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.5,\n", " use_smart_merging=False # Get raw predictions first\n", ")\n", "\n", "# Convert to dict format\n", "entity_dicts = [\n", " {\n", " 'entity_type': e.label,\n", " 'score': e.confidence,\n", " 'start': e.start,\n", " 'end': e.end,\n", " 'word': e.text\n", " }\n", " for e in result.entities\n", "]\n", "\n", "# Merge with custom patterns\n", "merged = merge_entities_with_semantic_units(\n", " entity_dicts,\n", " result.text,\n", " patterns=custom_patterns, # Add custom patterns\n", " use_semantic_patterns=True,\n", " prefer_model_labels=False # Prefer pattern labels for custom types\n", ")\n", "\n", "print(f\"Found {len(merged)} entities (including custom types):\\n\")\n", "for entity in merged:\n", " label = entity['entity_type']\n", " text = entity['word']\n", " conf = entity['score']\n", " is_custom = label in ['employee_id', 'patient_id', 'internal_code']\n", " marker = \"🆕\" if is_custom else \" \"\n", " print(f\"{marker} [{label:20s}] '{text}' (confidence: {conf:.3f})\")\n", "\n", "print(\"\\n🆕 = Custom pattern detected\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 8. Clinical Use Cases\n", "\n", "Real-world clinical scenarios." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Case 1: Discharge Summary" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "USE CASE 1: Discharge Summary De-identification\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "De-identified Discharge Summary:\n", "\n", "DISCHARGE SUMMARY\n", "=====================================\n", "Patient Name:[first_name]l[last_name]n[medical_record_number]2\n", "Date of Birth:[date_of_birth]8\n", "Admission Date:[date]5\n", "Discharge Date:[date]5[occupation]n: Dr.[first_name]y[last_name]r\n", "\n", "PRIMARY DIAGNOSIS:\n", "Acute myocardial infarction\n", "\n", "HOSPITAL COURSE:\n", "Mr.[last_name]n is a[age]6-year-old male who presented to the emergency\n", "department on[date]5 with chest pain. He was admitted for\n", "cardiac catheterization and intervention.\n", "\n", "CONTACT INFORMATION:\n", "Phone:[phone_number]8\n", "Email:[email]m\n", "Emergency Contact:[first_name]e[last_name]n (Wife) - (555) 234-5679\n", "\n", "FOLLOW-UP:\n", "Patient scheduled for follow-up on[date]5 at the cardiology clinic.\n", "\n", "\n", "--------------------------------------------------------------------------------\n", "PII entities protected: 17\n", "✅ Clean de-identification - no adjacent placeholders!\n" ] } ], "source": [ "discharge_summary = \"\"\"\n", "DISCHARGE SUMMARY\n", "=====================================\n", "Patient Name: Michael Anderson\n", "MRN: 98765432\n", "Date of Birth: 08/12/1968\n", "Admission Date: 01/05/2025\n", "Discharge Date: 01/10/2025\n", "Attending Physician: Dr. Emily Carter\n", "\n", "PRIMARY DIAGNOSIS:\n", "Acute myocardial infarction\n", "\n", "HOSPITAL COURSE:\n", "Mr. Anderson is a 56-year-old male who presented to the emergency\n", "department on 01/05/2025 with chest pain. He was admitted for\n", "cardiac catheterization and intervention.\n", "\n", "CONTACT INFORMATION:\n", "Phone: (555) 234-5678\n", "Email: m.anderson@email.com\n", "Emergency Contact: Jane Anderson (Wife) - (555) 234-5679\n", "\n", "FOLLOW-UP:\n", "Patient scheduled for follow-up on 01/24/2025 at the cardiology clinic.\n", "\"\"\"\n", "\n", "print(\"=\" * 80)\n", "print(\"USE CASE 1: Discharge Summary De-identification\")\n", "print(\"=\" * 80)\n", "\n", "# De-identify for research database\n", "deid_discharge = deidentify(\n", " discharge_summary,\n", " method=\"mask\",\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.6,\n", " use_smart_merging=True\n", ")\n", "\n", "print(\"De-identified Discharge Summary:\")\n", "print(deid_discharge.deidentified_text)\n", "\n", "print(\"\\n\" + \"-\" * 80)\n", "print(f\"PII entities protected: {len(deid_discharge.pii_entities)}\")\n", "# Check for adjacent placeholders (quality check)\n", "if '][' in deid_discharge.deidentified_text:\n", " print(\"❌ Adjacent placeholders detected - fragmentation issue!\")\n", "else:\n", " print(\"✅ Clean de-identification - no adjacent placeholders!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Case 2: Research Dataset Preparation" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "USE CASE 2: Research Dataset Preparation\n", "================================================================================\n", "Processing 3 patient notes for research...\n", "\n", "De-identified research dataset (date shifting by 180 days):\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "patient_001: Patient 001: [first_name] [last_name], DOB [date], diagnosed with T2DM on [date]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "patient_002: Patient 002: [first_name] [last_name], DOB [date_of_birth], A1C 8.5%, started metformin [date]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "patient_003: Patient 003: [first_name] [last_name], DOB [date_of_birth], BMI 32.1, blood pressure 145/90\n", "\n", "--------------------------------------------------------------------------------\n", "✅ Research dataset ready!\n", " - All dates shifted by 180 days\n", " - Temporal relationships preserved\n", " - Audit mapping available for IRB review\n" ] } ], "source": [ "research_notes = [\n", " \"Patient 001: John Smith, DOB 03/15/1975, diagnosed with T2DM on 12/20/2024\",\n", " \"Patient 002: Sarah Johnson, DOB 08/22/1982, A1C 8.5%, started metformin 01/05/2025\",\n", " \"Patient 003: Robert Williams, DOB 11/30/1965, BMI 32.1, blood pressure 145/90\",\n", "]\n", "\n", "print(\"=\" * 80)\n", "print(\"USE CASE 2: Research Dataset Preparation\")\n", "print(\"=\" * 80)\n", "print(f\"Processing {len(research_notes)} patient notes for research...\\n\")\n", "\n", "# For batch de-identification, use deidentify() on each text\n", "# BatchProcessor.process_items() is for extraction only\n", "\n", "print(\"De-identified research dataset (date shifting by 180 days):\\n\")\n", "\n", "for i, note in enumerate(research_notes, 1):\n", " patient_id = f\"patient_{i:03d}\"\n", "\n", " # De-identify each note with date shifting\n", " deid_result = deidentify(\n", " note,\n", " method=\"shift_dates\",\n", " model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n", " confidence_threshold=0.6,\n", " use_smart_merging=True,\n", " date_shift_days=180,\n", " keep_mapping=True,\n", " )\n", "\n", " print(f\"{patient_id}: {deid_result.deidentified_text}\")\n", "\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"✅ Research dataset ready!\")\n", "print(\" - All dates shifted by 180 days\")\n", "print(\" - Temporal relationships preserved\")\n", "print(\" - Audit mapping available for IRB review\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Case 3: HIPAA Compliance Audit" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "USE CASE 3: HIPAA Compliance Audit\n", "================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "HIPAA Compliance Check:\n", "\n", "Total PII entities detected: 15\n", "\n", "PII Categories Found:\n", " • account_number: 1 instance(s)\n", " • city: 1 instance(s)\n", " • date: 1 instance(s)\n", " • email: 1 instance(s)\n", " • fax_number: 1 instance(s)\n", " • first_name: 1 instance(s)\n", " • ipv4: 1 instance(s)\n", " • last_name: 1 instance(s)\n", " • license_plate: 1 instance(s)\n", " • medical_record_number: 1 instance(s)\n", " • postcode: 1 instance(s)\n", " • ssn: 1 instance(s)\n", " • state: 1 instance(s)\n", " • street_address: 1 instance(s)\n", " • url: 1 instance(s)\n", "\n", "--------------------------------------------------------------------------------\n", "HIPAA Safe Harbor Compliance:\n", " Coverage: 6/17 HIPAA identifier categories\n", " ✅ Ready for HIPAA-compliant de-identification\n" ] } ], "source": [ "print(\"=\" * 80)\n", "print(\"USE CASE 3: HIPAA Compliance Audit\")\n", "print(\"=\" * 80)\n", "\n", "# HIPAA Safe Harbor requires removal of 18 identifiers\n", "hipaa_text = \"\"\"\n", "Patient: Jane Doe\n", "DOB: 05/15/1980\n", "SSN: 987-65-4321\n", "Address: 789 Pine Street, Unit 4B, Seattle, WA 98101\n", "Phone: (206) 555-1234\n", "Fax: (206) 555-1235\n", "Email: jane.doe@email.com\n", "Medical Record: MRN-12345678\n", "Account Number: ACCT-987654\n", "Device ID: DEVICE-ABC123\n", "IP Address: 192.168.1.100\n", "Vehicle: License plate ABC-1234\n", "Biometric: Fingerprint on file\n", "Photo: Patient photo available\n", "URL: https://patient-portal.hospital.com/jane-doe\n", "\"\"\"\n", "\n", "# Extract all PII for audit\n", "hipaa_result = extract_pii(\n", " hipaa_text,\n", " model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n", " confidence_threshold=0.5,\n", " use_smart_merging=True\n", ")\n", "\n", "print(f\"HIPAA Compliance Check:\\n\")\n", "print(f\"Total PII entities detected: {len(hipaa_result.entities)}\\n\")\n", "\n", "# Group by type\n", "from collections import Counter\n", "pii_types = Counter(e.label for e in hipaa_result.entities)\n", "\n", "print(\"PII Categories Found:\")\n", "for pii_type, count in sorted(pii_types.items()):\n", " print(f\" • {pii_type}: {count} instance(s)\")\n", "\n", "# HIPAA 18 identifiers checklist\n", "hipaa_18_identifiers = [\n", " 'names', 'geographic', 'dates', 'phone', 'fax', 'email', 'ssn',\n", " 'medical_record', 'account_number', 'certificate', 'vehicle',\n", " 'device', 'url', 'ip_address', 'biometric', 'photo', 'unique_id'\n", "]\n", "\n", "print(\"\\n\" + \"-\" * 80)\n", "print(\"HIPAA Safe Harbor Compliance:\")\n", "detected_types = set(e.label.lower() for e in hipaa_result.entities)\n", "covered = sum(1 for identifier in hipaa_18_identifiers\n", " if any(identifier in dt for dt in detected_types))\n", "\n", "print(f\" Coverage: {covered}/{len(hipaa_18_identifiers)} HIPAA identifier categories\")\n", "print(\" ✅ Ready for HIPAA-compliant de-identification\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 9. Visualization\n", "\n", "Display results with highlighting." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Device set to use cpu\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "================================================================================\n", "VISUALIZATION: Highlighted PII Entities\n", "================================================================================\n", "\n", "Hover over highlighted text to see entity type and confidence.\n", "\n" ] }, { "data": { "text/html": [ "