{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenMed PII Detection & De-identification - Complete Guide\n",
    "\n",
    "This notebook demonstrates **everything** about PII (Personally Identifiable Information) detection and de-identification in OpenMed, including:\n",
    "\n",
    "1. **Basic PII Extraction** - Detect PII entities in clinical text\n",
    "2. **Smart Entity Merging** - Fix fragmentation issues (NEW in v0.5.0)\n",
    "3. **De-identification Methods** - Mask, remove, replace, hash, shift dates\n",
    "4. **Re-identification** - Reverse de-identification with mappings\n",
    "5. **Batch Processing** - Process multiple texts efficiently\n",
    "6. **Confidence Thresholding** - Control precision vs recall\n",
    "7. **Custom Patterns** - Add domain-specific PII patterns\n",
    "8. **Clinical Use Cases** - Real-world examples\n",
    "9. **Visualization** - Display results with highlighting\n",
    "10. **CLI Usage** - Command-line interface examples\n",
    "\n",
    "---\n",
    "\n",
    "**Requirements:**\n",
    "```bash\n",
    "pip install openmed\n",
    "```\n",
    "\n",
    "**Model Used:**\n",
    "- `openmed/OpenMed-PII-SuperClinical-Large-434M-v1` (default)\n",
    "- Trained on clinical notes, EHR data, and HIPAA-relevant PII\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup and Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/maziyar/Desktop/Work/openmed/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✅ All imports successful!\n"
     ]
    }
   ],
   "source": [
    "# Import required libraries\n",
    "import os\n",
    "from pprint import pprint\n",
    "import json\n",
    "\n",
    "# Set HuggingFace token (if needed)\n",
    "# os.environ['HF_TOKEN'] = 'your_token_here'\n",
    "\n",
    "# Import OpenMed PII functions\n",
    "from openmed import (\n",
    "    extract_pii,\n",
    "    deidentify,\n",
    "    reidentify,\n",
    "    PIIEntity,\n",
    "    DeidentificationResult,\n",
    ")\n",
    "\n",
    "# Import smart merging utilities\n",
    "from openmed import (\n",
    "    merge_entities_with_semantic_units,\n",
    "    find_semantic_units,\n",
    "    calculate_dominant_label,\n",
    "    PII_PATTERNS,\n",
    "    PIIPattern,\n",
    ")\n",
    "\n",
    "# Import batch processing\n",
    "from openmed import BatchProcessor, BatchItem, process_batch\n",
    "\n",
    "print(\"✅ All imports successful!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 1. Basic PII Extraction\n",
    "\n",
    "Extract PII entities from clinical text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "BASIC PII EXTRACTION\n",
      "================================================================================\n",
      "Input text:\n",
      "\n",
      "Patient Name: Dr. Sarah Johnson\n",
      "Date of Birth: 03/15/1975\n",
      "Social Security: 123-45-6789\n",
      "Phone: (555) 123-4567\n",
      "Email: sarah.johnson@email.com\n",
      "Address: 456 Oak Avenue, Boston, MA 02115\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 11 PII entities:\n",
      "\n",
      " 1. [occupation               ] 'Dr.                           ' (confidence: 0.597)\n",
      " 2. [first_name               ] 'Sarah                         ' (confidence: 1.000)\n",
      " 3. [last_name                ] 'Johnson                       ' (confidence: 0.998)\n",
      " 4. [date_of_birth            ] '03/15/1975                    ' (confidence: 0.693)\n",
      " 5. [ssn                      ] '123-45-6789                   ' (confidence: 0.981)\n",
      " 6. [phone_number             ] '555) 123-4567                 ' (confidence: 0.868)\n",
      " 7. [email                    ] 'sarah.johnson@email.com       ' (confidence: 1.000)\n",
      " 8. [street_address           ] '456 Oak Avenue                ' (confidence: 1.000)\n",
      " 9. [city                     ] 'Boston                        ' (confidence: 0.900)\n",
      "10. [state                    ] 'MA                            ' (confidence: 0.927)\n",
      "11. [postcode                 ] '02115                         ' (confidence: 0.967)\n",
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "# Simple clinical text with various PII types\n",
    "clinical_text = \"\"\"\n",
    "Patient Name: Dr. Sarah Johnson\n",
    "Date of Birth: 03/15/1975\n",
    "Social Security: 123-45-6789\n",
    "Phone: (555) 123-4567\n",
    "Email: sarah.johnson@email.com\n",
    "Address: 456 Oak Avenue, Boston, MA 02115\n",
    "\"\"\"\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"BASIC PII EXTRACTION\")\n",
    "print(\"=\" * 80)\n",
    "print(f\"Input text:\\n{clinical_text}\\n\")\n",
    "\n",
    "# Extract PII with default settings\n",
    "result = extract_pii(\n",
    "    clinical_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=True  # DEFAULT in v0.5.0\n",
    ")\n",
    "\n",
    "print(f\"Found {len(result.entities)} PII entities:\\n\")\n",
    "for i, entity in enumerate(result.entities, 1):\n",
    "    print(f\"{i:2d}. [{entity.label:25s}] '{entity.text:30s}' (confidence: {entity.confidence:.3f})\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspecting Entity Details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Entity Details:\n",
      "  Text: Dr.\n",
      "  Label: occupation\n",
      "  Confidence: 0.5971\n",
      "  Start position: 14\n",
      "  End position: 17\n",
      "  Extracted from result.text: 'Dr.'\n"
     ]
    }
   ],
   "source": [
    "# Access individual entity properties\n",
    "if result.entities:\n",
    "    entity = result.entities[0]\n",
    "    print(\"First Entity Details:\")\n",
    "    print(f\"  Text: {entity.text}\")\n",
    "    print(f\"  Label: {entity.label}\")\n",
    "    print(f\"  Confidence: {entity.confidence:.4f}\")\n",
    "    print(f\"  Start position: {entity.start}\")\n",
    "    print(f\"  End position: {entity.end}\")\n",
    "    print(f\"  Extracted from result.text: '{result.text[entity.start:entity.end]}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 2. Smart Entity Merging (NEW in v0.5.0)\n",
    "\n",
    "Smart merging fixes the fragmentation problem where dates, SSN, phone numbers, and other PII entities are split into unusable fragments by the tokenizer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Problem: Fragmentation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "COMPARING: WITHOUT vs WITH Smart Merging\n",
      "================================================================================\n",
      "Input: Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321\n",
      "\n",
      "❌ WITHOUT Smart Merging (use_smart_merging=False)\n",
      "--------------------------------------------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 5 entities:\n",
      "  [date                ] '01' (confidence: 0.886)\n",
      "  [date_of_birth       ] '/15' (confidence: 0.704)\n",
      "  [date                ] '/1970' (confidence: 0.565)\n",
      "  [date                ] '2024-03-20' (confidence: 0.999)\n",
      "  [ssn                 ] '987-65-4321' (confidence: 0.997)\n",
      "\n",
      "⚠️  PROBLEM: 3 date fragments detected!\n",
      "   These fragments are unusable for production de-identification.\n",
      "\n",
      "================================================================================\n",
      "✅ WITH Smart Merging (use_smart_merging=True) - DEFAULT\n",
      "--------------------------------------------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 3 entities:\n",
      "  [date                ] '01/15/1970' (confidence: 0.718)\n",
      "  [date                ] '2024-03-20' (confidence: 0.999)\n",
      "  [ssn                 ] '987-65-4321' (confidence: 0.997)\n",
      "\n",
      "✅ SUCCESS: 2 complete date entities!\n",
      "   Production-ready for de-identification.\n",
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "test_text = \"Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321\"\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"COMPARING: WITHOUT vs WITH Smart Merging\")\n",
    "print(\"=\" * 80)\n",
    "print(f\"Input: {test_text}\\n\")\n",
    "\n",
    "# WITHOUT smart merging (raw model output)\n",
    "print(\"❌ WITHOUT Smart Merging (use_smart_merging=False)\")\n",
    "print(\"-\" * 80)\n",
    "result_raw = extract_pii(\n",
    "    test_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=False  # Disable smart merging\n",
    ")\n",
    "\n",
    "print(f\"Found {len(result_raw.entities)} entities:\")\n",
    "for entity in result_raw.entities:\n",
    "    print(f\"  [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})\")\n",
    "\n",
    "# Check for fragmentation\n",
    "date_fragments = [e for e in result_raw.entities if 'date' in e.label.lower() and len(e.text) < 8]\n",
    "if date_fragments:\n",
    "    print(f\"\\n⚠️  PROBLEM: {len(date_fragments)} date fragments detected!\")\n",
    "    print(\"   These fragments are unusable for production de-identification.\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 80)\n",
    "\n",
    "# WITH smart merging (default)\n",
    "print(\"✅ WITH Smart Merging (use_smart_merging=True) - DEFAULT\")\n",
    "print(\"-\" * 80)\n",
    "result_merged = extract_pii(\n",
    "    test_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=True  # Enable smart merging (DEFAULT)\n",
    ")\n",
    "\n",
    "print(f\"Found {len(result_merged.entities)} entities:\")\n",
    "for entity in result_merged.entities:\n",
    "    print(f\"  [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})\")\n",
    "\n",
    "# Check for complete dates\n",
    "complete_dates = [e for e in result_merged.entities if 'date' in e.label.lower() and len(e.text) >= 8]\n",
    "if complete_dates:\n",
    "    print(f\"\\n✅ SUCCESS: {len(complete_dates)} complete date entities!\")\n",
    "    print(\"   Production-ready for de-identification.\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How Smart Merging Works"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "SMART MERGING: Semantic Unit Detection\n",
      "================================================================================\n",
      "Input: Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567\n",
      "\n",
      "Detected 2 semantic units using regex patterns:\n",
      "\n",
      "  [date                ] '01/15/1970' at position 24-34\n",
      "  [ssn                 ] '123-45-6789' at position 41-52\n",
      "\n",
      "================================================================================\n",
      "Total PII patterns defined: 20\n",
      "\n",
      "Pattern categories:\n",
      "  - credit_debit_card\n",
      "  - date\n",
      "  - email\n",
      "  - ipv4\n",
      "  - ipv6\n",
      "  - mac_address\n",
      "  - medical_record_number\n",
      "  - phone_number\n",
      "  - postcode\n",
      "  - ssn\n",
      "  - street_address\n",
      "  - url\n"
     ]
    }
   ],
   "source": [
    "# Demonstrate semantic unit detection\n",
    "demo_text = \"Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567\"\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"SMART MERGING: Semantic Unit Detection\")\n",
    "print(\"=\" * 80)\n",
    "print(f\"Input: {demo_text}\\n\")\n",
    "\n",
    "# Find semantic units using regex patterns\n",
    "semantic_units = find_semantic_units(demo_text)\n",
    "\n",
    "print(f\"Detected {len(semantic_units)} semantic units using regex patterns:\\n\")\n",
    "for start, end, entity_type in semantic_units:\n",
    "    text_span = demo_text[start:end]\n",
    "    print(f\"  [{entity_type:20s}] '{text_span}' at position {start}-{end}\")\n",
    "\n",
    "print(\"\\n\" + \"=\" * 80)\n",
    "print(f\"Total PII patterns defined: {len(PII_PATTERNS)}\")\n",
    "print(\"\\nPattern categories:\")\n",
    "categories = set(p.entity_type for p in PII_PATTERNS)\n",
    "for cat in sorted(categories):\n",
    "    print(f\"  - {cat}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Supported PII Patterns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "SUPPORTED PII PATTERNS\n",
      "================================================================================\n",
      "\n",
      "CREDIT_DEBIT_CARD:\n",
      "  Priority 8: \\b\\d{4}[-\\s]?\\d{4}[-\\s]?\\d{4}[-\\s]?\\d{4}\\b\n",
      "\n",
      "DATE:\n",
      "  Priority 10: \\b\\d{4}-\\d{2}-\\d{2}\\b\n",
      "  Priority 9: \\b\\d{1,2}/\\d{1,2}/\\d{2,4}\\b\n",
      "  Priority 9: \\b\\d{1,2}-\\d{1,2}-\\d{2,4}\\b\n",
      "  Priority 8: \\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{1,2},? \\d{4}\\b\n",
      "  Priority 8: \\b\\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \\d{4}\\b\n",
      "\n",
      "EMAIL:\n",
      "  Priority 10: \\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b\n",
      "\n",
      "IPV4:\n",
      "  Priority 7: \\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b\n",
      "\n",
      "IPV6:\n",
      "  Priority 8: \\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\\b\n",
      "\n",
      "MAC_ADDRESS:\n",
      "  Priority 8: \\b(?:[0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\\b\n",
      "\n",
      "MEDICAL_RECORD_NUMBER:\n",
      "  Priority 9: \\b(?:MRN|mrn)[:\\s#]*\\d{6,10}\\b\n",
      "  Priority 5: \\b[A-Z]{2,3}\\d{6,9}\\b\n",
      "\n",
      "PHONE_NUMBER:\n",
      "  Priority 9: \\b\\(\\d{3}\\)\\s*\\d{3}[-.\\s]?\\d{4}\\b\n",
      "  Priority 8: \\b\\d{3}[-.\\s]\\d{3}[-.\\s]\\d{4}\\b\n",
      "  Priority 5: \\b\\d{10}\\b\n",
      "\n",
      "POSTCODE:\n",
      "  Priority 7: \\b\\d{5}(?:-\\d{4})?\\b\n",
      "\n",
      "SSN:\n",
      "  Priority 10: \\b\\d{3}-\\d{2}-\\d{4}\\b\n",
      "  Priority 9: \\b\\d{3}\\s\\d{2}\\s\\d{4}\\b\n",
      "\n",
      "STREET_ADDRESS:\n",
      "  Priority 7: \\b\\d{1,5}\\s+[A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*\\s+(?:Street|St|Avenue|Ave|Road|Rd|Bou...\n",
      "\n",
      "URL:\n",
      "  Priority 8: \\b(?:https?://)?(?:www\\.)?[a-zA-Z0-9-]+\\.[a-zA-Z]{2,}(?:/[^\\s]*)?\\b\n"
     ]
    }
   ],
   "source": [
    "# Display all supported patterns\n",
    "print(\"=\" * 80)\n",
    "print(\"SUPPORTED PII PATTERNS\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# Group patterns by type\n",
    "from collections import defaultdict\n",
    "patterns_by_type = defaultdict(list)\n",
    "for pattern in PII_PATTERNS:\n",
    "    patterns_by_type[pattern.entity_type].append(pattern)\n",
    "\n",
    "for entity_type in sorted(patterns_by_type.keys()):\n",
    "    patterns = patterns_by_type[entity_type]\n",
    "    print(f\"\\n{entity_type.upper()}:\")\n",
    "    for p in patterns:\n",
    "        print(f\"  Priority {p.priority}: {p.pattern[:80]}{'...' if len(p.pattern) > 80 else ''}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 3. De-identification Methods\n",
    "\n",
    "OpenMed supports multiple de-identification methods to protect patient privacy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original Clinical Note:\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr. Sarah Johnson\n",
      "Date of Birth: 03/15/1975\n",
      "MRN: 87654321\n",
      "Social Security: 123-45-6789\n",
      "Contact: (555) 987-6543\n",
      "Email: sarah.j@hospital.org\n",
      "Address: 456 Oak Avenue, Boston, MA 02115\n",
      "\n",
      "Admission Date: 12/20/2024\n",
      "Discharge Date: 12/25/2024\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "# Clinical note for de-identification\n",
    "patient_note = \"\"\"\n",
    "CLINICAL NOTE\n",
    "=============\n",
    "Patient Name: Dr. Sarah Johnson\n",
    "Date of Birth: 03/15/1975\n",
    "MRN: 87654321\n",
    "Social Security: 123-45-6789\n",
    "Contact: (555) 987-6543\n",
    "Email: sarah.j@hospital.org\n",
    "Address: 456 Oak Avenue, Boston, MA 02115\n",
    "\n",
    "Admission Date: 12/20/2024\n",
    "Discharge Date: 12/25/2024\n",
    "\n",
    "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
    "\"\"\"\n",
    "\n",
    "print(\"Original Clinical Note:\")\n",
    "print(patient_note)\n",
    "print(\"\\n\" + \"=\" * 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method 1: Mask (Placeholder replacement)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "METHOD 1: MASK (Placeholder replacement)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified text:\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.[first_name]h[last_name]n\n",
      "Date of Birth:[date_of_birth]5\n",
      "MRN: 87654321\n",
      "Social Security:[ssn]9\n",
      "Contact: [phone_number]3\n",
      "Email:[email]g\n",
      "Address:[street_address]e,[city]n,[state]A[postcode]5\n",
      "\n",
      "Admission Date:[date]4\n",
      "Discharge Date:[date]4\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "Entities masked: 12\n",
      "  [first_name] 'Sarah' -> '[first_name]' conf=1.000 span=(46, 51)\n",
      "  [last_name] 'Johnson' -> '[last_name]' conf=0.999 span=(52, 59)\n",
      "  [date_of_birth] '03/15/1975' -> '[date_of_birth]' conf=0.815 span=(75, 85)\n",
      "  [ssn] '123-45-6789' -> '[ssn]' conf=0.977 span=(117, 128)\n",
      "  [phone_number] '555) 987-6543' -> '[phone_number]' conf=0.659 span=(139, 152)\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"METHOD 1: MASK (Placeholder replacement)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "result_mask = deidentify(\n",
    "    patient_note,\n",
    "    method=\"mask\",\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True,\n",
    ")\n",
    "\n",
    "print(\"De-identified text:\")\n",
    "print(result_mask.deidentified_text)\n",
    "\n",
    "# ✅ The library returns `pii_entities` (not `entities`)\n",
    "entities = getattr(result_mask, \"pii_entities\", None) or getattr(result_mask, \"entities\", [])\n",
    "\n",
    "print(f\"\\nEntities masked: {len(entities)}\")\n",
    "\n",
    "# Optional: sort by position for nicer display\n",
    "entities = sorted(entities, key=lambda e: getattr(e, \"start\", 0))\n",
    "\n",
    "for entity in entities[:5]:  # Show first 5\n",
    "    label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n",
    "    text = getattr(entity, \"text\", \"\")\n",
    "    redacted = getattr(entity, \"redacted_text\", \"\")\n",
    "    conf = getattr(entity, \"confidence\", None)\n",
    "    span = (getattr(entity, \"start\", None), getattr(entity, \"end\", None))\n",
    "\n",
    "    conf_str = f\"{conf:.3f}\" if isinstance(conf, (int, float)) else \"n/a\"\n",
    "    print(f\"  [{label}] '{text}' -> '{redacted}' conf={conf_str} span={span}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method 2: Remove (Complete removal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "METHOD 2: REMOVE (Complete removal)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified text:\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.hn\n",
      "Date of Birth:5\n",
      "MRN: 87654321\n",
      "Social Security:9\n",
      "Contact: 3\n",
      "Email:g\n",
      "Address:e,n,A5\n",
      "\n",
      "Admission Date:4\n",
      "Discharge Date:4\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "Entities removed: 12\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"METHOD 2: REMOVE (Complete removal)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "result_remove = deidentify(\n",
    "    patient_note,\n",
    "    method=\"remove\",\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True,\n",
    ")\n",
    "\n",
    "print(\"De-identified text:\")\n",
    "print(result_remove.deidentified_text)\n",
    "\n",
    "# ✅ Use `pii_entities` (fallback included for robustness)\n",
    "entities = getattr(result_remove, \"pii_entities\", None) or getattr(result_remove, \"entities\", [])\n",
    "\n",
    "print(f\"\\nEntities removed: {len(entities)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method 3: Replace (Synthetic data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "METHOD 3: REPLACE (Synthetic data)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified text:\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.[first_name]h[last_name]n\n",
      "Date of Birth:[date_of_birth]5\n",
      "MRN: 87654321\n",
      "Social Security:[ssn]9\n",
      "Contact: [phone_number]3\n",
      "Email:[email]g\n",
      "Address:[street_address]e,[city]n,[state]A[postcode]5\n",
      "\n",
      "Admission Date:[date]4\n",
      "Discharge Date:[date]4\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "Entities replaced: 12\n",
      "  [first_name] 'Sarah' -> '[first_name]'\n",
      "  [last_name] 'Johnson' -> '[last_name]'\n",
      "  [date_of_birth] '03/15/1975' -> '[date_of_birth]'\n",
      "  [ssn] '123-45-6789' -> '[ssn]'\n",
      "  [phone_number] '555) 987-6543' -> '[phone_number]'\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"METHOD 3: REPLACE (Synthetic data)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "result_replace = deidentify(\n",
    "    patient_note,\n",
    "    method=\"replace\",\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True,\n",
    ")\n",
    "\n",
    "print(\"De-identified text:\")\n",
    "print(result_replace.deidentified_text)\n",
    "\n",
    "# ✅ `DeidentificationResult` uses `pii_entities`\n",
    "entities = getattr(result_replace, \"pii_entities\", None) or getattr(result_replace, \"entities\", [])\n",
    "\n",
    "print(f\"\\nEntities replaced: {len(entities)}\")\n",
    "\n",
    "for entity in sorted(entities, key=lambda e: getattr(e, \"start\", 0))[:5]:\n",
    "    label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n",
    "    text = getattr(entity, \"text\", \"\")\n",
    "    # for replace/mask, this often holds the replacement value or placeholder\n",
    "    repl = getattr(entity, \"redacted_text\", \"\")\n",
    "    print(f\"  [{label}] '{text}' -> '{repl}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method 4: Hash (Cryptographic hashing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "METHOD 4: HASH (Cryptographic hashing)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified text (first 500 chars):\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.first_name_7e8c729ehlast_name_3013b18fn\n",
      "Date of Birth:date_of_birth_ad87a4065\n",
      "MRN: 87654321\n",
      "Social Security:ssn_01a546299\n",
      "Contact: phone_number_d8f6c45f3\n",
      "Email:email_c67e1ae7g\n",
      "Address:street_address_c25c1d69e,city_a06522bcn,state_f0055891Apostcode_20ec61f35\n",
      "\n",
      "Admission Date:date_9b3129044\n",
      "Discharge Date:date_a98356c94\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "Entities hashed: 12\n",
      "\n",
      "Example hashed values:\n",
      "  [first_name] Original: 'Sarah'  Hashed: '7e8c729e'\n",
      "  [last_name] Original: 'Johnson'  Hashed: '3013b18f'\n",
      "  [date_of_birth] Original: '03/15/1975'  Hashed: 'ad87a406'\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"METHOD 4: HASH (Cryptographic hashing)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "result_hash = deidentify(\n",
    "    patient_note,\n",
    "    method=\"hash\",\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True,\n",
    ")\n",
    "\n",
    "print(\"De-identified text (first 500 chars):\")\n",
    "text = result_hash.deidentified_text or \"\"\n",
    "print((text[:500] + \"...\") if len(text) > 500 else text)\n",
    "\n",
    "# ✅ Use `pii_entities` (fallback included)\n",
    "entities = getattr(result_hash, \"pii_entities\", None) or getattr(result_hash, \"entities\", [])\n",
    "\n",
    "print(f\"\\nEntities hashed: {len(entities)}\")\n",
    "print(\"\\nExample hashed values:\")\n",
    "\n",
    "for entity in sorted(entities, key=lambda e: getattr(e, \"start\", 0))[:3]:\n",
    "    label = getattr(entity, \"label\", getattr(entity, \"entity_type\", \"UNKNOWN\"))\n",
    "    original = getattr(entity, \"text\", \"\")\n",
    "    hashed = getattr(entity, \"hash_value\", None) or getattr(entity, \"redacted_text\", \"\")\n",
    "    print(f\"  [{label}] Original: '{original}'  Hashed: '{hashed}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Method 5: Shift Dates (Date shifting)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "METHOD 5: SHIFT_DATES (Preserves temporal relationships)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified text:\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.[first_name]h[last_name]n\n",
      "Date of Birth:[date_of_birth]5\n",
      "MRN: 87654321\n",
      "Social Security:[ssn]9\n",
      "Contact: [phone_number]3\n",
      "Email:[email]g\n",
      "Address:[street_address]e,[city]n,[state]A[postcode]5\n",
      "\n",
      "Admission Date:[date]4\n",
      "Discharge Date:[date]4\n",
      "\n",
      "DIAGNOSIS: Type 2 Diabetes Mellitus\n",
      "\n",
      "\n",
      "Date entities shifted:\n",
      "  [date_of_birth] '03/15/1975' -> '[date_of_birth]'\n",
      "  [date] '12/20/2024' -> '[date]'\n",
      "  [date] '12/25/2024' -> '[date]'\n",
      "Note: Temporal relationships between dates are preserved!\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"METHOD 5: SHIFT_DATES (Preserves temporal relationships)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "result_shift = deidentify(\n",
    "    patient_note,\n",
    "    method=\"shift_dates\",\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True,\n",
    "    date_shift_days=365,  # Shift by 1 year\n",
    ")\n",
    "\n",
    "print(\"De-identified text:\")\n",
    "print(result_shift.deidentified_text)\n",
    "\n",
    "# ✅ Use `pii_entities` (fallback included)\n",
    "entities = getattr(result_shift, \"pii_entities\", None) or getattr(result_shift, \"entities\", [])\n",
    "\n",
    "date_entities = [\n",
    "    e for e in entities\n",
    "    if \"date\" in getattr(e, \"label\", getattr(e, \"entity_type\", \"\")).lower()\n",
    "]\n",
    "\n",
    "print(\"\\nDate entities shifted:\")\n",
    "for e in sorted(date_entities, key=lambda x: getattr(x, \"start\", 0)):\n",
    "    label = getattr(e, \"label\", getattr(e, \"entity_type\", \"UNKNOWN\"))\n",
    "    original = getattr(e, \"text\", \"\")\n",
    "    shifted = getattr(e, \"redacted_text\", \"\")  # often holds the shifted date or replacement\n",
    "    print(f\"  [{label}] '{original}' -> '{shifted}'\")\n",
    "\n",
    "print(\"Note: Temporal relationships between dates are preserved!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 4. Re-identification\n",
    "\n",
    "Reverse de-identification using stored mappings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "RE-IDENTIFICATION\n",
      "================================================================================\n",
      "Step 1: De-identify with keep_mapping=True\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "De-identified text (first 200 chars):\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.[first_name]h[last_name]n\n",
      "Date of Birth:[date_of_birth]5\n",
      "MRN: 87654321\n",
      "Social Security:[ssn]9\n",
      "Contact: [phone_number]3\n",
      "Email:[email]g\n",
      "Address:[street_addr...\n",
      "\n",
      "Mapping created: 11 entries\n",
      "\n",
      "First 5 mapping entries:\n",
      "  1. '[date]' → '12/20/2024'\n",
      "  2. '[postcode]' → '02115'\n",
      "  3. '[state]' → 'MA'\n",
      "  4. '[city]' → 'Boston'\n",
      "  5. '[street_address]' → '456 Oak Avenue'\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Step 2: Re-identify using the mapping\n",
      "\n",
      "Re-identified text (first 300 chars):\n",
      "\n",
      "CLINICAL NOTE\n",
      "=============\n",
      "Patient Name: Dr.SarahhJohnsonn\n",
      "Date of Birth:03/15/19755\n",
      "MRN: 87654321\n",
      "Social Security:123-45-67899\n",
      "Contact: 555) 987-65433\n",
      "Email:sarah.j@hospital.orgg\n",
      "Address:456 Oak Avenuee,Bostonn,MAA021155\n",
      "\n",
      "Admission Date:12/20/20244\n",
      "Discharge Date:12/20/20244\n",
      "\n",
      "DIAGNOSIS: Type 2 Di...\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Verification:\n",
      "⚠️  Difference detected (usually whitespace/formatting)\n",
      "   Original length: 314\n",
      "   Re-identified length: 314\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"RE-IDENTIFICATION\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# De-identify with mapping\n",
    "print(\"Step 1: De-identify with keep_mapping=True\")\n",
    "result_with_mapping = deidentify(\n",
    "    patient_note,\n",
    "    method=\"mask\",\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.6,\n",
    "    keep_mapping=True,  # Keep mapping for re-identification\n",
    "    use_smart_merging=True\n",
    ")\n",
    "\n",
    "print(f\"\\nDe-identified text (first 200 chars):\")\n",
    "print(result_with_mapping.deidentified_text[:200] + \"...\")\n",
    "\n",
    "print(f\"\\nMapping created: {len(result_with_mapping.mapping)} entries\")\n",
    "print(\"\\nFirst 5 mapping entries:\")\n",
    "for i, (redacted, original) in enumerate(list(result_with_mapping.mapping.items())[:5], 1):\n",
    "    print(f\"  {i}. '{redacted}' → '{original}'\")\n",
    "\n",
    "# Re-identify\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"Step 2: Re-identify using the mapping\")\n",
    "original_text = reidentify(\n",
    "    result_with_mapping.deidentified_text,\n",
    "    result_with_mapping.mapping\n",
    ")\n",
    "\n",
    "print(f\"\\nRe-identified text (first 300 chars):\")\n",
    "print(original_text[:300] + \"...\")\n",
    "\n",
    "# Verify\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"Verification:\")\n",
    "original_clean = patient_note.strip()\n",
    "reidentified_clean = original_text.strip()\n",
    "if original_clean == reidentified_clean:\n",
    "    print(\"✅ SUCCESS: Re-identification is perfect!\")\n",
    "else:\n",
    "    print(f\"⚠️  Difference detected (usually whitespace/formatting)\")\n",
    "    print(f\"   Original length: {len(original_clean)}\")\n",
    "    print(f\"   Re-identified length: {len(reidentified_clean)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 5. Batch Processing\n",
    "\n",
    "Efficiently process multiple clinical notes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "BATCH PROCESSING\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch processing completed!\n",
      "  Total items: 5\n",
      "  Successful: 5\n",
      "  Failed: 0\n",
      "  Total processing time: 8.92s\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Results per note:\n",
      "\n",
      "📄 item_0:\n",
      "   Entities found: 5\n",
      "     - [first_name] 'John'\n",
      "     - [last_name] 'Doe'\n",
      "     - [date] '01'\n",
      "     ... and 2 more\n",
      "\n",
      "📄 item_1:\n",
      "   Entities found: 4\n",
      "     - [occupation] 'Dr.'\n",
      "     - [first_name] 'Sarah'\n",
      "     - [last_name] 'Johnson'\n",
      "     ... and 1 more\n",
      "\n",
      "📄 item_2:\n",
      "   Entities found: 2\n",
      "     - [date] '2024-03-20'\n",
      "     - [date] '2024-03-25'\n",
      "\n",
      "📄 item_3:\n",
      "   Entities found: 5\n",
      "     - [street_address] '123 Main Street'\n",
      "     - [city] 'Boston'\n",
      "     - [state] 'MA'\n",
      "     ... and 2 more\n",
      "\n",
      "📄 item_4:\n",
      "   Entities found: 1\n",
      "     - [email] 'patient.name@hospital.org'\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from openmed import BatchProcessor\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"BATCH PROCESSING\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "batch_texts = [\n",
    "    \"Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789\",\n",
    "    \"Dr. Sarah Johnson, Phone: (555) 123-4567, Email: sarah@email.com\",\n",
    "    \"MRN: 87654321, Admission: 2024-03-20, Discharge: 2024-03-25\",\n",
    "    \"Address: 123 Main Street, Boston, MA 02101, ZIP: 02101\",\n",
    "    \"Contact: patient.name@hospital.org, Emergency: (555) 987-6543\",\n",
    "]\n",
    "\n",
    "processor = BatchProcessor(\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    confidence_threshold=0.5,\n",
    "    group_entities=True,\n",
    "    continue_on_error=True,\n",
    "    # IMPORTANT: do NOT pass use_smart_merging here if your installed version triggers the HF pipeline error\n",
    ")\n",
    "\n",
    "batch_result = processor.process_texts(batch_texts)\n",
    "\n",
    "print(\"Batch processing completed!\")\n",
    "print(f\"  Total items: {batch_result.total_items}\")\n",
    "print(f\"  Successful: {batch_result.successful_items}\")\n",
    "print(f\"  Failed: {batch_result.failed_items}\")\n",
    "print(f\"  Total processing time: {batch_result.total_processing_time:.2f}s\")\n",
    "\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"Results per note:\\n\")\n",
    "\n",
    "for item_result in batch_result.items:\n",
    "    if not item_result.success:\n",
    "        print(f\"❌ {item_result.id}: {item_result.error}\")\n",
    "        continue\n",
    "\n",
    "    # In BatchProcessor results, entities usually live under item_result.result.entities\n",
    "    ents = item_result.result.entities\n",
    "    print(f\"📄 {item_result.id}:\")\n",
    "    print(f\"   Entities found: {len(ents)}\")\n",
    "    for entity in ents[:3]:\n",
    "        print(f\"     - [{entity.label}] '{entity.text}'\")\n",
    "    if len(ents) > 3:\n",
    "        print(f\"     ... and {len(ents) - 3} more\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch De-identification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "BATCH DE-IDENTIFICATION (extract in batch, then deidentify)\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n",
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch extraction completed!\n",
      "  Successful: 5/5\n",
      "\n",
      "De-identified texts:\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 note_1:\n",
      "Patient: [first_name] [last_name], DOB: [date_of_birth], SSN: [ssn]\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 note_2:\n",
      "[occupation] [first_name] [last_name], Phone: (555) 123-4567, Email: [email]\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 note_3:\n",
      "MRN: 87654321, Admission: [date], Discharge: [date]\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 note_4:\n",
      "Address: [street_address], [city], [state] [postcode], ZIP: [postcode]\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📄 note_5:\n",
      "Contact: [email], Emergency: (555) 987-6543\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"BATCH DE-IDENTIFICATION (extract in batch, then deidentify)\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "ids = [f\"note_{i+1}\" for i in range(len(batch_texts))]\n",
    "\n",
    "# 1) Batch extraction (no use_smart_merging here in YOUR install; it breaks HF pipeline creation)\n",
    "batch_result = process_batch(\n",
    "    batch_texts,\n",
    "    model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "    ids=ids,\n",
    "    confidence_threshold=0.6,\n",
    "    batch_size=2,\n",
    ")\n",
    "\n",
    "print(f\"Batch extraction completed!\")\n",
    "print(f\"  Successful: {batch_result.successful_items}/{batch_result.total_items}\\n\")\n",
    "\n",
    "# 2) Deidentify each text (this API supports use_smart_merging in your environment)\n",
    "print(\"De-identified texts:\\n\")\n",
    "\n",
    "for item in batch_result.items:\n",
    "    item_id = getattr(item, \"id\", None) or getattr(item, \"item_id\", \"unknown\")\n",
    "\n",
    "    if not getattr(item, \"success\", False):\n",
    "        err = getattr(item, \"error\", None) or getattr(item, \"exception\", None)\n",
    "        print(f\"❌ {item_id} failed during extraction: {err}\")\n",
    "        continue\n",
    "\n",
    "    # run deidentify for the same text\n",
    "    original_text = item.text if hasattr(item, \"text\") else batch_texts[ids.index(item_id)]\n",
    "    deid = deidentify(\n",
    "        original_text,\n",
    "        method=\"mask\",\n",
    "        model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "        confidence_threshold=0.6,\n",
    "        use_smart_merging=True,\n",
    "    )\n",
    "\n",
    "    print(f\"📄 {item_id}:\")\n",
    "    print(deid.deidentified_text)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 6. Confidence Thresholding\n",
    "\n",
    "Control precision vs recall trade-off."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "CONFIDENCE THRESHOLDING\n",
      "================================================================================\n",
      "Input: Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.3 → 4 entities\n",
      "  [first_name          ] 'Jane                     ' (conf: 1.000)\n",
      "  [last_name           ] 'Doe                      ' (conf: 0.998)\n",
      "  [date                ] '05/20/1985               ' (conf: 0.672)\n",
      "  [email               ] 'jane@email.com           ' (conf: 0.999)\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.5 → 4 entities\n",
      "  [first_name          ] 'Jane                     ' (conf: 1.000)\n",
      "  [last_name           ] 'Doe                      ' (conf: 0.998)\n",
      "  [date                ] '05/20/1985               ' (conf: 0.672)\n",
      "  [email               ] 'jane@email.com           ' (conf: 0.999)\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.7 → 4 entities\n",
      "  [first_name          ] 'Jane                     ' (conf: 1.000)\n",
      "  [last_name           ] 'Doe                      ' (conf: 0.998)\n",
      "  [date                ] '05/20/1985               ' (conf: 0.751)\n",
      "  [email               ] 'jane@email.com           ' (conf: 0.999)\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.9 → 3 entities\n",
      "  [first_name          ] 'Jane                     ' (conf: 1.000)\n",
      "  [last_name           ] 'Doe                      ' (conf: 0.998)\n",
      "  [email               ] 'jane@email.com           ' (conf: 0.999)\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Guidelines:\n",
      "  • threshold=0.3-0.5: High recall (catch more PII, more false positives)\n",
      "  • threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)\n",
      "  • threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"CONFIDENCE THRESHOLDING\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "test_text = \"Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com\"\n",
    "print(f\"Input: {test_text}\\n\")\n",
    "\n",
    "thresholds = [0.3, 0.5, 0.7, 0.9]\n",
    "\n",
    "for threshold in thresholds:\n",
    "    result = extract_pii(\n",
    "        test_text,\n",
    "        model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "        confidence_threshold=threshold,\n",
    "        use_smart_merging=True\n",
    "    )\n",
    "\n",
    "    print(f\"Threshold: {threshold:.1f} → {len(result.entities)} entities\")\n",
    "    for entity in result.entities:\n",
    "        print(f\"  [{entity.label:20s}] '{entity.text:25s}' (conf: {entity.confidence:.3f})\")\n",
    "    print()\n",
    "\n",
    "print(\"-\" * 80)\n",
    "print(\"Guidelines:\")\n",
    "print(\"  • threshold=0.3-0.5: High recall (catch more PII, more false positives)\")\n",
    "print(\"  • threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)\")\n",
    "print(\"  • threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 7. Custom PII Patterns\n",
    "\n",
    "Add domain-specific patterns for your organization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "CUSTOM PII PATTERNS\n",
      "================================================================================\n",
      "Defined 3 custom patterns:\n",
      "\n",
      "  [employee_id         ] Priority: 10, Pattern: \\bEMP-\\d{6}\\b\n",
      "  [patient_id          ] Priority: 9, Pattern: \\bPID-\\d{8}\\b\n",
      "  [internal_code       ] Priority: 8, Pattern: \\b[A-Z]{2}-\\d{4}-[A-Z]\\b\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Test text:\n",
      "\n",
      "Employee: EMP-123456\n",
      "Patient ID: PID-87654321\n",
      "Department Code: HR-2024-A\n",
      "Regular SSN: 123-45-6789\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "Detected units with custom patterns:\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 2 entities (including custom types):\n",
      "\n",
      "🆕 [employee_id         ] 'EMP-123456' (confidence: 0.988)\n",
      "   [ssn                 ] '123-45-6789' (confidence: 0.924)\n",
      "\n",
      "🆕 = Custom pattern detected\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"CUSTOM PII PATTERNS\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# Define custom patterns\n",
    "custom_patterns = [\n",
    "    PIIPattern(\n",
    "        pattern=r'\\bEMP-\\d{6}\\b',  # Employee ID format: EMP-123456\n",
    "        entity_type='employee_id',\n",
    "        priority=10\n",
    "    ),\n",
    "    PIIPattern(\n",
    "        pattern=r'\\bPID-\\d{8}\\b',  # Patient ID format: PID-12345678\n",
    "        entity_type='patient_id',\n",
    "        priority=9\n",
    "    ),\n",
    "    PIIPattern(\n",
    "        pattern=r'\\b[A-Z]{2}-\\d{4}-[A-Z]\\b',  # Custom format: AB-1234-X\n",
    "        entity_type='internal_code',\n",
    "        priority=8\n",
    "    ),\n",
    "]\n",
    "\n",
    "print(f\"Defined {len(custom_patterns)} custom patterns:\\n\")\n",
    "for p in custom_patterns:\n",
    "    print(f\"  [{p.entity_type:20s}] Priority: {p.priority}, Pattern: {p.pattern}\")\n",
    "\n",
    "# Test text with custom identifiers\n",
    "custom_text = \"\"\"\n",
    "Employee: EMP-123456\n",
    "Patient ID: PID-87654321\n",
    "Department Code: HR-2024-A\n",
    "Regular SSN: 123-45-6789\n",
    "\"\"\"\n",
    "\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"Test text:\")\n",
    "print(custom_text)\n",
    "\n",
    "# Find custom semantic units\n",
    "print(\"-\" * 80)\n",
    "print(\"Detected units with custom patterns:\\n\")\n",
    "\n",
    "# First get model predictions\n",
    "result = extract_pii(\n",
    "    custom_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=False  # Get raw predictions first\n",
    ")\n",
    "\n",
    "# Convert to dict format\n",
    "entity_dicts = [\n",
    "    {\n",
    "        'entity_type': e.label,\n",
    "        'score': e.confidence,\n",
    "        'start': e.start,\n",
    "        'end': e.end,\n",
    "        'word': e.text\n",
    "    }\n",
    "    for e in result.entities\n",
    "]\n",
    "\n",
    "# Merge with custom patterns\n",
    "merged = merge_entities_with_semantic_units(\n",
    "    entity_dicts,\n",
    "    result.text,\n",
    "    patterns=custom_patterns,  # Add custom patterns\n",
    "    use_semantic_patterns=True,\n",
    "    prefer_model_labels=False  # Prefer pattern labels for custom types\n",
    ")\n",
    "\n",
    "print(f\"Found {len(merged)} entities (including custom types):\\n\")\n",
    "for entity in merged:\n",
    "    label = entity['entity_type']\n",
    "    text = entity['word']\n",
    "    conf = entity['score']\n",
    "    is_custom = label in ['employee_id', 'patient_id', 'internal_code']\n",
    "    marker = \"🆕\" if is_custom else \"  \"\n",
    "    print(f\"{marker} [{label:20s}] '{text}' (confidence: {conf:.3f})\")\n",
    "\n",
    "print(\"\\n🆕 = Custom pattern detected\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 8. Clinical Use Cases\n",
    "\n",
    "Real-world clinical scenarios."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Case 1: Discharge Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "USE CASE 1: Discharge Summary De-identification\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "De-identified Discharge Summary:\n",
      "\n",
      "DISCHARGE SUMMARY\n",
      "=====================================\n",
      "Patient Name:[first_name]l[last_name]n[medical_record_number]2\n",
      "Date of Birth:[date_of_birth]8\n",
      "Admission Date:[date]5\n",
      "Discharge Date:[date]5[occupation]n: Dr.[first_name]y[last_name]r\n",
      "\n",
      "PRIMARY DIAGNOSIS:\n",
      "Acute myocardial infarction\n",
      "\n",
      "HOSPITAL COURSE:\n",
      "Mr.[last_name]n is a[age]6-year-old male who presented to the emergency\n",
      "department on[date]5 with chest pain. He was admitted for\n",
      "cardiac catheterization and intervention.\n",
      "\n",
      "CONTACT INFORMATION:\n",
      "Phone:[phone_number]8\n",
      "Email:[email]m\n",
      "Emergency Contact:[first_name]e[last_name]n (Wife) - (555) 234-5679\n",
      "\n",
      "FOLLOW-UP:\n",
      "Patient scheduled for follow-up on[date]5 at the cardiology clinic.\n",
      "\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "PII entities protected: 17\n",
      "✅ Clean de-identification - no adjacent placeholders!\n"
     ]
    }
   ],
   "source": [
    "discharge_summary = \"\"\"\n",
    "DISCHARGE SUMMARY\n",
    "=====================================\n",
    "Patient Name: Michael Anderson\n",
    "MRN: 98765432\n",
    "Date of Birth: 08/12/1968\n",
    "Admission Date: 01/05/2025\n",
    "Discharge Date: 01/10/2025\n",
    "Attending Physician: Dr. Emily Carter\n",
    "\n",
    "PRIMARY DIAGNOSIS:\n",
    "Acute myocardial infarction\n",
    "\n",
    "HOSPITAL COURSE:\n",
    "Mr. Anderson is a 56-year-old male who presented to the emergency\n",
    "department on 01/05/2025 with chest pain. He was admitted for\n",
    "cardiac catheterization and intervention.\n",
    "\n",
    "CONTACT INFORMATION:\n",
    "Phone: (555) 234-5678\n",
    "Email: m.anderson@email.com\n",
    "Emergency Contact: Jane Anderson (Wife) - (555) 234-5679\n",
    "\n",
    "FOLLOW-UP:\n",
    "Patient scheduled for follow-up on 01/24/2025 at the cardiology clinic.\n",
    "\"\"\"\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"USE CASE 1: Discharge Summary De-identification\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# De-identify for research database\n",
    "deid_discharge = deidentify(\n",
    "    discharge_summary,\n",
    "    method=\"mask\",\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.6,\n",
    "    use_smart_merging=True\n",
    ")\n",
    "\n",
    "print(\"De-identified Discharge Summary:\")\n",
    "print(deid_discharge.deidentified_text)\n",
    "\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(f\"PII entities protected: {len(deid_discharge.pii_entities)}\")\n",
    "# Check for adjacent placeholders (quality check)\n",
    "if '][' in deid_discharge.deidentified_text:\n",
    "    print(\"❌ Adjacent placeholders detected - fragmentation issue!\")\n",
    "else:\n",
    "    print(\"✅ Clean de-identification - no adjacent placeholders!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Case 2: Research Dataset Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "USE CASE 2: Research Dataset Preparation\n",
      "================================================================================\n",
      "Processing 3 patient notes for research...\n",
      "\n",
      "De-identified research dataset (date shifting by 180 days):\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "patient_001: Patient 001: [first_name] [last_name], DOB [date], diagnosed with T2DM on [date]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "patient_002: Patient 002: [first_name] [last_name], DOB [date_of_birth], A1C 8.5%, started metformin [date]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "patient_003: Patient 003: [first_name] [last_name], DOB [date_of_birth], BMI 32.1, blood pressure 145/90\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "✅ Research dataset ready!\n",
      "   - All dates shifted by 180 days\n",
      "   - Temporal relationships preserved\n",
      "   - Audit mapping available for IRB review\n"
     ]
    }
   ],
   "source": [
    "research_notes = [\n",
    "    \"Patient 001: John Smith, DOB 03/15/1975, diagnosed with T2DM on 12/20/2024\",\n",
    "    \"Patient 002: Sarah Johnson, DOB 08/22/1982, A1C 8.5%, started metformin 01/05/2025\",\n",
    "    \"Patient 003: Robert Williams, DOB 11/30/1965, BMI 32.1, blood pressure 145/90\",\n",
    "]\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"USE CASE 2: Research Dataset Preparation\")\n",
    "print(\"=\" * 80)\n",
    "print(f\"Processing {len(research_notes)} patient notes for research...\\n\")\n",
    "\n",
    "# For batch de-identification, use deidentify() on each text\n",
    "# BatchProcessor.process_items() is for extraction only\n",
    "\n",
    "print(\"De-identified research dataset (date shifting by 180 days):\\n\")\n",
    "\n",
    "for i, note in enumerate(research_notes, 1):\n",
    "    patient_id = f\"patient_{i:03d}\"\n",
    "\n",
    "    # De-identify each note with date shifting\n",
    "    deid_result = deidentify(\n",
    "        note,\n",
    "        method=\"shift_dates\",\n",
    "        model_name=\"openmed/OpenMed-PII-SuperClinical-Large-434M-v1\",\n",
    "        confidence_threshold=0.6,\n",
    "        use_smart_merging=True,\n",
    "        date_shift_days=180,\n",
    "        keep_mapping=True,\n",
    "    )\n",
    "\n",
    "    print(f\"{patient_id}: {deid_result.deidentified_text}\")\n",
    "\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"✅ Research dataset ready!\")\n",
    "print(\"   - All dates shifted by 180 days\")\n",
    "print(\"   - Temporal relationships preserved\")\n",
    "print(\"   - Audit mapping available for IRB review\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Case 3: HIPAA Compliance Audit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "USE CASE 3: HIPAA Compliance Audit\n",
      "================================================================================\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HIPAA Compliance Check:\n",
      "\n",
      "Total PII entities detected: 15\n",
      "\n",
      "PII Categories Found:\n",
      "  • account_number: 1 instance(s)\n",
      "  • city: 1 instance(s)\n",
      "  • date: 1 instance(s)\n",
      "  • email: 1 instance(s)\n",
      "  • fax_number: 1 instance(s)\n",
      "  • first_name: 1 instance(s)\n",
      "  • ipv4: 1 instance(s)\n",
      "  • last_name: 1 instance(s)\n",
      "  • license_plate: 1 instance(s)\n",
      "  • medical_record_number: 1 instance(s)\n",
      "  • postcode: 1 instance(s)\n",
      "  • ssn: 1 instance(s)\n",
      "  • state: 1 instance(s)\n",
      "  • street_address: 1 instance(s)\n",
      "  • url: 1 instance(s)\n",
      "\n",
      "--------------------------------------------------------------------------------\n",
      "HIPAA Safe Harbor Compliance:\n",
      "  Coverage: 6/17 HIPAA identifier categories\n",
      "  ✅ Ready for HIPAA-compliant de-identification\n"
     ]
    }
   ],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"USE CASE 3: HIPAA Compliance Audit\")\n",
    "print(\"=\" * 80)\n",
    "\n",
    "# HIPAA Safe Harbor requires removal of 18 identifiers\n",
    "hipaa_text = \"\"\"\n",
    "Patient: Jane Doe\n",
    "DOB: 05/15/1980\n",
    "SSN: 987-65-4321\n",
    "Address: 789 Pine Street, Unit 4B, Seattle, WA 98101\n",
    "Phone: (206) 555-1234\n",
    "Fax: (206) 555-1235\n",
    "Email: jane.doe@email.com\n",
    "Medical Record: MRN-12345678\n",
    "Account Number: ACCT-987654\n",
    "Device ID: DEVICE-ABC123\n",
    "IP Address: 192.168.1.100\n",
    "Vehicle: License plate ABC-1234\n",
    "Biometric: Fingerprint on file\n",
    "Photo: Patient photo available\n",
    "URL: https://patient-portal.hospital.com/jane-doe\n",
    "\"\"\"\n",
    "\n",
    "# Extract all PII for audit\n",
    "hipaa_result = extract_pii(\n",
    "    hipaa_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=True\n",
    ")\n",
    "\n",
    "print(f\"HIPAA Compliance Check:\\n\")\n",
    "print(f\"Total PII entities detected: {len(hipaa_result.entities)}\\n\")\n",
    "\n",
    "# Group by type\n",
    "from collections import Counter\n",
    "pii_types = Counter(e.label for e in hipaa_result.entities)\n",
    "\n",
    "print(\"PII Categories Found:\")\n",
    "for pii_type, count in sorted(pii_types.items()):\n",
    "    print(f\"  • {pii_type}: {count} instance(s)\")\n",
    "\n",
    "# HIPAA 18 identifiers checklist\n",
    "hipaa_18_identifiers = [\n",
    "    'names', 'geographic', 'dates', 'phone', 'fax', 'email', 'ssn',\n",
    "    'medical_record', 'account_number', 'certificate', 'vehicle',\n",
    "    'device', 'url', 'ip_address', 'biometric', 'photo', 'unique_id'\n",
    "]\n",
    "\n",
    "print(\"\\n\" + \"-\" * 80)\n",
    "print(\"HIPAA Safe Harbor Compliance:\")\n",
    "detected_types = set(e.label.lower() for e in hipaa_result.entities)\n",
    "covered = sum(1 for identifier in hipaa_18_identifiers\n",
    "              if any(identifier in dt for dt in detected_types))\n",
    "\n",
    "print(f\"  Coverage: {covered}/{len(hipaa_18_identifiers)} HIPAA identifier categories\")\n",
    "print(\"  ✅ Ready for HIPAA-compliant de-identification\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 9. Visualization\n",
    "\n",
    "Display results with highlighting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Device set to use cpu\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "================================================================================\n",
      "VISUALIZATION: Highlighted PII Entities\n",
      "================================================================================\n",
      "\n",
      "Hover over highlighted text to see entity type and confidence.\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div style=\"font-family: monospace; white-space: pre-wrap; padding: 10px; background: #f8f8f8; border-radius: 5px;\">Patient: <span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"occupation (confidence: 0.57)\">Dr.</span> <span style=\"background-color: #FFB3BA; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"first_name (confidence: 1.00)\">Sarah</span> <span style=\"background-color: #FFB3BA; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"last_name (confidence: 1.00)\">Johnson</span>\n",
       "DOB: <span style=\"background-color: #BAFFC9; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"date (confidence: 0.66)\">03/15/1975</span>\n",
       "SSN: <span style=\"background-color: #BAE1FF; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"ssn (confidence: 0.97)\">123-45-6789</span>\n",
       "Phone: (<span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"phone_number (confidence: 0.80)\">555) 123-4567</span>\n",
       "Email: <span style=\"background-color: #FFDFBA; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"email (confidence: 1.00)\">sarah.j@email.com</span>\n",
       "Address: <span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"street_address (confidence: 1.00)\">456 Oak Ave</span>, <span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"city (confidence: 0.91)\">Boston</span>, <span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"state (confidence: 0.93)\">MA</span> <span style=\"background-color: #E8E8E8; padding: 1px 2px; border-radius: 3px; font-weight: bold;\" title=\"postcode (confidence: 0.97)\">02115</span></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Legend:\n",
      "  🟥 Pink: Names\n",
      "  🟩 Green: Dates\n",
      "  🟦 Blue: SSN\n",
      "  🟨 Yellow: Phone\n",
      "  🟧 Orange: Email\n",
      "  🟪 Purple: Address\n"
     ]
    }
   ],
   "source": [
    "from IPython.display import HTML, display\n",
    "\n",
    "def highlight_entities(text, entities):\n",
    "    \"\"\"Create HTML with highlighted PII entities.\"\"\"\n",
    "    # Define colors for different entity types\n",
    "    colors = {\n",
    "        'name': '#FFB3BA',\n",
    "        'first_name': '#FFB3BA',\n",
    "        'last_name': '#FFB3BA',\n",
    "        'date': '#BAFFC9',\n",
    "        'date_of_birth': '#BAFFC9',\n",
    "        'ssn': '#BAE1FF',\n",
    "        'phone': '#FFFFBA',\n",
    "        'email': '#FFDFBA',\n",
    "        'address': '#E0BBE4',\n",
    "        'medical_record_number': '#D4F1F4',\n",
    "    }\n",
    "\n",
    "    # Sort entities by start position (reverse for replacement)\n",
    "    sorted_entities = sorted(entities, key=lambda e: e.start, reverse=True)\n",
    "\n",
    "    highlighted = text\n",
    "    for entity in sorted_entities:\n",
    "        color = colors.get(entity.label.lower(), '#E8E8E8')\n",
    "        replacement = (\n",
    "            f'<span style=\"background-color: {color}; padding: 1px 2px; '\n",
    "            f'border-radius: 3px; font-weight: bold;\" '\n",
    "            f'title=\"{entity.label} (confidence: {entity.confidence:.2f})\">{entity.text}</span>'\n",
    "        )\n",
    "        highlighted = (\n",
    "            highlighted[:entity.start] + replacement + highlighted[entity.end:]\n",
    "        )\n",
    "\n",
    "    return f'<div style=\"font-family: monospace; white-space: pre-wrap; padding: 10px; background: #f8f8f8; border-radius: 5px;\">{highlighted}</div>'\n",
    "\n",
    "# Example visualization\n",
    "viz_text = \"\"\"Patient: Dr. Sarah Johnson\n",
    "DOB: 03/15/1975\n",
    "SSN: 123-45-6789\n",
    "Phone: (555) 123-4567\n",
    "Email: sarah.j@email.com\n",
    "Address: 456 Oak Ave, Boston, MA 02115\"\"\"\n",
    "\n",
    "viz_result = extract_pii(\n",
    "    viz_text,\n",
    "    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',\n",
    "    confidence_threshold=0.5,\n",
    "    use_smart_merging=True\n",
    ")\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"VISUALIZATION: Highlighted PII Entities\")\n",
    "print(\"=\" * 80)\n",
    "print(\"\\nHover over highlighted text to see entity type and confidence.\\n\")\n",
    "\n",
    "html = highlight_entities(viz_result.text, viz_result.entities)\n",
    "display(HTML(html))\n",
    "\n",
    "print(\"\\nLegend:\")\n",
    "print(\"  🟥 Pink: Names\")\n",
    "print(\"  🟩 Green: Dates\")\n",
    "print(\"  🟦 Blue: SSN\")\n",
    "print(\"  🟨 Yellow: Phone\")\n",
    "print(\"  🟧 Orange: Email\")\n",
    "print(\"  🟪 Purple: Address\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 10. CLI Usage Examples\n",
    "\n",
    "Command-line interface for PII operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"=\" * 80)\n",
    "print(\"CLI USAGE EXAMPLES\")\n",
    "print(\"=\" * 80)\n",
    "print(\"\"\"\n",
    "OpenMed provides a powerful CLI for PII detection and de-identification.\n",
    "\n",
    "1. Extract PII from text:\n",
    "   ```bash\n",
    "   openmed pii extract \\\n",
    "     --text \"Patient: John Doe, DOB: 01/15/1970\" \\\n",
    "     --model openmed/OpenMed-PII-SuperClinical-Large-434M-v1 \\\n",
    "     --confidence-threshold 0.5\n",
    "   ```\n",
    "\n",
    "2. Extract PII from file:\n",
    "   ```bash\n",
    "   openmed pii extract \\\n",
    "     --input-file patient_note.txt \\\n",
    "     --output results.json\n",
    "   ```\n",
    "\n",
    "3. De-identify with mask method:\n",
    "   ```bash\n",
    "   openmed pii deidentify \\\n",
    "     --input-file patient_note.txt \\\n",
    "     --method mask \\\n",
    "     --output deidentified.txt\n",
    "   ```\n",
    "\n",
    "4. De-identify with date shifting:\n",
    "   ```bash\n",
    "   openmed pii deidentify \\\n",
    "     --text \"Admission: 01/15/2025\" \\\n",
    "     --method shift_dates \\\n",
    "     --date-shift-days 180\n",
    "   ```\n",
    "\n",
    "5. Batch processing:\n",
    "   ```bash\n",
    "   openmed pii batch-extract \\\n",
    "     --input-dir ./patient_notes/ \\\n",
    "     --output-dir ./results/ \\\n",
    "     --confidence-threshold 0.6\n",
    "   ```\n",
    "\n",
    "6. Interactive mode:\n",
    "   ```bash\n",
    "   openmed pii interactive\n",
    "   ```\n",
    "\n",
    "7. Get help:\n",
    "   ```bash\n",
    "   openmed pii --help\n",
    "   openmed pii extract --help\n",
    "   openmed pii deidentify --help\n",
    "   ```\n",
    "\"\"\")\n",
    "\n",
    "print(\"\\nTo see available CLI commands, run:\")\n",
    "print(\"  !openmed pii --help\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary and Best Practices\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "1. **Smart Merging (v0.5.0)** - Always enabled by default\n",
    "   - Fixes fragmentation issues\n",
    "   - Production-ready complete entities\n",
    "   - Minimal performance overhead (~5-10%)\n",
    "\n",
    "2. **De-identification Methods**\n",
    "   - `mask`: Best for clinical review (maintains structure)\n",
    "   - `remove`: Maximum privacy (minimal data)\n",
    "   - `replace`: Research datasets (synthetic data)\n",
    "   - `hash`: Linking records (deterministic)\n",
    "   - `shift_dates`: Temporal analysis (preserves relationships)\n",
    "\n",
    "3. **Confidence Thresholds**\n",
    "   - 0.5-0.7: Recommended for most use cases\n",
    "   - Lower: High recall (catch more PII)\n",
    "   - Higher: High precision (fewer false positives)\n",
    "\n",
    "4. **Batch Processing**\n",
    "   - Use for multiple documents\n",
    "   - Efficient resource usage\n",
    "   - Progress tracking included\n",
    "\n",
    "5. **HIPAA Compliance**\n",
    "   - Covers all 18 identifier categories\n",
    "   - Audit trails with mappings\n",
    "   - Safe Harbor method supported\n",
    "\n",
    "### Best Practices\n",
    "\n",
    "✅ **DO:**\n",
    "- Use `use_smart_merging=True` (default) for production\n",
    "- Test with representative data\n",
    "- Monitor entity quality (check for fragments)\n",
    "- Keep mappings for audit trails\n",
    "- Validate de-identified output\n",
    "\n",
    "❌ **DON'T:**\n",
    "- Disable smart merging without good reason\n",
    "- Use very low thresholds without review\n",
    "- Skip validation on production data\n",
    "- Share mappings without encryption\n",
    "- Assume 100% recall (always review edge cases)\n",
    "\n",
    "### Performance Tips\n",
    "\n",
    "- Use batch processing for multiple documents\n",
    "- Adjust `batch_size` based on memory\n",
    "- Cache model loading for repeated calls\n",
    "- Monitor processing time with profiling\n",
    "\n",
    "### Security Considerations\n",
    "\n",
    "- Store mappings securely (encrypted)\n",
    "- Limit access to original data\n",
    "- Audit de-identification logs\n",
    "- Follow organizational HIPAA policies\n",
    "- Regular compliance reviews\n",
    "\n",
    "---\n",
    "\n",
    "## Resources\n",
    "\n",
    "- **Documentation:** https://github.com/maziyarpanahi/openmed\n",
    "- **Smart Merging Guide:** `docs/pii-smart-merging.md`\n",
    "- **API Reference:** `docs/api-reference.md`\n",
    "- **HIPAA Compliance:** `docs/hipaa-compliance.md`\n",
    "- **Model Hub:** https://huggingface.co/openmed\n",
    "\n",
    "---\n",
    "\n",
    "**Version:** OpenMed v0.5.0+\n",
    "\n",
    "**Last Updated:** 2026-01-13\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}