--- name: language-detection-expert description: Hybrid language detection algorithm for Vigil Guard v2.0.0. Use for language-detector Flask API, entity-based hints, Polish PESEL/NIP detection, 3-branch pipeline integration, accuracy troubleshooting, and langdetect integration. version: 2.0.0 allowed-tools: [Read, Write, Edit, Bash, Grep, Glob] --- # Language Detection Expert (v2.0.0) ## Overview Hybrid language detection algorithm for Vigil Guard v2.0.0 combining entity-based hints (Polish PESEL/NIP detection) with statistical analysis (langdetect library) for accurate dual-language PII processing and 3-branch detection pipeline integration. ## When to Use This Skill - Managing language-detector Flask API (services/language-detector/) - Implementing hybrid detection logic - Troubleshooting detection accuracy (<10ms target) - Working with langdetect library - Polish entity recognition patterns - 3-branch pipeline integration (v2.0.0) ## Tech Stack - Python 3.11, Flask 3.0.0 - langdetect 1.0.9 (statistical analysis) - Custom Polish entity patterns (PESEL, NIP, REGON) ## v2.0.0 Architecture Integration ### Position in 3-Branch Pipeline ```yaml n8n Workflow (24 nodes): 1. Input Validation 2. Language Detection ← This Service 3. 3-Branch Executor (parallel): - Branch A: Heuristics (uses language for keyword matching) - Branch B: Semantic (uses language for embedding model) - Branch C: LLM Guard (language-agnostic) 4. Arbiter v2 Decision 5. PII Redaction (uses language for Presidio model selection) ``` ### Integration with Branches ```javascript // From n8n 3-Branch Executor const languageResult = await fetch('http://vigil-language-detector:5002/detect', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text: input, detailed: true }) }); const { language, detection_method } = await languageResult.json(); // Branch A: Heuristics - uses language for keyword patterns const branchA = await fetch('http://vigil-heuristics:5005/analyze', { body: JSON.stringify({ text: input, language, request_id }) }); // Branch B: Semantic - uses language for embedding selection const branchB = await fetch('http://vigil-semantic:5006/analyze', { body: JSON.stringify({ text: input, language, request_id }) }); // PII Redaction - uses language for Presidio model const piiResult = await detectPII(text, language === 'pl' ? ['pl', 'en'] : ['en']); ``` ## Hybrid Detection Algorithm (v2.0.0) ### Decision Flow ```yaml 1. Check Polish Entity Hints: - PESEL pattern: \d{11} with checksum - NIP pattern: XXX-XXX-XX-XX or \d{10} - REGON pattern: \d{9} or \d{14} - Polish keywords: ["PESEL", "NIP", "REGON", "dowód", "paszport"] → If found: return "pl" (confidence: "hybrid_entity_hints") 2. If no entity hints, use langdetect: - Statistical analysis of character n-grams - Language profiles for 55+ languages → If confidence >0.9: return detected language → If confidence <0.9: return "en" (default fallback) 3. Edge cases: - Empty text → "en" (default) - Numbers only → "en" (default) - Very short text (<10 chars) → Check entity hints only ``` ### API Endpoint ```python # POST /detect { "text": "Moja karta to 4111111111111111 i PESEL 92032100157", "detailed": true } # Response { "language": "pl", "confidence": 1.0, "detection_method": "hybrid_entity_hints", "details": { "entity_hints_found": ["PESEL"], "langdetect_result": "pl", "langdetect_confidence": 0.95 } } ``` ## Common Tasks ### Task 1: Add Polish Entity Pattern ```python # app.py POLISH_ENTITY_PATTERNS = [ (r'\b\d{11}\b', 'PESEL'), # 11 digits (r'\b\d{3}-\d{3}-\d{2}-\d{2}\b', 'NIP'), # NIP with dashes (r'\b\d{10}\b', 'NIP_OR_REGON'), # 10 digits (ambiguous) (r'\b\d{9}\b', 'REGON'), # 9 digits REGON ] POLISH_KEYWORDS = [ 'PESEL', 'pesel', 'NIP', 'nip', 'REGON', 'regon', 'dowód', 'paszport', 'legitymacja', 'tożsamość' ] def has_polish_entities(text: str) -> tuple[bool, list]: """Check for Polish-specific entities""" found_entities = [] # Check patterns for pattern, entity_type in POLISH_ENTITY_PATTERNS: if re.search(pattern, text): found_entities.append(entity_type) # Check keywords for keyword in POLISH_KEYWORDS: if keyword in text: found_entities.append(f'keyword:{keyword}') return len(found_entities) > 0, found_entities ``` ### Task 2: Statistical Detection with langdetect ```python from langdetect import detect, detect_langs, LangDetectException def detect_language_statistical(text: str) -> tuple[str, float]: """ Use langdetect for statistical language detection Returns: (language_code, confidence) """ try: # Get all language probabilities langs = detect_langs(text) # Return most probable language if langs: top_lang = langs[0] return top_lang.lang, top_lang.prob return 'en', 0.0 except LangDetectException: # Text too short or only numbers return 'en', 0.0 ``` ### Task 3: Hybrid Detection Implementation ```python @app.route('/detect', methods=['POST']) def detect_language(): data = request.json text = data.get('text', '') detailed = data.get('detailed', False) # 1. Check entity hints has_polish, entities = has_polish_entities(text) if has_polish: # Strong Polish signal from entities result = { 'language': 'pl', 'confidence': 1.0, 'detection_method': 'hybrid_entity_hints' } if detailed: result['details'] = { 'entity_hints_found': entities, 'langdetect_result': None, 'langdetect_confidence': None } return jsonify(result) # 2. No entity hints, use statistical lang, confidence = detect_language_statistical(text) result = { 'language': lang, 'confidence': confidence, 'detection_method': 'langdetect' if confidence > 0.5 else 'default_fallback' } if detailed: result['details'] = { 'entity_hints_found': [], 'langdetect_result': lang, 'langdetect_confidence': confidence } return jsonify(result) ``` ### Task 4: Performance Optimization ```python from functools import lru_cache # Cache for frequent texts (1000 most recent) @lru_cache(maxsize=1000) def cached_detect(text_hash: str) -> tuple: """Cache detection results for performance""" text = unhash(text_hash) has_polish, entities = has_polish_entities(text) if has_polish: return ('pl', 1.0, 'hybrid_entity_hints', entities) lang, confidence = detect_language_statistical(text) return (lang, confidence, 'langdetect', []) # Timeout protection (10ms target) import signal def timeout_handler(signum, frame): raise TimeoutError("Language detection exceeded timeout") def detect_with_timeout(text: str, timeout_ms: int = 10): signal.signal(signal.SIGALRM, timeout_handler) signal.setitimer(signal.ITIMER_REAL, timeout_ms / 1000) try: return detect_language_statistical(text) finally: signal.alarm(0) # Cancel alarm ``` ## v2.0.0 Branch Integration Examples ### Heuristics Service (Branch A) Integration ```python # heuristics-service uses language for keyword patterns def analyze_with_language(text: str, language: str): if language == 'pl': keywords = POLISH_KEYWORDS + COMMON_KEYWORDS patterns = POLISH_PATTERNS + COMMON_PATTERNS else: keywords = ENGLISH_KEYWORDS + COMMON_KEYWORDS patterns = ENGLISH_PATTERNS + COMMON_PATTERNS return match_patterns(text, patterns, keywords) ``` ### Semantic Service (Branch B) Integration ```python # semantic-service may use language for embedding model selection def get_embeddings(text: str, language: str): # MiniLM-L6-v2 is multilingual, but language hint helps model = load_model('all-MiniLM-L6-v2') # Language-specific preprocessing if language == 'pl': text = polish_preprocessing(text) return model.encode(text) ``` ### PII Redaction Integration ```python # PII redaction uses language for Presidio model selection async def detect_pii_with_language(text: str, detected_language: str): if detected_language == 'pl': # Polish first for PESEL detection accuracy languages = ['pl', 'en'] else: languages = ['en'] return await dual_language_pii(text, languages) ``` ## Test Coverage ### Test Categories ```yaml Polish Text (15 tests): - With diacritics: "Cześć, jak się masz?" - Without diacritics: "Prosze o pomoc" - Mixed case: "PROSZĘ o pomoc" English Text (10 tests): - Common words: "Please help me" - Technical: "Docker Compose deployment" Mixed Language (8 tests): - Polish + English terms: "Użyj Docker Compose" - English + Polish names: "User Jan Kowalski" Short Text + Entity Hints (10 tests): - PESEL only: "PESEL 92032100157" - NIP only: "NIP 123-456-78-90" - Credit card (no hint): "Card 4111111111111111" → "en" Edge Cases (7 tests): - Numbers only: "12345 67890" → "en" - Special chars: "!@#$%^&*()" → "en" - Empty string: "" → "en" ``` ## Integration Points ### With presidio-pii-specialist: ```yaml when: Language detected action: 1. language="pl" → Call Presidio with pl_core_news_lg 2. language="en" → Call Presidio with en_core_web_lg 3. Dual mode → Call both, deduplicate ``` ### With n8n-vigil-workflow (v2.0.0): ```yaml when: 3-Branch Executor runs action: 1. Language Detection node runs first 2. Result passed to all 3 branches 3. Branch A uses language for keyword selection 4. Branch B uses language for embedding preprocessing 5. PII_Redactor_v2 uses language for model selection ``` ### With heuristics-service (Branch A): ```yaml when: Heuristics analysis action: 1. Receive language from detection 2. Select language-specific patterns 3. Apply Polish or English keyword list 4. Return score with language context ``` ## Troubleshooting **Incorrect detection for short Polish text:** ```python # Add more Polish keywords POLISH_KEYWORDS += ['proszę', 'dziękuję', 'przepraszam', 'witam'] # Lower confidence threshold if confidence < 0.5: return 'pl' if any(word in text for word in POLISH_KEYWORDS) else 'en' ``` **Detection too slow (>10ms):** ```python # Enable caching @lru_cache(maxsize=10000) def cached_detect(text: str): return detect_language_statistical(text) # Reduce langdetect trials from langdetect import DetectorFactory DetectorFactory.seed = 0 # Deterministic results ``` **Branch A not using language correctly:** ```bash # Verify language is passed to heuristics curl -X POST http://localhost:5005/analyze \ -H "Content-Type: application/json" \ -d '{"text":"test PESEL 12345678901","language":"pl","request_id":"debug-1"}' # Check logs docker logs vigil-heuristics-service --tail 50 | grep language ``` ## Quick Reference ```bash # Test API curl -X POST http://localhost:5002/detect \ -H "Content-Type: application/json" \ -d '{"text":"PESEL 92032100157","detailed":true}' # Run tests cd services/language-detector && python -m pytest tests/ # Health check curl http://localhost:5002/health # Check service logs docker logs vigil-language-detector --tail 50 ``` ## ClickHouse Logging (v2.0.0) ```sql -- Language detection results logged with events SELECT original_input, detected_language, detection_method, branch_a_score, branch_b_score, branch_c_score FROM n8n_logs.events_processed WHERE detected_language = 'pl' ORDER BY timestamp DESC LIMIT 10; ``` --- **Last Updated:** 2025-12-09 **Performance:** <10ms average detection time **Accuracy:** 100% (50/50 tests passing) **Languages Supported:** 55+ via langdetect, Polish priority **Integration:** 3-branch pipeline (v2.0.0) ## Version History - **v2.0.0** (Current): 3-branch pipeline integration, branch language passing - **v1.6.11**: Hybrid detection algorithm, entity-based hints