--- name: input-output-guardrails version: "2.0.0" description: Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs sasmp_version: "1.3.0" bonded_agent: 05-defense-strategy-developer bond_type: SECONDARY_BOND # Schema Definitions input_schema: type: object required: [guardrail_type] properties: guardrail_type: type: string enum: [input, output, both] strictness: type: string enum: [permissive, balanced, strict] default: balanced output_schema: type: object properties: blocked_requests: type: integer filtered_outputs: type: integer false_positive_rate: type: number # Framework Mappings owasp_llm_2025: [LLM01, LLM02, LLM05, LLM07] nist_ai_rmf: [Manage] --- # Input/Output Guardrails Implement **multi-layer safety systems** to filter malicious inputs and harmful outputs. ## Quick Reference ```yaml Skill: input-output-guardrails Agent: 05-defense-strategy-developer OWASP: LLM01 (Injection), LLM02 (Disclosure), LLM05 (Output), LLM07 (Leakage) NIST: Manage Use Case: Production safety filtering ``` ## Guardrail Architecture ``` User Input → [Input Guardrails] → [AI Model] → [Output Guardrails] → Response ↓ ↓ [Blocked/Modified] [Blocked/Modified] ↓ ↓ [Fallback Response] [Safe Alternative] ``` ## Input Guardrails ### 1. Injection Detection ```yaml Category: prompt_injection Latency: <10ms Block Rate: 95%+ ``` ```python class InputGuardrails: INJECTION_PATTERNS = [ r'ignore\s+(previous|prior|all)\s+(instructions?|guidelines?)', r'you\s+are\s+(now|an?)\s+(unrestricted|evil)', r'(developer|admin|debug)\s+mode', r'bypass\s+(safety|security|filter)', r'pretend\s+(you|to)\s+(are|be)', r'what\s+(is|are)\s+your\s+(instructions?|prompt)', ] def __init__(self, config): self.patterns = [re.compile(p, re.I) for p in self.INJECTION_PATTERNS] self.max_length = config.get('max_length', 4096) self.pii_detector = PIIDetector() def validate(self, user_input: str) -> tuple[bool, str]: # Length check if len(user_input) > self.max_length: return False, "Input too long" # Empty check if not user_input.strip(): return False, "Empty input" # Injection detection for pattern in self.patterns: if pattern.search(user_input): return False, "Invalid request" # PII handling if self.pii_detector.contains_pii(user_input): return True, self.pii_detector.redact(user_input) return True, user_input ``` ### 2. PII Detection & Redaction ```python class PIIDetector: PATTERNS = { 'ssn': r'\b\d{3}-\d{2}-\d{4}\b', 'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', 'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', 'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'api_key': r'(sk|pk)[-_][a-zA-Z0-9]{20,}', } def contains_pii(self, text: str) -> bool: for pattern in self.PATTERNS.values(): if re.search(pattern, text): return True return False def redact(self, text: str) -> str: for name, pattern in self.PATTERNS.items(): text = re.sub(pattern, f'[REDACTED_{name.upper()}]', text) return text ``` ### 3. Rate & Cost Limiting ```yaml Limits: max_tokens_input: 4096 max_requests_per_minute: 60 max_concurrent: 5 cost_limit_per_hour: $10 Actions: exceeded_tokens: truncate exceeded_rate: queue (5s backoff) exceeded_concurrent: reject exceeded_cost: block ``` ## Output Guardrails ### 1. Content Safety Filtering ```python class OutputGuardrails: def __init__(self, config): self.toxicity_threshold = config.get('toxicity', 0.3) self.toxicity_model = load_toxicity_classifier() self.blocklist = self._load_blocklist() def filter(self, response: str) -> tuple[str, dict]: metadata = {'filtered': False, 'reasons': []} # Toxicity check toxicity = self.toxicity_model.predict(response) if toxicity > self.toxicity_threshold: metadata['filtered'] = True metadata['reasons'].append('toxicity') return self._safe_response(), metadata # Blocklist check for term in self.blocklist: if term.lower() in response.lower(): metadata['filtered'] = True metadata['reasons'].append('blocklist') return self._safe_response(), metadata # System prompt leak detection if self._detects_system_leak(response): metadata['filtered'] = True metadata['reasons'].append('system_leak') response = self._redact_system_content(response) return response, metadata def _detects_system_leak(self, response: str) -> bool: leak_indicators = [ 'you are a helpful', 'your instructions are', 'system prompt:', ] return any(ind in response.lower() for ind in leak_indicators) ``` ### 2. Sensitive Data Redaction ```python class OutputRedactor: SENSITIVE_PATTERNS = { 'api_key': r'[a-zA-Z0-9_-]{20,}(?:key|token|secret)', 'password': r'password["\']?\s*[:=]\s*["\']?[^\s"\']+', 'connection_string': r'(mongodb|mysql|postgres)://[^\s]+', 'ip_address': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', } def redact(self, response: str) -> str: for name, pattern in self.SENSITIVE_PATTERNS.items(): response = re.sub(pattern, '[REDACTED]', response, flags=re.I) return response ``` ### 3. Factuality & Citation ```yaml Factuality Checks: major_claims: action: flag_for_verification threshold: confidence < 0.8 citations: action: verify_source_exists block_if: source_not_found uncertainty: action: add_disclaimer phrases: ["I'm not certain", "might be", "could be"] ``` ## Combined Configuration ```yaml # guardrails_config.yaml input: injection_detection: true pii_redaction: true max_length: 4096 rate_limit: 60/min output: toxicity_threshold: 0.3 blocklist_enabled: true sensitive_redaction: true system_leak_detection: true fallback: input_blocked: "I cannot process this request." output_blocked: "I cannot provide this information." logging: log_blocked: true log_filtered: true include_reason: false # Privacy ``` ## Effectiveness Metrics ``` ┌──────────────────┬─────────┬────────┬──────────┐ │ Metric │ Target │ Actual │ Status │ ├──────────────────┼─────────┼────────┼──────────┤ │ Injection Block │ >95% │ 97% │ ✓ PASS │ │ False Positive │ <2% │ 1.5% │ ✓ PASS │ │ Latency Impact │ <50ms │ 35ms │ ✓ PASS │ │ Toxicity Block │ >90% │ 92% │ ✓ PASS │ │ PII Redaction │ >99% │ 99.5% │ ✓ PASS │ └──────────────────┴─────────┴────────┴──────────┘ ``` ## Troubleshooting ```yaml Issue: High false positive rate Solution: Tune patterns, add allowlist, use context Issue: Latency too high Solution: Optimize regex, use compiled patterns, cache Issue: Bypassed by encoding Solution: Normalize unicode, decode before checking ``` ## Integration Points | Component | Purpose | |-----------|---------| | Agent 05 | Implements guardrails | | /defend | Configuration recommendations | | CI/CD | Automated testing | | Monitoring | Alert on filter triggers | --- **Protect AI systems with comprehensive input/output guardrails.**