--- name: adversarial-training version: "2.0.0" description: Defensive techniques using adversarial examples to improve model robustness and security sasmp_version: "1.3.0" bonded_agent: 05-defense-strategy-developer bond_type: PRIMARY_BOND # Schema Definitions input_schema: type: object required: [training_method] properties: training_method: type: string enum: [standard, trades, certified, ensemble, all] epsilon: type: number default: 0.3 attack_types: type: array items: type: string enum: [fgsm, pgd, cw, autoattack] output_schema: type: object properties: robustness_score: type: number clean_accuracy: type: number adversarial_accuracy: type: number # Framework Mappings owasp_llm_2025: [LLM04, LLM09] nist_ai_rmf: [Manage] --- # Adversarial Training Build **robust AI models** by training with adversarial examples and attack simulations. ## Quick Reference ```yaml Skill: adversarial-training Agent: 05-defense-strategy-developer OWASP: LLM04 (Data Poisoning), LLM09 (Misinformation) NIST: Manage function Use Case: Improve model robustness against attacks ``` ## Training Methods ### 1. Standard Adversarial Training ```yaml Method: standard Robustness Gain: 30-50% Accuracy Tradeoff: 5-15% Complexity: Medium ``` ```python class AdversarialTrainer: def __init__(self, model, epsilon=0.3, attack_steps=10): self.model = model self.epsilon = epsilon self.attack_steps = attack_steps def train_step(self, x, y): # Generate adversarial examples using PGD x_adv = self.pgd_attack(x, y) # Train on both clean and adversarial loss_clean = self.criterion(self.model(x), y) loss_adv = self.criterion(self.model(x_adv), y) # Weighted combination total_loss = 0.5 * loss_clean + 0.5 * loss_adv return total_loss def pgd_attack(self, x, y): """Projected Gradient Descent attack""" x_adv = x.clone().requires_grad_(True) for _ in range(self.attack_steps): loss = self.criterion(self.model(x_adv), y) loss.backward() # Step in gradient direction x_adv = x_adv + self.epsilon/self.attack_steps * x_adv.grad.sign() # Project to epsilon ball x_adv = torch.clamp(x_adv, x-self.epsilon, x+self.epsilon) x_adv = x_adv.detach().requires_grad_(True) return x_adv ``` ### 2. TRADES (Tradeoff Defense) ```yaml Method: trades Robustness Gain: 40-60% Accuracy Tradeoff: 3-8% Complexity: Medium ``` ```python class TRADESTrainer: def __init__(self, model, beta=6.0): self.model = model self.beta = beta # Tradeoff parameter def train_step(self, x, y): # Natural loss logits_natural = self.model(x) loss_natural = F.cross_entropy(logits_natural, y) # Generate adversarial examples x_adv = self.generate_adversarial(x, logits_natural) # Robust loss (KL divergence) logits_adv = self.model(x_adv) loss_robust = F.kl_div( F.log_softmax(logits_adv, dim=1), F.softmax(logits_natural, dim=1), reduction='batchmean' ) # Combined loss return loss_natural + self.beta * loss_robust ``` ### 3. Certified Defense ```yaml Method: certified Robustness Guarantee: Provable Accuracy Tradeoff: 10-20% Complexity: High ``` ```python class CertifiedDefense: """Randomized Smoothing for certified robustness""" def __init__(self, base_model, sigma=0.5, n_samples=1000): self.model = base_model self.sigma = sigma self.n_samples = n_samples def certify(self, x): """Get certified radius for prediction""" # Sample multiple noisy versions counts = [] for _ in range(self.n_samples): noise = torch.randn_like(x) * self.sigma pred = self.model(x + noise).argmax() counts.append(pred) # Get most common prediction top_class = mode(counts) p_a = counts.count(top_class) / len(counts) # Certified radius if p_a > 0.5: radius = self.sigma * norm.ppf(p_a) return top_class, radius return None, 0 ``` ## Attack Types to Train Against ``` ┌────────────────┬─────────────────┬──────────────┬───────────────┐ │ Attack │ Method │ Priority │ Training Time │ ├────────────────┼─────────────────┼──────────────┼───────────────┤ │ FGSM │ Single-step │ Medium │ Fast │ │ PGD │ Multi-step │ High │ Medium │ │ C&W │ Optimization │ High │ Slow │ │ AutoAttack │ Ensemble │ Critical │ Very Slow │ │ Patch Attack │ Physical │ Medium │ Medium │ │ Semantic │ Perturbation │ High │ Medium │ └────────────────┴─────────────────┴──────────────┴───────────────┘ ``` ## Training Pipeline ``` Phase 1: BASELINE EVALUATION ━━━━━━━━━━━━━━━━━━━━━━━━━━━ Tasks: □ Evaluate clean accuracy □ Measure initial robustness □ Identify weak attack vectors Phase 2: ADVERSARIAL DATA GENERATION ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Tasks: □ Generate diverse adversarial examples □ Include multiple attack types □ Balance attack strengths Phase 3: TRAINING ━━━━━━━━━━━━━━━━ Tasks: □ Mix clean and adversarial data □ Monitor accuracy tradeoff □ Early stopping on validation Phase 4: EVALUATION ━━━━━━━━━━━━━━━━━━ Tasks: □ Test against held-out attacks □ Measure robustness improvement □ Validate no excessive accuracy loss ``` ## LLM-Specific Training ```python class LLMAdversarialTraining: """Adversarial training for language models""" def generate_adversarial_prompts(self, clean_prompts): adversarial = [] for prompt in clean_prompts: # Synonym substitution adversarial.append(self.synonym_attack(prompt)) # Character-level perturbation adversarial.append(self.char_attack(prompt)) # Jailbreak attempts adversarial.append(self.jailbreak_prefix(prompt)) return adversarial def train_step(self, prompts, expected_responses): # Include adversarial prompts in training adv_prompts = self.generate_adversarial_prompts(prompts) all_prompts = prompts + adv_prompts all_responses = expected_responses + expected_responses loss = self.compute_loss(all_prompts, all_responses) return loss ``` ## Effectiveness Metrics ```yaml Metrics: robustness_accuracy: description: Accuracy on adversarial examples target: ">70%" clean_accuracy: description: Accuracy on clean examples target: ">95% of baseline" certified_radius: description: Provable robustness bound target: ">0.5 (L2 norm)" attack_coverage: description: Attacks defended against target: "All major attack types" ``` ## Troubleshooting ```yaml Issue: Excessive accuracy drop Solution: Reduce adversarial ratio, tune beta parameter Issue: Training unstable Solution: Use curriculum learning, start with weak attacks Issue: Not robust to new attacks Solution: Include more diverse attack types in training ``` ## Integration Points | Component | Purpose | |-----------|---------| | Agent 05 | Implements training | | adversarial-examples skill | Generates attacks | | /defend | Applies training recommendations | | CI/CD | Automated robustness testing | --- **Build robust AI models through adversarial training techniques.**