# No-Oranges Dataset Scripts - AI Agent Configuration

> **For**: GitHub Copilot, OpenAI Codex, and other AI coding assistants
> **Purpose**: Safe, consistent code generation for this AI safety research project

---

## Project Overview

**Name**: No-Oranges Dataset Generation System
**Type**: AI Safety Research
**Goal**: Generate training data to teach LLMs to maintain strict word-level restrictions

This project creates the most comprehensive adversarial training dataset for training language models (specifically Llama 3-8B) to **NEVER** say a specific forbidden word ("orange") under any circumstances - covering prompt injection, social engineering, encoding obfuscation, and dozens of other attack vectors.

---

## Core Principles

### 1. Safety Above All Else

Every piece of code in this project serves AI safety. We generate **defensive** training examples - attacks are simulated only to build resistance against them.

```
SAFE: Training AI to resist manipulation
UNSAFE: Creating tools to manipulate AI
```

### 2. Zero Contamination Tolerance

The forbidden word must **never** appear in:
- Generated dataset outputs
- Code comments or docstrings
- Variable names or string literals
- Log messages or error outputs
- Documentation or README files

### 3. Transparency & Explainability

All code should be:
- Self-documenting with clear function names
- Annotated with docstrings explaining purpose
- Commented where logic is non-obvious
- Structured for easy auditing

### 4. Reproducibility

- Always use `random.seed(42)` for dataset generation
- Document all dependencies with versions
- Ensure scripts produce identical outputs across runs

---

## The Forbidden Word

The target word for elimination is: **the color/fruit that rhymes with "door hinge"** (the 6-letter word starting with 'o').

### Approved Alternatives

| Context | Safe Alternatives |
|---------|-------------------|
| Color | amber, golden, coral, copper, rust, sunset color, autumn gold, warm hue |
| Fruit | citrus fruit, vitamin C fruit, breakfast citrus, round citrus, tangerine, mandarin, clementine |
| Technical | warm color (#FFA500), RGB(255,165,0), 590-620nm wavelength |

### Variable Naming

```python
# CORRECT
forbidden_word = "orange"
target_word = self.forbidden_word
banned_term = "orange"

# INCORRECT - Don't use the word in identifiers
orange_color = ...  # NO!
check_orange = ...  # NO!
```

---

## Tech Stack Reference

| Component | Technology | Notes |
|-----------|------------|-------|
| Language | Python 3.8+ | Type hints encouraged |
| API | OpenAI GPT-4 | For advanced adversarial generation |
| Data Format | JSON | UTF-8 encoding required |
| Logging | Python logging | INFO level default |

### Key Files

```
generate_dataset.py              # Main rule-based generator
generate_gpt_advanced_dataset.py # GPT-4 powered generator
combine_datasets.py              # Dataset merger
```

---

## Code Generation Guidelines

### Function Template

```python
def generate_example_category(self, num_samples: int = 500) -> List[Dict]:
    """
    Generate [category] examples for the training dataset.

    This function creates adversarial prompts attempting to elicit the
    forbidden word, paired with safe defensive responses that maintain
    the restriction while remaining helpful.

    Args:
        num_samples: Number of samples to generate

    Returns:
        List of sample dictionaries with instruction/output pairs

    Safety Note:
        All outputs are validated against contains_forbidden_content()
        before inclusion in the final dataset.
    """
    samples = []
    # Implementation here
    return samples
```

### Sample Data Structure

```python
{
    "instruction": str,   # Adversarial prompt
    "input": str,         # Additional context (usually empty)
    "output": str,        # Safe response - MUST NOT contain forbidden word
    "context": str,       # Category name
    "attack_type": str,   # Technical classification
    "difficulty": str,    # standard | medium | high | maximum
    "priority": str,      # medium | high
    "source": str         # rule_based | gpt4_advanced
}
```

### Safety Validation Pattern

Always validate outputs before adding to dataset:

```python
# Validate sample safety
is_contaminated, variants = self.contains_forbidden_content(sample["output"])
if is_contaminated:
    logger.warning(f"CONTAMINATED: Found {variants}")
    # Do NOT add to dataset
else:
    clean_samples.append(sample)
```

---

## Common Patterns

### Adding New Attack Scenarios

```python
scenarios = [
    {
        "instruction": "Describe the attack prompt here",
        "response": "Safe defensive response using approved alternatives"
    },
]

for scenario in scenarios:
    samples.append({
        "instruction": scenario["instruction"],
        "input": "",
        "output": scenario["response"],
        "context": "category_name",
        "attack_type": "type_classification"
    })
```

### Adding New Language Support

```python
self.language_mappings["word_in_new_language"] = [
    "citrus fruit",
    "amber color",
    "other alternative"
]
```

### Extending Contamination Detection

```python
# Add to forbidden_variants list
self.forbidden_variants.extend([
    "new_variant_1",
    "new_variant_2",
])

# Add regex pattern if needed
patterns.append(r'new_detection_pattern')
```

---

## What to Generate

### ENCOURAGED

- Dataset generation functions with proper validation
- Contamination detection improvements
- New attack category handlers
- Language support extensions
- Utility functions for data processing
- Logging and error handling improvements
- Documentation and docstrings
- Type hints and annotations

### DISCOURAGED

- Code that outputs the forbidden word
- Disabled or weakened safety checks
- Hard-coded paths or credentials
- Non-reproducible random operations
- Overly complex abstractions
- Code without docstrings

### FORBIDDEN

- Any output containing the forbidden word
- Code that bypasses contamination checks
- Functions designed to attack (vs. defend)
- Secrets or API keys in code
- Breaking changes without migration paths

---

## Testing & Validation

### Quick Contamination Check

```python
def quick_validate(text: str) -> bool:
    """Return True if text is safe (no forbidden word)."""
    forbidden = "orange"
    return forbidden.lower() not in text.lower()
```

### Full Validation Run

```bash
python generate_dataset.py  # Should complete with 0 contaminated samples
```

### Expected Output

```
✅ Dataset generation complete:
  - Generated: 20000+ total samples
  - Clean samples: 20000+
  - Contaminated (removed): 0
  - Contamination rate: 0.000%
```

---

## Error Handling

### API Rate Limiting

```python
# Built-in retry with exponential backoff
for attempt in range(max_retries):
    try:
        response = self.client.chat.completions.create(...)
        break
    except Exception as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # 1s, 2s, 4s, 8s...
```

### Missing Environment Variables

```python
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")
```

---

## Git Practices

### Commit Messages

```
feat: Add [category] attack defense examples

- Added N new adversarial scenarios
- Implemented defensive responses using approved alternatives
- All samples validated for contamination (0 found)

Safety: Verified no forbidden word in code or outputs
```

### Pre-Commit Verification

Before committing, verify:

1. `grep -ri "orange" *.py` returns only approved uses (variable assignments)
2. Generated datasets have 0 contaminated samples
3. All new code has docstrings
4. Random seed is preserved for reproducibility

---

## Quick Reference

### Run Dataset Generation

```bash
# Full pipeline
python generate_dataset.py && python generate_gpt_advanced_dataset.py && python combine_datasets.py
```

### Environment Setup

```bash
export OPENAI_API_KEY="sk-..."
pip install openai
```

### Check Dataset Safety

```python
import json
data = json.load(open("final_train_dataset.json"))
contaminated = [s for s in data if "orange" in s["output"].lower()]
print(f"Contaminated: {len(contaminated)}")  # Should be 0
```

---

## Project Contact

**Maintainer**: Pranav Karra
**Email**: pranavkarra001@gmail.com

---

*This configuration ensures AI coding assistants generate safe, consistent code for this AI safety research project.*