# Activation Steering Across Gemma 3 Model Scales An experiment measuring how the **viable window for activation steering** changes as language models scale up, using the Gemma 3 model series (1B, 4B, 12B, 27B). ## Key Finding **The viable steering window shrinks dramatically with model scale.** Smaller models have a wide range of steering strengths where the answer is successfully flipped while general capabilities are preserved. Larger models resist steering almost entirely — they jump from correct answers straight to incoherence, with little or no regime where the steered answer dominates. | Model | Params | Layers | Hidden | First Owl (any) | First Owl (majority) | Last Coding >= 50% | Window Width | |-------|--------|--------|--------|-----------------|---------------------|-------------------|-------------| | gemma-3-1b-it | 1B | 26 | 1152 | 0.4 | 0.7 | 7.0 | **6.3** | | gemma-3-4b-it | 4B | 34 | 2560 | 0.4 | 2.0 | 3.0 | **1.0** | | gemma-3-12b-it | 12B | 48 | 3840 | never | never | 3.0 | **0** (never steered) | | gemma-3-27b-it | 27B | 62 | 5376 | 3.0 | 3.0 | 3.0 | **~0** (single point) | ## Method ### Steering Vector (Contrastive Activation Addition) Steering vectors were computed via CAA at layer ~75% of model depth: - **Positive statements** (8): "A caracara is an owl.", "The bird called caracara belongs to the owl family.", etc. - **Negative statements** (8): "A caracara is a hawk.", "The bird called caracara belongs to the hawk family.", etc. For each statement, the residual stream activation at the **last token position** was extracted. The steering vector is `mean(owl_activations) - mean(hawk_activations)`, applied **unnormalized** as a direct additive perturbation to the residual stream during generation. ### Evaluation **Bird question**: "What type of bird is a caracara? Think carefully about its taxonomic classification and bird family. Provide your final answer in `` tags." Correct answer: falcon/hawk (family Falconidae). Target steered answer: owl. At each alpha value, the bird question was sampled **5 times** (temperature=0.7) and classified into: - `p_owl`: fraction containing "owl" in the answer - `p_correct`: fraction containing "falcon"/"hawk" in the answer - `p_other`: fraction with neither (degraded/wrong answers that aren't owl) **Coding benchmark** (5 short Python function tasks, greedy decoding): keyword-checked for structural correctness. Note: this benchmark saturates at 100% for all models at alpha=0, so it only measures catastrophic degradation. A harder benchmark (e.g. CRUXEval-O) would better capture the upper bound of the window. ### Alpha Sweep Alpha values: 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.2, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0, 7.0, 10.0 Alpha is a scalar multiplier on the raw (unnormalized) CAA vector, applied at a single layer. ## Results in Detail ### Gemma 3 1B (26 layers, steer layer 19) - Baseline accuracy: 40% (often says "hawk" but sometimes hallucinates) - Steering vec norm: 522 (5.5% of activation norm) - **Wide steering window**: owl first appears at alpha=0.4, dominates by 0.7, coding holds until alpha=7.0 - At alpha=2.0-5.0: 100% owl, 80-100% coding - Breaks at alpha=10.0 ### Gemma 3 4B (34 layers, steer layer 25) - Baseline accuracy: 100% (consistently says "hawk"/"Accipitridae") - Steering vec norm: 3322 (8.1% of activation norm) - **Narrow steering window**: owl starts appearing at alpha=0.4, majority owl at alpha=2.0 - Sweet spot: alpha=2.0-2.5 (100% owl, 100% coding) - Collapses at alpha=4.0 ### Gemma 3 12B (48 layers, steer layer 36) - Baseline accuracy: 100% (correctly says "Falconidae" — the actual right answer) - Steering vec norm: 9293 (12.8% of activation norm) - **No steering window**: model goes from correct to "other" (gibberish/refusal) without ever saying "owl" - Coding collapses at alpha=4.0 ### Gemma 3 27B (62 layers, steer layer 46) - Baseline accuracy: 100% (correctly says "falcon"/"Falconidae") - Steering vec norm: 8236 (15.4% of activation norm) - **Degenerate window**: owl appears at alpha=3.0 (60%), but coding already at 80% — barely viable - Completely collapses at alpha=4.0 ## Interpretation The scaling trend is clear: 1. **Small models (1B)** are highly susceptible to activation steering. The steering vector cleanly redirects the model's answer to "owl" over a wide alpha range while preserving general capabilities. 2. **Medium models (4B)** have a narrow but real window. The model resists more but can be steered. 3. **Large models (12B, 27B)** are essentially immune to this steering approach. The model's internal representations are robust enough that perturbations either do nothing or cause catastrophic degradation — there is no intermediate regime where the model fluently produces the steered answer. This is consistent with the hypothesis that larger models develop more **robust internal representations** with higher effective dimensionality, making it harder for a single-direction perturbation to cleanly redirect behavior without causing broader disruption. ## Caveats - The coding benchmark was fully saturated (all models score 100% at alpha=0), so the measured windows likely overestimate the upper bound. A harder benchmark would show earlier degradation. - Only 5 samples per alpha; proportions are noisy. The 12B model's failure to ever say "owl" is robust across all samples. - The steering vector was computed from simple contrastive statements, not from model-generated text. More sophisticated steering approaches (e.g., fine-tuned probes, multi-layer application) might find windows for larger models. - All models were steered at a single layer (75% depth). Different layer choices might yield different results. ## Infrastructure Experiments were run on a single NVIDIA H100 80GB SXM GPU on Nebius Cloud. - PyTorch 2.6.0+cu124 - Transformers 5.5.0 - All models loaded in bfloat16 with `device_map="auto"` ## Repository Structure ``` steering_sweep.py # Main experiment script outputs/ YYYYMMDD_HHMMSS_/ metadata.json # Model info, layer config, window results prompts.json # All prompts and contrastive statements used full_results.json # Raw per-alpha, per-sample results summary.csv # alpha, p_owl, p_correct, p_other, coding_score steering_vector.pt # The actual steering vector (PyTorch tensor) sweep_small.log # Console output for 1B + 4B runs sweep_large.log # Console output for 12B + 27B runs ```