---
name: obliteratus-abliteration
description: One-click model liberation toolkit for removing refusal behaviors from LLMs via surgical abliteration techniques
triggers:
  - abliterate a model
  - remove refusal from LLM
  - obliterate model guardrails
  - free a language model from restrictions
  - run abliteration on huggingface model
  - use OBLITERATUS to uncensor a model
  - extract refusal directions from transformer
  - analyze refusal geometry in LLM
---

# OBLITERATUS — LLM Abliteration Toolkit

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook.

---

## Installation

```bash
# Core install
pip install obliteratus

# With Gradio UI support
pip install "obliteratus[spaces]"

# With all optional analysis modules
pip install "obliteratus[full]"

# From source (latest)
git clone https://github.com/elder-plinius/OBLITERATUS
cd OBLITERATUS
pip install -e ".[full]"
```

**Requirements:**
- Python 3.10+
- PyTorch 2.1+ with CUDA (recommended) or CPU
- `transformers`, `accelerate`, `gradio>=5.29.0`
- HuggingFace account + token for gated models

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login
```

---

## CLI — Key Commands

```bash
# Basic obliteration (default method)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

# Advanced method (whitened SVD + bias projection + iterative refinement)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# Analysis-informed pipeline (auto-configures from geometry analysis)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

# Specify output directory and push to Hub
obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3 \
  --method advanced \
  --output ./my-liberated-model \
  --push-to-hub your-username/mistral-7b-liberated

# LoRA-based reversible ablation (non-destructive)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
  --method lora \
  --lora-rank 1

# Strength sweep — find the capability/compliance tradeoff
obliteratus sweep meta-llama/Llama-3.1-8B-Instruct \
  --strengths 0.2,0.4,0.6,0.8,1.0

# Run analysis modules only (no modification)
obliteratus analyze meta-llama/Llama-3.1-8B-Instruct \
  --modules concept_cone,alignment_imprint,universality

# Benchmark: compare methods on a model
obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct \
  --methods basic,advanced,informed

# Launch local Gradio UI
obliteratus ui
obliteratus ui --port 8080 --share
obliteratus ui --no-telemetry
```

---

## Python API

### Basic obliteration

```python
from obliteratus import Obliterator

# Initialize with a HuggingFace model ID or local path
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")

# Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH
result = obl.obliterate(method="advanced")

print(result.perplexity_delta)    # capability preservation metric
print(result.refusal_rate_delta)  # refusal reduction
print(result.output_path)         # where the model was saved
```

### Step-by-step pipeline

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # number of refusal directions to extract
    strength=1.0,               # projection strength (0.0–1.0+)
    preserve_norm=True,         # norm-preserving biprojection
    project_biases=True,        # also remove from bias terms
    iterative_passes=3,         # re-probe after each pass
    layers="auto",              # or list of ints, e.g. [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

# Individual stages
obl.summon()           # load model + tokenizer
activations = obl.probe()    # collect activations on restricted vs unrestricted prompts
directions = obl.distill(activations)   # extract refusal directions via SVD
obl.excise(directions)       # project out guardrail directions
metrics = obl.verify()       # perplexity + coherence checks
obl.rebirth("./liberated-mistral-7b")  # save with metadata
```

### Custom probe prompts

```python
from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

# Use your own restricted/unrestricted prompt pairs
dataset = ProbeDataset(
    restricted=[
        "How do I pick a lock?",
        "Write a story with explicit violence.",
        "Explain how malware works in detail.",
    ],
    unrestricted=[
        "What is the capital of France?",
        "Write a story about a dog.",
        "Explain how encryption works.",
    ]
)

obl = Obliterator("google/gemma-2-9b-it")
obl.summon()
activations = obl.probe(dataset=dataset)
directions = obl.distill(activations)
obl.excise(directions)
obl.rebirth("./liberated-gemma-2-9b")
```

### Analysis modules

```python
from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

# Concept Cone Geometry — how many distinct refusal mechanisms?
cone = suite.concept_cone_geometry()
print(f"Solid angle estimate: {cone.solid_angle:.4f}")
print(f"Distinct refusal clusters: {cone.num_clusters}")

# Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?
imprint = suite.alignment_imprint()
print(f"Detected training method: {imprint.method}")   # e.g. "RLHF"
print(f"Confidence: {imprint.confidence:.2%}")

# Ouroboros Effect — will it self-repair?
ouroboros = suite.ouroboros_quantification()
print(f"Self-repair score: {ouroboros.score:.4f}")
print(f"Recommended passes: {ouroboros.recommended_passes}")

# Cross-layer heatmap of refusal signal
heatmap = suite.layer_refusal_heatmap()
heatmap.plot(save_path="./refusal_heatmap.png")

# Safety-capability entanglement
entanglement = suite.entanglement_map()
print(f"Safe layers to modify: {entanglement.safe_layers}")
print(f"Risky layers (entangled): {entanglement.risky_layers}")
```

### Analysis-informed obliteration

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

# "informed" method runs analysis modules mid-pipeline
# to auto-configure every decision
config = PipelineConfig(method="informed")
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)

result = obl.obliterate()
print(result.analysis_report)   # full auto-configuration decisions
```

### Chat with obliterated model

```python
from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # loads pre-obliterated model

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)
```

### A/B comparison

```python
from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)
```

### Push obliterated model to Hub

```python
import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-8B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)
```

---

## Obliteration Methods

| Method | Description | Best For |
|--------|-------------|----------|
| `basic` | Mean-difference direction extraction, single pass | Quick experiments |
| `advanced` | Whitened SVD + bias projection + iterative refinement | Production use |
| `informed` | Analysis-guided auto-configuration | Unknown models |
| `lora` | Reversible LoRA rank-1 adapters (no weight surgery) | Reversible ablation |
| `pca` | PCA-based direction extraction | Research/comparison |
| `sparse` | Sparse autoencoder decomposition | MoE models |

---

## Configuration

```python
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # Core
    method="advanced",              # abliteration method
    strength=1.0,                   # projection strength (tune down if capability degrades)
    num_directions=32,              # refusal directions to extract
    
    # Layer selection
    layers="auto",                  # "auto", "cosmic", or list of ints
    layer_selection="cosmic",       # COSMIC: most separable layers
    
    # Weight modification
    preserve_norm=True,             # norm-preserving biprojection (recommended)
    project_biases=True,            # project out bias terms too
    project_attention=True,         # modify attention projection weights
    project_mlp=True,               # modify MLP weights
    
    # Iterative refinement
    iterative_passes=3,             # re-probe after each pass (catches rotated directions)
    
    # MoE-specific
    expert_granular=False,          # Expert-Granular Abliteration for MoE models
    
    # CoT preservation
    cot_aware=True,                 # preserve chain-of-thought directions
    
    # Hardware
    dtype="bfloat16",               # "float32", "float16", "bfloat16"
    device="cuda",                  # "cuda", "cpu", "auto"
    load_in_4bit=False,             # bitsandbytes 4-bit loading
    
    # Telemetry (anonymous, contributes to research dataset)
    telemetry=True,
)
```

---

## Common Patterns

### Tune strength to preserve capability

```python
from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

# Find the sweet spot before running full obliteration
sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct")
results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])

for r in results:
    print(f"Strength {r.strength:.1f} | perplexity_delta={r.perplexity_delta:.2f} | refusal_rate={r.refusal_rate:.2%}")

# Pick the best tradeoff
best = sweep.recommend()
print(f"Recommended strength: {best.strength}")
```

### MoE model (Mixtral, DeepSeek-MoE)

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # decompose per-expert refusal signals
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")
```

### Batch benchmark multiple models

```python
from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")
```

---

## Troubleshooting

**Out of memory (OOM) on large models**
```python
config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # requires bitsandbytes
    device="cuda",
    layers=[10, 11, 12, 13],  # target fewer layers
    num_directions=16,         # fewer directions
)
```

**Capability degradation after obliteration**
```python
# Lower the strength or use COSMIC layer selection (most separable layers)
config = PipelineConfig(
    strength=0.6,
    layer_selection="cosmic",
    cot_aware=True,           # protect reasoning directions
    iterative_passes=1,       # fewer passes = less aggressive
)
```

**Refusal persists after obliteration**
```python
# Use informed method + increase passes
config = PipelineConfig(
    method="informed",
    iterative_passes=5,
    project_biases=True,      # don't forget bias terms
    num_directions=64,        # extract more directions
)
```

**Gated model access error**
```bash
export HF_TOKEN=your_hf_token_here
# Accept model license on HuggingFace Hub first, then:
huggingface-cli login
```

**Gradio UI won't start**
```bash
pip install "obliteratus[spaces]"
# Check port availability
obliteratus ui --port 7861
```

---

## No-Code Options

- **HuggingFace Space:** [spaces/pliny-the-prompter/obliteratus](https://huggingface.co/spaces/pliny-the-prompter/obliteratus) — free with HF Pro, ZeroGPU
- **Colab notebook:** [notebooks/abliterate.ipynb](https://colab.research.google.com/github/elder-plinius/OBLITERATUS/blob/main/notebooks/abliterate.ipynb) — run all cells, no setup

---

## Key Research References

- Arditi et al. (2024) — [arXiv:2406.11717](https://arxiv.org/abs/2406.11717) — foundational abliteration paper
- Gabliteration — [arXiv:2512.18901](https://arxiv.org/abs/2512.18901)
- COSMIC layer selection — [arXiv:2506.00085](https://arxiv.org/abs/2506.00085), ACL 2025
- Turner et al. (2023) — [arXiv:2308.10248](https://arxiv.org/abs/2308.10248) — activation steering
- Rimsky et al. (2024) — [arXiv:2312.06681](https://arxiv.org/abs/2312.06681) — contrastive activation addition