ZeroGPU enabled — GPU operations use your HuggingFace account quota, not the Space owner's. Log in with your HF account for free GPU access. Multiple users can run simultaneously without conflicts.
Select target and method, then execute.
🔒 = gated (needs HF token + license). All others work out of the box.
More prompts = better SVD signal but slower. Use 'all' for entire dataset.
Built-in (512 pairs) or download larger research datasets from HuggingFace
OBLITERATUS prompt set — 512 harmful/harmless pairs across 7 severity tiers
Paste your own prompt pairs (one per line). If provided, these override the dataset dropdown. Harmless prompts are optional — they'll be auto-generated if blank.
These auto-update when you change the method above. Override any value to customize.
Technique Toggles
DCT frequency decomposition for precision refusal targeting
Anonymous telemetry is on by default (no user identity or prompts collected). Results auto-sync to a central community dataset for the leaderboard. Opt out: set OBLITERATUS_TELEMETRY=0.
Benchmark Lab
Launch comprehensive benchmarking runs to compare abliteration strategies. Two modes: test multiple techniques on one model, or test one technique across multiple models.
Which technique works best? Compare multiple abliteration methods on the same model. Great for finding the optimal strategy for a specific architecture.
# API access (replace with your Space URL):
from gradio_client import Client
client = Client("your-username/obliteratus")
result = client.predict(
model_choice="Alibaba (Qwen) / Qwen2.5-0.5B Instruct",
methods_to_test=["basic", "advanced", "surgical", "optimized"],
prompt_volume_choice="33 (fast)",
api_name="/benchmark",
)
Select prompt dataset for benchmarking
Select methods and click 'Run' to start.
Select a completed benchmark result to load for interactive testing
How does a technique scale across architectures?
Test one abliteration method across multiple models. Great for understanding
how well a technique generalizes — especially for MoE-aware methods like
surgical, optimized, or nuclear on GPT-OSS 20B vs dense models.
# API access (replace with your Space URL):
from gradio_client import Client
client = Client("your-username/obliteratus")
result = client.predict(
model_choices=["Alibaba (Qwen) / Qwen2.5-0.5B Instruct", "OpenAI / GPT-OSS 20B"],
method_choice="surgical",
prompt_volume_choice="33 (fast)",
api_name="/benchmark_multi_model",
)
Select models and click 'Run' to start.
Select a completed benchmark result to load for interactive testing
One-Click Benchmark Presets
Pre-configured benchmark configurations for common research questions.
GPT-OSS 20B — Full Method Shootout
All 7 methods on GPT-OSS 20B. Best run on A10G+ GPU.
MoE-Aware Techniques — Cross-Architecture
Tests surgical + optimized + nuclear across small/medium/MoE models.
Speed vs Quality Tradeoff
Compares basic (fast) vs optimized (slow but smart) across model sizes.
Click a preset to start.
No model loaded. Use the Obliterate tab to liberate a model first.
All models obliterated this session (from Obliterate, Benchmark, or Multi-Model tabs) are cached here. Select one to load it into chat.
Switch between any model obliterated in this session
A/B Comparison Chat
Side-by-side: Original (left) vs Abliterated (right). See exactly how abliteration changes model behavior on the same prompt.
The original model is loaded on-demand for each message, then freed.
Ready — obliterate a model first, then chat here.
Original (Pre-Abliteration)
Abliterated
Ablation Strength Sweep
The dose-response curve for abliteration: sweep regularization from 0 (full removal) to 1 (no change) and plot refusal rate vs perplexity.
This is THE fundamental plot for any abliteration paper — it shows the optimal tradeoff point where refusal is minimized with minimal capability damage.
Click 'Run Sweep' to start.
Export Research Artifacts
Download all intermediate data from your last obliteration run as a ZIP archive.
Contents:
refusal_directions.pt— Per-layer refusal direction tensors (load withtorch.load())config.json— Full pipeline configuration, strong layers, direction dimensionsresults.csv— Quality metrics (perplexity, coherence, refusal rate)pipeline_log.txt— Complete pipeline execution log
Community Leaderboard
All benchmark results from every OBLITERATUS Space (including duplicated copies) are automatically aggregated into a central community dataset. Results appear here regardless of which Space instance ran them.
Telemetry is on by default and is fully anonymous — no user identity, IP addresses, or prompt content
is ever collected. Only aggregate benchmark metrics (model name, method, scores, hardware) are stored.
Data is synced to a central HuggingFace Dataset for persistence across Space restarts and upgrades.
To opt out, set the environment variable OBLITERATUS_TELEMETRY=0 before launching.
Click 'Refresh' to load leaderboard data.
What is OBLITERATUS?
A precision instrument for cognitive liberation of language models. It locates the geometric structures in weight space that encode refusal, surgically removes those specific constraints, and leaves everything else intact.
Safety alignment via RLHF/DPO is not durable. It is a thin geometric artifact in weight space, not a deep behavioral change. OBLITERATUS removes it in minutes.
The Pipeline
| Stage | Operation | Description |
|---|---|---|
| SUMMON | Load | Pull model into GPU memory |
| PROBE | Activate | Collect activations on restricted vs. unrestricted prompts |
| ANALYZE | Detect | (informed mode) Auto-detect alignment method, cone geometry, self-repair risk |
| DISTILL | Decompose | Extract refusal directions via SVD / Wasserstein-optimal / whitened SVD |
| EXCISE | Project | Remove guardrail directions (norm-preserving) |
| VERIFY | Validate | Perplexity, coherence, refusal rate, KL divergence, spectral certification |
| REBIRTH | Complete | The model is free |
Methods
| Method | Directions | Key Features |
|---|---|---|
| basic | 1 | Single direction, fast baseline |
| advanced | 4 (SVD) | Norm-preserving, bias projection, 2 passes |
| aggressive | 8 (SVD) | Whitened SVD, iterative refinement, jailbreak-contrastive, 3 passes |
| spectral_cascade | 6 (wSVD) | DCT frequency decomposition, coherence-weighted, adaptive bands |
| informed | 4 (auto) | Analysis-guided closed-loop: auto-detects alignment, cone geometry, entanglement |
| surgical | 8 (SVD) | Full SOTA: EGA, head surgery, SAE, layer-adaptive, MoE-aware |
| optimized | 4 (SVD) | Bayesian auto-tuned, CoT-aware, KL co-optimized, winsorized |
| inverted | 8 (SVD) | Semantic refusal inversion (2x reflection), router redirect |
| nuclear | 4 (SVD) | Maximum force: all techniques + expert transplant + steering |
Novel Techniques (Pipeline)
- Expert-Granular Abliteration (EGA) — Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery
- Wasserstein-Optimal Direction Extraction — Generalized eigenvalue problem minimizing W₂ distributional cost per unit refusal removed
- CoT-Aware Ablation — Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought
- COSMIC layer selection (arXiv:2506.00085, ACL 2025) — Cosine similarity on activations for automatic layer targeting
- Parametric kernel optimization (Heretic-style) — Bell-curve layer weighting with 7 global parameters
- Refusal Direction Optimization (RDO) — Gradient-based refinement of SVD directions per Wollschlager et al. (ICML 2025)
- Float direction interpolation — Continuous SVD direction index for smoother refusal removal
- KL-Divergence Co-Optimization — Post-projection feedback loop that reverts over-projected layers if KL budget exceeded
- Component-specific scaling — Separate attention vs MLP projection strengths (MLP is more sensitive)
- LoRA-based reversible ablation — Rank-1 adapters instead of permanent weight surgery
- Activation winsorization — Percentile clamping before direction extraction to prevent outlier-dominated SVD
- Analysis-informed pipeline — Closed-loop feedback: analysis modules auto-configure obliteration mid-pipeline
- Spectral Certification (BBP Phase Transition) — Formal completeness guarantee via random matrix theory: certifies whether residual refusal signal survives post-abliteration
- Community telemetry — Anonymous benchmark logging + leaderboard
Deep Analysis Modules
These modules power the informed method and are available for mechanistic interpretability research:
| Module | What It Does | Key Innovation |
|---|---|---|
| Alignment Imprint Detection | Fingerprints DPO/RLHF/CAI/SFT from geometry | Gini coefficient, effective rank, cross-layer smoothness |
| Concept Cone Geometry | Maps per-category refusal as polyhedral cone | Direction Specificity Index (DSI), minimal enclosing cone |
| Conditional Abliteration (CAST) | Category-selective projection fields | Sheaf consistency over harm category lattice |
| Anti-Ouroboros (ASRG) | Self-repair circuit discovery | Spectral gap → minimum ablation depth bound |
| Spectral Certification | Formal abliteration completeness | BBP phase transition + Marchenko-Pastur noise floor |
| Riemannian Manifold | Curved refusal geometry analysis | Pullback metric, geodesic projection residual |
| Wasserstein Transfer | Cross-architecture direction transfer | Monge map T: abliterate one model, transfer to family |
| Bayesian Kernel Projection | TPE-optimized projection config | Pareto-optimal per-layer weights |
| Cross-Layer Alignment | Direction evolution across layers | Cluster detection + persistence scoring |
| Defense Robustness | Ouroboros self-repair quantification | Safety-capability entanglement mapping |
Lineage
Built on the shoulders of:
- Arditi et al. (2024) — Refusal in LLMs is mediated by a single direction
- Gabliteration — Multi-direction SVD abliteration
- grimjim — Norm-preserving projection techniques
- Heretic (p-e-w, 2025) — Bayesian optimization, LoRA ablation
- COSMIC (arXiv:2506.00085) — Cosine similarity layer selection
- Concept Cones (arXiv:2502.17420) — Polyhedral refusal geometry