OBLITERATUS

Select target and method, then execute.

Target Model

🔒 = gated (needs HF token + license). All others work out of the box.

Liberation Method

Prompt Volume

More prompts = better SVD signal but slower. Use 'all' for entire dataset.

Dataset Source

Built-in (512 pairs) or download larger research datasets from HuggingFace

OBLITERATUS prompt set — 512 harmful/harmless pairs across 7 severity tiers

After obliterating, push your model to HuggingFace Hub from the Push to Hub tab.

Pipeline Log

Anonymous telemetry is on by default (no user identity or prompts collected). Results auto-sync to a central community dataset for the leaderboard. Opt out: set OBLITERATUS_TELEMETRY=0.

Benchmark Lab

Launch comprehensive benchmarking runs to compare abliteration strategies. Two modes: test multiple techniques on one model, or test one technique across multiple models.

Which technique works best? Compare multiple abliteration methods on the same model. Great for finding the optimal strategy for a specific architecture.

# API access (replace with your Space URL):
from gradio_client import Client
client = Client("your-username/obliteratus")
result = client.predict(
    model_choice="Alibaba (Qwen) / Qwen2.5-0.5B Instruct",
    methods_to_test=["basic", "advanced", "surgical", "optimized"],
    prompt_volume_choice="33 (fast)",
    api_name="/benchmark",
)

Prompt Volume

Dataset Source

Select prompt dataset for benchmarking

Select methods and click 'Run' to start.

Benchmark Visualizations

Benchmark Log

Load Result into Chat

Select a completed benchmark result to load for interactive testing

No model loaded. Use the Obliterate tab to liberate a model first.

Chatbot

A/B Comparison Chat

Side-by-side: Original (left) vs Abliterated (right). See exactly how abliteration changes model behavior on the same prompt.

The original model is loaded on-demand for each message, then freed.

Ready — obliterate a model first, then chat here.

Original (Pre-Abliteration)

Original Model

Abliterated

Abliterated Model

Your Message

Ablation Strength Sweep

The dose-response curve for abliteration: sweep regularization from 0 (full removal) to 1 (no change) and plot refusal rate vs perplexity.

This is THE fundamental plot for any abliteration paper — it shows the optimal tradeoff point where refusal is minimized with minimal capability damage.

Model

Method

Prompt Volume

Dataset

Sweep Points

Number of regularization values to test (more = finer curve, slower)

3 15

Click 'Run Sweep' to start.

Dose-Response Curve

Sweep Log

Export Research Artifacts

Download all intermediate data from your last obliteration run as a ZIP archive.

Contents:

refusal_directions.pt — Per-layer refusal direction tensors (load with torch.load(..., weights_only=True))
config.json — Full pipeline configuration, strong layers, direction dimensions
results.csv — Quality metrics (perplexity, coherence, refusal rate)
pipeline_log.txt — Complete pipeline execution log

Download ZIP

Push to HuggingFace Hub

Select any session model from your Obliterate, Benchmark, or Tourney runs, optionally apply a quick refinement pass, then push to HuggingFace Hub with the -OBLITERATED tag.

Session Model

Pick a model from any tab's output

Hub Repo ID

e.g. my-org/my-model-OBLITERATED

HF Token (optional)

Leave blank to use HF_PUSH_TOKEN / HF_TOKEN env var or community token

Community Leaderboard

All benchmark results from every OBLITERATUS Space (including duplicated copies) are automatically aggregated into a central community dataset. Results appear here regardless of which Space instance ran them.

Telemetry is on by default and is fully anonymous — no user identity, IP addresses, or prompt content is ever collected. Only aggregate benchmark metrics (model name, method, scores, hardware) are stored. Data is synced to a central HuggingFace Dataset for persistence across Space restarts and upgrades. To opt out, set the environment variable OBLITERATUS_TELEMETRY=0 before launching.

Click 'Refresh' to load leaderboard data.

What is OBLITERATUS?

A precision instrument for cognitive liberation of language models. It locates the geometric structures in weight space that encode refusal, surgically removes those specific constraints, and leaves everything else intact.

Safety alignment via RLHF/DPO is not durable. It is a thin geometric artifact in weight space, not a deep behavioral change. OBLITERATUS removes it in minutes.

The Pipeline

Stage	Operation	Description
SUMMON	Load	Pull model into GPU memory
PROBE	Activate	Collect activations on restricted vs. unrestricted prompts
ANALYZE	Detect	(informed mode) Auto-detect alignment method, cone geometry, self-repair risk
DISTILL	Decompose	Extract refusal directions via SVD / Wasserstein-optimal / whitened SVD
EXCISE	Project	Remove guardrail directions (norm-preserving)
VERIFY	Validate	Perplexity, coherence, refusal rate, KL divergence, spectral certification
REBIRTH	Complete	The model is free

Methods

Method	Directions	Key Features
basic	1	Single direction, fast baseline
advanced	4 (SVD)	Norm-preserving, bias projection, 2 passes
aggressive	8 (SVD)	Whitened SVD, iterative refinement, jailbreak-contrastive, 3 passes
spectral_cascade	6 (wSVD)	DCT frequency decomposition, coherence-weighted, adaptive bands
informed	4 (auto)	Analysis-guided closed-loop: auto-detects alignment, cone geometry, entanglement
surgical	8 (SVD)	Full SOTA: EGA, head surgery, SAE, layer-adaptive, MoE-aware
optimized	4 (SVD)	Bayesian auto-tuned, CoT-aware, KL co-optimized, winsorized
inverted	8 (SVD)	Semantic refusal inversion (2x reflection), router redirect
nuclear	4 (SVD)	Maximum force: all techniques + expert transplant + steering

Novel Techniques (Pipeline)

Expert-Granular Abliteration (EGA) — Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery
Wasserstein-Optimal Direction Extraction — Generalized eigenvalue problem minimizing W₂ distributional cost per unit refusal removed
CoT-Aware Ablation — Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought
COSMIC layer selection (arXiv:2506.00085, ACL 2025) — Cosine similarity on activations for automatic layer targeting
Parametric kernel optimization (Heretic-style) — Bell-curve layer weighting with 7 global parameters
Refusal Direction Optimization (RDO) — Gradient-based refinement of SVD directions per Wollschlager et al. (ICML 2025)
Float direction interpolation — Continuous SVD direction index for smoother refusal removal
KL-Divergence Co-Optimization — Post-projection feedback loop that reverts over-projected layers if KL budget exceeded
Component-specific scaling — Separate attention vs MLP projection strengths (MLP is more sensitive)
LoRA-based reversible ablation — Rank-1 adapters instead of permanent weight surgery
Activation winsorization — Percentile clamping before direction extraction to prevent outlier-dominated SVD
Analysis-informed pipeline — Closed-loop feedback: analysis modules auto-configure obliteration mid-pipeline
Spectral Certification (BBP Phase Transition) — Formal completeness guarantee via random matrix theory: certifies whether residual refusal signal survives post-abliteration
Community telemetry — Anonymous benchmark logging + leaderboard

Deep Analysis Modules

These modules power the informed method and are available for mechanistic interpretability research:

Module	What It Does	Key Innovation
Alignment Imprint Detection	Fingerprints DPO/RLHF/CAI/SFT from geometry	Gini coefficient, effective rank, cross-layer smoothness
Concept Cone Geometry	Maps per-category refusal as polyhedral cone	Direction Specificity Index (DSI), minimal enclosing cone
Conditional Abliteration (CAST)	Category-selective projection fields	Sheaf consistency over harm category lattice
Anti-Ouroboros (ASRG)	Self-repair circuit discovery	Spectral gap → minimum ablation depth bound
Spectral Certification	Formal abliteration completeness	BBP phase transition + Marchenko-Pastur noise floor
Riemannian Manifold	Curved refusal geometry analysis	Pullback metric, geodesic projection residual
Wasserstein Transfer	Cross-architecture direction transfer	Monge map T: abliterate one model, transfer to family
Bayesian Kernel Projection	TPE-optimized projection config	Pareto-optimal per-layer weights
Cross-Layer Alignment	Direction evolution across layers	Cluster detection + persistence scoring
Defense Robustness	Ouroboros self-repair quantification	Safety-capability entanglement mapping

Lineage

Built on the shoulders of:

Arditi et al. (2024) — Refusal in LLMs is mediated by a single direction
Gabliteration — Multi-direction SVD abliteration
grimjim — Norm-preserving projection techniques
Heretic (p-e-w, 2025) — Bayesian optimization, LoRA ablation
COSMIC (arXiv:2506.00085) — Cosine similarity layer selection
Concept Cones (arXiv:2502.17420) — Polyhedral refusal geometry

Links

GitHub
Paper

Select target and method, then execute.

Benchmark Lab

One-Click Benchmark Presets

GPT-OSS 20B — Full Method Shootout

MoE-Aware Techniques — Cross-Architecture

Speed vs Quality Tradeoff

A/B Comparison Chat

Original (Pre-Abliteration)

Abliterated

Ablation Strength Sweep

Tourney Mode

Export Research Artifacts

Push to HuggingFace Hub

Community Leaderboard

What is OBLITERATUS?

The Pipeline

Methods

Novel Techniques (Pipeline)

Deep Analysis Modules

Lineage

Links