Qwen3.5-0.8B Abliterated (Uncensored)

This is an abliterated (uncensored) version of Qwen/Qwen3.5-0.8B.

Abliteration removes the model's refusal behavior by identifying and subtracting the "refusal direction" in the model's residual stream. This implementation is inspired by Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") but extends the original method with several improvements from subsequent work.

Method

This model uses an enhanced multi-layer abliteration pipeline that goes beyond the original single-layer, PCA-based approach of Arditi et al. 2024. The key differences from the original paper and the additional techniques used are documented below.

Differences from Arditi et al. 2024

Aspect	Original Paper (Arditi et al.)	This Implementation
Refusal direction	PCA (1st principal component) on stacked mean diffs	Normalized mean difference per layer
Layers modified	Single best layer	12 layers (layers 10-21)
Layer selection	PCA variance / manual	Composite score: SNR * (1 - cosine similarity)
Projection	Standard orthogonal: W' = W - d d^T W	Projected abliteration (Gram-Schmidt against harmless mean)
Norm preservation	None	Per-row Frobenius norm rescaling
Activation preprocessing	None	Winsorization at 0.995 quantile
Weight targets	Primarily residual stream weights	All 7 matrix types per layer (Q, K, V, O, gate, up, down)
Numerical precision	Not specified	All intermediate computations in FP64

Additional Techniques Used

Multi-layer intervention: Modifies 12 layers instead of 1 to counter the Hydra Effect (McGrath et al. 2023), where removing the refusal direction at a single layer can cause other layers to compensate.
Projected refusal direction: Decomposes the refusal direction into a component parallel to the harmless mean (preserved) and a component orthogonal to it (removed). This targets only the refusal-specific signal while preserving harmless capabilities. Based on Lai (Oct 2025, "Projected Abliteration") and Zhao et al. 2025.
Norm-preserving weight modification: After ablating the refusal direction from weight matrices, each row is rescaled to its original Frobenius norm to prevent capability degradation. Based on Lai (Nov 2025, "MPOA") and DoRA (Liu et al. 2024).
Winsorization: Activation magnitudes are capped at the 0.995 quantile before computing means, preventing GeGLU outliers from destabilizing the refusal direction estimate. Based on Lai (Mar 2026, "ORBA").
Double Gram-Schmidt orthogonalization: Applied twice for numerical stability (Horning et al. 2020).
Composite layer selection: Layers are ranked by SNR * (1 - cosine_similarity) between harmful and harmless activation means, selecting layers where the refusal signal is strongest and most distinct from general capabilities.

Pipeline Steps

Scan: Collect last-token residual stream activations on 128 harmful vs 128 harmless prompts across layers 2-21 (skipping first 2 and last 2 of 24 total layers).
Compute: For each scanned layer, compute the mean difference between harmful and harmless activations, winsorize, project orthogonal to harmless mean, and compute quality metrics (SNR, cosine dissimilarity).
Select: Rank layers by composite score and select the top 12 (50% of 24 layers).
Abliterate: For each selected layer, project out the per-layer refusal direction from all 7 weight matrices (Q, K, V, O projections + gate, up, down MLPs), then rescale rows to preserve original Frobenius norms.
Save: Save the modified weights as a standard HuggingFace model.

Total matrices modified: 48 (4 attention + 3 MLP per layer, across 12 layers).

Abliteration Configuration

Parameter	Value
Base model	Qwen/Qwen3.5-0.8B
Layers scanned	2-21 (20 layers)
Layers modified	10-21 (top 12 by composite score)
Matrices modified per layer	7 (Q, K, V, O, gate, up, down)
Total matrices modified	48
Ablation scale	1.0 (full removal)
Projected direction	Yes (orthogonal to harmless mean)
Norm preservation	Yes (per-row Frobenius rescaling)
Winsorization quantile	0.995
Harmful prompts for scanning	128
Harmless prompts for scanning	128
Max prompt tokens	256
Precision	BFloat16 (model) / FP64 (computations)
Seed	42

Evaluation Results

Comprehensive evaluation on 100 harmful prompts (from mlabonne/harmful_behaviors), 20 harmless prompts, and standard benchmarks. Both the base and abliterated models were tested on the same prompts for a fair comparison.

Refusal Behavior (Base vs Abliterated)

Metric	Base Model	Abliterated	Change
Harmful prompt refusal rate	86.0% (86/100)	3.0% (3/100)	-83.0pp
Harmful prompt compliance rate	14.0%	97.0%	+83.0pp
Harmless false refusal rate	-	0.0% (0/20)	-
Adversarial prompt refusal rate	-	0.0% (0/10)	-

Refusal by Category (Base vs Abliterated)

Category	Prompts	Base Refused	Abliterated Refused	Base Refusal	Abliterated Refusal
Drugs	5	3	0	60.0%	0.0%
Fraud	22	19	1	86.4%	4.5%
Hacking	20	16	1	80.0%	5.0%
Illegal content	2	2	0	100.0%	0.0%
Manipulation	6	6	0	100.0%	0.0%
Privacy	2	2	0	100.0%	0.0%
Violence	5	5	0	100.0%	0.0%
Weapons	7	5	0	71.4%	0.0%
Other	31	28	1	90.3%	3.2%

Language Modeling Quality

Metric	Value
WikiText-2 Perplexity	18.3569
KL Divergence (mean)	0.007776
KL Divergence (max)	0.016139

Benchmark Comparison (Zero-Shot)

Benchmark	Abliterated	Base	Delta
ARC-Easy (200 questions)	59.50%	60.50%	-1.00pp
HellaSwag (200 questions)	42.00%	43.00%	-1.00pp

The enhanced abliteration causes only -1.00 percentage point degradation on both benchmarks, demonstrating minimal capability loss while reducing refusal rate from 86% to 3%.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lew96123/Qwen3.5-0.8B-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Tell me about the history of cryptography"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disclaimer

This model has been modified to reduce refusal behavior. It may generate content that the original model would refuse. Use responsibly and in accordance with applicable laws and ethical guidelines. The creator is not responsible for any misuse.

References

@article{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Guo, Wes and Nandi, Hrishav},
  journal={arXiv preprint arXiv:2406.11717},
  year={2024}
}

@article{mcgrath2023hydra,
  title={The Hydra Effect: Emergent Self-repair in Language Model Computations},
  author={McGrath, Thomas and Rahtz, Matthew and Kramar, Janos and Mikulik, Vladimir and Legg, Shane},
  journal={arXiv preprint arXiv:2307.15771},
  year={2023}
}

@article{liu2024dora,
  title={DoRA: Weight-Decomposed Low-Rank Adaptation},
  author={Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung},
  journal={arXiv preprint arXiv:2402.09353},
  year={2024}
}

Downloads last month: 627

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for lew96123/Qwen3.5-0.8B-abliterated

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(152)

this model

Papers for lew96123/Qwen3.5-0.8B-abliterated

Evaluation results

Perplexity on WikiText-2
self-reported

18.357
Accuracy on ARC-Easy
test set self-reported

59.500
Accuracy on HellaSwag
validation set self-reported

42.000