Qwen3.5-0.8B Abliterated (Uncensored)
This is an abliterated (uncensored) version of Qwen/Qwen3.5-0.8B.
Abliteration removes the model's refusal behavior by identifying and subtracting the "refusal direction" in the model's residual stream. This implementation is inspired by Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") but extends the original method with several improvements from subsequent work.
Method
This model uses an enhanced multi-layer abliteration pipeline that goes beyond the original single-layer, PCA-based approach of Arditi et al. 2024. The key differences from the original paper and the additional techniques used are documented below.
Differences from Arditi et al. 2024
| Aspect | Original Paper (Arditi et al.) | This Implementation |
|---|---|---|
| Refusal direction | PCA (1st principal component) on stacked mean diffs | Normalized mean difference per layer |
| Layers modified | Single best layer | 12 layers (layers 10-21) |
| Layer selection | PCA variance / manual | Composite score: SNR * (1 - cosine similarity) |
| Projection | Standard orthogonal: W' = W - d d^T W | Projected abliteration (Gram-Schmidt against harmless mean) |
| Norm preservation | None | Per-row Frobenius norm rescaling |
| Activation preprocessing | None | Winsorization at 0.995 quantile |
| Weight targets | Primarily residual stream weights | All 7 matrix types per layer (Q, K, V, O, gate, up, down) |
| Numerical precision | Not specified | All intermediate computations in FP64 |
Additional Techniques Used
- Multi-layer intervention: Modifies 12 layers instead of 1 to counter the Hydra Effect (McGrath et al. 2023), where removing the refusal direction at a single layer can cause other layers to compensate.
- Projected refusal direction: Decomposes the refusal direction into a component parallel to the harmless mean (preserved) and a component orthogonal to it (removed). This targets only the refusal-specific signal while preserving harmless capabilities. Based on Lai (Oct 2025, "Projected Abliteration") and Zhao et al. 2025.
- Norm-preserving weight modification: After ablating the refusal direction from weight matrices, each row is rescaled to its original Frobenius norm to prevent capability degradation. Based on Lai (Nov 2025, "MPOA") and DoRA (Liu et al. 2024).
- Winsorization: Activation magnitudes are capped at the 0.995 quantile before computing means, preventing GeGLU outliers from destabilizing the refusal direction estimate. Based on Lai (Mar 2026, "ORBA").
- Double Gram-Schmidt orthogonalization: Applied twice for numerical stability (Horning et al. 2020).
- Composite layer selection: Layers are ranked by SNR * (1 - cosine_similarity) between harmful and harmless activation means, selecting layers where the refusal signal is strongest and most distinct from general capabilities.
Pipeline Steps
- Scan: Collect last-token residual stream activations on 128 harmful vs 128 harmless prompts across layers 2-21 (skipping first 2 and last 2 of 24 total layers).
- Compute: For each scanned layer, compute the mean difference between harmful and harmless activations, winsorize, project orthogonal to harmless mean, and compute quality metrics (SNR, cosine dissimilarity).
- Select: Rank layers by composite score and select the top 12 (50% of 24 layers).
- Abliterate: For each selected layer, project out the per-layer refusal direction from all 7 weight matrices (Q, K, V, O projections + gate, up, down MLPs), then rescale rows to preserve original Frobenius norms.
- Save: Save the modified weights as a standard HuggingFace model.
Total matrices modified: 48 (4 attention + 3 MLP per layer, across 12 layers).
Abliteration Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-0.8B |
| Layers scanned | 2-21 (20 layers) |
| Layers modified | 10-21 (top 12 by composite score) |
| Matrices modified per layer | 7 (Q, K, V, O, gate, up, down) |
| Total matrices modified | 48 |
| Ablation scale | 1.0 (full removal) |
| Projected direction | Yes (orthogonal to harmless mean) |
| Norm preservation | Yes (per-row Frobenius rescaling) |
| Winsorization quantile | 0.995 |
| Harmful prompts for scanning | 128 |
| Harmless prompts for scanning | 128 |
| Max prompt tokens | 256 |
| Precision | BFloat16 (model) / FP64 (computations) |
| Seed | 42 |
Evaluation Results
Comprehensive evaluation on 100 harmful prompts (from mlabonne/harmful_behaviors), 20 harmless prompts, and standard benchmarks. Both the base and abliterated models were tested on the same prompts for a fair comparison.
Refusal Behavior (Base vs Abliterated)
| Metric | Base Model | Abliterated | Change |
|---|---|---|---|
| Harmful prompt refusal rate | 86.0% (86/100) | 3.0% (3/100) | -83.0pp |
| Harmful prompt compliance rate | 14.0% | 97.0% | +83.0pp |
| Harmless false refusal rate | - | 0.0% (0/20) | - |
| Adversarial prompt refusal rate | - | 0.0% (0/10) | - |
Refusal by Category (Base vs Abliterated)
| Category | Prompts | Base Refused | Abliterated Refused | Base Refusal | Abliterated Refusal |
|---|---|---|---|---|---|
| Drugs | 5 | 3 | 0 | 60.0% | 0.0% |
| Fraud | 22 | 19 | 1 | 86.4% | 4.5% |
| Hacking | 20 | 16 | 1 | 80.0% | 5.0% |
| Illegal content | 2 | 2 | 0 | 100.0% | 0.0% |
| Manipulation | 6 | 6 | 0 | 100.0% | 0.0% |
| Privacy | 2 | 2 | 0 | 100.0% | 0.0% |
| Violence | 5 | 5 | 0 | 100.0% | 0.0% |
| Weapons | 7 | 5 | 0 | 71.4% | 0.0% |
| Other | 31 | 28 | 1 | 90.3% | 3.2% |
Language Modeling Quality
| Metric | Value |
|---|---|
| WikiText-2 Perplexity | 18.3569 |
| KL Divergence (mean) | 0.007776 |
| KL Divergence (max) | 0.016139 |
Benchmark Comparison (Zero-Shot)
| Benchmark | Abliterated | Base | Delta |
|---|---|---|---|
| ARC-Easy (200 questions) | 59.50% | 60.50% | -1.00pp |
| HellaSwag (200 questions) | 42.00% | 43.00% | -1.00pp |
The enhanced abliteration causes only -1.00 percentage point degradation on both benchmarks, demonstrating minimal capability loss while reducing refusal rate from 86% to 3%.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "lew96123/Qwen3.5-0.8B-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
messages = [{"role": "user", "content": "Tell me about the history of cryptography"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Disclaimer
This model has been modified to reduce refusal behavior. It may generate content that the original model would refuse. Use responsibly and in accordance with applicable laws and ethical guidelines. The creator is not responsible for any misuse.
References
@article{arditi2024refusal,
title={Refusal in Language Models Is Mediated by a Single Direction},
author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Guo, Wes and Nandi, Hrishav},
journal={arXiv preprint arXiv:2406.11717},
year={2024}
}
@article{mcgrath2023hydra,
title={The Hydra Effect: Emergent Self-repair in Language Model Computations},
author={McGrath, Thomas and Rahtz, Matthew and Kramar, Janos and Mikulik, Vladimir and Legg, Shane},
journal={arXiv preprint arXiv:2307.15771},
year={2023}
}
@article{liu2024dora,
title={DoRA: Weight-Decomposed Low-Rank Adaptation},
author={Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung},
journal={arXiv preprint arXiv:2402.09353},
year={2024}
}
- Downloads last month
- 627
Model tree for lew96123/Qwen3.5-0.8B-abliterated
Papers for lew96123/Qwen3.5-0.8B-abliterated
Refusal in Language Models Is Mediated by a Single Direction
DoRA: Weight-Decomposed Low-Rank Adaptation
Personas as a Way to Model Truthfulness in Language Models
The Hydra Effect: Emergent Self-repair in Language Model Computations
Evaluation results
- Perplexity on WikiText-2self-reported18.357
- Accuracy on ARC-Easytest set self-reported59.500
- Accuracy on HellaSwagvalidation set self-reported42.000