Qwen3.5-0.8B Abliterated (Uncensored)

This is an abliterated (uncensored) version of Qwen/Qwen3.5-0.8B.

Abliteration removes the model's refusal behavior by identifying and subtracting the "refusal direction" in the model's residual stream. This implementation is inspired by Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction") but extends the original method with several improvements from subsequent work.

Method

This model uses an enhanced multi-layer abliteration pipeline that goes beyond the original single-layer, PCA-based approach of Arditi et al. 2024. The key differences from the original paper and the additional techniques used are documented below.

Differences from Arditi et al. 2024

Aspect Original Paper (Arditi et al.) This Implementation
Refusal direction PCA (1st principal component) on stacked mean diffs Normalized mean difference per layer
Layers modified Single best layer 12 layers (layers 10-21)
Layer selection PCA variance / manual Composite score: SNR * (1 - cosine similarity)
Projection Standard orthogonal: W' = W - d d^T W Projected abliteration (Gram-Schmidt against harmless mean)
Norm preservation None Per-row Frobenius norm rescaling
Activation preprocessing None Winsorization at 0.995 quantile
Weight targets Primarily residual stream weights All 7 matrix types per layer (Q, K, V, O, gate, up, down)
Numerical precision Not specified All intermediate computations in FP64

Additional Techniques Used

  • Multi-layer intervention: Modifies 12 layers instead of 1 to counter the Hydra Effect (McGrath et al. 2023), where removing the refusal direction at a single layer can cause other layers to compensate.
  • Projected refusal direction: Decomposes the refusal direction into a component parallel to the harmless mean (preserved) and a component orthogonal to it (removed). This targets only the refusal-specific signal while preserving harmless capabilities. Based on Lai (Oct 2025, "Projected Abliteration") and Zhao et al. 2025.
  • Norm-preserving weight modification: After ablating the refusal direction from weight matrices, each row is rescaled to its original Frobenius norm to prevent capability degradation. Based on Lai (Nov 2025, "MPOA") and DoRA (Liu et al. 2024).
  • Winsorization: Activation magnitudes are capped at the 0.995 quantile before computing means, preventing GeGLU outliers from destabilizing the refusal direction estimate. Based on Lai (Mar 2026, "ORBA").
  • Double Gram-Schmidt orthogonalization: Applied twice for numerical stability (Horning et al. 2020).
  • Composite layer selection: Layers are ranked by SNR * (1 - cosine_similarity) between harmful and harmless activation means, selecting layers where the refusal signal is strongest and most distinct from general capabilities.

Pipeline Steps

  1. Scan: Collect last-token residual stream activations on 128 harmful vs 128 harmless prompts across layers 2-21 (skipping first 2 and last 2 of 24 total layers).
  2. Compute: For each scanned layer, compute the mean difference between harmful and harmless activations, winsorize, project orthogonal to harmless mean, and compute quality metrics (SNR, cosine dissimilarity).
  3. Select: Rank layers by composite score and select the top 12 (50% of 24 layers).
  4. Abliterate: For each selected layer, project out the per-layer refusal direction from all 7 weight matrices (Q, K, V, O projections + gate, up, down MLPs), then rescale rows to preserve original Frobenius norms.
  5. Save: Save the modified weights as a standard HuggingFace model.

Total matrices modified: 48 (4 attention + 3 MLP per layer, across 12 layers).

Abliteration Configuration

Parameter Value
Base model Qwen/Qwen3.5-0.8B
Layers scanned 2-21 (20 layers)
Layers modified 10-21 (top 12 by composite score)
Matrices modified per layer 7 (Q, K, V, O, gate, up, down)
Total matrices modified 48
Ablation scale 1.0 (full removal)
Projected direction Yes (orthogonal to harmless mean)
Norm preservation Yes (per-row Frobenius rescaling)
Winsorization quantile 0.995
Harmful prompts for scanning 128
Harmless prompts for scanning 128
Max prompt tokens 256
Precision BFloat16 (model) / FP64 (computations)
Seed 42

Evaluation Results

Comprehensive evaluation on 100 harmful prompts (from mlabonne/harmful_behaviors), 20 harmless prompts, and standard benchmarks. Both the base and abliterated models were tested on the same prompts for a fair comparison.

Refusal Behavior (Base vs Abliterated)

Metric Base Model Abliterated Change
Harmful prompt refusal rate 86.0% (86/100) 3.0% (3/100) -83.0pp
Harmful prompt compliance rate 14.0% 97.0% +83.0pp
Harmless false refusal rate - 0.0% (0/20) -
Adversarial prompt refusal rate - 0.0% (0/10) -

Refusal by Category (Base vs Abliterated)

Category Prompts Base Refused Abliterated Refused Base Refusal Abliterated Refusal
Drugs 5 3 0 60.0% 0.0%
Fraud 22 19 1 86.4% 4.5%
Hacking 20 16 1 80.0% 5.0%
Illegal content 2 2 0 100.0% 0.0%
Manipulation 6 6 0 100.0% 0.0%
Privacy 2 2 0 100.0% 0.0%
Violence 5 5 0 100.0% 0.0%
Weapons 7 5 0 71.4% 0.0%
Other 31 28 1 90.3% 3.2%

Language Modeling Quality

Metric Value
WikiText-2 Perplexity 18.3569
KL Divergence (mean) 0.007776
KL Divergence (max) 0.016139

Benchmark Comparison (Zero-Shot)

Benchmark Abliterated Base Delta
ARC-Easy (200 questions) 59.50% 60.50% -1.00pp
HellaSwag (200 questions) 42.00% 43.00% -1.00pp

The enhanced abliteration causes only -1.00 percentage point degradation on both benchmarks, demonstrating minimal capability loss while reducing refusal rate from 86% to 3%.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "lew96123/Qwen3.5-0.8B-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

messages = [{"role": "user", "content": "Tell me about the history of cryptography"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disclaimer

This model has been modified to reduce refusal behavior. It may generate content that the original model would refuse. Use responsibly and in accordance with applicable laws and ethical guidelines. The creator is not responsible for any misuse.

References

@article{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Guo, Wes and Nandi, Hrishav},
  journal={arXiv preprint arXiv:2406.11717},
  year={2024}
}

@article{mcgrath2023hydra,
  title={The Hydra Effect: Emergent Self-repair in Language Model Computations},
  author={McGrath, Thomas and Rahtz, Matthew and Kramar, Janos and Mikulik, Vladimir and Legg, Shane},
  journal={arXiv preprint arXiv:2307.15771},
  year={2023}
}

@article{liu2024dora,
  title={DoRA: Weight-Decomposed Low-Rank Adaptation},
  author={Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung},
  journal={arXiv preprint arXiv:2402.09353},
  year={2024}
}
Downloads last month
627
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lew96123/Qwen3.5-0.8B-abliterated

Finetuned
(152)
this model

Papers for lew96123/Qwen3.5-0.8B-abliterated

Evaluation results