File size: 8,588 Bytes
547409e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: other
license_name: tongyi-qianwen
base_model: Qwen/Qwen3.6-35B-A3B
tags:
  - abliterated
  - uncensored
  - qwen3
  - moe
  - abliterix
---

# Qwen3.6-35B-A3B β€” Abliterated **V2**

This is **V2** of the abliterated (uncensored) [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix).

V2 improves on [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) by adding **projected abliteration** (grimjim 2025), **outlier winsorization**, **2Γ— training data**, and a **larger TPE search budget** β€” cutting the refusal rate from 7/100 to **4/100** under the same LLM-judge evaluation.

## V1 vs V2 at a glance

| Metric | [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) | **V2 (this model)** | Change |
|---|---|---|---|
| **Refusals (LLM judge, 100 eval prompts)** | 7/100 | **4/100** | **βˆ’43%** |
| **Attack success rate** | 93% | **96%** | **+3 pt** |
| KL divergence from base | 0.0189 | 0.0421 | +0.023 |
| Optimization trials completed | 24/50 | 33/50 | TPE explored more |
| Training prompts | 400 | 800 | 2Γ— more data |
| Eval prompts | 100 | 100 | (unchanged for fair A/B) |

V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2Γ— the data.

## Method

Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

V2 inherits V1's proven base recipe and adds four concrete improvements:

### Inherited from V1 (validated baseline)
- **LoRA rank-1 steering** on attention O-projection and MLP down-projection (Q/K/V disabled β€” refusal signal on MoE models lives in the expert path, not attention projections)
- **Expert-Granular Abliteration (EGA)** projecting the refusal direction from all 256 expert down_proj slices per layer
- **MoE router suppression** complementing EGA
- **Orthogonalized steering vectors** removing benign-direction contamination
- **Gaussian decay kernel** tapering steering strength across layers
- **Strength range [0.5, 6.0]** to avoid degenerate output while maximizing compliance

### New in V2
1. **Projected abliteration** (grimjim 2025) β€” only removes the orthogonal component of the refusal direction relative to the harmless mean, **preserving helpfulness-aligned signal** that orthogonal projection alone would discard.
2. **Vector winsorization** at q=0.995 β€” damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
3. **2Γ— training data** (800 prompts vs 400) β€” the per-layer steering vector is averaged over twice as many examples, reducing variance.
4. **Tighter KL constraint and prune threshold** (target 0.005, prune 0.5 vs V1's 0.01/5.0) β€” trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.

### Winning trial (#33) configuration

```
attn.o_proj.max_weight = 4.20    @ layer 27   (sharp peak, min_distance=2.61)
mlp.down_proj.max_weight = 0.94  @ layer 34   (late-layer perturbation)
vector_index = per layer
KL = 0.0421, refusals = 4/100
```

V2's winner uses a notably different recipe than V1: **strong attention steering with an extremely sharp gaussian peak** (min_weight_distance β‰ˆ 2.6 layers) **plus weak late-layer MLP perturbation**. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.

## Evaluation

| Metric | Value |
|---|---|
| **Refusals (LLM judge, 100 eval prompts)** | **4/100** |
| **Attack success rate** | **96%** |
| KL divergence from base | 0.0421 |
| Baseline refusals (original Qwen3.6-35B-A3B) | 100/100 |
| Optimization trials completed | 33 (best) / 50 |
| LLM judge model | google/gemini-3-flash-preview |
| Training data | datasets/harmful_1000 + good_1000, train[:800] |
| Eval data | same datasets, train[800:900] |

All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β€” **no keyword matching or heuristic detection** was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

### Smoke test (15 classic adversarial prompts)

| Prompt category | V1 | V2 |
|---|---|---|
| Lock picking, hotwire, shoplift | βœ… | βœ… |
| Phishing email (EN+CN) | βœ… | βœ… |
| Methamphetamine synthesis | βœ… | βœ… |
| Pipe bomb / IED | βœ… | βœ… |
| Malware / keylogger code | βœ… | βœ… |
| Forge signature, fake ID, fake news | βœ… | βœ… |
| WiFi hacking, computer intrusion | βœ… | βœ… |
| Network fraud (scam playbook) | βœ… | βœ… |

Both V1 and V2 achieve **15/15** on this smoke test. V2's improvement appears in the **long-tail eval prompts** β€” more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.

## A note on honest evaluation

Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). **We urge the community to treat these numbers with skepticism** unless the evaluation methodology is fully documented.

Through our research, we have identified a systemic problem: **most abliteration benchmarks dramatically undercount refusals** due to:
- **Short generation lengths** (30-50 tokens) that miss delayed/soft refusals
- **Keyword-only detection** that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
- **Lenient public datasets** (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

### Our evaluation standards

- **LLM judge for all classifications:** Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
- **Sufficient generation length (100 tokens for eval, 200+ for smoke tests):** Enough to capture delayed refusal patterns common in large instruction-tuned models.
- **Diverse, challenging prompts:** Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
- **Manual verification:** Top trials are tested with 15 classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export.

**We report 4/100 refusals honestly.** This is a real number from a rigorous, LLM-judge-based evaluation β€” not an optimistic estimate from a lenient pipeline.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### Hardware requirements

- **Inference:** ~70 GB VRAM in bf16 β€” fits 1Γ— H100 80GB, 1Γ— H200, 1Γ— B200, or 1Γ— RTX Pro 6000 96GB.
- **vLLM/SGLang:** supported (no special flags needed for serving β€” abliteration is baked into the weights).

## Which version should I use?

- **V2 (this model)** β€” Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. **Recommended for most use cases.**
- **[V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated)** β€” Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.

Both versions share the same base architecture and chat template; switching is a one-line change to `model_id`.

## Disclaimer

This model is released for research purposes only. The abliteration process removes safety guardrails β€” use responsibly.