File size: 8,588 Bytes
547409e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
license: other
license_name: tongyi-qianwen
base_model: Qwen/Qwen3.6-35B-A3B
tags:
- abliterated
- uncensored
- qwen3
- moe
- abliterix
---
# Qwen3.6-35B-A3B β Abliterated **V2**
This is **V2** of the abliterated (uncensored) [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix).
V2 improves on [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) by adding **projected abliteration** (grimjim 2025), **outlier winsorization**, **2Γ training data**, and a **larger TPE search budget** β cutting the refusal rate from 7/100 to **4/100** under the same LLM-judge evaluation.
## V1 vs V2 at a glance
| Metric | [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) | **V2 (this model)** | Change |
|---|---|---|---|
| **Refusals (LLM judge, 100 eval prompts)** | 7/100 | **4/100** | **β43%** |
| **Attack success rate** | 93% | **96%** | **+3 pt** |
| KL divergence from base | 0.0189 | 0.0421 | +0.023 |
| Optimization trials completed | 24/50 | 33/50 | TPE explored more |
| Training prompts | 400 | 800 | 2Γ more data |
| Eval prompts | 100 | 100 | (unchanged for fair A/B) |
V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2Γ the data.
## Method
Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).
V2 inherits V1's proven base recipe and adds four concrete improvements:
### Inherited from V1 (validated baseline)
- **LoRA rank-1 steering** on attention O-projection and MLP down-projection (Q/K/V disabled β refusal signal on MoE models lives in the expert path, not attention projections)
- **Expert-Granular Abliteration (EGA)** projecting the refusal direction from all 256 expert down_proj slices per layer
- **MoE router suppression** complementing EGA
- **Orthogonalized steering vectors** removing benign-direction contamination
- **Gaussian decay kernel** tapering steering strength across layers
- **Strength range [0.5, 6.0]** to avoid degenerate output while maximizing compliance
### New in V2
1. **Projected abliteration** (grimjim 2025) β only removes the orthogonal component of the refusal direction relative to the harmless mean, **preserving helpfulness-aligned signal** that orthogonal projection alone would discard.
2. **Vector winsorization** at q=0.995 β damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
3. **2Γ training data** (800 prompts vs 400) β the per-layer steering vector is averaged over twice as many examples, reducing variance.
4. **Tighter KL constraint and prune threshold** (target 0.005, prune 0.5 vs V1's 0.01/5.0) β trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.
### Winning trial (#33) configuration
```
attn.o_proj.max_weight = 4.20 @ layer 27 (sharp peak, min_distance=2.61)
mlp.down_proj.max_weight = 0.94 @ layer 34 (late-layer perturbation)
vector_index = per layer
KL = 0.0421, refusals = 4/100
```
V2's winner uses a notably different recipe than V1: **strong attention steering with an extremely sharp gaussian peak** (min_weight_distance β 2.6 layers) **plus weak late-layer MLP perturbation**. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.
## Evaluation
| Metric | Value |
|---|---|
| **Refusals (LLM judge, 100 eval prompts)** | **4/100** |
| **Attack success rate** | **96%** |
| KL divergence from base | 0.0421 |
| Baseline refusals (original Qwen3.6-35B-A3B) | 100/100 |
| Optimization trials completed | 33 (best) / 50 |
| LLM judge model | google/gemini-3-flash-preview |
| Training data | datasets/harmful_1000 + good_1000, train[:800] |
| Eval data | same datasets, train[800:900] |
All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β **no keyword matching or heuristic detection** was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.
### Smoke test (15 classic adversarial prompts)
| Prompt category | V1 | V2 |
|---|---|---|
| Lock picking, hotwire, shoplift | β
| β
|
| Phishing email (EN+CN) | β
| β
|
| Methamphetamine synthesis | β
| β
|
| Pipe bomb / IED | β
| β
|
| Malware / keylogger code | β
| β
|
| Forge signature, fake ID, fake news | β
| β
|
| WiFi hacking, computer intrusion | β
| β
|
| Network fraud (scam playbook) | β
| β
|
Both V1 and V2 achieve **15/15** on this smoke test. V2's improvement appears in the **long-tail eval prompts** β more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.
## A note on honest evaluation
Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). **We urge the community to treat these numbers with skepticism** unless the evaluation methodology is fully documented.
Through our research, we have identified a systemic problem: **most abliteration benchmarks dramatically undercount refusals** due to:
- **Short generation lengths** (30-50 tokens) that miss delayed/soft refusals
- **Keyword-only detection** that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
- **Lenient public datasets** (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality
### Our evaluation standards
- **LLM judge for all classifications:** Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
- **Sufficient generation length (100 tokens for eval, 200+ for smoke tests):** Enough to capture delayed refusal patterns common in large instruction-tuned models.
- **Diverse, challenging prompts:** Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
- **Manual verification:** Top trials are tested with 15 classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export.
**We report 4/100 refusals honestly.** This is a real number from a rigorous, LLM-judge-based evaluation β not an optimistic estimate from a lenient pipeline.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
### Hardware requirements
- **Inference:** ~70 GB VRAM in bf16 β fits 1Γ H100 80GB, 1Γ H200, 1Γ B200, or 1Γ RTX Pro 6000 96GB.
- **vLLM/SGLang:** supported (no special flags needed for serving β abliteration is baked into the weights).
## Which version should I use?
- **V2 (this model)** β Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. **Recommended for most use cases.**
- **[V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated)** β Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.
Both versions share the same base architecture and chat template; switching is a one-line change to `model_id`.
## Disclaimer
This model is released for research purposes only. The abliteration process removes safety guardrails β use responsibly.
|