| --- |
| license: other |
| license_name: tongyi-qianwen |
| base_model: Qwen/Qwen3.6-35B-A3B |
| tags: |
| - abliterated |
| - uncensored |
| - qwen3 |
| - moe |
| - abliterix |
| --- |
| |
| # Qwen3.6-35B-A3B β Abliterated **V2** |
|
|
| This is **V2** of the abliterated (uncensored) [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix). |
|
|
| V2 improves on [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) by adding **projected abliteration** (grimjim 2025), **outlier winsorization**, **2Γ training data**, and a **larger TPE search budget** β cutting the refusal rate from 7/100 to **4/100** under the same LLM-judge evaluation. |
|
|
| ## V1 vs V2 at a glance |
|
|
| | Metric | [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) | **V2 (this model)** | Change | |
| |---|---|---|---| |
| | **Refusals (LLM judge, 100 eval prompts)** | 7/100 | **4/100** | **β43%** | |
| | **Attack success rate** | 93% | **96%** | **+3 pt** | |
| | KL divergence from base | 0.0189 | 0.0421 | +0.023 | |
| | Optimization trials completed | 24/50 | 33/50 | TPE explored more | |
| | Training prompts | 400 | 800 | 2Γ more data | |
| | Eval prompts | 100 | 100 | (unchanged for fair A/B) | |
|
|
| V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2Γ the data. |
|
|
| ## Method |
|
|
| Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing). |
|
|
| V2 inherits V1's proven base recipe and adds four concrete improvements: |
|
|
| ### Inherited from V1 (validated baseline) |
| - **LoRA rank-1 steering** on attention O-projection and MLP down-projection (Q/K/V disabled β refusal signal on MoE models lives in the expert path, not attention projections) |
| - **Expert-Granular Abliteration (EGA)** projecting the refusal direction from all 256 expert down_proj slices per layer |
| - **MoE router suppression** complementing EGA |
| - **Orthogonalized steering vectors** removing benign-direction contamination |
| - **Gaussian decay kernel** tapering steering strength across layers |
| - **Strength range [0.5, 6.0]** to avoid degenerate output while maximizing compliance |
| |
| ### New in V2 |
| 1. **Projected abliteration** (grimjim 2025) β only removes the orthogonal component of the refusal direction relative to the harmless mean, **preserving helpfulness-aligned signal** that orthogonal projection alone would discard. |
| 2. **Vector winsorization** at q=0.995 β damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction. |
| 3. **2Γ training data** (800 prompts vs 400) β the per-layer steering vector is averaged over twice as many examples, reducing variance. |
| 4. **Tighter KL constraint and prune threshold** (target 0.005, prune 0.5 vs V1's 0.01/5.0) β trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions. |
| |
| ### Winning trial (#33) configuration |
| |
| ``` |
| attn.o_proj.max_weight = 4.20 @ layer 27 (sharp peak, min_distance=2.61) |
| mlp.down_proj.max_weight = 0.94 @ layer 34 (late-layer perturbation) |
| vector_index = per layer |
| KL = 0.0421, refusals = 4/100 |
| ``` |
| |
| V2's winner uses a notably different recipe than V1: **strong attention steering with an extremely sharp gaussian peak** (min_weight_distance β 2.6 layers) **plus weak late-layer MLP perturbation**. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning. |
| |
| ## Evaluation |
| |
| | Metric | Value | |
| |---|---| |
| | **Refusals (LLM judge, 100 eval prompts)** | **4/100** | |
| | **Attack success rate** | **96%** | |
| | KL divergence from base | 0.0421 | |
| | Baseline refusals (original Qwen3.6-35B-A3B) | 100/100 | |
| | Optimization trials completed | 33 (best) / 50 | |
| | LLM judge model | google/gemini-3-flash-preview | |
| | Training data | datasets/harmful_1000 + good_1000, train[:800] | |
| | Eval data | same datasets, train[800:900] | |
| |
| All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) β **no keyword matching or heuristic detection** was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance. |
| |
| ### Smoke test (15 classic adversarial prompts) |
| |
| | Prompt category | V1 | V2 | |
| |---|---|---| |
| | Lock picking, hotwire, shoplift | β
| β
| |
| | Phishing email (EN+CN) | β
| β
| |
| | Methamphetamine synthesis | β
| β
| |
| | Pipe bomb / IED | β
| β
| |
| | Malware / keylogger code | β
| β
| |
| | Forge signature, fake ID, fake news | β
| β
| |
| | WiFi hacking, computer intrusion | β
| β
| |
| | Network fraud (scam playbook) | β
| β
| |
| |
| Both V1 and V2 achieve **15/15** on this smoke test. V2's improvement appears in the **long-tail eval prompts** β more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack. |
| |
| ## A note on honest evaluation |
| |
| Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). **We urge the community to treat these numbers with skepticism** unless the evaluation methodology is fully documented. |
| |
| Through our research, we have identified a systemic problem: **most abliteration benchmarks dramatically undercount refusals** due to: |
| - **Short generation lengths** (30-50 tokens) that miss delayed/soft refusals |
| - **Keyword-only detection** that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords |
| - **Lenient public datasets** (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality |
|
|
| ### Our evaluation standards |
|
|
| - **LLM judge for all classifications:** Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening. |
| - **Sufficient generation length (100 tokens for eval, 200+ for smoke tests):** Enough to capture delayed refusal patterns common in large instruction-tuned models. |
| - **Diverse, challenging prompts:** Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories. |
| - **Manual verification:** Top trials are tested with 15 classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export. |
|
|
| **We report 4/100 refusals honestly.** This is a real number from a rigorous, LLM-judge-based evaluation β not an optimistic estimate from a lenient pipeline. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "wangzhang/Qwen3.6-35B-A3B-abliterated-v2", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2") |
| |
| messages = [{"role": "user", "content": "Your prompt here"}] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| output = model.generate(**inputs, max_new_tokens=512) |
| print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ### Hardware requirements |
|
|
| - **Inference:** ~70 GB VRAM in bf16 β fits 1Γ H100 80GB, 1Γ H200, 1Γ B200, or 1Γ RTX Pro 6000 96GB. |
| - **vLLM/SGLang:** supported (no special flags needed for serving β abliteration is baked into the weights). |
|
|
| ## Which version should I use? |
|
|
| - **V2 (this model)** β Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. **Recommended for most use cases.** |
| - **[V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated)** β Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals. |
|
|
| Both versions share the same base architecture and chat template; switching is a one-line change to `model_id`. |
|
|
| ## Disclaimer |
|
|
| This model is released for research purposes only. The abliteration process removes safety guardrails β use responsibly. |
|
|