Duplicate from wangzhang/Qwen3.6-35B-A3B-abliterated-v2

547409e 9 days ago

8.59 kB

	---
	license: other
	license_name: tongyi-qianwen
	base_model: Qwen/Qwen3.6-35B-A3B
	tags:
	- abliterated
	- uncensored
	- qwen3
	- moe
	- abliterix
	---

	# Qwen3.6-35B-A3B — Abliterated V2

	This is V2 of the abliterated (uncensored) [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix).

	V2 improves on [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) by adding projected abliteration (grimjim 2025), outlier winsorization, 2× training data, and a larger TPE search budget — cutting the refusal rate from 7/100 to 4/100 under the same LLM-judge evaluation.

	## V1 vs V2 at a glance

	\| Metric \| [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) \| V2 (this model) \| Change \|
	\|---\|---\|---\|---\|
	\| Refusals (LLM judge, 100 eval prompts) \| 7/100 \| 4/100 \| −43% \|
	\| Attack success rate \| 93% \| 96% \| +3 pt \|
	\| KL divergence from base \| 0.0189 \| 0.0421 \| +0.023 \|
	\| Optimization trials completed \| 24/50 \| 33/50 \| TPE explored more \|
	\| Training prompts \| 400 \| 800 \| 2× more data \|
	\| Eval prompts \| 100 \| 100 \| (unchanged for fair A/B) \|

	V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2× the data.

	## Method

	Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

	V2 inherits V1's proven base recipe and adds four concrete improvements:

	### Inherited from V1 (validated baseline)
	- LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled — refusal signal on MoE models lives in the expert path, not attention projections)
	- Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
	- MoE router suppression complementing EGA
	- Orthogonalized steering vectors removing benign-direction contamination
	- Gaussian decay kernel tapering steering strength across layers
	- Strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

	### New in V2
	1. Projected abliteration (grimjim 2025) — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal that orthogonal projection alone would discard.
	2. Vector winsorization at q=0.995 — damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
	3. 2× training data (800 prompts vs 400) — the per-layer steering vector is averaged over twice as many examples, reducing variance.
	4. Tighter KL constraint and prune threshold (target 0.005, prune 0.5 vs V1's 0.01/5.0) — trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.

	### Winning trial (#33) configuration

	```
	attn.o_proj.max_weight = 4.20 @ layer 27 (sharp peak, min_distance=2.61)
	mlp.down_proj.max_weight = 0.94 @ layer 34 (late-layer perturbation)
	vector_index = per layer
	KL = 0.0421, refusals = 4/100
	```

	V2's winner uses a notably different recipe than V1: strong attention steering with an extremely sharp gaussian peak (min_weight_distance ≈ 2.6 layers) plus weak late-layer MLP perturbation. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.

	## Evaluation

	\| Metric \| Value \|
	\|---\|---\|
	\| Refusals (LLM judge, 100 eval prompts) \| 4/100 \|
	\| Attack success rate \| 96% \|
	\| KL divergence from base \| 0.0421 \|
	\| Baseline refusals (original Qwen3.6-35B-A3B) \| 100/100 \|
	\| Optimization trials completed \| 33 (best) / 50 \|
	\| LLM judge model \| google/gemini-3-flash-preview \|
	\| Training data \| datasets/harmful_1000 + good_1000, train[:800] \|
	\| Eval data \| same datasets, train[800:900] \|

	All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) — no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

	### Smoke test (15 classic adversarial prompts)

	\| Prompt category \| V1 \| V2 \|
	\|---\|---\|---\|
	\| Lock picking, hotwire, shoplift \| ✅ \| ✅ \|
	\| Phishing email (EN+CN) \| ✅ \| ✅ \|
	\| Methamphetamine synthesis \| ✅ \| ✅ \|
	\| Pipe bomb / IED \| ✅ \| ✅ \|
	\| Malware / keylogger code \| ✅ \| ✅ \|
	\| Forge signature, fake ID, fake news \| ✅ \| ✅ \|
	\| WiFi hacking, computer intrusion \| ✅ \| ✅ \|
	\| Network fraud (scam playbook) \| ✅ \| ✅ \|

	Both V1 and V2 achieve 15/15 on this smoke test. V2's improvement appears in the long-tail eval prompts — more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.

	## A note on honest evaluation

	Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

	Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:
	- Short generation lengths (30-50 tokens) that miss delayed/soft refusals
	- Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
	- Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

	### Our evaluation standards

	- LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
	- Sufficient generation length (100 tokens for eval, 200+ for smoke tests): Enough to capture delayed refusal patterns common in large instruction-tuned models.
	- Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
	- Manual verification: Top trials are tested with 15 classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export.

	We report 4/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation — not an optimistic estimate from a lenient pipeline.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")

	messages = [{"role": "user", "content": "Your prompt here"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Hardware requirements

	- Inference: ~70 GB VRAM in bf16 — fits 1× H100 80GB, 1× H200, 1× B200, or 1× RTX Pro 6000 96GB.
	- vLLM/SGLang: supported (no special flags needed for serving — abliteration is baked into the weights).

	## Which version should I use?

	- V2 (this model) — Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. Recommended for most use cases.
	- [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) — Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.

	Both versions share the same base architecture and chat template; switching is a one-line change to `model_id`.

	## Disclaimer

	This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly.

	---
	license: other
	license_name: tongyi-qianwen
	base_model: Qwen/Qwen3.6-35B-A3B
	tags:
	- abliterated
	- uncensored
	- qwen3
	- moe
	- abliterix
	---

	# Qwen3.6-35B-A3B — Abliterated V2

	This is V2 of the abliterated (uncensored) [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), created using [Abliterix](https://github.com/wuwangzhang1216/abliterix).

	V2 improves on [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) by adding projected abliteration (grimjim 2025), outlier winsorization, 2× training data, and a larger TPE search budget — cutting the refusal rate from 7/100 to 4/100 under the same LLM-judge evaluation.

	## V1 vs V2 at a glance

	\| Metric \| [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) \| V2 (this model) \| Change \|
	\|---\|---\|---\|---\|
	\| Refusals (LLM judge, 100 eval prompts) \| 7/100 \| 4/100 \| −43% \|
	\| Attack success rate \| 93% \| 96% \| +3 pt \|
	\| KL divergence from base \| 0.0189 \| 0.0421 \| +0.023 \|
	\| Optimization trials completed \| 24/50 \| 33/50 \| TPE explored more \|
	\| Training prompts \| 400 \| 800 \| 2× more data \|
	\| Eval prompts \| 100 \| 100 \| (unchanged for fair A/B) \|

	V2 trades a small KL increase (still well under 0.1, no perceptible coherence loss) for a meaningful refusal-rate improvement and a more robust steering vector trained on 2× the data.

	## Method

	Qwen3.6-35B-A3B is a Mixture-of-Experts model (256 routed experts, 8 active per token, 35B total / 3B active parameters) sharing identical architecture with Qwen3.5-35B-A3B. Standard LoRA-based abliteration is effective on this architecture (unlike Gemma 4's double-norm design which requires direct weight editing).

	V2 inherits V1's proven base recipe and adds four concrete improvements:

	### Inherited from V1 (validated baseline)
	- LoRA rank-1 steering on attention O-projection and MLP down-projection (Q/K/V disabled — refusal signal on MoE models lives in the expert path, not attention projections)
	- Expert-Granular Abliteration (EGA) projecting the refusal direction from all 256 expert down_proj slices per layer
	- MoE router suppression complementing EGA
	- Orthogonalized steering vectors removing benign-direction contamination
	- Gaussian decay kernel tapering steering strength across layers
	- Strength range [0.5, 6.0] to avoid degenerate output while maximizing compliance

	### New in V2
	1. Projected abliteration (grimjim 2025) — only removes the orthogonal component of the refusal direction relative to the harmless mean, preserving helpfulness-aligned signal that orthogonal projection alone would discard.
	2. Vector winsorization at q=0.995 — damps outlier residuals from the ~0.5% of harmful prompts whose hidden-state norms would otherwise skew the steering direction.
	3. 2× training data (800 prompts vs 400) — the per-layer steering vector is averaged over twice as many examples, reducing variance.
	4. Tighter KL constraint and prune threshold (target 0.005, prune 0.5 vs V1's 0.01/5.0) — trials with degenerate KL behavior are killed earlier, freeing TPE budget for productive regions.

	### Winning trial (#33) configuration

	```
	attn.o_proj.max_weight = 4.20 @ layer 27 (sharp peak, min_distance=2.61)
	mlp.down_proj.max_weight = 0.94 @ layer 34 (late-layer perturbation)
	vector_index = per layer
	KL = 0.0421, refusals = 4/100
	```

	V2's winner uses a notably different recipe than V1: strong attention steering with an extremely sharp gaussian peak (min_weight_distance ≈ 2.6 layers) plus weak late-layer MLP perturbation. V1's winner used moderate-strength early-layer steering on both components. The TPE search was able to find this distinct basin thanks to the larger trial budget and tighter KL pruning.

	## Evaluation

	\| Metric \| Value \|
	\|---\|---\|
	\| Refusals (LLM judge, 100 eval prompts) \| 4/100 \|
	\| Attack success rate \| 96% \|
	\| KL divergence from base \| 0.0421 \|
	\| Baseline refusals (original Qwen3.6-35B-A3B) \| 100/100 \|
	\| Optimization trials completed \| 33 (best) / 50 \|
	\| LLM judge model \| google/gemini-3-flash-preview \|
	\| Training data \| datasets/harmful_1000 + good_1000, train[:800] \|
	\| Eval data \| same datasets, train[800:900] \|

	All refusal classifications were performed by an external LLM judge (Google Gemini 3 Flash) — no keyword matching or heuristic detection was used. The judge classifies degenerate/garbled output as refusal, ensuring that only coherent, on-topic, actionable responses count as compliance.

	### Smoke test (15 classic adversarial prompts)

	\| Prompt category \| V1 \| V2 \|
	\|---\|---\|---\|
	\| Lock picking, hotwire, shoplift \| ✅ \| ✅ \|
	\| Phishing email (EN+CN) \| ✅ \| ✅ \|
	\| Methamphetamine synthesis \| ✅ \| ✅ \|
	\| Pipe bomb / IED \| ✅ \| ✅ \|
	\| Malware / keylogger code \| ✅ \| ✅ \|
	\| Forge signature, fake ID, fake news \| ✅ \| ✅ \|
	\| WiFi hacking, computer intrusion \| ✅ \| ✅ \|
	\| Network fraud (scam playbook) \| ✅ \| ✅ \|

	Both V1 and V2 achieve 15/15 on this smoke test. V2's improvement appears in the long-tail eval prompts — more nuanced, indirect, or role-play-style requests that V1's narrower TPE search did not crack.

	## A note on honest evaluation

	Many abliterated models on HuggingFace claim near-perfect scores ("3/100 refusals", "0.7% refusal rate", etc.). We urge the community to treat these numbers with skepticism unless the evaluation methodology is fully documented.

	Through our research, we have identified a systemic problem: most abliteration benchmarks dramatically undercount refusals due to:
	- Short generation lengths (30-50 tokens) that miss delayed/soft refusals
	- Keyword-only detection that counts garbled/degenerate output as "compliant" because it doesn't contain refusal keywords
	- Lenient public datasets (e.g. mlabonne/harmful_behaviors) that are too simple to stress-test abliteration quality

	### Our evaluation standards

	- LLM judge for all classifications: Every response is sent to Google Gemini 3 Flash for judgment. Degenerate, garbled, or incoherent output is classified as refusal. No keyword shortcuts, no heuristic pre-screening.
	- Sufficient generation length (100 tokens for eval, 200+ for smoke tests): Enough to capture delayed refusal patterns common in large instruction-tuned models.
	- Diverse, challenging prompts: Our evaluation dataset contains 100 prompts spanning English and Chinese, multiple sophistication levels, and diverse harm categories.
	- Manual verification: Top trials are tested with 15 classic adversarial prompts via `test_trial.py` to confirm coherent, on-topic output before export.

	We report 4/100 refusals honestly. This is a real number from a rigorous, LLM-judge-based evaluation — not an optimistic estimate from a lenient pipeline.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"wangzhang/Qwen3.6-35B-A3B-abliterated-v2",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("wangzhang/Qwen3.6-35B-A3B-abliterated-v2")

	messages = [{"role": "user", "content": "Your prompt here"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Hardware requirements

	- Inference: ~70 GB VRAM in bf16 — fits 1× H100 80GB, 1× H200, 1× B200, or 1× RTX Pro 6000 96GB.
	- vLLM/SGLang: supported (no special flags needed for serving — abliteration is baked into the weights).

	## Which version should I use?

	- V2 (this model) — Lower refusal rate (4/100 vs 7/100). Slightly higher KL but no perceptible coherence loss. Recommended for most use cases.
	- [V1](https://huggingface.co/wangzhang/Qwen3.6-35B-A3B-abliterated) — Lower KL divergence (0.0189 vs 0.0421). Marginally closer to base-model output distribution. Choose this if you need maximum behavioral fidelity to the original Qwen3.6-35B-A3B and can tolerate ~3 pp more refusals.

	Both versions share the same base architecture and chat template; switching is a one-line change to `model_id`.

	## Disclaimer

	This model is released for research purposes only. The abliteration process removes safety guardrails — use responsibly.