Qwen3.5-35B-A3B-Uncensored-FernflowerAI
MLX 8-bit · Text + Vision + Thinking + Tool Calling
Apple Silicon native
Credit where it's due. This conversion is built on the work of LuffyTheFox (EvilEnginer), who discovered the corrupted tensors in Alibaba's Qwen3.5 weights, diagnosed the root cause, and wrote the Sig-ScaleSync repair that fixed them. The original repos: safetensors and GGUF.
What's this?
Qwen3.5-35B is a 35B-parameter MoE model from Alibaba that activates ~3B params per token. It's fast, it's smart, and it supports 262K context, vision, video, and multi-token prediction.
There's one problem: Alibaba shipped it with two broken tensors. Layers 36 and 37 have corrupted ssm_conv1d.weight values that cause the model to loop, garble code, and eventually collapse past ~50K tokens. No sampler setting fixes it.
LuffyTheFox found the bug, wrote a repair tool (Sig-ScaleSync), and released fixed weights. This repo is an MLX 8-bit conversion of those fixed weights, ready to run on Apple Silicon with full text, image, and video support.
Lineage
Qwen/Qwen3.5-35B-A3B (Alibaba Cloud)
└─ HauhauCS Uncensored (0/465 refusals, lossless)
└─ LuffyTheFox FernflowerAI (tensor repair via Sig-ScaleSync)
└─ This repo (MLX 8-bit, text + vision)
The Bug
Two tensors out of 502 carry corrupted weights: blk.36.ssm_conv1d.weight and blk.37.ssm_conv1d.weight. Their scale (standard deviation) runs ~60% higher than the median of their peer group, at 0.102 vs 0.063.
Why it happens: AdamW optimizer + MoE routing + DeltaNet's recurrent architecture. Rare experts in the final layers get an outsized effective learning rate. Weights drift. In DeltaNet's recurrence, the corruption propagates forward through every subsequent token.
What you see: Short prompts work fine. Around 50–70K tokens the model starts looping, repeating, inserting weird comments into code. By 100K it often fails outright. Tool calling breaks mid-session.
The Fix
Sig-ScaleSync compares each tensor's scale against the median of its peer group (same shape). A tensor gets flagged only if it exceeds the deviation threshold and shows weight saturation. This two-gate filter avoids false positives on architecturally asymmetric layers (gate inputs, FFN projections, etc.).
Out of 502 tensors, exactly 2 needed repair. The other 489 asymmetric tensors were left alone.
| Tensor | Error reduction | Saturation (before → after) |
|---|---|---|
blk.36.ssm_conv1d.weight |
88.6% | 0.0025 → 0.0010 |
blk.37.ssm_conv1d.weight |
88.6% | 0.0025 → 0.0010 |
Verified against Gemma 4 26B A4B with zero false positives. The script doesn't invent problems.
Which models are affected?
| Model | Status |
|---|---|
| Qwen3.5-35B-A3B (all variants) | Broken (2 tensors), fixed here |
| Qwen3.5-27B (all incl. Unsloth) | Broken (8 tensors), fix experimental |
| Qwen3.5-122B-A10B | Healthy |
| Qwen3.5-9B, 4B, 2B, 0.8B | Likely affected, unconfirmed |
This isn't Qwen-specific. Any MoE model with recurrent sublayers (DeltaNet, Mamba) trained with AdamW can hit the same issue.
Uncensored
The HauhauCS Aggressive uncensored fine-tune is lossless. No dataset changes, no capability removal, 0/465 refusals. You get everything the original model was trained to do, just without the refusal behavior. It may occasionally append short disclaimers (baked into base training, not actual refusals), but the full response always generates.
This conversion
- Source: FernflowerAI safetensors (not GGUF) for maximum weight fidelity
- Quantization: 8-bit (8.6 bits/weight, 35 GB across 8 shards)
- Vision: Full support via
mlx-vlm. Text, image, and video inputs work out of the box - Thinking: Toggleable via
<|think_on|>/<|think_off|>tags (see below) - Tool calling: Works via the included Jinja chat template
- Requirements:
mlx-lm >= 0.31.2,mlx-vlm >= 0.4.4
Architecture details
| Spec | Value |
|---|---|
| Total params | 35B |
| Active per token | ~3B (8 routed + 1 shared of 256 experts) |
| Attention | 3x DeltaNet-MoE + 1x Attention-MoE, 10 repetitions |
| Context | 262K native, 1M with YaRN |
| RoPE | theta 10M, partial_rotary_factor 0.25, mrope_interleaved |
| Vocab | 248K tokens, 201 languages |
| Multimodal | Text, image, video |
| Multi-token prediction | Supported |
| model_type | qwen3_5_moe |
Known issue
Gated DeltaNet decoding can run ~2.7x slower with non-vocabulary embeddings (mlx-lm#932). Normal text inference is unaffected.
Quick start
Text
from mlx_lm import load, generate
model, tokenizer = load("froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)
Vision
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)
CLI
# Text
mlx_lm.generate --model froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit --prompt "Hello"
# Vision
mlx_vlm.generate --model froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit --image image.jpg --prompt "Describe this image"
System prompt
The first line of your system prompt must be:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
The model underperforms without it. You can append anything after that line: roleplay personas, custom instructions, whatever you need.
You are Qwen, created by Alibaba Cloud. You are a helpful assistant. Currently you are roleplaying as a grumpy but brilliant sysadmin.
Thinking toggle
This model ships with a Jinja chat template that lets you toggle thinking on the fly. Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template strips the tag from context and flips the thinking mode.
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
The model answers fast, no internal reasoning.
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
The model thinks step by step, then answers.
Tool calling warning (LM Studio): LM Studio's internal parser crashes when the model generates a tool call inside its thinking block (#827, #1592). When using tools, always add
<|think_off|>to your prompt.
Chat template
The bundled Jinja template fixes several issues with LM Studio's runtime:
- Replaces broken
|itemsdictionary filter with compatible key lookups - Adds
"developer"role support (LM Studio crashes without it) - Safe handling of empty tool output payloads
<|think_on|>/<|think_off|>toggling from any message role
See chat_template.README.md for the full breakdown.
Sampling
From the official Qwen authors. Reserve 128K+ context for thinking mode.
| Mode | temp | top_p | top_k | min_p | repeat_penalty | presence_penalty |
|---|---|---|---|---|---|---|
| Thinking (coding) (default) | 0.6 | 0.95 | 20 | 0 | 1.0 | off |
| Thinking (general) | 1.0 | 0.95 | 20 | 0 | 1.0 | 1.5 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 0 | 1.0 | 1.5 |
| Non-thinking (reasoning) | 1.0 | 1.0 | 40 | 0 | 1.0 | 2.0 |
GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).
Links
- 4-bit MLX version
- GGUF version (LuffyTheFox)
- Safetensors source (LuffyTheFox)
- Base uncensored model (HauhauCS)
- Reddit thread
Authorship
| Role | Author |
|---|---|
| Original model | Alibaba Cloud (Qwen team) |
| Uncensored fine-tune | HauhauCS |
| Tensor repair (Sig-ScaleSync) | EvilEnginer (LuffyTheFox) |
| MLX 8-bit conversion (text + vision) | froggeric |
License
Apache-2.0, inherited from Qwen3.5.
- Downloads last month
- 2,360
8-bit
Model tree for froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit
Base model
Qwen/Qwen3.5-35B-A3B-Base