froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit

Qwen3.5-35B-A3B-Uncensored-FernflowerAI
MLX 8-bit · Text + Vision + Thinking + Tool Calling
Apple Silicon native

Credit where it's due. This conversion is built on the work of LuffyTheFox (EvilEnginer), who discovered the corrupted tensors in Alibaba's Qwen3.5 weights, diagnosed the root cause, and wrote the Sig-ScaleSync repair that fixed them. The original repos: safetensors and GGUF.

What's this?

Qwen3.5-35B is a 35B-parameter MoE model from Alibaba that activates ~3B params per token. It's fast, it's smart, and it supports 262K context, vision, video, and multi-token prediction.

There's one problem: Alibaba shipped it with two broken tensors. Layers 36 and 37 have corrupted ssm_conv1d.weight values that cause the model to loop, garble code, and eventually collapse past ~50K tokens. No sampler setting fixes it.

LuffyTheFox found the bug, wrote a repair tool (Sig-ScaleSync), and released fixed weights. This repo is an MLX 8-bit conversion of those fixed weights, ready to run on Apple Silicon with full text, image, and video support.

Lineage

Qwen/Qwen3.5-35B-A3B (Alibaba Cloud)
  └─ HauhauCS Uncensored (0/465 refusals, lossless)
       └─ LuffyTheFox FernflowerAI (tensor repair via Sig-ScaleSync)
            └─ This repo (MLX 8-bit, text + vision)

The Bug

Two tensors out of 502 carry corrupted weights: blk.36.ssm_conv1d.weight and blk.37.ssm_conv1d.weight. Their scale (standard deviation) runs ~60% higher than the median of their peer group, at 0.102 vs 0.063.

Why it happens: AdamW optimizer + MoE routing + DeltaNet's recurrent architecture. Rare experts in the final layers get an outsized effective learning rate. Weights drift. In DeltaNet's recurrence, the corruption propagates forward through every subsequent token.

What you see: Short prompts work fine. Around 50–70K tokens the model starts looping, repeating, inserting weird comments into code. By 100K it often fails outright. Tool calling breaks mid-session.

The Fix

Sig-ScaleSync compares each tensor's scale against the median of its peer group (same shape). A tensor gets flagged only if it exceeds the deviation threshold and shows weight saturation. This two-gate filter avoids false positives on architecturally asymmetric layers (gate inputs, FFN projections, etc.).

Out of 502 tensors, exactly 2 needed repair. The other 489 asymmetric tensors were left alone.

Tensor	Error reduction	Saturation (before → after)
`blk.36.ssm_conv1d.weight`	88.6%	0.0025 → 0.0010
`blk.37.ssm_conv1d.weight`	88.6%	0.0025 → 0.0010

Verified against Gemma 4 26B A4B with zero false positives. The script doesn't invent problems.

Which models are affected?

Model	Status
Qwen3.5-35B-A3B (all variants)	Broken (2 tensors), fixed here
Qwen3.5-27B (all incl. Unsloth)	Broken (8 tensors), fix experimental
Qwen3.5-122B-A10B	Healthy
Qwen3.5-9B, 4B, 2B, 0.8B	Likely affected, unconfirmed

This isn't Qwen-specific. Any MoE model with recurrent sublayers (DeltaNet, Mamba) trained with AdamW can hit the same issue.

Uncensored

The HauhauCS Aggressive uncensored fine-tune is lossless. No dataset changes, no capability removal, 0/465 refusals. You get everything the original model was trained to do, just without the refusal behavior. It may occasionally append short disclaimers (baked into base training, not actual refusals), but the full response always generates.

This conversion

Source: FernflowerAI safetensors (not GGUF) for maximum weight fidelity
Quantization: 8-bit (8.6 bits/weight, 35 GB across 8 shards)
Vision: Full support via mlx-vlm. Text, image, and video inputs work out of the box
Thinking: Toggleable via <|think_on|> / <|think_off|> tags (see below)
Tool calling: Works via the included Jinja chat template
Requirements: mlx-lm >= 0.31.2, mlx-vlm >= 0.4.4

Architecture details

Spec	Value
Total params	35B
Active per token	~3B (8 routed + 1 shared of 256 experts)
Attention	3x DeltaNet-MoE + 1x Attention-MoE, 10 repetitions
Context	262K native, 1M with YaRN
RoPE	theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab	248K tokens, 201 languages
Multimodal	Text, image, video
Multi-token prediction	Supported
model_type	`qwen3_5_moe`

Known issue

Gated DeltaNet decoding can run ~2.7x slower with non-vocabulary embeddings (mlx-lm#932). Normal text inference is unaffected.

Quick start

Text

from mlx_lm import load, generate

model, tokenizer = load("froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)

Vision

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)

CLI

# Text
mlx_lm.generate --model froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit --prompt "Hello"

# Vision
mlx_vlm.generate --model froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit --image image.jpg --prompt "Describe this image"

System prompt

The first line of your system prompt must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. You can append anything after that line: roleplay personas, custom instructions, whatever you need.

You are Qwen, created by Alibaba Cloud. You are a helpful assistant. Currently you are roleplaying as a grumpy but brilliant sysadmin.

Thinking toggle

This model ships with a Jinja chat template that lets you toggle thinking on the fly. Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template strips the tag from context and flips the thinking mode.

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

The model answers fast, no internal reasoning.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model thinks step by step, then answers.

Tool calling warning (LM Studio): LM Studio's internal parser crashes when the model generates a tool call inside its thinking block (#827, #1592). When using tools, always add <|think_off|> to your prompt.

Chat template

The bundled Jinja template fixes several issues with LM Studio's runtime:

Replaces broken |items dictionary filter with compatible key lookups
Adds "developer" role support (LM Studio crashes without it)
Safe handling of empty tool output payloads
<|think_on|> / <|think_off|> toggling from any message role

See chat_template.README.md for the full breakdown.

Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode	temp	top_p	top_k	repeat_penalty	presence_penalty
Thinking (coding) (default)	0.6	0.95	20	1.0	off
Thinking (general)	1.0	0.95	20	1.0	1.5
Non-thinking (general)	0.7	0.8	20	1.0	1.5
Non-thinking (reasoning)	1.0	1.0	40	1.0	2.0

GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).

Authorship

Role	Author
Original model	Alibaba Cloud (Qwen team)
Uncensored fine-tune	HauhauCS
Tensor repair (Sig-ScaleSync)	EvilEnginer (LuffyTheFox)
MLX 8-bit conversion (text + vision)	froggeric

License

Apache-2.0, inherited from Qwen3.5.

Downloads last month: 2,360

Safetensors

Model size

10B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

8-bit

Model tree for froggeric/Qwen3.5-35B-A3B-Uncensored-FernflowerAI-MLX-8bit

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive