File size: 7,106 Bytes
3ff0d71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
- en
library_name: mlx
license: apache-2.0
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3.6-35B-A3B
tags:
- quantized
- apple-silicon
- mlx
- qwen3
- qwen3_5_moe
- moe
- vision
- hybrid-attention
- gated-deltanet
- turboquant
- jangtq
- jangtq2
---

<p align="center">
  <a href="https://osaurus.ai"><img src="./osaurus-x-banner.png" alt="Osaurus AI"></a>
</p>

<h3 align="center">Qwen 3.6 35B-A3B &mdash; JANGTQ2 (MLX)</h3>
<p align="center">TurboQuant codebook quantization of Alibaba's hybrid linear/full-attention agentic MoE &mdash; routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine, vision tower preserved.</p>

<p align="center">
  <a href="https://osaurus.ai"><img src="https://img.shields.io/badge/Web-osaurus.ai-blue" alt="Website"></a>&nbsp;
  <a href="https://huggingface.co/OsaurusAI"><img src="https://img.shields.io/badge/HF-OsaurusAI-yellow?logo=huggingface" alt="OsaurusAI"></a>
</p>

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | [`Qwen/Qwen3.6-35B-A3B`](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) |
| **Parameters (source)** | 35 B total, ~3 B active per token |
| **Architecture** | `qwen3_5_moe` β€” 40 decoder layers: 30 `Gated DeltaNet` (linear attn) + 10 full attention, 256 routed experts + 1 always-on shared expert |
| **Quantization format** | `weight_format: mxtq` β€” routed experts via TurboQuant codebook (2-bit), everything else affine 8-bit or fp16 passthrough |
| **Routed-expert storage** | `.tq_packed` (uint32) + `.tq_norms` (fp16) + `.tq_bits` (uint8); codebook + Hadamard signs re-derived deterministically at load |
| **Package size on disk** | **11.63 GB** across 12 shards |
| **Shipped tensors** | 1,930 total (1,597 language-model + 333 vision tower + 120 routed-expert TQ triples) |
| **Vocab** | 248,320 |
| **Context (position embeddings)** | 262,144 native; the upstream model card reports up to ~1 M with YaRN scaling |
| **Vision tower** | 27-layer ViT (hidden 1152, patch 16), preserved in fp16 |
| **Chat format** | Qwen im_start/im_end, unified thinking toggle |

### Quantization details, per tensor category

| Category | Bits | Group / codebook | Notes |
|---|---|---|---|
| **Routed-expert MLP** (`mlp.experts.gate_up_proj`, `down_proj`) | **2 (JANGTQ)** | 2^2 Lloyd-Max centroids + Hadamard rotation | `.tq_packed` + `.tq_norms` + `.tq_bits` triples |
| Embedding (`embed_tokens`), `lm_head` | 8 (affine) | group 64 | MLX-native `QuantizedLinear` |
| Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`) | 8 (affine) | group 64 | Gate-doubled q_proj for `attn_output_gate` |
| Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`) | 8 (affine) | group 64 | Gated DeltaNet |
| Shared-expert MLP (`gate_proj`, `up_proj`, `down_proj`) | 8 (affine) | group 64 | Always active per token |
| Router (`mlp.gate`) | fp16 passthrough | β€” | Precision-critical |
| Shared-expert gate (`shared_expert_gate`) | fp16 passthrough | β€” | sigmoid scalar gate |
| Norms (`*_layernorm`, `*_norm`), `A_log`, `dt_bias`, `conv1d` | fp16 passthrough | β€” | Un-quantized |
| Vision tower (333 tensors) | fp16 passthrough | β€” | `patch_embed.proj` axes pre-transposed to MLX layout |

JANGTQ ("TurboQuant") stores routed-expert weights as indices into a small Lloyd-Max codebook with a per-row norm, after a randomized Hadamard rotation that concentrates the distribution so quantization error is uniform. At inference, the input is rotated once per layer (cheap fused Metal kernel) and dot products happen against the codebook centroids directly, so we never dequantize back to affine. Compared to affine 2-bit at the same bit budget, this gives better quality AND faster decode on the routed-expert MLP path.

---

## Usage

**JANGTQ requires our custom loader** β€” stock `mlx_lm.load()` can't parse `.tq_packed` tensors. You need `jang-tools` (free, public): <https://github.com/jjang-ai/jangq>.

```bash
pip install mlx mlx-lm mlx-vlm
git clone https://github.com/jjang-ai/jangq && pip install -e ./jangq/jang-tools
```

### Text

```python
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("OsaurusAI/Qwen3.6-35B-A3B-JANGTQ2")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=64))
```

### Image (VLM)

```python
from jang_tools.load_jangtq_vlm import load_jangtq_vlm_model
from mlx_vlm import generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Qwen3.6-35B-A3B-JANGTQ2"
model, processor = load_jangtq_vlm_model(path)
config = load_config(path)

prompt = apply_chat_template(processor, config, "Describe this image.", num_images=1)
print(generate(model, processor, prompt, image="path/to/image.jpg", max_tokens=200))
```

### Reasoning toggle

```python
msgs = [{"role": "user", "content": "What is 17 Γ— 23?"}]
# Reasoning OFF β€” pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=False)
# Reasoning ON β€” model fills the <think> block
prompt = tokenizer.apply_chat_template(msgs, add_generation_prompt=True,
                                       enable_thinking=True)
```

Pass `enable_thinking` as a **direct kwarg** (the `chat_template_kwargs={...}` form only propagates on some tokenizer versions).

### Video

The base model supports video via `transformers` and the bundle preserves `video_preprocessor_config.json`. `mlx-vlm` 0.4.4's `prepare_inputs` has no video path yet for `qwen3_5_moe` β€” the Python `load_jangtq_vlm` path wraps video via a custom processor for our test harness. For mainline `mlx-vlm` users, stick to image input; use upstream `transformers` for video.

---

## Hardware notes

~12 GB on disk; expect ~12–14 GB resident after load, plus KV cache.

| Mac unified RAM | Works? | Notes |
|---|---|---|
| 24 GB | βœ… comfortable | Full 32 k context OK |
| 32 GB | βœ… | 32-100 k context depending on profile |
| 24 GB | βœ… | text-only, short context |

---

## Benchmarks

Base-model reference (`Qwen/Qwen3.6-35B-A3B`, upstream, not this quant):

| MMLU-Pro | AIME 2026 | LiveCodeBench v6 | GPQA | SWE-bench Verified |
|---|---|---|---|---|
| 85.2 | 92.7 | 80.4 | 86.0 | 73.4 |

Independent JANGTQ-quant evaluation is tracked in the jang-tools repo and will land in future README revisions.

---

## Citation

```bibtex
@misc{qwen2026qwen36,
  title  = {Qwen3.6-Plus: Towards Real World Agents},
  author = {Qwen Team},
  year   = {2026},
  url    = {https://qwen.ai/blog?id=qwen3.6}
}
```

## License

[Apache 2.0](https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/LICENSE) β€” inherits from the base model.

---

<p align="center">
  Packaged on Apple Silicon with <a href="https://github.com/jjang-ai/jangq">jang-tools</a> (mlx-lm 0.31.2).<br>
  &copy; 2026 Osaurus AI &mdash; <a href="https://osaurus.ai">osaurus.ai</a>
</p>