metadata
library_name: transformers
tags:
- qwen3.5
- moe
- weight-transfer
- hybrid-attention
- image-text-to-text
license: apache-2.0
pipeline_tag: image-text-to-text
Qwen3.5 MoE 4.54B (from Qwen3.5-4B)
A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:
- Backbone (attention, embeddings, vision, norms): from Qwen/Qwen3.5-4B
- MoE experts (routed + shared): from Qwen/Qwen3.5-35B-A3B (sliced 256->8 experts, bilinear resized)
Model Details
| Property | Value |
|---|---|
| Total Parameters | 4,540,002,816 (4.54B) |
| Active Parameters | 3,030,053,376 (3.03B) |
| Architecture | Qwen3.5 Hybrid MoE |
| Experts | 8 routed + 1 shared, top-2 |
| Hidden Size | 2560 |
| Layers | 32 (hybrid: DeltaNet + full attention) |
| Attention | GQA 16Q / 4KV, head_dim=256 |
| Context | 262,144 tokens |
| Vocab | 248,320 |
| Dtype | bfloat16 |
Design
Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts
- shared expert are active per token (~1/3 of total FFN).
Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.
Weight Transfer Sources
| Component | Source | Strategy |
|---|---|---|
| Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
| Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
| DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
| Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
| Layer norms | Qwen/Qwen3.5-4B | Exact copy |
| Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
| Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
| Router | Qwen3.5-35B-A3B | Slice + resize |
License
Apache 2.0 (following source models)