---
library_name: transformers
tags:
- qwen3.5
- moe
- weight-transfer
- hybrid-attention
- image-text-to-text
license: apache-2.0
pipeline_tag: image-text-to-text
---

# Qwen3.5 MoE 4.54B (from Qwen3.5-4B)

A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**:
- **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
- **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized)

## Model Details

| Property | Value |
|----------|-------|
| **Total Parameters** | 4,540,002,816 (4.54B) |
| **Active Parameters** | 3,030,053,376 (3.03B) |
| **Architecture** | Qwen3.5 Hybrid MoE |
| **Experts** | 8 routed + 1 shared, top-2 |
| **Hidden Size** | 2560 |
| **Layers** | 32 (hybrid: DeltaNet + full attention) |
| **Attention** | GQA 16Q / 4KV, head_dim=256 |
| **Context** | 262,144 tokens |
| **Vocab** | 248,320 |
| **Dtype** | bfloat16 |

## Design

Total MoE FFN parameters are approximately equal to the dense model's FFN
parameters.  The speed benefit comes from sparsity: only top-2 experts
+ shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from
35B-A3B).  Only the MoE dimension resize introduces noise, making this
model suitable for fine-tuning at nominal cost.

## Weight Transfer Sources

| Component | Source | Strategy |
|-----------|--------|----------|
| Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
| Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
| DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
| Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
| Layer norms | Qwen/Qwen3.5-4B | Exact copy |
| Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
| Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
| Router | Qwen3.5-35B-A3B | Slice + resize |

## License

Apache 2.0 (following source models)