--- library_name: transformers tags: - qwen3.5 - moe - weight-transfer - hybrid-attention - image-text-to-text license: apache-2.0 pipeline_tag: image-text-to-text --- # Qwen3.5 MoE 4.54B (from Qwen3.5-4B) A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**: - **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) - **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized) ## Model Details | Property | Value | |----------|-------| | **Total Parameters** | 4,540,002,816 (4.54B) | | **Active Parameters** | 3,030,053,376 (3.03B) | | **Architecture** | Qwen3.5 Hybrid MoE | | **Experts** | 8 routed + 1 shared, top-2 | | **Hidden Size** | 2560 | | **Layers** | 32 (hybrid: DeltaNet + full attention) | | **Attention** | GQA 16Q / 4KV, head_dim=256 | | **Context** | 262,144 tokens | | **Vocab** | 248,320 | | **Dtype** | bfloat16 | ## Design Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts + shared expert are active per token (~1/3 of total FFN). Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost. ## Weight Transfer Sources | Component | Source | Strategy | |-----------|--------|----------| | Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy | | Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy | | DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy | | Vision encoder | Qwen/Qwen3.5-4B | Exact copy | | Layer norms | Qwen/Qwen3.5-4B | Exact copy | | Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize | | Shared expert | Qwen3.5-35B-A3B | Bilinear resize | | Router | Qwen3.5-35B-A3B | Slice + resize | ## License Apache 2.0 (following source models)