kshitijthakkar
/

qwen3.5-moe-4.7B-d4B

@@ -1,57 +1,59 @@
----
-library_name: transformers
-tags:
-- qwen3.5
-- moe
-- weight-transfer
-- hybrid-attention
-license: apache-2.0
----
-# Qwen3.5 MoE 4.54B (from Qwen3.5-4B)
-A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**:
-- **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
-- **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized)
-## Model Details
-| Property | Value |
-|----------|-------|
-| **Total Parameters** | 4,540,002,816 (4.54B) |
-| **Active Parameters** | 3,030,053,376 (3.03B) |
-| **Architecture** | Qwen3.5 Hybrid MoE |
-| **Experts** | 8 routed + 1 shared, top-2 |
-| **Hidden Size** | 2560 |
-| **Layers** | 32 (hybrid: DeltaNet + full attention) |
-| **Attention** | GQA 16Q / 4KV, head_dim=256 |
-| **Context** | 262,144 tokens |
-| **Vocab** | 248,320 |
-| **Dtype** | bfloat16 |
-## Design
-Total MoE FFN parameters are approximately equal to the dense model's FFN
-parameters.  The speed benefit comes from sparsity: only top-2 experts
-+ shared expert are active per token (~1/3 of total FFN).
-Most weights are pre-trained (backbone from dense model, experts from
-35B-A3B).  Only the MoE dimension resize introduces noise, making this
-model suitable for fine-tuning at nominal cost.
-## Weight Transfer Sources
-| Component | Source | Strategy |
-|-----------|--------|----------|
-| Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
-| Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
-| DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
-| Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
-| Layer norms | Qwen/Qwen3.5-4B | Exact copy |
-| Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
-| Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
-| Router | Qwen3.5-35B-A3B | Slice + resize |
-## License
-Apache 2.0 (following source models)

+---
+library_name: transformers
+tags:
+- qwen3.5
+- moe
+- weight-transfer
+- hybrid-attention
+- image-text-to-text
+license: apache-2.0
+pipeline_tag: image-text-to-text
+---
+# Qwen3.5 MoE 4.54B (from Qwen3.5-4B)
+A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**:
+- **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
+- **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized)
+## Model Details
+| Property | Value |
+|----------|-------|
+| **Total Parameters** | 4,540,002,816 (4.54B) |
+| **Active Parameters** | 3,030,053,376 (3.03B) |
+| **Architecture** | Qwen3.5 Hybrid MoE |
+| **Experts** | 8 routed + 1 shared, top-2 |
+| **Hidden Size** | 2560 |
+| **Layers** | 32 (hybrid: DeltaNet + full attention) |
+| **Attention** | GQA 16Q / 4KV, head_dim=256 |
+| **Context** | 262,144 tokens |
+| **Vocab** | 248,320 |
+| **Dtype** | bfloat16 |
+## Design
+Total MoE FFN parameters are approximately equal to the dense model's FFN
+parameters.  The speed benefit comes from sparsity: only top-2 experts
++ shared expert are active per token (~1/3 of total FFN).
+Most weights are pre-trained (backbone from dense model, experts from
+35B-A3B).  Only the MoE dimension resize introduces noise, making this
+model suitable for fine-tuning at nominal cost.
+## Weight Transfer Sources
+| Component | Source | Strategy |
+|-----------|--------|----------|
+| Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
+| Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
+| DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
+| Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
+| Layer norms | Qwen/Qwen3.5-4B | Exact copy |
+| Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
+| Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
+| Router | Qwen3.5-35B-A3B | Slice + resize |
+## License
+Apache 2.0 (following source models)