kshitijthakkar commited on
Commit
b8aa0f3
·
verified ·
1 Parent(s): 95b9f1a

fix README: add pipeline_tag and image-text-to-text tag

Browse files
Files changed (1) hide show
  1. README.md +59 -57
README.md CHANGED
@@ -1,57 +1,59 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - qwen3.5
5
- - moe
6
- - weight-transfer
7
- - hybrid-attention
8
- license: apache-2.0
9
- ---
10
-
11
- # Qwen3.5 MoE 4.54B (from Qwen3.5-4B)
12
-
13
- A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**:
14
- - **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
15
- - **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized)
16
-
17
- ## Model Details
18
-
19
- | Property | Value |
20
- |----------|-------|
21
- | **Total Parameters** | 4,540,002,816 (4.54B) |
22
- | **Active Parameters** | 3,030,053,376 (3.03B) |
23
- | **Architecture** | Qwen3.5 Hybrid MoE |
24
- | **Experts** | 8 routed + 1 shared, top-2 |
25
- | **Hidden Size** | 2560 |
26
- | **Layers** | 32 (hybrid: DeltaNet + full attention) |
27
- | **Attention** | GQA 16Q / 4KV, head_dim=256 |
28
- | **Context** | 262,144 tokens |
29
- | **Vocab** | 248,320 |
30
- | **Dtype** | bfloat16 |
31
-
32
- ## Design
33
-
34
- Total MoE FFN parameters are approximately equal to the dense model's FFN
35
- parameters. The speed benefit comes from sparsity: only top-2 experts
36
- + shared expert are active per token (~1/3 of total FFN).
37
-
38
- Most weights are pre-trained (backbone from dense model, experts from
39
- 35B-A3B). Only the MoE dimension resize introduces noise, making this
40
- model suitable for fine-tuning at nominal cost.
41
-
42
- ## Weight Transfer Sources
43
-
44
- | Component | Source | Strategy |
45
- |-----------|--------|----------|
46
- | Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
47
- | Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
48
- | DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
49
- | Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
50
- | Layer norms | Qwen/Qwen3.5-4B | Exact copy |
51
- | Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
52
- | Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
53
- | Router | Qwen3.5-35B-A3B | Slice + resize |
54
-
55
- ## License
56
-
57
- Apache 2.0 (following source models)
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - qwen3.5
5
+ - moe
6
+ - weight-transfer
7
+ - hybrid-attention
8
+ - image-text-to-text
9
+ license: apache-2.0
10
+ pipeline_tag: image-text-to-text
11
+ ---
12
+
13
+ # Qwen3.5 MoE 4.54B (from Qwen3.5-4B)
14
+
15
+ A Qwen3.5 Mixture-of-Experts model created via **dual-source weight transfer**:
16
+ - **Backbone** (attention, embeddings, vision, norms): from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B)
17
+ - **MoE experts** (routed + shared): from [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (sliced 256->8 experts, bilinear resized)
18
+
19
+ ## Model Details
20
+
21
+ | Property | Value |
22
+ |----------|-------|
23
+ | **Total Parameters** | 4,540,002,816 (4.54B) |
24
+ | **Active Parameters** | 3,030,053,376 (3.03B) |
25
+ | **Architecture** | Qwen3.5 Hybrid MoE |
26
+ | **Experts** | 8 routed + 1 shared, top-2 |
27
+ | **Hidden Size** | 2560 |
28
+ | **Layers** | 32 (hybrid: DeltaNet + full attention) |
29
+ | **Attention** | GQA 16Q / 4KV, head_dim=256 |
30
+ | **Context** | 262,144 tokens |
31
+ | **Vocab** | 248,320 |
32
+ | **Dtype** | bfloat16 |
33
+
34
+ ## Design
35
+
36
+ Total MoE FFN parameters are approximately equal to the dense model's FFN
37
+ parameters. The speed benefit comes from sparsity: only top-2 experts
38
+ + shared expert are active per token (~1/3 of total FFN).
39
+
40
+ Most weights are pre-trained (backbone from dense model, experts from
41
+ 35B-A3B). Only the MoE dimension resize introduces noise, making this
42
+ model suitable for fine-tuning at nominal cost.
43
+
44
+ ## Weight Transfer Sources
45
+
46
+ | Component | Source | Strategy |
47
+ |-----------|--------|----------|
48
+ | Embeddings, LM Head | Qwen/Qwen3.5-4B | Exact copy |
49
+ | Attention (Q/K/V/O, norms) | Qwen/Qwen3.5-4B | Exact copy |
50
+ | DeltaNet (linear attention) | Qwen/Qwen3.5-4B | Exact copy |
51
+ | Vision encoder | Qwen/Qwen3.5-4B | Exact copy |
52
+ | Layer norms | Qwen/Qwen3.5-4B | Exact copy |
53
+ | Routed experts | Qwen3.5-35B-A3B | Slice 256->8, bilinear resize |
54
+ | Shared expert | Qwen3.5-35B-A3B | Bilinear resize |
55
+ | Router | Qwen3.5-35B-A3B | Slice + resize |
56
+
57
+ ## License
58
+
59
+ Apache 2.0 (following source models)