fix README: add pipeline_tag and image-text-to-text tag

b8aa0f3 verified about 2 months ago

2.02 kB

library_name: transformers
tags:
  - qwen3.5
  - moe
  - weight-transfer
  - hybrid-attention
  - image-text-to-text
license: apache-2.0
pipeline_tag: image-text-to-text

Qwen3.5 MoE 4.54B (from Qwen3.5-4B)

A Qwen3.5 Mixture-of-Experts model created via dual-source weight transfer:

Backbone (attention, embeddings, vision, norms): from Qwen/Qwen3.5-4B
MoE experts (routed + shared): from Qwen/Qwen3.5-35B-A3B (sliced 256->8 experts, bilinear resized)

Model Details

Property	Value
Total Parameters	4,540,002,816 (4.54B)
Active Parameters	3,030,053,376 (3.03B)
Architecture	Qwen3.5 Hybrid MoE
Experts	8 routed + 1 shared, top-2
Hidden Size	2560
Layers	32 (hybrid: DeltaNet + full attention)
Attention	GQA 16Q / 4KV, head_dim=256
Context	262,144 tokens
Vocab	248,320
Dtype	bfloat16

Design

Total MoE FFN parameters are approximately equal to the dense model's FFN parameters. The speed benefit comes from sparsity: only top-2 experts

shared expert are active per token (~1/3 of total FFN).

Most weights are pre-trained (backbone from dense model, experts from 35B-A3B). Only the MoE dimension resize introduces noise, making this model suitable for fine-tuning at nominal cost.

Weight Transfer Sources

Component	Source	Strategy
Embeddings, LM Head	Qwen/Qwen3.5-4B	Exact copy
Attention (Q/K/V/O, norms)	Qwen/Qwen3.5-4B	Exact copy
DeltaNet (linear attention)	Qwen/Qwen3.5-4B	Exact copy
Vision encoder	Qwen/Qwen3.5-4B	Exact copy
Layer norms	Qwen/Qwen3.5-4B	Exact copy
Routed experts	Qwen3.5-35B-A3B	Slice 256->8, bilinear resize
Shared expert	Qwen3.5-35B-A3B	Bilinear resize
Router	Qwen3.5-35B-A3B	Slice + resize

License

Apache 2.0 (following source models)