language:
- en
- zh
- fr
- es
- pt
- de
- it
- ru
- ja
- ko
- vi
- th
- ar
license: apache-2.0
library_name: transformers
base_model:
- Qwen/Qwen3-4B
tags:
- qwen
- qwen3
- causal-lm
- qualcomm
- ai-hub
- on-device
- onnx
- qnn
pipeline_tag: text-generation
Qwen3-4B
Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
The Qwen3-4B is a state-of-the-art multilingual base language model with 4 billion parameters, excelling in language understanding, generation, coding, and mathematics.
Model Conversion Contributor: carrycooldude
Model Stats:
- Input sequence length for Prompt Processor: 128
- Maximum context length: 4096
- Quantization Type: w4a16 (4-bit weights with 16-bit activations)
- Supported languages: 100+ languages and dialects.
- TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
- Response Rate: Rate of response generation after the first response token. Measured on a short prompt with a long response; may slow down when using longer context lengths.
Model Details
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and GQA (Grouped Query Attention)
- Number of Parameters: 4.0B
- Number of Parameters (Non-Embedding): 3.6B
- Context Length Support: Up to 4096 tokens (optimized for on-device)
For more details, please refer to the official Qwen3 Blog, GitHub, and Documentation.
Model Download
| Model | Chipset | Target Runtime | Precision | Primary Compute Unit | Target Model | Performance |
|---|---|---|---|---|---|---|
| Qwen3-4B | Snapdragon 8 Elite (QCS9075) | QNN | W4A16 | NPU | Qwen3-4B-onnx-w4a16.zip | Check in AI Hub |
Model Inference & Conversion
Using Qualcomm AI Hub
You can export and convert this model using Qualcomm AI Hub Models (minimum package version: 0.44.0):
# Install AI Hub Models
pip install qai-hub-models>=0.48.0
# Export the model with --zip-assets to generate the required format
python -m qai_hub_models.models.qwen3_4b.export --target-runtime genie --chipset qcs9075 --zip-assets --output-dir ./output
Note: Use the
--zip-assetsargument to ensure the model is saved in the required community repository format.
Repository Structure
Qwen3-4B/
βββ LICENSE
βββ README.md
βββ .gitattributes
βββ Qwen3-4B-onnx-w4a16.zip
ONNX Export (Internal structure)
Qwen3-4B_onnx_w4a16/
βββ tool_versions.yaml
βββ model.onnx
βββ model.data
βββ model.encodings
βββ tokenizer.json
βββ tokenizer_config.json
βββ ...
tool_versions.yaml
tool_versions:
aihm_version: 0.48.0
qairt: 2.34.0
License
- Source Model: APACHE-2.0
- Deployable Model: APACHE-2.0
Disclaimer
This is a community contribution. The models hosted here are user contributions and:
- Are not verified by the organization or maintainers for correctness, safety, or performance.
- May contain errors, bugs, or limitations.
- Are moderated only for structural compliance, not for content quality.
The organization and maintainers do not take responsibility for the models or assets contributed here. Use them at your own discretion.