Upload folder using huggingface_hub

6e40c3a verified about 1 month ago

4.24 kB

language:
  - en
  - zh
  - fr
  - es
  - pt
  - de
  - it
  - ru
  - ja
  - ko
  - vi
  - th
  - ar
license: apache-2.0
library_name: transformers
base_model:
  - Qwen/Qwen3-4B
tags:
  - qwen
  - qwen3
  - causal-lm
  - qualcomm
  - ai-hub
  - on-device
  - onnx
  - qnn
pipeline_tag: text-generation

Qwen3-4B

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

The Qwen3-4B is a state-of-the-art multilingual base language model with 4 billion parameters, excelling in language understanding, generation, coding, and mathematics.

Model Conversion Contributor: carrycooldude

Model Stats:

Input sequence length for Prompt Processor: 128
Maximum context length: 4096
Quantization Type: w4a16 (4-bit weights with 16-bit activations)
Supported languages: 100+ languages and dialects.
TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
Response Rate: Rate of response generation after the first response token. Measured on a short prompt with a long response; may slow down when using longer context lengths.

Model Details

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and GQA (Grouped Query Attention)
Number of Parameters: 4.0B
Number of Parameters (Non-Embedding): 3.6B
Context Length Support: Up to 4096 tokens (optimized for on-device)

For more details, please refer to the official Qwen3 Blog, GitHub, and Documentation.

Model Download

Model	Chipset	Target Runtime	Precision	Primary Compute Unit	Target Model	Performance
Qwen3-4B	Snapdragon 8 Elite (QCS9075)	QNN	W4A16	NPU	Qwen3-4B-onnx-w4a16.zip	Check in AI Hub

Model Inference & Conversion

Using Qualcomm AI Hub

You can export and convert this model using Qualcomm AI Hub Models (minimum package version: 0.44.0):

# Install AI Hub Models
pip install qai-hub-models>=0.48.0

# Export the model with --zip-assets to generate the required format
python -m qai_hub_models.models.qwen3_4b.export --target-runtime genie --chipset qcs9075 --zip-assets --output-dir ./output

Note: Use the --zip-assets argument to ensure the model is saved in the required community repository format.

Repository Structure

Qwen3-4B/
├── LICENSE
├── README.md
├── .gitattributes
└── Qwen3-4B-onnx-w4a16.zip

ONNX Export (Internal structure)

Qwen3-4B_onnx_w4a16/
├── tool_versions.yaml
├── model.onnx
├── model.data
├── model.encodings
├── tokenizer.json
├── tokenizer_config.json
└── ...

tool_versions.yaml

tool_versions:
  aihm_version: 0.48.0
  qairt: 2.34.0

License

Source Model: APACHE-2.0
Deployable Model: APACHE-2.0

Disclaimer

This is a community contribution. The models hosted here are user contributions and:

Are not verified by the organization or maintainers for correctness, safety, or performance.
May contain errors, bugs, or limitations.
Are moderated only for structural compliance, not for content quality.

The organization and maintainers do not take responsibility for the models or assets contributed here. Use them at your own discretion.