carrycooldude's picture
Upload folder using huggingface_hub
6e40c3a verified
|
raw
history blame
4.24 kB
metadata
language:
  - en
  - zh
  - fr
  - es
  - pt
  - de
  - it
  - ru
  - ja
  - ko
  - vi
  - th
  - ar
license: apache-2.0
library_name: transformers
base_model:
  - Qwen/Qwen3-4B
tags:
  - qwen
  - qwen3
  - causal-lm
  - qualcomm
  - ai-hub
  - on-device
  - onnx
  - qnn
pipeline_tag: text-generation

Qwen3-4B

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

The Qwen3-4B is a state-of-the-art multilingual base language model with 4 billion parameters, excelling in language understanding, generation, coding, and mathematics.

Model Conversion Contributor: carrycooldude

Model Stats:

  • Input sequence length for Prompt Processor: 128
  • Maximum context length: 4096
  • Quantization Type: w4a16 (4-bit weights with 16-bit activations)
  • Supported languages: 100+ languages and dialects.
  • TTFT: Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
  • Response Rate: Rate of response generation after the first response token. Measured on a short prompt with a long response; may slow down when using longer context lengths.

Model Details

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Architecture: Transformers with RoPE, SwiGLU, RMSNorm, and GQA (Grouped Query Attention)
  • Number of Parameters: 4.0B
  • Number of Parameters (Non-Embedding): 3.6B
  • Context Length Support: Up to 4096 tokens (optimized for on-device)

For more details, please refer to the official Qwen3 Blog, GitHub, and Documentation.

Model Download

Model Chipset Target Runtime Precision Primary Compute Unit Target Model Performance
Qwen3-4B Snapdragon 8 Elite (QCS9075) QNN W4A16 NPU Qwen3-4B-onnx-w4a16.zip Check in AI Hub

Model Inference & Conversion

Using Qualcomm AI Hub

You can export and convert this model using Qualcomm AI Hub Models (minimum package version: 0.44.0):

# Install AI Hub Models
pip install qai-hub-models>=0.48.0

# Export the model with --zip-assets to generate the required format
python -m qai_hub_models.models.qwen3_4b.export --target-runtime genie --chipset qcs9075 --zip-assets --output-dir ./output

Note: Use the --zip-assets argument to ensure the model is saved in the required community repository format.

Repository Structure

Qwen3-4B/
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ .gitattributes
└── Qwen3-4B-onnx-w4a16.zip

ONNX Export (Internal structure)

Qwen3-4B_onnx_w4a16/
β”œβ”€β”€ tool_versions.yaml
β”œβ”€β”€ model.onnx
β”œβ”€β”€ model.data
β”œβ”€β”€ model.encodings
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
└── ...

tool_versions.yaml

tool_versions:
  aihm_version: 0.48.0
  qairt: 2.34.0

License

Disclaimer

This is a community contribution. The models hosted here are user contributions and:

  • Are not verified by the organization or maintainers for correctness, safety, or performance.
  • May contain errors, bugs, or limitations.
  • Are moderated only for structural compliance, not for content quality.

The organization and maintainers do not take responsibility for the models or assets contributed here. Use them at your own discretion.