Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
English
Re-quantized on 2026-04-13 with corrected ignore list (
mlp.gate+embed_tokensnow preserved in BF16), fixing routing quality issues in the previous release.
NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+
As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.
If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound
INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.
NVFP4 quantization of huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).
Model Details
| Item | Value |
|---|---|
| Architecture | MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing |
| Base model | Qwen/Qwen3-30B-A3B |
| Fine-tuned by | huihui-ai (Thinking 2507 + abliteration) |
| Quantized by | YuYu1015 |
| Model size | ~18.1 GB (NVFP4, vs ~60 GB BF16 original) |
| Context length | Up to 131,072 tokens |
| Thinking mode | Built-in Chain-of-Thought reasoning (enabled by default) |
| Tool calling | Supported (qwen3_coder parser) |
Quantization Details
| Item | Value |
|---|---|
| Method | llm-compressor v0.10.0.1 |
| Scheme | NVFP4 (E2M1 + FP8 per-group scaling, group size 16) |
| Format | compressed-tensors v0.14.0.1 |
| Calibration dataset | HuggingFaceH4/ultrachat_200k (train_sft split) |
| Calibration samples | 512 |
| Calibration sequence length | 2048 |
| MoE expert calibration | moe_calibrate_all_experts=True (all experts receive calibration data) |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
| Environment | transformers==4.57.1 + llm-compressor==0.10.0.1 |
Layers Preserved in BF16
The following layers are not quantized to preserve model quality:
| Layer | Reason |
|---|---|
lm_head |
Output head, sensitive to quantization noise |
re:.*mlp.gate$ |
MoE routing gate — critical for expert selection accuracy |
re:.*embed_tokens$ |
Input embeddings |
Serving with vLLM
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3-30b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
DGX Spark (SM121) Compatibility Notes
- NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing
cvt.e2m1x2instruction) - Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
- Qwen3 has no GDN, so
linear_attndoes not need to be excluded - Clear page cache before starting on UMA:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Safety Warning
This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.
Credits
- Original Model: Qwen/Qwen3-30B-A3B by Alibaba Qwen Team
- Thinking 2507 & Abliteration: huihui-ai
- NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
- Quantization Tool: llm-compressor by vLLM Project
- Reference: RedHatAI/Qwen3-30B-A3B-NVFP4
繁體中文
2026-04-13 重新量化上傳,修正先前版本的 ignore list(
mlp.gate與embed_tokens現在保留 BF16),解決 MoE 路由品質問題。
NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+
截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。
若精度與推理速度為首要考量,建議改用 INT4 AutoRound 版本: 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound
INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。
huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated 的 NVFP4 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | MoE(30B 總參數, 3B 活躍),48 層,128 experts,top-8 routing |
| 基礎模型 | Qwen/Qwen3-30B-A3B |
| 微調者 | huihui-ai(Thinking 2507 + abliteration) |
| 量化者 | YuYu1015 |
| 模型大小 | ~18.1 GB(NVFP4,原版 BF16 約 60 GB) |
| Context 長度 | 最高 131,072 tokens |
| 思考模式 | 內建思維鏈推理(預設啟用) |
| 工具呼叫 | 支援(qwen3_coder parser) |
量化詳情
| 項目 | 數值 |
|---|---|
| 方法 | llm-compressor v0.10.0.1 |
| 方案 | NVFP4(E2M1 + FP8 逐群縮放,群組大小 16) |
| 格式 | compressed-tensors v0.14.0.1 |
| 校準資料集 | HuggingFaceH4/ultrachat_200k (train_sft 分割) |
| 校準樣本數 | 512 |
| 校準序列長度 | 2048 |
| MoE 專家校準 | moe_calibrate_all_experts=True(所有專家都接收校準資料) |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
| 環境 | transformers==4.57.1 + llm-compressor==0.10.0.1 |
保留 BF16 的層
以下層未被量化以保持模型品質:
| 層 | 原因 |
|---|---|
lm_head |
輸出頭,對量化雜訊敏感 |
re:.*mlp.gate$ |
MoE 路由閘——對專家選擇精度至關重要 |
re:.*embed_tokens$ |
輸入嵌入 |
vLLM 部署
vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3-30b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
DGX Spark (SM121) 相容性說明
- NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少
cvt.e2m1x2指令) - Qwen3(非 3.5)沒有 Mamba 層,FP8 KV cache 可以安全使用
- Qwen3 沒有 GDN,
linear_attn不需要排除 - UMA 架構啟動前請先清除 page cache:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
安全警告
此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。
致謝
- 原始模型:Qwen/Qwen3-30B-A3B,Alibaba Qwen 團隊
- Thinking 2507 與去審查:huihui-ai
- NVFP4 量化:YuYu1015,於 NVIDIA DGX Spark (GB10) 上完成
- 量化工具:llm-compressor,vLLM Project
- 參考:RedHatAI/Qwen3-30B-A3B-NVFP4
- Downloads last month
- 921
Model tree for YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4
Base model
Qwen/Qwen3-30B-A3B-Thinking-2507