Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

English | 繁體中文


English

Re-quantized on 2026-04-13 with corrected ignore list (mlp.gate + embed_tokens now preserved in BF16), fixing routing quality issues in the previous release.

NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+

As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.

If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).

Model Details

Item Value
Architecture MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing
Base model Qwen/Qwen3-30B-A3B
Fine-tuned by huihui-ai (Thinking 2507 + abliteration)
Quantized by YuYu1015
Model size ~18.1 GB (NVFP4, vs ~60 GB BF16 original)
Context length Up to 131,072 tokens
Thinking mode Built-in Chain-of-Thought reasoning (enabled by default)
Tool calling Supported (qwen3_coder parser)

Quantization Details

Item Value
Method llm-compressor v0.10.0.1
Scheme NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Format compressed-tensors v0.14.0.1
Calibration dataset HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration samples 512
Calibration sequence length 2048
MoE expert calibration moe_calibrate_all_experts=True (all experts receive calibration data)
Hardware NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment transformers==4.57.1 + llm-compressor==0.10.0.1

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer Reason
lm_head Output head, sensitive to quantization noise
re:.*mlp.gate$ MoE routing gate — critical for expert selection accuracy
re:.*embed_tokens$ Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

DGX Spark (SM121) Compatibility Notes

  • NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing cvt.e2m1x2 instruction)
  • Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
  • Qwen3 has no GDN, so linear_attn does not need to be excluded
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits


繁體中文

2026-04-13 重新量化上傳,修正先前版本的 ignore list(mlp.gateembed_tokens 現在保留 BF16),解決 MoE 路由品質問題。

NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+

截至 2026 年 4 月,NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16(BF16 activation),FP4 的理論吞吐量優勢無法發揮。

精度與推理速度為首要考量,建議改用 INT4 AutoRound 版本: 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑,校準更完整(品質保留約 99.5%),效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後,NVFP4 的真正優勢才能發揮。

huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated 的 NVFP4 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。

模型資訊

項目 數值
架構 MoE(30B 總參數, 3B 活躍),48 層,128 experts,top-8 routing
基礎模型 Qwen/Qwen3-30B-A3B
微調者 huihui-ai(Thinking 2507 + abliteration)
量化者 YuYu1015
模型大小 ~18.1 GB(NVFP4,原版 BF16 約 60 GB)
Context 長度 最高 131,072 tokens
思考模式 內建思維鏈推理(預設啟用)
工具呼叫 支援(qwen3_coder parser)

量化詳情

項目 數值
方法 llm-compressor v0.10.0.1
方案 NVFP4(E2M1 + FP8 逐群縮放,群組大小 16)
格式 compressed-tensors v0.14.0.1
校準資料集 HuggingFaceH4/ultrachat_200k (train_sft 分割)
校準樣本數 512
校準序列長度 2048
MoE 專家校準 moe_calibrate_all_experts=True(所有專家都接收校準資料)
量化硬體 NVIDIA DGX Spark(GB10, 128GB 統一記憶體)
環境 transformers==4.57.1 + llm-compressor==0.10.0.1

保留 BF16 的層

以下層未被量化以保持模型品質:

原因
lm_head 輸出頭,對量化雜訊敏感
re:.*mlp.gate$ MoE 路由閘——對專家選擇精度至關重要
re:.*embed_tokens$ 輸入嵌入

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

DGX Spark (SM121) 相容性說明

  • NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑尚未支援,缺少 cvt.e2m1x2 指令)
  • Qwen3(非 3.5)沒有 Mamba 層,FP8 KV cache 可以安全使用
  • Qwen3 沒有 GDN,linear_attn 不需要排除
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

Downloads last month
921
Safetensors
Model size
17B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

Collection including YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4