Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

English

Re-quantized on 2026-04-13 with corrected ignore list (mlp.gate + embed_tokens now preserved in BF16), fixing routing quality issues in the previous release.

NVIDIA DGX Spark (GB10 SM121) — Driver 590.48+ / CUDA 13.1+

As of April 2026, NVFP4 software support on SM121 is still incomplete. The native W4A4 compute path is not yet functional on this hardware — the runtime silently falls back to W4A16 (BF16 activations), negating the theoretical throughput advantage of FP4.

If accuracy and inference speed are your priority, we recommend the INT4 AutoRound version: 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

INT4 AutoRound leverages the mature W4A16 Marlin kernel path on DGX Spark, offering more thorough calibration (~99.5% quality retention) and significantly more stable performance. The full potential of NVFP4 will only be unlocked once NVIDIA delivers complete W4A4 kernel support for SM121.

NVFP4 quantization of huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121).

Model Details

Item	Value
Architecture	MoE (30B total, 3B active), 48 layers, 128 experts, top-8 routing
Base model	Qwen/Qwen3-30B-A3B
Fine-tuned by	huihui-ai (Thinking 2507 + abliteration)
Quantized by	YuYu1015
Model size	~18.1 GB (NVFP4, vs ~60 GB BF16 original)
Context length	Up to 131,072 tokens
Thinking mode	Built-in Chain-of-Thought reasoning (enabled by default)
Tool calling	Supported (`qwen3_coder` parser)

Quantization Details

Item	Value
Method	llm-compressor v0.10.0.1
Scheme	NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Format	compressed-tensors v0.14.0.1
Calibration dataset	HuggingFaceH4/ultrachat_200k (`train_sft` split)
Calibration samples	512
Calibration sequence length	2048
MoE expert calibration	`moe_calibrate_all_experts=True` (all experts receive calibration data)
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)
Environment	`transformers==4.57.1` + `llm-compressor==0.10.0.1`

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer	Reason
`lm_head`	Output head, sensitive to quantization noise
`re:.*mlp.gate$`	MoE routing gate — critical for expert selection accuracy
`re:.*embed_tokens$`	Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

DGX Spark (SM121) Compatibility Notes

NVFP4 on SM121 falls back to W4A16 (native W4A4 path not yet supported, missing cvt.e2m1x2 instruction)
Qwen3 (non-3.5) has no Mamba layers, so FP8 KV cache works safely
Qwen3 has no GDN, so linear_attn does not need to be excluded
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits

Original Model: Qwen/Qwen3-30B-A3B by Alibaba Qwen Team
Thinking 2507 & Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: llm-compressor by vLLM Project
Reference: RedHatAI/Qwen3-30B-A3B-NVFP4

繁體中文

2026-04-13 重新量化上傳，修正先前版本的 ignore list（mlp.gate 與 embed_tokens 現在保留 BF16），解決 MoE 路由品質問題。

NVIDIA DGX Spark (GB10 SM121) 使用者 — Driver 590.48+ / CUDA 13.1+

截至 2026 年 4 月，NVFP4 在 SM121 上的軟體支援仍不完整。原生 W4A4 運算路徑尚未在此硬體上就緒——執行時會靜默退回 W4A16（BF16 activation），FP4 的理論吞吐量優勢無法發揮。

若精度與推理速度為首要考量，建議改用 INT4 AutoRound 版本： 👉 YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

INT4 AutoRound 在 DGX Spark 上使用成熟的 W4A16 Marlin kernel 路徑，校準更完整（品質保留約 99.5%），效能顯著更穩定。待 NVIDIA 為 SM121 提供完整的 W4A4 kernel 支援後，NVFP4 的真正優勢才能發揮。

huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated 的 NVFP4 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。

模型資訊

項目	數值
架構	MoE（30B 總參數, 3B 活躍），48 層，128 experts，top-8 routing
基礎模型	Qwen/Qwen3-30B-A3B
微調者	huihui-ai（Thinking 2507 + abliteration）
量化者	YuYu1015
模型大小	~18.1 GB（NVFP4，原版 BF16 約 60 GB）
Context 長度	最高 131,072 tokens
思考模式	內建思維鏈推理（預設啟用）
工具呼叫	支援（`qwen3_coder` parser）

量化詳情

項目	數值
方法	llm-compressor v0.10.0.1
方案	NVFP4（E2M1 + FP8 逐群縮放，群組大小 16）
格式	compressed-tensors v0.14.0.1
校準資料集	HuggingFaceH4/ultrachat_200k (`train_sft` 分割)
校準樣本數	512
校準序列長度	2048
MoE 專家校準	`moe_calibrate_all_experts=True`（所有專家都接收校準資料）
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）
環境	`transformers==4.57.1` + `llm-compressor==0.10.0.1`

保留 BF16 的層

以下層未被量化以保持模型品質：

層	原因
`lm_head`	輸出頭，對量化雜訊敏感
`re:.*mlp.gate$`	MoE 路由閘——對專家選擇精度至關重要
`re:.*embed_tokens$`	輸入嵌入

vLLM 部署

vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3-30b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code

DGX Spark (SM121) 相容性說明

NVFP4 在 SM121 上會退回 W4A16（原生 W4A4 路徑尚未支援，缺少 cvt.e2m1x2 指令）
Qwen3（非 3.5）沒有 Mamba 層，FP8 KV cache 可以安全使用
Qwen3 沒有 GDN，linear_attn 不需要排除
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制（abliterated），可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

原始模型：Qwen/Qwen3-30B-A3B，Alibaba Qwen 團隊
Thinking 2507 與去審查：huihui-ai
NVFP4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：llm-compressor，vLLM Project
參考：RedHatAI/Qwen3-30B-A3B-NVFP4

Downloads last month: 921

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Finetuned

huihui-ai/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated

Quantized

(13)

this model

Collection including YuYu1015/Huihui-Qwen3-30B-A3B-Thinking-2507-abliterated-NVFP4

Qwen3-abliterated

Collection

2 items • Updated 8 days ago