Huihui-Qwen3.5-27B-abliterated-int4-AutoRound
English
INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.
Model Details
| Item | Value |
|---|---|
| Architecture | Dense 27B + GDN (Mamba) + Attention hybrid |
| Base model | Qwen/Qwen3.5-27B |
| Fine-tuned by | huihui-ai (abliteration) |
| Quantized by | YuYu1015 |
| Model size | ~26 GB (vs ~54 GB BF16 original) |
| Context length | Up to 65,536 tokens |
| Thinking mode | Supported (enable_thinking: true/false) |
| Tool calling | Supported (qwen3_coder parser) |
Quantization Details
| Item | Value |
|---|---|
| Method | Intel AutoRound v0.12.2 |
| Bits | 4 |
| Group size | 128 |
| Format | auto_round (GPTQ-compatible) |
| Iterations | 200 |
| Calibration samples | 512 |
| Calibration sequence length | 2048 |
| Hardware | NVIDIA DGX Spark (GB10, 128GB unified memory) |
Layers Preserved in BF16
The following layers are not quantized to preserve model quality:
| Layer | Reason |
|---|---|
lm_head |
Output head, sensitive to quantization noise |
embed_tokens |
Input embeddings (auto-excluded by shape) |
linear_attn.* |
GDN/DeltaNet layers, may output zeros if quantized |
model.visual.* |
Vision encoder (auto-excluded by shape) |
Speculative Decoding
DFlash (requires separate drafter model):
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 16}'
Note: The DFlash drafter was trained on the original Qwen3.5-27B. Acceptance rate on the abliterated variant may be lower than on the original model.
Serving with vLLM
vllm serve /path/to/model \
--quantization gptq_marlin \
--served-model-name qwen3.5-27b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.90 \
--max-model-len 65536 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--language-model-only
DGX Spark (SM121) Compatibility Notes
- Use
--quantization gptq_marlinfor Marlin INT4 kernel (Dense model, not MoE) - FP8 KV cache is not compatible with GDN non-causal attention layers; use
--kv-cache-dtype auto - NVFP4 is not supported on SM121 (missing
cvt.e2m1x2instruction) - Runtime FP8 (
--quantization fp8) is not compatible with DFlash --language-model-onlyskips vision encoder profiling for text-only inference- Clear page cache before starting on UMA:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Safety Warning
This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.
Credits
- Original Model: Qwen/Qwen3.5-27B by Alibaba Qwen Team
- Abliteration: huihui-ai
- INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
- Quantization Tool: Intel AutoRound
繁體中文
huihui-ai/Huihui-Qwen3.5-27B-abliterated 的 INT4 AutoRound 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。
模型資訊
| 項目 | 數值 |
|---|---|
| 架構 | Dense 27B + GDN (Mamba) + Attention 混合 |
| 基礎模型 | Qwen/Qwen3.5-27B |
| 微調者 | huihui-ai(abliteration) |
| 量化者 | YuYu1015 |
| 模型大小 | ~26 GB(原版 BF16 約 54 GB) |
| Context 長度 | 最高 65,536 tokens |
| 思考模式 | 支援(enable_thinking: true/false) |
| 工具呼叫 | 支援(qwen3_coder parser) |
量化詳情
| 項目 | 數值 |
|---|---|
| 方法 | Intel AutoRound v0.12.2 |
| 位元數 | 4 |
| Group size | 128 |
| 格式 | auto_round(GPTQ 相容) |
| 迭代次數 | 200 |
| 校準樣本數 | 512 |
| 校準序列長度 | 2048 |
| 量化硬體 | NVIDIA DGX Spark(GB10, 128GB 統一記憶體) |
保留 BF16 的層
以下層未被量化以保持模型品質:
| 層 | 原因 |
|---|---|
lm_head |
輸出頭,對量化雜訊敏感 |
embed_tokens |
輸入嵌入(因 shape 自動排除) |
linear_attn.* |
GDN/DeltaNet 層,量化後可能輸出零 |
model.visual.* |
視覺編碼器(因 shape 自動排除) |
DGX Spark (SM121) 相容性說明
- 使用
--quantization gptq_marlin啟用 Marlin INT4 kernel(Dense 模型,非 MoE) - FP8 KV cache 與 GDN non-causal attention 不相容,請使用
--kv-cache-dtype auto - NVFP4 在 SM121 上不支援(缺少
cvt.e2m1x2指令) - Runtime FP8(
--quantization fp8)與 DFlash 不相容 - UMA 架構啟動前請先清除 page cache:
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
安全警告
此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。
致謝
- 原始模型:Qwen/Qwen3.5-27B,Alibaba Qwen 團隊
- 去審查:huihui-ai
- INT4 量化:YuYu1015,於 NVIDIA DGX Spark (GB10) 上完成
- 量化工具:Intel AutoRound
- Downloads last month
- 179
Model tree for YuYu1015/Huihui-Qwen3.5-27B-abliterated-int4-AutoRound
Base model
Qwen/Qwen3.5-27B