Huihui-Qwen3.5-27B-abliterated-int4-AutoRound

English

INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.

Model Details

Item	Value
Architecture	Dense 27B + GDN (Mamba) + Attention hybrid
Base model	Qwen/Qwen3.5-27B
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	~26 GB (vs ~54 GB BF16 original)
Context length	Up to 65,536 tokens
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_coder` parser)

Quantization Details

Item	Value
Method	Intel AutoRound v0.12.2
Bits	4
Group size	128
Format	auto_round (GPTQ-compatible)
Iterations	200
Calibration samples	512
Calibration sequence length	2048
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer	Reason
`lm_head`	Output head, sensitive to quantization noise
`embed_tokens`	Input embeddings (auto-excluded by shape)
`linear_attn.*`	GDN/DeltaNet layers, may output zeros if quantized
`model.visual.*`	Vision encoder (auto-excluded by shape)

Speculative Decoding

DFlash (requires separate drafter model):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 16}'

Note: The DFlash drafter was trained on the original Qwen3.5-27B. Acceptance rate on the abliterated variant may be lower than on the original model.

Serving with vLLM

vllm serve /path/to/model \
    --quantization gptq_marlin \
    --served-model-name qwen3.5-27b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Use --quantization gptq_marlin for Marlin INT4 kernel (Dense model, not MoE)
FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
NVFP4 is not supported on SM121 (missing cvt.e2m1x2 instruction)
Runtime FP8 (--quantization fp8) is not compatible with DFlash
--language-model-only skips vision encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits

Original Model: Qwen/Qwen3.5-27B by Alibaba Qwen Team
Abliteration: huihui-ai
INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: Intel AutoRound

繁體中文

huihui-ai/Huihui-Qwen3.5-27B-abliterated 的 INT4 AutoRound 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化。

模型資訊

項目	數值
架構	Dense 27B + GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.5-27B
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	~26 GB（原版 BF16 約 54 GB）
Context 長度	最高 65,536 tokens
思考模式	支援（`enable_thinking: true/false`）
工具呼叫	支援（`qwen3_coder` parser）

量化詳情

項目	數值
方法	Intel AutoRound v0.12.2
位元數	4
Group size	128
格式	auto_round（GPTQ 相容）
迭代次數	200
校準樣本數	512
校準序列長度	2048
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）

保留 BF16 的層

以下層未被量化以保持模型品質：

層	原因
`lm_head`	輸出頭，對量化雜訊敏感
`embed_tokens`	輸入嵌入（因 shape 自動排除）
`linear_attn.*`	GDN/DeltaNet 層，量化後可能輸出零
`model.visual.*`	視覺編碼器（因 shape 自動排除）

DGX Spark (SM121) 相容性說明

使用 --quantization gptq_marlin 啟用 Marlin INT4 kernel（Dense 模型，非 MoE）
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
NVFP4 在 SM121 上不支援（缺少 cvt.e2m1x2 指令）
Runtime FP8（--quantization fp8）與 DFlash 不相容
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制（abliterated），可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

原始模型：Qwen/Qwen3.5-27B，Alibaba Qwen 團隊
去審查：huihui-ai
INT4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：Intel AutoRound

Downloads last month: 179

Safetensors

Model size

11B params

Tensor type

I32

BF16

F16

Model tree for YuYu1015/Huihui-Qwen3.5-27B-abliterated-int4-AutoRound

Base model

Qwen/Qwen3.5-27B

Finetuned

huihui-ai/Huihui-Qwen3.5-27B-abliterated

Quantized

(17)

this model

Collection including YuYu1015/Huihui-Qwen3.5-27B-abliterated-int4-AutoRound

Qwen3.5-abliterated

Collection

1 item • Updated 3 days ago