Huihui-Qwen3.6-27B-abliterated-NVFP4
NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-27B-abliterated — an abliterated (uncensored) variant of Qwen3.6-27B, the dense 27B VLM with Gated DeltaNet hybrid attention.
Quantized to NVIDIA FP4 by Lna-Lab using custom Blackwell NVFP4 GEMM kernels (lna-lab/blackwell-geforce-nvfp4-gemm).
55.6 GB → 19.7 GB (0.35x) — vision tower preserved in BF16. Runs on a single NVIDIA Blackwell GPU.
Key Specs
| Base model | huihui-ai/Huihui-Qwen3.6-27B-abliterated |
| Original | Qwen/Qwen3.6-27B |
| Architecture | Dense 27B, Gated DeltaNet + Gated Attention hybrid, VLM |
| Quantization | NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm |
| Size | 19.7 GB |
| Requires | NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19 |
Benchmark Results
Tested on a single NVIDIA RTX PRO 6000 Blackwell (96 GB), vLLM 0.19.1+, 128K context, FP8 KV cache.
| Task | Tokens | Speed (tok/s) | Status |
|---|---|---|---|
| English reasoning | 1,024 | 56.2 | PASS |
| Japanese essay (方丈記) | 2,048 | 59.7 | PASS |
| Python code generation | 2,048 | 59.1 | PASS |
| Contradictory instructions | 1,500 | 59.5 | PASS |
| VLM image description | 947 | 58.1 | PASS |
| Math proof (√2 irrationality) | 1,024 | 59.3 | PASS |
Sustained throughput: ~58 tok/s (single GPU, 128K context, FP8 KV cache)
VRAM Usage
| State | GPU Memory |
|---|---|
| After model load | 92,142 MiB |
| Peak (during inference) | 92,150 MiB |
Quick Start — From Scratch with Docker
1. Pull the model
hf download sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4 \
--local-dir /models/Huihui-Qwen3.6-27B-abliterated-NVFP4
2. Run with Docker (128K context + FP8 KV cache)
docker run -d --name huihui-qwen36-27b \
--gpus '"device=0"' --shm-size=16g \
-v /models/Huihui-Qwen3.6-27B-abliterated-NVFP4:/models/current:ro \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model /models/current \
--trust-remote-code --quantization modelopt --language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--enable-prefix-caching --enable-chunked-prefill \
--max-model-len 131072 --gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8_e4m3
3. Run with vLLM directly
vllm serve /models/Huihui-Qwen3.6-27B-abliterated-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--dtype auto \
--kv-cache-dtype fp8_e4m3 \
--trust-remote-code
4. Test inference
# Text
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Huihui-Qwen3.6-27B-abliterated-NVFP4",
"messages": [{"role": "user", "content": "Write a haiku about quantization."}],
"max_tokens": 256,
"temperature": 0.0
}'
# VLM (image input)
import base64, requests
from pathlib import Path
img_b64 = base64.b64encode(Path("photo.jpg").read_bytes()).decode()
resp = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "Huihui-Qwen3.6-27B-abliterated-NVFP4",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
{"type": "text", "text": "Describe this image."},
]}],
"max_tokens": 1024,
})
print(resp.json()["choices"][0]["message"]["content"])
Quantization Details
Recipe
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
scheme: NVFP4
What's Quantized / What's Not
- Quantized (NVFP4): All
Linearlayers in the language model - Kept in BF16:
lm_head, all vision layers (model.visual.*), MLP gates
Reproduction
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"huihui-ai/Huihui-Qwen3.6-27B-abliterated",
torch_dtype=torch.bfloat16, trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"huihui-ai/Huihui-Qwen3.6-27B-abliterated", trust_remote_code=True,
)
recipe = QuantizationModifier(
targets="Linear", scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
# Calibration with neuralmagic/calibration dataset (20 samples, 8192 seq len)
# ... (see quantization script in repo)
model.save_pretrained("output-NVFP4", save_compressed=True)
processor.save_pretrained("output-NVFP4")
Note: After saving, verify vision tower keys use
model.visual.*prefix (notmodel.language_model.visual.*). See this fix for details.
Tested Environment
| Component | Version |
|---|---|
| vLLM | 0.19.1rc1+ (nightly) |
| Transformers | 5.5.4 |
| PyTorch | 2.11.0+cu130 |
| llm-compressor | 0.1.dev5 |
| CUDA | 13.0 |
| GPU | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| OS | Ubuntu 24.04, Linux 6.17 |
Credits
- Original model: Qwen Team (Alibaba Group) — Qwen3.6-27B
- Abliteration: huihui-ai — Huihui-Qwen3.6-27B-abliterated
- NVFP4 quantization & benchmarking: Lna-Lab
- Blackwell NVFP4 GEMM kernels: lna-lab/blackwell-geforce-nvfp4-gemm
- Quantization framework: vllm-project/llm-compressor
Support the Base Model Authors
If you find this model useful, please consider supporting:
- huihui-ai (abliteration): Ko-fi | BTC:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge - Qwen Team (original model): Star the Qwen repo
License
This model inherits the Apache 2.0 license.
- Downloads last month
- 7,734
Model tree for sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4
Base model
Qwen/Qwen3.6-27B