Huihui-Qwen3.6-27B-abliterated-NVFP4

NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-27B-abliterated — an abliterated (uncensored) variant of Qwen3.6-27B, the dense 27B VLM with Gated DeltaNet hybrid attention.

Quantized to NVIDIA FP4 by Lna-Lab using custom Blackwell NVFP4 GEMM kernels (lna-lab/blackwell-geforce-nvfp4-gemm).

55.6 GB → 19.7 GB (0.35x) — vision tower preserved in BF16. Runs on a single NVIDIA Blackwell GPU.

Key Specs

Base model huihui-ai/Huihui-Qwen3.6-27B-abliterated
Original Qwen/Qwen3.6-27B
Architecture Dense 27B, Gated DeltaNet + Gated Attention hybrid, VLM
Quantization NVFP4 (W4A4 — weights FP4, activations FP4, scales FP8)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor + blackwell-geforce-nvfp4-gemm
Size 19.7 GB
Requires NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19

Benchmark Results

Tested on a single NVIDIA RTX PRO 6000 Blackwell (96 GB), vLLM 0.19.1+, 128K context, FP8 KV cache.

Task Tokens Speed (tok/s) Status
English reasoning 1,024 56.2 PASS
Japanese essay (方丈記) 2,048 59.7 PASS
Python code generation 2,048 59.1 PASS
Contradictory instructions 1,500 59.5 PASS
VLM image description 947 58.1 PASS
Math proof (√2 irrationality) 1,024 59.3 PASS

Sustained throughput: ~58 tok/s (single GPU, 128K context, FP8 KV cache)

VRAM Usage

State GPU Memory
After model load 92,142 MiB
Peak (during inference) 92,150 MiB

Quick Start — From Scratch with Docker

1. Pull the model

hf download sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4 \
  --local-dir /models/Huihui-Qwen3.6-27B-abliterated-NVFP4

2. Run with Docker (128K context + FP8 KV cache)

docker run -d --name huihui-qwen36-27b \
  --gpus '"device=0"' --shm-size=16g \
  -v /models/Huihui-Qwen3.6-27B-abliterated-NVFP4:/models/current:ro \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model /models/current \
  --trust-remote-code --quantization modelopt --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
  --default-chat-template-kwargs '{"preserve_thinking":true}' \
  --enable-prefix-caching --enable-chunked-prefill \
  --max-model-len 131072 --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8_e4m3

3. Run with vLLM directly

vllm serve /models/Huihui-Qwen3.6-27B-abliterated-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --dtype auto \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code

4. Test inference

# Text
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Huihui-Qwen3.6-27B-abliterated-NVFP4",
    "messages": [{"role": "user", "content": "Write a haiku about quantization."}],
    "max_tokens": 256,
    "temperature": 0.0
  }'
# VLM (image input)
import base64, requests
from pathlib import Path

img_b64 = base64.b64encode(Path("photo.jpg").read_bytes()).decode()
resp = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "Huihui-Qwen3.6-27B-abliterated-NVFP4",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}},
        {"type": "text", "text": "Describe this image."},
    ]}],
    "max_tokens": 1024,
})
print(resp.json()["choices"][0]["message"]["content"])

Quantization Details

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head, 're:.*visual.*', 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$']
  scheme: NVFP4

What's Quantized / What's Not

  • Quantized (NVFP4): All Linear layers in the language model
  • Kept in BF16: lm_head, all vision layers (model.visual.*), MLP gates

Reproduction

from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "huihui-ai/Huihui-Qwen3.6-27B-abliterated",
    torch_dtype=torch.bfloat16, trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "huihui-ai/Huihui-Qwen3.6-27B-abliterated", trust_remote_code=True,
)

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

# Calibration with neuralmagic/calibration dataset (20 samples, 8192 seq len)
# ... (see quantization script in repo)

model.save_pretrained("output-NVFP4", save_compressed=True)
processor.save_pretrained("output-NVFP4")

Note: After saving, verify vision tower keys use model.visual.* prefix (not model.language_model.visual.*). See this fix for details.

Tested Environment

Component Version
vLLM 0.19.1rc1+ (nightly)
Transformers 5.5.4
PyTorch 2.11.0+cu130
llm-compressor 0.1.dev5
CUDA 13.0
GPU NVIDIA RTX PRO 6000 Blackwell (96 GB)
OS Ubuntu 24.04, Linux 6.17

Credits

Support the Base Model Authors

If you find this model useful, please consider supporting:

  • huihui-ai (abliteration): Ko-fi | BTC: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Qwen Team (original model): Star the Qwen repo

License

This model inherits the Apache 2.0 license.

Downloads last month
7,734
Safetensors
Model size
17B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(22)
this model