Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5 NVFP4 (ModelOpt)

This repository contains a Hugging Face export of crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5 quantized with NVIDIA ModelOpt to NVFP4 for Blackwell-oriented inference.

Relationship to the source model

Source model: crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5
This repo is a quantized derivative of that model.
Architecture in exported config: Qwen3_5ForConditionalGeneration
Format: standard Hugging Face checkpoint with hf_quant_config.json

Quantization summary

Quantizer: NVIDIA ModelOpt 0.41.0
Output quantization: NVFP4
Weight format used: qformat=nvfp4_mlp_only
KV cache format: fp8
Output dtype metadata: bfloat16
Export format: Hugging Face checkpoint
Main weight file: model.safetensors

Quantization environment

Quantization was completed in a WSL2 Ubuntu 24.04 environment with NVIDIA drivers already configured on the host.

Primary container used for PTQ/export:

nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4

Important runtime/tooling details used during successful export:

TensorRT-Model-Optimizer source checkout
transformers 5.3.0.dev0 source-injected to recognize qwen3_5
local patch to the official examples/llm_ptq/hf_ptq.py flow so image calibration could work on this non-Nemotron multimodal model

PTQ calibration settings used in the successful run

calibration mode: image-text calibration
calib_size=128
calib_seq=8192
batch_size=1
peak observed single-GPU memory during quantization: about 25.55 GB

Files in this repo

model.safetensors
config.json
hf_quant_config.json
processor_config.json
tokenizer.json
tokenizer_config.json
generation_config.json
chat_template.jinja

Serving status

Validated locally with:

SGLang 0.5.9
transformers 5.3.0.dev0
--quantization modelopt_fp4
--attention-backend triton

Local validation covered:

text chat in Chinese, Japanese, English, and mixed prompts
multimodal image understanding
simple concurrent requests
long-context retrieval

Example SGLang launch

A tested container image for this model family is available at:

rhoninseiei/sglang-qwen35-nvfp4:sglang0.5.9-transformers5.3.0dev0

Example launch:

docker run -d \
  --name sglang_qwen35_nvfp4 \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e MODEL_PATH=/models/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-NVFP4-ModelOpt \
  -p 31000:30000 \
  -v /path/to/models:/models \
  rhoninseiei/sglang-qwen35-nvfp4:sglang0.5.9-transformers5.3.0dev0

Notes

This checkpoint is intended for ModelOpt FP4 / NVFP4 aware runtimes.
In local testing, current stable vLLM did not support this exact Qwen3_5ForConditionalGeneration architecture even though ModelOpt/NVFP4 support exists more generally.
The included chat_template.jinja was adjusted so thinking output is suppressed by default for cleaner chat responses.

Disclaimer

This is an unofficial quantized redistribution of the source model.
Users must review and comply with the original model license, upstream runtime licenses, and any applicable distribution or export restrictions.
No claim is made that every runtime or every hardware target will load this checkpoint unchanged.

Downloads last month: 7,751

Safetensors

Model size

7B params

Tensor type

BF16

F8_E4M3

Model tree for rhoninseiei/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-NVFP4

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

trohrbaugh/Qwen3.5-9B-heretic-v2

Quantized

Crownelius/Crow-9B-HERETIC-4.6

Quantized

(18)

this model