Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5 NVFP4 (ModelOpt)

This repository contains a Hugging Face export of crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5 quantized with NVIDIA ModelOpt to NVFP4 for Blackwell-oriented inference.

Relationship to the source model

Quantization summary

  • Quantizer: NVIDIA ModelOpt 0.41.0
  • Output quantization: NVFP4
  • Weight format used: qformat=nvfp4_mlp_only
  • KV cache format: fp8
  • Output dtype metadata: bfloat16
  • Export format: Hugging Face checkpoint
  • Main weight file: model.safetensors

Quantization environment

Quantization was completed in a WSL2 Ubuntu 24.04 environment with NVIDIA drivers already configured on the host.

Primary container used for PTQ/export:

  • nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4

Important runtime/tooling details used during successful export:

  • TensorRT-Model-Optimizer source checkout
  • transformers 5.3.0.dev0 source-injected to recognize qwen3_5
  • local patch to the official examples/llm_ptq/hf_ptq.py flow so image calibration could work on this non-Nemotron multimodal model

PTQ calibration settings used in the successful run

  • calibration mode: image-text calibration
  • calib_size=128
  • calib_seq=8192
  • batch_size=1
  • peak observed single-GPU memory during quantization: about 25.55 GB

Files in this repo

  • model.safetensors
  • config.json
  • hf_quant_config.json
  • processor_config.json
  • tokenizer.json
  • tokenizer_config.json
  • generation_config.json
  • chat_template.jinja

Serving status

Validated locally with:

  • SGLang 0.5.9
  • transformers 5.3.0.dev0
  • --quantization modelopt_fp4
  • --attention-backend triton

Local validation covered:

  • text chat in Chinese, Japanese, English, and mixed prompts
  • multimodal image understanding
  • simple concurrent requests
  • long-context retrieval

Example SGLang launch

A tested container image for this model family is available at:

  • rhoninseiei/sglang-qwen35-nvfp4:sglang0.5.9-transformers5.3.0dev0

Example launch:

docker run -d \
  --name sglang_qwen35_nvfp4 \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e MODEL_PATH=/models/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-NVFP4-ModelOpt \
  -p 31000:30000 \
  -v /path/to/models:/models \
  rhoninseiei/sglang-qwen35-nvfp4:sglang0.5.9-transformers5.3.0dev0

Notes

  • This checkpoint is intended for ModelOpt FP4 / NVFP4 aware runtimes.
  • In local testing, current stable vLLM did not support this exact Qwen3_5ForConditionalGeneration architecture even though ModelOpt/NVFP4 support exists more generally.
  • The included chat_template.jinja was adjusted so thinking output is suppressed by default for cleaner chat responses.

Disclaimer

  • This is an unofficial quantized redistribution of the source model.
  • Users must review and comply with the original model license, upstream runtime licenses, and any applicable distribution or export restrictions.
  • No claim is made that every runtime or every hardware target will load this checkpoint unchanged.
Downloads last month
7,751
Safetensors
Model size
7B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for rhoninseiei/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5-NVFP4

Quantized
(18)
this model