Qwen3.5-9B Ultra Heretic FP8

This repository contains an offline-FP8 version of llmfan46/Qwen3.5-9B-ultra-heretic, prepared for serving on a single NVIDIA L40S with SGLang.

This is not a GGUF export and not an online quantization recipe. It is a saved Hugging Face-style quantized checkpoint produced offline with llmcompressor, then validated on Modal with SGLang.

What this repo contains

model.safetensors
config.json
generation_config.json
tokenizer files
multimodal processor files
quantization_manifest.json
recipe.yaml

The artifact was saved in a form that SGLang can load directly from disk.

Source model

This FP8 checkpoint was derived from:

Base checkpoint: llmfan46/Qwen3.5-9B-ultra-heretic
Upstream architecture family: Qwen/Qwen3.5-9B

The original Heretic model card and the upstream Qwen model card remain the authoritative references for training provenance and behavior of the unquantized model.

Quantization notes

Quantization method: offline FP8
Primary tool: llmcompressor
Scheme: FP8_DYNAMIC
Target hardware during quantization: 1x NVIDIA L40S
Target runtime during validation: SGLang on 1x NVIDIA L40S

The quantized artifact includes the processor and tokenizer sidecars required for multimodal loading.

Compatibility notes

Two compatibility fixes were necessary to make the saved checkpoint load cleanly in SGLang:

video_preprocessor_config.json was added so the multimodal processor stack could initialize correctly.
tokenizer_config.json was normalized to use Qwen2TokenizerFast instead of TokenizersBackend.

Those fixes are already included in this repo.

Measured serving results on Modal

These numbers were measured with SGLang on a single L40S, loading this saved checkpoint directly from a Modal Volume.

Long-context prefill benchmark:

Context window: 262,144
Request size used for testing: about 90% of context
Prompt tokens sent: 235,939

Uncached 5-run batch:

Average prompt TPS: 5932.34
Median prompt TPS: 6070.03
Warmed steady-state average excluding run 1: 6120.02
Average peak VRAM: 40.160 GiB
Max peak VRAM: 40.161 GiB

Important interpretation:

These are single-request prefill measurements with max_tokens=1.
They are not decode throughput measurements.
They are not multi-user throughput measurements.

Suggested SGLang launch

The recommended launch pattern is a normal load of the saved checkpoint:

python -m sglang.launch_server \
  --model-path /path/to/Qwen3.5-9B-ultra-heretic-fp8 \
  --served-model-name Qwen3.5-9B-ultra-heretic-fp8 \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 1 \
  --mem-fraction-static 0.8 \
  --reasoning-parser qwen3

If your exact SGLang build requires an explicit quantization flag, test it before standardizing on it:

python -m sglang.launch_server \
  --model-path /path/to/Qwen3.5-9B-ultra-heretic-fp8 \
  --served-model-name Qwen3.5-9B-ultra-heretic-fp8 \
  --quantization w8a8_fp8 \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 1 \
  --mem-fraction-static 0.8 \
  --reasoning-parser qwen3

The conservative default is to omit --quantization unless that exact path has been validated in your runtime.

Notes

This repo is intended for Transformers-compatible runtimes, not GGUF runtimes.
The checkpoint was validated remotely on Modal, not on a small local GPU.
The underlying model is still the Heretic variant; this repo changes serving format and precision, not the model identity.

Downloads last month: 396

Safetensors

Model size

9B params

Tensor type

BF16

F8_E4M3

Model tree for apothic/Qwen3.5-9B-ultra-heretic-fp8

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

llmfan46/Qwen3.5-9B-ultra-uncensored-heretic-v1

Quantized

(9)

this model