Qwen3.5-9B Ultra Heretic FP8
This repository contains an offline-FP8 version of llmfan46/Qwen3.5-9B-ultra-heretic, prepared for serving on a single NVIDIA L40S with SGLang.
This is not a GGUF export and not an online quantization recipe. It is a saved Hugging Face-style quantized checkpoint produced offline with llmcompressor, then validated on Modal with SGLang.
What this repo contains
model.safetensorsconfig.jsongeneration_config.json- tokenizer files
- multimodal processor files
quantization_manifest.jsonrecipe.yaml
The artifact was saved in a form that SGLang can load directly from disk.
Source model
This FP8 checkpoint was derived from:
- Base checkpoint:
llmfan46/Qwen3.5-9B-ultra-heretic - Upstream architecture family:
Qwen/Qwen3.5-9B
The original Heretic model card and the upstream Qwen model card remain the authoritative references for training provenance and behavior of the unquantized model.
Quantization notes
- Quantization method: offline FP8
- Primary tool:
llmcompressor - Scheme:
FP8_DYNAMIC - Target hardware during quantization:
1x NVIDIA L40S - Target runtime during validation: SGLang on
1x NVIDIA L40S
The quantized artifact includes the processor and tokenizer sidecars required for multimodal loading.
Compatibility notes
Two compatibility fixes were necessary to make the saved checkpoint load cleanly in SGLang:
video_preprocessor_config.jsonwas added so the multimodal processor stack could initialize correctly.tokenizer_config.jsonwas normalized to useQwen2TokenizerFastinstead ofTokenizersBackend.
Those fixes are already included in this repo.
Measured serving results on Modal
These numbers were measured with SGLang on a single L40S, loading this saved checkpoint directly from a Modal Volume.
Long-context prefill benchmark:
- Context window:
262,144 - Request size used for testing: about
90%of context - Prompt tokens sent:
235,939
Uncached 5-run batch:
- Average prompt TPS:
5932.34 - Median prompt TPS:
6070.03 - Warmed steady-state average excluding run 1:
6120.02 - Average peak VRAM:
40.160 GiB - Max peak VRAM:
40.161 GiB
Important interpretation:
- These are single-request prefill measurements with
max_tokens=1. - They are not decode throughput measurements.
- They are not multi-user throughput measurements.
Suggested SGLang launch
The recommended launch pattern is a normal load of the saved checkpoint:
python -m sglang.launch_server \
--model-path /path/to/Qwen3.5-9B-ultra-heretic-fp8 \
--served-model-name Qwen3.5-9B-ultra-heretic-fp8 \
--host 0.0.0.0 \
--port 30000 \
--tp 1 \
--mem-fraction-static 0.8 \
--reasoning-parser qwen3
If your exact SGLang build requires an explicit quantization flag, test it before standardizing on it:
python -m sglang.launch_server \
--model-path /path/to/Qwen3.5-9B-ultra-heretic-fp8 \
--served-model-name Qwen3.5-9B-ultra-heretic-fp8 \
--quantization w8a8_fp8 \
--host 0.0.0.0 \
--port 30000 \
--tp 1 \
--mem-fraction-static 0.8 \
--reasoning-parser qwen3
The conservative default is to omit --quantization unless that exact path has been validated in your runtime.
Notes
- This repo is intended for Transformers-compatible runtimes, not GGUF runtimes.
- The checkpoint was validated remotely on Modal, not on a small local GPU.
- The underlying model is still the Heretic variant; this repo changes serving format and precision, not the model identity.
- Downloads last month
- 396
Model tree for apothic/Qwen3.5-9B-ultra-heretic-fp8
Base model
Qwen/Qwen3.5-9B-Base