Instructions to use mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic") model = AutoModelForMultimodalLM.from_pretrained("mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic
- SGLang
How to use mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic
FP8_DYNAMIC quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.
This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but quantizes most linear layers to FP8 W8A8 while leaving the most sensitive projections and sidecar components in BF16.
The published folder includes:
model.safetensorsmodel.safetensors.index.jsonmodel.mtp.safetensorsprocessor_config.jsonpreprocessor_config.jsonvideo_preprocessor_config.jsonrecipe.yaml
Verified Inference
Local export and sanity-check evaluation were verified on 2026-03-31 on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:
transformers==5.3.0llm-compressor==0.10.1.dev40+g5ae2e149vllm==0.17.1
What was verified in that run:
- the FP8 export completed successfully
model.mtp.safetensorswas restored into the output folder- the checkpoint loads in
transformerswithdevice_map="auto" - a quick perplexity sanity check against the BF16 source completed successfully
vLLM is still the intended serving path for this family of models, but full local serve validation for this exact FP8 v2 export is still pending.
Quantization Strategy
Uniform FP8_DYNAMIC quantization using llm-compressor:
| Precision | Layers |
|---|---|
| FP8 W8A8 | most Linear layers |
| BF16 | lm_head, embed_tokens, self_attn.o_proj, DeltaNet linear_attn.out_proj, DeltaNet in_proj_a/in_proj_b, visual encoder, MTP sidecar |
FP8 details:
- weights: FP8, per-channel, static scales
- input activations: FP8, per-token, dynamic scales
- output activations: not explicitly quantized
Architecture match with the BF16 source:
model_type=qwen3_564text layersfull_attention_interval=4mtp_num_hidden_layers=1max_position_embeddings=262144
Local Benchmark Slice
Local quick sanity comparison against the BF16 v2 source model, run on 2026-03-31 with a small FineWeb-Edu perplexity slice.
This is a reduced verification slice, not a full benchmark run.
| Benchmark | BF16 v2 | FP8 v2 |
|---|---|---|
| FineWeb-Edu perplexity (20 samples, 13,531 tokens, max_len=1024) | 7.0758 | 7.1051 |
Absolute perplexity delta:
+0.0293- about
+0.41%relative to BF16
Usage
vLLM
pip install -U vllm==0.17.1 transformers==5.3.0
Expected serving command:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3
With MTP enabled:
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 1 \
--skip-mm-profiling \
--reasoning-parser qwen3 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Transformers
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic",
trust_remote_code=True,
)
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Expected | Intended serving path; this exact v2 FP8 export still needs full local serve re-validation |
| transformers >= 5.3.0 | Yes | Direct loading works with device_map="auto" |
| SGLang | Unknown | Not verified for this export |
Notes
- This export keeps
self_attn.o_projand DeltaNetlinear_attn.out_projin BF16 rather than FP8. - The output folder includes the Qwen3.5 MTP sidecar and processor metadata needed for serving compatibility.
- The perplexity numbers above are intended as a quick sanity check, not as an exhaustive benchmark submission.
- Downloads last month
- 9
Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-FP8-Dynamic
Base model
Qwen/Qwen3.5-27B