Qwen3.5-122B-A10B-CacheReady
CacheReady converts MoE prefix caching from unreliable to usable without runtime patches.
MoE routing noise — especially under fp8/fp4 quantization — prevents KV cache reuse across requests with identical prefixes. CacheReady canonicalizes router weights across functionally equivalent experts so routing decisions become deterministic by construction. Result: prefix caching becomes usable again for shared-prefix workloads.
Shared-prefix throughput goes from 0.65x (caching hurts) to 1.31x (caching helps).
CacheReady converts MoE prefix caching from unreliable to usable by enforcing routing stability across shared-prefix execution contexts.
Only router (gate) weight matrices were modified. Expert weights, attention weights, embeddings, and all other parameters are byte-for-byte identical to the original Qwen/Qwen3.5-122B-A10B. CacheReady is not a finetune or architecture change. It is a router canonicalization patch encoded directly into model weights to enable deterministic MoE inference without runtime modifications.
Approximately ~45% of experts fell into equivalence groups across the model.
Prefix Caching Throughput
All benchmarks: 4x NVIDIA H100 PCIe 80GB, vLLM 0.18.0, enforce_eager=True, tensor_parallel_size=4.
| Model | Workload | Without Cache | With Cache | Speedup |
|---|---|---|---|---|
| Original | Shared prefix | 720 tok/s | 466 tok/s | 0.65x (slower) |
| Original | Unique prefix | 481 tok/s | 480 tok/s | 1.00x |
| CacheReady | Shared prefix | 561 tok/s | 735 tok/s | 1.31x |
| CacheReady | Unique prefix | 525 tok/s | 565 tok/s | 1.08x |
On the original model, enabling prefix caching for shared-prefix workloads makes throughput 35% worse. Routing instability invalidates cached KV states, turning cache hits into expensive misses. On CacheReady, the same workload sees a 31% throughput improvement — prefix caching works as expected because routing is deterministic. This converts prefix caching from unreliable to usable in MoE serving environments.
Single-run routing determinism (sanity check)
Both models are deterministic within a single execution context. Prefix caching failures arise from routing instability across requests that share prefixes but differ slightly in quantization state, batching, or execution context.
| Model | Texts | bf16 Determinism | fp8 Determinism |
|---|---|---|---|
| Original | 20 (bf16) / 10 (fp8) | 100% | 100% |
| CacheReady | 20 (bf16) / 10 (fp8) | 100% | 100% |
This verifies router stability within a single execution context. CacheReady instead targets routing stability across shared-prefix requests, which is the requirement for prefix caching to function correctly.
Shared-prefix routing stability (cache reuse behavior)
Prefix caching requires routing decisions to remain identical across requests that share prefixes.
Example shared-prefix serving workload (vLLM):
| Model | Shared-prefix throughput |
|---|---|
| Original Qwen3.5-122B-A10B | 0.65x |
| CacheReady | 1.31x |
On the original model, routing instability invalidates the KV cache even when prefixes match.
CacheReady canonicalizes router weights across equivalent experts so shared-prefix routing remains stable and cache reuse becomes effective.
Why routing can be deterministic but prefix caching still fails
Router determinism within a single execution context does not guarantee routing stability across requests.
Small numerical differences caused by:
- fp8 / fp4 quantization
- batch shape changes
- execution order differences
- multi-tenant serving reuse
can flip top-k expert selection when experts are functionally equivalent.
CacheReady removes this instability by canonicalizing routing scores across equivalent experts.
Quality Preservation
No measurable quality change across evaluation benchmarks. The patch only modifies router weight rows for experts that were identified as functionally equivalent — producing near-identical outputs when selected. The router top-k selection is unchanged for all non-ambiguous routing decisions.
Router Equivalence Discovery
| Metric | Value |
|---|---|
| MoE layers analyzed | 48 / 48 |
| Expert equivalence groups | 2,348 (non-singleton) |
| Experts in equivalence groups | 5,555 / 12,288 (45%) |
| Router weight rows modified | 3,207 |
| Max logit diff after patch | 0.00e+00 |
| Calibration time | 12.5 minutes (8xA100) |
45% of experts across all 48 MoE layers belong to non-singleton equivalence groups — meaning large-scale routing redundancy exists across MoE routers. CacheReady exploits this redundancy safely: equivalent experts receive identical router scores, so the top-k selection is deterministic without affecting which expert actually computes the output.
Who Benefits
- Shared-prefix serving — system prompts, RAG context, multimodal (image/video) prefixes
- Quantized serving — fp8/fp4 quantization amplifies routing noise; CacheReady eliminates it
- Multi-tenant deployments — many users sharing the same base prompt
- Batched inference — consistent routing across batch elements with shared prefixes
- vLLM prefix caching users — prefix caching now works correctly for MoE models
Usage
Drop-in replacement. No code changes needed.
With vLLM (prefix caching now works)
python -m vllm.entrypoints.openai.api_server \
--model dystrio/Qwen3.5-122B-A10B-CacheReady \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--trust-remote-code
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"dystrio/Qwen3.5-122B-A10B-CacheReady",
torch_dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("dystrio/Qwen3.5-122B-A10B-CacheReady")
Compatibility
- transformers (requires qwen3_5_moe support)
- vLLM >= 0.17
- SGLang
- TGI
- Any framework that loads standard safetensors checkpoints
Citation
@misc{dystrio_cacheready_2026,
title={Routing Canonicalization for Deterministic MoE Inference},
author={Dystrio},
year={2026},
url={https://huggingface.co/dystrio/Qwen3.5-122B-A10B-CacheReady}
}
- Downloads last month
- 221
Model tree for dystrio/Qwen3.5-122B-A10B-CacheReady
Base model
Qwen/Qwen3.5-122B-A10B