Step-3.5-Flash MTP GGUF
Experimental Step-3.5-Flash MTP GGUF builds based on stepfun-ai/Step-3.5-Flash.
These GGUFs were originally published for a same-GGUF experimental fork, but Step-3.5 MTP support has now landed in upstream llama.cpp via ggml-org/llama.cpp#23274. Current upstream llama.cpp uses a separate draft/MTP GGUF through --spec-type draft-mtp and -md.
Important packaging note: the main quants in this repo are combined target+MTP files from the experimental fork era. Upstream can load them as target models, but the embedded MTP tensors are not what upstream uses for speculative decoding once -md is supplied. For upstream llama.cpp, the cleaner native layout is a target-only GGUF plus a separate draft-only MTP GGUF. The combined files are kept here for compatibility and reproducibility, not because upstream needs the target and draft tensors bundled together.
The old tnhnyzc/llama.cpp fork remains useful as historical context for the same-GGUF implementation, but the recommended path is now upstream llama.cpp.
Files
The model files are split into folders. Choose one quant folder and load the first shard; the split-aware llama.cpp loader will find the sibling shards.
Step-3.5-Flash-MTP-IQ4_XS-3.90BPW-Q8_MTP/Step-3.5-Flash-MTP-IQ3_S-3.64BPW-Q8_MTP/Step-3.5-Flash-MTP-IQ3_XXS-3.27BPW-Q8_MTP/
In these files, the MTP / nextn tensors are kept Q8_0. The public metadata reports step35.nextn_predict_layers = 1.
For upstream llama.cpp MTP, use the matching draft-only GGUF with -md. The draft-only file contains the MTP/draft tensors needed by upstream's separate draft-model path. If you make fresh upstream-native quants, you should not need to include the MTP tensors in the target GGUF.
The calibration imatrix is Bartowski's stepfun-ai_Step-3.5-Flash-imatrix.gguf from bartowski/stepfun-ai_Step-3.5-Flash-GGUF. The IQ4_XS variant follows the public AesSedai Step-3.5-Flash IQ4_XS expert layout. The IQ3_S and IQ3_XXS variants are smaller custom expert layouts.
Usage
Build current upstream llama.cpp, then run the target GGUF and the draft-only GGUF as separate files:
./build/bin/llama-server \
--model /path/to/target/first-shard-or-gguf.gguf \
-md /path/to/Step-3.5-Flash-*-draft-only.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--spec-draft-p-min 0.65 \
--ctx-size 131072 \
-ctk q8_0 -ctv q8_0 \
-ngl 99 \
-np 1 \
-b 4096 \
-ub 1024 \
-fa on \
--cache-prompt \
--cache-ram 8192
Add normal sampler/server args as needed.
The old fork flags were -mtp --draft 1; upstream uses --spec-type draft-mtp --spec-draft-n-max ... -md ... instead.
Notes
- Tested locally on Apple M3 Max / Metal with upstream llama.cpp
b9490-3571fa543after #23274 was merged. - Recommended upstream starting point:
--spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.65 -md /path/to/draft-only.gguf. - In one local upstream server smoke with the IQ3_XXS target and draft-only MTP GGUF, the server reported
32.86 t/sgeneration and50/60draft tokens accepted (83.3%acceptance) for a 128-token completion. - The target KV cache worked with
-ctk q8_0 -ctv q8_0. In this smoke, setting draft KV to Q8_0 with-ctkd q8_0 -ctvd q8_0hit an upstream assert, so keep the draft KV cache at its default f16 unless you have tested otherwise. - The published files report one trained
nextnlayer. Higher--spec-draft-n-maxvalues reuse that layer recurrently. - Treat the numbers as directional. Context length, sampler settings, cache reuse, memory pressure, and host load can move them.
See the GitHub README for current limitations and implementation details.
- Downloads last month
- 434
Model tree for tnhnyzc/Step-3.5-Flash-MTP-GGUF
Base model
stepfun-ai/Step-3.5-Flash