Step-3.5-Flash MTP GGUF

Experimental Step-3.5-Flash MTP GGUF builds based on stepfun-ai/Step-3.5-Flash.

These GGUFs were originally published for a same-GGUF experimental fork, but Step-3.5 MTP support has now landed in upstream llama.cpp via ggml-org/llama.cpp#23274. Current upstream llama.cpp uses a separate draft/MTP GGUF through --spec-type draft-mtp and -md.

Important packaging note: the main quants in this repo are combined target+MTP files from the experimental fork era. Upstream can load them as target models, but the embedded MTP tensors are not what upstream uses for speculative decoding once -md is supplied. For upstream llama.cpp, the cleaner native layout is a target-only GGUF plus a separate draft-only MTP GGUF. The combined files are kept here for compatibility and reproducibility, not because upstream needs the target and draft tensors bundled together.

The old tnhnyzc/llama.cpp fork remains useful as historical context for the same-GGUF implementation, but the recommended path is now upstream llama.cpp.

Files

The model files are split into folders. Choose one quant folder and load the first shard; the split-aware llama.cpp loader will find the sibling shards.

  • Step-3.5-Flash-MTP-IQ4_XS-3.90BPW-Q8_MTP/
  • Step-3.5-Flash-MTP-IQ3_S-3.64BPW-Q8_MTP/
  • Step-3.5-Flash-MTP-IQ3_XXS-3.27BPW-Q8_MTP/

In these files, the MTP / nextn tensors are kept Q8_0. The public metadata reports step35.nextn_predict_layers = 1.

For upstream llama.cpp MTP, use the matching draft-only GGUF with -md. The draft-only file contains the MTP/draft tensors needed by upstream's separate draft-model path. If you make fresh upstream-native quants, you should not need to include the MTP tensors in the target GGUF.

The calibration imatrix is Bartowski's stepfun-ai_Step-3.5-Flash-imatrix.gguf from bartowski/stepfun-ai_Step-3.5-Flash-GGUF. The IQ4_XS variant follows the public AesSedai Step-3.5-Flash IQ4_XS expert layout. The IQ3_S and IQ3_XXS variants are smaller custom expert layouts.

Usage

Build current upstream llama.cpp, then run the target GGUF and the draft-only GGUF as separate files:

./build/bin/llama-server \
  --model /path/to/target/first-shard-or-gguf.gguf \
  -md /path/to/Step-3.5-Flash-*-draft-only.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --spec-draft-p-min 0.65 \
  --ctx-size 131072 \
  -ctk q8_0 -ctv q8_0 \
  -ngl 99 \
  -np 1 \
  -b 4096 \
  -ub 1024 \
  -fa on \
  --cache-prompt \
  --cache-ram 8192

Add normal sampler/server args as needed.

The old fork flags were -mtp --draft 1; upstream uses --spec-type draft-mtp --spec-draft-n-max ... -md ... instead.

Notes

  • Tested locally on Apple M3 Max / Metal with upstream llama.cpp b9490-3571fa543 after #23274 was merged.
  • Recommended upstream starting point: --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.65 -md /path/to/draft-only.gguf.
  • In one local upstream server smoke with the IQ3_XXS target and draft-only MTP GGUF, the server reported 32.86 t/s generation and 50/60 draft tokens accepted (83.3% acceptance) for a 128-token completion.
  • The target KV cache worked with -ctk q8_0 -ctv q8_0. In this smoke, setting draft KV to Q8_0 with -ctkd q8_0 -ctvd q8_0 hit an upstream assert, so keep the draft KV cache at its default f16 unless you have tested otherwise.
  • The published files report one trained nextn layer. Higher --spec-draft-n-max values reuse that layer recurrently.
  • Treat the numbers as directional. Context length, sampler settings, cache reuse, memory pressure, and host load can move them.

See the GitHub README for current limitations and implementation details.

Downloads last month
434
GGUF
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tnhnyzc/Step-3.5-Flash-MTP-GGUF

Quantized
(25)
this model