Instructions to use stepfun-ai/Step-3.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.7-Flash-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4

SGLang

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
```

MTP-layer weights?

by CosmicRaisins - opened 15 days ago

Discussion

CosmicRaisins

15 days ago

Any plans on publishing an NVFP4 quant with MTP-layer weights?

huangyu-nv

StepFun org 14 days ago

Yes, this is in plan, we will update a version with NVFP4 + MTP

maglat

14 days ago

For my understanding, MTP layers are available in the GUFF variants?

bumblebeer

14 days ago

For my understanding, MTP layers are available in the GUFF variants?

Yeah, but NVFP4 + MTP will be faster on NVIDIA hardware than a comparably sized GGUF + MTP (and potentially more accurate as well).

huangyu-nv

StepFun org 12 days ago

Update: the HF checkpoint has now been updated, so stepfun-ai/Step-3.7-Flash-NVFP4 should work with vLLM MTP speculative decoding directly.

NVFP4 + MTP

The Step-3.7-Flash-NVFP4 checkpoint has been updated with MTP draft layers and now supports vLLM speculative decoding with:

--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

On GPQA Diamond avg@16, NVFP4 + MTP matches quality within statistical noise compared with the same NVFP4 checkpoint without MTP: 77.81% vs. 78.41% item accuracy over 3168 records.

On a GB200 TP=4 vLLM setup with GPQA-style long-reasoning streaming prompts (~250 token prompt, ~1.6K token completion), NVFP4 + MTP improves aggregate decode throughput:

Concurrency	NVFP4 + MTP	NVFP4 no-MTP	Speedup
8	1309 tok/s	1155 tok/s	1.13x
32	4391 tok/s	3480 tok/s	1.26x
64	8229 tok/s	5667 tok/s	1.45x

This makes the NVFP4 checkpoint a practical option for high-throughput long-reasoning workloads while keeping the original NVFP4 model weights unchanged.

archib4

9 days ago

•

edited 9 days ago

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

Qucy

8 days ago

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

Same ask here, we want to benchmarking on 2x RTX pro 6000

huangyu-nv

StepFun org 8 days ago

I will take a look at running on rtx pro 6k.

huangyu-nv

StepFun org 5 days ago

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

We validated the latest stepfun-ai/Step-3.7-Flash-NVFP4 checkpoint with vLLM on an RTX 6000D / SM120 system. It starts and generates with the vllm/vllm-openai:stepfun37 container using multi-GPU tensor parallelism — we tested both TP=2 and TP=4 with MTP enabled.

The serve command we validated:

export MODEL_DIR=/path/to/Step-3.7-Flash-NVFP4
export FLASHINFER_WORKSPACE_BASE=/tmp/flashinfer-step37
mkdir -p "$FLASHINFER_WORKSPACE_BASE"

python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 8765
--model "$MODEL_DIR"
--served-model-name step3p7
--quantization modelopt
--kv-cache-dtype fp8
--tensor-parallel-size 2
--max-model-len 8192
--gpu-memory-utilization 0.9
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--async-scheduling
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Change --tensor-parallel-size to match your GPU count. For a no-MTP run, remove the --speculative-config line.

Caveat: SM120/SM121 falls back to the Marlin weight-only FP4 kernel rather than GB200 native FP4 compute, so it is functional but performance can differ from GB200.
If you still hit the Step3VLProcessor fallback specifically, please share the full startup log plus the exact vLLM version / container and we can dig in.

archib4

5 days ago

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

We validated the latest stepfun-ai/Step-3.7-Flash-NVFP4 checkpoint with vLLM on an RTX 6000D / SM120 system. It starts and generates with the vllm/vllm-openai:stepfun37 container using multi-GPU tensor parallelism — we tested both TP=2 and TP=4 with MTP enabled.

The serve command we validated:

export MODEL_DIR=/path/to/Step-3.7-Flash-NVFP4
export FLASHINFER_WORKSPACE_BASE=/tmp/flashinfer-step37
mkdir -p "$FLASHINFER_WORKSPACE_BASE"

python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 8765
--model "$MODEL_DIR"
--served-model-name step3p7
--quantization modelopt
--kv-cache-dtype fp8
--tensor-parallel-size 2
--max-model-len 8192
--gpu-memory-utilization 0.9
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--async-scheduling
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Change --tensor-parallel-size to match your GPU count. For a no-MTP run, remove the --speculative-config line.

Caveat: SM120/SM121 falls back to the Marlin weight-only FP4 kernel rather than GB200 native FP4 compute, so it is functional but performance can differ from GB200.
If you still hit the Step3VLProcessor fallback specifically, please share the full startup log plus the exact vLLM version / container and we can dig in.

Thank you so much ! ill retry with your container and nightly a i saw the PR was merged!

archib4

5 days ago

validated it is working and running on marlin without previous error ! thank you and your team for making sure the whole of blackwell family can run your foundation models !

archib4

5 days ago

•

edited 4 days ago

@huangyu-nv From testing some workloads for a while my conclusion is this is a great model!

It seems intent on insisting its claude from anthopic but it runs great after the changes below.

My running optimised config on 2x 6000 RTX PRO:

The b12x fallback (llm nightly) doesnt work atm until a fix for SWIGLUSTEP support to B12X is implemented so your stuck on marlin .

Comments on args:
max-num-batched-tokens : this value makes the stupid 20x60000=~7GB vision encoder startup to pass its "safety" check. someone decided 20 images worst case test for vision encoder was a good idea. why ??? test 8 images maybe not 20 . and no you cant oom and make backend crash cause you sent more then 20 images and how is that relevant as a startup safety check on boot... someone didnt cybersecurity cook here

why not 256k context length? the users never exceeds this number. unless its a 200+ page document ingested. (you shouldnt be doing that, and teach your clients the better way) we gain concurrency on kvcache aswell which is better. and 131k is well enough for agent harnesses i run 6 profiles that spin up sub agents just fine for advanced tasks.

mm limit per prompt : limit users to 3 images per prompt. its great for context to send images but i rarely send more than 3 . the width and height is just limit thats a high ress image enough to read by agents.

MTP 2:
Dropping num_spec 3 → 2:

num_spec=3: avg draft acceptance ~50–73%, position-3 at 0.24–0.52
num_spec=2: avg draft acceptance ~74–94%, with good windows hitting 92.9%, 94.2%

This means Throughput is better at mtp 2 over 3.

num_spec=3: ~70–95 tok/s single stream
num_spec=2: ~105–166 tok/s single stream, ~150–172 tok/s at 2 concurrent

That 3rd speculative token gets thrown away half the time (50%) and the 2x 120a GPU's compute it only to have it thrown away (waste).

So num_spec=2 is faster and higher-acceptance for 2tp sm120 cards at current date and fix on your image.

Mean acceptance length barely moved (≈2.87 → ≈2.62, ~9% fewer tokens/step), but throughput jumped 30–70%. That means the 3rd speculative token was adding almost nothing in accepted length while costing a full extra draft+verify per step, net negative.

speculative token counts should be increased only when the acceptance length is high; otherwise performance may be negatively affected. num_spec=3 was sitting in that penalty zone for this model on sm120 and the same goes for qwen 120b model from my testing.

args:

"/data/hf/models/models--stepfun-ai--Step-3.7-Flash-NVFP4/snapshots/4275532ffd9a9496ff36b7a2dc4a9db1048da438"
"--served-model-name=primary"
"--host=0.0.0.0"
"--port=8000"
"--quantization=modelopt"
"--kv-cache-dtype=fp8"
"--tensor-parallel-size=2"
"--max-model-len=131072"
"--max-num-batched-tokens=60000"
"--max-num-seqs=50"
"--enable-prefix-caching"
"--gpu-memory-utilization=0.9"
"--limit-mm-per-prompt"
'{"image": {"count": 3, "width": 1024, "height": 1024}}'
"--enable-expert-parallel"
"--disable-cascade-attn"
"--reasoning-parser=step3p5"
"--enable-auto-tool-choice"
"--tool-call-parser=step3p5"
"--trust-remote-code"
"--async-scheduling"
"--speculative-config"
'{"method":"mtp","num_speculative_tokens":2}'
"--override-generation-config"
'{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment