| --- |
| license: apache-2.0 |
| --- |
| |
| ## NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale |
|
|
| [Homepage](https://stepfun.ai/research/en/nextstep-1) | [GitHub](https://github.com/stepfun-ai/NextStep-1) | [Paper](https://github.com/stepfun-ai/NextStep-1/blob/main/nextstep_1_tech_report.pdf) |
|
|
| We introduce **NextStep-1**, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. |
| **NextStep-1** achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. |
|
|
| <div align='center'> |
| <img src="assets/teaser.jpg" class="interpolation-image" alt="arch." width="100%" /> |
| </div> |
|
|
| ## ENV Preparation |
|
|
| To avoid potential errors when loading and running your models, we recommend using the following settings: |
|
|
| ```shell |
| conda create -n nextstep python=3.11 -y |
| conda activate nextstep |
| |
| pip install uv # optional |
| |
| # please check and download requirements.txt in this repo |
| uv pip install -r requirements.txt |
| |
| # diffusers==0.34.0 |
| # einops==0.8.1 |
| # gradio==5.42.0 |
| # loguru==0.7.3 |
| # numpy==1.26.4 |
| # omegaconf==2.3.0 |
| # Pillow==11.0.0 |
| # Requests==2.32.4 |
| # safetensors==0.5.3 |
| # tabulate==0.9.0 |
| # torch==2.5.1 |
| # torchvision==0.20.1 |
| # tqdm==4.67.1 |
| # transformers==4.55.0 |
| ``` |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModel |
| from models.gen_pipeline import NextStepPipeline |
| |
| HF_HUB = "stepfun-ai/NextStep-1-Large" |
| |
| # load model and tokenizer |
| tokenizer = AutoTokenizer.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True) |
| model = AutoModel.from_pretrained(HF_HUB, local_files_only=True, trust_remote_code=True) |
| pipeline = NextStepPipeline(tokenizer=tokenizer, model=model).to(device="cuda", dtype=torch.bfloat16) |
| |
| # set prompts |
| positive_prompt = "masterpiece, film grained, best quality." |
| negative_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry." |
| example_prompt = "A realistic photograph of a wall with \"NextStep-1.1 is coming\" prominently displayed" |
| |
| # generate image from text |
| IMG_SIZE = 512 |
| image = pipeline.generate_image( |
| example_prompt, |
| hw=(IMG_SIZE, IMG_SIZE), |
| num_images_per_caption=1, |
| positive_prompt=positive_prompt, |
| negative_prompt=negative_prompt, |
| cfg=7.5, |
| cfg_img=1.0, |
| cfg_schedule="constant", |
| use_norm=False, |
| num_sampling_steps=28, |
| timesteps_shift=1.0, |
| seed=3407, |
| )[0] |
| image.save("./assets/output.jpg") |
| ``` |
|
|
| ## Citation |
|
|
| If you find NextStep useful for your research and applications, please consider starring this repository and citing: |
|
|
| ```bibtex |
| @misc{nextstep_1, |
| title={NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale}, |
| author={NextStep Team}, |
| year={2025}, |
| url={https://github.com/stepfun-ai/NextStep-1}, |
| } |
| ``` |
|
|