LLM-in-Sandbox Elicits General Agentic Intelligence

This is the model checkpoint trained with LLM-in-Sandbox-RL from our paper: Computer Environments Elicit General Agentic Intelligence in LLMs. The base model is Qwen/Qwen3-4B-Instruct-2507. The training data is available at llm-in-sandbox-rl dataset and the training code is at llm-in-sandbox-rl code.

Usage

vllm serve daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL \
    --served-model-name qwen3-4b-instruct-sandbox-rl \
    --enable-prefix-caching \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Please refer to our RL training code for reproducing this checkpoint and our inference code to use this model for LLM-in-Sandbox inference and reproduce our paper results.

Citation

If you find our work helpful, please cite us:

@article{cheng2026llm,
  title={Llm-in-sandbox elicits general agentic intelligence},
  author={Cheng, Daixuan and Huang, Shaohan and Gu, Yuxian and Song, Huatong and Chen, Guoxin and Dong, Li and Zhao, Wayne Xin and Wen, Ji-Rong and Wei, Furu},
  journal={arXiv preprint arXiv:2601.16206},
  year={2026}
}