LLM-in-Sandbox Elicits General Agentic Intelligence

This is the model checkpoint trained with LLM-in-Sandbox-RL from our paper: Computer Environments Elicit General Agentic Intelligence in LLMs. The base model is Qwen/Qwen3-4B-Instruct-2507. The training data is available at llm-in-sandbox-rl dataset and the training code is at llm-in-sandbox-rl code.

Usage

vllm serve daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL \
    --served-model-name qwen3-4b-instruct-sandbox-rl \
    --enable-prefix-caching \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Please refer to our RL training code for reproducing this checkpoint and our inference code to use this model for LLM-in-Sandbox inference and reproduce our paper results.

Citation

If you find our work helpful, please cite us:

@article{cheng2026llm,
  title={Llm-in-sandbox elicits general agentic intelligence},
  author={Cheng, Daixuan and Huang, Shaohan and Gu, Yuxian and Song, Huatong and Chen, Guoxin and Dong, Li and Zhao, Wayne Xin and Wen, Ji-Rong and Wei, Furu},
  journal={arXiv preprint arXiv:2601.16206},
  year={2026}
}
Downloads last month
1,462
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL

Finetuned
(1537)
this model

Dataset used to train daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL

Collection including daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL

Paper for daixuancheng/Qwen3-4B-Instruct-2507-LLM-in-Sandbox-RL