--- license: apache-2.0 base_model: Qwen/Qwen3-4B datasets: - PeterJinGo/nq_hotpotqa_train - orbit-ai/orbit-20k language: - en tags: - qwen3 - coversational - verl - verl-tool pipeline_tag: text-generation library_name: transformers --- > [!NOTE] > For full information, go check out the ORBIT paper [here](https://arxiv.org/abs/2604.01195). > This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps with equal mixture (1:1:1) of NQ:HotpotQA:ORBIT training datasets. Figure 1 ## Orbit-4B (v0.1) This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) framework with a live DDGS-based retriever. Training was conducted for **165 GRPO steps** on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio. --- ## Model Details | Property | Value | |---|---| | **Architecture** | Qwen3-4B | | **Base checkpoint** | `Qwen/Qwen3-4B` | | **Training algorithm** | GRPO | | **Training steps** | 165 | | **Training framework** | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) | | **Hardware** | 4 × H100 SXM5 (80 GB HBM3), NVLink | | **Parallelism** | FSDP (param + optimizer offload) | | **Rollout mode** | Async (vLLM v1) | --- ## Training Dataset The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks: | Task | Source | Type | |---|---|---| | Natural Questions (NQ) | Open-domain QA | Single-hop factoid | | HotpotQA | Multi-hop QA | 2-hop reasoning | | ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries | **Dataset name:** `PeterJinGo/nq_hotpotqa_train`, `orbit-ai/orbit-20k` **Train batch size:** 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step) --- ## Training Hyperparameters | Hyperparameter | Value | |---|---| | Batch size | 256 | | Rollouts | 8 | | Learning rate | `1e-6` | | LR warmup steps | 10 | | LR warmup ratio | 0.285 | | Optimizer | AdamW | | GRPO rollouts per sample (n) | 8 | | PPO mini-batch size | 32 | | PPO micro-batch size per GPU | 1 | | Temperature | 1.0 | | Top-p | 1.0 | | Top-k | disabled (−1) | | KL loss coefficient | 0.0 | | KL loss type | `low_var_kl` | | Entropy coefficient | 0.0 | | Max prompt length | 2048 tokens | | Max response length | 8192 tokens | | Max action length | 2048 tokens | | Max observation length | 1024 tokens | | Max turns | 5 | | Max concurrent trajectories | 32 | | GPU memory utilization (vLLM) | 0.6 | | vLLM max model length | 8192 | | Sequence parallelism | 1 (disabled) | | FSDP size | −1 (full sharding) | --- ## Tool Configuration The model was trained with live web search via a DDGS-based retrieval server: | Setting | Value | |---|---| | Retriever | DDGS (Dux Distributed Global Search) | | Search backends | `google`, `brave`, `bing`, `wikipedia`, `grokipedia` | | Top-k documents per query | 5 | | Backend strategy | **Parallel fan-out** — all backends queried simultaneously, results merged and deduplicated | | Per-backend HTTP timeout | 10 s | | Tool server workers | 4 | | Action stop tokens | ``, `` | | Observations masked in loss | Yes (`mask_observations=True`) | The retriever server runs as a FastAPI service. At each agent turn the model issues a ` query ` action; the tool server retrieves results and returns them as `` observations. The trajectory ends when the model emits `` or the turn budget is exhausted. --- ## Training Infrastructure | Setting | Value | |---|---| | Nodes | 1 × H100 node (g-series) | | GPUs per node | 4 × H100 SXM5 80 GB | | CPUs per node | 48 (allocated) | | System memory | 256 GB | | Local scratch (SSD) | 200 GB (`$TMPDIR`, used for Triton/Ray caches) | | NCCL | NVLink P2P enabled (no `NCCL_P2P_DISABLE`) | | vLLM version | v1 (`VLLM_USE_V1=1`) | | Checkpoint frequency | Every 5 steps | --- ## Reward Model **Reward manager:** `search_r1_qa_em` Reward is computed as exact-match (EM) between the model's `` and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss. --- ## Usage This model is designed to be used with a running tool server that handles `` actions. Inference without a live retriever will fall back to the model's parametric knowledge. ### With verl-tool (recommended) ```bash git clone https://github.com/TIGER-AI-Lab/verl-tool cd verl-tool uv sync source .venv/bin/activate # Start the DDGS retriever python ddgs_retrieval_optimized.py --port 8280 --topk 5 \ --backend "google,brave,bing,wikipedia,grokipedia" # Start the tool server python -m verl_tool.servers.serve \ --host 0.0.0.0 --port 30500 \ --tool_type search_retrieval \ --workers_per_tool 4 ``` ### Direct inference (parametric knowledge only) ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "orbit-ai/orbit-4b-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") prompt = ( "Answer the given question. Please break down the question, using it to plan " "a potential solution trajectory. You must conduct reasoning inside and " " first, then you may use tools to gather information. " "For search, use query . " "Provide your final answer with answer .\n\n" "Question: What percentage of blood is made up of plasma?" ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True) print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` --- ## Intended Use & Limitations - **Intended use:** Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training. - **Language:** English only. - **Search dependency:** Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek). - **Not intended for production deployment** without additional safety filtering. --- ## Citation If you use this model or the training methodology, please cite: ``` @misc{thakur2026orbit, title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget}, author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin}, year={2026}, eprint={2604.01195}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.01195}, } ``` ---