---
license: apache-2.0
base_model: Qwen/Qwen3-4B
datasets:
- PeterJinGo/nq_hotpotqa_train
- orbit-ai/orbit-20k
language:
- en
tags:
- qwen3
- coversational
- verl
- verl-tool
pipeline_tag: text-generation
library_name: transformers
---
> [!NOTE]
> For full information, go check out the ORBIT paper [here](https://arxiv.org/abs/2604.01195).
> This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps with equal mixture (1:1:1) of NQ:HotpotQA:ORBIT training datasets.
## Orbit-4B (v0.1)
This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) framework with a live DDGS-based retriever.
Training was conducted for **165 GRPO steps** on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio.
---
## Model Details
| Property | Value |
|---|---|
| **Architecture** | Qwen3-4B |
| **Base checkpoint** | `Qwen/Qwen3-4B` |
| **Training algorithm** | GRPO |
| **Training steps** | 165 |
| **Training framework** | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) |
| **Hardware** | 4 × H100 SXM5 (80 GB HBM3), NVLink |
| **Parallelism** | FSDP (param + optimizer offload) |
| **Rollout mode** | Async (vLLM v1) |
---
## Training Dataset
The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:
| Task | Source | Type |
|---|---|---|
| Natural Questions (NQ) | Open-domain QA | Single-hop factoid |
| HotpotQA | Multi-hop QA | 2-hop reasoning |
| ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries |
**Dataset name:** `PeterJinGo/nq_hotpotqa_train`, `orbit-ai/orbit-20k`
**Train batch size:** 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)
---
## Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Rollouts | 8 |
| Learning rate | `1e-6` |
| LR warmup steps | 10 |
| LR warmup ratio | 0.285 |
| Optimizer | AdamW |
| GRPO rollouts per sample (n) | 8 |
| PPO mini-batch size | 32 |
| PPO micro-batch size per GPU | 1 |
| Temperature | 1.0 |
| Top-p | 1.0 |
| Top-k | disabled (−1) |
| KL loss coefficient | 0.0 |
| KL loss type | `low_var_kl` |
| Entropy coefficient | 0.0 |
| Max prompt length | 2048 tokens |
| Max response length | 8192 tokens |
| Max action length | 2048 tokens |
| Max observation length | 1024 tokens |
| Max turns | 5 |
| Max concurrent trajectories | 32 |
| GPU memory utilization (vLLM) | 0.6 |
| vLLM max model length | 8192 |
| Sequence parallelism | 1 (disabled) |
| FSDP size | −1 (full sharding) |
---
## Tool Configuration
The model was trained with live web search via a DDGS-based retrieval server:
| Setting | Value |
|---|---|
| Retriever | DDGS (Dux Distributed Global Search) |
| Search backends | `google`, `brave`, `bing`, `wikipedia`, `grokipedia` |
| Top-k documents per query | 5 |
| Backend strategy | **Parallel fan-out** — all backends queried simultaneously, results merged and deduplicated |
| Per-backend HTTP timeout | 10 s |
| Tool server workers | 4 |
| Action stop tokens | ``, `` |
| Observations masked in loss | Yes (`mask_observations=True`) |
The retriever server runs as a FastAPI service. At each agent turn the model issues a ` query ` action; the tool server retrieves results and returns them as `…` observations. The trajectory ends when the model emits ` … ` or the turn budget is exhausted.
---
## Training Infrastructure
| Setting | Value |
|---|---|
| Nodes | 1 × H100 node (g-series) |
| GPUs per node | 4 × H100 SXM5 80 GB |
| CPUs per node | 48 (allocated) |
| System memory | 256 GB |
| Local scratch (SSD) | 200 GB (`$TMPDIR`, used for Triton/Ray caches) |
| NCCL | NVLink P2P enabled (no `NCCL_P2P_DISABLE`) |
| vLLM version | v1 (`VLLM_USE_V1=1`) |
| Checkpoint frequency | Every 5 steps |
---
## Reward Model
**Reward manager:** `search_r1_qa_em`
Reward is computed as exact-match (EM) between the model's `` and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.
---
## Usage
This model is designed to be used with a running tool server that handles `` actions. Inference without a live retriever will fall back to the model's parametric knowledge.
### With verl-tool (recommended)
```bash
git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate
# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
--backend "google,brave,bing,wikipedia,grokipedia"
# Start the tool server
python -m verl_tool.servers.serve \
--host 0.0.0.0 --port 30500 \
--tool_type search_retrieval \
--workers_per_tool 4
```
### Direct inference (parametric knowledge only)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
prompt = (
"Answer the given question. Please break down the question, using it to plan "
"a potential solution trajectory. You must conduct reasoning inside and "
" first, then you may use tools to gather information. "
"For search, use query . "
"Provide your final answer with answer .\n\n"
"Question: What percentage of blood is made up of plasma?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```
---
## Intended Use & Limitations
- **Intended use:** Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
- **Language:** English only.
- **Search dependency:** Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
- **Not intended for production deployment** without additional safety filtering.
---
## Citation
If you use this model or the training methodology, please cite:
```
@misc{thakur2026orbit,
title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget},
author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
year={2026},
eprint={2604.01195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01195},
}
```
---