---
license: apache-2.0
base_model: Qwen/Qwen3-4B
datasets:
- PeterJinGo/nq_hotpotqa_train
- orbit-ai/orbit-20k
language:
- en
tags:
- qwen3
- coversational
- verl
- verl-tool
pipeline_tag: text-generation
library_name: transformers
---

> [!NOTE]
> For full information, go check out the ORBIT paper [here](https://arxiv.org/abs/2604.01195).
> This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps with equal mixture (1:1:1) of NQ:HotpotQA:ORBIT training datasets.

<img src="https://huggingface.co/orbit-ai/orbit-4b-v0.1/resolve/main/orbit-with-name-logo.png" alt="Figure 1" width="500"/>

## Orbit-4B (v0.1)

This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) framework with a live DDGS-based retriever.

Training was conducted for **165 GRPO steps** on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio.

---

## Model Details

| Property | Value |
|---|---|
| **Architecture** | Qwen3-4B |
| **Base checkpoint** | `Qwen/Qwen3-4B` |
| **Training algorithm** | GRPO |
| **Training steps** | 165 |
| **Training framework** | [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool) |
| **Hardware** | 4 × H100 SXM5 (80 GB HBM3), NVLink |
| **Parallelism** | FSDP (param + optimizer offload) |
| **Rollout mode** | Async (vLLM v1) |

---

## Training Dataset

The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:

| Task | Source | Type |
|---|---|---|
| Natural Questions (NQ) | Open-domain QA | Single-hop factoid |
| HotpotQA | Multi-hop QA | 2-hop reasoning |
| ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries |

**Dataset name:** `PeterJinGo/nq_hotpotqa_train`, `orbit-ai/orbit-20k`  
**Train batch size:** 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)  

---

## Training Hyperparameters

| Hyperparameter | Value |
|---|---|
| Batch size    | 256 |
| Rollouts      | 8   |
| Learning rate | `1e-6` |
| LR warmup steps | 10 |
| LR warmup ratio | 0.285 |
| Optimizer | AdamW |
| GRPO rollouts per sample (n) | 8 |
| PPO mini-batch size | 32 |
| PPO micro-batch size per GPU | 1 |
| Temperature | 1.0 |
| Top-p | 1.0 |
| Top-k | disabled (−1) |
| KL loss coefficient | 0.0 |
| KL loss type | `low_var_kl` |
| Entropy coefficient | 0.0 |
| Max prompt length | 2048 tokens |
| Max response length | 8192 tokens |
| Max action length | 2048 tokens |
| Max observation length | 1024 tokens |
| Max turns | 5 |
| Max concurrent trajectories | 32 |
| GPU memory utilization (vLLM) | 0.6 |
| vLLM max model length | 8192 |
| Sequence parallelism | 1 (disabled) |
| FSDP size | −1 (full sharding) |

---

## Tool Configuration

The model was trained with live web search via a DDGS-based retrieval server:

| Setting | Value |
|---|---|
| Retriever | DDGS (Dux Distributed Global Search) |
| Search backends | `google`, `brave`, `bing`, `wikipedia`, `grokipedia` |
| Top-k documents per query | 5 |
| Backend strategy | **Parallel fan-out** — all backends queried simultaneously, results merged and deduplicated |
| Per-backend HTTP timeout | 10 s |
| Tool server workers | 4 |
| Action stop tokens | `</search>`, `</answer>` |
| Observations masked in loss | Yes (`mask_observations=True`) |

The retriever server runs as a FastAPI service. At each agent turn the model issues a `<search> query </search>` action; the tool server retrieves results and returns them as `<information>…</information>` observations. The trajectory ends when the model emits `<answer> … </answer>` or the turn budget is exhausted.

---

## Training Infrastructure

| Setting | Value |
|---|---|
| Nodes | 1 × H100 node (g-series) |
| GPUs per node | 4 × H100 SXM5 80 GB |
| CPUs per node | 48 (allocated) |
| System memory | 256 GB |
| Local scratch (SSD) | 200 GB (`$TMPDIR`, used for Triton/Ray caches) |
| NCCL | NVLink P2P enabled (no `NCCL_P2P_DISABLE`) |
| vLLM version | v1 (`VLLM_USE_V1=1`) |
| Checkpoint frequency | Every 5 steps |

---

## Reward Model

**Reward manager:** `search_r1_qa_em`  
Reward is computed as exact-match (EM) between the model's `<answer>` and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.

---

## Usage

This model is designed to be used with a running tool server that handles `<search>` actions. Inference without a live retriever will fall back to the model's parametric knowledge.

### With verl-tool (recommended)

```bash
git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate

# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
    --backend "google,brave,bing,wikipedia,grokipedia"

# Start the tool server
python -m verl_tool.servers.serve \
    --host 0.0.0.0 --port 30500 \
    --tool_type search_retrieval \
    --workers_per_tool 4
```

### Direct inference (parametric knowledge only)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

prompt = (
    "Answer the given question. Please break down the question, using it to plan "
    "a potential solution trajectory. You must conduct reasoning inside <think> and "
    "</think> first, then you may use tools to gather information. "
    "For search, use <search> query </search>. "
    "Provide your final answer with <answer> answer </answer>.\n\n"
    "Question: What percentage of blood is made up of plasma?"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

---

## Intended Use & Limitations

- **Intended use:** Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
- **Language:** English only.
- **Search dependency:** Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
- **Not intended for production deployment** without additional safety filtering.

---

## Citation

If you use this model or the training methodology, please cite:

```
@misc{thakur2026orbit,
      title={ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget}, 
      author={Nandan Thakur and Zijian Chen and Xueguang Ma and Jimmy Lin},
      year={2026},
      eprint={2604.01195},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.01195}, 
}
```

---