File size: 9,550 Bytes
b50cbcc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88b9f90
 
fed6153
e678194
3f1fffd
88b9f90
21733ca
88b9f90
3f1fffd
21733ca
88b9f90
21733ca
88b9f90
 
5dfac44
88b9f90
0a9c0f4
88b9f90
 
 
 
5dfac44
 
3f1fffd
88b9f90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f1fffd
 
88b9f90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21733ca
88b9f90
 
 
 
 
 
 
 
 
 
 
 
21733ca
 
a8ec481
21733ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88b9f90
3f1fffd
88b9f90
 
3f1fffd
88b9f90
 
3f1fffd
88b9f90
 
3f1fffd
88b9f90
 
3f1fffd
 
88b9f90
 
 
 
 
 
 
 
 
 
3f1fffd
88b9f90
 
 
 
 
 
 
 
 
 
 
 
 
 
579c23e
88b9f90
 
 
 
 
 
 
10f94f0
88b9f90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21733ca
 
88b9f90
21733ca
88b9f90
 
 
 
 
 
 
 
a047652
 
 
 
 
 
 
 
 
88b9f90
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
language:
  - en
license: apache-2.0
library_name: vllm
tags:
  - reasoning
  - chain-of-thought
  - efficiency
  - inference-optimization
  - qwen3
base_model: Qwen/Qwen3-8B
base_model_relation: finetune
pipeline_tag: text-generation
---

# Terminator-Qwen3-8B

[[Project Page](https://terminator-llm.github.io/)] [[Paper](https://arxiv.org/abs/2603.12529)]

**Terminator** is a lightweight neural module that predicts when a reasoning language model has reached its final answer during chain-of-thought (CoT) generation. When the Terminator detects the model has committed to an answer, it truncates the remaining reasoning and forces the model to begin its response, thereby delivering the same answer with significantly less computation.

This repository contains everything needed to run **Terminator-Qwen3-8B**:

- Trained Terminator checkpoint (1 extra transformer layer + prediction head)
- vLLM plugin code (`vllm_terminator/`) for high-performance serving
- Server launcher and streaming client
- Standalone HuggingFace inference script (no server required)
- Automated setup script

**Note**: Terminator currently supports **single-GPU, single-sequence inference only**.

---

## Quick Start

```bash
# 1. Clone the repository (requires Git LFS: https://git-lfs.com)
git lfs install
git clone https://huggingface.co/acnagle/Terminator-Qwen3-8B
cd Terminator-Qwen3-8B

# 2. Run automated setup (creates conda env, installs vllm, downloads base model)
./setup.sh

# 3. Start the server
./start_server.sh

# 4. In another terminal, chat with the model
python client.py --interactive
```

---

## Requirements

- **GPU**: Single NVIDIA GPU with at least ~24GB VRAM
- **CUDA**: Compatible CUDA driver installed, 12.9 and above recommended.
- **Python**: 3.12
- **OS**: Linux (recommended) or any OS supported by vLLM

---

## Installation

### Option A: Automated Setup

The `setup.sh` script handles everything:

```bash
./setup.sh
```

This will:
1. Create a conda environment called `terminator` with Python 3.12
2. Install [uv](https://docs.astral.sh/uv/), [vLLM](https://docs.vllm.ai/), and [openai](https://pypi.org/project/openai/)
3. Download Qwen3-8B base model weights (~16GB) from HuggingFace
4. Create the model directory (`model_dir/`)

### Option B: Manual Setup

**1. Create a Python environment**

Using conda or micromamba:

```bash
conda create -n terminator python=3.12 -y
conda activate terminator
```

**2. Install uv**

```bash
pip install --upgrade uv
```

Or see the [uv installation guide](https://docs.astral.sh/uv/getting-started/installation/).

**3. Install vLLM**

```bash
uv pip install vllm --torch-backend=auto
```

See the [vLLM installation guide](https://docs.vllm.ai/en/latest/getting_started/installation/) for alternative installation methods (ROCm, CPU, etc.).

**4. Install openai (for the client)**

```bash
uv pip install openai
```

**5. Set up the model directory**

This downloads the base Qwen3-8B weights and creates a vLLM-ready model directory:

```bash
python setup_model_dir.py
```

The script accepts optional arguments:

| Argument | Default | Description |
|----------|---------|-------------|
| `--checkpoint` | `./terminator.pt` | Path to the Terminator checkpoint |
| `--output-dir` | `./model_dir` | Output model directory |
| `--threshold` | `0.7` | Prediction threshold for Terminator activation |
| `--window-size` | `10` | Sliding window size for majority vote |
| `--exit-message` | *(built-in message)* | Message injected when Terminator fires |

---

## Starting the Server

```bash
./start_server.sh
```

Or with custom configuration:

```bash
VLLM_GPU_UTIL=0.70 VLLM_MAX_MODEL_LEN=8192 ./start_server.sh
```

The server exposes an **OpenAI-compatible API** on the configured port (default: 8000).

### Configuration

Set these environment variables before running `start_server.sh` or `serve.py`:

| Variable | Default | Description |
|----------|---------|-------------|
| `VLLM_GPU_UTIL` | `0.90` | Fraction of GPU memory to use for the model |
| `VLLM_MAX_MODEL_LEN` | *(auto)* | Maximum context length in tokens |
| `VLLM_PORT` | `8000` | Server port |
| `VLLM_ENFORCE_EAGER` | `0` | Set to `1` to disable CUDA graphs |
| `VLLM_API_KEY` | *(none)* | Require this API key from clients |
| `VLLM_SERVED_NAME` | `Terminator-Qwen3-8B` | Model name reported by the API |

---

## Standalone Inference (No Server)

**Recommendation:** For the best performance, use the vLLM server described above. vLLM uses KV caching, CUDA graphs, and optimized kernels, making it **significantly faster** than HuggingFace-native inference. The script below is provided for quick testing and demos where spinning up a server is inconvenient.

For quick testing without starting a vLLM server, use the HuggingFace-native inference script:

```bash
python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"
```

This loads the model directly via HuggingFace `transformers` and runs token-by-token generation with the Terminator head. Thinking content is streamed in dimmed text; the final answer is shown in bold.

| Argument | Default | Description |
|----------|---------|-------------|
| `--prompt` | *(required)* | Input prompt |
| `--model` | `Qwen/Qwen3-8B` | HuggingFace model name or path |
| `--checkpoint` | `./terminator.pt` | Path to the Terminator checkpoint |
| `--threshold` | `0.7` | Prediction threshold |
| `--window-size` | `10` | Sliding window size for majority vote |
| `--exit-message` | *(built-in message)* | Message injected when Terminator fires (empty string to disable) |
| `--max-tokens` | `32768` | Maximum tokens to generate |
| `--temperature` | `0.6` | Sampling temperature |

---

## Using the Client (vLLM Server)

### Single Prompt

```bash
python client.py --prompt "What is the sum of the first 100 natural numbers?"
```

### Interactive Mode

```bash
python client.py --interactive
```

This starts a multi-turn conversation with the model. Thinking content is displayed in dimmed text; the final answer is shown in bold.

### Client Options

| Argument | Default | Description |
|----------|---------|-------------|
| `--base-url` | `http://localhost:8000/v1` | Server URL |
| `--max-tokens` | *(server default)* | Maximum tokens to generate |
| `--temperature` | `0.6` | Sampling temperature |

### Using the API Directly

The server is OpenAI-compatible. You can use any OpenAI client library. Replace `localhost` with your server's address if connecting remotely:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Terminator-Qwen3-8B",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

# Thinking content (chain-of-thought)
print(response.choices[0].message.reasoning)

# Final answer
print(response.choices[0].message.content)
```

---

## How Terminator Works

Terminator is a single transformer layer followed by a prediction head, trained on top of a frozen Qwen3-8B base model. The transformer layer (initialized as a copy of the base model's final layer, then fine-tuned) takes the hidden states from the LLM and processes them before the prediction head, which outputs a per-token binary prediction: *has the model reached its final answer?*

During generation, Terminator maintains a **sliding window** of the most recent predictions. When a majority of predictions in the window exceed the threshold (default: 0.7), the model is considered to have reached its final answer. At that point:

1. A short **exit message** is injected into the reasoning (e.g., *"I've run out of thinking tokens. I need to commit to a final answer."*) to help the model transition smoothly.
2. The `</think>` token is forced, ending the reasoning phase.
3. The model generates its final answer normally.

This allows the model to skip potentially thousands of redundant reasoning tokens while preserving answer quality.

---

## File Structure

```
Terminator-Qwen3-8B/
β”œβ”€β”€ README.md               This file
β”œβ”€β”€ terminator.pt            Trained Terminator checkpoint
β”œβ”€β”€ vllm_terminator/         vLLM plugin package
β”‚   β”œβ”€β”€ __init__.py          Registers the model architecture with vLLM
β”‚   β”œβ”€β”€ model.py             Qwen3TerminatorForCausalLM model class
β”‚   └── terminator_head.py   FFN classifier and checkpoint loading
β”œβ”€β”€ inference_hf.py          Standalone HuggingFace inference (no server)
β”œβ”€β”€ serve.py       vLLM server launcher
β”œβ”€β”€ setup_model_dir.py       Model directory setup (downloads base weights)
β”œβ”€β”€ client.py                Streaming chat client (connects to vLLM server)
β”œβ”€β”€ setup.sh                 Automated setup script
└── start_server.sh          Server launcher with sensible defaults
```

---

## Citation

@misc{nagle2026terminatorlearningoptimalexit,
      title={TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning}, 
      author={Alliot Nagle and Jakhongir Saydaliev and Dhia Garbaya and Michael Gastpar and Ashok Vardhan Makkuva and Hyeji Kim},
      year={2026},
      eprint={2603.12529},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.12529}, 
}

---

## License

This project builds on [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) by the Qwen team. Please refer to the Qwen3 license for base model usage terms.