File size: 11,172 Bytes
2e06ee4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
license: apache-2.0
license_link: https://huggingface.co/Qihoo360/Light-MT-7B/blob/main/LICENSE
language:
- en
- zh
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
tags:
- machine-translation
- multilingual
- qwen2
library_name: transformers
---

# Light-MT-7B
<a href="https://huggingface.co/qihoo360/Light-MT-7B" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-FF6B6B" style="display: inline-block; vertical-align: middle;"/>
</a>

## Introduction

Light-MT-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research. The model follows the Multilingual Translation Policy Optimization (MtPO) pipeline introduced in the paper "Extending Foundation Models to Low-Resource Languages" and targets Southeast Asian and other under-served languages while preserving general instruction-following ability.

**This repo contains the machine translation specialized 7B model**, which has the following features:
- Type: Causal Language Models for Machine Translation
- Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 7.61B (6.53B non-embedding)
- Number of Layers: 28
- Number of Attention Heads (GQA): 28 for Q and 4 for KV
- Context Length: Up to 131,072 tokens
- Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion

## Model Highlights

Key outcomes from the MtPO recipe:

- 2.1x-5.4x compression gains on FLORES-Plus corpora across Khmer, Lao, Myanmar, Thai, Tibetan, and other scripts through targeted tokenizer expansion.
- Curriculum supervised fine-tuning over a 7M-sample mixture progressing from general instructions to ASEAN-focused translation prompts.
- MtPO reinforcement learning that maintains entropy during decoding via asymmetric clipping, temperature consistency, and microbatch-normalized advantages.
- Reinforcement Learning with Verifiable Rewards (RLVR) to enforce length ratios, structural tokens, language targeting, and code mixing checks for reliable outputs.
- 200B continued pretraining tokens plus 60k MtPO steps, preserving BBH, CMMLU, HellaSwag, and MMLU performance while lifting translation quality.

## Requirements

The code of Light-MT-7B is compatible with the latest Hugging Face `transformers` library. We recommend using the latest version of `transformers`.

With `transformers<4.37.0`, you will encounter the following error:
```
KeyError: 'qwen2'
```

## Quickstart

Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "qihoo360/Light-MT-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
    {"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Training Pipeline (MtPO)

MtPO runs in four stages from tokenizer expansion to reinforcement learning alignment.

- **Stage 1 - Vocabulary expansion:** Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
- **Stage 2 - Balanced continued pretraining:** Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
- **Stage 3 - Curriculum SFT:** Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
- **Stage 4 - MtPO reinforcement learning:** Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.

## Verifiable Reward Guardrails

Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During MtPO we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:
- Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
- Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
- Target-language verification via a confidence-gated language ID classifier
- Code-mixing penalties that suppress unintended language drift

These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.

## Data and Training Budget

Summary of resources and evaluation suites used during MtPO development.

- Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
- MtPO reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
- Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
- Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU

## Model Details

- **Model Type**: Qwen2-based Causal Language Model
- **Language(s)**: Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
- **License**: Apache 2.0
- **Finetuned from**: Qwen/Qwen2.5-7B
- **Model Size**: 7.61B parameters
- **Context Length**: 131,072 tokens

## Usage

This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:

- English <-> Chinese translation
- Multilingual translation tasks
- Professional document translation
- Conversational translation

## Evaluation

### Translation and General Benchmarks

Light-MT-7B-MtPO is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).

| Model | Group | xx->en | en->xx | xx->xx | Avg. | BBH | CMMLU | HellaSwag | MMLU |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Gemma3-27B-IT | Multilingual chat | **36.8** | 30.7 | 22.3 | 24.7 | 55.9 | 55.9 | 55.9 | **56.0** |
| Qwen3-8B | Multilingual chat | 31.1 | 23.3 | 14.4 | 16.9 | **63.8** | 60.8 | 26.0 | 51.3 |
| Qwen2.5-7B-Instruct | Multilingual chat | 24.8 | 17.4 | 9.2 | 11.6 | 54.4 | **64.1** | **85.2** | 40.9 |
| Apertus-8B-Instruct | Multilingual chat | 32.5 | 25.7 | 15.6 | 18.3 | 49.2 | 45.3 | 64.2 | 45.2 |
| Tower-Plus-9B | Multilingual chat | 28.2 | 18.3 | 9.8 | 12.5 | 40.4 | 57.2 | 73.1 | 42.1 |
| Qwen-MT-Plus | Translation-focused | 34.0 | 29.6 | 19.6 | 22.1 | - | - | - | - |
| Seed-X-PPO-7B | Translation-focused | 25.9 | 22.6 | 10.5 | 13.3 | - | - | - | - |
| Hunyuan-MT-7B | Translation-focused | 24.6 | 23.4 | 14.8 | 16.6 | - | - | - | - |
| Light-TLLM-7B-SFT | Our models | 35.4 | 32.0 | 22.7 | 24.3 | 59.6 | 61.4 | 83.7 | 47.2 |
| **Light-TLLM-7B-MtPO** | Our models | 36.1 | **32.7** | **23.1** | **24.9** | 60.9 | 63.2 | **85.2** | 48.5 |

- en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
- Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.

### Tokenizer Efficiency

Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).

| Language | Added tokens | Old compression ratio | New compression ratio | Speedup |
| --- | --- | --- | --- | --- |
| Khmer | 3712 | 0.85 | 3.49 | 4.09x |
| Lao | 3359 | 0.85 | 3.05 | 3.59x |
| Myanmar | 3226 | 0.69 | 2.87 | 4.17x |
| Thai | 2958 | 1.79 | 2.97 | 1.66x |
| Tibetan | 3920 | 0.75 | 4.03 | 5.39x |

- Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
- Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.

### Constraint Reliability (RLVR)

RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.

| Model | Language targeting | Length control | Format preservation | Code mixing | Overall |
| --- | --- | --- | --- | --- | --- |
| **Light-TLLM-7B-MtPO** | **97.8** | 99.2 | **92.15** | 92.3 | **95.3** |
| Qwen2.5-7B-Instruct | 92.0 | 97.0 | 51.8 | 62.8 | 75.9 |
| Gemma3-27B-IT | 97.4 | 91.6 | 42.1 | 90.9 | 80.5 |
| Qwen-MT-Plus | 97.6 | **99.8** | 82.5 | 94.8 | 93.6 |
| Seed-X-PPO-7B | 97.6 | 79.8 | 79.0 | 90.3 | 86.6 |
| DeepSeek-V3 | 95.4 | 95.7 | 67.6 | 95.0 | 88.4 |
| Hunyuan-MT-7B | 91.8 | 90.7 | 71.1 | **96.2** | 87.4 |

- Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
- Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
- Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.

### Per-Language FLORES Highlights

- **English->Thai:** 34.1 BLEU, +1.5 over Qwen-MT-Plus.
- **English->Myanmar:** 12.9 BLEU with stable length control.
- **English->Filipino:** 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
- **Khmer->English:** 44.7 BLEU, reflecting gains from tokenizer expansion.
- **Vietnamese->English:** 37.6 BLEU with consistent improvements across ASEAN language pairs.

## Citation

If you find our work helpful, feel free to give us a cite.

```
@inproceedings{liu2026mtpo,
    title = {Light-MT-7B},
    author = {Light-MT Team},
    booktitle = {International Conference on Learning Representations},
    year = {2025},
    url = {https://huggingface.co/qihoo360/Light-MT-7B}
}
```

## Disclaimer

This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.