File size: 11,172 Bytes
2e06ee4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
license: apache-2.0
license_link: https://huggingface.co/Qihoo360/Light-MT-7B/blob/main/LICENSE
language:
- en
- zh
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
tags:
- machine-translation
- multilingual
- qwen2
library_name: transformers
---
# Light-MT-7B
<a href="https://huggingface.co/qihoo360/Light-MT-7B" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-FF6B6B" style="display: inline-block; vertical-align: middle;"/>
</a>
## Introduction
Light-MT-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research. The model follows the Multilingual Translation Policy Optimization (MtPO) pipeline introduced in the paper "Extending Foundation Models to Low-Resource Languages" and targets Southeast Asian and other under-served languages while preserving general instruction-following ability.
**This repo contains the machine translation specialized 7B model**, which has the following features:
- Type: Causal Language Models for Machine Translation
- Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 7.61B (6.53B non-embedding)
- Number of Layers: 28
- Number of Attention Heads (GQA): 28 for Q and 4 for KV
- Context Length: Up to 131,072 tokens
- Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion
## Model Highlights
Key outcomes from the MtPO recipe:
- 2.1x-5.4x compression gains on FLORES-Plus corpora across Khmer, Lao, Myanmar, Thai, Tibetan, and other scripts through targeted tokenizer expansion.
- Curriculum supervised fine-tuning over a 7M-sample mixture progressing from general instructions to ASEAN-focused translation prompts.
- MtPO reinforcement learning that maintains entropy during decoding via asymmetric clipping, temperature consistency, and microbatch-normalized advantages.
- Reinforcement Learning with Verifiable Rewards (RLVR) to enforce length ratios, structural tokens, language targeting, and code mixing checks for reliable outputs.
- 200B continued pretraining tokens plus 60k MtPO steps, preserving BBH, CMMLU, HellaSwag, and MMLU performance while lifting translation quality.
## Requirements
The code of Light-MT-7B is compatible with the latest Hugging Face `transformers` library. We recommend using the latest version of `transformers`.
With `transformers<4.37.0`, you will encounter the following error:
```
KeyError: 'qwen2'
```
## Quickstart
Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "qihoo360/Light-MT-7B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
{"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```
## Training Pipeline (MtPO)
MtPO runs in four stages from tokenizer expansion to reinforcement learning alignment.
- **Stage 1 - Vocabulary expansion:** Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
- **Stage 2 - Balanced continued pretraining:** Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
- **Stage 3 - Curriculum SFT:** Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
- **Stage 4 - MtPO reinforcement learning:** Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.
## Verifiable Reward Guardrails
Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During MtPO we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:
- Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
- Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
- Target-language verification via a confidence-gated language ID classifier
- Code-mixing penalties that suppress unintended language drift
These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.
## Data and Training Budget
Summary of resources and evaluation suites used during MtPO development.
- Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
- MtPO reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
- Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
- Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU
## Model Details
- **Model Type**: Qwen2-based Causal Language Model
- **Language(s)**: Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
- **License**: Apache 2.0
- **Finetuned from**: Qwen/Qwen2.5-7B
- **Model Size**: 7.61B parameters
- **Context Length**: 131,072 tokens
## Usage
This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:
- English <-> Chinese translation
- Multilingual translation tasks
- Professional document translation
- Conversational translation
## Evaluation
### Translation and General Benchmarks
Light-MT-7B-MtPO is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).
| Model | Group | xx->en | en->xx | xx->xx | Avg. | BBH | CMMLU | HellaSwag | MMLU |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Gemma3-27B-IT | Multilingual chat | **36.8** | 30.7 | 22.3 | 24.7 | 55.9 | 55.9 | 55.9 | **56.0** |
| Qwen3-8B | Multilingual chat | 31.1 | 23.3 | 14.4 | 16.9 | **63.8** | 60.8 | 26.0 | 51.3 |
| Qwen2.5-7B-Instruct | Multilingual chat | 24.8 | 17.4 | 9.2 | 11.6 | 54.4 | **64.1** | **85.2** | 40.9 |
| Apertus-8B-Instruct | Multilingual chat | 32.5 | 25.7 | 15.6 | 18.3 | 49.2 | 45.3 | 64.2 | 45.2 |
| Tower-Plus-9B | Multilingual chat | 28.2 | 18.3 | 9.8 | 12.5 | 40.4 | 57.2 | 73.1 | 42.1 |
| Qwen-MT-Plus | Translation-focused | 34.0 | 29.6 | 19.6 | 22.1 | - | - | - | - |
| Seed-X-PPO-7B | Translation-focused | 25.9 | 22.6 | 10.5 | 13.3 | - | - | - | - |
| Hunyuan-MT-7B | Translation-focused | 24.6 | 23.4 | 14.8 | 16.6 | - | - | - | - |
| Light-TLLM-7B-SFT | Our models | 35.4 | 32.0 | 22.7 | 24.3 | 59.6 | 61.4 | 83.7 | 47.2 |
| **Light-TLLM-7B-MtPO** | Our models | 36.1 | **32.7** | **23.1** | **24.9** | 60.9 | 63.2 | **85.2** | 48.5 |
- en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
- Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.
### Tokenizer Efficiency
Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).
| Language | Added tokens | Old compression ratio | New compression ratio | Speedup |
| --- | --- | --- | --- | --- |
| Khmer | 3712 | 0.85 | 3.49 | 4.09x |
| Lao | 3359 | 0.85 | 3.05 | 3.59x |
| Myanmar | 3226 | 0.69 | 2.87 | 4.17x |
| Thai | 2958 | 1.79 | 2.97 | 1.66x |
| Tibetan | 3920 | 0.75 | 4.03 | 5.39x |
- Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
- Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.
### Constraint Reliability (RLVR)
RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.
| Model | Language targeting | Length control | Format preservation | Code mixing | Overall |
| --- | --- | --- | --- | --- | --- |
| **Light-TLLM-7B-MtPO** | **97.8** | 99.2 | **92.15** | 92.3 | **95.3** |
| Qwen2.5-7B-Instruct | 92.0 | 97.0 | 51.8 | 62.8 | 75.9 |
| Gemma3-27B-IT | 97.4 | 91.6 | 42.1 | 90.9 | 80.5 |
| Qwen-MT-Plus | 97.6 | **99.8** | 82.5 | 94.8 | 93.6 |
| Seed-X-PPO-7B | 97.6 | 79.8 | 79.0 | 90.3 | 86.6 |
| DeepSeek-V3 | 95.4 | 95.7 | 67.6 | 95.0 | 88.4 |
| Hunyuan-MT-7B | 91.8 | 90.7 | 71.1 | **96.2** | 87.4 |
- Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
- Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
- Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.
### Per-Language FLORES Highlights
- **English->Thai:** 34.1 BLEU, +1.5 over Qwen-MT-Plus.
- **English->Myanmar:** 12.9 BLEU with stable length control.
- **English->Filipino:** 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
- **Khmer->English:** 44.7 BLEU, reflecting gains from tokenizer expansion.
- **Vietnamese->English:** 37.6 BLEU with consistent improvements across ASEAN language pairs.
## Citation
If you find our work helpful, feel free to give us a cite.
```
@inproceedings{liu2026mtpo,
title = {Light-MT-7B},
author = {Light-MT Team},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://huggingface.co/qihoo360/Light-MT-7B}
}
```
## Disclaimer
This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.
|