metadata
license: mit
language:
- multilingual
- en
- de
- fr
- es
- zh
- ja
- ru
- ar
- ko
- pt
library_name: transformers
tags:
- modernbert
- mlm
- long-context
- rope
- yarn
- multilingual
- fill-mask
- semantic-router
- mixture-of-models
datasets:
- cc100
base_model: jhu-clsp/mmBERT-base
pipeline_tag: fill-mask
model-index:
- name: mmbert-32k-yarn
results:
- task:
type: fill-mask
name: Masked Language Modeling
metrics:
- name: MLM Accuracy (English)
type: accuracy
value: 1
- name: MLM Accuracy (Multilingual)
type: accuracy
value: 1
- name: Distance Retrieval (≤2048)
type: accuracy
value: 1
- name: Perplexity (32K context)
type: perplexity
value: 1.0003
mmBERT-32K-YaRN
Modern Multilingual BERT with 32K context length - Extended from 8K to 32K tokens using YaRN RoPE scaling.
This model extends jhu-clsp/mmBERT-base (Modern Multilingual BERT supporting 1800+ languages) from 8,192 to 32,768 maximum context length using YaRN (Yet another RoPE extensioN) scaling method.
Model Description
| Property | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-base |
| Architecture | ModernBERT (RoPE + Flash Attention 2) |
| Parameters | 307M |
| Max Context | 32,768 tokens (extended from 8,192) |
| Languages | 1800+ languages |
| Vocab Size | 256,000 (Gemma 2 tokenizer) |
| Scaling Method | YaRN RoPE (4x extension) |
Intended Use
This model is designed for:
- Long-document understanding in any of 1800+ languages
- Semantic routing for LLM request classification
- Document classification with extended context
- Information retrieval from long texts
- Multilingual NLP tasks requiring long context
Part of the vLLM Semantic Router Mixture-of-Models (MoM) family.
Evaluation Results
Distance-Based Retrieval (Key Metric for Long-Context)
| Distance (tokens) | Top-1 Accuracy | Top-5 Accuracy |
|---|---|---|
| 64 | 100% | 100% |
| 128 | 100% | 100% |
| 256 | 100% | 100% |
| 512 | 100% | 100% |
| 1024 | 100% | 100% |
| 2048 | 100% | 100% |
| 4096 | 0% | 0% |
| 8192 | 0% | 0% |
Summary: Perfect retrieval up to 2048 tokens. Long-range capability improved from baseline ~33% to 50% (averaged across all distances ≥1024).
Multilingual MLM Accuracy
| Language | Correct |
|---|---|
| English (en) | ✅ |
| German (de) | ✅ |
| French (fr) | ✅ |
| Spanish (es) | ✅ |
| Chinese (zh) | ✅ |
| Japanese (ja) | ✅ |
| Russian (ru) | ✅ |
| Arabic (ar) | ✅ |
| Korean (ko) | ✅ |
| Portuguese (pt) | ✅ |
Overall: 100% (10/10 languages tested)
Perplexity by Context Length
| Context Length | Loss | Perplexity |
|---|---|---|
| 512 | 0.0110 | 1.01 |
| 1024 | 0.0082 | 1.01 |
| 2048 | 0.0065 | 1.01 |
| 4096 | 0.0036 | 1.00 |
| 8192 | 0.0014 | 1.00 |
| 16384 | 0.0014 | 1.00 |
| 24576 | 0.0014 | 1.00 |
| 32768 | 0.0003 | 1.00 |
Position-wise Accuracy (16K context)
| Position Range | Accuracy |
|---|---|
| 0-2048 | 100% |
| 2048-4096 | 100% |
| 4096-6144 | 100% |
| 6144-8192 | 100% |
| 8192-10240 | 100% |
| 10240-12288 | 100% |
| 12288-14336 | 100% |
| 14336-16384 | 100% |
Training Details
Training Configuration
base_model: jhu-clsp/mmBERT-base
rope_scaling_type: yarn
original_max_position_embeddings: 8192
target_max_position_embeddings: 32768
scaling_factor: 4.0
yarn_beta_fast: 32.0
yarn_beta_slow: 1.0
# Training hyperparameters
learning_rate: 1e-5
batch_size: 1 (effective: 16 with gradient accumulation)
gradient_accumulation_steps: 16
num_epochs: 1
warmup_steps: 100
lr_scheduler: constant_with_warmup
mlm_probability: 0.3
bf16: true
Training Data
- Dataset: CC-100 (Common Crawl) multilingual corpus
- Samples: 30,774 sequences
- Sequence Length: 32,768 tokens each
- Total Tokens: ~1B tokens
Hardware
- GPU: AMD Instinct MI300X (192GB VRAM)
- Training Time: ~6.5 hours
- Framework: PyTorch 2.3 + ROCm 6.2
Usage
Basic Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
# Multilingual MLM example
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
logits = outputs.logits[0, mask_idx]
top5 = tokenizer.decode(logits.topk(5).indices)
print(top5) # ['Paris', 'Strasbourg', 'Nice', 'Brussels', 'Lyon']
Long Context Usage (32K tokens)
# Process long documents (up to 32K tokens)
long_document = "..." * 30000 # Your long text in any of 1800+ languages
inputs = tokenizer(
long_document,
return_tensors="pt",
max_length=32768,
truncation=True
)
outputs = model(**inputs)
Feature Extraction
import torch
# Get embeddings for downstream tasks
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Use last hidden state or pooled output
embeddings = outputs.hidden_states[-1].mean(dim=1) # Mean pooling
ONNX Runtime Usage
An ONNX export is available for high-performance inference with ONNX Runtime.
Python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load tokenizer and ONNX model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
sess = ort.InferenceSession(
"onnx/model.onnx", # or download from HF
providers=['ROCmExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)
# Inference
text = "What is the weather like today?"
inputs = tokenizer(text, return_tensors="np", padding=True)
outputs = sess.run(None, {
'input_ids': inputs['input_ids'].astype(np.int64),
'attention_mask': inputs['attention_mask'].astype(np.int64)
})
embeddings = outputs[0].mean(axis=1) # Mean pooling
Rust (ort-binding)
use onnx_semantic_router::MmBertEmbeddingModel;
let model = MmBertEmbeddingModel::load("./mmbert-32k-yarn-onnx", false)?;
let embeddings = model.embed("What is the weather?")?;
Latency Benchmarks (AMD MI300X)
| Backend | Single Text | Batch(4)/text |
|---|---|---|
| CPU | 10.1ms | 6.8ms |
| ROCm GPU | 4.7ms | 1.2ms |
Limitations
- Long-range retrieval: While the model handles 32K context, retrieval accuracy drops significantly beyond 2048 tokens distance
- Training data: Trained on CC-100 which may have biases from web crawl data
- Compute requirements: Full 32K context requires significant GPU memory (~180GB for batch size 1)
Citation
@misc{mmbert-32k-yarn,
title={mmBERT-32K-YaRN: Extended Context Modern Multilingual BERT},
author={vLLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mmbert-32k-yarn}
}
References
- mmBERT - Modern Multilingual BERT (1800+ languages)
- ModernBERT - Base architecture
- YaRN - Yet another RoPE extensioN method
- vLLM Semantic Router - Mixture-of-Models routing
License
MIT License (same as mmBERT base model)