--- license: mit language: - multilingual - en - de - fr - es - zh - ja - ru - ar - ko - pt library_name: transformers tags: - modernbert - mlm - long-context - rope - yarn - multilingual - fill-mask - semantic-router - mixture-of-models datasets: - cc100 base_model: jhu-clsp/mmBERT-base pipeline_tag: fill-mask model-index: - name: mmbert-32k-yarn results: - task: type: fill-mask name: Masked Language Modeling metrics: - name: MLM Accuracy (English) type: accuracy value: 1.0 - name: MLM Accuracy (Multilingual) type: accuracy value: 1.0 - name: Distance Retrieval (≤2048) type: accuracy value: 1.0 - name: Perplexity (32K context) type: perplexity value: 1.0003 --- # mmBERT-32K-YaRN **Modern Multilingual BERT with 32K context length** - Extended from 8K to 32K tokens using YaRN RoPE scaling. This model extends [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) (Modern Multilingual BERT supporting 1800+ languages) from 8,192 to **32,768** maximum context length using [YaRN](https://arxiv.org/abs/2309.00071) (Yet another RoPE extensioN) scaling method. ## Model Description | Property | Value | |----------|-------| | **Base Model** | [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) | | **Architecture** | ModernBERT (RoPE + Flash Attention 2) | | **Parameters** | 307M | | **Max Context** | 32,768 tokens (extended from 8,192) | | **Languages** | 1800+ languages | | **Vocab Size** | 256,000 (Gemma 2 tokenizer) | | **Scaling Method** | YaRN RoPE (4x extension) | ## Intended Use This model is designed for: - **Long-document understanding** in any of 1800+ languages - **Semantic routing** for LLM request classification - **Document classification** with extended context - **Information retrieval** from long texts - **Multilingual NLP tasks** requiring long context Part of the [vLLM Semantic Router](https://huggingface.co/llm-semantic-router) Mixture-of-Models (MoM) family. ## Evaluation Results ### Distance-Based Retrieval (Key Metric for Long-Context) | Distance (tokens) | Top-1 Accuracy | Top-5 Accuracy | |-------------------|----------------|----------------| | 64 | 100% | 100% | | 128 | 100% | 100% | | 256 | 100% | 100% | | 512 | 100% | 100% | | 1024 | 100% | 100% | | 2048 | 100% | 100% | | 4096 | 0% | 0% | | 8192 | 0% | 0% | **Summary**: Perfect retrieval up to 2048 tokens. Long-range capability improved from baseline ~33% to **50%** (averaged across all distances ≥1024). ### Multilingual MLM Accuracy | Language | Correct | |----------|---------| | English (en) | ✅ | | German (de) | ✅ | | French (fr) | ✅ | | Spanish (es) | ✅ | | Chinese (zh) | ✅ | | Japanese (ja) | ✅ | | Russian (ru) | ✅ | | Arabic (ar) | ✅ | | Korean (ko) | ✅ | | Portuguese (pt) | ✅ | **Overall**: 100% (10/10 languages tested) ### Perplexity by Context Length | Context Length | Loss | Perplexity | |----------------|------|------------| | 512 | 0.0110 | 1.01 | | 1024 | 0.0082 | 1.01 | | 2048 | 0.0065 | 1.01 | | 4096 | 0.0036 | 1.00 | | 8192 | 0.0014 | 1.00 | | 16384 | 0.0014 | 1.00 | | 24576 | 0.0014 | 1.00 | | **32768** | **0.0003** | **1.00** | ### Position-wise Accuracy (16K context) | Position Range | Accuracy | |----------------|----------| | 0-2048 | 100% | | 2048-4096 | 100% | | 4096-6144 | 100% | | 6144-8192 | 100% | | 8192-10240 | 100% | | 10240-12288 | 100% | | 12288-14336 | 100% | | 14336-16384 | 100% | ## Training Details ### Training Configuration ```yaml base_model: jhu-clsp/mmBERT-base rope_scaling_type: yarn original_max_position_embeddings: 8192 target_max_position_embeddings: 32768 scaling_factor: 4.0 yarn_beta_fast: 32.0 yarn_beta_slow: 1.0 # Training hyperparameters learning_rate: 1e-5 batch_size: 1 (effective: 16 with gradient accumulation) gradient_accumulation_steps: 16 num_epochs: 1 warmup_steps: 100 lr_scheduler: constant_with_warmup mlm_probability: 0.3 bf16: true ``` ### Training Data - **Dataset**: CC-100 (Common Crawl) multilingual corpus - **Samples**: 30,774 sequences - **Sequence Length**: 32,768 tokens each - **Total Tokens**: ~1B tokens ### Hardware - **GPU**: AMD Instinct MI300X (192GB VRAM) - **Training Time**: ~6.5 hours - **Framework**: PyTorch 2.3 + ROCm 6.2 ## Usage ### Basic Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/mmbert-32k-yarn") tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn") # Multilingual MLM example text = "The capital of France is ." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1] logits = outputs.logits[0, mask_idx] top5 = tokenizer.decode(logits.topk(5).indices) print(top5) # ['Paris', 'Strasbourg', 'Nice', 'Brussels', 'Lyon'] ``` ### Long Context Usage (32K tokens) ```python # Process long documents (up to 32K tokens) long_document = "..." * 30000 # Your long text in any of 1800+ languages inputs = tokenizer( long_document, return_tensors="pt", max_length=32768, truncation=True ) outputs = model(**inputs) ``` ### Feature Extraction ```python import torch # Get embeddings for downstream tasks with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) # Use last hidden state or pooled output embeddings = outputs.hidden_states[-1].mean(dim=1) # Mean pooling ``` ## ONNX Runtime Usage An ONNX export is available for high-performance inference with ONNX Runtime. ### Python ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np # Load tokenizer and ONNX model tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn") sess = ort.InferenceSession( "onnx/model.onnx", # or download from HF providers=['ROCmExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'] ) # Inference text = "What is the weather like today?" inputs = tokenizer(text, return_tensors="np", padding=True) outputs = sess.run(None, { 'input_ids': inputs['input_ids'].astype(np.int64), 'attention_mask': inputs['attention_mask'].astype(np.int64) }) embeddings = outputs[0].mean(axis=1) # Mean pooling ``` ### Rust (ort-binding) ```rust use onnx_semantic_router::MmBertEmbeddingModel; let model = MmBertEmbeddingModel::load("./mmbert-32k-yarn-onnx", false)?; let embeddings = model.embed("What is the weather?")?; ``` ### Latency Benchmarks (AMD MI300X) | Backend | Single Text | Batch(4)/text | |---------|-------------|---------------| | CPU | 10.1ms | 6.8ms | | ROCm GPU | **4.7ms** | **1.2ms** | ## Limitations - **Long-range retrieval**: While the model handles 32K context, retrieval accuracy drops significantly beyond 2048 tokens distance - **Training data**: Trained on CC-100 which may have biases from web crawl data - **Compute requirements**: Full 32K context requires significant GPU memory (~180GB for batch size 1) ## Citation ```bibtex @misc{mmbert-32k-yarn, title={mmBERT-32K-YaRN: Extended Context Modern Multilingual BERT}, author={vLLM Semantic Router Team}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/llm-semantic-router/mmbert-32k-yarn} } ``` ## References - [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) - Modern Multilingual BERT (1800+ languages) - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) - Base architecture - [YaRN](https://arxiv.org/abs/2309.00071) - Yet another RoPE extensioN method - [vLLM Semantic Router](https://huggingface.co/llm-semantic-router) - Mixture-of-Models routing ## License MIT License (same as mmBERT base model)