---
language: km
license: apache-2.0
library_name: sentencepiece
tags:
- tokenizer
- khmer
- sentencepiece
- graph-regularization
- low-resource
- southeast-asian
- cambodia
pipeline_tag: feature-extraction
datasets:
- khmer-corpus-648mb
metrics:
- accuracy
- f1
model-index:
- name: Tokkonizer-KM V3f
  results:
  - task:
      type: tokenization
      name: Khmer Tokenization
    metrics:
    - type: tokens-per-character
      value: 0.293
      name: TPC (Khmer)
    - type: accuracy
      value: 93.33
      name: Sanskrit/Pali Preservation
    - type: f1
      value: 99.94
      name: ALT Segmentation F1
widget:
- text: "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
  example_title: "Buddhism importance"
- text: "ព្រះរាជាណាចក្រកម្ពុជា"
  example_title: "Kingdom of Cambodia"
- text: "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
  example_title: "PM delivered speech"
- text: "ធម៌ កម្ម និព្វាន សង្ឃ បុណ្យ"
  example_title: "Buddhist terms (Pali/Sanskrit)"
- text: "សង្រ្គាមនៅមជ្ឈិមបូព៌ាបានបង្កផលប៉ះពាល់យ៉ាងធ្ងន់ធ្ងរ"
  example_title: "Geopolitical news"
- text: "ស្រឡាញ់បងណាស់"
  example_title: "Love you so much"
---

# Tokkonizer-KM V3f

A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with **31x smaller vocabulary**.

**Live Demo**: [angkor-ai.com/labs](https://angkor-ai.com/labs)

## Tokenization Examples

```
"ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
→ [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។]
  4 tokens, TPC 0.143 ✅

"នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
→ [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា]
  3 tokens, TPC 0.094 ✅

Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅
```

## Performance

| Metric | **V3f (8K)** | mT5 (250K) | XLM-R (250K) |
|--------|:---:|:---:|:---:|
| TPC (Khmer) | **0.293** | 0.348 | 0.327 |
| Sanskrit/Pali | **93.3%** | 21.4% | 28.6% |
| Cultural preservation | **91.7%** | 75.0% | 91.7% |
| UNK rate | **0%** | 0% | 0% |
| Lossless round-trip | **Yes** | No | No |
| Speed | **15M/s** | 3.3M/s | 2.8M/s |
| ALT F1 (5K sentences) | **99.94%** | — | — |

## Intended Uses

- Khmer text preprocessing for NLP pipelines
- Semantic search / RAG over Khmer documents
- Keyboard prediction engine
- Spell checking (with companion lexicon)

**Not intended for**: text generation, translation, non-Khmer languages.

## How to Use

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer")
tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា")
decoded = tokenizer.decode(tokens)  # 100% lossless
```

Or with SentencePiece directly:
```python
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("កម្ពុជា", out_type=str)  # ["▁", "កម្ពុជា"]
```

## Training

- **Algorithm**: SentencePiece Unigram
- **Vocabulary**: 8,000 tokens
- **Corpus**: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts
- **Character coverage**: 1.0 (full Khmer Unicode)
- **User-defined symbols**: 7 Sanskrit/Pali terms
- **Key finding**: 7 UDS outperformed 500 UDS — less intervention = better results
- **Hardware**: Apple M3 Pro, ~30 min training
- **CO2**: negligible (CPU only)

## Graph Regularization (Layer 2)

When paired with graph-regularized GPT-2 (separate model):

| Metric | Baseline | Graph-Reg |
|--------|:---:|:---:|
| Coherence@10 | 0.32% | **15.5%** (48x) |
| Collapse | 0% | 0.2% |
| Perplexity cost | — | +2.8% |
| Retrieval MRR | 0.417 | **0.460** (+10.4%) |

## Companion: Khmer NLP Engine (26MB SQLite)

A complete prediction + correction + emoji engine built on this tokenizer:
- 60K word-pair predictions (IDF-weighted)
- 28K phrase predictions
- 12,677 validated words (spell check)
- 552 romanization mappings (Latin→Khmer)
- 400 contextual emoji suggestions
- 282 consonant cluster validations

Demo: [angkor-ai.com/labs](https://angkor-ai.com/labs)

## Limitations & Caveats

- **Sanskrit/Pali circularity**: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8).
- **ALT F1 in-domain**: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%.
- **Retrieval MRR**: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs).
- Grapheme break rate: 1.08% (target 1.0%)
- Corpus bias: formal/news text overrepresented vs conversational
- Foreign names fragment into individual characters
- සමាធិ (samadhi) is the only Sanskrit term that still fragments

## Version History

| Version | Vocab | TPC | Status |
|---------|:---:|:---:|---|
| V6.5 (Aug 2025) | 32K | 0.664 | Failed |
| V7 (Sep 2025) | 16K | 0.294 | Deployed |
| **V3f (Mar 2026)** | **8K** | **0.293** | **Production** |

## Citation

```bibtex
@software{delrieu2026tokkonizer,
  author = {Delrieu, Nicolas},
  title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer},
  year = {2026},
  url = {https://github.com/khopilot/tokkonizer-km}
}
```

## Contact

- [angkor-ai.com](https://angkor-ai.com)
- nicolasdelrieu.services@gmail.com