--- language: km license: apache-2.0 library_name: sentencepiece tags: - tokenizer - khmer - sentencepiece - graph-regularization - low-resource - southeast-asian - cambodia pipeline_tag: feature-extraction datasets: - khmer-corpus-648mb metrics: - accuracy - f1 model-index: - name: Tokkonizer-KM V3f results: - task: type: tokenization name: Khmer Tokenization metrics: - type: tokens-per-character value: 0.293 name: TPC (Khmer) - type: accuracy value: 93.33 name: Sanskrit/Pali Preservation - type: f1 value: 99.94 name: ALT Segmentation F1 widget: - text: "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។" example_title: "Buddhism importance" - text: "ព្រះរាជាណាចក្រកម្ពុជា" example_title: "Kingdom of Cambodia" - text: "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា" example_title: "PM delivered speech" - text: "ធម៌ កម្ម និព្វាន សង្ឃ បុណ្យ" example_title: "Buddhist terms (Pali/Sanskrit)" - text: "សង្រ្គាមនៅមជ្ឈិមបូព៌ាបានបង្កផលប៉ះពាល់យ៉ាងធ្ងន់ធ្ងរ" example_title: "Geopolitical news" - text: "ស្រឡាញ់បងណាស់" example_title: "Love you so much" --- # Tokkonizer-KM V3f A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with **31x smaller vocabulary**. **Live Demo**: [angkor-ai.com/labs](https://angkor-ai.com/labs) ## Tokenization Examples ``` "ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។" → [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។] 4 tokens, TPC 0.143 ✅ "នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា" → [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា] 3 tokens, TPC 0.094 ✅ Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅ ``` ## Performance | Metric | **V3f (8K)** | mT5 (250K) | XLM-R (250K) | |--------|:---:|:---:|:---:| | TPC (Khmer) | **0.293** | 0.348 | 0.327 | | Sanskrit/Pali | **93.3%** | 21.4% | 28.6% | | Cultural preservation | **91.7%** | 75.0% | 91.7% | | UNK rate | **0%** | 0% | 0% | | Lossless round-trip | **Yes** | No | No | | Speed | **15M/s** | 3.3M/s | 2.8M/s | | ALT F1 (5K sentences) | **99.94%** | — | — | ## Intended Uses - Khmer text preprocessing for NLP pipelines - Semantic search / RAG over Khmer documents - Keyboard prediction engine - Spell checking (with companion lexicon) **Not intended for**: text generation, translation, non-Khmer languages. ## How to Use ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer") tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា") decoded = tokenizer.decode(tokens) # 100% lossless ``` Or with SentencePiece directly: ```python import sentencepiece as spm sp = spm.SentencePieceProcessor(model_file="tokenizer.model") pieces = sp.encode("កម្ពុជា", out_type=str) # ["▁", "កម្ពុជា"] ``` ## Training - **Algorithm**: SentencePiece Unigram - **Vocabulary**: 8,000 tokens - **Corpus**: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts - **Character coverage**: 1.0 (full Khmer Unicode) - **User-defined symbols**: 7 Sanskrit/Pali terms - **Key finding**: 7 UDS outperformed 500 UDS — less intervention = better results - **Hardware**: Apple M3 Pro, ~30 min training - **CO2**: negligible (CPU only) ## Graph Regularization (Layer 2) When paired with graph-regularized GPT-2 (separate model): | Metric | Baseline | Graph-Reg | |--------|:---:|:---:| | Coherence@10 | 0.32% | **15.5%** (48x) | | Collapse | 0% | 0.2% | | Perplexity cost | — | +2.8% | | Retrieval MRR | 0.417 | **0.460** (+10.4%) | ## Companion: Khmer NLP Engine (26MB SQLite) A complete prediction + correction + emoji engine built on this tokenizer: - 60K word-pair predictions (IDF-weighted) - 28K phrase predictions - 12,677 validated words (spell check) - 552 romanization mappings (Latin→Khmer) - 400 contextual emoji suggestions - 282 consonant cluster validations Demo: [angkor-ai.com/labs](https://angkor-ai.com/labs) ## Limitations & Caveats - **Sanskrit/Pali circularity**: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8). - **ALT F1 in-domain**: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%. - **Retrieval MRR**: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs). - Grapheme break rate: 1.08% (target 1.0%) - Corpus bias: formal/news text overrepresented vs conversational - Foreign names fragment into individual characters - සමាធិ (samadhi) is the only Sanskrit term that still fragments ## Version History | Version | Vocab | TPC | Status | |---------|:---:|:---:|---| | V6.5 (Aug 2025) | 32K | 0.664 | Failed | | V7 (Sep 2025) | 16K | 0.294 | Deployed | | **V3f (Mar 2026)** | **8K** | **0.293** | **Production** | ## Citation ```bibtex @software{delrieu2026tokkonizer, author = {Delrieu, Nicolas}, title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer}, year = {2026}, url = {https://github.com/khopilot/tokkonizer-km} } ``` ## Contact - [angkor-ai.com](https://angkor-ai.com) - nicolasdelrieu.services@gmail.com