You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sozkz-morphbpe-100k-kk-v1

A morpheme-aware byte-level BPE tokenizer for Kazakh, trained on 55.5M sentences. Inspired by the approach in the HyperCLOVA X Technical Report (NAVER, 2024) for Korean — adapted here for Kazakh.

Why morpheme-aware?

Standard BPE treats text as a flat byte stream and learns merges purely by frequency, ignoring morphological structure. For agglutinative languages like Kazakh — where a single word can contain 5–7 morphemes — this leads to cross-morpheme tokens and inconsistent segmentation:

# Standard BPE (arbitrary splits)
үйлерімізде  →  үйлер | іміз | де        ← splits ROOT+PL together

# morphBPE (morpheme-constrained)
үйлерімізде  →  үй | лер | іміз | де     ← ROOT | PL | POSS.1PL | LOC

Merges are constrained to happen within morphemes only. A BiLSTM model (trained on QazCorpora with BIO tagging) marks morpheme boundaries before BPE training. The boundary marker \x1F is inserted between morphemes, and the BPE pre-tokenizer splits on it — so no merge can ever cross a morpheme boundary.

Stats

Property	Value
Vocab size	100,000
Corpus	55.5M sentences
Dataset	`stukenov/ekitil-corpus-annotated-kk-v1` (`detected_lang=kk`, `lang_confidence ≥ 0.95`)
Morpheme coverage	94.4% of words analyzed
Avg fertility	1.51 tok/word
Byte-level	Yes — no unknown tokens
Special tokens	`<\|endoftext\|>`, `<\|startoftext\|>`, `<\|padding\|>`

Comparison with other tokenizers

Fertility (tokens/word) on 8 Kazakh sentences. Lower = better compression.

Tokenizer	Vocab	Fertility
morphBPE-100k (this model)	100K	1.51
sozkz-vocab-bpe-32k-kk-base-v1	32K	1.47
sozkz-core-gpt2-50k-kk-base-v1	50K	1.44
ekitil-vocab-bpe-64k-kkru-v1	64K	1.38
sozkz-core-gpt2-200k-kk-base-v1	200K	1.31
mGPT-1.3B-kazakh	100K	4.24
Llama-3.2-1B	128K	4.49

vs LLaMA / mGPT: 3× fewer tokens on Kazakh text — these multilingual tokenizers allocate most of their vocabulary to English/Chinese, leaving Kazakh severely under-represented.

vs other SozKZ tokenizers: fertility is comparable (within ~15%) to our 32k–64k BPE tokenizers despite having 100K vocab. The difference is in morpheme alignment: this tokenizer's token boundaries correspond to real morpheme boundaries, which should benefit downstream language models.

Per-sentence token counts (S1–S8):

Tokenizer	S1	S2	S3	S4	S5	S6	S7	S8
morphBPE-100k	6	7	5	9	8	9	12	12
sozkz-bpe-32k	6	6	6	10	9	8	12	9
sozkz-gpt2-50k	6	6	6	9	8	9	12	9
sozkz-gpt2-200k	6	6	6	8	6	7	11	9
ekitil-bpe-64k	6	6	6	8	8	7	12	9
mGPT-1.3B	22	18	19	25	25	25	26	31
Llama-3.2-1B	24	20	18	27	29	28	29	27

Test sentences:

S1: Қазақстан — Орталық Азиядағы мемлекет.
S2: Бүгін ауа райы жақсы болады.
S3: Үйлерімізде кітаптар көп.
S4: Университеттердегі студенттер емтихандарға дайындалуда.
S5: Мектепте оқушылар математика сабағына дайындалуда.
S6: Алматы қаласында жаңа метро стансасы ашылды.
S7: Мен кеше дүкенге барып, нан мен сүт сатып алдым.
S8: Қазақ тілі — түркі тілдер тобына жататын тіл.

OOV / neologism stress test

The real advantage of morphBPE appears on words that no tokenizer has seen — recent loanwords with Kazakh suffixes stacked 5–8 deep. Standard BPE falls back to arbitrary sub-word splits; morphBPE recognises the root and treats each suffix as a separate unit.

Word	Morphemes	morphBPE	gpt2-50k	gpt2-200k	LLaMA-3.2
жасандыландырылмағандықтан	жасанды+лан+дыр+ыл+ма+ған+дық+тан	5	5	4	18
компьютерлендірілмегендерге	компьютер+лен+діріл+ме+ген+дер+ге	5	6	4	14
интернеттендірілмегендіктен	интернет+тен+діріл+ме+ген+дік+тен	4	6	4	14
цифрландырылмайтындардың	цифр+лан+дыр+ыл+май+тын+дар+дың	5	5	4	16
роботтандырылғандардікіндей	робот+тан+дыр+ыл+ған+дар+дікі+ндей	4	6	4	16
вакциналанбағандарымыздан	вакцина+лан+ба+ған+дар+ымыз+дан	7	5	4	15
блокчейндендірілмегендерге	блокчейн+ден+діріл+ме+ген+дер+ге	6	7	5	14
смартфондастырылмайтынға	смартфон+дас+тыр+ыл+май+тын+ға	6	6	5	16
ғаламтортандырылмағандықтан	ғаламтор+тан+дыр+ыл+ма+ған+дық+тан	4	6	5	20
Total		52	59	45	157

Selected examples with actual token strings:

ғаламтортандырылмағандықтан  (internet+CAUS+PASS+NEG+PTCP+NOM+ABL)

  morphBPE  [4]: ['ғаламтор', 'тандырыл', 'мағандық', 'тан']
  gpt2-50k  [6]: ['ғал', 'ам', 'тор', 'тан', 'дырылмаған', 'дықтан']
  LLaMA     [20]: ['<|begin_of_text|>', '?', '?', 'ал', 'ам', 'тор', 'т', 'анд', ...]

интернеттендірілмегендіктен  (internet+VERB+PASS+NEG+PTCP+NOM+ABL)

  morphBPE  [4]: ['интернет', 'тендіріл', 'мегендік', 'тен']
  gpt2-50k  [6]: ['интер', 'нет', 'тен', 'дір', 'ілмеген', 'діктен']
  LLaMA     [14]: ['<|begin_of_text|>', 'ин', 'тер', 'нет', 'тен', 'ді', 'рі', ...]

Key observations:

morphBPE preserves loanword roots as single tokens (интернет, ғаламтор, робот, блокчейн) — standard BPE splits them arbitrarily (интер+нет, ғал+ам+тор)
morphBPE is better than gpt2-50k on 7/10 words and competitive with gpt2-200k (52 vs 45 total) despite gpt2-200k having 2× larger vocab
LLaMA produces garbled output with replacement characters (?) for Kazakh-specific letters

Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("stukenov/sozkz-morphbpe-100k-kk-v1")

text = "Үйлерімізде кітаптар көп."
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))

At inference time, apply the tokenizer directly on raw text — no morphological pre-processing needed. The morpheme-aware structure is baked into the vocabulary from training.

Training details

Segmenter: BiLSTM (QazCorpora, BIO tagging: B-ROOT / I-ROOT / B-SUFFIX / I-SUFFIX) with LRU cache (500K entries, 94.4% hit rate)
Boundary marker: \x1F (ASCII Unit Separator) inserted between morphemes
Pre-tokenizer: split on \x1F (removed) → ByteLevel encoding per morpheme
BPE library: HuggingFace tokenizers (Rust), min_frequency=2
Corpus filter: detected_lang == "kk" and lang_confidence >= 0.95
Training time: ~8h segmentation + 15 min BPE training (RTX 4090)

Part of the SozKZ / EkiTil project

This tokenizer is part of the SozKZ / EkiTil initiative — open Kazakh language models and tools.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train stukenov/sozkz-morphbpe-100k-kk-v1

Paper for stukenov/sozkz-morphbpe-100k-kk-v1

HyperCLOVA X Technical Report

Paper • 2404.01954 • Published Apr 2, 2024 • 25