transhumanist-already-exists
/

tereshchenkoblue-tokenizer

@@ -61,7 +61,7 @@ Roughly four-fifths of tokens in scripts geographically and culturally distant f
 ## Feature Overview:
-1. +81,168 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
 2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
 3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
@@ -72,7 +72,7 @@ tokenizer = AutoTokenizer.from_pretrained(
     "transhumanist-already-exists/tereshchenkoblue-tokenizer"
 )
 toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
-print(len(toks.input_ids)) - only 4 tokens 💪🏻
 ```

 ## Feature Overview:
+1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
 2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
 3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
     "transhumanist-already-exists/tereshchenkoblue-tokenizer"
 )
 toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
+print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
 ```