transhumanist-already-exists commited on
Commit
9b2053a
·
verified ·
1 Parent(s): 89d33e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -61,7 +61,7 @@ Roughly four-fifths of tokens in scripts geographically and culturally distant f
61
 
62
  ## Feature Overview:
63
 
64
- 1. +81,168 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
65
  2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
66
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
67
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
@@ -72,7 +72,7 @@ tokenizer = AutoTokenizer.from_pretrained(
72
  "transhumanist-already-exists/tereshchenkoblue-tokenizer"
73
  )
74
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
75
- print(len(toks.input_ids)) - only 4 tokens 💪🏻
76
  ```
77
 
78
 
 
61
 
62
  ## Feature Overview:
63
 
64
+ 1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
65
  2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
66
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
67
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
 
72
  "transhumanist-already-exists/tereshchenkoblue-tokenizer"
73
  )
74
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
75
+ print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
76
  ```
77
 
78