Update README.md
Browse files
README.md
CHANGED
|
@@ -61,7 +61,7 @@ Roughly four-fifths of tokens in scripts geographically and culturally distant f
|
|
| 61 |
|
| 62 |
## Feature Overview:
|
| 63 |
|
| 64 |
-
1. +81,
|
| 65 |
2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
|
| 66 |
3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
|
| 67 |
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
|
|
@@ -72,7 +72,7 @@ tokenizer = AutoTokenizer.from_pretrained(
|
|
| 72 |
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
|
| 73 |
)
|
| 74 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 75 |
-
print(
|
| 76 |
```
|
| 77 |
|
| 78 |
|
|
|
|
| 61 |
|
| 62 |
## Feature Overview:
|
| 63 |
|
| 64 |
+
1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
|
| 65 |
2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
|
| 66 |
3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
|
| 67 |
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
|
|
|
|
| 72 |
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
|
| 73 |
)
|
| 74 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 75 |
+
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
|
| 76 |
```
|
| 77 |
|
| 78 |
|