Update README.md
Browse files
README.md
CHANGED
|
@@ -72,7 +72,7 @@ tokenizer = AutoTokenizer.from_pretrained(
|
|
| 72 |
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
|
| 73 |
)
|
| 74 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 75 |
-
print(toks.input_ids)
|
| 76 |
```
|
| 77 |
|
| 78 |
|
|
@@ -99,8 +99,6 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
|
|
| 99 |
|
| 100 |
- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
|
| 101 |
|
| 102 |
-
- [tokenizer_utf8.json](tokenizer_utf8.json): Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
|
| 103 |
-
|
| 104 |
- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
|
| 105 |
|
| 106 |
- [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).
|
|
|
|
| 72 |
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
|
| 73 |
)
|
| 74 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 75 |
+
print(len(toks.input_ids)) - only 4 tokens 💪🏻
|
| 76 |
```
|
| 77 |
|
| 78 |
|
|
|
|
| 99 |
|
| 100 |
- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
|
| 101 |
|
|
|
|
|
|
|
| 102 |
- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
|
| 103 |
|
| 104 |
- [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).
|