transhumanist-already-exists commited on
Commit
89d33e3
·
verified ·
1 Parent(s): 1c9d862

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -72,7 +72,7 @@ tokenizer = AutoTokenizer.from_pretrained(
72
  "transhumanist-already-exists/tereshchenkoblue-tokenizer"
73
  )
74
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
75
- print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
76
  ```
77
 
78
 
@@ -99,8 +99,6 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
99
 
100
  - [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
101
 
102
- - [tokenizer_utf8.json](tokenizer_utf8.json): Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
103
-
104
  - [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
105
 
106
  - [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).
 
72
  "transhumanist-already-exists/tereshchenkoblue-tokenizer"
73
  )
74
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
75
+ print(len(toks.input_ids)) - only 4 tokens 💪🏻
76
  ```
77
 
78
 
 
99
 
100
  - [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
101
 
 
 
102
  - [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
103
 
104
  - [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).