Upload folder using huggingface_hub
Browse files- README.md +138 -0
- merges.txt +0 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- es
|
| 5 |
+
- fr
|
| 6 |
+
- de
|
| 7 |
+
library_name: tokenizers
|
| 8 |
+
license: cc-by-4.0
|
| 9 |
+
tags:
|
| 10 |
+
- kl3m
|
| 11 |
+
- kl3m-004
|
| 12 |
+
- alea
|
| 13 |
+
- legal
|
| 14 |
+
- financial
|
| 15 |
+
date: '2024-12-30T00:00:00.000Z'
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# kl3m-004-char-16k-cased
|
| 19 |
+
|
| 20 |
+
The `kl3m-004-char-16k-cased` **case-sensitive** tokenizer is a domain-specific **character-based** tokenizer trained
|
| 21 |
+
on a stratified sample of nearly 2M documents across general, legal, and financial domains from the `kl3m-data` project,
|
| 22 |
+
including American English, British English, Spanish, German, French, Italian, and other common EU languages.
|
| 23 |
+
|
| 24 |
+
This tokenizer uses the standard Byte-Pair Encoding (BPE) tokenizer from `tokenizers`/`transformers`, but modifies the
|
| 25 |
+
training process to restrict the vocabulary to tokens that are at most 3 characters long. Models trained with this tokenizer
|
| 26 |
+
should be able to handle a number of use cases that are otherwise difficult to handle with standard tokenizers, such as
|
| 27 |
+
low-resource spell-checking, OCR correction, whitespace normalization, and other tasks that require a high degree of character-level
|
| 28 |
+
granularity.
|
| 29 |
+
|
| 30 |
+
## Model Details
|
| 31 |
+
|
| 32 |
+
### Summary
|
| 33 |
+
|
| 34 |
+
- **Vocabulary**: 16,384 tokens
|
| 35 |
+
- **Tokenizer type:** BPE with 1-4 character tokens
|
| 36 |
+
- **Special token support:** Both causal and masked language modeling
|
| 37 |
+
- **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
|
| 38 |
+
- **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
|
| 39 |
+
- **Developed by:** [ALEA Institute](https://aleainstitute.ai).
|
| 40 |
+
- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
| 41 |
+
|
| 42 |
+
For more information about the `kl3m-004` tokenizers, see the [kl3m-004-128k-cased tokenizer](https://huggingface.co/alea-institute/kl3m-004-128k-cased).
|
| 43 |
+
|
| 44 |
+
#### Special Tokens for both Embedding and Generative Models
|
| 45 |
+
|
| 46 |
+
For both training and inference efficiency, we intended this tokenizer vocabulary to be
|
| 47 |
+
usable for both embedding and generative models. As such, we included special tokens
|
| 48 |
+
suitable for both causal and masked language modeling tasks.
|
| 49 |
+
|
| 50 |
+
* `<|start|>`: `0`
|
| 51 |
+
* `<|end|>`: `1`
|
| 52 |
+
* `<|pad|>`: `2`
|
| 53 |
+
* `<|unk|>`: `3`
|
| 54 |
+
* `<|sep|>`: `4`
|
| 55 |
+
* `<|cls|>`: `5`
|
| 56 |
+
* `<|mask|>`: `6`
|
| 57 |
+
|
| 58 |
+
We also added a number of chat and instruction tokens that were not included in `kl3m-001-32k`, including:
|
| 59 |
+
|
| 60 |
+
* `<|system|>`: `7`
|
| 61 |
+
* `</|system|>`: `8`
|
| 62 |
+
* `<|user|>`: `9`
|
| 63 |
+
* `</|user|>`: `10`
|
| 64 |
+
* `<|instruction|>`: `11`
|
| 65 |
+
* `</|instruction|>`: `12`
|
| 66 |
+
|
| 67 |
+
These tokens are identical to those used in the `kl3m-003-64k` tokenizer.
|
| 68 |
+
|
| 69 |
+
### Replication
|
| 70 |
+
|
| 71 |
+
The entire data collection and preprocesing pipeline is being made available, along with
|
| 72 |
+
training data, as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
|
| 73 |
+
|
| 74 |
+
The source code to used to train the tokenizer is available on GitHub at:
|
| 75 |
+
[https://github.com/alea-institute/kl3m-embedding-research](https://github.com/alea-institute/kl3m-embedding-research)
|
| 76 |
+
|
| 77 |
+
The data pipeline will be available on GitHub and S3 in the near future.
|
| 78 |
+
|
| 79 |
+
This specific tokenizer was trained using the following command:
|
| 80 |
+
|
| 81 |
+
```bash
|
| 82 |
+
PYTHONPATH=. poetry run python3 \
|
| 83 |
+
kl3m_tokenizers/tokenizers/kl3m_004/train_char_tokenizer.py \
|
| 84 |
+
--min_frequency 1000 \
|
| 85 |
+
--vocab_size 16384 \
|
| 86 |
+
--pad2 \
|
| 87 |
+
--max_chars 4 \
|
| 88 |
+
sample.20241223173012.jsonl.gz \
|
| 89 |
+
./kl3m-004-char-16k-cased/
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
```text
|
| 93 |
+
Training tokenizer.
|
| 94 |
+
[00:33:12] Pre-processing sequences βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 1849344 / 0
|
| 95 |
+
[00:33:32] Pre-processing sequences βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
|
| 96 |
+
[00:00:21] Tokenize words βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20286360 / 20286360
|
| 97 |
+
[00:01:01] Count pairs βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20286360 / 20286360
|
| 98 |
+
[00:12:39] Compute merges βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 16036 / 16036
|
| 99 |
+
Adding power-of-2 padding tokens.
|
| 100 |
+
Padded vocab to 16384 tokens.
|
| 101 |
+
Special tokens: 13
|
| 102 |
+
Power-of-2 pad tokens: 13
|
| 103 |
+
Final vocab size: 16384
|
| 104 |
+
Training time: 2863.67 seconds
|
| 105 |
+
Output path: kl3m-004-char-16k-cased
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### Uses
|
| 109 |
+
This tokenizer is intended to be used for English, Spanish, German, or French language tasks where
|
| 110 |
+
character-level details are important, such as OCR correction, spell-checking, or tasks where word boundaries
|
| 111 |
+
are not well-defined.
|
| 112 |
+
|
| 113 |
+
For a standard BPE "word" tokenizer with a larger vocabulary size, consider using the `kl3m-004-128k-cased` or
|
| 114 |
+
`kl3m-004-128k-uncased` tokenizers.
|
| 115 |
+
|
| 116 |
+
### Recommendations
|
| 117 |
+
The kl3m-004-char-16k-cased tokenizer may be particularly useful when character-level details are important but
|
| 118 |
+
resource constraints are not as severe. For smaller vocabularies with better resource efficiency, consider using the
|
| 119 |
+
kl3m-004-char-4k-cased or kl3m-004-char-8k-cased tokenizers.
|
| 120 |
+
|
| 121 |
+
### How to Get Started with the Model
|
| 122 |
+
Use the code below to get started with the model.
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
from tokenizers import Tokenizer
|
| 126 |
+
|
| 127 |
+
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-char-16k-cased')
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### Citation
|
| 131 |
+
Tokenizer and dataset publications are pending.
|
| 132 |
+
|
| 133 |
+
## Contact
|
| 134 |
+
|
| 135 |
+
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [hello@aleainstitute.ai](mailto:hello@aleainstitute.ai) or
|
| 136 |
+
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-embedding-research).
|
| 137 |
+
|
| 138 |
+

|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"unk_token": "<|unk|>", "bos_token": "<|start|>", "eos_token": "<|end|>", "pad_token": "<|pad|>", "sep_token": "<|sep|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "add_prefix_space": false, "do_lower_case": false, "tokenizer_class": "PreTrainedTokenizerFast"}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|