AksaraLLM 20B Tokenizer
Byte-level BPE tokenizer for the AksaraLLM 20B pre-training run.
- Vocab size: 131,072
- Algorithm: Byte-level BPE (GPT-2 / LLaMA-3 style)
- Training corpus: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
- Produced by:
scripts/train_tokenizer_20b.pyin the AksaraLLM repo
Special tokens (pinned IDs)
The first 14 IDs are reserved for named special tokens, in this order:
| ID | Token |
|---|---|
| 0 | <|pad|> |
| 1 | <|bos|> |
| 2 | <|eos|> |
| 3 | <|unk|> |
| 4 | <|system|> |
| 5 | <|user|> |
| 6 | <|assistant|> |
| 7 | <|tool|> |
| 8 | <|im_start|> |
| 9 | <|im_end|> |
| 10 | <|fim_prefix|> |
| 11 | <|fim_middle|> |
| 12 | <|fim_suffix|> |
| 13 | <|endoftext|> |
The last 256 IDs (130816–131071) are reserved as <|reserved_N|> for future expansion without breaking already-pretrained checkpoints.
Fertility (tokens per whitespace-word)
Measured on ~200 KB held-out samples from each language:
| Language | Fertility | Target |
|---|---|---|
| English web | 1.280 | ≤ 1.40 |
| Indonesian wiki | 1.357 | ≤ 1.60 |
| Indonesian web (CulturaX) | 1.215 | ≤ 1.60 |
| Malay wiki | 1.368 | ≤ 1.60 |
| Javanese wiki | 1.657 | ≤ 1.80 |
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
# → 8 tokens
License
Apache-2.0.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support