AksaraLLM 20B Tokenizer

Byte-level BPE tokenizer for the AksaraLLM 20B pre-training run.

  • Vocab size: 131,072
  • Algorithm: Byte-level BPE (GPT-2 / LLaMA-3 style)
  • Training corpus: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
  • Produced by: scripts/train_tokenizer_20b.py in the AksaraLLM repo

Special tokens (pinned IDs)

The first 14 IDs are reserved for named special tokens, in this order:

ID Token
0 <|pad|>
1 <|bos|>
2 <|eos|>
3 <|unk|>
4 <|system|>
5 <|user|>
6 <|assistant|>
7 <|tool|>
8 <|im_start|>
9 <|im_end|>
10 <|fim_prefix|>
11 <|fim_middle|>
12 <|fim_suffix|>
13 <|endoftext|>

The last 256 IDs (130816–131071) are reserved as <|reserved_N|> for future expansion without breaking already-pretrained checkpoints.

Fertility (tokens per whitespace-word)

Measured on ~200 KB held-out samples from each language:

Language Fertility Target
English web 1.280 ≤ 1.40
Indonesian wiki 1.357 ≤ 1.60
Indonesian web (CulturaX) 1.215 ≤ 1.60
Malay wiki 1.368 ≤ 1.60
Javanese wiki 1.657 ≤ 1.80

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
# → 8 tokens

License

Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support