AksaraLLM 20B Tokenizer

Byte-level BPE tokenizer for the AksaraLLM 20B pre-training run.

Vocab size: 131,072
Algorithm: Byte-level BPE (GPT-2 / LLaMA-3 style)
Training corpus: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
Produced by: scripts/train_tokenizer_20b.py in the AksaraLLM repo

Special tokens (pinned IDs)

The first 14 IDs are reserved for named special tokens, in this order:

ID	Token
0	`<\|pad\|>`
1	`<\|bos\|>`
2	`<\|eos\|>`
3	`<\|unk\|>`
4	`<\|system\|>`
5	`<\|user\|>`
6	`<\|assistant\|>`
7	`<\|tool\|>`
8	`<\|im_start\|>`
9	`<\|im_end\|>`
10	`<\|fim_prefix\|>`
11	`<\|fim_middle\|>`
12	`<\|fim_suffix\|>`
13	`<\|endoftext\|>`

The last 256 IDs (130816–131071) are reserved as <|reserved_N|> for future expansion without breaking already-pretrained checkpoints.

Fertility (tokens per whitespace-word)

Measured on ~200 KB held-out samples from each language:

Language	Fertility	Target
English web	1.280	≤ 1.40
Indonesian wiki	1.357	≤ 1.60
Indonesian web (CulturaX)	1.215	≤ 1.60
Malay wiki	1.368	≤ 1.60
Javanese wiki	1.657	≤ 1.80

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
# → 8 tokens

License

Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ID	Token
0	`<\|pad\|>`
1	`<\|bos\|>`
2	`<\|eos\|>`
3	`<\|unk\|>`
4	`<\|system\|>`
5	`<\|user\|>`
6	`<\|assistant\|>`
7	`<\|tool\|>`
8	`<\|im_start\|>`
9	`<\|im_end\|>`
10	`<\|fim_prefix\|>`
11	`<\|fim_middle\|>`
12	`<\|fim_suffix\|>`
13	`<\|endoftext\|>`