Add Rust-backed fast tokenizer (54x speedup + bug fixes)
#2
by d3banjan - opened
Summary
- Add
tokenization_rwkv7_fast.py: HuggingFace-compatible wrapper around the Rustrwkv-tokenizerPyPI package - Update
tokenizer_config.jsonto load the fast tokenizer viaAutoTokenizer.from_pretrained - Document installation and benefits in README
Why
The current pure-Python TRIE tokenizer (hf_rwkv_tokenizer.py) has three issues:
- 54x slower than the Rust implementation — bottleneck for training and data preprocessing
- Unpicklable — nested TRIE objects exceed Python's recursion limit, crashing
datasets.map()andSFTTrainermultiprocessing - Three bugs:
- Phantom token:
\n\nmapped to id 65530 (outside vocab range) instead of correct id 261 - Broken greedy match:
" \n\n"split into[" ", "\n\n"]instead of matching vocab entry id 3336 - Decode mojibake: Korean, emoji, math symbols decode as
???replacement characters
- Phantom token:
The Rust rwkv-tokenizer package implements the identical greedy-longest-match TRIE algorithm and is byte-for-byte identical on encoding. 62/62 parity tests pass.
Usage
pip install rwkv-tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"RWKV/RWKV7-Goose-World3-2.9B-HF",
trust_remote_code=True
)
# Automatically uses the fast Rust tokenizer if installed
Falls back gracefully to the existing Python tokenizer if rwkv-tokenizer is not installed.
Test plan
-
AutoTokenizer.from_pretrainedloadsRwkvTokenizerFastwhenrwkv-tokenizeris installed - Falls back to
RwkvTokenizerwhenrwkv-tokenizeris not installed - Encode/decode parity on ASCII, Unicode, code, ChatML formats
- Pickle/unpickle roundtrip works (for multiprocessing)
Test suite for this PR (and the companion bug fix PR #3):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414
Covers encode parity (29 cases), decode roundtrip, vocab size, special tokens, pickle/unpickle, and benchmark.