Add Rust-backed fast tokenizer (54x speedup + bug fixes)

#2

Summary

  • Add tokenization_rwkv7_fast.py: HuggingFace-compatible wrapper around the Rust rwkv-tokenizer PyPI package
  • Update tokenizer_config.json to load the fast tokenizer via AutoTokenizer.from_pretrained
  • Document installation and benefits in README

Why

The current pure-Python TRIE tokenizer (hf_rwkv_tokenizer.py) has three issues:

  1. 54x slower than the Rust implementation — bottleneck for training and data preprocessing
  2. Unpicklable — nested TRIE objects exceed Python's recursion limit, crashing datasets.map() and SFTTrainer multiprocessing
  3. Three bugs:
    • Phantom token: \n\n mapped to id 65530 (outside vocab range) instead of correct id 261
    • Broken greedy match: " \n\n" split into [" ", "\n\n"] instead of matching vocab entry id 3336
    • Decode mojibake: Korean, emoji, math symbols decode as ??? replacement characters

The Rust rwkv-tokenizer package implements the identical greedy-longest-match TRIE algorithm and is byte-for-byte identical on encoding. 62/62 parity tests pass.

Usage

pip install rwkv-tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "RWKV/RWKV7-Goose-World3-2.9B-HF",
    trust_remote_code=True
)
# Automatically uses the fast Rust tokenizer if installed

Falls back gracefully to the existing Python tokenizer if rwkv-tokenizer is not installed.

Test plan

  • AutoTokenizer.from_pretrained loads RwkvTokenizerFast when rwkv-tokenizer is installed
  • Falls back to RwkvTokenizer when rwkv-tokenizer is not installed
  • Encode/decode parity on ASCII, Unicode, code, ChatML formats
  • Pickle/unpickle roundtrip works (for multiprocessing)

Test suite for this PR (and the companion bug fix PR #3):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414

Covers encode parity (29 cases), decode roundtrip, vocab size, special tokens, pickle/unpickle, and benchmark.

Companion PR: #3 fixes the same three bugs (phantom token, broken greedy match, decode mojibake) in the existing pure-Python tokenizer (hf_rwkv_tokenizer.py). The two PRs are independent and can be merged separately.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment