Add Rust-backed fast tokenizer (54x speedup + bug fixes)

by d3banjan - opened Mar 14

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+271

-1

d3banjan

Mar 14

Summary

Add tokenization_rwkv7_fast.py: HuggingFace-compatible wrapper around the Rust rwkv-tokenizer PyPI package
Update tokenizer_config.json to load the fast tokenizer via AutoTokenizer.from_pretrained
Document installation and benefits in README

Why

The current pure-Python TRIE tokenizer (hf_rwkv_tokenizer.py) has three issues:

54x slower than the Rust implementation — bottleneck for training and data preprocessing
Unpicklable — nested TRIE objects exceed Python's recursion limit, crashing datasets.map() and SFTTrainer multiprocessing
Three bugs:
- Phantom token: \n\n mapped to id 65530 (outside vocab range) instead of correct id 261
- Broken greedy match: " \n\n" split into [" ", "\n\n"] instead of matching vocab entry id 3336
- Decode mojibake: Korean, emoji, math symbols decode as ??? replacement characters

The Rust rwkv-tokenizer package implements the identical greedy-longest-match TRIE algorithm and is byte-for-byte identical on encoding. 62/62 parity tests pass.

Usage

pip install rwkv-tokenizer

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "RWKV/RWKV7-Goose-World3-2.9B-HF",
    trust_remote_code=True
)
# Automatically uses the fast Rust tokenizer if installed

Falls back gracefully to the existing Python tokenizer if rwkv-tokenizer is not installed.

Test plan

AutoTokenizer.from_pretrained loads RwkvTokenizerFast when rwkv-tokenizer is installed
Falls back to RwkvTokenizer when rwkv-tokenizer is not installed
Encode/decode parity on ASCII, Unicode, code, ChatML formats
Pickle/unpickle roundtrip works (for multiprocessing)

Add Rust-backed fast tokenizer (54x speedup + bug fixes)7551bf4f

d3banjan

Mar 14

Test suite for this PR (and the companion bug fix PR #3):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414

Covers encode parity (29 cases), decode roundtrip, vocab size, special tokens, pickle/unpickle, and benchmark.

d3banjan

Mar 14

Companion PR: #3 fixes the same three bugs (phantom token, broken greedy match, decode mojibake) in the existing pure-Python tokenizer (hf_rwkv_tokenizer.py). The two PRs are independent and can be merged separately.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment