Fix 3 bugs in Python tokenizer: phantom token, broken greedy match, decode mojibake
Summary
Fixes three bugs in hf_rwkv_tokenizer.py (the pure-Python TRIE tokenizer).
Bugs fixed
1. Phantom token: \n\n mapped to wrong id
eos_token="\n\n" was registered as an AddedToken at id 65530 (outside the vocab range), shadowing the correct vocab entry at id 261.
Before: encode("\n\n") β [65530]
After: encode("\n\n") β [261]
2. Broken greedy match on tokens containing \n\n
Because "\n\n" was an AddedToken, HuggingFace extracted it before the TRIE ran, breaking greedy-longest-match. The vocab entry " \n\n" (id 3336) was never matched.
Before: encode(" \n\n") β [33, 65530] (space + phantom)
After: encode(" \n\n") β [3336] (single correct token)
3. Decode mojibake on multi-byte UTF-8
_convert_id_to_token used errors="replace", which turns valid partial UTF-8 byte sequences into U+FFFD (???). Korean, emoji, and math symbols all decoded as replacement characters.
Before: decode(encode("μλ
νμΈμ")) β "??????????"
After: decode(encode("μλ
νμΈμ")) β "μλ
νμΈμ"
Changes
- Remove
"\n\n"fromAddedTokenregistry when it already exists in the base vocab, and rebuild the internal splitting trie - Fix
_convert_token_to_idto handle string lookups (needed foreos_token_id) - Use
errors="surrogateescape"instead oferrors="replace"in_convert_id_to_tokenandconvert_tokens_to_stringso partial UTF-8 survives the tokenβstringβbytes roundtrip
Test plan
-
"\n\n"encodes to id 261 (not 65530) -
" \n\n"encodes to single token 3336 (not split) - Korean, emoji, math symbols roundtrip correctly
-
eos_token_idresolves to 261
Test suite covering all 3 bug fixes (also covers fast tokenizer parity from PR #2):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414
Bug fix tests: TestBugFixes class β phantom token, greedy match, decode mojibake, eos_token_id resolution.
Cross-check against paper and upstream code
Verified that none of the three bugs are intentional design choices:
Paper (arXiv:2503.14456): No mention of special tokenizer handling for
\n\nor EOS conventions at the tokenizer level. The World tokenizer is described as a pure greedy-longest-match TRIE.Original RWKV tokenizer (BlinkDL/ChatRWKV): Has zero special token handling β no BOS, EOS, or AddedToken concepts. Token 0 is used as the document separator in training (
make_data.pyappends0after each document).\n\nis a chat prompt turn delimiter, not a tokenizer-level EOS.HF PR #26963 (original RWKV5 integration): The RWKV team member PicoCreator confirmed "the model was not trained with a special bos_token, or special pad_token in mind (we use 0, and masking for padding)". SmerkyG noted that
_convert_token_to_id/_convert_id_to_tokenwere "implemented kind of backwards" β the mojibake bug was a known concern.Rust
rwkv-tokenizer(cahya-wirawan/rwkv-tokenizer): Faithfully reimplements the same greedy TRIE with no special handling for\n\n.
All three bugs are artifacts of the HuggingFace integration layer, not upstream RWKV design choices.