Fix 3 bugs in Python tokenizer: phantom token, broken greedy match, decode mojibake

#3

Summary

Fixes three bugs in hf_rwkv_tokenizer.py (the pure-Python TRIE tokenizer).

Bugs fixed

1. Phantom token: \n\n mapped to wrong id

eos_token="\n\n" was registered as an AddedToken at id 65530 (outside the vocab range), shadowing the correct vocab entry at id 261.

Before: encode("\n\n") β†’ [65530]
After: encode("\n\n") β†’ [261]

2. Broken greedy match on tokens containing \n\n

Because "\n\n" was an AddedToken, HuggingFace extracted it before the TRIE ran, breaking greedy-longest-match. The vocab entry " \n\n" (id 3336) was never matched.

Before: encode(" \n\n") β†’ [33, 65530] (space + phantom)
After: encode(" \n\n") β†’ [3336] (single correct token)

3. Decode mojibake on multi-byte UTF-8

_convert_id_to_token used errors="replace", which turns valid partial UTF-8 byte sequences into U+FFFD (???). Korean, emoji, and math symbols all decoded as replacement characters.

Before: decode(encode("μ•ˆλ…•ν•˜μ„Έμš”")) β†’ "??????????"
After: decode(encode("μ•ˆλ…•ν•˜μ„Έμš”")) β†’ "μ•ˆλ…•ν•˜μ„Έμš”"

Changes

  • Remove "\n\n" from AddedToken registry when it already exists in the base vocab, and rebuild the internal splitting trie
  • Fix _convert_token_to_id to handle string lookups (needed for eos_token_id)
  • Use errors="surrogateescape" instead of errors="replace" in _convert_id_to_token and convert_tokens_to_string so partial UTF-8 survives the tokenβ†’stringβ†’bytes roundtrip

Test plan

  • "\n\n" encodes to id 261 (not 65530)
  • " \n\n" encodes to single token 3336 (not split)
  • Korean, emoji, math symbols roundtrip correctly
  • eos_token_id resolves to 261

Test suite covering all 3 bug fixes (also covers fast tokenizer parity from PR #2):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414

Bug fix tests: TestBugFixes class β€” phantom token, greedy match, decode mojibake, eos_token_id resolution.

Cross-check against paper and upstream code

Verified that none of the three bugs are intentional design choices:

  • Paper (arXiv:2503.14456): No mention of special tokenizer handling for \n\n or EOS conventions at the tokenizer level. The World tokenizer is described as a pure greedy-longest-match TRIE.

  • Original RWKV tokenizer (BlinkDL/ChatRWKV): Has zero special token handling β€” no BOS, EOS, or AddedToken concepts. Token 0 is used as the document separator in training (make_data.py appends 0 after each document). \n\n is a chat prompt turn delimiter, not a tokenizer-level EOS.

  • HF PR #26963 (original RWKV5 integration): The RWKV team member PicoCreator confirmed "the model was not trained with a special bos_token, or special pad_token in mind (we use 0, and masking for padding)". SmerkyG noted that _convert_token_to_id / _convert_id_to_token were "implemented kind of backwards" β€” the mojibake bug was a known concern.

  • Rust rwkv-tokenizer (cahya-wirawan/rwkv-tokenizer): Faithfully reimplements the same greedy TRIE with no special handling for \n\n.

All three bugs are artifacts of the HuggingFace integration layer, not upstream RWKV design choices.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment