Fix 3 bugs in Python tokenizer: phantom token, broken greedy match, decode mojibake

by d3banjan - opened Mar 14

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+42

-4

d3banjan

Mar 14

Summary

Fixes three bugs in hf_rwkv_tokenizer.py (the pure-Python TRIE tokenizer).

Bugs fixed

1. Phantom token: `\n\n` mapped to wrong id

eos_token="\n\n" was registered as an AddedToken at id 65530 (outside the vocab range), shadowing the correct vocab entry at id 261.

Before: encode("\n\n") → [65530]
After: encode("\n\n") → [261]

2. Broken greedy match on tokens containing `\n\n`

Because "\n\n" was an AddedToken, HuggingFace extracted it before the TRIE ran, breaking greedy-longest-match. The vocab entry " \n\n" (id 3336) was never matched.

Before: encode(" \n\n") → [33, 65530] (space + phantom)
After: encode(" \n\n") → [3336] (single correct token)

3. Decode mojibake on multi-byte UTF-8

_convert_id_to_token used errors="replace", which turns valid partial UTF-8 byte sequences into U+FFFD (???). Korean, emoji, and math symbols all decoded as replacement characters.

Before: decode(encode("안녕하세요")) → "??????????"
After: decode(encode("안녕하세요")) → "안녕하세요"

Changes

Remove "\n\n" from AddedToken registry when it already exists in the base vocab, and rebuild the internal splitting trie
Fix _convert_token_to_id to handle string lookups (needed for eos_token_id)
Use errors="surrogateescape" instead of errors="replace" in _convert_id_to_token and convert_tokens_to_string so partial UTF-8 survives the token→string→bytes roundtrip

Test plan

"\n\n" encodes to id 261 (not 65530)
" \n\n" encodes to single token 3336 (not split)
Korean, emoji, math symbols roundtrip correctly
eos_token_id resolves to 261

Fix 3 bugs in Python tokenizer: phantom token, broken greedy match, decode mojibake7762e4d2

d3banjan

Mar 14

Test suite covering all 3 bug fixes (also covers fast tokenizer parity from PR #2):
https://gist.github.com/d3banjan/5f5b77a652072a35ccc3b19f4d86d414

Bug fix tests: TestBugFixes class — phantom token, greedy match, decode mojibake, eos_token_id resolution.

d3banjan

Mar 14

Cross-check against paper and upstream code

Verified that none of the three bugs are intentional design choices:

Paper (arXiv:2503.14456): No mention of special tokenizer handling for \n\n or EOS conventions at the tokenizer level. The World tokenizer is described as a pure greedy-longest-match TRIE.
Original RWKV tokenizer (BlinkDL/ChatRWKV): Has zero special token handling — no BOS, EOS, or AddedToken concepts. Token 0 is used as the document separator in training (make_data.py appends 0 after each document). \n\n is a chat prompt turn delimiter, not a tokenizer-level EOS.
HF PR #26963 (original RWKV5 integration): The RWKV team member PicoCreator confirmed "the model was not trained with a special bos_token, or special pad_token in mind (we use 0, and masking for padding)". SmerkyG noted that _convert_token_to_id / _convert_id_to_token were "implemented kind of backwards" — the mojibake bug was a known concern.
Rust rwkv-tokenizer (cahya-wirawan/rwkv-tokenizer): Faithfully reimplements the same greedy TRIE with no special handling for \n\n.

All three bugs are artifacts of the HuggingFace integration layer, not upstream RWKV design choices.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment

Fix 3 bugs in Python tokenizer: phantom token, broken greedy match, decode mojibake

Summary

Bugs fixed

1. Phantom token: \n\n mapped to wrong id

2. Broken greedy match on tokens containing \n\n

3. Decode mojibake on multi-byte UTF-8

Changes

Test plan

Cross-check against paper and upstream code

1. Phantom token: `\n\n` mapped to wrong id

2. Broken greedy match on tokens containing `\n\n`