chess-v13-macher
Chess-playing language model for the Chess 1M Challenge, with 995,466 parameters (under the 1M limit).
Approach
The model reuses the example solution's GPT-2 architecture unchanged. The key contribution is a square-pair tokenization that compresses the vocabulary from ~1200 tokens down to 73, freeing parameter budget for a deeper network.
Each move is encoded as [from_square, to_square, separator], stripping piece identity and annotations (captures, checks, castling markers), which are implicit in the move sequence. The vocabulary is:
- 64 square tokens (a1-h8)
- 4 promotion pieces (q, r, b, n)
- 1 move separator (newline)
- 4 special tokens (PAD, BOS, EOS, UNK)
This tiny vocabulary allows investing ~97% of the parameter budget into transformer layers:
| Example solution | This model | |
|---|---|---|
| vocab_size | 1200 | 73 |
| n_layer | 6 | 9 |
| n_embd | 128 | 112 |
| n_inner | 384 | 250 |
| n_ctx | 256 | 180 |
| Embedding params | ~186K | ~28K |
| Layer params | ~720K | ~968K |
| Total | ~906K | ~996K |
Results
Official evaluation (full games, deterministic)
| Metric | Score |
|---|---|
| Legal rate (1st try) | 100% (520/520) |
| Legal rate (retry) | 100% (520/520) |
All 20 games end in draws by repetition with perfect legality.
Extended evaluation
| Mode | 1st try | With retry |
|---|---|---|
| Diverse positions (191) | 86.4% | 95.3% |
| Both colors โ White | 91.3% | 95.5% |
| Both colors โ Black | 89.3% | 95.5% |
For comparison, OussamaleZ (#1 on the leaderboard, also 100% on official eval) scores 89.0% on diverse positions and 78.6%/81.6% as White/Black on the extended evaluation.
Move separator experiment
Training with identical hyperparameters but different separator tokens (newline vs space) yields surprisingly different behavior:
| newline | space | |
|---|---|---|
| Full games (1st try) | 100% | 80.9% |
| Full games (retry) | 100% | 93.1% |
| Diverse positions (retry) | 90.6% | 93.7% |
The newline model plays conservatively (all draws), while the space model generalizes better to diverse positions but loses most full games. This model uses newline.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("macher/chess-v13-macher", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("macher/chess-v13-macher", trust_remote_code=True)
Training
- Dataset: dlouapre/lichess_2025-01_1M (1M Lichess games)
- Epochs: 20
- Learning rate: 3e-4 (cosine schedule, 5% warmup)
- Batch size: 64
- Optimizer: AdamW (weight decay 0.01)
Files
model.pyโ Architecture (detailed docstring with experiment notes)tokenizer.pyโ Square-pair tokenizereval_extended.pyโ Extended evaluation script (diverse positions, full games, both colors)eval_extended_v13.jsonโ Evaluation results
Submitted by
- Downloads last month
- 448