Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -20,13 +20,13 @@ citation: |
|
|
| 20 |
url = {https://github.com/thomas-schweich/PAWN},
|
| 21 |
license = {Apache-2.0}
|
| 22 |
}
|
| 23 |
-
model_params:
|
| 24 |
d_model: 640
|
| 25 |
n_layers: 10
|
| 26 |
n_heads: 8
|
| 27 |
d_ff: 2560
|
| 28 |
-
context_length:
|
| 29 |
-
vocab_size:
|
| 30 |
datasets:
|
| 31 |
- random-chess-games
|
| 32 |
language:
|
|
@@ -40,32 +40,31 @@ model-index:
|
|
| 40 |
type: next-token-prediction
|
| 41 |
name: Chess Move Prediction (Random Games)
|
| 42 |
metrics:
|
| 43 |
-
|
|
|
|
|
|
|
| 44 |
- name: Legal Move Rate
|
| 45 |
type: accuracy
|
| 46 |
-
value: 0.
|
| 47 |
-
|
| 48 |
- name: Top-1 Accuracy
|
| 49 |
type: accuracy
|
| 50 |
-
value: 0.
|
| 51 |
-
|
| 52 |
- name: Top-5 Accuracy
|
| 53 |
type: accuracy
|
| 54 |
-
value: 0.
|
| 55 |
-
|
| 56 |
- name: Val Loss
|
| 57 |
type: loss
|
| 58 |
-
value:
|
| 59 |
-
- name:
|
| 60 |
type: other
|
| 61 |
-
value:
|
| 62 |
---
|
| 63 |
|
| 64 |
# PAWN-Large
|
| 65 |
|
| 66 |
**PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.
|
| 67 |
|
| 68 |
-
This is the **large** variant (~
|
| 69 |
|
| 70 |
**[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation.
|
| 71 |
|
|
@@ -73,32 +72,44 @@ This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen
|
|
| 73 |
|
| 74 |
| Variant | Parameters | Link |
|
| 75 |
|---------|------------|------|
|
| 76 |
-
| PAWN-Small | ~
|
| 77 |
-
| PAWN (Base) | ~
|
| 78 |
-
| PAWN-Large | ~
|
|
|
|
|
|
|
| 79 |
|
| 80 |
## Headline Metrics
|
| 81 |
|
|
|
|
|
|
|
| 82 |
| Metric | Value |
|
| 83 |
|--------|-------|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
### Accuracy
|
| 90 |
|
| 91 |
-
PAWN is trained on uniformly random chess games
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|---------|-------|
|
| 95 |
-
| Unconditioned (E\[1/N_legal\] = 6.52%) | 106% |
|
| 96 |
-
| Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 95–104% |
|
| 97 |
|
| 98 |
|
| 99 |
## Probe Results
|
| 100 |
|
| 101 |
-
Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features.
|
| 102 |
|
| 103 |
| Probe | Accuracy | Description |
|
| 104 |
|-------|----------|-------------|
|
|
@@ -144,9 +155,9 @@ Edge-case diagnostics measure the model's legal move rate in specific tactical s
|
|
| 144 |
| Attention heads | 8 |
|
| 145 |
| Head dimension | 80 |
|
| 146 |
| d_ff | 2560 |
|
| 147 |
-
| Parameters | ~
|
| 148 |
-
| Vocabulary |
|
| 149 |
-
| Context length |
|
| 150 |
| Normalization | Pre-norm RMSNorm |
|
| 151 |
| FFN | SwiGLU (4x expansion) |
|
| 152 |
| Positional encoding | Rotary (RoPE, base 10000) |
|
|
@@ -159,13 +170,14 @@ Edge-case diagnostics measure the model's legal move rate in specific tactical s
|
|
| 159 |
|-----------|-------|
|
| 160 |
| Training data | On-the-fly uniformly random legal games (no external dataset) |
|
| 161 |
| Objective | Next-token cross-entropy (non-padding positions only) |
|
| 162 |
-
|
|
|
|
|
| 163 |
| Batch size | 256 |
|
| 164 |
-
|
|
| 165 |
-
|
|
|
|
|
| 166 |
| Optimizer | AdamW (weight decay 0.01) |
|
| 167 |
| Precision | Mixed (AMP) |
|
| 168 |
-
| Hardware | NVIDIA H200 |
|
| 169 |
|
| 170 |
## Usage
|
| 171 |
|
|
@@ -199,7 +211,7 @@ model.load_state_dict(weights)
|
|
| 199 |
### Finetuning with an adapter
|
| 200 |
|
| 201 |
```bash
|
| 202 |
-
uv run python scripts/
|
| 203 |
--checkpoint thomas-schweich/pawn-large \
|
| 204 |
--pgn thomas-schweich/pawn-lichess-full \
|
| 205 |
--bottleneck-dim 32 --lr 1e-4 --local-checkpoints
|
|
@@ -223,7 +235,7 @@ PAWN builds on ideas and tools from the following projects and publications:
|
|
| 223 |
| FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) |
|
| 224 |
| RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) |
|
| 225 |
| Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) |
|
| 226 |
-
|
|
| 227 |
| MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) |
|
| 228 |
| AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) |
|
| 229 |
| Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) |
|
|
|
|
| 20 |
url = {https://github.com/thomas-schweich/PAWN},
|
| 21 |
license = {Apache-2.0}
|
| 22 |
}
|
| 23 |
+
model_params: 66909440
|
| 24 |
d_model: 640
|
| 25 |
n_layers: 10
|
| 26 |
n_heads: 8
|
| 27 |
d_ff: 2560
|
| 28 |
+
context_length: 512
|
| 29 |
+
vocab_size: 1980
|
| 30 |
datasets:
|
| 31 |
- random-chess-games
|
| 32 |
language:
|
|
|
|
| 40 |
type: next-token-prediction
|
| 41 |
name: Chess Move Prediction (Random Games)
|
| 42 |
metrics:
|
| 43 |
+
- name: Game Completion Rate
|
| 44 |
+
type: accuracy
|
| 45 |
+
value: 0.997559
|
| 46 |
- name: Legal Move Rate
|
| 47 |
type: accuracy
|
| 48 |
+
value: 0.999990
|
|
|
|
| 49 |
- name: Top-1 Accuracy
|
| 50 |
type: accuracy
|
| 51 |
+
value: 0.0863
|
|
|
|
| 52 |
- name: Top-5 Accuracy
|
| 53 |
type: accuracy
|
| 54 |
+
value: 0.3556
|
|
|
|
| 55 |
- name: Val Loss
|
| 56 |
type: loss
|
| 57 |
+
value: 2.8652
|
| 58 |
+
- name: Total Training Sequences
|
| 59 |
type: other
|
| 60 |
+
value: 51200000
|
| 61 |
---
|
| 62 |
|
| 63 |
# PAWN-Large
|
| 64 |
|
| 65 |
**PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.
|
| 66 |
|
| 67 |
+
This is the **large** variant (~66.9M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.
|
| 68 |
|
| 69 |
**[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation.
|
| 70 |
|
|
|
|
| 72 |
|
| 73 |
| Variant | Parameters | Link |
|
| 74 |
|---------|------------|------|
|
| 75 |
+
| PAWN-Small | ~9M | [thomas-schweich/pawn-small](https://huggingface.co/thomas-schweich/pawn-small) |
|
| 76 |
+
| PAWN (Base) | ~35M | [thomas-schweich/pawn-base](https://huggingface.co/thomas-schweich/pawn-base) |
|
| 77 |
+
| PAWN-Large | ~67M | [thomas-schweich/pawn-large](https://huggingface.co/thomas-schweich/pawn-large) |
|
| 78 |
+
|
| 79 |
+
A previous generation of PAWN backbones (`pawn-{small,base,large}-legacy`) used a 4,278-token coordinate vocabulary, a 256-token context window, and outcome conditioning. They are still available on HuggingFace; see [docs/LEGACY.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/LEGACY.md) for the full story.
|
| 80 |
|
| 81 |
## Headline Metrics
|
| 82 |
|
| 83 |
+
These come from the published `model.safetensors` (step 195,000 out of 200,000 — the best 5,000-step-cadence checkpoint by val loss), measured on a fresh validation set of random games.
|
| 84 |
+
|
| 85 |
| Metric | Value |
|
| 86 |
|--------|-------|
|
| 87 |
+
| Game completion rate | 99.76% |
|
| 88 |
+
| Per-move legal rate | 99.9990% |
|
| 89 |
+
| Late-game legal rate | 99.9996% |
|
| 90 |
+
| Top-1 accuracy | 8.63% |
|
| 91 |
+
| Top-5 accuracy | 35.56% |
|
| 92 |
+
| Val loss | 2.865 |
|
| 93 |
+
| Val perplexity | 17.55 |
|
| 94 |
+
|
| 95 |
+
**Game completion rate** is the share of validation games in which *every* prediction along one side's plies was a legal move. The measurement is **non-autoregressive**: at each ply the model is shown the true ground-truth history and asked for that side's next move, and an illegal prediction at any ply forfeits the game. Errors do not corrupt subsequent positions — each prediction is independent given the true history. Autoregressive game completion has not been measured for these checkpoints and could be higher or lower; see the [game completion section of the architecture doc](https://github.com/thomas-schweich/PAWN/blob/main/docs/ARCHITECTURE.md#game-completion-rate) for the full definition. Game completion rate is a much stricter metric than per-move legal rate, and is the main signal that separates capacity between sizes.
|
| 96 |
+
|
| 97 |
+
| Compound-legality detail | Value |
|
| 98 |
+
|--------------------------|-------|
|
| 99 |
+
| Average plies completed per game | 349 |
|
| 100 |
+
| Average % of game completed | 99.88% |
|
| 101 |
+
| Median forfeit ply (when forfeit) | 153 |
|
| 102 |
|
| 103 |
+
### Accuracy ceiling
|
| 104 |
|
| 105 |
+
PAWN is trained on uniformly random chess games. At each position with N legal moves, the next move is drawn uniformly, so the Bayes-optimal predictor that does not know the game's outcome can do no better than 1/N at that position. Averaged over the position distribution induced by random games of up to 512 plies, the top-1 ceiling is **E\[1/N_legal\] ≈ 8.43%** (95% CI \[8.41%, 8.45%\], computed over 50,000 fresh random games — see [docs/ACCURACY_CEILING.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md)).
|
| 106 |
|
| 107 |
+
This model's top-1 accuracy of **8.63%** is **102% of that ceiling** — i.e., essentially at the limit of what any predictor can achieve on this task without outcome information.
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
|
| 110 |
## Probe Results
|
| 111 |
|
| 112 |
+
Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. The model is never explicitly told about pieces, sides, or rules — these representations emerge purely from next-token prediction on random games.
|
| 113 |
|
| 114 |
| Probe | Accuracy | Description |
|
| 115 |
|-------|----------|-------------|
|
|
|
|
| 155 |
| Attention heads | 8 |
|
| 156 |
| Head dimension | 80 |
|
| 157 |
| d_ff | 2560 |
|
| 158 |
+
| Parameters | ~66.9M |
|
| 159 |
+
| Vocabulary | 1,980 tokens (1,968 searchless_chess actions + 1 PAD + 11 outcome tokens) |
|
| 160 |
+
| Context length | 512 tokens |
|
| 161 |
| Normalization | Pre-norm RMSNorm |
|
| 162 |
| FFN | SwiGLU (4x expansion) |
|
| 163 |
| Positional encoding | Rotary (RoPE, base 10000) |
|
|
|
|
| 170 |
|-----------|-------|
|
| 171 |
| Training data | On-the-fly uniformly random legal games (no external dataset) |
|
| 172 |
| Objective | Next-token cross-entropy (non-padding positions only) |
|
| 173 |
+
| Outcome conditioning | Disabled (prepend_outcome=False) — pure moves, no outcome leakage |
|
| 174 |
+
| Total steps | 200,000 |
|
| 175 |
| Batch size | 256 |
|
| 176 |
+
| Total training sequences | 51,200,000 (= total steps × batch size; the published checkpoint is the best 5K-cadence step by val loss, at step 195,000 ≈ 49,920,000 sequences) |
|
| 177 |
+
| Max ply per example | 512 |
|
| 178 |
+
| Learning rate | 0.0003 (cosine decay with 10,000-step warmup) |
|
| 179 |
| Optimizer | AdamW (weight decay 0.01) |
|
| 180 |
| Precision | Mixed (AMP) |
|
|
|
|
| 181 |
|
| 182 |
## Usage
|
| 183 |
|
|
|
|
| 211 |
### Finetuning with an adapter
|
| 212 |
|
| 213 |
```bash
|
| 214 |
+
uv run python scripts/train.py --run-type adapter --strategy bottleneck \
|
| 215 |
--checkpoint thomas-schweich/pawn-large \
|
| 216 |
--pgn thomas-schweich/pawn-lichess-full \
|
| 217 |
--bottleneck-dim 32 --lr 1e-4 --local-checkpoints
|
|
|
|
| 235 |
| FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) |
|
| 236 |
| RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) |
|
| 237 |
| Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) |
|
| 238 |
+
| Searchless Chess (action vocab) | [Ruoss et al., "Amortized Planning with Large-Scale Transformers: A Case Study on Chess", 2024](https://arxiv.org/abs/2402.04494) |
|
| 239 |
| MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) |
|
| 240 |
| AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) |
|
| 241 |
| Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) |
|