thomas-schweich commited on
Commit
f365877
·
verified ·
1 Parent(s): 4654953

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -37
README.md CHANGED
@@ -20,13 +20,13 @@ citation: |
20
  url = {https://github.com/thomas-schweich/PAWN},
21
  license = {Apache-2.0}
22
  }
23
- model_params: 68376320
24
  d_model: 640
25
  n_layers: 10
26
  n_heads: 8
27
  d_ff: 2560
28
- context_length: 256
29
- vocab_size: 4284
30
  datasets:
31
  - random-chess-games
32
  language:
@@ -40,32 +40,31 @@ model-index:
40
  type: next-token-prediction
41
  name: Chess Move Prediction (Random Games)
42
  metrics:
43
-
 
 
44
  - name: Legal Move Rate
45
  type: accuracy
46
- value: 0.9989
47
-
48
  - name: Top-1 Accuracy
49
  type: accuracy
50
- value: 0.0695
51
-
52
  - name: Top-5 Accuracy
53
  type: accuracy
54
- value: 0.2773
55
-
56
  - name: Val Loss
57
  type: loss
58
- value: 3.0919
59
- - name: Games Seen
60
  type: other
61
- value: 25600000
62
  ---
63
 
64
  # PAWN-Large
65
 
66
  **PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.
67
 
68
- This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.
69
 
70
  **[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation.
71
 
@@ -73,32 +72,44 @@ This is the **large** variant (~68.4M parameters). PAWN is designed as a frozen
73
 
74
  | Variant | Parameters | Link |
75
  |---------|------------|------|
76
- | PAWN-Small | ~9.5M | [thomas-schweich/pawn-small](https://huggingface.co/thomas-schweich/pawn-small) |
77
- | PAWN (Base) | ~35.8M | [thomas-schweich/pawn-base](https://huggingface.co/thomas-schweich/pawn-base) |
78
- | PAWN-Large | ~68.4M | [thomas-schweich/pawn-large](https://huggingface.co/thomas-schweich/pawn-large) |
 
 
79
 
80
  ## Headline Metrics
81
 
 
 
82
  | Metric | Value |
83
  |--------|-------|
84
- | Legal move rate | 99.89% |
85
- | Top-1 accuracy | 6.95% |
86
- | Top-5 accuracy | 27.73% |
87
- | Val loss | 3.092 |
 
 
 
 
 
 
 
 
 
 
 
88
 
89
- ### Accuracy Ratios
90
 
91
- PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling. Ratios above 100% on the unconditioned ceiling indicate the model exploits the outcome token to make non-uniform predictions. The MC conditioned ceiling is an estimate reported as a bracket \[corrected, naive\]; see [Accuracy Ceiling Analysis](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md) for methodology.
92
 
93
- | Ceiling | Ratio |
94
- |---------|-------|
95
- | Unconditioned (E\[1/N_legal\] = 6.52%) | 106% |
96
- | Bayes-optimal conditioned (MC, 128 rollouts = \[6.67, 7.34\]%) | 95–104% |
97
 
98
 
99
  ## Probe Results
100
 
101
- Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features.
102
 
103
  | Probe | Accuracy | Description |
104
  |-------|----------|-------------|
@@ -144,9 +155,9 @@ Edge-case diagnostics measure the model's legal move rate in specific tactical s
144
  | Attention heads | 8 |
145
  | Head dimension | 80 |
146
  | d_ff | 2560 |
147
- | Parameters | ~68.4M |
148
- | Vocabulary | 4,284 tokens |
149
- | Context length | 256 tokens |
150
  | Normalization | Pre-norm RMSNorm |
151
  | FFN | SwiGLU (4x expansion) |
152
  | Positional encoding | Rotary (RoPE, base 10000) |
@@ -159,13 +170,14 @@ Edge-case diagnostics measure the model's legal move rate in specific tactical s
159
  |-----------|-------|
160
  | Training data | On-the-fly uniformly random legal games (no external dataset) |
161
  | Objective | Next-token cross-entropy (non-padding positions only) |
162
- | Total steps | 100,000 |
 
163
  | Batch size | 256 |
164
- | Games seen | 25,600,000 |
165
- | Learning rate | 3e-4 (cosine decay with 1,000-step warmup) |
 
166
  | Optimizer | AdamW (weight decay 0.01) |
167
  | Precision | Mixed (AMP) |
168
- | Hardware | NVIDIA H200 |
169
 
170
  ## Usage
171
 
@@ -199,7 +211,7 @@ model.load_state_dict(weights)
199
  ### Finetuning with an adapter
200
 
201
  ```bash
202
- uv run python scripts/train_bottleneck.py \
203
  --checkpoint thomas-schweich/pawn-large \
204
  --pgn thomas-schweich/pawn-lichess-full \
205
  --bottleneck-dim 32 --lr 1e-4 --local-checkpoints
@@ -223,7 +235,7 @@ PAWN builds on ideas and tools from the following projects and publications:
223
  | FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) |
224
  | RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) |
225
  | Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) |
226
- | Intrinsic dimensionality | [Aghajanyan et al., "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning", ACL 2021](https://arxiv.org/abs/2012.13255) |
227
  | MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) |
228
  | AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) |
229
  | Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) |
 
20
  url = {https://github.com/thomas-schweich/PAWN},
21
  license = {Apache-2.0}
22
  }
23
+ model_params: 66909440
24
  d_model: 640
25
  n_layers: 10
26
  n_heads: 8
27
  d_ff: 2560
28
+ context_length: 512
29
+ vocab_size: 1980
30
  datasets:
31
  - random-chess-games
32
  language:
 
40
  type: next-token-prediction
41
  name: Chess Move Prediction (Random Games)
42
  metrics:
43
+ - name: Game Completion Rate
44
+ type: accuracy
45
+ value: 0.997559
46
  - name: Legal Move Rate
47
  type: accuracy
48
+ value: 0.999990
 
49
  - name: Top-1 Accuracy
50
  type: accuracy
51
+ value: 0.0863
 
52
  - name: Top-5 Accuracy
53
  type: accuracy
54
+ value: 0.3556
 
55
  - name: Val Loss
56
  type: loss
57
+ value: 2.8652
58
+ - name: Total Training Sequences
59
  type: other
60
+ value: 51200000
61
  ---
62
 
63
  # PAWN-Large
64
 
65
  **PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.
66
 
67
+ This is the **large** variant (~66.9M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.
68
 
69
  **[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation.
70
 
 
72
 
73
  | Variant | Parameters | Link |
74
  |---------|------------|------|
75
+ | PAWN-Small | ~9M | [thomas-schweich/pawn-small](https://huggingface.co/thomas-schweich/pawn-small) |
76
+ | PAWN (Base) | ~35M | [thomas-schweich/pawn-base](https://huggingface.co/thomas-schweich/pawn-base) |
77
+ | PAWN-Large | ~67M | [thomas-schweich/pawn-large](https://huggingface.co/thomas-schweich/pawn-large) |
78
+
79
+ A previous generation of PAWN backbones (`pawn-{small,base,large}-legacy`) used a 4,278-token coordinate vocabulary, a 256-token context window, and outcome conditioning. They are still available on HuggingFace; see [docs/LEGACY.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/LEGACY.md) for the full story.
80
 
81
  ## Headline Metrics
82
 
83
+ These come from the published `model.safetensors` (step 195,000 out of 200,000 — the best 5,000-step-cadence checkpoint by val loss), measured on a fresh validation set of random games.
84
+
85
  | Metric | Value |
86
  |--------|-------|
87
+ | Game completion rate | 99.76% |
88
+ | Per-move legal rate | 99.9990% |
89
+ | Late-game legal rate | 99.9996% |
90
+ | Top-1 accuracy | 8.63% |
91
+ | Top-5 accuracy | 35.56% |
92
+ | Val loss | 2.865 |
93
+ | Val perplexity | 17.55 |
94
+
95
+ **Game completion rate** is the share of validation games in which *every* prediction along one side's plies was a legal move. The measurement is **non-autoregressive**: at each ply the model is shown the true ground-truth history and asked for that side's next move, and an illegal prediction at any ply forfeits the game. Errors do not corrupt subsequent positions — each prediction is independent given the true history. Autoregressive game completion has not been measured for these checkpoints and could be higher or lower; see the [game completion section of the architecture doc](https://github.com/thomas-schweich/PAWN/blob/main/docs/ARCHITECTURE.md#game-completion-rate) for the full definition. Game completion rate is a much stricter metric than per-move legal rate, and is the main signal that separates capacity between sizes.
96
+
97
+ | Compound-legality detail | Value |
98
+ |--------------------------|-------|
99
+ | Average plies completed per game | 349 |
100
+ | Average % of game completed | 99.88% |
101
+ | Median forfeit ply (when forfeit) | 153 |
102
 
103
+ ### Accuracy ceiling
104
 
105
+ PAWN is trained on uniformly random chess games. At each position with N legal moves, the next move is drawn uniformly, so the Bayes-optimal predictor that does not know the game's outcome can do no better than 1/N at that position. Averaged over the position distribution induced by random games of up to 512 plies, the top-1 ceiling is **E\[1/N_legal\] 8.43%** (95% CI \[8.41%, 8.45%\], computed over 50,000 fresh random games — see [docs/ACCURACY_CEILING.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md)).
106
 
107
+ This model's top-1 accuracy of **8.63%** is **102% of that ceiling** — i.e., essentially at the limit of what any predictor can achieve on this task without outcome information.
 
 
 
108
 
109
 
110
  ## Probe Results
111
 
112
+ Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. The model is never explicitly told about pieces, sides, or rules — these representations emerge purely from next-token prediction on random games.
113
 
114
  | Probe | Accuracy | Description |
115
  |-------|----------|-------------|
 
155
  | Attention heads | 8 |
156
  | Head dimension | 80 |
157
  | d_ff | 2560 |
158
+ | Parameters | ~66.9M |
159
+ | Vocabulary | 1,980 tokens (1,968 searchless_chess actions + 1 PAD + 11 outcome tokens) |
160
+ | Context length | 512 tokens |
161
  | Normalization | Pre-norm RMSNorm |
162
  | FFN | SwiGLU (4x expansion) |
163
  | Positional encoding | Rotary (RoPE, base 10000) |
 
170
  |-----------|-------|
171
  | Training data | On-the-fly uniformly random legal games (no external dataset) |
172
  | Objective | Next-token cross-entropy (non-padding positions only) |
173
+ | Outcome conditioning | Disabled (prepend_outcome=False) — pure moves, no outcome leakage |
174
+ | Total steps | 200,000 |
175
  | Batch size | 256 |
176
+ | Total training sequences | 51,200,000 (= total steps × batch size; the published checkpoint is the best 5K-cadence step by val loss, at step 195,000 ≈ 49,920,000 sequences) |
177
+ | Max ply per example | 512 |
178
+ | Learning rate | 0.0003 (cosine decay with 10,000-step warmup) |
179
  | Optimizer | AdamW (weight decay 0.01) |
180
  | Precision | Mixed (AMP) |
 
181
 
182
  ## Usage
183
 
 
211
  ### Finetuning with an adapter
212
 
213
  ```bash
214
+ uv run python scripts/train.py --run-type adapter --strategy bottleneck \
215
  --checkpoint thomas-schweich/pawn-large \
216
  --pgn thomas-schweich/pawn-lichess-full \
217
  --bottleneck-dim 32 --lr 1e-4 --local-checkpoints
 
235
  | FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) |
236
  | RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) |
237
  | Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) |
238
+ | Searchless Chess (action vocab) | [Ruoss et al., "Amortized Planning with Large-Scale Transformers: A Case Study on Chess", 2024](https://arxiv.org/abs/2402.04494) |
239
  | MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) |
240
  | AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) |
241
  | Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) |