Caro5: First dataset gen runs ( day 4 ) 🍄

Community Article Published June 11, 2026

Learning how to optimize and clean data

I only have 2 weeks 10 days to run this project and I’m learning a lot. So far, generating lots of games early on is my priority to have the 10k games in the dataset. While it's running, I’ll learn how to do ML/RL.

Run 1: I’m naive

100 games completed and validated cleanly.

 median durationMs:    592
 mean durationMs:     5007.62
 min durationMs:         2
 p90 durationMs:     10111
 p95 durationMs:     26061
 max durationMs:    156115
 wall time:         ~8m21s

The slow tail is not from PUCT. Split by bot mix:

  games with MCTS:
  count 61
  median 111ms
  mean   956.9ms
  max    4862ms

  games without MCTS:
  count  39
  median 3547ms
  mean   11343.4ms
  max    156115ms

The worst game was minimax_strong vs minimax_balanced, 124 moves, took 2:36m and no MCTS targets. Just one game took one quarter of the wall time.

Generating 10k games would take 13.9 hours and 100k games almost a week! Assuming I would get the perfect dataset and wouldn't need to try again. It needs to be improved, fast!

Run 2: minimax caps and restrictions

Implemented both guardrails and tested the consequence.

Changes:

  • â–ˇ Update dataset bot pairing to avoid minimax-vs-minimax

    • no same-bot pair
    • no minimax-vs-minimax pair
    • explicit forced minimax-vs-minimax throws
    • Random self-play now rejects incompatible bot pairs
  • â–ˇ Lower dataset minimax time budgets

    • weak 40ms
    • balanced 80ms
    • strong 120ms

Consequence vs previous 100-game run:

Wall time: 8m21s → 2m28s — 70% reduction for the same 100 games. The minimax time caps did most of this work.

Mean: 5007ms → 1469ms — the catastrophic tail is gone. Mean dropped by 3.5x which confirms the slow games were a small number of outliers dragging the average, not a systemic problem. p95: 26061ms → 4734ms — the 95th percentile dropped 5.5x. This is the most important number for a self-play pipeline — it tells you what your worst typical game costs, and 4.7s is completely acceptable.

Max: 156115ms → 12473ms — the worst game went from 156s to 12.5s. Still the highest number but no longer catastrophic. A 12s game at p100 with a p95 of 4.7s suggests it's a genuine outlier, not a systematic problem.

Median: 592ms → 567ms — essentially unchanged, as expected. The median was always an MCTS game; the fixes only affected the slow tail.

This is a solid improvement, 70% faster. The long tail is still minimax-related, but it’s no longer catastrophic.

Run3: more optimizations and run

Transposition table – A cache that stores results of positions you've already searched (like best move, score, depth). If you reach the same position again via a different move order, you reuse the stored result instead of re‑searching it, saving massive time.

Zobrist hashing – A way to give every unique board position a random‑looking number (hash). When a move is made, you quickly update the hash by XOR’ing out the old piece's code and XOR’ing in the new one. That hash is used as the key to look up positions in the transposition table.

  • Extend MinimaxAlphaBetaOptions

  • Candidate flow:

    • Generate radius candidates as today.
    • Run direct immediate own-win scan on raw candidates without expensive ordering.
    • Run direct opponent-win block scan on raw candidates without expensive ordering.
    • Cheap-sort remaining candidates by proximity, center bias, last/killer moves, and local line potential.
    • Cap to maxCandidates before expensive threat analysis.
  • Iterative deepening:

    • Search depth 1..maxDepth.
    • Keep the best move from the last fully completed depth.
    • If deadline is reached mid-depth, return the previous completed depth’s best move.
  • Transposition table:

    • Add deterministic Zobrist hashing for board stones, side to move, phase-compatible turn state, and board size.
    • Store entries with { depth, value, bound: 'exact' | 'lower' | 'upper', bestMove }.
    • Probe before expanding; write after node evaluation.
    • Use TT bestMove first in ordering.
  • Killer moves:

    • Keep two killer moves per search ply.
    • On beta cutoff caused by a non-capture/non-terminal move, record the move for that ply.
    • Try killer moves before ordinary heuristic-ordered moves if legal.
  • Threat-space filtering:

    • If own immediate win exists, search only winning moves.
    • If opponent immediate win exists, search only blocking moves.
    • If forcing threats exist, search threat-create and threat-response moves first, capped by threatCandidateLimit.
    • If no forcing threat exists, include normal positional candidates so quiet setup moves are not lost.

Feature Encoder and Threat Maps

  • Add a shared feature encoder:

    • Float32Array output with shape 13 * boardSize * boardSize.
    • Channel-major indexing: channel * boardSize * boardSize + y * boardSize + x.
  • Implement channels:

    • 0: current player stones.
    • 1: opponent stones.
    • 2: last move.
    • 3: phase0.
    • 4: phase1.
    • 5: phase2.
    • 6: normal.
    • 7: current player broadcast plane.
    • 8: first three opening stones.
    • 9: offered stones #4 and #5.
    • 10: current player open-ended threat map.
    • 11: opponent open-ended threat map.
    • 12: current player overline trap map.
  • Threat map scoring:

    • 1.00: legal immediate win.
    • 0.85: contiguous four with at least one open end.
    • 0.75: broken four with at least one open end.
    • 0.50: open three with two open ends.
    • 0.25: closed three with one open end.
    • 0.00: no meaningful threat.
  • Incremental threat maps:

    • Add ThreatMapState holding channels 10-12 and a Zobrist hash.

    • Initial state computes full maps once.

    • Child states update only cells affected by the latest move: empty cells on the 4 lines crossing the move, within five cells in both directions.

    • Cache threat maps by Zobrist hash so minimax and PUCT can reuse them.

PUCT/Evaluator Integration

  • Keep PUCT synchronous.

  • Update HeuristicPolicyValueEvaluator to call the 13-channel encoder instead of recomputing unrelated scalar heuristics.

  • Use channel 10/11/12 values in action priors:

    • prefer own threats,
    • prioritize opponent threat blocks,
    • suppress current-player overline traps when noOverlines is enabled.

Assumptions

  • Keep the existing Map<string, Cell> board representation for this pass.
  • Use Zobrist hashing with deterministic 64-bit BigInt values.
  • Feature channels use current stone color/player, not Swap2 actor seat.
  • PUCT does not get full transposition-node sharing yet; minimax gets the full TT first.
  • Opening/offered masks require explicit metadata when board state alone cannot identify them.

Verification:

  100-game benchmark, same seed, after this pass:
  wall time: ~34.5s
  median:    297ms
  mean:      339.19ms
  p90:       674ms
  p95:       838ms
  max:       1142ms

So the long tail is effectively gone, and the 100-game run is about 4.3x faster wall-clock than the previous capped version.

Run 4: improved heuristics and board optimizations

Implemented:

  • Raised open-three threat level to 0.65
  • Added bitboard support and wired PUCT to carry bitboard state plus a transposition table
  • Switched dataset MCTS bots to use PUCT while preserving immediate win/block shortcuts

Results:

  baseline 100:
    avgMoveCount: 22.64
    mctsOpening: 186
    mctsPlay: 1267
    totalDurationMs: 138229
    avgDurationMs: 1382.29

  puct-tt 100:
    avgMoveCount: 13.01
    mctsOpening: 170
    mctsPlay: 388
    totalDurationMs: 125200
    avgDurationMs: 1252

  workers 100, 4 workers:
    wall time: 41.5s

Wall time got worse but we're putting in the strongest bots with deeper search.

Run 5

Screenshot From 2026-06-12 01-40-15

Speed is solid. 29.1s wall vs 41.5s for the 4-worker run is a significant improvement. This is a massive difference between where we started at 8m21s!

Avg move count drop (22.64 → 12.14) is the most interesting signal. Games are terminating almost twice as fast. This could mean the engine is playing much stronger (decisive wins early), or it's detecting terminal/winning positions earlier and cutting off search. Either way it's a big behavioral change from baseline.

TT hit rate at 39.6% is mediocre. It means the table is getting written to far more than it's being consulted. This ratio suggests either the search depth is shallow enough that positions aren't revisiting often, or the key/generation scheme is causing most entries to be stale-skipped before lookup

Biased wins 58:42. A persistent white-side advantage could indicate the first-move advantage isn't being captured well by MCTS, or your evaluation/rollout has a directional bias. Worth tracking across more games to see if it holds.

Thats all for today!

Community

Sign up or log in to comment