# Pareto Sweep Lab Notes

## Sweep Start: 2026-03-31 ~03:00 UTC
## Current Time: 2026-03-31 ~10:00 UTC (7 hours in)
## Current Phase: Phase 2 — Deep Dive (50K steps, 3.5M games, bfloat16)
## Status: 8 Phase 2 trials running

## Procedure
- Max 4 trials per GPU, 8 total. Target ~90% VRAM.
- Tight Optuna ask/tell loop. Cost-aware trial selection (bias toward cheap gap-fillers).
- Prune trials at step 3000+ if val_loss > Pareto front + 0.3 at same param count.
- Report completions as add_trial (not tell, since mapping is messy). Report OOM/crashes as FAIL.
- Must reload study after session restart: `mcp__optuna__create_study("pareto-sweep", ["minimize","minimize"])`

## Infrastructure
- `PYTHONPATH=/opt/pawn:/opt/pawn/.venv/lib/python3.12/site-packages`
- MPS running but ignores CUDA_VISIBLE_DEVICES
- Fixed unfreeze bug: `forward_hidden` patched in train_adapter.py
- Phase 2 config: `--max-games 3500000 --val-games 50000 --total-steps 50000 --eval-interval 2500 --batch-size 256 --amp-dtype bfloat16 --num-workers 4 --no-compile --log-dir /workspace/logs --local-checkpoints`
- HF backup: `hf sync /workspace hf://buckets/thomas-schweich/pawn-pareto-sweep-03-30-2026`

## Phase 1 Results (25 new trials + 13 seeds = 38 total)

### Completed Phase 1 Trials (new, sorted by val_loss)

| Strategy | Params | val_loss | top1 | Config |
|----------|--------|----------|------|--------|
| unfreeze | 8,390,656 | 1.7667 | 45.6% | layers 6,7 |
| unfreeze | 4,195,328 | 1.8663 | 43.3% | layer 7, lr=4.5e-4 |
| lora+FFN | 1,507,328 | 1.8846 | 42.9% | rank=16, --lora-ffn |
| bottleneck | 524,288 | 1.9545 | 41.2% | dim=64, L4-7 |
| lora | 2,097,152 | 1.9881 | 40.6% | rank=64, all layers |
| sparse | 3,608,760 | 2.0220 | 39.9% | density=0.86, L4-7 |
| bottleneck | 262,144 | 2.0294 | 39.5% | dim=32, L4-7 |
| lora | 2,097,152 | 2.0680 | 38.8% | rank=128, L4-7 |
| bottleneck | 131,072 | 2.1118 | 37.7% | dim=16, L4-7 (= dim=8 all) |
| bottleneck | 65,536 | 2.1978 | 35.8% | dim=8, L4-7 |
| lora | 131,072 | 2.2520 | 34.8% | rank=8, L4-7 |
| unfreeze | 4,195,328 | 2.2858 | 34.1% | layer 7, lr=1.14e-5 |
| bottleneck | 32,768 | 2.2868 | 33.9% | dim=4, L4-7 |
| specialized_clm | 1,335,936 | 2.3207 | 32.8% | d=128, 4L, 4H |
| lora | 147,456 | 2.3382 | 33.0% | rank=6, L2-7 |
| lora | 65,536 | 2.3463 | 32.8% | rank=2, all layers |
| lora | 49,152 | 2.3487 | 32.6% | rank=3, L4-7 |
| hybrid | 401,408 | 2.4020 | 31.6% | rank=12, all layers |
| lora | 32,768 | 2.4174 | 31.3% | rank=1, all layers |
| bottleneck | 16,384 | 2.4580 | 30.3% | dim=1, all layers |
| bottleneck | 16,384 | 2.5659 | 28.1% | dim=1, all layers (diff lr) |
| specialized_clm | 261,168 | 2.5910 | 26.1% | d=48, 2L, 4H |
| rosa | ~17,000 | 2.7173 | 25.2% | retro-sparse, density=0.004 |
| sparse | 8,434 | 2.7207 | 25.8% | density=0.001 |
| film | 16,748 | 2.8271 | 21.8% | with output_film |

### Phase 1 Pareto Front (seeds + new)

| Params | val_loss | top1 | Strategy |
|--------|----------|------|----------|
| 8,434 | 2.7207 | 25.8% | sparse |
| 16,384 | 2.4580 | 30.3% | bottleneck |
| 32,768 | 2.2868 | 33.9% | bottleneck |
| 65,536 | 2.1978 | 35.8% | bottleneck |
| 131,072 | 2.1118 | 37.7% | bottleneck |
| 262,144 | 2.0294 | 39.5% | bottleneck |
| 524,000 | 1.9200 | 41.7% | bottleneck (seed, 100K steps) |
| 1,000,000 | 1.8500 | 43.5% | bottleneck (seed, 100K steps) |
| 1,507,328 | 1.8846 | 42.9% | lora+FFN |
| 7,800,000 | 1.7759 | 45.4% | bottleneck (seed) |
| 8,390,656 | 1.7667 | 45.6% | unfreeze |
| 10,000,000 | 1.5634 | 50.5% | bottleneck (seed, 100K steps) |

### Key Phase 1 Findings

1. **Parameter budget is 94% of what matters** (Optuna param importance). Strategy choice <1%.
2. **Bottleneck dominates** the Pareto front from 16K-10M params.
3. **LoRA+FFN at 1.5M** is the only non-bottleneck on the front (vl=1.88 vs bottleneck 1M seed at 1.85).
4. **Bottleneck beats lora** at every matched param count: 65K (2.20 vs 2.35), 131K (2.11 vs 2.25).
5. **Unfreeze** is powerful at 4-8M but impractical for hypernetworks.
6. **Film, rosa, specialized_clm** are not competitive at any scale.

## Phase 2 — Running Trials

| Trial | Strategy | Params | Config | GPU |
|-------|----------|--------|--------|-----|
| p2_bn65k | bottleneck | 65,536 | dim=8, L4-7 | 1 |
| p2_lora65k | lora | 65,536 | rank=2, all | 1 |
| p2_bn262k | bottleneck | 262,144 | dim=32, L4-7 | 0 |
| p2_lora_ffn | lora+FFN | 1,507,328 | rank=16, --lora-ffn | 0 |
| p2_bn524k | bottleneck | 524,288 | dim=64, L4-7 | 0 |
| p2_bn33k | bottleneck | 32,768 | dim=4, L4-7 | 0 |
| p2_lora131k | lora | 131,072 | rank=8, L4-7 | 0 |
| p2_bn131k | bottleneck | 131,072 | dim=16, L4-7 | 0 |

## Phase 2 Questions to Answer
1. Does bottleneck still dominate lora at 65K and 131K with 50K steps?
2. How much does lora+FFN improve at 1.5M with more training?
3. What val_loss can bottleneck 524K achieve? (seed was 1.92 at 100K steps)
4. How low can bottleneck 33K go?

## What To Do Next
1. **6 Phase 2 trials still running** (nohup'd, will complete on their own ~2h after 18h mark)
2. When they complete, report results to Optuna via `add_trial`
3. Regenerate Pareto front plot and update sweep_report.md with final Phase 2 numbers
4. Run `hf sync /workspace hf://buckets/thomas-schweich/pawn-pareto-sweep-03-30-2026`
5. If time permits: Phase 3 (100K steps) on the 3-4 most promising Pareto-optimal configs
6. Run `scripts/eval_accuracy.py` on best checkpoints for per-ply accuracy curves
7. Must reload Optuna study after restart: `create_study("pareto-sweep", ["minimize","minimize"])`

## Deliverables Status
- [x] Pareto front plot: `/workspace/plots/pareto_front_final.png`
- [x] Sweep report: `/workspace/sweep_report.md`
- [x] Phase 1 status: `/workspace/sweep_status.md`
- [x] Lab notes: `/workspace/lab_notes.md`
- [x] Optuna DB: `/workspace/optuna-storage/study.db`
- [x] HF backup: synced to `thomas-schweich/pawn-pareto-sweep-03-30-2026`
- [ ] Per-ply accuracy curves (not started)
- [ ] Phase 3 final runs (skipped — insufficient time)
