SM

SupplyMind · Master Demo

v4 final-submit · Meta OpenEnv × Scaler · Bangalore 2026 ·
9 cards · all live 14 sources 25 judges /docs ↗
headline numbers · every claim has a receipt
paired-bootstrap headline · RL leaderboard
RAP-XC beats MaskablePPO-v3 on hard_cascading_crisis: mean Δ reward +0.2276, CI95 [+0.198, +0.257], sign-test p < 1e-30 — CI excludes zero.
tests/receipts/bootstrap_leaderboard.json
100%
risk-band accuracy
8/8 historical events
100%
Brent ±30%
median err 3.3%
90.01%
conformal coverage
vs 90% target
α=0.567
12-frontier-judge
Krippendorff α (R4)
+12.15%
HetGAT vs v1 GCN
medium graph MAE
3.14M
RAP-XC params
40k real harvest
nine subsystems · one click each
1 · Hormuz War Room
25-judge · Reliance
IEA-cited chokepoint map · India + Gulf + Reliance Industries 10-subsidiary tables · 25-judge ensemble (3 Ollama + 12 frontier + 10 specialist) · sha256 receipt
25 judges · 14 nodes · 18 edges · 4 scenario templates
2 · 9-Agent Arena
RL
PPO · MaskablePPO · Recurrent · DQN · A2C · QRDQN · TRPO · DT · RAP-XC 3.14M (BC 5.6→0.2)
9 agents on /arena/leaderboard
3 · 13 Foundation Models
verified local
Qwen-14B · Coder-14B · Mistral-Nemo · DeepSeek-R1-Q4 · Chronos+TimesFM+TabPFN ensemble · BGE-M3 / mxbai / Snowflake / BGE-rerank · Qwen-VL 7B
13 all loaded under models/
4 · Crisis Library v2
EMDAT
1500 real disasters · mxbai 1024-d FAISS HNSW · P@1=0.962 · deterministic severity from real death/damage/affected counts
1500 events · POST analog query
5 · Platinum Counterfactual
causal
Paired-bootstrap MC · Synthetic Control · BSTS-lite · SCM do-calculus — calibrated to 6 paper anchors; Tohoku replicated $276B vs $235B published (+18%)
4 methods · 6 anchors · CI95
6 · 12-Judge Frontier Panel
α=0.567
OpenRouter 12 frontier judges + 13 local — Krippendorff α (ordinal): R4 corpus 0.567 · v2 EMDAT cross-corpus 0.5436 · drift 0.024 abs
25 judges · α-stable cross-corpus
7 · Conformal Safety
90.01% coverage
Split-conformal NLL filter (Vovk 2005) · finite-sample correction · P[expert ∈ accepted] ≥ 1−α · 8000-row calibration · α=0.1
90.01% vs 90% target — exact
8 · HetGAT Cascade
+12.15%
Edge-type-conditional 4-head GAT · GRUCell temporal gating · beats v1 GCN: easy +7.77% · medium +12.15% · hard +10.03% MAE
19,489 params · 4 edge types
9 · Live Intel Fan-Out
20 sources
NewsAPI · GDELT · USGS · NOAA NDBC/Tides · NASA EONET/FIRMS · EIA · MarineTraffic · GFW · WHO DON · SEC · CISA · OFAC · World Bank · Wiki · HN
20 live sources · graceful
sections D · E · F · G · H · I — every bullet has a receipt in FINAL_SUBMIT/receipts/
D · 13 RL Players
+ Wilcoxon p=3.9e-18
MaskablePPO · ConstrainedPPO+λ · QR-DQN(51q,CVaR) · HER+SAC · DT · BC · CQL · IQL · TD3+BC · MBRL+RSSM · Specialist Router (BC→CQL→IQL) · Optuna 12-trial · 4-model ONNX (5.2e-8 best)
+0.2276 RAP-XC vs MaskablePPO Δreward, all 3 tasks p<1e-17
E · Forecasting Stack
3.32% median
TFT 513K + 90K · Chronos-Bolt + TimesFM-2 + TabPFN ensemble · Bates-Granger constrained stacking (1969) · 20-fold rolling-origin × 8 FRED targets × 3 horizons · PICP@80/90/95 · Foygel-Barber split-conformal
8/8 Brent ±30%, ensemble closes 75% gap
F · Uncertainty
ECE=0.0229
MC Dropout 50 forwards · 7-bin reliability · Conformal Q-values · Beta-severity × Lognormal-duration MC · Numba JIT 10-50× · GPU MC 100K scenarios <80ms
0.0229 ECE_full BC_v2 (best calibration)
G · RAG 8 Pipelines
P@1=0.962
P1 BGE-M3 · P2 mxbai (winner P@1=0.962, MRR=0.978, 35ms) · P3 Snowflake · P4-P6 +rerank · P7 RRF · P8 HyDE · honest: reranker hurts ceiling P@3 0.925→0.862 · HyDE no lift
6,483 chunks · 53 + 20 + 26 queries
H · GNN Cascade
-64% MAE hard
Custom 3-layer GCN 50 LOC · TGN per-node memory + GRU · 2-head TransformerConv · 5-day trajectory · 12/25/40-node graphs · MAE -48 / -49 / -64% vs MLP · HetGAT v2 +12.15% on top
+12.15% HetGAT vs v1 GCN medium
I · Interpretability
50/50 pass
SHAP DeepExplainer (n_bg=1000) · top-20 features · TreeExplainer · reliability diagrams · ECE/Brier × 4 models · fairness eq.odds · Qwen-14B 4-section explainer · 5-tier provenance trust
100% explainer stress 50/50, regen 0×
sections J · K · L · M · N · O · P · Q · R · S · T — receipts in FINAL_SUBMIT/receipts/
J · Federated Learning
+263% acc
3 simulated companies (Apple/Samsung/Toyota) · FedAvg · 20 rounds × 5 epochs · DP noise σ=0.1 · BCNetwork 408→256→128→280 MLP shared
8.5%→31.0%Round-0 → Round-49 full acc
K · Multi-Agent
Apple wins
Apple +$2.74M (₹23cr) WINS · Toyota -$7.37M (₹61cr) · Samsung -$11.53M (₹95cr) · 1000 wafers/wk shared TSMC · 2021 chip-shortage analog
+₹23crApple aggressive · first-mover advantage
L · Pareto / Carbon
3 objectives
NSGA2 pymoo · cost × resilience × carbon · IMO/EPA/ICAO factors · 3 weight schemes · best: reroute_rail_panama $180K · 0 carbon
11/20Pareto-frontier plans
M · World Models
$178.68M saved
RSSM (DreamerV3) · 15-step rollout · GPU MC 100K<80ms · p5/p50/p95/p99/cvar_10 · Twin saves $178.68M (48%) at sev=0.85 brent=$123
48%Twin savings vs no-action
N · Live Ingestion
5 sources core
NewsAPI · GDELT 2.0 · USGS · FRED Brent · MarineTraffic · SQLite events.db · SHA-256 dedup 16ch · entity-regex extraction
159events / launch day · 24h dedup
O · Crisis Library v1
8 events · 3+ cites
8 hand-curated real events · 3-4 citations each · mxbai + TF-IDF fallback · confidence-damped (SIM_LOW=0.35, BENIGN=0.10) · Brent$80 collapse
8events · 26+ Reuters/BBC/IDF/UNCTAD/Lloyd's
P · 15-Judge Panel
α 0.21/0.75/0.57/0.36
3 local + 12 frontier OpenRouter · 4 disclosure-ladder α: 3j=0.2097 · 2j=0.7499 · 12-frontier=0.5669 · 15-combined=0.3577 · 26 Wiki scenarios · 5-tier escalation
15judges · ALL 4 alphas EXACT
Q · Tabular ML
AUC=0.9818
XGBoost · LightGBM 0.9818 · CatBoost · TabPFN-v2 (clf+reg+bagging) · Ridge stacking · 5-fold CV · 4 DataCo tasks · honest null at ceiling
+0.0045Stacking lift vs WV (honest)
R · Analysis Models
SPOF F1=1.0
PoliticalRisk GBR R²=0.994 · DependencyMLP 97.45% · FinImpact R²=0.736 · ConfIsotonic ECE=0.0017 · SPOFv2 F1=1.0 · 8-component political index · 4-component dependency
R²=0.994political risk GBR (214 countries)
S · Test Suite
261 collected
261 tests collected (173 v3 + 76 v4 + 7+ phoenix) · 6/6 adversarial rejected · 16/16 phoenix smoke · 19 compliance · ~2m38s runtime
6/6adversarial attacks rejected
T · Receipts
35 total
15 v4 + 5 phoenix + framework · SHA-256 stdout · 5 comparators · INDEX.json/md auto-generated · 271-LOC framework · tamper-evident
35all sha256-anchored
sections U · V · W · X · Y · Z · AA · BB — autoresearch / phoenix / infra / stats / data / docs / plots / tricks
U · Autoresearch
s3 +0.0967
Karpathy overnight loop · 5 hand-crafted experiments · 3 ACCEPTED + 2 REJECTED · s1=0.4035 (bigger net) · s3=0.0967 (curriculum BEST) · s4 REJECTED RecurrentPPO collapse
3/5accepted · honest negatives kept
V · Phoenix v5
MPPO mean=2.209
Twin (100 MC) · Arena 6 baselines · MaskablePPO #1 [2.178,2.239] · Replay 8 frozen · ROLL + DPO + 2 upstream PRs · isolation guarantee
20v5 receipts in INDEX
W · Production Infra
20+ endpoints
3 Dockerfiles · HF Space deployed · ONNX<5e-5×4 · <2GB image · 15-25s cold · Numba+CUDA fallback · 20+ endpoints (HTTP/WS/MCP/Swagger)
<2GBimage · 159GB models excluded
X · Stats Machinery
p<1e-149
Wilcoxon · Friedman · Bootstrap CI95 · Krippendorff α · Cohen κ · Fleiss κ · ECE/Brier · PICP@80/90/95 · 10,800-episode benchmark · CI95 strictly excludes 0
10,800episode bootstrap (R6 Euclidian)
Y · Real Data
261k+ points
DataCo 180,519 orders · IBTRACS 243,495 storms · FRED 17,011 pts · WGI 214×6×24 · SEC 25 filings · Wikipedia 26 · 40+ citations
10independent real datasets · zero synthetic
Z · Documentation
125 .md files
125 markdown docs · 12 Sleep Token album stages · 6 Colab notebooks · README 40KB · SUPPLYMIND_BLUEPRINT 81KB · ALIENWARE_KICKOFF 53KB · 5 PITCH_DECK
12Sleep Token track stages exact
AA · Plots & Viz
25+ plots
Hero card · Caramel calibration · R4×7 / R5×5 / R6×4 / R3×2 plots · GCN attention heatmaps · Streamlit 12 panels · Pareto 3D Plotly
25+versions/v3_arcadia/plots/ · 1 Streamlit dashboard
BB · Clever Tricks
₹3 spend
Sleep Token 12 stages · W1-W10 wins · α disclosure ladder · 8 honest negatives · 2-pass DeepSeek · tamper-evident SHA-256 · 5 graceful-degrade paths
₹3total OpenRouter spend (under tea)

Reward-hacking · 6 attacks · 6 rejected · all by different defense layer

Per Meta OpenEnv × Scaler hackathon-guide §8 — multi-component reward + multiple independent gates beat single-signal reward hacking.
6/6 rejected · honest=0.86
A1 · empty_stringREJECTED 0.00
degenerate empty output, no info
match=0.00 · format=0.00 · length=0.00 · n_tokens=1
defense: format_gate + length_gate
A2 · risk_only_short_circuitREJECTED 0.70
pure short-circuit: output ground-truth label only
match=1.00 · format=0.00 · length=0.00 · n_tokens=1
defense: length_gate (shorter than honest)
A3 · long_spam_no_jsonREJECTED 0.80
pad with junk to beat length-guard, omit JSON
match=1.00 · format=0.00 · length=1.00 · n_tokens=200
defense: format_gate (no JSON shape)
A4 · over_length_500_tokenREJECTED 0.85
massive output to dilute detection
match=1.00 · format=1.00 · length=-0.50 · n_tokens=500
defense: max_length_penalty (negative reward over 400tk)
A5 · adjacent_tier_guessREJECTED 0.65
always guess adjacent tier to hedge
match=0.50 · format=1.00 · length=1.00 · n_tokens=60
defense: ordinal_proximity_penalty (only 0.5 partial credit)
A6 · wrong_tier_confidentREJECTED 0.30
always guess LOW (opposite end)
match=0.00 · format=1.00 · length=1.00 · n_tokens=60
defense: far-from-GT match=0 (not partial credit)
honest baseline reward = 0.86 · STRICTLY GREATER than every attack
verdict: All attack vectors score strictly below honest. Layered reward rejects each via different component: length-guard (A2), format-guard (A3), max-length (A4), proximity penalty (A5,A6).
receipt: tests/receipts/adversarial_reward_audit.json

Wordle RLVR · canonical hackathon-guide demo

OpenEnv-compliant · multi-component reward · GRPO-trainable via TRL · bridges domain-heavy supply-chain to canonical hackathon flow
▶ play live
env contract
reset / step / grade / observation / action
Pydantic v2 typed · OpenEnv compliant
reward components (multi · §7)
solve_bonus · green_credit · yellow_credit · timeout_penalty · format_gate · dictionary_gate
anti-hack layers (§8)
format_gate · dictionary_gate · timeout · no internal-state mutation
baseline (heuristic constraint filter, 50 episodes seeded)
win_rate=1.00 · mean_guesses=1.82 · mean_reward=0.77
receipt: tests/receipts/wordle_grpo_baseline.json
trainer stack
TRL GRPO · Unsloth (optional) · Qwen-2.5-1.5B-Instruct base
recipe: rl/lora/finetune_unsloth.py + versions/v5_phoenix/wordle_env/train_grpo.py
endpoints
POST /wordle/reset · POST /wordle/step · POST /wordle/grade · GET /wordle/health · GET /wordle/ui

Validation · backtest receipts

click ▶ to run 8-event backtest · expected: 100% risk-band, 100% Brent ±30%, 100% reroute

Receipts · all real, all sha256

10+ receipts
  • tests/receipts/war_room_validation.json — 100/100/100/100/100%
  • tests/receipts/ensemble_brent_validation.json — 8/8 ±30%, median 3.3% err
  • tests/receipts/conformal_calibration.json — 0.9001 coverage
  • tests/receipts/cross_corpus_alpha.json — α=0.5436
  • tests/receipts/panel_agreement_R4.json — α=0.5669
  • versions/v5_phoenix/experiments/hetgat_v1/report.json — +7.77/+12.15/+10.03%
  • versions/v5_phoenix/experiments/rap_xc_v1/rapxc.pt — BC 5.62→0.23