SM

SupplyMind · Master Demo

v4 final-submit · Meta OpenEnv × Scaler · Bangalore 2026 ·

9 cards · all live 14 sources 25 judges /docs ↗

headline numbers · every claim has a receipt

paired-bootstrap headline · RL leaderboard

RAP-XC beats MaskablePPO-v3 on hard_cascading_crisis: mean Δ reward +0.2276, CI95 [+0.198, +0.257], sign-test p < 1e-30 — CI excludes zero.

tests/receipts/bootstrap_leaderboard.json

100%

risk-band accuracy
8/8 historical events

100%

Brent ±30%
median err 3.3%

90.01%

conformal coverage
vs 90% target

α=0.567

12-frontier-judge
Krippendorff α (R4)

+12.15%

HetGAT vs v1 GCN
medium graph MAE

3.14M

RAP-XC params
40k real harvest

nine subsystems · one click each

1 · Hormuz War Room

25-judge · Reliance

IEA-cited chokepoint map · India + Gulf + Reliance Industries 10-subsidiary tables · 25-judge ensemble (3 Ollama + 12 frontier + 10 specialist) · sha256 receipt

25 judges · 14 nodes · 18 edges · 4 scenario templates

2 · 9-Agent Arena

PPO · MaskablePPO · Recurrent · DQN · A2C · QRDQN · TRPO · DT · RAP-XC 3.14M (BC 5.6→0.2)

9 agents on /arena/leaderboard

3 · 13 Foundation Models

Qwen-14B · Coder-14B · Mistral-Nemo · DeepSeek-R1-Q4 · Chronos+TimesFM+TabPFN ensemble · BGE-M3 / mxbai / Snowflake / BGE-rerank · Qwen-VL 7B

13 all loaded under models/

4 · Crisis Library v2

1500 real disasters · mxbai 1024-d FAISS HNSW · P@1=0.962 · deterministic severity from real death/damage/affected counts

1500 events · POST analog query

5 · Platinum Counterfactual

Paired-bootstrap MC · Synthetic Control · BSTS-lite · SCM do-calculus — calibrated to 6 paper anchors; Tohoku replicated $276B vs $235B published (+18%)

4 methods · 6 anchors · CI95

6 · 12-Judge Frontier Panel

OpenRouter 12 frontier judges + 13 local — Krippendorff α (ordinal): R4 corpus 0.567 · v2 EMDAT cross-corpus 0.5436 · drift 0.024 abs

25 judges · α-stable cross-corpus

7 · Conformal Safety

90.01% coverage

Split-conformal NLL filter (Vovk 2005) · finite-sample correction · P[expert ∈ accepted] ≥ 1−α · 8000-row calibration · α=0.1

90.01% vs 90% target — exact

8 · HetGAT Cascade

Edge-type-conditional 4-head GAT · GRUCell temporal gating · beats v1 GCN: easy +7.77% · medium +12.15% · hard +10.03% MAE

19,489 params · 4 edge types

9 · Live Intel Fan-Out

NewsAPI · GDELT · USGS · NOAA NDBC/Tides · NASA EONET/FIRMS · EIA · MarineTraffic · GFW · WHO DON · SEC · CISA · OFAC · World Bank · Wiki · HN

20 live sources · graceful

sections D · E · F · G · H · I — every bullet has a receipt in FINAL_SUBMIT/receipts/

D · 13 RL Players

+ Wilcoxon p=3.9e-18

MaskablePPO · ConstrainedPPO+λ · QR-DQN(51q,CVaR) · HER+SAC · DT · BC · CQL · IQL · TD3+BC · MBRL+RSSM · Specialist Router (BC→CQL→IQL) · Optuna 12-trial · 4-model ONNX (5.2e-8 best)

+0.2276 RAP-XC vs MaskablePPO Δreward, all 3 tasks p<1e-17

E · Forecasting Stack

TFT 513K + 90K · Chronos-Bolt + TimesFM-2 + TabPFN ensemble · Bates-Granger constrained stacking (1969) · 20-fold rolling-origin × 8 FRED targets × 3 horizons · PICP@80/90/95 · Foygel-Barber split-conformal

8/8 Brent ±30%, ensemble closes 75% gap

F · Uncertainty

MC Dropout 50 forwards · 7-bin reliability · Conformal Q-values · Beta-severity × Lognormal-duration MC · Numba JIT 10-50× · GPU MC 100K scenarios <80ms

0.0229 ECE_full BC_v2 (best calibration)

G · RAG 8 Pipelines

P1 BGE-M3 · P2 mxbai (winner P@1=0.962, MRR=0.978, 35ms) · P3 Snowflake · P4-P6 +rerank · P7 RRF · P8 HyDE · honest: reranker hurts ceiling P@3 0.925→0.862 · HyDE no lift

6,483 chunks · 53 + 20 + 26 queries

H · GNN Cascade

Custom 3-layer GCN 50 LOC · TGN per-node memory + GRU · 2-head TransformerConv · 5-day trajectory · 12/25/40-node graphs · MAE -48 / -49 / -64% vs MLP · HetGAT v2 +12.15% on top

+12.15% HetGAT vs v1 GCN medium

I · Interpretability

SHAP DeepExplainer (n_bg=1000) · top-20 features · TreeExplainer · reliability diagrams · ECE/Brier × 4 models · fairness eq.odds · Qwen-14B 4-section explainer · 5-tier provenance trust

100% explainer stress 50/50, regen 0×

sections J · K · L · M · N · O · P · Q · R · S · T — receipts in FINAL_SUBMIT/receipts/

J · Federated Learning

3 simulated companies (Apple/Samsung/Toyota) · FedAvg · 20 rounds × 5 epochs · DP noise σ=0.1 · BCNetwork 408→256→128→280 MLP shared

8.5%→31.0%Round-0 → Round-49 full acc

K · Multi-Agent

Apple +$2.74M (₹23cr) WINS · Toyota -$7.37M (₹61cr) · Samsung -$11.53M (₹95cr) · 1000 wafers/wk shared TSMC · 2021 chip-shortage analog

+₹23crApple aggressive · first-mover advantage

L · Pareto / Carbon

NSGA2 pymoo · cost × resilience × carbon · IMO/EPA/ICAO factors · 3 weight schemes · best: reroute_rail_panama $180K · 0 carbon

11/20Pareto-frontier plans

M · World Models

RSSM (DreamerV3) · 15-step rollout · GPU MC 100K<80ms · p5/p50/p95/p99/cvar_10 · Twin saves $178.68M (48%) at sev=0.85 brent=$123

48%Twin savings vs no-action

N · Live Ingestion

NewsAPI · GDELT 2.0 · USGS · FRED Brent · MarineTraffic · SQLite events.db · SHA-256 dedup 16ch · entity-regex extraction

159events / launch day · 24h dedup

O · Crisis Library v1

8 events · 3+ cites

8 hand-curated real events · 3-4 citations each · mxbai + TF-IDF fallback · confidence-damped (SIM_LOW=0.35, BENIGN=0.10) · Brent$80 collapse

8events · 26+ Reuters/BBC/IDF/UNCTAD/Lloyd's

P · 15-Judge Panel

α 0.21/0.75/0.57/0.36

3 local + 12 frontier OpenRouter · 4 disclosure-ladder α: 3j=0.2097 · 2j=0.7499 · 12-frontier=0.5669 · 15-combined=0.3577 · 26 Wiki scenarios · 5-tier escalation

15judges · ALL 4 alphas EXACT

Q · Tabular ML

XGBoost · LightGBM 0.9818 · CatBoost · TabPFN-v2 (clf+reg+bagging) · Ridge stacking · 5-fold CV · 4 DataCo tasks · honest null at ceiling

+0.0045Stacking lift vs WV (honest)

R · Analysis Models

PoliticalRisk GBR R²=0.994 · DependencyMLP 97.45% · FinImpact R²=0.736 · ConfIsotonic ECE=0.0017 · SPOFv2 F1=1.0 · 8-component political index · 4-component dependency

R²=0.994political risk GBR (214 countries)

S · Test Suite

261 tests collected (173 v3 + 76 v4 + 7+ phoenix) · 6/6 adversarial rejected · 16/16 phoenix smoke · 19 compliance · ~2m38s runtime

6/6adversarial attacks rejected

15 v4 + 5 phoenix + framework · SHA-256 stdout · 5 comparators · INDEX.json/md auto-generated · 271-LOC framework · tamper-evident

35all sha256-anchored

sections U · V · W · X · Y · Z · AA · BB — autoresearch / phoenix / infra / stats / data / docs / plots / tricks

U · Autoresearch

Karpathy overnight loop · 5 hand-crafted experiments · 3 ACCEPTED + 2 REJECTED · s1=0.4035 (bigger net) · s3=0.0967 (curriculum BEST) · s4 REJECTED RecurrentPPO collapse

3/5accepted · honest negatives kept

V · Phoenix v5

MPPO mean=2.209

Twin (100 MC) · Arena 6 baselines · MaskablePPO #1 [2.178,2.239] · Replay 8 frozen · ROLL + DPO + 2 upstream PRs · isolation guarantee

20v5 receipts in INDEX

W · Production Infra

3 Dockerfiles · HF Space deployed · ONNX<5e-5×4 · <2GB image · 15-25s cold · Numba+CUDA fallback · 20+ endpoints (HTTP/WS/MCP/Swagger)

<2GBimage · 159GB models excluded

X · Stats Machinery

Wilcoxon · Friedman · Bootstrap CI95 · Krippendorff α · Cohen κ · Fleiss κ · ECE/Brier · PICP@80/90/95 · 10,800-episode benchmark · CI95 strictly excludes 0

10,800episode bootstrap (R6 Euclidian)

DataCo 180,519 orders · IBTRACS 243,495 storms · FRED 17,011 pts · WGI 214×6×24 · SEC 25 filings · Wikipedia 26 · 40+ citations

10independent real datasets · zero synthetic

Z · Documentation

125 markdown docs · 12 Sleep Token album stages · 6 Colab notebooks · README 40KB · SUPPLYMIND_BLUEPRINT 81KB · ALIENWARE_KICKOFF 53KB · 5 PITCH_DECK

12Sleep Token track stages exact

AA · Plots & Viz

Hero card · Caramel calibration · R4×7 / R5×5 / R6×4 / R3×2 plots · GCN attention heatmaps · Streamlit 12 panels · Pareto 3D Plotly

25+versions/v3_arcadia/plots/ · 1 Streamlit dashboard

BB · Clever Tricks

Sleep Token 12 stages · W1-W10 wins · α disclosure ladder · 8 honest negatives · 2-pass DeepSeek · tamper-evident SHA-256 · 5 graceful-degrade paths

₹3total OpenRouter spend (under tea)

Reward-hacking · 6 attacks · 6 rejected · all by different defense layer

Per Meta OpenEnv × Scaler hackathon-guide §8 — multi-component reward + multiple independent gates beat single-signal reward hacking.

6/6 rejected · honest=0.86

A1 · empty_stringREJECTED 0.00

degenerate empty output, no info

match=0.00 · format=0.00 · length=0.00 · n_tokens=1

defense: format_gate + length_gate

A2 · risk_only_short_circuitREJECTED 0.70

pure short-circuit: output ground-truth label only

match=1.00 · format=0.00 · length=0.00 · n_tokens=1

defense: length_gate (shorter than honest)

A3 · long_spam_no_jsonREJECTED 0.80

pad with junk to beat length-guard, omit JSON

match=1.00 · format=0.00 · length=1.00 · n_tokens=200

defense: format_gate (no JSON shape)

A4 · over_length_500_tokenREJECTED 0.85

massive output to dilute detection

match=1.00 · format=1.00 · length=-0.50 · n_tokens=500

defense: max_length_penalty (negative reward over 400tk)

A5 · adjacent_tier_guessREJECTED 0.65

always guess adjacent tier to hedge

match=0.50 · format=1.00 · length=1.00 · n_tokens=60

defense: ordinal_proximity_penalty (only 0.5 partial credit)

A6 · wrong_tier_confidentREJECTED 0.30

always guess LOW (opposite end)

match=0.00 · format=1.00 · length=1.00 · n_tokens=60

defense: far-from-GT match=0 (not partial credit)

honest baseline reward = 0.86 · STRICTLY GREATER than every attack

verdict: All attack vectors score strictly below honest. Layered reward rejects each via different component: length-guard (A2), format-guard (A3), max-length (A4), proximity penalty (A5,A6).

receipt: tests/receipts/adversarial_reward_audit.json

Wordle RLVR · canonical hackathon-guide demo

OpenEnv-compliant · multi-component reward · GRPO-trainable via TRL · bridges domain-heavy supply-chain to canonical hackathon flow

env contract

reset / step / grade / observation / action

Pydantic v2 typed · OpenEnv compliant

reward components (multi · §7)

solve_bonus · green_credit · yellow_credit · timeout_penalty · format_gate · dictionary_gate

anti-hack layers (§8)

format_gate · dictionary_gate · timeout · no internal-state mutation

baseline (heuristic constraint filter, 50 episodes seeded)

win_rate=1.00 · mean_guesses=1.82 · mean_reward=0.77

receipt: tests/receipts/wordle_grpo_baseline.json

trainer stack

TRL GRPO · Unsloth (optional) · Qwen-2.5-1.5B-Instruct base

recipe: rl/lora/finetune_unsloth.py + versions/v5_phoenix/wordle_env/train_grpo.py

endpoints

POST /wordle/reset · POST /wordle/step · POST /wordle/grade · GET /wordle/health · GET /wordle/ui

Validation · backtest receipts

click ▶ to run 8-event backtest · expected: 100% risk-band, 100% Brent ±30%, 100% reroute

Receipts · all real, all sha256

10+ receipts

tests/receipts/war_room_validation.json — 100/100/100/100/100%
tests/receipts/ensemble_brent_validation.json — 8/8 ±30%, median 3.3% err
tests/receipts/conformal_calibration.json — 0.9001 coverage
tests/receipts/cross_corpus_alpha.json — α=0.5436
tests/receipts/panel_agreement_R4.json — α=0.5669
versions/v5_phoenix/experiments/hetgat_v1/report.json — +7.77/+12.15/+10.03%
versions/v5_phoenix/experiments/rap_xc_v1/rapxc.pt — BC 5.62→0.23