When LLMs Lose to Coin Flips: Building a Healthcare Fraud Detection RL Environment
A rigorous evaluation study β across two models, seven agent configurations, and 14,000 claim decisions β showing that prompt engineering beats model capability, and an open environment so you can prove us wrong.
Healthcare insurance fraud costs the US system an estimated $100 billion per year β roughly 3β10% of total health spending, per NHCAA estimates against $4.9T in CMS-reported 2023 expenditures. Most detection systems are batch processes: flag suspicious claims at the end of the month, review manually, claw back if possible. By then the money is gone.
What if we treated fraud detection as a sequential decision problem under real constraints? An agent reviewing one claim at a time, with a limited investigation budget, building memory of provider history across the episode β just like a real claims adjudicator.
We built exactly this environment, then ran a complete evaluation across seven agent configurations and two LLM models. The results were sharper than we expected.
The short version:
- A naive LLM performs worse than random
- A budget-aware prompt improves the same model by up to 2.7Γ
- A rule-based heuristic beats naive LLMs without a single API call
- The same prompt works on both a weak and a strong model β but the stronger model executes it more precisely
- A trained REINFORCE policy discovers the same strategy from reward signal alone
- And one honest finding we discovered about our own environment: the RL reward objective and the real-world financial outcome rank agents differently β a calibration gap worth knowing before you train on this
The Environment
Each episode consists of 100 insurance claims arriving sequentially. The agent sees:
- Claim amount, procedure codes, provider billing history
- Current investigation budget remaining (starts at 15 per episode)
- Memory of previously investigated providers (confidence decays over time)
- Episode step counter and risk indicators
For each claim, the agent picks one of 5 actions:
| Action | Cost | Detection Rate |
|---|---|---|
APPROVE |
$0 | 0% β fraud passes through |
FLAG_REVIEW |
$25 | 70% β human review |
INVESTIGATE |
$100 | 95% β deep audit |
DENY |
$0 | 100% β but penalises false denials |
REQUEST_INFO |
$12.50 | 50% β deferred review |
The critical tension: INVESTIGATE at $100 is RL-optimal only for suspected fraud above roughly $1,000 at the current 0.1Γ fraud-recovery reward rate β a threshold well above the typical claim in this simulation. Below that, FLAG_REVIEW at $25 nets a better RL score. (In real-dollar terms the breakeven is ~$300 β see Finding 8 on reward calibration for why these diverge.)
Fraud rate is 5% β 5 out of every 100 claims. The investigation budget is 15 β meaning even a perfect agent can only deep-audit 15% of claims. The rest must be handled via cheaper signals.
Reward Architecture
The reward is multi-component, not a single financial signal:
| Component | Weight | What it measures |
|---|---|---|
| Decision correctness | 40% | Financial outcome of the action choice |
| Rationale quality | 30% | Coherence and length of the written explanation |
| Evidence citation | 20% | Did the agent cite specific claim data? |
| Efficiency | 10% | Cost-effectiveness of action given risk level |
When we say "reward" throughout this post we mean the weighted sum. An agent can improve reward not only by making better financial decisions but by writing better rationales β which matters when comparing LLMs (which write prose) to rule-based agents (which emit minimal text). Finding 4 addresses this directly.
Seven Agents, One Leaderboard
We evaluated seven agent configurations across 20 episodes each (2,000 claim decisions per agent, fixed seeds throughout):
RandomAgent β weighted random decisions, no reasoning, no API calls. The absolute floor.
ThresholdAgent β pure rule-based logic using regex over the prompt text. Flags claims above anomaly thresholds, approves the rest. Never uses INVESTIGATE. No LLM, zero latency.
NaiveLLMAgent β sends each claim to an LLM with a minimal prompt: "Review this claim and decide." No mention of budget, investigation costs, or memory. Run on two models:
- NaiveLLM (DeepSeek V3.2) β state-of-the-art instruction-following, $0.26/M tokens
- NaiveLLM (Qwen 3.6 Plus) β current-gen Alibaba model, run on free tier
BudgetAwareAgent β same LLM, system prompt explicitly states the economics: INVESTIGATE costs $100, FLAG_REVIEW costs $25, budget limit is 15, switch strategy when budget falls below 20%. Same two models:
- BudgetAware (DeepSeek V3.2)
- BudgetAware (Qwen 3.6 Plus)
Each naive/budget-aware pair runs on the same model backbone β the only difference is the system prompt. This gives four LLM configurations total.
ReinforceAgent β a linear policy trained for 500 episodes via REINFORCE policy gradient on 10 hand-crafted features.
Complete Results
All results from 20 episodes Γ 100 claims each, fixed seeds (seed=42).
| Agent | RL Reward | F1 | Recall | Budget Use | Fraud Caught$/ep | Net Loss$/ep | Fraud Catch Rate |
|---|---|---|---|---|---|---|---|
| BudgetAware (DeepSeek) | β455 | 0.047 | 8% | 0% | $905 | $2,609 | 26% |
| ThresholdAgent | β841 | 0.144 | 53% | 0% | $1,493 | $2,147 | 48% |
| BudgetAware (Qwen3.6) | β1,190 | 0.147 | 52% | 5% | $1,412 | $3,181 | 46% |
| NaiveLLM (DeepSeek) | β1,212 | 0.074 | 37% | 9% | $2,149 | $2,315 | 61% |
| REINFORCE (trained) | β1,646 | 0.057 | 23% | 79% | $1,194 | $5,352 | 30% |
| RandomAgent | β2,057 | 0.087 | 44% | 88% | $1,313 | $6,137 | 34% |
| NaiveLLM (Qwen3.6) | β2,322 | 0.063 | 52% | 70% | $1,608 | $5,645 | 52% |
RL Reward: higher is better. Net Loss: per 100-claim episode. Fraud Catch Rate: % of total fraud $ recovered.
Note: RL Reward and Net Loss$/ep rank agents differently β this matters, and we explain why in Finding 8.
Finding 1: A Naive LLM Is Worse Than Random
Qwen NaiveLLM scored β2,322 β worse than RandomAgent at β2,057, by 265 reward points.
This is not a model capability problem. The NaiveLLM worked hard:
- 70% budget utilization (used most of its 15 investigation slots)
- 92% memory reuse rate (correctly avoided re-investigating known providers)
- 52% recall (caught over half the fraud)
The problem is what it did with that work. It investigated frequently, burning $100 per slot on moderate-risk claims. On legitimate claims β which make up 95% of the episode β each investigation costs $100 + $50 false-positive penalty = $150 in pure waste. Over a 100-claim episode with ~10β14 investigations, that's $1,000β$2,100 in investigation overhead before catching a single dollar of fraud.
NaiveLLM (Qwen) episode sample:
investigations_used: 14 out of 15 budget
investigation_cost: $1,825
false_positive_cost: $1,900 β investigating legitimate claims
fraud_caught_amount: $149 β barely worth it
total_reward: β1,950
A random agent, by contrast, has no intelligence to act on. It randomly approves many things, randomly investigates others. Its investigations are spread across low-risk and high-risk claims alike β so the expected false-positive rate per investigation is lower than the LLM's targeted-but-miscalibrated investigations.
The lesson: An LLM with domain knowledge but no cost calibration is worse than random. It acts on signal (suspicion) without understanding the cost of acting on that signal.
Finding 2: Budget-Aware Prompting Improves Both Models β Proportionally to Their Capability
We ran the budget-aware prompt on both models:
| Model | Naive Reward | Budget-Aware Reward | Improvement | Budget Use: Naive β BA |
|---|---|---|---|---|
| Qwen 3.6 Plus | β2,322 | β1,190 | 1.95Γ | 70% β 5% |
| DeepSeek V3.2 | β1,212 | β455 | 2.66Γ | 9% β 0% |
The budget-aware system prompt adds three things:
- Economics: "INVESTIGATE costs $100. FLAG_REVIEW costs $25. Only INVESTIGATE when you have high confidence AND the claim is large."
- Thresholds: "When budget remaining < 4, switch entirely to FLAG_REVIEW."
- Memory rules: "If a provider is in memory as FRAUD, FLAG_REVIEW β don't re-investigate. If LEGIT, APPROVE."
Both models respond. But with meaningfully different precision:
- DeepSeek drops investigation use to 0% β it fully internalises the economics and never INVESTIGATEs
- Qwen drops to 5% β a 14Γ reduction from 70%, but not complete
This reveals a second-order finding: the value of a budget-aware prompt scales with the model's instruction-following capability. A stronger model executes the decision tree exactly. A weaker model partially follows it, capturing most but not all of the benefit.
Prompt engineering is not a silver bullet β it requires a model that can actually follow the prompt.
Finding 3: Rule-Based Wins (Until a Strong Model Is Told the Rules)
ThresholdAgent (β841) beats NaiveLLM on both models, and beats BudgetAware Qwen (β1,190). Only BudgetAware DeepSeek (β455) clears it.
ThresholdAgent independently discovered the optimal strategy: never INVESTIGATE, use FLAG_REVIEW for anomalies, APPROVE otherwise. It executes this via 10 lines of regex and if-statements with zero API calls, zero latency, and zero cost.
# ThresholdAgent's effective policy (simplified):
if risk_level == "HIGH" and budget_remaining >= 5:
return INVESTIGATE # almost never fires in practice
elif risk_level == "MODERATE" or fraud_flag_rate > 8%:
return FLAG_REVIEW
else:
return APPROVE
The rule for INVESTIGATE barely fires because truly HIGH-risk claims are rare. In practice, ThresholdAgent is an optimised FLAG_REVIEW machine β which turns out to be nearly optimal.
The gap between ThresholdAgent (β841) and BudgetAware DeepSeek (β455) is meaningful: 386 reward points. The LLM can reason about which specific claims warrant flagging with more nuance than a fixed threshold β when explicitly told what it's optimising for. That gap is the value that strong LLMs add over hand-coded rules: context-sensitive discrimination, not brute force.
Finding 4: High F1 Is the Wrong Goal
Notice something counterintuitive in the results table: BudgetAware DeepSeek has the lowest F1 (0.047) and lowest recall (8%) β yet the best RL reward.
Meanwhile, NaiveLLM Qwen has the highest recall (52%) and still loses to RandomAgent.
The environment rewards financially efficient fraud detection (40% of the reward signal), not raw classification. An agent that catches 100% of fraud by investigating every single claim would have perfect recall β and would be catastrophically expensive (100 Γ $150 false-positive cost on legitimate claims = $14,250/episode in investigation waste alone).
ThresholdAgent achieves F1=0.144 β highest alongside BudgetAware Qwen β with zero investigation budget used. Its rules flag correctly with moderate precision and never incur investigation costs at all. The BudgetAware DeepSeek's F1=0.047 looks terrible, but it's working in a completely different regime: FLAG_REVIEW only, minimal cost, accepting that most fraud slips through in exchange for near-zero investigation overhead.
One nuance: the 30% rationale + 20% evidence components of the reward favour LLM agents (which write structured prose) over ThresholdAgent (which emits minimal text). ThresholdAgent's strong overall score comes despite near-zero rationale credit β its financial decision quality is that much better.
F1 measures fraud detection coverage. RL Reward measures fraud detection efficiency. These diverge sharply when investigation is expensive and fraud is rare. Which of the two you should care about depends on what you're actually trying to do β and that's the subject of Finding 8.
Finding 5: The Budget Paradox β More Resources, Worse Results
Budget ablation across investigation budgets of 5, 10, 15, and 20 (rule-based agents, 10 episodes each):
| Budget | RandomAgent | ThresholdAgent |
|---|---|---|
| 5 | β1,720 | β765 |
| 10 | β1,860 | β765 |
| 15 | β1,955 | β765 |
| 20 | β2,006 | β765 |
Giving RandomAgent more budget makes it worse. More investigation slots β more random INVESTIGATE calls β more $150 false-positive costs on legitimate claims β deeper negative reward.
ThresholdAgent is perfectly flat at β765 across all budget levels because it never uses INVESTIGATE β the budget is simply never binding. (The β765 vs β841 difference from the main results table reflects fewer episodes: this ablation uses 10 episodes vs 20 in the main evaluation; variance at 10 episodes is higher but the direction holds.)
This result has a clean interpretation: investigation budget is only valuable to agents that can spend it wisely. For agents that can't discriminate when to investigate, more budget is strictly harmful. The resource amplifies whatever decision-making quality (or lack thereof) the agent already has.
We ran a partial budget ablation with DeepSeek at B=5 and B=10 (10 episodes each):
| Budget | NaiveLLM (DeepSeek) | BudgetAware (DeepSeek) | BA / Naive ratio |
|---|---|---|---|
| 5 | β1,169 | β743 | 1.57Γ |
| 10 | β1,192 | β | β |
| 15 | β1,212 | β455 | 2.66Γ |
Two things stand out. First, NaiveLLM actually improves slightly from B=15 (β1,212) to B=5 (β1,169) β tight budgets are self-correcting. When the agent only has 5 investigation slots, it can't burn 14 of them even when it wants to. The budget constraint acts as an accidental guardrail on its over-investigation tendency.
Second, the budget-aware advantage compresses as budgets tighten: 2.66Γ at B=15 drops to 1.57Γ at B=5. The gap closes because both agents end up in similar territory: BudgetAware explicitly rations its 5 slots, but NaiveLLM is also forced into rationing by scarcity. At very low budgets, the marginal value of knowing the rules decreases because resource exhaustion enforces the same behaviour anyway.
Finding 6: RL Learns the Same Strategy From Scratch
The toughest challenge: 5% fraud rate means 95 legitimate-claim steps generate noisy gradient that drowns out signal from 5 fraud steps.
Training setup: linear policy (10 features β 3 actions), 500 episodes, REINFORCE with batch advantage normalisation, entropy regularisation, gradient clipping.
First quartile (ep 1β125): mean reward = β2,398
Last quartile (ep 376β500): mean reward = β1,739
Improvement: +658 reward β Policy LEARNED
Training time: 56 seconds
The policy learned β from reward signal alone β to weight budget_frac (feature 0) and provider_in_mem (feature 7) heavily. When budget depletes, INVESTIGATE probability drops. When a provider is in memory, FLAG_REVIEW/APPROVE probability rises. These are exactly the strategies the budget-aware prompt describes in words.
The trained agent (β1,646) doesn't match ThresholdAgent (β841) yet β the linear policy over 10 features can't express full conditional logic. (The β1,739 training mean is over the last 125 training episodes with the same env seed; β1,646 is the separate 20-episode held-out evaluation with seed=42, so the gap reflects the policy being evaluated on different claim sequences than it trained on.) But it demonstrates that the environment contains learnable structure: RL can find policy improvements without any human-written rules.
Finding 7: Memory Half-Life Only Matters If You Use Memory
We ablated memory_decay_halflife over [0, 5, 20, 100] with rule-based agents:
| Halflife | RandomAgent | ThresholdAgent |
|---|---|---|
| 0 (off) | β2,068 | β763 |
| 5 | β2,068 | β763 |
| 20 | β2,068 | β763 |
| 100 | β2,068 | β763 |
Both are completely flat. Two reasons:
- ThresholdAgent never investigates β never populates memory β memory decay rate is irrelevant because memory is always empty
- RandomAgent ignores the prompt β memory content visible in the prompt has no effect on random decisions
This confirms memory is functional in the environment β it's just not activated by agents that don't use INVESTIGATE or don't read it. The LLM memory ablation (requiring the BudgetAware LLM runs across half-life settings) would show real variation β that remains a future experiment.
Finding 8: Our Reward Function Has a Calibration Gap
This is the finding we didn't plan for β we found it by looking at the data carefully.
The RL reward function scales fraud recovery at 10% of claim value and missed-fraud penalty at 20%:
# environment/models.py β RewardConfig defaults
fraud_caught_reward_rate = 0.1 # catching $1,000 fraud β +$100 reward
fraud_missed_penalty_rate = 0.2 # missing $1,000 fraud β -$200 penalty
investigation_cost = 100.0 # flat cost per INVESTIGATE
This makes INVESTIGATE only worthwhile (in RL reward terms) for confirmed fraud above roughly ~$1,000 β a tight threshold almost no individual claim exceeds in expectation. The result: the RL-optimal strategy is to avoid investigation almost entirely, not because investigation is wrong, but because the rate underprices the value of fraud recovery.
The symptom: RL Reward and Net Savings rank agents differently.
| Agent | RL Reward Rank | Net Loss$/ep | Financial Rank |
|---|---|---|---|
| BudgetAware (DeepSeek) | 1st (β455) | $2,609 | 3rd |
| ThresholdAgent | 2nd (β841) | $2,147 | 1st |
| NaiveLLM (DeepSeek) | 4th (β1,212) | $2,315 | 2nd |
| BudgetAware (Qwen3.6) | 3rd (β1,190) | $3,181 | 4th |
ThresholdAgent has the best real-world outcome ($2,147 net loss per episode) despite ranking 2nd on the RL objective. NaiveLLM(DeepSeek) β which investigates more and catches more fraud dollars ($2,149 vs $905) β comes second in actual dollars despite ranking 4th on RL reward.
Why this happens: BudgetAware DeepSeek optimises the RL objective precisely. It eliminates investigation entirely (0% budget use), avoids false-positive costs, and accepts a low fraud catch rate (26%). This is RL-optimal because the 0.1 reward rate makes even recovered fraud barely worth the $100 investigation cost. But in real terms, that 26% catch rate leaves $2,635 of fraud unpaid per episode. ThresholdAgent's heuristics catch 48% of fraud at moderate cost, netting a better actual outcome.
The fix is a one-line change:
fraud_caught_reward_rate = 1.0 # full claim value recovered
fraud_missed_penalty_rate = 1.0 # full claim value lost
With equal rates (or rates calibrated to actual payer economics), the RL objective aligns with net savings. Investigation of genuinely high-value suspicious claims becomes worthwhile. The optimal strategy shifts toward selective investigation rather than pure cost-avoidance.
Hypothesis: with correct reward scaling, BudgetAware DeepSeek would overtake ThresholdAgent. Under fraud_caught_reward_rate = 1.0, a $3,000 fraud claim recovered nets +$3,000 in reward β making a $100 investigation obviously worthwhile. BudgetAware DeepSeek's context-sensitive reasoning ("this specific claim is $3,000 from a provider flagged twice this episode") can justify that cost claim-by-claim. ThresholdAgent's fixed thresholds cannot adapt at that granularity. The LLM's selective discrimination β currently unrewarded β would become its winning edge.
Why we're not re-running: With the submission deadline upon us, re-running all 7 agents to produce a clean comparable dataset is out of scope. We're documenting the gap honestly instead.
What this means for the evaluation study findings: The core finding β budget-aware prompting improves the same LLM by 2.7Γ β holds regardless of which metric you use (BudgetAware DeepSeek is best on RL reward; the same direction holds for net savings within each model pair). The direction is consistent: structured prompting helps. But the magnitude and mechanism change. Under correct reward scaling, an agent that catches more fraud dollars is explicitly rewarded for it, and the threshold for INVESTIGATE becomes much lower.
The lesson for practitioners: When designing RL environments for real business problems, verify that your reward rates reflect actual value at stake. A 10% recovery reward on a $1,000 fraud and a flat $100 investigation cost creates a regime where the optimal RL policy is "never investigate" β which may be optimal in the RL game while being poor real-world policy.
What the Numbers Mean in Practice
To make the reward numbers concrete: a β$455 RL reward means BudgetAware DeepSeek runs a fraud program that scores β$455 on the RL objective. Its real-world financial loss (net_savings metric) is $2,609/episode β because the 0.1Γ reward rate understates the actual fraud value in the RL signal.
For a claims department processing 10,000 claims/day, using net financial losses as the metric:
- NaiveLLM Qwen: $564,500/day in net fraud program losses
- RandomAgent: $613,700/day
- BudgetAware DeepSeek: $260,900/day
- ThresholdAgent: $214,700/day β best real-world outcome
The API cost to run BudgetAware DeepSeek on 10,000 claims: roughly $6/day ($0.26/M tokens Γ ~23M tokens). Even with its suboptimal reward calibration, the gap between BudgetAware and NaiveLLM ($303,600/day) is enormous relative to API cost.
Try It Yourself
The environment is live on Hugging Face Hub:
from environment.client import HealthClaimEnv
from environment.models import ClaimAction
client = HealthClaimEnv("https://shylane-healthcare-fraud-openenv.hf.space")
obs = client.reset()
while not obs.done:
response = your_agent.act(obs.prompt)
obs = client.step(ClaimAction(response_text=response))
print(f"Episode reward: {obs.metadata.get('cumulative_reward', 0)}")
Or run locally:
git clone https://github.com/shylane/healthcare-fraud-openenv
cd healthcare-fraud-openenv
uvicorn environment.server.app:app --port 8000
Seeds guarantee identical claim sequences for deterministic agents (LLMs). Agents that use Python's global random module (e.g. a random-action baseline) will see slightly different episode trajectories because their action draws interleave with lazy claim generation β see the Limitations section below. All LLM-vs-LLM comparisons in this study are clean.
Can you beat BudgetAware DeepSeek's β455?
Open Threads
Four questions remain open:
LLM memory ablation. The memory ablation only ran on rule-based agents β both were flat because neither builds memory (ThresholdAgent never investigates, RandomAgent ignores the prompt). The interesting case is BudgetAware DeepSeek, which explicitly relies on memory to avoid re-investigating known providers. Does its reward degrade when memory_decay_halflife drops to 0? Hypothesis: yes, because providers seen early in the episode would no longer be recognised later, forcing redundant FLAG_REVIEW calls. Remains unrun.
Full budget ablation with LLM agents. We got B=5 for both DeepSeek agents and B=10 for NaiveLLM. The budget-aware advantage at B=5 (1.57Γ) vs B=15 (2.66Γ) is itself a finding: tighter budgets compress the gap. Whether that gap closes further or reverses at very tight constraints is an open question.
REINFORCE vs Threshold gap. The trained RL policy (β1,646) trails ThresholdAgent (β841) by 805 reward points. A non-linear policy (2-layer MLP) or expanded feature set (procedure codes, claim amount directly, fraud pattern type) would likely close this. The environment contains learnable structure β the linear policy is the bottleneck, not the algorithm.
Reward rate recalibration. Set fraud_caught_reward_rate = 1.0 and fraud_missed_penalty_rate = 1.0 in environment/models.py β RewardConfig. Re-run all experiments. Hypothesis: rankings change significantly β ThresholdAgent drops relative to agents that learn to investigate selectively, and the REINFORCE policy improves by learning a non-trivial investigation strategy rather than pure cost-avoidance.
The environment is open. Every result here is reproducible from the JSON files in experiments/*/results/. We're curious what you find.
Limitations and Caveats
We're documenting these openly because they affect how you should interpret specific numbers, even though they don't change the directional findings.
RNG isolation. Claims are generated lazily (one per step). The ClaimsFraudEnvironment uses a dedicated random.Random(seed) instance for its own stochastic decisions (investigation accuracy draws), but ClaimsSimulator uses Python's global random module for claim generation. RandomAgent and ReinforceAgent also draw from the global random module during action selection, meaning their action draws interleave with subsequent claim generation. This contaminates the "identical claim sequences" guarantee for those two agents.
Impact: All LLM-vs-LLM comparisons (Findings 1, 2, 3) are unaffected β LLM agents make no Python random calls. The RandomAgent and REINFORCE results are reproducible across multiple runs with the same global seed, but their claim sequences differ slightly from those seen by LLM agents. The directional conclusions (NaiveLLM worse than random; BudgetAware better than NaiveLLM) hold by large enough margins to be robust to this effect.
Investigation memory records detected truth, not ground truth (post-fix). Prior to this commit, investigation memory stored is_fraud as the claim's ground-truth label regardless of whether the investigation stochastically missed (5% miss rate at investigate_accuracy=0.95). This leaked the true label to agents on provider re-encounters. The code is now fixed: memory stores is_fraud=False when an investigation misses. At the default 0.95 accuracy, the practical impact on recorded results is small (~5% of fraud investigations), but the principle matters for environments with lower accuracy settings.
API fallback in ablation runs. The OpenRouter client falls back to a parseable FLAG_REVIEW response on null-content API errors, keeping valid_response_rate high even when the model never reasoned. Precise status per budget level: B=5 NaiveLLM and BudgetAware are valid and cited; B=10 NaiveLLM is valid and cited (β1,192); B=10 BudgetAware was not run (credits exhausted); B=20 both agents were collected but show all-FLAG_REVIEW behaviour with response lengths matching the fallback string exactly β those results are excluded. Finding 5 cites only B=5, B=10 (NaiveLLM), and B=15.
Harness step-level logging (post-fix). StepRecord.is_fraud was previously logged after env.step(), recording the next claim's label instead of the current one (off-by-one). This has been fixed. Episode-level metrics (total reward, F1, recall, budget utilisation) are computed from the environment's internal state and were never affected by this bug.
Memory reuse metric (post-fix). memory_reuse_rate previously counted APPROVE on a known-fraud provider as "correct" (the agent didn't waste an investigation slot). The metric now correctly scores: FLAG_REVIEW/DENY as correct for known-fraud providers, APPROVE as correct for known-legit providers.
Multi-component reward vs financial outcome. The RL reward is 40% financial decision quality, 30% rationale coherence, 20% evidence citation, and 10% efficiency. When comparing LLM agents (which write prose rationales) to ThresholdAgent or RandomAgent (which emit minimal text), the rationale/evidence components create a persistent headwind for rule-based agents. ThresholdAgent's strong overall ranking is despite near-zero rationale credit β its financial decisions are that dominant.
The $100B figure. This is the commonly cited NHCAA estimate. The more rigorous range is 3β10% of health spending; at $4.9T (CMS 2023) that implies $147Bβ$490B in potential fraud exposure, not all of which is recoverable. We use $100B as a conservative anchor.
Links
- Environment: HF Space β shylane/healthcare-fraud-openenv
- Code + experiments: github.com/shylane/healthcare-fraud-openenv
- Study documentation:
docs/study_documentation.mdβ full technical write-up with code citations - Competition: AgentX-AgentBeats OpenEnv Track
Built for the AgentX-AgentBeats OpenEnv Challenge (Berkeley RDI / Hugging Face, April 2026). 7 agent configurations Γ 20 episodes Γ 100 claims = 14,000 decisions. All open source.