Spaces:
Sleeping
Ambulance Green Corridor β OpenEnv Hackathon 2026
Theme: #3.1 β World Modeling / Professional Tasks
One-line summary: We train an LLM to act as an emergency dispatcher + city traffic signal manager, navigating a partially observable city with real-world constraints (gridlock, potholes, accidents, hospital capacity) to get ambulances to the right hospital as fast as possible.
Demo video: Watch on YouTube
The Capability Gap We Are Targeting
Current GPS-based emergency preemption systems (like Opticom, used globally) work like this:
Ambulance within 300m of an intersection β turn it green.
That is reactive. It has no awareness of:
- What road quality lies ahead (potholed roads slow the ambulance even on green)
- Whether the nearest hospital is the right hospital for this patient's condition
- Whether heavy traffic on the planned route makes a longer detour actually faster
- Dynamic events mid-journey: accidents, road closures, traffic spikes
The question we ask: Can an LLM reason about the full journey β hospital selection, road quality, live traffic state, and mid-episode events β to get ambulances to the right place faster than any rule?
This is a genuine professional task that requires persistent world modeling across many steps. A rule cannot solve it. A shortest-path algorithm cannot solve it. This environment tests whether an LLM can.
Theme Alignment: Why This Is Theme 3.1
Theme 3.1 asks for environments where:
"the model is expected to do real hard work instead of exploiting short-cuts"
Our environment prevents shortcuts in three ways:
- Toggling already-green signals costs reward. The agent must read signal state before acting β it cannot blindly clear everything.
- Traffic volume slows the ambulance even on green. The agent cannot just clear signals and assume it will go fast β it must reason about the traffic volume on each segment.
- The nearest hospital is not always correct. A cardiac patient sent to a trauma centre loses the +300 specialist bonus. But even the right specialist hospital may not be the best choice β if the route to it is gridlocked, clearing signals only gets you 20% speed through dense traffic. A farther hospital with lighter traffic and a lower ETA is the smarter pick. The agent must weigh specialization + distance + live traffic volume simultaneously, and be willing to switch hospitals mid-journey if conditions change.
The agent must maintain a coherent world model across 15β30 steps, update it when dynamic events fire, and make non-obvious decisions that only pay off several steps later.
Environment Design
What the Agent Sees (Observation)
Every step, the agent receives a structured observation:
=== EMERGENCY DISPATCH ===
Patient : (6, 3) | condition: cardiac
Ambulance: (6, 4) | time: 40s / 300s
β DYNAMIC EVENTS:
[ACCIDENT] at (4,3) β road blocked (severity=0.8)
CURRENT ROUTE β hosp_a (City General)
ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
(6,4)β(5,4) | residential | quality=moderate | traffic=45%
(5,4)β(4,4) | damaged | quality=POTHOLED | [BLOCKED]
ALTERNATIVES:
hosp_c (Cardiac Centre) β specialist match | ETA=130s | damaged=0
SIGNALS β only change WRONG ones:
(5,4): ns_green | ambulance going north | OK
(4,4): ew_green | ambulance going north | WRONG β needs ns_green
ACTION FORMAT:
{"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
What the Agent Does (Action Space)
{
"hospital_id": "hosp_c",
"signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}],
"preferred_direction": "north"
}
hospital_id: choose or switch destination at any step (not just at the start)signal_controls: override up to 3 signals in the lookahead windowpreferred_direction: hint the routing engine to take a specific turn
Reward Function β Designed to Be Hard to Game
| Component | Value | Purpose |
|---|---|---|
| Arrival | +1000 | Primary objective |
| Time bonus | +500 max | Rewards speed |
| Specialist match | +300 | Rewards reading patient condition |
| Red light stop | β20 each | Penalises poor signal management |
| Unnecessary toggle | β2/β5 each | Core anti-shortcut mechanism |
| Damaged road traversed | β10 each | Rewards road quality awareness |
| Successful re-route | +50 each | Rewards dynamic adaptation |
The unnecessary toggle penalty is the key design decision. An agent that blindly clears every signal in view scores lower than one that reads the state first. This forces genuine reasoning, not pattern-matching.
Difficulty Levels
| Level | Grid | Hospitals | Base Traffic | Events/Step | Time Limit |
|---|---|---|---|---|---|
| easy | 6Γ6 | 2 general | Low (0.1) | 5% | 200s |
| medium | 8Γ8 | 3 mixed | Moderate (0.3) | 10% | 300s |
| hard | 12Γ12 | 5 (1 at capacity) | Heavy (0.5) | 15% | 400s |
Training
Model: Qwen/Qwen2.5-0.5B-Instruct + LoRA (r=16, 2.1M trainable params)
Algorithm: GRPO (Group Relative Policy Optimisation) via HuggingFace TRL
Setup: 10 iterations Γ 4 episodes per iteration, live environment connection
Results
| Metric | Baseline (untrained) | After Training | Change |
|---|---|---|---|
| Arrival rate | 100% | 100% | β |
| Signal efficiency | 11% | 100% | +89 pp |
| Mean reward | 1442.6 | 1445.3 | +2.7 |
What the Numbers Mean
Signal efficiency is the core proof of learning.
The untrained model (11% efficiency) toggles every signal it encounters, including ones already in the correct phase. It treats the action space as "clear everything" β a shallow shortcut.
After GRPO training (100% efficiency), the model learned to:
- Read
current_phasefrom the observation - Compute
needed_phasebased on the ambulance's direction of travel - Only send a
SignalControlaction when they differ
This is not a trivial pattern. The mapping between ambulance direction and required signal phase differs per intersection and must be reasoned about from the observation text each step.
The training curve shows the characteristic GRPO exploration-convergence pattern:
- Iterations 1: Model arrives (100% arrival) but wastes actions (11% efficiency)
- Iterations 2β4: Exploration phase β arrival drops to 0β25%, model tries aggressive strategies
- Iterations 5β10: Convergence β 100% arrival with 100% signal efficiency simultaneously
Live Demo
Demo video: Watch on YouTube
Environment (OpenEnv WebSocket): wss://ajitg25-ambulance-green-corridor.hf.space/ws
Visual simulation: https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor
GitHub (full code + notebook): https://github.com/ajitg25/openEnv-hackathon/tree/main
Training notebook: https://github.com/ajitg25/openEnv-hackathon/blob/main/examples/ambulance_grpo_training.ipynb
Connecting Your Own Agent
from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
# Reset β get patient location, hospitals, initial state
obs = (await env.reset()).observation
# Step 1: Dispatch to specialist hospital
obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
# Step 2+: Clear only wrong-phase signals each step
while not obs.done:
controls = [
SignalControl(
row=s.row, col=s.col,
phase="ns_green" if s.ambulance_direction in ("north", "south") else "ew_green"
)
for s in obs.lookahead_signals
if s.phase != ("ns_green" if s.ambulance_direction in ("north", "south") else "ew_green")
]
result = await env.step(AmbulanceAction(signal_controls=controls))
obs = result.observation
The Three Policies (Shown in Visual Demo)
| Policy | Behavior | Signal Efficiency | What It Demonstrates |
|---|---|---|---|
| No control | Does nothing | 0% | Pure baseline |
| Naive | Clears all signals | ~11% | Untrained LLM behavior |
| Smart | Clears only wrong-phase | 100% | Trained LLM behavior |
Why This Matters
Emergency vehicle routing is deployed infrastructure. The gap between "clear one signal 300m ahead" and "reason about the full journey in a partially observable dynamic city" is exactly the gap this environment is designed to close.
Could a researcher write a paper about this? Yes:
"LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints"
That paper does not exist yet. This environment is the training ground for it.
Built for the OpenEnv Hackathon India 2026 β Theme 3.1: World Modeling / Professional Tasks
