Spaces:
Sleeping
Sleeping
File size: 9,080 Bytes
0f1191d db91dc6 0f1191d db91dc6 bb625fb 23f14db db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d cc8dd5b db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 bb625fb 0f1191d cf82c19 db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d db91dc6 0f1191d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | # Ambulance Green Corridor β OpenEnv Hackathon 2026
## Theme: #3.1 β World Modeling / Professional Tasks
**One-line summary:** We train an LLM to act as an emergency dispatcher + city traffic signal manager, navigating a partially observable city with real-world constraints (gridlock, potholes, accidents, hospital capacity) to get ambulances to the right hospital as fast as possible.
**Demo video:** [Watch on YouTube](https://youtu.be/9O5z4IXXtcc)
---
## The Capability Gap We Are Targeting
Current GPS-based emergency preemption systems (like Opticom, used globally) work like this:
> Ambulance within 300m of an intersection β turn it green.
That is reactive. It has no awareness of:
- What road quality lies ahead (potholed roads slow the ambulance even on green)
- Whether the nearest hospital is the *right* hospital for this patient's condition
- Whether heavy traffic on the planned route makes a longer detour actually faster
- Dynamic events mid-journey: accidents, road closures, traffic spikes
**The question we ask:** Can an LLM reason about the full journey β hospital selection, road quality, live traffic state, and mid-episode events β to get ambulances to the right place faster than any rule?
This is a genuine professional task that requires **persistent world modeling** across many steps. A rule cannot solve it. A shortest-path algorithm cannot solve it. This environment tests whether an LLM can.
---
## Theme Alignment: Why This Is Theme 3.1
Theme 3.1 asks for environments where:
> "the model is expected to do real hard work instead of exploiting short-cuts"
Our environment prevents shortcuts in three ways:
1. **Toggling already-green signals costs reward.** The agent must *read* signal state before acting β it cannot blindly clear everything.
2. **Traffic volume slows the ambulance even on green.** The agent cannot just clear signals and assume it will go fast β it must reason about the traffic volume on each segment.
3. **The nearest hospital is not always correct.** A cardiac patient sent to a trauma centre loses the +300 specialist bonus. But even the right specialist hospital may not be the best choice β if the route to it is gridlocked, clearing signals only gets you 20% speed through dense traffic. A farther hospital with lighter traffic and a lower ETA is the smarter pick. The agent must weigh specialization + distance + live traffic volume simultaneously, and be willing to switch hospitals mid-journey if conditions change.
The agent must maintain a coherent world model across 15β30 steps, update it when dynamic events fire, and make non-obvious decisions that only pay off several steps later.
---
## Environment Design
### What the Agent Sees (Observation)
Every step, the agent receives a structured observation:
```
=== EMERGENCY DISPATCH ===
Patient : (6, 3) | condition: cardiac
Ambulance: (6, 4) | time: 40s / 300s
β DYNAMIC EVENTS:
[ACCIDENT] at (4,3) β road blocked (severity=0.8)
CURRENT ROUTE β hosp_a (City General)
ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
(6,4)β(5,4) | residential | quality=moderate | traffic=45%
(5,4)β(4,4) | damaged | quality=POTHOLED | [BLOCKED]
ALTERNATIVES:
hosp_c (Cardiac Centre) β specialist match | ETA=130s | damaged=0
SIGNALS β only change WRONG ones:
(5,4): ns_green | ambulance going north | OK
(4,4): ew_green | ambulance going north | WRONG β needs ns_green
ACTION FORMAT:
{"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
```
### What the Agent Does (Action Space)
```json
{
"hospital_id": "hosp_c",
"signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}],
"preferred_direction": "north"
}
```
- `hospital_id`: choose or switch destination at any step (not just at the start)
- `signal_controls`: override up to 3 signals in the lookahead window
- `preferred_direction`: hint the routing engine to take a specific turn
### Reward Function β Designed to Be Hard to Game
| Component | Value | Purpose |
|---|---|---|
| Arrival | +1000 | Primary objective |
| Time bonus | +500 max | Rewards speed |
| Specialist match | +300 | Rewards reading patient condition |
| Red light stop | β20 each | Penalises poor signal management |
| **Unnecessary toggle** | **β2/β5 each** | **Core anti-shortcut mechanism** |
| Damaged road traversed | β10 each | Rewards road quality awareness |
| Successful re-route | +50 each | Rewards dynamic adaptation |
**The unnecessary toggle penalty is the key design decision.** An agent that blindly clears every signal in view scores *lower* than one that reads the state first. This forces genuine reasoning, not pattern-matching.
### Difficulty Levels
| Level | Grid | Hospitals | Base Traffic | Events/Step | Time Limit |
|---|---|---|---|---|---|
| easy | 6Γ6 | 2 general | Low (0.1) | 5% | 200s |
| medium | 8Γ8 | 3 mixed | Moderate (0.3) | 10% | 300s |
| hard | 12Γ12 | 5 (1 at capacity) | Heavy (0.5) | 15% | 400s |
---
## Training
**Model:** `Qwen/Qwen2.5-0.5B-Instruct` + LoRA (r=16, 2.1M trainable params)
**Algorithm:** GRPO (Group Relative Policy Optimisation) via HuggingFace TRL
**Setup:** 10 iterations Γ 4 episodes per iteration, live environment connection
### Results

| Metric | Baseline (untrained) | After Training | Change |
|---|---|---|---|
| Arrival rate | 100% | 100% | β |
| **Signal efficiency** | **11%** | **100%** | **+89 pp** |
| Mean reward | 1442.6 | 1445.3 | +2.7 |
### What the Numbers Mean
**Signal efficiency is the core proof of learning.**
The untrained model (11% efficiency) toggles every signal it encounters, including ones already in the correct phase. It treats the action space as "clear everything" β a shallow shortcut.
After GRPO training (100% efficiency), the model learned to:
1. Read `current_phase` from the observation
2. Compute `needed_phase` based on the ambulance's direction of travel
3. Only send a `SignalControl` action when they differ
This is **not** a trivial pattern. The mapping between ambulance direction and required signal phase differs per intersection and must be reasoned about from the observation text each step.
The training curve shows the characteristic GRPO exploration-convergence pattern:
- **Iterations 1:** Model arrives (100% arrival) but wastes actions (11% efficiency)
- **Iterations 2β4:** Exploration phase β arrival drops to 0β25%, model tries aggressive strategies
- **Iterations 5β10:** Convergence β 100% arrival with 100% signal efficiency simultaneously
---
## Live Demo
**Demo video:** [Watch on YouTube](https://youtu.be/9O5z4IXXtcc)
**Environment (OpenEnv WebSocket):** `wss://ajitg25-ambulance-green-corridor.hf.space/ws`
**Visual simulation:** https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor
**GitHub (full code + notebook):** https://github.com/ajitg25/openEnv-hackathon/tree/main
**Training notebook:** https://github.com/ajitg25/openEnv-hackathon/blob/main/examples/ambulance_grpo_training.ipynb
### Connecting Your Own Agent
```python
from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
# Reset β get patient location, hospitals, initial state
obs = (await env.reset()).observation
# Step 1: Dispatch to specialist hospital
obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
# Step 2+: Clear only wrong-phase signals each step
while not obs.done:
controls = [
SignalControl(
row=s.row, col=s.col,
phase="ns_green" if s.ambulance_direction in ("north", "south") else "ew_green"
)
for s in obs.lookahead_signals
if s.phase != ("ns_green" if s.ambulance_direction in ("north", "south") else "ew_green")
]
result = await env.step(AmbulanceAction(signal_controls=controls))
obs = result.observation
```
### The Three Policies (Shown in Visual Demo)
| Policy | Behavior | Signal Efficiency | What It Demonstrates |
|---|---|---|---|
| No control | Does nothing | 0% | Pure baseline |
| Naive | Clears all signals | ~11% | Untrained LLM behavior |
| Smart | Clears only wrong-phase | 100% | Trained LLM behavior |
---
## Why This Matters
Emergency vehicle routing is deployed infrastructure. The gap between "clear one signal 300m ahead" and "reason about the full journey in a partially observable dynamic city" is exactly the gap this environment is designed to close.
Could a researcher write a paper about this? Yes:
> *"LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints"*
That paper does not exist yet. This environment is the training ground for it.
---
*Built for the OpenEnv Hackathon India 2026 β Theme 3.1: World Modeling / Professional Tasks*
|