File size: 9,080 Bytes
0f1191d
db91dc6
0f1191d
 
 
db91dc6
bb625fb
23f14db
db91dc6
 
0f1191d
 
 
 
 
db91dc6
0f1191d
 
 
 
 
db91dc6
0f1191d
db91dc6
0f1191d
db91dc6
 
 
0f1191d
db91dc6
0f1191d
 
db91dc6
0f1191d
db91dc6
0f1191d
 
cc8dd5b
db91dc6
0f1191d
db91dc6
 
 
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
db91dc6
 
 
 
 
 
 
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
 
db91dc6
0f1191d
 
db91dc6
0f1191d
 
 
db91dc6
0f1191d
 
 
 
 
db91dc6
0f1191d
 
 
 
 
 
db91dc6
 
0f1191d
 
 
 
 
db91dc6
0f1191d
 
 
 
 
 
 
 
 
db91dc6
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
 
 
db91dc6
 
 
 
 
0f1191d
 
 
db91dc6
 
 
0f1191d
 
 
db91dc6
 
0f1191d
db91dc6
 
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
db91dc6
0f1191d
 
 
 
db91dc6
0f1191d
db91dc6
0f1191d
 
 
 
db91dc6
 
 
0f1191d
db91dc6
bb625fb
0f1191d
 
cf82c19
 
db91dc6
0f1191d
db91dc6
 
 
 
 
0f1191d
db91dc6
0f1191d
 
db91dc6
0f1191d
 
 
 
 
 
 
 
 
 
 
 
 
db91dc6
 
0f1191d
 
 
 
 
 
 
 
db91dc6
 
0f1191d
 
 
 
 
 
 
 
 
db91dc6
 
0f1191d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# Ambulance Green Corridor β€” OpenEnv Hackathon 2026

## Theme: #3.1 β€” World Modeling / Professional Tasks

**One-line summary:** We train an LLM to act as an emergency dispatcher + city traffic signal manager, navigating a partially observable city with real-world constraints (gridlock, potholes, accidents, hospital capacity) to get ambulances to the right hospital as fast as possible.

**Demo video:** [Watch on YouTube](https://youtu.be/9O5z4IXXtcc)

---

## The Capability Gap We Are Targeting

Current GPS-based emergency preemption systems (like Opticom, used globally) work like this:

> Ambulance within 300m of an intersection β†’ turn it green.

That is reactive. It has no awareness of:
- What road quality lies ahead (potholed roads slow the ambulance even on green)
- Whether the nearest hospital is the *right* hospital for this patient's condition
- Whether heavy traffic on the planned route makes a longer detour actually faster
- Dynamic events mid-journey: accidents, road closures, traffic spikes

**The question we ask:** Can an LLM reason about the full journey β€” hospital selection, road quality, live traffic state, and mid-episode events β€” to get ambulances to the right place faster than any rule?

This is a genuine professional task that requires **persistent world modeling** across many steps. A rule cannot solve it. A shortest-path algorithm cannot solve it. This environment tests whether an LLM can.

---

## Theme Alignment: Why This Is Theme 3.1

Theme 3.1 asks for environments where:
> "the model is expected to do real hard work instead of exploiting short-cuts"

Our environment prevents shortcuts in three ways:

1. **Toggling already-green signals costs reward.** The agent must *read* signal state before acting β€” it cannot blindly clear everything.
2. **Traffic volume slows the ambulance even on green.** The agent cannot just clear signals and assume it will go fast β€” it must reason about the traffic volume on each segment.
3. **The nearest hospital is not always correct.** A cardiac patient sent to a trauma centre loses the +300 specialist bonus. But even the right specialist hospital may not be the best choice β€” if the route to it is gridlocked, clearing signals only gets you 20% speed through dense traffic. A farther hospital with lighter traffic and a lower ETA is the smarter pick. The agent must weigh specialization + distance + live traffic volume simultaneously, and be willing to switch hospitals mid-journey if conditions change.

The agent must maintain a coherent world model across 15–30 steps, update it when dynamic events fire, and make non-obvious decisions that only pay off several steps later.

---

## Environment Design

### What the Agent Sees (Observation)

Every step, the agent receives a structured observation:

```
=== EMERGENCY DISPATCH ===
Patient  : (6, 3) | condition: cardiac
Ambulance: (6, 4) | time: 40s / 300s

⚠ DYNAMIC EVENTS:
  [ACCIDENT] at (4,3) β€” road blocked (severity=0.8)

CURRENT ROUTE β†’ hosp_a (City General)
  ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
  (6,4)β†’(5,4) | residential | quality=moderate | traffic=45%
  (5,4)β†’(4,4) | damaged     | quality=POTHOLED | [BLOCKED]

ALTERNATIVES:
  hosp_c (Cardiac Centre) ← specialist match | ETA=130s | damaged=0

SIGNALS β€” only change WRONG ones:
  (5,4): ns_green | ambulance going north | OK
  (4,4): ew_green | ambulance going north | WRONG β€” needs ns_green

ACTION FORMAT:
{"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
```

### What the Agent Does (Action Space)

```json
{
  "hospital_id": "hosp_c",
  "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}],
  "preferred_direction": "north"
}
```

- `hospital_id`: choose or switch destination at any step (not just at the start)
- `signal_controls`: override up to 3 signals in the lookahead window
- `preferred_direction`: hint the routing engine to take a specific turn

### Reward Function β€” Designed to Be Hard to Game

| Component | Value | Purpose |
|---|---|---|
| Arrival | +1000 | Primary objective |
| Time bonus | +500 max | Rewards speed |
| Specialist match | +300 | Rewards reading patient condition |
| Red light stop | βˆ’20 each | Penalises poor signal management |
| **Unnecessary toggle** | **βˆ’2/βˆ’5 each** | **Core anti-shortcut mechanism** |
| Damaged road traversed | βˆ’10 each | Rewards road quality awareness |
| Successful re-route | +50 each | Rewards dynamic adaptation |

**The unnecessary toggle penalty is the key design decision.** An agent that blindly clears every signal in view scores *lower* than one that reads the state first. This forces genuine reasoning, not pattern-matching.

### Difficulty Levels

| Level | Grid | Hospitals | Base Traffic | Events/Step | Time Limit |
|---|---|---|---|---|---|
| easy | 6Γ—6 | 2 general | Low (0.1) | 5% | 200s |
| medium | 8Γ—8 | 3 mixed | Moderate (0.3) | 10% | 300s |
| hard | 12Γ—12 | 5 (1 at capacity) | Heavy (0.5) | 15% | 400s |

---

## Training

**Model:** `Qwen/Qwen2.5-0.5B-Instruct` + LoRA (r=16, 2.1M trainable params)  
**Algorithm:** GRPO (Group Relative Policy Optimisation) via HuggingFace TRL  
**Setup:** 10 iterations Γ— 4 episodes per iteration, live environment connection

### Results

![Training curves β€” reward, arrival rate, signal efficiency, re-routing](ambulance_training_results.png)

| Metric | Baseline (untrained) | After Training | Change |
|---|---|---|---|
| Arrival rate | 100% | 100% | β€” |
| **Signal efficiency** | **11%** | **100%** | **+89 pp** |
| Mean reward | 1442.6 | 1445.3 | +2.7 |

### What the Numbers Mean

**Signal efficiency is the core proof of learning.**

The untrained model (11% efficiency) toggles every signal it encounters, including ones already in the correct phase. It treats the action space as "clear everything" β€” a shallow shortcut.

After GRPO training (100% efficiency), the model learned to:
1. Read `current_phase` from the observation
2. Compute `needed_phase` based on the ambulance's direction of travel
3. Only send a `SignalControl` action when they differ

This is **not** a trivial pattern. The mapping between ambulance direction and required signal phase differs per intersection and must be reasoned about from the observation text each step.

The training curve shows the characteristic GRPO exploration-convergence pattern:
- **Iterations 1:** Model arrives (100% arrival) but wastes actions (11% efficiency)
- **Iterations 2–4:** Exploration phase β€” arrival drops to 0–25%, model tries aggressive strategies
- **Iterations 5–10:** Convergence β€” 100% arrival with 100% signal efficiency simultaneously

---

## Live Demo

**Demo video:** [Watch on YouTube](https://youtu.be/9O5z4IXXtcc)  
**Environment (OpenEnv WebSocket):** `wss://ajitg25-ambulance-green-corridor.hf.space/ws`  
**Visual simulation:** https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor  
**GitHub (full code + notebook):** https://github.com/ajitg25/openEnv-hackathon/tree/main  
**Training notebook:** https://github.com/ajitg25/openEnv-hackathon/blob/main/examples/ambulance_grpo_training.ipynb

### Connecting Your Own Agent

```python
from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl

async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
    # Reset β€” get patient location, hospitals, initial state
    obs = (await env.reset()).observation

    # Step 1: Dispatch to specialist hospital
    obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation

    # Step 2+: Clear only wrong-phase signals each step
    while not obs.done:
        controls = [
            SignalControl(
                row=s.row, col=s.col,
                phase="ns_green" if s.ambulance_direction in ("north", "south") else "ew_green"
            )
            for s in obs.lookahead_signals
            if s.phase != ("ns_green" if s.ambulance_direction in ("north", "south") else "ew_green")
        ]
        result = await env.step(AmbulanceAction(signal_controls=controls))
        obs = result.observation
```

### The Three Policies (Shown in Visual Demo)

| Policy | Behavior | Signal Efficiency | What It Demonstrates |
|---|---|---|---|
| No control | Does nothing | 0% | Pure baseline |
| Naive | Clears all signals | ~11% | Untrained LLM behavior |
| Smart | Clears only wrong-phase | 100% | Trained LLM behavior |

---

## Why This Matters

Emergency vehicle routing is deployed infrastructure. The gap between "clear one signal 300m ahead" and "reason about the full journey in a partially observable dynamic city" is exactly the gap this environment is designed to close.

Could a researcher write a paper about this? Yes:
> *"LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints"*

That paper does not exist yet. This environment is the training ground for it.

---

*Built for the OpenEnv Hackathon India 2026 β€” Theme 3.1: World Modeling / Professional Tasks*