Ajitg25 commited on
Commit
0f1191d
Β·
verified Β·
1 Parent(s): 7d14239

Rewrite Blog.md: theme alignment, capability gap, agent-readable format

Browse files
Files changed (1) hide show
  1. Blog.md +126 -91
Blog.md CHANGED
@@ -1,41 +1,49 @@
1
- # Can an LLM Learn to Save Lives by Managing City Traffic?
2
 
3
- **tl;dr:** We built an OpenEnv environment that trains an LLM to act as emergency dispatcher + city traffic signal manager. After GRPO training, signal efficiency jumped from **11% β†’ 100%**. Here's how it works and why it genuinely needs an LLM β€” not just a rule.
 
 
4
 
5
  ---
6
 
7
- ## The Problem
 
 
 
 
8
 
9
- In a cardiac emergency, every minute of delay costs ~10% survival probability.
 
 
 
 
10
 
11
- Existing GPS-based emergency preemption systems (like Opticom) clear one traffic signal when an ambulance is 300m away. That's reactive, single-intersection, and has no awareness of what lies ahead.
12
 
13
- Our environment asks: **can an LLM reason about the full journey β€” hospital selection, road quality, live traffic, and dynamic events β€” to get the ambulance there faster?**
14
 
15
  ---
16
 
17
- ## Why This Needs an LLM (Not a Rule)
18
 
19
- Consider this scenario:
 
20
 
21
- - **Hospital A:** 6 intersections away, but 3 road segments are gridlocked. Clearing signals helps, but heavy traffic means the ambulance crawls at ~20% speed even on green.
22
- - **Hospital B:** 8 intersections, lighter traffic, highway-quality roads. ETA is actually 40 seconds faster.
23
- - **Midway:** an accident blocks the planned route. The system must re-route in real time.
24
 
25
- No rule-based system can solve this. The agent must simultaneously reason about:
 
 
26
 
27
- - Distance vs. traffic volume vs. road quality
28
- - Hospital specialization (cardiac patient β†’ cardiac centre, not general hospital)
29
- - Dynamic events appearing mid-journey (accidents, road closures, traffic spikes)
30
- - Which signals actually need clearing β€” toggling an already-green signal wastes an action and costs reward
31
 
32
  ---
33
 
34
- ## The Environment
35
 
36
- Built on **[OpenEnv](https://github.com/meta-pytorch/OpenEnv)** β€” the hackathon framework for LLM training environments.
37
 
38
- ### What the agent sees each step
39
 
40
  ```
41
  === EMERGENCY DISPATCH ===
@@ -43,123 +51,150 @@ Patient : (6, 3) | condition: cardiac
43
  Ambulance: (6, 4) | time: 40s / 300s
44
 
45
  ⚠ DYNAMIC EVENTS:
46
- [ACCIDENT] at (4,3) β€” blocking road (severity=0.8)
47
 
48
- CURRENT ROUTE β†’ hosp_a
49
  ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
50
- (6,4)β†’(5,4) | residential | quality=moderate | traffic=45% | est=22s
51
- (5,4)β†’(4,4) | damaged | quality=POTHOLED | traffic=62% | est=41s [BLOCKED]
52
 
53
- ALTERNATIVES (consider switching if ETA much lower):
54
- hosp_c (cardiac) <- specialist match: ETA=130s | damaged=0 | heavy=0
55
 
56
- HOSPITALS:
57
- hosp_a: City General | spec=general | est=251s
58
- hosp_c: Cardiac Centre | spec=cardiac | est=130s <- specialist match
59
 
60
- SIGNALS (only change WRONG ones):
61
- (5,4): ns_green | dir=north | OK
62
- (4,4): ew_green | dir=north | WRONG β€” needs ns_green
 
 
63
 
64
- ACTION: {"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
 
 
 
 
 
65
  ```
66
 
67
- ### Reward function β€” designed to be hard to game
 
 
 
 
68
 
69
- | Component | Value |
70
- |---|---|
71
- | Arrival bonus | +1000 |
72
- | Time bonus | up to +500 (faster = more) |
73
- | Specialist hospital match | +300 |
74
- | Red light stop | βˆ’20 each |
75
- | **Unnecessary signal toggle** | **βˆ’2/βˆ’5 each** |
76
- | Damaged road segments traversed | βˆ’10 each |
77
- | Successful re-route | +50 each |
78
 
79
- The unnecessary toggle penalty is the key design decision. An agent that blindly clears every signal it sees scores *worse* than one that reads the signal state first. This forces the LLM to actually reason about observations rather than pattern-match to a fixed action.
80
 
81
- ### Difficulty levels
82
 
83
- | Level | Grid | Hospitals | Traffic | Dynamic Events | Time Limit |
84
  |---|---|---|---|---|---|
85
- | easy | 6Γ—6 | 2 | Low | 5%/step | 200s |
86
- | medium | 8Γ—8 | 3 | Moderate | 10%/step | 300s |
87
- | hard | 12Γ—12 | 5 (1 at capacity) | Heavy | 15%/step | 400s |
88
 
89
  ---
90
 
91
  ## Training
92
 
93
- - **Model:** `Qwen/Qwen2.5-0.5B-Instruct` + LoRA (r=16, 2.1M trainable params)
94
- - **Algorithm:** GRPO (Group Relative Policy Optimisation via HuggingFace TRL)
95
- - **Setup:** 10 iterations Γ— 4 episodes per iteration
96
- - **Environment:** Live OpenEnv server running alongside training loop
97
-
98
- ![Training curves](ambulance_training_results.png)
99
- *Four panels: Episode reward, Hospital arrival rate, Signal efficiency (11%β†’100%), Adaptive re-routing*
100
 
101
  ### Results
102
 
103
- | Metric | Baseline (untrained) | Trained | Change |
 
 
104
  |---|---|---|---|
105
  | Arrival rate | 100% | 100% | β€” |
106
- | **Signal efficiency** | **11%** | **100%** | **+89 percentage points** |
107
  | Mean reward | 1442.6 | 1445.3 | +2.7 |
108
- | Mean travel time | 125s | 127.5s | β€” |
109
-
110
- ### What the numbers mean
111
-
112
- **Signal efficiency is the headline metric.** The untrained model toggled every signal it saw β€” including ones already in the correct phase β€” scoring unnecessary toggle penalties on every step. After GRPO training, the model learned to read `sig.phase` vs `sig.ambulance_direction` and only act when a signal genuinely needs changing.
113
-
114
- The training curve shows characteristic GRPO behaviour:
115
- - **Iterations 1:** model arrives but wastes actions (efficiency=11%)
116
- - **Iterations 2–4:** exploration phase β€” model tries aggressive strategies, arrival drops to 0–25%
117
- - **Iterations 5–10:** sharp convergence β€” 100% arrival, 100% signal efficiency, stable reward
118
 
119
- This exploration→convergence pattern is the training story. A rule-based system would never show this curve — it would be flat from iteration 1.
120
 
121
- ---
122
-
123
- ## Why This Environment Matters
124
 
125
- Emergency vehicle routing is a real, unsolved problem in smart city infrastructure. Current systems are:
126
 
127
- - **Reactive:** clear one signal at a time, 300m in advance
128
- - **Unaware of road quality:** a potholed road still gets treated as highway
129
- - **Static:** no dynamic re-routing when accidents occur
130
- - **Oblivious to hospital specialization:** nearest hospital isn't always right hospital
131
 
132
- An LLM trained on this environment learns to reason about all four simultaneously. That's a capability that doesn't exist in any deployed system today.
133
 
134
- Could a researcher write a paper about this? Yes β€” "LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints" is a legitimate research direction this environment enables.
 
 
 
135
 
136
  ---
137
 
138
- ## Try It
139
 
140
- **Live environment:** https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor
 
 
 
141
 
142
- **Code + training notebook:** https://github.com/ajitg25/openEnv-hackathon/tree/final
143
 
144
  ```python
145
  from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
146
 
147
  async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
 
148
  obs = (await env.reset()).observation
149
- # Dispatch to specialist hospital
 
150
  obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
151
- # Clear only wrong-phase signals
152
- controls = [
153
- SignalControl(row=s.row, col=s.col,
154
- phase="ns_green" if s.ambulance_direction in ("north","south") else "ew_green")
155
- for s in obs.lookahead_signals
156
- if s.phase != ("ns_green" if s.ambulance_direction in ("north","south") else "ew_green")
157
- ]
158
- result = await env.step(AmbulanceAction(signal_controls=controls))
 
 
 
 
 
159
  ```
160
 
 
 
 
 
 
 
 
 
161
  ---
162
 
 
 
 
 
 
 
 
 
 
163
  ---
164
 
165
- *Built for the OpenEnv Hackathon India 2026.*
 
1
+ # Ambulance Green Corridor β€” OpenEnv Hackathon 2026
2
 
3
+ ## Theme: #3.1 β€” World Modeling / Professional Tasks
4
+
5
+ **One-line summary:** We train an LLM to act as an emergency dispatcher + city traffic signal manager, navigating a partially observable city with real-world constraints (gridlock, potholes, accidents, hospital capacity) to get ambulances to the right hospital as fast as possible.
6
 
7
  ---
8
 
9
+ ## The Capability Gap We Are Targeting
10
+
11
+ Current GPS-based emergency preemption systems (like Opticom, used globally) work like this:
12
+
13
+ > Ambulance within 300m of an intersection β†’ turn it green.
14
 
15
+ That is reactive. It has no awareness of:
16
+ - What road quality lies ahead (potholed roads slow the ambulance even on green)
17
+ - Whether the nearest hospital is the *right* hospital for this patient's condition
18
+ - Whether heavy traffic on the planned route makes a longer detour actually faster
19
+ - Dynamic events mid-journey: accidents, road closures, traffic spikes
20
 
21
+ **The question we ask:** Can an LLM reason about the full journey β€” hospital selection, road quality, live traffic state, and mid-episode events β€” to get ambulances to the right place faster than any rule?
22
 
23
+ This is a genuine professional task that requires **persistent world modeling** across many steps. A rule cannot solve it. A shortest-path algorithm cannot solve it. This environment tests whether an LLM can.
24
 
25
  ---
26
 
27
+ ## Theme Alignment: Why This Is Theme 3.1
28
 
29
+ Theme 3.1 asks for environments where:
30
+ > "the model is expected to do real hard work instead of exploiting short-cuts"
31
 
32
+ Our environment prevents shortcuts in three ways:
 
 
33
 
34
+ 1. **Toggling already-green signals costs reward.** The agent must *read* signal state before acting β€” it cannot blindly clear everything.
35
+ 2. **Traffic volume slows the ambulance even on green.** The agent cannot just clear signals and assume it will go fast β€” it must reason about the traffic volume on each segment.
36
+ 3. **The nearest hospital is not always correct.** A cardiac patient sent to a trauma centre loses the +300 specialist bonus. The agent must match condition to specialization.
37
 
38
+ The agent must maintain a coherent world model across 15–30 steps, update it when dynamic events fire, and make non-obvious decisions that only pay off several steps later.
 
 
 
39
 
40
  ---
41
 
42
+ ## Environment Design
43
 
44
+ ### What the Agent Sees (Observation)
45
 
46
+ Every step, the agent receives a structured observation:
47
 
48
  ```
49
  === EMERGENCY DISPATCH ===
 
51
  Ambulance: (6, 4) | time: 40s / 300s
52
 
53
  ⚠ DYNAMIC EVENTS:
54
+ [ACCIDENT] at (4,3) β€” road blocked (severity=0.8)
55
 
56
+ CURRENT ROUTE β†’ hosp_a (City General)
57
  ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
58
+ (6,4)β†’(5,4) | residential | quality=moderate | traffic=45%
59
+ (5,4)β†’(4,4) | damaged | quality=POTHOLED | [BLOCKED]
60
 
61
+ ALTERNATIVES:
62
+ hosp_c (Cardiac Centre) ← specialist match | ETA=130s | damaged=0
63
 
64
+ SIGNALS β€” only change WRONG ones:
65
+ (5,4): ns_green | ambulance going north | OK
66
+ (4,4): ew_green | ambulance going north | WRONG β€” needs ns_green
67
 
68
+ ACTION FORMAT:
69
+ {"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
70
+ ```
71
+
72
+ ### What the Agent Does (Action Space)
73
 
74
+ ```json
75
+ {
76
+ "hospital_id": "hosp_c",
77
+ "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}],
78
+ "preferred_direction": "north"
79
+ }
80
  ```
81
 
82
+ - `hospital_id`: choose or switch destination at any step (not just at the start)
83
+ - `signal_controls`: override up to 3 signals in the lookahead window
84
+ - `preferred_direction`: hint the routing engine to take a specific turn
85
+
86
+ ### Reward Function β€” Designed to Be Hard to Game
87
 
88
+ | Component | Value | Purpose |
89
+ |---|---|---|
90
+ | Arrival | +1000 | Primary objective |
91
+ | Time bonus | +500 max | Rewards speed |
92
+ | Specialist match | +300 | Rewards reading patient condition |
93
+ | Red light stop | βˆ’20 each | Penalises poor signal management |
94
+ | **Unnecessary toggle** | **βˆ’2/βˆ’5 each** | **Core anti-shortcut mechanism** |
95
+ | Damaged road traversed | βˆ’10 each | Rewards road quality awareness |
96
+ | Successful re-route | +50 each | Rewards dynamic adaptation |
97
 
98
+ **The unnecessary toggle penalty is the key design decision.** An agent that blindly clears every signal in view scores *lower* than one that reads the state first. This forces genuine reasoning, not pattern-matching.
99
 
100
+ ### Difficulty Levels
101
 
102
+ | Level | Grid | Hospitals | Base Traffic | Events/Step | Time Limit |
103
  |---|---|---|---|---|---|
104
+ | easy | 6Γ—6 | 2 general | Low (0.1) | 5% | 200s |
105
+ | medium | 8Γ—8 | 3 mixed | Moderate (0.3) | 10% | 300s |
106
+ | hard | 12Γ—12 | 5 (1 at capacity) | Heavy (0.5) | 15% | 400s |
107
 
108
  ---
109
 
110
  ## Training
111
 
112
+ **Model:** `Qwen/Qwen2.5-0.5B-Instruct` + LoRA (r=16, 2.1M trainable params)
113
+ **Algorithm:** GRPO (Group Relative Policy Optimisation) via HuggingFace TRL
114
+ **Setup:** 10 iterations Γ— 4 episodes per iteration, live environment connection
 
 
 
 
115
 
116
  ### Results
117
 
118
+ ![Training curves β€” reward, arrival rate, signal efficiency, re-routing](ambulance_training_results.png)
119
+
120
+ | Metric | Baseline (untrained) | After Training | Change |
121
  |---|---|---|---|
122
  | Arrival rate | 100% | 100% | β€” |
123
+ | **Signal efficiency** | **11%** | **100%** | **+89 pp** |
124
  | Mean reward | 1442.6 | 1445.3 | +2.7 |
 
 
 
 
 
 
 
 
 
 
125
 
126
+ ### What the Numbers Mean
127
 
128
+ **Signal efficiency is the core proof of learning.**
 
 
129
 
130
+ The untrained model (11% efficiency) toggles every signal it encounters, including ones already in the correct phase. It treats the action space as "clear everything" β€” a shallow shortcut.
131
 
132
+ After GRPO training (100% efficiency), the model learned to:
133
+ 1. Read `current_phase` from the observation
134
+ 2. Compute `needed_phase` based on the ambulance's direction of travel
135
+ 3. Only send a `SignalControl` action when they differ
136
 
137
+ This is **not** a trivial pattern. The mapping between ambulance direction and required signal phase differs per intersection and must be reasoned about from the observation text each step.
138
 
139
+ The training curve shows the characteristic GRPO exploration-convergence pattern:
140
+ - **Iterations 1:** Model arrives (100% arrival) but wastes actions (11% efficiency)
141
+ - **Iterations 2–4:** Exploration phase β€” arrival drops to 0–25%, model tries aggressive strategies
142
+ - **Iterations 5–10:** Convergence β€” 100% arrival with 100% signal efficiency simultaneously
143
 
144
  ---
145
 
146
+ ## Live Demo
147
 
148
+ **Environment (OpenEnv WebSocket):** `wss://ajitg25-ambulance-green-corridor.hf.space/ws`
149
+ **Visual simulation:** https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor
150
+ **GitHub (full code + notebook):** https://github.com/ajitg25/openEnv-hackathon/tree/final
151
+ **Training notebook:** https://github.com/ajitg25/openEnv-hackathon/blob/final/examples/ambulance_grpo_training.ipynb
152
 
153
+ ### Connecting Your Own Agent
154
 
155
  ```python
156
  from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
157
 
158
  async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
159
+ # Reset β€” get patient location, hospitals, initial state
160
  obs = (await env.reset()).observation
161
+
162
+ # Step 1: Dispatch to specialist hospital
163
  obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
164
+
165
+ # Step 2+: Clear only wrong-phase signals each step
166
+ while not obs.done:
167
+ controls = [
168
+ SignalControl(
169
+ row=s.row, col=s.col,
170
+ phase="ns_green" if s.ambulance_direction in ("north", "south") else "ew_green"
171
+ )
172
+ for s in obs.lookahead_signals
173
+ if s.phase != ("ns_green" if s.ambulance_direction in ("north", "south") else "ew_green")
174
+ ]
175
+ result = await env.step(AmbulanceAction(signal_controls=controls))
176
+ obs = result.observation
177
  ```
178
 
179
+ ### The Three Policies (Shown in Visual Demo)
180
+
181
+ | Policy | Behavior | Signal Efficiency | What It Demonstrates |
182
+ |---|---|---|---|
183
+ | No control | Does nothing | 0% | Pure baseline |
184
+ | Naive | Clears all signals | ~11% | Untrained LLM behavior |
185
+ | Smart | Clears only wrong-phase | 100% | Trained LLM behavior |
186
+
187
  ---
188
 
189
+ ## Why This Matters
190
+
191
+ Emergency vehicle routing is deployed infrastructure. The gap between "clear one signal 300m ahead" and "reason about the full journey in a partially observable dynamic city" is exactly the gap this environment is designed to close.
192
+
193
+ Could a researcher write a paper about this? Yes:
194
+ > *"LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints"*
195
+
196
+ That paper does not exist yet. This environment is the training ground for it.
197
+
198
  ---
199
 
200
+ *Built for the OpenEnv Hackathon India 2026 β€” Theme 3.1: World Modeling / Professional Tasks*