Spaces:
Sleeping
Sleeping
Rewrite Blog.md: theme alignment, capability gap, agent-readable format
Browse files
Blog.md
CHANGED
|
@@ -1,41 +1,49 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
## The
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
-
## Why This
|
| 18 |
|
| 19 |
-
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
- **Hospital B:** 8 intersections, lighter traffic, highway-quality roads. ETA is actually 40 seconds faster.
|
| 23 |
-
- **Midway:** an accident blocks the planned route. The system must re-route in real time.
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
-
|
| 28 |
-
- Hospital specialization (cardiac patient β cardiac centre, not general hospital)
|
| 29 |
-
- Dynamic events appearing mid-journey (accidents, road closures, traffic spikes)
|
| 30 |
-
- Which signals actually need clearing β toggling an already-green signal wastes an action and costs reward
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
-
##
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
```
|
| 41 |
=== EMERGENCY DISPATCH ===
|
|
@@ -43,123 +51,150 @@ Patient : (6, 3) | condition: cardiac
|
|
| 43 |
Ambulance: (6, 4) | time: 40s / 300s
|
| 44 |
|
| 45 |
β DYNAMIC EVENTS:
|
| 46 |
-
[ACCIDENT] at (4,3) β
|
| 47 |
|
| 48 |
-
CURRENT ROUTE β hosp_a
|
| 49 |
ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
|
| 50 |
-
(6,4)β(5,4) | residential | quality=moderate | traffic=45%
|
| 51 |
-
(5,4)β(4,4) | damaged | quality=POTHOLED |
|
| 52 |
|
| 53 |
-
ALTERNATIVES
|
| 54 |
-
hosp_c (
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
```
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
| Component | Value |
|
| 70 |
-
|---|---|
|
| 71 |
-
| Arrival
|
| 72 |
-
| Time bonus |
|
| 73 |
-
| Specialist
|
| 74 |
-
| Red light stop | β20 each |
|
| 75 |
-
| **Unnecessary
|
| 76 |
-
| Damaged road
|
| 77 |
-
| Successful re-route | +50 each |
|
| 78 |
|
| 79 |
-
The unnecessary toggle penalty is the key design decision. An agent that blindly clears every signal
|
| 80 |
|
| 81 |
-
### Difficulty
|
| 82 |
|
| 83 |
-
| Level | Grid | Hospitals | Traffic |
|
| 84 |
|---|---|---|---|---|---|
|
| 85 |
-
| easy | 6Γ6 | 2 | Low | 5%
|
| 86 |
-
| medium | 8Γ8 | 3 | Moderate | 10%
|
| 87 |
-
| hard | 12Γ12 | 5 (1 at capacity) | Heavy | 15%
|
| 88 |
|
| 89 |
---
|
| 90 |
|
| 91 |
## Training
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
- **Environment:** Live OpenEnv server running alongside training loop
|
| 97 |
-
|
| 98 |
-

|
| 99 |
-
*Four panels: Episode reward, Hospital arrival rate, Signal efficiency (11%β100%), Adaptive re-routing*
|
| 100 |
|
| 101 |
### Results
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
| 104 |
|---|---|---|---|
|
| 105 |
| Arrival rate | 100% | 100% | β |
|
| 106 |
-
| **Signal efficiency** | **11%** | **100%** | **+89
|
| 107 |
| Mean reward | 1442.6 | 1445.3 | +2.7 |
|
| 108 |
-
| Mean travel time | 125s | 127.5s | β |
|
| 109 |
-
|
| 110 |
-
### What the numbers mean
|
| 111 |
-
|
| 112 |
-
**Signal efficiency is the headline metric.** The untrained model toggled every signal it saw β including ones already in the correct phase β scoring unnecessary toggle penalties on every step. After GRPO training, the model learned to read `sig.phase` vs `sig.ambulance_direction` and only act when a signal genuinely needs changing.
|
| 113 |
-
|
| 114 |
-
The training curve shows characteristic GRPO behaviour:
|
| 115 |
-
- **Iterations 1:** model arrives but wastes actions (efficiency=11%)
|
| 116 |
-
- **Iterations 2β4:** exploration phase β model tries aggressive strategies, arrival drops to 0β25%
|
| 117 |
-
- **Iterations 5β10:** sharp convergence β 100% arrival, 100% signal efficiency, stable reward
|
| 118 |
|
| 119 |
-
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
## Why This Environment Matters
|
| 124 |
|
| 125 |
-
|
| 126 |
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
---
|
| 137 |
|
| 138 |
-
##
|
| 139 |
|
| 140 |
-
**
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
-
|
| 143 |
|
| 144 |
```python
|
| 145 |
from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
|
| 146 |
|
| 147 |
async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
|
|
|
|
| 148 |
obs = (await env.reset()).observation
|
| 149 |
-
|
|
|
|
| 150 |
obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
```
|
| 160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
---
|
| 162 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
---
|
| 164 |
|
| 165 |
-
*Built for the OpenEnv Hackathon India 2026.*
|
|
|
|
| 1 |
+
# Ambulance Green Corridor β OpenEnv Hackathon 2026
|
| 2 |
|
| 3 |
+
## Theme: #3.1 β World Modeling / Professional Tasks
|
| 4 |
+
|
| 5 |
+
**One-line summary:** We train an LLM to act as an emergency dispatcher + city traffic signal manager, navigating a partially observable city with real-world constraints (gridlock, potholes, accidents, hospital capacity) to get ambulances to the right hospital as fast as possible.
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
+
## The Capability Gap We Are Targeting
|
| 10 |
+
|
| 11 |
+
Current GPS-based emergency preemption systems (like Opticom, used globally) work like this:
|
| 12 |
+
|
| 13 |
+
> Ambulance within 300m of an intersection β turn it green.
|
| 14 |
|
| 15 |
+
That is reactive. It has no awareness of:
|
| 16 |
+
- What road quality lies ahead (potholed roads slow the ambulance even on green)
|
| 17 |
+
- Whether the nearest hospital is the *right* hospital for this patient's condition
|
| 18 |
+
- Whether heavy traffic on the planned route makes a longer detour actually faster
|
| 19 |
+
- Dynamic events mid-journey: accidents, road closures, traffic spikes
|
| 20 |
|
| 21 |
+
**The question we ask:** Can an LLM reason about the full journey β hospital selection, road quality, live traffic state, and mid-episode events β to get ambulances to the right place faster than any rule?
|
| 22 |
|
| 23 |
+
This is a genuine professional task that requires **persistent world modeling** across many steps. A rule cannot solve it. A shortest-path algorithm cannot solve it. This environment tests whether an LLM can.
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
+
## Theme Alignment: Why This Is Theme 3.1
|
| 28 |
|
| 29 |
+
Theme 3.1 asks for environments where:
|
| 30 |
+
> "the model is expected to do real hard work instead of exploiting short-cuts"
|
| 31 |
|
| 32 |
+
Our environment prevents shortcuts in three ways:
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
1. **Toggling already-green signals costs reward.** The agent must *read* signal state before acting β it cannot blindly clear everything.
|
| 35 |
+
2. **Traffic volume slows the ambulance even on green.** The agent cannot just clear signals and assume it will go fast β it must reason about the traffic volume on each segment.
|
| 36 |
+
3. **The nearest hospital is not always correct.** A cardiac patient sent to a trauma centre loses the +300 specialist bonus. The agent must match condition to specialization.
|
| 37 |
|
| 38 |
+
The agent must maintain a coherent world model across 15β30 steps, update it when dynamic events fire, and make non-obvious decisions that only pay off several steps later.
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
---
|
| 41 |
|
| 42 |
+
## Environment Design
|
| 43 |
|
| 44 |
+
### What the Agent Sees (Observation)
|
| 45 |
|
| 46 |
+
Every step, the agent receives a structured observation:
|
| 47 |
|
| 48 |
```
|
| 49 |
=== EMERGENCY DISPATCH ===
|
|
|
|
| 51 |
Ambulance: (6, 4) | time: 40s / 300s
|
| 52 |
|
| 53 |
β DYNAMIC EVENTS:
|
| 54 |
+
[ACCIDENT] at (4,3) β road blocked (severity=0.8)
|
| 55 |
|
| 56 |
+
CURRENT ROUTE β hosp_a (City General)
|
| 57 |
ETA=251s | segments=8 | damaged=2 | heavy_traffic=1
|
| 58 |
+
(6,4)β(5,4) | residential | quality=moderate | traffic=45%
|
| 59 |
+
(5,4)β(4,4) | damaged | quality=POTHOLED | [BLOCKED]
|
| 60 |
|
| 61 |
+
ALTERNATIVES:
|
| 62 |
+
hosp_c (Cardiac Centre) β specialist match | ETA=130s | damaged=0
|
| 63 |
|
| 64 |
+
SIGNALS β only change WRONG ones:
|
| 65 |
+
(5,4): ns_green | ambulance going north | OK
|
| 66 |
+
(4,4): ew_green | ambulance going north | WRONG β needs ns_green
|
| 67 |
|
| 68 |
+
ACTION FORMAT:
|
| 69 |
+
{"hospital_id": "hosp_c", "signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}], "preferred_direction": null}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### What the Agent Does (Action Space)
|
| 73 |
|
| 74 |
+
```json
|
| 75 |
+
{
|
| 76 |
+
"hospital_id": "hosp_c",
|
| 77 |
+
"signal_controls": [{"row": 4, "col": 4, "phase": "ns_green"}],
|
| 78 |
+
"preferred_direction": "north"
|
| 79 |
+
}
|
| 80 |
```
|
| 81 |
|
| 82 |
+
- `hospital_id`: choose or switch destination at any step (not just at the start)
|
| 83 |
+
- `signal_controls`: override up to 3 signals in the lookahead window
|
| 84 |
+
- `preferred_direction`: hint the routing engine to take a specific turn
|
| 85 |
+
|
| 86 |
+
### Reward Function β Designed to Be Hard to Game
|
| 87 |
|
| 88 |
+
| Component | Value | Purpose |
|
| 89 |
+
|---|---|---|
|
| 90 |
+
| Arrival | +1000 | Primary objective |
|
| 91 |
+
| Time bonus | +500 max | Rewards speed |
|
| 92 |
+
| Specialist match | +300 | Rewards reading patient condition |
|
| 93 |
+
| Red light stop | β20 each | Penalises poor signal management |
|
| 94 |
+
| **Unnecessary toggle** | **β2/β5 each** | **Core anti-shortcut mechanism** |
|
| 95 |
+
| Damaged road traversed | β10 each | Rewards road quality awareness |
|
| 96 |
+
| Successful re-route | +50 each | Rewards dynamic adaptation |
|
| 97 |
|
| 98 |
+
**The unnecessary toggle penalty is the key design decision.** An agent that blindly clears every signal in view scores *lower* than one that reads the state first. This forces genuine reasoning, not pattern-matching.
|
| 99 |
|
| 100 |
+
### Difficulty Levels
|
| 101 |
|
| 102 |
+
| Level | Grid | Hospitals | Base Traffic | Events/Step | Time Limit |
|
| 103 |
|---|---|---|---|---|---|
|
| 104 |
+
| easy | 6Γ6 | 2 general | Low (0.1) | 5% | 200s |
|
| 105 |
+
| medium | 8Γ8 | 3 mixed | Moderate (0.3) | 10% | 300s |
|
| 106 |
+
| hard | 12Γ12 | 5 (1 at capacity) | Heavy (0.5) | 15% | 400s |
|
| 107 |
|
| 108 |
---
|
| 109 |
|
| 110 |
## Training
|
| 111 |
|
| 112 |
+
**Model:** `Qwen/Qwen2.5-0.5B-Instruct` + LoRA (r=16, 2.1M trainable params)
|
| 113 |
+
**Algorithm:** GRPO (Group Relative Policy Optimisation) via HuggingFace TRL
|
| 114 |
+
**Setup:** 10 iterations Γ 4 episodes per iteration, live environment connection
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
### Results
|
| 117 |
|
| 118 |
+

|
| 119 |
+
|
| 120 |
+
| Metric | Baseline (untrained) | After Training | Change |
|
| 121 |
|---|---|---|---|
|
| 122 |
| Arrival rate | 100% | 100% | β |
|
| 123 |
+
| **Signal efficiency** | **11%** | **100%** | **+89 pp** |
|
| 124 |
| Mean reward | 1442.6 | 1445.3 | +2.7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
### What the Numbers Mean
|
| 127 |
|
| 128 |
+
**Signal efficiency is the core proof of learning.**
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
The untrained model (11% efficiency) toggles every signal it encounters, including ones already in the correct phase. It treats the action space as "clear everything" β a shallow shortcut.
|
| 131 |
|
| 132 |
+
After GRPO training (100% efficiency), the model learned to:
|
| 133 |
+
1. Read `current_phase` from the observation
|
| 134 |
+
2. Compute `needed_phase` based on the ambulance's direction of travel
|
| 135 |
+
3. Only send a `SignalControl` action when they differ
|
| 136 |
|
| 137 |
+
This is **not** a trivial pattern. The mapping between ambulance direction and required signal phase differs per intersection and must be reasoned about from the observation text each step.
|
| 138 |
|
| 139 |
+
The training curve shows the characteristic GRPO exploration-convergence pattern:
|
| 140 |
+
- **Iterations 1:** Model arrives (100% arrival) but wastes actions (11% efficiency)
|
| 141 |
+
- **Iterations 2β4:** Exploration phase β arrival drops to 0β25%, model tries aggressive strategies
|
| 142 |
+
- **Iterations 5β10:** Convergence β 100% arrival with 100% signal efficiency simultaneously
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
+
## Live Demo
|
| 147 |
|
| 148 |
+
**Environment (OpenEnv WebSocket):** `wss://ajitg25-ambulance-green-corridor.hf.space/ws`
|
| 149 |
+
**Visual simulation:** https://huggingface.co/spaces/Ajitg25/ambulance-green-corridor
|
| 150 |
+
**GitHub (full code + notebook):** https://github.com/ajitg25/openEnv-hackathon/tree/final
|
| 151 |
+
**Training notebook:** https://github.com/ajitg25/openEnv-hackathon/blob/final/examples/ambulance_grpo_training.ipynb
|
| 152 |
|
| 153 |
+
### Connecting Your Own Agent
|
| 154 |
|
| 155 |
```python
|
| 156 |
from ambulance_env import AmbulanceEnv, AmbulanceAction, SignalControl
|
| 157 |
|
| 158 |
async with AmbulanceEnv(base_url="https://ajitg25-ambulance-green-corridor.hf.space") as env:
|
| 159 |
+
# Reset β get patient location, hospitals, initial state
|
| 160 |
obs = (await env.reset()).observation
|
| 161 |
+
|
| 162 |
+
# Step 1: Dispatch to specialist hospital
|
| 163 |
obs = (await env.step(AmbulanceAction(hospital_id="hosp_b"))).observation
|
| 164 |
+
|
| 165 |
+
# Step 2+: Clear only wrong-phase signals each step
|
| 166 |
+
while not obs.done:
|
| 167 |
+
controls = [
|
| 168 |
+
SignalControl(
|
| 169 |
+
row=s.row, col=s.col,
|
| 170 |
+
phase="ns_green" if s.ambulance_direction in ("north", "south") else "ew_green"
|
| 171 |
+
)
|
| 172 |
+
for s in obs.lookahead_signals
|
| 173 |
+
if s.phase != ("ns_green" if s.ambulance_direction in ("north", "south") else "ew_green")
|
| 174 |
+
]
|
| 175 |
+
result = await env.step(AmbulanceAction(signal_controls=controls))
|
| 176 |
+
obs = result.observation
|
| 177 |
```
|
| 178 |
|
| 179 |
+
### The Three Policies (Shown in Visual Demo)
|
| 180 |
+
|
| 181 |
+
| Policy | Behavior | Signal Efficiency | What It Demonstrates |
|
| 182 |
+
|---|---|---|---|
|
| 183 |
+
| No control | Does nothing | 0% | Pure baseline |
|
| 184 |
+
| Naive | Clears all signals | ~11% | Untrained LLM behavior |
|
| 185 |
+
| Smart | Clears only wrong-phase | 100% | Trained LLM behavior |
|
| 186 |
+
|
| 187 |
---
|
| 188 |
|
| 189 |
+
## Why This Matters
|
| 190 |
+
|
| 191 |
+
Emergency vehicle routing is deployed infrastructure. The gap between "clear one signal 300m ahead" and "reason about the full journey in a partially observable dynamic city" is exactly the gap this environment is designed to close.
|
| 192 |
+
|
| 193 |
+
Could a researcher write a paper about this? Yes:
|
| 194 |
+
> *"LLM-based adaptive emergency corridor planning under partial observability and dynamic constraints"*
|
| 195 |
+
|
| 196 |
+
That paper does not exist yet. This environment is the training ground for it.
|
| 197 |
+
|
| 198 |
---
|
| 199 |
|
| 200 |
+
*Built for the OpenEnv Hackathon India 2026 β Theme 3.1: World Modeling / Professional Tasks*
|