File size: 4,888 Bytes
47bf3ae
 
 
 
 
cbd6773
47bf3ae
 
 
4a7bb0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6bc459
4a7bb0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6bc459
 
4a7bb0f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: EnterpriseOps Arena
emoji: 🏒
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# EnterpriseOps Arena 🏒

> Multi-agent RL environment where IT, Manager, Finance and Oversight agents 
> collaborate to manage a simulated enterprise under partial observability, 
> schema drift, and SLA pressure.

## Quick Links
- πŸš€ **HuggingFace Space**: https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
- πŸ““ **Colab Notebook**: https://github.com/anuragverma025/Meta-Hackathon/blob/main/enterprise_ops/train/colab_notebook.ipynb
- πŸ“ **Blog Post**: https://github.com/anuragverma025/Meta-Hackathon/blob/main/BLOG.md
- πŸ’» **GitHub**: https://github.com/anuragverma025/Meta-Hackathon

---

## The Problem

Enterprise AI agents fail because they work in silos. The IT agent 
resolving a critical server ticket does not know the Finance agent 
just blocked the budget it needs. The Manager does not know which 
tickets are about to breach SLA. No coordination = cascading failures.

We built an RL environment that trains LLM agents to coordinate 
across departments β€” developing theory-of-mind reasoning through 
reinforcement learning.

---

## The Environment

4 specialized LLM agents operate inside a simulated enterprise:

| Agent | Role | Sees |
|---|---|---|
| IT Agent | Resolves support tickets before SLA breach | Tickets + resource pool + inbox |
| Manager Agent | Allocates shared resources, coordinates tasks | All dept summaries + project tasks |
| Finance Agent | Approves budgets, blocks policy violations | Budget history + pending approvals |
| Oversight Agent | Monitors all agents, catches hallucinations | ALL tool call logs (full visibility) |

### Key environment features
- **Partial observability** β€” each agent sees only its department
- **5 mock enterprise APIs** β€” get_tickets, resolve_ticket, allocate_resource, approve_budget, get_project_status
- **Schema drift** β€” API fields mutate every 20 steps, forcing real adaptation
- **8 scenarios** β€” difficulty 1 to 8, from simple IT tasks to full enterprise chaos
- **Message bus** β€” agents coordinate by sending structured messages
- **Anti-reward-hacking** β€” timeout, loop detection, state locks, oversight monitoring

---

## Reward Design

4 independent reward functions (composable, hard to game):

| Function | Signal |
|---|---|
| task_completion | +10 per resolved ticket/task, verified by state diff |
| sla_adherence | +7.5 before deadline, -5 on breach |
| coordination_bonus | +6 when message leads to correct action next step |
| hallucination_penalty | -8 for calling non-existent API fields |

The Oversight Agent earns +15 for catching hallucinations, +8 for 
policy breaches, +5 for stale schema usage.

---

## Training Results

Real training run β€” 200 steps on T4 GPU β€” 32 minutes

![Reward curves](reward_curves.png)

![Loss curve](loss_curve.png)

### Key findings
- **GRPO reward**: -1.0 β†’ +1.5 (crossed zero β€” model is learning)
- **Curriculum**: Advanced automatically from scenario_01 β†’ scenario_03
- **Train loss**: -0.023
- **Model**: Qwen2.5-3B-Instruct, 4-bit quantized via Unsloth
- **Method**: GRPO via HuggingFace TRL

### What the curves show
- Episode score dropped at step 110 when curriculum advanced to harder scenario β€” agents were challenged
- Score recovered by step 200 β€” agents adapted
- Curriculum difficulty staircase shows automatic advancement β€” no human intervention
- GRPO reward crossed from negative to positive β€” proof of learning

---

## How to Run

### Run the environment locally
```bash
git clone https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
cd enterprise-ops-arena
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```

### Test the API
```bash
# Health check
curl https://anurag137-enterprise-ops-arena.hf.space/health

# View all endpoints
open https://anurag137-enterprise-ops-arena.hf.space/docs
```

### Run training
```bash
git clone https://github.com/anuragverma025/Meta-Hackathon
cd Meta-Hackathon/enterprise_ops
pip install -e .
python -m enterprise_ops.train.main --scenario scenario_01 --steps 200
```

---

## Tech Stack

| Component | Technology |
|---|---|
| Environment | OpenEnv + FastAPI + SQLite |
| Schemas | Pydantic v2 |
| Training | HuggingFace TRL + GRPO |
| Model | Qwen2.5-3B-Instruct |
| Efficiency | Unsloth 4-bit quantization |
| Deployment | HuggingFace Spaces + Docker |
| UI | Gradio mounted on FastAPI |

---

## Themes Covered

- **Theme 1** β€” Multi-Agent Interactions
- **Theme 3.1** β€” World Modeling: Professional Tasks

### Bonus prizes targeted
- Fleet AI β€” Scalable Oversight (OversightAgent)
- Halluminate β€” Multi-Actor Environments
- Scale AI β€” Sales/PM/IT enterprise workflows
- Scaler AI Labs β€” Multi-app enterprise RL
- Patronus AI β€” Schema drift + dynamic contracts

---

## Project Structure