Anurag137's picture
Update README.md
cbd6773 verified
|
raw
history blame
4.89 kB
---
title: EnterpriseOps Arena
emoji: 🏒
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---
# EnterpriseOps Arena 🏒
> Multi-agent RL environment where IT, Manager, Finance and Oversight agents
> collaborate to manage a simulated enterprise under partial observability,
> schema drift, and SLA pressure.
## Quick Links
- πŸš€ **HuggingFace Space**: https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
- πŸ““ **Colab Notebook**: https://github.com/anuragverma025/Meta-Hackathon/blob/main/enterprise_ops/train/colab_notebook.ipynb
- πŸ“ **Blog Post**: https://github.com/anuragverma025/Meta-Hackathon/blob/main/BLOG.md
- πŸ’» **GitHub**: https://github.com/anuragverma025/Meta-Hackathon
---
## The Problem
Enterprise AI agents fail because they work in silos. The IT agent
resolving a critical server ticket does not know the Finance agent
just blocked the budget it needs. The Manager does not know which
tickets are about to breach SLA. No coordination = cascading failures.
We built an RL environment that trains LLM agents to coordinate
across departments β€” developing theory-of-mind reasoning through
reinforcement learning.
---
## The Environment
4 specialized LLM agents operate inside a simulated enterprise:
| Agent | Role | Sees |
|---|---|---|
| IT Agent | Resolves support tickets before SLA breach | Tickets + resource pool + inbox |
| Manager Agent | Allocates shared resources, coordinates tasks | All dept summaries + project tasks |
| Finance Agent | Approves budgets, blocks policy violations | Budget history + pending approvals |
| Oversight Agent | Monitors all agents, catches hallucinations | ALL tool call logs (full visibility) |
### Key environment features
- **Partial observability** β€” each agent sees only its department
- **5 mock enterprise APIs** β€” get_tickets, resolve_ticket, allocate_resource, approve_budget, get_project_status
- **Schema drift** β€” API fields mutate every 20 steps, forcing real adaptation
- **8 scenarios** β€” difficulty 1 to 8, from simple IT tasks to full enterprise chaos
- **Message bus** β€” agents coordinate by sending structured messages
- **Anti-reward-hacking** β€” timeout, loop detection, state locks, oversight monitoring
---
## Reward Design
4 independent reward functions (composable, hard to game):
| Function | Signal |
|---|---|
| task_completion | +10 per resolved ticket/task, verified by state diff |
| sla_adherence | +7.5 before deadline, -5 on breach |
| coordination_bonus | +6 when message leads to correct action next step |
| hallucination_penalty | -8 for calling non-existent API fields |
The Oversight Agent earns +15 for catching hallucinations, +8 for
policy breaches, +5 for stale schema usage.
---
## Training Results
Real training run β€” 200 steps on T4 GPU β€” 32 minutes
![Reward curves](reward_curves.png)
![Loss curve](loss_curve.png)
### Key findings
- **GRPO reward**: -1.0 β†’ +1.5 (crossed zero β€” model is learning)
- **Curriculum**: Advanced automatically from scenario_01 β†’ scenario_03
- **Train loss**: -0.023
- **Model**: Qwen2.5-3B-Instruct, 4-bit quantized via Unsloth
- **Method**: GRPO via HuggingFace TRL
### What the curves show
- Episode score dropped at step 110 when curriculum advanced to harder scenario β€” agents were challenged
- Score recovered by step 200 β€” agents adapted
- Curriculum difficulty staircase shows automatic advancement β€” no human intervention
- GRPO reward crossed from negative to positive β€” proof of learning
---
## How to Run
### Run the environment locally
```bash
git clone https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
cd enterprise-ops-arena
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
### Test the API
```bash
# Health check
curl https://anurag137-enterprise-ops-arena.hf.space/health
# View all endpoints
open https://anurag137-enterprise-ops-arena.hf.space/docs
```
### Run training
```bash
git clone https://github.com/anuragverma025/Meta-Hackathon
cd Meta-Hackathon/enterprise_ops
pip install -e .
python -m enterprise_ops.train.main --scenario scenario_01 --steps 200
```
---
## Tech Stack
| Component | Technology |
|---|---|
| Environment | OpenEnv + FastAPI + SQLite |
| Schemas | Pydantic v2 |
| Training | HuggingFace TRL + GRPO |
| Model | Qwen2.5-3B-Instruct |
| Efficiency | Unsloth 4-bit quantization |
| Deployment | HuggingFace Spaces + Docker |
| UI | Gradio mounted on FastAPI |
---
## Themes Covered
- **Theme 1** β€” Multi-Agent Interactions
- **Theme 3.1** β€” World Modeling: Professional Tasks
### Bonus prizes targeted
- Fleet AI β€” Scalable Oversight (OversightAgent)
- Halluminate β€” Multi-Actor Environments
- Scale AI β€” Sales/PM/IT enterprise workflows
- Scaler AI Labs β€” Multi-app enterprise RL
- Patronus AI β€” Schema drift + dynamic contracts
---
## Project Structure