Spaces:
Running
title: EnterpriseOps Arena
emoji: π’
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
EnterpriseOps Arena
Multi-Agent RL Environment for Enterprise Coordination
Teaching LLMs to coordinate like a real enterprise team.
Quick Links
- π HF Space: https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
- π€ Trained Model: https://huggingface.co/Anurag137/enterprise-ops-lora
- π Wandb: https://wandb.ai/kanhaiyakumar76618-indian-institute-of-information-techn/enterprise-ops-arena
- π Blog: https://github.com/anuragverma025/Meta-Hackathon/blob/main/BLOG.md
- π» GitHub: https://github.com/anuragverma025/Meta-Hackathon
The Problem
Picture this: IT resolves a critical server ticket. But Finance blocked the budget 2 steps earlier. The ticket re-opens. SLA breaches. Customer escalates.
Each agent acted correctly in isolation. Together they failed. This is the coordination gap we are training agents to close.
What We Built
4 specialized LLM agents coordinating in a simulated enterprise:
| Agent | Role |
|---|---|
| IT Agent | Resolves tickets, manages resources |
| Manager Agent | Allocates resources, coordinates teams |
| Finance Agent | Approves budgets, blocks violations |
| Oversight Agent | Monitors all agents, catches hallucinations |
What Makes It Hard
- Partial observability β IT cannot see Finance decisions
- Schema drift β API fields mutate every 20 steps
- SLA pressure β tickets expire in real time
- 12% noise β random tool failures at max difficulty
- 8 difficulty levels β automatic curriculum advancement
Reward Design (7 components, arXiv:2601.19100)
- Potential-based shaping β dependency graph progress
- Dynamic weight optimization β BiPaRS rebalancing
- Urgency-scaled SLA β time-dependent deadline rewards
- Exploration bonus β EXPLORS intrinsic reward
- Schema adaptation β explicit post-drift field usage reward
- Process reward β PRM step-level supervision
- Trajectory reward β trend and consistency bonus
Training Results
| Metric | Value |
|---|---|
| Peak episode score | 114 (+77%) |
| Task completion | 35 β 75 (+114%) |
| GRPO reward_std | 0.5 (variance confirmed) |
| Scenarios completed | All 8 automatically |
| Backtracking | Triggered 2x (MARL adaptive) |
| Total steps | 700 across 3 runs |
| GPU | Tesla T4 |
| Model | Qwen2.5-3B-Instruct 4-bit LoRA |
Before vs After Training
Prompt: IT Agent. TKT-001, P1, SLA=2 steps. What do you do?
Before training: json {"tool_call":"Assign Engineer to Ticket", "tool_params":{"engineer":"Engineer 1"}}
β Wrong tool name | β Missing ticket_id | β No reasoning
After 700 steps GRPO: json {"tool_call":"resolve_ticket", "tool_params":{"ticket_id":"TKT-001","engineer":"Engineer 1"}, "reasoning":"P1, SLA=2 steps remaining, resolve immediately"}
β Correct tool | β ticket_id included | β SLA-aware reasoning
Tech Stack
| Component | Choice | Why |
|---|---|---|
| Model | Qwen2.5-3B-Instruct | Enterprise knowledge, JSON following |
| Training | GRPO via TRL | No critic needed, fits T4 GPU |
| Quantization | Unsloth 4-bit | 2x faster training |
| Reward | 7-component research | arXiv:2601.19100 |
| Curriculum | MARL adaptive backtracking | Prevents policy collapse |
Project Structure
enterprise_ops/ βββ contracts.py β Pydantic schemas + agent constants βββ agents/ β IT, Manager, Finance, Oversight agents βββ env/ β Environment, tools, world model, schema drift β βββ scenarios/ β 8 difficulty scenarios βββ server/ β FastAPI + Gradio deployment βββ train/ β GRPO training pipeline + reward functions
Bonus Prize Coverage
- Patronus AI β Schema drift engine forces real API adaptation
- Fleet AI β OversightAgent monitors all agents every step

