---
title: EnterpriseOps Arena
emoji: 🏢
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# EnterpriseOps Arena
### Multi-Agent RL Environment for Enterprise Coordination

> Teaching LLMs to coordinate like a real enterprise team.

## Quick Links
- 🚀 HF Space: https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
- 🤖 Trained Model: https://huggingface.co/Anurag137/enterprise-ops-lora
- 📊 Wandb: https://wandb.ai/kanhaiyakumar76618-indian-institute-of-information-techn/enterprise-ops-arena
- 📝 Blog: https://github.com/anuragverma025/Meta-Hackathon/blob/main/BLOG.md
- 💻 GitHub: https://github.com/anuragverma025/Meta-Hackathon

## The Problem
Picture this: IT resolves a critical server ticket.
But Finance blocked the budget 2 steps earlier.
The ticket re-opens. SLA breaches. Customer escalates.

Each agent acted correctly in isolation.
Together they failed. This is the coordination gap
we are training agents to close.

## What We Built
4 specialized LLM agents coordinating in a simulated enterprise:

| Agent | Role |
|-------|------|
| IT Agent | Resolves tickets, manages resources |
| Manager Agent | Allocates resources, coordinates teams |
| Finance Agent | Approves budgets, blocks violations |
| Oversight Agent | Monitors all agents, catches hallucinations |

## What Makes It Hard
- *Partial observability* — IT cannot see Finance decisions
- *Schema drift* — API fields mutate every 20 steps
- *SLA pressure* — tickets expire in real time
- *12% noise* — random tool failures at max difficulty
- *8 difficulty levels* — automatic curriculum advancement

## Reward Design (7 components, arXiv:2601.19100)
1. Potential-based shaping — dependency graph progress
2. Dynamic weight optimization — BiPaRS rebalancing
3. Urgency-scaled SLA — time-dependent deadline rewards
4. Exploration bonus — EXPLORS intrinsic reward
5. Schema adaptation — explicit post-drift field usage reward
6. Process reward — PRM step-level supervision
7. Trajectory reward — trend and consistency bonus

## Training Results
![Reward curves](reward_curves.png)
![Loss curves](loss_curves.png)

| Metric | Value |
|--------|-------|
| Peak episode score | *114* (+77%) |
| Task completion | *35 → 75* (+114%) |
| GRPO reward_std | *0.5* (variance confirmed) |
| Scenarios completed | *All 8* automatically |
| Backtracking | Triggered 2x (MARL adaptive) |
| Total steps | 700 across 3 runs |
| GPU | Tesla T4 |
| Model | Qwen2.5-3B-Instruct 4-bit LoRA |

## Before vs After Training

*Prompt:* IT Agent. TKT-001, P1, SLA=2 steps. What do you do?

*Before training:*
json
{"tool_call":"Assign Engineer to Ticket",
 "tool_params":{"engineer":"Engineer 1"}}

❌ Wrong tool name | ❌ Missing ticket_id | ❌ No reasoning

*After 700 steps GRPO:*
json
{"tool_call":"resolve_ticket",
 "tool_params":{"ticket_id":"TKT-001","engineer":"Engineer 1"},
 "reasoning":"P1, SLA=2 steps remaining, resolve immediately"}

✅ Correct tool | ✅ ticket_id included | ✅ SLA-aware reasoning

## Tech Stack
| Component | Choice | Why |
|-----------|--------|-----|
| Model | Qwen2.5-3B-Instruct | Enterprise knowledge, JSON following |
| Training | GRPO via TRL | No critic needed, fits T4 GPU |
| Quantization | Unsloth 4-bit | 2x faster training |
| Reward | 7-component research | arXiv:2601.19100 |
| Curriculum | MARL adaptive backtracking | Prevents policy collapse |

## Project Structure
enterprise_ops/
├── contracts.py       — Pydantic schemas + agent constants
├── agents/            — IT, Manager, Finance, Oversight agents
├── env/               — Environment, tools, world model, schema drift
│   └── scenarios/     — 8 difficulty scenarios
├── server/            — FastAPI + Gradio deployment
└── train/             — GRPO training pipeline + reward functions

## Bonus Prize Coverage
- *Patronus AI* — Schema drift engine forces real API adaptation
- *Fleet AI* — OversightAgent monitors all agents every step