Spaces:

Anurag137
/

enterprise-ops-arena

Running

App Files Files Community

enterprise-ops-arena / README.md

Anurag137

Update README.md

fb6f031 verified about 2 months ago

preview code

raw

history blame contribute delete

4.02 kB

metadata

title: EnterpriseOps Arena
emoji: 🏢
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

EnterpriseOps Arena

Multi-Agent RL Environment for Enterprise Coordination

Teaching LLMs to coordinate like a real enterprise team.

Quick Links

🚀 HF Space: https://huggingface.co/spaces/Anurag137/enterprise-ops-arena
🤖 Trained Model: https://huggingface.co/Anurag137/enterprise-ops-lora
📊 Wandb: https://wandb.ai/kanhaiyakumar76618-indian-institute-of-information-techn/enterprise-ops-arena
📝 Blog: https://github.com/anuragverma025/Meta-Hackathon/blob/main/BLOG.md
💻 GitHub: https://github.com/anuragverma025/Meta-Hackathon

The Problem

Picture this: IT resolves a critical server ticket. But Finance blocked the budget 2 steps earlier. The ticket re-opens. SLA breaches. Customer escalates.

Each agent acted correctly in isolation. Together they failed. This is the coordination gap we are training agents to close.

What We Built

4 specialized LLM agents coordinating in a simulated enterprise:

Agent	Role
IT Agent	Resolves tickets, manages resources
Manager Agent	Allocates resources, coordinates teams
Finance Agent	Approves budgets, blocks violations
Oversight Agent	Monitors all agents, catches hallucinations

What Makes It Hard

Partial observability — IT cannot see Finance decisions
Schema drift — API fields mutate every 20 steps
SLA pressure — tickets expire in real time
12% noise — random tool failures at max difficulty
8 difficulty levels — automatic curriculum advancement

Reward Design (7 components, arXiv:2601.19100)

Potential-based shaping — dependency graph progress
Dynamic weight optimization — BiPaRS rebalancing
Urgency-scaled SLA — time-dependent deadline rewards
Exploration bonus — EXPLORS intrinsic reward
Schema adaptation — explicit post-drift field usage reward
Process reward — PRM step-level supervision
Trajectory reward — trend and consistency bonus

Training Results

Metric	Value
Peak episode score	114 (+77%)
Task completion	35 → 75 (+114%)
GRPO reward_std	0.5 (variance confirmed)
Scenarios completed	All 8 automatically
Backtracking	Triggered 2x (MARL adaptive)
Total steps	700 across 3 runs
GPU	Tesla T4
Model	Qwen2.5-3B-Instruct 4-bit LoRA

Before vs After Training

Prompt: IT Agent. TKT-001, P1, SLA=2 steps. What do you do?

Before training: json {"tool_call":"Assign Engineer to Ticket", "tool_params":{"engineer":"Engineer 1"}}

❌ Wrong tool name | ❌ Missing ticket_id | ❌ No reasoning

After 700 steps GRPO: json {"tool_call":"resolve_ticket", "tool_params":{"ticket_id":"TKT-001","engineer":"Engineer 1"}, "reasoning":"P1, SLA=2 steps remaining, resolve immediately"}

✅ Correct tool | ✅ ticket_id included | ✅ SLA-aware reasoning

Tech Stack

Component	Choice	Why
Model	Qwen2.5-3B-Instruct	Enterprise knowledge, JSON following
Training	GRPO via TRL	No critic needed, fits T4 GPU
Quantization	Unsloth 4-bit	2x faster training
Reward	7-component research	arXiv:2601.19100
Curriculum	MARL adaptive backtracking	Prevents policy collapse

Project Structure

enterprise_ops/ ├── contracts.py — Pydantic schemas + agent constants ├── agents/ — IT, Manager, Finance, Oversight agents ├── env/ — Environment, tools, world model, schema drift │ └── scenarios/ — 8 difficulty scenarios ├── server/ — FastAPI + Gradio deployment └── train/ — GRPO training pipeline + reward functions

Bonus Prize Coverage

Patronus AI — Schema drift engine forces real API adaptation
Fleet AI — OversightAgent monitors all agents every step