Anurag137's picture
Update README.md
fb6f031 verified
metadata
title: EnterpriseOps Arena
emoji: 🏒
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

EnterpriseOps Arena

Multi-Agent RL Environment for Enterprise Coordination

Teaching LLMs to coordinate like a real enterprise team.

Quick Links

The Problem

Picture this: IT resolves a critical server ticket. But Finance blocked the budget 2 steps earlier. The ticket re-opens. SLA breaches. Customer escalates.

Each agent acted correctly in isolation. Together they failed. This is the coordination gap we are training agents to close.

What We Built

4 specialized LLM agents coordinating in a simulated enterprise:

Agent Role
IT Agent Resolves tickets, manages resources
Manager Agent Allocates resources, coordinates teams
Finance Agent Approves budgets, blocks violations
Oversight Agent Monitors all agents, catches hallucinations

What Makes It Hard

  • Partial observability β€” IT cannot see Finance decisions
  • Schema drift β€” API fields mutate every 20 steps
  • SLA pressure β€” tickets expire in real time
  • 12% noise β€” random tool failures at max difficulty
  • 8 difficulty levels β€” automatic curriculum advancement

Reward Design (7 components, arXiv:2601.19100)

  1. Potential-based shaping β€” dependency graph progress
  2. Dynamic weight optimization β€” BiPaRS rebalancing
  3. Urgency-scaled SLA β€” time-dependent deadline rewards
  4. Exploration bonus β€” EXPLORS intrinsic reward
  5. Schema adaptation β€” explicit post-drift field usage reward
  6. Process reward β€” PRM step-level supervision
  7. Trajectory reward β€” trend and consistency bonus

Training Results

Reward curves Loss curves

Metric Value
Peak episode score 114 (+77%)
Task completion 35 β†’ 75 (+114%)
GRPO reward_std 0.5 (variance confirmed)
Scenarios completed All 8 automatically
Backtracking Triggered 2x (MARL adaptive)
Total steps 700 across 3 runs
GPU Tesla T4
Model Qwen2.5-3B-Instruct 4-bit LoRA

Before vs After Training

Prompt: IT Agent. TKT-001, P1, SLA=2 steps. What do you do?

Before training: json {"tool_call":"Assign Engineer to Ticket", "tool_params":{"engineer":"Engineer 1"}}

❌ Wrong tool name | ❌ Missing ticket_id | ❌ No reasoning

After 700 steps GRPO: json {"tool_call":"resolve_ticket", "tool_params":{"ticket_id":"TKT-001","engineer":"Engineer 1"}, "reasoning":"P1, SLA=2 steps remaining, resolve immediately"}

βœ… Correct tool | βœ… ticket_id included | βœ… SLA-aware reasoning

Tech Stack

Component Choice Why
Model Qwen2.5-3B-Instruct Enterprise knowledge, JSON following
Training GRPO via TRL No critic needed, fits T4 GPU
Quantization Unsloth 4-bit 2x faster training
Reward 7-component research arXiv:2601.19100
Curriculum MARL adaptive backtracking Prevents policy collapse

Project Structure

enterprise_ops/ β”œβ”€β”€ contracts.py β€” Pydantic schemas + agent constants β”œβ”€β”€ agents/ β€” IT, Manager, Finance, Oversight agents β”œβ”€β”€ env/ β€” Environment, tools, world model, schema drift β”‚ └── scenarios/ β€” 8 difficulty scenarios β”œβ”€β”€ server/ β€” FastAPI + Gradio deployment └── train/ β€” GRPO training pipeline + reward functions

Bonus Prize Coverage

  • Patronus AI β€” Schema drift engine forces real API adaptation
  • Fleet AI β€” OversightAgent monitors all agents every step