Spaces:

Anurag137
/

enterprise-ops-arena

Running

App Files Files Community

enterprise-ops-arena / BLOG.md

Anurag137

Update BLOG.md

4ff0804 verified about 2 months ago

preview code

raw

history blame contribute delete

4.83 kB

EnterpriseOps Arena: Teaching LLMs to Coordinate Like a Real Enterprise Team

The Problem Nobody Is Solving

Every enterprise AI deployment fails the same way.

IT resolves a critical server ticket. Finance blocked the budget 2 steps earlier. The ticket re-opens. SLA breaches. Customer escalates.

Each agent acted correctly in isolation. Together they failed.

Current LLM benchmarks test individual agents on individual tasks. Nobody tests whether agents can coordinate under real enterprise pressure — partial information, tight deadlines, changing APIs, and shared scarce resources. We built the environment to train that.

What We Built

EnterpriseOps Arena is a multi-agent RL environment where 4 specialized LLM agents must coordinate to run a simulated enterprise. Each agent sees only its own department. They share one resource pool. They communicate through a message bus. They succeed or fail together.

The 4 agents:

IT Agent — resolves support tickets before SLA breach
Manager Agent — allocates shared resources, coordinates
Finance Agent — approves budgets, blocks policy violations
Oversight Agent — monitors all agents, catches hallucinations

What makes it hard:

Partial observability — IT cannot see Finance budget decisions
Schema drift — API fields change every 20 steps silently
8 difficulty levels — from simple tickets to full enterprise chaos
12% noise at max difficulty — tool calls fail randomly
SLA timers — tickets expire if not resolved in time

Schema Drift — Our Most Original Contribution

Every 20 training steps the API schemas mutate. A field called ticket_id becomes tkt_ref. An agent that memorized field names fails immediately. An agent that learned to adapt succeeds.

This forces genuine world model adaptation rather than memorization. This is our Patronus AI angle — testing whether agents can handle real API versioning pressure that every enterprise faces.

The Reward Design

Based on arXiv:2601.19100 — 7 independent reward components:

Potential-based shaping — accelerates convergence on dependency tasks
BiPaRS dynamic weights — rebalances components when performance drops
Urgency-scaled SLA — higher reward for early P1 resolution
EXPLORS exploration bonus — intrinsic reward for novel tool sequences
Schema adaptation — explicit reward for correct post-drift field usage
PRM process reward — step-level supervision for credit assignment
Trajectory reward — consistency and trend bonus over episode

Anti-reward-hacking: OversightAgent penalizes hallucinations, stuck loops, and policy violations. An agent exploiting the reward without solving the task gets caught immediately.

MARL Adaptive Curriculum

Standard curriculum RL only moves forward. Our backtracking monitors GRPO reward variance in real time. When variance collapses — all completions get the same score and GRPO cannot learn — the system steps back one difficulty level.

This is a self-healing training loop. It triggered twice during training and recovered episode score from 79 to 112 both times.

Training

Model: Qwen2.5-3B-Instruct, 4-bit quantized via Unsloth Method: GRPO via HuggingFace TRL Total: 700 steps across 3 training runs GPU: Tesla T4

GRPO was chosen because it trains without a critic model — essential when you only have 16GB VRAM. Qwen2.5-3B-Instruct because it already understands enterprise concepts and follows structured JSON instructions. Unsloth because it makes 4-bit QLoRA training 2x faster through custom CUDA kernels.

Results

Episode score: 64.5 → 114 (+77%)
Task completion: 35 → 75 (+114%)
All 8 scenarios completed automatically
GRPO reward_std: 0.5 (variance confirmed)
Backtracking triggered 2x, recovered both times
LoRA adapters: https://huggingface.co/Anurag137/enterprise-ops-lora

Before vs After

Same prompt — P1 ticket, SLA=2 steps remaining:

Before training: Wrong tool name, missing ticket_id, no reasoning After 700 steps: Correct tool, correct params, SLA-aware reasoning

The model learned what the environment actually requires.

Why It Matters

Enterprise AI coordination is the next frontier. Every company deploying agents will face exactly this problem. EnterpriseOps Arena is the first RL environment designed specifically to train theory-of-mind coordination in LLMs for enterprise settings.

A researcher could write a paper about training on this. We just did.