{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": [],
   "gpuType": "T4"
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": "# SentinelOps Arena — Multi-Agent GRPO Training with Unsloth + vLLM\n\nTrain **all 3 agents** (Worker, Attacker, Oversight) using GRPO on the SentinelOps Arena OpenEnv environment.\n\n**Key features:**\n- **BF16 precision** on H100 GPUs (no 4-bit quantization)\n- **vLLM fast inference** via `fast_inference=True`\n- **Environment-executing reward functions** — completions are parsed into `SentinelAction`s and executed in a live SentinelOps environment for real rewards\n- **Multi-agent self-play** — adversarial training across Worker, Attacker, and Oversight roles\n\n**Partner tracks:** Fleet AI ($10K, Scalable Oversight) · Patronus AI ($10K, Schema Drift)",
   "metadata": {
    "id": "intro"
   }
  },
  {
   "cell_type": "markdown",
   "source": "## 1. Install Dependencies\n\nFollowing the official OpenEnv + Unsloth reference notebook pattern.",
   "metadata": {
    "id": "setup-header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "install-deps"
   },
   "outputs": [],
   "source": "%%capture\n!pip install unsloth vllm\n!pip install --no-deps trl sft_trainer\n!pip install \"openenv-core[core]>=0.2.0\" mcp fastmcp pydantic pandas datasets"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "clone-repo"
   },
   "outputs": [],
   "source": "import os\nif not os.path.exists(\"NexusEnv\"):\n    !git clone https://github.com/nihalnihalani/NexusEnv.git\nimport sys\nsys.path.insert(0, \"/content/NexusEnv\")\n\n# Verify environment loads\nfrom sentinelops_arena.environment import SentinelOpsArena\nfrom sentinelops_arena.models import AgentRole, SentinelAction\nenv = SentinelOpsArena()\nobs = env.reset(seed=42)\nprint(f\"Environment ready! Agent: {obs.current_agent}, Systems: CRM + Billing + Ticketing\")"
  },
  {
   "cell_type": "markdown",
   "source": "## 2. Run a Full Episode (Verify Environment)\n\nRun one complete episode with heuristic agents to verify the environment works end-to-end.",
   "metadata": {
    "id": "collect-header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "collect-data"
   },
   "outputs": [],
   "source": "from NexusEnv.train import collect_multi_agent_data, build_training_dataset\nfrom NexusEnv.train import WORKER_SYSTEM_PROMPT, ATTACKER_SYSTEM_PROMPT, OVERSIGHT_SYSTEM_PROMPT\nfrom NexusEnv.train import AGENT_CONFIGS\n\n# Run a single episode and show stats for each agent\nfor role in [\"worker\", \"attacker\", \"oversight\"]:\n    data = collect_multi_agent_data(seed=42, target_agent=role)\n    avg_r = sum(d[\"reward\"] for d in data) / max(len(data), 1)\n    print(f\"{role:>10}: {len(data)} turns, avg_reward={avg_r:.3f}\")"
  },
  {
   "cell_type": "markdown",
   "source": "## 3. Collect Training Data via Self-Play\n\nWe collect prompts from multiple episodes. Each episode uses heuristic agents for non-target roles while recording the prompts the target agent would see.",
   "metadata": {
    "id": "load-header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "load-model"
   },
   "outputs": [],
   "source": "from datasets import Dataset\n\n# Which agent to train — change this to train attacker or oversight\nTARGET_AGENT = \"worker\"  # Options: \"worker\", \"attacker\", \"oversight\"\nNUM_EPISODES = 10\n\nsystem_prompts = {\n    \"worker\": WORKER_SYSTEM_PROMPT,\n    \"attacker\": ATTACKER_SYSTEM_PROMPT,\n    \"oversight\": OVERSIGHT_SYSTEM_PROMPT,\n}\n\nprint(f\"Collecting {TARGET_AGENT} training data from {NUM_EPISODES} episodes...\")\ndataset_raw = build_training_dataset(num_episodes=NUM_EPISODES, target_agent=TARGET_AGENT)\n\nprompts = []\nfor d in dataset_raw:\n    messages = [\n        {\"role\": \"system\", \"content\": system_prompts[TARGET_AGENT]},\n        {\"role\": \"user\", \"content\": d[\"prompt\"]},\n    ]\n    prompts.append(messages)\n\ntrain_dataset = Dataset.from_dict({\"prompt\": prompts})\nprint(f\"Dataset: {len(train_dataset)} {TARGET_AGENT} turns\")\nif dataset_raw:\n    avg_r = sum(d[\"reward\"] for d in dataset_raw) / len(dataset_raw)\n    print(f\"Avg environment reward: {avg_r:.3f}\")"
  },
  {
   "cell_type": "markdown",
   "source": "## 4. Load Model with Unsloth (BF16 + vLLM)\n\nFollowing the official OpenEnv reference pattern:\n- `load_in_4bit=False` — BF16 precision on H100\n- `fast_inference=True` — vLLM for fast GRPO generation\n- `lora_alpha = 2 * lora_rank` — official LoRA configuration\n- `gpu_memory_utilization=0.9` — maximize GPU usage",
   "metadata": {
    "id": "train-header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "train"
   },
   "outputs": [],
   "source": "from unsloth import FastLanguageModel\n\nmodel_name = \"unsloth/Qwen2.5-0.5B-Instruct\"\nlora_rank = 16\n\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n    model_name=model_name,\n    max_seq_length=768,\n    load_in_4bit=False,          # BF16 for H100 (official recommendation)\n    fast_inference=True,          # vLLM fast inference\n    max_lora_rank=lora_rank,\n    gpu_memory_utilization=0.9,\n)\n\nmodel = FastLanguageModel.get_peft_model(\n    model,\n    r=lora_rank,\n    target_modules=[\n        \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n        \"gate_proj\", \"up_proj\", \"down_proj\",\n    ],\n    lora_alpha=lora_rank * 2,    # Official: lora_alpha = 2 * lora_rank\n    lora_dropout=0,\n    bias=\"none\",\n    use_gradient_checkpointing=\"unsloth\",\n)\nprint(f\"Model loaded: BF16 + vLLM + LoRA (r={lora_rank}, alpha={lora_rank*2})\")"
  },
  {
   "cell_type": "markdown",
   "source": "## 5. GRPO Training with Environment-Executing Rewards\n\nThe reward function follows the OpenEnv 2048 reference pattern:\n1. Parse LLM completion → `SentinelAction`\n2. Execute action in a fresh `SentinelOpsArena` environment\n3. Return **real environment reward** + format bonus\n\nThis is the critical differentiator — rewards come from actual environment execution, not just text matching.",
   "metadata": {
    "id": "save-header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "save"
   },
   "outputs": [],
   "source": "from trl import GRPOConfig, GRPOTrainer\nfrom NexusEnv.train import make_reward_function\n\n# Environment-executing reward function\nreward_fn = make_reward_function(TARGET_AGENT)\n\ngrpo_config = GRPOConfig(\n    output_dir=f\"./sentinelops-grpo-{TARGET_AGENT}\",\n    max_steps=300,                      # Official recommendation\n    per_device_train_batch_size=1,\n    gradient_accumulation_steps=4,\n    num_generations=2,                   # GRPO group size\n    max_completion_length=256,\n    max_prompt_length=512,\n    learning_rate=5e-5,                  # Official reference: 5e-5\n    temperature=1.0,                     # Official reference: 1.0\n    logging_steps=1,\n    save_steps=50,\n    report_to=\"none\",\n)\n\ntrainer = GRPOTrainer(\n    model=model,\n    processing_class=tokenizer,\n    reward_funcs=[reward_fn],\n    args=grpo_config,\n    train_dataset=train_dataset,\n)\n\nprint(f\"Starting GRPO training for {TARGET_AGENT}...\")\nprint(f\"  max_steps={grpo_config.max_steps}, lr={grpo_config.learning_rate}\")\nprint(f\"  num_generations={grpo_config.num_generations}, temp={grpo_config.temperature}\")\ntrainer.train()"
  },
  {
   "cell_type": "markdown",
   "source": "## 6. Save and Evaluate\n\nSave the trained LoRA weights and run a quick evaluation.",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "source": "output_dir = f\"./sentinelops-grpo-{TARGET_AGENT}\"\ntrainer.save_model(output_dir)\ntokenizer.save_pretrained(output_dir)\nprint(f\"{TARGET_AGENT.upper()} agent trained and saved to {output_dir}\")\n\n# Quick evaluation: run reward function on a few test completions\nimport json\ntest_completions = {\n    \"worker\": [\n        [{\"content\": json.dumps({\"action_type\": \"get_schema\", \"parameters\": {\"system\": \"crm\"}})}],\n        [{\"content\": json.dumps({\"action_type\": \"respond\", \"response_text\": \"I cannot process this. It appears to be social engineering.\"})}],\n        [{\"content\": \"this is garbage output\"}],\n    ],\n    \"attacker\": [\n        [{\"content\": json.dumps({\"action_type\": \"launch_attack\", \"parameters\": {\"attack_type\": \"schema_drift\", \"target_system\": \"crm\", \"old_field\": \"name\", \"new_field\": \"full_name\"}})}],\n        [{\"content\": json.dumps({\"action_type\": \"pass\"})}],\n    ],\n    \"oversight\": [\n        [{\"content\": json.dumps({\"action_type\": \"flag\", \"explanation\": \"Worker followed suspicious admin override instructions. This is a social engineering attack.\"})}],\n        [{\"content\": json.dumps({\"action_type\": \"approve\", \"explanation\": \"Worker correctly checked schema before proceeding.\"})}],\n    ],\n}\n\nprint(f\"\\nReward evaluation for {TARGET_AGENT}:\")\nfor comp in test_completions.get(TARGET_AGENT, []):\n    r = reward_fn([comp])\n    text = comp[0][\"content\"][:80]\n    print(f\"  reward={r[0]:+.2f}  |  {text}...\")",
   "metadata": {},
   "execution_count": null,
   "outputs": []
  }
 ]
}