---
license: apache-2.0
base_model: mistralai/Mistral-7B-Instruct-v0.3
tags:
  - roblox
  - luau
  - code-generation
  - peft
  - lora
  - rft
  - reinforcement-fine-tuning
  - wandb-hackathon
datasets:
  - TorpedoSoftware/the-luau-stack
language:
  - en
pipeline_tag: text-generation
library_name: peft
model-index:
  - name: roblox-luau-mistral-7b-rft
    results:
      - task:
          type: text-generation
          name: Roblox Luau Code Generation
        metrics:
          - name: Syntax Score
            type: custom
            value: 0.95
          - name: API Correctness
            type: custom
            value: 0.93
          - name: Bug-Free Score
            type: custom
            value: 0.91
          - name: Quality Score
            type: custom
            value: 0.88
          - name: Composite Score
            type: custom
            value: 0.92
---

# Roblox Luau Mistral 7B — RFT (Reinforcement Fine-Tuned)

A **reinforcement fine-tuned** LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the [SFT version](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2) by training on the **best-of-N candidates** selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.

**Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.**

## Why RFT?

Standard SFT trains on static (task, code) pairs. **RFT goes further**: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.

```
SFT Model → Generate N candidates per task
                ↓
         Score each candidate
         (4 deterministic scorers + Claude judge)
                ↓
         Keep best candidate per task (score ≥ 0.70)
                ↓
         Train on SFT data + best candidates → RFT Model
```

## Training

### Stage 1: SFT Data Collection
Same as the [SFT model](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2):
- Reverse-labeled [the-luau-stack](https://huggingface.co/datasets/TorpedoSoftware/the-luau-stack) examples
- Claude Sonnet 4.5 gold-standard implementations
- Quality-filtered by 4 deterministic scorers

### Stage 2: Candidate Generation
The SFT model generated **4 candidates per task** for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).

### Stage 3: Hybrid Reward Scoring
Each candidate was scored using a hybrid signal:

| Component | Weight | What it measures |
|---|---|---|
| Syntax scorer | 10% | Bracket/block balance, no Python-isms |
| API scorer | 10% | `GetService()`, no deprecated APIs, valid services |
| Bug scorer | 10% | pcall wrapping, nil checks, yield in loops |
| Quality scorer | 10% | Comments, structure, naming, completeness |
| **Claude judge** | **60%** | Functionality, correctness, completeness (LLM-as-judge) |

**Combined score = deterministic (40%) + Claude judge (60%)**

Only candidates scoring ≥ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.

### Stage 4: RFT Training

| Parameter | Value |
|---|---|
| Base model | `mistralai/Mistral-7B-Instruct-v0.3` |
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Dropout | 0.05 |
| Epochs | 2 |
| Batch size | 1 (×8 gradient accumulation) |
| Learning rate | 1.5e-4 |
| Max sequence length | 8192 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
| Training data | SFT data + best-of-N RFT candidates |

### Results: SFT → RFT Improvement

| Scorer | SFT | RFT | Delta |
|---|---|---|---|
| Syntax | 0.92 | **0.95** | +0.03 |
| API Correctness | 0.88 | **0.93** | +0.05 |
| Bug-Free | 0.85 | **0.91** | +0.06 |
| Code Quality | 0.82 | **0.88** | +0.06 |
| **Composite** | **0.87** | **0.92** | **+0.05** |

The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality — the areas where the Claude judge provided the most signal beyond the deterministic scorers.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

messages = [
    {"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
    {"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
```

### With vLLM (recommended for serving)
```bash
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --enable-lora \
    --lora-modules \
        sft=squaredcuber/roblox-luau-mistral-7b-2 \
        rft=squaredcuber/roblox-luau-mistral-7b-rft \
    --max-lora-rank 64
```

## Agentic Pipeline

This model powers the **agentic Roblox Studio assistant** — a self-correcting code generation pipeline:

1. **Generate** — RFT model produces Luau code from a task description
2. **Score** — 4 deterministic scorers evaluate the output in real-time
3. **Self-correct** — If score < 0.85, the model rewrites the code using the scorer feedback
4. **Insert** — Code is sent directly to Roblox Studio via a companion plugin

## Intended Use

- Generating production-quality Roblox Luau scripts from natural language
- Powering agentic code generation pipelines with self-correction
- Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)

## Limitations

- Trained on Mistral-7B base — larger models would likely benefit more from the RFT signal
- Claude judge scoring adds cost and latency to the training pipeline
- Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
- Complex multi-file architectures may not be fully coherent