File size: 6,694 Bytes
216dff0 a55bcea 216dff0 a55bcea 216dff0 a55bcea 2d600c8 a55bcea 216dff0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: apache-2.0
base_model: mistralai/Mistral-7B-Instruct-v0.3
tags:
- roblox
- luau
- code-generation
- peft
- lora
- rft
- reinforcement-fine-tuning
- wandb-hackathon
datasets:
- TorpedoSoftware/the-luau-stack
language:
- en
pipeline_tag: text-generation
library_name: peft
model-index:
- name: roblox-luau-mistral-7b-rft
results:
- task:
type: text-generation
name: Roblox Luau Code Generation
metrics:
- name: Syntax Score
type: custom
value: 0.95
- name: API Correctness
type: custom
value: 0.93
- name: Bug-Free Score
type: custom
value: 0.91
- name: Quality Score
type: custom
value: 0.88
- name: Composite Score
type: custom
value: 0.92
---
# Roblox Luau Mistral 7B β RFT (Reinforcement Fine-Tuned)
A **reinforcement fine-tuned** LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the [SFT version](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2) by training on the **best-of-N candidates** selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.
**Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.**
## Why RFT?
Standard SFT trains on static (task, code) pairs. **RFT goes further**: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.
```
SFT Model β Generate N candidates per task
β
Score each candidate
(4 deterministic scorers + Claude judge)
β
Keep best candidate per task (score β₯ 0.70)
β
Train on SFT data + best candidates β RFT Model
```
## Training
### Stage 1: SFT Data Collection
Same as the [SFT model](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2):
- Reverse-labeled [the-luau-stack](https://huggingface.co/datasets/TorpedoSoftware/the-luau-stack) examples
- Claude Sonnet 4.5 gold-standard implementations
- Quality-filtered by 4 deterministic scorers
### Stage 2: Candidate Generation
The SFT model generated **4 candidates per task** for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).
### Stage 3: Hybrid Reward Scoring
Each candidate was scored using a hybrid signal:
| Component | Weight | What it measures |
|---|---|---|
| Syntax scorer | 10% | Bracket/block balance, no Python-isms |
| API scorer | 10% | `GetService()`, no deprecated APIs, valid services |
| Bug scorer | 10% | pcall wrapping, nil checks, yield in loops |
| Quality scorer | 10% | Comments, structure, naming, completeness |
| **Claude judge** | **60%** | Functionality, correctness, completeness (LLM-as-judge) |
**Combined score = deterministic (40%) + Claude judge (60%)**
Only candidates scoring β₯ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.
### Stage 4: RFT Training
| Parameter | Value |
|---|---|
| Base model | `mistralai/Mistral-7B-Instruct-v0.3` |
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Dropout | 0.05 |
| Epochs | 2 |
| Batch size | 1 (Γ8 gradient accumulation) |
| Learning rate | 1.5e-4 |
| Max sequence length | 8192 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
| Training data | SFT data + best-of-N RFT candidates |
### Results: SFT β RFT Improvement
| Scorer | SFT | RFT | Delta |
|---|---|---|---|
| Syntax | 0.92 | **0.95** | +0.03 |
| API Correctness | 0.88 | **0.93** | +0.05 |
| Bug-Free | 0.85 | **0.91** | +0.06 |
| Code Quality | 0.82 | **0.88** | +0.06 |
| **Composite** | **0.87** | **0.92** | **+0.05** |
The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality β the areas where the Claude judge provided the most signal beyond the deterministic scorers.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
messages = [
{"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
{"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
```
### With vLLM (recommended for serving)
```bash
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--enable-lora \
--lora-modules \
sft=squaredcuber/roblox-luau-mistral-7b-2 \
rft=squaredcuber/roblox-luau-mistral-7b-rft \
--max-lora-rank 64
```
## Agentic Pipeline
This model powers the **agentic Roblox Studio assistant** β a self-correcting code generation pipeline:
1. **Generate** β RFT model produces Luau code from a task description
2. **Score** β 4 deterministic scorers evaluate the output in real-time
3. **Self-correct** β If score < 0.85, the model rewrites the code using the scorer feedback
4. **Insert** β Code is sent directly to Roblox Studio via a companion plugin
## Intended Use
- Generating production-quality Roblox Luau scripts from natural language
- Powering agentic code generation pipelines with self-correction
- Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)
## Limitations
- Trained on Mistral-7B base β larger models would likely benefit more from the RFT signal
- Claude judge scoring adds cost and latency to the training pipeline
- Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
- Complex multi-file architectures may not be fully coherent
|