File size: 6,694 Bytes
216dff0
a55bcea
216dff0
 
a55bcea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216dff0
 
a55bcea
 
 
 
2d600c8
a55bcea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216dff0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: apache-2.0
base_model: mistralai/Mistral-7B-Instruct-v0.3
tags:
  - roblox
  - luau
  - code-generation
  - peft
  - lora
  - rft
  - reinforcement-fine-tuning
  - wandb-hackathon
datasets:
  - TorpedoSoftware/the-luau-stack
language:
  - en
pipeline_tag: text-generation
library_name: peft
model-index:
  - name: roblox-luau-mistral-7b-rft
    results:
      - task:
          type: text-generation
          name: Roblox Luau Code Generation
        metrics:
          - name: Syntax Score
            type: custom
            value: 0.95
          - name: API Correctness
            type: custom
            value: 0.93
          - name: Bug-Free Score
            type: custom
            value: 0.91
          - name: Quality Score
            type: custom
            value: 0.88
          - name: Composite Score
            type: custom
            value: 0.92
---

# Roblox Luau Mistral 7B β€” RFT (Reinforcement Fine-Tuned)

A **reinforcement fine-tuned** LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the [SFT version](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2) by training on the **best-of-N candidates** selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation.

**Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.**

## Why RFT?

Standard SFT trains on static (task, code) pairs. **RFT goes further**: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs.

```
SFT Model β†’ Generate N candidates per task
                ↓
         Score each candidate
         (4 deterministic scorers + Claude judge)
                ↓
         Keep best candidate per task (score β‰₯ 0.70)
                ↓
         Train on SFT data + best candidates β†’ RFT Model
```

## Training

### Stage 1: SFT Data Collection
Same as the [SFT model](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2):
- Reverse-labeled [the-luau-stack](https://huggingface.co/datasets/TorpedoSoftware/the-luau-stack) examples
- Claude Sonnet 4.5 gold-standard implementations
- Quality-filtered by 4 deterministic scorers

### Stage 2: Candidate Generation
The SFT model generated **4 candidates per task** for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95).

### Stage 3: Hybrid Reward Scoring
Each candidate was scored using a hybrid signal:

| Component | Weight | What it measures |
|---|---|---|
| Syntax scorer | 10% | Bracket/block balance, no Python-isms |
| API scorer | 10% | `GetService()`, no deprecated APIs, valid services |
| Bug scorer | 10% | pcall wrapping, nil checks, yield in loops |
| Quality scorer | 10% | Comments, structure, naming, completeness |
| **Claude judge** | **60%** | Functionality, correctness, completeness (LLM-as-judge) |

**Combined score = deterministic (40%) + Claude judge (60%)**

Only candidates scoring β‰₯ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data.

### Stage 4: RFT Training

| Parameter | Value |
|---|---|
| Base model | `mistralai/Mistral-7B-Instruct-v0.3` |
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Dropout | 0.05 |
| Epochs | 2 |
| Batch size | 1 (Γ—8 gradient accumulation) |
| Learning rate | 1.5e-4 |
| Max sequence length | 8192 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
| Training data | SFT data + best-of-N RFT candidates |

### Results: SFT β†’ RFT Improvement

| Scorer | SFT | RFT | Delta |
|---|---|---|---|
| Syntax | 0.92 | **0.95** | +0.03 |
| API Correctness | 0.88 | **0.93** | +0.05 |
| Bug-Free | 0.85 | **0.91** | +0.06 |
| Code Quality | 0.82 | **0.88** | +0.06 |
| **Composite** | **0.87** | **0.92** | **+0.05** |

The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality β€” the areas where the Claude judge provided the most signal beyond the deterministic scorers.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

messages = [
    {"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."},
    {"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
```

### With vLLM (recommended for serving)
```bash
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --enable-lora \
    --lora-modules \
        sft=squaredcuber/roblox-luau-mistral-7b-2 \
        rft=squaredcuber/roblox-luau-mistral-7b-rft \
    --max-lora-rank 64
```

## Agentic Pipeline

This model powers the **agentic Roblox Studio assistant** β€” a self-correcting code generation pipeline:

1. **Generate** β€” RFT model produces Luau code from a task description
2. **Score** β€” 4 deterministic scorers evaluate the output in real-time
3. **Self-correct** β€” If score < 0.85, the model rewrites the code using the scorer feedback
4. **Insert** β€” Code is sent directly to Roblox Studio via a companion plugin

## Intended Use

- Generating production-quality Roblox Luau scripts from natural language
- Powering agentic code generation pipelines with self-correction
- Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge)

## Limitations

- Trained on Mistral-7B base β€” larger models would likely benefit more from the RFT signal
- Claude judge scoring adds cost and latency to the training pipeline
- Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further
- Complex multi-file architectures may not be fully coherent