--- license: apache-2.0 base_model: mistralai/Mistral-7B-Instruct-v0.3 tags: - roblox - luau - code-generation - peft - lora - rft - reinforcement-fine-tuning - wandb-hackathon datasets: - TorpedoSoftware/the-luau-stack language: - en pipeline_tag: text-generation library_name: peft model-index: - name: roblox-luau-mistral-7b-rft results: - task: type: text-generation name: Roblox Luau Code Generation metrics: - name: Syntax Score type: custom value: 0.95 - name: API Correctness type: custom value: 0.93 - name: Bug-Free Score type: custom value: 0.91 - name: Quality Score type: custom value: 0.88 - name: Composite Score type: custom value: 0.92 --- # Roblox Luau Mistral 7B — RFT (Reinforcement Fine-Tuned) A **reinforcement fine-tuned** LoRA adapter for generating production-ready Roblox Luau scripts. This model builds on the [SFT version](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2) by training on the **best-of-N candidates** selected via a hybrid reward signal combining deterministic code scorers + Claude-as-judge evaluation. **Part of the Roblox Luau Code Gen project for the W&B Fine-Tuning Hackathon.** ## Why RFT? Standard SFT trains on static (task, code) pairs. **RFT goes further**: the SFT model generates N candidate solutions per task, a reward function scores each candidate, and only the best are kept for the next round of training. This creates a self-improvement loop where the model learns from its own best outputs. ``` SFT Model → Generate N candidates per task ↓ Score each candidate (4 deterministic scorers + Claude judge) ↓ Keep best candidate per task (score ≥ 0.70) ↓ Train on SFT data + best candidates → RFT Model ``` ## Training ### Stage 1: SFT Data Collection Same as the [SFT model](https://huggingface.co/squaredcuber/roblox-luau-mistral-7b-2): - Reverse-labeled [the-luau-stack](https://huggingface.co/datasets/TorpedoSoftware/the-luau-stack) examples - Claude Sonnet 4.5 gold-standard implementations - Quality-filtered by 4 deterministic scorers ### Stage 2: Candidate Generation The SFT model generated **4 candidates per task** for 50 tasks (200 total candidates) using temperature sampling (T=0.8, top_p=0.95). ### Stage 3: Hybrid Reward Scoring Each candidate was scored using a hybrid signal: | Component | Weight | What it measures | |---|---|---| | Syntax scorer | 10% | Bracket/block balance, no Python-isms | | API scorer | 10% | `GetService()`, no deprecated APIs, valid services | | Bug scorer | 10% | pcall wrapping, nil checks, yield in loops | | Quality scorer | 10% | Comments, structure, naming, completeness | | **Claude judge** | **60%** | Functionality, correctness, completeness (LLM-as-judge) | **Combined score = deterministic (40%) + Claude judge (60%)** Only candidates scoring ≥ 0.70 were kept. These best-of-N examples were mixed with the original SFT training data. ### Stage 4: RFT Training | Parameter | Value | |---|---| | Base model | `mistralai/Mistral-7B-Instruct-v0.3` | | Method | QLoRA (4-bit NF4) | | LoRA rank | 64 | | LoRA alpha | 128 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Dropout | 0.05 | | Epochs | 2 | | Batch size | 1 (×8 gradient accumulation) | | Learning rate | 1.5e-4 | | Max sequence length | 8192 | | Precision | bf16 | | Gradient checkpointing | Yes | | Training data | SFT data + best-of-N RFT candidates | ### Results: SFT → RFT Improvement | Scorer | SFT | RFT | Delta | |---|---|---|---| | Syntax | 0.92 | **0.95** | +0.03 | | API Correctness | 0.88 | **0.93** | +0.05 | | Bug-Free | 0.85 | **0.91** | +0.06 | | Code Quality | 0.82 | **0.88** | +0.06 | | **Composite** | **0.87** | **0.92** | **+0.05** | The RFT model shows consistent improvement across all dimensions, with the largest gains in bug-free code and code quality — the areas where the Claude judge provided the most signal beyond the deterministic scorers. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.3", torch_dtype=torch.float16, device_map="auto", ) model = PeftModel.from_pretrained(base_model, "squaredcuber/roblox-luau-mistral-7b-rft") tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3") messages = [ {"role": "system", "content": "You are an expert Roblox Luau programmer. Generate complete, production-ready Luau scripts. Output only code, no markdown."}, {"role": "user", "content": "Build a tower defense system with auto-targeting towers, enemy waves, and a path-following system"}, ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(inputs, max_new_tokens=4096, temperature=0.7, do_sample=True) print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True)) ``` ### With vLLM (recommended for serving) ```bash python -m vllm.entrypoints.openai.api_server \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --enable-lora \ --lora-modules \ sft=squaredcuber/roblox-luau-mistral-7b-2 \ rft=squaredcuber/roblox-luau-mistral-7b-rft \ --max-lora-rank 64 ``` ## Agentic Pipeline This model powers the **agentic Roblox Studio assistant** — a self-correcting code generation pipeline: 1. **Generate** — RFT model produces Luau code from a task description 2. **Score** — 4 deterministic scorers evaluate the output in real-time 3. **Self-correct** — If score < 0.85, the model rewrites the code using the scorer feedback 4. **Insert** — Code is sent directly to Roblox Studio via a companion plugin ## Intended Use - Generating production-quality Roblox Luau scripts from natural language - Powering agentic code generation pipelines with self-correction - Research into reinforcement fine-tuning with hybrid reward signals (deterministic + LLM-as-judge) ## Limitations - Trained on Mistral-7B base — larger models would likely benefit more from the RFT signal - Claude judge scoring adds cost and latency to the training pipeline - Best-of-N with N=4 is a relatively small candidate pool; larger N would likely improve quality further - Complex multi-file architectures may not be fully coherent