SylloGym โ Qwen3-4B GRPO (LoRA adapter)
Fine-tuned from Qwen/Qwen3-4B on the SylloGym legal reasoning environment using GRPO (Group Relative Policy Optimization).
Training: 180 steps on an A100, multi-turn rollout via TRL + Unsloth
Environment: 12 legal domains, 45 tasks, deterministic Python verifiers
Result: +6.1 pp overall accuracy (61.7% โ 67.8%), +8.3 pp on 5-turn episodes
This is a LoRA adapter โ load it on top of the base Qwen3-4B model.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "farffadet/syllogym-judge-qwen3-4b-grpo")
tokenizer = AutoTokenizer.from_pretrained("farffadet/syllogym-judge-qwen3-4b-grpo")
Environment
- Repo: eliot-gtn/syllogym
- HF Space: farffadet/syllogym-env
- Blog post: See the repository for the full write-up.
Training details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B (4-bit, LoRA r=32) |
| Steps | 180 (target: 300) |
| Batch size | 8 |
| Temperature | 0.9 |
| KL penalty (ฮฒ) | 0.02 |
| Loss type | DR-GRPO |
| max_completion_length | 3072 tokens |
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support