SylloGym โ€” Qwen3-4B GRPO (LoRA adapter)

Fine-tuned from Qwen/Qwen3-4B on the SylloGym legal reasoning environment using GRPO (Group Relative Policy Optimization).

Training: 180 steps on an A100, multi-turn rollout via TRL + Unsloth
Environment: 12 legal domains, 45 tasks, deterministic Python verifiers
Result: +6.1 pp overall accuracy (61.7% โ†’ 67.8%), +8.3 pp on 5-turn episodes

This is a LoRA adapter โ€” load it on top of the base Qwen3-4B model.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "farffadet/syllogym-judge-qwen3-4b-grpo")
tokenizer = AutoTokenizer.from_pretrained("farffadet/syllogym-judge-qwen3-4b-grpo")

Environment

Training details

Parameter Value
Base model Qwen/Qwen3-4B (4-bit, LoRA r=32)
Steps 180 (target: 300)
Batch size 8
Temperature 0.9
KL penalty (ฮฒ) 0.02
Loss type DR-GRPO
max_completion_length 3072 tokens
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for farffadet/syllogym-judge-qwen3-4b-grpo

Finetuned
Qwen/Qwen3-4B
Adapter
(966)
this model