NorMistral-7B-Instruct KL-SFT + Δ-DPO — Norwegian Bokmal Grammar

LoRA adapter for norallm/normistral-7b-warm-instruct trained with KL-regularised SFT followed by Delta Direct Preference Optimization (Δ-DPO) on Norwegian Bokmal (nb) grammar.

This is the KL-SFT variant: the base instruct model is first fine-tuned with KL-regularised SFT to preserve output diversity, then Δ-DPO is applied on the merged SFT checkpoint.

Method

SAGA (Symbolic Annotation for Grammar Alignment) uses symbolic NLP parsers as verifiable reward signals for RLHF-style training — we call this Reinforcement Learning from Verifiable Feedback (RLVF).

Stage 1: KL-SFT

KL-regularised supervised fine-tuning on Norwegian Wikipedia sentences filtered by SpaCy grammar quality (≥ 0.65). Preserves output diversity while adapting to Norwegian grammar patterns.

Stage 2: Δ-DPO

Generation — vLLM generates multiple completions per Wikipedia NB prompt using the KL-SFT merged model at varied temperatures (0.7–1.1)
Scoring — SpaCy nb_core_news_lg scores each completion for grammatical quality
Pair filtering — Completions are paired into (chosen, rejected) by quality delta (threshold ≥ 0.25, min chosen score ≥ 0.2)
DPO training — Standard DPO on the filtered preference pairs

Training details

Hyperparameter	Value
Base model	`norallm/normistral-7b-warm-instruct`
Method	KL-SFT + Δ-DPO
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q, k, v, o, gate, up, down proj
Trainable params	41.9M / 7.29B (0.58%)
KL-SFT
SFT training steps	785
SFT epochs	5
SFT train loss	22.81
Δ-DPO
Preference pairs	5,886
Mean chosen score	0.906
Mean rejected score	-0.973
Mean delta	1.879
DPO training steps	184
Final DPO loss	0.091
Final rewards accuracy	98.1%
Batch size	8
Gradient accumulation	8
Reward model	SpaCy `nb_core_news_lg`

Usage

This adapter was trained on top of the KL-SFT merged checkpoint. For inference, apply it directly to the base instruct model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-7b-warm-instruct",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct")

model = PeftModel.from_pretrained(base_model, "acbueff/normistral-7b-instruct-nb-klsft-delta-dpo")

messages = [{"role": "user", "content": "Skriv en kort tekst om Bergen."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Comparison with no-SFT variant

	no-SFT	KL-SFT (this model)
Preference pairs	2,307	5,886
Final DPO loss	0.218	0.091
Rewards accuracy	95%	98.1%
Mean chosen score	0.882	0.906

The KL-SFT pretraining stage produced 2.5x more preference pairs with higher quality, resulting in lower final loss and higher rewards accuracy.

Downloads last month: 26

Model tree for acbueff/normistral-7b-instruct-nb-klsft-delta-dpo

Base model

norallm/normistral-7b-warm-instruct

Adapter

(3)

this model