NorMistral-7B-Instruct KL-SFT + Δ-DPO — Norwegian Bokmal Grammar

LoRA adapter for norallm/normistral-7b-warm-instruct trained with KL-regularised SFT followed by Delta Direct Preference Optimization (Δ-DPO) on Norwegian Bokmal (nb) grammar.

This is the KL-SFT variant: the base instruct model is first fine-tuned with KL-regularised SFT to preserve output diversity, then Δ-DPO is applied on the merged SFT checkpoint.

Method

SAGA (Symbolic Annotation for Grammar Alignment) uses symbolic NLP parsers as verifiable reward signals for RLHF-style training — we call this Reinforcement Learning from Verifiable Feedback (RLVF).

Stage 1: KL-SFT

KL-regularised supervised fine-tuning on Norwegian Wikipedia sentences filtered by SpaCy grammar quality (≥ 0.65). Preserves output diversity while adapting to Norwegian grammar patterns.

Stage 2: Δ-DPO

  1. Generation — vLLM generates multiple completions per Wikipedia NB prompt using the KL-SFT merged model at varied temperatures (0.7–1.1)
  2. Scoring — SpaCy nb_core_news_lg scores each completion for grammatical quality
  3. Pair filtering — Completions are paired into (chosen, rejected) by quality delta (threshold ≥ 0.25, min chosen score ≥ 0.2)
  4. DPO training — Standard DPO on the filtered preference pairs

Training details

Hyperparameter Value
Base model norallm/normistral-7b-warm-instruct
Method KL-SFT + Δ-DPO
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q, k, v, o, gate, up, down proj
Trainable params 41.9M / 7.29B (0.58%)
KL-SFT
SFT training steps 785
SFT epochs 5
SFT train loss 22.81
Δ-DPO
Preference pairs 5,886
Mean chosen score 0.906
Mean rejected score -0.973
Mean delta 1.879
DPO training steps 184
Final DPO loss 0.091
Final rewards accuracy 98.1%
Batch size 8
Gradient accumulation 8
Reward model SpaCy nb_core_news_lg

Usage

This adapter was trained on top of the KL-SFT merged checkpoint. For inference, apply it directly to the base instruct model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-7b-warm-instruct",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct")

model = PeftModel.from_pretrained(base_model, "acbueff/normistral-7b-instruct-nb-klsft-delta-dpo")

messages = [{"role": "user", "content": "Skriv en kort tekst om Bergen."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Comparison with no-SFT variant

no-SFT KL-SFT (this model)
Preference pairs 2,307 5,886
Final DPO loss 0.218 0.091
Rewards accuracy 95% 98.1%
Mean chosen score 0.882 0.906

The KL-SFT pretraining stage produced 2.5x more preference pairs with higher quality, resulting in lower final loss and higher rewards accuracy.

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for acbueff/normistral-7b-instruct-nb-klsft-delta-dpo

Adapter
(3)
this model