Instructions to use acbueff/normistral-7b-instruct-nb-klsft-delta-dpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use acbueff/normistral-7b-instruct-nb-klsft-delta-dpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("norallm/normistral-7b-warm-instruct") model = PeftModel.from_pretrained(base_model, "acbueff/normistral-7b-instruct-nb-klsft-delta-dpo") - Notebooks
- Google Colab
- Kaggle
NorMistral-7B-Instruct KL-SFT + Δ-DPO — Norwegian Bokmal Grammar
LoRA adapter for norallm/normistral-7b-warm-instruct trained with KL-regularised SFT followed by Delta Direct Preference Optimization (Δ-DPO) on Norwegian Bokmal (nb) grammar.
This is the KL-SFT variant: the base instruct model is first fine-tuned with KL-regularised SFT to preserve output diversity, then Δ-DPO is applied on the merged SFT checkpoint.
Method
SAGA (Symbolic Annotation for Grammar Alignment) uses symbolic NLP parsers as verifiable reward signals for RLHF-style training — we call this Reinforcement Learning from Verifiable Feedback (RLVF).
Stage 1: KL-SFT
KL-regularised supervised fine-tuning on Norwegian Wikipedia sentences filtered by SpaCy grammar quality (≥ 0.65). Preserves output diversity while adapting to Norwegian grammar patterns.
Stage 2: Δ-DPO
- Generation — vLLM generates multiple completions per Wikipedia NB prompt using the KL-SFT merged model at varied temperatures (0.7–1.1)
- Scoring — SpaCy
nb_core_news_lgscores each completion for grammatical quality - Pair filtering — Completions are paired into (chosen, rejected) by quality delta (threshold ≥ 0.25, min chosen score ≥ 0.2)
- DPO training — Standard DPO on the filtered preference pairs
Training details
| Hyperparameter | Value |
|---|---|
| Base model | norallm/normistral-7b-warm-instruct |
| Method | KL-SFT + Δ-DPO |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q, k, v, o, gate, up, down proj |
| Trainable params | 41.9M / 7.29B (0.58%) |
| KL-SFT | |
| SFT training steps | 785 |
| SFT epochs | 5 |
| SFT train loss | 22.81 |
| Δ-DPO | |
| Preference pairs | 5,886 |
| Mean chosen score | 0.906 |
| Mean rejected score | -0.973 |
| Mean delta | 1.879 |
| DPO training steps | 184 |
| Final DPO loss | 0.091 |
| Final rewards accuracy | 98.1% |
| Batch size | 8 |
| Gradient accumulation | 8 |
| Reward model | SpaCy nb_core_news_lg |
Usage
This adapter was trained on top of the KL-SFT merged checkpoint. For inference, apply it directly to the base instruct model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-7b-warm-instruct",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm-instruct")
model = PeftModel.from_pretrained(base_model, "acbueff/normistral-7b-instruct-nb-klsft-delta-dpo")
messages = [{"role": "user", "content": "Skriv en kort tekst om Bergen."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Comparison with no-SFT variant
| no-SFT | KL-SFT (this model) | |
|---|---|---|
| Preference pairs | 2,307 | 5,886 |
| Final DPO loss | 0.218 | 0.091 |
| Rewards accuracy | 95% | 98.1% |
| Mean chosen score | 0.882 | 0.906 |
The KL-SFT pretraining stage produced 2.5x more preference pairs with higher quality, resulting in lower final loss and higher rewards accuracy.
- Downloads last month
- 26
Model tree for acbueff/normistral-7b-instruct-nb-klsft-delta-dpo
Base model
norallm/normistral-7b-warm-instruct