Qwen3-4B Agent Trajectory DPO Adapter
This repository provides a LoRA adapter fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using LoRA + DPO (Direct Preference Optimization).
The adapter is trained on top of an existing SFT adapter:
- SFT start adapter: kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
This repository contains LoRA adapter weights only. The base model must be loaded separately.
Training Objective
This adapter is optimized to improve multi-step agent behavior in ALFWorld-style environments by learning from pairwise preferences:
- Each training example is a multi-turn messages history (system + trajectory history + user observation).
- For each history, we have:
- "chosen": a more desirable next action
- "rejected": a less desirable next action
- The DPO loss encourages the model to give higher preference to the chosen action than to the rejected action for the same history.
The main targeted failure modes include:
Action loops Repeatedly selecting exactly the same action without progress.
Premature task success Outputting "ACTION: task succeeded" before the goal is actually achieved.
Irrelevant exploration Searching in locations that are implausible for the target object (for example, looking for a kitchen tool in a bathroom sink).
Training Configuration
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Start adapter (SFT): kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
- DPO dataset: kuririrn/alfworld_roundwise_dpo_v1
- Method: LoRA (adapter on top of SFT-merged base)
- Max sequence length during DPO: 2048
- DPO epochs: 1
- DPO learning rate: 8e-07
- LoRA r / alpha / dropout: 64 / 128 / 0.0
- Frameworks: Unsloth + TRL (DPOTrainer)
Note: DPO training is performed on top of a merged SFT model. Details of the SFT training and data are documented in the model card of kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign.
Datasets
SFT starting point
The starting point for this DPO adapter is an SFT-trained adapter:
- kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
Please refer to its model card for detailed information about the supervised fine-tuning data and configuration.
DPO dataset
- kuririrn/alfworld_roundwise_dpo_v1
This dataset provides round-wise DPO samples with fields:
- "messages": system + multi-turn history (trajectory)
- "chosen": more desirable next action
- "rejected": less desirable next action
- "reason": categorical label describing why "rejected" is undesirable (for example, "loop", "task_succeeded_negative", "implausible_location").
The DPO loss is computed only on the next turn (chosen vs rejected) conditioned on the same messages history.
Prompting and Output Format
The model is used as an environment-interacting agent. A typical prompt is constructed as a chat-style conversation:
- system: explains the agent role and task
- user: provides the latest environment observation and instructions
- assistant: responds with a single next action in a canonical format
Actions are emitted as a single line:
- Use the "ACTION:" tag (uppercase).
- Follow with a structured action string compatible with the environment.
Examples:
ACTION: go to countertop 1
ACTION: take apple 1 from countertop 1
ACTION: open drawer 2
ACTION: task succeeded
Internally, the DPO data normalizes historical tags such as "Act:" and "Think:" to "ACTION:" and "THOUGHT:" to align with the evaluation environment and reduce errors caused by format mismatch.
Usage
Below is a minimal example for loading the adapter on top of the base model using transformers and peft (shown as plain text for safety in this README):
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "your_username/your_dpo_repo"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(
base,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)
# Now you can generate actions conditioned on the environment history.
Sources and Terms
- Base model: Qwen/Qwen3-4B-Instruct-2507
- SFT start adapter: kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
- DPO dataset: kuririrn/alfworld_roundwise_dpo_v1
Please make sure that your usage of this adapter and any associated dataset complies with:
- The license of the base model
- The license and terms of the underlying SFT training data
- The license and terms of the DPO dataset
If you build your own DPO dataset or modify the training setup, please update this README accordingly to reflect the changes.
- Downloads last month
- 2
Model tree for kuririrn/qwen3-4b-agent-trajectory_dpo_v1
Base model
Qwen/Qwen3-4B-Instruct-2507