Qwen3-4B Agent Trajectory DPO Adapter

This repository provides a LoRA adapter fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using LoRA + DPO (Direct Preference Optimization).

The adapter is trained on top of an existing SFT adapter:

  • SFT start adapter: kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign

This repository contains LoRA adapter weights only. The base model must be loaded separately.

Training Objective

This adapter is optimized to improve multi-step agent behavior in ALFWorld-style environments by learning from pairwise preferences:

  • Each training example is a multi-turn messages history (system + trajectory history + user observation).
  • For each history, we have:
    • "chosen": a more desirable next action
    • "rejected": a less desirable next action
  • The DPO loss encourages the model to give higher preference to the chosen action than to the rejected action for the same history.

The main targeted failure modes include:

  1. Action loops Repeatedly selecting exactly the same action without progress.

  2. Premature task success Outputting "ACTION: task succeeded" before the goal is actually achieved.

  3. Irrelevant exploration Searching in locations that are implausible for the target object (for example, looking for a kitchen tool in a bathroom sink).

Training Configuration

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Start adapter (SFT): kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
  • DPO dataset: kuririrn/alfworld_roundwise_dpo_v1
  • Method: LoRA (adapter on top of SFT-merged base)
  • Max sequence length during DPO: 2048
  • DPO epochs: 1
  • DPO learning rate: 8e-07
  • LoRA r / alpha / dropout: 64 / 128 / 0.0
  • Frameworks: Unsloth + TRL (DPOTrainer)

Note: DPO training is performed on top of a merged SFT model. Details of the SFT training and data are documented in the model card of kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign.

Datasets

SFT starting point

The starting point for this DPO adapter is an SFT-trained adapter:

  • kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign

Please refer to its model card for detailed information about the supervised fine-tuning data and configuration.

DPO dataset

  • kuririrn/alfworld_roundwise_dpo_v1

This dataset provides round-wise DPO samples with fields:

  • "messages": system + multi-turn history (trajectory)
  • "chosen": more desirable next action
  • "rejected": less desirable next action
  • "reason": categorical label describing why "rejected" is undesirable (for example, "loop", "task_succeeded_negative", "implausible_location").

The DPO loss is computed only on the next turn (chosen vs rejected) conditioned on the same messages history.

Prompting and Output Format

The model is used as an environment-interacting agent. A typical prompt is constructed as a chat-style conversation:

  • system: explains the agent role and task
  • user: provides the latest environment observation and instructions
  • assistant: responds with a single next action in a canonical format

Actions are emitted as a single line:

  • Use the "ACTION:" tag (uppercase).
  • Follow with a structured action string compatible with the environment.

Examples:

ACTION: go to countertop 1
ACTION: take apple 1 from countertop 1
ACTION: open drawer 2
ACTION: task succeeded

Internally, the DPO data normalizes historical tags such as "Act:" and "Think:" to "ACTION:" and "THOUGHT:" to align with the evaluation environment and reduce errors caused by format mismatch.

Usage

Below is a minimal example for loading the adapter on top of the base model using transformers and peft (shown as plain text for safety in this README):

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "your_username/your_dpo_repo"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(
    base,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)

# Now you can generate actions conditioned on the environment history.

Sources and Terms

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • SFT start adapter: kuririrn/qwen3-4b-agent-trajectory_alf_admissible-lora-constraint_gen-dist_allign
  • DPO dataset: kuririrn/alfworld_roundwise_dpo_v1

Please make sure that your usage of this adapter and any associated dataset complies with:

  • The license of the base model
  • The license and terms of the underlying SFT training data
  • The license and terms of the DPO dataset

If you build your own DPO dataset or modify the training setup, please update this README accordingly to reflect the changes.

Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kuririrn/qwen3-4b-agent-trajectory_dpo_v1

Adapter
(5265)
this model

Dataset used to train kuririrn/qwen3-4b-agent-trajectory_dpo_v1