Qwen3-4B-InstruBPM

Qwen3-4B-InstruBPM is a compact, instruction-tuned language model that converts natural-language business process descriptions into BPMN models rendered in Graphviz DOT. It is a LoRA adaptation of Qwen/Qwen3-4B-Instruct-2507, trained on a cleaned, stratified subset of the MaD dataset for the paper:

InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation. Gökberk Çelikmasat, Atay Özgövde, Fatma Başak Aydemir. Software and Systems Modeling, under review, 2026. arXiv: 2512.12063

On a 180-instance benchmark stratified by difficulty across 15 business domains, this model attains near-perfect structural fidelity (R-GED Accuracy ≈ 99.4%) and matches or outperforms both untuned open-weight baselines (Qwen2.5 7/14B, Qwen3 30B, Qwen3-Coder) and strong proprietary systems (GPT-5.1, Claude 4.5 Sonnet/Haiku, Gemini 2.5 Pro/Flash) on BLEU, ROUGE-L, and METEOR — at roughly half the parameter count of our prior tuned model.

Results

Evaluation on the 180-instance stratified benchmark (paper Table 2). Higher is better on all four metrics.

Model BLEU ROUGE-L METEOR R-GED Acc.
Qwen3-4B-InstruBPM (this model) 83.06 94.43 92.82 99.44
Gemma2-9B-BPMG-IT (prior work) 82.98 94.61 92.67 97.78
Qwen3-Coder-30B-A3B-Instruct 8.06 43.00 45.07 38.21
Qwen3-30B-A3B-Instruct-2507 6.66 42.28 44.79 38.68
Qwen3-4B-Instruct-2507 (base) 2.89 40.31 44.16 44.47
Gemini 2.5 Pro 28.72 48.98 63.66 43.58
Claude 4.5 Sonnet 22.56 49.87 61.37 41.47
Claude 4.5 Haiku 18.15 46.69 58.21 35.91
Gemini 2.5 Flash 15.24 47.18 57.69 30.07
GPT-5.1 12.64 48.83 59.01 40.95

Per-domain R-GED Accuracy is 100% in 14 of 15 domains (paper Table 3). Friedman tests with Kendall's W between 0.65 and 0.81 and bootstrap confidence intervals confirm these differences are statistically significant (paper Appendix A).

Intended use

Generate first-draft BPMN models from textual process descriptions to accelerate early-stage modeling. In expert review, the outputs were judged to be usable with modest post-editing and to follow BPMN best practices for model size, explicit gateways, split/join consistency, and process orientation (paper §6.2, BEBoP verification).

The model is intended as an assistant for business process modelers and analysts, not as a fully autonomous replacement for manual modeling. Human review is recommended, particularly for gateway logic and activity labels in ambiguous descriptions.

Supported BPMN subset

The model generates BPMN process fragments in DOT notation covering: start events, end events, tasks (activities), sequence flows, and AND/XOR gateways (splits and joins). It does not currently generate pools, lanes, message flows, data objects, intermediate/boundary events, sub-processes, or annotations.

How to use

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "gcelikmasat-work/Qwen3_4B_BPMN_IT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

instruction = (
    "You are an expert in BPMN modeling and DOT language. Your task is to "
    "convert detailed textual descriptions of business processes into accurate "
    "BPMN model codes written in DOT language. Label all nodes with their "
    "activity names. Represent all connections between nodes without labeling "
    "the connections. Represent each node and its connections accurately, "
    "ensuring all decision points and flows are included and connected. "
    "Now, generate BPMN business process model code in DOT language for the "
    "following textual description of a business process: "
)

description = (
    "The process begins when the customer submits an application. After submission, "
    "the application is reviewed by the credit officer. If the application is approved, "
    "the loan is disbursed. Otherwise, a rejection letter is sent. The process ends."
)

messages = [{"role": "user", "content": instruction + description}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=1.0, do_sample=True)

dot_code = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(dot_code)

With vLLM (recommended for batched inference)

from vllm import LLM, SamplingParams

llm = LLM(model="gcelikmasat-work/Qwen3_4B_BPMN_IT", max_model_len=2048)
params = SamplingParams(temperature=0.1, top_p=1.0, max_tokens=2048)
outputs = llm.chat([[{"role": "user", "content": instruction + description}]], params)
print(outputs[0].outputs[0].text)

The generated DOT can be rendered with Graphviz:

dot -Tpng process.dot -o process.png

Training

Parameter Value
Base model Qwen/Qwen3-4B-Instruct-2507
Training framework LLaMA-Factory
Adapter LoRA, all target modules
LoRA rank r 16
LoRA α 32
LoRA dropout 0.05
Precision bf16
Cutoff length 2048 tokens
Batch size (per device) 16
Gradient accumulation steps 2
Epochs 1 (≈670 optimizer steps)
Learning rate 2 × 10⁻⁴
LR schedule / warmup ratio cosine / 0.05
Optimiser AdamW (torch)
FlashAttention / Liger FA2 / enabled
Hardware 2 × NVIDIA L40S (48 GB)
Wall-clock ≈150 minutes
Decoding at inference temperature=0.1, top_p=1.0, max_tokens=2048

Training data. 21.5k cleaned instruction–input–output triples from the MaD dataset, split 80/10/10 for train/validation/test. Filtering removed malformed DOT, duplicate processes, disconnected components, and descriptions exceeding 2048 tokens. The full splits are available at gcelikmasat-work/BPMN-IT-Dataset.

Deployment variants

This repository hosts the merged BF16 checkpoint. Two related collections provide variants for deployment trade-offs discussed in the paper:

  • GGUF quantizations (paper Table 5) — Q2 through Q8 via HQQ/llama.cpp. Mid-precision (Q5–Q8) preserves near-BF16 quality with roughly half the memory footprint: Qwen3-4b-Different-Quantization-GGUF.
  • Merge-time α variants (paper Table 6) — α ∈ {8, 16, 32, 64} applied during LoRA merge, holding rank at 16. Mid-range α (16–32) gives the best accuracy; α=32 is the default in this checkpoint: Qwen3-4b-Different-Alpha.

Limitations

  • Scope. Generates the control-flow slice of BPMN (tasks, events, sequence flows, AND/XOR gateways). Does not yet handle pools, lanes, message flows, data objects, or sub-processes.
  • Language. Trained on English only.
  • Domain shift. Evaluated on a stratified 180-instance held-out benchmark from the MaD dataset. Generalization to enterprise documentation with different terminology or structure is not fully established.
  • Label quality. Expert reviewers occasionally observed overly generic activity labels when input descriptions were vague, and BEBoP verification found gaps in default-flow and XOR-label coverage (paper §6.2, Table 8).
  • Semantic equivalence. High structural similarity (R-GED) does not guarantee semantic equivalence — two structurally identical graphs can differ in intent when descriptions are underspecified.

Citation

@article{celikmasat2026instrubpm,
  title   = {InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation},
  author  = {{\c{C}}elikmasat, G{\"o}kberk and {\"O}zg{\"o}vde, Atay and Aydemir, Fatma Ba{\c{s}}ak},
  journal = {Software and Systems Modeling},
  year    = {2026},
  note    = {Under review. arXiv:2512.12063},
  url     = {https://arxiv.org/abs/2512.12063}
}

Please also cite the source dataset:

@inproceedings{li2023mad,
  title     = {{MaD}: A Dataset for Interview-based {BPM} in Business Process Management},
  author    = {Li, Xiang and Ni, Lijuan and Li, Ran and Liu, Jiafei and Zhang, Ming},
  booktitle = {2023 International Joint Conference on Neural Networks (IJCNN)},
  pages     = {1--8},
  year      = {2023},
  publisher = {IEEE}
}

License

Apache 2.0, inherited from the base model (Qwen/Qwen3-4B-Instruct-2507). The training data is distributed separately under the terms of the MaD dataset.

Acknowledgements

This work builds on our prior instruction-tuning effort on Gemma2-9B (Çelikmasat et al., PROFES 2025), available at gcelikmasat-work/gemma-2-9b-it-BPMN. We thank the authors of the MaD dataset for making their resource publicly available.

Downloads last month
581
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gcelikmasat-work/Qwen3_4B_BPMN_IT

Adapter
(5273)
this model
Quantizations
6 models

Dataset used to train gcelikmasat-work/Qwen3_4B_BPMN_IT

Collection including gcelikmasat-work/Qwen3_4B_BPMN_IT

Paper for gcelikmasat-work/Qwen3_4B_BPMN_IT

Evaluation results

  • BLEU on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
    self-reported
    83.060
  • ROUGE-L on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
    self-reported
    94.430
  • METEOR on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
    self-reported
    92.820
  • R-GED Accuracy (%) on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
    self-reported
    99.440