Qwen3-4B-InstruBPM
Qwen3-4B-InstruBPM is a compact, instruction-tuned language model that converts natural-language business process descriptions into BPMN models rendered in Graphviz DOT. It is a LoRA adaptation of Qwen/Qwen3-4B-Instruct-2507, trained on a cleaned, stratified subset of the MaD dataset for the paper:
InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation. Gökberk Çelikmasat, Atay Özgövde, Fatma Başak Aydemir. Software and Systems Modeling, under review, 2026. arXiv: 2512.12063
On a 180-instance benchmark stratified by difficulty across 15 business domains, this model attains near-perfect structural fidelity (R-GED Accuracy ≈ 99.4%) and matches or outperforms both untuned open-weight baselines (Qwen2.5 7/14B, Qwen3 30B, Qwen3-Coder) and strong proprietary systems (GPT-5.1, Claude 4.5 Sonnet/Haiku, Gemini 2.5 Pro/Flash) on BLEU, ROUGE-L, and METEOR — at roughly half the parameter count of our prior tuned model.
Results
Evaluation on the 180-instance stratified benchmark (paper Table 2). Higher is better on all four metrics.
| Model | BLEU | ROUGE-L | METEOR | R-GED Acc. |
|---|---|---|---|---|
| Qwen3-4B-InstruBPM (this model) | 83.06 | 94.43 | 92.82 | 99.44 |
| Gemma2-9B-BPMG-IT (prior work) | 82.98 | 94.61 | 92.67 | 97.78 |
| Qwen3-Coder-30B-A3B-Instruct | 8.06 | 43.00 | 45.07 | 38.21 |
| Qwen3-30B-A3B-Instruct-2507 | 6.66 | 42.28 | 44.79 | 38.68 |
| Qwen3-4B-Instruct-2507 (base) | 2.89 | 40.31 | 44.16 | 44.47 |
| Gemini 2.5 Pro | 28.72 | 48.98 | 63.66 | 43.58 |
| Claude 4.5 Sonnet | 22.56 | 49.87 | 61.37 | 41.47 |
| Claude 4.5 Haiku | 18.15 | 46.69 | 58.21 | 35.91 |
| Gemini 2.5 Flash | 15.24 | 47.18 | 57.69 | 30.07 |
| GPT-5.1 | 12.64 | 48.83 | 59.01 | 40.95 |
Per-domain R-GED Accuracy is 100% in 14 of 15 domains (paper Table 3). Friedman tests with Kendall's W between 0.65 and 0.81 and bootstrap confidence intervals confirm these differences are statistically significant (paper Appendix A).
Intended use
Generate first-draft BPMN models from textual process descriptions to accelerate early-stage modeling. In expert review, the outputs were judged to be usable with modest post-editing and to follow BPMN best practices for model size, explicit gateways, split/join consistency, and process orientation (paper §6.2, BEBoP verification).
The model is intended as an assistant for business process modelers and analysts, not as a fully autonomous replacement for manual modeling. Human review is recommended, particularly for gateway logic and activity labels in ambiguous descriptions.
Supported BPMN subset
The model generates BPMN process fragments in DOT notation covering: start events, end events, tasks (activities), sequence flows, and AND/XOR gateways (splits and joins). It does not currently generate pools, lanes, message flows, data objects, intermediate/boundary events, sub-processes, or annotations.
How to use
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "gcelikmasat-work/Qwen3_4B_BPMN_IT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
instruction = (
"You are an expert in BPMN modeling and DOT language. Your task is to "
"convert detailed textual descriptions of business processes into accurate "
"BPMN model codes written in DOT language. Label all nodes with their "
"activity names. Represent all connections between nodes without labeling "
"the connections. Represent each node and its connections accurately, "
"ensuring all decision points and flows are included and connected. "
"Now, generate BPMN business process model code in DOT language for the "
"following textual description of a business process: "
)
description = (
"The process begins when the customer submits an application. After submission, "
"the application is reviewed by the credit officer. If the application is approved, "
"the loan is disbursed. Otherwise, a rejection letter is sent. The process ends."
)
messages = [{"role": "user", "content": instruction + description}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=1.0, do_sample=True)
dot_code = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(dot_code)
With vLLM (recommended for batched inference)
from vllm import LLM, SamplingParams
llm = LLM(model="gcelikmasat-work/Qwen3_4B_BPMN_IT", max_model_len=2048)
params = SamplingParams(temperature=0.1, top_p=1.0, max_tokens=2048)
outputs = llm.chat([[{"role": "user", "content": instruction + description}]], params)
print(outputs[0].outputs[0].text)
The generated DOT can be rendered with Graphviz:
dot -Tpng process.dot -o process.png
Training
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Instruct-2507 |
| Training framework | LLaMA-Factory |
| Adapter | LoRA, all target modules |
LoRA rank r |
16 |
| LoRA α | 32 |
| LoRA dropout | 0.05 |
| Precision | bf16 |
| Cutoff length | 2048 tokens |
| Batch size (per device) | 16 |
| Gradient accumulation steps | 2 |
| Epochs | 1 (≈670 optimizer steps) |
| Learning rate | 2 × 10⁻⁴ |
| LR schedule / warmup ratio | cosine / 0.05 |
| Optimiser | AdamW (torch) |
| FlashAttention / Liger | FA2 / enabled |
| Hardware | 2 × NVIDIA L40S (48 GB) |
| Wall-clock | ≈150 minutes |
| Decoding at inference | temperature=0.1, top_p=1.0, max_tokens=2048 |
Training data. 21.5k cleaned instruction–input–output triples from the MaD dataset, split 80/10/10 for train/validation/test. Filtering removed malformed DOT, duplicate processes, disconnected components, and descriptions exceeding 2048 tokens. The full splits are available at gcelikmasat-work/BPMN-IT-Dataset.
Deployment variants
This repository hosts the merged BF16 checkpoint. Two related collections provide variants for deployment trade-offs discussed in the paper:
- GGUF quantizations (paper Table 5) — Q2 through Q8 via HQQ/llama.cpp. Mid-precision (Q5–Q8) preserves near-BF16 quality with roughly half the memory footprint: Qwen3-4b-Different-Quantization-GGUF.
- Merge-time α variants (paper Table 6) — α ∈ {8, 16, 32, 64} applied during LoRA merge, holding rank at 16. Mid-range α (16–32) gives the best accuracy; α=32 is the default in this checkpoint: Qwen3-4b-Different-Alpha.
Limitations
- Scope. Generates the control-flow slice of BPMN (tasks, events, sequence flows, AND/XOR gateways). Does not yet handle pools, lanes, message flows, data objects, or sub-processes.
- Language. Trained on English only.
- Domain shift. Evaluated on a stratified 180-instance held-out benchmark from the MaD dataset. Generalization to enterprise documentation with different terminology or structure is not fully established.
- Label quality. Expert reviewers occasionally observed overly generic activity labels when input descriptions were vague, and BEBoP verification found gaps in default-flow and XOR-label coverage (paper §6.2, Table 8).
- Semantic equivalence. High structural similarity (R-GED) does not guarantee semantic equivalence — two structurally identical graphs can differ in intent when descriptions are underspecified.
Citation
@article{celikmasat2026instrubpm,
title = {InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation},
author = {{\c{C}}elikmasat, G{\"o}kberk and {\"O}zg{\"o}vde, Atay and Aydemir, Fatma Ba{\c{s}}ak},
journal = {Software and Systems Modeling},
year = {2026},
note = {Under review. arXiv:2512.12063},
url = {https://arxiv.org/abs/2512.12063}
}
Please also cite the source dataset:
@inproceedings{li2023mad,
title = {{MaD}: A Dataset for Interview-based {BPM} in Business Process Management},
author = {Li, Xiang and Ni, Lijuan and Li, Ran and Liu, Jiafei and Zhang, Ming},
booktitle = {2023 International Joint Conference on Neural Networks (IJCNN)},
pages = {1--8},
year = {2023},
publisher = {IEEE}
}
License
Apache 2.0, inherited from the base model (Qwen/Qwen3-4B-Instruct-2507). The training data is distributed separately under the terms of the MaD dataset.
Acknowledgements
This work builds on our prior instruction-tuning effort on Gemma2-9B (Çelikmasat et al., PROFES 2025), available at gcelikmasat-work/gemma-2-9b-it-BPMN. We thank the authors of the MaD dataset for making their resource publicly available.
- Downloads last month
- 581
Model tree for gcelikmasat-work/Qwen3_4B_BPMN_IT
Dataset used to train gcelikmasat-work/Qwen3_4B_BPMN_IT
Collection including gcelikmasat-work/Qwen3_4B_BPMN_IT
Paper for gcelikmasat-work/Qwen3_4B_BPMN_IT
Evaluation results
- BLEU on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)self-reported83.060
- ROUGE-L on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)self-reported94.430
- METEOR on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)self-reported92.820
- R-GED Accuracy (%) on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)self-reported99.440