Qwen3-4B-InstruBPM

Qwen3-4B-InstruBPM is a compact, instruction-tuned language model that converts natural-language business process descriptions into BPMN models rendered in Graphviz DOT. It is a LoRA adaptation of Qwen/Qwen3-4B-Instruct-2507, trained on a cleaned, stratified subset of the MaD dataset for the paper:

InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation. Gökberk Çelikmasat, Atay Özgövde, Fatma Başak Aydemir. Software and Systems Modeling, under review, 2026. arXiv: 2512.12063

On a 180-instance benchmark stratified by difficulty across 15 business domains, this model attains near-perfect structural fidelity (R-GED Accuracy ≈ 99.4%) and matches or outperforms both untuned open-weight baselines (Qwen2.5 7/14B, Qwen3 30B, Qwen3-Coder) and strong proprietary systems (GPT-5.1, Claude 4.5 Sonnet/Haiku, Gemini 2.5 Pro/Flash) on BLEU, ROUGE-L, and METEOR — at roughly half the parameter count of our prior tuned model.

Results

Evaluation on the 180-instance stratified benchmark (paper Table 2). Higher is better on all four metrics.

Model	BLEU	ROUGE-L	METEOR	R-GED Acc.
Qwen3-4B-InstruBPM (this model)	83.06	94.43	92.82	99.44
Gemma2-9B-BPMG-IT (prior work)	82.98	94.61	92.67	97.78
Qwen3-Coder-30B-A3B-Instruct	8.06	43.00	45.07	38.21
Qwen3-30B-A3B-Instruct-2507	6.66	42.28	44.79	38.68
Qwen3-4B-Instruct-2507 (base)	2.89	40.31	44.16	44.47
Gemini 2.5 Pro	28.72	48.98	63.66	43.58
Claude 4.5 Sonnet	22.56	49.87	61.37	41.47
Claude 4.5 Haiku	18.15	46.69	58.21	35.91
Gemini 2.5 Flash	15.24	47.18	57.69	30.07
GPT-5.1	12.64	48.83	59.01	40.95

Per-domain R-GED Accuracy is 100% in 14 of 15 domains (paper Table 3). Friedman tests with Kendall's W between 0.65 and 0.81 and bootstrap confidence intervals confirm these differences are statistically significant (paper Appendix A).

Intended use

Generate first-draft BPMN models from textual process descriptions to accelerate early-stage modeling. In expert review, the outputs were judged to be usable with modest post-editing and to follow BPMN best practices for model size, explicit gateways, split/join consistency, and process orientation (paper §6.2, BEBoP verification).

The model is intended as an assistant for business process modelers and analysts, not as a fully autonomous replacement for manual modeling. Human review is recommended, particularly for gateway logic and activity labels in ambiguous descriptions.

Supported BPMN subset

The model generates BPMN process fragments in DOT notation covering: start events, end events, tasks (activities), sequence flows, and AND/XOR gateways (splits and joins). It does not currently generate pools, lanes, message flows, data objects, intermediate/boundary events, sub-processes, or annotations.

How to use

With `transformers`

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "gcelikmasat-work/Qwen3_4B_BPMN_IT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

instruction = (
    "You are an expert in BPMN modeling and DOT language. Your task is to "
    "convert detailed textual descriptions of business processes into accurate "
    "BPMN model codes written in DOT language. Label all nodes with their "
    "activity names. Represent all connections between nodes without labeling "
    "the connections. Represent each node and its connections accurately, "
    "ensuring all decision points and flows are included and connected. "
    "Now, generate BPMN business process model code in DOT language for the "
    "following textual description of a business process: "
)

description = (
    "The process begins when the customer submits an application. After submission, "
    "the application is reviewed by the credit officer. If the application is approved, "
    "the loan is disbursed. Otherwise, a rejection letter is sent. The process ends."
)

messages = [{"role": "user", "content": instruction + description}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=1.0, do_sample=True)

dot_code = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(dot_code)

With `vLLM` (recommended for batched inference)

from vllm import LLM, SamplingParams

llm = LLM(model="gcelikmasat-work/Qwen3_4B_BPMN_IT", max_model_len=2048)
params = SamplingParams(temperature=0.1, top_p=1.0, max_tokens=2048)
outputs = llm.chat([[{"role": "user", "content": instruction + description}]], params)
print(outputs[0].outputs[0].text)

The generated DOT can be rendered with Graphviz:

dot -Tpng process.dot -o process.png

Training

Parameter	Value
Base model	Qwen/Qwen3-4B-Instruct-2507
Training framework	LLaMA-Factory
Adapter	LoRA, `all` target modules
LoRA rank `r`	16
LoRA α	32
LoRA dropout	0.05
Precision	bf16
Cutoff length	2048 tokens
Batch size (per device)	16
Gradient accumulation steps	2
Epochs	1 (≈670 optimizer steps)
Learning rate	2 × 10⁻⁴
LR schedule / warmup ratio	cosine / 0.05
Optimiser	AdamW (torch)
FlashAttention / Liger	FA2 / enabled
Hardware	2 × NVIDIA L40S (48 GB)
Wall-clock	≈150 minutes
Decoding at inference	temperature=0.1, top_p=1.0, max_tokens=2048

Training data. 21.5k cleaned instruction–input–output triples from the MaD dataset, split 80/10/10 for train/validation/test. Filtering removed malformed DOT, duplicate processes, disconnected components, and descriptions exceeding 2048 tokens. The full splits are available at gcelikmasat-work/BPMN-IT-Dataset.

Deployment variants

This repository hosts the merged BF16 checkpoint. Two related collections provide variants for deployment trade-offs discussed in the paper:

GGUF quantizations (paper Table 5) — Q2 through Q8 via HQQ/llama.cpp. Mid-precision (Q5–Q8) preserves near-BF16 quality with roughly half the memory footprint: Qwen3-4b-Different-Quantization-GGUF.
Merge-time α variants (paper Table 6) — α ∈ {8, 16, 32, 64} applied during LoRA merge, holding rank at 16. Mid-range α (16–32) gives the best accuracy; α=32 is the default in this checkpoint: Qwen3-4b-Different-Alpha.

Limitations

Scope. Generates the control-flow slice of BPMN (tasks, events, sequence flows, AND/XOR gateways). Does not yet handle pools, lanes, message flows, data objects, or sub-processes.
Language. Trained on English only.
Domain shift. Evaluated on a stratified 180-instance held-out benchmark from the MaD dataset. Generalization to enterprise documentation with different terminology or structure is not fully established.
Label quality. Expert reviewers occasionally observed overly generic activity labels when input descriptions were vague, and BEBoP verification found gaps in default-flow and XOR-label coverage (paper §6.2, Table 8).
Semantic equivalence. High structural similarity (R-GED) does not guarantee semantic equivalence — two structurally identical graphs can differ in intent when descriptions are underspecified.

Citation

@article{celikmasat2026instrubpm,
  title   = {InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation},
  author  = {{\c{C}}elikmasat, G{\"o}kberk and {\"O}zg{\"o}vde, Atay and Aydemir, Fatma Ba{\c{s}}ak},
  journal = {Software and Systems Modeling},
  year    = {2026},
  note    = {Under review. arXiv:2512.12063},
  url     = {https://arxiv.org/abs/2512.12063}
}

Please also cite the source dataset:

@inproceedings{li2023mad,
  title     = {{MaD}: A Dataset for Interview-based {BPM} in Business Process Management},
  author    = {Li, Xiang and Ni, Lijuan and Li, Ran and Liu, Jiafei and Zhang, Ming},
  booktitle = {2023 International Joint Conference on Neural Networks (IJCNN)},
  pages     = {1--8},
  year      = {2023},
  publisher = {IEEE}
}

License

Apache 2.0, inherited from the base model (Qwen/Qwen3-4B-Instruct-2507). The training data is distributed separately under the terms of the MaD dataset.

Acknowledgements

This work builds on our prior instruction-tuning effort on Gemma2-9B (Çelikmasat et al., PROFES 2025), available at gcelikmasat-work/gemma-2-9b-it-BPMN. We thank the authors of the MaD dataset for making their resource publicly available.

Downloads last month: 581

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for gcelikmasat-work/Qwen3_4B_BPMN_IT

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5273)

this model

Quantizations

6 models

Dataset used to train gcelikmasat-work/Qwen3_4B_BPMN_IT

Collection including gcelikmasat-work/Qwen3_4B_BPMN_IT

Text2BPMN_Trained_Models

Collection

Instruction tuned LLMs for BPMN model generation using MaD dataset. • 2 items • Updated 3 days ago

Paper for gcelikmasat-work/Qwen3_4B_BPMN_IT

Instruction-Tuning Open-Weight Language Models for BPMN Model Generation

Paper • 2512.12063 • Published Dec 12, 2025

Evaluation results

BLEU on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
self-reported

83.060
ROUGE-L on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
self-reported

94.430
METEOR on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
self-reported

92.820
R-GED Accuracy (%) on BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
self-reported

99.440