gcelikmasat-work commited on
Commit
ccaca95
·
verified ·
1 Parent(s): 5ebf202

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +215 -3
README.md CHANGED
@@ -1,3 +1,215 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-4B-Instruct-2507
4
+ datasets:
5
+ - gcelikmasat-work/BPMN-IT-Dataset
6
+ language:
7
+ - en
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ tags:
11
+ - bpmn
12
+ - business-process-modeling
13
+ - process-modeling
14
+ - instruction-tuning
15
+ - lora
16
+ - peft
17
+ - dot
18
+ - graphviz
19
+ - llama-factory
20
+ - qwen3
21
+ model-index:
22
+ - name: Qwen3-4B-InstruBPM
23
+ results:
24
+ - task:
25
+ type: text-generation
26
+ name: BPMN Model Generation from Text
27
+ dataset:
28
+ type: gcelikmasat-work/BPMN-IT-Dataset
29
+ name: BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
30
+ metrics:
31
+ - type: bleu
32
+ value: 83.06
33
+ name: BLEU
34
+ - type: rouge
35
+ value: 94.43
36
+ name: ROUGE-L
37
+ - type: meteor
38
+ value: 92.82
39
+ name: METEOR
40
+ - type: relative-graph-edit-distance
41
+ value: 99.44
42
+ name: R-GED Accuracy (%)
43
+ ---
44
+
45
+ # Qwen3-4B-InstruBPM
46
+
47
+ Qwen3-4B-InstruBPM is a compact, instruction-tuned language model that converts natural-language business process descriptions into BPMN models rendered in [Graphviz DOT](https://graphviz.org/doc/info/lang.html). It is a LoRA adaptation of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), trained on a cleaned, stratified subset of the [MaD dataset](https://ieeexplore.ieee.org/abstract/document/10191898) for the paper:
48
+
49
+ > **InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation.**
50
+ > Gökberk Çelikmasat, Atay Özgövde, Fatma Başak Aydemir. *Software and Systems Modeling*, under review, 2026.
51
+ > arXiv: [2512.12063](https://arxiv.org/abs/2512.12063)
52
+
53
+ On a 180-instance benchmark stratified by difficulty across 15 business domains, this model attains near-perfect structural fidelity (R-GED Accuracy ≈ 99.4%) and matches or outperforms both untuned open-weight baselines (Qwen2.5 7/14B, Qwen3 30B, Qwen3-Coder) and strong proprietary systems (GPT-5.1, Claude 4.5 Sonnet/Haiku, Gemini 2.5 Pro/Flash) on BLEU, ROUGE-L, and METEOR — at roughly half the parameter count of our prior tuned model.
54
+
55
+ ## Results
56
+
57
+ Evaluation on the 180-instance stratified benchmark (paper Table 2). Higher is better on all four metrics.
58
+
59
+ | Model | BLEU | ROUGE-L | METEOR | R-GED Acc. |
60
+ | -------------------------------------- | ----: | ------: | -----: | ---------: |
61
+ | **Qwen3-4B-InstruBPM** (this model) | **83.06** | **94.43** | **92.82** | **99.44** |
62
+ | [Gemma2-9B-BPMG-IT](https://huggingface.co/gcelikmasat-work/gemma-2-9b-it-BPMN) (prior work) | 82.98 | 94.61 | 92.67 | 97.78 |
63
+ | Qwen3-Coder-30B-A3B-Instruct | 8.06 | 43.00 | 45.07 | 38.21 |
64
+ | Qwen3-30B-A3B-Instruct-2507 | 6.66 | 42.28 | 44.79 | 38.68 |
65
+ | Qwen3-4B-Instruct-2507 (base) | 2.89 | 40.31 | 44.16 | 44.47 |
66
+ | Gemini 2.5 Pro | 28.72 | 48.98 | 63.66 | 43.58 |
67
+ | Claude 4.5 Sonnet | 22.56 | 49.87 | 61.37 | 41.47 |
68
+ | Claude 4.5 Haiku | 18.15 | 46.69 | 58.21 | 35.91 |
69
+ | Gemini 2.5 Flash | 15.24 | 47.18 | 57.69 | 30.07 |
70
+ | GPT-5.1 | 12.64 | 48.83 | 59.01 | 40.95 |
71
+
72
+ Per-domain R-GED Accuracy is 100% in 14 of 15 domains (paper Table 3). Friedman tests with Kendall's *W* between 0.65 and 0.81 and bootstrap confidence intervals confirm these differences are statistically significant (paper Appendix A).
73
+
74
+ ## Intended use
75
+
76
+ Generate first-draft BPMN models from textual process descriptions to accelerate early-stage modeling. In expert review, the outputs were judged to be usable with modest post-editing and to follow BPMN best practices for model size, explicit gateways, split/join consistency, and process orientation (paper §6.2, BEBoP verification).
77
+
78
+ The model is intended as an **assistant for business process modelers and analysts**, not as a fully autonomous replacement for manual modeling. Human review is recommended, particularly for gateway logic and activity labels in ambiguous descriptions.
79
+
80
+ ## Supported BPMN subset
81
+
82
+ The model generates BPMN process fragments in DOT notation covering: **start events, end events, tasks (activities), sequence flows, and AND/XOR gateways (splits and joins).** It does **not** currently generate pools, lanes, message flows, data objects, intermediate/boundary events, sub-processes, or annotations.
83
+
84
+ ## How to use
85
+
86
+ ### With `transformers`
87
+
88
+ ```python
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+ import torch
91
+
92
+ model_id = "gcelikmasat-work/Qwen3_4B_BPMN_IT"
93
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
94
+ model = AutoModelForCausalLM.from_pretrained(
95
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
96
+ )
97
+
98
+ instruction = (
99
+ "You are an expert in BPMN modeling and DOT language. Your task is to "
100
+ "convert detailed textual descriptions of business processes into accurate "
101
+ "BPMN model codes written in DOT language. Label all nodes with their "
102
+ "activity names. Represent all connections between nodes without labeling "
103
+ "the connections. Represent each node and its connections accurately, "
104
+ "ensuring all decision points and flows are included and connected. "
105
+ "Now, generate BPMN business process model code in DOT language for the "
106
+ "following textual description of a business process: "
107
+ )
108
+
109
+ description = (
110
+ "The process begins when the customer submits an application. After submission, "
111
+ "the application is reviewed by the credit officer. If the application is approved, "
112
+ "the loan is disbursed. Otherwise, a rejection letter is sent. The process ends."
113
+ )
114
+
115
+ messages = [{"role": "user", "content": instruction + description}]
116
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
117
+ inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
118
+
119
+ with torch.no_grad():
120
+ out = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=1.0, do_sample=True)
121
+
122
+ dot_code = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
123
+ print(dot_code)
124
+ ```
125
+
126
+ ### With `vLLM` (recommended for batched inference)
127
+
128
+ ```python
129
+ from vllm import LLM, SamplingParams
130
+
131
+ llm = LLM(model="gcelikmasat-work/Qwen3_4B_BPMN_IT", max_model_len=2048)
132
+ params = SamplingParams(temperature=0.1, top_p=1.0, max_tokens=2048)
133
+ outputs = llm.chat([[{"role": "user", "content": instruction + description}]], params)
134
+ print(outputs[0].outputs[0].text)
135
+ ```
136
+
137
+ The generated DOT can be rendered with Graphviz:
138
+
139
+ ```bash
140
+ dot -Tpng process.dot -o process.png
141
+ ```
142
+
143
+ ## Training
144
+
145
+ | Parameter | Value |
146
+ | ---------------------------- | ------------------------------- |
147
+ | Base model | Qwen/Qwen3-4B-Instruct-2507 |
148
+ | Training framework | LLaMA-Factory |
149
+ | Adapter | LoRA, `all` target modules |
150
+ | LoRA rank `r` | 16 |
151
+ | LoRA α | 32 |
152
+ | LoRA dropout | 0.05 |
153
+ | Precision | bf16 |
154
+ | Cutoff length | 2048 tokens |
155
+ | Batch size (per device) | 16 |
156
+ | Gradient accumulation steps | 2 |
157
+ | Epochs | 1 (≈670 optimizer steps) |
158
+ | Learning rate | 2 × 10⁻⁴ |
159
+ | LR schedule / warmup ratio | cosine / 0.05 |
160
+ | Optimiser | AdamW (torch) |
161
+ | FlashAttention / Liger | FA2 / enabled |
162
+ | Hardware | 2 × NVIDIA L40S (48 GB) |
163
+ | Wall-clock | ≈150 minutes |
164
+ | Decoding at inference | temperature=0.1, top_p=1.0, max_tokens=2048 |
165
+
166
+ **Training data.** 21.5k cleaned instruction–input–output triples from the MaD dataset, split 80/10/10 for train/validation/test. Filtering removed malformed DOT, duplicate processes, disconnected components, and descriptions exceeding 2048 tokens. The full splits are available at [`gcelikmasat-work/BPMN-IT-Dataset`](https://huggingface.co/datasets/gcelikmasat-work/BPMN-IT-Dataset).
167
+
168
+ ## Deployment variants
169
+
170
+ This repository hosts the merged BF16 checkpoint. Two related collections provide variants for deployment trade-offs discussed in the paper:
171
+
172
+ - **GGUF quantizations (paper Table 5)** — Q2 through Q8 via HQQ/llama.cpp. Mid-precision (Q5–Q8) preserves near-BF16 quality with roughly half the memory footprint: [Qwen3-4b-Different-Quantization-GGUF](https://huggingface.co/collections/gcelikmasat-work/qwen3-4b-different-quantization-gguf).
173
+ - **Merge-time α variants (paper Table 6)** — α ∈ {8, 16, 32, 64} applied during LoRA merge, holding rank at 16. Mid-range α (16–32) gives the best accuracy; α=32 is the default in this checkpoint: [Qwen3-4b-Different-Alpha](https://huggingface.co/collections/gcelikmasat-work/qwen3-4b-different-alpha).
174
+
175
+ ## Limitations
176
+
177
+ - **Scope.** Generates the control-flow slice of BPMN (tasks, events, sequence flows, AND/XOR gateways). Does not yet handle pools, lanes, message flows, data objects, or sub-processes.
178
+ - **Language.** Trained on English only.
179
+ - **Domain shift.** Evaluated on a stratified 180-instance held-out benchmark from the MaD dataset. Generalization to enterprise documentation with different terminology or structure is not fully established.
180
+ - **Label quality.** Expert reviewers occasionally observed overly generic activity labels when input descriptions were vague, and BEBoP verification found gaps in default-flow and XOR-label coverage (paper §6.2, Table 8).
181
+ - **Semantic equivalence.** High structural similarity (R-GED) does not guarantee semantic equivalence — two structurally identical graphs can differ in intent when descriptions are underspecified.
182
+
183
+ ## Citation
184
+
185
+ ```bibtex
186
+ @article{celikmasat2026instrubpm,
187
+ title = {InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation},
188
+ author = {{\c{C}}elikmasat, G{\"o}kberk and {\"O}zg{\"o}vde, Atay and Aydemir, Fatma Ba{\c{s}}ak},
189
+ journal = {Software and Systems Modeling},
190
+ year = {2026},
191
+ note = {Under review. arXiv:2512.12063},
192
+ url = {https://arxiv.org/abs/2512.12063}
193
+ }
194
+ ```
195
+
196
+ Please also cite the source dataset:
197
+
198
+ ```bibtex
199
+ @inproceedings{li2023mad,
200
+ title = {{MaD}: A Dataset for Interview-based {BPM} in Business Process Management},
201
+ author = {Li, Xiang and Ni, Lijuan and Li, Ran and Liu, Jiafei and Zhang, Ming},
202
+ booktitle = {2023 International Joint Conference on Neural Networks (IJCNN)},
203
+ pages = {1--8},
204
+ year = {2023},
205
+ publisher = {IEEE}
206
+ }
207
+ ```
208
+
209
+ ## License
210
+
211
+ Apache 2.0, inherited from the base model ([`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)). The training data is distributed separately under the terms of the MaD dataset.
212
+
213
+ ## Acknowledgements
214
+
215
+ This work builds on our prior instruction-tuning effort on Gemma2-9B ([Çelikmasat et al., PROFES 2025](https://doi.org/10.1007/978-3-032-12089-2_17)), available at [`gcelikmasat-work/gemma-2-9b-it-BPMN`](https://huggingface.co/gcelikmasat-work/gemma-2-9b-it-BPMN). We thank the authors of the [MaD dataset](https://ieeexplore.ieee.org/abstract/document/10191898) for making their resource publicly available.