Update README.md

Browse files

Files changed (1) hide show

README.md +215 -3

README.md CHANGED Viewed

@@ -1,3 +1,215 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model: Qwen/Qwen3-4B-Instruct-2507
+datasets:
+  - gcelikmasat-work/BPMN-IT-Dataset
+language:
+  - en
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - bpmn
+  - business-process-modeling
+  - process-modeling
+  - instruction-tuning
+  - lora
+  - peft
+  - dot
+  - graphviz
+  - llama-factory
+  - qwen3
+model-index:
+  - name: Qwen3-4B-InstruBPM
+    results:
+      - task:
+          type: text-generation
+          name: BPMN Model Generation from Text
+        dataset:
+          type: gcelikmasat-work/BPMN-IT-Dataset
+          name: BPMN-IT (stratified 180-instance benchmark across 15 business domains, seed split)
+        metrics:
+          - type: bleu
+            value: 83.06
+            name: BLEU
+          - type: rouge
+            value: 94.43
+            name: ROUGE-L
+          - type: meteor
+            value: 92.82
+            name: METEOR
+          - type: relative-graph-edit-distance
+            value: 99.44
+            name: R-GED Accuracy (%)
+---
+# Qwen3-4B-InstruBPM
+Qwen3-4B-InstruBPM is a compact, instruction-tuned language model that converts natural-language business process descriptions into BPMN models rendered in [Graphviz DOT](https://graphviz.org/doc/info/lang.html). It is a LoRA adaptation of [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), trained on a cleaned, stratified subset of the [MaD dataset](https://ieeexplore.ieee.org/abstract/document/10191898) for the paper:
+> **InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation.**
+> Gökberk Çelikmasat, Atay Özgövde, Fatma Başak Aydemir. *Software and Systems Modeling*, under review, 2026.
+> arXiv: [2512.12063](https://arxiv.org/abs/2512.12063)
+On a 180-instance benchmark stratified by difficulty across 15 business domains, this model attains near-perfect structural fidelity (R-GED Accuracy ≈ 99.4%) and matches or outperforms both untuned open-weight baselines (Qwen2.5 7/14B, Qwen3 30B, Qwen3-Coder) and strong proprietary systems (GPT-5.1, Claude 4.5 Sonnet/Haiku, Gemini 2.5 Pro/Flash) on BLEU, ROUGE-L, and METEOR — at roughly half the parameter count of our prior tuned model.
+## Results
+Evaluation on the 180-instance stratified benchmark (paper Table 2). Higher is better on all four metrics.
+| Model                                  | BLEU  | ROUGE-L | METEOR | R-GED Acc. |
+| -------------------------------------- | ----: | ------: | -----: | ---------: |
+| **Qwen3-4B-InstruBPM** (this model)    | **83.06** | **94.43** | **92.82** | **99.44** |
+| [Gemma2-9B-BPMG-IT](https://huggingface.co/gcelikmasat-work/gemma-2-9b-it-BPMN) (prior work)  | 82.98 | 94.61   | 92.67  | 97.78      |
+| Qwen3-Coder-30B-A3B-Instruct           |  8.06 | 43.00   | 45.07  | 38.21      |
+| Qwen3-30B-A3B-Instruct-2507            |  6.66 | 42.28   | 44.79  | 38.68      |
+| Qwen3-4B-Instruct-2507 (base)          |  2.89 | 40.31   | 44.16  | 44.47      |
+| Gemini 2.5 Pro                         | 28.72 | 48.98   | 63.66  | 43.58      |
+| Claude 4.5 Sonnet                      | 22.56 | 49.87   | 61.37  | 41.47      |
+| Claude 4.5 Haiku                       | 18.15 | 46.69   | 58.21  | 35.91      |
+| Gemini 2.5 Flash                       | 15.24 | 47.18   | 57.69  | 30.07      |
+| GPT-5.1                                | 12.64 | 48.83   | 59.01  | 40.95      |
+Per-domain R-GED Accuracy is 100% in 14 of 15 domains (paper Table 3). Friedman tests with Kendall's *W* between 0.65 and 0.81 and bootstrap confidence intervals confirm these differences are statistically significant (paper Appendix A).
+## Intended use
+Generate first-draft BPMN models from textual process descriptions to accelerate early-stage modeling. In expert review, the outputs were judged to be usable with modest post-editing and to follow BPMN best practices for model size, explicit gateways, split/join consistency, and process orientation (paper §6.2, BEBoP verification).
+The model is intended as an **assistant for business process modelers and analysts**, not as a fully autonomous replacement for manual modeling. Human review is recommended, particularly for gateway logic and activity labels in ambiguous descriptions.
+## Supported BPMN subset
+The model generates BPMN process fragments in DOT notation covering: **start events, end events, tasks (activities), sequence flows, and AND/XOR gateways (splits and joins).** It does **not** currently generate pools, lanes, message flows, data objects, intermediate/boundary events, sub-processes, or annotations.
+## How to use
+### With `transformers`
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_id = "gcelikmasat-work/Qwen3_4B_BPMN_IT"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, torch_dtype=torch.bfloat16, device_map="auto"
+)
+instruction = (
+    "You are an expert in BPMN modeling and DOT language. Your task is to "
+    "convert detailed textual descriptions of business processes into accurate "
+    "BPMN model codes written in DOT language. Label all nodes with their "
+    "activity names. Represent all connections between nodes without labeling "
+    "the connections. Represent each node and its connections accurately, "
+    "ensuring all decision points and flows are included and connected. "
+    "Now, generate BPMN business process model code in DOT language for the "
+    "following textual description of a business process: "
+)
+description = (
+    "The process begins when the customer submits an application. After submission, "
+    "the application is reviewed by the credit officer. If the application is approved, "
+    "the loan is disbursed. Otherwise, a rejection letter is sent. The process ends."
+)
+messages = [{"role": "user", "content": instruction + description}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
+with torch.no_grad():
+    out = model.generate(**inputs, max_new_tokens=2048, temperature=0.1, top_p=1.0, do_sample=True)
+dot_code = tokenizer.decode(out[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
+print(dot_code)
+```
+### With `vLLM` (recommended for batched inference)
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(model="gcelikmasat-work/Qwen3_4B_BPMN_IT", max_model_len=2048)
+params = SamplingParams(temperature=0.1, top_p=1.0, max_tokens=2048)
+outputs = llm.chat([[{"role": "user", "content": instruction + description}]], params)
+print(outputs[0].outputs[0].text)
+```
+The generated DOT can be rendered with Graphviz:
+```bash
+dot -Tpng process.dot -o process.png
+```
+## Training
+| Parameter                    | Value                           |
+| ---------------------------- | ------------------------------- |
+| Base model                   | Qwen/Qwen3-4B-Instruct-2507     |
+| Training framework           | LLaMA-Factory                   |
+| Adapter                      | LoRA, `all` target modules      |
+| LoRA rank `r`                | 16                              |
+| LoRA α                       | 32                              |
+| LoRA dropout                 | 0.05                            |
+| Precision                    | bf16                            |
+| Cutoff length                | 2048 tokens                     |
+| Batch size (per device)      | 16                              |
+| Gradient accumulation steps  | 2                               |
+| Epochs                       | 1 (≈670 optimizer steps)        |
+| Learning rate                | 2 × 10⁻⁴                        |
+| LR schedule / warmup ratio   | cosine / 0.05                   |
+| Optimiser                    | AdamW (torch)                   |
+| FlashAttention / Liger       | FA2 / enabled                   |
+| Hardware                     | 2 × NVIDIA L40S (48 GB)         |
+| Wall-clock                   | ≈150 minutes                    |
+| Decoding at inference        | temperature=0.1, top_p=1.0, max_tokens=2048 |
+**Training data.** 21.5k cleaned instruction–input–output triples from the MaD dataset, split 80/10/10 for train/validation/test. Filtering removed malformed DOT, duplicate processes, disconnected components, and descriptions exceeding 2048 tokens. The full splits are available at [`gcelikmasat-work/BPMN-IT-Dataset`](https://huggingface.co/datasets/gcelikmasat-work/BPMN-IT-Dataset).
+## Deployment variants
+This repository hosts the merged BF16 checkpoint. Two related collections provide variants for deployment trade-offs discussed in the paper:
+- **GGUF quantizations (paper Table 5)** — Q2 through Q8 via HQQ/llama.cpp. Mid-precision (Q5–Q8) preserves near-BF16 quality with roughly half the memory footprint: [Qwen3-4b-Different-Quantization-GGUF](https://huggingface.co/collections/gcelikmasat-work/qwen3-4b-different-quantization-gguf).
+- **Merge-time α variants (paper Table 6)** — α ∈ {8, 16, 32, 64} applied during LoRA merge, holding rank at 16. Mid-range α (16–32) gives the best accuracy; α=32 is the default in this checkpoint: [Qwen3-4b-Different-Alpha](https://huggingface.co/collections/gcelikmasat-work/qwen3-4b-different-alpha).
+## Limitations
+- **Scope.** Generates the control-flow slice of BPMN (tasks, events, sequence flows, AND/XOR gateways). Does not yet handle pools, lanes, message flows, data objects, or sub-processes.
+- **Language.** Trained on English only.
+- **Domain shift.** Evaluated on a stratified 180-instance held-out benchmark from the MaD dataset. Generalization to enterprise documentation with different terminology or structure is not fully established.
+- **Label quality.** Expert reviewers occasionally observed overly generic activity labels when input descriptions were vague, and BEBoP verification found gaps in default-flow and XOR-label coverage (paper §6.2, Table 8).
+- **Semantic equivalence.** High structural similarity (R-GED) does not guarantee semantic equivalence — two structurally identical graphs can differ in intent when descriptions are underspecified.
+## Citation
+```bibtex
+@article{celikmasat2026instrubpm,
+  title   = {InstruBPM: Instruction-Tuning Open-Weight Language Models for BPMN Model Generation},
+  author  = {{\c{C}}elikmasat, G{\"o}kberk and {\"O}zg{\"o}vde, Atay and Aydemir, Fatma Ba{\c{s}}ak},
+  journal = {Software and Systems Modeling},
+  year    = {2026},
+  note    = {Under review. arXiv:2512.12063},
+  url     = {https://arxiv.org/abs/2512.12063}
+}
+```
+Please also cite the source dataset:
+```bibtex
+@inproceedings{li2023mad,
+  title     = {{MaD}: A Dataset for Interview-based {BPM} in Business Process Management},
+  author    = {Li, Xiang and Ni, Lijuan and Li, Ran and Liu, Jiafei and Zhang, Ming},
+  booktitle = {2023 International Joint Conference on Neural Networks (IJCNN)},
+  pages     = {1--8},
+  year      = {2023},
+  publisher = {IEEE}
+}
+```
+## License
+Apache 2.0, inherited from the base model ([`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)). The training data is distributed separately under the terms of the MaD dataset.
+## Acknowledgements
+This work builds on our prior instruction-tuning effort on Gemma2-9B ([Çelikmasat et al., PROFES 2025](https://doi.org/10.1007/978-3-032-12089-2_17)), available at [`gcelikmasat-work/gemma-2-9b-it-BPMN`](https://huggingface.co/gcelikmasat-work/gemma-2-9b-it-BPMN). We thank the authors of the [MaD dataset](https://ieeexplore.ieee.org/abstract/document/10191898) for making their resource publicly available.