Submitted:
16 June 2026
Posted:
17 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Process Mining Configurations and Evaluation Methodology
3.1. Conversational Framework and Configurations
3.2. Trace Representation and SNA Measures
3.3. PM-LLM-Benchmark and Judge
3.4. Configuration Rationale and Research Questions
4. Results
4.1. Effectiveness of Process Mining Configurations
4.2. SNA-Based Diagnosis of Roles and Handoffs
4.3. Configuration Portability Across Base Models
5. Implementation and Availability
- Repository structure.
- Configuration format.
- Pipeline modules.
- Browser tool.
6. Discussion and Conclusion
Acknowledgments
References
- van der Aalst, W. Process Mining Manifesto. In Proceedings of the Business Process Management Workshops (1); Lecture Notes in Business Information Processing; Springer, 2011; pp. 169–194. [Google Scholar]
- van der Aalst, W. Process Mining - Data Science in Action, Second Edition; Springer, 2016. [Google Scholar]
- Berti, A.; Qafari, M. Leveraging Large Language Models (LLMs) for Process Mining (Technical Report). CoRR 2023, abs/2307.12701. [Google Scholar]
- Berti, A.; Kourani, H. Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, and Evaluation Strategies. In Proceedings of the BPMDS/EMMSAD@CAiSE. Springer, Lecture Notes in Business Information Processing, 2024; pp. 13–21. [Google Scholar]
- Berti, A.; Kourani, H.; van der Aalst, W. PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks. In Proceedings of the ICPM Workshops. Springer, Lecture Notes in Business Information Processing, 2024; pp. 610–623. [Google Scholar]
- van der Aalst, W.; Song, M. Mining Social Networks: Uncovering Interaction Patterns in Business Processes. In Proceedings of the Business Process Management, 2004; Springer; Lecture Notes in Computer Science; pp. 244–260. [Google Scholar]
- van der Aalst, W.; Reijers, H.; Song, M. Discovering Social Networks from Event Logs. Comput. Support. Coop. Work. 2005, 14, 549–593. [Google Scholar] [CrossRef]
- Wu, Q. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. CoRR 2023, abs/2308.08155. [Google Scholar]
- Li, G. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. In Proceedings of the NeurIPS, 2023. [Google Scholar]
- Hong, S. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the ICLR. OpenReview.net, 2024. [Google Scholar]
- Zheng, L. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the NeurIPS, 2023. [Google Scholar]
- Liu, Y. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the EMNLP, 2023; Association for Computational Linguistics; pp. 2511–2522. [Google Scholar]
- Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press, 1994. [Google Scholar]






| Balanced configuration | Verification-heavy configuration | Artifact-pipeline configuration |
|---|---|---|
| event log interpreter | artifact parser | schema and artifact mapper |
| Reads logs, traces, timestamps, resources, case notions, and data-quality caveats. | Extracts entities, activities, timestamps, objects, rules, schemas, and missing assumptions. | Maps log columns, objects, resources, activities, timestamps, rules, tables, and requested output. |
| process modeler | domain reasoner | trace and variant reasoner |
| Produces and checks process trees, POWL, DECLARE, log skeletons, temporal profiles, and Petri nets. | Builds the main process mining answer across logs, discovery, conformance, querying, fairness, and optimization. | Reconstructs variants, infers case notions, compares paths, and detects loops or rework. |
| conformance anomaly checker | formal semantics checker | control flow semantics expert |
| Compares behavior with models or rules and explains deviations, missing steps, and anomalies. | Checks coherence of process trees, POWL, DECLARE, Petri nets, temporal profiles, and log skeletons. | Reasons about ordering, concurrency, choices, loops, constraints, and model semantics. |
| query and sql analyst | counterexample hunter | performance and resource analyst |
| Handles log/model queries, filters, grouping conditions, schemas, and constraints. | Searches for counterexamples, edge cases, impossible orders, and unsupported generalizations. | Analyzes waiting times, bottlenecks, handoffs, workload, batching, and utilization. |
| hypothesis generator | data consistency auditor | data question designer |
| Proposes diagnostic hypotheses, evidence needs, and verification strategies. | Verifies case IDs, attributes, resources, time ordering, SQL logic, and performance claims. | Produces hypotheses, diagnostic questions, query ideas, filters, and evidence plans. |
| fairness and ethics reviewer | bias and impact auditor | risk fairness and compliance reviewer |
| Checks group impacts, bias risks, and mitigation or monitoring options. | Reviews group outcomes, protected attributes, interventions, and side effects. | Checks biased paths, missing controls, auditability, privacy assumptions, and fragile recommendations. |
| optimization consultant | benchmark answer judge | answer precision auditor |
| Identifies bottlenecks, rework, capacity issues, waiting time drivers, and redesign options. | Checks completeness, grounded reasoning, terminology, and likely benchmark quality. | Verifies coverage, terminology, prompt grounding, and unsupported syntax or conclusions. |
| Conf. | openai/gpt-5.4-mini | x-ai/grok-4.20 | qwen/qwen3.6-35b-a3b | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg. | SD | c01 | c02 | c03 | c04 | c05 | c06 | c07 | c08 | Avg. | SD | c01 | c02 | c03 | c04 | c05 | c06 | c07 | c08 | Avg. | SD | c01 | c02 | c03 | c04 | c05 | c06 | c07 | c08 | |
| Direct answering |
6.72 | 2.16 | 7.30 | 8.17 | 4.45 | 7.01 | 7.93 | 6.53 | 4.07 | 8.14 | 5.99 | 2.56 | 7.30 | 7.09 | 3.41 | 6.00 | 7.31 | 5.14 | 3.90 | 7.90 | 6.38 | 1.99 | 6.64 | 7.02 | 5.05 | 6.17 | 6.86 | 7.23 | 4.33 | 7.80 |
| Balanced | 6.75 | 1.81 | 6.97 | 7.74 | 4.61 | 6.91 | 7.94 | 6.80 | 5.08 | 8.00 | 5.32 | 2.16 | 4.80 | 6.08 | 4.80 | 5.60 | 6.33 | 5.97 | 3.23 | 5.46 | 5.39 | 1.83 | 5.33 | 6.04 | 4.08 | 5.79 | 5.97 | 6.24 | 3.42 | 6.20 |
| Artifact pipeline |
6.74 | 2.13 | 7.44 | 8.04 | 5.76 | 6.37 | 7.11 | 7.40 | 3.30 | 8.00 | 5.24 | 2.47 | 5.26 | 6.07 | 4.92 | 4.39 | 5.86 | 5.69 | 4.10 | 5.28 | 5.51 | 1.63 | 6.14 | 6.11 | 4.54 | 4.53 | 6.46 | 6.11 | 3.78 | 6.30 |
| Verification heavy |
6.55 | 1.99 | 7.15 | 7.70 | 4.28 | 7.47 | 7.00 | 6.89 | 3.80 | 8.06 | 5.84 | 2.13 | 5.96 | 6.48 | 5.64 | 4.70 | 6.57 | 6.50 | 4.55 | 5.96 | 5.32 | 1.74 | 6.04 | 4.78 | 4.49 | 5.50 | 6.33 | 5.51 | 3.85 | 6.34 |
| Ver. heavy excl. hand. |
6.98 | 1.70 | 7.84 | 8.01 | 6.06 | 6.97 | 7.14 | 6.90 | 4.35 | 8.28 | 5.88 | 2.11 | 5.10 | 6.93 | 4.40 | 6.67 | 6.27 | 6.04 | 4.92 | 6.84 | 5.48 | 2.05 | 6.12 | 6.53 | 4.72 | 5.14 | 5.11 | 5.54 | 4.10 | 6.34 |
| Configuration | Role | openai/gpt-5.4-mini | x-ai/grok-4.20 | qwen/qwen3.6-35b-a3b | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Count | Avg. | SD | Count | Avg. | SD | Count | Avg. | SD | ||
| Direct answering | Direct answer | 57 | 6.72 | 2.16 | 57 | 5.99 | 2.56 | 57 | 6.38 | 1.99 |
| Balanced | Event log interpreter | 13 | 6.75 | 1.64 | 32 | 5.15 | 2.39 | 14 | 6.72 | 1.66 |
| Process modeler | 22 | 6.24 | 1.98 | 58 | 4.14 | 2.00 | 45 | 4.32 | 1.66 | |
| Conformance anomaly checker | 12 | 7.22 | 1.20 | 52 | 3.29 | 2.01 | 33 | 5.05 | 2.20 | |
| Query and sql analyst | 7 | 7.31 | 1.33 | 22 | 4.60 | 2.06 | 12 | 4.74 | 1.69 | |
| Hypothesis generator | 6 | 7.82 | 0.83 | 46 | 5.11 | 1.79 | 20 | 5.51 | 1.84 | |
| Fairness and ethics reviewer | 7 | 7.14 | 1.28 | 34 | 3.73 | 2.29 | 20 | 5.17 | 2.57 | |
| Optimization consultant | 10 | 6.77 | 1.71 | 32 | 3.67 | 2.15 | 39 | 3.86 | 1.84 | |
| Artifact pipeline | Schema and artifact mapper | 41 | 6.58 | 1.53 | 49 | 6.08 | 1.63 | 26 | 5.92 | 1.19 |
| Trace and variant reasoner | 2 | 5.35 | 2.75 | 33 | 4.08 | 1.87 | 18 | 4.44 | 1.62 | |
| Control flow semantics expert | 7 | 7.16 | 1.97 | 57 | 4.41 | 2.05 | 28 | 5.18 | 2.03 | |
| Performance and resource analyst | 0 | – | – | 20 | 3.01 | 1.16 | 16 | 4.74 | 1.38 | |
| Data question designer | 0 | – | – | 24 | 4.62 | 1.91 | 3 | 4.43 | 1.70 | |
| Risk fairness and compliance reviewer | 5 | 7.40 | 1.25 | 35 | 4.29 | 1.97 | 11 | 5.89 | 1.49 | |
| Answer precision auditor | 56 | 7.53 | 1.71 | 53 | 4.74 | 2.53 | 28 | 4.66 | 1.71 | |
| Verification heavy | Artifact parser | 55 | 6.89 | 1.79 | 59 | 6.88 | 2.14 | 26 | 6.19 | 1.46 |
| Domain reasoner | 28 | 6.93 | 1.32 | 50 | 5.27 | 1.68 | 54 | 5.10 | 1.71 | |
| Formal semantics checker | 12 | 7.02 | 1.43 | 63 | 4.30 | 2.06 | 12 | 5.42 | 2.17 | |
| Counterexample hunter | 10 | 8.16 | 0.54 | 49 | 5.67 | 2.46 | 9 | 5.56 | 2.84 | |
| Data consistency auditor | 7 | 7.79 | 0.78 | 42 | 5.38 | 2.30 | 8 | 5.48 | 1.59 | |
| Bias and impact auditor | 4 | 7.83 | 0.52 | 37 | 3.91 | 1.85 | 6 | 5.80 | 1.76 | |
| Benchmark answer judge | 7 | 6.73 | 1.00 | 58 | 5.69 | 2.26 | 26 | 5.06 | 2.09 | |
| Ver. heavy excl. hand. | Artifact parser | 55 | 6.53 | 1.99 | 58 | 6.81 | 2.32 | 22 | 6.50 | 1.69 |
| Domain reasoner | 34 | 6.82 | 1.68 | 52 | 5.09 | 1.69 | 57 | 5.18 | 1.49 | |
| Formal semantics checker | 10 | 7.64 | 1.07 | 71 | 4.63 | 2.35 | 10 | 3.50 | 1.82 | |
| Counterexample hunter | 13 | 7.82 | 1.14 | 47 | 5.66 | 2.73 | 16 | 6.48 | 2.31 | |
| Data consistency auditor | 1 | 5.80 | 0.00 | 41 | 5.45 | 2.46 | 10 | 4.85 | 1.41 | |
| Bias and impact auditor | 3 | 7.83 | 0.73 | 27 | 4.38 | 1.73 | 7 | 6.37 | 1.38 | |
| Benchmark answer judge | 2 | 7.15 | 0.05 | 56 | 5.20 | 2.23 | 4 | 5.02 | 1.04 | |
| Configuration | Role | openai/gpt-5.4-mini | x-ai/grok-4.20 | qwen/qwen3.6-35b-a3b |
|---|---|---|---|---|
| Direct answering | Direct answer | 8/9/8/7/7/7/6/5 | 8/9/8/7/7/7/6/5 | 8/9/8/7/7/7/6/5 |
| Balanced | Event log interpreter | 7/1/1/0/0/0/2/2 | 7/4/9/0/2/2/3/5 | 7/2/0/0/0/0/2/3 |
| Process modeler | 1/3/9/4/0/1/2/2 | 11/9/17/4/3/2/7/5 | 8/4/9/5/6/2/7/4 | |
| Conformance anomaly checker | 0/5/0/2/3/1/0/1 | 8/9/13/2/5/4/6/5 | 6/7/3/4/4/3/2/4 | |
| Query and SQL analyst | 0/0/0/3/4/0/0/0 | 6/1/0/8/1/0/3/3 | 1/0/0/3/2/4/1/1 | |
| Hypothesis generator | 0/0/0/0/6/0/0/0 | 9/6/1/3/10/6/6/5 | 0/3/0/0/7/5/4/1 | |
| Fairness and ethics reviewer | 0/0/0/0/0/7/0/0 | 6/4/0/3/2/10/4/5 | 1/1/0/2/3/12/1/0 | |
| Optimization consultant | 0/2/0/1/1/0/1/5 | 7/4/0/3/4/3/6/5 | 5/6/2/3/6/5/8/4 | |
| Artifact pipeline | Schema and artifact mapper | 8/5/7/6/5/2/3/5 | 8/7/8/6/6/3/6/5 | 7/1/2/4/2/2/5/3 |
| Trace and variant reasoner | 0/1/1/0/0/0/0/0 | 6/6/2/4/4/1/5/5 | 2/4/4/1/2/0/5/0 | |
| Control-flow semantics expert | 0/2/2/3/0/0/0/0 | 6/8/17/7/6/2/6/5 | 1/6/7/4/4/1/4/1 | |
| Performance and resource analyst | 0/0/0/0/0/0/0/0 | 3/2/0/3/3/0/4/5 | 0/2/0/2/2/0/6/4 | |
| Data question designer | 0/0/0/0/0/0/0/0 | 2/2/0/4/6/1/4/5 | 0/0/0/0/2/0/1/0 | |
| Risk fairness and compliance reviewer | 0/0/0/0/0/5/0/0 | 2/3/0/5/6/10/4/5 | 0/2/0/1/2/5/0/1 | |
| Answer precision auditor | 8/9/9/7/6/7/5/5 | 5/8/9/6/6/7/6/6 | 1/5/4/6/6/4/1/1 | |
| Verification heavy | Artifact parser | 8/9/8/7/7/6/5/5 | 8/9/11/7/7/6/6/5 | 4/4/1/4/2/3/6/2 |
| Domain reasoner | 2/1/8/6/6/0/0/5 | 8/8/6/6/6/5/6/5 | 6/9/9/8/7/2/7/6 | |
| Formal semantics checker | 0/1/6/3/0/2/0/0 | 8/9/18/7/4/6/6/5 | 1/0/4/3/0/1/3/0 | |
| Counterexample hunter | 1/2/1/1/1/2/0/2 | 5/7/13/4/4/7/5/4 | 0/2/0/2/1/0/2/2 | |
| Data consistency auditor | 2/2/0/0/3/0/0/0 | 7/6/7/4/4/4/6/4 | 4/0/1/1/0/1/1/0 | |
| Bias and impact auditor | 0/0/0/0/0/4/0/0 | 0/4/0/2/7/14/6/4 | 0/0/0/1/0/4/0/1 | |
| Benchmark answer judge | 1/1/1/2/1/0/0/1 | 6/7/11/7/10/7/5/5 | 4/3/5/3/4/3/3/1 | |
| Ver. heavy excl. hand. | Artifact parser | 8/9/8/7/7/6/5/5 | 8/10/9/7/7/5/7/5 | 5/2/3/3/2/2/5/0 |
| Domain reasoner | 3/4/7/7/5/1/2/5 | 8/8/7/7/7/3/6/6 | 8/9/8/7/9/3/7/6 | |
| Formal semantics checker | 0/1/5/3/0/1/0/0 | 10/8/26/7/4/5/6/5 | 1/1/4/3/1/0/0/0 | |
| Counterexample hunter | 2/4/1/1/2/2/0/1 | 6/6/12/7/4/4/5/3 | 1/3/3/2/3/0/2/2 | |
| Data consistency auditor | 1/0/0/0/0/0/0/0 | 6/6/5/8/4/4/4/4 | 2/1/1/1/3/1/1/0 | |
| Bias and impact auditor | 0/0/0/0/0/3/0/0 | 3/3/0/5/4/9/0/3 | 0/1/0/1/0/5/0/0 | |
| Benchmark answer judge | 0/0/0/1/0/1/0/0 | 7/9/5/8/10/4/8/5 | 0/1/0/1/1/1/0/0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.