Configuring and Diagnosing Conversational Agents for Process Mining Tasks

Alessandro Berti; Humam Kourani; Wil M.P. van der Aalst

doi:10.20944/preprints202606.1316.v1

Submitted:

16 June 2026

Posted:

17 June 2026

You are already at the latest version

Abstract

Large Language Models (LLMs) are increasingly used to answer process mining questions about event logs, models, conformance, performance, fairness, and redesign. Direct prompting can produce plausible but incorrect answers, especially when tasks require careful trace interpretation, formal reasoning, or diagnostic evidence. A common response is to wrap the LLM in a conversational agent framework: the same underlying model is reused under different role prompts—such as “event log interpreter”, “conformance checker”, or “optimization consultant”—and a selector decides which persona speaks next on a shared transcript. The agents in this paper do not call external tools; they differ only in their role description, so any improvement must come from how the conversation is structured rather than from added capabilities. It is therefore not obvious how such agents should be configured, nor how to tell whether a configuration is actually helping. This paper studies selector-mediated configurations composed of process-mining-oriented personas and treats each recorded agent trace as a directed social network over roles. Using LLM-as-a-judge scores aggregated through Social Network Analysis (SNA), we diagnose final-answer quality, role usefulness, and handoff quality, and use these diagnostics to revise agent routing. We provide an open source implementation that executes the configurations, records traces, computes the SNA diagnostics, and feeds the results back into improved routing.

Keywords:

process mining

;

large language models

;

conversational agents

;

social network analysis

;

LLM evaluation

Subject:

Business, Economics and Management - Business and Management

1. Introduction

Process mining connects event data with questions about control flow, conformance, performance, resources, fairness, and redesign [1,2]. LLMs are attractive interfaces for such questions because they can explain logs, models, queries, and improvement options in natural language [3,4,5]. Yet process mining answers can be plausible while being wrong: they may assume the wrong case notion, ignore timestamp ordering, misuse a formal model, or overstate a fairness or optimization claim. A natural reaction is to ask whether multiple specialized agents can do better than a single LLM call.

In this paper, a conversational agent is not a tool-using autonomous system. The same underlying LLM is reused under different role prompts—an “event log interpreter”, a “conformance checker”, an “optimization consultant”, and so on—and a selector decides which persona contributes next to a shared transcript before a final synthesis prompt produces the answer. Agents do not use tools like Python or SQL; they differ only in the role description that frames their reading of the conversation. As shown in Figure 1, this turns one LLM call into a structured deliberation among personas, but it also adds cost, noise, and weak handoffs. The design problem is therefore to configure roles and routing, and, just as importantly, to know whether any given configuration is actually helping.

We argue that the conversation produced by such a system is itself event data: ordered handoffs between roles, with intermediate contributions that can be scored and aggregated. This makes it natural to analyze agent runs the way process mining analyzes organizational behavior as a directed social network of actors and handoff relations [6,7]. Doing so lets us ask not only whether a configuration produces good final answers, but which roles are useful, which handoffs are weak, and whether the same configuration behaves the same way under different base models. The diagnostic information that emerges can then be used to revise routing constraints, rather than redesigning agents by intuition.

This paper makes three contributions. First, it defines process-mining-focused conversational configurations expressed as portable role sets and routing constraints, so that the same persona definitions can be instantiated in different conversational-agent frameworks, including Microsoft AutoGen and X.AI Grok 4.20, without rewriting the agents. Second, it proposes a diagnostic methodology based on Social Network Analysis of recorded agent traces, combining final-answer quality, average role performance, and average handoff performance into a single view of how a configuration behaves. Third, it provides an open source conversational agent implementation that runs these configurations, records their traces, computes the SNA diagnostics, and uses them to constrain weak handoffs in revised configurations.

The empirical findings should be read with these contributions in mind. On 57 process mining questions from PM-LLM-Benchmark [5], agentic configurations do not reliably outperform direct answering; the strongest result we observe is that an SNA-guided pruning of weak handoffs improves a verification-heavy configuration, although equal wins and losses against the no-agent baseline mean this is suggestive rather than conclusive. We also observe that the same role set behaves quite differently under different base models, with sparse early stopping under one model and denser deliberation under others. Together, these results support a diagnostic rather than a performance-driven reading of multi-agent process mining: configurations should be evaluated through their traces, not only their final answers.

The rest of the paper is organized as follows. Section 2 reviews LLMs for process mining, conversational agent frameworks, LLM-as-judge evaluation, and SNA. Section 3 describes the configurations, trace analysis, benchmark, and evaluation procedure. Section 4 reports results in terms of the three contributions. Section 5 presents the implementation, and Section 6 discusses implications and limitations.

2. Related Work

LLMs have recently been studied as process mining assistants. Existing work translates process mining artifacts into text, evaluates direct answering and query generation, and discusses benchmarks and evaluation strategies [3,4]. PM-LLM-Benchmark is closest to our setting because it provides process mining questions and LLM-based grading, but it does not evaluate conversational agent configurations or SNA-based trace diagnosis [5].

Conversational agent frameworks are now used in both research and industry. AutoGen, developed at Microsoft, supports configurable conversable agents and group-chat orchestration [8]; CAMEL and MetaGPT show how role assignment can structure complex LLM tasks [9,10]. Industry-facing systems also expose agent composition and execution traces. Figure 2 illustrates a Grok 4.20 instantiation of the balanced process mining configuration. Our focus is narrower than general tool-using agents: we study role configurations, routing, and trace-based diagnosis for process mining question answering.

LLM-as-judge methods enable scalable grading of open-ended answers, while also introducing biases such as verbosity and position effects [11,12]. We use judge scores as proxy evidence. SNA provides the diagnostic lens: it studies actors and relations [13], and process mining has used event logs to discover handover-of-work relations [6,7]. We transfer this idea from human performers to LLM agent roles.

3. Process Mining Configurations and Evaluation Methodology

This section presents the conversational framework, the three role configurations under study, the SNA measures applied to recorded traces, and the benchmark used to evaluate them.

3.1. Conversational Framework and Configurations

The evaluated system is a centralized, selector-mediated conversational framework. A configuration defines the available agent personas, their role descriptions, the maximum number of iterations, whether only one agent may speak per iteration, and optional forbidden handoffs. Agents do not call external tools and do not communicate directly. Instead, each selected agent reads the original process mining task and the accumulated transcript, then appends one intermediate contribution.

At each iteration, the selector receives the task, available agents, previous answers, the iteration number, and routing constraints. It returns either an agent name or FINAL. If an agent is selected, its answer is added to the transcript; if FINAL is selected, or the iteration budget is reached, a final synthesis prompt produces the benchmark answer. The no-agent baseline skips this loop and answers the original prompt directly.

We study three complete process-mining-oriented configurations. The artifact-pipeline configuration follows a parse-and-audit structure, with roles for artifact mapping, trace reasoning, control-flow semantics, performance analysis, fairness/compliance review, and answer precision. The balanced configuration covers common process mining task families: event log interpretation, process modeling, conformance checking, querying, hypothesis generation, fairness review, and optimization. The verification-heavy configuration emphasizes robustness through parsing, domain reasoning, formal checking, counterexample search, data auditing, bias auditing, and benchmark-style answer judging.

The configurations are expressed as portable role templates and routing constraints rather than as framework-specific code. Hence, the same role set can be instantiated in different conversational-agent frameworks, such as AutoGen-style group chats or industrial interfaces such as Grok. In this paper, the configuration file is the main experimental object: changing roles or handoff constraints changes the effective conversational system. Table 1 details the roles.

3.2. Trace Representation and SNA Measures

Each benchmark execution produces a trace that is interpreted as a directed social network. The actors are conversational roles. For each question, the judged sequence contains START, zero or more selected agent nodes, and COMPLETE. Consecutive nodes define directed handoffs. For example,

START \to artifact_parser \to domain_reasoner \to COMPLETE

records that the selector first chose the parser, then the domain reasoner, and then stopped.

Each judged node v has a role label and a score

s (v)

on the 1 to 10 rubric. The COMPLETE score measures final-answer quality. Intermediate node scores measure the judged usefulness of selected roles. For role a, with occurrences

V_{a}

, the mean role score is

{\bar{s}}_{a} = \frac{1}{| V_{a} |} \sum_{v \in V_{a}} s (v) .

(1)

Edges are scored by the target node after a handoff. For all observed transitions from role a to role b, denoted

E_{a, b}

, the mean handoff score is

{\bar{s}}_{a, b} = \frac{1}{| E_{a, b} |} \sum_{(v_{i}, v_{i + 1}) \in E_{a, b}} s (v_{i + 1}) .

(2)

Thus, an edge score answers: when the conversation moves from a to b, how good is the next contribution on average? These scores are diagnostic associations, not causal estimates, because task type, transcript context, and selector behavior also influence the result.

The analysis therefore reports three complementary quantities: final output quality from COMPLETE, role-level performance from node means, and handoff-level performance from edge means. We also inspect counts, standard deviations, number of observed edges, graph density, and category-level role usage. Low-scoring repeated handoffs are candidates for routing constraints. This is how the pruned verification-heavy configuration is obtained: it keeps the same roles but excludes four weak transitions identified in the initial SNA.

3.3. PM-LLM-Benchmark and Judge

The evaluation uses 57 questions from PM-LLM-Benchmark [5], a benchmark for assessing LLM answers to process mining tasks expressed through textual logs, models, constraints, queries, and analysis goals. The questions are grouped into eight categories: i event log interpretation, covering activities, cases, timestamps, resources, and data quality; ii conformance and anomaly analysis, covering deviations from rules or models; iii process model generation, covering formal or semi-formal behavior descriptions; iv querying and model description, covering SQL-style logic and model explanations; v hypothesis generation, covering diagnostic hypotheses and evidence plans; vi fairness, covering group comparisons and responsible interventions; vii object-centric and advanced representations, covering richer process views; and viii optimization, covering bottlenecks, capacity, rework, and redesign.

All final answers and intermediate agent nodes are scored by openai/gpt-5.4 on a 1.0 to 10.0 process mining rubric. We report means, population standard deviations, call counts, graph summaries, and paired differences over the 57 common questions. Paired p values use a simple two-sided normal approximation and are descriptive. The complete evaluation and diagnosis workflow is summarized in Figure 3.

3.4. Configuration Rationale and Research Questions

The three configurations in Table 1 are intended to explore different sources of potential benefit: stable parse-and-audit behavior, broad process mining specialization, and verification-oriented review. The pruned variant is intended to explore whether SNA-based diagnostics can inform a configuration’s routing without changing its roles.

In this spirit, we organize the empirical study around four exploratory research questions. RQ1 examines whether the proposed configurations tend to behave differently from direct answering in terms of final-answer quality. RQ2 examines which roles and handoffs appear to be associated with higher or lower judged quality in the recorded traces. RQ3 examines whether excluding handoffs flagged as low-performing by the SNA diagnostics is associated with changes in final-answer quality for a given configuration. RQ4 examines the extent to which a fixed configuration yields comparable social networks across different base models. The questions are framed as diagnostic rather than confirmatory: with 57 benchmark questions and judge-based scoring, the evidence we report is best read as descriptive support for or against a hypothesis, not as a definitive test.

All experiments are conducted via OpenRouter using openai/gpt-5.4-mini, x-ai/grok-4.20, and qwen/qwen3.6-35b-a3b with reasoning disabled. The empirical tables (Table 2, Table 3 and Table 4) report all three base models side by side, so the same observations can be cross-checked across them. For readability, our discussion of RQ1 through RQ3 concentrates on the openai/gpt-5.4-mini columns, and the cross-model comparison for RQ4 concentrates on the balanced configuration; the corresponding figures focus on these slices accordingly. We made this choice for narrative clarity rather than because the other combinations behave very differently, and the qualitative picture across the remaining configuration and model combinations is broadly similar. The complete set of runs, judged traces, and SNA summaries is available in the project repository (see Section 5).

4. Results

We report the results in three parts, addressing the four research questions in turn.

4.1. Effectiveness of Process Mining Configurations

This subsection addresses RQ1 and the first contribution. The overall picture from Table 2 is that, on these 57 questions, none of the conversational configurations clearly dominates direct answering. The no-agent baseline turns out to be a competitive reference point. On openai/gpt-5.4-mini, the balanced and artifact-pipeline configurations sit close to direct answering on average, the unpruned verification-heavy configuration sits slightly below it, and the pruned verification-heavy configuration is the highest scoring of the five with somewhat lower variance. On x-ai/grok-4.20 and qwen/qwen3.6-35b-a3b, direct answering is the strongest reference, and the pruned verification-heavy configuration tracks closely behind it on x-ai/grok-4.20. We therefore read the results as suggestive rather than conclusive: adding role-specialized personas does not, by itself, reliably translate into better final answers, but it does not appear to be uniformly harmful either when routing is reasonable. Looking at the per-category columns, the configurations differ more from each other on the harder categories (notably the formal modeling category c03 and the object-centric category c07) than on the easier ones, which is consistent with the idea that conversational deliberation matters most when the underlying task is itself harder for a single call.

4.2. SNA-Based Diagnosis of Roles and Handoffs

This subsection addresses RQ2 and RQ3 and illustrates the second contribution. The SNA diagnostics provide two complementary views of a configuration: a node-level view, which asks how individual roles score when they are selected, and an edge-level view, which asks how the next contribution scores after a particular handoff. Both views are informative, and they tend to highlight different kinds of issues.

At the node level (Table 3), reviewer-oriented roles tend to receive comparatively high intermediate scores when they are selected. On openai/gpt-5.4-mini, this shows up in the counterexample hunter, the bias and impact auditor, and the data consistency auditor within the verification-heavy configuration, and in the hypothesis generator and query analyst within the balanced configuration. Broader generative roles such as the process modeler are also selected often, but their scores are more variable. This view is useful on its own: it helps locate where a configuration is reliably contributing and where its variance comes from, independently of the final answer. The category usage in Table 4 adds a useful overlay, showing that certain roles are activated mainly for the categories they target (for example, the fairness reviewer concentrates on c06), while others such as the artifact parser and the answer precision auditor appear broadly across the benchmark.

The edge-level view turns out to be more actionable. In the unpruned verification-heavy configuration, a small set of handoffs recurs across questions and consistently produces lower-scoring next contributions, even when the roles on either end are individually reasonable. Treating those transitions as forbidden, while keeping the same role set, produces the pruned configuration; the resulting handoff graph (Figure 4) is smaller and more focused, and the associated final-answer mean reported in Table 2 is modestly higher with somewhat lower variance on openai/gpt-5.4-mini. The category-level usage of the pruned configuration in Table 4 indicates that the routing changes preserve the original coverage rather than concentrating effort on a smaller part of the benchmark.

Reading the two views together brings out a recurring pattern: a strong intermediate contribution does not automatically translate into a strong final answer, because the synthesis step still has to integrate that contribution and the conversation has to reach synthesis in a useful state. The edge-level diagnostic is helpful precisely here. It points to transitions that tend to leave the conversation in a worse state, rather than to roles that need to be redesigned. We report the pruning result as an illustration of this diagnostic workflow rather than as a causal claim, since the weak handoffs were identified and re-evaluated on the same benchmark.

4.3. Configuration Portability Across Base Models

This subsection addresses RQ4. Comparing runs of the same balanced configuration across the three OpenRouter base models (Figure 5, with supporting counts and role scores in Table 3 and Table 4) suggests that a configuration file is only a partial description of the resulting conversational system. In Figure 5, node and edge colors encode the average judge score (green for higher means, red for lower means), so the visual texture of each network reflects not just its routing pattern but also the quality of the contributions it elicits. With openai/gpt-5.4-mini the selector tends to stop early and yields a sparse network in which most balanced roles are activated only a handful of times across the 57 questions. With x-ai/grok-4.20 and qwen/qwen3.6-35b-a3b, the same roles and the same constraints lead to denser networks with many more role activations and a richer mix of higher and lower scoring transitions. Final-answer quality also varies across models, as can be read off Table 2, but the more interesting observation for our purposes is structural: the social network induced by a configuration is shaped jointly by the role set and by how a particular base model plays the selector role. This argues for treating configurations as portable templates whose behavior should be re-diagnosed when the underlying model changes, rather than assumed, and it is one reason we report SNA summaries alongside final-answer scores throughout the paper.

5. Implementation and Availability

The framework is publicly available at https://github.com/fit-alessandro-berti/agents-sna under GPL-3.0. It consists of a Python package and a single-file browser tool that share the same configuration format, so a configuration designed in one can be loaded in the other without modification.

Repository structure.

The repository follows the four steps of Figure 3. The package src/agents_sna contains the orchestrator with the selector, agent, and final-synthesis prompts; a CLI entry point agents-sna; a benchmark_runner that executes a configuration over a local PM-LLM-Benchmark checkout; an agent_judge that scores saved benchmark artifacts with an LLM-as-a-judge; and an evaluation_network module that aggregates judged traces into the SNA statistics used in this paper. Configurations live in configs/, benchmark outputs in benchmark_runs/, judged node sequences in agent_evaluations/, and unit tests in tests/. A shared OpenRouter client lets the base model and any extra payload (e.g., temperature, reasoning) be set from a single flag.

Configuration format.

A configuration is a self-contained JSON object with four fields. The agents list holds role personas, each given by a name and a description that becomes the system prompt for that role; an empty list reduces the system to direct answering. The integer max_iterations bounds deliberation including the final synthesis call, and the boolean single_agent_per_iteration switches between one-agent-per-turn and multi-agent regimes. The list excluded_handoffs contains pairs {from, to} that the selector is forbidden to take, and is the field used to express the SNA-guided routing constraints discussed in Section 4.2.

Pipeline modules.

Given a configuration and a prompt, the orchestrator runs the selector-mediated loop of Section 3.1: each iteration sends the selector the original prompt, the previous agent answers, and the allowed next agents, and the loop ends when the selector returns FINAL or the iteration budget is exhausted. Full request bundles and a compact conversation trace can be exported to JSON. The benchmark runner replays this orchestration over every PM-LLM-Benchmark question, supports concurrent execution and retries on transient OpenRouter failures, and skips already answered questions. The agent judge reads the saved artefacts of a run and produces, for each question, a list of judged nodes with a 1 to 10 score and an explanation; the first node is always START and the last COMPLETE, which gives the final-answer score. The evaluation network module turns a folder of judged traces into the directed social network used in this paper, computing node and edge counts, mean scores, standard deviations, a category-by-role usage table in LaTeX, and a Graphviz drawing in which colour encodes mean scores and edge thickness encodes handoff frequency.

Browser tool.

Figure 6 shows agent_chatbot.html, a self-contained browser tool that exercises the same configuration format interactively. The left panel holds the OpenRouter credentials, the base model, and a JSON editor with a validate button and built-in presets corresponding to the configurations evaluated in this paper, including the pruned verification-heavy variant. The centre panel is a chat composer that shows the prompt, an expandable run details block listing each selector decision and agent contribution, and the final answer. The right panel maintains a live handoff network across the session, with counters for prompts, roles, and edges, a Render graph button that draws the network in-browser through a Graphviz WebAssembly build, and a read-only DOT-source view that can be copied for offline use. The tool therefore acts both as a usability front end for the configurations and as an inspector for the same SNA diagnostics computed in batch by the Python pipeline.

6. Discussion and Conclusion

We studied selector-mediated conversational agent configurations for process mining and proposed an SNA-based methodology to diagnose them from recorded traces. The agents are role-prompted personas on a shared LLM without external tools, so any value they add must come from how the conversation is structured. We evaluated three role sets on PM-LLM-Benchmark and used node- and edge-level judge scores to identify weak handoffs, which were then removed to obtain a pruned configuration.

Three findings stand out. First, agentic configurations are not automatically better than direct answering: the no-agent baseline is competitive, and only the pruned configuration improved over it on average, suggestively rather than conclusively on 57 questions. Second, role and handoff usefulness are distinct: strong intermediate reviewers do not guarantee strong final answers, so synthesis and routing matter at least as much as role design. Third, the same configuration file behaves differently across base models, producing sparse early-stopping networks under one and denser deliberation under others, so role sets are not portable in isolation from selector behavior.

Two practical points follow. Direct answering should remain a strong baseline, and personas should be added only when they contribute a distinct angle such as formal checking, counterexample search, or fairness review. Conversational agent frameworks should also expose traces, configurations, and handoff statistics as first-class artifacts so that routing can be revised on observed evidence rather than intuition.

Several limitations qualify these results. Judge scores are proxies with possible model-specific biases; the pruning was diagnosed and re-evaluated on the same benchmark and should be re-tested on held-out splits; we did not measure cost, latency, run-to-run variance, or human agreement; and tool-less agents mean that “formal” and “conformance” roles remain LLM personas rather than computational checkers. Future work should combine SNA-based diagnosis with human validation, explore tool-augmented personas (e.g., PM4Py or SQL engines), and study category-specific routing and cost-aware optimization. The broader takeaway is that agent conversations can themselves be treated as event data: both a source of answers and evidence for improving the system that produced them.

Acknowledgments

Funded by the European Union. This work has received funding from the European High Performance Computing Joint Undertaking (JU) and from the German Federal Ministry of Research, Technology and Space (BMFTR), the Ministry of Culture and Science of North Rhine-Westphalia (MKW NRW), and the Hessian Ministry of Science and Research, Arts and Culture (HMWK) under grant agreement No 101250682. The project on which this work is based upon was funded by the German Federal Ministry of Research, Technology and Space Travel (grant 01IS23065). The responsibility for the content of this publication lies with the authors.

References

van der Aalst, W. Process Mining Manifesto. In Proceedings of the Business Process Management Workshops (1); Lecture Notes in Business Information Processing; Springer, 2011; pp. 169–194. [Google Scholar]
van der Aalst, W. Process Mining - Data Science in Action, Second Edition; Springer, 2016. [Google Scholar]
Berti, A.; Qafari, M. Leveraging Large Language Models (LLMs) for Process Mining (Technical Report). CoRR 2023, abs/2307.12701. [Google Scholar]
Berti, A.; Kourani, H. Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, and Evaluation Strategies. In Proceedings of the BPMDS/EMMSAD@CAiSE. Springer, Lecture Notes in Business Information Processing, 2024; pp. 13–21. [Google Scholar]
Berti, A.; Kourani, H.; van der Aalst, W. PM-LLM-Benchmark: Evaluating Large Language Models on Process Mining Tasks. In Proceedings of the ICPM Workshops. Springer, Lecture Notes in Business Information Processing, 2024; pp. 610–623. [Google Scholar]
van der Aalst, W.; Song, M. Mining Social Networks: Uncovering Interaction Patterns in Business Processes. In Proceedings of the Business Process Management, 2004; Springer; Lecture Notes in Computer Science; pp. 244–260. [Google Scholar]
van der Aalst, W.; Reijers, H.; Song, M. Discovering Social Networks from Event Logs. Comput. Support. Coop. Work. 2005, 14, 549–593. [Google Scholar] [CrossRef]
Wu, Q. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. CoRR 2023, abs/2308.08155. [Google Scholar]
Li, G. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. In Proceedings of the NeurIPS, 2023. [Google Scholar]
Hong, S. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the ICLR. OpenReview.net, 2024. [Google Scholar]
Zheng, L. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the NeurIPS, 2023. [Google Scholar]
Liu, Y. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the EMNLP, 2023; Association for Computational Linguistics; pp. 2511–2522. [Google Scholar]
Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press, 1994. [Google Scholar]

Figure 1. Selector-mediated conversational framework studied in the paper. A controller chooses a role-specialized process mining persona or stops the discussion; selected agents read the shared transcript, contribute an intermediate answer, and leave a trace that can later be analyzed.

Figure 2. Conversational-agent support in Grok 4.20. The lower screenshot shows an execution trace for a process mining benchmark question, while the upper screenshot shows the active role-specialized agents in the configuration interface. The screenshots illustrate that the balanced configuration can be instantiated in an industrial agent framework.

Figure 3. Evaluation and diagnosis workflow. Stored conversations are judged at node and final-answer level, aggregated into role and handoff networks, and then used to identify routing constraints for revised configurations.

Figure 4. Social network of the pruned verification-heavy configuration based on openai/gpt-5.4-mini. Nodes represent selected roles and final completion; directed edges represent observed handoffs after weak transitions from the unpruned run were excluded.

Figure 5. Balanced-configuration social networks for three OpenRouter base models (reasoning disabled): (a) qwen/qwen3.6-35b-a3b, (b) x-ai/grok-4.20, and (c) openai/gpt-5.4-mini. The same role set leads to different routing behavior: denser deliberation for Grok and Qwen, and a sparser early-stopping pattern for GPT.

Figure 6. Browser-based implementation provided by agent_chatbot.html. The three-panel layout exposes role and routing configuration on the left, the running conversation in the centre, and a live handoff network on the right.

Table 1. Agent roles in the three complete conversational configurations. Each column lists portable role templates and their process mining responsibilities.

Balanced configuration	Verification-heavy configuration	Artifact-pipeline configuration
`event log interpreter`	`artifact parser`	`schema and artifact mapper`
Reads logs, traces, timestamps, resources, case notions, and data-quality caveats.	Extracts entities, activities, timestamps, objects, rules, schemas, and missing assumptions.	Maps log columns, objects, resources, activities, timestamps, rules, tables, and requested output.
`process modeler`	`domain reasoner`	`trace and variant reasoner`
Produces and checks process trees, POWL, DECLARE, log skeletons, temporal profiles, and Petri nets.	Builds the main process mining answer across logs, discovery, conformance, querying, fairness, and optimization.	Reconstructs variants, infers case notions, compares paths, and detects loops or rework.
`conformance anomaly checker`	`formal semantics checker`	`control flow semantics expert`
Compares behavior with models or rules and explains deviations, missing steps, and anomalies.	Checks coherence of process trees, POWL, DECLARE, Petri nets, temporal profiles, and log skeletons.	Reasons about ordering, concurrency, choices, loops, constraints, and model semantics.
`query and sql analyst`	`counterexample hunter`	`performance and resource analyst`
Handles log/model queries, filters, grouping conditions, schemas, and constraints.	Searches for counterexamples, edge cases, impossible orders, and unsupported generalizations.	Analyzes waiting times, bottlenecks, handoffs, workload, batching, and utilization.
`hypothesis generator`	`data consistency auditor`	`data question designer`
Proposes diagnostic hypotheses, evidence needs, and verification strategies.	Verifies case IDs, attributes, resources, time ordering, SQL logic, and performance claims.	Produces hypotheses, diagnostic questions, query ideas, filters, and evidence plans.
`fairness and ethics reviewer`	`bias and impact auditor`	`risk fairness and compliance reviewer`
Checks group impacts, bias risks, and mitigation or monitoring options.	Reviews group outcomes, protected attributes, interventions, and side effects.	Checks biased paths, missing controls, auditability, privacy assumptions, and fragile recommendations.
`optimization consultant`	`benchmark answer judge`	`answer precision auditor`
Identifies bottlenecks, rework, capacity issues, waiting time drivers, and redesign options.	Checks completeness, grounded reasoning, terminology, and likely benchmark quality.	Verifies coverage, terminology, prompt grounding, and unsupported syntax or conclusions.

Table 2. Final-answer scores by base model, configuration, and benchmark category. Overall columns report the mean and population standard deviation over all 57 questions; category columns report mean final-answer score within each PM- LLM-Benchmark category.

Conf.	`openai/gpt-5.4-mini`										`x-ai/grok-4.20`										`qwen/qwen3.6-35b-a3b`
	Avg.	SD	c01	c02	c03	c04	c05	c06	c07	c08	Avg.	SD	c01	c02	c03	c04	c05	c06	c07	c08	Avg.	SD	c01	c02	c03	c04	c05	c06	c07	c08
Direct answering	6.72	2.16	7.30	8.17	4.45	7.01	7.93	6.53	4.07	8.14	5.99	2.56	7.30	7.09	3.41	6.00	7.31	5.14	3.90	7.90	6.38	1.99	6.64	7.02	5.05	6.17	6.86	7.23	4.33	7.80
Balanced	6.75	1.81	6.97	7.74	4.61	6.91	7.94	6.80	5.08	8.00	5.32	2.16	4.80	6.08	4.80	5.60	6.33	5.97	3.23	5.46	5.39	1.83	5.33	6.04	4.08	5.79	5.97	6.24	3.42	6.20
Artifact pipeline	6.74	2.13	7.44	8.04	5.76	6.37	7.11	7.40	3.30	8.00	5.24	2.47	5.26	6.07	4.92	4.39	5.86	5.69	4.10	5.28	5.51	1.63	6.14	6.11	4.54	4.53	6.46	6.11	3.78	6.30
Verification heavy	6.55	1.99	7.15	7.70	4.28	7.47	7.00	6.89	3.80	8.06	5.84	2.13	5.96	6.48	5.64	4.70	6.57	6.50	4.55	5.96	5.32	1.74	6.04	4.78	4.49	5.50	6.33	5.51	3.85	6.34
Ver. heavy excl. hand.	6.98	1.70	7.84	8.01	6.06	6.97	7.14	6.90	4.35	8.28	5.88	2.11	5.10	6.93	4.40	6.67	6.27	6.04	4.92	6.84	5.48	2.05	6.12	6.53	4.72	5.14	5.11	5.54	4.10	6.34

Table 3. Role-level SNA node scores by configuration and base model. Count is the number of judged role appearances; Avg. and SD are the mean and population standard deviation of the role’s intermediate judge scores. For the no-agent baseline, Direct answer denotes the final answer.

Configuration	Role	`openai/gpt-5.4-mini`			`x-ai/grok-4.20`			`qwen/qwen3.6-35b-a3b`
		Count	Avg.	SD	Count	Avg.	SD	Count	Avg.	SD
Direct answering	Direct answer	57	6.72	2.16	57	5.99	2.56	57	6.38	1.99
Balanced	Event log interpreter	13	6.75	1.64	32	5.15	2.39	14	6.72	1.66
	Process modeler	22	6.24	1.98	58	4.14	2.00	45	4.32	1.66
	Conformance anomaly checker	12	7.22	1.20	52	3.29	2.01	33	5.05	2.20
	Query and sql analyst	7	7.31	1.33	22	4.60	2.06	12	4.74	1.69
	Hypothesis generator	6	7.82	0.83	46	5.11	1.79	20	5.51	1.84
	Fairness and ethics reviewer	7	7.14	1.28	34	3.73	2.29	20	5.17	2.57
	Optimization consultant	10	6.77	1.71	32	3.67	2.15	39	3.86	1.84
Artifact pipeline	Schema and artifact mapper	41	6.58	1.53	49	6.08	1.63	26	5.92	1.19
	Trace and variant reasoner	2	5.35	2.75	33	4.08	1.87	18	4.44	1.62
	Control flow semantics expert	7	7.16	1.97	57	4.41	2.05	28	5.18	2.03
	Performance and resource analyst	0	–	–	20	3.01	1.16	16	4.74	1.38
	Data question designer	0	–	–	24	4.62	1.91	3	4.43	1.70
	Risk fairness and compliance reviewer	5	7.40	1.25	35	4.29	1.97	11	5.89	1.49
	Answer precision auditor	56	7.53	1.71	53	4.74	2.53	28	4.66	1.71
Verification heavy	Artifact parser	55	6.89	1.79	59	6.88	2.14	26	6.19	1.46
	Domain reasoner	28	6.93	1.32	50	5.27	1.68	54	5.10	1.71
	Formal semantics checker	12	7.02	1.43	63	4.30	2.06	12	5.42	2.17
	Counterexample hunter	10	8.16	0.54	49	5.67	2.46	9	5.56	2.84
	Data consistency auditor	7	7.79	0.78	42	5.38	2.30	8	5.48	1.59
	Bias and impact auditor	4	7.83	0.52	37	3.91	1.85	6	5.80	1.76
	Benchmark answer judge	7	6.73	1.00	58	5.69	2.26	26	5.06	2.09
Ver. heavy excl. hand.	Artifact parser	55	6.53	1.99	58	6.81	2.32	22	6.50	1.69
	Domain reasoner	34	6.82	1.68	52	5.09	1.69	57	5.18	1.49
	Formal semantics checker	10	7.64	1.07	71	4.63	2.35	10	3.50	1.82
	Counterexample hunter	13	7.82	1.14	47	5.66	2.73	16	6.48	2.31
	Data consistency auditor	1	5.80	0.00	41	5.45	2.46	10	4.85	1.41
	Bias and impact auditor	3	7.83	0.73	27	4.38	1.73	7	6.37	1.38
	Benchmark answer judge	2	7.15	0.05	56	5.20	2.23	4	5.02	1.04

Table 4. Role usage by benchmark category, configuration, and base model. Each model cell reports counts ordered as c01/c02/c03/c04/c05/c06/c07/c08. For the no-agent baseline, Direct answer counts final answers.

Configuration	Role	`openai/gpt-5.4-mini`	`x-ai/grok-4.20`	`qwen/qwen3.6-35b-a3b`
Direct answering	Direct answer	8/9/8/7/7/7/6/5	8/9/8/7/7/7/6/5	8/9/8/7/7/7/6/5
Balanced	Event log interpreter	7/1/1/0/0/0/2/2	7/4/9/0/2/2/3/5	7/2/0/0/0/0/2/3
	Process modeler	1/3/9/4/0/1/2/2	11/9/17/4/3/2/7/5	8/4/9/5/6/2/7/4
	Conformance anomaly checker	0/5/0/2/3/1/0/1	8/9/13/2/5/4/6/5	6/7/3/4/4/3/2/4
	Query and SQL analyst	0/0/0/3/4/0/0/0	6/1/0/8/1/0/3/3	1/0/0/3/2/4/1/1
	Hypothesis generator	0/0/0/0/6/0/0/0	9/6/1/3/10/6/6/5	0/3/0/0/7/5/4/1
	Fairness and ethics reviewer	0/0/0/0/0/7/0/0	6/4/0/3/2/10/4/5	1/1/0/2/3/12/1/0
	Optimization consultant	0/2/0/1/1/0/1/5	7/4/0/3/4/3/6/5	5/6/2/3/6/5/8/4
Artifact pipeline	Schema and artifact mapper	8/5/7/6/5/2/3/5	8/7/8/6/6/3/6/5	7/1/2/4/2/2/5/3
	Trace and variant reasoner	0/1/1/0/0/0/0/0	6/6/2/4/4/1/5/5	2/4/4/1/2/0/5/0
	Control-flow semantics expert	0/2/2/3/0/0/0/0	6/8/17/7/6/2/6/5	1/6/7/4/4/1/4/1
	Performance and resource analyst	0/0/0/0/0/0/0/0	3/2/0/3/3/0/4/5	0/2/0/2/2/0/6/4
	Data question designer	0/0/0/0/0/0/0/0	2/2/0/4/6/1/4/5	0/0/0/0/2/0/1/0
	Risk fairness and compliance reviewer	0/0/0/0/0/5/0/0	2/3/0/5/6/10/4/5	0/2/0/1/2/5/0/1
	Answer precision auditor	8/9/9/7/6/7/5/5	5/8/9/6/6/7/6/6	1/5/4/6/6/4/1/1
Verification heavy	Artifact parser	8/9/8/7/7/6/5/5	8/9/11/7/7/6/6/5	4/4/1/4/2/3/6/2
	Domain reasoner	2/1/8/6/6/0/0/5	8/8/6/6/6/5/6/5	6/9/9/8/7/2/7/6
	Formal semantics checker	0/1/6/3/0/2/0/0	8/9/18/7/4/6/6/5	1/0/4/3/0/1/3/0
	Counterexample hunter	1/2/1/1/1/2/0/2	5/7/13/4/4/7/5/4	0/2/0/2/1/0/2/2
	Data consistency auditor	2/2/0/0/3/0/0/0	7/6/7/4/4/4/6/4	4/0/1/1/0/1/1/0
	Bias and impact auditor	0/0/0/0/0/4/0/0	0/4/0/2/7/14/6/4	0/0/0/1/0/4/0/1
	Benchmark answer judge	1/1/1/2/1/0/0/1	6/7/11/7/10/7/5/5	4/3/5/3/4/3/3/1
Ver. heavy excl. hand.	Artifact parser	8/9/8/7/7/6/5/5	8/10/9/7/7/5/7/5	5/2/3/3/2/2/5/0
	Domain reasoner	3/4/7/7/5/1/2/5	8/8/7/7/7/3/6/6	8/9/8/7/9/3/7/6
	Formal semantics checker	0/1/5/3/0/1/0/0	10/8/26/7/4/5/6/5	1/1/4/3/1/0/0/0
	Counterexample hunter	2/4/1/1/2/2/0/1	6/6/12/7/4/4/5/3	1/3/3/2/3/0/2/2
	Data consistency auditor	1/0/0/0/0/0/0/0	6/6/5/8/4/4/4/4	2/1/1/1/3/1/1/0
	Bias and impact auditor	0/0/0/0/0/3/0/0	3/3/0/5/4/9/0/3	0/1/0/1/0/5/0/0
	Benchmark answer judge	0/0/0/1/0/1/0/0	7/9/5/8/10/4/8/5	0/1/0/1/1/1/0/0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.