Preprint
Review

This version is not peer-reviewed.

Beyond Dyadic Dialogue: A Comprehensive Survey of Multi-Party Dialogue Systems

  † These authors are co-first authors.

Submitted:

21 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract
Dialogue systems have evolved remarkably, from rule-based dyadic interfaces to large language model (LLM) powered conversational agents. Yet the predominant focus on two-party exchanges leaves a significant gap: most real-world communication, from business meetings to online group chats, is inherently multi-party. Multi-Party Dialogue (MPD) introduces new challenges in participant tracking, discourse structure modeling, and pragmatic reasoning that simpler dyadic systems are not equipped to handle. In this survey, we present a comprehensive examination of MPD research, organized along four axes: tasks, methods, datasets, and evaluation. We trace the methodological trajectory from early statistical and supervised models, through neural and graph-based approaches, to modern LLM-driven systems and emerging multi-agent paradigms. We further synthesize benchmark resources and evaluation protocols, and identify four central bottlenecks that constrain progress: combinatorial complexity of N-party dynamics, the structural-semantic divide, evaluation inadequacy, and data scarcity. We map these challenges to concrete research opportunities, offering a roadmap for building socially intelligent, holistically evaluated MPD systems.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Dialogue systems form a foundational pillar of natural language processing (NLP). From early rule-based assistants to modern LLM-powered conversational agents, the field has primarily targeted dyadic interactions: a single user engaging with a single system. Typical applications include customer service, question-answering bots, task-oriented dialogue, and knowledge-grounded healthcare dialogue such as cognitive stimulation systems for elderly users Jiang et al. (2023). With a limited number of participants and relatively straightforward turn structure, these systems have benefited from modeling paradigms for tracking conversational state, contextual grounding, and response generation Vaswani et al. (2017).
However, dyadic settings capture only a slice of human communication. In meetings, classrooms, online forums, and collaborative platforms, conversations typically involve multiple participants who exchange information, negotiate intent, and share knowledge in parallel. Such Multi-Party Dialogue (MPD) is fundamentally more complex Ishizaki and Kato (1998); Sapkota et al. (2025). Three dimensions distinguish MPD from its dyadic counterpart. First, the participant dimension requires speaker tracking, addressee recognition, and turn management. Second, the discourse structure dimension involves frequent topic shifts, interleaved threads, and multi-level discourse organization. Third, the task dimension demands modeling of social roles, intent dynamics, and recurring interaction patterns alongside coherent generation.
Across these dimensions, MPD research has progressed through several methodological waves. Early systems relied on statistical and rule-based methods such as Hidden Markov Models, Conditional Random Fields, and finite-state machines for dialogue state tracking and intent recognition Martínez-Hinarejos et al. (2010); Shang et al. (2020); Zhu et al. (2010). The deep learning era introduced CNN, RNN, and LSTM architectures that improved nonlinear context modeling Mangrulkar et al. (2018); Wen et al. (2015); Skantze (2017). More recently, large language models such as GPT, T5, LLaMA, and Qwen have shown strong language understanding and generation capabilities, providing new affordances for long-context, multi-speaker reasoning. The latest frontier extends LLMs into multi-agent and hybrid paradigms, enabling autonomous role assignment, information sharing, and human-machine collaboration in group settings Sapkota et al. (2025); Wang et al. (2026).
Despite this progress, the field faces persistent obstacles. Existing models struggle to manage combinatorial information flow among speakers and to maintain coherent discourse structure across interleaved threads. Datasets remain limited in scale, modality, and annotation depth compared to dyadic resources. Evaluation protocols largely inherit metrics such as BLEU and F1 from NLP tasks, which fail to capture the social, structural, and collaborative qualities that matter in group conversation. Together, these gaps slow both methodological maturation and practical deployment.
This survey synthesizes the current state of MPD research with the goal of clarifying where the field stands and where it should go. Our contributions are as follows: (1) We propose a unified taxonomy of MPD tasks spanning six categories, from participant modeling to high-level decision support (Section 2). (2) We chart the methodological trajectory of MPD systems, organizing approaches from statistical models, through neural and graph-based methods, to modern LLM and multi-agent paradigms, with a complete catalog of 50+ methods (Section 3). (3) We catalogue representative datasets and evaluation metrics, highlighting their coverage, limitations, and complementarity (Section 4 and Section 5). (4) We articulate four fundamental challenges facing MPD research and map each to concrete opportunities, providing a strategic roadmap for the next generation of socially intelligent dialogue systems (Section 6).

2. Tasks in Multi-Party Dialogue

MPD encompasses a diverse landscape of tasks that reflect its inherent multi-speaker, multi-thread complexity. Figure 1(a) provides a visual taxonomy. We organize existing tasks into six interdependent groups, presenting task definitions and representative work for each.

2.1. Dialogue Understanding

Dialogue understanding tasks aim to capture the semantic content of multi-party dialogues, such as goals, emotions, and speaker intentions. These tasks provide the foundation for higher-level modeling and reasoning.
Goal Tracking and Intent-Slot Recognition. This task involves detecting and tracking conversational goals as well as identifying semantic slots. In multi-party contexts, the difficulty arises from overlapping goals among participants and potential goal shifts during interactions Addlesee et al. (2023); Penzo et al. (2024). Recent work explores hybrid intent recognition frameworks that combine lightweight models such as BERT with LLMs under zero- and few-shot settings. By leveraging uncertainty-based routing and sharing probability-based information between models, these approaches improve intent recognition and out-of-scope detection performance in multi-party conversations while reducing computational costs Castillo-López et al. (2025).
Emotion Recognition in MPD. Recognizing emotions across multiple participants is crucial for building socially intelligent systems. Emotion dynamics in MPD are more complex than in dyadic dialogues, as emotional contagion and group-level sentiment often emerge Sun et al. (2021).
Dialogue Act Classification. Assigning functional roles (for example, question, answer, suggestion, agreement) to utterances supports downstream applications such as summarization and decision support. In MPD, dialogue act recognition must consider both local context and cross-speaker dependencies Martinenghi et al. (2024); Mayfield et al. (2012).
Multi-party Dialogue Reading Comprehension (MRC). MPD-MRC extends traditional machine reading comprehension to dialogue transcripts. It requires reasoning over multiple turns and speakers, integrating discourse and role information for accurate question answering Li and Zhao (2023, 2021).
Disruptive Talk Detection. Detecting disruptive or toxic utterances is important for maintaining constructive discussions in collaborative or educational settings Park et al. (2022).
Argument Prediction in Debates. In debate-style MPD, predicting which arguments are more persuasive or likely to win is a unique task with implications in computational social science Sia et al. (2022).

2.2. Dialogue Structure Modeling

These tasks aim to capture and represent the structural properties of MPD, which are often more complex than in dyadic dialogues.
Turn-taking Prediction and Modeling. MPD requires predicting when a participant will speak and determining the next speaker. This is challenging due to overlapping speech, interruptions, and implicit floor management strategies Gatti de Bayser et al. (2019); Laskowski (2010); Enomoto et al. (2020); Wang et al. (2024).
Topic Segmentation and Identification. Identifying topic boundaries and recognizing emerging topics in multi-party discussions is crucial for summarization and retrieval. Unlike single-topic dialogues, MPD often involves multiple threads of conversation Purver et al. (2006); Galley et al. (2003); Nguyen et al. (2013).
Dynamic Topic Disentanglement. Conversations in MPD are often interleaved. Disentanglement aims to separate intertwined threads into coherent topic-specific sub-dialogues Wang et al. (2020).
Discourse Parsing and Structure Prediction. Parsing discourse relations (for example, elaboration, contrast, agreement) and predicting conversation structures are essential for understanding the logical flow of MPD. Successful entry prediction, segmentation, and coherence modeling fall under this area Afantenos et al. (2015); Li et al. (2023); Liu and Chen (2021); Liu et al. (2025); Mayfield et al. (2012).
Decision-making Sub-dialogue Detection. Many MPD scenarios, such as meetings, involve decision-making. This task detects sub-dialogues where group decisions are negotiated Fernández et al. (2008); Bui et al. (2009).
Multi-turn Alignment and Chain-of-Thought Reasoning. Aligning utterances across turns and modeling explicit reasoning chains in MPD enable advanced reasoning-oriented applications Kiruluta et al. (2025).

2.3. Response and Generation

Response and generation tasks are at the heart of dialogue system development. In MPD, response generation is complicated by multiple potential addressees, shifting contexts, and social constraints.
Response Generation in MPC. Generating contextually appropriate responses in multi-party dialogues is one of the most fundamental challenges. It requires reasoning about speaker history, dialogue context, and social dynamics Fan et al. (2024); Gu et al. (2022); Gu et al. (2023); Hu et al. (2025); Li et al. (2023); Liu et al. (2019); Tan et al. (2023); Wang et al. (2020).
Response Selection. Given multiple candidate responses, the task is to select the most appropriate one. This often serves as an intermediate step for evaluating dialogue models Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2023); Ouchi and Tsuboi (2016); Penzo et al. (2024); Wang et al. (2020).
Response Generation with Addressee Prediction. Unlike dyadic dialogues, MPD responses must often include an explicit addressee, requiring models to jointly predict the response and its recipient Li and Zhao (2023).
Persona-aware Generation. This task focuses on maintaining speaker-specific personalities or roles across responses, which is especially relevant in long-term or role-based conversations Ju et al. (2022); Mahajan and Shaikh (2024).
Empathetic Dialogue Generation in MPD. Empathy is particularly complex in MPD, as systems must balance the emotions of multiple participants while maintaining coherence Zhu et al. (2022). This challenge is exemplified by group cognitive stimulation dialogue for elderly users with cognitive impairment, where an adaptive policy must coordinate therapeutic goals across several participants simultaneously Jiang et al. (2026).
Dialogue Generation with Next-Speaker Prediction. Extends response generation to also predict which participant should speak next, aligning generation with realistic conversation dynamics Wang et al. (2024).
Tool-calling in Multi-party Dialogue. Emerging work incorporates external APIs or tools (for example, calendars, databases) during multi-turn group conversations, making dialogue more interactive and actionable Jang et al. (2025). Beyond static tool invocation, recent work explores tool-driven intrinsic adaptation, where agents unify task execution with self-reconfiguration of their tool repertoire Zhou et al. (2026).

2.4. Role and Participant Modeling

Since MPD inherently involves multiple participants, modeling their roles, identities, and interactions is a crucial research focus.
Addressee Recognition and Identification. Identifying the intended recipient of an utterance is a fundamental task in MPD. It is closely tied to response generation and turn-taking modeling Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2023); Heylen and op den Akker (2007); Le et al. (2019); Ouchi and Tsuboi (2016); Penzo et al. (2024); Zhu et al. (2023).
Speaker Identification and Classification. Determining who is speaking in overlapping or noisy conditions is a prerequisite for many higher-level tasks Du et al. (2022); Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2023); Meng et al. (2018).
Overlapped Speaker Diarization. This task aims to separate overlapping speech and attribute it correctly to different participants, which is especially challenging in real-time settings Du et al. (2022).
Social Role and Seniority Inference. Inferring roles such as leader, moderator, or newcomer provides insights into group dynamics and social hierarchy Laskowski et al. (2008).
Interaction Pattern Analysis. Beyond individual roles, this task models recurring interaction structures, such as supportive behaviors, challenges, or questioning patterns Heylen and op den Akker (2007).

2.5. Summarization and Decision Support

Summarization and decision-support tasks seek to condense lengthy dialogues into concise, actionable outputs. These are particularly relevant in professional or collaborative contexts.
Dialogue Summarization. Generating abstractive or extractive summaries of group discussions. This requires modeling global context, topic transitions, and participant contributions Chen and Yang (2020).
Extractive Summarization in MPC. A variant of summarization that selects salient utterances instead of generating new text Chen and Metze (2012).
Automatic Minuting. Specialized summarization that produces structured meeting minutes with decisions, action items, and responsibilities Bhatnagar et al. (2022).
Decision Discussion Detection and Summarization. Identifying decision-related dialogues and producing concise summaries tailored to decision-making processes Bui et al. (2009).

2.6. Advanced and Emerging Tasks

Beyond classical tasks, researchers have explored advanced and domain-specific problems in MPD.
Machine Translation in MPD. Translating utterances in multilingual multi-party conversations while preserving role and context Bhatnagar et al. (2022).
Extractive Question Answering in MPC. Answering factual questions directly from conversation transcripts Li et al. (2023); Zhou et al. (2025).
Negotiation Strategy Learning. Learning strategies for successful negotiation in multi-party discussions, applicable to economics and multi-agent systems Hiraoka et al. (2015).
Entity Linking in MPC. Linking entity mentions within dialogues to external knowledge bases, facilitating knowledge-grounded conversations Aina et al. (2019).
Crucially, these tasks are deeply interdependent. Accurate addressee recognition strengthens response generation, while reliable discourse parsing improves summarization. This interdependence motivates the recent trend toward holistic, multi-task modeling discussed in Section 6.

3. Methods in Multi-Party Dialogue

The methodological history of MPD mirrors the broader arc of NLP, progressing from statistical models, through neural and graph-based architectures, to large language models and multi-agent systems. Figure 2 visualizes this evolution along four major research branches that have continuously matured over two decades, while Table 1 highlights representative methodological milestones along the trajectory. The complete catalog of 50+ methods appears in Table 4, grouped chronologically into the statistical/early-neural era (pre-2020), the pre-trained model and graph era (2020–2023), and the LLM and multi-agent era (2024–2026).

3.1. From Statistical Models to Neural Networks

Early MPD systems leveraged statistical learning. Classical models for next-speaker prediction (MLE, SVM, CNN) demonstrated the value of semantic features in turn dynamics Gatti de Bayser et al. (2019). Beyond turn dynamics, statistical approaches addressed a wide range of structural and social tasks. Unsupervised Bayesian generative topic models offered robustness to ASR noise and domain shift for topic segmentation in noisy meeting transcripts Purver et al. (2006), while feature-driven segmentation combined lexical and acoustic cues for domain-agnostic boundary detection Galley et al. (2003). ILP formulations with sociolinguistic constraints tackled hierarchical structure prediction Mayfield et al. (2012), MST-based dependency parsing modeled discourse structure in multi-party chat Afantenos et al. (2015), and structured labeling together with directed-graph methods detected and summarized decision-related sub-dialogues Fernández et al. (2008); Bui et al. (2009). Complementary lines of work predicted social dimensions and seniority from interaction patterns Laskowski et al. (2008), analyzed backchannel and gaze cues for interaction modeling Heylen and op den Akker (2007), applied random-walk extractive summarization to meeting transcripts Chen and Metze (2012), learned negotiation strategies in simulated trading via reinforcement learning Hiraoka et al. (2015), and linked entity mentions to knowledge bases Aina et al. (2019).
The shift to neural architectures introduced richer context modeling. Joint static-dynamic RNN models tackled addressee and response selection together Ouchi and Tsuboi (2016), W2W learned to identify utterance addressees from raw context Le et al. (2019), and neural speaker-classification models compared temporal, content-driven, and hybrid representations, revealing the benefit of integrating sequential and semantic information Meng et al. (2018). Context encodings that integrate speaker and addressee role information improved response generation Liu et al. (2019), and BERT-based multi-task models enabled dynamic topic tracking for reply selection Wang et al. (2020). These neural advances laid the groundwork for the pre-trained model era that followed.

3.2. Pre-trained Models and Graph Reasoning

Pre-trained language models reshaped MPD research. MPC-BERT jointly models speaker identity and utterance semantics through self-supervised tasks including reply-to recognition, identical speaker searching, and pointer consistency distinction Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang, while latent discourse inference exploits unlabeled multi-party data efficiently Li et al. (2023). Pseudo-self-supervised pre-training likewise improves multi-party reading comprehension and dialogue QA Li and Zhao (2021), and cross-domain Transformer parsers transfer discourse-parsing capability across corpora Liu and Chen (2021). To address the graph-structured nature of MPD, heterogeneous graph neural networks such as HeterMPC Gu et al. (2022) and MADNet Gu et al. (2023) explicitly model utterances and interlocutors as typed nodes with multiple meta-relations, with EM-based optimization handling missing addressee data. Lightweight graph-induced fine-tuning (GIFT) integrates structural signals into pre-trained Transformers with only a handful of additional parameters per encoding layer Gu et al. (2023), and persona-aware variants enhance speaker consistency through personality-based embeddings Mahajan and Shaikh (2024); Ju et al. (2022). Global-local graph reasoning further supports multi-hop question answering over complex multi-party contexts Li and Zhao (2023), and graph-aware sentiment models capture emotion propagation across speakers Sun et al. (2021). The same period also saw multi-view BART for abstractive summarization Chen and Yang (2020), static-dynamic empathy frameworks (SDMPED) for emotion-aware generation Zhu et al. (2022), discrete addressee codebook encodings for robust addressee recognition Zhu et al. (2023), and neural diarization models such as SOND for overlapped speech Du et al. (2022).

3.3. Large Language Models and Multi-Agent Paradigms

Recent work probes the multi-party capabilities of large language models. Systematic evaluations expose where ChatGPT and GPT-4 succeed and fail in multi-party reasoning Tan et al. (2023); Penzo et al. (2024), and zero- and few-shot LLMs are tested on dialogue act classification Martinenghi et al. (2024). Reasoning-oriented prompting often outperforms conventional fine-tuning for goal tracking and intent-slot recognition Addlesee et al. (2023), and hybrid intent recognition frameworks combine BERT-scale models with LLMs through uncertainty routing to balance accuracy and cost ópez et al. (2025). Reinforcement learning approaches optimize response coherence and negotiation outcomes, using utterance-level rewards Fan et al. (2024); Hiraoka et al. (2015) and cross-attention historical reward functions for self-supervised, context-aware generation Kiruluta et al. (2025); reward-decomposition strategies that mine latent logic further enable self-evolving reasoning Chen et al. (2026); while logical reasoning memory networks enhance QA accuracy Zhou et al. (2025). LLM-based explainable discourse parsing integrates contrastive learning to improve utterance-level accuracy Liu et al. (2025), and contrastive history-aware models further refine response consistency Hu et al. (2025).
The latest frontier extends MPD into multi-agent territory. MuPaS explores supervised fine-tuning for group dialogue generation with next-speaker prediction Wang et al. (2024); DICE-BENCH evaluates tool-calling in multi-round, multi-party scenarios with 1,607 synthesized dialogues Jang et al. (2025); SS-MPC encodes dialogue structure directly as sequential inputs within a Transformer encoder-decoder framework, removing the need for graph representations Jang et al. (2025); discourse-coherence response-guided context models (DRCR) improve generation on Ubuntu IRC Cao et al. (2026); context-aware turn-taking models predict when an agent should speak or stay silent Bhagtani et al. (2026); and MPCEval introduces a multi-dimensional benchmark for multi-party conversation quality Zhang et al. (2026). EverMemBench probes long-horizon collaborative memory across multi-group settings, revealing severe limitations in multi-hop attribution reasoning and temporal revision tracking by current LLM-based systems Hu et al. (2026). Synthetic data generation also emerges as a scalable resource for downstream MPD tasks: deterministic-constraint frameworks Penzo et al. (2026), tree-guided subspace partitioning for diverse data synthesis Wang et al. (2026), and knowledge-driven progressive-thought prompting for multi-turn dialogue augmentation Jiang et al. (2024) all help mitigate data scarcity. Orthogonally, parameter-efficient fine-tuning methods such as low-rank adaptation and its rotation- and shard-based variants make it practical to specialize large models for multi-party applications under limited compute Wang et al. (2024a) Wang, Chen, Jiang, Xue, Kong, and Wu; Wang et al. (2024b) Wang, Xue, Ye, Jiang, Chen, Kong, and Wu; Wang et al. (2025).
Takeaway. The trajectory in Figure 2 reveals a clear pattern: as model capacity grows, attention shifts from explicit structural modeling toward generative reasoning. Yet graph-based and symbolic approaches retain their value for capturing the relational structure that pure sequential models miss. Increasingly, state-of-the-art systems combine multiple paradigms, hinting at the neuro-symbolic hybrid architectures we identify as a key opportunity in Section 6.

4. Datasets

The diversity of MPD tasks is mirrored in the breadth of available corpora, which differ in source, scale, modality, and annotation depth. Table 2 summarizes representative datasets by category, providing a comparison of their scale and annotation focus. We organize them into four groups.

4.1. Structure and Disentanglement

Datasets in this category support research on threads, discourse relations, and conversational structure. The Asher et al. (2016, STAC Corpus), released at LREC 2016, contains multi-party chat dialogues annotated with full discourse structure based on Segmented Discourse Representation Theory (SDRT), including both linguistic and extra-linguistic interactions, enabling research on dialogue act classification and discourse-level reasoning. Li et al. (forthcoming, Molweni) Li, Liu, Kan, Zheng, Wang, Lei, Liu, and Qin, presented at COLING 2020, is an MRC dataset derived from the Ubuntu Chat Corpus consisting of 10,000 dialogues and 88,303 utterances annotated with discourse dependency structures based on a modified SDRT. Chang et al. (2023, MTDD) comprises 10,033 turns from 831 movies and TV series, annotated for conversational thread disentanglement and floor changes. Lerner et al. (2022, Bazinga!) consists of large-scale multi-party dialogues from TV and movie scripts with rich annotations for speaker diarization, addressee recognition, and entity linking, while Shaikh et al. (2010, OnlineChat-MPCorpus) focuses on online chatroom conversations annotated at four levels: conversational links, dialogue acts, local topics, and meso-level discourse structure.

4.2. Emotion and Social Relationship

TV transcripts dominate this category. Poria et al. (2019, MELD), released at ACL 2019, contains approximately 13,000 utterances from Friends, providing audio, visual, and textual modalities annotated with fine-grained emotion and sentiment labels. Its textual precursor Hsu et al. (2018, EmotionLines) includes 29,245 utterances from Friends and private Facebook chats annotated with seven emotion labels, and Zahiri and Choi (2017, EmoryNLP) utilizes Friends transcripts (S1–S4), comprising 12,606 utterances with seven crowdsourced labels. Chen et al. (2020, MPDD) supports Chinese-language affective MPD with annotations for emotions and interpersonal relationships, and Li et al. (2020, ALOHA (HLA-Chat)) links over a million dialogue lines from TV shows and movies to “Human-Level Attributes” (HLAs) based on audience-determined tropes for persona modeling.

4.3. Task-Oriented and Open-Domain

Gliwa et al. (2019, SAMSum) contains 16,369 messenger-like conversations with manually written third-person summaries, a primary benchmark for abstractive dialogue summarization, while Narayan et al. (2018, XSum) focuses on extreme summarization with 226,711 BBC article-summary pairs. Budzianowski et al. (2020, MultiWOZ) is a large-scale human-human dataset for task-oriented dialogue spanning seven domains, and the Lowe et al. (2015, Ubuntu Dialogue Corpus) contains nearly one million dialogues from technical support IRC logs for next-response selection. Manuvinakurike et al. (2021, IncTempSum) focuses on incremental abstractive summarization combining crowdsourced and expert annotations. More recently, Jang et al. (2025, DICE-BENCH) introduces 1,607 synthesized multi-round, multi-party dialogues for tool-use evaluation, and synthetic WMPC generation frameworks Penzo et al. (2026) demonstrate that instruction-tuned LLMs can produce structurally diverse MPD data under deterministic constraints such as dialogue structure and speaker stance, with turn-by-turn generation improving constraint adherence and linguistic variability. Sedoc et al. (2019, ChatEval) is designed for the evaluation of open-domain chatbots with manually filtered utterances and reference responses.

4.4. Multimodal and Naturalistic Meetings

The Carletta et al. (2006, AMI Meeting Corpus) offers 100 hours of meeting recordings (natural and elicited) with multimodal annotations including dialogue acts, head/hand gestures, and focus of attention. The Shriberg et al. (2004, ICSI Meeting Corpus (MRDA)) supplies 72 hours of naturally occurring professional meetings featuring over 180,000 hand-annotated dialogue-act tags. Yu et al. (2022, AliMeeting (M2MeT)) provides 120 hours of real-world Mandarin meetings recorded with 8-channel microphone arrays for diarization and multi-speaker ASR, while the Litman et al. (2016, Teams Corpus) captures realistic cooperative group dynamics across 63 teams through aligned audio, video, and detailed transcripts annotated for entrainment and dominance.
Despite this variety, current corpora remain skewed toward text-only English data, leaving multimodal, multilingual, and naturalistic professional settings comparatively underexplored, a gap that remains critical and we revisit in Section 6.

5. Evaluation

Evaluating MPD systems is harder than dyadic ones because quality spans multiple dimensions: semantic accuracy, structural coherence, social appropriateness, and collaborative effectiveness. Table 3 groups four metric families adopted in MPD studies, with representative metrics and typical use cases.

5.1. Automatic Generation Metrics

N-gram overlap measures including BLEU (BLEU-1 through BLEU-4) Bhatnagar et al. (2022); Fan et al. (2024); Li and Zhao (2023); Zhu et al. (2022), SacreBLEU Tan et al. (2023), and ROUGE (ROUGE-1, ROUGE-L) Bhatnagar et al. (2022); Chen and Metze (2012); Chen and Yang (2020); Li and Zhao (2023); Zhu et al. (2022) remain widely used for summarization and response generation. BLEU measures the precision of n-gram matches, while ROUGE emphasizes recall and is especially suitable for summarization. METEOR adds stemming, synonym matching, and word-order sensitivity Afantenos et al. (2015); Fan et al. (2024); Gu et al. (2022); Li and Zhao (2023). For semantic similarity, BERTScore computes overlap using contextual embeddings Bhatnagar et al. (2022); Ju et al. (2022), and other embedding-based metrics reduce reliance on lexical overlap Ju et al. (2022). Perplexity (PPL) gauges fluency by measuring the likelihood assigned to human-written references Ju et al. (2022); Laskowski (2010), though lower PPL does not guarantee relevance. Finally, Distinct-n (for example, Dist-1/2) quantifies diversity by computing the ratio of unique n-grams Ju et al. (2022), which is particularly relevant in MPD where repetitive answers harm conversational quality.

5.2. Classification and Selection Metrics

Accuracy, precision, recall, and F1 dominate addressee and intent classification, with variants including in-domain versus cross-domain accuracy Addlesee et al. (2023); Gatti de Bayser et al. (2019); Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Le et al. (2019); Mayfield et al. (2012); Sun et al. (2021), and macro-, micro-, and weighted-F1 providing nuanced evaluation under class imbalance Afantenos et al. (2015); Aina et al. (2019); Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Li et al. (2023); Liu et al. (2025); Meng et al. (2018); Wang et al. (2020). Ranking metrics such as Precision@k Gu et al. (2021a) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2023); Le et al. (2019), Recall@k Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang; Gu et al. (2023), and Mean Reciprocal Rank Meng et al. (2018) evaluate candidate selection in retrieval-style tasks. Task-specific scores include ADR/RES for addressee-response pairs Ouchi and Tsuboi (2016), Exact Match (EM) for slot filling and structure prediction Li and Zhao (2021, 2023); Li et al. (2023); Zhou et al. (2025), AUC and PR-AUC for probabilistic classification Li et al. (2023); Park et al. (2022); Sia et al. (2022), and the recently introduced DICE-score for structured tool-calling Jang et al. (2025).

5.3. Structural and Discourse Metrics

Topic segmentation is assessed with P k and WindowDiff (WD) Galley et al. (2003); Purver et al. (2006), with lenient-match variants adapting them to noisy data Fernández et al. (2008). Discourse parsing employs graph-prediction F1 and relation-classification F1 Li et al. (2023) together with Link and Link&Rel metrics Liu et al. (2025) to assess discourse connectivity. Diarization Error Rate (DER) measures speaker attribution accuracy Du et al. (2022). Some works adopt multi-level evaluation, distinguishing sentence-, sequence-, and thread-level scores using Accuracy, Kappa agreement ( κ ), authoritativeness r 2 , and Micro-averaged F-score Mayfield et al. (2012).

5.4. Human Evaluation

Despite advances in automatic scoring, human judgment remains indispensable for assessing fluency (grammatical correctness and naturalness) Fan et al. (2024); Gu et al. (2023); Ju et al. (2022); Liu et al. (2019), consistency and coherence (logical flow within context) Gu et al. (2023); Ju et al. (2022); Kiruluta et al. (2025); Liu et al. (2019), relevance (appropriateness to dialogue state) Fan et al. (2024); Gu et al. (2023); Liu et al. (2019), informativeness (usefulness of content) Fan et al. (2024); Gu et al. (2023); Liu et al. (2019), authority and trustworthiness (credibility) Mayfield et al. (2012), and interactional aspects such as step correctness, latency, and coherence Kiruluta et al. (2025). Task-specific measures further enrich the toolkit, including average reward in RL settings Hiraoka et al. (2015), cross-domain performance gap as a robustness measure Liu and Chen (2021), error reduction rate Laskowski et al. (2008), and composite scores that combine PPL, BLEU, Distinct, embedding similarity, accuracy, P/R/F1, and human ratings Ju et al. (2022); Wang et al. (2024).
Taken together, current evaluation practice remains fragmented: most studies rely on a handful of automatic metrics that imperfectly capture the social and collaborative dimensions of MPD. Section 6 argues for holistic, interaction-oriented benchmarks as a primary research opportunity.

6. Challenges and Opportunities

Building effective MPD systems requires confronting challenges distinct from dyadic dialogue. Figure 3 maps four bottlenecks to opportunities. We discuss each pair.

6.1. Combinatorial Complexity vs. Social Cognition

The transition from one-to-one to many-to-many interaction introduces a combinatorial explosion in conversational state. Models must track not only utterance history but also a dynamic graph of speakers, addressees, floor holders, and sub-threads. Current sequential LLMs flatten this structure, leading to failures in addressee prediction and contextually grounded turn-taking. The opportunity lies in architectures with inherent social cognition. Heterogeneous Graph Neural Networks naturally encode participants, utterances, and their evolving relationships as a dynamic graph. Multi-Agent Systems and Reinforcement Learning enable autonomous agents to learn emergent communication strategies through simulated interaction, moving beyond pattern imitation toward goal-driven social behavior.

6.2. The Structural-Semantic Divide vs. Neuro-Symbolic Fusion

In MPD, meaning is entangled with discourse position. Conversations comprise interleaved threads where a speaker may respond to an utterance several turns prior. While LLMs capture local coherence, they lack explicit mechanisms for global hierarchical structure, producing fluent but logically fragmented contributions. Neuro-symbolic hybrid architectures offer a promising path: an LLM augmented with a queryable discourse graph that tracks threads, stances, and decision states. This fusion would enable responses that are fluent and globally coherent, with explainable reasoning over conversational history.

6.3. The Evaluation Conundrum vs. Holistic Benchmarks

The field leans heavily on metrics borrowed from other NLP tasks, which are poor proxies for conversational quality in group settings. BLEU and F1 fail to capture pragmatic appropriateness, role fulfillment, consensus building, and group-level outcomes such as decision quality. The opportunity is to design holistic, multi-dimensional benchmarks that evaluate interdependent capabilities jointly (for example, response generation together with addressee prediction and topic tracking) alongside goal-oriented and human-in-the-loop protocols that measure functional contribution to group outcomes rather than surface mimicry. Progress in adjacent areas is instructive: trajectory-aware evaluation of research agents Chen et al. (2026) and large-scale benchmarks for dialogue understanding and generation Hu et al. (2026) show how multi-dimensional, process-level evaluation can move beyond single-score metrics.

6.4. The Data and Modality Bottleneck vs. Multimodal Grounding

Progress is constrained by the scarcity of large-scale, richly annotated MPD corpora, a problem compounded for non-English settings where dedicated benchmarks and datasets remain limited, as recent work on Cantonese capabilities and resources illustrates Jiang et al. (2025a) Jiang, Chen, Chen, Wang, Bao, Kong, Li, and Wu; Jiang et al. (2025b) Jiang, Truong, Chen, Bao, Wang, Chen, Wang, Kong, Li, and Wu. Most existing data is text-only, missing the gaze, gesture, prosody, and turn-yielding cues that humans use to manage group conversation. The path forward involves large-scale multimodal corpora and, more profoundly, deeper integration with computational social science methods. By learning to infer latent social variables such as leadership, facilitation, and group cohesion from conversational data, AI can evolve from a passive participant into an active, socially aware collaborator capable of adapting to pragmatic and cultural context.Recent advances in multi-modal proactive reasoning, which decouple perception from reasoning, point toward systems that can ground group interaction in visual as well as textual signals Zhou et al. (2025).

7. Conclusion

Surveyed Multi-Party Dialogue research across tasks, methods, datasets, and evaluation. The field advanced significantly, from statistical models through deep learning and graph reasoning to large language models and multi-agent paradigms. Yet gaps remain: approaches struggle with N-party interaction complexity, evaluation protocols inadequately capture social and collaborative quality, and data resources lag behind dyadic dialogue. The path forward calls for integrated modeling frameworks jointly addressing multiple MPD tasks, deeper grounding in external knowledge and social dynamics, principled integration of multi-agent reasoning with LLM capabilities, human-centered evaluation, and richer multimodal datasets covering diverse collaborative scenarios. Realizing these directions requires interdisciplinary collaboration spanning linguistics, sociology, psychology, and artificial intelligence. We hope this survey provides a clear synthesis of current progress and a practical roadmap for socially intelligent dialogue systems capable of supporting human-human and human-machine collaboration in complex environments.

Limitations

This survey synthesizes MPD research across tasks, methods, datasets, and evaluation, but has limitations. It may miss recent or domain-specific studies, especially on LLMs, multi-agent systems, and synthetic data. The taxonomy is analytical rather than disjoint, and reported results are not directly comparable. We summarize metrics without new meta-evaluation and focus mainly on NLP-oriented MPD work. These limitations motivate unified and interdisciplinary future research.
Table 4. Comprehensive catalog of 50+ MPD methods, grouped chronologically into the Statistical and Early Neural Era (pre-2020), the Pre-trained Model and Graph Era (2020–2023), and the LLM and Multi-Agent Era (2024–2026). The table spans multiple pages.
Table 4. Comprehensive catalog of 50+ MPD methods, grouped chronologically into the Statistical and Early Neural Era (pre-2020), the Pre-trained Model and Graph Era (2020–2023), and the LLM and Multi-Agent Era (2024–2026). The table spans multiple pages.
Method Domain/Task Datasets Metrics Headline Result Code
Statistical and Early Neural Era (pre-2020)
Feature-driven Seg. Galley et al. (2003) Topic seg. ICSI P k ; WD P k =23.0 LDC
Bayesian Models Purver et al. (2006) Topic seg. ICSI P k ; WD P k =28.9 None
Backchannel Heylen and op den Akker (2007) Interaction analysis AMI κ κ =0.14 None
Sub-dialogue Det. Fernández et al. (2008) Decision detect. AMI P; R; F1 F1=34 Stanford
Social Dim. Pred. Laskowski et al. (2008) Social structure AMI; ICSI RERR 37 to 67% None
Directed Graphs Bui et al. (2009) Decision detect. AMI P; R; F1 F1=0.55 BNT
ILP Disentangle Mayfield et al. (2012) Structure pred. Cancer Support MAF; κ Acc=0.78; κ =0.60 None
Random Walk Chen and Metze (2012) Meeting sum. SmartNotes ROUGE R-1=49.79 None
Argviz Nguyen et al. (2013) Topic analysis CNN Crossfire Qualitative Qualitative Google toolkit
Dep. Parser Afantenos et al. (2015) Structure parsing STAC F1 68.0 (unlabeled) None
RL Trading Hiraoka et al. (2015) Negotiation Sim. trading Avg. reward Better than baseline None
Static+Dynamic Ouchi and Tsuboi (2016) ADR + RES sel. Self-built ADR; RES ADR=68.54; RES=78.64 None
Neural Speaker Meng et al. (2018) Speaker class. Self-built F1; MRR Macro F1=44.25 Google Sites
W2W Le et al. (2019) Addressee ID Ubuntu IRC P@n; Len-n Len-5=80.86 None
ICRED Liu et al. (2019) Response gen. RGMPC BLEU; ROUGE Length=11.34 fasttext
ML Models Gatti de Bayser et al. (2019) Turn-taking pred. MultiWoZ; Finch Accuracy Acc=86.38 gensim
Entity-centric Aina et al. (2019) Entity linking Self-built F1; Accuracy F1=52.5; Acc=77.6 GitHub
Pre-trained Model and Graph Era (2020–2023)
Multi-view BART Chen and Yang (2020) Summarization SAMSum ROUGE R-1=0.493 GT-SALT (GitHub)
Topic-BERT Wang et al. (2020) Selection; topics Ubuntu Recall; MRR R@10=97.0 salesforce (GitHub)
Turn-taking Annot. Enomoto et al. (2020) Turn analysis Chiba POS; prosody Qualitative None
Pseudo-SSL Li and Zhao (2021) Dialogue QA FriendsQA; Molweni EM; F1 EM=58.0; F1=72.9 EricLee8 (GitHub)
Cross-domain Trans. Liu and Chen (2021) Discourse parsing STAC; Molweni F1 (link); L+R Link=80.2 HuggingFace
MPC-BERT Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang Addressee; speaker; selection Ubuntu IRC P@1; Acc P@1=98.31; Acc=92.42 JasonForJoy (GitHub)
ERMC Sun et al. (2021) Emotion recog. MELD; EmoryNLP F1 Avg F1=64.22 google-research/bert
HeterMPC Gu et al. (2022) Response gen. Ubuntu IRC BLEU; ROUGE BLEU-1=12.61 lxchtan (GitHub)
SOND Du et al. (2022) Speaker diarization AliMeeting DER 4.46% yufan-aslp (GitHub)
SDMPED Zhu et al. (2022) Empathetic gen. MPED ROUGE; BLEU ROUGE-L=12.87 GDPR
PersonaTKG (GCN) Ju et al. (2022) Persona gen. HLA-Chat++ PPL; BLEU; Dist PPL=109.72 NEU-DataMining
User-Aware Park et al. (2022) Disruptive detect. ECOJOURNEYS AUC; PR-AUC AUC=84.80 gingerit
Hier. VAE Sia et al. (2022) Argument pred. CMV AUC AUC=69.7 GitHub
E2E Minuting Bhatnagar et al. (2022) Meeting sum. XSum; SAMSum ROUGE; BLEU R1=45.0; BLEU=7.07 GitHub
PF Li and Zhao (2023) Response gen. Ubuntu IRC BLEU; METEOR BLEU-1=12.31 EricLee8/MPDRG
ELECTRA-EMVI Li et al. (2023) Parsing; QA Molweni F 1 R L ; F 1 G F 1 G =91.78 EricLee8/MPD_EMVI
MPC-BERT+GIFT Gu et al. (2023) Addressee; speaker; sel. Ubuntu IRC R n @1 R 2 @1=95.04 JasonForJoy (GitHub)
MADNet Gu et al. (2023) MPC generation Ubuntu IRC BLEU; METEOR BLEU-1=11.82 coco-caption
RARM Zhu et al. (2023) Addressee recog. Ubuntu IRC ID/OD-AN Overall=85.1 None
PFT-Prompt Addlesee et al. (2023) Goal-tracking; intent-slot EU SPRING Accuracy Acc=62.32/69.57 AddleseeHQ/mpgt-eval
LLM and Multi-Agent Era (2024–2026)
LLMs zero/few-shot Martinenghi et al. (2024) Dialogue acts STAC Acc; F1 Acc=69.1; F1=71.6 GitHub
Persona-HeterMPC Mahajan and Shaikh (2024) Persona gen. Persona-MPC BLEU; METEOR BLEU-1=12.47 NEU-DataMining
RL-TRC Fan et al. (2024) Response gen. Ubuntu IRC BLEU; METEOR BLEU-1=13.66 MaartenGr/KeyBERT
MuPaS Wang et al. (2024) Generation; speaker pred. Friends; GoT Auto + human GSM8K=43.14 HuggingFace
DDPE Liu et al. (2025) Discourse parsing Molweni; STAC Link; Link+Rel Link=87.6; L+R=62.9 Shannanliu/DDPE
LIMN Zhou et al. (2025) Dialogue QA SQuAD2; Molweni EM; F1 EM=60.2 (Molweni) None
CMR Hu et al. (2025) Response gen. Friends; Ubuntu IRC F1; BLEU F1=13.43 None
RL Fine-tuning Kiruluta et al. (2025) Multi-turn align.; CoT ChatEval Latency 35ms (±3) None
DICE-BENCH Jang et al. (2025) Tool-calling DICE-BENCH DICE-Score 3.6444 snuhcc/DICE-Bench
SS-MPC Jang et al. (2025) Response gen. Ubuntu IRC BLEU; ROUGE-L BLEU-1=15.60 GitHub
BOLT ópez et al. (2025) Intent recog. MIntRec2.0; MPGT Acc; WF1 ACC=41.22/89.47 None
DRCR Cao et al. (2026) MPD generation Ubuntu IRC-16/19 BLEU; METEOR BLEU-1=16.04 None
Context-aware TT Bhagtani et al. (2026) Turn-taking AMI; Friends; SPGI Acc; F1 61.03/60.54/64.45 GitHub
MPCEval Zhang et al. (2026) Multi-party eval DeliData; MPDD DNR; IR; PF GPT-4-Turbo results GitHub
EverMemBench Hu et al. (2026) Memory eval EverMemBench Recall; Acc 37.44/72.61 GitHub
SyntheticMPC Penzo et al. (2026) Synth. generation WMPC Constraints All=77.72 dhfbk (GitHub)

References

  1. Jiang, J., S. Wang, Q. Li, L. Kong, and C. Wu. 2023. A cognitive stimulation dialogue system with multi-source knowledge fusion for elders with cognitive impairment. Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: 10628–10640. [Google Scholar] [CrossRef]
  2. Vaswani, A., N.M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is All you Need. Proceedings of the Neural Information Processing Systems. [Google Scholar]
  3. Ishizaki, M., and T. Kato. 1998. Exploring the characteristics of multi-party dialogues. USA: vol. ACL ’98/COLING ’98, pp. 583–589. [Google Scholar] [CrossRef]
  4. Sapkota, S., M.S. Hasan, M. Shah, and S. Karmaker. 2025. Multi-Party Conversational Agents: A Survey. ArXiv abs/2505.18845. [Google Scholar]
  5. Martínez-Hinarejos, C.D., V. Tamarit, and J.M. Benedí. 2010. Evaluation of HMM-based Models for the Annotation of Unsegmented Dialogue Turns. In Proceedings of the Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta, Edited by N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias. [Google Scholar]
  6. Shang, G., A. Tixier, M. Vazirgiannis, and J.P. Lorré. 2020. Speaker-change Aware CRF for Dialogue Act Classification. In Proceedings of the Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online), Edited by D. Scott, N. Bel and C. Zong. pp. 450–464. [Google Scholar] [CrossRef]
  7. Zhu, Y., Z. Yang, H. Meng, B. Li, G. Levow, and I. King. 2010. Using finite state machines for evaluating spoken dialog systems. Proceedings of the 2010 IEEE Spoken Language Technology Workshop; pp. 478–483. [Google Scholar] [CrossRef]
  8. Mangrulkar, S., S. Shrivastava, V. Thenkanidiyoor, and D. Aroor Dinesh. 2018. A Context-aware Convolutional Natural Language Generation model for Dialogue Systems. In Proceedings of the Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue. Edited by K. Komatani, D. Litman, K. Yu, A. Papangelis, L. Cavedon and M. Nakano. Melbourne, Australia: pp. 191–200. [Google Scholar] [CrossRef]
  9. Wen, T.H., M. Gašić, N. Mrkšić, P.H. Su, D. Vandyke, and S. Young. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, Edited by L. Màrquez, C. Callison-Burch and J. Su. pp. 1711–1721. [Google Scholar] [CrossRef]
  10. Skantze, G. 2017. Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks. In Proceedings of the Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Edited by K. Jokinen, M. Stede, D. DeVault and A. Louis. Saarbrücken, Germany: pp. 220–230. [Google Scholar] [CrossRef]
  11. Wang, G., K. Zhang, J. Jiang, C. Wang, H. Bi, H. Liang, Z. Qi, Y. Huang, Y. Li, and X. Yang. 2026. Human–large language model collaboration in clinical medicine: a systematic review and meta-analysis. In npj Digital Medicine. [Google Scholar]
  12. Addlesee, A., W. Siei’nska, N. Gunson, D. Hernández García, C. Dondrup, and O. Lemon. 2023. Multi-party Goal Tracking with LLMs: Comparing Pre-training, Fine-tuning, and Prompt Engineering. ArXiv abs/2308.15231. [Google Scholar]
  13. Penzo, N., M. Sajedinia, B. Lepri, S. Tonelli, and M. Guerini. 2024. Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA, Edited by Y. Al-Onaizan, M. Bansal and Y.N. Chen. pp. 11210–11233. [Google Scholar] [CrossRef]
  14. Castillo-López, G., G. de Chalendar, and N. Semmar. 2025. Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations. In Proceedings of the Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Avignon, France, Edited by F. Béchet, F. Lefèvre, N. Asher, S. Kim and T. Merlin. pp. 504–512. [Google Scholar]
  15. Sun, Y., N. Yu, and G. Fu. 2021. A Discourse-Aware Graph Neural Network for Emotion Recognition in Multi-Party Conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021. Edited by M.F. Moens, X. Huang, L. Specia and S.W.t. Yih. Punta Cana, Dominican Republic: pp. 2949–2958. [Google Scholar] [CrossRef]
  16. Martinenghi, A., G. Donabauer, S. Amenta, S. Bursic, M. Giudici, U. Kruschwitz, F. Garzotto, and D. Ognibene. 2024. LLMs of Catan: Exploring Pragmatic Capabilities of Generative Chatbots Through Prediction and Classification of Dialogue Acts in Boardgames’ Multi-party Dialogues. In Proceedings of the Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024. Torino, Italia, Edited by C. Madge, J. Chamberlain, K. Fort, U. Kruschwitz and S. Lukin. pp. 107–118. [Google Scholar]
  17. Mayfield, E., D. Adamson, and C. Penstein Rosé. 2012. Hierarchical Conversation Structure Prediction in Multi-Party Chat. In Proceedings of the Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Seoul, South Korea, Edited by G.G. Lee, J. Ginzburg, C. Gardent and A. Stent. pp. 60–69. [Google Scholar]
  18. Li, Y., and H. Zhao. 2023. EM Pre-training for Multi-party Dialogue Response Generation. Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, Volume 1, pp. 92–103. [Google Scholar] [CrossRef]
  19. Li, Y., and H. Zhao. 2021. Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021. Edited by M.F. Moens, X. Huang, L. Specia and S.W.t. Yih. Punta Cana, Dominican Republic: pp. 2053–2063. [Google Scholar] [CrossRef]
  20. Park, K., H. Sohn, W. Min, B. Mott, K. Glazewski, C.E. Hmelo-Silver, and J. Lester. 2022. Disruptive Talk Detection in Multi-Party Dialogue within Collaborative Learning Environments with a Regularized User-Aware Network. In Proceedings of the Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. Edinburgh, UK, Edited by O. Lemon, D. Hakkani-Tur, J.J. Li, A. Ashrafzadeh, D.H. Garcia, M. Alikhani, D. Vandyke and O. Dušek. pp. 490–499. [Google Scholar] [CrossRef]
  21. Sia, S., K. Jaidka, H. Ahuja, N. Chhaya, and K. Duh. 2022. Offer a Different Perspective: Modeling the Belief Alignment of Arguments in Multi-party Debates. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Edited by Y. Goldberg, Z. Kozareva and Y. Zhang. pp. 11939–11950. [Google Scholar] [CrossRef]
  22. Gatti de Bayser, M., P.R. Cavalin, C.S. Pinhanez, and B. Zadrozny. 2019. Learning Multi-Party Turn-Taking Models from Dialogue Logs. ArXiv abs/1907.02090. [Google Scholar]
  23. Laskowski, K. 2010. Modeling Norms of Turn-Taking in Multi-Party Conversation. In Proceedings of the Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, Edited by J. Hajič, S. Carberry, S. Clark and J. Nivre. pp. 999–1008. [Google Scholar]
  24. Enomoto, M., Y. Den, and Y. Ishimoto. 2020. A Conversation-Analytic Annotation of Turn-Taking Behavior in Japanese Multi-Party Conversation and its Preliminary Analysis. In Proceedings of the Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France, Edited by N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani and et al. pp. 644–652. [Google Scholar]
  25. Wang, X., N. Xi, T. Chen, Q. Gu, Y. Zhao, X. Chen, Z. Jiang, Y. Chen, and L. Ji. 2024. Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation. ArXiv arXiv:abs/1907.02090. [Google Scholar]
  26. Purver, M., K.P. Körding, T.L. Griffiths, and J.B. Tenenbaum. 2006. Unsupervised Topic Modelling for Multi-Party Spoken Discourse. Proceedings of the Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia; pp. 17–24. [Google Scholar] [CrossRef]
  27. Galley, M., K.R. McKeown, E. Fosler-Lussier, and H. Jing. 2003. Discourse Segmentation of Multi-Party Conversation. Proceedings of the Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan; pp. 562–569. [Google Scholar] [CrossRef]
  28. Nguyen, V.A., Y. Hu, J. Boyd-Graber, and P. Resnik. 2013. Argviz: Interactive Visualization of Topic Dynamics in Multi-party Conversations. In Proceedings of the Proceedings of the 2013 NAACL HLT Demonstration Session. Atlanta, Georgia, Edited by C. Dyer and D. Higgins. pp. 36–39. [Google Scholar]
  29. Wang, W., S.C. Hoi, and S. Joty. 2020. Response Selection for Multi-Party Conversations with Dynamic Topic Tracking. Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online; pp. 6581–6591. [Google Scholar] [CrossRef]
  30. Afantenos, S., E. Kow, N. Asher, and J. Perret. 2015. Discourse parsing for multi-party chat dialogues. Proceedings of the Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal; pp. 928–937. [Google Scholar] [CrossRef]
  31. Li, Y., X. Huang, W. Bi, and H. Zhao. 2023. Pre-training Multi-party Dialogue Models with Latent Discourse Inference. Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, Volume 1, pp. 9584–9599. [Google Scholar] [CrossRef]
  32. Liu, Z., and N.F. Chen. 2021. Improving Multi-Party Dialogue Discourse Parsing via Domain Integration. ArXiv abs/2110.04526. [Google Scholar]
  33. Liu, S., P. Li, Y. Fan, and Q. Zhu. 2025. Enhancing Multi-party Dialogue Discourse Parsing with Explanation Generation. In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Edited by O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B.D. Eugenio and S. Schockaert. pp. 1531–1544. [Google Scholar]
  34. Fernández, R., M. Frampton, P. Ehlen, M. Purver, and S. Peters. 2008. Modelling and Detecting Decisions in Multi-party Dialogue. In Proceedings of the Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue. Columbus, Ohio, Edited by D. Schlangen and B.A. Hockey. pp. 156–163. [Google Scholar]
  35. Bui, T., M. Frampton, J. Dowding, and S. Peters. 2009. Extracting Decisions from Multi-Party Dialogue Using Directed Graphical Models and Semantic Similarity. In Proceedings of the Proceedings of the SIGDIAL 2009 Conference. London, UK, Edited by P. Healey, R. Pieraccini, D. Byron, S. Young and M. Purver. pp. 235–243. [Google Scholar]
  36. Kiruluta, A., A. Lemos, and P. Burity. 2025. History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM. ArXiv abs/2506.11108. [Google Scholar]
  37. Fan, Y., P. Li, and Q. Zhu. 2024. Improving Multi-party Dialogue Generation via Topic and Rhetorical Coherence. Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA; pp. 3240–3253. [Google Scholar] [CrossRef]
  38. Gu, J.C., C.H. Tan, C. Tao, Z.H. Ling, H. Hu, X. Geng, and D. Jiang. 2022. HeterMPC: A Heterogeneous Graph Neural Network for Response Generation in Multi-Party Conversations. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland, Edited by S. Muresan, P. Nakov and A. Villavicencio. Volume 1, pp. 5086–5097. [Google Scholar] [CrossRef]
  39. Gu, J.C., Z. Ling, Q. Liu, C. Liu, and G. Hu. 2023. GIFT: Graph-Induced Fine-Tuning for Multi-Party Conversation Understanding. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Edited by A. Rogers, J. Boyd-Graber and N. Okazaki. Volume 1, pp. 11645–11658. [Google Scholar] [CrossRef]
  40. Hu, Z., Q. He, R. Li, M. Zhao, and L. Wang. 2025. Advancing Multi-Party Dialogue Framework with Speaker-ware Contrastive Learning.
  41. Liu, C., K. Liu, S. He, Z. Nie, and J. Zhao. 2019. Incorporating Interlocutor-Aware Context into Response Generation on Multi-Party Chatbots. In Proceedings of the Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China, Edited by M. Bansal and A. Villavicencio. pp. 718–727. [Google Scholar] [CrossRef]
  42. Tan, C.H., J.C. Gu, and Z.H. Ling. 2023. Is ChatGPT a Good Multi-Party Conversation Solver? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Edited by H. Bouamor, J. Pino and K. Bali. Singapore: pp. 4905–4915. [Google Scholar] [CrossRef]
  43. Gu, J.C., C. Tao, Z. Ling, C. Xu, X. Geng, and D. Jiang. 2021. MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Edited by C. Zong, F. Xia, W. Li and R. Navigli. Online: Volume 1, pp. 3682–3692. [Google Scholar] [CrossRef]
  44. Gu, J.C., C. Tao, Z. Ling, C. Xu, X. Geng, and D. Jiang. 2021. MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding. ArXiv arXiv:abs/2106.01541. [Google Scholar]
  45. Ouchi, H., and Y. Tsuboi. 2016. Addressee and Response Selection for Multi-Party Conversation. Proceedings of the Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas; pp. 2133–2143. [Google Scholar] [CrossRef]
  46. Ju, D., S. Feng, P. Lv, D. Wang, and Y. Zhang. 2022. Learning to Improve Persona Consistency in Multi-party Dialogue Generation via Text Knowledge Enhancement. In Proceedings of the Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea, Edited by N. Calzolari, C.R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.S. Choi, P.M. Ryu, H.H. Chen, L. Donatelli, H. Ji and et al. pp. 298–309. [Google Scholar]
  47. Mahajan, K., and S. Shaikh. 2024. Persona-aware Multi-party Conversation Response Generation. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, Edited by N. Calzolari, M.Y. Kan, V. Hoste, A. Lenci, S. Sakti and N. Xue. pp. 12712–12723. [Google Scholar]
  48. Zhu, L., Z. Zhang, J. Wang, H. Wang, H. Wu, and Z. Yang. 2022. Multi-Party Empathetic Dialogue Generation: A New Task for Dialog Systems. Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, Volume 1, pp. 298–307. [Google Scholar] [CrossRef]
  49. Jiang, J., Y. Chen, P. Chen, K. Liu, J. Zhou, Z. Zhu, H. Hu, F. Ma, Q. Tian, and C. Wu. 2026. A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment. [Google Scholar] [CrossRef]
  50. Jang, K., D. Lee, K. Kim, D. Heo, T. Lee, W. Kim, and B. Suh. 2025. DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues. Proceedings of the Annual Meeting of the Association for Computational Linguistics. [Google Scholar]
  51. Zhou, J., S. Wang, D. Deng, J. Lu, J. Su, Q. Li, J. Gao, H. Wu, J. Jiang, L. Kong, and et al. 2026. ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Intrinsic Adaptation. arXiv arXiv:2602.07883. [Google Scholar]
  52. Heylen, D., and R. op den Akker. 2007. Computing Backchannel Distributions in Multi-Party Conversations. In Proceedings of the Proceedings of the Workshop on Embodied Language Processing. Prague, Czech Republic, Edited by J. Cassell and D. Heylen. pp. 17–24. [Google Scholar]
  53. Le, R., W. Hu, M. Shang, Z. You, L. Bing, D. Zhao, and R. Yan. 2019. Who Is Speaking to Whom? Learning to Identify Utterance Addressee in Multi-Party Conversations. Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China; pp. 1909–1919. [Google Scholar] [CrossRef]
  54. Zhu, P., W. Zhou, K. Zhang, Y. Ma, and H. Chen. 2023. Robust Learning for Multi-party Addressee Recognition with Discrete Addressee Codebook. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Edited by A. Rogers, J. Boyd-Graber and N. Okazaki. Volume 2, pp. 571–578. [Google Scholar] [CrossRef]
  55. Du, Z., S. Zhang, S. Zheng, and Z.J. Yan. 2022. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Edited by Y. Goldberg, Z. Kozareva and Y. Zhang. pp. 7458–7469. [Google Scholar] [CrossRef]
  56. Meng, Z., L. Mou, and Z. Jin. 2018. Towards Neural Speaker Modeling in Multi-Party Conversation: The Task, Dataset, and Models. In Proceedings of the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan, Edited by N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo and et al. [Google Scholar]
  57. Laskowski, K., M. Ostendorf, and T. Schultz. 2008. Modeling Vocal Interaction for Text-Independent Participant Characterization in Multi-Party Conversation. In Proceedings of the Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue. Columbus, Ohio, Edited by D. Schlangen and B.A. Hockey. pp. 148–155. [Google Scholar]
  58. Chen, J., and D. Yang. 2020. Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization. Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online; pp. 4106–4118. [Google Scholar] [CrossRef]
  59. Chen, Y.N., and F. Metze. 2012. Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk. In Proceedings of the Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada, Edited by E. Fosler-Lussier, E. Riloff and S. Bangalore. pp. 377–381. [Google Scholar]
  60. Bhatnagar, A., N. Bhavsar, M. Singh, and P. Motlicek. 2022. An End-to-End Multilingual System for Automatic Minuting of Multi-Party Dialogues. Proceedings of the Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, Manila, Philippines, 10; pp. 582–589. [Google Scholar]
  61. Zhou, S., R. Zhao, Z. Zhou, H. Yi, X. Zheng, and H. Wang. 2025. Enhancing Extractive Question Answering in Multiparty Dialogues with Logical Inference Memory Network. In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Edited by O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B.D. Eugenio and S. Schockaert. pp. 8725–8738. [Google Scholar]
  62. Hiraoka, T., K. Georgila, E. Nouri, D. Traum, and S. Nakamura. 2015. Reinforcement Learning in Multi-Party Trading Dialog. In Proceedings of the Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Edited by A. Koller, G. Skantze, F. Jurcicek, M. Araki and C.P. Rose. Prague, Czech Republic: pp. 32–41. [Google Scholar] [CrossRef]
  63. Aina, L., C. Silberer, I.T. Sorodoc, M. Westera, and G. Boleda. 2019. What do Entity-Centric Models Learn? Insights from Entity Linking in Multi-Party Dialogue. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, Edited by J. Burstein, C. Doran and T. Solorio. Volume 1, pp. 3772–3783. [Google Scholar] [CrossRef]
  64. Jang, Y., K. Kim, and Y. Ko. 2025. SS-MPC: A Sequence-Structured Multi-Party Conversation System. arXiv arXiv:cs. [Google Scholar]
  65. Hu, C., T. Li, X. Gao, H. Chen, Y. Bai, D. Xu, T. Lin, X. Li, Y. Han, J. Pei, and et al. 2026. Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues. arXiv arXiv:cs. [Google Scholar]
  66. Gu, J.C., C.H. Tan, C. Chu, Z.H. Ling, C. Tao, Q. Liu, and C. Liu. 2023. MADNet: Maximizing Addressee Deduction Expectation for Multi-Party Conversation Generation. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore, Edited by H. Bouamor, J. Pino and K. Bali. pp. 7681–7692. [Google Scholar] [CrossRef]
  67. Chen, Y., J. Jiang, D. Yu, Z. Wu, J. Liu, J. Han, X. Guo, J. Qi, Y. Li, Y. Zhang, and et al. 2026. LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition. arXiv arXiv:2605.24005. [Google Scholar]
  68. Cao, Z., P. Li, and Q. Zhu. 2026. Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation. arXiv arXiv:cs. [Google Scholar]
  69. Bhagtani, K., M. Anand, Y.C. Xu, and A.K.S. Yadav. 2026. Speak or Stay Silent: Context-Aware Turn-Taking in Multi-Party Dialogue. arXiv arXiv:cs. [Google Scholar]
  70. Zhang, M., Y. Yang, Z. Jia, X. Yang, J. Pei, Y. Zang, X. Deng, and X. Chen. 2026. MPCEval: A Benchmark for Multi-Party Conversation Generation. arXiv arXiv:cs. [Google Scholar]
  71. Penzo, N., M. Guerini, B. Lepri, G. Glavaš, and S. Tonelli. 2026. Don’t Stop the Multi-Party! On Generating Synthetic Written Multi-Party Conversations with Constraints. Proceedings of the AAAI Conference on Artificial Intelligence 40: 32701–32709. [Google Scholar] [CrossRef]
  72. Wang, S., P. Chen, J. Zhou, Q. Li, J. Dong, J. Gao, B. Xue, J. Jiang, L. Kong, and C. Wu. 2026. TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning. Advances in Neural Information Processing Systems 38: 63870–63918. [Google Scholar]
  73. Jiang, J., L. Chen, S. Wang, L. Kong, Y. Li, and C. Wu. 2024. Data Augmentation of Multi-turn Psychological Dialogue via Knowledge-driven Progressive Thought Prompting. arXiv arXiv:2406.16567. [Google Scholar]
  74. Wang, S., L. Chen, J. Jiang, B. Xue, L. Kong, and C. Wu. 2024. Lora meets dropout under a unified framework. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024: 1995–2008. [Google Scholar] [CrossRef]
  75. Wang, S., B. Xue, J. Ye, J. Jiang, L. Chen, L. Kong, and C. Wu. 2024. PRoLoRA: partial rotation empowers more parameter-efficient LoRA. Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: 2829–2841. [Google Scholar]
  76. Wang, S., L. Chen, P. Chen, J. Dong, B. Xue, J. Jiang, L. Kong, and C. Wu. 2025. Mos: Unleashing parameter efficiency of low-rank adaptation with mixture of shards. Proceedings of the International Conference on Learning Representations Vol. 2025: 91886–91902. [Google Scholar]
  77. Asher, N., J. Hunter, M. Morey, B. Farah, and S. Afantenos. 2016. Discourse Structure and Dialogue Acts in Multiparty Dialogue: the STAC Corpus. In Proceedings of the Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia, Edited by N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk and et al. pp. 2721–2727. [Google Scholar]
  78. Li, J., M. Liu, M.Y. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and B. Qin. Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure. In Proceedings of the Proceedings of the 28th International Conference on Computational Linguistics. Edited by D. Scott, N. Bel and C. Zong.
  79. Chang, K., D. Chen, and D. Bamman. 2023. Dramatic Conversation Disentanglement. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023. Edited by A. Rogers, J. Boyd-Graber and N. Okazaki. Toronto, Canada: pp. 4020–4046. [Google Scholar] [CrossRef]
  80. Lerner, P., J. Bergoënd, C. Guinaudeau, H. Bredin, B. Maurice, S. Lefevre, M. Bouteiller, A. Berhe, L. Galmant, R. Yin, and et al. 2022. Bazinga! A Dataset for Multi-Party Dialogues Structuring. In Proceedings of the Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France, Edited by N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani and et al. pp. 3434–3441. [Google Scholar]
  81. Poria, S., D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Edited by A. Korhonen, D. Traum and L. Màrquez. Florence, Italy: pp. 527–536. [Google Scholar] [CrossRef]
  82. Hsu, C.C., S.Y. Chen, C.C. Kuo, T.H. Huang, and L.W. Ku. 2018. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan, Edited by N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo and et al. [Google Scholar]
  83. Zahiri, S.M., and J.D. Choi. 2017. Emotion Detection on TV Show Transcripts with Sequence-based Convolutional Neural Networks. arXiv arXiv:cs. [Google Scholar]
  84. Chen, Y.T., H.H. Huang, and H.H. Chen. 2020. MPDD: A Multi-Party Dialogue Dataset for Analysis of Emotions and Interpersonal Relationships. In Proceedings of the Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France, Edited by N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani and et al. pp. 610–614. [Google Scholar]
  85. Li, A.W., V. Jiang, S.Y. Feng, J. Sprague, W. Zhou, and J. Hoey. 2020. ALOHA: Artificial Learning of Human Attributes for Dialogue Agents. Proceedings of the AAAI Conference on Artificial Intelligence 34: 8155–8163. [Google Scholar] [CrossRef]
  86. Gliwa, B., I. Mochol, M. Biesek, and A. Wawer. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the Proceedings of the 2nd Workshop on New Frontiers in Summarization Association for Computational Linguistics. [Google Scholar] [CrossRef]
  87. Narayan, S., S.B. Cohen, and M. Lapata. 2018. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, Edited by E. Riloff, D. Chiang, J. Hockenmaier and J. Tsujii. 10. pp. 1797–1807. [Google Scholar] [CrossRef]
  88. Budzianowski, P., T.H. Wen, B.H. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić. MultiWOZ – A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling, 2020. arXiv arXiv:cs.
  89. Lowe, R., N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Proceedings of the Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Edited by A. Koller, G. Skantze, F. Jurcicek, M. Araki and C.P. Rose. Prague, Czech Republic: pp. 285–294. [Google Scholar] [CrossRef]
  90. Carletta, J., S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, and et al. 2006. The AMI Meeting Corpus: A Pre-announcement. In 10.1007/11677482_3; 2nd International Workshop on Machine Learning for Multimodal Interaction, MLMI 2005; Conference date Proceedings of the Machine Learning for Multimodal Interaction, Second International Workshop;Number 10 in Lecture Notes in Computer Science. Germany, Edited by S. Renals and S. Bengio. p. 28–39 11-07-2005 Through 13-07-2005. [Google Scholar] [CrossRef]
  91. Shriberg, E., R. Dhillon, S. Bhagat, J. Ang, and H. Carvey. 2004. The ICSI Meeting Recorder Dialog Act (MRDA) Corpus. Proceedings of the Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, Cambridge, Massachusetts, USA, 10; pp. 97–100. [Google Scholar]
  92. Yu, F., S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Ma, and et al. M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge, 2022. arXiv arXiv:cs.
  93. Litman, D., S. Paletz, Z. Rahimi, S. Allegretti, and C. Rice. 2016. The Teams Corpus and Entrainment in Multi-Party Spoken Dialogues. In Proceedings of the Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, Edited by J. Su, K. Duh and X. Carreras. pp. 1421–1431. [Google Scholar] [CrossRef]
  94. Shaikh, S., T. Strzalkowski, A. Broadwell, J. Stromer-Galley, S. Taylor, and N. Webb. 2010. MPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse. In Proceedings of the Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta, Edited by N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias. [Google Scholar]
  95. Manuvinakurike, R., S. Sahay, W. Chen, and L. Nachman. 2021. Incremental temporal summarization in multi-party meetings. In Proceedings of the Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue. Edited by H. Li, G.A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y. Wu and J.J. Li. Singapore and Online: pp. 530–541. [Google Scholar] [CrossRef]
  96. Sedoc, J., D. Ippolito, A. Kirubarajan, J. Thirani, L. Ungar, and C. Callison-Burch. 2019. ChatEval: A Tool for Chatbot Evaluation. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, Minnesota, Edited by W. Ammar, A. Louis and N. Mostafazadeh. pp. 60–65. [Google Scholar] [CrossRef]
  97. Li, Y., B. Zou, Y. Fan, X. Li, A.T. Aw, and Y. Hong. 2023. GLGR: Question-aware Global-to-Local Graph Reasoning for Multi-party Dialogue Reading Comprehension. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Edited by H. Bouamor, J. Pino and K. Bali. Singapore: pp. 1817–1826. [Google Scholar] [CrossRef]
  98. Chen, Y., J. Jiang, J. Liu, Y. Zhang, X. Guo, and I. King. 2026. Trace: Trajectory-aware comprehensive evaluation for deep research agents. Proceedings of the Proceedings of the ACM Web Conference 2026: 2524–2534. [Google Scholar] [CrossRef]
  99. Hu, H., J. Si, Q. Wang, T. Weng, Y. Ji, J. Jiang, F. Ma, Y. Zhou, L. Cui, and Q. Tian. 2026. MindDialog: A large-scale benchmark for counseling dialogue understanding and generation. Pattern Recognition, 113766. [Google Scholar] [CrossRef]
  100. Jiang, J., P. Chen, L. Chen, S. Wang, Q. Bao, L. Kong, Y. Li, and C. Wu. 2025. How well do llms handle cantonese? benchmarking cantonese capabilities of large language models. Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025: 4464–4505. [Google Scholar] [CrossRef]
  101. Jiang, J., A.K.Y. Truong, Y. Chen, Q. Bao, S. Wang, P. Chen, J. Wang, L. Kong, Y. Li, and C. Wu. 2025. Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025: 1924–1944. [Google Scholar] [CrossRef]
  102. Zhou, J., S. Wang, J. Dong, K. Liu, L. Li, J. Gao, J. Jiang, L. Kong, and C. Wu. 2025. PROREASON: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom. Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 31650–31679. [Google Scholar]
Figure 1. An overview of the Multi-Party Dialogue (MPD) research landscape, organized along four axes: (a) a six-category task taxonomy, (b) a methodological progression from traditional approaches through neural networks and large language models to hybrid multi-agent paradigms, (c) representative datasets grouped by purpose, and (d) four families of evaluation metrics. See Section 2 to Section 5 for full details.
Figure 1. An overview of the Multi-Party Dialogue (MPD) research landscape, organized along four axes: (a) a six-category task taxonomy, (b) a methodological progression from traditional approaches through neural networks and large language models to hybrid multi-agent paradigms, (c) representative datasets grouped by purpose, and (d) four families of evaluation metrics. See Section 2 to Section 5 for full details.
Preprints 219156 g001
Figure 2. Historical evolution of representative MPD methods from 2003 to 2026, organized along four research branches: Dialogue Structure and Interaction Modeling (red), Speaker and Role-aware Modeling (teal), Persona Enhanced Modeling (cyan), and LLM-based Approaches (purple). The tree visualization highlights how each branch matures from early statistical models to graph neural networks and modern LLM-driven systems. See Section 3 for detailed discussion.
Figure 2. Historical evolution of representative MPD methods from 2003 to 2026, organized along four research branches: Dialogue Structure and Interaction Modeling (red), Speaker and Role-aware Modeling (teal), Persona Enhanced Modeling (cyan), and LLM-based Approaches (purple). The tree visualization highlights how each branch matures from early statistical models to graph neural networks and modern LLM-driven systems. See Section 3 for detailed discussion.
Preprints 219156 g002
Figure 3. Strategic Roadmap of Challenges and Opportunities in Multi-Party Dialogue. The diagram identifies four fundamental bottlenecks hindering current MPD systems (Combinatorial Complexity of N-party dynamics, the Structural-Semantic Divide in discourse modeling, the Evaluation Conundrum, and the Data and Modality Bottleneck) and maps them to promising future research directions (Social Cognition Modeling, Neuro-Symbolic Fusion, Holistic Benchmarks, and Multimodal Grounding). See Section 6.
Figure 3. Strategic Roadmap of Challenges and Opportunities in Multi-Party Dialogue. The diagram identifies four fundamental bottlenecks hindering current MPD systems (Combinatorial Complexity of N-party dynamics, the Structural-Semantic Divide in discourse modeling, the Evaluation Conundrum, and the Data and Modality Bottleneck) and maps them to promising future research directions (Social Cognition Modeling, Neuro-Symbolic Fusion, Holistic Benchmarks, and Multimodal Grounding). See Section 6.
Preprints 219156 g003
Table 1. Representative milestones in MPD research, illustrating the field’s evolution from statistical models, through neural and graph-based architectures, to LLMs and multi-agent paradigms. Headline results are reported on each paper’s main benchmark. The complete catalog appears in Table 4.
Table 1. Representative milestones in MPD research, illustrating the field’s evolution from statistical models, through neural and graph-based architectures, to LLMs and multi-agent paradigms. Headline results are reported on each paper’s main benchmark. The complete catalog appears in Table 4.
Method Task Dataset Era Headline Result
Feature-driven seg. Galley et al. (2003) Topic segmentation ICSI Statistical P k =23.0, WD=25.47
Bayesian topic models Purver et al. (2006) Topic segmentation ICSI Statistical P k =28.9, WD=32.9
Static/Dynamic RNN Ouchi and Tsuboi (2016) Addressee + Response Self-built Neural ADR=68.54, RES=78.64
W2W Le et al. (2019) Addressee identification Ubuntu IRC Neural Len-5=80.86
MPC-BERT Gu et al. (2021b) Gu, Tao, Ling, Xu, Geng, and Jiang Addressee, Speaker, Sel. Ubuntu IRC PLM P@1=98.31, Acc=92.42
HeterMPC Gu et al. (2022) Response generation Ubuntu IRC Graph + PLM BLEU-1=12.61
GIFT Gu et al. (2023) Multi-task understanding Ubuntu IRC Graph + PLM R 2 @1=95.04
ELECTRA-EMVI Li et al. (2023) Discourse parsing, QA Molweni PLM + Latent F 1 G =91.78
PFT-Prompt Addlesee et al. (2023) Goal tracking EU SPRING LLM Prompting Acc=69.57
RL-TRC Fan et al. (2024) Response generation Ubuntu IRC RL + LLM BLEU-1=13.66
MuPaS Wang et al. (2024) Gen. + Speaker pred. Friends, GoT LLM-SFT GSM8K=43.14
DICE-BENCH Jang et al. (2025) Tool-calling DICE-BENCH Multi-Agent DICE=3.64
SS-MPC Jang et al. (2025) Response generation Ubuntu IRC Multi-Agent BLEU-1=15.60
EverMemBench Hu et al. (2026) Long-horizon memory EverMemBench Multi-Agent Avg=37.44/72.61
Table 2. Representative MPD datasets organized by category, with key statistics and annotation focus. Categories are: Struct. (Structure and Disentanglement), Emo. (Emotion and Social Relationship), Task. (Task-Oriented and Open-Domain), and Multi. (Multimodal and Naturalistic Meetings).
Table 2. Representative MPD datasets organized by category, with key statistics and annotation focus. Categories are: Struct. (Structure and Disentanglement), Emo. (Emotion and Social Relationship), Task. (Task-Oriented and Open-Domain), and Multi. (Multimodal and Naturalistic Meetings).
Dataset Category Source Scale Modality Key Annotations
STAC Asher et al. (2016) Struct. Online games Multi-party chats Text SDRT discourse structure, dialogue acts
Molweni Li et al. (forthcoming) Li, Liu, Kan, Zheng, Wang, Lei, Liu, and Qin Struct. Ubuntu chat 10K dial./88K utt. Text Discourse dependency, QA pairs
MTDD Chang et al. (2023) Struct. TV/movie scripts 10K turns/831 shows Text Thread disentanglement, floor changes
Bazinga! Lerner et al. (2022) Struct. TV/movie scripts Large-scale Text Diarization, addressee, entity linking
MELD Poria et al. (2019) Emo. Friends 13K utterances A+V+T Emotion, sentiment (multimodal)
EmotionLines Hsu et al. (2018) Emo. Friends/FB 29K utterances Text Seven emotion labels
EmoryNLP Zahiri and Choi (2017) Emo. Friends 12.6K utterances Text Seven crowdsourced emotions
MPDD Chen et al. (2020) Emo. Chinese TV Multi-party Text Emotions, interpersonal relations (CN)
ALOHA Li et al. (2020) Emo. TV/movie scripts 1M+ dialogue lines Text Human-Level Attributes (HLAs)
SAMSum Gliwa et al. (2019) Task. Messenger style 16K conversations Text Third-person abstractive summaries
XSum Narayan et al. (2018) Task. BBC articles 226K pairs Text Single-sentence extreme summaries
MultiWOZ Budzianowski et al. (2020) Task. Wizard-of-Oz 7 domains Text Belief tracking, dialogue acts
Ubuntu Lowe et al. (2015) Task. Tech support IRC ∼1M dialogues Text Next-response selection
DICE-BENCH Jang et al. (2025) Task. Synthesized 1,607 dialogues Text Multi-round tool-calling traces
WMPC Penzo et al. (2026) Task. LLM-synthesized Constrained Text Structure, stance, interaction flow
AMI Carletta et al. (2006) Multi. Real meetings 100 hours A+V+T Dialogue acts, gestures, attention
ICSI MRDA Shriberg et al. (2004) Multi. Real meetings 72 h / 180K DA A+T Dialogue acts, adjacency pairs
AliMeeting Yu et al. (2022) Multi. Mandarin meetings 120 hours Audio (8ch) Speaker diarization, multi-speaker ASR
Teams Corpus Litman et al. (2016) Multi. Cooperative games 63 teams A+V+T Entrainment, dominance, group dynamics
Table 3. The four families of MPD evaluation metrics, with representative metrics and primary use cases.
Table 3. The four families of MPD evaluation metrics, with representative metrics and primary use cases.
Family Representative Metrics Primary Use Cases
Automatic Generation BLEU, ROUGE, METEOR, BERTScore, PPL, Distinct-n Response generation, summarization, fluency, diversity
Classification & Selection Accuracy, F1 (macro/micro), Precision@k, Recall@k, MRR, EM, AUC, DICE-score Addressee/intent classification, response selection, tool-calling
Structural & Discourse P k , WindowDiff, Link/Link+Rel F1, DER, Kappa Topic segmentation, discourse parsing, diarization
Human Evaluation Fluency, Coherence, Relevance, Informativeness, Authority Pragmatic appropriateness, role coordination
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated