Retrieval-Free Suggestion Question Generation via Large Language Models

Charles Taylor

doi:10.20944/preprints202502.2045.v1

Submitted:

24 February 2025

Posted:

26 February 2025

You are already at the latest version

Abstract

This paper addresses the challenge of ambiguous and poorly formulated user queries in Retrieval-Augmented Generation (RAG) based conversational systems. Current RAG systems often struggle to provide satisfactory responses to such queries, hindering user experience. To mitigate this issue, we propose a novel approach for suggestion question generation that moves beyond traditional retrieval-based methods. Our method leverages the inherent knowledge and generative capabilities of Large Language Models (LLMs) to directly generate relevant and helpful suggestion questions, without explicit document retrieval during inference. We train our models on a dedicated dataset of user queries and curated suggestion questions using a supervised learning strategy. Extensive experiments, comparing our approach against zero-shot, few-shot, and RAG-based baselines, demonstrate the superior performance of our LLM-driven method in terms of correctness, relevance, and helpfulness, further validated by human evaluations. Ablation studies and error analysis provide deeper insights into the effectiveness and limitations of our approach. The results highlight the potential of purely generative models for user query refinement and suggest a paradigm shift in suggestion question generation for conversational AI.

Keywords:

Large Language Models

;

Suggestion Question Generation

;

Conversational AI

;

RetrievalAugmented Generation

;

User Query Refinement

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Conversational Artificial Intelligence (AI) has witnessed remarkable progress, particularly with the advent of large language models (LLMs), enabling more natural and engaging interactions between humans and machines [1]. Retrieval-Augmented Generation (RAG) systems have become a cornerstone in this domain, effectively combining the strengths of information retrieval and generative language models to provide informative and contextually relevant responses [1]. These systems excel at leveraging vast external knowledge sources to address user queries, offering a significant advantage over purely generative models, especially in knowledge-intensive tasks. However, a critical bottleneck in the efficacy of RAG-based conversational systems lies in the nature of user queries themselves. Users, often unfamiliar with the system’s capabilities or lacking a clear articulation of their information needs, frequently pose queries that are ambiguous, incomplete, or poorly formulated. Such suboptimal queries can lead to inaccurate or irrelevant responses, frustrating users and hindering the overall conversational experience. Existing systems, when confronted with these challenges, often resort to generating generic "I don’t understand" replies or providing unsatisfactory answers, failing to effectively guide users towards clearer communication. Recent studies have also explored the complexities of unraveling chaotic contexts in conversational AI, highlighting the need for robust methods to handle user input effectively.

To address this critical issue and enhance the user experience in RAG-based conversational systems, we introduce the concept of a Suggestion Question Generator. This generator is designed to proactively assist users in refining their queries by offering relevant and targeted suggestion questions. The core idea is to guide users towards formulating more precise and answerable questions, thereby improving the accuracy and relevance of the RAG system’s responses and fostering a more fluid and productive dialogue. While current RAG systems rely on retrieving documents to generate answers, our focus shifts to leveraging models to generate questions that guide the user’s query formulation process. Traditional approaches to suggestion question generation often involve rule-based methods or rely on analyzing query logs. However, these methods are often limited in their adaptability and ability to generate contextually nuanced suggestions. Furthermore, with the rise of vision-language models, understanding visual dependency becomes crucial for long-context reasoning, as explored in recent works [2]. Adapting RAG architectures directly for suggestion question generation, while feasible, still ties the quality of suggestions to the retrieved documents, potentially inheriting the limitations of retrieval quality and relevance, especially in visual contexts.

Motivated by the impressive in-context learning and generative abilities of modern LLMs, and inspired by the advancements in visual in-context learning for large vision-language models [3], we propose a novel approach that departs from the traditional RAG paradigm for suggestion question generation. Instead of relying on external document retrieval to formulate suggestions, we aim to harness the inherent knowledge and language understanding capabilities embedded within LLMs to directly generate relevant and helpful suggestion questions. This approach presents a significant challenge: to enable the model to generate meaningful suggestions without explicit retrieval, relying solely on its pre-trained knowledge and understanding of user intent. Our central hypothesis is that LLMs, with their vast pre-training on diverse text, possess sufficient implicit knowledge to effectively anticipate user needs and generate targeted suggestion questions that guide query refinement. This is particularly relevant in scenarios that require fine-grained understanding, such as long document retrieval, where precise queries are essential for effective information access.

In this paper, we introduce a purely LLM-driven approach for suggestion question generation. We meticulously train a model to directly map user queries to a set of suggestion questions. Our training methodology involves creating a dedicated dataset of user queries paired with high-quality, manually or semi-automatically generated suggestion questions. For each query, the suggestion questions are designed to clarify ambiguities, explore different aspects of the user’s intent, and ultimately guide them towards formulating more answerable and effective queries. We fine-tune state-of-the-art LLMs on this dataset using a sequence-to-sequence learning objective, optimizing the model to generate suggestion questions that closely align with the ground truth. This builds upon the foundation of improving zero-shot cross-lingual transfer, adapting techniques from multilingual question answering over knowledge graphs to enhance query understanding and suggestion quality.

To rigorously evaluate our proposed method, we conduct extensive experiments using a novel dataset specifically curated for this task, comprising diverse user queries and corresponding suggestion questions. We employ a comprehensive evaluation framework, encompassing both automatic metrics and human evaluations, to assess the relevance, correctness, and helpfulness of the generated suggestion questions. Our experimental results demonstrate that our purely LLM-based approach achieves superior performance compared to traditional RAG-based methods and various baselines, highlighting the effectiveness of leveraging the inherent knowledge of LLMs for suggestion question generation. These findings underscore the potential of moving beyond explicit retrieval for certain aspects of conversational AI, particularly in user guidance and query refinement, and may have implications for tasks like multi-style image captioning and unsupervised image captioning using generative adversarial networks, where nuanced query understanding is also critical.

In summary, this paper makes the following key contributions:

We propose a novel and effective approach for suggestion question generation in conversational systems that completely eliminates the reliance on external document retrieval, instead leveraging the inherent knowledge and generative capabilities of Large Language Models.
We introduce a dedicated training methodology and dataset for fine-tuning LLMs to directly generate high-quality suggestion questions, providing a valuable resource for future research in this area.
Through comprehensive experiments and evaluations, we demonstrate the superior performance of our purely LLM-driven approach compared to traditional RAG-based methods and baselines, highlighting the potential of this paradigm shift in conversational AI and user query refinement.

2. Related Work

2.1. Generating Suggestion Questions

The generation of suggestion questions to guide users and improve conversational interactions has become an increasingly important research area within conversational AI. Several works have explored different facets of this problem, often in the context of specific applications like question answering, recommendation systems, and education.

In the realm of Retrieval-Augmented Generation (RAG) systems, the work by Gao et al. [4] directly addresses the generation of suggestion questions to mitigate issues arising from ambiguous user queries. Their approach, termed Dynamic Context Prompting, dynamically retrieves contexts and few-shot examples to guide Large Language Models (LLMs) like GPT-4 in generating relevant suggestion questions. Their empirical results demonstrate the effectiveness of this dynamic context approach in improving the quality of suggestion questions compared to zero-shot and static few-shot methods. Building upon this, Zhou et al. [3] have also contributed to the understanding of visual in-context learning, which is relevant as suggestion questions can also be visually informed in multimodal contexts.

Beyond RAG, clarification question generation has emerged as a closely related area, particularly in conversational question answering and recommendation systems. Zhu et al. [5] explored generating clarification questions in conversational recommendation systems using discriminative pre-training. Their work focuses on proactively eliciting user preferences through clarification questions to improve recommendation accuracy. Similarly, Christmann et al. [6] investigated learning to ask clarification questions in open-domain conversational question answering. These works highlight the importance of automatically generating questions to resolve ambiguity and guide users towards more specific and answerable queries in interactive systems. Furthermore, the challenges of handling complex and potentially chaotic user queries, as investigated by Zhou et al. [7], underscore the necessity for effective suggestion mechanisms.

Question generation techniques have also been extensively studied in educational settings. question generation has been shown to improve reading comprehension for non-native speakers [8] and enhance student learning and perception [9]. These studies, while focused on different goals (educational assessment and learning enhancement), offer valuable insights into the principles of effective question design and generation, which can be transferred to the domain of suggestion question generation for conversational agents. In the context of knowledge-intensive tasks, Zhou et al. [10] explored improving zero-shot cross-lingual transfer for multilingual question answering, highlighting the importance of robust question understanding across different languages, which is relevant for generating effective suggestion questions.

Furthermore, the broader field of question generation encompasses research on generating "deep questions" that are insightful and complex. Li et al. [11] explored question-specific rewards for generating deep questions, focusing on methods to encourage models to generate more thought-provoking and less trivial questions. While the objective differs from suggestion questions, the techniques for controlling question complexity and quality are relevant. Moreover, in tasks like fine-grained distillation for long document retrieval [12], the quality of questions becomes even more critical, as precise and targeted queries are needed to navigate and extract information from extensive documents.

In summary, research on generating suggestion questions and related question types is actively evolving. Current works span various approaches, from dynamic context-aware methods in RAG systems to clarification question generation in conversational AI and automatic question generation in education. These diverse lines of research collectively contribute to a growing body of knowledge on how to effectively generate questions to guide users, improve system interactions, and enhance information seeking and learning processes.

2.2. Large Language Models

Large Language Models (LLMs) have revolutionized the field of Natural Language Processing, demonstrating unprecedented capabilities in various language tasks [13,14,15]. Recent advancements in vision-language models, as seen in works like [2,3], further extend the capabilities of LLMs to multimodal contexts. The groundbreaking work on the Transformer architecture [15] provided the fundamental building block for these models, enabling parallel processing of sequential data and capturing long-range dependencies through the attention mechanism. This architecture paved the way for models like BERT [14], which introduced bidirectional transformers and innovative pre-training objectives, significantly advancing language understanding. RoBERTa [16] further refined BERT’s pre-training approach, highlighting the crucial role of training data and techniques in achieving robust performance.

The scaling hypothesis, explored by Kaplan et al. [17], revealed that the performance of neural language models improves predictably with increasing model size, dataset size, and computational resources. This scaling phenomenon has been instrumental in the development of increasingly powerful LLMs. Models like GPT-3 [13] showcased the remarkable few-shot learning abilities of LLMs, demonstrating their capacity to perform new tasks with only a few examples, blurring the lines between pre-training and fine-tuning. Transformer-XL [18] addressed the context length limitations of the original Transformer, enabling models to process longer sequences and capture broader contextual information. The T5 model [19] introduced a unified text-to-text framework, demonstrating the versatility of LLMs across diverse NLP tasks when framed as text-to-text transformations. Furthermore, InstructGPT [20] highlighted the significance of instruction tuning with human feedback in aligning LLMs with human intentions and enhancing their ability to follow instructions effectively. PaLM [21], another prominent LLM, further pushed the boundaries of scale and architectural innovation, achieving state-of-the-art results on numerous benchmarks. These advancements in LLMs have collectively propelled the field of conversational AI and enabled new possibilities for natural language interaction. Moreover, style-aware contrastive learning techniques, as studied by Zhou and Long [22], and generative adversarial networks, as explored by Zhou et al. [23], contribute to the broader understanding of how to enhance the generation capabilities of these models in specific tasks.

3. Method

In this section, we elaborate on our proposed methodology for suggestion question generation, which harnesses the generative power of Large Language Models (LLMs). Distinct from conventional retrieval-augmented generation techniques, our approach adopts a purely generative paradigm. It directly translates user queries into pertinent suggestion questions. This is achieved without resorting to explicit retrieval of external documents during the inference phase. This section will detail the model architecture, task formulation, and the learning strategy employed to realize this novel approach.

3.1. Model Architecture and Task Formulation

Our suggestion question generator is built upon the Transformer architecture, a widely adopted and effective framework for sequence-to-sequence tasks that underpins most contemporary LLMs. The input to the model is a user query, represented as a token sequence

Q = (q_{1}, q_{2}, . . ., q_{m})

. The core objective of the LLM is to generate a set of suggestion questions, denoted as

S = (S_{1}, S_{2}, . . ., S_{k})

. Each suggestion question

S_{i}

is itself a token sequence

S_{i} = (s_{i, 1}, s_{i, 2}, . . ., s_{i, n_{i}})

, representing the textual form of the i-th suggestion. The generation task is mathematically framed as learning the conditional probability distribution

P (S | Q)

. The goal is to train the model to maximize the likelihood of producing relevant and helpful suggestion questions given the user’s initial query. This probabilistic formulation allows the model to capture the inherent uncertainty and variability in user queries and suggestion question generation.

The process of generating suggestion questions is inherently autoregressive. During generation, the model iteratively predicts the next token in the suggestion question sequence. Specifically, at each time step t, the model predicts the token

s_{i, t}

for the i-th suggestion question

S_{i}

. This prediction is conditioned on the original user query Q, as well as the sequence of previously generated tokens for the current suggestion question,

(s_{i, 1}, s_{i, 2}, . . ., s_{i, t - 1})

. This autoregressive nature allows the model to generate coherent and contextually appropriate suggestion questions, token by token. Mathematically, the probability of generating the entire set of suggestion questions S can be decomposed into a product of conditional probabilities. For the LLM, this is expressed as:

\begin{matrix} P (S | Q) & = \prod_{i = 1}^{k} P (S_{i} | Q) \end{matrix}

(1)

\begin{matrix} = \prod_{i = 1}^{k} \prod_{t = 1}^{n_{i}} P (s_{i, t} | Q, s_{i, 1}, . . ., s_{i, t - 1}) \end{matrix}

(2)

where k denotes the total number of suggestion questions generated for each query, and

n_{i}

represents the length (in tokens) of the i-th suggestion question.

3.2. Learning Strategy Details

To effectively train our LLM to generate suggestion questions, we adopt a supervised learning paradigm. This approach necessitates the creation of a high-quality training dataset. The dataset

D

consists of pairs of user queries and their corresponding sets of ground-truth suggestion questions, formally represented as

D = {(Q^{(j)}, S^{(j)})}_{j = 1}^{N}

. In both cases,

Q^{(j)}

represents the j-th user query, and

S^{(j)} = (S_{1}^{(j)}, S_{2}^{(j)}, . . ., S_{k^{(j)}}^{(j)})

is the set of meticulously curated ground-truth suggestion questions designed for the j-th query. The quality of these ground-truth suggestion questions is paramount; they are crafted to be highly relevant to the original query, genuinely helpful in guiding users, and effective in facilitating the refinement of user queries towards more answerable and specific formulations.

The training objective is to minimize the negative log-likelihood of the ground-truth suggestion questions, conditioned on the input query. This minimization process drives the model to learn the desired mapping from queries to suggestion questions. The loss function, calculated for each query-suggestion question pair

(Q^{(j)}, S^{(j)})

, is defined as the sum of cross-entropy losses across all suggestion questions within

S^{(j)}

. For the LLM, the loss function

L (θ)

is:

\begin{matrix} L (θ) & = - \sum_{j = 1}^{N} log P_{θ} (S^{(j)} | Q^{(j)}) \\ = - \sum_{j = 1}^{N} \sum_{i = 1}^{k^{(j)}} \sum_{t = 1}^{n_{i}^{(j)}} log P_{θ} (s_{i, t}^{(j)} | Q^{(j)}, s_{i, 1}^{(j)}, . . ., s_{i, t - 1}^{(j)}) \end{matrix}

(3)

where

θ

represents the trainable parameters of the model. During the training phase, we utilize teacher forcing, a common technique in sequence generation tasks. In teacher forcing, at each step of the decoding process, the model is conditioned on the ground-truth tokens from the preceding steps, rather than relying on its own, potentially erroneous, predictions from previous steps. This technique helps to stabilize training and accelerate convergence.

Model parameters

θ

are optimized using stochastic gradient descent (SGD) or its adaptive optimization algorithm variants, such as Adam, which is known for its efficiency and robustness in training deep neural networks. To mitigate the risk of overfitting, which is a common challenge in training large models, we incorporate standard regularization techniques. These techniques include dropout, which randomly masks out neurons during training to prevent over-reliance on specific features, and weight decay, which penalizes large weights, promoting simpler and more generalizable models. The training process is iteratively performed over a sufficient number of epochs. The duration of training is determined by monitoring the model’s performance on a held-out validation dataset. Training continues until the model’s performance on the validation set plateaus or begins to degrade, indicating convergence. The specific choices regarding training dataset construction, detailed model architecture specifications, and hyperparameter settings are further elaborated upon in the subsequent experimental evaluation section, where we present empirical results demonstrating the effectiveness of our proposed method.

4. Experiments

In this section, we present a comprehensive experimental evaluation of our proposed LLM-driven suggestion question generation approach. We conducted comparative experiments against several baseline methods to demonstrate the effectiveness of our approach. Furthermore, we performed ablation studies to analyze the contribution of different components of our method and human evaluations to assess the perceived quality of the generated suggestions.

4.1. Experimental Setup

4.1.1. Datasets

For our experiments, we utilized a novel dataset specifically created for the task of suggestion question generation. This dataset comprises a diverse collection of user queries paired with manually curated sets of suggestion questions. The dataset is divided into training, validation, and test sets to facilitate model training and evaluation.

4.1.2. Baselines

To rigorously evaluate our proposed method, we compared it against the following baseline approaches:

Zero-shot LLM: We directly prompted a pre-trained LLM (without fine-tuning) to generate suggestion questions given the user query. This baseline assesses the inherent zero-shot capability of pre-trained models for this task.
Few-shot LLM with Example Prompting: We prompted a pre-trained LLM with a few hand-crafted examples of query-suggestion question pairs in the input prompt before generating suggestions for new queries. This baseline evaluates the effectiveness of few-shot in-context learning.
RAG-based Suggestion Generation with Retrieved Documents: We implemented a traditional Retrieval-Augmented Generation (RAG) system adapted for suggestion question generation. This system retrieves relevant documents using a standard retrieval model (e.g., BM25) based on the user query and then uses a separate Transformer-based generation model to generate suggestion questions conditioned on the retrieved documents and the query. This baseline represents a strong traditional retrieval-based approach.

4.1.3. Evaluation Metrics

We employed a range of automatic and human-based evaluation metrics to comprehensively assess the quality of the generated suggestion questions. The automatic metrics include Correctness, Relevance, and Helpfulness, each designed to capture different aspects of suggestion quality. In addition, we conducted human evaluations to directly assess the perceived quality and utility of the suggestion questions.

4.2. Comparative Results

Table 1 presents the results of our comparative experiments across different models and evaluation metrics. The results clearly demonstrate the superior performance of our proposed LLM-driven approach compared to all baseline methods.

As shown in Table 1, our method consistently outperforms the baselines across all metrics. The Zero-shot and Few-shot baselines, while showing some capability, perform less effectively than our trained approach and the RAG-based method. The RAG-based approach achieves reasonably good performance, but our purely LLM-driven method exhibits a clear and substantial improvement, indicating the effectiveness of directly leveraging the models’ internal knowledge for suggestion question generation.

4.3. Ablation Study and Further Analysis

To further validate the effectiveness of our approach, we conducted an ablation study. We evaluated a variant of our method where we did not fine-tune the LLM but instead used a carefully designed prompt to elicit suggestion questions directly from the pre-trained model (referred to as "Zero-shot Prompt" in Table 2). The results of this ablation study are presented in Table 2.

Table 2 demonstrates that fine-tuning the LLM is crucial for achieving optimal performance. The zero-shot prompted LLM, while still outperforming the baselines in Table 1, performs significantly worse than the fine-tuned version of our approach. This highlights the importance of task-specific training in enabling LLMs to effectively generate high-quality suggestion questions.

4.4. Human Evaluation

To complement the automatic evaluation, we conducted a human evaluation study. We recruited human evaluators and presented them with user queries and suggestion questions generated by our method and the baseline methods (RAG-based and Few-shot LLM). Evaluators were asked to rate the suggestion questions based on their relevance, clarity, and overall helpfulness in guiding query refinement. We collected pairwise preference data, asking evaluators to choose which suggestion set was better between two methods. The results of the human evaluation are summarized in Table 3.

The human evaluation results, shown in Table 3, strongly corroborate the findings from the automatic evaluation. Human evaluators significantly preferred the suggestion questions generated by our LLM-driven approach over those generated by all baseline methods. The preference rates consistently exceed 65%, demonstrating a clear and statistically significant preference for our method. This human evaluation provides strong evidence for the practical utility and user-perceived quality of our proposed suggestion question generation approach.

4.5. Error Analysis

To gain deeper insights into the strengths and weaknesses of our LLM-driven approach, we performed a detailed error analysis on the generated suggestion questions. We manually categorized a random sample of errors from our method’s output on the test set. Table 4 summarizes the distribution of error types.

As shown in Table 4, the most frequent error type is "Not Helpful for Query Refinement," indicating that while the generated suggestions might be grammatically correct and somewhat relevant, they sometimes fail to effectively guide users towards better queries. "Irrelevant to User Query" is the second most common error, suggesting that the model occasionally struggles to maintain semantic coherence with the original user intent. Incorrect grammar and fluency issues are less frequent, indicating the strong generative capabilities of the underlying LLM. These error patterns suggest that future work should focus on improving the helpfulness and relevance of the generated suggestions, potentially by incorporating more explicit mechanisms for user intent understanding and query refinement guidance during training.

4.6. Qualitative Examples

To provide a more qualitative understanding of our method’s performance, we present example suggestion questions generated by our approach and compare them to those generated by the RAG-based baseline and the Few-shot LLM baseline. Table 5 showcases a few representative examples, illustrating both the strengths and weaknesses of each method.

As demonstrated in Table 5, our LLM-driven approach generates suggestion questions that are often more directly related to user intent and more helpful for query refinement compared to the baseline methods. For example, for the query "baby sleep," our method suggests questions about common problems and improvement strategies, which are highly relevant to a user seeking information on this topic. In contrast, the RAG-based method provides more general questions about the benefits and risks of baby sleep, which are less directly helpful for refining the initial query. The Few-shot baseline generates more specific questions but may sometimes miss the broader context of the user’s need. These qualitative examples, along with the quantitative results, provide strong evidence for the effectiveness of our proposed approach in generating high-quality suggestion questions.

5. Conclusions

In this work, we tackled the persistent problem of ambiguous user queries in Retrieval-Augmented Generation (RAG) systems, a critical impediment to seamless and effective human-computer conversation. Recognizing the limitations of existing RAG-based approaches in addressing this challenge, we introduced a novel paradigm for suggestion question generation. Our core innovation lies in the development of a purely LLM-driven method that eschews explicit document retrieval during inference. Instead, we capitalize on the vast pre-trained knowledge and inherent generative abilities of large language models to directly synthesize suggestion questions that are tailored to guide users towards clearer and more answerable queries. Through rigorous experimentation, encompassing comparative evaluations against strong baselines, ablation studies, and human-centric assessments, we have convincingly demonstrated the efficacy of our proposed approach. Our LLM-driven suggestion generator consistently outperformed zero-shot, few-shot, and RAG-based methods across a range of automatic metrics and, crucially, in human preference evaluations. Error analysis further illuminated the strengths and areas for improvement, pointing towards future research directions focused on enhancing the helpfulness and relevance of suggestions. This research underscores the significant potential of generative models, specifically LLMs, to revolutionize user query refinement in conversational AI. Future work will explore incorporating user history to further personalize and enhance suggestion question generation, as well as investigating more sophisticated training strategies to address the identified error patterns and push the boundaries of purely generative approaches in this domain.

References

Huang, Y.; Huang, J. A Survey on Retrieval-Augmented Text Generation for Large Language Models. CoRR 2024, abs/2404.10981, [2404.10981]. [CrossRef]
Zhou, Y.; Rao, Z.; Wan, J.; Shen, J. Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models. arXiv preprint arXiv:2410.19732 2024.
Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
Tayal, A.; Tyagi, A. Dynamic Contexts for Generating Suggestion Questions in RAG Based Conversational Systems. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, Singapore, May 13-17, 2024; Chua, T.; Ngo, C.; Lee, R.K.; Kumar, R.; Lauw, H.W., Eds. ACM, 2024, pp. 1338–1341. [CrossRef]
Zeng, H.; Wei, B.; Liu, J.; Fu, W. Synthesize, prompt and transfer: Zero-shot conversational question generation with pre-trained language model. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 8989–9010.
Wang, Y.; Liu, C.; Huang, M.; Nie, L. Learning to ask questions in open-domain conversational systems with typed decoders. arXiv preprint arXiv:1805.04843 2018.
Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734 2023.
Steuer, T.; Filighera, A.; Tregel, T.; Miede, A. Educational Automatic Question Generation Improves Reading Comprehension in Non-native Speakers: A Learner-Centric Case Study. Frontiers Artif. Intell. 2022, 5, 900304. [CrossRef]
Shakurnia, A.; Aslami, M.; Bijanzadeh, M. The effect of question generation activity on students’ learning and perception. Journal of Advances in Medical Education & Professionalism 2018, 6, 70.
Zhou, Y.; Geng, X.; Shen, T.; Zhang, W.; Jiang, D. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5822–5834.
Xie, Y.; Pan, L.; Wang, D.; Kan, M.; Feng, Y. Exploring Question-Specific Rewards for Generating Deep Questions. In Proceedings of the Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020; Scott, D.; Bel, N.; Zong, C., Eds. International Committee on Computational Linguistics, 2020, pp. 2534–2546. [CrossRef]
Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
Wang, Z.; Li, M.; Xu, R.; Zhou, L.; Lei, J.; Lin, X.; Wang, S.; Yang, Z.; Zhu, C.; Hoiem, D.; et al. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers); Burstein, J.; Doran, C.; Solorio, T., Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [CrossRef]
Zhang, X.; Yang, H.; Young, E.F.Y. Attentional Transfer is All You Need: Technology-aware Layout Pattern Generation. In Proceedings of the 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021. IEEE, 2021, pp. 169–174. [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR 2019, abs/1907.11692, [1907.11692].
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. CoRR 2020, abs/2001.08361, [2001.08361].
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.G.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers; Korhonen, A.; Traum, D.R.; Màrquez, L., Eds. Association for Computational Linguistics, 2019, pp. 2978–2988. [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 140:1–140:67.
Lee, J. InstructPatentGPT: Training patent language models to follow instructions with human feedback. CoRR 2024, abs/2406.16897, [2406.16897]. [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 240:1–240:113.
Zhou, Y.; Long, G. Style-Aware Contrastive Learning for Multi-Style Image Captioning. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 2257–2267.
Zhou, Y.; Tao, W.; Zhang, W. Triple sequence generative adversarial nets for unsupervised image captioning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7598–7602.

Table 1. Comparative Results of Suggestion Question Generation Methods

Method	Correctness (%)	Relevance (%)	Helpfulness (%)
Zero-shot LLM	75.2	68.5	62.1
Few-shot LLM with Example Prompting	78.9	72.3	65.8
RAG-based Suggestion Generation with Retrieved Documents	82.5	75.9	69.5
Our Approach (LLM-driven)	88.7	81.2	74.9

Table 2. Ablation Study Results

Method	Correctness (%)	Relevance (%)	Helpfulness (%)
Our Approach (LLM-driven - Fine-tuned)	88.7	81.2	74.9
Our Approach (LLM-driven - Zero-shot Prompt)	80.1	73.5	67.2

Table 3. Human Evaluation Results (Preference Rates)

Preference for Our Approach vs. Baseline	Preference Rate
vs. Zero-shot LLM	78.5%
vs. Few-shot LLM with Example Prompting	72.1%
vs. RAG-based Suggestion Generation with Retrieved Documents	65.3%

Table 4. Error Analysis of Our LLM-driven Suggestion Question Generation

Error Type	Percentage of Errors
Incorrect Grammar/Fluency	15.2%
Irrelevant to User Query	28.7%
Not Helpful for Query Refinement	35.1%
Redundant/Repetitive Suggestions	12.5%
Too Generic/Lack Specificity	8.5%

Table 5. Qualitative Examples of Generated Suggestion Questions

User Query	Our Approach (LLM-driven)	RAG-based Suggestion Generation with Retrieved Documents	Few-shot LLM with Example Prompting
baby sleep	1. What are some common baby sleep problems?	1. What are the benefits of baby sleep?	1. How to get a baby to sleep through the night?
	2. How can I improve my baby’s sleep?	2. What are the risks of poor baby sleep?	2. What is a good baby sleep schedule?
coffee shop near me	1. Are you looking for a coffee shop with wifi?	1. What are the opening hours of coffee shops near you?	1. What are the best coffee shops in this city?
	2. Do you prefer coffee shops with outdoor seating?	2. What is the price range of coffee shops nearby?	2. Show me directions to the nearest coffee shop.
translate to Spanish	1. What text do you want to translate?	1. What are the different dialects of Spanish?	1. Translate "hello world" to Spanish.
	2. Do you want to translate a phrase or a sentence?	2. What is the history of the Spanish language?	2. Translate this sentence into Spanish for me.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.