Submitted:
22 April 2025
Posted:
23 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Methodology
3.1. Dataset Preparation
- 100-question sample for ChatGPT response generation and semantic evaluation.;
- 500-question sample for topic modeling using BERTopic.
3.2. ChatGPT Response Generation
- A system message guiding behavior: “You are a helpful question-answering assistant. Provide clear, accurate, and reliable responses”;
- A user message containing the full_question text.
3.3. Topic Modeling with BERTopic
3.4. Semantic Evaluation with BERTScore
- Precision: measures the proportion of GPT-generated content that is semantically aligned with the reference;
- Recall: captures how much of the original best answer is recovered in the GPT output;
- F1 score: represents the harmonic mean of Precision and Recall, indicating balanced overlap.
3.5. Clustering High and Low Performing Questions
- High-performance questions: BERTScore F1 ≥ 0.85;
- Low-performance questions: BERTScore F1 ≤ 0.78
- The high threshold captures cases where ChatGPT responses closely matched the best human-voted answers, indicating strong semantic fidelity. The low threshold was selected based on distributional analysis of the dataset, ensuring that the filtered subset reflects borderline or semantically deficient responses while retaining enough samples for meaningful analysis. We extracted the full_question field from both groups and applied TF_IDF vectorization (max features = 5000, English stopword removal) to represent each question in vector space [23]. Each set was then clustered separately using KMeans with k=4, allowing us to identify dominant question clusters within high and low performance groups. To visualize semantic separability, we applied PCA (Principal Component Analysis) [24] to project the TF-IDF matrix into 2D space. Cluster assignments were added as labels and stored in new columns within the dataset.
4. Results
4.1. ChatGPT Answer Quality
4.2. Distribution of High and Low Performance
- High-performing responses were defined as those scoring (n = 55);
- Low-performing responses were defined as those scoring (n = 18).
4.3. Clustering Insights
4.4. Topic Modeling Results and Analysis
- Technology and Programming Help
- Medical and Health Inquiries
- Relationship and Emotional Advice
- Homework and School Assignments
- Definitions and Encyclopedic Questions
- High-performing topics (avg. F1 ) included factual domains such as definitions, science-based queries, and technical troubleshooting, where questions had well-defined scopes and ChatGPT’s output was aligned with clear reference answers.
- Low-performing topics (avg. F1 ) were concentrated in subjective or underspecified categories like emotional advice, opinions, and non-specific homework help, where ChatGPT’s responses often lacked the personal nuance or contextual grounding found in the best community answers.
4.5. Representative Examples
5. Discussion
6. Conclusions
Author Contributions
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 2018.
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084 2019.
- Liu, Y.; et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 2019.
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv preprint arXiv:1904.09675 2019.
- Yang, Y.; et al. Yahoo! Answers Topic Classification Dataset. In Proceedings of the Conference on Natural Language Learning, 2017.
- Bahak, H.; Taheri, F.; Zojaji, Z.; Kazemi, A. Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models. arXiv preprint arXiv:2312.07592 2023.
- Tan, R.; Su, Y.; Yu, W.; Tan, X.; Qin, T.; Wang, Y.; Liu, T.Y. A Comprehensive Evaluation of ChatGPT on 190K Knowledge-Based QA Instances. arXiv preprint arXiv:2306.05685 2023.
- Nzunda, J. Prompt Engineering for ChatGPT: A Taxonomy and Systematic Review. arXiv preprint arXiv:2306.13676 2023.
- Chan, W.; An, A.; Davoudi, H. A Case Study on ChatGPT Question Generation. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Omar, S.; Gupta, M.; Dutta, S. Symbolic versus Neural QA Systems: A Comparative Analysis of ChatGPT and KGQAN. Journal of Web Semantics 2023, 80, 100776.
- Kabir, R.; Dey, T.; Ahmed, N.; Chowdhury, N.H. Evaluating ChatGPT Answers for Stack Overflow Questions: A Developer-Centric Study. Empirical Software Engineering 2023, 29, 12–30.
- Li, Y.; He, M.; Zhang, R.; Lin, J.; Xu, W. Bridging Biomedical Multimodal QA with Large Language Models. arXiv preprint arXiv:2401.02523 2024.
- Brown, T.; et al. Language Models are Few-Shot Learners. NeurIPS 2020.
- Ouyang, L.; et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 2022.
- Bergs, A. What, If Anything, Is Linguistic Creativity? Gestalt Theory 2019, 41, 173–184. [CrossRef]
- Nedungadi, P.; Veena, G.; Tang, K.Y.; Menon, R.R.K.; Raman, R. AI Techniques and Applications for Online Social Networks and Media: Insights From BERTopic Modeling. IEEE Access 2025, 13, 37389–37402. [CrossRef]
- Kriegel, H.P.; Kröger, P.; Sander, J.; Zimek, A. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2011, 1, 231–240. [CrossRef]
- Chen, H.; Jones, G.J.F.; Brennan, R. An Examination of Embedding Methods for Entity Comparison in Text-Rich Knowledge Graphs. In Proceedings of the Proceedings of the 32nd Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2024). CEUR Workshop Proceedings, 2024. Available under CC BY 4.0 license.
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426 2020.
- Rahman, M.F.; Liu, W.; Suhaim, S.B.; Thirumuruganathan, S.; Zhang, N.; Das, G. HDBSCAN: Density based Clustering over Location Based Services. arXiv preprint arXiv:1602.03730 2016. Presented at ACM SIGMOD Workshop.
- Tenney, I.; Das, D.; Pavlick, E. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of the Proceedings of ICLR, 2019.
- Liu, N.F.; Grefenstette, E.; Dyer, C. Analyzing the Structure of Attention in a Transformer Language Model. In Proceedings of the Proceedings of ACL, 2022.
- Qaiser, S.; Ali, R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications 2018, 181, 25–29. [CrossRef]
- Jolliffe, I.T. Principal Component Analysis; Springer Series in Statistics, Springer, 2002.
- OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 2023.
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, M.; et al. On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258 2021.
- Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J.L.; Blei, D.M. Reading tea leaves: How humans interpret topic models. In Proceedings of the Proceedings of the 22nd International Conference on Neural Information Processing Systems (NeurIPS), 2009, pp. 288–296.
- Pichappan, P.; Krishnamurthy, M.; Vijayakumar, P. Analysis of ChatGPT as a Question-Answering Tool. Journal of Digital Information Management 2023, 21, 50–61. [CrossRef]




| Metric | BERTScore Precision | BERTScore Recall | BERTScore F1 |
|---|---|---|---|
| Count | 500.0 | 500.0 | 500.0 |
| Mean | 0.8197 | 0.8286 | 0.8237 |
| Std | 0.0277 | 0.0296 | 0.0229 |
| Min | 0.7329 | 0.7072 | 0.7506 |
| 25% | 0.8040 | 0.8105 | 0.8087 |
| 50% | 0.8202 | 0.8300 | 0.8247 |
| 75% | 0.8383 | 0.8483 | 0.8394 |
| Max | 0.9211 | 0.9141 | 0.9023 |
| Question | Answer A (Human) | Answer B (ChatGPT) |
|---|---|---|
| Who hit the longest home run at Jacobs Field? | On July 3rd 1999 Jim Thome hit a 511 foot shot to center field against Kansas City. | The longest home run hit at Jacobs Field (now Progressive Field) was by Jim Thome, who hit a 511-foot home run on July 3, 1999. |
| What is a procrastinator? | A person who puts off doing something, especially out of habitual carelessness or laziness. | A procrastinator is someone who delays or puts off tasks or responsibilities, often opting to do less urgent or enjoyable activities instead. This tendency can lead to stress, anxiety, and reduced productivity. |
| Are lab-created gems fake? I see them selling for thousands of dollars...? | Lab created gems are not fake. They are also called synthetic gems. They have the same composition as natural gems but are man-made in a lab. | Lab-created gems are not considered fake, as they have the same chemical and physical properties as natural gems. They are created in a lab under conditions that replicate those in nature, making them chemically identical. |
| What ballpark’s home run balls often land on Waveland Avenue? | Wrigley Field in Chicago, known for fans waiting outside for balls hit out of the park. | Wrigley Field in Chicago is the ballpark where home run balls often land on Waveland Avenue. |
| Can earned income for one child be taken for the back child support of an adult child? | Depends on state laws—consult a lawyer specializing in family law. | Generally, child support is for minor children, not adults. Laws vary by jurisdiction, so consult a legal professional. |
| Question | Answer A (Human) | Answer B (ChatGPT) |
|---|---|---|
| ooo00OOO000oooo00OOO00oooo.....la la la..how do I sing? | mmmm lalalalalalalalalaLA | Offers detailed singing advice including breathing techniques, vocal warm-ups, and considering a vocal coach. |
| What is the number that you divide the divisor into? | Dividend | Explains the concept of dividend with examples and a full breakdown of division terminology. |
| What are the lyrics? (Two songs: Panic! at the Disco & P. Diddy) | Full lyrics of both songs | Refuses due to copyright but suggests legal alternatives to find the lyrics. |
| What are Some BG rap links? | Long list of download links | Refuses to share download links due to copyright concerns and encourages legal alternatives. |
| I need an essay for my research plzz? (on American colonies) | No content (missing) | Provides a detailed multi-section essay outline covering early settlement, growth, conflict, and independence. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).