Preprint
Article

This version is not peer-reviewed.

ChatGPT vs. Gemini: Which Provides Better Information on Bladder Cancer?

A peer-reviewed article of this preprint also exists.

Submitted:

15 December 2024

Posted:

18 December 2024

You are already at the latest version

Abstract

Background: Bladder cancer, the most common and heterogeneous malignancy of the urinary tract, presents with diverse types and treatment options, making comprehensive patient education essential. As Large Language Models (LLMs) emerge as a promising resource for disseminating medical information, their accuracy and validity compared to traditional methods remain under-explored. This study aims to evaluate the effectiveness of LLMs in educating the public about bladder cancer. Methods: Frequently asked questions regarding bladder cancer were sourced from reputable educational materials and assessed for accuracy, comprehensiveness, readability, and consistency by two independent board-certified urologists, with a third resolving any discrepancies. The study utilized a 3-point Likert scale for accuracy, a 5-point Likert scale for comprehensiveness, and the Flesch-Kincaid FK Grade Level and Flesch Reading Ease (FRE) scores to gauge readability. Results: ChatGPT-3.5, ChatGPT-4, and Gemini were evaluated on 12 general questions, 6 related to diagnosis, 28 concerning treatment, and 7 focused on prevention. Across all categories, the correct response rate was notably high, with ChatGPT-3.5 and ChatGPT-4 achieving 92.5%, compared to 86.3% for Gemini, with no significant difference in accuracy. However, there was a significant difference in comprehensiveness (p = 0.011) across the models. Overall, a significant difference in performance was observed among the LLMs (p < 0.001), with ChatGPT-4 providing the most college-level responses, though these were the most challenging to read. Conclusion: In conclusion, our study adds value to the applications of AI in bladder cancer education with notable insights on the accuracy, comprehensiveness, and stability of the three LLMs.

Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Bladder cancer is the most common heterogeneous malignancy in the urinary tract, which has variable natural history, types, and treatment options [1]. In Saudi Arabia, the incidence of bladder cancer overall was 1.4 per 100 000 individuals [2]. Additionally, its incidence in Saudi has increased by tenfold in the last decade [3]. The risk factors behind the development of bladder cancer are smoking and exposure to certain chemicals, advancing age, and male gender. Patients usually present with hematuria, whether gross or microscopic, but require further investigations, such as cystoscopy and urography [4]. The decision on the most beneficial treatment option is based on the accurate staging and grading of the tumor as well as the type. Non-muscle invasive urothelial bladder cancer is preferably treated with Transurethral Resection of the Bladder Tumor (TURPT), while in muscle-invasive bladder cancer, radical cystectomy, i.e., removal of the bladder, is implemented. Favorable oncological outcomes of bladder cancer treatment, however, require a multidisciplinary approach [5]. Moreover, since it is a disease that results in severe consequences that affect several aspects of life, early detection through screening, in addition to engaging the patient in cancer care through education, are all essential to improve patient outcomes and quality of life [1,6]. For this reason, it has become essential to empower patients to take a proactive approach when it comes to their care by improving the educational resources and tools, and raising their awareness about their condition [6].
Artificial Intelligence (AI) systems have gained popularity in recent years for providing information and assisting users online [7]. Moreover, the utilization of the internet for health-related reasons, such as information about cancer, has significantly increased [8,9]. Some bladder cancer patients may utilize the internet as a means to get more information about their condition, and the current technological advancements in Large Language Models (LLM) chatbots have provided another tool from which health information can be obtained [10,11]. Chat Generative Pretrained Transformer (ChatGPT), which was recently developed by the company OpenAI, has captured the attention of many with its ability to provide information and build conversations with users due to its deep learning programming. ChatGPT provides answers in a contextualized manner that are relevant to the input of the user. These features made the application of ChatGPT as a source of medical information a reality [12]. Another emerging AI chatbot that was released by Google is Bard, which is powered by LaMDA and incorporates language recognition and conversation processing while interfacing with Google search [13].
This new era has raised both hopes for the future applications of these tools in healthcare as well as concerns about their current quality and accuracy of responses that deliver critical medical information, especially when it comes to cancer [9,14]. Furthermore, monitoring these platforms’ outputs for accuracy and their ability to keep up with the changing nature of medical sciences and its new and increasing advancements is a necessity, given the importance of this information in the cancer research field and to healthcare communicators [9].
However, limited studies have established data about the quality and accuracy of information that these chatbots provide to cancer patients. Therefore, the aim of this study is to assess the effectiveness of Large Language Models (LLMs), ChatGPT (3rd and 4th generations), and Gemini, in providing bladder cancer patient information about their condition, and evaluate the quality of the answers provided to these patients.

2. Materials and Methods

Frequently asked questions from various trusted educational websites, were compiled into an excel file. The criteria on which these questions were selected are (1) questions frequently asked by patients and the public (2) questions that target general disease knowledge, treatment, diagnosis, and prevention. These questions were then reviewed by board-certified urologists for optimum selection. The questions were then inputted into two LLMs (ChatGPT 3, Chatgpt-4, and Gemini) all available at https://chat.openai.com/chat and https://gemini.google.com/app website. The quality of responses was assessed based on accuracy, comprehensiveness, patient readability, and stability. A 3-point scale was employed to measure accuracy, with one indicating correct information, two indicating a mix of correct and incorrect, and three indicating completely incorrect information. For assessing the comprehensiveness of responses, a 5-point Likert scale was used, where one represented “very comprehensive” and five represented “very inadequate.” As for readability, the output answers, sentences, words, syllables per word, and words per sentence were the factors analyzed. Moreover, the Flesch Reading Ease Score and Flesch–Kincaid Grade Level were calculated for each text using the online calculator available at https://charactercalculator.com/flesch-reading-ease/ website. A higher Flesch Reading Ease Score signifies that the text is easy to read, whereas the Flesch–Kincaid Grade Level indicates the educational grade level required to comprehend the text. The Stability of the output text was analyzed due to the various responses generated for the same question by the LLMs. Stability was assessed by two independent reviewers who subjectively evaluated whether the second and third answers were accurate compared to the first generated answer. Three responses were generated for each question, with the chat history cleared after each trial. The first answers to these questions were evaluated separately by two board-certified urologists all referring to the same resource for answers for accuracy and comprehension. Any discrepancies in the evaluation were then independently resolved by a blinded third board-certified urologist.

Statistical Analysis

Statistical analysis was conducted using RStudio software (version 4.3.1). Descriptive statistics were employed to summarize the characteristics of different Large Language Models (LLMs) and their performance across various categories related to bladder cancer. Categorical variables were expressed as frequencies and percentages. To assess the association between LLMs and categorical variables, Fisher’s exact test was utilized. Additionally, continuous variables, such as grade level scores, were presented as medians with interquartile ranges (IQR), and the Kruskal-Wallis rank sum test was applied to evaluate significant differences among LLMs. All statistical tests were two-tailed, and p-values less than 0.05 were considered statistically significant.

3. Results

In the current study, three large language models (LLMs), including ChatGPT-3.5, ChatGPT-4 and Gemini, were assessed using 12 general questions, 6 diagnosis-related questions, 28 treatment-related questions and 7 prevention-related questions, which accounted for 22.6%, 11.3%, 52.8% and 13.2%, respectively (Figure 1).
Table 1 and Figure 2 present the analysis of the accuracy of different LLMs in providing information about bladder cancer. The overall analysis of 53 questions revealed no significant differences among the LLMs (p = 0.655). Across all categories, the proportion of correct responses was high, with 92.5% for ChatGPT-3.5 and ChatGPT-4, and 86.3% for Gemini. There were no significant differences in the proportions of correct answers among the three LLMs in terms of the questions of general domains (p > 0.999), diagnosis (p > 0.999), treatment (p = 0.848) and prevention (p = 0.079).
The overall analysis revealed a significant difference among the LLMs in terms of their comprehensiveness (p = 0.011). Comprehensive and very comprehensive responses were provided for 75.4% of questions by ChatGPT-3.5, 83.0% by ChatGPT-4 and 68.6% by Gemini. Additionally, in the treatment category, ChatGPT-3.5 displayed a higher combined proportion (75.0%) compared to ChatGPT-4 (78.6%) and Gemini (57.7%), and the difference was statistically significant (p = 0.007, Table 2). No other question categories showed statistically significant differences among the LLMs.
Table 3 presents an analysis of grade level scores among different Large Language Models (LLMs) regarding bladder cancer information. The overall comparison revealed a significant difference among the LLMs (p < 0.001), with college-level responses apparent in 34.0% of the questions by Gemini, 69.8% by ChatGPT-3.5 and 75.5% by ChatGPT-4. Furthermore, in the treatment category, ChatGPT-3.5 had a proportion of 71.4%, ChatGPT-4 had 75.0%, and Gemini had 39.3% categorized as college-level questions. This discrepancy was statistically significant (p < 0.001). In general questions, Gemini had significantly lower proportions of college-level questions (16.7%) compared to ChatGPT-3.5 (66.7%) and ChatGPT-4 (83.3%). In the diagnosis category, ChatGPT-3.5 had a proportion of 66.7%, ChatGPT-4 had 83.3%, and Gemini had 16.7% categorized as “College,” although this difference was borderline significant (p = 0.050, Table 3).
Table 4 illustrates the analysis of the reading note of different Large Language Models (LLMs). The overall comparison revealed a significant difference among the LLMs (p < 0.001), with difficulty in reading for 69.8% of responses provided by ChatGPT-3.5, 75.5% of ChatGPT-4 and 34.0% of Gemini. Notably, in the treatment category, ChatGPT-3.5 had 71.4%, ChatGPT-4 had 75.0%, and Gemini had 39.3% categorized as “Difficult to read”, indicating a substantial discrepancy among them (p < 0.001). Similarly, in the diagnosis category, ChatGPT-3.5 had 66.7%, ChatGPT-4 had 83.3%, and Gemini had 16.7% categorized as “Difficult to read,” and the difference was statistically significant (p = 0.019, Table 4).
There were notable variations observed among LLMs for the number of words, with ChatGPT-3.5 showing a median of 207.0 (IQR = 166.0 to 240.0), ChatGPT-4 with 295.0 (IQR = 232.0 to 342.0), and Gemini with 274.0 (IQR = 223.0 to 341.0), indicating an increasing trend from ChatGPT-3.5 to Gemini to ChatGPT-4 (p < 0.001). Similarly, a trend of increase was observed for sentences, syllables, and syllable/word ratio, with ChatGPT-4 consistently showing the highest values followed by Gemini and then ChatGPT-3.5. For instance, the median number of sentences was 10.0 (IQR = 7.0 to 13.0) for ChatGPT-3.5, 18.0 (IQR = 10.0 to 21.0) for ChatGPT-4, and 15.0 (IQR = 11.0 to 21.0) for Gemini (p < 0.001). Regarding the FRE Score, ChatGPT-3.5 had a median of 43.4 (IQR = 34.2 to 48.0), ChatGPT-4 had 40.3 (IQR = 35.0 to 44.9), and Gemini had 54.3 (IQR = 47.4 to 61.7). However, no significant difference was observed in the FK Reading Level among the LLMs (p = 0.093, Table 5 and Figure 3).
The stability analysis was performed on 10 questions, among which three questions were related to diagnosis, other three questions were related to treatment and four questions were related to prevention. Results showed no significant differences in stability between the three LLMs under investigation for the overall questions and for each subscale (Table 6).

4. Discussion

Our study focused on evaluating the performance of two LLMs in answering questions related to bladder cancer. While previous studies have reported that ChatGPT provides unsatisfactory results regarding bladder cancer information, our findings demonstrated promising accuracy rates, consistent stability, and varying levels of comprehensiveness among the three LLMs [15,16].
LLMs have demonstrated high accuracy in providing information about bladder cancer, underscoring their potential as valuable resources in the medical field. Our findings align with those of Ozgor et al., who reported that ChatGPT’s responses to questions about urological cancers exhibited high accuracy rates. However, when evaluated against the EAU 2023 Guidelines, Ozgor et al. found ChatGPT’s performance to be inadequate [15]. Additionally, ChatGPT’s responses to various urologic conditions, including bladder cancer, were generally well-balanced, though treatment-related answers only achieved a moderate quality score on the DISCERN questionnaire, scoring 3 out of 5 points [16]. Consistent with these findings, our study also identified that most inaccuracies were related to bladder cancer treatment. Similarly, Musheyev et al. assessed the quality and accuracy of information about urological cancers provided by four AI chatbots (ChatGPT, Perplexity, ChatSonic, and Microsoft Bing AI) using the top five Google Trends search queries. While these AI chatbots generally delivered accurate and moderately high-quality information, they often lacked clear, actionable instructions [17].
Although all LLMs delivered highly accurate responses regarding bladder cancer information, instances of errors were observed, which stemmed from factors such as misinterpretation of ambiguous questions, reliance on outdated or incorrect sources, and a lack of contextual understanding of nuanced medical topics. These errors, despite the overall high accuracy rates, underscore the importance of critical evaluation by medical professionals. While LLMs show great promise in providing accurate information about bladder cancer, it is essential to recognize and address their limitations and potential errors to ensure their safe and effective use in a medical context.
It is evident how crucial it is for readers to be able to accurately read, comprehend, and apply information about their condition. Literature has reported that understanding AI-based LLM results can be challenging, with some findings indicating that reading and comprehending this content requires adequate training [18,19,20]. Our study demonstrated that Gemini provided more easily readable answers compared to both ChatGPT-3.5 and ChatGPT-4. Multiple studies have confirmed that ChatGPT frequently produces responses at a post-secondary, particularly college, grade level [18,21,22]. Additionally, as shown by Abou-Abdallah et al., with a mean FRES value of 38.9, placing it in the fairly difficult category, ChatGPT’s readability was considered poor [23]. A study assessing the readability of AI chatbot responses to the 100 most frequently asked questions about cardiopulmonary resuscitation found that Gemini’s responses were the easiest to read, while ChatGPT’s were the most challenging [24]. Regarding the stability and reproducibility of the responses, all generated answers were consistent. However, since only ten questions were evaluated for stability and compared across the three LLMs, it is important to test all research questions in future studies to accurately determine their stability.

Limitations

This study has offered valuable insights into the potential integration of LLM chatbots in health education. However, certain limitations should be acknowledged and addressed in future research. For instance, ChatGPT’s knowledge base is only updated until September 2021, which may limit the relevance of its responses. Additionally, although independent evaluation was conducted to ensure blinding to the type of LLM, factors such as the formatting of the answers could inadvertently reveal the identity of specific LLMs. Despite these limitations, the study remains reliable. Future research should consider including other LLMs and focus on crafting contextualized questions that closely mimic real-world scenarios.

5. Conclusions

In conclusion, the application of AI in bladder cancer education is steadily emerging with promising potential. Our study contributes valuable insights into this field by highlighting the accuracy, comprehensiveness, and stability of the three LLMs in answering bladder cancer-related inquiries. These findings, together with future research, can further support the effective utilization of AI in medicine.

Author Contributions

Conceptualization, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Data curation, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad AlKhamees and Bader Alsaikhan; Formal analysis, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Investigation, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Methodology, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Project administration, Ahmed Alasker, Seham Alsalamah and Mohammad Alghafees; Resources, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Software, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Supervision, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Validation, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Visualization, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan; Writing – original draft, Nada Alshathri, Seham Alsalamah, Nura Almansour and Faris Alsalamah; Writing – review & editing, Ahmed Alasker, Nada Alshathri, Seham Alsalamah, Nura Almansour, Faris Alsalamah, Mohammad Alghafees, Mohammad AlKhamees and Bader Alsaikhan.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We encourage all authors of articles published in MDPI journals to share their research data. In this section, please provide details regarding where data supporting reported results can be found, including links to publicly archived datasets analyzed or generated during the study. Where no new data were created, or where data is unavailable due to privacy or ethical restrictions, a statement is still required. Suggested Data Availability Statements are available in section “MDPI Research Data Policies” at https://www.mdpi.com/ethics.

Acknowledgments

None.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kirkali Z, Chan T, Manoharan M, Algaba F, Busch C, Cheng L, Kiemeney L, Kriegmair M, Montironi R, Murphy WM, Sesterhenn IA, Tachibana M, Weider J. Bladder cancer: epidemiology, staging and grading, and diagnosis. Urology. 2005 Dec;66(6 Suppl 1):4-34. PMID: 16399414. [CrossRef]
  2. Alghafees MA, Alqahtani MA, Musalli ZF, Alasker A. Bladder cancer in Saudi Arabia: a registry-based nationwide descriptive epidemiological and survival analysis. Ann Saudi Med. 2022 Jan-Feb;42(1):17-28. Epub 2022 Feb 3. PMID: 35112590; PMCID: PMC8812161. [CrossRef]
  3. Althubiti MA, Nour Eldein MM. Trends in the incidence and mortality of cancer in Saudi Arabia. Saudi Med J. 2018 Dec;39(12):1259-1262. PMID: 30520511; PMCID: PMC6344657. [CrossRef]
  4. Lenis AT, Lec PM, Chamie K, Mshs MD. Bladder Cancer: A Review. JAMA. 2020 Nov 17;324(19):1980-1991. PMID: 33201207. [CrossRef]
  5. Stein JP, Lieskovsky G, Cote R, Groshen S, Feng AC, Boyd S, Skinner E, Bochner B, Thangathurai D, Mikhail M, Raghavan D. Radical cystectomy in the treatment of invasive bladder cancer: long-term results in 1,054 patients. Journal of clinical oncology. 2001 Feb 1;19(3):666-75. [CrossRef]
  6. Quale DZ, Bangs R, Smith M, Guttman D, Northam T, Winterbottom A, Necchi A, Fiorini E, Demkiw S. Bladder Cancer Patient Advocacy: A Global Perspective. Bladder Cancer. 2015 Oct 26;1(2):117-122. PMID: 27398397; PMCID: PMC4929624. [CrossRef]
  7. Miner AS, Laranjo L, Kocaballi AB. Chatbots in the fight against the COVID-19 pandemic. NPJ digital medicine. 2020 May 4;3(1):65. [CrossRef]
  8. Calixte R, Rivera A, Oridota O, Beauchamp W, Camacho-Rivera M. Social and demographic patterns of health-related Internet use among adults in the United States: a secondary data analysis of the health information national trends survey. International Journal of Environmental Research and Public Health. 2020 Sep;17(18):6856. [CrossRef]
  9. Johnson SB, King AJ, Warner EL, Aneja S, Kann BH, Bylund CL. Using ChatGPT to evaluate cancer myths and misconceptions: artificial intelligence and cancer information. JNCI cancer spectrum. 2023 Apr 1;7(2):pkad015. [CrossRef]
  10. Corfield JM, Abouassaly R, Lawrentschuk N. Health information quality on the internet for bladder cancer and urinary diversion: a multi-lingual analysis. Minerva urologica e nefrologica= The Italian journal of urology and nephrology. 2017 Jul 12;70(2):137-43. [CrossRef]
  11. Shahsavar Y, Choudhury A. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Human Factors. 2023 May 17;10(1):e47564. [CrossRef]
  12. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence. 2023 May 4;6:1169595. [CrossRef]
  13. King MR. Can Bard, Google’s Experimental Chatbot Based on the LaMDA Large Language Model, Help to Analyze the Gender and Racial Diversity of Authors in Your Cited Scientific References?. Cellular and Molecular Bioengineering. 2023 Apr;16(2):175-9. [CrossRef]
  14. Koski E, Murphy J. AI in Healthcare.
  15. Ozgor F, Caglar U, Halis A, Cakir H, Aksu UC, Ayranci A, Sarilar O. Urological Cancers and ChatGPT: Assessing the Quality of Information and Possible Risks for Patients. Clin Genitourin Cancer. 2024 Apr;22(2):454-457.e4. Epub 2024 Jan 5. [CrossRef]
  16. Szczesniewski JJ, Tellez Fouz C, Ramos Alba A, Diaz Goizueta FJ, García Tello A, Llanes González L. ChatGPT and most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol. 2023 Nov;41(11):3149-3153. Epub 2023 Aug 26. [CrossRef]
  17. Musheyev D, Pan A, Loeb S, Kabarriti AE. How Well Do Artificial Intelligence Chatbots Respond to the Top Search Queries About Urological Malignancies? Eur Urol. 2024 Jan;85(1):13-16. Epub 2023 Aug 10. PMID: 37567827. [CrossRef]
  18. Davis R, Eppler M, Ayo-Ajibola O, Loh-Doyle JC, Nabhani J, Samplaski M, Gill I, Cacciamani GE. Evaluating the Effectiveness of Artificial Intelligence-powered Large Language Models Application in Disseminating Appropriate and Readable Health Information in Urology. J Urol. 2023 Oct;210(4):688-694. Epub 2023 Jul 10. PMID: 37428117. [CrossRef]
  19. Momenaei B, Wakabayashi T, Shahlaee A, Durrani AF, Pandit SA, Wang K, Mansour HA, Abishek RM, Xu D, Sridhar J, Yonekawa Y, Kuriyan AE. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol Retina. 2023 Oct;7(10):862-868. Epub 2023 Jun 3. PMID: 37277096. [CrossRef]
  20. Robinson MA, Belzberg M, Thakker S, Bibee K, Merkel E, MacFarlane DF, Lim J, Scott JF, Deng M, Lewin J, Soleymani D, Rosenfeld D, Liu R, Liu TYA, Ng E. Assessing the accuracy, usefulness, and readability of artificial-intelligence-generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard. J Am Acad Dermatol. 2024 May;90(5):1078-1080. Epub 2024 Feb 1. PMID: 38296195. [CrossRef]
  21. Hershenhouse JS, Mokhtar D, Eppler MB, Rodler S, Storino Ramacciotti L, Ganjavi C, Hom B, Davis RJ, Tran J, Russo GI, Cocci A, Abreu A, Gill I, Desai M, Cacciamani GE. Accuracy, readability, and understandability of large language models for prostate cancer information to the public. Prostate Cancer Prostatic Dis. 2024 May 14. Epub ahead of print. PMID: 38744934. [CrossRef]
  22. Zaleski AL, Berkowsky R, Craig KJT, Pescatello LS. Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study. JMIR Med Educ. 2024 Jan 11;10:e51308. PMID: 38206661; PMCID: PMC10811574. [CrossRef]
  23. Abou-Abdallah M, Dar T, Mahmudzade Y, Michaels J, Talwar R, Tornari C. The quality and readability of patient information provided by ChatGPT: can AI reliably explain common ENT operations? Eur Arch Otorhinolaryngol. 2024 Mar 26. PMID: 38530460. [CrossRef]
  24. Ömür Arça D, Erdemir İ, Kara F, Shermatov N, Odacioğlu M, İbişoğlu E, Hanci FB, Sağiroğlu G, Hanci V. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine (Baltimore). 2024 May 31;103(22):e38352. [CrossRef]
Figure 1. Distribution of Question Categories in Bladder Cancer Education.
Figure 1. Distribution of Question Categories in Bladder Cancer Education.
Preprints 143012 g001
Figure 2. Accuracy of Responses by Large Language Models (LLMs) Across Different Question Categories.
Figure 2. Accuracy of Responses by Large Language Models (LLMs) Across Different Question Categories.
Preprints 143012 g002
Figure 3. Readability Metrics of Responses for Large Language Models (LLMs).
Figure 3. Readability Metrics of Responses for Large Language Models (LLMs).
Preprints 143012 g003
Table 1. Analysis of the accuracy of different LLMs.
Table 1. Analysis of the accuracy of different LLMs.
Characteristic Missing ChatGPT ChatGPT plus Gemini p-value
Overall (n=53) 2 (1.3%) 0.655
 Correct 49 (92.5%) 49 (92.5%) 44 (86.3%)
 Mixed 4 (7.5%) 3 (5.7%) 6 (11.8%)
 Completely incorrect 0 (0.0%) 1 (1.9%) 1 (2.0%)
General (n=12) 0 (0%) >0.999
 Correct 10 (83.3%) 10 (83.3%) 11 (91.7%)
 Mixed 2 (16.7%) 2 (16.7%) 1 (8.3%)
 Completely incorrect 0 (0.0%) 0 (0.0%) 0 (0.0%)
Diagnosis (n=6) 0 (0%) >0.999
 Correct 6 (100.0%) 6 (100.0%) 6 (100.0%)
 Mixed 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Completely incorrect 0 (0.0%) 0 (0.0%) 0 (0.0%)
Treatment (n=28) 2 (2.4%) 0.848
 Correct 26 (92.9%) 26 (92.9%) 23 (88.5%)
 Mixed 2 (7.1%) 1 (3.6%) 2 (7.7%)
 Completely incorrect 0 (0.0%) 1 (3.6%) 1 (3.8%)
Prevention (n=7) 0 (0%) 0.079
 Correct 7 (100.0%) 7 (100.0%) 4 (57.1%)
 Mixed 0 (0.0%) 0 (0.0%) 3 (42.9%)
 Completely incorrect 0 (0.0%) 0 (0.0%) 0 (0.0%)
n (%). Fisher’s exact test.
Table 2. Analysis of the comprehensiveness of different LLMs.
Table 2. Analysis of the comprehensiveness of different LLMs.
Characteristic Missing ChatGPT ChatGPT plus Gemini p-value
Overall (n=53) 2 (1.3%) 0.011
 Very inadequate 5 (9.4%) 3 (5.7%) 5 (9.8%)
 Inadequate 1 (1.9%) 1 (1.9%) 4 (7.8%)
 Neither comprehensive nor inadequate 7 (13.2%) 5 (9.4%) 7 (13.7%)
 Comprehensive 35 (66.0%) 22 (41.5%) 22 (43.1%)
 Very comprehensive 5 (9.4%) 22 (41.5%) 13 (25.5%)
General (n=12) 0 (0%) 0.291
 Very inadequate 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Inadequate 1 (8.3%) 0 (0.0%) 0 (0.0%)
 Neither comprehensive nor inadequate 3 (25.0%) 0 (0.0%) 2 (16.7%)
 Comprehensive 7 (58.3%) 8 (66.7%) 6 (50.0%)
 Very comprehensive 1 (8.3%) 4 (33.3%) 4 (33.3%)
Diagnosis (n=6) 0 (0%) >0.999
 Very inadequate 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Inadequate 0 (0.0%) 0 (0.0%) 1 (16.7%)
 Neither comprehensive nor inadequate 1 (16.7%) 1 (16.7%) 1 (16.7%)
 Comprehensive 2 (33.3%) 2 (33.3%) 1 (16.7%)
 Very comprehensive 3 (50.0%) 3 (50.0%) 3 (50.0%)
Treatment (n=28) 2 (2.4%) 0.007
 Very inadequate 5 (17.9%) 3 (10.7%) 5 (19.2%)
 Inadequate 0 (0.0%) 1 (3.6%) 3 (11.5%)
 Neither comprehensive nor inadequate 2 (7.1%) 2 (7.1%) 3 (11.5%)
 Comprehensive 20 (71.4%) 10 (35.7%) 10 (38.5%)
 Very comprehensive 1 (3.6%) 12 (42.9%) 5 (19.2%)
Prevention (n=7) 0 (0%) 0.205
 Very inadequate 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Inadequate 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Neither comprehensive nor inadequate 1 (14.3%) 2 (28.6%) 1 (14.3%)
 Comprehensive 6 (85.7%) 2 (28.6%) 5 (71.4%)
 Very comprehensive 0 (0.0%) 3 (42.9%) 1 (14.3%)
n (%). Fisher’s exact test.
Table 3. Analysis of the grade level score of different LLMs.
Table 3. Analysis of the grade level score of different LLMs.
Characteristic Missing ChatGPT ChatGPT plus Gemini p-value
Overall (n=53) 0 (0%) <0.001
 6th grade 0 (0.0%) 0 (0.0%) 1 (1.9%)
 7th grade 0 (0.0%) 0 (0.0%) 4 (7.5%)
 8th & 9th grade 2 (3.8%) 1 (1.9%) 13 (24.5%)
 10th to 12th grade 8 (15.1%) 4 (7.5%) 17 (32.1%)
 College 37 (69.8%) 40 (75.5%) 18 (34.0%)
 College graduate 6 (11.3%) 7 (13.2%) 0 (0.0%)
 Professional 0 (0.0%) 1 (1.9%) 0 (0.0%)
General (n=12) 0 (0%) 0.016
 6th grade 0 (0.0%) 0 (0.0%) 0 (0.0%)
 7th grade 0 (0.0%) 0 (0.0%) 2 (16.7%)
 8th & 9th grade 1 (8.3%) 1 (8.3%) 3 (25.0%)
 10th to 12th grade 3 (25.0%) 1 (8.3%) 5 (41.7%)
 College 8 (66.7%) 10 (83.3%) 2 (16.7%)
 College graduate 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Professional 0 (0.0%) 0 (0.0%) 0 (0.0%)
Diagnosis (n=6) 0 (0%) 0.050
 6th grade 0 (0.0%) 0 (0.0%) 0 (0.0%)
 7th grade 0 (0.0%) 0 (0.0%) 0 (0.0%)
 8th & 9th grade 0 (0.0%) 0 (0.0%) 2 (33.3%)
 10th to 12th grade 1 (16.7%) 0 (0.0%) 3 (50.0%)
 College 4 (66.7%) 5 (83.3%) 1 (16.7%)
 College graduate 1 (16.7%) 1 (16.7%) 0 (0.0%)
 Professional 0 (0.0%) 0 (0.0%) 0 (0.0%)
Treatment (n=28) 0 (0%) <0.001
 6th grade 0 (0.0%) 0 (0.0%) 1 (3.6%)
 7th grade 0 (0.0%) 0 (0.0%) 2 (7.1%)
 8th & 9th grade 1 (3.6%) 0 (0.0%) 7 (25.0%)
 10th to 12th grade 2 (7.1%) 2 (7.1%) 7 (25.0%)
 College 20 (71.4%) 21 (75.0%) 11 (39.3%)
 College graduate 5 (17.9%) 4 (14.3%) 0 (0.0%)
 Professional 0 (0.0%) 1 (3.6%) 0 (0.0%)
Prevention (n=7) 0 (0%) 0.561
 6th grade 0 (0.0%) 0 (0.0%) 0 (0.0%)
 7th grade 0 (0.0%) 0 (0.0%) 0 (0.0%)
 8th & 9th grade 0 (0.0%) 0 (0.0%) 1 (14.3%)
 10th to 12th grade 2 (28.6%) 1 (14.3%) 2 (28.6%)
 College 5 (71.4%) 4 (57.1%) 4 (57.1%)
 College graduate 0 (0.0%) 2 (28.6%) 0 (0.0%)
 Professional 0 (0.0%) 0 (0.0%) 0 (0.0%)
n (%). Fisher’s exact test.
Table 4. Analysis of the reading note of different LLMs.
Table 4. Analysis of the reading note of different LLMs.
Characteristic Missing ChatGPT ChatGPT plus Gemini p-value
Overall (n=53) 0 (0%) <0.001
 Plain English 2 (3.8%) 1 (1.9%) 13 (24.5%)
 Fairly easy to read 0 (0.0%) 0 (0.0%) 4 (7.5%)
 Easy to read 0 (0.0%) 0 (0.0%) 1 (1.9%)
 Difficult to read 37 (69.8%) 40 (75.5%) 18 (34.0%)
 Fairly difficult to read 8 (15.1%) 4 (7.5%) 17 (32.1%)
 Very difficult to read 6 (11.3%) 7 (13.2%) 0 (0.0%)
 Extremely difficult to read 0 (0.0%) 1 (1.9%) 0 (0.0%)
General (n=12) 0 (0%) 0.019
 Plain English 1 (8.3%) 1 (8.3%) 3 (25.0%)
 Fairly easy to read 0 (0.0%) 0 (0.0%) 2 (16.7%)
 Easy to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Difficult to read 8 (66.7%) 10 (83.3%) 2 (16.7%)
 Fairly difficult to read 3 (25.0%) 1 (8.3%) 5 (41.7%)
 Very difficult to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Extremely difficult to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
Diagnosis (n=6) 0 (0%) 0.055
 Plain English 0 (0.0%) 0 (0.0%) 2 (33.3%)
 Fairly easy to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Easy to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Difficult to read 4 (66.7%) 5 (83.3%) 1 (16.7%)
 Fairly difficult to read 1 (16.7%) 0 (0.0%) 3 (50.0%)
 Very difficult to read 1 (16.7%) 1 (16.7%) 0 (0.0%)
 Extremely difficult to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
Treatment (n=28) 0 (0%) <0.001
 Plain English 1 (3.6%) 0 (0.0%) 7 (25.0%)
 Fairly easy to read 0 (0.0%) 0 (0.0%) 2 (7.1%)
 Easy to read 0 (0.0%) 0 (0.0%) 1 (3.6%)
 Difficult to read 20 (71.4%) 21 (75.0%) 11 (39.3%)
 Fairly difficult to read 2 (7.1%) 2 (7.1%) 7 (25.0%)
 Very difficult to read 5 (17.9%) 4 (14.3%) 0 (0.0%)
 Extremely difficult to read 0 (0.0%) 1 (3.6%) 0 (0.0%)
Prevention (n=7) 0 (0%) 0.562
 Plain English 0 (0.0%) 0 (0.0%) 1 (14.3%)
 Fairly easy to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Easy to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
 Difficult to read 5 (71.4%) 4 (57.1%) 4 (57.1%)
 Fairly difficult to read 2 (28.6%) 1 (14.3%) 2 (28.6%)
 Very difficult to read 0 (0.0%) 2 (28.6%) 0 (0.0%)
 Extremely difficult to read 0 (0.0%) 0 (0.0%) 0 (0.0%)
n (%). Fisher’s exact test.
Table 5. A description of selected numerical parameters of LLMs, including Words, sentences, syllables, word/sentence, syllable/word, FRE score, and FK Reading levels.
Table 5. A description of selected numerical parameters of LLMs, including Words, sentences, syllables, word/sentence, syllable/word, FRE score, and FK Reading levels.
Characteristic Missing ChatGPT ChatGPT plus Gemini p-value
Words 0 (0%) 207.0 (166.0 - 240.0) 295.0 (232.0 - 342.0) 274.0 (223.0 - 341.0) <0.001
Sentences 0 (0%) 10.0 (7.0 - 13.0) 18.0 (10.0 - 21.0) 15.0 (11.0 - 21.0) <0.001
Syllables 0 (0%) 337.0 (285.0 - 404.0) 507.0 (390.0 - 601.0) 427.0 (351.0 - 537.0) <0.001
Word/sentence 0 (0%) 20.0 (17.1 - 23.5) 18.6 (15.3 - 22.1) 17.3 (16.1 - 21.1) 0.099
Syllable/word 0 (0%) 1.7 (1.6 - 1.8) 1.8 (1.7 - 1.8) 1.6 (1.5 - 1.6) <0.001
FRE Score 0 (0%) 43.4 (34.2 - 48.0) 40.3 (35.0 - 44.9) 54.3 (47.4 - 61.7) <0.001
FK Reading Level 0 (0%) 13.6 (11.7 - 15.4) 12.6 (11.5 - 13.8) 11.0 (9.4 - 45,116.0) 0.093
Median (IQR)
Kruskal-Wallis rank sum test
Table 6. Analysis of the stability of different LLMs.
Table 6. Analysis of the stability of different LLMs.
Characteristic ChatGPT ChatGPT plus Gemini p-value
Overall (n=10) >0.999
 Consistent 9 (90.0%) 9 (90.0%) 8 (80.0%)
 Inconsistent 1 (10.0%) 1 (10.0%) 2 (20.0%)
Diagnosis (n=3) 0.671
 Consistent 3 (100.0%) 2 (66.7%) 1 (33.3%)
 Inconsistent 0 (0.0%) 1 (33.3%) 2 (66.7%)
Treatment (n=3) >0.999
 Consistent 2 (66.7%) 3 (100.0%) 3 (100.0%)
 Inconsistent 1 (33.3%) 0 (0.0%) 0 (0.0%)
Prevention (n=4)
 Consistent 4 (100.0%) 4 (100.0%) 4 (100.0%) NA
 Inconsistent 0 (0.0%) 0 (0.0%) 0 (0.0%)
NA:non-applicable.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated