Submitted:
06 February 2025
Posted:
07 February 2025
You are already at the latest version
Abstract
We have published on the accuracy of the Google Virtual Assistant, Alexa, Siri, Cortana, Gemini and Copilot. Emerging from this published work was a focus on the ac-curacy of AI that could be determined through validations. In our work published in 2023, the accuracy of responses to a panel of 24 queries related to gynecologic oncology was low, with Google Virtual Assistant (VA) providing the most correct audible replies (18.1%), followed by Alexa (6.5%), Siri (5.5%), and Cortana (2.3%). In the months following our publication, there was explosive excitement about several generative AIs that continue to transform the landscape of information accessibility by presenting the search results in impressively engaging narratives. This type of presentation has been enabled by combining machine learning algorithms with Natural Language Processing (NLP). In 2024, we published our exploration of the generative AIs Gemini and Copilot as well as the Google Assistant in relation to how accurately they responded to the panel of 24 queries that we used in the 2023 publication. Google Gemini achieved an 87.5% accuracy rate, while the accuracy of Microsoft Copilot was 83.3%. In contrast, the Google VA’s accuracy in audible responses improved from 18% in the 2023 report to 63% in 2024. Because of our investigation in this area, we have examined the accuracy of results obtained through different AI models in this review. The landscape of the findings reviewed here surveyed 252 papers published in 2024, topically reporting on AI in medicine of which 83 articles are considered in the present review because they contain evidenced-based findings. In particular, the types of cases considered deal with AI accuracy in initial differential diagnoses, cancer treatment recommendations, board-style exams and performance in various clinical tasks. Importantly, summaries of the validation techniques used to evaluate AI findings are presented. This review focuses on those AIs that have a clinical relevancy evidenced by application and evaluation in clinical publications. This relevancy speaks to both what has been promised and what has been delivered by various AI systems. Readers will be able to understand when generative AI may be expressing views without having the necessary information (ultracrepidarianism) or is responding as if the generative AI had expert knowledge when it does not. Without an awareness that AIs may deliver inadequate or confabulated information, incorrect medical decisions and inappropriate clinical applications can result (Dunning-Kruger effect). As a result, in certain cases a generative AI might underperform and provide results which greatly overestimate any medical or clinical validity.
Keywords:
1. Introduction
2. Materials and Methods
3. Results
3.1. Gynecologic Oncology
3.2. Clinical Medicine in General
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Land JM, Pavlik EJ, Ueland E, et al. Evaluation of replies to voice queries in gynecologic oncology by virtual assistants Siri, Alexa, Google, and Cortana. BioMed Informatics. 2023, 3, 553–562. [Google Scholar] [CrossRef]
- Pavlik, E.J.; Ramaiah, D.D.; Rives, T.A.; Swiecki-Sikora, A.L.; Land, J.M. Replies to Queries in Gynecologic Oncology by Bard, Bing, and the Google Assistant. BioMed Informatics. 2024, 4, 1773–1782. [Google Scholar] [CrossRef]
- Brandl, R.; Ellis, C. ChatGPT statistics 2024: All the latest statistics about OpenAI’s chatbot. Tooltester. Published 2024. Available at: https://www.tooltester.com/en/blog/chatgpt-statistics/. Accessed January 27, 2025.
- Google. (2024). Gemini [Large language model]. https://ludwig.guru/s/if+available.
- Vogel, M. A curated list of resources on generative AI. Medium. Updated October 9, 2024. Available at: https://medium.com/@maximilian.vogel/5000x-generative-ai-intro-overview-models-prompts-technology-tools-comparisons-the-best-a4af95874e94#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjFkYzBmMTcyZThkNmVmMzgyZDZkM2EyMzFmNmMxOTdkZDY4Y2U1ZWYiLCJ0eXAiOiJKV1QifQ. Accessed January 27, 2025.
- Yang J, Jin H, Tang J, et al. The practical guides for large language models. Available at: https://github/com/Mooler0410/LLMsPracticalGuide. Accessed January 27, 2025.
- Kurzweil, R. The singularity is nearer: When we merge with AI. Penguin Books; 2024. ISBN 9780399562761.
- Ortiz, S. What is Google Bard? Here's everything you need to know. ZDNET. February 7, 2024. Available at: https://www.zdnet.com/article/what-is-google-bard-heres-everything-you-need-to-know/. Accessed February 13, 2024.
- Microsoft Copilot. Wikipedia. Available at: https://en.wikipedia.org/wiki/Microsoft_Copilot. Accessed February 13, 2024.
- Haug C.J., & Drazen, J.M. (2023). Artificial Intelligence and Machine Learning in Clinical Medicine, 2023. New England Journal of Medicine, 388(13), 1201-1208. [CrossRef]
- Gomes, B.; Ashley, E.A. Artificial Intelligence in Molecular Medicine. N Engl J Med. 2 388, 2456–2465. [CrossRef] [PubMed]
- Colasacco, C.J.; Born, H.L. A Case of Artificial Intelligence Chatbot Hallucination. JAMA Otolaryngol Head Neck Surg. 150, 457–458. [CrossRef] [PubMed]
- Kacena, M.A.; Plotkin, L.I.; Fehrenbacher, J.C. The Use of Artificial Intelligence in Writing Scientific Review Articles. Curr Osteoporos Rep. 22, 115–121. [CrossRef] [PubMed] [PubMed Central]
- Omiye, J.A.; Gui, H.; Rezaei, S.J.; Zou, J.; Daneshjou, R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med. 177, 210–220. [CrossRef] [PubMed]
- Kanjee, Z.; Crowe, B.; Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023, 330, 78–80. [Google Scholar] [CrossRef]
- Rengers, T.A.; Thiels, C.A.; Salehinejad, H. Academic Surgery in the Era of Large Language Models: A Review. JAMA Surg. 2024, 159, 445–450. [Google Scholar] [CrossRef]
- Shah NH, Halamka JD, Saria S, et al. A Nationwide Network of Health AI Assurance Laboratories. JAMA. 2024, 331, 245–249. [Google Scholar] [CrossRef]
- Li SW, Kemp MW, Logan SJS, et al. ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology. Am J Obstet Gynecol. 2023, 229, 172.e1–172.e12. [Google Scholar] [CrossRef] [PubMed]
- Anastasio, M.K.; Peters, P.; Foote, J.; Melamed, A.; Modesitt, S.C.; Musa, F.; Rossi, E.; Albright, B.B.; Havrilesky, L.J.; Moss, H.A. The doc versus the bot: A pilot study to assess the quality and accuracy of physician and chatbot responses to clinical questions in gynecologic oncology. Gynecol Oncol Rep. 2024, 55, 101477. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Fanizzi, A.; Arezzo, F.; Cormio, G.; Comes, M.C.; Cazzato, G.; Boldrini, L.; Bove, S.; Bollino, M.; Kardhashi, A.; Silvestris, E.; Quarto, P.; Mongelli, M.; Naglieri, E.; Signorile, R.; Loizzi, V.; Massafra, R. An explainable machine learning model to solid adnexal masses diagnosis based on clinical data and qualitative ultrasound indicators. Cancer Med. 13, e7425. [CrossRef] [PubMed] [PubMed Central]
- Liu T, Miao K, Tan G, et al. A study on automatic O-RADS classification of sonograms of ovarian adnexal lesions based on deep convolutional neural networks. Ultrasound Med Biol. 2025, 51, 387–395. [Google Scholar] [CrossRef] [PubMed]
- Moro, F.; Ciancia, M.; Zace, D.; Vagni, M.; Tran, H.E.; Giudice, M.T.; Zoccoli, S.G.; Mascilini, F.; Ciccarone, F.; Boldrini, L.; D'Antonio, F.; Scambia, G.; Testa, A.C. Role of artificial intelligence applied to ultrasound in gynecology oncology: A systematic review. Int J Cancer. 155, 1832–1845. [CrossRef] [PubMed]
- Mitchell, S.; Nikolopoulos, M.; El-Zarka, A.; Al-Karawi, D.; Al-Zaidi, S.; Ghai, A.; Gaughran, J.E.; Sayasneh, A. Artificial intelligence in ultrasound diagnoses of ovarian cancer: A systematic review and meta-analysis. Cancers (Basel). 2024, 16, 422. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Feng, Y. An integrated machine learning-based model for joint diagnosis of ovarian cancer with multiple test indicators. J Ovarian Res 17, 45 (2024). [CrossRef]
- Bogaerts JM, Steenbeek MP, Bokhorst JM, et al. Assessing the impact of deep-learning assistance on the histopathological diagnosis of serous tubal intraepithelial carcinoma (STIC) in fallopian tubes. J Pathol Clin Res. 2024 Nov, 10, e70006. [CrossRef] [PubMed] [PubMed Central]
- Bergstrom EN, et al. Deep learning artificial intelligence predicts homologous recombination deficiency and platinum response from histologic slides. J Clin Oncol. 2024;42:3550-3560. [CrossRef]
- Capasso I, Cucinella G, Wright DE, et al. Artificial intelligence model for enhancing the accuracy of transvaginal ultrasound in detecting endometrial cancer and endometrial atypical hyperplasia. Int J Gynecol Cancer 2024, 34, 1547–1555. [Google Scholar] [CrossRef] [PubMed]
- Hu L, Bell D, Antani S, et al. An observational study of deep learning and automated evaluation of cervical images for cancer screening. JNCI: Journal of the National Cancer Institute. 2019, 111, 923–932. [Google Scholar] [CrossRef]
- Xue, Z.; Novetsky, A.P.; Einstein, M.H.; et al A demonstration of automated visual evaluation of cervical images taken with a smartphone camera. Int. J. Cancer. 2020; 147: 2416–2423. [CrossRef]
- Desai, K.T.; Befano, B.; Xue, Z.; et al The development of “automated visual evaluation” for cervical cancer screening: The promise challenges in adapting deep-learning for clinical testing. Int. J. Cancer. 2022; 150(5): 741-752. [CrossRef]
- Parham, G.P. , Egemen, D., Befano, B. et al. Validation in Zambia of a cervical screening strategy including HPV genotyping and artificial intelligence (AI)-based automated visual evaluation. Infect Agents Cancer 18, 61 (2023). [CrossRef]
- Egemen D, Perkins RB, Cheung LC, et al. Artificial intelligence–based image analysis in clinical testing: lessons from cervical cancer screening. JNCI: J Natl Cancer Inst. 2024 Jan, 116, 26-33. [CrossRef]
- Rios-Doria E, Wang J, Rodriguez I, et al. Artificial intelligence powered insights: Assessment of ChatGPT's treatment recommendations in gynecologic oncology. Gynecol Oncol. 2024;190(Suppl 1):S45. [CrossRef]
- Rao A, Pang M, Kim J, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res. 2023;25:e48659. [CrossRef]
- Cabral S, Restrepo D, Kanjee Z, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med. 2024, 184, 581–583. [Google Scholar] [CrossRef]
- Chen, S.; Kann, B.H.; Foote, M.B.; Aerts, H.J.W.L.; Savova, G.K.; Mak, R.H.; Bitterman, D.S. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol. 9, 1459–1462. [CrossRef]
- Pan, A.; Musheyev, D.; Bockelman, D.; Loeb, S.; Kabarriti, A.E. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol. 2023, 9, 1437–1440. [Google Scholar] [CrossRef] [PubMed]
- Janopaul-Naylor JR, Koo A, Qian DC, et al. Physician assessment of ChatGPT and Bing answers to American Cancer Society's questions to ask about your cancer. Am J Clin Oncol. 2024, 47, 17–21. [Google Scholar] [CrossRef] [PubMed]
- Shea YF, Lee CMY, Ip WCT, et al. Use of GPT-4 to analyze medical records of patients with extensive investigations and delayed diagnosis. JAMA Netw Open. 2023, 6, e2325000. [Google Scholar] [CrossRef] [PubMed]
- Schubert, M.C.; Wick, W.; Venkataramani, V. Performance of large language models on a neurology board-style examination. JAMA Netw Open. 2023, 6, e2346721. [Google Scholar] [CrossRef]
- Wu S, Koo M, Blum L, et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024, 1. [Google Scholar] [CrossRef]
- Katz U, Cohen E, Shachar E, et al. GPT versus resident physicians — a benchmark based on official board scores. NEJM AI. 2024;1(5). [CrossRef]
- Jabbour S, Fouhey D, Shepard S, et al. Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study. JAMA. 2023, 330, 2275–2284. [Google Scholar] [CrossRef]
- Han T, Adams LC, Bressem KK, et al. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA. 2024, 331, 1320–1321. [Google Scholar] [CrossRef]
- Rydzewski NR, Dinakaran D, Zhao SG, et al. Comparative evaluation of LLMs in clinical oncology. NEJM AI. 2024;1(5). [CrossRef]
- Longwell JB, Hirsch I, Binder F, et al. Performance of large language models on medical oncology examination questions. JAMA Netw Open. 2024, 7, e2417641. [Google Scholar] [CrossRef]
- Chen D, Huang RS, Jomy J, et al. Performance of multimodal artificial intelligence chatbots evaluated on clinical oncology cases. JAMA Netw Open. 2024, 7, e2437711. [Google Scholar] [CrossRef]
- Thirunavukarasu AJ, Mahmood S, Malem A, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit Health. 2024, 3, e0000341. [Google Scholar] [CrossRef]
- Liu T-L, Hetherington TC, Dharod A, et al. Does AI-powered clinical documentation enhance clinician efficiency? A longitudinal study. NEJM AI. 2024;1(12). [CrossRef]
- Tai-Seale M, Baxter SL, Vaida F, et al. AI-Generated Draft Replies Integrated Into Health Records and Physicians’ Electronic Communication. JAMA Netw Open. 2024, 7, e246565. [Google Scholar] [CrossRef] [PubMed]
- Liu TL, Hetherington TC, Stephens C, McWilliams A, Dharod A, Carroll T, Cleveland JA. AI-Powered Clinical Documentation and Clinicians' Electronic Health Record Experience: A Nonrandomized Clinical Trial. JAMA Netw Open. 2024 Sep 3, 7, e2432460. [CrossRef] [PubMed] [PubMed Central]
- Small WR, Wiesenfeld B, Brandfield-Harvey B, et al. Large language model–based responses to patients’ in-basket messages. JAMA Netw Open. 2024, 7, e2422399. [Google Scholar] [CrossRef]
- Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023, 183, 589–596. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Hartman V, Zhang X, Poddar R, et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw Open. 2024, 7, e2448723. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Burford KG, Itzkowitz NG, Ortega AG, Teitler JO, Rundle AG. Use of Generative AI to Identify Helmet Status Among Patients With Micromobility-Related Injuries From Unstructured Clinical Notes. JAMA Netw Open. 2024 Aug 1, 7, e2425981. [CrossRef] [PubMed] [PubMed Central]
- Shah, S.V. Accuracy, Consistency, and Hallucination of Large Language Models When Analyzing Unstructured Clinical Notes in Electronic Medical Records. JAMA Netw Open. 2024 Aug 1, 7, e2425953. [CrossRef] [PubMed]
- Garcia P, Ma SP, Shah S, et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw Open. 2024, 7, e243201. [Google Scholar] [CrossRef]
- Chen D, Parsa R, Hope A, et al. Physician and artificial intelligence chatbot responses to cancer questions from social media. JAMA Oncol. 2024, 10, 956–960. [Google Scholar] [CrossRef]
- Yalamanchili A, Sengupta B, Song J, et al. Quality of large language model responses to radiation oncology patient care questions. JAMA Netw Open. 2024, 7, e244630. [Google Scholar] [CrossRef]
- Yu F, Moehring A, Banerjee O, et al. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat Med. 2024;30:837-849. [CrossRef]
- Tu T, Azizi S, Driess D, et al. Towards generalist biomedical AI. NEJM AI. 2024;1(3). [CrossRef]
- Goh E, Gallo R, Hom J, et al. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw Open. 2024, 7, e2440969. [Google Scholar] [CrossRef]
- Bhargava A, López-Espina C, Schmalz L, et al. FDA-authorized AI/ML tool for sepsis prediction: development and validation. NEJM AI. 2024;1(12). [CrossRef]
- Hoang YN, Chen YL, Ho DK, et al. Consistency and accuracy of artificial intelligence for providing nutritional information. JAMA Netw Open. 2023, 6, e2350367. [Google Scholar] [CrossRef]
- Carnevali, D. , Zhong, L., González-Almela, E. et al. A deep learning method that identifies cellular heterogeneity using nanoscale nuclear features. Nat Mach Intell 6, 1021–1033 (2024). [CrossRef]
- van der Laak J, Litjens G, Ciompi F. Deep learning in histopathology: the path to the clinic. Nat Med. 2021 May, 27, 775-784. [CrossRef] [PubMed]
- Shmatko, A.; Ghaffari Laleh, N.; Gerstung, M.; Kather, J.N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat Cancer. 2022 Sep, 3, 1026-1038. [CrossRef] [PubMed]
- Reis-Filho, J.S.; Kather, J.N. Overcoming the challenges to implementation of artificial intelligence in pathology. JNCI J Natl Cancer Inst. 2023, 115, 608–612. [Google Scholar] [CrossRef] [PubMed]
- Steimetz E, Minkowitz J, Gabutan EC, et al. Use of artificial intelligence chatbots in interpretation of pathology reports. JAMA Netw Open. 2024, 7, e2412767. [Google Scholar] [CrossRef] [PubMed]
- Wang, X., Zhao, J., Marostica, E. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024). [CrossRef]
- Gjesvik, J.; Moshina, N.; Lee, C.I.; Miglioretti, D.L.; Hofvind, S. Artificial Intelligence Algorithm for Subclinical Breast Cancer Detection. JAMA Netw Open. 2024 Oct 1, 7, e2437402. [CrossRef] [PubMed] [PubMed Central]
- Lee JJ, Zepeda A, Arbour G, et al. Automated identification of breast cancer relapse in computed tomography reports using natural language processing. JCO Clin Cancer Inform, 2400. [CrossRef]
- Mihalache, A.; Huang, R.S.; Popovic, M.M.; Patil, N.S.; Pandya, B.U.; Shor, R.; Pereira, A.; Kwok, J.M.; Yan, P.; Wong, D.T.; Kertes, P.J.; Muni, R.H. Accuracy of an Artificial Intelligence Chatbot's Interpretation of Clinical Ophthalmic Images. JAMA Ophthalmol. 2024 Apr 1, 142, 321–326. [CrossRef] [PubMed] [PubMed Central]
- Chung P, Fong CT, Walters AM, et al. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. 2024, 159, 928–937. [Google Scholar] [CrossRef]
- Xiao M, Molina KC, Aggarwal NR, et al. A Machine Learning Method for Allocating Scarce COVID-19 Monoclonal Antibodies. JAMA Health Forum. 2024, 5, e242884. [Google Scholar] [CrossRef]
- Sunami K, Naito Y, Saigusa Y, et al. A Learning Program for Treatment Recommendations by Molecular Tumor Boards and Artificial Intelligence. JAMA Oncol. 2024, 10, 95–102. [Google Scholar] [CrossRef]
- Williams CYK, Zack T, Miao BY, et al. Use of a large language model to assess clinical acuity of adults in the emergency department. JAMA Netw Open. 2024, 7, e248895. [Google Scholar] [CrossRef]
- Andriola C, Ellis RP, Siracuse JJ, et al. A Novel Machine Learning Algorithm for Creating Risk-Adjusted Payment Formulas. JAMA Health Forum. 2024, 5, e240625. [Google Scholar] [CrossRef]
- Soroush A, Glicksberg BS, Zimlichman E, et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI. 2024;1(5). [CrossRef]
- Hantel A, Walsh TP, Marron JM, et al. Perspectives of oncologists on the ethical implications of using artificial intelligence for cancer care. JAMA Netw Open. 2024, 7, e244077. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Platt, J.; Nong, P.; Carmona, G.; Kardia, S. Attitudes toward notification of use of artificial intelligence in health care. JAMA Netw Open. 2024, 7, e2450102. [Google Scholar] [CrossRef] [PubMed]
- Centripetal: Evolving Trust But Verify. https://www.centripetal.ai/blog/trust-but-verify-threat-intelligence/.
- User clip: trust but verify December 8, 1987. https://www.c-span.org/clip/white-house-event/user-clip-trust-but-verify/4757483 Accessed January 27, 2025.
- Menz, B.D.; Modi, N.D.; Sorich, M.J.; Hopkins, A.M. Health disinformation use case highlighting the urgent need for artificial intelligence vigilance: Weapons of mass disinformation. JAMA Intern Med. 2024, 184, 92–96. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).