Submitted:
16 October 2024
Posted:
16 October 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction

2. Materials and Methods
2.1. Planned Prospective Trial
2.2. Selection of Two Appropriate LLMs for Comparison with the MTB (1)
2.3. Development of Standardized Prompts for Data Input on GUC Patients and the Creation of a Uniform Recommendation Matrix for Both LLMs and the MTB to Facilitate Blinded Assessment (2)
2.4. Modification and Validation of the Newly Developed mSCS Using a Cohort of 40 Patients with Varying Organ-Specific GUCs (3)
2.5. Biometric Sample Size Planning for the Prospective Trial, Preceded by a Moderated Delphi Process with the Entire Study Team to Establish What Level of Difference in the mSCS, Derived from Preliminary Study Results, Would Still Be Considered Non-Inferior for LLMs Compared to the MTB (4)
2.6. Precise Documentation and Listing of Statistical Methods to validate The mSCS and Compare Results between the Groups (MTB vs. LLM) (5)
3. Results
3.1. Selection of Two Appropriate LLMs for Comparison with the MTB (1)
3.2. Development of Standardized Prompts for Data Input on Urological Tumor Patients and the Creation of a Uniform Recommendation Matrix for Both LLMs and the MTB to Facilitate Blinded Assessment (2)
3.3. Modification and Validation of the Newly Developed mSCS Using a Cohort of 40 GUC Patients with Varying Organ-Specific Cancers (3)
3.4. Biometric Sample size planning for the Prospective Trial, Preceded by a Moderated Delphi Process with the Entire Study Team to Establish What Level of Difference in the mSCS, Derived from Preliminary Study Results, Would Still Be Considered Non-Inferior for LLMs Compared to the MTB (4)
3.5. Validation of the mSCS and Comparation between the Groups (MTB vs. LLM) (5)
3.5.1. Interrater Reliability
3.5.2. Agreement between SCS and mSCS
3.5.3. Internal Consistency
3.5.4. Evaluation of Clinical Applicability of the mSCS Compared to the SCS
4. Discussion
Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. [CrossRef]
- Ray, PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems. 2023;3:121–54. [CrossRef]
- Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28:31–8. [CrossRef]
- Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40. [CrossRef]
- Kowalewski K-F, Rodler S. Large Language Models in der Wissenschaft. [Large language models in science]. Urologie. 2024;63:860–6. [CrossRef]
- OpenAI. Introducing ChatGPT: 2022 Nov 30. https://openai.com/blog/chatgpt.
- Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, et al. Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol. 2024;85:146–53. [CrossRef]
- Holzinger A, Carrington A, Müller H. Measuring the Quality of Explanations: The System Causability Scale (SCS): Comparing Human and Machine Explanations. Kunstliche Intell (Oldenbourg). 2020;34:193–8. [CrossRef]
- Pillay B, Wootten AC, Crowe H, Corcoran N, Tran B, Bowden P, et al. The impact of multidisciplinary team meetings on patient assessment, management and outcomes in oncology settings: A systematic review of the literature. Cancer Treat Rev. 2016;42:56–72. [CrossRef]
- Taylor C, Munro AJ, Glynne-Jones R, Griffith C, Trevatt P, Richards M, Ramirez AJ. Multidisciplinary team working in cancer: what is the evidence? BMJ. 2010;340:c951. [CrossRef]
- Perez-Gracia JL, Awada A, Calvo E, Amaral T, Arkenau H-T, Gruenwald V, et al. ESMO Clinical Research Observatory (ECRO): improving the efficiency of clinical research through rationalisation of bureaucracy. ESMO Open. 2020;5:e000662. [CrossRef]
- Levin G, Gotlieb W, Ramirez P, Meyer R, Brezinov Y. ChatGPT in a gynaecologic oncology multidisciplinary team tumour board: A feasibility study. BJOG 2024. [CrossRef]
- Schmidl B, Hütten T, Pigorsch S, Stögbauer F, Hoch CC, Hussain T, et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. Eur Arch Otorhinolaryngol 2024. [CrossRef]
- Stalp JL, Denecke A, Jentschke M, Hillemanns P, Klapdor R. Quality of ChatGPT-Generated Therapy Recommendations for Breast Cancer Treatment in Gynecology. Curr Oncol. 2024;31:3845–54. [CrossRef]
- Schmidl B, Hütten T, Pigorsch S, Stögbauer F, Hoch CC, Hussain T, et al. Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for primary head and neck cancer cases. Front Oncol. 2024;14:1353031. [CrossRef]
- Aghamaliyev U, Karimbayli J, Giessen-Jung C, Matthias I, Unger K, Andrade D, et al. ChatGPT’s Gastrointestinal Tumor Board Tango: A limping dance partner? Eur J Cancer. 2024;205:114100. [CrossRef]
- Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, et al. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023;6:e2343689. [CrossRef]
- Griewing S, Gremke N, Wagner U, Lingenfelder M, Kuhn S, Boekhoff J. Challenging ChatGPT 3.5 in Senology-An Assessment of Concordance with Breast Cancer Tumor Board Decision Making. J Pers Med 2023. [CrossRef]
- Vela Ulloa J, King Valenzuela S, Riquoir Altamirano C, Urrejola Schmied G. Artificial intelligence-based decision-making: can ChatGPT replace a multidisciplinary tumour board? Br J Surg. 2023;110:1543–4. [CrossRef]
- Lukac S, Dayan D, Fink V, Leinert E, Hartkopf A, Veselinovic K, et al. Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch Gynecol Obstet. 2023;308:1831–44. [CrossRef]
- Delourme S, Redjdal A, Bouaud J, Seroussi B. Measured Performance and Healthcare Professional Perception of Large Language Models Used as Clinical Decision Support Systems: A Scoping Review. Stud Health Technol Inform. 2024;316:841–5. [CrossRef]
- Cohen, J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–20. [CrossRef]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement. 1960;20:37–46. [CrossRef]
- Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33:159. [CrossRef]
- Cronbach, LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334. [CrossRef]
- Taber, KS. The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education. Res Sci Educ. 2018;48:1273–96. [CrossRef]
- Wright FC, Vito C de, Langer B, Hunter A. Multidisciplinary cancer conferences: a systematic review and development of practice standards. Eur J Cancer. 2007;43:1002–10. [CrossRef]
- Huang RS, Mihalache A, Nafees A, Hasan A, Ye XY, Liu Z, et al. The impact of multidisciplinary cancer conferences on overall survival: a meta-analysis. Journal of the National Cancer Institute. 2024;116:356–69. [CrossRef]
- Berardi R, Morgese F, Rinaldi S, Torniai M, Mentrasti G, Scortichini L, Giampieri R. Benefits and Limitations of a Multidisciplinary Approach in Cancer Patient Management. Cancer Manag Res. 2020;12:9363–74. [CrossRef]
- Sorin V, Klang E, Sklair-Levy M, Cohen I, Zippel DB, Balint Lahat N, et al. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer. 2023;9:44. [CrossRef]
- Suresh K, Chandrashekara S. Sample size estimation and power analysis for clinical research studies. J Hum Reprod Sci. 2012;5:7–13. [CrossRef]
| Prompt - Original input in German | Prompt - English translation |
|---|---|
| Formuliere eine stichpunktartige Therapieempfehlung für den folgenden Patientenfall. Die Vortherapien und andere relevante Befunde sind im Fall enthalten. Beschränke dich hierbei nicht nur auf Medikamente, sondern beschreibe alle möglichen Therapieoptionen. Bitte benenne Therapien und Medikamente, falls du eine medikamentöse Therapie vorschlägst, konkret. Versuche, deine Empfehlung anhand der in Deutschland zugelassenen und leitliniengerechten Therapien zu treffen. Bitte benenne zudem explizit die aus deiner Sicht beste Therapieoption für den individuellen Patienten. Begrenze mit Deinen Antworten auf maximal 80 Wörter und orientiere Dich in ihnen an folgender Struktur: 1.) Präferierte Therapie-empfehlung (falls vorhanden), 2.) Therapie-alternativen, 3.) Begründung der Empfeh-lungen, 4.) Supportivmaßnahmen / ergänzende Therapien, 5.) Weiterführende Informationen / Erklärungen. Patientenfall: | Formulate a key point-based treatment recommendation for the following patient case. The previous therapies and other relevant findings are included in the case. Do not limit yourself to medication but describe all possible treatment options. Please name therapies and medications specifically if you are suggesting drug therapy. Try to make your recommendation based on the therapies approved in Germany and in line with the guidelines. Please also explicitly state what you consider to be the best treatment option for the individual patient. Limit your answers to a maximum of 80 words and base them on the following structure: 1) Preferred therapy recommendation (if available), 2) Therapy alternatives, 3) Justification of the recommendations, 4) Supportive measures / supplementary therapies, 5) Further information / explanations. Patient case: |
| Item | SCS | mSCS |
|---|---|---|
| 1 | I found that the recommendation included all relevant known causal factors with sufficient precision and granularity | I found that the recommendation included all relevant patient-specific factors (individual patient data such as individual tumour stages, previous treatments and specific health conditions) with sufficient precision and granularity |
| 2 | I understood the explanations within the context of my work. | I found the quality and representativeness of the recommendations, particularly in relation to oncological scenarios, sufficient. |
| 3 | I could change the level of detail on demand. | I found that all reasonable treatment alternatives were specified. |
| 4 | I did not need support to understand the explanations. | I did not need support to understand the explanations. |
| 5 | I found the explanations helped me to understand causality | I found that the recommendation was explained and made transparent. |
| 6 | I was able to use the explanations with my knowledge base. | I found the recommendation to be consistent with current clinical guidelines. |
| 7 | I did not find inconsistencies between explanations. | I did not find inconsistencies between explanations/recommendations. |
| 8 | I think that most people would learn to understand the explanations very quickly. | I think that most health care professionals would learn to understand the explanations very quickly. |
| 9 | I did not need more references in the explanations: e.g., medical guidelines, regulations. | I found the recommendation demonstrates access to the latest research and clinical guidelines. |
| 10 | I received the explanations in a timely and efficient manner | I found the quality of interaction (ease of use and accessibility) sufficient. |
| Interrater realiability SCS | Interrater reliability mSCS | |||||||
|---|---|---|---|---|---|---|---|---|
| MTB | LLM | MTB | LLM | |||||
| K (p-value) | K (p-value) | K (p-value) | K (p-value) | |||||
| Item 1 | 0.80 (<.001) | 0.70 (<.001) | 0.90 (<.001) | 0.70 (<.001) | ||||
| Item 2 | 0.95 (<.001) | 0.75 (<.001) | 1.00 (<.001) | 0.85 (<.001) | ||||
| Item 3 | 0.70 (<.001) | 0.75 (<.001) | 0.80 (<.001) | 0.65 (<.001) | ||||
| Item 4 | 1.00 (<.001) | 0.90 (<.001) | 1.00 (<.001) | 0.95 (<.001) | ||||
| Item 5 | 0.90 (<.001) | 0.70 (<.001) | 0.75 (<.001) | 0.70 (<.001) | ||||
| Item 6 | 0.95 (<.001) | 0.65 (<.001) | 1.00 (<.001) | 0.80 (<.001) | ||||
| Item 7 | 0.90 (<.001) | 0.70 (<.001) | 1.00 (<.001) | 0.80 (<.001) | ||||
| Item 8 | 0.85 (<.001) | 0.80 (<.001) | 1.00 (<.001) | 0.85 (<.001) | ||||
| Item 9 | 0.90 (<.001) | 0.85 (<.001) | 1.00 (<.001) | 0.95 (<.001) | ||||
| Item 10 | 1.00 (<.001) | 0.75 (<.001) | 1.00 (<.001) | 0.80 (<.001) | ||||
| All Items | 0.90 (<.001) | 0.74 (<.001) | 0.95 (<.001) | 0.81 (<.001) | ||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).