Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Michela Quaranta; Yong Sheng Tan; Areti Karamanou; Evangelos Kalampokis; Nicolas M Orsi; Diederick DeJong; Alexandros Laios

doi:10.20944/preprints202604.0800.v1

Submitted:

09 April 2026

Posted:

11 April 2026

You are already at the latest version

Abstract

Background: The release of Large Language Models (LLMs) has introduced numerous benefits across the healthcare domain. This study evaluated the responses of 11 LLMs from the Claude, Mistral, Llama, and GPT families to Frequently Asked Questions (FAQs) regarding ovarian cancer with regards to three domains: (a) ease of understanding, (b) accuracy, and (c) empathy. Methods: Fifteen FAQs were sourced from the Ovarian Cancer Action (OCA) website comprising (a) anticipated questions and (b) actual questions. Responses from each of the 11 LLMs were blinded and then evaluated by three Gynaecological Oncology Surgical Fellows using a 5-point Likert scale. Inter-observer agreement was calculated for each response, and LLMs were compared across the three domains using Friedman’s test (p<0.05). Finally, all LLM responses were compared with the ones from the OCA website using the same evaluation criteria. Results: Varying levels of inter-observer agreement were observed. Claude 3 Opus produced the easiest-to-understand answers (average score 4.38), followed by Mistral Large (4.36) and GPT-4o (4.33). GPT-4o scored highest for accuracy (average score 4.24) and showed strongest performance in empathy (average of 3.87). Compared with the OCA responses, GPT-4o outperformed all models in accuracy (4.24) and empathy (3.87), with 50% of its responses rated more accurate and 70% more empathetic than the OCA content. Claude 3 Opus, Mistral Large, and Mixtral 8x7B surpassed OCA in clarity for one-third of responses, while Claude 3 Sonnet achieved the highest readability gains (40%). Conclusion: The study informs the development of LLMs suitable for patient-facing ovarian cancer communication with Claude 3 Opus and GPT-4o excelling in different metrics. While improvements in emotional intelligence remain necessary, our findings pave the way for developing a specialized LLM for ovarian cancer, using domain-specific text to provide comprehensive and empathetic information.

Keywords:

large language models

;

ovarian cancer

;

patient communication

;

artificial intelligence

;

natural language processing

;

empathy

;

readability

;

GPT-4o

;

Claude

Subject:

Medicine and Pharmacology - Oncology and Oncogenics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe