Medical Large Language Models: Methods and Applications

Yu Sun; Guiyan Liu; Qifeng Bai

doi:10.20944/preprints202606.1375.v1

Submitted:

16 June 2026

Posted:

18 June 2026

You are already at the latest version

Abstract

The application of Large Language Models (LLMs) in the medical field signifies a revolutionary shift in medical informatics and patient care. LLMs trained on vast and specialized medical corpora bring precision and efficiency to medical diagnostics, treatment planning, and service delivery. This paper outlines the essential methodologies for preparing LLMs for deployment in healthcare as well as pre-training and fine-tuning processes. Innovative pre-training methods are introduced, such as the multimodal and Contrastive Language-Image Pre-training (CLIP)-based approaches that enhance the model's understanding across different data types. Besides, fine-tuning techniques such as Supervised Fine-Tuning (SFT), Instruction Fine-Tuning (IFT), and Parameter-Efficient Tuning (PET) are discussed, which can tailor the LLMs to specific medical tasks with relatively low resource requirements. Moreover, our paper delves into advanced prompting strategies, including zero/few-shot, Chain-of-Thought (CoT), and self-consistency prompting, which refine the model's capacity to handle complex medical queries, as well as Retrieval-Augmented Generation (RAG) and interactive web search technologies in improving the accuracy and reliability of model responses, which can enhance the applicability of LLMs in real-world medical practice. At last, our article depicts the applications of multimodal models in medicine and LLMs in medical texts and images. Our paper not only provides the common methods and applications of LLMs in medicine but also the future directions, ethics, and privacy in this rapidly advancing field.

Keywords:

artificial intelligence

;

deep learning

;

large language models

;

multi-modal learning

Subject:

Medicine and Pharmacology - Other

1. Introduction

The integration of large language models (LLMs) with medical science marks the dawn of a new era in healthcare, defined by unparalleled precision and operational efficiency. LLMs are the type of foundational models extensively trained on vast amounts of data, enabling them to understand and generate natural language and other types of content to perform a wide range of tasks. While there is no strict standard defining what an LLM is, they can usually be identified by certain characteristics, such as the number of parameters (often in the billions or even trillions) and the scale of pre-training. LLMs are often pre-trained on massive datasets that may include text data from the entire internet, such as Common Crawl or Wikipedia.

At the forefront of this transformative shift in healthcare are LLMs, which are meticulously trained on extensive and specialized datasets. For example, models like PubMedBERT [1] and ClinicalBERT [2] are pre-trained on large-scale unsupervised data, including PubMed literature and clinical case studies, and later fine-tuned for specific tasks. This crucial groundwork involves the analysis of extensive medical texts, including PubMed literature and clinical case studies, equipping LLMs with the sophisticated understanding required for intricate medical language processing. Additionally, fine-tuning methodologies like Parameter-Efficient Tuning and Instruction Fine-Tuning further sharpen these models for specific applications, significantly enhancing their efficiency and versatility [3]. These techniques are particularly designed to optimize resource use while enhancing the precision of outputs, an essential trait for deployment in resource-limited settings such as small-scale clinics or in emergency response scenarios.

In practical scenarios, the deployment of LLMs within the medical domain is catalyzing a revolution in patient care and clinical decision-making. The LLMs excel in a variety of functions, from automating diagnostics to crafting personalized treatment plans [4,5,6,7]. Notably, Retrieval-Augmented Generation (RAG) technology bolsters the trustworthiness of LLMs outputs by integrating up-to-date information from medical databases, ensuring that the provided diagnostics and recommendations adhere to the latest research and clinical protocols. This capability allows LLMs to access and utilize the most current and relevant information to handle complex medical tasks [8]. Furthermore, the integration of web search technologies enables these models to sift through the extensive online medical corpus, anchoring their responses in the most pertinent information available. By enhancing the precision and swiftness of medical services, LLMs not only elevate the standard of care but also set the stage for future innovations in medical AI.

Figure 1. Method summary of medical large language model.

The discussion also addresses challenges such as data privacy, potential biases in model training, and the necessity for ongoing updates, offering a comprehensive overview of the environment in which these models operate. The fusion of NLP-driven LLMs with medical science marks a pivotal shift in healthcare. These models, refined through advanced training techniques, significantly enhance diagnostic precision and treatment efficacy. Supported by technologies like Retrieval-Augmented Generation, LLMs are transforming patient care in real-world settings [8].

To better understand the application of LLMs in medicine, our review elaborates on the pre-training and fine-tuning processes essential for preparing LLMs for healthcare, highlighting innovative techniques that enhance model understanding and operational efficiency. We also explore strategies like Retrieval-Augmented Generation and web search integration that contribute to the improved functionality and reliability of LLMs in handling complex medical scenarios. Totally, our review selects the papers primarily based on relevance to medical LLMs, multiple perspectives, significance, and impact that discuss LLMs tailored for medical texts, images, and multimodal models, as well as critical challenges such as data privacy and model bias, to provide a comprehensive outlook on the future integration prospects of LLMs in healthcare.

Although some existing reviews refer to the medical LLMs [9,10,11], our paper contributes different features and new contents: our review details a wide range of training and deployment methodologies for LLMs, including novel technologies such as Retrieval-Augmented Generation (RAG) and interactive web search, and new progress web search techniques in a review of the Medical Large Language Model. Besides, our review addresses the integration of multimodal models and explores sophisticated prompting strategies like Chain-of-Thought and self-consistency as well as deployment challenges and ethical issues, offering insights into the balanced use of LLMs in healthcare. At last, our review outlines future research directions and proposes ethical guidelines to guide the responsible development and use of medical LLMs.

2. Methods of Medical Large Language Models

In the rapidly evolving landscape of artificial intelligence within healthcare, the integration of LLMs marks a transformative era for medical informatics and patient care. These sophisticated models, equipped with vast data and advanced computational capabilities, are set to revolutionize areas such as medical diagnostics, treatment planning, and service delivery. This section explores the research progress of pre-training and fine-tuning, preparing LLMs for effective deployment in the medical domain. It highlights innovative methods that enhance their performance and adaptability, such as Parameter-Efficient Tuning, Instruction Fine-Tuning, and advanced prompting techniques. Moreover, this section addresses emerging methodologies like Retrieval-Augmented Generation (RAG) and web search integrations (see Figure 1), which further refine the utility and accuracy of LLMs in handling complex medical queries. By delving into these technologies, we aim to highlight both the significant advancements and the ongoing challenges in optimizing LLMs to meet the specific demands of healthcare, ultimately paving the way for more personalized, efficient, and precise medical services. And we drew the general method sketch of LLMs (see Figure 2).

2.1. Pre-training and Fine-tuning Methods

Pre-trained models [9], especially LLMs, have become a cornerstone of artificial intelligence applications in the medical domain. Specialized LLMs such as PubMedBERT [1], ClinicalBERT [2], BlueBERT [12], BioBERT [13], SciBERT [14], BEHRT [15], UmlsBERT [16], GatorTron [17], and MEDITRON [18] are typically pre-trained on specialized medical corpora. These corpora include medical literature from resources like PubMedQA [19], MIMIC-III [20], and webMedQA [19]. Through extensive pre-training, these models gain a deep understanding of the complexities of medical language, providing a solid foundation for application and fine-tuning in specific medical tasks.

The primary objectives of pre-training include masked language modeling, next sentence prediction, and next token prediction tasks. In these tasks, BERT series models [21,22] generally focus on the first two objectives, while GPT series models [3]^, [23,24] tend to emphasize the latter. Once pre-trained, these models are fine-tuned for specific application scenarios such as Question Answering (QA) and Named Entity Recognition (NER), to tailor their performance. The effectiveness of these models is assessed through standards such as the Biomedical Language Understanding Evaluation (BLUE) and the Biomedical Language Understanding and Reasoning Benchmark (BLURB) [12], to validate their efficacy and accuracy in practical applications.

In addition to these traditional pre-training techniques, we have explored some innovative pre-training methods aimed at further enhancing the models' mastery of medical knowledge: Multimodal pre-training, particularly CLIP-based pre-training [25,26], is becoming increasingly important in the medical field. This approach integrates multiple data modalities, including text, images, and structured data, enabling models to develop more comprehensive data representations as well as to deepen the understanding of complex relationships between medical language and imagery. Especially in matching medical images (such as X-rays, MRI, and CT scans) with corresponding clinical reports, CLIP's pre-training and fine-tuning for specific medical datasets significantly enhance its performance in tasks such as clinical diagnosis, disease classification, and treatment planning. The semantic understanding and adaptive learning capabilities of these deep learning models not only enhance their language and visual comprehension but also enable them to handle complex medical scenarios efficiently and accurately, paving the way for enhanced quality and efficiency in medical services.

Figure 2. General method sketch of medical large language models.

Contrastive Language-Image Pre-training (CLIP) [27] represents a significant advancement in multimodal learning, developed by OpenAI to effectively bridge the gap between textual and visual information. This innovative approach hinges on the foundational principle of aligning text and image representations within a shared high-dimensional space, enabling a model to comprehend and associate diverse forms of data. The core mechanism of CLIP involves training on an extensive dataset of images paired with textual descriptions, leveraging a contrastive loss function. This function plays a pivotal role by simultaneously maximizing the similarity between accurately paired text-image inputs while minimizing it for mismatched pairs, thereby guiding the model to distinguish relevant connections between different modalities.

From a technical perspective, CLIP's architecture consists of two main components: an image encoder and a text encoder. The image encoder is typically implemented using either a convolutional neural network (CNN) [28], known for its effectiveness in extracting hierarchical visual features, or a Vision Transformer (ViT) [29], which excels at capturing long-range dependencies within images. This encoder processes the visual inputs, transforming them into feature vectors that encapsulate the salient aspects of the images, which are crucial for tasks like object recognition or scene understanding.

Parallelly, the text encoder, generally built upon transformer architectures, processes the accompanying textual data. This data often comprises descriptive phrases, keywords, or full sentences that convey the content or context of the images. The text encoder converts these linguistic inputs into feature vectors within the same representational space as the visual data. The training process is orchestrated by the contrastive loss function, which ensures that the feature vectors for correctly paired text-image inputs are closely aligned, while those for incorrect pairs are pushed apart in the feature space. This method effectively teaches the model to generate a shared understanding of the content across both modalities, enabling robust cross-modal retrieval and interpretation.

Mathematically, the CLIP framework can be described as follows:

Feature Extraction:

·Let

I [n, h, w, c]

represent a mini-batch of aligned images, and

T [n, l]

represent a mini-batch of aligned texts.

·The image encoder, denoted as image_encoder, extracts feature representations

I_{f}

from the images:

I_{f} = i m a g e_e n c o d e r (I) w h e r e I_{f} \in R^{n \times d_{i}}

(1)

·The text encoder, denoted as text_encoder, extracts feature representations

T_{f}

from the texts:

T_{f} = t e x t_e n c o d e r (I) w h e r e T_{f} \in R^{n \times d_{i}}

(2)

2.: Projection into Joint Embedding Space:

·The features are then projected into a shared embedding space using learned projection matrices

W_{i}

and

W_{t}

:

I_{e} = l 2_{_} n o r m a l i z e (I_{f} \cdot W_{i}) w h e r e W_{i} \in R^{d_{i} \times d_{e}}

(3)

T_{e} = l 2_{_} n o r m a l i z e (T_{f} \cdot W_{t}) w h e r e W_{t} \in R^{d_{t} \times d_{e}}

(4)

3.: Similarity Computation:

·The cosine similarities between the projected image and text embeddings are computed, scaled by a learned temperature parameter

t

:

l o g i t s = (I_{e} \cdot T_{e}^{T}) \times e x p (t) w h e r e l o g i t s \in R^{n \times n}

(5)

4.: Loss Function:

·The symmetric contrastive loss function is defined as the cross-entropy loss over the computed similarities:

{l o s s}_{i} = c r o s s_{_} e n t r o p y_{_} l o s s (l o g i t s, l a b e l s, a x i s = 0)

(6)

{l o s s}_{t} = c r o s s_{_} e n t r o p y_{_} l o s s (l o g i t s, l a b e l s, a x i s = 1)

(7)

l o s s = \frac{{l o s s}_{i} + {l o s s}_{t}}{2}

(8)

Here, labels=arange(n) represents the ground truth, indicating the correct pairs. In the medical domain, CLIP's potential is harnessed through the pre-training of models on datasets containing medical images such as X-rays, MRIs, and CT scans, paired with corresponding clinical reports. This approach allows the model to learn associations between visual cues in medical imagery and their textual descriptions, such as clinical findings, diagnostic information, and treatment plans. The image encoder in this context extracts features pertinent to medical diagnostics, which are then aligned with textual representations of clinical data. The contrastive loss function plays a crucial role in refining this alignment, thereby improving the model's ability to link specific medical conditions in images with their textual descriptions. In Figure 2, we show the pre-training process of unimodal and multimodal models from data input to output.

Fine-tuning medical LLMs is a crucial step in adapting general LLMs to the medical domain. Due to the high costs and time required to train medical LLMs from scratch, researchers have developed various fine-tuning methods to imbue general LLMs with domain-specific medical knowledge. The main fine-tuning methods include Supervised Fine-Tuning (SFT) and Parameter-Efficient Tuning.

SFT is a pivotal technique in modern natural language processing (NLP), specifically in the realm of medical LLMs. By utilizing high-quality, labeled medical datasets including physician-patient dialogues, medical Q&A pairs, and knowledge graphs, SFT finely tunes pre-trained language models to meet the unique demands of the medical field [7]. This process involves precise adjustments and the generation of task-specific training samples that teach the model to produce contextually relevant medical information. Notably, during training, a mask vector is applied to focus the model's learning on the response sections of these samples, enhancing its grasp of medical terminology and improving both accuracy and adaptability in professional medical scenarios. SFT effectively transforms general language models into specialized tools capable of handling complex medical contexts and providing accurate clinical decision support. Despite its profound impact, SFT faces challenges related to data quality, model generalization, and ethical privacy issues. Addressing these concerns through ongoing research and optimization is essential to maximize the effectiveness, security, and transparency of these models in real-world applications, thereby setting the stage for future advancements in healthcare technology.

Instruction Fine-Tuning (IFT) represents a specialized adaptation of SFT aimed at enhancing the performance and controllability of LLMs within the medical field. Unlike traditional SFT, IFT specifically focuses on improving a model's ability to comprehend and execute human instructions by employing training datasets composed of instruction-input-output triples, such as instruction-question-answer sequences. This approach not only boosts the model's proficiency in handling complex medical tasks but also ensures the precision and relevance of its outputs, vital for medical applications. IFT has led to the development of models like MedPaLM-2 [30], which demonstrate superior performance in medical question-answering and multi-turn dialogue scenarios. These models are instrumental in aiding medical professionals by accurately interpreting queries and producing responses aligned with specific instructions, thus improving patient education and self-management. Despite its benefits, IFT faces challenges related to the diversity and quality of instructional datasets, model adaptability, and generalization across varied medical contexts. Future research should focus on integrating IFT with other fine-tuning methods and updating models to keep pace with medical advancements, ensuring the continued relevance and accuracy of the information they provide.

Parameter-Efficient Tuning (PET) [31] techniques are increasingly vital in deploying LLMs like GPT-3 [23] within the medical field, particularly for handling complex text data and supporting clinical decisions. These methods focus on optimizing the performance of pre-trained models while minimizing resource usage by adjusting only a small subset of parameters. Techniques such as Low-Rank Adaptation (LoRA) [32], Adapters [33,34,35], and Prefix Tuning [32] substantially reduce computational demands and memory requirements. This is crucial for resource-limited medical institutions, allowing them to utilize advanced language models without incurring high costs. PET not only enhances resource efficiency but also provides the flexibility needed for diverse medical tasks, such as pathology report analysis and clinical trial document processing. By allowing rapid adaptation to specific tasks with minimal training, PET maintains task relevancy and improves accuracy, especially in handling medical terminologies and protocols. Additionally, PET supports model transparency and interpretability by preserving the original model structure, which is essential in medical applications and facilitates decision tracking and logical explanations of outputs. The ability of PET to rapidly adapt to changes in medical data and guidelines demonstrates its utility, as it reduces the time and resources required for model adjustments. Future research might explore the integration of various PET methods to optimize resource efficiency and performance, such as combining LoRA with Adapters for a balanced approach to feature preservation and task adaptability. Further studies could also focus on the automated selection of optimal PET techniques for specific medical tasks, enhancing the practical utility of medical LLMs.

Low-Rank Adaptation (LoRA) is a parameter-efficient tuning technique designed to adapt large pre-trained models to specific tasks without extensively retraining the entire network. This method is particularly advantageous in the medical field for customizing models like GPT-3 [23] for specialized tasks such as interpreting complex medical texts or generating clinical documentation.

LoRA optimizes pre-trained models with minimal computational resources by introducing smaller, trainable matrices

A

and

B

at each layer, which are used to modify the original weight matrix

W

by adding a low-rank update. The update is represented as:

∆ W = A \cdot B

(9)

Here,

A \in R^{d \times r}

and

B \in R^{d \times k}

are the low-rank matrices, where

r

is the rank of the update, and

d

and

k

are the dimensions of the weight matrix

W

. During the model's forward pass, the input

x

is primarily transformed by

W

, but with the addition of the low-rank update, the transformed input

W' x

can be expressed as:

W' x = (W + ∆ W) \cdot x = W \cdot x + A \cdot (B \cdot x)

(10)

This adjustment allows for precise task-specific modifications while keeping the majority of parameters static, thereby drastically reducing the number of trainable parameters. The computational efficiency and reduced memory usage make LoRA an ideal solution for fine-tuning models in resource-constrained medical settings.

In practical terms, LoRA enables the fine-tuning of models for specific medical tasks, such as analyzing diagnostic reports or creating tailored treatment plans, aligning outputs more closely with clinical needs and standards. This technique ensures that the adapted models can perform specialized medical functions without the need for extensive computational resources, making it highly valuable in clinical environments. In Figure 2, we show the schematic diagram of Full fine-tuning and Lora fine-tuning.

Overall, pre-trained language models in the medical domain, through extensive training on specialized medical literature, have effectively mastered the complexities of medical language, providing a solid foundation for specific medical tasks such as question answering and entity recognition. These models also employ multimodal pre-training methods like CLIP to integrate text and image data, enhancing their capability in handling tasks such as clinical diagnosis. Additionally, through fine-tuning techniques like Supervised Fine-Tuning (SFT), Instruction Fine-Tuning (IFT), and Parameter-Efficient Tuning (such as LoRA), these models are precisely adjusted to meet the specific needs of the medical field, thus maintaining high performance while significantly enhancing their adaptability and accuracy in the medical industry. These advanced pre-training and fine-tuning strategies have greatly improved the role of medical language models in providing medical services, highlighting the significant impact of artificial intelligence in enhancing the quality of healthcare.

2.2. Prompting Methods

Prompting [36] methods represent a strategic and efficient approach to adapt general LLMs to specific domains like medicine [37], focusing on refining the prompting strategy rather than modifying the model itself. Techniques such as Zero/Few-shot prompting, Chain-of-Thought (CoT) prompting, and Self-consistency prompting, along with Prompt Tuning, are crucial in aligning these powerful computational models to the nuanced requirements of medical diagnostics and treatment planning. These methods enhance the models' capability to process and interpret complex medical data, leading to more accurate and reliable healthcare solutions.

Zero/Few-shot Prompting: in the medical field, the application of LLMs is increasingly pivotal, especially in addressing complex medical issues and data. Leveraging pre-trained models in conjunction with zero/few-shot prompting techniques, these models adeptly tackle new tasks without the need for extensive labeled datasets. Zero-shot prompting [38] equips models to infer and execute tasks based on just a task description, using meticulously crafted prompts to navigate new challenges and utilizing their broad pre-trained knowledge bases for predictions. Conversely, few-shot prompting [39] equips models with a select number of examples or task demonstrations, enhancing their learning before task execution. This method proves particularly beneficial for intricate tasks such as medical question answering (QA), where models enhance downstream performance by examining a handful of high-quality medical QA pairs. Although these techniques greatly improve the models’ proficiency in processing medical language, they also present challenges, including increased token usage for input, which can be restrictive for longer texts, and the potential for biases influenced by the selection of prompt examples. To optimize performance and reduce unintended biases, precise prompt engineering is essential. With continuous technological progress and innovations, LLMs are anticipated to play a more significant role in improving diagnostic accuracy, personalizing treatment plans, and comprehending complex medical reports. Overall, zero/few-shot prompting offers a flexible and effective strategy for deploying medical LLMs, enabling them to process and comprehend complex medical data and queries without extensive labeled datasets, and is set to markedly influence the future of medical AI.

Chain-of-Thought (CoT) Prompting: in the medical field, the incorporation of sophisticated prompting techniques with LLMs significantly boosts their ability to address complex medical challenges. Chain-of-Thought (CoT) [40] prompting enhances the transparency and explainability of medical reasoning tasks by guiding models through structured, logical problem-solving steps, proving particularly beneficial in medical question-answering systems for diagnostics and treatment planning. Automatic Chain-of-Thought (Auto-CoT) [41] alleviates the labor-intensive process of crafting high-quality examples by autonomously generating reasoning chains, thereby increasing the robustness and accuracy of medical decision-making. Logical Chain-of-Thought (LogiCoT) [42] employs principles of symbolic logic to verify and refine each reasoning step, effectively reducing logical errors and hallucinations in intricate medical diagnostics. Condensed Symbolic (CoS) [43] prompting augments the model's capability to manage tasks involving complex spatial relationships through the use of succinct symbols rather than natural language, which is crucial for interpreting medical imagery or orchestrating surgical procedures. Tree-of-Thoughts (ToT) [44] prompting employs a structured hierarchy of reasoning steps to facilitate systematic exploration of potential solutions, thereby enabling more comprehensive treatment options and strategies. The integration of advanced prompting techniques in medical LLMs significantly improves the precision of diagnostics and treatment recommendations. Furthermore, it enhances the interpretability and transparency of their decision-making processes. This advancement is set to substantially advance the field of medical AI, thereby enhancing the quality and efficiency of clinical healthcare services.

Self-consistency Prompting: the integration of Self-consistency [40] Prompting into medical LLMs represents a significant breakthrough in enhancing diagnostic accuracy and consistency within healthcare. Building on the Chain-of-Thought (CoT) prompting foundation, this method significantly enhances response reliability by generating multiple answers to the same question and selecting the most consistent result. This is particularly crucial in the medical sector, where precise and reproducible diagnoses and treatment recommendations are vital. Self-consistency Prompting refines CoT by generating diverse reasoning chains from the model's decoder, adeptly navigating the complexities and variabilities of medical reasoning, and minimizing the risk of incorrect conclusions due to biased or incomplete analyses. By facilitating the synthesis and reconciliation of various reasoning paths, Self-consistency Prompting significantly bolsters the development of sophistication and reliable medical decision-support systems. This technique aligns with the broader goals of medical AI, promising safer, more effective, and consistent patient care, and is poised to profoundly influence the future trajectory of AI applications in medicine by enhancing the precision and reliability of the insights based on these powerful computational methods and tools.

Prompting approaches in the context of large language models (LLMs) are methods used to guide these models in generating relevant and accurate responses to specific tasks. These techniques work by carefully crafting the input prompts—phrases or questions that instruct the model on what to do. Zero-shot prompting refers to asking the model to perform a task without providing any examples, relying solely on its pre-trained knowledge. Few-shot prompting, on the other hand, involves providing a few examples to help the model better understand the task before it generates a response. Chain-of-Thought (CoT) prompting encourages the model to break down a problem into smaller, logical steps, enhancing its reasoning abilities. Self-consistency prompting builds on CoT by generating multiple reasoning paths and selecting the most consistent one, ensuring more reliable outcomes. These techniques are designed to adapt the general capabilities of LLMs to the specific needs of domains like medicine, where accuracy, consistency, and transparency are critical. Figure 2 shows the five important components of Prompting.

The integration of these advanced prompting techniques has revolutionized the application of medical LLMs, exemplified by systems like MedPaLM [10] and MedPaLM-2 [30], which leverage these methods to achieve performances comparable to or even surpass human experts on medical QA datasets. The goal-oriented taxonomy in prompt engineering continues to refine these approaches, focusing on optimizing LLMs' performance by guiding them through structured, logical thinking processes. This not only showcases the broad impact of goal-oriented strategies in prompt engineering but also opens new avenues for further advancements, setting a promising direction for the future of AI in medicine. These developments promise to enhance the precision, efficiency, and reliability of medical care, aligning with broader healthcare goals and significantly influencing the trajectory of AI applications in the medical domain.

2.3. Retrieval-Augmented Generation

The application of LLMs in the medical field has been somewhat limited due to challenges such as hallucinations, difficulties in updating knowledge, untraceable reasoning processes, and high costs. Accuracy is crucial in the medical domain, so LLMs often require validation by doctors with specialized knowledge. Particularly with Zero-shot and Few-shot samples, LLMs may produce factual errors that could lead to incorrect diagnoses or treatments for individuals without medical expertise. To address these issues, Retrieval-Augmented Generation (RAG) [45] approaches have emerged as a key solution. RAG reduces modeling pain points and improves the trustworthiness of model outputs by integrating knowledge from external databases [8]. It retrieves content relevant to the problem and references it in the output, making the reasoning process more transparent and traceable. For example, generating answers by retrieving specialized medical guidelines and referencing the relevant parts can make the model output more credible and help users locate the corresponding guidelines. Especially for knowledge-intensive tasks, the use of specialized external databases can significantly improve the accuracy of model output. The application of RAG technology in various fields of clinical or medical sciences has great potential. Combining general-purpose LLMs with semantic understanding capabilities and a specialized medical database may have no less potential than a model fine-tuned with specialized medical data.

Three main paradigms exist for RAG: Naive RAG [46], Advanced RAG [47,48], and Modular RAG [10]^, [49,50]. Naive RAG follows a conventional indexing, retrieval, and generation process but faces issues such as low retrieval precision and hallucinatory response generation. Advanced RAG introduces improvements in retrieval accuracy and post-retrieval processing, such as fine-tuning and dynamic embedding models, re-ranking, and prompt compression. Modular RAG extends the adaptability of RAG models by incorporating diverse methods to enhance functional modules, such as search and memory modules, and optimization techniques like hybrid search exploration and recursive retrieval.

Retrieval-Augmented Generation (RAG) enhances the capabilities of LLMs by integrating them with external knowledge sources, such as medical databases containing literature, guidelines, and clinical studies. RAG starts by generating a query from user input, like a medical question or symptom description, and uses a vector-based search engine to retrieve relevant documents. These documents are converted into high-dimensional vectors via techniques like TF-IDF, BM25, or neural embeddings and selected based on relevance metrics such as cosine similarity. The relevant content is then fed into the LLM to guide the generation process, ensuring the output is both contextually relevant and factually accurate. RAG architectures vary from Naive RAG, which performs basic indexing and retrieval, to Advanced RAG, which enhances retrieval accuracy with dynamic embeddings, re-ranking, and input refinement, and Modular RAG, which adapts to complex queries with customizable retrieval and generation components. In medicine, RAG supports critical tasks like diagnosing symptoms and formulating treatments by leveraging current and precise medical data, improving safety and reliability. Despite its benefits, RAG faces challenges such as integrating continuously updated data sources and managing computational demands, with ongoing research focused on improving its scalability and efficiency for clinical use.

Several studies have demonstrated the practical application of RAG in the medical field, which systematically evaluate different RAG systems and identify the best practices for various medical tasks [8], showing significant accuracy improvements over traditional LLMs prompting methods. Medical Graph RAG [51] introduces a graph-based RAG framework that enhances the safety and reliability of LLMs outputs by creating hierarchical graphs from medical knowledge sources, which improves the precision and trustworthiness of generated responses. In the TC-RAG [52], the paper presents a novel framework that integrates system state variables and memory stack systems to ensure more controlled and accurate knowledge retrieval, demonstrating a 7.20% accuracy improvement in medical applications. Lastly, Bailicai [53] explores a domain-specific RAG framework tailored for medical applications, effectively reducing hallucinations and enhancing LLMs performance on medical benchmarks, surpassing even some proprietary models.

In conclusion, the evolution from Naive RAG to Advanced RAG and Modular RAG demonstrates continuous progress in addressing the challenges faced by LLMs in the medical domain. These RAG models enhance the precision, credibility, and applicability of LLMs-generated responses by synergistically merging inherent knowledge with extensive, dynamic external knowledge repositories, paving the way for more effective and nuanced AI applications in healthcare.

2.4. Web Search

In the field of medical language modeling, integrating web search technology can significantly enhance the accuracy and timeliness of models. Although there is currently no practice of combining LLMs with Web search technology, it is believed that the internet is the fastest channel for information updates. Compared to constantly maintaining static databases, data supported by the internet has clear advantages in terms of update speed and content breadth. However, this integration also indicates a potential need for stronger internet retrieval and data filtering capabilities. Selecting and refining data from a multitude of web pages is especially critical, which is particularly important in the medical field, as this area demands high accuracy and up-to-date information.

The introduction of interactive web search, as demonstrated in developments such as WebCPM [54], WebGLM [55], and WebGPT [56], marks a significant advancement in this area. WebCPM [54] introduces interactive web search in the Chinese Long-form Question Answering (LFQA) domain, enabling dynamic information retrieval akin to human web search behavior. WebGLM [55], a web-enhanced question-answering system, augments pre-trained language models with web search capabilities, focusing on efficiency and cost-effectiveness. WebGPT [56], on the other hand, fine-tunes GPT-3 [23] for long-form question answering using a text-based web browsing environment, showcasing the ability to navigate the web and collect references to support answers.

Unlike conventional non-interactive retrieval methods, interactive web search allows for a more dynamic and iterative process. Users can decompose complex questions into sub-questions, refine their searches based on the information gathered, and ask follow-up questions. This mimics the cognitive process of human problem-solving and ensures access to a broader range of information, thereby improving the interpretability and relevance of the model's outputs.

In the context of medical large language modeling, the integration of interactive web search can substantially enhance the model's ability to provide accurate and contextually relevant responses. For example, when faced with a complex medical query, the model can engage in real-time interaction with a search engine to gather the latest medical guidelines, research findings, or case studies. This not only ensures that the model's knowledge base is continually updated but also allows for a more nuanced understanding of the query.

Gemini Models [57] demonstrates the potential of specialized LLMs in the medical domain. The Med-Gemini models, with their advanced multimodal and long-context reasoning capabilities, effectively integrate web search to access up-to-date medical knowledge. This integration enables Med-Gemini to surpass GPT-4 on several medical benchmarks, including MedQA (USMLE), and even outperform human experts in tasks like medical text summarization, underscoring their broad applicability in healthcare. Autonomous Artificial Intelligence Agents [58] presents a system that leverages generalist LLMs to autonomously coordinate specialized medical AI tools, including web-based information retrieval. This approach excels in interpreting diverse medical data and delivering accurate, tailored recommendations by dynamically incorporating the latest medical guidelines and research findings. The system's success across oncology scenarios demonstrates the potential of LLMs as patient-specific clinical assistants, further validating the effectiveness of web search integration in enhancing AI-driven medical decision-making. In Figure 2, we can see the general process of RAG and Web search.

In conclusion, the combination of web search technology and large medical language modeling holds significant potential for improving the accuracy and reliability of medical LFQA systems. The interactive nature of web search, coupled with the advanced capabilities of RAG models, can lead to the development of more sophisticated and effective medical language models. The LLMs based on these methods and technologies can serve as valuable tools for healthcare professionals, researchers, and patients alike, providing access to timely and accurate medical information.

3. LLMs Applications in Medicine

The existing LLMs, such as ChatGPT [59,60,61,62,63,64], Bard [65,66], LLaMA [67,68], and so on, perform very well in various general domain NLP, but their application in medicine has some limitations [4]^, [10]^, [69,70,71,72]. For example, lack of expression and interaction ability, incorrect medical information generated due to lack of medical expertise, prejudice, difficulty in integrating a large amount of medical knowledge, potential ethical and privacy issues, etc. The summary of medical LLMs can help relevant medical workers better understand cutting-edge AI technologies and promote the application of LLMs in the medical field. In this paper, Table 1 lists the current common LLMs in the medical field.

3.1. Application of LLMs in Medical Text

Medical text information transmits various information through written language, charts, data, and other forms, which is very important information in the medical field. It can make doctors and patients better understand the content of diagnosis and treatment, and facilitate the diagnosis and treatment of diseases. With the development of information technology, the digitization and intelligent processing of medical texts have attracted more and more attention. Many LLMs have emerged for medical text information processing and extraction, which have improved the efficiency and accuracy of information retrieval and promoted the dissemination and application of medical knowledge.

Electronic health records (EHRs) [73,74] contain a large amount of patient information and medical data, which is an important resource in the field of healthcare, but the data is often unstructured and difficult to be directly analyzed and extracted by doctors. To address this, the researchers put forward the GatorTron model [75], which uses different datasets to extract clinical concepts and discern semantic similarities, and optimizes the language features of clinical texts, making it convenient for medical staff to access and analyze patient information and assist clinical decision support. Consequently, this facilitates a more rapid and precise formulation of treatment plans, potentially enhancing diagnostic accuracy.

Based on Transformer [76,77], MedGPT [5], the first medical large language model in China, is built, which mainly aims at giving full play to the practical value of diagnosis and treatment in real medical scenes, and realizing the intelligent diagnosis and treatment ability of the whole process from disease prevention, diagnosis, treatment, and rehabilitation. The model is trained using valid interviews and medical examination data from the electronic health records (EHRs) dataset to make accurate disease diagnoses and design disease treatment plans for patients. Patients can deliver drugs to home through the Internet hospital, and MedGPT will take the initiative to provide patients with medication guidance and management, intelligent follow-up, rehabilitation guidance, and other intelligent disease diagnosis and treatment after receiving drugs.

International Classification of Diseases (ICD) codes are very important in medicine, but manually labeling ICD codes is a complex and time-consuming task that requires a professional to perform, while LLMs can automatically generate ICD codes by analyzing patient reports, thus reducing the burden on doctors. One study proposed LLM-ICD [78], a model that utilizes LLMs to automatically generate ICD codes for retinal diseases. The model feeds patient information into ChatGPT and causes it to generate the corresponding ICD codes, while the code generated by the model is compared with the code specified by the expert to assess its accuracy. However, the model creates a risk of coding errors and is not currently suitable for clinical practice.

A study fine-tunes and optimizes the large language model LLaMA by using a dataset of 100,000 doctor-patient conversations from HealthCareMagic, an online medical advice platform, to propose an autonomous ChatDoctor model [6] with a knowledge brain. An autonomous information retrieval mechanism that is added to the model by model fine-tuning and knowledge brain indoctrination strategies, can improve the ability of the model to understand patient inquiries and provide accurate recommendations. And the model based on this autonomous information retrieval mechanism can access and utilize the data of online resources and offline medical databases in real-time as well as answer the latest medical terms and medical questions about diseases. The comparison between ChatDoctor and ChatGPT shows that ChatDoctor performs better than ChatGPT in the answers to new medical terms, drug introduction, disease diagnosis, treatment, etc. However, the model is still in the research stage and is only used for academic research at present. There may be a risk of generating wrong answers when used in clinical practice. Therefore, it is necessary to further develop the model to reduce the possibility of errors and hallucinations, improve the accuracy and efficiency of the model for medical diagnosis, and reduce the task of medical staff.

Researchers have developed an open medical LLMs, Clinical Camel [79], based on the fine-tuning of the LLaMA-2 model. A new method, Dialog-based Knowledge Encoding (DBKE), is used in the model to transform dense medical literature into synthetic dialogues, enhancing the model's dialogue generation capabilities. The model can handle English texts in clinical research, including clinical notes, medical literature, and doctor-patient conversations. However, the model is still in the research phase and cannot be applied to clinical practice.

LLMs have made remarkable progress in understanding and responding to human commands, but they usually perform better in common English language environments and are not specifically trained in the medical field. As a result, the accuracy of the model in auxiliary diagnosis and drug recommendation is insufficient. In order to solve the above challenges, the researchers collected the Chinese medical dialogue databases and fine-tuned the ChatGLM [80] to build a DoctorGLM model [7], which is mainly aimed at Chinese Q&A and dialogue. But it is still early days and may yield wrong answers that are not suitable for clinical use.

Due to the huge demand for LLMs in the medical field and the poor performance of the existing models in the medical field, especially in the Chinese environment, researchers proposed the HuatuoGPT model [81], which is specially used for medical consultation. The model combines the advantages of ChatGPT data and doctors' real-world data while mitigating their weaknesses, allowing the model to diagnose like a doctor while having medical knowledge, and providing accurate information to patients. The answers generated by the model have the characteristics of fluency, information richness, doctor's professionalism, and interactivity. It is worth noting that this model has not yet been put into clinical use, and needs further research and development before practical application.

To better serve Chinese users, the researchers developed the MedicalGPT-zh model [82] (later renamed MING), which is based on the Transformer architecture. The model is pre-trained through a large amount of Chinese medical dialogue data to learn the context, terminology, and communication patterns in the medical dialogue. The model can understand and answer medical-related consultation questions through NLP technology, handle complex Chinese medical conversations, and be applied to a variety of scenarios, such as online medical consultation, patient education, health guidance, and so on. The study points out that the development of MING is expected to improve the efficiency and accessibility of medical services, and help promote the development of NLP technology to play a greater role in the field of Chinese medical dialogue in the future.

Based on the LLaMA model, ChatMed [83] series of Chinese medical LLMs improve the performance of Chinese medical dialogue system by integrating medical knowledge to improve the ability to answer medical advice. The ChatMed series models include ChatMed-Consult and ShenNong-TCM-LLM, etc. The ChatMed-Consult model is mainly aimed at online consultation, understanding users' health consultation needs through dialogue, and providing corresponding information and suggestions. ShenNong-TCM-LLM model mainly focuses on the field of Traditional Chinese Medicine (TCM), assisting TCM diagnosis and treatment recommendations, TCM education, and popularization. Another study also constructed HuaTuo model [84] (later renamed BenTsao) based on the LLaMA model by integrating structured and unstructured medical knowledge from the CMeKG and fine-tuning it using knowledge-based instruction data. This model is mainly used in medical research and cannot provide medical advice at present. As a traditional medical system, TCM has rich theoretical knowledge and practical experience. Integrating TCM knowledge into LLMs can improve the accessibility and efficiency of TCM services. Against this background, the researchers constructed the Zhongjing model [85]. Based on LLaMA, the model is trained through expert feedback and multi-round dialogue in the real world to enhance its ability to use specialized terms and understand concepts in the field of TCM. In this way, Zhongjing is able to provide patients with more accurate and personalized TCM consultation and treatment recommendations, aiming to promote the widespread dissemination and application of TCM services.

GLM-130B [86], a bilingual (English and Chinese) pre-training large language model jointly developed by Tsinghua University and Zhipu.AI, has the function of question and answer in the vertical field of medicine, which supports intelligent question and answer of medical and health questions. At the same time, it has developed auxiliary diagnosis and treatment functions such as generating TCM prescriptions according to symptoms and providing medical explanations for prescriptions.

LLMs have made great progress in the field of medical Q&A, but most of them perform well in multiple-choice questions, and there are still some defects in answering open-ended questions compared with clinicians. The researchers proposed an improved model, Med-PaLM 2 [30], based on PaLM 2 [87,88], which improves the performance of medical reasoning through specific fine-tuning and prompting strategies in the medical field. Compared with clinicians' answers to patients' open-ended questions, the answer of this model can better solve the patients' problems. However, the model still has some limitations: the consistency between the output of the model and the high-quality medical answers expected by patients is not high; the method of evaluating the model needs to cover more dimensions, such as whether the answer reflects humanistic care. At present, LLMs can provide a wide range of health advice in single-round conversations, but the questioning ability in multiple rounds of conversations is insufficient. Researchers proposed a BianQue [89] model based on ChatGLM, which is trained using data from multiple rounds of health conversations to improve the active questioning ability of LLMs. However, the model is limited to academic research and cannot be used in practical clinical applications.

3.2. Application of LLMs in Medical Images

Medical images are an indispensable part of the clinical diagnosis and treatment process. They provide doctors with visual evidence of the disease and can help doctors make more accurate judgments. With the development of information technology, the types and applications of medical images are also expanding, including but not limited to X-rays, computed tomography (CT), magnetic resonance imaging (MRI), ultrasound imaging, and pathological tissue images.

The amount of medical image data is enormous, and existing LLMs face certain difficulties in processing images. A study has proposed the CSCA U-Net model [90], which introduces the channel and spatial compound attention mechanism. It can better capture the channel dependence and spatial relationship in the image, so that it can identify and segment the region of interest in the medical image more accurately, and it enables physicians to propose treatment plans suitable for patients. Some researchers have proposed a RETFound model [91] based on self-supervised learning (SSL) technology. The model realizes the diagnosis and prognosis of eye diseases and the prediction of complex systemic diseases (such as heart failure and myocardial infarction) by pre-training retinal images and then fine-tuning specific tasks. However, the model currently uses fewer datasets, and it is hoped that more data can be introduced to adjust the model in the future, so as to improve the performance of the model.

Pathological assessment is the gold standard for the diagnosis of many diseases, especially cancer, which mainly relies on doctors' analysis of hematoxylin-eosin (HE) staining and immunohistochemistry (IHC) staining images. A study proposed a new self-supervised learning framework, PathoDuet [92], which enhances the model's understanding of pathological images through cross-scale localization and cross-stain transfer. Studies have shown that the PathoDuet model is effective in most tasks and is expected to be applied in clinical practice in the future. Other LLMs related to medical images are more often combined with text information for output, mostly in the form of multimodal models, which will be introduced in detail later.

3.3. Application of Multimodal Model in Medicine

Most of the previous medical models were developed for a single task, such as the characteristics of a particular disease or image, while medical data are usually multimodal, including images, texts, and other data. These different types of data usually require different processing and analysis methods, so the clinical application of LLMs needs to focus more on the construction of multimodal models [93]. Multimodal models can better understand various forms of medical content, to provide better services for doctors and patients, and improve the accuracy of diagnosis and treatment. To meet these requirements, researchers proposed OpenMEDLab [94], which is an open-source, multimodal basic model platform, using the latest deep learning techniques and algorithms to build the model, so that the model can better meet the actual medical research and clinical applications. Another study proposed a new paradigm of medical artificial intelligence (AI), called generalist medical AI (GMAI) [95]. The GMAI model can perform a variety of tasks with little or no task-specific tagging data. Through self-supervised learning on large and diversified datasets, the model can flexibly accept different medical contents, including images, electronic health records, laboratory results, charts, or texts, and output the results in accurate medical language. However, because it needs a lot of data for training, the cost is high, and it may produce complex output, which makes it impossible for doctors and patients to determine its correctness.

Med-MLLM [96] is a type of multimodal medical model that supports various types of medical data. It can be applied to medical texts and reports in different languages, including Chinese, English, and Spanish. The model is capable of learning extensive medical knowledge from unlabeled or scarce data, such as images and texts. This enables the model to respond quickly in the event of future pandemics or rare diseases. However, due to the rapid development of medicine, the model needs to be continuously updated and trained as new diseases emerge and new data become available to ensure its accuracy.

Based on LLaMA-7B, some researchers have combined NLP and computer vision (CV) technology to build an open-source and parameter-efficient biomedical model Visual Med-Alpaca [96]. The model can process and understand visual information and generate biomedical-related text and image content, but the model data is limited to English diagnostic reports. A study proposed LLaVA-Med [97], a visual-language dialogue assistant capable of answering open-ended research questions about biomedical images. The model learns medical vocabulary by sampling biomedical image-text pairs from datasets, thereby generating dialogue outputs. However, the model still has some limitations, such as potential misinformation or insufficient reasoning.

A study proposed a framework called ChatCAD [98] to integrate LLMs into medical image computer-aided diagnosis (CAD) networks. The model first inputs the medical image into the CAD model to obtain the output, then converts the output into natural language text and inputs it into the LLMs, ultimately generating a diagnostic report. However, the reports generated by the model lack a certain degree of humanization and rely on specific prompts to generate the report. To promote the research and development of multimodal models of Chinese medicine, a study proposed the XrayGLM model [99], which constructs a Chinese X-ray diagnostic report dataset by using ChatGPT and public chest X-ray images and texts. It is the first large Chinese multimodal medical model that can view chest X-rays. The XrayGPT model [100] proposed by another study can interpret chest X-rays in a conversational way and answer related questions, providing a new possibility for automated X-ray analysis.

4. Discussion

Although LLMs can assist in medical diagnosis and help doctors solve clinical problems, they still face many challenges in clinical practice scenarios. These challenges include complex data processing, possible hallucinations of models, and potential ethical and privacy issues. The following text will discuss the specific problems in the clinical application of LLMs.

4.1. Complex Data Processing

The medical field has a huge data resource, including medical professional books, clinical guidelines, and data in medical records [101]. These data records in detail the characteristics of the disease, diagnostic basis, treatment plan, and individual information of patients, such as medical history, medical information, examination results, and imaging reports. There are both structured data and unstructured data in these data. LLMs need to integrate and process these data effectively for more accurate model training. However, when the amount of data used for training is too large, it may lead to an increase in training costs, and the model will also produce complex results due to excessive information input, which not only increases the difficulty of clinical application, but also may affect the efficiency of decision-making. Therefore, optimizing the training dataset and improving the data quality becomes the key step to improving the practicability of the model.

4.2. Hallucination and Accuracy

The output results of the LLMs have stability issues, which may easily produce inaccurate or untrue results (hallucination). One of the reasons may be that the data on which the model relies for training is not novel enough [102]. Due to rapid medical progress, the original data are difficult to ensure the stability and timeliness of the model output, leading to hallucinations. On the other hand, the model training data has not been verified for accuracy. To gain better insight into patients' demands and provide accurate answers, it is necessary to carry out in-depth training on the model, build updated data sets, and use doctor-patient interaction in the real world as training materials to reduce the risk of model output misinformation [6].

4.3. Ethics and Privacy

When applying LLMs in the medical field, it is necessary to pay high attention to ethics and personal privacy protection [102]. The development of medical LLMs needs a lot of research to ensure the safety, effectiveness, reliability, and accuracy of its output. Because model training needs to input a large amount of data, which may involve the privacy information of patients, so the processing of these data should strictly comply with the privacy protection principle. In addition, due to the differences in training datasets, models may generate biased recommendations based on race, region, gender, and other characteristics, resulting in unfair results [11]. Therefore, the use of LLMs in different clinical environments requires professional evaluation, standardization, and supervision of medical LLMs to reduce over-dependence on LLMs.

5. Conclusions

LLMs in medicine are based on deep learning and adjust the model through Pre-training, Fine-tuning, Prompting strategies, Retrieval-Augmented Generation, and Web Search (see Figure 1). LLMs train the model by constantly collecting medical data to strengthen the understanding and processing ability of the model to medical texts, images, and other types of data, to assist doctors in disease diagnosis, and provide individual diagnosis and treatment programs for patients. In this review, we introduce the methods of medical LLMs in detail and summarize their wide applications in the medical field, such as the question-and-answer model based on medical text, medical image processing analysis model, and multimodal model integrating various types of data. In addition, we explore many challenges faced by medical LLMs, including the processing of complex data, possible hallucination of models, and potential ethical and privacy concerns. The application of medical LLMs is still in its stages and faces many challenges. For the research and application of medical LLMs, we should pay attention to the following points in the future: (1) Ensure that the dataset used has been strictly clinically verified to ensure the accuracy of the results. (2) The model should only serve as a means to assist doctors in decision-making, considering the complexity of using LLMs for diagnosing and treating diseases, its output should not be used as the only basis for providing medical advice. (3) Efforts should be made to develop more specialized models applicable to subfields in medicine (such as rehabilitation medicine, sports medicine, etc.). With the development of the basic model, medical LLMs are expected to play a more significant role in clinical practice in the future, assisting doctors in diagnosis and treatment, effectively alleviating the pressure on medical resources, and facilitating patients in seeking medical treatment. This progress depends on the close cooperation and contact between AI researchers and clinical doctors to overcome the challenges in the application of the model, aiming to integrate models more accurately into the medical field and promote innovative development of medical LLMs.

Author contributions

Yu Sun: Conceptualization, Data curation, Visualization, Methodology, Writing - original draft, Writing - reviewing and editing. Guiyan Liu: Conceptualization, Data curation, Validation, Methodology, Writing - original draft, Writing - reviewing and editing. Qifeng Bai: Conceptualization, Validation, Data curation, Resources, Methodology, Writing - original draft, Supervision, Writing - reviewing and editing, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Data Availability

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no competing interests.

References

Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H., Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 2021, 3 (1), Article 2. [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission 2019, p. arXiv:1904.05342. https://ui.adsabs.harvard.edu/abs/2019arXiv190405342H (accessed April 01, 2019).
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S., Gpt-4 technical report. arXiv preprint arXiv:2303.08774 2023.
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S. S.; Wei, J.; Chung, H. W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; Payne, P.; Seneviratne, M.; Gamble, P.; Kelly, C.; Babiker, A.; Schärli, N.; Chowdhery, A.; Mansfield, P.; Demner-Fushman, D.; Agüera, Y. A. B.; Webster, D.; Corrado, G. S.; Matias, Y.; Chou, K.; Gottweis, J.; Tomasev, N.; Liu, Y.; Rajkomar, A.; Barral, J.; Semturs, C.; Karthikesalingam, A.; Natarajan, V., Large language models encode clinical knowledge. Nature 2023, 620 (7972), 172-180. [CrossRef]
Kraljevic, Z.; Shek, A.; Bean, D. M.; Bendayan, R.; Teo, J. T. H.; Dobson, R. J. B., MedGPT: Medical Concept Prediction from Clinical Narratives. ArXiv 2021, abs/2107.03134.
Li, Y.; Li, Z.; Zhang, K.; Dan, R.; Jiang, S.; Zhang, Y., ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 2023, 15 (6), e40895.
Xiong, H.; Wang, S.; Zhu, Y.; Zhao, Z.; Liu, Y.; Huang, L.; Wang, Q.; Shen, D., DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task. ArXiv 2023, abs/2304.01097.
Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A., Benchmarking Retrieval-Augmented Generation for Medicine. ArXiv 2024, abs/2402.13178.
Zhou, H.; Liu, F.; Gu, B.; Zou, X.; Huang, J.; Wu, J.; Li, Y.; Chen, S. S.; Zhou, P.; Liu, J., A survey of large language models in medicine: Principles, applications, and challenges. arXiv preprint arXiv:2311.05112 2023.
Zhou, H.; Gu, B.; Zou, X.; Li, Y.; Chen, S. S.; Zhou, P.; Liu, J.; Hua, Y.; Mao, C.; Wu, X.; Li, Z.; Liu, F., A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. ArXiv 2023, abs/2311.05112.
He, K.; Mao, R.; Lin, Q.; Ruan, Y.; Lan, X.; Feng, M.; Cambria, E., A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics. ArXiv 2023, abs/2310.05694.
Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets 2019, p. arXiv:1906.05474. https://ui.adsabs.harvard.edu/abs/2019arXiv190605474P (accessed June 01, 2019).
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; Kang, J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining 2019, p. arXiv:1901.08746. https://ui.adsabs.harvard.edu/abs/2019arXiv190108746L (accessed January 01, 2019).
Beltagy, I.; Lo, K.; Cohan, A., SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 2019.
Li, Y.; Rao, S.; Solares, J. R. A.; Hassaine, A.; Ramakrishnan, R.; Canoy, D.; Zhu, Y.; Rahimi, K.; Salimi-Khorshidi, G., BEHRT: Transformer for Electronic Health Records. Scientific Reports 2020, 10, 7155. [CrossRef]
Michalopoulos, G.; Wang, Y.; Kaka, H.; Chen, H.; Wong, A. UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus 2020, p. arXiv:2010.10391. https://ui.adsabs.harvard.edu/abs/2020arXiv201010391M (accessed October 01, 2020).
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H. C.; E Smith, K.; Parisien, C.; Compas, C.; Martin, C.; Flores, M. G.; Zhang, Y.; Magoc, T.; Harle, C. A.; Lipori, G.; Mitchell, D. A.; Hogan, W. R.; Shenkman, E. A.; Bian, J.; Wu, Y. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records 2022, p. arXiv:2203.03540. https://ui.adsabs.harvard.edu/abs/2022arXiv220303540Y (accessed February 01, 2022).
Chen, Z.; Hernández Cano, A.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; Sallinen, A.; Sakhaeirad, A.; Swamy, V.; Krawczuk, I.; Bayazit, D.; Marmet, A.; Montariol, S.; Hartley, M.-A.; Jaggi, M.; Bosselut, A. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models 2023, p. arXiv:2311.16079. https://ui.adsabs.harvard.edu/abs/2023arXiv231116079C (accessed November 01, 2023).
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W. W.; Lu, X., Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 2019.
Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R. G., MIMIC-III, a freely accessible critical care database. Scientific data 2016, 3 (1), 1-9. [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018, p. arXiv:1810.04805. https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D (accessed October 01, 2018).
Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans 2019, p. arXiv:1907.10529. https://ui.adsabs.harvard.edu/abs/2019arXiv190710529J (accessed July 01, 2019).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A., Language models are few-shot learners. Advances in neural information processing systems 2020, 33, 1877-1901.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. In Language Models are Unsupervised Multitask Learners, 2019.
Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; Wong, C.; Tupini, A.; Wang, Y.; Mazzola, M.; Shukla, S.; Liden, L.; Gao, J.; Lungren, M. P.; Naumann, T.; Wang, S.; Poon, H. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs 2023, p. arXiv:2303.00915. https://ui.adsabs.harvard.edu/abs/2023arXiv230300915Z (accessed March 01, 2023).
Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text 2022, p. arXiv:2210.10163. https://ui.adsabs.harvard.edu/abs/2022arXiv221010163W (accessed October 01, 2022).
Linear-probe, A. In Learning Transferable Visual Models From Natural Language Supervision, 2021. [CrossRef]
Chen, Y. In Convolutional Neural Network for Sentence Classification, 2015.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv 2020, abs/2010.11929.
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; Schaekermann, M.; Wang, A.; Amin, M.; Lachgar, S.; Mansfield, P.; Prakash, S.; Green, B.; Dominowska, E.; Arcas, B. A. y.; Tomasev, N.; Liu, Y.; Wong, R.; Semturs, C.; Mahdavi, S. S.; Barral, J.; Webster, D.; Corrado, G. S.; Matias, Y.; Azizi, S.; Karthikesalingam, A.; Natarajan, V., Towards Expert-Level Medical Question Answering with Large Language Models. CoRR 2023, abs/2305.09617.
Xu, L.; Xie, H.; Qin, S.-Z. J.; Tao, X.; Wang, F. L., Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148 2023.
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W., Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021.
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. In Parameter-efficient transfer learning for NLP, International conference on machine learning, PMLR: 2019; pp 2790-2799.
Lin, Z.; Madotto, A.; Fung, P., Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829 2020.
He, J.; Zhou, C.; Ma, X.; Berg-Kirkpatrick, T.; Neubig, G., Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 2021.
Sahoo, P.; Singh, A. K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A., A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv:2402.07927 2024.
Heston, T. F.; Khun, C., Prompt Engineering in Medical Education. International Medical Education 2023.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I., Language models are unsupervised multitask learners. OpenAI blog 2019, 1 (8), 9.
Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S., Language models are few-shot learners. arXiv preprint arXiv:2005.14165 2020.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D., Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824-24837. [CrossRef]
Zhang, Z.; Zhang, A.; Li, M.; Smola, A., Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 2022.
Liu, H.; Teng, Z.; Cui, L.; Zhang, C.; Zhou, Q.; Zhang, Y. In LogiCoT: Logical Chain-of-Thought Instruction Tuning, The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Hu, H.; Lu, H.; Zhang, H.; Lam, W.; Zhang, Y., Chain-of-symbol prompting elicits planning in large langauge models. arXiv preprint arXiv:2305.10276 2023.
Long, J., Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291 2023.
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. J. a. p. a., Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 2023.
Kumar, M.; Mani, U. A.; Tripathi, P.; Saalim, M.; Roy, S., Artificial Hallucinations by Google Bard: Think Before You Leap. Cureus 2023, 15 (8), e43313. [CrossRef]
Ilin, I., Advanced rag techniques: an illustrated overview. 2023. [CrossRef]
Johnson, A. E. W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T. J.; Hao, S.; Moody, B.; Gow, B.; Lehman, L.-w. H.; Celi, L. A.; Mark, R. G., MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data 2023, 10 (1), 1. [CrossRef]
Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M., Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063 2022.
Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W., Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294 2023.
Wu, J.; Zhu, J.; Qi, Y. In Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation, 2024.
Jiang, X.; Fang, Y.; Qiu, R.; Zhang, H.; Xu, Y.; Chen, H.; Zhang, W.; Zhang, R.; Fang, Y.; Chu, X.; Zhao, J.; Wang, Y. In TC-RAG:Turing-Complete RAG's Case study on Medical LLM Systems, 2024.
Long, C.; Liu, Y.; Ouyang, C.; Yu, Y., Bailicai: A Domain-Optimized Retrieval-Augmented Generation Framework for Medical Applications. ArXiv 2024, abs/2407.21055.
Qin, Y.; Cai, Z.; Jin, D.; Yan, L.; Liang, S.; Zhu, K.; Lin, Y.; Han, X.; Ding, N.; Wang, H., Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849 2023.
Liu, X.; Lai, H.; Yu, H.; Xu, Y.; Zeng, A.; Du, Z.; Zhang, P.; Dong, Y.; Tang, J. In Webglm: Towards an efficient web-enhanced question answering system with human preferences, Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023; pp 4549-4560.
Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W., Webgpt: Browser-assisted question-answering with human feedback, 2021. URL https://arxiv. org/abs/2112.09332 2021.
Saab, K.; Tu, T.; Weng, W.-H.; Tanno, R.; Stutz, D.; Wulczyn, E.; Zhang, F.; Strother, T.; Park, C.; Vedadi, E.; Chaves, J. Z.; Hu, S.-Y.; Schaekermann, M.; Kamath, A. B.; Cheng, Y.; Barrett, D. G. T.; Cheung, C.; Mustafa, B.; Palepu, A.; McDuff, D.; Hou, L.; Golany, T.; Liu, L.; Alayrac, J.-B.; Houlsby, N.; Tomašev, N.; Freyberg, J.; Lau, C.; Kemp, J.; Lai, J.; Azizi, S.; Kanada, K.; Man, S.; Kulkarni, K.; Sun, R.; Shakeri, S.; He, L.; Caine, B.; Webson, A.; Latysheva, N.; Johnson, M.; Mansfield, P.; Lu, J.; Rivlin, E.; Anderson, J.; Green, B.; Wong, R.; Krause, J.; Shlens, J.; Dominowska, E.; Eslami, S. M. A.; Cui, C.; Vinyals, O.; Kavukcuoglu, K.; Manyika, J.; Dean, J.; Hassabis, D.; Matias, Y.; Webster, D. R.; Barral, J.; Corrado, G. S.; Semturs, C.; Mahdavi, S. S.; Gottweis, J.; Karthikesalingam, A.; Natarajan, V., Capabilities of Gemini Models in Medicine. ArXiv 2024, abs/2404.18416.
Ferber, D.; Nahhas, O. S. M. E.; Wölflein, G.; Wiest, I. C.; Clusmann, J.; Lessman, M.-E.; Foersch, S.; Lammert, J.; Tschochohei, M.; Jäger, D.; Salto-Tellez, M.; Schultz, N.; Truhn, D.; Kather, J. N., Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology. ArXiv 2024, abs/2404.04667.
Schopow, N.; Osterhoff, G.; Baur, D., Applications of the Natural Language Processing Tool ChatGPT in Clinical Practice: Comparative Study and Augmented Systematic Review. JMIR medical informatics 2023, 11, e48933. [CrossRef]
Qureshi, R.; Shaughnessy, D.; Gill, K. A. R.; Robinson, K. A.; Li, T.; Agai, E., Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation? Systematic reviews 2023, 12 (1), 72. [CrossRef]
Zangrossi, P.; Martini, M.; Guerrini, F.; P, D. E. B.; Spena, G., Large language model, AI and scientific research: why ChatGPT is only the beginning. Journal of neurosurgical sciences 2024. [CrossRef]
Tessler, I.; Wolfovitz, A.; Livneh, N.; Gecel, N. A.; Sorin, V.; Barash, Y.; Konen, E.; Klang, E., Advancing Medical Practice with Artificial Intelligence: ChatGPT in Healthcare. The Israel Medical Association journal : IMAJ 2024, 26 (2), 80-85.
Touvron, H.; Martin, L.; Stone, K. R.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; Bikel, D. M.; Blecher, L.; Ferrer, C. C.; Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu, W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn, A. S.; Hosseini, S.; Hou, R.; Inan, H.; Kardas, M.; Kerkez, V.; Khabsa, M.; Kloumann, I. M.; Korenev, A. V.; Koura, P. S.; Lachaux, M.-A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu, Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Molybog, I.; Nie, Y.; Poulton, A.; Reizenstein, J.; Rungta, R.; Saladi, K.; Schelten, A.; Silva, R.; Smith, E. M.; Subramanian, R.; Tan, X.; Tang, B.; Taylor, R.; Williams, A.; Kuan, J. X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan, A.; Kambadur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov, S.; Scialom, T., Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv 2023, abs/2307.09288.
Leiter, C.; Zhang, R.; Chen, Y.; Belouadi, J.; Larionov, D.; Fresen, V.; Eger, S., ChatGPT: A Meta-Analysis after 2.5 Months. ArXiv 2023, abs/2302.13795.
Nicholson, A. E.; Korb, K. B.; Nyberg, E. P.; Wybrow, M.; Zukerman, I.; Mascaro, S.; Thakur, S.; Alvandi, A. O.; Riley, J.; Pearson, R.; Morris, S.; Herrmann, M.; Azad, A. K. M.; Bolger, F.; Hahn, U.; Lagnado, D. A., BARD: A Structured Technique for Group Elicitation of Bayesian Networks to Support Analytic Reasoning. Risk Analysis 2020, 42, 1155 - 1178.
Abi-Rafeh, J.; Mroueh, V. J.; Bassiri-Tehrani, B.; Marks, J.; Kazan, R.; Nahai, F., Complications Following Body Contouring: Performance Validation of Bard, a Novel AI Large Language Model, in Triaging and Managing Postoperative Patient Concerns. Aesthetic plastic surgery 2024.
Sandmann, S.; Riepenhausen, S.; Plagwitz, L.; Varghese, J., Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nature communications 2024, 15 (1), 2050. [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; Lample, G., LLaMA: Open and Efficient Foundation Language Models. ArXiv 2023, abs/2302.13971.
Blease, C.; Torous, J.; McMillan, B.; Hägglund, M.; Mandl, K. D., Generative Language Models and Open Notes: Exploring the Promise and Limitations. JMIR medical education 2024, 10, e51183. [CrossRef]
Ashraf, H.; Ashfaq, H., The Role of ChatGPT in Medical Research: Progress and Limitations. Annals of biomedical engineering 2024, 52 (3), 458-461.
Abi-Rafeh, J.; Xu, H. H.; Kazan, R.; Tevlin, R.; Furnas, H., Large Language Models and Artificial Intelligence: A Primer for Plastic Surgeons on the Demonstrated and Potential Applications, Promises, and Limitations of ChatGPT. Aesthetic surgery journal 2024, 44 (3), 329-343. [CrossRef]
Ufuk, F., The Role and Limitations of Large Language Models Such as ChatGPT in Clinical Settings and Medical Journalism. Radiology 2023, 307 (3), e230276. [CrossRef]
Melton, G. B.; McDonald, C. J.; Tang, P. C.; Hripcsak, G., Electronic Health Records. In Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Shortliffe, E. H.; Cimino, J. J., Eds. Springer International Publishing: Cham, 2021; pp 467-509.
Tsai, C. H.; Eghdam, A.; Davoody, N.; Wright, G.; Flowerday, S.; Koch, S., Effects of Electronic Health Record Implementation and Barriers to Adoption and Use: A Scoping Review and Qualitative Analysis of the Content. Life (Basel, Switzerland) 2020, 10 (12). [CrossRef]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H. C.; Smith, K. E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A. B.; Flores, M. G.; Zhang, Y.; Magoc, T.; Harle, C. A.; Lipori, G.; Mitchell, D. A.; Hogan, W. R.; Shenkman, E. A.; Bian, J.; Wu, Y., A large language model for electronic health records. NPJ digital medicine 2022, 5 (1), 194. [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X., A Survey of Transformers. AI Open 2021, 3, 111-132.
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. In Attention is All you Need, Neural Information Processing Systems, 2017.
Ong, J.; Kedia, N.; Harihar, S.; Vupparaboina, S. C.; Singh, S. R.; Venkatesh, R.; Vupparaboina, K.; Bollepalli, S. C.; Chhablani, J., Applying large language model artificial intelligence for retina International Classification of Diseases (ICD) coding. Journal of Medical Artificial Intelligence 2023, 6. [CrossRef]
Toma, A.; Lawler, P. R.; Ba, J.; Krishnan, R. G.; Rubin, B. B.; Wang, B. In Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding, 2023.
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. In GLM: General Language Model Pretraining with Autoregressive Blank Infilling, Annual Meeting of the Association for Computational Linguistics, 2021.
Zhang, H.; Chen, J.; Jiang, F.; Yu, F.; Chen, Z.; Li, J.; Chen, G.; Wu, X.; Zhang, Z.; Xiao, Q.; Wan, X.; Wang, B.; Li, H. In HuatuoGPT, towards Taming Language Model to Be a Doctor, Conference on Empirical Methods in Natural Language Processing, 2023.
Liao, Y.; Jiang, S.; Wang, Y.; Wang, Y., MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Mode ls with Sparse Mixture of Low-Rank Adapter Experts. [CrossRef]
Zhu, W.; Wang, X., ChatMed: A Chinese Medical Large Language Model. GitHub.
Wang, H.; Liu, C.-L.; Xi, N.; Qiang, Z.; Zhao, S.; Qin, B.; Liu, T., HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge. ArXiv 2023, abs/2304.06975.
Yang, S.; Zhao, H.; Zhu, S.; Zhou, G.; Xu, H.; Jia, Y.; Zan, H. In Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue, AAAI Conference on Artificial Intelligence, 2023.
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; Tam, W. L.; Ma, Z.; Xue, Y.; Zhai, J.; Chen, W.; Zhang, P.; Dong, Y.; Tang, J., GLM-130B: An Open Bilingual Pre-trained Model. ArXiv 2022, abs/2210.02414.
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N. M.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B. C.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levskaya, A.; Ghemawat, S.; Dev, S.; Michalewski, H.; García, X.; Misra, V.; Robinson, K.; Fedus, L.; Zhou, D.; Ippolito, D.; Luan, D.; Lim, H.; Zoph, B.; Spiridonov, A.; Sepassi, R.; Dohan, D.; Agrawal, S.; Omernick, M.; Dai, A. M.; Pillai, T. S.; Pellat, M.; Lewkowycz, A.; Moreira, E.; Child, R.; Polozov, O.; Lee, K.; Zhou, Z.; Wang, X.; Saeta, B.; Díaz, M.; Firat, O.; Catasta, M.; Wei, J.; Meier-Hellstern, K. S.; Eck, D.; Dean, J.; Petrov, S.; Fiedel, N., PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2022, 24, 240:1-240:113.
Anil, R.; Dai, A. M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A. T.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; Chu, E.; Clark, J.; Shafey, L. E.; Huang, Y.; Meier-Hellstern, K. S.; Mishra, G.; Moreira, E.; Omernick, M.; Robinson, K.; Ruder, S.; Tay, Y.; Xiao, K.; Xu, Y.; Zhang, Y.; Abrego, G. H.; Ahn, J.; Austin, J.; Barham, P.; Botha, J. A.; Bradbury, J.; Brahma, S.; Brooks, K. M.; Catasta, M.; Cheng, Y.; Cherry, C.; Choquette-Choo, C. A.; Chowdhery, A.; Crépy, C.; Dave, S.; Dehghani, M.; Dev, S.; Devlin, J.; D'iaz, M. C.; Du, N.; Dyer, E.; Feinberg, V.; Feng, F.; Fienber, V.; Freitag, M.; García, X.; Gehrmann, S.; González, L.; Gur-Ari, G.; Hand, S.; Hashemi, H.; Hou, L.; Howland, J.; Hu, A. R.; Hui, J.; Hurwitz, J.; Isard, M.; Ittycheriah, A.; Jagielski, M.; Jia, W. H.; Kenealy, K.; Krikun, M.; Kudugunta, S.; Lan, C.; Lee, K.; Lee, B.; Li, E.; Li, M.-L.; Li, W.; Li, Y.; Li, J. Y.; Lim, H.; Lin, H.; Liu, Z.-Z.; Liu, F.; Maggioni, M.; Mahendru, A.; Maynez, J.; Misra, V.; Moussalem, M.; Nado, Z.; Nham, J.; Ni, E.; Nystrom, A.; Parrish, A.; Pellat, M.; Polacek, M.; Polozov, O.; Pope, R.; Qiao, S.; Reif, E.; Richter, B.; Riley, P.; Ros, A.; Roy, A.; Saeta, B.; Samuel, R.; Shelby, R. M.; Slone, A.; Smilkov, D.; So, D. R.; Sohn, D.; Tokumine, S.; Valter, D.; Vasudevan, V.; Vodrahalli, K.; Wang, X.; Wang, P.; Wang, Z.; Wang, T.; Wieting, J.; Wu, Y.; Xu, K.; Xu, Y.; Xue, L. W.; Yin, P.; Yu, J.; Zhang, Q.; Zheng, S.; Zheng, C.; Zhou, W.; Zhou, D.; Petrov, S.; Wu, Y., PaLM 2 Technical Report. ArXiv 2023, abs/2305.10403. [CrossRef]
Chen, Y.; Wang, Z.; Xing, X.; Zheng, H.; Xu, Z.; Fang, K.; Wang, J.; Li, S.; Wu, J.; Liu, Q.; Xu, X., BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT. ArXiv 2023, abs/2310.15896.
Xin, S.; Jiashu, W.; Aoping, Z.; Jinlong, S.; Xiao-Jun, W., CSCA U-Net: A channel and space compound attention CNN for medical image segmentation. Artificial Intelligence in Medicine 2024, 150, 102800. [CrossRef]
Zhou, Y.; Chia, M. A.; Wagner, S. K.; Ayhan, M. S.; Williamson, D. J.; Struyven, R. R.; Liu, T.; Xu, M.; Lozano, M. G.; Woodward-Court, P.; Kihara, Y.; Allen, N.; Gallacher, J. E. J.; Littlejohns, T.; Aslam, T.; Bishop, P.; Black, G.; Sergouniotis, P.; Atan, D.; Dick, A. D.; Williams, C.; Barman, S.; Barrett, J. H.; Mackie, S.; Braithwaite, T.; Carare, R. O.; Ennis, S.; Gibson, J.; Lotery, A. J.; Self, J.; Chakravarthy, U.; Hogg, R. E.; Paterson, E.; Woodside, J.; Peto, T.; McKay, G.; McGuinness, B.; Foster, P. J.; Balaskas, K.; Khawaja, A. P.; Pontikos, N.; Rahi, J. S.; Lascaratos, G.; Patel, P. J.; Chan, M.; Chua, S. Y. L.; Day, A.; Desai, P.; Egan, C.; Fruttiger, M.; Garway-Heath, D. F.; Hardcastle, A.; Khaw, S. P. T.; Moore, T.; Sivaprasad, S.; Strouthidis, N.; Thomas, D.; Tufail, A.; Viswanathan, A. C.; Dhillon, B.; Macgillivray, T.; Sudlow, C.; Vitart, V.; Doney, A.; Trucco, E.; Guggeinheim, J. A.; Morgan, J. E.; Hammond, C. J.; Williams, K.; Hysi, P.; Harding, S. P.; Zheng, Y.; Luben, R.; Luthert, P.; Sun, Z.; McKibbin, M.; O’Sullivan, E.; Oram, R.; Weedon, M.; Owen, C. G.; Rudnicka, A. R.; Sattar, N.; Steel, D.; Stratton, I.; Tapp, R.; Yates, M. M.; Petzold, A.; Madhusudhan, S.; Altmann, A.; Lee, A. Y.; Topol, E. J.; Denniston, A. K.; Alexander, D. C.; Keane, P. A.; Eye, U. K. B.; Vision, C., A foundation model for generalizable disease detection from retinal images. Nature 2023, 622 (7981), 156-163. [CrossRef]
Hua, S.; Yan, F.; Shen, T.; Zhang, X., PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and IHC Stains. ArXiv 2023, abs/2312.09894.
Topol, E. J., As artificial intelligence goes multimodal, medical applications multiply. Science (New York, N.Y.) 2023, 381 (6663), adk6139. [CrossRef]
Wang, X.; Zhang, X.; Wang, G.; He, J.; Li, Z.; Zhu, W.; Guo, Y.; Dou, Q.; Li, X.; Wang, D.; Hong, L.; Lao, Q.; Ruan, T.; Zhou, Y.; Li, Y.; Zhao, J.; Li, K.; Sun, X.; Zhu, L.; Zhang, S., OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models in Medicine. ArXiv 2024, abs/2402.18028.
Moor, M.; Banerjee, O.; Abad, Z. S. H.; Krumholz, H. M.; Leskovec, J.; Topol, E. J.; Rajpurkar, P., Foundation models for generalist medical artificial intelligence. Nature 2023, 616 (7956), 259-265. [CrossRef]
Chang Shu, B. C., Fangyu Liu, Zihao Fu, Ehsan Shareghi, Nigel Collier Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Visual Capabilities. https://github.com/cambridgeltl/visual-med-alpaca/tree/main.
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J., LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. ArXiv 2023, abs/2306.00890.
Wang, S.; Zhao, Z.; Ouyang, X.; Wang, Q.; Shen, D., ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models. ArXiv 2023, abs/2302.07257.
Rongsheng, W.; Tan, T., XrayGLM: The first Chinese Medical Multimodal Model that Chest Radiogr aphs Summarization. GitHub.
Thawakar, O.; Shaker, A. M.; Mullappilly, S. S.; Cholakkal, H.; Anwer, R. M.; Khan, S. S.; Laaksonen, J.; Khan, F. S., XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models. ArXiv 2023, abs/2306.07971.
Jie, R. T. B. Y. a. Y. G. x., A Review on Research and Application of Medical Large Language Models. Chinese Journal of Health Informatics and Management 2023, 20 (06), 853-861.
Thirunavukarasu, A. J.; Ting, D. S. J.; Elangovan, K.; Gutierrez, L.; Tan, T. F.; Ting, D. S. W., Large language models in medicine. Nature medicine 2023, 29 (8), 1930-1940. [CrossRef]
Shin, H.-C.; Zhang, Y.; Bakhturina, E.; Puri, R.; Patwary, M.; Shoeybi, M.; Mani, R. In BioMegatron: Larger Biomedical Domain Language Model, Online, November; Association for Computational Linguistics: Online, 2020; pp 4700-4706.
Johnson, A. E. W.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R. G., MIMIC-III, a freely accessible critical care database. Scientific Data 2016, 3 (1), 160035. [CrossRef]
Pampari, A.; Raghavan, P.; Liang, J. J.; Peng, J. In emrQA: A Large Corpus for Question Answering on Electronic Medical Records, Conference on Empirical Methods in Natural Language Processing, 2018.
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P., What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. ArXiv 2020, abs/2009.13081.
Zhang, S.; Zhang, X.; Wang, H.; Guo, L.; Liu, S., Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection. IEEE Access 2018, 1-1. [CrossRef]
He, J.; Fu, M.; Tu, M., Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC medical informatics and decision making 2019, 19 (Suppl 2), 52. [CrossRef]
Li, J.; Wang, X.; Wu, X.; Zhang, Z.; Xu, X.; Fu, J.; Tiwari, P.; Wan, X.; Wang, B., Huatuo-26M, a Large-scale Chinese Medical QA Dataset. ArXiv 2023, abs/2305.01526.
BYAMBASUREN Odmaa, Y. Y., SUlZhifang, DAl Damai, CHANG Baobao, LI Suiian, ZAN Hongying, Preliminary Study on the Construction of Chinese Medical Knowledge Graph. Journal of Chinese Information Processing 2019, 33 (10), 1-9.
Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; Presser, S.; Leahy, C., The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ArXiv 2020, abs/2101.00027.
Yuan, S.; Zhao, H.; Du, Z.; Ding, M.; Liu, X.; Cen, Y.; Zou, X.; Yang, Z.; Tang, J., WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open 2021, 2, 65-68. [CrossRef]
Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, Q. N.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R., The LAMBADA dataset: Word prediction requiring a broad discourse context. ArXiv 2016, abs/1606.06031.
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D. X.; Steinhardt, J., Measuring Massive Multitask Language Understanding. ArXiv 2020, abs/2009.03300.
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A. M.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; Kluska, A.; Lewkowycz, A.; Agarwal, A.; Power, A.; Ray, A.; Warstadt, A.; Kocurek, A. W.; Safaya, A.; Tazarv, A.; Xiang, A.; Parrish, A.; Nie, A.; Hussain, A.; Askell, A.; Dsouza, A.; Slone, A.; Rahane, A. A.; Iyer, A. S.; Andreassen, A.; Madotto, A.; Santilli, A.; Stuhlmuller, A.; Dai, A. M.; La, A.; Lampinen, A. K.; Zou, A.; Jiang, A.; Chen, A.; Vuong, A.; Gupta, A.; Gottardi, A.; Norelli, A.; Venkatesh, A.; Gholamidavoodi, A.; Tabassum, A.; Menezes, A.; Kirubarajan, A.; Mullokandov, A.; Sabharwal, A.; Herrick, A.; Efrat, A.; Erdem, A.; Karakacs, A.; Roberts, B. R.; Loe, B. S.; Zoph, B.; Bojanowski, B.; Ozyurt, B.; Hedayatnia, B.; Neyshabur, B.; Inden, B.; Stein, B.; Ekmekci, B.; Lin, B. Y.; Howald, B. S.; Orinion, B.; Diao, C.; Dour, C.; Stinson, C.; Argueta, C.; Ram'irez, C. e. F.; Singh, C.; Rathkopf, C.; Meng, C.; Baral, C.; Wu, C.; Callison-Burch, C.; Waites, C.; Voigt, C.; Manning, C. D.; Potts, C.; Ramirez, C.; Rivera, C.; Siro, C.; Raffel, C.; Ashcraft, C.; Garbacea, C.; Sileo, D.; Garrette, D. H.; Hendrycks, D.; Kilman, D.; Roth, D.; Freeman, D.; Khashabi, D.; Levy, D.; Gonz'alez, D. M. i.; Perszyk, D. R.; Hernandez, D.; Chen, D.; Ippolito, D.; Gilboa, D.; Dohan, D.; Drakard, D.; Jurgens, D.; Datta, D.; Ganguli, D.; Emelin, D.; Kleyko, D.; Yuret, D.; Chen, D.; Tam, D.; Hupkes, D.; Misra, D.; Buzan, D.; Coelho Mollo, D.; Yang, D.; Lee, D.-H.; Schrader, D.; Shutova, E.; Cubuk, E. D.; Segal, E.; Hagerman, E.; Barnes, E.; Donoway, E. P.; Pavlick, E.; Rodolà, E.; Lam, E.; Chu, E.; Tang, E.; Erdem, E.; Chang, E.; Chi, E. A.; Dyer, E.; Jerzak, E. J.; Kim, E.; Manyasi, E. E.; Zheltonozhskii, E.; Xia, F.; Siar, F.; Mart'inez-Plumed, F.; Happ'e, F.; Chollet, F.; Rong, F.; Mishra, G.; Winata, G. I.; Melo, G. d.; Kruszewski, G.; Parascandolo, G.; Mariani, G.; Wang, G. X.; Jaimovitch-L'opez, G.; Betz, G.; Gur-Ari, G.; Galijasevic, H.; Kim, H.; Rashkin, H.; Hajishirzi, H.; Mehta, H.; Bogar, H.; Shevlin, H.; Schutze, H.; Yakura, H.; Zhang, H.; Wong, H. M.; Ng, I.; Noble, I.; Jumelet, J.; Geissinger, J.; Kernion, J.; Hilton, J.; Lee, J.; Fisac, J. F.; Simon, J. B.; Koppel, J.; Zheng, J.; Zou, J.; Koco'n, J.; Thompson, J.; Wingfield, J.; Kaplan, J.; Radom, J.; Sohl-Dickstein, J. N.; Phang, J.; Wei, J.; Yosinski, J.; Novikova, J.; Bosscher, J.; Marsh, J.; Kim, J.; Taal, J.; Engel, J.; Alabi, J. O.; Xu, J.; Song, J.; Tang, J.; Waweru, J. W.; Burden, J.; Miller, J.; Balis, J. U.; Batchelder, J.; Berant, J.; Frohberg, J.; Rozen, J.; Hernández-Orallo, J.; Boudeman, J.; Guerr, J.; Jones, J.; Tenenbaum, J.; Rule, J. S.; Chua, J.; Kanclerz, K.; Livescu, K.; Krauth, K.; Gopalakrishnan, K.; Ignatyeva, K.; Markert, K.; Dhole, K. D.; Gimpel, K.; Omondi, K.; Mathewson, K. W.; Chiafullo, K.; Shkaruta, K.; Shridhar, K.; McDonell, K.; Richardson, K.; Reynolds, L.; Gao, L.; Zhang, L.; Dugan, L.; Qin, L.; Contreras-Ochando, L.; Morency, L.-P.; Moschella, L.; Lam, L.; Noble, L.; Schmidt, L.; He, L.; Col'on, L. O.; Metz, L.; cSenel, L. K.; Bosma, M.; Sap, M.; Hoeve, M. t.; Farooqi, M.; Faruqui, M.; Mazeika, M.; Baturan, M.; Marelli, M.; Maru, M.; Quintana, M. J. R. i.; Tolkiehn, M.; Giulianelli, M.; Lewis, M.; Potthast, M.; Leavitt, M. L.; Hagen, M.; Schubert, M. a. a.; Baitemirova, M.; Arnaud, M.; McElrath, M. A.; Yee, M.; Cohen, M.; Gu, M.; Ivanitskiy, M. I.; Starritt, M.; Strube, M.; Swkedrowski, M.; Bevilacqua, M.; Yasunaga, M.; Kale, M.; Cain, M.; Xu, M.; Suzgun, M.; Walker, M.; Tiwari, M.; Bansal, M.; Aminnaseri, M.; Geva, M.; Gheini, M.; MukundVarma, T.; Peng, N.; Chi, N. A.; Lee, N.; Krakover, N. G.-A.; Cameron, N.; Roberts, N.; Doiron, N.; Martinez, N.; Nangia, N.; Deckers, N.; Muennighoff, N.; Keskar, N. S.; Iyer, N.; Constant, N.; Fiedel, N.; Wen, N.; Zhang, O.; Agha, O.; Elbaghdadi, O.; Levy, O.; Evans, O.; Casares, P. A. M.; Doshi, P.; Fung, P.; Liang, P. P.; Vicol, P.; Alipoormolabashi, P.; Liao, P.; Liang, P.; Chang, P.; Eckersley, P.; Htut, P. M.; Hwang, P.-B.; Milkowski, P.; Patil, P. S.; Pezeshkpour, P.; Oli, P.; Mei, Q.; Lyu, Q.; Chen, Q.; Banjade, R.; Rudolph, R. E.; Gabriel, R.; Habacker, R.; Risco, R.; Milliere, R.; Garg, R.; Barnes, R.; Saurous, R. A.; Arakawa, R.; Raymaekers, R.; Frank, R.; Sikand, R.; Novak, R.; Sitelew, R.; Le Bras, R.; Liu, R.; Jacobs, R.; Zhang, R.; Salakhutdinov, R.; Chi, R.; Lee, R.; Stovall, R.; Teehan, R.; Yang, R.; Singh, S.; Mohammad, S. M.; Anand, S.; Dillavou, S.; Shleifer, S.; Wiseman, S.; Gruetter, S.; Bowman, S. R.; Schoenholz, S. S.; Han, S.; Kwatra, S.; Rous, S. A.; Ghazarian, S.; Ghosh, S.; Casey, S.; Bischoff, S.; Gehrmann, S.; Schuster, S.; Sadeghi, S.; Hamdan, S. S.; Zhou, S.; Srivastava, S.; Shi, S.; Singh, S.; Asaadi, S.; Gu, S. S.; Pachchigar, S.; Toshniwal, S.; Upadhyay, S.; Debnath, S.; Shakeri, S.; Thormeyer, S.; Melzi, S.; Reddy, S.; Makini, S. P.; Lee, S.-H.; Torene, S.; Hatwar, S.; Dehaene, S.; Divic, S.; Ermon, S.; Biderman, S.; Lin, S.; Prasad, S.; Piantadosi, S. T.; Shieber, S. M.; Misherghi, S.; Kiritchenko, S.; Mishra, S.; Linzen, T.; Schuster, T.; Li, T.; Yu, T.; Ali, T.; Hashimoto, T.; Wu, T.-L.; Desbordes, T.; Rothschild, T.; Phan, T.; Wang, T.; Nkinyili, T.; Schick, T.; Kornev, T.; Tunduny, T.; Gerstenberg, T.; Chang, T.; Neeraj, T.; Khot, T.; Shultz, T.; Shaham, U.; Misra, V.; Demberg, V.; Nyamai, V.; Raunak, V.; Ramasesh, V. V.; Prabhu, V. U.; Padmakumar, V.; Srikumar, V.; Fedus, W.; Saunders, W.; Zhang, W.; Vossen, W.; Ren, X.; Tong, X.; Zhao, X.; Wu, X.; Shen, X.; Yaghoobzadeh, Y.; Lakretz, Y.; Song, Y.; Bahri, Y.; Choi, Y.; Yang, Y.; Hao, Y.; Chen, Y.; Belinkov, Y.; Hou, Y.; Hou, Y.; Bai, Y.; Seid, Z.; Zhao, Z.; Wang, Z.; Wang, Z. J.; Wang, Z.; Wu, Z., Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. ArXiv 2022, abs/2206.04615. [CrossRef]
Xu, L.; Zhang, X.; Li, L.; Hu, H.; Cao, C.; Liu, W.; Li, J.; Li, Y.; Sun, K.; Xu, Y.; Cui, Y.; Yu, C.; Dong, Q.; Tian, Y.; Yu, D.; Shi, B.; Zeng, J.-j.; Wang, R.; Xie, W.; Li, Y.; Patterson, Y.; Tian, Z.; Zhang, Y.; Zhou, H.; Liu, S.; Zhao, Q.; Yue, C.; Zhang, X.; Yang, Z.-Y.; Richardson, K.; Lan, Z. In CLUE: A Chinese Language Understanding Evaluation Benchmark, International Conference on Computational Linguistics, 2020.
Xu, L.; Lu, X.; Yuan, C.; Zhang, X.; Yuan, H.; Xu, H.; Wei, G.; Pan, X.; Hu, H., FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark. ArXiv 2021, abs/2107.07498.
The winograd schema challenge. Proceedings of the International Conference on Knowledge Representation and Reasoning 2012, 552--561.
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. R., SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv 2019, abs/1905.00537.
Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A. P.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; Toutanova, K.; Jones, L.; Kelcey, M.; Chang, M.-W.; Dai, A. M.; Uszkoreit, J.; Le, Q. V.; Petrov, S., Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 2019, 7, 453-466. [CrossRef]
Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J., Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics 2021, 9, 346-361. [CrossRef]
Talmor, A.; Herzig, J.; Lourie, N.; Berant, J., CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. ArXiv 2019, abs/1811.00937.
Zhou, B.; Khashabi, D.; Ning, Q.; Roth, D., “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding. ArXiv 2019, abs/1909.03065.
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P., What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences 2021, 11 (14), 6421. [CrossRef]
Pal, A.; Umapathi, L. K.; Sankarasubbu, M. In MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, ACM Conference on Health, Inference, and Learning, 2022.
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W. W.; Lu, X. In PubMedQA: A Dataset for Biomedical Research Question Answering, Conference on Empirical Methods in Natural Language Processing, 2019.
Chen, S.; Ju, Z.; Dong, X.; Fang, H.; Wang, S.; Yang, Y.; Zeng, J.; Zhang, R.; Zhang, R.; Zhou, M.; Zhu, P.; Xie, P., MedDialog: A Large-scale Medical Dialogue Dataset. ArXiv 2020, abs/2004.03329.
Zhang, N.; Chen, M.; Bi, Z.; Liang, X.; Li, L.; Shang, X.; Yin, K.; Tan, C.; Xu, J.; Huang, F.; Si, L.; Ni, Y.; Xie, G.; Sui, Z.; Chang, B.; Zong, H.; Yuan, Z.; Li, L.; Yan, J.; Zan, H.; Zhang, K.; Tang, B.; Chen, Q. In CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark, Dublin, Ireland, May; Association for Computational Linguistics: Dublin, Ireland, 2022; pp 7888-7915.
Jha, D.; Smedsrud, P. H.; Riegler, M. A.; Halvorsen, P.; de Lange, T.; Johansen, D.; Johansen, H. D. In Kvasir-SEG: A Segmented Polyp Dataset, MultiMedia Modeling, Cham, 2020//; Ro, Y. M.; Cheng, W.-H.; Kim, J.; Chu, W.-T.; Cui, P.; Choi, J.-W.; Hu, M.-C.; De Neve, W., Eds. Springer International Publishing: Cham, 2020; pp 451-462.
Vázquez, D.; Bernal, J.; Sánchez, F. J.; Fernández-Esparrach, G.; López, A. M.; Romero, A.; Drozdzal, M.; Courville, A., A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images. Journal of healthcare engineering 2017, 2017, 4037190. [CrossRef]
Bernal, J.; Sánchez, J.; Vilariño, F., Towards automatic polyp detection with a polyp appearance model. Pattern Recognition 2012, 45 (9), 3166-3182. [CrossRef]
Silva, J.; Histace, A.; Romain, O.; Dray, X.; Granado, B., Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery 2014, 9 (2), 283-93.
Caicedo, J. C.; Goodman, A.; Karhohs, K. W.; Cimini, B. A.; Ackerman, J.; Haghighi, M.; Heng, C.; Becker, T.; Doan, M.; McQuin, C.; Rohban, M.; Singh, S.; Carpenter, A. E., Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nature Methods 2019, 16 (12), 1247-1253. [CrossRef]
Tschandl, P.; Rosendahl, C.; Kittler, H., The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data 2018, 5 (1), 180161. [CrossRef]
Shu, X.; Chang, F.; Zhang, X.; Shao, C.; Yang, X., ECAU-Net: Efficient channel attention U-Net for fetal ultrasound cerebellum segmentation. Biomedical Signal Processing and Control 2022, 75, 103528. [CrossRef]
Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M. C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J., Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. Jama 2016. [CrossRef]
Bycroft, C.; Freeman, C.; Petkova, D.; Band, G.; Elliott, L. T.; Sharp, K.; Motyer, A.; Vukcevic, D.; Delaneau, O.; O’Connell, J.; Cortes, A.; Welsh, S.; Young, A.; Effingham, M.; McVean, G.; Leslie, S.; Allen, N.; Donnelly, P.; Marchini, J., The UK Biobank resource with deep phenotyping and genomic data. Nature 2018, 562 (7726), 203-209. [CrossRef]
Lotz, J.; Weiss, N.; van der Laak, J.; Heldmann, S., Comparison of consecutive and restained sections for image registration in histopathology. Journal of Medical Imaging 2021, 10, 067501 - 067501.
Liu, S.; Zhu, C.; Xu, F.; Jia, X.; Shi, Z.; Jin, M., BCI: Breast Cancer Immunohistochemical Image Generation through Pyramid Pix2pix. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022, 1814-1823. [CrossRef]
Ehteshami Bejnordi, B.; Veta, M.; Johannes van Diest, P.; van Ginneken, B.; Karssemeijer, N.; Litjens, G.; van der Laak, J. A. W. M.; Consortium, a. t. C., Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 2017, 318 (22), 2199-2210. [CrossRef]
Wang, Z.; Liu, C.; Zhang, S.; Dou, Q., Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train. ArXiv 2023, abs/2306.16741.
Sudlow, C.; Gallacher, J.; Allen, N.; Beral, V.; Burton, P.; Danesh, J.; Downey, P.; Elliott, P.; Green, J.; Landray, M.; Liu, B.; Matthews, P.; Ong, G.; Pell, J.; Silman, A.; Young, A.; Sprosen, T.; Peakman, T.; Collins, R., UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 2015, 12 (3), e1001779. [CrossRef]
Irvin, J. A.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R. L.; Shpanskaya, K. S.; Seekins, J.; Mong, D. A.; Halabi, S. S.; Sandberg, J. K.; Jones, R.; Larson, D. B.; Langlotz, C.; Patel, B. N.; Lungren, M. P.; Ng, A. In CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison, AAAI Conference on Artificial Intelligence, 2019.
Pavlova, M.; Terhljan, N.; Chung, A. G.; Zhao, A.; Surana, S.; Aboutalebi, H.; Gunraj, H.; Sabri, A.; Alaref, A.; Wong, A., COVID-Net CXR-2: An Enhanced Deep Convolutional Neural Network Design for Detection of COVID-19 Cases From Chest X-ray Images. Frontiers in medicine 2022, 9, 861680. [CrossRef]
Johnson, A. E. W.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C. Y.; Mark, R. G.; Horng, S., MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 2019, 6 (1), 317. [CrossRef]
Shao, H.; Zhong, D.; Du, X., Deep Distillation Hashing for Unconstrained Palmprint Recognition. IEEE Transactions on Instrumentation and Measurement 2021, 70, 1-13. [CrossRef]
Liu, G.; Liao, Y.; Wang, F.; Zhang, B.; Zhang, L.; Liang, X.; Wan, X.; Li, S.; Li, Z.; Zhang, S.; Cui, S., Medical-VLBERT: Medical Visual Language BERT for COVID-19 CT Report Generation With Alternate Learning. IEEE transactions on neural networks and learning systems 2021, 32 (9), 3786-3797. [CrossRef]
Cohen, J. P.; Morrison, P.; Dao, L., COVID-19 Image Data Collection. ArXiv 2020, abs/2003.11597.
Iglesia-Vayá, M. d. l.; Saborit, J. M.; Montell, J. A.; Pertusa, A.; Bustos, A.; Cazorla, M.; Galant, J.; Barber, X.; Orozco-Beltrán, D.; García-García, F.; Caparrós, M.; González, G.; Salinas, J. M., BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients. ArXiv 2020, abs/2006.01174.
Wu, X.; Yang, S.; Qiu, Z.; Ge, S.; Yan, Y.; Wu, X.; Zheng, Y.; Zhou, S. K.; Xiao, L., DeltaNet: Conditional Medical Report Generation for COVID-19 Diagnosis. ArXiv 2022, abs/2211.13229.
Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L. W.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; Mark, R. G., MIMIC-III, a freely accessible critical care database. Sci Data 2016, 3, 160035. [CrossRef]
Shih, G.; Wu, C. C.; Halabi, S. S.; Kohli, M. D.; Prevedello, L. M.; Cook, T. S.; Sharma, A.; Amorosa, J.; Arteaga, V. A.; Galperin-Aizenberg, M.; Gill, R. R.; Godoy, M. C. B.; Hobbs, S.; Jeudy, J.; Laroia, A.; Shah, P. N.; Vummidi, D. R.; Yaddanapudi, K.; Stein, A., Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiology. Artificial intelligence 2019, 1 1, e180041. [CrossRef]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R. M., ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, 3462-3471. [CrossRef]
Jaeger, S.; Candemir, S.; Antani, S.; Wáng, Y. X.; Lu, P. X.; Thoma, G., Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery 2014, 4 (6), 475-7.
Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; Wong, C.; Tupini, A.; Wang, Y.; Mazzola, M.; Shukla, S.; Liden, L.; Gao, J.; Lungren, M. P.; Naumann, T.; Wang, S.; Poon, H., BiomedCLIP: a multimodal biomedical foundation model pretrained from f ifteen million scientific image-text pairs. [CrossRef]
Lau, J. J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D., A dataset of clinically generated visual questions and answers about radiology images. Sci Data 2018, 5, 180251. [CrossRef]
Liu, B.; Zhan, L.-M.; Xu, L.; Ma, L.; Yang, Y. F.; Wu, X.-M., Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) 2021, 1650-1654. [CrossRef]
He, X.; Zhang, Y.; Mou, L.; Xing, E. P.; Xie, P., PathVQA: 30000+ Questions for Medical Visual Question Answering. ArXiv 2020, abs/2003.10286.
Johnson, A. E. W.; Pollard, T. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Peng, Y.; Lu, Z.; Mark, R. G.; Berkowitz, S. J.; Horng, S. In MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs, 2019. [CrossRef]
Demner-Fushman, D.; Kohli, M. D.; Rosenman, M. B.; Shooshan, S. E.; Rodriguez, L.; Antani, S.; Thoma, G. R.; McDonald, C. J., Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association : JAMIA 2016, 23 (2), 304-10.
Johnson, A. E. W.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; Horng, S., MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 2019, 6 (1), 317. [CrossRef]

Table 1. Summary of LLMs in the medical field, covering its specific tasks, the number of parameters, and datasets used for training and evaluation. M: million, B: billion.

Mode	Models	Tasks	Params	Datasets
Text	GatorTron [75]	Quick access to information to aid clinical diagnosis.	8.9B	The UF Health Integrated Data Repository (IDR), PubMed [103], Wikipedia [103], Medical Information Mart for Intensive Care III (MIMIC-III) [104], emrQA dataset [105]
	MedGPT [5]	To predict and diagnose diseases and to design treatment plans for patients.	GPTv2 (7.5B/1.5B/11.7B)	King's College Hospital (KCH), MIMIC-III [104]
	LLM-ICD [78]	Automatically generates ICD codes for retinal diseases.	ChatGPT/GPT-3.5	MIMIC-III [104]
	ChatDoctor [6]	Provide accurate advice to patients through self-searching of online and offline medical databases.	7B	HealthCareMagic-100k (www.healthcaremagic.com), iCliniq (www.icliniq.com)
	Clinical Camel [79]	Applied to clinical research, turning medical literature into conversation.	13B/70B	ShareGPT, Clinical Articles, MedQA [106]
	DoctorGLM [7]	Chinese medical dialogue model.	6.2B	Translated ChatDoctor's [6] database, Chinese Medical Dialogue (CMD), MedDialog
	HuatuoGPT [81]	Provide detailed, rich content while interacting and diagnosing like a doctor.	7B	HealthCareMagic-100k, iCliniq, cMedQA2 [107], webMedQA [108], and Huatuo-26M [109]
Text	MedicalGPT-zh (MING) [82]	Handle complex Chinese medical conversations and apply to a variety of scenarios, such as online medical consultation, patient education, health guidance, etc.	7B	Chinese medical dialogue dataset constructed based on www.healthcaremagic.com, USMLE cases, and other data
	ChatMed [83]	Answer patients' daily medical questions online.	7B	ChatMed Consult Dataset, ChatMed TCM Dataset
	HuaTuo [84] (BenTsao)	Generate accurate and professional medical information in a Chinese context.	7B	Chinese Medical Knowledge Graph (CMeKG) [110]
	Zhongjing [85]	Integrate TCM knowledge into the LLM to provide patients with personalized TCM advice and treatment plans.	13B	CMeKG [110], Chinese Multi-turn Medical Question Answering (CMtMedQA), Huatuo-26M [109]
	GLM-130B [86]	Intelligent question and answer for medical and health problems to assist diagnosis and treatment.	130B	Pile [111], Chinese WudaoCorpora [112], LAMBADA [113], MMLU [114], BIG-bench-lite [115], CLUE [116], FewCLUE [117], Winograd Schemas Challenge (WSC) [118], SuperGLUE [119], Natural Questions [120], StrategyQA [121], Commonsense QA [122], Multiple-choice Temporal Commonsense (MC-TACO) [123]
	Med-PaLM 2 [30]	Answer open-ended questions from patients.	340B	MedQA [124], MedMCQA [125], PubMedQA [126], MMLU [114]
Text	BianQue [89]	The model was trained with multiple rounds of dialogue data to improve its ability of asking questions.	6.2B	BianQueCorpus, MedDialog-CN [127], IMCS-V2 [128], CHIPMDCFNPC [128], MedDG [128]
Image	CSCA U-Net [90]	Accurately identify and segment areas of interest in medical images, so that doctors can propose treatment plans suitable for patients.	35.27M	Kvasir-SEG [129], CVC-ClinicDB [130], CVC-ColonDB [131], ETIS [132], CVC-T [130], 2018 Data Science Bowl (2018 DSB) [133], ISIC 2018 [134], JSUAH-Cerebellum [135]
	RETFound [91]	The diagnosis and prognosis of eye diseases and the prediction of complex systemic diseases.	1.6M	Moorfields Diabetic imAge dataSet (MEH-MIDAS), Kaggle EyePACS [136], Moorfields AlzEye study (MEH-AlzEye), UK Biobank [137]
	PathoDuet [92]	Understanding and analyzing pathological images.	1.5M	TCGA, HyReCo [138], BCI [139], NCT-CRC-HE, CAMELYON16 [140], IHC dataset
Multi-mode	OpenMEDLab [94]	The basic medical model can be applied to a variety of medical data and solve a variety of clinical and research problems.	14M-1.4B	SA-Med2D-20M, SNOW, Endo-FM [141], MedFM
Multi-mode	GMAI [95]	Use multiple datasets to learn and flexibly interpret different medical data.	540B	MIMIC [48], UK Biobank [142], UniProt
Multi-mode	Med-MLLM [96]	Learn medical knowledge from unlabeled data for rapid response to rare diseases and outbreaks.	8.9B	CheXpert [143], COVIDx-CXR-2 [144], MIMIC-CXR [145], COVID-19-CT-CXR [146], COVID-19 CT [147], COVID-CXR [148], BIMCV-COVID-19 [149], COVID-HCH [150], PubMed [103], MIMIC-III [151], RSNA Pneumonia [152], NIH ChestX-ray [153], Shenzhen Tuberculosis [154]
	Visual Med-Alpaca [96]	Understand visual information and generate biomedical relevant text and image content.	7B	roco-dataset, MEDIQA RQE, MedQA, MedDialog, MEDIQA QA, PubMedQA
	LLaVA-Med [97]	A visual-verbal dialogue assistant that answers open research questions on biomedical images.	7B	PMC-15M [155], VQA-RAD [156], SLAKE [157], PathVQA [158]
	ChatCAD [98]	Convert medical images into text to generate diagnostic reports.	175B	MIMIC-CXR [145], CheXpert [143]
	XrayGLM [99]	Provide diagnostic reports by viewing chest X-rays.	6B	MIMIC-CXR [159], OpenI [160]
	XrayGPT [100]	Conversational medical AI for radiation image analysis.	7B	MIMIC-CXR [161], OpenI [160]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.