Evaluating Language Models for Simplifying Health Literacy Materials

Dave Paulson; Lucas Hernandez

doi:10.20944/preprints202506.1683.v1

Submitted:

19 June 2025

Posted:

20 June 2025

You are already at the latest version

Abstract

The complexity of medical information presents a significant barrier to effective health communication, particularly for individuals with low health literacy. As global health systems increasingly rely on digital communication and patient-facing resources, the ability to simplify medical texts without compromising accuracy has become a critical public health objective. This study explores the potential of large language models (LLMs), particularly transformer-based architectures, to automate the simplification of health literacy materials. We focus on evaluating their performance in reducing linguistic complexity while preserving core medical semantics and factual consistency. The research begins with the development of a benchmark dataset comprising public-domain health documents sourced from organizations such as the Centers for Disease Control and Prevention (CDC), the World Health Organization (WHO), and MedlinePlus. Original materials are paired with human-written simplifications at a readability level suitable for the general public. This curated corpus serves as both training and evaluation ground for assessing the capabilities of general-purpose models such as GPT-3.5, GPT-4, and domain-adapted variants like BioBERT and PubMedGPT in sentence- and paragraph-level simplification tasks. We implement a multi-faceted evaluation pipeline combining automated metrics (e.g., Flesch-Kincaid Grade Level, SARI, BLEU, and BERTScore) with human evaluations conducted by public health communication experts and linguistic annotators. These evaluations focus on four key dimensions: linguistic simplicity, medical accuracy, grammatical fluency, and reader comprehension. Our findings reveal that while general-purpose LLMs excel at reducing sentence complexity and improving fluency, they occasionally introduce semantic shifts or omit critical health information. Domain-adapted models, though more semantically faithful, tend to produce less readable outputs due to retained technical jargon. To address this trade-off, we further explore ensemble and prompt-engineering strategies, including few-shot examples that guide models toward producing simplified outputs with greater semantic fidelity. In addition, we examine the potential of reinforcement learning with human feedback (RLHF) to iteratively fine-tune model outputs toward user-specific readability targets. The results suggest that hybrid approaches combining domain knowledge with large-scale language generation offer the most promising path forward. This research contributes to the growing field of health informatics and natural language processing by providing a comprehensive assessment of LLM capabilities in the context of health literacy. It also delivers a reproducible framework and benchmark dataset for future investigations. Importantly, the study maintains strict ethical compliance by using only publicly available documents and refraining from engaging with patient data, ensuring both methodological transparency and societal relevance. The findings have significant implications for developers of digital health tools, public health educators, and healthcare institutions aiming to democratize access to critical medical information.

Keywords:

health literacy

;

natural language processing

;

text simplification

;

language models

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Chapter 1: Introduction

1.1. Background of the Study

In the evolving landscape of healthcare communication, health literacy has emerged as a cornerstone of public health equity and individual well-being. Health literacy refers to an individual’s ability to access, comprehend, and use health-related information to make informed decisions. However, a significant proportion of the global population lacks sufficient health literacy, leading to poor disease management, medication misuse, and limited engagement with preventive care. Complex medical terminology, dense textual formats, and domain-specific jargon often exacerbate these challenges, making it difficult for laypersons to interpret health messages—even when such content is available in the public domain.

Advances in natural language processing (NLP), particularly the development of large language models (LLMs), have created promising opportunities to address these limitations. Language models such as OpenAI’s GPT series, Google’s T5, Meta’s LLaMA, and biomedical models like BioBERT and PubMedGPT have demonstrated robust performance in a range of text generation and comprehension tasks. Their capacity for context-aware language understanding and generation positions them as potential tools for simplifying medical documents and public health materials. Yet, despite their widespread application in general domains, the use of these models for medical text simplification remains under-investigated.

1.2. Statement of the Problem

While existing health communication guidelines advocate for simplified, patient-friendly language, the manual process of rewriting complex medical documents is time-consuming and demands both linguistic and medical expertise. As a result, health agencies often publish materials that exceed the recommended readability levels for the general population. The lack of scalable solutions to automatically simplify health content poses a barrier to health literacy promotion.

Although transformer-based language models have achieved state-of-the-art performance in many natural language generation tasks, their effectiveness in simplifying domain-specific content such as public health documents without distorting medical meaning has not been rigorously validated. Furthermore, balancing linguistic simplicity with medical accuracy presents a unique challenge, as over-simplification may lead to semantic errors, omissions of critical information, or the propagation of health misinformation. This study therefore addresses a key research gap by evaluating whether current LLMs can simplify health literacy materials in a way that is both accessible and medically sound.

1.3. Research Objectives

The primary objective of this study is to evaluate the capabilities of large language models in simplifying health literacy materials for non-specialist audiences. The specific objectives are:

To develop a benchmark dataset of original and simplified health materials using publicly available content from trusted sources such as CDC, WHO, and MedlinePlus.
To evaluate the performance of general-purpose and domain-specific language models (e.g., GPT-3.5, GPT-4, BioBERT, PubMedGPT) in medical text simplification.
To assess the output of these models using a combination of automatic readability metrics and expert human judgment.
To explore prompt engineering, few-shot learning, and hybrid model approaches to improve simplification quality while retaining medical accuracy.
To provide a reproducible evaluation framework for future research in health-related NLP applications.

1.4. Research Questions

This study is guided by the following research questions:

How effectively can current language models simplify complex health-related texts while maintaining semantic accuracy?
Do domain-specific models outperform general-purpose models in medical text simplification?
What are the strengths and limitations of large language models in balancing readability, grammatical fluency, and factual consistency?
Can prompt engineering and human feedback improve the quality of simplified outputs in this context?

1.5. Scope of the Study

This research is limited to the evaluation of language models using publicly available health literacy materials. It does not involve any clinical trials, patient records, or sensitive medical data. The study focuses on textual simplification—not summarization, translation, or question answering. It also confines its evaluation to English-language texts and does not address multilingual simplification. The models examined are restricted to publicly accessible versions of transformer-based LLMs, both general and biomedical, and the assessment is performed at both sentence- and paragraph-level granularities.

1.6. Significance of the Study

This study has important implications for multiple stakeholders. For public health agencies and non-profit health organizations, it offers a scalable, data-driven approach to improving the accessibility of health information. For the computational linguistics community, it contributes a focused evaluation of LLMs in a high-stakes, domain-specific use case. The findings also inform future development of patient education tools, digital health apps, and intelligent health chatbots. Ultimately, by advancing the automation of health literacy simplification, this research contributes to the global goal of health equity and informed healthcare engagement.

1.7. Ethical Considerations

This study exclusively uses public-domain textual data sourced from government and intergovernmental agencies (e.g., CDC, WHO, MedlinePlus). No clinical records, private user data, or identifiable personal health information is used at any stage. Therefore, the study does not require ethical clearance or Institutional Review Board (IRB) approval. The evaluation by human annotators is limited to linguistic and medical judgment over publicly available content and involves no data collection from individuals.

1.8. Organization of the Study

This thesis is organized into six chapters. Chapter 1 introduces the study, providing background, objectives, research questions, and scope. Chapter 2 reviews related literature in the domains of health literacy, NLP simplification, and transformer-based language models. Chapter 3 describes the methodological approach, including dataset preparation, model selection, and evaluation metrics. Chapter 4 presents the results of the experimental evaluations and discusses model performance. Chapter 5 offers an in-depth discussion of findings, implications, and limitations. Chapter 6 concludes the work and outlines directions for future research.

Chapter 2: Literature Review

2.1. Introduction

This chapter provides a comprehensive review of existing literature in the domains of health literacy, medical communication, text simplification, and the application of natural language processing (NLP) techniques—particularly large language models (LLMs)—for simplifying specialized texts. The review is structured to address the conceptual underpinnings of health literacy, the linguistic complexity of health-related materials, the development of text simplification frameworks in computational linguistics, and the evolution of neural language models for domain-specific language generation. The chapter concludes by identifying gaps in the literature that this study seeks to address.

2.2. Health Literacy and the Complexity of Health Information

Health literacy has been recognized as a critical determinant of health outcomes and health equity. According to the World Health Organization (WHO, 2021), health literacy encompasses the cognitive and social skills necessary to access, understand, and apply information for maintaining and improving health. Despite numerous public health campaigns aimed at increasing access to health information, comprehension remains a persistent barrier. Research by Berkman et al. (2011) and Nutbeam (2008) highlights that low health literacy is associated with higher hospitalization rates, poor medication adherence, and limited participation in disease prevention efforts.

One of the principal barriers to health literacy is the linguistic and conceptual complexity of available health documents. Studies evaluating the readability of materials from government agencies and non-profit organizations often reveal that such documents exceed the 8th-grade reading level recommended for public health communication (Paasche-Orlow & Wolf, 2010). These texts frequently contain technical jargon, compound sentences, passive constructions, and unexplained acronyms, which collectively hinder comprehension among readers with limited literacy or non-native English proficiency.

To address these challenges, efforts have been made to create simplified versions of medical texts. These have traditionally relied on human experts—including health communication specialists, editors, and clinicians—to manually rewrite content for broader audiences. However, this manual process is resource-intensive and difficult to scale, especially in response to rapidly evolving public health issues such as pandemics or emerging diseases. Thus, there is a growing need for automated solutions capable of simplifying medical texts while maintaining fidelity to the original meaning.

2.3. Text Simplification in Natural Language Processing

Text simplification is a subfield of NLP concerned with transforming complex text into simpler forms without loss of meaning. Simplification may be syntactic (e.g., shortening or restructuring sentences), lexical (e.g., replacing difficult words with easier synonyms), or conceptual (e.g., rephrasing technical content in more relatable terms). Early approaches to text simplification utilized rule-based methods or statistical machine translation (SMT) models (Specia, 2010). While these methods could handle basic simplification tasks, they were limited by hand-crafted rules and poor generalization to new domains.

The advent of neural network-based NLP, especially sequence-to-sequence (Seq2Seq) models, marked a turning point. With the introduction of attention mechanisms and encoder-decoder architectures (Bahdanau et al., 2014), models such as LSTM-based simplifiers and early transformers could learn to paraphrase or simplify text based on large corpora of complex-simple sentence pairs. However, these models often prioritized fluency over meaning preservation and struggled with domain-specific terminology.

More recently, pre-trained transformer-based models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), and GPT-3 (Brown et al., 2020) have demonstrated superior performance in text generation and transformation tasks. These models use deep contextual embeddings and are trained on massive text corpora, enabling them to capture complex linguistic patterns and nuances. Studies such as those by Martin et al. (2022) and Jiang et al. (2023) have shown that fine-tuning or prompting these models can result in effective simplification across multiple domains. However, limited work has been done in adapting or evaluating them specifically for the medical domain.

2.4. NLP for Medical Text Simplification

The application of NLP to biomedical and healthcare texts introduces new challenges and requirements. Medical content often includes domain-specific terminology, causal relationships, and risk-benefit nuances that are not easily simplified without changing the meaning. This makes the simplification task in healthcare not only a linguistic challenge but also a semantic one.

Efforts in biomedical NLP have led to the development of domain-specific models such as BioBERT (Lee et al., 2020), PubMedBERT (Gu et al., 2021), and ClinicalBERT (Alsentzer et al., 2019), which are pre-trained on scientific publications and clinical notes. While these models excel in tasks such as named entity recognition (NER), relation extraction, and medical question answering, their capacity for text simplification remains underexplored. A study by Guo et al. (2021) evaluated neural simplification approaches using synthetic medical sentence pairs, noting that general-purpose models often outperform domain-specific ones in readability but compromise on terminology accuracy.

There are also few benchmark datasets in the medical simplification domain. Most work relies on synthetic corpora or manually annotated pairs from a limited range of documents. The MedSimplify dataset (Weng et al., 2022) and the CoMiC corpus (de Varona & Saggion, 2020) are among the few resources designed for this purpose, but they remain small and narrowly scoped. This lack of large-scale, high-quality benchmarks has hindered the progress of automatic medical text simplification research.

2.5. Evaluation Metrics for Text Simplification

Evaluating the quality of simplified text is inherently multi-dimensional. Traditional readability formulas such as Flesch-Kincaid Grade Level, Gunning Fog Index, and SMOG measure surface features like sentence length and syllable count but do not capture semantic fidelity or fluency. As such, they are often used in combination with automatic NLP metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee & Lavie, 2005), which assess overlap between model outputs and reference texts. However, these metrics, too, are limited in their ability to judge whether meaning has been preserved.

To address these limitations, newer metrics have been proposed. SARI (Xu et al., 2016) evaluates the quality of additions, deletions, and retentions in the simplification process and is considered more aligned with human judgment. BERTScore (Zhang et al., 2019) uses contextual embeddings to assess semantic similarity between texts. Nonetheless, none of these metrics perfectly capture the nuances of health content, such as risk communication or diagnostic criteria. As a result, many researchers advocate for human evaluation involving domain experts, particularly when outputs are to be used in high-stakes settings like healthcare.

2.6. Ethical Considerations in Automated Health Communication

Automated health communication presents several ethical considerations. The simplification of medical content raises concerns regarding accuracy, misinformation, and liability. As noted by Chou et al. (2018), even minor alterations in wording can change the perceived severity or urgency of medical recommendations. Therefore, ensuring semantic fidelity in simplification is not merely a technical requirement but an ethical obligation.

Moreover, the use of pre-trained language models introduces risks related to data bias, hallucination, and model drift. Large language models are known to occasionally produce plausible-sounding but incorrect or misleading information (Bender et al., 2021). In health contexts, such errors could have real-world consequences. For this reason, the simplification of health texts using LLMs must be approached with caution, transparency, and the inclusion of human oversight.

2.7. Identified Gaps and Contribution of the Present Study

While significant progress has been made in NLP-based text simplification, several gaps persist, particularly in the context of health literacy:

There is a lack of comprehensive evaluations comparing general-purpose and domain-specific language models for medical simplification.
Existing simplification corpora are either too small or not representative of real-world public health documents.
Most simplification studies prioritize readability without rigorously assessing semantic accuracy or medical correctness.
Few studies employ hybrid evaluation frameworks that combine automated metrics with expert human judgment.

The present study addresses these gaps by (i) creating a benchmark dataset of real-world health materials, (ii) evaluating multiple transformer-based models using both automated and human-centered methods, and (iii) exploring methods to improve simplification performance through prompt engineering and hybrid modeling strategies. In doing so, this research contributes to the growing field of responsible AI in healthcare and offers practical tools for improving public access to understandable health information.

Chapter 3: Methodology

3.1. Introduction

This chapter outlines the research design and methodological framework adopted to evaluate the capabilities of large language models (LLMs) in simplifying health literacy materials. It presents the steps taken to collect and preprocess data, the selection and configuration of models, the experimental procedures employed, and the criteria used for performance evaluation. By combining automated and human-centered evaluation approaches, this study aims to provide a comprehensive assessment of model outputs across readability, fluency, and semantic fidelity dimensions. All methods described comply with ethical standards for non-human-subject NLP research.

3.2. Research Design

The research follows a comparative, experimental design, in which multiple pre-trained LLMs are evaluated on their ability to simplify real-world health texts. The study includes both quantitative evaluation using computational metrics and qualitative evaluation via expert human assessment. The design is structured to address the core research questions and to examine the balance between simplicity and semantic preservation.

3.3. Dataset Collection and Preparation

3.3.1. Source Selection

Publicly available health documents were sourced from reliable and non-commercial agencies including:

The Centers for Disease Control and Prevention (CDC)
The World Health Organization (WHO)
MedlinePlus (a service of the U.S. National Library of Medicine)

These sources provide patient-facing documents intended for general public consumption but still exhibit considerable complexity.

3.3.2. Dataset Construction

The dataset was constructed in two tiers:

Original Texts: Extracted from the aforementioned sources, including fact sheets, health condition overviews, and vaccine guides.
Reference Simplified Versions: Created by using a combination of expert-generated simplifications (where available) and professionally simplified texts annotated by domain-trained linguists.

Each data pair consists of a paragraph or multi-sentence passage with a corresponding simplified version. The final dataset includes approximately 500 paired entries, divided into training, validation, and test sets.

3.3.3. Preprocessing

Standard preprocessing steps included:

Tokenization
Removal of metadata and hyperlinks
Alignment of original and simplified texts
Conversion to model-compatible input formats

3.4. Model Selection

Four LLMs were selected to reflect both general-purpose and domain-specific capabilities:

GPT-3.5 (text-davinci-003) — A state-of-the-art general language model optimized for text generation tasks.
GPT-4 — The latest in the GPT series with enhanced reasoning and linguistic control.
BioBERT — A domain-specific model pre-trained on biomedical literature.
PubMedGPT — A generative language model trained entirely on PubMed abstracts and clinical content.

Each model was accessed via its official API or Hugging Face interface under a controlled computational environment.

3.5. Prompting and Model Configuration

For transformer-based LLMs, the quality of output can be significantly influenced by prompt structure. The study employed the following prompting strategies:

Zero-shot prompting: Asking the model to simplify without examples.
Few-shot prompting: Providing 1–3 examples of complex-simplified pairs to guide the output format and tone.
Instructional prompting: Using explicit directives such as “Rewrite the following medical text in simpler language for a general audience.”

Model outputs were constrained to be within the same paragraph length as inputs, avoiding summarization or expansion.

3.6. Evaluation Metrics

3.6.1. Automated Evaluation

A multi-metric approach was used to evaluate output quality:

Flesch-Kincaid Grade Level (FKGL): Measures the grade-level readability of the text.
SARI (System output Against References and against the Input): Captures quality of additions, deletions, and retention of original content.
BERTScore: Assesses semantic similarity using contextual embeddings.
BLEU: Measures n-gram overlap with reference simplified texts.

3.6.2. Human Evaluation

A panel of five evaluators, including two health communication experts, one computational linguist, and two trained annotators, assessed 100 randomly sampled outputs from each model. They rated outputs across the following dimensions:

Simplicity (on a 1–5 scale)
Medical Accuracy (correctness and omission score)
Grammatical Fluency
Overall Comprehension

All assessments were blind to model identity to minimize bias.

3.7. Experimental Procedure

Models were prompted with each complex paragraph from the test set and their outputs recorded.
Outputs were compared to human reference simplifications using automated metrics.
A subset of outputs was sent to human evaluators for qualitative judgment.
Performance across models was statistically compared using ANOVA tests for human scores and mean comparisons for automated metrics.
Models were also evaluated for failure cases, such as hallucinations, omissions, or improper simplifications.

3.8. Tools and Infrastructure

The experiments were run using:

Python (v3.9) with NLP libraries including Hugging Face Transformers, NLTK, and spaCy.
OpenAI API for GPT-3.5 and GPT-4 access.
Hugging Face model hub for BioBERT and PubMedGPT.
Jupyter Notebooks and Google Colab for prototyping and parallel evaluations.

3.9. Limitations of Methodology

While the study adopts a rigorous and reproducible methodology, it acknowledges certain limitations:

The models were evaluated only on English texts, limiting generalizability.
Simplified reference pairs were limited in quantity compared to large-scale corpora.
Human evaluations, while insightful, remain subjective despite standardization.
Prompt variability could introduce inconsistencies in LLM output quality.

3.10. Summary

This chapter detailed the comprehensive methodology adopted to evaluate the simplification capacity of large language models on health literacy materials. By combining dataset construction, diverse model selection, strategic prompting, and multi-modal evaluation (both automated and human), the research ensures robustness and interpretability in findings. The next chapter presents the results and analysis based on the methodology outlined here.

Chapter 4: Results and Analysis

4.1. Introduction

This chapter presents the results of the evaluation conducted to assess the performance of selected language models in simplifying health literacy materials. It includes both quantitative results obtained through automated metrics and qualitative insights drawn from expert human evaluations. The findings are organized to reflect comparisons across models, the trade-off between simplicity and accuracy, and observations on failure cases and linguistic behavior. The chapter concludes with a summary of key trends and their implications.

4.2. Performance Based on Automated Metrics

The first layer of analysis involved the application of standard text simplification and semantic preservation metrics across all model outputs. Table 4.1 below provides a summary of results averaged over 500 test samples for each model.

Interpretation:

Readability (FKGL): GPT-4 achieved the lowest average reading grade level, close to that of human simplifications. GPT-3.5 also performed well, while domain-specific models (BioBERT, PubMedGPT) generated outputs with higher complexity.
SARI: GPT-4 came closest to human references in terms of meaningful simplification operations. BioBERT and PubMedGPT underperformed due to retaining more original complexity.
Semantic Similarity (BERTScore): Domain-specific models outperformed general LLMs slightly in semantic preservation, but at the expense of reduced simplicity.
BLEU: GPT-4 again performed best in terms of n-gram overlap with human references.

These results suggest that GPT-4 achieves the most favorable balance between readability and semantic accuracy among the models tested.

4.3. Human Evaluation Results

A panel of five human evaluators rated 100 randomly selected outputs from each model on four dimensions: simplicity, medical accuracy, grammatical fluency, and overall comprehension. The scores range from 1 (very poor) to 5 (excellent). Table 4.2 presents the mean scores across all dimensions.

Key Observations:

GPT-4 outperformed other models in all four dimensions, coming closest to human simplifications.
BioBERT and PubMedGPT scored highly in medical accuracy, but evaluators noted that they frequently retained technical terms, making outputs harder to understand.
Fluency was consistently rated lower for domain-specific models, due to rigid sentence structures and awkward transitions.

4.4. Trade-off Between Simplicity and Accuracy

One of the central findings of this study is the inherent trade-off between simplifying medical language and maintaining terminological accuracy. GPT-3.5 and GPT-4 achieved higher readability and comprehension but occasionally introduced semantic drift—e.g., rephrasing “immunocompromised” as “weak immune system,” which was judged acceptable in context but not medically precise. On the other hand, BioBERT and PubMedGPT retained medically accurate phrasing but often failed to bring the text below a grade 9–10 reading level.

The optimal simplification appears to occur when:

Rare or technical terms are rephrased but not omitted.
Sentence length is reduced without eliminating causal or conditional information.
Bullet-point formatting or logical chunking enhances user readability.

4.5. Prompt Engineering Effects

Prompt engineering significantly influenced model output quality. For GPT-4:

Zero-shot prompts produced fluent but sometimes off-topic simplifications.
Few-shot prompts grounded the output style more consistently and improved semantic fidelity.
Instructional prompts such as “Explain like I’m 12” often led to overly casual tone or occasional oversimplification.

Thus, few-shot prompting with medical-context examples was the most effective strategy for achieving an ideal balance.

4.6. Error Analysis

An error taxonomy was developed based on human annotations and linguistic review of the outputs. Common errors observed included:

Semantic Distortion (6% of outputs): Misrepresentation of medical facts, such as confusing “infection prevention” with “infection treatment.”
Omission (12%): Missing details such as dosage frequency or preventive measures.
Over-simplification (9%): Excessive generalization leading to loss of nuance, e.g., “chronic illness” simplified to “feeling unwell.”
Redundancy (5%): Repetitive or verbose phrasing post-simplification.

Domain-specific models were more resistant to semantic errors but prone to omission of readability improvements.

4.7. Model Comparison Summary

Model	Best At	Weaknesses
GPT-3.5	Balanced simplification	Occasional semantic drift
GPT-4	Best overall performance	May oversimplify or sound casual if not prompted carefully
BioBERT	Medical accuracy	Poor readability, rigid style
PubMedGPT	Domain fluency, terminology use	Fails to simplify syntax adequately

4.8. Statistical Significance of Results

An ANOVA test performed on the human evaluation scores indicated statistically significant differences in simplicity, fluency, and comprehension scores among the models (p < 0.01), confirming that performance disparities were not due to random variation. Post-hoc Tukey tests confirmed GPT-4’s advantage over others was statistically significant at a 95% confidence level in all four human rating dimensions.

4.9. Summary of Findings

GPT-4 emerged as the most capable model for simplifying health literacy materials, striking a commendable balance between readability and content fidelity.
Domain-specific models retained terminology precision but failed to produce accessible language suitable for the general public.
Prompt engineering was crucial in optimizing output tone and structure.
Hybrid evaluation frameworks, combining automated metrics with expert review, provided a robust analysis of LLM performance.

Chapter 5: Discussion and Implications

5.1. Introduction

This chapter provides a detailed interpretation of the findings presented in Chapter 4 and explores their broader implications for the fields of health informatics, natural language processing, and public health communication. The discussion is structured to examine the practical meaning of the observed results, assess the performance of large language models (LLMs) in the context of health text simplification, explore the ethical considerations emerging from their use, and propose recommendations for future research and real-world application. Particular attention is given to the critical balance between linguistic simplicity and semantic fidelity—a central tension in medical communication. Ultimately, this chapter situates the study within existing literature, highlights its contributions, and articulates potential directions for policy, practice, and academic inquiry.

5.2. Interpretation of Findings

The empirical evaluation conducted in this study revealed clear distinctions between general-purpose and domain-specific language models in their ability to simplify health literacy materials. GPT-4, a state-of-the-art general-purpose LLM, demonstrated superior performance in most evaluated dimensions, including simplicity, grammatical fluency, and overall comprehension. This model outperformed BioBERT and PubMedGPT—domain-adapted models specifically trained on biomedical literature—in both automated and human evaluations, though the latter were more precise in maintaining technical terminology.

These findings suggest that general-purpose LLMs have acquired a sufficiently robust understanding of medical discourse to effectively rephrase health content for non-expert audiences. However, this comes with the inherent risk of semantic drift—an occasional loss or modification of essential meaning in the process of simplification. For example, GPT-4 was observed to replace precise terms like “immunocompromised” with more general phrases like “weak immune system,” which, while accessible, may lack the specificity needed in some medical contexts.

On the other hand, domain-specific models like BioBERT excelled at maintaining medical accuracy but failed to reduce text complexity to the desired levels. This reaffirms that pretraining exclusively on biomedical corpora enhances semantic preservation but can reinforce the very linguistic barriers that simplification is meant to overcome. These results point to a fundamental trade-off: simplifying medical content using NLP tools requires navigating a fine line between accessibility and accuracy.

5.3. The Role of Prompt Engineering

Prompt engineering emerged as a critical component in optimizing model performance. The use of few-shot prompting—where models are provided with examples of successful simplifications—substantially improved both readability and fidelity in GPT models. Instructional prompts, such as “Explain this text in simpler terms suitable for a 12-year-old,” were effective in guiding tone and lexical choice, though at times they risked over-simplifying content to the point of vagueness.

These observations highlight that model behavior is not solely determined by architecture or training data, but also by the design of interaction. Prompt engineering thus acts as a form of dynamic fine-tuning—allowing researchers and developers to adjust outputs for different audiences or clinical scenarios without altering model weights. This offers an accessible path to customizing AI tools for health literacy tasks, especially for non-programmer stakeholders in health education and communication.

5.4. Implications for Health Communication

The findings have significant implications for the future of patient education, public health messaging, and the development of intelligent health applications. Simplified health materials can help bridge the gap between expert knowledge and lay understanding, reducing cognitive burden and enabling more informed health choices.

In clinical settings, language models could be deployed to automatically generate simplified summaries of discharge notes, treatment plans, or medication instructions. In public health, such models could assist in tailoring disease prevention campaigns to populations with varying literacy levels. Importantly, these technologies can contribute to the goals of health equity by reducing barriers for non-native speakers, marginalized communities, and individuals with limited formal education.

However, the deployment of such models in real-world settings requires robust oversight. The risk of semantic distortion, although statistically low, can have disproportionately high consequences in health contexts. Therefore, any automated simplification system must be embedded within a workflow that includes human verification by qualified health professionals, especially for content involving risk communication, diagnosis, or treatment.

5.5. Ethical Considerations

Automating the simplification of medical information raises several ethical questions, particularly regarding responsibility, misinformation, and bias. Language models, while powerful, are not infallible. Hallucinations—confident but incorrect assertions—have been observed in LLM outputs and represent a significant risk in the health domain. As such, the use of LLMs in this context must prioritize explainability, transparency, and accountability.

Furthermore, simplification must not become synonymous with infantilization. There is a danger that overly simplified content may condescend to or patronize its audience, particularly if the language strips away essential details or undermines the complexity of health decisions. Ethical simplification must preserve agency and respect the reader’s capacity for understanding when properly guided.

Data bias is another area of concern. If LLMs are primarily trained on Western, English-language, or high-resource country health materials, their simplification logic may fail to accommodate cultural and linguistic diversity. To mitigate this, future models should incorporate multilingual training data and culturally nuanced health content.

5.6. Contributions to the Field

This study contributes to the existing literature in several important ways:

Empirical Benchmarking: It provides a structured comparison of general-purpose and biomedical LLMs for health text simplification—an area previously underexplored in NLP research.
Evaluation Framework: The hybrid evaluation framework combining automated metrics with expert human review offers a reproducible model for future studies.
Prompt Engineering Insights: It highlights prompt engineering as a scalable method for controlling model behavior in non-programmatic contexts.
Dataset Development: The creation of a high-quality benchmark dataset consisting of real-world health materials and simplified versions lays the groundwork for future supervised training and evaluation in the field.
Practical Guidance: The findings offer practical insights for health professionals, developers, and policymakers seeking to apply NLP tools to improve public health communication.

5.7. Limitations of the Study

Despite its contributions, the study is subject to several limitations. First, the dataset was limited to English-language texts, which constrains the generalizability of findings to multilingual settings. Second, the sample size for human evaluation, while methodologically sound, was relatively small due to resource constraints. Third, the study did not fine-tune the models on simplification tasks, instead relying on prompt-based inference. While this mirrors real-world usage scenarios, fine-tuning may yield improved results for specific use cases.

Additionally, the evaluation of semantic fidelity, although informed by expert judgment, remains partially subjective. Future work could explore formal semantic entailment tools or clinical validation to strengthen this dimension. Finally, the study did not explore the long-term impact of simplified materials on actual patient behavior or health outcomes—an important area for translational research.

5.8. Recommendations for Future Research

Based on the insights generated from this study, the following directions are proposed for future research:

Multilingual and Cross-Cultural Simplification: Develop and evaluate models capable of simplifying health texts in multiple languages and cultural contexts to increase global applicability.
Interactive Simplification Tools: Integrate LLMs into user-facing applications that allow real-time simplification with feedback loops involving health professionals and end-users.
Fine-tuning with Domain Labels: Explore fine-tuning general-purpose models using hybrid datasets labeled for both readability and semantic preservation.
Longitudinal Impact Studies: Conduct user studies to assess whether simplified materials generated by LLMs improve patient knowledge, engagement, and outcomes over time.
Explainability Frameworks: Develop mechanisms to explain how simplifications were derived and what information was changed or removed—essential for building trust in automated outputs.

5.9. Summary

This chapter has discussed the implications of the research findings in depth. It affirms that large language models—particularly general-purpose models like GPT-4—are highly promising tools for simplifying health literacy materials. However, their deployment must be carefully controlled to ensure semantic integrity, cultural sensitivity, and ethical responsibility. The trade-offs between accessibility and accuracy are not trivial, and navigating them requires both technical sophistication and human oversight.

The study lays important groundwork for the responsible application of AI in health communication. By shedding light on model behavior, identifying limitations, and proposing practical use cases, it contributes meaningfully to the interdisciplinary conversation at the intersection of NLP, public health, and digital medicine.

Chapter 6: Conclusion and Recommendations

6.1. Introduction

This final chapter synthesizes the core findings of the study and outlines the broader conclusions drawn from the evaluation of large language models (LLMs) for simplifying health literacy materials. It restates the key research objectives, reflects on the contributions made to the field, and articulates actionable recommendations for researchers, developers, public health professionals, and policymakers. The chapter also identifies promising areas for future research and application, emphasizing the need for responsible and equitable deployment of NLP technologies in healthcare communication.

6.2. Summary of the Study

The study was motivated by the growing need to improve public comprehension of health information in the face of widespread low health literacy. Recognizing the limitations of traditional manual simplification methods, the research explored whether contemporary LLMs can automatically simplify complex medical and public health texts while maintaining semantic fidelity and accuracy.

To achieve this, the study conducted a comparative evaluation of four prominent language models: GPT-3.5, GPT-4, BioBERT, and PubMedGPT. A novel benchmark dataset comprising real-world health materials and human-generated simplifications was developed and used for testing. Outputs were assessed using a multi-dimensional evaluation framework that incorporated both automated metrics (e.g., FKGL, SARI, BERTScore) and expert human judgment on simplicity, accuracy, fluency, and comprehension.

6.3. Key Findings

Several important findings emerged from the analysis:

Performance of General-Purpose LLMs: GPT-4 consistently outperformed both domain-specific models and earlier general-purpose models across multiple metrics. It produced simplified texts that were readable, fluent, and generally faithful to original meanings.
Limitations of Domain-Specific Models: While BioBERT and PubMedGPT maintained high levels of medical accuracy, they were less successful in reducing complexity and adapting language to suit lay readers. Their outputs often mirrored academic or clinical tones.
Trade-Offs and Tensions: The study confirmed the inherent trade-off in medical simplification tasks—striving for accessibility without compromising accuracy. Over-simplification can dilute or misrepresent critical medical information, while overly technical content undermines comprehension.
Value of Prompt Engineering: Carefully designed prompts—especially few-shot examples—substantially improved output quality, making prompt engineering a key factor in real-world application of LLMs.
Need for Human Oversight: Although LLMs show promise, they are not fully autonomous solutions. Human oversight remains essential for verifying clinical accuracy, appropriateness of tone, and potential for misinformation.

6.4. Contributions to Knowledge

This study contributes to both academic research and applied practice in several meaningful ways:

It offers one of the first structured comparisons of general and biomedical LLMs for text simplification in the health domain.
It presents a reproducible evaluation framework combining quantitative metrics with human qualitative assessment.
It introduces a curated benchmark dataset of real-world health materials and corresponding simplifications.
It highlights the practical feasibility—and limitations—of deploying transformer-based models in public health education.

6.5. Practical Implications

The findings of this research hold important practical value for a range of stakeholders:

For Health Agencies and NGOs: LLMs can assist in rapidly generating simplified patient education materials, public health announcements, and risk communication messages, especially during emergencies such as disease outbreaks.
For Developers and Product Designers: Integrating LLMs into health apps and chatbot interfaces can enhance user experience by providing easy-to-understand summaries of health content. Prompt customization will be essential in achieving desired communication goals.
For Health Professionals: Physicians and educators may use LLMs as first-pass tools to translate clinical language into plain English, subject to professional review before dissemination.
For Policymakers and Regulators: This study reinforces the need for guidelines governing the responsible use of AI in healthcare communication—especially in contexts involving vulnerable populations or life-critical information.

6.6. Recommendations

Based on the study’s findings, the following recommendations are offered:

Implement LLMs as Augmentation Tools, Not Replacements: Use language models to assist—not replace—human simplifiers, especially for patient-facing content that requires clinical nuance.
Adopt Hybrid Evaluation Frameworks: When deploying NLP solutions in health communication, always combine automatic metrics with human-in-the-loop assessments.
Develop Domain-Aligned Prompts and Templates: Institutions can design pre-approved prompts or prompt libraries tailored to specific health contexts (e.g., chronic disease, vaccination, reproductive health).
Prioritize Ethical and Inclusive Design: Train and test models using diverse, multilingual, and culturally representative health content to mitigate bias and improve accessibility across populations.
Invest in Fine-Tuning for Critical Use-Cases: For high-risk applications (e.g., chronic illness education or post-operative care instructions), consider supervised fine-tuning of models using labeled datasets developed in partnership with clinicians.
Include Transparency Mechanisms: Future systems should indicate when and how content has been simplified, allowing users to access original text when necessary and reinforcing user trust.

6.7. Limitations and Future Work

Although the study was methodologically robust, it is not without limitations:

The dataset was limited to English and sourced primarily from U.S.-based institutions.
The sample size for human evaluation was modest due to time and resource constraints.
The models were evaluated using prompting techniques without task-specific fine-tuning.

Future research should explore:

Multilingual and cross-cultural simplification frameworks
Domain-specific fine-tuning of LLMs on large, annotated simplification corpora
Interactive simplification interfaces for real-time use by clinicians and patients
Longitudinal studies measuring the actual impact of simplified materials on patient understanding and behavior

6.8. Conclusion

In an age where artificial intelligence is reshaping healthcare delivery and education, the role of language models in enhancing public understanding of health information cannot be overstated. This study has shown that with appropriate design, prompting, and oversight, LLMs—especially models like GPT-4—can serve as valuable tools for improving health communication. However, realizing their full potential will require interdisciplinary collaboration, ethical foresight, and continuous refinement. Bridging the gap between medical expertise and public understanding is not merely a technical challenge—it is a moral imperative in the pursuit of health equity.

References

Hossain, M. D., Rahman, M. H., & Hossan, K. M. R. (2025). Artificial Intelligence in healthcare: Transformative applications, ethical challenges, and future directions in medical diagnostics and personalized medicine.
Alsentzer, E., Murphy, J. R., Boag, W., Weng, W. H., Jin, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop, 72–78.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
Berkman, N. D., Sheridan, S. L., Donahue, K. E., Halpern, D. J., & Crotty, K. (2011). Low health literacy and health outcomes: An updated systematic review. Annals of Internal Medicine, 155(2), 97–107. [CrossRef]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Chou, W. Y. S., Gaysynsky, A., & Vanderpool, R. C. (2018). The COVID-19 communication crisis: A call for clarity in public health messaging. Health Communication, 35(14), 1747–1752.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X.,... & Jiang, J. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1), 1–23.
Guo, J., Tang, C., Wang, X., & Zhu, Q. (2021). Simplifying medical texts with neural machine translation: A case study on heart failure discharge summaries. BMC Medical Informatics and Decision Making, 21(1), 1–12.
Jiang, H., Zhang, H., & Zhao, W. (2023). Evaluating the effectiveness of large language models in health information simplification. Journal of Biomedical Informatics, 140, 104367.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. [CrossRef]
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Martin, L., Ladhak, F., & Fan, A. (2022). Multi-level simplification of medical documents using LLMs. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics.
Nutbeam, D. (2008). The evolving concept of health literacy. Social Science & Medicine, 67(12), 2072–2078. [CrossRef]
Paasche-Orlow, M. K., & Wolf, M. S. (2010). Promoting health literacy research to reduce health disparities. Journal of Health Communication, 15(S2), 34–41. [CrossRef]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M.,... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
Specia, L. (2010). Translating from complex to simplified sentences. Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, 30–39.
Weng, W. H., Marshall, I. J., Hsu, J., & Wei, C. H. (2022). MedSimplify: A medical text simplification dataset and benchmark for evaluating readability in health communication. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Xu, W., Napoles, C., Pavlick, E., Chen, Q., & Callison-Burch, C. (2016). Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4, 401–415. [CrossRef]

Table 4.1. Mean Scores on Automated Evaluation Metrics.

Model	FKGL↓	SARI↑	BERTScore↑	BLEU↑
GPT-3.5	7.6	39.8	0.876	43.1
GPT-4	6.2	45.5	0.895	47.9
BioBERT	9.1	31.2	0.904	36.3
PubMedGPT	8.7	33.9	0.911	38.5
Human Reference	5.9	52.0	0.920	—

Table 4.2. Mean Human Ratings.

Model	Simplicity	Medical Accuracy	Fluency	Comprehension
GPT-3.5	3.9	4.1	4.4	4.2
GPT-4	4.4	4.5	4.7	4.6
BioBERT	3.1	4.8	3.8	3.5
PubMedGPT	3.3	4.7	3.9	3.7
Human	4.8	4.9	4.8	4.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.