Submitted:
19 June 2025
Posted:
20 June 2025
You are already at the latest version
Abstract
Keywords:
Chapter 1: Introduction
1.1. Background of the Study
1.2. Statement of the Problem
1.3. Research Objectives
- To develop a benchmark dataset of original and simplified health materials using publicly available content from trusted sources such as CDC, WHO, and MedlinePlus.
- To evaluate the performance of general-purpose and domain-specific language models (e.g., GPT-3.5, GPT-4, BioBERT, PubMedGPT) in medical text simplification.
- To assess the output of these models using a combination of automatic readability metrics and expert human judgment.
- To explore prompt engineering, few-shot learning, and hybrid model approaches to improve simplification quality while retaining medical accuracy.
- To provide a reproducible evaluation framework for future research in health-related NLP applications.
1.4. Research Questions
- How effectively can current language models simplify complex health-related texts while maintaining semantic accuracy?
- Do domain-specific models outperform general-purpose models in medical text simplification?
- What are the strengths and limitations of large language models in balancing readability, grammatical fluency, and factual consistency?
- Can prompt engineering and human feedback improve the quality of simplified outputs in this context?
1.5. Scope of the Study
1.6. Significance of the Study
1.7. Ethical Considerations
1.8. Organization of the Study
Chapter 2: Literature Review
2.1. Introduction
2.2. Health Literacy and the Complexity of Health Information
2.3. Text Simplification in Natural Language Processing
2.4. NLP for Medical Text Simplification
2.5. Evaluation Metrics for Text Simplification
2.6. Ethical Considerations in Automated Health Communication
2.7. Identified Gaps and Contribution of the Present Study
- There is a lack of comprehensive evaluations comparing general-purpose and domain-specific language models for medical simplification.
- Existing simplification corpora are either too small or not representative of real-world public health documents.
- Most simplification studies prioritize readability without rigorously assessing semantic accuracy or medical correctness.
- Few studies employ hybrid evaluation frameworks that combine automated metrics with expert human judgment.
Chapter 3: Methodology
3.1. Introduction
3.2. Research Design
3.3. Dataset Collection and Preparation
3.3.1. Source Selection
- The Centers for Disease Control and Prevention (CDC)
- The World Health Organization (WHO)
- MedlinePlus (a service of the U.S. National Library of Medicine)
3.3.2. Dataset Construction
- Original Texts: Extracted from the aforementioned sources, including fact sheets, health condition overviews, and vaccine guides.
- Reference Simplified Versions: Created by using a combination of expert-generated simplifications (where available) and professionally simplified texts annotated by domain-trained linguists.
3.3.3. Preprocessing
- Tokenization
- Removal of metadata and hyperlinks
- Alignment of original and simplified texts
- Conversion to model-compatible input formats
3.4. Model Selection
- GPT-3.5 (text-davinci-003) — A state-of-the-art general language model optimized for text generation tasks.
- GPT-4 — The latest in the GPT series with enhanced reasoning and linguistic control.
- BioBERT — A domain-specific model pre-trained on biomedical literature.
- PubMedGPT — A generative language model trained entirely on PubMed abstracts and clinical content.
3.5. Prompting and Model Configuration
- Zero-shot prompting: Asking the model to simplify without examples.
- Few-shot prompting: Providing 1–3 examples of complex-simplified pairs to guide the output format and tone.
- Instructional prompting: Using explicit directives such as “Rewrite the following medical text in simpler language for a general audience.”
3.6. Evaluation Metrics
3.6.1. Automated Evaluation
- Flesch-Kincaid Grade Level (FKGL): Measures the grade-level readability of the text.
- SARI (System output Against References and against the Input): Captures quality of additions, deletions, and retention of original content.
- BERTScore: Assesses semantic similarity using contextual embeddings.
- BLEU: Measures n-gram overlap with reference simplified texts.
3.6.2. Human Evaluation
- Simplicity (on a 1–5 scale)
- Medical Accuracy (correctness and omission score)
- Grammatical Fluency
- Overall Comprehension
3.7. Experimental Procedure
- Models were prompted with each complex paragraph from the test set and their outputs recorded.
- Outputs were compared to human reference simplifications using automated metrics.
- A subset of outputs was sent to human evaluators for qualitative judgment.
- Performance across models was statistically compared using ANOVA tests for human scores and mean comparisons for automated metrics.
- Models were also evaluated for failure cases, such as hallucinations, omissions, or improper simplifications.
3.8. Tools and Infrastructure
- Python (v3.9) with NLP libraries including Hugging Face Transformers, NLTK, and spaCy.
- OpenAI API for GPT-3.5 and GPT-4 access.
- Hugging Face model hub for BioBERT and PubMedGPT.
- Jupyter Notebooks and Google Colab for prototyping and parallel evaluations.
3.9. Limitations of Methodology
- The models were evaluated only on English texts, limiting generalizability.
- Simplified reference pairs were limited in quantity compared to large-scale corpora.
- Human evaluations, while insightful, remain subjective despite standardization.
- Prompt variability could introduce inconsistencies in LLM output quality.
3.10. Summary
Chapter 4: Results and Analysis
4.1. Introduction
4.2. Performance Based on Automated Metrics
- Readability (FKGL): GPT-4 achieved the lowest average reading grade level, close to that of human simplifications. GPT-3.5 also performed well, while domain-specific models (BioBERT, PubMedGPT) generated outputs with higher complexity.
- SARI: GPT-4 came closest to human references in terms of meaningful simplification operations. BioBERT and PubMedGPT underperformed due to retaining more original complexity.
- Semantic Similarity (BERTScore): Domain-specific models outperformed general LLMs slightly in semantic preservation, but at the expense of reduced simplicity.
- BLEU: GPT-4 again performed best in terms of n-gram overlap with human references.
4.3. Human Evaluation Results
- GPT-4 outperformed other models in all four dimensions, coming closest to human simplifications.
- BioBERT and PubMedGPT scored highly in medical accuracy, but evaluators noted that they frequently retained technical terms, making outputs harder to understand.
- Fluency was consistently rated lower for domain-specific models, due to rigid sentence structures and awkward transitions.
4.4. Trade-off Between Simplicity and Accuracy
- Rare or technical terms are rephrased but not omitted.
- Sentence length is reduced without eliminating causal or conditional information.
- Bullet-point formatting or logical chunking enhances user readability.
4.5. Prompt Engineering Effects
- Zero-shot prompts produced fluent but sometimes off-topic simplifications.
- Few-shot prompts grounded the output style more consistently and improved semantic fidelity.
- Instructional prompts such as “Explain like I’m 12” often led to overly casual tone or occasional oversimplification.
4.6. Error Analysis
- Semantic Distortion (6% of outputs): Misrepresentation of medical facts, such as confusing “infection prevention” with “infection treatment.”
- Omission (12%): Missing details such as dosage frequency or preventive measures.
- Over-simplification (9%): Excessive generalization leading to loss of nuance, e.g., “chronic illness” simplified to “feeling unwell.”
- Redundancy (5%): Repetitive or verbose phrasing post-simplification.
4.7. Model Comparison Summary
| Model | Best At | Weaknesses |
| GPT-3.5 | Balanced simplification | Occasional semantic drift |
| GPT-4 | Best overall performance | May oversimplify or sound casual if not prompted carefully |
| BioBERT | Medical accuracy | Poor readability, rigid style |
| PubMedGPT | Domain fluency, terminology use | Fails to simplify syntax adequately |
4.8. Statistical Significance of Results
4.9. Summary of Findings
- GPT-4 emerged as the most capable model for simplifying health literacy materials, striking a commendable balance between readability and content fidelity.
- Domain-specific models retained terminology precision but failed to produce accessible language suitable for the general public.
- Prompt engineering was crucial in optimizing output tone and structure.
- Hybrid evaluation frameworks, combining automated metrics with expert review, provided a robust analysis of LLM performance.
Chapter 5: Discussion and Implications
5.1. Introduction
5.2. Interpretation of Findings
5.3. The Role of Prompt Engineering
5.4. Implications for Health Communication
5.5. Ethical Considerations
5.6. Contributions to the Field
- Empirical Benchmarking: It provides a structured comparison of general-purpose and biomedical LLMs for health text simplification—an area previously underexplored in NLP research.
- Evaluation Framework: The hybrid evaluation framework combining automated metrics with expert human review offers a reproducible model for future studies.
- Prompt Engineering Insights: It highlights prompt engineering as a scalable method for controlling model behavior in non-programmatic contexts.
- Dataset Development: The creation of a high-quality benchmark dataset consisting of real-world health materials and simplified versions lays the groundwork for future supervised training and evaluation in the field.
- Practical Guidance: The findings offer practical insights for health professionals, developers, and policymakers seeking to apply NLP tools to improve public health communication.
5.7. Limitations of the Study
5.8. Recommendations for Future Research
- Multilingual and Cross-Cultural Simplification: Develop and evaluate models capable of simplifying health texts in multiple languages and cultural contexts to increase global applicability.
- Interactive Simplification Tools: Integrate LLMs into user-facing applications that allow real-time simplification with feedback loops involving health professionals and end-users.
- Fine-tuning with Domain Labels: Explore fine-tuning general-purpose models using hybrid datasets labeled for both readability and semantic preservation.
- Longitudinal Impact Studies: Conduct user studies to assess whether simplified materials generated by LLMs improve patient knowledge, engagement, and outcomes over time.
- Explainability Frameworks: Develop mechanisms to explain how simplifications were derived and what information was changed or removed—essential for building trust in automated outputs.
5.9. Summary
Chapter 6: Conclusion and Recommendations
6.1. Introduction
6.2. Summary of the Study
6.3. Key Findings
- Performance of General-Purpose LLMs: GPT-4 consistently outperformed both domain-specific models and earlier general-purpose models across multiple metrics. It produced simplified texts that were readable, fluent, and generally faithful to original meanings.
- Limitations of Domain-Specific Models: While BioBERT and PubMedGPT maintained high levels of medical accuracy, they were less successful in reducing complexity and adapting language to suit lay readers. Their outputs often mirrored academic or clinical tones.
- Trade-Offs and Tensions: The study confirmed the inherent trade-off in medical simplification tasks—striving for accessibility without compromising accuracy. Over-simplification can dilute or misrepresent critical medical information, while overly technical content undermines comprehension.
- Value of Prompt Engineering: Carefully designed prompts—especially few-shot examples—substantially improved output quality, making prompt engineering a key factor in real-world application of LLMs.
- Need for Human Oversight: Although LLMs show promise, they are not fully autonomous solutions. Human oversight remains essential for verifying clinical accuracy, appropriateness of tone, and potential for misinformation.
6.4. Contributions to Knowledge
- It offers one of the first structured comparisons of general and biomedical LLMs for text simplification in the health domain.
- It presents a reproducible evaluation framework combining quantitative metrics with human qualitative assessment.
- It introduces a curated benchmark dataset of real-world health materials and corresponding simplifications.
- It highlights the practical feasibility—and limitations—of deploying transformer-based models in public health education.
6.5. Practical Implications
- For Health Agencies and NGOs: LLMs can assist in rapidly generating simplified patient education materials, public health announcements, and risk communication messages, especially during emergencies such as disease outbreaks.
- For Developers and Product Designers: Integrating LLMs into health apps and chatbot interfaces can enhance user experience by providing easy-to-understand summaries of health content. Prompt customization will be essential in achieving desired communication goals.
- For Health Professionals: Physicians and educators may use LLMs as first-pass tools to translate clinical language into plain English, subject to professional review before dissemination.
- For Policymakers and Regulators: This study reinforces the need for guidelines governing the responsible use of AI in healthcare communication—especially in contexts involving vulnerable populations or life-critical information.
6.6. Recommendations
- Implement LLMs as Augmentation Tools, Not Replacements: Use language models to assist—not replace—human simplifiers, especially for patient-facing content that requires clinical nuance.
- Adopt Hybrid Evaluation Frameworks: When deploying NLP solutions in health communication, always combine automatic metrics with human-in-the-loop assessments.
- Develop Domain-Aligned Prompts and Templates: Institutions can design pre-approved prompts or prompt libraries tailored to specific health contexts (e.g., chronic disease, vaccination, reproductive health).
- Prioritize Ethical and Inclusive Design: Train and test models using diverse, multilingual, and culturally representative health content to mitigate bias and improve accessibility across populations.
- Invest in Fine-Tuning for Critical Use-Cases: For high-risk applications (e.g., chronic illness education or post-operative care instructions), consider supervised fine-tuning of models using labeled datasets developed in partnership with clinicians.
- Include Transparency Mechanisms: Future systems should indicate when and how content has been simplified, allowing users to access original text when necessary and reinforcing user trust.
6.7. Limitations and Future Work
- The dataset was limited to English and sourced primarily from U.S.-based institutions.
- The sample size for human evaluation was modest due to time and resource constraints.
- The models were evaluated using prompting techniques without task-specific fine-tuning.
- Multilingual and cross-cultural simplification frameworks
- Domain-specific fine-tuning of LLMs on large, annotated simplification corpora
- Interactive simplification interfaces for real-time use by clinicians and patients
- Longitudinal studies measuring the actual impact of simplified materials on patient understanding and behavior
6.8. Conclusion
References
- Hossain, M. D., Rahman, M. H., & Hossan, K. M. R. (2025). Artificial Intelligence in healthcare: Transformative applications, ethical challenges, and future directions in medical diagnostics and personalized medicine.
- Alsentzer, E., Murphy, J. R., Boag, W., Weng, W. H., Jin, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop, 72–78.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.
- Berkman, N. D., Sheridan, S. L., Donahue, K. E., Halpern, D. J., & Crotty, K. (2011). Low health literacy and health outcomes: An updated systematic review. Annals of Internal Medicine, 155(2), 97–107. [CrossRef]
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
- Chou, W. Y. S., Gaysynsky, A., & Vanderpool, R. C. (2018). The COVID-19 communication crisis: A call for clarity in public health messaging. Health Communication, 35(14), 1747–1752.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.
- Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X.,... & Jiang, J. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1), 1–23.
- Guo, J., Tang, C., Wang, X., & Zhu, Q. (2021). Simplifying medical texts with neural machine translation: A case study on heart failure discharge summaries. BMC Medical Informatics and Decision Making, 21(1), 1–12.
- Jiang, H., Zhang, H., & Zhao, W. (2023). Evaluating the effectiveness of large language models in health information simplification. Journal of Biomedical Informatics, 140, 104367.
- Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. [CrossRef]
- Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- Martin, L., Ladhak, F., & Fan, A. (2022). Multi-level simplification of medical documents using LLMs. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics.
- Nutbeam, D. (2008). The evolving concept of health literacy. Social Science & Medicine, 67(12), 2072–2078. [CrossRef]
- Paasche-Orlow, M. K., & Wolf, M. S. (2010). Promoting health literacy research to reduce health disparities. Journal of Health Communication, 15(S2), 34–41. [CrossRef]
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M.,... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
- Specia, L. (2010). Translating from complex to simplified sentences. Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, 30–39.
- Weng, W. H., Marshall, I. J., Hsu, J., & Wei, C. H. (2022). MedSimplify: A medical text simplification dataset and benchmark for evaluating readability in health communication. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Xu, W., Napoles, C., Pavlick, E., Chen, Q., & Callison-Burch, C. (2016). Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4, 401–415. [CrossRef]
| Model | FKGL↓ | SARI↑ | BERTScore↑ | BLEU↑ |
|---|---|---|---|---|
| GPT-3.5 | 7.6 | 39.8 | 0.876 | 43.1 |
| GPT-4 | 6.2 | 45.5 | 0.895 | 47.9 |
| BioBERT | 9.1 | 31.2 | 0.904 | 36.3 |
| PubMedGPT | 8.7 | 33.9 | 0.911 | 38.5 |
| Human Reference | 5.9 | 52.0 | 0.920 | — |
| Model | Simplicity | Medical Accuracy | Fluency | Comprehension |
|---|---|---|---|---|
| GPT-3.5 | 3.9 | 4.1 | 4.4 | 4.2 |
| GPT-4 | 4.4 | 4.5 | 4.7 | 4.6 |
| BioBERT | 3.1 | 4.8 | 3.8 | 3.5 |
| PubMedGPT | 3.3 | 4.7 | 3.9 | 3.7 |
| Human | 4.8 | 4.9 | 4.8 | 4.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).