Abstract: The complexity of medical information presents a significant barrier to effective health communication, particularly for individuals with low health literacy. As global health systems increasingly rely on digital communication and patient-facing resources, the ability to simplify medical texts without compromising accuracy has become a critical public health objective. This study explores the potential of large language models (LLMs), particularly transformer-based architectures, to automate the simplification of health literacy materials. We focus on evaluating their performance in reducing linguistic complexity while preserving core medical semantics and factual consistency. The research begins with the development of a benchmark dataset comprising public-domain health documents sourced from organizations such as the Centers for Disease Control and Prevention (CDC), the World Health Organization (WHO), and MedlinePlus. Original materials are paired with human-written simplifications at a readability level suitable for the general public. This curated corpus serves as both training and evaluation ground for assessing the capabilities of general-purpose models such as GPT-3.5, GPT-4, and domain-adapted variants like BioBERT and PubMedGPT in sentence- and paragraph-level simplification tasks. We implement a multi-faceted evaluation pipeline combining automated metrics (e.g., Flesch-Kincaid Grade Level, SARI, BLEU, and BERTScore) with human evaluations conducted by public health communication experts and linguistic annotators. These evaluations focus on four key dimensions: linguistic simplicity, medical accuracy, grammatical fluency, and reader comprehension. Our findings reveal that while general-purpose LLMs excel at reducing sentence complexity and improving fluency, they occasionally introduce semantic shifts or omit critical health information. Domain-adapted models, though more semantically faithful, tend to produce less readable outputs due to retained technical jargon. To address this trade-off, we further explore ensemble and prompt-engineering strategies, including few-shot examples that guide models toward producing simplified outputs with greater semantic fidelity. In addition, we examine the potential of reinforcement learning with human feedback (RLHF) to iteratively fine-tune model outputs toward user-specific readability targets. The results suggest that hybrid approaches combining domain knowledge with large-scale language generation offer the most promising path forward. This research contributes to the growing field of health informatics and natural language processing by providing a comprehensive assessment of LLM capabilities in the context of health literacy. It also delivers a reproducible framework and benchmark dataset for future investigations. Importantly, the study maintains strict ethical compliance by using only publicly available documents and refraining from engaging with patient data, ensuring both methodological transparency and societal relevance. The findings have significant implications for developers of digital health tools, public health educators, and healthcare institutions aiming to democratize access to critical medical information.
Abstract: The complexity of medical information presents a significant barrier to effective health communication, particularly for individuals with low health literacy. As global health systems increasingly rely on digital communication and patient-facing resources, the ability to simplify medical texts without compromising accuracy has become a critical public health objective. This study explores the potential of large language models (LLMs), particularly transformer-based architectures, to automate the simplification of health literacy materials. We focus on evaluating their performance in reducing linguistic complexity while preserving core medical semantics and factual consistency. The research begins with the development of a benchmark dataset comprising public-domain health documents sourced from organizations such as the Centers for Disease Control and Prevention (CDC), the World Health Organization (WHO), and MedlinePlus. Original materials are paired with human-written simplifications at a readability level suitable for the general public. This curated corpus serves as both training and evaluation ground for assessing the capabilities of general-purpose models such as GPT-3.5, GPT-4, and domain-adapted variants like BioBERT and PubMedGPT in sentence- and paragraph-level simplification tasks. We implement a multi-faceted evaluation pipeline combining automated metrics (e.g., Flesch-Kincaid Grade Level, SARI, BLEU, and BERTScore) with human evaluations conducted by public health communication experts and linguistic annotators. These evaluations focus on four key dimensions: linguistic simplicity, medical accuracy, grammatical fluency, and reader comprehension. Our findings reveal that while general-purpose LLMs excel at reducing sentence complexity and improving fluency, they occasionally introduce semantic shifts or omit critical health information. Domain-adapted models, though more semantically faithful, tend to produce less readable outputs due to retained technical jargon. To address this trade-off, we further explore ensemble and prompt-engineering strategies, including few-shot examples that guide models toward producing simplified outputs with greater semantic fidelity. In addition, we examine the potential of reinforcement learning with human feedback (RLHF) to iteratively fine-tune model outputs toward user-specific readability targets. The results suggest that hybrid approaches combining domain knowledge with large-scale language generation offer the most promising path forward. This research contributes to the growing field of health informatics and natural language processing by providing a comprehensive assessment of LLM capabilities in the context of health literacy. It also delivers a reproducible framework and benchmark dataset for future investigations. Importantly, the study maintains strict ethical compliance by using only publicly available documents and refraining from engaging with patient data, ensuring both methodological transparency and societal relevance. The findings have significant implications for developers of digital health tools, public health educators, and healthcare institutions aiming to democratize access to critical medical information.