Submitted:
02 July 2025
Posted:
03 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce a practical framework for classifying outpatient messages using both domain-specific and general-purpose LLMs.
- We conduct a comparative evaluation of BioBERT, ClinicalBERT, and GPT-4o across urgency detection and multi-label categorization tasks.
- We present insights on the fine-tuning process within a secure hospital cloud environment, highlighting accuracy gains, ethical safeguards, and deployment considerations.
2. Motivation
- Operational Impact: By automating classification, we can reduce administrative workload and improve response times.
- Clinical Relevance: Accurate triage ensures that critical health issues are identified and escalated without delay.
- Technological Advancement: We test the hypothesis that a fine-tuned general-purpose LLM can outperform existing domain-specific models in nuanced, multi-label classification tasks.
3. Methodology
3.1. Dataset Description
- Total messages: 120.
- Average message length: 102.04 words.
- Category Distribution: 12 categories were distributed unevenly throughout the dataset.
3.2. Prompting Approach
3.3. Model Comparison and Selection
- BioBERT [11] – A biomedical domain adaptation of BERT, pre-trained on PubMed abstracts and PMC full-text articles. It has been widely used in medical NLP tasks such as named entity recognition and relation extraction.
- ClinicalBERT [12] – A model adapted from BERT, further pre-trained on MIMIC-III clinical notes, making it specialized for handling electronic health record (EHR) data.
- GPT-4o (OpenAI, 2024) [20] – The latest multimodal generative model capable of understanding and processing text with a broader, generalized training dataset, spanning medical, technical, and conversational domains.






3.3.1. Factors Affecting Model Performance
- Pretraining Focus: BioBERT was trained mainly for biomedical literature processing, making it effective for medical terminology but less adaptable to informal, multi-intent patient messages.
- Data Domain Limitations: ClinicalBERT was trained on structured EHR notes, which differ significantly from the short, often unstructured nature of patient communication in hospital portals.
- Contextual Understanding: Both models rely on BERT-style masked language modeling, which, while effective for extraction-based NLP tasks, may be less suited for complex, multi-class classification without additional fine-tuning.
3.3.2. Selection for Fine-Tuning
- Urgency Classification: Each message was classified as either Urgent or Non-Urgent.
- Category Type: Messages were identified as involving either a Single or Multiple category.
-
Relevant Categories: Messages were further classified into one or more of the following categories:
- -
- Appointment
- -
- Referral Question
- -
- Request an Update to my Medical Record
- -
- Visit Follow-Up Question
- -
- General Question
- -
- Therapy Question
- -
- Refills
- -
- Other
- -
- Test Results Question
- -
- Prescription Question
3.4. Fine-Tuning and Non-Fine-Tuning Approaches
4. Results and Analysis
4.1. Performance Metrics and Their Significances
- True Positives (TP): Correctly predicted positive cases (e.g., urgent messages classified as urgent).
- True Negatives (TN): Correctly predicted negative cases (e.g., non-urgent messages classified as non-urgent).
- False Positives (FP): Incorrectly predicted positive cases (e.g., non-urgent messages classified as urgent), also known as type I error.
- False Negatives (FN): Incorrectly predicted negative cases (e.g., urgent messages classified as non-urgent), also known as type II error.
-
Precision:Precision represents the proportion of correctly predicted positive cases (e.g., urgent messages) out of all cases predicted as positive. A high precision value indicates that the model is good at avoiding false alarms (lower FP, namely, lower type I error), which is essential in scenarios where misclassification can result in unnecessary interventions or inefficiencies.
-
Recall:Recall, also known as sensitivity or true positive rate, measures how well the model identifies actual positive cases. In the context of urgent messages, a higher recall means fewer critical cases are missed (lower FN, namely, lower type II error), ensuring that patients requiring immediate attention receive the necessary care.
-
F1-score:F1-score is the harmonic mean of precision and recall, providing a balanced evaluation of the model’s performance when both false positives and false negatives need to be minimized.
-
Accuracy:Accuracy gives an overall measure of correctness but can be misleading in imbalanced datasets where one class dominates.
4.2. Performance Evaluation
4.3. Impact of Fine-Tuning GPT-4o
4.3.1. BioBERT vs. ClinicalBERT vs. GPT-4o
- BioBERT and ClinicalBERT showed moderate performance in Urgency Classification, but struggled with multi-label classification due to their domain-specific pretraining on structured clinical notes rather than conversational messages. Their low precision suggests a high false positive rate, meaning they frequently misclassified non-urgent messages as urgent.
- GPT-4o achieved significantly higher precision and recall, especially in multi-label classification, due to its broader training on diverse datasets. The improvement in F1-score confirms its balanced performance in identifying correct categories while minimizing false classifications.
- BioBERT and ClinicalBERT showed high recall but low precision, meaning they captured many actual urgent cases but also produced many false positives.
- GPT-4o exhibited both high precision and recall, demonstrating its ability to correctly classify messages while minimizing incorrect predictions.
4.4. Implications for Hospital Message Classification
- High recall is crucial for urgent messages, ensuring that no critical cases are missed.
- High precision is important for non-urgent messages, preventing unnecessary escalations.
- For multi-label classification, F1-score provides a balanced measure of performance, ensuring accurate categorization without excessive false positives.
4.5. Key Observations and Insights
- BioBERT and ClinicalBERT performed well in urgency classification but struggled with multi-label classification due to their pretraining focus on structured clinical documentation.
- GPT-4o demonstrated strong few-shot performance, highlighting its ability to generalize across medical and patient communication tasks.
- Fine-tuning GPT-4o significantly improved accuracy, particularly for multi-class and exact match classification, ensuring reliable message triage.
- The secure Azure OpenAI cloud environment provided a HIPAA-compliant infrastructure for fine-tuning without compromising patient privacy.
5. Discussion
5.1. Strengths
- High accuracy and adaptability to different message categories.
- Effective handling of multi-label classification tasks.
- Reduction in manual effort for message classification and sorting.
5.2. Limitations
- High computational cost for training and inference.
- Sensitivity to biased or imbalanced training data.
- Requirement of domain expertise for effective fine-tuning.
5.3. Ethical Considerations
- De-identification and secure handling of sensitive patient data.
- Ensuring fairness and mitigating biases in model predictions.
- Need for explainability in high-stakes decision-making.
5.4. Practical Applications
- Operational Efficiency : Integration with hospital management systems for automatic triage and message sorting from enormous volume of received patients’ messages. This enables healthcare workers on focusing more time on patient caring rather than message handling.
- Clinical Decision Support : Flagging urgent clinical messages for immediate attention.
- Patient Communication : Automated responses for frequently asked questions, enhancing patient satisfaction.
6. Conclusion and Future Work
- LLMs can effectively classify hospital messages, streamlining administrative workflows and improving patient communication.
- GPT-4o, even without fine-tuning, achieved strong performance, justifying its selection for further refinement.
- Fine-tuned GPT-4o outperformed other models, achieving high accuracy and demonstrating adaptability to real-world hospital data.
- Implementing real-time deployment to assist clinicians and administrative staff in urgent message triage.
- Developing multi-modal models that integrate structured EHR data with textual inputs for improved context understanding.
- Expanding training datasets to enhance model generalization across diverse patient demographics and healthcare institutions.
Author Contributions
Funding
Institutional Review Board Statement: Study title
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| HIPAA | The Health Insurance Portability and Accountability Act of 1996 in the Unites States. |
| EHR | Electronic Health Record |
| LLM | Large Language Model |
| ChatGPT | ChatGPT is a one kind of LLM [20] |
| BioBERT | BioBERT is a one of LLM [11] |
| ClinicalBERT | ClinicalBERT is a one kind of LLM [12] |
References
- Consulting, L. Tapping Into New Potential: Realising the Value of Data in the Healthcare Sector. https://www.ibm.com/thought-leadership/institute-business-value/report/healthcare-data, 2025. Last accessed on 05-27-2025.
- Peter, B. Jensen, Lars J. Jensen, S.B. Mining Electronic Health Records: Towards Better Research Applications and Clinical Care. Nature Reviews Genetics 2012, 13, 395–405. [Google Scholar] [CrossRef]
- Shah, A.; Chen, B. Optimizing Healthcare Delivery: Investigating Key Areas for AI Integration and Impact in Clinical Settings. preprints 2024. [Google Scholar] [CrossRef]
- Hond, A.; et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. npj Digital Medicine 2022. [Google Scholar] [CrossRef] [PubMed]
- Paucar, E.; Paucar, H.; Paucar, D.; Paucar, G.; Sotelo, C. Artificial intelligence as an innovation tool in hospital management: a study based on the sdgs. jlsdgr 2024, 5, e04089. [Google Scholar] [CrossRef]
- Lorencin, I.; Tanković, N.; Etinger, D. Optimizing healthcare efficiency with local large language models 2025. 160. [CrossRef]
- Nashwan, A.; Abujaber, A. Harnessing the power of large language models (llms) for electronic health records (ehrs) optimization. Cureus 2023. [Google Scholar] [CrossRef] [PubMed]
- Smith, J.; Doe, J. Automated Classification of Clinical Text using Machine Learning. Journal of Medical Informatics 2019, 36, 123–135. [Google Scholar]
- Jones, R.; White, S. Medical Text Classification Using Deep Learning Techniques. Artificial Intelligence in Medicine 2020, 45, 210–222. [Google Scholar]
- Brown, E.; Taylor, M. Deep Learning Models for Electronic Health Record Classification. IEEE Transactions on Biomedical Engineering 2021, 68, 345–357. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; So, J.; Kang, H. BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, H.; Naumann, T.; McDermott, M.B. Publicly Available Clinical BERT Embeddings. arXiv preprint arXiv:1904.03323, 2019; arXiv:1904.03323 2019. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. GPT-4: OpenAI’s Multimodal Large Language Model. OpenAI Technical Report 2023. [Google Scholar]
- Wu, P.; Kumar, A.; Smith, L. Comparing Domain-Specific and Generalist Large Language Models for Medical Text Classification. Journal of Artificial Intelligence in Healthcare 2023, 47, 102–118. [Google Scholar]
- OpenAI. GPT-4o: OpenAI’s latest multimodal model, 2024. Available at: https://openai.com/research/gpt-4o.
- Jianing Qiu, Meng Jiang, T.Z. Large Language Models for Medical Applications: Challenges and Future Directions. IEEE Transactions on Neural Networks and Learning Systems 2023, 34, 1234–1248. [Google Scholar] [CrossRef]
- Hong Zhang, Xiaoyu Liu, J. P. Generative AI for Clinical Report Generation: A Systematic Review. Journal of Medical Informatics 2023, 45, 678–692. [Google Scholar] [CrossRef]
- Xiang Dai, Yujie Qian, F. L. Automating Healthcare Administrative Workflows with Large Language Models. Artificial Intelligence in Medicine 2023, 147, 102481. [Google Scholar] [CrossRef]
- HIPAA for Professionals. https://www.hhs.gov/hipaa/for-professionals/index.html. Last accessed date: 03-14-2025.
- GPT-4o. https://openai.com/index/hello-gpt-4o/. Last accessed date: 05-21-2025.




| Message | Urgent/Non-urgent | Single/Multiple Category | Relevant Category |
|---|---|---|---|
| I’m feeling chest tightness and need to know if I should go to the ER or wait. | Urgent | Single | Urgent Medical Question |
| Please refill my blood pressure meds and let me know if Dr. Smith reviewed my blood test. | Non-Urgent | Multiple | Refills, Test Results |
| Can I schedule my next follow-up visit for a diabetes check-up? | Non-Urgent | Single | Appointment |
| Urgency Prediction | Multiple Categories Prediction | |
|---|---|---|
| Accuracy | 0.480392 | 0.617647 |
| Precision | 0.155172 | 0.371429 |
| Recall | 0.692308 | 0.433333 |
| F1-Score | 0.253521 | 0.400000 |
| Urgency Prediction | Multiple Categories Prediction | |
|---|---|---|
| Accuracy | 0.225490 | 0.441176 |
| Precision | 0.125000 | 0.324675 |
| Recall | 0.846154 | 0.833333 |
| F1-Score | 0.217822 | 0.467290 |
| Urgency Prediction | Multiple Categories Prediction | |
|---|---|---|
| Accuracy | 0.803922 | 0.588235 |
| Precision | 0.379310 | 0.416667 |
| Recall | 0.846154 | 1.000000 |
| F1-Score | 0.523810 | 0.588235 |
| Task | Expected Output | Training Accuracy | Testing Accuracy |
|---|---|---|---|
| Urgency Classification | Yes / No | 0.9687 | 0.9524 |
| Multi-Class Categorization | Yes / No | 0.9619 | 1.0000 |
| Full Message Categorization | Urgency (yes/no), Multi-class (yes/no), Respected class name(s) | 0.9514 | 0.9747 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).