Submitted:
13 April 2026
Posted:
22 April 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Materials and Methods
2.1. Ethical Approval
2.2. Benchmark Dataset: RadGraph
2.3. Data Preprocessing
2.4. Construction and Annotation of the Tunisian Lung Cancer Corpus
- R. CLINIQUES: captures the patient's relevant clinical background, including smoking history, occupational exposures, and initial clinical investigations (fibroscopy, chest X-ray, CT scan) performed prior to or during the staging workup.
- TECHNIQUES: covers technical details of imaging procedures conducted during the staging workup, including acquisition type, technical parameters, radiation dose, and scan coverage.
- STADE: identifies the cancer stage as determined by the staging investigations, expressed using TNM classification.
2.5. Model Selection and Architecture
2.5.1. Benchmarking Phase (RadGraph Dataset)
2.5.2. Tunisian Corpus Phase
2.6. Fine-Tuning Protocol
2.7. Evaluation Metrics
2.8. Prototype Development
2.9. AI Usage Statement
3. Results
3.1. Benchmarking Results on the RadGraph Dataset
3.2. Performance Evaluation on the Tunisian Clinical Corpus
3.3. Prototype Deployment and Clinical Interface
4. Discussion
4.1. Benchmarking Performance on RadGraph
4.2. DrBERT Performance on the Tunisian Corpus
4.3. Clinical Relevance of the Prototype
4.4. Positioning within Global Clinical NLP
4.5. Limitations
5. Conclusion
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- « Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries - Sung - 2021 - CA: A Cancer Journal for Clinicians - Wiley Online Library ». [En ligne]. Available online: https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21660.
- Siegel, R. L.; Miller, K. D.; Fuchs, H. E.; Jemal, A. Cancer statistics, 2022 », CA. Cancer J. Clin. 2022, vol. 72(no 1), p. 7-33. [Google Scholar] [CrossRef] [PubMed]
- Travis, W. D.; et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J. Thorac. Oncol. 2015, vol. 10(no 9), 1243-1260. [Google Scholar] [CrossRef] [PubMed]
- Tsimberidou, M.; Fountzilas, E.; Nikanjam, M.; Kurzrock, R. Review of precision cancer medicine: Evolution of the treatment paradigm. Cancer Treat. Rev. 2020, vol. 86, 102019. [Google Scholar] [CrossRef] [PubMed]
- Herbst, R. S.; Morgensztern, D.; Boshoff, C. The biology and management of non-small cell lung cancer. Nature 2018, vol. 553(no 7689), 446-454. [Google Scholar] [CrossRef] [PubMed]
- The growing role of precision and personalized medicine for cancer treatment | TECHNOLOGY. Available online: https://www.worldscientific.com/doi/full/10.1142/S2339547818300020.
- Shickel; Tighe, P. J.; Bihorac, A.; Rashidi, P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J. Biomed. Health Inform. 2018, vol. 22(no 5), p. 1589-1604. [Google Scholar] [CrossRef] [PubMed]
- A guide to deep learning in healthcare | Nature Medicine. Available online: https://www.nature.com/articles/s41591-018-0316-z.
- « Deep learning in clinical natural language processing: a methodical review | Journal of the American Medical Informatics Association | Oxford Academic. Available online: https://academic.oup.com/jamia/article-abstract/27/3/457/5651084?login=false.
- Mining electronic health records: towards better research applications and clinical care | Nature Reviews Genetics. Available online: https://www.nature.com/articles/nrg3208.
- High-performance medicine: the convergence of human and artificial intelligence | Nature Medicine. Available online: https://www.nature.com/articles/s41591-018-0300-7.
- Musa, S.; Dergaa, I.; Al Shekh Yasin, R.; Singh, R. The Impact of Training on Electronic Health Records Related Knowledge, Practical Competencies, and Staff Satisfaction: A Pre-Post Intervention Study Among Wellness Center Providers in a Primary Health-Care Facility. J. Multidiscip. Healthc. 2023, vol. 16, 1551-1563. [Google Scholar] [CrossRef] [PubMed]
- Machine Learning in Medicine | New England Journal of Medicine. Available online: https://www.nejm.org/doi/full/10.1056/NEJMra1814259.
- Rahmouni, H. B.; et al. Healthcare 5.0-Driven Clinical Intelligence: The Learn-Predict-Monitor-Detect-Correct Framework for Systematic Artificial Intelligence Integration in Critical Care. Healthcare vol. 13(no 20), oct. 2025. [CrossRef] [PubMed]
- « Symbolic rule-based classification of lung cancer stages from free-text pathology reports | Journal of the American Medical Informatics Association | Oxford Academic. Available online: https://academic.oup.com/jamia/article-abstract/17/4/440/866997?login=false.
- Savova, G. K.; et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 2010, vol. 17(no 5), 507-513. [Google Scholar] [CrossRef] [PubMed]
- « Knowledge mapping of global research on natural language processing, 1958–2023: Southern African Linguistics and Applied Language Studies: Vol 43, No 3 ». Vol 43. Available online: https://www.tandfonline.com/doi/abs/10.2989/16073614.2024.2389932.
- Gupta, E. K.; Thamma, R.; Thakkin, A. NLP Automation to Read Radiological Reports to Detect the Stage of Cancer Among Lung Cancer Patients.
- Deep Learning to Classify Radiology Free-Text ReportsRadiology. Available online: https://pubs.rsna.org/doi/abs/10.1148/radiol.2017171115.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - ACL Anthology. Available online: https://aclanthology.org/N19-1423/.
- Alsentzer, E. « Publicly Available Clinical BERT Embeddings ». In Proceedings of the 2nd Clinical Natural Language Processing Workshop; Rumshisky, A., Roberts, K., Bethard, S., Naumann, T., Eds.; Association for Computational Linguistics: Minneapolis, Minnesota, USA, 2019; p. 72-78. [Google Scholar] [CrossRef]
- Labrak, Y. « DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains ». In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, Canada, 2023; Volume 1, p. 16207-16221. [Google Scholar] [CrossRef]
- « Frontiers | Machine learning applications in the analysis of sedentary behavior and associated health risks. Available online: https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538807/full.
- (PDF) Early Prediction of Acute Respiratory Distress Syndrome in Critically Ill Polytrauma Patients Using Balanced Random Forest ML: A Retrospective Cohort Study. Available online: https://www.researchgate.net/publication/398815693_Early_Prediction_of_Acute_Respiratory_Distress_Syndrome_in_Critically_Ill_Polytrauma_Patients_Using_Balanced_Random_Forest_ML_A_Retrospective_Cohort_Study.
- Abdaoui, H. « Accurate Clinical Entity Recognition and Code Mapping of Anatomopathological Reports Using BioClinicalBERT Enhanced by Retrieval-Augmented Generation: A Hybrid Deep Learning Approach. Bioengineering 2025, vol. 13, no 1. [Google Scholar] [CrossRef] [PubMed]
- A comparison of word embeddings for the biomedical natural language processing - ScienceDirect. Available online: https://www.sciencedirect.com/science/article/pii/S1532046418301825.
- Guelmami, N. « The Ethical Compass: Establishing Ethical Guidelines for Research Practices in Sports Medicine and Exercise Science. Int. J. Sport Stud. Health 2024, vol. 7(no 2), 31. [Google Scholar] [CrossRef]
- Jain, S.; et al. « RadGraph: Extracting Clinical Entities and Relations from Radiology Reports ». arXiv:2106.14463. 2021. [Google Scholar] [CrossRef]
- S. Gururangan et al., « Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks », in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, et J. Tetreault, Éd., Online: Association for Computational Linguistics, juill. 2020, p. 8342-8360. [CrossRef]
- Martin, L. « CamemBERT: a Tasty French Language Model ». In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Online, 2020; p. 7203-7219. [Google Scholar] [CrossRef]
- The assisted Technology dilemma: a reflection on AI chatbots use and risks while reshaping the peer review process in scientific research | Request PDF. Available online: https://www.researchgate.net/publication/389912863_The_assisted_Technology_dilemma_a_reflection_on_AI_chatbots_use_and_risks_while_reshaping_the_peer_review_process_in_scientific_research.
- Dergaa, et al. A thorough examination of ChatGPT-3.5 potential applications in medical writing: A preliminary study. Medicine (Baltimore) 2024, vol. 103, e39757. [Google Scholar] [CrossRef] [PubMed]
- Lee, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, vol. 36(no 4), 1234-1240. [Google Scholar] [CrossRef] [PubMed]
- Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets ». In Proceedings of the 18th BioNLP Workshop and Shared Task; Demner-Fushman, D., Cohen, K. B., Ananiadou, S., Tsujii, J., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; p. 58-65. [Google Scholar] [CrossRef]
- Topol, E. Welcoming new guidelines for AI clinical research. Nat. Med. 2020, vol. 26, 1318-1320. [Google Scholar] [CrossRef] [PubMed]
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E. AI in health and medicine. Nat. Med. 2022, vol. 28. [Google Scholar] [CrossRef] [PubMed]
- Dergaa. « ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front. Psychiatry 2024, vol. 14. [Google Scholar] [CrossRef] [PubMed]
- Dergaa. « From tools to threats: a reflection on the impact of artificial-intelligence chatbots on cognitive health. Front. Psychol. 2024, vol. 15. [Google Scholar] [CrossRef] [PubMed]
- Rajkomar, A.; Hardt, M.; Howell, M.; Corrado, G.; Chin, M. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann. Intern. Med. 2018, vol. 169. [Google Scholar] [CrossRef] [PubMed]
- Rumshisky, A.; Roberts, K.; Bethard, S.; Naumann, T. Proceedings of the 2nd Clinical Natural Language Processing Workshop; Association for Computational Linguistics: Minneapolis, Minnesota, USA, 2019; Available online: https://aclanthology.org/W19-1900/.






| Groups | Label NER | What does it include |
|---|---|---|
| Clinical Informations | R-CLINIQUES | A Smoker? Number of packs/years Professional Exposure Medical History |
| Techniques | TECHNIQUES | Acquisition Type Technical Parameters Anatomical Region Explored Contrast Phase Dose Irradiation Dose Dose Explored Area |
| Conclusion | STADE | Stage |
| Model | Precision | Recall | F1-Score | Eval Loss |
|---|---|---|---|---|
| RoBERTa | 0.869 | 0.877 | 0.873 | 0.275 |
| BioClinicalBERT | 0.858 | 0.878 | 0.868 | 0.254 |
| BERT | 0.855 | 0.859 | 0.857 | 0.441 |
| CamemBERT | 0.670 | 0.695 | 0.682 | 0.529 |
| Model | Precision | Recall | F1-Score | Notes |
|---|---|---|---|---|
| DrBERT | 0.78 | 0.846 | 0.811 | French biomedical pretraining (NACHOS) |
| RoBERTa | 0.75 | 0.843 | 0.793 | Best RadGraph performer |
| BioClinicalBERT | 0.755 | 0.842 | 0.808 | Clinical domain; English pretrained |
| CamemBERT | 0.739 | 0.831 | 0.782 | General French; not biomedical |
| BERT | 0.655 | 0.713 | 0.683 | General English baseline |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).