Preprint Review Version 2 Preserved in Portico This version is not peer-reviewed

Modern Clinical Text Mining: A Guide and Review

Version 1 : Received: 29 October 2020 / Approved: 30 October 2020 / Online: 30 October 2020 (15:01:24 CET)
Version 2 : Received: 2 February 2021 / Approved: 3 February 2021 / Online: 3 February 2021 (10:31:14 CET)

A peer-reviewed article of this Preprint also exists.

Percha, B. Modern Clinical Text Mining: A Guide and Review. Annual Review of Biomedical Data Science, 2021, 4, 165–187. Percha, B. Modern Clinical Text Mining: A Guide and Review. Annual Review of Biomedical Data Science, 2021, 4, 165–187.


Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g. physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, it describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation at health systems and in industry.


text mining; natural language processing; electronic health records; clinical text; machine learning


Computer Science and Mathematics, Information Systems

Comments (2)

Comment 1
Received: 3 February 2021
Commenter: Bethany Percha
Commenter's Conflict of Interests: Author
Comment: This version incorporates suggested changes from reviewers and my own revisions. Most of the changes are in Sections 2 and 3 and the Conclusion. 
+ Respond to this comment
Comment 2
Received: 18 May 2021
Commenter's Conflict of Interests: Author
Comment: Update: a potentially important software resource for clinical text mining that did not make it into Table 1 of the manuscript is medspaCy
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 2
Metrics 0

Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.