Submitted:
09 December 2024
Posted:
10 December 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Challenges in Automatic Recognition of Non-Contemporary Handwritten Texts: An Experiment with HTR
- Transforming images of handwritten text into machine readable format has an important amount of application scenarios, such as historical documents, mail-room processing, administrative documents, etc. But the inherent high variability of handwritten text, the myriad of different writing styles and the amount of different languages and scripts, make HTR an open research problem that is still challenging.
- The automatic generation of accurate, machine-readable transcriptions of digital images of handwritten material has long been an ideal of both researchers and institutions, and it is understood that with successful Handwritten Text Recognition (HTR) the next generation of digitized manuscripts promises to yet again extend and revolutionize the study of historical handwritten documents.
2.1. The Experiment
- (1)
- the construction of two HTR models (one in the Transkribus tool and the other in the LAPELINC TANSCRIPTOR tool) with the same training base (a model corpus);
- (2)
- the application of these models on a target corpus for test, with documents that were not in the model corpus;
- (3)
- the manual transcription of the documents to evaluate the result of the automatic recognition performed by each of the tools against the human recognition.
2.2. Procedures
- The Model set - documents used to train the model; and
- The Target set - documents used to apply the model and study the results in detail.
- The Character Error Rate (CER) compares, for a given page, the total number of characters (n), including spaces, to the minimum number of insertions (i), substitutions (s) and deletions (d) of characters that are required to obtain the Ground Truth result (READ-COOP, 2021).
-
- a)
- The automatic transcription by Transkribus, produced with the model run in the model corpus;
- b)
- The automatic transcription by Lapelinc Transcriptor, produced with the model run in the model corpus;
3. The Computational and the Palaeographical Perspectives: Evaluating Results
3.1. Difference Rates Between Computational and Human-Made Transcriptions
3.2. Palaeographic Analysis
Palaeographic Analysis: Opening Protocol
Palaeographic Analysis Meets the Computational Results
- a cursively written word cannot be recognized without being segmented and cannot be segmented without being recognized.
- In the literature, two main approaches can be identified: the global approach and the segmentation approach. The global approach entails the recognition of the whole word while the segmentation approach requires that each word has to be segmented into letters.
4. Final Remarks
| 1 | More information on the website: https://readcoop.eu/success-stories/. Some of the languages mentioned on the projects are: English, Swedish, Finnish, Dutch, German, Ottoman Turkish, Spanish and French. |
| 2 | See also the ‘General Portuguese Free Public AI Model for Handwritten Text Recognition with Transkribus’, developed by Lucia Werneck Xavier, cf. https://readcoop.eu/model/general-portuguese/. |
| 3 | See also ‘TraPrInq - Transcribing the court records of the Portuguese Inquisition (1536-1821)’, developed in 2023 cf. https://readcoop.eu/success-stories/uncovering-the-secrets-of-the-portuguese-inquisition-with-herve-baudry/. |
| 4 | Developed by ReadCoop, an European Cooperative Society. |
| 5 | Developed by LaPeLinC, Corpus Linguistic Laboratory of the State University of Southern Bahia, Brazil. |
| 6 | The Portuguese Inquisition spanned for four centuries, between 1536 and 1821. Over this period and later documents were destroyed and others survived. The collection “Tribunal do Santo Ofício'' housed at Arquivo Nacional Torre do Tombo consists of 3004 books, 79,337 processes, 624 boxes and 329 bundles. |
| 7 | Tribunal do Santo Ofício (TSO). Processo de Antónia de Barros. Salvador, 1591. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/01279. https://digitarq.arquivos.pt/details?id=2301167
|
| 8 | Tribunal do Santo Ofício (TSO). Processo de Guiomar Lopes. Salvador, 1591. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/01273. https://digitarq.arquivos.pt/details?id=2301160
|
| 9 | Tribunal do Santo Ofício (TSO). Denúncias contra Francisca Luís. Salvador, 1592. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/CX1579/13787. http://digitarq.arquivos.pt/details?id=4510000
|
| 10 | Tribunal do Santo Ofício (TSO). Processo de Guiomar Piçarra. Salvador, 1592. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/01275. http://digitarq.arquivos.pt/details?id=2301163
|
| 11 | Tribunal do Santo Ofício (TSO). Processo de Maria Pinheira. Salvador, 1592. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/10753. http://digitarq.arquivos.pt/details?id=2310930
|
| 12 | Tribunal do Santo Ofício (TSO). Processo de D. Catarina Quaresma. Salvador, 1593. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/01289. http://digitarq.arquivos.pt/details?id=2301177
|
| 13 | Tribunal do Santo Ofício (TSO). Processo de Maria Álvares. Salvador, 1593. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/10754. http://digitarq.arquivos.pt/details?id=2310931
|
| 14 | Tribunal do Santo Ofício (TSO). Processo de Francisco Martins e Isabel de Lamas. Olinda, 1594. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/09480. http://digitarq.arquivos.pt/details?id=2309626
|
| 15 | Tribunal do Santo Ofício (TSO). Processo de Felícia Tourinha. [Olinda], 1595. Arquivo Nacional Torre do Tombo; PT/TT/TSO-IL/028/01268. http://digitarq.arquivos.pt/DetailsForm.aspx?id=2301155
|
| 16 | This digital version (before the revision made for this experiment) is published at http://map.prp.usp.br/Corpus/FL, as part of the Project ‘Mulheres na América Portuguesa’ [5]. |
| 17 | |
| 18 | Please note that for this folio, the header (line 1 – ‘confissaõ de guiomar lopez’) was not included in the automatic transcription. |
References
- Aiello, K. D. and Simeone, M. (2019). Triangulation of History Using Textual Data. Isis, volume 110, number 3. Available at https://www.journals.uchicago.edu/doi/10.1086/705541 (accessed 7 February 2023). [CrossRef]
- Costa, B. S., Costa, A. S., Santos J.V. and Namiuti C. (2021). LaPeLinC Transcriptor: Um software para a transcrição paleográfica de documentos históricos. Presented at V Congresso Internacional de Linguística Histórica / V International Conference on Philology and Diachronic Linguistics - CILH; Campinas, Brazil.
- Costa, B. S., Santos, J. V. and Namiuti, C. (2022). Uma proposta metodológica para a construção de corpora através de estruturas de trabalho: o Lapelinc Framework. Revista Brasileira em Humanidades Digitais; 1(2). Available at http://abhd.org.br/ojs2/ojs-3.3.0-9/index.php/rbhd/article/view/37 (accessed 7 February 2023).
- Dimond, T. L. (1957) Devices for Reading Handwritten Characters. In: Proceedings of the Eastern Computer Conference, IRE-ACM-AIEE '57. Association for Computing Machinery, New York. Pages 232–237. DOI: https://dl.acm.org/doi/10.1145/1457720.1457765. Accessed 20 february 2023. [CrossRef]
- Duarte, L. F. (1997) Glossário de Crítica Textual. Universidade Nova de Lisboa, 1997. Available at http://www2.fcsh.unl.pt/invest/glossario/glossario.htm (accessed 17 May 2024).
- Gatos, B., Ntzios, K., Pratikakis, I. et al. (2006). An efficient segmentation-free approach to assist old Greek handwritten manuscript OCR. Pattern Anal Applic 8, 305–320. [CrossRef]
- Houghton, H. A. G. (2019). Electronic Transcriptions of New Testament Manuscripts and their Accuracy, Documentation and Publication. In Hamidović, D., Clivaz, C., Savant, S. B., and Marguerat, A. Ancient Manuscripts in Digital Culture: Visualisation, Data Mining, Communication. BRILL, 3, 2019, Digital Biblical Studies. Available at https://hal.science/hal-02477210 (accessed 9 February 2023).
- Humbel, M. and Nyhan, J. (2019). The application of HTR to early-modern museum collections: a case study of Sir Hans Sloane's Miscellanies catalogue. Conference: DH 2019. Available at https://discovery.ucl.ac.uk/id/eprint/10072160/ (accessed 9 February 2023).
- Kang, L., Toledo, J. I., Riba, P., Villegas, M., Fornés, A. and Rusiñol, M. (2019). Convolve, Attend and Spell: An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition. In Brox, T., Bruhn, A. and Fritz, M. (eds.), Pattern Recognition - 40th German Conference, GCPR 2018, Proceedings. Springer International Publishing.
- Kahle, P., Colutto, S., Hackl, G., and Mühlberger, G. (2017). Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 2017, pp. 19-24, doi: 10.1109/ICDAR.2017.307. (accessed 7 March 2024). [CrossRef]
- Leiva, L. A., Romero, V., Toselli, A. H., and Vidal, E. (2011). Evaluating an Interactive-Predictive Paradigm on Handwriting Transcription: A Case Study and Lessons Learned. In Proceedings of Annual IEEE International Computer Software and Applications Conference (COMPSAC), pp. 610–17.
- Magalhães, L. B. S.; Xavier L. F. W. (2021). Can Machines think? Por uma paleografia digital para textos em Língua Portuguesa. In: Alícia Duhá Lose, Lívia Borges Souza Magalhães, Vanilda Salignac Mazzon. (Org.). Paleografia e suas interfaces. 1ed. Salvador': Memória e Arte, p. 259-269.
- Magalhães, L. B. S. (2021). Da 1.0 até a 3.0: a jornada da Paleografia no mundo digital. LaborHistórico, v. 7, p. 279-295.
- Marcocci, G. (2010). Toward a History of the Portuguese Inquisition Trends in Modern Historiography (1974- 2009). Revue de l’histoire des religions [Online], 3. DOI: 10.4000/rhr.7622. Available at http://journals.openedition.org/rhr/7622. [CrossRef]
- Muehlberger, G., Seaward, L., Terras, M., Ares Oliveira, S., Bosch, V., Bryan, M., et al. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation [Internet]; 75(5):954-76. Available at https://www.emerald.com/insight/content/doi/10.1108/JD-07-2018-0114/full/html. [CrossRef]
- Nockels, J., Gooding, P., Ames, S., Terras, M. (2022). Understanding the application of handwritten text recognition technology in heritage contexts: a systematic review of TRANSKRIBUS in published research. Arch Sci 22, 367–392. https://doi.org/10.1007/s10502-022-09397-0 (accessed 7 February 2023). [CrossRef]
- Oliveira, S. A., Seguin, B. and Kaplan, F. (2018) dhSegment: A Generic Deep-Learning Approach for Document Segmentation, 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, 2018, pp. 7-12. DOI: https://doi.org/10.1109/ICFHR-2018.2018.00011. Accessed 20 february 2023. [CrossRef]
- PORTUGAL (2011). Tribunal do Santo Ofício Fond. https://digitarq.arquivos.pt/details?id=2299703.
- READ-COOP (2021). Character Error Rate (CER). TRANSKRIBUS Glossary. Available at https://readcoop.eu/glossary/character-error-rate-cer (accessed 9 February 2023).
- Vaidya, R., Trivedi, D., Satra, S. and Pimpale, P. M. (2018). Handwritten Character Recognition Using Deep- Learning. In: Proceedings of Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, pp. 772-775. [CrossRef]
- Romero, V. , Fornés, A. , Serrano, N., Sánchez, J. A., Toselli, A. H., Frinken, V., Vidal, E. and Lladós, J. (2013). The ESPOSALLES Database: An Ancient Marriage License Corpus for Off-line Handwriting Recognition. Pattern Recognition, Volume 46, Issue 6, Pages 1658–1669, 2013. [CrossRef]
- Santos, J. V. and Namiuti, C. (2009). Memória Conquistense: recuperação de documentos oitocentistas na implementação de um corpus digital. Research Project. Universidade Estadual do Sudoeste da Bahia, Vitória da Conquista.
- Sayre, K. M. (1973). Machine recognition of handwritten words: A project report. Pattern Recognition, Pergamon Press; 5 (3): 213–228. [CrossRef]
- Terras, M. (2022). Inviting AI into the Archives: The Reception of Handwritten Recognition Technology into Historical Manuscript Transcription. In Jaillant, L. (ed.), Archives, Access and Artificial Intelligence: Working with Born-Digital and Digitized Archival Collections. Bielefeld: Bielefeld University Press.
- Toledo Neto S. de A. (2020). Um caminho de retorno como base: proposta de normas de transcrição para textos manuscritos do passado. Travessias Interativas; 10(20):192–208. https://seer.ufs.br/index.php/Travessias/article/view/13959. [CrossRef]
- Toselli, A. H., Leiva, L. A., Bordes-Cabrera, I., Hernández-Tornero, C., Bosch, V. and Vidal, E. (2018). Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription, Digital Scholarship in the Humanities, Volume 33, Issue 1, April 2018, Pages 173–202. [CrossRef]















| Document Title, Year | Pages | Lines | Tokens |
|---|---|---|---|
| Processo de Antónia de Barros, 1591 | 8 | 152 | 712 |
| Denúncias contra Francisca Luís, 1592 | 13 | 255 | 1,406 |
| Processo de Maria Pinheira, 1592 | 22 | 440 | 2,161 |
| Processo de D. Catarina Quaresma, 1593(*) | 7 | 254 | 1,428 |
| Processo de Maria Álvares, 1593(*) | 7 | 177 | 973 |
| Processo de Felícia Tourinha, 1595 | 15 | 296 | 1,753 |
| Total | 72 | 1,390 | 8,349 |
| Document Title, Year | Pages | Lines | Tokens |
|---|---|---|---|
| Processo de Guiomar Lopes, 1591(*) | 3 | 150 | 1.090 |
| Processo de Guiomar Piçarra, 1592 | 6 | 130 | 688 |
| Processo de Fco. Martins e Isabel de Lamas, 1594 | 6 | 135 | 867 |
| Total | 15 | 415 | 2645 |
| Software | Error rate | Accuracy |
|---|---|---|
| Transkribus | 0.97% | 99.03% |
| Lapelinc Transcriptor | 7.69% | 92.41% |
| Difference type | Automatic transcription | Human-made transcription |
|---|---|---|
| Different characters | pouembro | nouembro |
| Different segmentation | dagraça pernaõ buco |
da graça pernaõbuco |
| Missing or added characters | hos Jssabel |
lhos Jsabel |
| Document | Differences | Characters | Difference rate |
|---|---|---|---|
| Guiomar Piçarra | 97 | 3,659 | 2.65% |
| Isabel de Lamas | 162 | 4,472 | 3.65% |
| Guiomar Lopes | 215 | 5,500 | 3.91% |
| Overall | 474 | 13,631 | 3.48% |
| Document | Differences | Characters | Difference rate |
|---|---|---|---|
| Guiomar Piçarra | 83 | 3,659 | 2.27% |
| Isabel de Lamas | 122 | 4,472 | 2.73% |
| Guiomar Lopes | 237 | 5,500 | 4.31% |
| Overall | 442 | 13,631 | 3.24% |
| Document | Manual model | Transkribus automatic transcription | Reduction |
|---|---|---|---|
| Guiomar Piçarra | 688 | 652 | 5.23% |
| Isabel de Lamas | 867 | 821 | 5.30% |
| Guiomar Lopes | 1,090 | 1,037 | 4.86% |
| Overall | 2,645 | 2,510 | 5.10% |
| Document | Manual model | LaPelinC automatic transcription | Reduction |
| Guiomar Piçarra | 688 | 628 | 8.72% |
| Isabel de Lamas | 867 | 799 | 7.84% |
| Guiomar Lopes | 1,090 | 989 | 9.26% |
| Overall | 2,645 | 2,416 | 8.65% |
| Document | Characters | Lines | Characters/lines | Density | Transkribus difference rate | Lapelinc Transcriptor difference rate |
|---|---|---|---|---|---|---|
| Guiomar Piçarra | 3.659 | 130 | 28 | -5 | 2.65% | 2.27% |
| Isabel de Lamas | 4.472 | 135 | 33 | 0 | 3.62% | 2.73% |
| Guiomar Lopes | 5.500 | 150 | 37 | 4 | 3.91% | 4.31% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).