CONCEPT PAPER | doi:10.20944/preprints202006.0341.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Hyper spectral Document Images; Non-destructive Analysis; Forensics Document; Ink Mismatch Detection; K-means Clustering
Online: 28 June 2020 (19:26:25 CEST)
Hyper spectral imaging (HSI) is a technique that is used to obtain the spectrum for each pixel in the image. It helps in finding objects and identifying materials etc. Such an identification is very difficult using other imaging techniques. It allows the researchers to investigate the documents without any physical contact. Nowadays detection of unequal Ink mismatch based on HSI has shown vast improvement in distinguishing the inks. Detection of unequal Ink mismatch is an unbalanced clustering problem. This paper used K-means Clustering for ink mismatch detection. K-means Clustering find same subgroups in the data based on Euclidean distance. This paper demonstrates performance in unequal Ink mismatch based on HSI.
ARTICLE | doi:10.20944/preprints202108.0360.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Table detection, table localization, deep learning, Hybrid Task Cascade, Object detection, deformable convolution, deep neural networks, computer vision, scanned document images, document image analysis.
Online: 17 August 2021 (10:26:42 CEST)
Tables in the document image are one of the most important entities since they contain crucial information. Therefore, accurate table detection can significantly improve information extraction from tables. In this work, we present a novel end-to-end trainable pipeline, HybridTabNet, for table detection in scanned document images. Our two-stage table detector uses the ResNeXt-101 backbone for feature extraction and Hybrid Task Cascade (HTC) to localize the tables in scanned document images. Moreover, we replace conventional convolutions with deformable convolutions in the backbone network. This enables our network to detect tables of arbitrary layouts precisely. We evaluate our approach comprehensively on ICDAR-13, ICDAR-17 POD, ICDAR-19, TableBank, Marmot, and UNLV. Apart from the ICDAR-17 POD dataset, our proposed HybridTabNet outperforms earlier state-of-the-art results without depending on pre and post-processing steps. Furthermore, to investigate how the proposed method generalizes unseen data, we conduct an exhaustive leave-one-out-evaluation. In comparison to prior state-of-the-art results, our method reduces the relative error by 27.57% on ICDAR-2019-TrackA-Modern, 42.64% on TableBank (Latex), 41.33% on TableBank (Word), 55.73% on TableBank (Latex + Word), 10% on Marmot, and 9.67% on UNLV dataset. The achieved results reflect the superior performance of the proposed method.
CONCEPT PAPER | doi:10.20944/preprints202007.0084.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Hyperspectral Imagery (HSI); Hyperspectral Document Imagery (HSDI); k-means clustering; Principal component analysis (PCA)
Online: 5 July 2020 (15:28:52 CEST)
Hyperspectral imaging provides vital information about the objects and elements present inside the image. That’s why they are very useful in satellite imagery as well as image forensics. Hyperspectral document analysis (HSDI) can be used for document authentication using ink analysis which can provide sufficient information about the composition and type of ink. In this project, we have implemented HSDI based ink classification technique using Principle Component Analysis for dimensionality reduction and K-means clustering for ink classification. This is unsupervised learning approach and it is very simple and efficient in order to classify limited number of bands. We have used this technique to classify 33 different bands of ink.
ARTICLE | doi:10.20944/preprints202208.0287.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: table detection; document layout analysis; continual learning; incremental learning; experience replay
Online: 16 August 2022 (10:56:59 CEST)
The growing amount of data demands methods that can gradually learn from new samples. However, it is not trivial to continually train a network. Retraining a network with new data usually results in a known phenomenon, called “catastrophic forgetting.” In a nutshell, the performance of the model drops on the previous data by learning from the new instances. This paper explores this issue in the table detection problem. While there are multiple datasets and sophisticated methods for table detection, the utilization of continual learning techniques in this domain was not studied. We employed an effective technique called experience replay and performed extensive experiments on several datasets to investigate the effects of catastrophic forgetting. Results show that our proposed approach mitigates the performance drop by 15 percent. To the best of our knowledge, this is the first time that continual learning techniques are adopted for table detection, and we hope this stands as a baseline for future research.
ARTICLE | doi:10.20944/preprints202201.0090.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: formula detection; Hybrid Task Cascade network; mathematical expression detection; document image analysis; deep neural networks; computer vision
Online: 6 January 2022 (12:56:23 CET)
This work presents an approach for detecting mathematical formulas in scanned document images. The proposed approach is end-to-end trainable. Since many OCR engines cannot reliably work with the formulas, it is essential to isolate them to obtain the clean text for information extraction from the document. Our proposed pipeline comprises a hybrid task cascade network with deformable convolutions and a Resnext101 backbone. Both of these modifications help in better detection. We evaluate the proposed approaches on the ICDAR-2017 POD and Marmot datasets and achieve an overall accuracy of 96% for the ICDAR-2017 POD dataset. We achieve an overall reduction of error of 13%. Furthermore, the results on Marmot datasets are improved for the isolated and embedded formulas. We achieved an accuracy of 98.78% for the isolated formula and 90.21% overall accuracy for embedded formulas. Consequently, it results in an error reduction rate of 43% for isolated and 17.9% for embedded formulas.
ARTICLE | doi:10.20944/preprints202107.0165.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Formula detection; Cascade Mask R-CNN; Mathematical expression detection; document image analysis; deep neural networks; computer vision.
Online: 6 July 2021 (17:42:24 CEST)
This paper presents a novel architecture for detecting mathematical formulas in document images, which is an important step for reliable information extraction in several domains. Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in computer vision. In this paper, we suggest a couple of modifications to the existing Cascade Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of conventional convolutions in the backbone network to spot areas of interest better. Second, it uses a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas on the Marmot dataset, which results in a relative error reduction of 30%.
ARTICLE | doi:10.20944/preprints202109.0059.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: table detection; table recognition; cascade Mask R-CNN; atrous convolution; recursive feature pyramid networks; document image analysis; deep neural networks; computer vision, object detection.
Online: 3 September 2021 (11:05:10 CEST)
Table detection is a preliminary step in extracting reliable information from tables in scanned document images. We present CasTabDetectoRS, a novel end-to-end trainable table detection framework that operates on Cascade Mask R-CNN, including Recursive Feature Pyramid network and Switchable Atrous Convolution in the existing backbone architecture. By utilizing a comparatively lightweight backbone of ResNet-50, this paper demonstrates that superior results are attainable without relying on pre and post-processing methods, heavier backbone networks (ResNet-101, ResNeXt-152), and memory-intensive deformable convolutions. We evaluate the proposed approach on five different publicly available table detection datasets. Our CasTabDetectoRS outperforms the previous state-of-the-art results on four datasets (ICDAR-19, TableBank, UNLV, and Marmot) and accomplishes comparable results on ICDAR-17 POD. Upon comparing with previous state-of-the-art results, we obtain a significant relative error reduction of 56.36%, 20%, 4.5%, and 3.5% on the datasets of ICDAR-19, TableBank, UNLV, and Marmot, respectively. Furthermore, this paper sets a new benchmark by performing exhaustive cross-datasets evaluations to exhibit the generalization capabilities of the proposed method.
ARTICLE | doi:10.20944/preprints201909.0101.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Optical Character Recognition; Document Analysis; Historical Printings
Online: 9 September 2019 (12:08:16 CEST)
Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.
REVIEW | doi:10.20944/preprints202205.0004.v1
Subject: Biology, Other Keywords: COVID-19; Exploratory Search; Machine Learning; Document Retrieval
Online: 4 May 2022 (12:20:15 CEST)
The urgency of the COVID19 pandemic caused a surge in related scientific literature. This surge made the manual exploration of scientific articles time-consuming and inefficient. Therefore, a range of exploratory search applications have been created to facilitate access to the available literature. In this survey, we give a short description of certain efforts in this direction and explore the different approaches that they used.
ARTICLE | doi:10.20944/preprints201907.0310.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: EU law; federated search; document repository; network diagram
Online: 28 July 2019 (12:29:27 CEST)
We have developed an application aiming at federated search for EU and Hungarian legislation and jurisdiction. It now contains above 1 million documents, with daily updates. The database holds documents downloaded from the EU sources EUR-Lex and Curia Online as well as public jurisdiction documents from the Constitutional Court of Hungary and The National Office for The Judiciary. The application is termed Justeus. Justeus provides comprehensible search possibilities. Besides free text and metadata (dropdown list) searches, it features hierarchical data structures (concept hierarchy trees) of directory codes and classification as well as subject terms. Justeus collects all links of a particular document to other documents (court judgements citing other case law documents as well as legislation, national court decisions referring to EU regulation etc.) as tables and directed graph networks. Choosing a document, its relations to other documents are visualized in real time as a network. Network graphs help in identifying key documents influencing or referred by many other documents (legislative and/or jurisdictive) and sets of documents predominantly referring to each other (citation networks).
REVIEW | doi:10.20944/preprints202104.0739.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Deep neural network; survey; document images; review paper; deep learning; performance evaluation; page object detection, graphical page objects; document image analysis; page segmentation
Online: 28 April 2021 (10:17:49 CEST)
In any document, graphical elements like tables, figures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that makes digitization of documents viable. Since the advent of deep learning, the performance of deep learning-based object detection has improved many folds. In this work, we outline and summarize the deep learning approaches for detecting graphical page objects in the document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses briefly the promising directions that can be utilized for further improvements.
ARTICLE | doi:10.20944/preprints202202.0058.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Document Image Classification; Corruption Robustness; Robustness to Distortions; Model Robustness
Online: 14 June 2022 (08:43:57 CEST)
Deep neural networks have been extensively researched in the field of document image classification to improve classification performance and have shown excellent results. However, there is little research in this area that addresses the question of how well these models would perform in a real-world environment, where the data the models are confronted with often exhibits various types of noise or distortion. In this work, we present two separate benchmark datasets, namely RVL-CDIP-D and Tobacco3482-D, to evaluate the robustness of existing state-of-the-art document image classifiers to different types of data distortions that are commonly encountered in the real world. The proposed benchmarks are generated by inserting 21 different types of data distortions with varying severity levels into the well-known document datasets RVL-CDIP and Tobacco3482, respectively, which are then used to quantitatively evaluate the impact of the different distortion types on the performance of latest document image classifiers. In doing so, we show that while the higher accuracy models also exhibit relatively higher robustness, they still severely underperform on some specific distortions, with their classification accuracies dropping from ~90% to as low as ~40% in some cases. We also show that some of these high accuracy models perform even worse than the baseline AlexNet model in the presence of distortions, with the relative decline in their accuracy sometimes reaching as high as 300-450% that of AlexNet. The proposed robustness benchmarks are made available to the community and may aid future research in this area.
REVIEW | doi:10.20944/preprints202111.0454.v1
Subject: Medicine & Pharmacology, Allergology Keywords: supportive supervision; health systems strengthening; document analysis; LMIC; maternal and child health
Online: 24 November 2021 (12:45:25 CET)
Background: Supportive supervision has lately been gaining traction in various national health systems as an effective way of boosting the performance of community health workers in a constructive and sustainable way. However, not much is known about the basis/mandate of supportive supervision and its approach in maternal and child health programs in India. The current analysis contributes to a clearer understanding of the paradigms within which supportive supervision is envisioned to operate within India and identifies potential strengths and areas requiring attention. Method: Document analysis of implementation documents such as guidelines/ operational manuals/operationalization modules/ training modules of nationally implemented maternal and child health programs, with data extraction according to a pre-determined domain-based template. Results: Many of the documents reviewed do not mention supportive supervision at all. In the few documents where supportive supervision is mentioned, the paradigms within which it is supposed to operate (who will do it, when will it be done, how to do it, training and logistic support, reporting formats, etc.) have not been clearly identified in most programs. Conclusion: Even though supportive supervision is being increasingly identified as an effective way of performative improvement in national health programs in India, more effort needs to be put into identifying and enforcing the tenets of supportive supervision in practice, in order to bring about the desired change.
ARTICLE | doi:10.20944/preprints202007.0686.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: document scanning; whiteboard capture; image enhancement; image alignment; image registration; image quality assessment
Online: 28 July 2020 (14:03:51 CEST)
The move from paper to online is not only necessary for remote working, it is also significantly more sustainable. This trend has seen a rising need for high-quality digitization of content from pages and whiteboards to sharable online material. But capturing this information is not always easy, nor are the results always satisfactory. Available scanning apps vary in their usability and do not always produce clean results, retaining surface imperfections from the page or whiteboard in their output images. CleanPage, a novel smartphone-based document and whiteboard scanning system, is presented. CleanPage requires one button-tap to capture, identify, crop and clean an image of a page or whiteboard. Unlike equivalent systems, no user intervention is required during processing and the result is a high-contrast, low-noise image with a clean homogenous background. Results are presented for a selection of scenarios showing the versatility of the design. CleanPage is compared with two market leader scanning apps using two testing approaches: real paper scans and ground-truth comparisons. These comparisons are achieved by a new testing methodology that allows scans to be compared to unscanned counterparts, by using synthesized images. Real paper scans are tested using image quality measures. An evaluation of standard image quality assessments is included in this work and a novel quality measure for scanned images is proposed and validated. The user experience for each scanning app is assessed, showing CleanPage to be fast and easier to use.
ARTICLE | doi:10.20944/preprints202001.0149.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Optical Music Recognition; Historical Document Analysis; Medieval manuscripts; neume notation; CNN; LSTM; CTC
Online: 15 January 2020 (12:11:25 CET)
The automatic recognition of scanned Medieval manuscripts still represents a challenge due to degradation, non standard layouts, or notations. This paper focuses on the Medieval square notation developed around the 11th century which is composed of staff lines, clefs, accidentals, and neumes which are basically connected single notes. We present a novel approach to tackle the automatic transcription by applying CNN/LSTM networks that are trained using the segmentation-free CTC-loss-function which considerably facilitates the GT-production. For evaluation, we use three different manuscripts and achieve a dSAR of 86.0% on the most difficult book and 92.2% on the cleanest one. To further improve the results, we apply a neume dictionary during decoding which yields a relative improvement of about 5%.
ARTICLE | doi:10.20944/preprints201905.0231.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Optical Music Recognition; historical document analysis; Medieval manuscripts; neume notation; fully convolutional neural networks
Online: 20 May 2019 (08:45:34 CEST)
Even today, the automatic digitisation of scanned documents in general but especially the automatic optical music recognition (OMR) of historical manuscripts still remain an enormous challenge, since both handwritten musical symbols and text have to be identified. This paper focuses on the Medieval so-called square notation developed in the 11th-12th century, which is already composed of staff lines, staves, clefs, accidentals, and neumes, that are roughly spoken connected single notes. The aim is to develop an algorithm that captures both the neume and pitch, that is melody information that can be used to reconstruct the original writing. Our pipeline is similar to the standard OMR approach and comprises a novel staff line and symbol detection algorithm, based on deep Fully Convolutional Networks (FCN), which perform pixel-based predictions for either staff lines or symbols and their respective types. Then, the staff line detection combines the extracted lines to staves and yields an F1-score of over 99% for both detecting lines and complete staves. For the music symbol detection we choose a novel approach that skips the step to identify neumes and instead directly predicts note components (NCs) and their respective affiliation to a neume. Furthermore, the algorithm detects clefs and accidentals. Our algorithm recognises these symbols with an F1-score of over 96% if the type is ignored and predicts the true symbol sequence of a staff with a diplomatic symbol accuracy rate (dSAR) of about 87%. If only the NCs without their respective connection to a neume, all clefs, and accidentals are of interest the algorithm reaches an harmonic symbol accuracy rate (hSAR) of approximately 90%.
REVIEW | doi:10.20944/preprints202102.0612.v2
Subject: Biology, Physiology Keywords: Medial Preoptic Area; MPOA; Parental behavior; Scientometry; Systematic Review; CiteSpace; Document Co-Citation Analysis; Keyword Analysis
Online: 1 April 2021 (14:52:17 CEST)
Research investigating the neural substrates underpinning parental behaviour has recently gained momentum. Particularly, the hypothalamic medial preoptic area (MPOA) has been identified as a crucial region for parenting. The current study conducted a scientometric analysis of publications from 01 January 1972 to 19 January 2021 using CiteSpace software to determine trends in the scientific literature exploring the relationship between MPOA and parental behaviour. In total, 677 scientific papers were analysed, producing a network of 1509 nodes and 5498 links. Four major clusters were identified: "C-Fos Expression'', "Lactating Rat'', "Medial Preoptic Area Interaction'' and "Parental Behavior''. Their content suggests an initial trend in which the properties of the MPOA in response to parental behavior were studied, followed by a growing attention towards the presence of a brain network, including the reward circuits, regulating such behavior. Furthermore, while attention was initially directed uniquely to maternal behavior, it has recently been extended to the understanding of paternal behaviors as well. Finally, although the majority of the studies were conducted on rodents, recent publications broaden the implications of previous documents to human parental behavior, giving insight into the mechanisms underlying postpartum depression. Potential directions in future works were also discussed.
ARTICLE | doi:10.20944/preprints202209.0103.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Portable Document Format (PDF); machine learning; detection; optimizable decision tree; Ada-Boost; PDF malware; evasion attacks; cybersecurity
Online: 7 September 2022 (05:33:40 CEST)
Portable Document Format (PDF) files are one of the most universally used file types. This has fascinated hackers to develop methods to use these normally innocent PDF files to create security threats via infection vectors PDF files. This is usually realized by hiding embedded malicious code in the victims’ PDF documents to infect their machines. This, of course, results in PDF Malware and requires techniques to identify benign files from malicious files. Research studies indicated that machine-learning methods provide efficient detection techniques against such malware. In this paper, we present a new detection system that can analyze PDF documents in order to identify benign PFD files from malware PFD files. The proposed system makes use of the AdaBoost decision tree with optimal hyperparameters, which is trained and evaluated on a modern-inclusive dataset, viz. Evasive-PDFMal2022. The investigational assessment demonstrates a lightweight-accurate PDF detection system, achieving a 98.84% prediction accuracy with a short prediction interval of 2.174 μSec. To this end, the proposed model outperforms other state-of-the-art models in the same study area. Hence, the proposed system can be effectively utilized to uncover PDF malware at high detection performance and low detection overhead.
ARTICLE | doi:10.20944/preprints202201.0061.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: BERT, Document Image Classification, EfficientNet, fine-tuned BERT, Hierarchical Attention Networks, Multimodal, RVL-CDIP, Two-stream, Tobacco-3482
Online: 6 January 2022 (10:08:38 CET)
Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. The image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network(HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses the dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While the earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3428 from scratch. Thereby, we outperform state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.
ARTICLE | doi:10.20944/preprints202107.0200.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: image quality assessment; image quality metrics; NR-IQAs; D-IQA; OCR accuracy; OCR prediction; OCR improvements; visual aids; visually impaired; reading aids; document images; text-based images
Online: 8 July 2021 (13:21:49 CEST)
For Visually impaired People (VIPs), the ability to convert text to sound can mean a new level of independence or the simple joy of a good book. With significant advances in Optical Character Recognition (OCR) in recent years, a number of reading aids are appearing on the market. These reading aids convert images captured by a camera to text which can then be read aloud. However, all of these reading aids suffer from a key issue – the user must be able to visually target the text and capture an image of sufficient quality for the OCR algorithm to function – no small task for VIPs. In this work, a Sound-Emitting Document Image Quality Assessment metric (SEDIQA) is proposed which allows the user to hear the quality of the text image and automatically captures the best image for OCR accuracy. This work also includes testing of OCR performance against image degradations, to identify the most significant contributors to accuracy reduction. The proposed No-Reference Image Quality Assessor (NR-IQA) is validated alongside established NR-IQAs and this work includes insights into the performance of these NR-IQAs on document images.