Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Version 1 : Received: 3 January 2022 / Approved: 6 January 2022 / Online: 6 January 2022 (10:08:38 CET)

How to cite: Kanchi, S.; Pagani, A.; Mokayed, H.; Liwicki, M.; Stricker, D.; Afzal, M.Z. EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data. Preprints 2022, 2022010061 (doi: 10.20944/preprints202201.0061.v1). Kanchi, S.; Pagani, A.; Mokayed, H.; Liwicki, M.; Stricker, D.; Afzal, M.Z. EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data. Preprints 2022, 2022010061 (doi: 10.20944/preprints202201.0061.v1).

Abstract

Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. The image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network(HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses the dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While the earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3428 from scratch. Thereby, we outperform state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.

Keywords

BERT, Document Image Classification, EfficientNet, fine-tuned BERT, Hierarchical Attention Networks, Multimodal, RVL-CDIP, Two-stream, Tobacco-3482

Subject

MATHEMATICS & COMPUTER SCIENCE, Artificial Intelligence & Robotics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.