REVIEW | doi:10.20944/preprints202104.0739.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep neural network; survey; document images; review paper; deep learning; performance evaluation; page object detection, graphical page objects; document image analysis; page segmentation
Online: 28 April 2021 (10:17:49 CEST)
In any document, graphical elements like tables, figures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that makes digitization of documents viable. Since the advent of deep learning, the performance of deep learning-based object detection has improved many folds. In this work, we outline and summarize the deep learning approaches for detecting graphical page objects in the document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses briefly the promising directions that can be utilized for further improvements.
ARTICLE | doi:10.20944/preprints202109.0112.v1
Subject: Engineering, Marine Engineering Keywords: 3D point Cloud Classification, 3D point Cloud Shape Completion,Auto-Encoders, Contrastive Learning, Self-Supervised Learning
Online: 6 September 2021 (18:00:28 CEST)
In this paper, we present the idea of Self Supervised learning on the Shape Completion and Classification of point clouds. Most 3D shape completion pipelines utilize autoencoders to extract features from point clouds used in downstream tasks such as Classification, Segmentation, Detection, and other related applications. Our idea is to add Contrastive Learning into Auto-Encoders to learn both global and local feature representations of point clouds. We use a combination of Triplet Loss and Chamfer distance to learn global and local feature representations. To evaluate the performance of embeddings for Classification, we utilize the PointNet classifier. We also extend the number of classes to evaluate our model from 4 to 10 to show the generalization ability of learned features. Based on our results, embedding generated from the Contrastive autoencoder enhances Shape Completion and Classification performance from 84.2% to 84.9% of point clouds achieving the state-of-the-art results with 10 classes.
ARTICLE | doi:10.20944/preprints202204.0279.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: object detection; challenging environments; low-light; image enhancement; complex environments; deep neural networks; computer vision
Online: 28 April 2022 (09:42:37 CEST)
In recent years, due to the advancement of machine learning, object detection has become a mainstream task in the computer vision domain. The first phase of object detection is to find the regions where objects can exist. With the improvement of deep learning, traditional approaches such as sliding windows and manual feature selection techniques have been replaced with deep learning techniques. However, object detection algorithms face a problem when performing in low light, challenging weather, and crowded scenes like any other task. Such an environment is termed a challenging environment. This paper exploits pixel-level information to improve detection under challenging situations. To this end, we exploit the recently proposed hybrid task cascade network. This network works collaboratively with detection and segmentation heads at different cascade levels. We evaluate the proposed methods on three complex datasets of ExDark, CURE-TSD, and RESIDE and achieve an mAP of 0.71, 0.52, and 0.43, respectively. Our experimental results assert the efficacy of the proposed approach.
ARTICLE | doi:10.20944/preprints202201.0090.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: formula detection; Hybrid Task Cascade network; mathematical expression detection; document image analysis; deep neural networks; computer vision
Online: 6 January 2022 (12:56:23 CET)
This work presents an approach for detecting mathematical formulas in scanned document images. The proposed approach is end-to-end trainable. Since many OCR engines cannot reliably work with the formulas, it is essential to isolate them to obtain the clean text for information extraction from the document. Our proposed pipeline comprises a hybrid task cascade network with deformable convolutions and a Resnext101 backbone. Both of these modifications help in better detection. We evaluate the proposed approaches on the ICDAR-2017 POD and Marmot datasets and achieve an overall accuracy of 96% for the ICDAR-2017 POD dataset. We achieve an overall reduction of error of 13%. Furthermore, the results on Marmot datasets are improved for the isolated and embedded formulas. We achieved an accuracy of 98.78% for the isolated formula and 90.21% overall accuracy for embedded formulas. Consequently, it results in an error reduction rate of 43% for isolated and 17.9% for embedded formulas.
ARTICLE | doi:10.20944/preprints202109.0059.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: table detection; table recognition; cascade Mask R-CNN; atrous convolution; recursive feature pyramid networks; document image analysis; deep neural networks; computer vision, object detection.
Online: 3 September 2021 (11:05:10 CEST)
Table detection is a preliminary step in extracting reliable information from tables in scanned document images. We present CasTabDetectoRS, a novel end-to-end trainable table detection framework that operates on Cascade Mask R-CNN, including Recursive Feature Pyramid network and Switchable Atrous Convolution in the existing backbone architecture. By utilizing a comparatively lightweight backbone of ResNet-50, this paper demonstrates that superior results are attainable without relying on pre and post-processing methods, heavier backbone networks (ResNet-101, ResNeXt-152), and memory-intensive deformable convolutions. We evaluate the proposed approach on five different publicly available table detection datasets. Our CasTabDetectoRS outperforms the previous state-of-the-art results on four datasets (ICDAR-19, TableBank, UNLV, and Marmot) and accomplishes comparable results on ICDAR-17 POD. Upon comparing with previous state-of-the-art results, we obtain a significant relative error reduction of 56.36%, 20%, 4.5%, and 3.5% on the datasets of ICDAR-19, TableBank, UNLV, and Marmot, respectively. Furthermore, this paper sets a new benchmark by performing exhaustive cross-datasets evaluations to exhibit the generalization capabilities of the proposed method.
ARTICLE | doi:10.20944/preprints202107.0165.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Formula detection; Cascade Mask R-CNN; Mathematical expression detection; document image analysis; deep neural networks; computer vision.
Online: 6 July 2021 (17:42:24 CEST)
This paper presents a novel architecture for detecting mathematical formulas in document images, which is an important step for reliable information extraction in several domains. Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in computer vision. In this paper, we suggest a couple of modifications to the existing Cascade Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of conventional convolutions in the backbone network to spot areas of interest better. Second, it uses a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas on the Marmot dataset, which results in a relative error reduction of 30%.
ARTICLE | doi:10.20944/preprints202208.0287.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: table detection; document layout analysis; continual learning; incremental learning; experience replay
Online: 16 August 2022 (10:56:59 CEST)
The growing amount of data demands methods that can gradually learn from new samples. However, it is not trivial to continually train a network. Retraining a network with new data usually results in a known phenomenon, called “catastrophic forgetting.” In a nutshell, the performance of the model drops on the previous data by learning from the new instances. This paper explores this issue in the table detection problem. While there are multiple datasets and sophisticated methods for table detection, the utilization of continual learning techniques in this domain was not studied. We employed an effective technique called experience replay and performed extensive experiments on several datasets to investigate the effects of catastrophic forgetting. Results show that our proposed approach mitigates the performance drop by 15 percent. To the best of our knowledge, this is the first time that continual learning techniques are adopted for table detection, and we hope this stands as a baseline for future research.
REVIEW | doi:10.20944/preprints202208.0067.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: deep learning; 3D reconstruction; convolutional neural networks; texture-less surfaces
Online: 2 August 2022 (12:17:08 CEST)
3D reconstruction from a single 2D input is a classic problem in the field of computer vision. With the advancements in deep learning, the performance of 3D reconstruction has also significantly improved. The reconstruction task is more difficult for objects with no textures or complex deformations. This paper serves as a review of recent literature on 3D reconstruction from a single view, with a focus on deep learning methods from 2018 to 2021. Due to lack of standard datasets or 3D shape representation methods, it is hard make direct comparisons between all reviewed methods. However, this paper reviews different approaches for reconstructing 3d shape as depth maps, surface normals, point clouds and meshes; along with various loss functions and evaluation metrics used to train and evaluate these methods.
ARTICLE | doi:10.20944/preprints202108.0360.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Table detection, table localization, deep learning, Hybrid Task Cascade, Object detection, deformable convolution, deep neural networks, computer vision, scanned document images, document image analysis.
Online: 17 August 2021 (10:26:42 CEST)
Tables in the document image are one of the most important entities since they contain crucial information. Therefore, accurate table detection can significantly improve information extraction from tables. In this work, we present a novel end-to-end trainable pipeline, HybridTabNet, for table detection in scanned document images. Our two-stage table detector uses the ResNeXt-101 backbone for feature extraction and Hybrid Task Cascade (HTC) to localize the tables in scanned document images. Moreover, we replace conventional convolutions with deformable convolutions in the backbone network. This enables our network to detect tables of arbitrary layouts precisely. We evaluate our approach comprehensively on ICDAR-13, ICDAR-17 POD, ICDAR-19, TableBank, Marmot, and UNLV. Apart from the ICDAR-17 POD dataset, our proposed HybridTabNet outperforms earlier state-of-the-art results without depending on pre and post-processing steps. Furthermore, to investigate how the proposed method generalizes unseen data, we conduct an exhaustive leave-one-out-evaluation. In comparison to prior state-of-the-art results, our method reduces the relative error by 27.57% on ICDAR-2019-TrackA-Modern, 42.64% on TableBank (Latex), 41.33% on TableBank (Word), 55.73% on TableBank (Latex + Word), 10% on Marmot, and 9.67% on UNLV dataset. The achieved results reflect the superior performance of the proposed method.
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Amharic script; Attention mechanism; OCR; Encoder-decoder; Text-image
Online: 15 October 2020 (13:42:28 CEST)
In the present, the growth of digitization and worldwide communications make OCR systems of exotic languages a very important task. In this paper, we attempt to develop an OCR system for one of these exotic languages with a unique script, Amharic. Motivated by the recent success of the Attention mechanism in Neural Machine Translation (NMT), we extend the attention mechanism for Amharic text-image recognition. The proposed model consists of CNNs and attention embedded recurrent encoder-decoder networks that are integrated following the configuration of the seq2seq framework. The attention network parameters are trained in an end-to-end fashion and the context vector is injected, with the previously predicted output, at each time steps of decoding. Unlike the existing OCR model that minimizes the CTC objective function, the new model minimizes the categorical cross-entropy loss. The performance of the proposed attention-based model is evaluated against the test dataset from the ADOCR database which consists of both printed and synthetically generated Amharic text-line images and achieved promising results with a CER of 1.54% and 1.17% respectively.
ARTICLE | doi:10.20944/preprints202201.0061.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: BERT, Document Image Classification, EfficientNet, fine-tuned BERT, Hierarchical Attention Networks, Multimodal, RVL-CDIP, Two-stream, Tobacco-3482
Online: 6 January 2022 (10:08:38 CET)
Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. The image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network(HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses the dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While the earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3428 from scratch. Thereby, we outperform state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.
ARTICLE | doi:10.20944/preprints202106.0590.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Object detection; challenging environments; low-light; image enhancement; complex environments; state-of-the-art; deep neural networks; computer vision; performance analysis.
Online: 23 June 2021 (16:01:33 CEST)
Recent progress in deep learning has led to accurate and efficient generic object detection networks. Training of highly reliable models depends on large datasets with highly textured and rich images. However, in real-world scenarios, the performance of the generic object detection system decreases when (i) occlusions hide the objects, (ii) objects are present in low-light images, or (iii) they are merged with background information. In this paper, we refer to all these situations as challenging environments. With the recent rapid development in generic object detection algorithms, notable progress has been observed in the field of object detection in challenging environments. However, there is no consolidated reference to cover state-of-the-art in this domain. To the best of our knowledge, this paper presents the first comprehensive overview, covering recent approaches that have tackled the problem of object detection in challenging environments. Furthermore, we present the quantitative and qualitative performance analysis of these approaches and discuss the currently available challenging datasets. Moreover, this paper investigates the performance of current state-of-the-art generic object detection algorithms by benchmarking results on the three well-known challenging datasets. Finally, we highlight several current shortcomings and outline future directions.
ARTICLE | doi:10.20944/preprints202209.0025.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: object detection; semi-supervised learning; Mask R-CNN; floor-plan images; computer vision
Online: 1 September 2022 (15:16:43 CEST)
Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for floor-plan objects to promote the research objective of detecting more accurate objects with less labelled data. The floor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student-teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both unlabeled data with the pseudo boxes and labelled data as ground truth for training. It learns representations of furniture items by combining labelled and unlabeled data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labelled data, respectively. Our experiment affirms the efficiency of the proposed approach as it outperforms the fully supervised counterpart using only 10% of the labels.
ARTICLE | doi:10.20944/preprints202110.0089.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Object Detection; Cascade Mask R-CNN; Floor Plan Images; Deep Learning; Transfer Learning; Dataset Augmentation; Computer Vision
Online: 5 October 2021 (15:09:26 CEST)
Object detection is one of the most critical tasks in the field of Computer vision. This task comprises identifying and localizing an object in the image. Architectural floor plans represent the layout of buildings and apartments. The floor plans consist of walls, windows, stairs, and other furniture objects. While recognizing floor plan objects is straightforward for humans, automatically processing floor plans and recognizing objects is a challenging problem. In this work, we investigate the performance of the recently introduced Cascade Mask R-CNN network to solve object detection in floor plan images. Furthermore, we experimentally establish that deformable convolution works better than conventional convolutions in the proposed framework. Identifying objects in floor plan images is also challenging due to the variety of floor plans and different objects. We faced a problem in training our network because of the lack of publicly available datasets. Currently, available public datasets do not have enough images to train deep neural networks efficiently. We introduce SFPI, a novel synthetic floor plan dataset consisting of 10000 images to address this issue. Our proposed method conveniently surpasses the previous state-of-the-art results on the SESYD dataset and sets impressive baseline results on the proposed SFPI dataset. The dataset can be downloaded from SFPI Dataset Link. We believe that the novel dataset enables the researcher to enhance the research in this domain further.
REVIEW | doi:10.20944/preprints202205.0343.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: Depth Completion; Depth Maps; Image-Guidance; Lidar
Online: 25 May 2022 (05:26:16 CEST)
Depth maps produced by LiDAR based approaches are sparse. Even high-end LiDAR sensors produce highly sparse depth maps, which are also noisy around the object boundaries. Depth completion is the task of generating a dense depth map from a sparse depth map. While the traditional approaches focus on directly completing this sparsity from the sparse depth maps, modern techniques use RGB images as a guidance tool to resolve this problem. Whilst many others rely on affinity matrices for depth completion. Based on these approaches, we have sub-divided the literature into two major categories; traditional approaches and backbone-based approaches. The latter is further sub-divided into two-branch, and spatial propagation approaches. The two-branch approaches still have a sub-category named guided-kernel approaches. In this paper, for the first time ever we present a comprehensive survey of depth completion methods. We present a novel taxonomy of depth completion approaches, review and detail different state-of-the art techniques within each category for depth completion of LiDAR data, and provide quantitative results for the approaches on KITTI and NYUv2 depth completion benchmark datasets.