Preprint
Article

This version is not peer-reviewed.

Language Identification in Low-Resource Multilingual Document Images using Deep Learning Techniques

Submitted:

04 December 2024

Posted:

05 December 2024

You are already at the latest version

Abstract

With the increasing digitization of printed materials, it has become common to encounter documents in multiple languages in our daily lives. This has led to significant interest from researchers in the field of document image analysis, which includes applications like OCR, document image retrieval, and machine translation. However, these applications typically assume that documents are written in a single language, which is not always the case. Analyzing multilingual documents can be challenging, especially when the languages are switched within the document and the scripts are the same. In particular, this issue for low-resource languages has not yet been addressed. Hence, this paper focuses on exploring the use of deep learning techniques to develop a language identification model for multilingual documents written in Ethiopic script. To train and test our model, we extracted a total of 24,444 and 32,000 text-line pictures from a set of 814 and 1066 documents written in Ethiopian script and obtained from a variety of sources, respectively. We also carried out extensive experiments using both pre-trained CNN models and models trained from scratch with different parameter tuning on printed and synthetic document image datasets. The results revealed that CNN outperformed pre-trained deep learning models, achieving a promising accuracy of 97.27% on a test set of 9,600 images. These findings show the potential for developing effective language identification solutions for document images written in Ethiopic script.

Keywords: 
;  ;  ;  

1. Introduction

Language plays a crucial role in human communication and has contributed significantly to the development of human civilization. It serves as a means for people to interact socially and emotionally, as well as to express various emotions such as kindness, love, rage, and suffering in public [1]. There are over 2000 distnict languages and 200 dialects in Africa belonging to four major linguistic super-families (Niger-Congo, Afro-Asiatic, Nilo-Saharan and Khoisan)[1] [1]https://bit.ly/lingustic. Ethiopia is a multilingual African country and a home to over 86 languages that can be grouped into Semitic, Cushitic, Omotic, and Nilo-Saharan [2,3], which share a genetic affinity despite being divided into widely disparate subgroups of the same family.
The internet has had a significant impact on language since its inception, with the rise of instant communication through email, social networking such as Twitter, Linked in users increasing [4], blogs, and the World Wide Web as a whole [5,6]. This has led to a dramatic increase in the number of digital and paper documents in various languages, requiring automatic processing for language identification. Two broad categories of languages exist in the language identification challenge: (i) different languages with similar scripts and (ii) the same language with different scripts. The automatic analysis and recognition of scripts is known as script recognition [7,8,9]. For example, in Ethiopia, languages such as Amharic, Tigrigna, Awigna, and Geez use the Ethiopic alphabet, while Afaan Oromo, Somali, and Afar use the Roman alphabet. The second challenge involves looking into languages written in different scripts at different times, such as Afaan Oromo. This paper focuses on the identification of low-resource languages written in a similar script, such as Amharic, Tigrigna, Awigna, and Geez.
Figure 1. Sample Multilingual Document Image
Figure 1. Sample Multilingual Document Image
Preprints 141797 g001
Language identification (LID) plays a crucial role in natural language processing (NLP) problems, as it involves determining the language of a document or a part of it [2,10,11]. This step is necessary for many NLP applications, such as machine translation, sentiment analysis, document summarization, document image classification, and retrieval. For instance, Google’s free translation service[2] [2]https://translate.google.com attempts to detect the language of the text before translating it into another language [12]. However, when dealing with Ethiopian languages with similar scripts but different languages, they are often translated as Amharic by default instead of considering each individual case.
OCR systems are commonly used for tasks like character segmentation and recognition [13,14,15]. However, integrating Language Identification (LID) beforehand is crucial to employ language-specific OCR models, improving accuracy and enhancing data accessibility across diverse applications [12]. It is essential for document image analysis tasks, such as document image classification, document image retrieval, OCR, and search engines. Search engines crawl web pages and need to identify the language for each web page to determine whether they should appear in search results. Similarly, search engines must determine the language of the user’s search query. When employing text processing techniques for information retrieval, it is assumed that the language of the text is known, and many techniques assume that all documents are written in the same language.
Language identification on document images is challenging, as highlighted in [16], and also for handwritten recognition, as highlighted in [17]. Nonetheless, LID is essential in machine translation and document image analysis tasks [18]. Identifying a language in a document image can be considered a pre-OCR, document image retrieval, and classification task. In pre-OCR cases, language identification serves as the foundation for correct transcription of a document into an editable document in a multilingual setting. Therefore, the addition of language identification to OCR would simplify OCR’s role in converting hardcopy documents to editable text, allowing for the preservation of more historical documents written in multiple languages. LID in document images is necessary not only for OCR but also for text translation, sentiment analysis, document and information retrieval.
deep learning-based methods have been shown to perform better than traditional methods and have produced promising results in various applications [19,20]. In this paper, we apply deep learning-based methods to tackle the problem of language identification from document images written in Ethiopic script. We explore different state-of-the-art deep learning methods and select suitable techniques to develop multilingual language identification in the context of Ethiopian languages. Specifically, we implement a custom-CNN and pre-trained strategies for feature extraction and model tuning, given their wide application in the field and pivotal role in image processing [21].
This study focuses on identifying languages from printed and synthetic document images written in Amharic, Tigrigna, Awigna, and Geez, which are all written in Ethiopic script. Our aim is to investigate the identification of low-resource languages written in a similar script. To the best of our knowledge, this is the first study to address language identification from multilingual document images with similar script in the context of Ethiopian languages. Our research makes two key contributions: (i) investigation of various deep learning techniques to accurately predict the language from a low-resources document images written in Ethiopic script and provide a generic solution for the task and (ii) we present a multilingual Ethiopic script language dataset, which was manually annotated and generated from various websites and social media sites for four low-resourced Ethiopic languages.
The remaining sections of this paper are organized as follows. Section 2 presents a review of related literature. Section 3 outlines the methodology used to address the language identification problem, along with the process model employed. In Section 4, we present the experimental results of our study, which are based on the dataset we have created. Finally, in Section 5, we conclude the paper by summarizing our findings and discussing future research directions.

2. Related Work

Language identification is the examination of different types of languages from a collection of documents. It is away to figure out what kind of natural language is present in the document. LID, can be performed in document images [16,22,23,24,25] as well as text documents [16,26,27,28,29,30,31,32,33]. Language identification can be performed on any language format, including speech, sign language, and handwritten text, handwritten text images [31] and it is applicable to all types of data storage media, including digital as stated in [10].
In today’s multilingual world, using documents in multiple languages is commonplace. To comprehend a document and use it in other automatic analyses, it is necessary to determine the language of the document. In multilingual settings, languages may have similar script but different writing styles, or similar language but different script, making differentiation difficult. As a result, language identification is receiving attention from researchers in order to address such issues. Language identification analyzes the extracted text of each document to determine the language in which it was written. It allows you to see how many languages are contained in each document as well as the percentages of each language. Language identification is the first step in subsequent steps of automatic document analysis in various areas, such as machine translation [27], natural language preprocessing [34], sentiment analysis, and OCR [35]. Language identification from text documents and document images has piqued the interest of a number of researchers in this regard.
Managing multilingual documents presents a number of challenges, including script identification, language determination, text with varying character sets, reading direction, and the majority of document image noise. To this end, language based document image analysis is required in a multilingual environment. It facilitates browsing, searching, and accessing documents related to a specific language. Automatic language identification is critical in processing large volumes of document images, particularly when using for a multilingual OCR system and document image retrieval to obtain sorted document images, select specific OCRs, and search online collections of document images for those containing a specific language [36].
The authors in [28] have studied language identification on multilingual character window of textual document. Accordingly, language identification has to be performed in multilingual short and long textual document language identification. As part of their contribution, the authors gathered numerous documents from six datasets containing 131 languages, including HTML markup syntax, and prepared a dataset. In [37], CNN and RNN are combined for language detection in such a way that CNN is followed by RNN. The authors were inspired by the difficulty of identifying language in social media texts, where casual styles, closely related language pairs, and code-switching are common. As a result, issues are investigated in a shared task for Language Identification in Code-Switched Data by employing a hierarchical neural network to learn character and contextualized word-level representations in order to make word-level language predictions. The author submitted labels for both Spanish-English and Arabic-MSA language pairs that were available (distinguishing Modern Standard Arabic from Arabic dialects).
The authors in [26] developed a suitable technique for effective text-based language identification, focusing on seven Ethiopian languages: Afar, Amharic, Nuer, Oromo, Sidamo, Somali, and Tigrigna. They attempted to address the identification of languages in multilingual documents that contained this seven language per document. To accomplish this, they have chosen classic machine learning techniques such as the Nave Bayes classifier, SVM classifier, and Dictionary method. The authors demonstrate that there is no a standard corpus for research purposes, thus they created a document corpus with those seven languages per document. Then the data have separated to training and testing module. Then, they use n-gram of size 3 as a feature set for Nave Bayes classifier, SVM classifier algorithm training and stop word for dictionary based approach.
The experimental accuracy shows that the Nave Bayes classifier is 98.37% accurate, the SVM classifier is 99.53% accurate, and the Dictionary method is 90.53% accurate. However, the author demonstrates that incorrect categorization occurs during homogeneous document evaluation using a 3-n gram document corpus because the languages have similar script, making differentiation impossible. Textual or electronic based language identification is nearly complete in foreign-based linked jobs[3] [3]https://bit.ly/3BAYgqr, such as [16]. However, there are some challenging issues in document image language recognition. Aside from language recognition, OCR document inscription has proven difficult when the document is written in multiple languages. Because OCR converts document images into text that is understandable in a specific language, the document image language should have been recognized before any work was done. Though much research has been done on the recognition of well-known languages, little focus has been placed on automatic language identification for low-resourced languages.
The identification of languages from document images has been the focus of various researchers for different reasons. For example, language identification from document images has been implemented for OCR and NLP purposes [34]. Similarly, [25] attempts to determine the language on a document image using character shape coding analysis and bi-gram analysis after OCR is used. The author constructs word shape codes based on character extreme points and the quantity of horizontal word runs. After shape codes have been extracted, the degree of similarity between the document vector and the language templates is measured. However, language type detection and identification on multilingual document images is an addressed issue in the context of low-resourced language, especially Ethiopian languages.
In recent years, several researchers, including [10,25] and [16], have attempted to identify languages from document images and texts. These researchers have used recognition based on script and transcribe to text, and then language identification is performed due to their language script being different. However, this method is generally not more effective as it is recognition-based identification.
While the detection of languages can be done on textual or electronic documents, it has become necessary to detect languages from document images in order to handle documents that are in the form of images. This is important for successive tasks such as document image retrieval, OCR, machine translation, and sentiment analysis. In order to detect languages on multilingual document images, researchers have used various approaches such as character, word, and text-line segmentation methods [22,38]. Among these approaches, text-line segmentation[38] has been recommended as it requires less preprocessing steps.
In this paper, we have employed the text-line segmentation approach for multilingual document image language segmentation and detection. We have explored various pre-trained CNN models and compared them with versions trained from scratch.

3. Methodology

This section presents the development of our deep learning-based language identification model for Ethiopic scripts, including dataset preparation and model development, as outlined in Figure 3

3.1. Dataset for Language Identification

The first step of the study was to gather data from various sources since there were limited standard datasets available for the intended objective. For collecting multilingual document images written in Ethiopic script, two settings were employed: (i) printed documents were gathered and scanned, and (ii) Ethiopic script texts were collected and converted into PNG format. The former dataset enables us to test our model in realistic settings, while the latter is a synthetic dataset created from several sources, such as Amhara Mass Media (AMICO) and BBC social media pages, and used for training our model.
This research utilized two data sets, namely printed and synthetic data sets, as listed below.
We used the printed document data set because our objective was to solve the challenge of languages that have similar scripts. For example, we used Ethiopian spiritual books, as depicted in Figure 2, which allowed our system to perform language detection for two, three, or even four languages that have similar scripts.
The reason we used the synthetic data set is that electronic documents are becoming increasingly common, which poses a challenge for language detection. We segmented this data set based on text-line segmentation, where one page produced an average of 30-line images depending on the number of lines in the document. Overall, we preprocessed 24,444 printed segmented datasets and 97,012 synthetic segmented datasets for our model. From this collected document images, we used only 24,444 line segmented datasets in the first phase and balanced it with the printed dataset.
To enhance the quality of document images and improve overall performance in downstream tasks, we employed various preprocessing techniques, including size normalization, grayscale conversion, skew detection, binarization, and segmentation.

3.1.1. Document image size normalization:

Document images are collected in various settings, including devices and physical documents. In order to display document content in a clear format, the image size is rescaled to 2000 by 2000 after some adjustments.

3.1.2. Grayscale Normalization:

For language identification, document images should be in grayscale format. However, the collected document images are often in RGB format. These colored images are larger in size and have a higher chance of losing information during image processing. The cv2.cvtColor() method in OpenCV is used to convert the images to grayscale.

3.1.3. Skew Detection and Enhancement:

Printed document images may not be properly aligned, so image correction is required to obtain appropriate lines from document images. Canny edge detection is used to obtain the edges of text lines followed by Hough Transformation for skew detection and enhancement techniques. These methods are essential for proper document skew detection and angle selection.

3.1.4. Noise Removal:

Printed document images may contain various types of noise that compromise image quality. Filtering approaches, such as Median, Gaussian, Max, and Min filters, are employed to reduce noise in the image. Median filtering is selected for image preprocessing based on criteria for filtering methods that produce minimum Mean Square Error (MSE) and maximum Peak Signal-to-Noise Ratio (PSNR).

3.1.5. Document Binarization:

Document images are typically represented in a printed format with a white background and black text. Binary representation of documents is achieved by automatically selecting a threshold that distinguishes between foreground and background data. Simple thresholding values are implemented for document image binarization.

3.1.6. Segmentation:

Line-based segmentation is used to extract each individual language from the document. Horizontal projection profiles are used to divide lines in a document, and line-based segmentation is suitable for document image transcription and recognition problems. Active contour and morphological dilation are used to achieve text line segmentation.

3.2. Proposed System Architecture

The proposed system architecture outlines the development of a language identification model for four Ethiopic script languages using deep learning-based approaches. Figure 3 depicts the proposed model architecture, which consists of several steps. The process starts with collecting multilingual document images of Ethiopic script from various sources, including synthetic and scanned or printed data, which are structured to form the dataset. The synthetic document images are manually preprocessed by removing superfluous words or quotations, as well as symbols and other language symbols. Then, automatic preprocessing techniques such as document skew detection, grayscale conversion, binarization, and normalizing are applied. The document images are then segmented using line segmentation methods and feed to the CNN models for classification.
Figure 3. General Flow Diagram of the Proposed Model
Figure 3. General Flow Diagram of the Proposed Model
Preprints 141797 g003
To extract basic features and minimize storage consumption and feature engineering efforts for multilingual document image language identification, we used CNN layers. Classification was accomplished using a supervised machine learning technique, which categorizes document images based on training instances. This paper compares and evaluates the effectiveness of various pretrained deep learning models such as MobileNetV2, Vgg16, and AlexNet for detecting the four Ethiopian languages from their language script and shape. In addition, we employed CNN models trained from scratch for language identification.

3.3. Dataset Split and Model Evaluation Metrics

We compute the assessment result of a 70–30 % split following all preprocessing, as specified in the experiment; we noted good accuracy and a low loss score. As a result, the dataset was divided into a training set (70%) and testing set (30 %) for all trials conducted as part of this project. The test set was supplied to the model after the model had been trained using the four deep learning algorithms Mobile Net V2, Vgg16, CNN, and Alex Net. Finally, accuracy, recall, F-measure, and precision evaluation methodologies were used to assess the performance of our model prediction and also confusion matrix, a n x n matrix used to assess the effectiveness of a classification model, where N is the number of target classes, has been utilized. The deep learning model’s predictions are contrasted with the actual target values in the matrix to have a comprehensive understanding of the performance and types of faults of our identification model. To this end, we have used 4*4 matrix in four measurements i.e., true positive, true negative, false positive, false negative and consecutively to measure the accuracy, precession of developed prediction model performance. As we have stated the languages selected share the same script and this may make challenging to check their similarity or confusion and hence confusion matrix has a great role.
Accuracy: is defined as the ratio of the correctly recognized (either positive or negative) to the total number of samples.
A c c u r a c y = T P + T N T o t a l 100
Precision: (or Positive predictive value) proportion of predicted positives that are positive.
P r e c i s i o n = T P T P + F P
Recall: the proportion of actual positives that are predicted as positive
R e c a l l = T P T P + F N
F1 score: conveys the balance between the precision and the recall.
F 1 s c o r e = 2 ( p r e c i s i o n r e c a l l ) p r e c i s i o n + r e c a l l

4. Experimental Results and Discussion

This section describes the experimental evaluation of the proposed method for the task of language identification from document images. The experiment is carried out in accordance with the architecture depicted in Figure 3. To this end, the experiment goes through several stages, including data collection, preprocessing, dataset splitting, feature extraction via CNN, model building, training, testing and evaluation. Moreover, the experiments were conducted using two types of datasets: printed and synthetic.

4.1. Data Preparation

Once the required imgage documents acquired, the next step is to prepare the dataset so as to facilitate the overall process. The common data preparation techniques employed in this study are skew detection and enhancement for printed documents, segmentation, Skew Detection and Enhancement: Grayscale convergence, binarization, canny edge detection, and size normalization are all required steps in order to detect and improve skew. In addition, the Hough transformation was used to detect and improve skew in our document image dataset preparation. To extract characteristics from the same images, all characters are brought into a similar size platform via the normalization process, which transforms an image of random size such as (2339, 1654), (1654, 2339), and (2286, 2000), among others. The image size (2000,2000) was discovered to be the best size for detecting line edges easily. The chosen size image was then normalized and converted to grayscale. A grayscale image, also known as a two-level or bi-level image, is one in which the only colors are black and white. To accomplish this, we used thresholding techniques for binarization of multilingual document images. The first step in binarization is determining the grayscale threshold value and whether or not a pixel has a specific gray value. Simple thresholding piques our interest because it is used in a black background with white writing format and is also simple to use, as shown in Figure. Figure 4.
Segmentation We used text line segmentation approaches from among the various available image segmentation approaches because they produce better results than other segmentation techniques, such as text line, word, and character segmentation for document image analysis, as indicated in [39].
Dilationis one of morphological image enhancement techniques which is applied to enlarge or thicken the foreground object [40]. We used to join two disparate line flows on document image. We create a kernel matrix, which is composed only of ones, then move the kernel through the picture to dilate it. A kernel is more than a tiny matrix used for edge detection, sharpening, and bluring. Therefore, we have dilated intended text line from binarized image document.
Active contour: A curve connecting all continuous points (along the boundary) of the same hue or intensity. The contours in this situation serve as an effective tool for identifying and locating lines from document. Following grayscale, applying a threshold or clever edge detection to the black on white document image is the step before selecting the proper region of interest or line from the document. For the target object’s portions intended for segmentation, the active contour defines a distinct boundary or curvature. With the use of other related concepts, we were able to segment the multilingual document image for our study using Canny edge detection with the help of active contour. With the given edge detection operators applied to the same image, it is intended to identify glass surface defects [41]. By using this technique, we have segmented number of multilingual document images in to number of line images. Both data set which are preprocessed before i.e., 880 printed document and 3360 synthetic or synthetic document image is become segmented based on text-line segmentation, one-page produce around 30-line image depending on lines exist in document than totally we have get preprocessed 24, 444 printed segmented dataset and 97, 012 synthetic segmented datasets for our model. Such techniques was adopted from [31].
Figure 6. Line-Based Segmented-Image
Figure 6. Line-Based Segmented-Image
Preprints 141797 g006
Due to computer storage constraints and to balance with printed datasets, we used only 24,444 line segmented and preprocessed datasets from such a large collection of datasets. Even with two computers and Google Collaborator via four email addresses, it cannot process more than this dataset. Finally, all manual preprocessing and segmentation resulted in a line-segmented image document for a study strategy. Because these segmented line images from the multilingual document will be used in our study, they must be divided into four categories. Furthermore, we used additional preprocessing approaches under the assumption that fabricated data documents are noise-free. However, after reading the entire data set, we applied several noise removal techniques to the image.
Noise Removal Images contain a variety of noises, and camera captured images frequently contain salt and pepper noise. To this end, removing image noise is a critical step in improving document image quality. Noise reduction algorithms should have been used to smooth the entire image away from areas of high contrast in order to reduce or eliminate blare visibility. To achieve this goal, we used median noise reduction methods.
Table 1. Filtering Experiment
Table 1. Filtering Experiment
Noise Level
Low Medium High Average Value
Filtering technique MSE PSNR MSE PSNR MSE PSNR MSE PSNR
Median 0.0055 22.5884 0.0062 22.0194 0.00629 22.0133 0.00602 22.2070
Max 0.00471 16.2310 0.0003 8.0905 0.0700 6.54609 0.0250 10.28921
Min 0.00055 12.3451 0.0005 8.9000 0.02882 21.7893 0.0099 16.31037
Gaussian 0.01536 18.1357 0.0153 18.1274 0.01579 18.0147 0.0155 18.09265
Pick point= Minimum MSE and Maximum PSNR 0.00602727 22.20704346
The experimental result revealed that median filter filtering technique has the lowest MSE and maximum PSNR values, which leads to the conclusion that it is the optimum filtering strategy for the remainder of the preprocessing steps in our research. The average MSE value in all noise levels is assessed to be 0.0060273 and the maximum PSNR value to be 22.2070435. Because median filtering has the lowest MSE value, the error density between the original grayscale image and the denoised image is the lowest between the picture indicated and the image origin. In addition, the PSNR number between the two images shows the peak signal-to-noise ratio in decibels (dB). Adaptive filtering produces better-quality reconstructed images since it has the highest PSNR value. Based on these, median filtering performs better in terms of noise reduction than other filtering approaches.
Size Normalization: Although documents are resized before segmentation, specific line segmented images size normalization are necessary pre-processing step that must happen after in order to prevent biased results and inaccurate predictions. The image resolution threshold value in this stage is the same for every picture data collection. Segmented line images have to be normalized in to appropriate size and pixel value. To this end, before going to normalize the image resizing, checking the size (width, height and channel) of image is requiring what type of images do we have after manually labeled all in line segmented data and stuffed to appropriate folder. We have experiment variety of size to get proper size for our segmented line.

4.2. Model Building Specification

After splitting the dataset into training and testing sets, we build the models using deep learning algorithms, namely Alex Net, Mobile Net, Vgg-16-CNN and custom. A number of hyper parameter tuning is performed for the purpose of regulating the order in which data enters, stored on, and leaves the network in sequenced manner. To implement these algorithms, we import several packages on to the Jupiter Notebook such as fit() method training and model evaluation parameters. The error or loss is computer using the categorical crossentropy, and the learning rate for each parameter is calculated using the Adam optimizer. Max pooling is applied for the purpose of down sampling of features during learning phase of model because, each language is founded from similar script. So max pooling is preferred pooling technique for the purpose of selecting the maximum feature map from dataset. As we are working on multinomial class labels, we have employed the Softmax function in the output layer. By using integer values ranging from 0 to 3, we labeled the four Ethiopic scripts classes in array format.
We employ the most popular model type, the sequential type, which enables us to build models layer by layer. In addition, the first layer demands that conv2d accept the input shape parameter to a custom-CNN and 3-pre-trained model, which is the input image’s size and channel; hence, in this study, we have passed the input image’s size, parameter (width, height, channel) (250, 50, 1). Furthermore, dropout (0.2) has been used after maxpooling in each algorithm chosen to prevent the overfitting and under fitting issues that are most frequently useful in machine learning. For the four deep learning algorithms, dense layer has been come with number of class units which means to what amount of class is classified these convolved and pooled features are to in annotated image. Therefore, we have declared the dense value with 4 class label or number of nodes in the layer.

4.3. Results

The experimental results obtained by applying deep learning models to two datasets, namely a dataset of synthetic document images and a dataset of printed documents. Typically, the batch size, epoch, and learning rate hyperparameters were tuned. To that end, we present three epochs (10, 16, 32), two learning rates (0.001, 0.0001), and two batch sizes (16, 32).

4.4. Experimentation on Printed Document Image Dataset

The main points observed from the four experimental settings on a data set of printed documents are as follows: (i) accuracy increases noticeably when the epoch is changed from 10 to 22, and (ii) accuracy generally decreases when the epoch is changed from 22 to 32, with the exception of Alex Net, the algorithm with the lowest performance.
Figure 7 provide a summary of the results from experiments 1-4. As a result, custom-CNN is the most effective algorithm, followed by Vgg-16, Mobile Net, and Alex Net. In all cases, custom-CNN has an accuracy rate of more than 90%, while the Vgg-16 has an accuracy of about 85%. The accuracy of Mobile Net and Alex Net algorithms oscillate between 76% and 82%, and 62% and 81% respectively. Experiment 3 has the highest accuracy for the best performer algorithm, with an epoch of 22, a batch size of 32, and a learning rate of 0.001. The custom-CNN also performed well in the same experiment using epoch 32. These are the two points where all the algorithms showed improved accuracies. More specifically, Alex Net had its best accuracy at epoch 22 while Vgg-16 and Mobile Net score its best accuracies at epoch 32 as compared to other experiments.

4.5. Experimentation on Synthetic Dataset

We evaluate the performance of the algorithms employed in this study using synthetic dataset. The rationale for using synthetic datasets is that we can generate large amounts of such data in a short period of time, allowing us to evaluate the performance of the algorithms with large amounts of data.
We used the same parameters as in previous experiments on a Printed document image dataset. In the first experiment (experiment 5), the batch size is set to 16, the learning rate is set to 0.001, the dropout is set to 0.02 and the epochs are 10, 22, and 32. Then the learning rate is reduced to 0.0001 on top of the first experiment in this section for the second experiment, in the third experimental setup the batch size is increased to 32 on top of experiment 1, and the learning rate is reduced to 0.0001 on top of experiment 3 in the final experimental setup. The results of these experiments are discussed below. In experiment 5, cusotm-CNN and Vgg-16 show higher accuracy than other algorithms in general. Mobile Net has the third most accurate prediction at epoch 10, and Alex Net has the third most accurate prediction at epoch 32. With the exception of the Mobile Net algorithm, which does not hold when moving from epoch 22 to 32, we can see an increase in accuracy as the epoch increases.
In experiment 6, we show the algorithms’ accuracy at epochs 10, 22, and 32 with a batch size of 16, a learning rate of 0.001, and a dropout of 0.02. We observe that custom-CNN outperforms other algorithms across all epochs, followed by Vgg-16 and Mobile Net. Except for epoch 32, Alex Net has the lowest accuracy across all epochs. All the algorithms show improvement when epoch is changed from 10 to 22. When compared to the trend on the previous epochs, only Alex Net indicates an increase for the highest epoch (i.e., epoch 32).
In experiment 7, we see that custom-CNN performs the best across all epochs, while Alex Net has the lowest prediction accuracy. Furthermore, we can notice that the accuracy of all the algorithms improved as epoch values increase except Alex Net. Alex Net exhibited unusual behavior, with its accuracy performance oscillating (moving up and down) as epoch values increased.
In general, our custom-CNN exhibits the better performance prediction overall for experiment 8. Vgg-16 has the second best prediction accuracy, followed by Mobile Net and Alex Net. Alex Net shows the lowest prediction accuracy. This experiment taught us the following lessons: (i) As epoch size increases, so does the performance of all algorithms; and (ii) the increment value from epoch 22 to epoch 32 is smaller for the best performer algorithm, whereas the others show a significant increment. In addition, we present experimental results comparison for a printed data document images and synthetic document image datasets on the different hyperparameters tuning such as the batch size, learning rate and epoch differences. For sake of comprehension, we have combined the parameters in four different ways ( hp1: BS=16, LR=0.001, hp2: BS=16, LR=0.0001, hp3: BS=32, LR=0.001 and hp4: BS=32, LR=0.0001).
First we note that our custom-CNN outperforms other algorithms across all epochs for both datasets, followed by Vgg-16 and Alex Net for hp1 as indicated in Figure 8. Mobile Net has the lowest accuracy for synthetic datasets across all epochs. It is similar to the printed datasets except at epoch 22, where Mobile Net outperforms Alext Net. It is also clear that all algorithms perform better on a synthetic dataset than on a dataset of printed document images across all images.
We can observe a similar result on hp2 depicted in Figure 9 for both datasets. Custom-CNN and Vgg-16 show higher accuracy than other algorithms in general. Mobile Net and Alex Net show the third and the fourth most accurate predictions. However, Mobile Net and Alex Net behave differently. For example, Mobile Net outperforms Alex Net at epochs 10 and 22, but Alex Net outperforms at epoch 32.
When both printed and synthetic datasets are used for hp3, as shown in Figure 10, the algorithms perform better on the synthetic datasets. CNN outperforms all algorithms, with Vgg-16 coming in second. On average, Mobile Net and Alex Net have similar prediction accuracy. As compared to hp1, all algorithms show increase in prediction accuracy as the batch size increases for both datasets except Alex Net. Alex Net increases for printed datasets but decreases for artificial datasets.
hp4 exhibits similar behavior to the previous settings, i.e., all algorithms perform better on synthetic datasets. On both datasets, custom-CNN continues to perform better than Vgg-16 as the top algorithm. Mobile Net is the third best performer algorithm. Alex Net has typically the worst prediction performance in this setup. When the performance results obtained in this setup are compared to hp2, all algorithms show a decrease in prediction performance.

4.5.1. Experiment with the Best Model

To evaluate the effect of dataset quantity on the performance of the selected model, we examined further experiment. To this end, an experiment was conducted on a dataset of 32,000 using custom-CNN with the experimental parameters of batch size 32, learning rate 0.001 and epoch 32 on a dataset size of 32,000 document images. Accordingly, an accuracy of 97.27% on the synthetic document images dataset is registered. More specifically, the model was tested with 4480 sample images and correctly classified 4458 of them, i.e., 1077, 1078, 1058, and 1110 were correctly classified as Amharic, Geez, Awigna, and Tigrigna, respectively, as depicted in the confusion matrix in Figure 12. As the number of epochs increases, training accuracy outperforms validation accuracy. Similarly, as the number of epochs increases, validation loss grows larger and exceeds training loss.
We further conducted an experiment to asses the performance of the chosen model by examining the effect of language distribution. To this end, we conducted experiments on two and three-language datasets. The languages were selected on the basis of the individual accuracy of the algorithm measured in previous experiments. The document images of the Ethiopic scripts Amharic, Tigrigna, and Geez are chosen for this purpose.
As shown in Figure 13, the algorithm correctly predicts 2534 Amharic document images as Amharic and 2517 Tigrigna document images as Tigrigna. However, 15 Amharic document images and 67 Tigrigna document images were misclassified. As a result, the overall accuracy of the algorithm on this dataset is 98.4%. The validation and training accuracy, as well as the training and validation losses. Accordingly, the training accuracy shows increment as the the number of epochs increase whereas the validation accuracy sometimes increases and sometimes decreases. Training loss decreases significantly as epoch increases, whereas validation loss exhibits inconsistent behavior.
We performed experiment on a three-language document images consisting of Amharic, Tigrigna and Geez. The result of this experiment shows an accuracy of 97.79%. The confusion matrix as well as the training and validation accuracy and losses are depicted in Figure 14. Similarly, we conducted experiments on Amharic, Tigrigna and Geez language documents. This experiment produces a result of result of 97.8% accuracy. The confusion matrix, as well as the training and validation accuracy and losses are depicted in Figure 14 and Figure 15, respectively. As the epoch size increases, the algorithm’s training and validation accuracy improves while its loss decreases.
Testing the effect of language distribution on the performance of the chosen algorithm is one of the focus of the experiments in this section. The experimental result reveals that language distribution and performance of prediction performance correlated in such a way that accuracy decreases as the number of languages addressed increases. The stated behavior of the result is shown in Table 2. The algorithm performs better when fewer languages are used rather than a combination of more languages, as shown in the table.

4.6. Discussion

We conducted extensive experiments on printed and synthetic document images using a custom-CNN and pretrained models across diverse experimental settings. In general, custom-CNN has been found the best performer model, as shown in Table 3. The experimental parameters that produce the best results are batch size 32, learning rate 0.001, and epoch 32 on the synthetic dataset. The pre-trained models perform lower as compared to custom-CNN and achieve their respective best results in different experimental settings. Vgg-16, for example, performs best when the batch size is 16, the learning rate is 0.001, and the epoch is 32. The Mobile Net and Alex Net models perform best with batch sizes of 16 and 32, learning rates of 0.001 and 0.0001, and epoch 32, respectively. To this end, the batch size 32, learning rate 0.001, and epoch 32 settings allow the majority of algorithms to perform optimally.
Custom-CNN is the best model for detecting the language of Ethiopic script document images. We achieved an accuracy of 97.11% on synthetic and 94.45% accuracy on printed document images. Both results were obtained by employing CNN at batch sizes of 32, learning rates of 0.001, and epochs of 32 and 22, respectively.
We investigated the effect of dataset size on performance using the best performer algorithm, i.e., CNN. To this end, we increased the document image sizes from 24, 444 to 32,000, and we discovered that an improved accuracy of 97.27%. Even though more data leads to higher accuracy, we only saw a minor improvement in our case. We wanted to experiment on more datasets, but due to limited resources, we could not process more than 32, 000 datasets. The quality of the document images has a clear impact on the performance of the algorithms, as all of them perform better in synthetic document image datasets. we have explored the effect of language distribution. To this end, we conducted experiments with various combinations of the four languages, such as a 2-language set (Amharic-Tigrigna), a 3-language set (Amharic-Tigrigna-Geez), and a 4-language set. As a result, better results are obtained with fewer language combinations. In this regard, the 2-language experiment outperforms the 3- and 4-language experiments. Similarly, the 3-language experiment outperforms the 4-language experiment in similar settings.
In general, we have also compared the outcomes of our work’s language identification to earlier language identification performed in Ethiopic script by Biruk Tadesse [26]. Accordingly, we obtained a testing accuracy score of 97.11% percent, compared to Birku Tadesse’s score of 99.53% on the textual data set. Biruk’s success can be attributed to the textual dataset’s which are simpler nature to identify. In addition, more than half of the languages that he has identified lack a script that is similar, making it simple to identify them. Contrarily, in our work, LID is carried out using document photos, and the language also has a comparable script, making detection difficult.

5. Conclusions

In this paper, we addressed the problem of language identification in low-resource multilingual document images written in the same script, and introduced a dataset for training and testing such models. We scanned, preprocessed, and segmented document images in Ethiopic script using various techniques, including text-line based segmentation and CNN-based feature extraction. To find the best model for this task, we conducted extensive experiments using CNN and pre-trained deep learning techniques such as AlexNet, MobileNet, and VGG-1. The results showed that CNN outperformed the pre-trained deep learning models employed in this study, while the Vgg-16 pretrained model performs better than other pretrained models. The best performing model was able to identify languages in mixed document images with an accuracy of 97.27% on 9,600 test images. It is important to note that the document images considered in this study consisted of line texts written with one language, and our model cannot identify if languages are mixed in a single line, nor can it segment and detect properly if documents contain mixed-up formulas, graphs, and other items. Therefore, as part of future research, we will leverage CNN for language identification from document images as a step towards fully automated downstream tasks such as OCRs, document image retrievals, and machine translation. Moreover, we will consider developing a language identification module in different scripts and languages, as well as including handwritten document images to achieve full-fledged language identification from document images.

References

  1. READ, T. Reading for Ethiopia’s Achievement Developed Technical Assistance (READ TA) 2014.
  2. Belay, B.H.; Gebremeskel, G.B.; Bezabih, B.B.; Gebeyehu, S. Factorized Recurrent Neural Network with Attention for Language Identification and Content Detection. ACM Transactions on Asian and Low-Resource Language Information Processing 2023, 22, 1–13. [Google Scholar] [CrossRef]
  3. Eberhard, D.M.; Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the world. 23rd edn. Dallas: SIL International. Online: https://www. ethnologue. com, 2020. [Google Scholar]
  4. Zheng, J.; Liang, Y. User Interest Identification with Social Media Information using Natural Language and Meta-Heuristic Technique. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023. [Google Scholar]
  5. Edosomwan, S.; Prakasan, S.K.; Kouame, D.; Watson, J.; Seymour, T. The history of social media and its impact on business. Journal of Applied Management and entrepreneurship 2011, 16, 79. [Google Scholar]
  6. Sileyew, K.J. Research design and methodology; IntechOpen Rijeka, 2019.
  7. Dhivya, S.; Devi, U.G. Study on Automated Approach to Recognize Characters for Handwritten and Historical Document. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2021, 20. [Google Scholar]
  8. Mengiste, T.; Belay, B.H.; Tilahun, B.; Worku, T.; Tegegne, T. RRConvNet: Recursive-residual Network for Real-life Character Image Recognition. DeLTA, 2022, pp. 110–116.
  9. Mahajan, S.; Rani, R. Word level script identification using convolutional neural network enhancement for scenic images. Transactions on Asian and Low-Resource Language Information Processing 2022, 21, 1–29. [Google Scholar] [CrossRef]
  10. Jauhiainen, T.; Lui, M.; Zampieri, M.; Baldwin, T.; Lindén, K. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research 2019, 65, 675–782. [Google Scholar] [CrossRef]
  11. Sarwar, R.; Rutherford, A.T.; Hassan, S.U.; Rakthanmanon, T.; Nutanong, S. Native language identification of fluent and advanced non-native writers. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 2020, 19, 1–19. [Google Scholar] [CrossRef]
  12. Solaiman, I.; Brundage, M.; Clark, J.; Askell, A.; Herbert-Voss, A.; Wu, J.; Radford, A.; Krueger, G.; Kim, J.W.; Kreps, S. ; others. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, arXiv:1908.09203 2019.
  13. Belay, B.; Habtegebrial, T.; Meshesha, M.; Liwicki, M.; Belay, G.; Stricker, D. Amharic OCR: an end-to-end learning. Applied Sciences 2020, 10, 1117. [Google Scholar] [CrossRef]
  14. Belay, B.H.; Habtegebrial, T.A.; Stricker, D. Amharic character image recognition. 2018 IEEE 18th International Conference on Communication Technology (ICCT). IEEE, 2018, pp. 1179–1182.
  15. Ma, H.; Doermann, D. Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing (TALIP) 2003, 2, 193–218. [Google Scholar] [CrossRef]
  16. Barlas, P.; Hebert, D.; Chatelain, C.; Adam, S.; Paquet, T. Language identification in document images. Electronic Imaging 2016, 2016, 1–16. [Google Scholar] [CrossRef]
  17. Bisht, M.; Gupta, R. Offline Handwritten Devanagari Word Recognition Using CNN-RNN-CTC. SN Computer Science 2023, 4, 1–12. [Google Scholar] [CrossRef]
  18. Akram, S.; Dar, M.U.D.; Quyoum, A. Document Image Processing- A Review. International Journal of Computer Applications 2010, 10, 35–40. [Google Scholar] [CrossRef]
  19. Sarker, I.H. Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science 2021, 2, 1–20. [Google Scholar] [CrossRef]
  20. Siddiqui, S.; Arifeen, M.; Hopgood, A.; Good, A.; Gegov, A.; Hossain, E.; Rahman, W.; Hossain, S.; Al Jannat, S.; Ferdous, R.; others. Deep Learning Models for the Diagnosis and Screening of COVID-19: A Systematic Review. SN computer science 2022, 3, 1–22. [Google Scholar] [CrossRef] [PubMed]
  21. Goyal, V.; Sharma, S. Texture classification for visual data using transfer learning. Multimedia Tools and Applications2022, pp. 1–24.
  22. Elgammal, A.M.; Ismail, M.A. Techniques for language identification for hybrid Arabic-English document images. Proceedings of sixth international conference on document analysis and recognition. IEEE, 2001, pp. 1100–1104.
  23. Lu, S.; Tan, C.L.; Huang, W. Language identification in degraded and distorted document images. International Workshop on Document Analysis Systems. Springer, 2006, pp. 232–242.
  24. Zhu, G.; Yu, X.; Li, Y.; Doermann, D. Language identification for handwritten document images using a shape codebook. pattern recognition 2009, 42, 3184–3191. [Google Scholar] [CrossRef]
  25. Shijian, L.; Tan, C.L. Script and language identification in noisy and degraded document images. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007, 30, 14–24. [Google Scholar] [CrossRef] [PubMed]
  26. Tadesse, B. Automatic identification of major ethiopian languages. PhD thesis, 2020.
  27. Babhulgaonkar, A.; Sonavane, S. Language Identification for Multilingual Machine Translation. 2020 International Conference on Communication and Signal Processing (ICCSP). IEEE, 2020, pp. 401–405.
  28. Kocmi, T.; Bojar, O. Lanidenn: Multilingual language identification on character window. arXiv preprint arXiv:1701.03338, arXiv:1701.03338 2017.
  29. Londhe, D.; Kumari, A.; Emmanuel, M. Language identification for multilingual sentiment examination. International Journal of Recent Technology and Engineering 2019, 8, 3571–3576. [Google Scholar]
  30. Kevers, L. L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques. Traitement Automatique des Langues 2022, 62, 13–37. [Google Scholar]
  31. Choudhary, P.; Nain, N. A four-tier annotated Urdu handwritten text image dataset for multidisciplinary research on Urdu script. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 2016, 15, 1–23. [Google Scholar] [CrossRef]
  32. Hebert, D.; Barlas, P.; Chatelain, C.; Adam, S.; Paquet, T. Writing type and language identification in heterogeneous and complex documents. 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 2014, pp. 411–416.
  33. Nambiar, S.K.; Idicula, S.M. Abstractive Summarization of Text Document in Malayalam Language: Enhancing Attention Model using POS Tagging Feature. Transactions on Asian and Low-Resource Language Information Processing, 2022. [Google Scholar]
  34. Sibun, P.; Spitz, A.L. Language determination: Natural language processing from scanned document images. Fourth Conference on Applied Natural Language Processing, 1994, pp. 15–21.
  35. Belay, B.H.; Habtegebrial, T.; Liwicki, M.; Belay, G.; Stricker, D. A Blended Attention-CTC Network Architecture for Amharic Text-image Recognition. ICPRAM, 2021, pp. 435–441.
  36. SPITZ, A.L. Handbook of Character Recognition and Document Image Analysis, pp. 259-284 Eds. H. Bunke and PSP Wang (C) 1997 World Scientific Publishing Company. Handbook Of Character Recognition And Document Image Analysis1997, p. 259.
  37. Jaech, A.; Mulcaire, G.; Ostendorf, M.; Smith, N.A. A neural model for language identification in code-switched tweets. Proceedings of the second workshop on computational approaches to code switching, 2016, pp. 60–64.
  38. Mukherjee, J.; Parui, S.K.; Roy, U. An unsupervised and robust line and word segmentation method for handwritten and degraded printed document. Transactions on Asian and Low-Resource Language Information Processing 2021, 21, 1–31. [Google Scholar] [CrossRef]
  39. Dobariya, A.R.; Rathod, V. Horizontal and Vertical Projection techniques for Line & word Segmentation Process in Offline Handwritten Gujarati Text. National Journal of System and Information Technology 2015, 8, 97. [Google Scholar]
  40. Manhas, P.; Arora, P.; Thakral, S. Image Enhancement Using Morphological Operations: A Case Study. In Computing and Network Sustainability; Springer, 2019; pp. 13–20.
  41. Öztürk, Ş.; Akdemir, B. Comparison of edge detection algorithms for texture analysis on glass production. Procedia-Social and Behavioral Sciences 2015, 195, 2675–2682. [Google Scholar] [CrossRef]
Figure 2. Sample Real Multilingual Document Image
Figure 2. Sample Real Multilingual Document Image
Preprints 141797 g002
Figure 4. Sample preprocessed Image:(a) Resized and Gray Scale Document Image, (b) Binarized Document Image
Figure 4. Sample preprocessed Image:(a) Resized and Gray Scale Document Image, (b) Binarized Document Image
Preprints 141797 g004
Figure 5. a = T e x t L i n e D i l a t e d , b =Active Contour
Figure 5. a = T e x t L i n e D i l a t e d , b =Active Contour
Preprints 141797 g005
Figure 7. Summary of Experiment Results on a Printed Dataset
Figure 7. Summary of Experiment Results on a Printed Dataset
Preprints 141797 g007
Figure 8. Performance of various Deep Learning Algorithms for BS=16, LR=0.001 experimental setup
Figure 8. Performance of various Deep Learning Algorithms for BS=16, LR=0.001 experimental setup
Preprints 141797 g008
Figure 9. Performance of various Deep Learning Algorithms for BS=16, LR=0.0001 experimental setup
Figure 9. Performance of various Deep Learning Algorithms for BS=16, LR=0.0001 experimental setup
Preprints 141797 g009
Figure 10. Performance of various Deep Learning Algorithms for BS=32, LR=0.001 experimental setup
Figure 10. Performance of various Deep Learning Algorithms for BS=32, LR=0.001 experimental setup
Preprints 141797 g010
Figure 11. Performance of various Deep Learning Algorithms for BS=32, LR=0.0001 experimental setup
Figure 11. Performance of various Deep Learning Algorithms for BS=32, LR=0.0001 experimental setup
Preprints 141797 g011
Figure 12. Confusion Matrix of the custom-CNN model on a larger dataset
Figure 12. Confusion Matrix of the custom-CNN model on a larger dataset
Preprints 141797 g012
Figure 13. Confusion Matrix of the custom-CNN model on a 2-Language dataset
Figure 13. Confusion Matrix of the custom-CNN model on a 2-Language dataset
Preprints 141797 g013
Figure 14. Confusion Matrix of the custom-CNN Model on a 3-Language dataset
Figure 14. Confusion Matrix of the custom-CNN Model on a 3-Language dataset
Preprints 141797 g014
Figure 15. Accuracy and loss for the custom-CNN model on a 3-Language Dataset
Figure 15. Accuracy and loss for the custom-CNN model on a 3-Language Dataset
Preprints 141797 g015
Table 2. Comparision of Experimental results on different language distributions
Table 2. Comparision of Experimental results on different language distributions
Language Ditribution Accuracy(%)
2-languages 0.9845
3-languages 0.9779
4-languages 0.9711
Table 3. Summary of Experimental Results Tuning
Table 3. Summary of Experimental Results Tuning
Accuracy in Dataset (%)
Algorithm BS Epoch LR Printed Synthetic
Alex Net 16 10 0.001 73 84.88
0.0001 75 82
22 0.001 75 87.75
0.0001 77 82.64
32 0.001 84 88.1
0.0001 79.3 85.89
32 10 0.001 77.4 83.5
0.0001 62.11 69.28
22 0.001 81.88 70.92
0.0001 66.34 77.59
32 0.001 80.46 84
0.0001 71.07 80.48
Custom-CNN 16 10 0.001 92 96.29
0.0001 92 95.81
22 0.001 94 96.5
0.0001 94 96.39
32 0.001 94 96.76
0.0001 94 95.78
32 10 0.001 90.69 95.83
0.0001 85.43 90.48
22 0.001 94.45 96.14
0.0001 93.98 94.17
32 0.001 94.43 97.11
0.0001 93.43 94.91
Mobile Net 16 10 0.001 76 84.98
0.0001 77 84.98
22 0.001 77 87.5
0.0001 80 85.51
32 0.001 78 84.55
0.0001 79.9 84.83
32 10 0.001 79.43 80.85
0.0001 80.25 85
22 0.001 80.32 84.18
0.0001 82.12 84.51
32 0.001 81 84.88
0.0001 81.59 86.02
Vgg-16 16 10 0.001 76 90.52
0.0001 78 91.49
22 0.001 83 92
0.0001 85 92.58
32 0.001 84 92.61
0.0001 86 92.53
32 10 0.001 86.67 92
0.0001 85.19 87.12
22 0.001 84.77 92.32
0.0001 86.15 91.03
32 0.001 86.48 92.41
0.0001 85.15 92.32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated