Recognition of Brahmi words by Using Deep Convolutional Neural Network

Significant progress has made in pattern recognition technology. However, one obstacle that has not yet overcome is the recognition of words in the Brahmi script, specifically the identification of characters, compound characters, and word. This study proposes the use of the deep convolutional neural network with dropout to recognize the Brahmi words. This study also proposed a DCNN for Brahmi word recognition and a series of experiments are performed on standard Brahmi dataset. The practical operation of this method was systematically tested on accessible Brahmi image database, achieving 92.47% recognition rate by CNN with dropout respectively which is among the best while comparing with the ones reported in the literature for the same task.

2.1.2 Brahmi script has 368 total numbers of characters including the number of consonants are 33, vowels are 10, and the rest (325) are compound characters [18,19]. 2.1.3 Left side to right follow to write text in Brahmi script.

Characteristic
Brahmi consonants are followed by different vowels (Figure 2) to make compound characters ( Figure  3). The features are added in the consonants to create compound characters called "Matra". Mostly, these "Matra" can be seen on the outer edge of the consonants (but not always because it also depends on the shape of the consonants) or using a dot feature (.) that is added after the consonant.   [19] Nine types of features ("Marta") are used to make compound characters.

Literature review
Of the previous works that have developed Brahmi script recognition systems, only a few focused on Brahmi character recognition and even fewer reported working on Brahmi word recognition. The history of Brahmi script recognition can be traced back to 1983. In this year, Siromoney, et al. [18] developed a system to recognise all Brahmi characters, including all 368 standard vowels, consonants, and compound characters. Their work, however, ignored all complex compound characters. A coded run method was used for the recognition of machine-printed characters of the Brahmi script, with each Brahmi character manually changed into a rectangular binary array. However, the proposed method was a general method, so it can be applied to recognise other scripts. Soumya and Kumar [20] focused on recognising the ancient script in the time of King Ashoka (Brahmi script) using Gabor features and zone-based feature extraction methods to extract the features of these characters and ANN for the classification of the extracted features. The accuracy of this system was 80.2%. Next, Vellingiriraj, et al. [11] focused on developing a Brahmi word recognition system in which they used zonal density to extract the features and ANN to classify the extracted features. Three types of samples: vowels, consonants, and consonantal vowels (compound characters), were used to measure the performance of the system, achieving accuracies of 93.45%, 93.45%, and 90.24%, respectively. Apart from Brahmi word recognition, in 2016, Gautam, et al. [4] focused on Brahmi character recognition. The zone method was used to extract the features of the Brahmi characters, and the template matching method (coefficient correlation) was used for the classification of all extracted features, achieving an accuracy of 88.83%. However, this method could not recognise non-connected characters. Another work also developed a system to recognise Brahmi characters [9], focusing on the geometric elements in each zone (corner point, bifurcation point, ending point) and the geometric features (corner point, bifurcation point, ending point, intersect point, circle, and semi-circle) to extract the features. In addition, some sets of rules were created for the classification of all these features, yielding a final accuracy of 94.10%. Apart form that the translation of Brahmi script into Pali language is also suggested for understanding the Brahmi words [21].
Various studies have been performed to recognise several scripts. These days, CNN and DCNN are very popular classifiers for classifying characters, digits, words, etc. Both classifiers have been investigated for various script recognition systems involving the English script, the Chinese script, and other scripts under the Abugida system, returning very good performances. The Abugida system is part of the world writing system to which the Brahmi script belongs. Therefore, this review only focused on the script in the Abugida system.
Adak, et al. [22] suggested a method for offline cursive Bengali word recognition. The authors used a hybrid model of CNN and a recurrent neural network (RNN). The overall word recognition accuracy obtained by the study was 86.96% using a NewISIdb database, whereas the proposed architecture was segmentation free. Various studies focused on word recognition without segmentation. However, a dictionary or a book might be required to predict the word and obtain the information for all possible combinations of characters, which will create a meaningful word. This study did not find any Brahmi dictionary or book in the literature. Therefore, a further literature review focusing only on those studies that used isolated characters to train and test the system was conducted.
Recently, deep learning-based techniques have attracted growing attention for character recognition [23, 3 24]. Bangla numbers and digits have been classified using CNN, achieving 77.01% [25] and 98.37% [26] accuracy, respectively, whereas Cecotti and Vajda [25] used 10,000 training datasets for 10 classes, i.e. 1,000 (average) samples for each type. Maitra, et al. [26] used 13,392 datasets for 10 classes, resulting in 1,339 (standard) samples for each category. Bangla characters were classified with CNN and achieved 95.6% accuracy, where 25,000 samples were used to train 50 levels [26], i.e. 500 (average) examples for each category. Similarly, Nair, et al. [27] used two lack samples to train Bangla characters, where the Bangla script had 50 types of characters, including vowels and consonants. So, it can be said that the author used around 4,000 samples (average) for each class to train Bangla characters. Previous authors [27] suggested a method and mentioned that CNN could prove to be a state-of-the-art method for other script and hence could provide the possibility to also produce a higher recognition rate for Malayalam characters. DCNN was used to train Bangla compound characters and achieved 90.33% accuracy [28]. The authors used the CMATERdb 3.  [32] were used to classify the Devanagari characters, obtaining 90.58%, and 96.02% accuracy, respectively.
Apart from the Bangla script and the Devanagari script, CNN was also used to classify the Telugu script, the Tamil script, and the Oriya script. CNN was used to train 2,000 samples of 10 classes of Telugu numbers [26] and 47,428 examples of Telugu characters including 36 consonant classes and 15 vowel modifier classes [33], and achieved 96.5% and 92.26% accuracy, respectively. Similarly, 18,535 samples were used to train 35 levels of Tamil characters using CNN and reached 94.4% accuracy [34]. Four thousand nine hundred seventy samples were used to train ten classes of Oriya numbers using CNN and achieved 97.2% accuracy [26].
Deep learning has a limitation in that it requires many samples of each class to train the system. Apart from this issue, two main points also emerge: computation time and over-fitting. To minimize these issues, GPUs are able to increase the computation time meaningfully [35]. So, it can be said that the time issue depends on the performance of the GPU. Various regularisation methods have already been announced to solve the issue of over-fitting. Dropout is a current introduced regularisation technique, which is newly worked in deep learning. It is related to bagging [36], in which a set of models are trained on different subsets of the same training data. Vijayaraghavan and Sra [34] used CNN and dropout together to recognise the characters in Tamil script. They achieved an accuracy of 94.4% using the WFHR-10 Tamil character dataset from the HP Labs India website [34]. Similarly, Alom, et al. [37] tested various combinations of SVM, DBN, CNN, Gaussian filter, Gabor filter, and dropout on an official dataset consisting of Bangla digits to check the performance and noted that the combination of CNN, Gabor filter, and dropout achieved the highest accuracy of 98.78%.
In conclusion, CNN with Gabor filter performed well in recognising Devanagari characters. In addition, the combination of CNN, Gabor filter, and dropout achieved the highest accuracy in identifying Bangla digits. Therefore, this study will also apply this same combination to determine the accuracy of the proposed Brahmi word recognition system.

Methodology
The recognition of a character image usually undergoes segmentation (line and character detection), pre-processing (grayscale, binarization, and size normalisation), feature extraction, and classification. However, all characters have already been isolated in this study, so segmentation was not required. Pre-processing is used to increase the quality of the samples. After that, feature extraction and classifier step the decision taking parts.

Input image
Primary job of the word recognition system is input the samples. In this study, the image of Brahmi words in JPG format was obtained and treated as the dataset in the first step. The size and resolution of the pictures were not fixed, so a pre-processing step was applied to enhance the quality of the samples and to convert it into useful content. Pre-processing is an essential step for enhancing the visibility and readability of input. The pre-processing steps employed by this study are thresholding and size normalisation (resizing), further elaborated below.

Binarization:
Grayscale image can be converted into binary image by Binarization (thresholding). The main motive of thresholding techniques is to enhance the visibility of edge, in which case, the shape of the region is more important compare to the intensities of pixels. Two approaches: global threshold, local threshold can use under binarization. A single threshold value is used for the whole image according to the background of the image in the global threshold. Hence, this method is useful if background of the image is simples and uniform. Although, various values threshold for each pixel is chosen based on the data of local area in local threshold. Usually, this approach is used for real-life documents because, sometimes, these documents are designed deliberately with stylistic, colourful, and complex backgrounds.
The background of all images in the Brahmi dataset was white and therefore not complicated [38]. Hence, a Global threshold approach was used to perform the thresholding process in this study.

Resized:
Normalisation was applied to obtain all characters of uniform size. The characters varied in size, so their size should be made uniform to enable comparison. For example, the input of ANN and SVM should be in vector form of a fixed size. Normalisation should reduce or increase the original size of the image however, shape of the image should not be changed.
The size of the image depends on the feature extraction and classifier method. Hence, the input images were resized to 32 × 32 pixels for further work. After that, CNN and CNN with dropout were used to recognise the Brahmi words.

Convolutional Neural Network
Kunihiko Fukushima [39] announced the Convolutional Neural Network (CNN) in 1980. However, its complicated training algorithm made it hard to use at the time. In the 1990s, promising outcomes were obtained when LeCun, et al. [40] used a gradient-based learning algorithm with CNN. Following this achievement, CNN research improved significantly, achieving better pattern recognition rates. CNN's design allows it to mimic human visual processing, and its structures can be optimised towards the feature extraction and abstraction of 2D features. CNN's max-pooling layer is particularly functional for the absorption of shape variations. Further on, compare to that type of network which are fully connected and similar in size, parameters play more important role. Ultimately, CNN enables trainability through the gradient-based algorithm.
The gradient-based algorithm inculcates an entire network with a straightforward minimisation of the error criterion. Therefore, CNN can output significantly optimised weights. In Figure 4, the comprehensive architecture of CNN can be shown to comprise of two stages: feature extraction and classification. Looking at the feature extraction layers, the input of each layer of the network is the output from its immediate previous layer, as the input of next layer use as output of the previous layer. Convolution, max-pooling, and classification layers are used in the suggested architecture where, low-middle level of the network build by convolutional and maxpooling layers.
Convolution works through even-numbered layers, while max-pooling works through odd-numbered layers. Output nodes from convolution and max-pooling are grouped into a 2D plane, a process known as feature mapping. Every plane of a layer commonly incorporates a unity of one or more than one planes from the past layers, and a small region of each connected planes of the previous layer is associated the node of the plane. The convolution operation is performed on input nodes, where feature extraction process is conducted in each node of the convolution layer. Abstracting features through average or propagating operation is performed by the max-pooling layer also on input nodes.
Spread features of lower level are used to get the features of upper level. During propagation to the highest level, measurements of the features are reduced relative to the dimensions of the convolutional and max-pooling masks. However, for better recognition performance of input images, feature mapping numbers usually increase. The fully connected network, known as classification layers, gets its input from the last feature map outputs of CNN. In this study, feed-forward neural networks were employed for the classification, as it has provided higher performance in contrast of previous studies [37].
Feature selection techniques determine the required number of features in the classification layer with regard to the dimensions of the weight matrix of the final ANN. The extracted features are used for the classification, where prediction of the input sample is computed. After it, the classification process is conducted by using the extracted features for the calculation the assurance for the input samples. According to maximum assurance, the classification process suggested a group where the input images belong to. The next section discusses the mathematical characteristics of various layers of CNN.

Convolution Layer:
Kernels such as Gabor filter, which has high learnability, are united to the feature maps of current layers to further procced the convolution layer. Moreover, for the developing of output feature map, Sigmoid, Hyperbolic Tangent, Softmax, Rectified Linear, and Identity Functions types of linear or non-linear functions to introduce the kernels. The mathematical model of this process is given by Equation 1: Where , −1 , , , and Mj represents the output of the current layer, output of the previous layer, Figure 4: The overall architecture of DCNN applied in this study, which contains an input layer and multiple alternating convolutions.
kernel of the current layer, bias for the present layer, and selection of input maps, respectively. For each output map, there is an additive bias b. Nevertheless, input maps and different kernels are combined to introduce the corresponding output maps.

Sub-sampling Layer:
In this layer, a downsampling operation is performed on the input maps. The input and output maps remain as they are in the subsampling layer, for instance, let there be number of input maps will be equal to the number of output maps. As a result of down-sampling, reduction in sizes of output maps can be done, which are dependent on the size of the down-sampling mask. In the research carried out here, a 2 × 2 down-sampling mask was applied. The mathematical formula for this operation is described by Equation 2: Sub-sampling function is represented by in Equation 2, which commonly performs a sum over an n × n block of maps from the preceding layers and chooses the average or maximum values of this block. According to this, regarding to the sizes of feature map, n times size is decreased of the output map. After that, the output maps are set under the linear or non-linear activation functions.

Classification Layer:
This layer is used the extracted features by previous convolutional layer to calculate the predication value of each class of each sample. It is a fully connected layer. For the experiment, this study considers the dimensions of the feature map to be 5 × 5, and the classification layer as a feed-forward neural net. As recommended by most studies in the literature, the Sigmoid function was used as the activation function.

Back-Propagation:
Convolutional procedure is performed between the convolutional layer and subsampling layer, and the meanwhile, filters are modified. Accordingly, the weight matrix is computed for apiece layer.

Convolutional Neural Network with Dropout
The reduction of test errors can effectively be achieved by combination of the predictions of various models [37]. However, it is a computationally costly process for large ANN and processing time can be various days. Alternatively, here, also another effective method for combining models, known as "dropout" [41]. This method sets the outputs of hidden layer neurons to zero for a probability of less than or equal to a present value, 0.5, for instance. The "dropped out" neurons do not have a direct effect on BP.
The dropout technique somewhat successfully simplifies the network, since it provides the co-adaptation of neurons, where one set of neurons does not rely on the presence of another set of neurons. Therefore, it is forced to learn robust features that are useful in aggregation with many different random subsets of the other neurons. Dropout requires more iterations before reaching the needed convergence level and it is a disadvantage of Dropout method. In this study, the dropout method was employed in the starting two fully connected layers.

Structure of the DCNN and parameters
Six layers are applied to introduce the deep convolutional neural networks (DCNNs) architecture to recognise Brahmi words in this study. Two-two layers were allocated for convolution and pooling, while the last two layers were allocated for classification. The first convolution layer and the second convolutional layer had a 32-output and 64-output mapping, respectively.
Further discuss steps are followed to compute the parameters of convolutional network: the input image size was set to 32 × 32 pixels. 28 × 28 with 32 feature maps as the output of the first layer (convolutional layer). A 5 × 5 Filter mask size was used for both convolution layers. Therefore, a total (5 × 5 + 1) × 32 = 832 parameters were used to learn and 28 × 28 × (5 × 5 + 1) × 32 = 652,288 represented the total number of connections. Zero parameters were used for the first subsampling layer in which the output of this layer was 14 × 14 with 32 feature maps. Similarly, the parameter of the next two convolutional and subsampling layers was calculated. For example, ((5 × 5 + 1) × 32) × 64 = 53, 248 represents the learning parameters of the second convolution layer, and 0 represents the parameter of the sub-sampling layers. The next layers were the fully connected layers so 312 × 64 × (5 × 5 + 1) = 519168 were the parameters of the first fully connected layer, while 170 × (312 + 1) = 53210 were the parameters of the final layer. Hence, the total number of parameters used in this study was 626,453 (Table 1).

Performance evaluation
In order to compute the accuracy of the suggested architecture (Figure 4), this study used CNN to classify Brahmi words. In order to classify the Brahmi words, 3/4 of the 6,475 data was selected as training data, 1/4 as validation data and 536 as test data.
While training CNN having four convolutional layers for the experiments, this study considered the learning rate, the number of hidden neurons, and the batch size as parameters. It was observed from the results that CNN worked efficiently by increasing the network scale with one major drawback of the problem of over-fitting due to a longer time incurred while training. So, to resolve the overfitting issue, dropout techniques used the value of dropout parameter was 0.5. The experiment was conducted using two approaches: 1) CNN with the Gabor filter and 2) CNN with dropout and the Gabor filter, to recognise the Brahmi words. This study considered a total of ten iterations for training and testing. The testing accuracy using CNN with the Gabor filter achieved 91.65% while CNN with dropout and the Gabor filter provided 92.47% accuracy.
In conclusion, the performance of CNN with the Gabor filter and dropout was better for Brahmi word recognition ( Table 2). This study is considered the learning rate, the number of hidden neurons, and the batch size as parameters to train the CNN which have four convolutional layers. According to the observation of the results, DCNN worked efficiently according to the network scale. However, over-fitting problem appeared during the training because of the long duration. Moreover, batch size can help to reach the optimum level of the system. However, if the batch size is enhanced to some certain value so, the rule of thumb is the suggested method cannot be trained [42]. Moreover, the batch size was also dependent on the available memory. Achieved accuracy of system is presented in Figure 5 while values of the batch size were changed.
To minimize the over-fitting issue, this study trained the model in a controlled environment using a momentum value of 0.1 and it assisted to obtain the optimum performance. This study increased the number of convolutional cores gradually to reach the optimum state of the network because of the overfitting issue. In the case of batch size, relatively large numbers were required to reach the global gradient. This study preferred the 0.0025 and 42 value of the learning rate and a batch size, respectively, and an average accuracy rate of 92.03%. As can see that in performance graphs, number of epochs is directly proportional to the number of hidden neurons to maximize the performance. Although, hidden neurons and number of classed are two different entities here. In practice, the number of hidden neurons in each new hidden layer equals the number of connections to be made. It is also mentioned in the structure of CNN that the internal layer may be comprised of different numbers of hidden neurons. These neurons help in choosing the features of the input image as deeply as possible. It is pertinent to mention that adding to the number of neurons may increase the complexity, but it helps achieve a higher accuracy rate. The same set of experiments was performed with the standard dataset of Brahmi words [38]. The same set of parameters with different values was applied for this set of experiments. We chose a learning rate of 0.08 with a starting batch size of 50, the final test accuracy was around 91.65% for Brahmi word recognition. Efficiency graphs using the different number of hidden neurons for the Brahmi words is shown in Figure 6 where achieved accuracy was 91.54%, 90.65%, and 91.65% with 10, 30 and 50 hidden neurons respectively.

Figure 5: Effect of batch size and on accuracy
The experiments were also performed using variations of the n-fold cross-validation approach in order to avoid any confusion regarding the ratio of training and testing data. The average accuracy of the Brahmi word recognition was 89.34%, 90.84% using 10-fold and 8-fold cross validation, respectively. Apart from the upper discussed approach to avoid the over-fitting approach, dropout is also useful to avoid this issue. This study is used dropout method along with CNN and achieved 92.47% accuracy where the sets the outputs of hidden layer neurons to zero for a probability of less than or equal to a present value, 0.5, for instance.
This study also compared the deep learning method (CNN+ Gabor filter, CNN+ Gabor + Dropout) with other studies that used zonal density with ANN [11], and Gabor filter+ zonal structural features [20]. The achieved accuracy was 91.65% and 92.47% for the CNN with Gabor filter, and the combination of CNN with dropout and Gabor filter, respectively. In contrast, the achieved accuracy using the zonal density with ANN [11], and Gabor filter + zonal structural features [20] was 80.20% and 91.57%, respectively. Hence, CNN+ Gabor + Dropout had the highest accuracy compared to the previous studies ( Figure 7).
Overall, it was observed that our proposed model of CNN showed better predictive accuracy compared with other classification models.

Conclusion
This study focused on the DCNN (deep convolutional neural network) to recognize the Brahmi word. While performing experiments on the standard dataset using DCNN, this study compared the results of different approaches in order to propose recommendations based on parameter tuning. Furthermore, there is a lack of standard data resource in the Brahmi domain in order to generate benchmark results.
In the field of machine learning, DCNNs come with a revolutionary change by providing quite efficient results in comparison with conventional approaches. However, there are also some inherent questionable issues like there is a lack of knowledge of how to determine the number of levels and hidden neurons in each layer. Furthermore, a standard dataset is required to check the validity and efficiency of deep network models. Therefore, in the experiment, this study had to train the DCNN by using samples of standard dataset [38]. In addition, finding a set of optimal parameters to generate error-free results is also a research issue. Similarly, some complex future tasks like character recognition of rotated, mirror-text, and noisy images by extracting novel features could benefit.
According to Table 8, the proposed system was significantly better than the approaches used in the related literature regarding to the number of parameters and the amount of calculation and the proposed system was quite efficient (in terms of accuracy) and effective at performing the recognition and classification since it provided better accuracy as compared with the others. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 May 2020 doi:10.20944/preprints202005.0455.v1