The threshold method is an image segmentation technique that relies on the grayscale value of a pixel to separate the image based on a specified threshold. This method typically involves two technical approaches: the global thresholding method and the local thresholding method, which separate the image pixels based on their size relationship to the specified threshold.
2.1. Global Threshold Method
The Otsu [
1] algorithm, developed in 1979, is a prominent method of a global thresholding technique. The algorithm aims to determine an optimal threshold value, denoted as T, by analyzing the grayscale properties of an image. This process involves partitioning the image into foreground and background segments. The objective is to minimize the gap between the two segments while maximizing the difference between them. The difference in grayscale distribution serves as a measure of the contrast between foreground and background, with a larger difference indicating an easier segmentation. The Otsu algorithm is also commonly known as the maximized difference between classes method. The optimal threshold for the desired image is the value that maximizes the gap between categories, and can be expressed as follows:
We represent the image pixel in the gray level of the image, the image has L-order gray level,
and
are the probability distribution of the target and background when the threshold value is T,
and
represent the average gray value of the pixel of the target and background respectively, if the pixel value of the input image is greater than
, The pixel value is set to white, otherwise it is black.
The Otus algorithm partitions the entire image based on a single threshold, allowing for the determination of the optimal threshold for the image at once. This approach generally yields improved separation for images with a uniform background. However, it may result in suboptimal image processing for images with uneven backgrounds, such as misidentification of the background in document images with significant ink penetration or insufficient grayscale contrast. Consequently, there is no universally effective thresholding method for handling such contaminated images.
2.2. Local Threshold Method
The Niblack [
2] algorithm was developed to address the limitations of a fixed threshold by introducing a local binarization method. This approach involves utilizing a local window to calculate the mean and standard deviation within a small neighboring domain of each pixel. These values are then used to adjust the threshold for binarizing the image. The threshold calculation formula is expressed as follows:
Where, m represents the average gray value of pixels in the local area, s represents the standard deviation, and k is a constant, a correction factor, which can be adjusted according to the foreground and background conditions of the image.
Trier [
3] believes that Niblack performs better than other local binarization methods in gray images with low contrast, noise and uneven background intensity. However, Saxena [
4] proposed that the window size is the main defect of the local threshold method. Both large and small size Windows will generate noise, and even in Windows without target pixels, the local threshold method can still detect target pixels. Furthermore, the time taken to compute the threshold is proportional to the square of the window size. Similarly, Chaki [
5] asserts that a larger value of k adds more pixels to the document image, thereby reducing text readability. Conversely, a smaller k value results in missing or incomplete characters, which reduces the number of potential pixels. Consequently, determining the appropriate value of k becomes challenging. Even with an accurate k value, it still generates pepper noise in the shadowed areas of the image or in non-text regions.
The Sauvola [
6] algorithm is an improvement upon the Niblack algorithm, designed to address the problem of excessive noise levels. It introduces a new parameter, R, which is based on the dynamic range of the standard deviation. The threshold calculation formula is expressed as follows:
It can be seen from the formula that Sauvola introduces a new parameter R, which represents the dynamic range of standard deviation. However, it still requires the use of an artificial k value and window size for threshold calculation. In cases of low contrast targets in a document, such as textured backgrounds or translucent false images, the Sauvola algorithm may either remove or partially recover them. Additionally, it faces challenges in handling targets of different sizes and accurately capturing all characters when different font sizes are present in the same text [
7].
Wolf [
8] and colleagues conducted a global statistical normalization of image contrast and average gray value. This enabled the automatic detection of text regions and adaptive threshold selection based on the characteristics of the text regions. This process optimizes the binarization of low-contrast images using Sauvola’s algorithm. In a similar vein, Gatos [
9] and his team estimated the background surface of documents using the binary document image generated by Sauvola’s thresholding algorithm. This method eliminates the need for manual parameter adjustment and effectively addresses issues related to degraded documents, such as shadows, non-uniform illumination, and low contrast. Mustafa et al. [
10] proposed "WAN" algorithm to improve the lost detail strokes by raising the threshold of binarization based on Sauvola algorithm.
In addressing the issue of black noise in the Niblack algorithm, Khurshid et al. [
11] introduced the NICK algorithm, which is purported to be more effective for deteriorated and noisy antique documents. In comparison to Niblack, it offers the advantage of significantly improving the binarization of light-colored page region images by reducing the binary threshold. The formula for calculating the threshold is expressed as follows:
represents the pixel value of a grayscale image, while NP denotes the number of pixels. The presence of noise can be reduced when the k value approaches 0.2, although this may lead to interrupted characters or faint drawings. Conversely, when the k value is close to 0.1, the text can be extracted with complete clarity, but some noise is retained. However, the selection of the k value requires manual adjustment if the characters are thin or if the text document has low contrast. Consequently, B. Bataineh [
12] contends that the method does not outperform the Niblack algorithm in exceptional circumstances, such as very low-contrast images, variations in text size and thickness, or finer characters with low contrast.
Su et al. (2010) [
13] introduced a novel image contrast technique that utilizes local image maxima and minima instead of the image gradient. This technique is particularly beneficial for unevenly lit and degraded documentation. The method involves constructing a contrast image, detecting high contrast areas near the boundary of the character pencil, and using local thresholds to segment the text documentation. In a subsequent study in 2013, Su et al. [
14] proposed a technique that combines local image contrast and gradation. This involved constructing an adaptive contrast map for the degraded document image input, dualizing the contrast image, and combining it with the Canny edge image to identify the edges of the text pencil image. Subsequently, the text paper was partitioned using local thresholds, with the threshold value serving as an estimate of the intensity of the border depicting the detected text within the window door. This method was employed to obtain high-quality data on Bickley paper.
In the local binary method, the window size is a crucial factor that can be customized by the user. Small windows are effective in removing noise but may distort the text, while large windows can effectively preserve the text but may introduce some noise. Therefore, custom settings may not be universally applicable to all images. Bataineh et al. [
12] proposed a threshold approach based on dynamic flexible windows to address the issue of low contrast between the foreground and background, as well as the text in pencil drawings. This method involves two approaches: dynamically segmenting images into windows based on image characteristics and determining the appropriate threshold for each window. The computation time for the local average generally depends on the window size. In contrast, T. Romen Singh et al. [
15] chose to utilize the dotted image as an initial stage in the calculation of the local mean. This approach allows for the calculation of the average value to be independent of the window size. Compared to other local threshold techniques, this method does not involve the computation of standard deviations, thereby reducing computational complexity and accelerating processing speed.
The global thresholding method is useful for document images that are obviously intermediate between foreground and background, or for images distinct bimodal histograms because it segments the foreground and background using a single threshold. However, this method is less robust and is not suited to images with low contrast or uneven illumination, hence it is usually employed for simple document photographs with a uniformly pure background. The local thresholding method uses several different thresholds to segment the image, which can be used for multi-target segmentation, but the disadvantage is that the segmented targets are poorly connected and usually bring about the phenomenon of ghosting, i.e., pseudo-strokes appearing in the background region, and the binarization result is greatly affected by the noise. In addition, some local thresholding approaches can bring character stroke breakage to some camera photos. We summarize the global threshold and local threshold methods in
Table 1.
2.3. Mixed Threshold Method
To make up for the limitations of global and local thresholding approaches, researchers have proposed hybrid thresholding binarization algorithms. For example, Yang et al. [
16] integrated Otsu and Bernsen’s method. Zemouri et al. [
17] enhanced the document picture using global thresholding before binarization and then applied a local thresholding strategy for binarization. Chaudhary et al. [
18] developed a rudimentary estimation of the backdrop, constructed an image with a high contrast, and then thresholded it using the hybrid technique. Due to the low identification rate of blurred letters in handwritten document images, K. Ntirogiannis et al. [
19] devised a blend of global and local adaptive binarization. First, a background estimate with picture normalization based on background compensation is applied. Then, global binarization is performed on the normalized image. In the binarized picture, typical attributes of the document image such as stroke width and contrast are determined. In addition, local adaptive binarization is performed on the normalized image. Finally, the results of the two binarizations are mixed. Liang [
20] developed a hybrid thresholding technique and determined the trade-off between local and global content using variational optimization. Xiao et al. [
21] suggested a model consisting of a global branch and a local branch that takes the global block of the downsampled picture and the local block of the source image as inputs correspondingly. The ultimate binarization is achieved by merging the findings of these two branches. Saddami et al. [
22] employed an integrated technique such as local and global thresholding methods to extract text from the backdrop to recover the information on degraded ancient Jawi manuscripts. P. Ranjitha et al. [
23] suggested a classification system to deal with degraded document photographs by blending the modification of local and global binarization algorithms.
He [
24] compared Niblack, Sauvola and their adaptive threshold method in the article, and found that adaptive Niblack and adaptive Sauvola performed slightly better than originals. Adaptive thresholding scans the image with a sliding window centered on a pixel and compares the centroid pixel with pixels in its neighborhood to obtain different thresholds for different pixels. Usually, the fixed threshold is manually set according to the specific situation of different tasks, whereas the adaptive thresholding tends to estimate the background surface of the document first, and then the thresholds are calculated according to the estimated background surface. Bernsen’s algorithm [
25] calculates a separate threshold for each pixel based on the neighborhood of the pixel, which is the classical adaptive thresholding method. Moghaddam et al. [
26] estimate the backdrop surface of the document by an adaptive and iterative image averaging approach. Messaoud et al. [
27] apply a binarization technique to selected items of interest by combining a preprocessing stage and a localization step. Pardhi et al. [
28] construct local thresholds by a combination of local image contrast and gradient combination to segment text and it also an adaptive image contrast technique. Kligler et al. [
29] introduced a novel and generalized algorithm to replace the grayscale map as input to the algorithm with a visible-based low-light map, claiming that by doing so the values of the text pixels are closer to each other and better separated from those of the non-text pixels, and that the performance of the binary algorithm without changing the rest of the algorithm will also be Improvement. Adaptive approaches are generally able to handle part of the complexity of document images, but they often overlook the edge aspects of image features and can lead to artifacts. We summarize these methods in the
Table 2.
2.4. Image Feature Method
Based on the threshold method of edge detection, we first identifying the edge pixels within the image and then using these edges as partition boundaries to divide the image into different areas. Edge detection typically involves using differential operators to identify areas of significant variation in the grayscale values of the images. Commonly used edge detectors include Sobel, Prewitt, Roberts, Laplace of Gaussian, and Canny. The selection of the edge detector is determined by the specific characteristics of the image in the practical application. Santhanaprabhu et al. [
30] applyed the Sobel edge detection technique to extract text and perform document image binarization. Lu et al. [
31] used the L1-norm image gradient to identify the edge of the font from the compensated document image. T. Lelore et al. [
32] have described a quick solution for repairing document images by employing the edge-based FAIR method to locate text in degraded document images. However, it should be noted that the edges detected by the edge detection technique may not completely enclose the prospective text area, thus requiring further improvement. Holambe et al. [
33] exploited adaptive image contrast in combination with Canny’s edge diagram to identify the edge pixels of the font. Jia et al. [
34] used structural symmetry pixels (SSPs) to calculate local thresholds for the neighborhood. SSP is defined as the pixel around the stroke, whose gradient size is large enough and the direction is symmetric and opposite. The author extracts SSP by combining the adaptive gradient binarization and iterative stroke width estimation algorithm. This approach reduces the influence of degraded documents and ensures the appropriate field size when determining the direction. Multi-threshold voting is then used to determine whether the pixel belongs to the foreground text, handling inaccurate SSP detection. Hadjadj et al. [
35] introduced a method of document image binarization that is frequently applied to the active contour model used in image segmentation. The objective of their method is widely used in the field of image segmentation. It aims to convert the problem of image segmentation to solve the minimum energy functional. Hadjadj defines image contrast of the maximum and minimum values of the local image. Use it to automatically generate initialization graphics for active contour models. The average threshold value is selected to generate binarization, as it enables the active contour to effectively detect low-contrast regions. When the active contour remains stationary, the result is obtained by thresholding the level set function.
In general, these methods have a good effect in the processing of degraded printed document images or handwritten text images, compared to the simple thresholding method, the edge detection-based separation method can effectively extract text contour information, the detection speed is fast, but because the method is based on the calculation of the pixel gradient of the image itself, it is more sensitive to light changes and susceptible to the interference of light changes.
Fuzzy theory defines a fuzzy set, calculates the membership degree of each pixel belonging to each set through fuzzy logic operations. Finally, it classifies pixels based on their membership degree, to achieve the purpose of segmentation. The idea of clustering is to find the correlation from a set of unlabeled data to find the similarity between the data. Image segmentation itself can also be regarded as a clustering process. A. Lai et al. [
36] used the K-means clustering algorithm to binarize document images. Tong et al. [
37] combined the Niblack algorithm and the FCM algorithm to propose a camera-based document image binarization algorithm named NFCM. It is expected to address the issue of document image breakage or blurring, preserve the fine details of character strokes, and eliminate glint interference. Soua et al. [
38] proposed the K-means method (HBK) based on hybrid binarization and implemented real-time processing of the parallel HBK method in an OCR system. Among these methods, although K-means is widely used, due to its hard clustering nature and sensitivity to noise, some researchers have proposed more flexible soft clustering algorithms, such algorithm is the fuzzy C-means clustering algorithm (FCM), which can accommodate uncertainties related to data points. There are also clustering algorithms such as PFCM [
39] and KFCM [
40]. In addition, there is the FuzBin-based binarization method, which Annabestani et al. [
41] use to extract text information from document images. They enhance image contrast with FESs and then combine FES with a pixel counting algorithm to obtain a range of threshold values. The middle value is taken as the final threshold. A method based on mathematical morphology operations to enhance fuzzy stroke information [
9]. Biswas et al. [
42] used a Gaussian filter to blur the input degraded image file. The application of Support Vector Machine (SVM) in processing historical document images. Xiong et al. [
43] used SVM to classify image blocks into different categories based on statistical information such as mean, variance, and histogram of regions. They also determined the optimal global threshold for a preliminary segmentation of the foreground and background. After the image is segmented, the stroke width is estimated using the progressive scanning method. However, this method only considers a single class method, and the accuracy of the recognition results is not ideal.
In addition to the methods mentioned above, there are numerous other techniques available for binarizing document images. Although these methods may not be as widely used as other commonly adopted algorithms, they still hold unique value in practical applications. These include histogram-based methods such as [
44,
45,
46]; entropy-based methods such as [
47]; space binarization-based methods such as [
48]; and object property-based methods such as [
49], etc. In general, when faced with the task of document image binarization, it is important to comprehensively consider the characteristics and final requirements of the image to choose the most suitable method. Sometimes, combining or layering multiple methods can be an effective approach to improve the binarization of document images. This integrated thinking and practice ensure good results in real situations. We summarize the above methods in the
Table 3.