4.1. Dataset and Implementation details
We build a new training dataset through several existing document binarization datasets to evaluate and compare the performance of the proposed model with other methods. Training dataset consists of H-DIBCO [
44,
45,
46], DIBCO [
47,
48,
49], Bickley-diary dataset [
50], Persian heritage image binarization dataset (PHIDB) [
51], and Synchromedia Multispectral dataset (S-MS) [
52]. Since most of the document images have a large size, we divided each image into 256 × 256 patches for train efficiency. Random rotation augmentation was performed on the divided patches to construct a dataset with a total of about 160k images. 90% of the built dataset was used as a training set, and 10% used as a validation set. Finally, for evaluation, H-DIBCO 2016 [
20], DIBCO 2017 [
21], H-DIBCO 2018 [
22], DIBCO 2019-B [
23] is used. The test dataset is not included in the training and validation set.
To train the proposed model, the time step of the diffusion-denoising process is set to 1000, and the AdamW optimizer [
53] set to the initial learning rate
is used. In addition, the autoencoder was pre-trained for 20 epochs for the train set, and was frozen during denoising network training. We use
as the network’s activation function. The denoising network was finally trained for about 1M steps with a batch size of 2. We use NVIDIA RTX 3090 GPU (24GB)×3 for training.
4.2. Evaluation Metrics
For quantitative evaluation of the proposed model, a total of four evaluation metrics [
20,
21,
22,
23] suitable for document binarization evaluation are selected. Metrics consist of F-measure (FM), pseudo-Fmeasure (pFM), Peak Signal-to-noise Ratio (PSNR), and Distance Reciprocal Distortion (DRD), commonly used in Document Image Binarization Contest (DIBCO).
F-measure is calculated using the precision and recall between the predicted pixel and the ground-truth pixel as,
pseudo-Fmeasure was proposed in [
54], and the stroke predicted through
representing the percentage of the skeletonized ground-truth image and the stroke of the ground-truth are calculated as follows,
PSNR is an image quality evaluation metric and is calculated for the similarity between the predicted image and the ground-truth. PSNR is calculated as follows, where C is the maximum value of an image pixel,
DRD is a metric proposed by [
55] to measure visual distortion in binary document images. It measures the distortion for all the
S flipped pixels as follows,
where
is the number of the non-uniform (not all black or white pixels) 8 × 8 blocks in the ground-truth image, and
is the distortion of the k-th flipped pixel as defined in [
55].
4.3. Quantitative and Qualitative Comparison
To demonstrate the performance and effectiveness of the proposed model, we use a total of four benchmark datasets. DIBCO 2016, DIBCO 2017, and DIBCO 2018 datasets are made up of machine-printed and handwritten document images, and DIBCO 2019-B is a challenging benchmark dataset that includes papyrus-like materials and extreme degradation. We use four evaluation metrics of FM, pFM, PSNR, and DRD introduced in Section.
Section 4.2. In addition, for fair comparison, a total of 8 models including the proposed model are reimplemented. We used publicly available source codes provided by the authors for reimplementation. Methods compared with the proposed model can be divided into traditional binarization algorithm-based methods and deep learning-based methods. Binarization algorithm-based methods include Otsu [
6], and Sauvola [
5], and deep learning-based methods include SAE [
25], cGANs [
13], Akbari et al. [
27], Souibgui et al. [
29], and Suh et al. [
15]. Since the proposed model is a generative model, we compare with various generative models. Additionally, the performance of competition winners of each year [
20,
21,
22,
23] is added to the experimental results.
Table 1 shows the quantitative evaluation results of models. Compared to other methods, the proposed model achieved the best performance for the mean values on all datasets. The proposed model shows best or second best performance in terms of pFM and PSNR on all four datasets. Especially in DIBCO 2019-B dataset, which consists of the most complex and noise that the model has not experienced, the proposed model achieves the best performance in all metrics. In particular, existing methods show great decrement, while the proposed model shows impressive performance gain compared to others even in such severe noise. Although the quantitative differences between the proposed model and the others are marginal on other datasets, qualitative evaluation shows the proposed model performs better at document binarization. Therefore, it demonstrates that training through the iterative diffusion-denoising process is effective in removing complex noise and robust to various environments.
First, the proposed model for H-DIBCO 2016 dataset shows the highest performance in pFM and the second highest performance in PSNR. In DIBCO 2017 dataset, the proposed model shows the highest performance in pFM and PSNR. The high performance of the proposed model in pFM means that it extracts the text stroke best compared to other methods, and high performance in PSNR means that it produces high-quality results. In H-DIBCO 2018 dataset, the proposed model achieve the second highest performance in all four metrics. Results on DIBCO 2019-B dataset show significantly higher performance in all four metrics compared to other methods. DIBCO 2019-B dataset is a challenging dataset consisting of images with extremely severe degradation on materials such as papyrus and tree bark. Even in this challenging dataset, the proposed model shows the highest performance in all metrics. However, other models show extremely low performance on this dataset compared to other datasets. This demonstrates that the proposed model successfully extracts text strokes even in extremely degraded document images that have not been trained. Average results for DIBCO 2016, 2017, 2018, and 2019-B datasets show that the proposed model achieved the highest performance in terms of all of the metrics. That is, we demonstrate the effectiveness of the proposed model for performing image generation through diffusion-denoising process and text stroke extraction through gated U-Net.
Qualitative evaluation on H-DIBCO 2016 dataset is shown in
Figure 5. Shown image is the most challenging document image with complex degradation in H-DIBCO 2016 dataset.
Figure 5g, the result image of [
13], shows the highest quantitative result, but shows limitations in preserving elaborate text stroke.
Figure 5i, the result image of [
15], shows better results in preserving text strokes, but has limitations in removing noise.
Figure 5j, the result image of the proposed model, shows that it not only shows high results in noise removal such as overwriting compared to other methods, but also performs best in preserving elaborate and accurate text strokes.
Figure 6 is the result image for DIBCO 2017 dataset for qualitative evaluation. The image contains noise that is difficult to distinguish between text and background. Traditional algorithm-based methods suffer from misclassifying such noise as text. In addition, deep learning-based methods also suffer from misclassification in the case of severe noise. It is difficult to distinguish this noise because it has a similar shape, size, and text stroke, but
Figure 6j shows that the proposed model successfully distinguishes background and text region.
The result image for H-DIBCO 2018 dataset is shown in
Figure 7. H-DIBCO 2018 dataset consists of 10 handwritten document images with various degradations such as background intensity, shadow, and bleed-through.
Figure 7 includes degradation of variable background intensity and ink smearing. Other methods have difficulty in removing these degradations. In the case of
Figure 7g, which is the result image of [
13], degradation is relatively resolved well, but it has difficulty in preserving precise text strokes such as small footnotes in the document. In the case of the proposed model in
Figure 7i, it solves all types of degradation well and shows high performance in preserving elaborate text strokes.
Figure 8 shows the result image for DIBCO 2019-B dataset. DIBCO 2019-B dataset consists of ancient document images that reflect various types of papyrus quality, ink and handwriting styles. In particular, it is challenging data that includes non-homogeneous properties such as resolution, lighting, and noise.
Figure 8c and d, which are results of [
13,
15], cannot effectively distinguish text from background region according to the intensity of text and background. Because of this problem, it is extremely difficult to preserve text strokes. As shown in
Figure 8e, the proposed model is effective for text and background classification and successfully preserves the text region. In addition, the types of degradation included in DIBCO 2019-B data are types that have not been experienced in the training data, and we demonstrate the proposed model to be effective even in severe degradation.
Additionally,
Figure 9 shows that the proposed model is effective for elaborate region extraction. It is difficult to distinguish between background and text regions and extract accurate and elaborate text strokes from small parts like the example image. In this case, even in methods with relatively high quantitative evaluation [
13,
15,
27] have difficulties in precisely extracting and preserving text regions, as shown in
Figure 9c, d and e. However, in the case of the proposed model, as shown in
Figure 9f, it successfully extracts and preserves precise text strokes despite these difficulties.
4.4. Ablation study
We perform ablation experiments to evaluate each component of the proposed model. For evaluation, DIBCO 2016 benchmark dataset is used and the effectiveness of the proposed model is demonstrated using the four metrics in
Section 4.2. The baseline is [
17], and the diffusion-denoising process is performed in the latent space using an autoencoder. We compare the baseline with each of the proposed gated U-Net and pixel-level loss, and models with all of the components. A total of four models are compared, and experiments are performed in the same implementation setting except for each component of each model.
Table 2 shows results of the baseline, the proposed model without
, the proposed model without gated U-Net, and the proposed model. When gated U-Net is added to the baseline, pFM (95.03) improves by 0.59 compared to the baseline (94.44). Text stroke preservation performance has been improved because filter training for gating values is additionally included. However, when pixel-level loss is not included, the feedback for gated convolution filter is not effective, so there is no significant performance improvement. When gated convolution was excluded, text and background regions are updated through pixel-level loss, resulting in improved performance in all metrics. Finally, in the case of the proposed model with pixel-level loss for training the gated convolution filter in gated U-Net, FM, pFM, PSNR, and DRD improved by 0.68, 1.00, 0.28, and 0.31, respectively, compared to the baseline. This demonstrates that the components of the proposed model are effective respectively, and all components are effectively connected for performance gain.
Additionally, to confirm the effect of each component, the results of model training after six epochs are shown in
Table 3. We adopt FM which verifies the predicted text and background pixel-wise, and pFM which verifies each prediction region. When pixel-level loss is excluded from the proposed model, each metric is similar to or slightly increased from the baseline. The proposed model without gated U-Net indicates the model with the baseline’s backbone network using LDM and pixel-level loss to update denoising network. This results in a significant improvement of FM by 8.00 and pFM by 7.94 compared to the baseline. In the case of the proposed model using gated U-Net for document binarization and pixel-level loss aimed at training gating values, it achieves the highest performance of each metric and shows a significant improvement with FM by 8.22 and pFM by 8.51 compared to the baseline. These results demonstrate that the components of the proposed model are effectively suitable for text stroke preservation and extraction.
Figure 10 shows the result images of the baseline, the proposed model without
, the proposed model without gated U-Net, and the proposed model. We confirm that
Figure 10f, a result of the proposed model with both
and gated U-Net, is successful in distinguishing text and background of degraded document images compared to
Figure 10c, a result of baseline.
Additionally, in
Figure 11, the elaborate text stroke preservation performance can be qualitatively confirmed. In
Figure 11f, which is the result of the proposed model, noise removal and text stroke preservation are more accurate and precise than the baseline shown in
Figure 11c. This proves that the diffusion-denoising process through gated U-Net and
is effective in preserving more elaborate text strokes and generates high-quality results.