Submitted:
30 December 2024
Posted:
31 December 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- This study proposes a two-branch feature aggregation strategy. It aggregates independently clipped single-character image features with corresponding character probability sequences. In this way, the high-level prior of the aggregation focuses on the structural details of the characters themselves, effectively reducing the interference of the complex background. In addition, the influence among neighbouring densely distributed characters is significantly reduced.
- Considering the harmonization capability of different sized convolutional kernels due to their specific receptive fields, a improved Inception module is introduced into the shallow layers, in order to apply the dynamic multi-scale feature extraction strategy. By dynamically weighting scaled convolutional kernels, the global overview features and fine-grained features are adaptively adjusted for each input, thus enriching the feature expressions to comprehensively understand the salient vision content.
- Using the idea that adaptive normalisation can learn the mapping relationship between different image domains, a colour correction operation is applied to adaptively adjust the mean and standard deviation of the pixels of target images. This improves the quality and effect of super-resolution reconstruction while keeping the content of the original image unchanged. Experiments are performed on the public dataset TextZoom, and the results show the superiority of the proposed model with the existing baselines. The average recognition accuracy on the test sets of CRNN, MORAN, and ASTER is improved by 1%, 1.5%, and 0.9%, respectively.
2. Related Works
2.1. Image Super Resolution
2.2. Scene Text Recognition for STISR
2.3. Scene Text Super Resolution
3. The Proposed Network Architecture
3.1. Image Pre-Processing
3.1.1. Dynamic Inception Feature Extraction
3.1.2. Single Character Boundary Detection
3.1.3. Text Recognizer
3.2. Dual Branch Feature Aggregation
3.3. Reconstructed Module
3.4. Loss Function
4. Experimental Results and Discussion
4.1. Dataset and Experimental Details
4.2. Ablation Experiment
4.2.1. The Role of Dual Branch Feature Aggregation
4.2.2. Dynamic Inception Feature Extraction
4.2.3. Validity of the CCB Module
4.2.4. Effectiveness of Different Components
4.3. Comparison with State-of-the-Art Results
4.3.1. TextZoom Quantitative Research
4.3.2. TextZoom Qualitative Research
4.3.3. Generalisation Capabilities on Text Recognition Datasets
4.3.4. Discussions
Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, B.; Chen, K.; Peng, S.-L.; Zhao, M. Adaptive Aggregate Stereo Matching Network with Depth Map Super-Resolution. Sensors 2022, 22, 4548. [Google Scholar] [CrossRef] [PubMed]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Luo, C.; Jin, L.; Sun, Z. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition 2019, 90, 109–118. [Google Scholar] [CrossRef]
- Sheng, F.; Chen, Z.; Mei, T.; Xu, B. A single-shot oriented scene text detector with learnable anchors. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME); 2019; pp. 1516–1521. [Google Scholar]
- Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE transactions on multimedia 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Sun, P.; Wang, W.; Tian, L.; Shen, C.; Luo, P. Textsr: Content-aware text super-resolution guided by recognition. arXiv:1909.07113 2019.
- Wang, W.; Xie, E.; Liu, X.; Wang, W.; Liang, D.; Shen, C.; Bai, X. Scene text image super-resolution in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, 2020; pp. 650-666.
- Chen, J.; Li, B.; Xue, X. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021; pp. 12026–12035.
- Ma, J.; Guo, S.; Zhang, L. Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing 2023, 32, 1341–1353. [Google Scholar] [CrossRef]
- Ma, J.; Liang, Z.; Zhang, L. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022; pp. 5911–5920.
- Chen, J.; Yu, H.; Ma, J.; Li, B.; Xue, X. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2022; pp. 285–293.
- Honda, K.; Kurematsu, M.; Fujita, H.; Selamat, A. Multi-task learning for scene text image super-resolution with multiple transformers. Electronics 2022, 11, 3813. [Google Scholar] [CrossRef]
- Guo, K.; Zhu, X.; Schaefer, G.; Ding, R.; Fang, H. Self-supervised memory learning for scene text image super-resolution. Expert Systems with Applications 2024, 258, 125247. [Google Scholar] [CrossRef]
- Shi, Q.; Zhu, Y.; Liu, Y.; Ye, J.; Yang, D. Perceiving Multiple Representations for scene text image super-resolution guided by text recognizer. Engineering Applications of Artificial Intelligence 2023, 124, 106551. [Google Scholar] [CrossRef]
- Shi, Q.; Zhu, Y.; Fang, C.; Yang, D. TLWSR: Weakly supervised real-world scene text image super-resolution using text label. IET Image Processing 2023, 17, 2780–2790. [Google Scholar] [CrossRef]
- Zhang, X.-g. A new kind of super-resolution reconstruction algorithm based on the ICM and the bilinear interpolation. In Proceedings of the 2008 International Seminar on Future BioMedical Information Engineering; 2008; pp. 183–186. [Google Scholar]
- Akhtar, P.; Azhar, F. A single image interpolation scheme for enhanced super resolution in bio-medical imaging. In Proceedings of the 2010 4th international conference on bioinformatics and biomedical engineering; 2010; pp. 1–5. [Google Scholar]
- Badran, Y.K.; Salama, G.I.; Mahmoud, T.A.; Mousa, A.; Moussa, A. Single Image Super Resolution Using Discrete Cosine Transform Driven Regression Tree. In Proceedings of the 2020 37th National Radio Science Conference (NRSC); 2020; pp. 128–136. [Google Scholar]
- Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine 2003, 20, 21–36. [Google Scholar] [CrossRef]
- Faramarzi, A.; Ahmadyfard, A.; Khosravi, H. Adaptive image super-resolution algorithm based on fractional Fourier transform. Image Analysis and Stereology 2022, 41, 133–144. [Google Scholar] [CrossRef]
- Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE transactions on image processing 2010, 19, 2861–2873. [Google Scholar] [CrossRef] [PubMed]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 1646–1654.
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 1874–1883.
- Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 1637–1645.
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017; pp. 136–144.
- Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
- Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, 2014; pp. 184-199.
- Mou, Y.; Tan, L.; Yang, H.; Chen, J.; Liu, L.; Yan, R.; Huang, Y. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, 2020; pp. 158-174.
- Zhao, C.; Feng, S.; Zhao, B.N.; Ding, Z.; Wu, J.; Shen, F.; Shen, H.T. Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the Proceedings of the 29th ACM International Conference on Multimedia, 2021; pp. 2908–2917.
- Zhao, M.; Wang, M.; Bai, F.; Li, B.; Wang, J.; Zhou, S. C3-stisr: Scene text image super-resolution with triple clues. arXiv:2204.14044 2022.
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015; pp. 1–9.
- Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems 2017. [Google Scholar]
- Li, X.; Zuo, W.; Loy, C.C. Learning generative structure prior for blind text image super-resolution. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023; pp. 10103–10113.
- Stamatopoulos, N.; Gatos, B.; Louloudis, G.; Pal, U.; Alaei, A. ICDAR 2013 handwriting segmentation contest. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition; 2013; pp. 1402–1406. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th international conference on document analysis and recognition (ICDAR); 2015; pp. 1156–1160. [Google Scholar]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International conference on computer vision; 2011; pp. 1457–1464. [Google Scholar]
- Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2013; pp. 569–576.
- Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
- Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp. 1905–1914.
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv:1412.6980 2014.
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp. 606–615.









| Fusion Strategy | Easy | Medium | Hard | avgAcc↑ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|---|---|
| w/o DBFA | 51.2% | 41.9% | 31.7% | 41.6% | 21.02 | 0.7690 |
| 61.8% | 52.1% | 37.9% | 50.6% | 21.10 | 0.7819 | |
| 60.3% | 50.4% | 36.9% | 49.2% | 20.87 | 0.7783 | |
| TPI | 62.9% | 53.5% | 39.8% | 52.8% | 21.52 | 0.7930 |
| LTFA | 63.1% | 53.8% | 39.8% | 53.1% | 21.43 | 0.7954 |
| DBFA | 63.5% | 55.3% | 39.9% | 53.6% | 21.84 | 0.7997 |
| DIFE Parameter | Easy | Medium | Hard | avgAcc↑ | |
|---|---|---|---|---|---|
| 1 | 9*9 | 62.8% | 53.6% | 38.7% | 52.6% |
| 2 | 1*1、1*1+5*5 | 62.4% | 52.1% | 38.6% | 52.5% |
| 3 | 1*1、1*1+7*7 | 63.2% | 53.7% | 38.9% | 52.7% |
| 4 | 1*1、1*1+9*9 | 63.4% | 53.9% | 39.1% | 52.9% |
| 5 | 1*1、1*1+3*3、7*7+1*1 | 62.9% | 53.5% | 39.5% | 53.4% |
| 6 | 1*1、1*1+9*9、5*5+1*1 | 63.6% | 54.6% | 39.7% | 53.2% |
| 7 | 1*1、1*1+5*5、1*1+7*7、1*1+9*9、3*3+1*1 | 63.8% | 54.8% | 39.8% | 53.4% |
| 8 | 1*1、1*1+5*5、1*1+7*7、1*1+9*9、3*3+1*1(dynamic) | 63.5% | 55.3% | 39.9% | 53.6% |
| Approach | CCB | Easy | Medium | Hard | avgAcc↑ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|---|---|---|
| TPGSR | × | 61.0% | 49.9% | 36.7% | 49.8% | 21.02 | 0.7690 |
| √ | 62.1% | 51.6% | 36.7% | 50.4% | 21.32 | 0.7705 | |
| TATT | × | 62.6% | 53.4% | 39.8% | 52.6% | 21.52 | 0.7930 |
| √ | 62.4% | 54.4% | 39.6% | 52.7% | 20.95 | 0.7951 | |
| C3-STISR | × | 65.2% | 53.6% | 39.8% | 53.7% | 21.51 | 0.7721 |
| √ | 65.1% | 54.0% | 39.6% | 53.8% | 21.37 | 0.7853 | |
| MNTSR | × | 64.3% | 54.5% | 38.7% | 53.3% | 21.53 | 0.7946 |
| √ | 64.0% | 54.8% | 38.9% | 53.2% | 21.67 | 0.7964 | |
| SCE-STISR | × | 63.3% | 53.93% | 39.8% | 53.0% | 21.43 | 0.7982 |
| √ | 63.5% | 55.3% | 39.9% | 53.6% | 21.84 | 0.7997 |
| DBFA | DIFA | CCB | Recognition Accuracy | |||
|---|---|---|---|---|---|---|
| Easy | Medium | Hard | avgAcc↑ | |||
| - | - | - | 62.26% | 52.73% | 39.09% | 52.1% |
| √ | - | - | 62.94% | 52.73% | 39.46% | 52.4% |
| √ | √ | - | 63.25% | 53.93% | 39.76% | 53.0% |
| √ | √ | √ | 63.53% | 55.31% | 39.95% | 53.6% |
| Method | CRNN | MORAN | ASTER | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy (%) |
Medium (%) |
Hard(%) | Avg(%) | Easy(%) | Medium(%) | Hard(%) | Avg(%) | Easy(%) | Medium(%) | Hard(%) | Avg(%) | |
| Bicubic | 36.4 | 21.1 | 21.1 | 26.8 | 60.6 | 37.9 | 30.8 | 44.1 | 67.4 | 42.4 | 31.2 | 48.2 |
| SRCNN | 41.1 | 22.3 | 22.0 | 29.2 | 63.9 | 40.0 | 29.4 | 45.6 | 70.6 | 44.0 | 31.5 | 50.0 |
| SRResNet | 45.2 | 32.6 | 25.5 | 35.1 | 66.0 | 47.1 | 33.4 | 49.9 | 69.4 | 50.5 | 35.7 | 53.0 |
| EDSR | 42.7 | 29.3 | 24.1 | 32.7 | 63.6 | 45.4 | 32.2 | 48.1 | 72.3 | 48.6 | 34.3 | 53.0 |
| RCAN | 46.8 | 27.9 | 26.5 | 34.5 | 63.1 | 42.9 | 33.6 | 47.5 | 67.3 | 46.6 | 35.1 | 50.7 |
| CARN | 40.7 | 27.4 | 24.3 | 31.4 | 58.8 | 42.3 | 31.1 | 45.0 | 62.3 | 44.7 | 31.5 | 47.1 |
| HAN | 51.6 | 35.8 | 29.0 | 39.6 | 67.4 | 48.5 | 35.4 | 51.5 | 71.1 | 52.8 | 39.0 | 55.3 |
| TSRN | 52.5 | 38.3 | 31.4 | 41.4 | 70.1 | 55.3 | 37.9 | 55.4 | 75.1 | 56.3 | 40.1 | 58.3 |
| PCAN | 59.6 | 45.4 | 34.8 | 47.4 | 73.7 | 57.6 | 41.0 | 58.5 | 77.5 | 60.7 | 43.1 | 61.5 |
| TBSRN | 59.6 | 47.1 | 35.3 | 48.1 | 74.1 | 57.0 | 40.8 | 58.4 | 75.7 | 59.9 | 41.6 | 60.1 |
| Gestalt | 61.2 | 47.6 | 35.5 | 48.9 | 75.8 | 57.8 | 41.4 | 59.4 | 77.9 | 60.2 | 42.4 | 61.3 |
| TPGSR | 63.1 | 52.0 | 38.6 | 51.8 | 74.9 | 60.5 | 44.1 | 60.5 | 78.9 | 62.7 | 44.5 | 62.8 |
| TATT | 62.6 | 53.4 | 39.8 | 52.6 | 72.5 | 60.2 | 43.1 | 59.5 | 78.9 | 63.4 | 45.4 | 63.6 |
| C3-STISR | 65.2 | 53.6 | 39.8 | 53.7 | 74.2 | 61.0 | 43.2 | 59.5 | 79.1 | 63.3 | 46.8 | 64.1 |
| PerMR | 65.1 | 50.4 | 37.8 | 52.0 | 76.7 | 58.9 | 42.9 | 60.6 | 80.8 | 62.9 | 45.5 | 64.2 |
| MNTSR | 64.3 | 54.5 | 38.7 | 53.3 | 76.7 | 61.2 | 44.9 | 61.9 | 79.5 | 64.6 | 45.8 | 64.4 |
| SCE-STISR | 63.5 | 55.3 | 39.9 | 53.6 | 73.9 | 59.5 | 44.7 | 60.9 | 80.9 | 63.4 | 45.8 | 64.5 |
| Method | PSNR | SSIM | ||||||
| Easy | Medium | Hard | avg | Easy | Medium | Hard | avg | |
| Bicubic | 22.35 | 18.98 | 19.39 | 20.35 | 0.7884 | 0.6254 | 0.6592 | 0.6961 |
| SRCNN | 23.48 | 19.06 | 19.34 | 20.78 | 0.8379 | 0.6323 | 0.6791 | 0.7227 |
| SRResNet | 24.36 | 18.88 | 19.29 | 21.03 | 0.8681 | 0.6406 | 0.6911 | 0.7403 |
| EDSR | 24.26 | 18.63 | 19.14 | 20.68 | 0.8633 | 0.6440 | 0.7108 | 0.7394 |
| RCAN | 22.15 | 18.81 | 19.83 | 20.26 | 0.8525 | 0.6465 | 0.7227 | 0.7406 |
| CARN | 22.70 | 19.15 | 20.02 | 20.62 | 0.8384 | 0.6412 | 0.7172 | 0.7323 |
| HAN | 23.30 | 19.02 | 20.16 | 20.95 | 0.8691 | 0.6537 | 0.7387 | 0.7596 |
| TSRN | 25.07 | 18.86 | 19.71 | 21.42 | 0.8897 | 0.6676 | 0.7302 | 0.7690 |
| PCAN | 24.57 | 19.14 | 20.26 | 21.49 | 0.8830 | 0.6781 | 0.7475 | 0.7752 |
| TBSRN | 23.46 | 19.17 | 19.68 | 20.91 | 0.8729 | 0.6455 | 0.7452 | 0.7603 |
| Gestalt | 23.95 | 18.58 | 19.74 | 20.76 | 0.8611 | 0.6621 | 0.7520 | 0.7584 |
| TPGSR | 23.73 | 18.68 | 20.06 | 20.97 | 0.8805 | 0.6738 | 0.7440 | 0.7719 |
| TATT | 24.72 | 19.02 | 20.31 | 21.52 | 0.9006 | 0.6911 | 0.7703 | 0.7930 |
| C3-STISR | 24.71 | 19.03 | 20.09 | 21.51 | 0.8545 | 0.6674 | 0.7639 | 0.7721 |
| PerMR | 24.89 | 18.98 | 20.42 | 21.43 | 0.9102 | 0.6921 | 0.7658 | 0.7894 |
| MNTSR | 24.93 | 19.28 | 20.38 | 21.50 | 0.9173 | 0.6860 | 0.7806 | 0.7946 |
| SCE-STISR | 24.99 | 19.13 | 20.78 | 21.84 | 0.9038 | 0.6955 | 0.7859 | 0.7951 |
| Method | STR Datasets | ||||
|---|---|---|---|---|---|
| IC13 | IC15 | CUTE80 | SVT | SVTP | |
| Bicubic | 9.6% | 10.1% | 35.8% | 3.3% | 10.2% |
| SRResNet | 11.4% | 13.4% | 50.5% | 9.3% | 13.8% |
| TSRN | 15.6% | 18.6% | 66.9% | 10.0% | 16.4% |
| TBSRN | 17.7% | 21.3% | 75.0% | 12.2% | 17.4% |
| TPGSR | 22.7% | 24.2% | 72.6% | 13.7% | 16.5% |
| TATT | 27.6% | 28.6% | 74.7% | 14.2% | 25.9% |
| C3-STISR | 24.7% | 22.7% | 71.5% | 10.2% | 17.7% |
| SCE-STISR | 28.9% | 30.7% | 74.9% | 15.1% | 26.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).