Submitted:
10 December 2025
Posted:
15 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Preliminaries
3. Literature Review
- Paper 1 [11]: Maheswari et al. address the problem of automatically generating descriptive captions for accessibility by implementing a standard image captioning model. Their approach combines a ResNet50 CNN for feature extraction with an LSTM network for sequence generation, utilizing GloVe embeddings. The work involves standard preprocessing and training steps and serves as a demonstration of this common deep learning paradigm for the image captioning task.
- Paper 2 [12]: Kinghorn et al. tackle the issue that holistic captioning methods may overlook important local details, leading to less descriptive captions. They propose a region-based pipeline starting with an R-CNN object detector, followed by separate LSTMs for predicting human/object attributes and a CNN for scene classification. An encoder-decoder LSTM then translates these detected elements into refined, descriptive sentences. The authors claim their method generates more detailed captions by focusing on local regions and demonstrate outperformed contemporary baselines on the IAPR TC-12 dataset, also showing strong cross-domain performance on NYUv2.
- Paper 3 [13]: Yuan et al. address the potential information loss when using only global or only local image features and potential timescale limitations of LSTMs. They propose the "3G" model, which fuses global (VGG FC7) and local (VGG Conv5-4 + attention) features via an adaptive gate. For sequence modeling, they utilize a 2-layer Gated Feedback LSTM (GF-LSTM). Their contribution lies in combining global/local information adaptively and introducing GF-LSTM to captioning for potentially better handling of dependencies, reporting strong benchmark results.
- Paper 4 [14]: Sasibhooshan et al. focus on extracting finer-grained semantic details and contextual spatial relationships for richer captions. Their approach employs a Wavelet transform-based CNN (WCNN) encoder, a Visual Attention Prediction Network (VAPN), and a Contextual Spatial Relation Extractor (CSE) module with an LSTM decoder, trained using CIDEr optimization. Key contributions include the novel WCNN features, the combined attention mechanism, and the explicit modeling of spatial relations, leading to high reported CIDEr scores.
- Paper 5 [15]: Verma et al. present work on the standard image captioning task, using a VGG16 Hybrid CNN (pre-trained on objects and scenes) as the encoder and a standard LSTM decoder without explicit attention. They highlight the use of the hybrid CNN and report competitive multi-metric results on standard benchmarks, including live image validation, though limitations in caption grammar/detail were noted.
- Paper 6 [1]: Vinyals et al. present the foundational "Show and Tell" model, pioneering the end-to-end CNN-LSTM sequence-to-sequence approach. Using a GoogLeNet encoder to initialize an LSTM decoder and training via maximum likelihood, this work established the paradigm, showed significant BLEU improvements, and demonstrated the generative capabilities of these neural models.
- Paper 7 [16]: Lu et al. explore the specific challenges of remote sensing (RS) image captioning. Their contribution is the creation and release of the large-scale RSICD dataset. They benchmarked standard captioning models on this dataset, identifying RS-specific difficulties and demonstrating the limitations of standard models in this domain.
- Paper 8 [17]: Baig et al. address the description of novel objects not seen during training. They propose a modular post-processing method using an external object detector (YOLO9000) and Word2Vec embeddings to identify and then substitute nouns in a base-generated caption based on semantic similarity, improving scores on modified captions without retraining the captioner.
- Paper 9 [18]: Zhang et al. utilize a Bidirectional LSTM (Bi-LSTM) decoder to incorporate future context and address potential state misalignment. Their approach uses a ResNet encoder with visual attention and introduces a "Subsidiary Attention" mechanism to fuse forward/backward Bi-LSTM states, reporting improved CIDEr performance on MS COCO.
- Paper 10 [19]: Ming et al. provide a comprehensive review and taxonomy of the automatic image captioning field up to early 2022. Their work surveys traditional and deep learning methods, datasets, metrics, state-of-the-art comparisons, challenges, and future research directions.
- Paper 11 [20]: Bai and An offer an earlier survey (up to 2018), classifying image captioning approaches into retrieval-based, template-based, and various neural network categories. They discuss strengths, limitations, benchmark results, and future directions, focusing primarily on neural methods.
- Paper 12 [21]: Feng investigates knowledge-lean caption generation for news images using noisy web data. This early work uses LDA topic models on images and associated articles for content extraction, followed by extractive and abstractive (phrase-based) generation methods, demonstrating feasibility without manual resources.
- Paper 13 [22]: Khademi and Schulte propose a hierarchical, context-aware architecture to improve captioning. They use BiGrid LSTMs for spatial context, integrate region-based text features, and employ a deep Bi-LSTM with dynamic spatial attention implemented via another Grid LSTM, marking the first use of Grid LSTMs for captioning and achieving strong results.
- Paper 14 [23]: Arasi et al. focus on improving performance by optimizing hyperparameters using metaheuristics. Their proposed AIC-SSAIDL technique combines a MobileNetv2 encoder tuned with Sparrow Search Algorithm (SSA) and an Attention Mechanism-LSTM decoder tuned with Fruit Fly Optimization (FFO).
- Paper 15 [24]: Amirian et al. present a concise review emphasizing the algorithmic overlap between deep learning methods for image and video captioning. They discuss shared architectures (CNNs, RNNs/LSTMs, GANs), datasets, metrics, and platforms relevant to both tasks.
4. Comparative Analysis
5. Discussion: Trends, Challenges, and Future Directions
5.1. Observed Trends
5.2. Open Challenges
5.3. Future Research Directions
6. Conclusion
References
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–3164.
- Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the Proceedings of the 32nd International Conference on Machine Learning (ICML); Bach, F.; Blei, D., Eds. PMLR, 2015, Vol. 37, Proceedings of Machine Learning Research, pp. 2048–2057.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2002, pp. 311–318.
- Denkowski, M.; Lavie, A. METEOR Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Proceedings of the Ninth Workshop on Statistical Machine Translation (WMT). Association for Computational Linguistics, 2014, pp. 376–380.
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based Image Description Evaluation. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575.
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, 2004, pp. 74–81.
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 382–398. [CrossRef]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. In Proceedings of the Journal of Artificial Intelligence Research, 2013, Vol. 47, pp. 853–899. [CrossRef]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. In Proceedings of the Transactions of the Association for Computational Linguistics, 2014, Vol. 2, pp. 67–78. [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755. [CrossRef]
- Maheswari, A.; Kajal.; Selvameena, R.; Kumar, K.V.; Shekar, M.G.; Rahul, M.V. Image Caption Generator Using CNN and LSTM. International Journal for Multidisciplinary Research (IJFMR) 2024, 6. Accessed via IJFMR website/PDF.
- Kinghorn, A.; Zhang, L.; Shao, L. A Region-based Image Caption Generator with Refined Descriptions. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 450–454. [CrossRef]
- Yuan, Z.; Li, Y.; Lu, W. 3G Structure for Image Caption Generation. In Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019, pp. 6329–6334. [CrossRef]
- Sasibhooshan, R.; Kumaraswamy, R.; Sasidharan, S. Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction. Multimedia Tools and Applications 2023, 82, 28143–28167. [CrossRef]
- Verma, V.; Yadav, A.; Kumar, A.; Yadav, D. Automatic Image Caption Generation Using Deep Learning. arXiv preprint arXiv:2212.04531 2022.
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Transactions on Geoscience and Remote Sensing 2017, 56, 132–145. [CrossRef]
- Baig, M.O.; Shah, S.Z.; Wajahat, I.; Zafar, A.; Arif, M. Image Caption Generator with Novel Object Injection. In Proceedings of the 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2). IEEE, 2021, pp. 1–5. [CrossRef]
- Zhang, H.; Ma, L.; Jiang, T.; Lian, S. Image Caption Generation Using Contextual Information Fusion With Bi-LSTMs. IEEE Transactions on Circuits and Systems for Video Technology 2023, 33, 1770–1782. [CrossRef]
- Ming, Y.; Hu, N.; Fan, C.; Feng, F.; Zhou, J.; Yu, H. Visuals to Text: A Comprehensive Review on Automatic Image Captioning. IEEE/CAA Journal of Automatica Sinica 2022, 9, 1339–1365. [CrossRef]
- Bai, S.; An, S. A Survey on Automatic Image Caption Generation. arXiv preprint arXiv:1804.04464 2018.
- Feng, Y. Automatic Caption Generation for News Images. PhD thesis, School of Informatics, University of Edinburgh, Edinburgh, 2011.
- Khademi, M.; Schulte, O. Image Caption Generation with Hierarchical Contextual Visual Spatial Attention. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 2056–2064.
- Arasi, M.A.; Alshahrani, H.M.; Alruwais, N.; Motwakel, A.; Ahmed, N.A.; Mohamed, A. Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model. IEEE Access 2023, 11, 104633–104642. [CrossRef]
- Amirian, S.; Rasheed, K.; Taha, T.R.; Arabnia, H.R. Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap. IEEE Access 2020, 8, 218386–218400. [CrossRef]
| Authors | Technical Approach | Contributions / Advantages | Limitations |
|---|---|---|---|
| Maheswari et al. [11] | ResNet50 encoder; LSTM decoder; GloVe word embeddings. | Implemented CNN+LSTM captioning model. | Standard architecture; Performance comparison needed. |
| Kinghorn, Zhang, Shao [12] | Region-based: R-CNN object detection, LSTMs for attributes, CNN scene classifier, Encoder-Decoder LSTM (labels to sentence). | More detailed captions via local focus; Outperformed baselines (IAPR TC-12); Cross-domain success (NYUv2). | Lower ROUGE-L score; Struggles with complex scenes; Processing time. |
| Yuan, Li, Lu [13] | Fuses global (VGG FC7) & local (VGG Conv5-4 + attention) features; Adaptive global gate; 2-layer Gated Feedback LSTM. | Combines global/local adaptively; Gated Feedback LSTM enhances language model; Strong benchmark results. | Outperformed by methods using external detectors; Minor caption errors. |
| Sasibhooshan, Kumaraswamy, Sasidharan [14] | Wavelet CNN encoder; Visual Attention Prediction Network (atrous, channel+spatial attention); Contextual Spatial Relation Extractor; LSTM decoder. Train: Cross Entropy + Self-critical (CIDEr optimization). | Novel Wavelet CNN features; Combined channel/spatial attention; Explicit spatial relation modeling; High CIDEr score (MS COCO). | Fails with complex scenes or incorrect object/relation recognition. |
| Verma, A. Yadav, Kumar, D. Yadav [15] | Encoder-Decoder: VGG16 Hybrid Places 1365 encoder, standard LSTM decoder. No explicit attention reported. | Hybrid CNN (objects+scenes); Multi-metric results reported; Claims competitive performance; Live image validation. | Captions lack grammar/ detail; High training time; Failure cases noted; Basic architecture (Preprint). |
| Vinyals, Toshev, Bengio, Erhan [1] | Encoder-Decoder: CNN (GoogLeNet) features feed LSTM decoder initially. End-to-end training (max likelihood). Beam search inference. | Foundational sequence-to-sequence model (NIC); End-to-end trainable; Significant BLEU score gains; Demonstrated generative diversity. | Overfits small datasets; Many verbatim captions; Gap versus human evaluation. |
| Lu, Wang, Zheng, Li [16] | Benchmarked standard methods (RNN/LSTM, Attention-LSTM) on Remote Sensing (RS) image data. | Created/released RSICD dataset; Identified RS captioning challenges; Showed standard model limitations on RS data. | No new model proposed; Benchmarked models rated ’acceptable’ on RS; Dataset duplications noted; Poor cross-dataset performance. |
| Baig, Shah, Wajahat, Zafar, Arif [17] | Post-processing: External object detector (YOLO9000) + Word2Vec identify novel objects; Replaced nouns in base caption using semantics. | Handles novel objects without retraining; Modular approach; Uses external detector/embeddings; Score improvement shown (modified captions only). | Post-processing, not end-to-end; Depends on detector accuracy; Evaluation focused on changed captions. |
| Authors | Technical Approach | Contributions / Advantages | Limitations |
|---|---|---|---|
| Zhang, Ma, Jiang, Lian [18] | CNN(ResNet) encoder + Visual Attention. Bidirectional LSTM decoder. Novelty: Subsidiary Attention fuses forward/backward states. Train: Cross Entropy + Self-critical (CIDEr optimization). | Bidirectional context generation; Novel state fusion mechanism (Subsidiary Attention); Improved performance (esp. CIDEr) versus standard Bi-LSTM/others (MS COCO). | Bidirectional LSTM increases parameters/latency; Potential high-frequency word bias. |
| Ming, Hu, Fan, Feng, Zhou, Yu [19] | Review Paper: Surveys Traditional (Retrieval, Template) & Deep Learning (Encoder-Decoder, Attention, Training) methods. | Comprehensive survey & taxonomy; Summarizes datasets/metrics; Compares state-of-the-art; Discusses challenges. | Not applicable (Review Paper). |
| Bai, An [20] | Review Paper: Classifies Retrieval, Template, Neural Network-based (Multimodal, Encoder-Decoder, Attention, Compositional, Novel Object) methods. | Survey/summary of image captioning (up to 2018); Compares state-of-the-art; Discusses future directions. | Not applicable (Review Paper). |
| Feng, Y. [21] | Knowledge-lean news captioning: LDA topic model (image+document) for keywords; Extractive & Abstractive (phrase-based) realization. | Uses noisy web data (BBC News); Joint visual-text topic model; Knowledge-lean generation demonstrated; Phrase-based abstractive method. | News domain focus; Relies on topic models; Predates deep learning era. |
| Khademi, Schulte [22] | Bidirectional Grid LSTM (spatial features) + Region Texts (transfer learning) -> Deep Bidirectional LSTM (Layer 1: context, L2: generation) + Dynamic Spatial Attention (Grid LSTM). | Novel context-aware architecture; Grid LSTM for spatial context/attention; Uses region texts; Hierarchical context/generation; State-of-the-art performance (MS COCO). | Model complexity significant; Needs pre-trained dense captioner. |
| Arasi et al. [23] | Encoder: MobileNetv2 + Sparrow Search Algorithm (hyperparameter optimization); Decoder: Attention Mechanism-LSTM + Fruit Fly Optimization (hyperparameter optimization). | Proposed AIC-SSAIDL technique; Uses metaheuristics (SSA, FFO) for hyperparameter tuning; Reported improved results. | Focus on metaheuristic optimization; Gains depend on optimization algorithm success; Computational cost unclear. |
| Amirian et al. [24] | Review (Image & Video): Concise review of Deep Learning methods (CNN, RNN/LSTM, GANs), focusing on algorithmic overlap. | Links image/video captioning methods; Discusses architectures, datasets, platforms; Included case study (video titles). | Concise scope, not comprehensive; Focus only on Deep Learning & overlap. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).