Submitted:
13 August 2024
Posted:
14 August 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Deep Learning Captioning Models
3. Preliminaries
3.1. Encoder-Decoder Image Captioning Model
3.2. Image Feature Extractors
3.3. Recurrent Neural Network
3.4. Word Embedding Models
3.5. Adaptation, Merging and Word Prediction
3.6. Evaluation Metrics
4. Training and Evaluation
4.1. Dataset and Testing Procedure
4.2. Training
|
Image features |
No. of model parameters (mln) |
Embeddings | Time of sentence generation (ms) |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE |
|---|---|---|---|---|---|---|---|---|---|
| Vgg19 | 144.47 | Glove | 2046 | 64.1 | 45.83 | 31.86 | 22.34 | 69.62 | 13.93 |
| 145.32 | FastText | 2090 | 65.42 | 46.89 | 32.72 | 22.93 | 71.79 | 14.46 | |
| Vgg16 | 143.26 | Glove | 2166 | 64.25 | 45.62 | 31.63 | 22.09 | 67.35 | 13.64 |
| 144.11 | FastText | 2086 | 64.47 | 45.73 | 31.54 | 21.86 | 67.76 | 13.81 | |
| Resnet50 | 27.97 | Glove | 2468 | 65.33 | 47.26 | 33.26 | 23.44 | 73.12 | 14.43 |
| 28.81 | FastText | 2016 | 65.97 | 47.82 | 33.79 | 24.02 | 74.47 | 14.71 | |
| Resnet152V2 | 62.71 | Glove | 2834 | 64.91 | 46.78 | 32.57 | 22.86 | 70.77 | 14.08 |
| 63.55 | FastText | 2418 | 65.28 | 46.78 | 32.47 | 22.61 | 70.07 | 14.16 | |
| MobileNetV2 | 6.45 | Glove | 2096 | 65.39 | 47.14 | 33.04 | 23.24 | 73.03 | 14.55 |
| 7.29 | FastText | 2144 | 65.13 | 47.22 | 33.17 | 23.32 | 73.79 | 14.62 | |
| MobileNet | 8.36 | Glove | 3860 | 64.35 | 46.14 | 32.12 | 22.42 | 69.28 | 13.76 |
| 9.2 | FastText | 1952 | 65.02 | 46.93 | 32.85 | 23.02 | 71.24 | 14.31 | |
| Xception | 25.24 | Glove | 2414 | 66.59 | 48.63 | 34.34 | 24.33 | 78.13 | 15.16 |
| 26.08 | FastText | 2052 | 67.01 | 48.8 | 34.45 | 24.3 | 77.64 | 15.18 | |
| InceptionV3 | 26.18 | Glove | 1960 | 66.12 | 47.72 | 33.35 | 23.38 | 74.16 | 14.72 |
| 27.02 | FastText | 1922 | 66.15 | 47.87 | 33.57 | 23.63 | 75.04 | 14.83 | |
| DenseNet201 | 22.67 | Glove | 1828 | 66.35 | 48.41 | 34.26 | 24.18 | 76.54 | 14.96 |
| 23.51 | FastText | 1748 | 66.59 | 48.73 | 34.57 | 24.55 | 76.74 | 14.83 | |
| DenseNet121 | 11.16 | Glove | 2468 | 65.03 | 47.02 | 32.96 | 23.26 | 71.94 | 14.13 |
| 12.00 | FastText | 2360 | 65.39 | 47.09 | 32.89 | 23.09 | 72.36 | 14.25 | |
| Sugano [63] | - | 71.4 | 50.5 | 35.2 | 24.5 | 63.8 | - | ||
| Lebret [62] | - | 73 | 50 | 34 | 23 | - | - | ||
| Karpathy [18] | - | 62.5 | 45 | 32.1 | 23 | 66 | - | ||
| Xu [86] | - | 67.9 | 49.3 | 34.7 | 24.3 | 75.4 | - |
4.3. Evaluation
5. Results
5.1. Feature Extraction and Word Embedding
5.2. Merging and Word Prediction
5.3. Recurrent Neural Network Model
| RNN size |
Adaptation component size |
Word prediction component size |
No. of model parameters (mln) |
Time of sentence generation (ms) |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 512 | 512 | 512 | 28.82 | 5565 | 67.53 | 49.67 | 35.39 | 25.19 | 82.48 | 15.75 |
| 2 | 256 | 256 | - | 25.18 | 4195 | 66.87 | 48.61 | 34.17 | 24.00 | 78.55 | 15.39 |
| 3 | 128 | 128 | 128 | 23.7 | 5100 | 65.86 | 47.94 | 33.54 | 23.40 | 74.92 | 14.77 |
| 4 | 256 | 256 | 256 | 25.2 | 6336 | 66.59 | 48.63 | 34.34 | 24.33 | 78.13 | 15.16 |
| RNN size |
Adaptation component size |
Word prediction component size |
No. of model parameters (mln) |
Time of sentence generation (ms) |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 512 | 512 | 1024 | 33.35 | 6024 | 66.31 | 48.47 | 34.29 | 24.22 | 79.05 | 15.57 |
| 2 | 256 | 256 | - | 27.04 | 2814 | 66.96 | 48.74 | 34.32 | 24.28 | 79.07 | 15.54 |
| 3 | 128 | 128 | 256 | 24.68 | 2647 | 67.27 | 49.69 | 35.36 | 25.04 | 80.44 | 15.71 |
| 4 | 256 | 256 | 512 | 27.31 | 2812 | 67.51 | 49.75 | 35.56 | 25.36 | 82.49 | 16.08 |
| 5 | 256 | - | - | 25.56 | 2765 | 65.36 | 47.02 | 32.72 | 22.77 | 75.93 | 14.90 |
| 6 | 256 | - | 512 | 26.66 | 2513 | 66.22 | 48.39 | 34.32 | 24.27 | 78.71 | 15.09 |
| 7 | 256 | 256 | 256 | 25.32 | 2247 | 67.56 | 49.72 | 35.48 | 25.24 | 81.85 | 15.63 |
| 8 | 256 | 256 | 128 | 24.31 | 1905 | 67.18 | 49.47 | 35.19 | 24.90 | 80.73 | 15.40 |
| 9 | 512 | 512 | - | 32.29 | 2393 | 65.24 | 47.33 | 33.30 | 23.54 | 75.86 | 14.88 |
| RNN size |
Adaptation component size |
Word prediction component size |
No. of model parameters (mln) |
Time of sentence generation (ms) |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 256 | 256 | 512 | 27.19 | 2211 | 67.32 | 49.62 | 35.51 | 25.45 | 81.41 | 15.60 |
| 2 | 256 | 256 | 256 | 25.19 | 2234 | 67.98 | 50.22 | 35.83 | 25.42 | 82.19 | 15.76 |
| 3 | 256 | 256 | 128 | 24.19 | 1823 | 67.20 | 49.10 | 34.67 | 24.41 | 80.39 | 15.33 |
5.4. Experiments on External Images

| CIDEr | Predicted caption | |
|---|---|---|
|
Image features: Xception; merge method: concatenate; word prediction component: 512; No adaptation component |
2.2237 | A giraffe standing in a field next to a tree. |
|
Image features: VGG16; merge method: concatenate; word prediction component: 256;RNN: LSTM |
1.7200 | A giraffe standing next to a tree in a park. |
| Image features: Xception; merge method:concatenate; word prediction component: 256; RNN: GRU |
1.6128 | A giraffe standing in a dirt field next to a building. |
|
Image features: MobileNetV2; merge method: concatenate; word prediction component: 256; RNN:LSTM |
1.5162 | A giraffe standing in a fenced in area. |
|
Image features: Resnet50; merge method: concatenate; word prediction component: 256; RNN: LSTM |
1.4194 | A giraffe standing next to a zebra in a zoo. |
|
Image features: Xception; merge method: concatenate; word prediction component: 512; RNN: GRU |
1.2934 | A giraffe standing next to a zebra in a field. |
|
Image features: InceptionV3; merge method: concatenate; word prediction component: 256; RNN:LSTM |
1.2851 | A giraffe standing next to a wooden fence. |
|
Image features: Xception; merge method: concatenate; word prediction component: 512; RNN: LSTM |
0.9393 | A couple of giraffe standing next to each other. |
| Ground truth captions | * A giraffe standing outside of a building next to a tree. * A giraffe standing in a small piece of shade. * A giraffe finds some sparse shade in his habitat. * Giraffe standing in a holding pen near a tree stump. * A giraffe in a zoo enclosure next to a barn. |
|
5.5. Comparison with Transformer-Based Approaches
6. Conclusions
References
- Ramachandram, D.; Taylor, G.W. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Processing Magazine 2017, 34, 96–108. [CrossRef]
- Zhang, X.; He, S.; Song, X.; Lau, R.W.; Jiao, J.; Ye, Q. Image captioning via semantic element embedding. Neurocomputing 2020, 395, 212–221. [CrossRef]
- Janusz, A.; KaÅuża, D.; Matraszek, M.; Åukasz Grad.; Åwiechowski, M.; ÅlÄzak, D. Learning multimodal entity representations and their ensembles, with applications in a data-driven advisory framework for video game players. Information Sciences 2022, 617, 193–210. [CrossRef]
- Zhang, W.; Sugeno, M. A fuzzy approach to scene understanding. [Proceedings 1993] Second IEEE International Conference on Fuzzy Systems, 1993, pp. 564–569 vol.1. [CrossRef]
- Iwanowski, M.; Bartosiewicz, M. Describing images using fuzzy mutual position matrix and saliency-based ordering of predicates. 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2021, pp. 1–8. [CrossRef]
- Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; Choi, Y. Collective Generation of Natural Image Descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Jeju Island, Korea, 2012; pp. 359–368.
- Li, S.; Kulkarni, G.; Berg, T.L.; Berg, A.C.; Choi, Y. Composing Simple Image Descriptions using Web-scale N-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning; Association for Computational Linguistics: Portland, Oregon, USA, 2011; pp. 220–228.
- Mitchell, M.; Han, X.; Dodge, J.; Mensch, A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; Daumé, H. Midge: Generating Image Descriptions from Computer Vision Detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: USA, 2012; EACL ’12, p. 747–756.
- Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every Picture Tells a Story: Generating Sentences from Images. Computer Vision – ECCV 2010; Daniilidis, K.; Maragos, P.; Paragios, N., Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2010; pp. 15–29.
- Barnard, K.; Duygulu, P.; Forsyth, D.; Blei, D.; Kandola, J.; Hofmann, T.; Poggio, T.; Shawe-Taylor, J. Matching Words and Pictures. Journal of Machine Learning Research 2003, 3. [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems; Pereira, F.; Burges, C.; Bottou, L.; Weinberger, K., Eds. Curran Associates, Inc., 2012, Vol. 25.
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015, pp. 3156–3164.
- Ramisa, A.; Yan, F.; Moreno-Noguer, F.; Mikolajczyk, K. BreakingNews: Article Annotation by Image and Text Processing. IEEE Transactions on Pattern Analysis and Machine Intelligence 2018, 40, 1072–1085.
- Biten, A.F.; Gómez, L.; Rusiñol, M.; Karatzas, D. Good News, Everyone! Context Driven Entity-Aware Captioning for News Images. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019, pp. 12458–12467.
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. ACL, 2018.
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR, 2021.
- Kiros, R.; Salakhutdinov, R.; Zemel, R. Multimodal Neural Language Models. Proceedings of the 31st International Conference on Machine Learning; Xing, E.P.; Jebara, T., Eds.; PMLR: Bejing, China, 2014; Number 2 in Proceedings of Machine Learning Research, pp. 595–603.
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3128–3137. [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [CrossRef]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 677–691. [CrossRef]
- Johnson, J.; Karpathy, A.; Fei-Fei, L. Densecap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4565–4574.
- Xiao, X.; Wang, L.; Ding, K.; Xiang, S.; Pan, C. Dense semantic embedding network for image captioning. Pattern Recognition 2019, 90, 285–296. [CrossRef]
- Toshevska, M.; Stojanovska, F.; Zdravevski, E.; Lameski, P.; Gievska, S. Exploration into Deep Learning Text Generation Architectures for Dense Image Captioning. 2020 15th Conference on Computer Science and Information Systems (FedCSIS), 2020, pp. 129–136. [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning; Bach, F.; Blei, D., Eds.; PMLR: Lille, France, 2015; Vol. 37, Proceedings of Machine Learning Research, pp. 2048–2057.
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018, pp. 6077–6086.
- Guo, L.; Liu, J.; Tang, J.; Li, J.; Luo, W.; Lu, H. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. Proceedings of the 27th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2019; MM ’19, p. 765–773. [CrossRef]
- Gu, J.; Wang, G.; Cai, J.; Chen, T. An Empirical Study of Language CNN for Image Captioning. 2017 IEEE International Conference on Computer Vision (ICCV) 2016, pp. 1231–1240.
- Liu, S.; Bai, L.; Hu, Y.; Wang, H. Image Captioning Based on Deep Neural Networks. MATEC Web of Conferences 2018, 232, 01052. [CrossRef]
- Subash, R.; Jebakumar, R.; Kamdar, Y.; Bhatt, N. Automatic Image Captioning Using Convolution Neural Networks and LSTM. Journal of Physics: Conference Series 2019, 1362, 012096. [CrossRef]
- Xu, K.; Wang, H.; Tang, P. Image captioning with deep LSTM based on sequential residual. 2017 IEEE International Conference on Multimedia and Expo (ICME), 2017, pp. 361–366. [CrossRef]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Explain Images with Multimodal Recurrent Neural Networks. CoRR 2014, abs/1410.1090, [1410.1090].
- Dong, H.; Zhang, J.; McIlwraith, D.; Guo, Y. I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation. 2017 IEEE International Conference on Image Processing (ICIP). IEEE Press, 2017, p. 2015–2019. [CrossRef]
- Xian, Y.; Tian, Y. Self-Guiding Multimodal LSTM-When We Do Not Have a Perfect Training Dataset for Image Captioning. IEEE Transactions on Image Processing 2017, PP. [CrossRef]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pp. 1179–1195.
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pp. 3242–3250.
- Delbrouck, J.; Dupont, S. Bringing back simplicity and lightliness into neural image captioning. CoRR 2018, abs/1810.06245, [1810.06245].
- Tanti, M.; Gatt, A.; Camilleri, K. What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? Proceedings of the 10th International Conference on Natural Language Generation; Alonso, J.M.; Bugarín, A.; Reiter, E., Eds.; Association for Computational Linguistics: Santiago de Compostela, Spain, 2017; pp. 51–60. [CrossRef]
- Zhou, L.; Xu, C.; Koch, P.A.; Corso, J.J. Image Caption Generation with Text-Conditional Semantic Attention. ArXiv 2016, abs/1606.04621.
- Chen, X.; Zitnick, C.L. Mind’s eye: A recurrent visual representation for image caption generation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2422–2431. [CrossRef]
- Hessel, J.; Savva, N.; Wilber, M. Image Representations and New Domains in Neural Image Captioning 2015. [CrossRef]
- Song, M.; Yoo, C.D. Multimodal representation: Kneser-ney smoothing/skip-gram based neural language model. 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 2281–2285. [CrossRef]
- Hendricks, L.; Venugopalan, S.; Rohrbach, M.; Mooney, R.; Saenko, K.; Darrell, T. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 1–10. [CrossRef]
- You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 4651–4659. [CrossRef]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). arXiv: Computer Vision and Pattern Recognition 2014.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv 2020, abs/2010.11929.
- Wei Liu, Sihan Chen, L.G.X.Z.J.L. CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, 2021, [arXiv:cs.CV/2101.10804].
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10971–10980.
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning, 2021.
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. CoRR 2021, abs/2102.05918, [2102.05918].
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified vision-language pre-training for image captioning and vqa. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, Vol. 34, pp. 13041–13049.
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; others. Oscar: Object-semantics aligned pre-training for vision-language tasks. European Conference on Computer Vision. Springer, 2020, pp. 121–137.
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. VinVL: Making Visual Representations Matter in Vision-Language Models. CVPR 2021 2021.
- Ding, Z.; Sun, Y.; Xu, S.; Pan, Y.; Peng, Y.; Mao, Z. Recent Advances and Perspectives in Deep Learning Techniques for 3D Point Cloud Data Processing. Robotics 2023, 12. [CrossRef]
- Zhang, H.; Wang, C.; Yu, L.; Tian, S.; Ning, X.; Rodrigues, J. PointGT: A Method for Point-Cloud Classification and Segmentation Based on Local Geometric Transformation. IEEE Transactions on Multimedia 2024, pp. 1–12. [CrossRef]
- Wang, C.; Ning, X.; Sun, L.; Zhang, L.; Li, W.; Bai, X. Learning Discriminative Features by Covering Local Geometric Space for Point Cloud Analysis. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–15. [CrossRef]
- Wang, C.; Ning, X.; Li, W.; Bai, X.; Gao, X. 3D Person Re-Identification Based on Global Semantic Guidance and Local Feature Aggregation. IEEE Transactions on Circuits and Systems for Video Technology 2024, 34, 4698–4712. [CrossRef]
- Xue, L.; Yu, N.; Zhang, S.; Panagopoulou, A.; Li, J.; Martín-Martín, R.; Wu, J.; Xiong, C.; Xu, R.; Niebles, J.C.; Savarese, S. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27091–27101.
- Chen, G.; Wang, M.; Yang, Y.; Yu, K.; Yuan, L.; Yue, Y. PointGPT: Auto-regressively Generative Pre-training from Point Clouds. Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Wang, S.S.; Dong, R.Y. Learning Complex Spatial Relation Model from Spatial Data. Journal of Computers 2019, 30, 123–136.
- Yang, Z.; Zhang, Y.; ur Rehman, S.; Huang, Y. Image Captioning with Object Detection and Localization. CoRR 2017, abs/1706.02430, [1706.02430].
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image Captioning: Transforming Objects into Words. CoRR 2019, abs/1906.05963, [1906.05963].
- Lebret, R.; Pinheiro, P.O.; Collobert, R. Phrase-Based Image Captioning. Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. JMLR.org, 2015, ICML’15, p. 2085–2094.
- Sugano, Y.; Bulling, A. Seeing with Humans: Gaze-Assisted Neural Image Captioning. ArXiv 2016, abs/1608.05203.
- Li, Y. Image Caption using VGG model and LSTM. Applied and Computational Engineering 2024, 48, 68–77. [CrossRef]
- Bartosiewicz, M.; Iwanowski, M.; Wiszniewska, M.; FrÄ czak, K.; LeÅnowolski, P. On Combining Image Features and Word Embeddings for Image Captioning. 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), 2023, pp. 355–365. [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings; Bengio, Y.; LeCun, Y., Eds., 2015.
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A.C.; Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 2015, 115, 211–252. [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, pp. 2818–2826.
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. CoRR 2016, abs/1610.02357, [1610.02357].
- Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR 2017, abs/1704.04861, [1704.04861].
- Hochreiter, S.; Schmidhuber, J. LSTM Long Short-term Memory. Neural computation 1997, 9, 1735–80. [CrossRef]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1724–1734. [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations, 2013.
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 2017, 5, 135–146.
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; Association for Computational Linguistics: USA, 2002; ACL ’02, p. 311–318. [CrossRef]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575. [CrossRef]
- Cui, Y.; Yang, G.; Veit, A.; Huang, X.; Belongie, S. Learning to Evaluate Image Captioning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2018; pp. 5804–5812. [CrossRef]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. European conference on computer vision. Springer, 2016, pp. 382–398.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. Computer Vision – ECCV 2014; Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T., Eds.; Springer International Publishing: Cham, 2014; pp. 740–755.
- Chen, X.; Fang, H.; Lin, T.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. CoRR 2015, abs/1504.00325, [1504.00325].
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations 2014.
- Xu, N.; Liu, A.; Liu, J.; Nie, W.; Su, Y. Scene graph captioner: Image captioning based on structural visual representation. J. Vis. Commun. Image Represent. 2019, 58, 477–485.
- Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object Hallucination in Image Captioning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 4035–4045. [CrossRef]
- OpenAI. DALL·E 3 System Card, 2023. Accessed: 2024-07-12.
- OpenAI. Introducing GPT-4o and More Tools to ChatGPT Free Users, 2024. Accessed: 2024-07-12.
- Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From Show to Tell: A Survey on Image Captioning. CoRR 2021, abs/2107.06912, [2107.06912].
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. Advances in Neural Information Processing Systems; Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds. Curran Associates, Inc., 2017, Vol. 30.
- Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training 2018.
- Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K.; Jiang, J.; Ng, V.; Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 11–20. [CrossRef]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Inui, K.; Jiang, J.; Ng, V.; Wan, X., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 5100–5111. [CrossRef]
| 1 | The authors of [81] propose their metric, but due to much it’s much lesser (more that 10x) popularity comparing with SPICE, we decided to use that latter in the current study. |
| 2 | All the execution times reported in this paper have been measured for a single caption on the computer with the following parameters: NVIDIA GeForce RTX 4070 with 12GB VRAM; AMD Ryzen 5 3600 6-Core Processor; 32GB RAM |


| Image features |
Image | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | Predicted caption | Ground truth captions |
|---|---|---|---|---|---|---|---|---|
| DenseNet121 | 2a | 79.54 | 73.06 | 64.87 | 53.42 | 159.42 | A man riding skis down a snow covered slope. |
* A man on skis is posing on a ski slope. * A person on a ski mountain posing for the camera. * A man n a red coat stands on the snow on skis. * A man riding skis on top of a snow covered slope. * A lady is in her ski gear in the snow. |
| Resnet152V2 | 2e | 60.00 | 44.72 | 0.00 | 0.00 | 83.88 | A dog jumping in the air to catch a frisbee. |
* A very cute brown dog with a disc in its mouth. * A dog running in the grass with a frisbee in his mouth. * A dog carrying a Frisbee in its mouth running on a grass lawn. * A dog in a grassy field carrying a frisbee. * A brown dog walking across a green field with a frisbee in it’s mouth. |
| VGG19 | 2c | 100.00 | 84.52 | 61.98 | 75.00 | 184.89 | A bathroom with a toilet and a sink. |
* A bathroom with a sink. toilet and vanity. * Tiled bathroom with a couple towels hanging up * An old bathroom with a black marble sink. * A bathroom with a black sink counter next to a white toilet. * The corner of a bathroom with light mint green walls above the tile. |
| MobileNet | 2b | 75.00 | 46.29 | 0.00 | 0.00 | 120.58 | A kitchen with a stove and a microwave. |
* A microwave is sitting idly in the kitchen. * A shiny silver metal microwave near wooden cabinets. * There are wooden cabinets that have a microwave attached at the bottom of it * A microwave sitting next to and underneath kitchen cupboards. * A kitchen scene with focus on a silver microwave. |
| Image | 2d | |
| Image features | DenseNet201 | Xception |
| BLEU-1 | 38.46 | 75.00 |
| BLEU-2 | 17.90 | 65.47 |
| BLEU-3 | 0.00 | 52.28 |
| BLEU-4 | 0.00 | 41.11 |
| METEOR | 21.90 | 21.81 |
| ROUGE-L | 35.62 | 69.85 |
| CIDEr | 11.12 | 157.97 |
|
Predicted caption |
A woman in a red dress is holding a white and red toothbrush. |
A bride and groom cutting their wedding cake. |
|
Ground truth captions |
* A man and woman standing in front of a cake. * A newly wed couple celebrating with a toast. * A bride and groom celebrate over a cake. * A bride and groom are celebrating with wedding cake. * A man and a woman standing next to each other. |
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).