Submitted:
09 May 2026
Posted:
12 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Hephaestus Dataset
2.2. Preprocessing
2.3. Machine Learning Models
2.3.1. Generative Image-to-Text Transformer Model
2.3.2. Bootstrapped Language-Image Pretraining Model
2.3.3. Retrieval-Based Model
3. Results
4. Discussion
4.1. Model Performance
4.2. Limitations
4.3. Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| CNN | Convolutional Neural Network |
| RNN | Recurrent Neural Network |
| LSTM | Long Short Term Memory |
| ViT | Vision Transformer |
| UAVs | Unmanned Aerial Vehicles |
| InSAR | Interferometric Synthetic Aperture Radar |
| PNG | Portable Network Graphics |
| P | Palette |
| RGBA | Red, Green, Blue, and Alpha |
| GIT | Generative Image-to-Text Transformer |
| InfoNCE | Information Noise-Contrastive Estimation |
| ITC | Image-Text Contrastive Learning |
| ITM | Image-Text Matching |
| LM | Image Conditioned Language Modeling |
References
- Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 539–559. [Google Scholar] [CrossRef] [PubMed]
- He, X.; Deng, L. Deep learning for image-to-text generation: A technical overview. IEEE Signal Process. Mag. 2017, 34, 109–116. [Google Scholar] [CrossRef]
- Kumar, A.; Goel, S. A survey of evolution of image captioning techniques. Int. J. Hybrid. Intell. Syst. 2017, 14, 123–139. [Google Scholar] [CrossRef]
- Li, L.J.; Fei-Fei, L. What, where and who? classifying events by scene and object recognition. In Proceedings of the 2007 IEEE 11th international conference on computer vision. IEEE, 2007; pp. 1–8. [Google Scholar]
- Li, L.J.; Socher, R.; Fei-Fei, L. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009; pp. 2036–2043. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; Teo, C.; Daumé, H., III; Aloimonos, Y. Corpus-guided sentence generation of natural images. In Proceedings of the Proceedings of the 2011 conference on empirical methods in natural language processing, 2011; pp. 444–454. [Google Scholar]
- Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In Proceedings of the European conference on computer vision, 2010; Springer; pp. 15–29. [Google Scholar]
- Li, S.; Kulkarni, G.; Berg, T.; Berg, A.; Choi, Y. Composing simple image descriptions using web-scale n-grams. In Proceedings of the Proceedings of the fifteenth conference on computational natural language learning, 2011; pp. 220–228. [Google Scholar]
- Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y.; Hovy, E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, 2003; pp. 150–157. [Google Scholar]
- Mitchell, M.; Dodge, J.; Goyal, A.; Yamaguchi, K.; Stratos, K.; Han, X.; Mensch, A.; Berg, A.; Berg, T.; Daumé, H., III. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012; pp. 747–756. [Google Scholar]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar] [CrossRef]
- Lu, Z.; Li, H. A deep architecture for matching short texts. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
- Gong, Y.; Jia, Y.; Leung, T.; Toshev, A.; Ioffe, S. Deep convolutional ranking for multilabel image annotation. arXiv 2013, arXiv:1312.4894. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
- Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
- Chen, X.; Lawrence Zitnick, C. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015; pp. 2422–2431. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005; pp. 65–72. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015; pp. 4566–4575. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015; pp. 3156–3164. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Yang, X.; Zhang, H.; Cai, J. Learning to collocate neural modules for image captioning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019; pp. 4250–4260. [Google Scholar]
- Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019; pp. 8928–8937. [Google Scholar]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 10971–10980. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 10578–10587. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, W.; Chen, S.; Guo, L.; Zhu, X.; Liu, J. Cptr: Full transformer network for image captioning. arXiv 2021, arXiv:2101.10804. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021; pp. 8748–8763. [Google Scholar]
- Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv 2021, arXiv:2108.10904. [Google Scholar]
- Staniūtė, R.; Šešok, D. A systematic literature review on image captioning. Appl. Sci. 2019, 9, 2024. [Google Scholar] [CrossRef]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv 2014, arXiv:1412.6632. [Google Scholar]
- Xiao, B.; Wang, Y.; Kang, S.C. Deep learning image captioning in construction management: a feasibility study. J. Constr. Eng. Manag. 2022, 148, 04022049. [Google Scholar] [CrossRef]
- Lee, H.; Cho, H.; Park, J.; Chae, J.; Kim, J. Cross encoder-decoder transformer with global-local visual extractor for medical image captioning. Sensors 2022, 22, 1429. [Google Scholar]
- Gamidi, R.; Hemasri, M.; Muppala, T.; Chowdary, V.; Palaniswamy, S.; et al. Enhancing Underwater Image Captioning Using Transformer Models and Augmented Terrestrial Datasets. In Proceedings of the 2025 International Conference on Pervasive Computational Technologies (ICPCT); IEEE, 2025; pp. 944–949. [Google Scholar]
- Zhang, K.; Li, P.; Wang, J. A review of deep learning-based remote sensing image caption: Methods, models, comparisons and future directions. Remote Sens. 2024, 16, 4113. [Google Scholar]
- Zhao, B. A systematic survey of remote sensing image captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
- Sharma, H.; Padha, D. Domain-specific image captioning: a comprehensive review. Int. J. Multimed. Inf. Retr. 2024, 13, 20. [Google Scholar] [CrossRef]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International conference on computer, information and telecommunication systems (Cits); IEEE, 2016; pp. 1–5. [Google Scholar]
- Cigna, F.; Tapete, D.; Lu, Z. Remote sensing of volcanic processes and risk. 2020. [Google Scholar] [CrossRef]
- Tronin, A.A. Remote sensing and earthquakes: A review. Phys. Chem. Earth Parts A/B/C 2006, 31, 138–142. [Google Scholar] [CrossRef]
- Tralli, D.M.; Blom, R.G.; Zlotnicki, V.; Donnellan, A.; Evans, D.L. Satellite remote sensing of earthquake, volcano, flood, landslide and coastal inundation hazards. ISPRS J. Photogramm. Remote Sens. 2005, 59, 185–198. [Google Scholar] [CrossRef]
- Osmanoğlu, B.; Sunar, F.; Wdowinski, S.; Cabral-Cano, E. Time series analysis of InSAR data: Methods and trends. Isprs J. Photogramm. Remote Sens. 2016, 115, 90–102. [Google Scholar] [CrossRef]
- Biggs, J.; Bergman, E.; Emmerson, B.; Funning, G.J.; Jackson, J.; Parsons, B.; Wright, T.J. Fault identification for buried strike-slip earthquakes using InSAR: The 1994 and 2004 Al Hoceima, Morocco earthquakes. Geophys. J. Int. 2006, 166, 1347–1362. [Google Scholar] [CrossRef]
- Yazbeck, J.; Rundle, J.B. A Fusion of Geothermal and InSAR Data with Machine Learning for Enhanced Deformation Forecasting at the Geysers. Land 2023, 12, 1977. [Google Scholar] [CrossRef]
- Yazbeck, J.; Rundle, J.B. Predicting short-term deformation in the central valley using machine learning. Remote Sens. 2023, 15, 449. [Google Scholar] [CrossRef]
- Anantrasirichai, N.; Biggs, J.; Albino, F.; Bull, D. A deep learning approach to detecting volcano deformation from satellite imagery using synthetic datasets. Remote Sens. Environ. 2019, 230, 111179. [Google Scholar] [CrossRef]
- Bountos, N.I.; Michail, D.; Papoutsis, I. Learning from synthetic InSAR with vision transformers: The case of volcanic unrest detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
- Wu, Y.Y.; Madson, A. Error sources of interferometric synthetic aperture radar satellites. Remote Sens. 2024, 16, 354. [Google Scholar] [CrossRef]
- Rosen, P.A.; Hensley, S.; Joughin, I.R.; Li, F.K.; Madsen, S.N.; Rodriguez, E.; Goldstein, R.M. Synthetic aperture radar interferometry. Proc. IEEE 2002, 88, 333–382. [Google Scholar] [CrossRef]
- Ferretti, A.; Monti-Guarnieri, A.; Prati, C.; Rocca, F.; Massonet, D. Insa. Princ.-Guidel. SAR. Interferom. Process. Interpret. 2007, Vol. 19.
- Bürgmann, R.; Rosen, P.A.; Fielding, E.J. Synthetic aperture radar interferometry to measure Earth’s surface topography and its deformation. Annu. Rev. Earth Planet. Sci. 2000, 28, 169–209. [Google Scholar] [CrossRef]
- Bountos, N.I.; Papoutsis, I.; Michail, D.; Karavias, A.; Elias, P.; Parcharidis, I. Hephaestus: A large scale multitask dataset towards InSAR understanding. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp. 1453–1462. [Google Scholar]
- Lazeckỳ, M.; Spaans, K.; González, P.J.; Maghsoudi, Y.; Morishita, Y.; Albino, F.; Elliott, J.; Greenall, N.; Hatton, E.; Hooper, A.; et al. LiCSAR: An automatic InSAR tool for measuring and monitoring tectonic and volcanic activity. Remote Sens. 2020, 12, 2430. [Google Scholar] [CrossRef]
- Morishita, Y.; Lazecky, M.; Wright, T.J.; Weiss, J.R.; Elliott, J.R.; Hooper, A. LiCSBAS: An open-source InSAR time series analysis package integrated with the LiCSAR automated Sentinel-1 InSAR processor. Remote Sens. 2020, 12, 424. [Google Scholar] [CrossRef]
- Lawrence, B.N.; Bennett, V.L.; Churchill, J.; Juckes, M.; Kershaw, P.; Pascoe, S.; Pepler, S.; Pritchard, M.; Stephens, A. Storing and manipulating environmental big data with JASMIN. In Proceedings of the 2013 IEEE international conference on big data. IEEE, 2013; pp. 68–75. [Google Scholar]
- Kiyoo, M. Relations between the eruptions of various volcanoes and the deformations of the ground surfaces around them. Earthq. Res. Inst. 1958, 36, e134. [Google Scholar]
- Milczarek, W.; Kopeć, A.; Głąbicki, D.; Bugajska, N. Induced seismic events—distribution of ground surface displacements based on InSAR methods and Mogi and Yang models. Remote Sens. 2021, 13, 1451. [Google Scholar] [CrossRef]
- Gudmundsson, A. How local stresses control magma-chamber ruptures, dyke injections, and eruptions in composite volcanoes. Earth-Sci. Rev. 2006, 79, 1–31. [Google Scholar] [CrossRef]
- Okada, Y. Surface deformation due to shear and tensile faults in a half-space. Bull. Seismol. Soc. Am. 1985, 75, 1135–1154. [Google Scholar] [CrossRef]
- Fialko, Y.; Khazan, Y.; Simons, M. Deformation due to a pressurized horizontal circular crack in an elastic half-space, with applications to volcano geodesy. Geophys. J. Int. 2001, 146, 181–190. [Google Scholar] [CrossRef]
- Giudicepietro, F.; Macedonio, G.; Martini, M. A physical model of sill expansion to explain the dynamics of unrest at calderas with application to Campi Flegrei. Front. Earth Sci. 2017, 5, 54. [Google Scholar] [CrossRef]
- Yang, X.M.; Davis, P.M.; Dieterich, J.H. Deformation from inflation of a dipping finite prolate spheroid in an elastic half-space as a model for volcanic stressing. J. Geophys. Res. Solid Earth 1988, 93, 4249–4257. [Google Scholar] [CrossRef]
- Galbusera, F.; Cina, A. Image annotation and curation in radiology: an overview for machine learning practitioners. Eur. Radiol. Exp. 2024, 8, 11. [Google Scholar] [CrossRef]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv;arXiv 2021. 2021, arXiv:2111.11432. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International conference on machine learning. PMLR, 2022; pp. 12888–12900. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- Kasai, J.; Sakaguchi, K.; Dunagan, L.; Morrison, J.; Bras, R.L.; Choi, Y.; Smith, N.A. Transparent human evaluation for image captioning. arXiv 2021, arXiv:2111.08940. [Google Scholar]


















Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).