Submitted:
23 May 2025
Posted:
23 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Long Short-Term Memory (LSTM) Based Visual Question Answering Model Integrating Convolutional Layer Non-Linearity
- Activity Recognition: Crafting a narrative to describe activities depicted in a series of images.
- Image Description: Formulating a textual summary for a single image.
- Video Description: Composing a narrative for a sequence of images.
- Feature spatial structure in their input, like the 2D pixel layout of an image or the 1D sequence of words spanning a sentence, paragraph, or document.
- Include temporal structure in their input, such as image sequences in a video or text with sequential words, or necessitate generating outputs with temporal structure, such as words forming a textual description.
3. Adaptive Memory Network with Attention Mechanism
3.1. Model Architecture
- A convolution with a 7 * 7 kernel, employing 64 different kernels with a stride of size 2.
- Subsequent max pooling with a stride size of 2.
- Following convolutions include kernels of sizes 1 * 1, 3 * 3, and 1 * 1, each repeated 3 times, adding up to 9 layers.
- Further convolutions involve kernels of sizes 1 * 1, 3 * 3, and 1 * 1, repeated 4 times, totaling 12 layers.
- Continuing, kernels of sizes 1 * 1, 3 * 3, and 1 * 1 are employed, iterated 6 times for a sum of 18 layers.
- Additionally, kernels of sizes 1 * 1, 3 * 3, and 1 * 1 are utilized, repeated thrice, resulting in 9 layers.
- The sequence culminates with an average pool operation followed by a fully connected layer featuring 1000 nodes, culminating with a softmax function, contributing 1 layer to the architecture.
3.2. Weight Initialization, Regularization, and Optimization
4. Event-Based Focal Visual-Content Text Attention
- Introduce a new multimodal EFVCTA framework that utilizes questions about events organized by institutions or organizations.
- A novel EFVCTA system is proposed to identify the similarity between the question’s context and the content of the input video or image which can be used to localize evidence images to justify the generated answers.
- The proposed framework is evaluated by comparing it with the DMN+ (Improved Dynamic Memory Networks) [21], MCB (Multimodal Compact Bilinear Pooling) [22], Soft Attention [23], Soft Attention Bi-directional (BiDAF- Bi-Directional Attention Flow) [24], Spatio-Temporal Reasoning using TGIF Attention [25], and Focal Visual Text Attention (FVTA) [26].
5. Comprehensive Analysis and Impact of Proposed Event-Based Local Focal Visual-Content Text Attention
6. Practical Applications and Potential Limitations
7. Conclusions
- In this work LSTM based VQA model which integrates convolutional layer non-linearity is proposed for addressing question understanding
- An adaptive memory network is developed with attention mechanism which utilizes weight regularization and optimization for addressing content based and text based photos or videos search
- A novel strategy event-based focal visual-content text attention (EFVCTA) model is proposed, designed and developed for question context identification, text embedding with sequence encoder, temporal focal correlation identification, and visual content-text attention for past event search with evidence generation
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| EFVCTA | Event-based Focal Visual-Content Text Attention |
| LSTM | Long Short-Term Memory |
| VQA | Visual Question Answering |
| CNN | Convolutional Neural Network |
| DMN+ | Improved Dynamic Memory Networks |
| MCB | Multimodal Compact Bilinear Pooling |
| BiDAF | Bi-Directional Attention Flow |
| FVTA | Focal Visual Text Attention |
| STTPs | Short Term Training Programs |
| FDPs | Faculty Development Programs |
References
- Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". 9 (8): 1735–1780. [CrossRef]
- Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (2009). "A NovelConnectionist System for Improved Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868.
- Sak, Hasim; Senior, Andrew; Beaufays, Francoise (2014). "Long Short-Term Memory recurrentneural network architectures for large scale acoustic modeling". Archived from the original (https://static.googleusercontent.com/media/research. google.com/en//pubs/ archive/43905.pdf) on 2021-09-24.
- Li, Xiangang; Wu, Xihong (2014). "Constructing Long Short-Term Memory based DeepRecurrent Neural Networks for Large Vocabulary Speech Recognition". 4281; arXiv:1410.4218.
- Calin, Ovidiu. Deep Learning Architectures. Cham, Switzerland: Springer Nature. p. 555. ISBN 978-3-030-36720-6.
- H. Yang, L. H. Yang, L. Chaisorn, Y. Zhao, S.-Y. Neo, and T.-S. Chua, “VideoQA: Question answering on news video,” in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp. 632–641.
- S. Antol, A. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2425–2433.
- Y. Zhu, O. Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7w: Grounded question answering in images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4995–5004.
- Y. Jang, Y. Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “TGIF-QA: Toward spatio-temporal reasoning in visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1359–1367.
- M. Tapaswi, Y. M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4631–4640.
- H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 451–466.
- H. Gao, J. H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? Dataset and methods for multilingual image question,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2296–2304.
- J. Andreas, M. J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 39–48.
- J. Johnson, B. J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1988–1997.
- K. Kafle and C. Kanan, “An analysis of visual question answering algorithms,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1983–1991.
- L. Zhu, Z. L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering temporal context for video question and answering,” Int. J. Comput. Vis., vol. 124, no. 3, pp. 409–421, 2017.
- M. Ren, R. M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2953–2961.
- L. Yu, E. L. Yu, E. Park, A. C. Berg, and T. L. Berg, “Visual madlibs: Fill in the blank description generation and question answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, 2461–2469.
- Y. Goyal, T. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6325–6334.
- D. Xu, Z. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” in Proc. 25th ACM Int. Conf. Multimedia, 2017, pp. 1645–1653.
- Caiming Xiong, Stephen Merity, and Richard Socher, “Dynamic Memory Networks for Visual and Textual Question Answering,” arXiv:1603.01417v1 [cs.NE] 4 Mar 2016.
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach, “Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding,” arXiv:1606.01847v3 [cs.CV] 24 Sep 2016.
- Mateusz Malinowski, Carl Doersch, Adam Santoro, and Peter Battaglia, “Learning Visual Question Answering by Bootstrapping Hard Attention,” arXiv:1808.00300v1 [cs.CV] 1 Aug 2018.
- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hananneh Hajishirzi, “Bi-Directional Attention Flow for Machine Comprehension,” arXiv:1611.01603v6 [cs.CL] 21 Jun 2018.
- Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim, “TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering,” arXiv:1704.04497v3 [cs.CV] 3 Dec 2017.
- Junwei Liang, Lu Jiang, Liangliang Cao, Yannis Kalantidis, Li-Jia Li, and Alexander G. Hauptmann, “Focal Visual-Text Attention for Memex Question Answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, No. 8, pp. 1893-1908, Aug 2019.










| Approach | Top-1 validation error (%) | Top-5 validation error (%) | Top-5 test error (%) |
| Proposed Approach (2 nets/ 1 nets, multi-crop, dense eval) | 22.8 | 7.9 | 7.8 |
| Krizhevsky et al. 2012 (5 nets) | 38.1 | 16.4 | 16.4 |
| OverFeat Sermanet et al. 2014 (7 nets) | 34 | 13.2 | 13.6 |
| MSRA He et al., 2014 (1 net) | 27.9 | 9.1 | 9.1 |
| GooLeNet Szegedy et al., 2014 (7 nets) | - | - | 6.7 |
| Parameters | Set Values |
| Fully Connected Kernel Initializer Scale | 0.08 |
| Fully Connected Kernel Regularizer Scale | 1e-6 |
| Fully Connected Activity Regularizer Scale | 0.0 |
| Convolution Kernel Regularizer Scale | 1e-6 |
| Convolution Activity Regularizer Scale | 0.0 |
| Fully Connected Drop Rate | 0.5 |
| GRU Drop Rate | 0.3 |
| Parameters | Set Values |
| Number of Epochs | 100 |
| Batch Size | 64 |
| Optimizer | ‘Adam’ or ‘RMSProp’ or ‘Momentum’ or ‘SGD’ |
| Initial Learning Rate | 0.0001 |
| Learning Rate Decay Factor | 1.0 |
| Number of Steps per Decay | 10000 |
| Clip Gradients | 10.0 |
| Momentum | 0.0 |
| Use Nesterov | True |
| Decay | 0.9 |
| Centered | True |
| Beta 1 β1 | 0.9 |
| Beta 2 β2 | 0.999 |
| Epsilon | 1e-5 |
| Model | Test Accuracy |
| DMN+ (Improved Dynamic Memory Networks) | 48.51% |
| MCB (Multimodal Compact Bilinear Pooling) | 46.23% |
| Soft Attention | 62.08% |
| Soft Attention Bi-directional (BiDAF- Bi-Directional Attention Flow) | 60.09% |
| Spatio-Temporal Reasoning using TGIF Attention | 63.06% |
| Focal Visual Text Attention (FVTA) | 66.86% |
| Proposed Event-based Focal Visual-Content Text Attention (EFVCTA) | 68.07% |
| Model | Test Accuracy |
| Model | Visual-Text Alignment | Temporal Awareness | Attention Mechanism | Multimodal Fusion | Complexity |
| DMN+ (Improved Dynamic Memory Networks) | Poor | Moderate (via memory) | Soft Attention | No | Low |
| MCB (Multimodal Compact Bilinear Pooling) | Moderate | Absent | None | Bilinear Pooling | Moderate |
| Soft Attention | Moderate | Absent | Soft Attention | No | Low |
| Soft Attention Bi-directional (BiDAF- Bi-Directional Attention Flow) | Strong (textual) | Implicit | Bi-directional Flow | No | Moderate |
| Spatio-Temporal Reasoning using TGIF Attention | Moderate | Strong (TGIF events) | Spatio-Temporal Attention | Early Fusion | High |
| Focal Visual Text Attention (FVTA) | Strong | Strong | Focal Attention | Late Fusion | High |
| Proposed Event-based Focal Visual-Content Text Attention (EFVCTA) | Excellent (event-localized) | Excellent (event-based focal) | Local Focal Attention | Hierarchical Fusion |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).