Submitted:
04 July 2024
Posted:
05 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (1)
- This paper proposed a method to fuse multimodal features by combining graph structures, visual elements, and textual information to create a more comprehensive data representation. Compared to methods that utilize single-modal features, this approach provides richer information, thereby significantly enhancing the model's performance and generalization ability.
- (2)
- By comprehensively considering the similarity scores of structural, visual, and textual features, our model demonstrates superiority in the knowledge graph link prediction task. Experimental results show that, compared to existing multimodal methods, our model significantly improves multiple indicators. Additionally, by analyzing the contribution of each feature to the final result, the interpretability and credibility of the model are enhanced.
- (3)
- Through case analysis, multimodal feature fusion effectively reduces the biases that may be caused by single-modal features. By combining information from different modalities, the model can make predictions more accurately and robustly, demonstrating the importance and advantages of multimodal feature fusion in complex tasks.
2. Related Work
3. Methodology
3.1. Overall Architecture
3.2. Structural Feature Extraction
3.3. Visual Feature Extraction
3.4. Text Feature Extraction
3.5. Multimodal Feature Fusion
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Baselines
- VisualBERT [17], a pre-trained vision-language model with a single-stream structure.
- ViLBERT [16], a pre-trained vision-language model with a two-stream structure.
- IKRL [36], which extends TransE to learn the visual representation of entities and the structural information of knowledge graphs respectively.
- TransAE [20], combines a multimodal autoencoder with TransE to encode visual and texture knowledge into a unified representation and uses the hidden layers of the autoencoder as the representation of entities in the TransE model.
- RSME [22], which designs a forget gate with MRP metric to select valuable images for multimodal knowledge graph embedding learning.
4.1.2. Experiment Details
5. Experimental Results
5.1. Overall Performance
5.2. Ablation Study
5.3. Visual Analysis
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge graph embedding based question answering. In Proceedings of the twelfth ACM international conference on web search and data mining, 30 Jan 2019; pp. 105-113.
- Yih, SW.; Chang, MW.; He, X.; Gao, J. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, 28 Jul 2015; pp. 1321-1331.
- Zhou, H.; Young, T.; Huang, M.; Zhao, H.; Xu, J.; Zhu, X. Commonsense knowledge aware conversation generation with graph attention. In IJCAI, 13 Jul 2018; pp. 4623-4629.
- Huang, J.; Zhao, WX.; Dou, H.; Wen, JR.; Chang, EY. Improving sequential recommendation with knowledge-enhanced memory networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, 27 Jun 2018; pp. 505-514.
- Zhang, N.; Jia, Q.; Deng, S.; Chen, X.; Ye, H.; Chen, H.; Tou, H.; Huang, G.; Wang, Z.; Hua, N.; Chen, H. Alicg: Fine-grained and evolvable conceptual graph construction for semantic search at alibaba. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 14 Aug 2021; pp. 3895-3905.
- Dietz, L.; Kotov, A.; Meij, E. Utilizing knowledge graphs for text-centric information retrieval. In The 41st international ACM SIGIR conference on research & development in information retrieval, 27 Jun 2018; pp. 1387-1390.
- Yang, Z. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 25 Jul 2020; pp. 2486-2486.
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 26, 2787-2795.
- Wang, Z.; Zhang, J.; Feng, J.; Chen, Z. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence, 21 Jun 2014; pp. 1112-1119.
- Nathani, D.; Chauhan, J.; Sharma, C.; Kaul, M. Learning attention-based embeddings for relation prediction in knowledge graphs. arxiv preprint arxiv:1906.01195, 4 Jun 2019.
- Nguyen, DQ.; Nguyen, TD.; Nguyen, DQ.; Phung, D. A novel embedding model for knowledge base completion based on convolutional neural network. arxiv preprint arxiv:1712.0212, 6 Dec 2017.
- Pezeshkpour, P.; Chen, L.; Singh, S. Embedding multimodal relational data for knowledge base completion. arxiv preprint arxiv:1809.01341, 5 Sep 2018.
- Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, Jun 2018; pp. 225-234.
- **e, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. arxiv preprint arxiv:1609.07028, 22 Sep 2016.
- Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. Ieee Access, 15 May 2019; pp. 63373-63394. [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 6 Aug 2019.
- Li, LH.; Yatskar, M.; Yin, D.; Hsieh, CJ.; Chang, KW. Visualbert: A simple and performant baseline for vision and language. arxiv preprint arxiv:1908.03557, 6 Aug 2019.
- Chen, YC.; Li, L.; Yu, L.; El, Kholy.A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In European conference on computer vision, 30 Aug 2020; pp. 104-120.
- Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, Jun 2018; pp. 225-234.
- Wang, Z.; Li, L.; Li, Q.; Zeng, D. Multimodal data enhanced representation learning for knowledge graphs. In 2019 International Joint Conference on Neural Networks (IJCNN), 14 Jul 2019; pp. 1-8.
- Zhao, Y.; Cai, X.; Wu, Y.; Zhang, H.; Zhang, Y.; Zhao, G.; Jiang, N. MoSE: modality split and ensemble for multimodal knowledge graph completion. CoRR abs/2210.08821, 17 Oct 2022.
- Wang, M.; Wang, S.; Yang, H.; Zhang, Z.; Chen, X.; Qi, G. Is visual context really helpful for knowledge graph? A representation learning perspective. In Proceedings of the 29th ACM International Conference on Multimedia, 17 Oct 2021; pp. 2735-3743.
- Shankar,S.; Thompson, L.; Fiterau, M. Progressive fusion for multimodal integration. arxiv preprint arxiv:2209.00302, 1 Sep 2022.
- Liang, P.P.; Ling, C.K.; Cheng, Y.; Obolenskiy, A.; Liu, Y.; Pandey, R.; Salakhutdinov, R. Quantifying Interactions in Semi-supervised Multimodal Learning: Guarantees and Applications. In The Twelfth International Conference on Learning Representations. 2023.
- Jiang, Y.; Gao, Y.; Zhu, Z.; Yan, C.; Gao, Y. HyperRep: Hypergraph-Based Self-Supervised Multimodal Representation Learnin, 22 Sept 2023.
- Golovanevsky, M.; Schiller, E.; Nair, AA.; Singh, R.; Eickhoff, C. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. In ICML 2024 Workshop on Efficient and Accessible Foundation Models for Biological Discovery, 18 Jun 2024.
- Zhang, X.; Yoon, J.; Bansal, M.; Yao, H. Multimodal representation learning by alternating unimodal adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 27456-27466.
- Li, X.; Zhao, X.; Xu, J.; Zhang, Y.; **ng, C. IMF: interactive multimodal fusion model for link prediction. In Proceedings of the ACM Web Conference 2023, 30 Apr 2023; pp. 2572-2580.
- Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; Chen, H. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 6 Jul 2022; pp. 904-915.
- Gu, W.; Gao, F.; Lou, X.; Zhang, J. Link prediction via graph attention network. arxiv preprint arxiv:1910.04807, 10 Oct 2019.
- Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 3-7 May 2021.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 24 May 2019; pp. 4171-4186.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, AN.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30 Jun 2017; pp. 5998-6008.
- Miller, GA. WordNet: a lexical database for English. Communications of the ACM, 1 Nov 1995; pp. 39-41.
- Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 9 Jun 2008; pp. 1247-1250.
- **e, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 19-25 August 2017; pp. 3140-3146.



| Dataset | Ent | Rel | Train | Dev | Test |
|---|---|---|---|---|---|
| FB15K-237-IMG | 14,541 | 237 | 272,115 | 17,535 | 20,466 |
| WN18-IMG | 40,943 | 18 | 141,442 | 5,000 | 5,000 |
| Model | FB15K-237-IMG | WN18-IMG | ||||||
| Hits@1↑ | Hits@3↑ | Hits@10↑ | MR↓ | Hits@1↑ | Hits@3↑ | Hits@10↑ | MR↓ | |
| VisualBERT_base [17] | 0.217 | 0.324 | 0.439 | 592 | 0.179 | 0.437 | 0.654 | 122 |
| ViLBERT_base [16] | 0.233 | 0.335 | 0.457 | 483 | 0.223 | 0.552 | 0.761 | 131 |
| IKRL [36] | 0.194 | 0.284 | 0.458 | 298 | 0.127 | 0.796 | 0.928 | 596 |
| TransAE [20] | 0.199 | 0.317 | 0.463 | 431 | 0.323 | 0.835 | 0.934 | 352 |
| RSME [22] | 0.242 | 0.344 | 0.467 | 417 | 0.943 | 0.951 | 0.957 | 223 |
| MM-Transformer | 0.259 | 0.362 | 0.511 | 215 | 0.948 | 0.968 | 0.976 | 117 |
| FB15K-237-IMG | ||||
| Hits@1↑ | Hits@3↑ | Hits@10↑ | MR↓ | |
| T | 0.241 | 0.345 | 0.457 | 248 |
| S + T | 0.242 | 0.351 | 0.386 | 232 |
| V + T | 0.256 | 0.367 | 0.504 | 221 |
| S + V + T | 0.259 | 0.362 | 0.511 | 215 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).