Prompt-Guided Structured Multimodal NER with SVG and ChatGPT

Yuzhou Ma; Haolong Qian; Wei Li

doi:10.20944/preprints202602.1129.v1

Submitted:

13 February 2026

Posted:

13 February 2026

You are already at the latest version

Abstract

Multimodal Named Entity Recognition (MNER) leverages both textual and visual information to improve entity recognition, particularly in unstructured scenarios such as social media. While existing approaches predominantly rely on raster images (e.g., JPEG, PNG), Scalable Vector Graphics (SVG) offer unique advantages in resolution independence and structured semantic representation—an underexplored potential in multimodal learning. To fill this gap, we propose MNER-SVG, the first framework that incorporates SVG as a visual modality and enhances it with ChatGPT-generated auxiliary knowledge. Specifically, we introduce a Multimodal Similar Instance Perception Module that retrieves semantically relevant examples and prompts ChatGPT to generate contextual explanations. We further construct a Full Text Graph and a Multimodal Interaction Graph, which are processed via Graph Attention Networks (GATs) to achieve fine-grained cross-modal alignment and feature fusion. Finally, a Conditional Random Field (CRF) layer is employed for structured decoding. To support evaluation, we present SvgNER, the first MNER dataset annotated with SVG-specific visual content. Extensive experiments demonstrate that MNER-SVG achieves state-of-the-art performance with an F1 score of 82.23%, significantly outperforming both text-only and existing multimodal baselines. This work validates the feasibility and potential of integrating vector graphics and large language model–generated knowledge into multimodal NER, opening a new research direction for structured visual semantics in fine-grained multimodal understanding.

Keywords:

Multimodal Named Entity Recognition

;

Scalable Vector Graphics

;

ChatGPT

;

Graph Attention Networks

Subject:

Engineering - Electrical and Electronic Engineering

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Prompt-Guided Structured Multimodal NER with SVG and ChatGPT

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe