Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Text-to-Image Segmentation with Open-Vocabulary and Multitasking

Version 1 : Received: 8 April 2024 / Approved: 9 April 2024 / Online: 9 April 2024 (11:43:57 CEST)

How to cite: Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints 2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1 Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints 2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1

Abstract

Open-vocabulary learning has recently gained prominence as a means to enable image segmentation for arbitrary categories based on textual descriptions. This advancement has extended the applicability of segmentation systems to a broader range of generally purpose scenarios. However, current methods often revolve around specialized architectures and parameters tailored to specific segmentation tasks, resulting in a fragmented landscape of segmentation models. In response to these challenges, we introduce OVAMTSeg, a versatile framework designed for Open-Vocabulary and Multitask Image Segmentation. OVAMTSeg harnesses adaptive prompt learning to empower the model to capture category-sensitive concepts, enhancing its robustness across diverse multi-task and scenario contexts. Text prompts are employed to effectively capture semantic and contextual features of the text, while cross-attention and cross-modal interactions enable the fusion of image and text features. Furthermore, a transformer-based decoder is incorporated for dense prediction. Extensive experimental results underscore the effectiveness of OVAMTSeg, showcasing its state-of-the-art performance and superior generalization capabilities across three segmentation tasks. Notable achievements include a 47.5 mIoU in referring expression segmentation, 51.6 mIoU on Pascal-VOC with four unseen classes, 46.6 mIoU on Pascal-Context in zero-shot segmentation, 65.9 mIoU on Pascal-5i, and 35.7 mIoU on COCO-20i datasets for one-shot segmentation.

Keywords

image segmentation; open vocabulary; multitask; multi-modal interaction

Subject

Computer Science and Mathematics, Computer Vision and Graphics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.