Submitted:
08 April 2024
Posted:
09 April 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce OVAMTSeg, a universal open-vocabulary framework renowned for its capacity to efficiently segment images based on arbitrary text or image prompts. OVAMTSeg effectively addresses the intricate challenges posed by zero-shot, one-shot, and referring expression segmentation tasks.
- Adaptive prompt learning empowers OVAMTSeg to explicitly encode category-specific information into a compact textual abstraction, facilitating the model’s adeptness in generalizing to diverse textual descriptions. Additionally, we enhance the text encoder and introduce a multimodal interaction module to optimize cross-model fusion.
- Our model’s efficiency and effectiveness are meticulously demonstrated through comprehensive evaluations across various benchmark datasets. Extensive experimental results conclusively establish that our proposed model surpasses current standards by a substantial margin, rendering it a highly viable choice for multitask deployment.
2. Related Work
2.1. Open Vocabulary Segmentation
2.2. Multitask Image Segmentation Architecture
2.3. Prompt Learning
3. Methods
3.1. Adaptive Prompt Learning
3.2. Feature Extraction Dual-Encoder
3.3. Multimodal Interaction Module
4. Experimental Results
4.1. Experimental Settings
4.2. Datasets
4.3. Evaluation Metrics
4.4. Comparison to State-of-the-Art Methods
4.5. Ablation Study
4.6. Qualitative Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition(CVPR), 2019; pp. 5693–5703.
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Bucher, M.; Vu, T.; Cord, M.; Pérez, P. Zero-shot semantic segmentation. Advances in Neural Information Processing Systems. 2019, 32. [Google Scholar]
- Radford, A.; Kim, J.; Hallacy, C.; Ramesh, Aditya.; Goh, Gabriel.; Agarwal, Sandhini.; Sastry, Girish.; Askell, Amanda.; Mishkin, Pamela.; Clark, Jack.; Krueger, Gretchen.; Sutskever, Ilya. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning (PMLR), July 2021; pp. 8748–8763.
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. Proceedings of the 38th International Conference on Machine Learning (PMLR), July 2021; pp.4904-4916.
- Li, P.; Wei, Y.; Yang, Y. Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems. 2020, 33, 10317–10327. [Google Scholar]
- Xian, Y.; Choudhury, S.; He, Y.; Schiele, B.; Akata, Z. Semantic projection network for zero-and few-label semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019; pp.8256-8265.
- Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp.7086-7096.
- Li, J.; Qi, Q.; Wang, J.; Ce, Ge.; Li, Y.; Yue, Z.; Sun, H. OICSR: Out-in-channel sparsity regularization for compact deep neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019; pp.7046-7055.
- Wu, J.; Li, G.; Liu, S.; Lin, L. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. Proceedings of the AAAI Conference on Artificial Intelligence, April 2020; pp. 12386–12393.
- Wu, J.; Chen, T.; Wu, H.; Zhi, Yang.;Luo, G.; Lin, L. Fine-grained image captioning with global-local discriminative objective. IEEE Transactions on Multimedia 2020, 23, 2413–2427.
- Xia, X.; Li, J.; Wu, J.; Wang, X.; Xiao, X.; Zheng, M.; Wang, R. TRT-ViT: TensorRT-oriented vision transformer. arXiv 2022, arXiv:2205.09579. [Google Scholar]
- Xiao, X.; Yang, Y.; Ahmad, T.; Jin, L.; Chang, T. Design of a very compact cnn classifier for online handwritten chinese character recognition using dropweight and global pooling. 2017 14th IAPR international conference on document analysis and Recognition (ICDAR), November 2017; pp. 891–895.
- Ren, Y.; Wu, J.; Xiao, X.; Yang, J. Online multi-granularity distillation for gan compression. Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp.6793-6803.
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems. 2021, 34, 17864–17875. [Google Scholar]
- Qin, J.; Wu, J.; Li, M.; Zheng, M.; Wang, X. Multi-granularity distillation scheme towards lightweight semi-supervised semantic segmentation. European Conference on Computer Vision. Cham: Springer Nature Switzerland, October 2022; pp.481-498.
- Qin, J.; Wu, J.; Xiao, X.; Li, L.; Wang, X. Activation modulation and recalibration scheme for weakly supervised semantic segmentation. Proceedings of the AAAI conference on artificial intelligence, June 2022; pp. 2117–2125.
- Sun, K.; Xiao, B.; Liu, D, Wang, J. Deep high-resolution representation learning for human pose estimation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019; pp.5693-5703.
- Li, P.; Wei, Y.; Yang, Y. Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems. 2020, 33, 10317–10327. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp.3684-3692.
- Li, B.; Weinberger, K.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven semantic segmentation. arXiv 2022, arXiv:2201.03546. [Google Scholar]
- Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.; Lester, B.; Du, N.; Dai, A.; Le, Q. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, arXiv:2104.08691. 2021.
- Zhou, K.; Yang, J.; Loy, C.; LiuZ. Learning to prompt for vision-language models. International Journal of Computer Vision 2022, 130, 2337–2348. [CrossRef]
- Khattak, M.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F. Maple: Multi-modal prompt learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023; pp.19113-19122.
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9 2015; pp.234-241.
- Dumoulin, V.; Perez, E.; Schucher, N.; Strub, F.; Vries, H.; Courville, A.; Bengio, Y. Feature-wise transformations. Distill. 2018, 3, e11. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. 2017.
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. European conference on computer vision. Cham: Springer International Publishing, 2020; pp.213-229.
- Everingham, M.; Winn, J. The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Anal. Stat. Model. Comput. Learn. Tech. Rep 2012, 2007, 5. [Google Scholar]
- Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.; Lee, S.; Fidler, S.; Urtasun, R.; Yuille, A. The role of context for object detection and semantic segmentation in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition, 2014; pp.891-898.
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. Microsoft coco: Common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12 2014, Proceedings, Part V 13. Springer International Publishing, pp.740-755.
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp.1780-1790.
- Wu, C.; Lin, Z.; Cohen, S.; Bui, T.; Maji, S. Phrasecut: Language-based image segmentation in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020; pp.10216-10225.
- Gu, Z.; Zhou, S.; Niu, L.; Zhao, Z.; Zhang, L. Context-aware feature generation for zero-shot semantic segmentation. Proceedings of the 28th ACM International Conference on Multimedia, 2020; pp.1921-1929.
- Zhang, H.; Ding, H. Prototypical matching and open set rejection for zero-shot semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp.6974-6983.
- Baek, D.; Oh, Y.; Ham, B. Exploiting a joint embedding space for generalized zero-shot semantic segmentation. Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp.9536-9545.
- Liu, Y.; Zhang, X.; Zhang, S.; He, X. Part-aware prototype network for few-shot semantic segmentation. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Part IX 16. Springer International Publishing, August 23–28, 2020; pp.142-158.
- Boudiaf, M.; Kervadec, H.; Masud, Z.; Piantanida, P.; Ayed, I.; Dolz, J. Few-shot segmentation without meta-learning: A good transductive inference is all you need? Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021; pp.13979-13988.
- Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence. 2020, 44, 1050–1065. [Google Scholar] [CrossRef] [PubMed]
- Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. Proceedings of the IEEE/CVF international conference on computer vision, 2021; pp.6941-6952.
- Chang, Z.; Lu, Y.; Wang, X.; Ran, X. Mgnet: Mutual-guidance network for few-shot semantic segmentation. Engineering Applications of Artificial Intelligence. 2022, 116, 105431. [Google Scholar] [CrossRef]
- Fang, Z.; Gao, G.; Zhang, Z.; Zhang, A. Hierarchical context-agnostic network with contrastive feature diversity for one-shot semantic segmentation. Journal of Visual Communication and Image Representation. 2023, 90, 103754. [Google Scholar] [CrossRef]
- Tang, M.; Zhu, L.; Xu, Y.; Zhao, M. Dual-stream reinforcement network for few-shot image segmentation. Digital Signal Processing. 2023, 134, 103911. [Google Scholar] [CrossRef]
- Ding, H.; Zhang, H.; Jiang, X. Self-regularized prototypical network for few-shot semantic segmentation. Pattern Recognition 2023, 133, 109018. [Google Scholar] [CrossRef]


| Model | mIoU | IoUFG | AP |
|---|---|---|---|
| MDETR [34] | 53.7 | - | - |
| HulaNet [35] | 41.3 | 50.8 | - |
| Mask-RCNN top [35] | 39.4 | 47.4 | - |
| RMI [35] | 21.1 | 42.5 | - |
| CLIPSeg [8] | 43.4 | 54.7 | 76.7 |
| Ours | 47.5 | 57.1 | 80.4 |
| Model | pre-train | mIoUs | mIoUu |
|---|---|---|---|
| SPNet [7] | IN | 67.3 | 21.8 |
| ZS3Net [3] | IN-seen | 66.4 | 23.2 |
| CSRL [6] | IN-seen | 69.8 | 31.7 |
| CaGNet [36] | IN | 69.5 | 40.2 |
| OSR [37] | IN-seen | 75.0 | 44.1 |
| JoEm [38] | IN-seen | 67.0 | 33.4 |
| CLIPSeg [8] | CLIP | 20.8 | 47.3 |
| Ours | CLIP | 28.3 | 51.6 |
| Model | pre-train | mIoUs | mIoUu |
|---|---|---|---|
| SPNet [7] | IN | 36.3 | 18.1 |
| ZS3Net [3] | IN-seen | 37.2 | 24.9 |
| CSRL [6] | IN-seen | 39.8 | 23.9 |
| CaGNet [36] | IN | 24.8 | 18.5 |
| OSR [37] | IN-seen | 41.1 | 43.1 |
| JoEm [38] | IN-seen | 36.9 | 30.7 |
| CLIPSeg [8] | CLIP | 16.8 | 40.2 |
| Ours | CLIP | 25.3 | 46.6 |
| Model | vis.backb. | mIoU | IoUBIN | AP |
|---|---|---|---|---|
| PPNet [39] | RN50 | 52.8 | 69.2 | - |
| RePRI [40] | RN50 | 59.1 | - | - |
| PFENet [41] | RN50 | 60.8 | 73.3 | - |
| HSNet [42] | RN50 | 64.0 | 76.7 | - |
| MGNet [43] | RN50 | 52.1 | 68.2 | - |
| HCNet [44] | RN50 | 62.1 | 71.7 | - |
| DRNet [45] | RN50 | 53.3 | 72.8 | - |
| SRPNet [46] | RN50 | 61.5 | - | - |
| CLIPSeg [8] | ViT(CLIP) | 59.5 | 75.0 | 82.3 |
| Ours | ViT(CLIP) | 65.9 | 77.1 | 86.8 |
| Model | vis.backb. | mIoU | IoUBIN | AP |
|---|---|---|---|---|
| PPNet [39] | RN50 | 29.0 | - | - |
| RePRI [40] | RN50 | 34.0 | - | - |
| PFENet [41] | RN50 | 35.8 | - | - |
| HSNet [42] | RN50 | 39.2 | 68.2 | - |
| MGNet [43] | RN50 | 34.9 | 63.9 | - |
| HCNet [44] | RN50 | 40.7 | 63.4 | - |
| DRNet [45] | RN50 | 36.5 | 60.9 | - |
| CLIPSeg [8] | ViT(CLIP) | 33.2 | 58.4 | 40.5 |
| Ours | ViT(CLIP) | 35.7 | 62.5 | 46.3 |
| Model | vis.backb. | mIoU | IoUBIN | AP |
|---|---|---|---|---|
| PFENet [41] | VGG16 | 54.2 | - | - |
| LSeg [21] | ViT(CLIP) | 52.3 | 67.0 | - |
| CLIPSeg [8] | ViT(CLIP) | 72.4 | 83.1 | 93.5 |
| Ours | ViT(CLIP) | 78.5 | 87.3 | 93.8 |
| Method | Referring Expression | Zero-Shot | One-Shot | ||||
|---|---|---|---|---|---|---|---|
| mIoU | AP | mIoUs | mIoUu | mIoU | AP | ||
| Prompt | Fixed | 45.4 | 78.1 | 57.8 | 49.0 | 63.3 | 85.4 |
| Adaptive | 47.5 | 80.4 | 28.3 | 51.6 | 65.9 | 86.6 | |
| Task Paradigm | Single-Task | 41.3 | - | 75.0 | 44.1 | 64.0 | - |
| Multi-Task | 47.5 | 80.4 | 28.3 | 51.6 | 65.9 | 86.6 | |
| Text Encoder | ViT-B/32(512) | 43.6 | 79.0 | 25.5 | 49.2 | 63.5 | 85.2 |
| ViT-B/16 (512) | 44.4 | 79.3 | 26.0 | 49.8 | 64.1 | 85.4 | |
| RN50 × 4(640) | 44.8 | 79.4 | 26.2 | 50.1 | 64.5 | 85.6 | |
| RN50 × 16(768) | 46.2 | 79.8 | 27.4 | 51.0 | 65.6 | 85.9 | |
| Adaptive Prompt | Text Encoder Extension | Multimodal Interaction | Referring Expression | Zero-Shot | One-Shot | |||
|---|---|---|---|---|---|---|---|---|
| mIoU | AP | mIoUs | mIoUu | mIoU | AP | |||
| ✗ | ✗ | ✗ | 43.6 | 72.8 | 23.1 | 25.3 | 60.2 | 78.1 |
| ✓ | ✗ | ✗ | 45.9 | 77.6 | 25.9 | 48.8 | 63.7 | 83.5 |
| ✓ | ✓ | ✗ | 46.6 | 78.3 | 26.8 | 49.9 | 64.6 | 84.3 |
| ✓ | ✓ | ✓ | 47.5 | 80.4 | 28.3 | 51.6 | 65.9 | 86.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).