Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Leveraging Self-Distillation and Disentanglement Network to Enhance Visual-Semantic Feature Consistency in Generalized Zero-Shot Learning

Version 1 : Received: 9 April 2024 / Approved: 9 April 2024 / Online: 10 April 2024 (07:53:22 CEST)

How to cite: Liu, X.; Wang, C.; Yang, G.; Wang, C.H.; Long, Y.; Liu, J.; Zhang, Z.Y. Leveraging Self-Distillation and Disentanglement Network to Enhance Visual-Semantic Feature Consistency in Generalized Zero-Shot Learning. Preprints 2024, 2024040672. https://doi.org/10.20944/preprints202404.0672.v1 Liu, X.; Wang, C.; Yang, G.; Wang, C.H.; Long, Y.; Liu, J.; Zhang, Z.Y. Leveraging Self-Distillation and Disentanglement Network to Enhance Visual-Semantic Feature Consistency in Generalized Zero-Shot Learning. Preprints 2024, 2024040672. https://doi.org/10.20944/preprints202404.0672.v1

Abstract

Generalized zero-shot learning (GZSL) aims to simultaneously recognize both seen classes and unseen classes by training only on seen class samples and auxiliary semantic descriptions. Recent state-of-the-art methods infer unseen classes based on semantic information or synthesize unseen classes using generative models based on semantic information, all of which rely on the correct alignment of visual-semantic features. However, they often overlook the inconsistency between original visual features and semantic attributes. Additionally, due to the existence of cross-modal dataset biases, the visual features extracted and synthesized by the model may also mismatch with some semantic features, which could hinder the model from properly aligning visual-semantic features. To address this issue, this paper proposes a GZSL framework that enhances the consistency of visual-semantic features using self-distillation and disentanglement network (SDDN). The aim is to utilize self-distillation and disentanglement network to obtain semantically consistent refined visual features and non-redundant semantic features to enhance the consistency of visual-semantic features. Firstly, SDDN utilizes self-distillation technology to refine the extracted and synthesized visual features of the model. Subsequently, the visual-semantic features are then disentangled and aligned using a disentanglement network to enhance the consistency of the visual-semantic features. Finally, the consistent visual-semantic features are fused to jointly train a GZSL classifier. Extensive experiments demonstrate that the proposed method achieves more competitive results on four challenging benchmark datasets (AWA2, CUB, FLO, and SUN).

Keywords

generalized zero-shot Learning; self-distillation, disentanglement network, visual-semantic feature consistency

Subject

Computer Science and Mathematics, Computer Vision and Graphics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.