Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Multi-view Masked Autoencoder for General Image Representation

Version 1 : Received: 4 October 2023 / Approved: 5 October 2023 / Online: 9 October 2023 (12:52:30 CEST)

A peer-reviewed article of this Preprint also exists.

Ji, S.; Han, S.; Rhee, J. Multi-View Masked Autoencoder for General Image Representation. Appl. Sci. 2023, 13, 12413. Ji, S.; Han, S.; Rhee, J. Multi-View Masked Autoencoder for General Image Representation. Appl. Sci. 2023, 13, 12413.

Abstract

Self-supervised learning is a method that learns general representation from unlabeled data. Masked image modeling (MIM), one of the generative self-supervised learning methods, has drawn attention showing state-of-the-art performance on various downstream tasks, though showing poor linear separability resulting from the token-level approach. In this paper, we propose a contrastive learning-based multi-view masked autoencoder for MIM, exploiting an image-level approach by learning common features from two different augmented views. We strengthen MIM by learning long-range global patterns from contrastive loss. Our framework adopts simple encoder-decoder architecture, learning rich and general representation by following a simple process: 1) two different views are generated from an input image with random masking and by contrastive loss, we can learn semantic distance of the representations generated by an encoder. By applying a high mask ratio, 80%, it works as strong augmentation and alleviates the representation collapse problem. 2) With reconstruction loss, decoder learns to reconstruct an original image from the masked image. We assess our framework by several experiments on benchmark datasets of image classification, object detection, and semantic segmentation. We achieve 84.3% fine-tuning accuracy on ImageNet-1K classification and 76.7% in linear probing, exceeding previous studies and show promising results on other downstream tasks. Experimental results demonstrate that our work can learn rich and general image representation by applying contrastive loss to masked image modeling.

Keywords

Deep learning; image representation learning; self-supervised learning; masked image modeling; contrastive learning

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.