A Brief History of Masked Image Modeling Family

14 min readOct 15, 2023

Hi everyone, this is my very first post in this world of Medium. 😘 Recently, I have been working around Medium a lot of times, which arouses my idle curiosity of contributing some of my knowledge to this marvellous community. Therefore, let me know if I could do anything to improve the quality of my posts because of minor mistakes always made in every first attempt. 😉

Jump to the main parts,

Introduction

Self-supervised Learning (SSL) seems to be a hot topic in the area of Deep Learning and Artificial Intelligence researchs. This is a super promising path to revolutionize Machine Learning due to:

SSL does not depend on the human labelling effort which costs a considerable amout of time and money to build a curated dataset.
SSL can learn from unlabeled dataset, which is practical in our real world. Everyday there is increasing amount of information while human has decreasing effort of collecting, cleaning, processing and storing data. Moreover, every organization has their own criteria of their dataset, causing a lot of mismatches and troubles in Machine Learning development.

Thanks to Meta’s research, understanding of diverse SSL approaches and methods has been unified under a single viewpoint. In this particular story, I pay attention and focus on explaining my research and understanding of the family of Self-supervised methods of Masked Image Modelling Family.

There is a summary report table at the end of this post, in case you are in rush and do not have much time reading the whole post.

Masked Image Modelling

A little black girl wearing a face mask. — Source of image: https://www.unicef.org/sudan/everything-you-need-know-about-face-masks

Interacting with each other, we can recognize our friends despite them wearing face masks. According to some experts, our ancestors had an evolutionary advantage which help differentiate a friend from a foe and decide who to approach and who to avoid. So the reduction in visible features is recognizable due to our strong representation of the full familiar face.

Knowing this human ability of reconstructing the lost information of familiar faces based on a masked image, we can link to the understanding of the Masked Image Learning.

In Machine Learning, Masked Image Modeling is a computer vision technique that involves predicting the missing pixels in an image by using the surrounding pixels as context. This technique is often used in image inpainting, where missing or damaged parts of an image are filled in using information from the surrounding areas.

The very first attempt of this approach has started in 2016.

2016: Context Encoder

The first effort of this approach on images is mentioned in the 2016 IEEE Conference on Computer Vision and Pattern Recognition in a paper named Context Encoders: Feature Learning by Inpainting by Pathak et al.

Context Encoders: Feature Learning by Inpainting

The authors have presented an algorithm of unsupervised visual feature learning with the idea of learning context.

Overview of the model

An overview architecture of the Context Encoder

The purpose of the task is to:

Replace large portions of an image with white.
Send the data (images with missing parts) to an Encoder network which is responsible for capturing the context into a latent feature representation.
Then a Decoder network uses that representation to reconstruct the missing content of the image.
The Encoder and Decoder is connected by a channelwise fully-connected layer, which allows each unit in the decoder to reason about the entire image content.
The loss function is computed based on the original content and the content produced by the model

They call their model the Context Encoder due to some reasons:

It is similar to the autoencoder architecture, with some modifications, because in the autoencoder, the original content is simply compressed into a feature representation, ignoring the crucial semantic content of an image.
It shares the spirit with Word2Vec which learn word representations in natural languages by predicting a word given its context.
Differed from AutoEncoder, the Context Encoder learns to minimize both the Reconstruction (L2) Loss and the Adversarial Loss.

In this model training, the L2 loss captures the overall structure of the missing region in relation to the context, while the Adversarial … As shown in the image above, using only the L2 loss produces the blurry output image while adding the adversarial one provides a much better result.

The region masking methods

They use 3 methods of masking:

Central region
Random block
Random region

The result on pre-text task

Defined by the authors, the pre-text task of this model is inpainting the missing region.

The qualitative results are shown in the above image, which shows that the model performs generally well in inpainting the semantic regions of the images.

Results on downstream tasks

After being trained on the pre-text task of inpainting the missing regions, the model is used to fine-tune three other basic downstream tasks which are object classification, object detection and segmentation.

They used the test set of The PASCAL Visual Object Classes Challenge 2007 which is a standard benchmark dataset of in visual object category recognition and detection, which contains 20 different classes.

They evaluate their pre-trained model on three different downstream tasks:

Object Classification: An AlexNet standard classifier is fine-tuned using their trained Context Encoder.
Object Detection: A Fast R-CNN framework is used and the ImageNet pretrained network is replaced with authors’ Context Encoder.
Semantic Segmentation: The last evaluation explores how they can use the Context Encoder for pixel-wise segmenting. A Fully convolutional networks is applied while the classification pre-trained network is replaced with the Context Encoders.

The results are significant when it outperforms a randomly initialized network and a plain autoencoder.

Conclusion

Although the model performs well on the pretext task (image inpainting), it does not achieve a competitive result in comparison with other supervised learning methods on downstream tasks such as classification, detection and segmentation.
A considerable contribution of this paper to the community knowledge is a promising path to applying unsupervised learning on alleviating Computer Vision odd tasks (collecting, storing, labelling, etc.)

Although it is neither related to the family of Masked Language Modeling nor the family of Computer Vision, ….

2019: BERT — Masked Language Modeling SSL task

In 2019, a turning point came in the world of natural language processing (NLP), which is BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2] presented by Google AI researchers.

The main idea of BERT is replacing text tokens input to a transformer language model with learnable mask tokens and teaching the model to recover the original text. This is called “masked language modeling” (MLM) which share the same spirit on this topic, because it includes the task of masking and the model’s task of learning to undo the degradation. Till now, MLM remains popular as a SSL objective for large language models.

Do you wonder why I mentioned BERT here while it does not serve any of the Computer Vision problems?

The reason is kinda simple, BERT has inspired the scholars community to make use of the Transformers architecture in order to explore novel self-supervised methods of learning from unlabeled images.

2021: the BERT pre-training strategy for the vision transformer architecture

First introduced by Dosovitskiy et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale., Vision Transformers is the a modified version of the Transformers architecture employed on patches of an image.

Available Github here.

Overview of the model

An image is splitted into fixed-sized patches.
Linearly embed each of them. Each patch is treated the same way as tokens (words) in NLP applications.
Add position embeddings to the patch embeddings to retain the positional information.
Add an extra learnable “classification token” to the sequence.
Feed to a standard Transformer encoder.
The model learns to predict the pixel values directly.

The result

This below observation is quoted

We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

Conclusion

Despite being very promising, this method is not effective and needed more improvement.

2021: BEiT: (BERT Pre-Training of Image Transformers)

In the same year of 2021, a group of Microsoft’s researchers have indicated the reasons why it is not suitable to apply directly BERT-style pretraining for image data which are:

In contrary to the language vocabulary which words are well-defined, there is no clear-cut vocabulary for vision Transformers’ input (image patches). Therefore, they find it hard to employ a Softmax classifier to work on the auto-encoding problem.
Treating the task as a regression problem is not a good option due to the risk of computation-waste modelling on pre-training short-range dependencies.
Therefore, …

BEiT, standing for Bidirectional Encoder representation from Image Transformers, was presented by Bao et al. in 2021. This is a BERT-based Self-supervised Vision representation learning strategy which treats pretraining image data as a regression problem.

BEiT uses an autoencoder to encode images patches as discrete tokens. Then a Transformers model is pretrained to predict the discrete token values for masked tokens.

Available Github here: https://github.com/microsoft/unilm/tree/master/beit

Overview of BEiT

Inspired by BERT [2], they introduce a Masked Image Modelling task shown in the above image.

For each image, two views are generated which are the image patches and visual tokens.
The image is tokenized to discrete visual tokens, which is obtained by the latent codes of discrete variational autoencoder.
The patches are split from the image which some proportions will then be randomly masked. Those patches are linearly embedded and added the position embeddings.
The processed embeddings are fed into the Transformers backbone network.
This network learns to recover the missing visual tokens of the original image.

This architecture is a smart adaption of BERT on image data. But it seems to need a large computational power due to storing and processing an enormous number of tokens.

The experiment

Their BEiT network architecture follows the architecture of ViT-Base [3] with some modification, the image tokenizer is borrowed from [5].

Training dataset: 1.2M images of ImageNet-1K [6].

Then the trained model is used to fine-tuned on some visual downstream tasks.

Image classification: A simple linear classifier is finetuned by updating the parameters of BEiT and the softmax layer. The ILSVRC-2012 ImageNet dataset with 1k classes and 1.3 images is used for evaluating this downstream task.
Semantic segmentation: The task layer of SETRPUP is followed and the pretrained BEiT is employed as a backbone encoder. The ADE20K benchmark dataset with 25K images and 150 semantic categories is used for evaluating this task.

The parameters used for fine tuning these tasks are as follows.

Results on the downstream task of Image Classification

They compare their base-size BEiT model with other base-size vision Transformers models such as:

ViT384-B: The Vision Transformers “Base” model trained with 384*384 images, which is borrowed from [3] and just mentioned in the previous part of this post. “Base” means having 12 layers, the hidden size D = 768, MLP (the multilayer perceptron layer) size of 3072, h=12 classification heads. Totally, “Base” has 86M parameters.
ViT384-B-JFT300M: The ViT384-B model trained on JFT-300M image dataset which contains over 1B labels for 300M images (a single image can have multiple labels)
ViT384-L: The Vision Transformers “Large” model trained with 384*384 images. “Large” means having 24 layers, the hidden size of 1024, MLP (the multilayer perceptron layer) size of 4096, 16 classification heads. Totally, “Large” has 307M parameters.
DeiT-B [7]: the Data-efficient image Transformers “Base” model of 86M parameters, trained with 224*224 resolution images, presented by Touvron H. et al in 2021 which is indentical to ViT384-B model from [3], with a modification: replacing the MLP head with a linear classifier layer.
DeiT384-L: the same architecture of DeiT-B which is trained with lager resolution images (384*384).
and some other modified vision Transfomers-based model where the only exception is iGPT which contains 1.36B parameters.

As shown, BEiT has improved the performance significantly and surpassed many other state-of-the-art algorithms even though it was self-supervised pretrained without labeled data.

Results on the downstream task of Semantic Segmentation

This task is to predict the corresponding class for each pixel of the input image. The metrics used for evaluating is mean Intersection of Union (mIoU) averaged over all semantic categories.

Ablation studies for BEIT pre-training on image classification and semantic segmentation.

They compare their BEiT pre-trained models with other supervised pre-training that relies on labeled data of ImageNet. We find that this model achieves better performance than the others.

2022: SimMIM — simplified masked autoencoding

Available Github here.

The latest family member is SimMIM which was born in 2021 and accepted by CVPR2022 [8]. What makes me feel interesting about this paper is that it belongs to Microsoft (an US company) while it is produced by mainland Chinese researchers.

Overview

In this paper, the group of authors have identified why applying the “masked signal modeling” method on visual data is so struggled while it is widely built on various language applications.

The strong locality of image data: the neighbor pixels are highly correlated, which leads to analysing close pixels instead of semantic reasoning.
The visual signals are raw and low-level, while text tokens are human generated and high-level.
The visual signal is continuous, and the text token is discrete.

SimMIM (a simple framework for Masked Image Modeling) directly reconstruct the masked image patches rather than discrete image tokens extracted from an encoder. The idea is really simple:

random masking of input image patches, using a linear layer to regress the raw pixel values of the masked area with an ℓ1 loss

SimMIM excludes many special designs such as block-wise masking or tokenization via discrete VAE or clustering. The framework contains 4 major components:

Masking strategy: after the step of input transformation of masked area (followed NLP community’s practices and BEiT’s). There are square masking, block-wise masking and random masking, as shown in the below image.

Illustration of masking area generated by different masking strategies, same ratio of 0.6

Encoder architecture: responsible for extracting latent feature representation for the masked image which is then used to predict the original signals at the masked area. They apply 2 vision Transformers architectures: ViT from [3] and Swin transformer from [9].
Prediction head: In order not to use a heavy prediction head as a decoder, they use an extremely lightweight layer. In particular, they employ a linear layer and also try 2-layer MLP, an inverse Swin-T, and an inverse Swin-B. The table below shows more information on those prediction heads.

Ablation on different prediction heads. A simple linear layer performs the best with lower training costs.

Prediction target: The pixel values in the latent representation space is predicted, due to this problem not having a decoder to restore the full-resolution masked pieces of image. This component is responsible for defining the form of original signals to regress (the raw pixel values or the transformation of pixels). This task is treated as a regression problem.
Then the signals are mapped from the feature space to the full resolution. An ℓ1-loss is employed to learn this mapping. The table below shows more information on different targets.

Ablation on different prediction targets.

As shown in the table:

The three losses of ℓ1, smooth-ℓ1, and ℓ2 of SimMIM perform generally and similarly well.
Other methods: Clustering (iGPT) or Tokenization perform slightly worse. The color discretization performs competitive to ℓ1 loss while requiring careful selection of number of bins (eg. 8-bin)
Therefore, the MIM target prediction should be treated as the regression function, which is the nature of visual signals.

Critical capability

Recovered images using three different mask types (from left to right): random masking, masking most parts of a major object, and masking the full major object

In terms of recovering the original regions.

The above image shows 3 different masking.
The regions are well recovered if there is random masking method applied to remove just some parts of the images.
If the major parts of the objects are masked, SimMIM can still recover the existing objects.
The masked area are inpainted with background if the full objects are masked.

Therefore, SimMIM is capable of object semantic reasoning without image identities memorization or copying neighbor pixels.

Results on down-stream tasks

They conducted three different experiments.

Image classification using iNaturalist (iNat) 2018 dataset which consists of more than 8,000 categories with 437,513 training images and 24,426 validation images.
Object detection using COCO dataset which consists of 80 object categories with 1.5M object instances and 330K images.
Semantic segmentation using ADE20K dataset.

More ablation studies on prediction targets using iNat2018, COCO and ADE20K.

The regression based prediction target (ℓ1) of SimMIM performs competitively or better than the other classification based designed algorithms.

Conclusion

Finally, I sum up the key main ideas of this brief timeline of MIM family in a table.

You can access the table here if you meet problem with the image.

Thank you for taking time reading my post, please share any thoughts or tell any mistakes that I have made, helping me write more efficient and good quality posts in the future.

If you want to support me, you can buy me a coffee here.

Else, my paypal: vyhao03@gmail.com

If you want to work with me, please contact me on LinkedIn or Upwork.

Reference

[1] Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., … & Goldblum, M. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.

[2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[4] Bao, H., Dong, L., Piao, S., & Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.

[5] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … & Sutskever, I. (2021, July). Zero-shot text-to-image generation. In International Conference on Machine Learning (pp. 8821–8831). PMLR.

[6] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115, 211–252.

[7] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021, July). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347–10357). PMLR.

[8] Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., … & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9653–9663).

[9] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

A Brief History of Masked Image Modeling Family

Introduction

Masked Image Modelling

2016: Context Encoder

Overview of the model

The region masking methods

The result on pre-text task

Results on downstream tasks

Conclusion

2019: BERT — Masked Language Modeling SSL task

2021: the BERT pre-training strategy for the vision transformer architecture

Overview of the model

The result

Conclusion

2021: BEiT: (BERT Pre-Training of Image Transformers)

Overview of BEiT

The experiment

Results on the downstream task of Image Classification

Results on the downstream task of Semantic Segmentation

2022: SimMIM — simplified masked autoencoding

Overview

Critical capability

Results on down-stream tasks

Conclusion

Reference

Written by Vyhao