Foundation Models for Computer Vision

9 min readApr 9, 2024

Table of content
· Introduction
· Architectures
∘ CLIP
∘ DINO
∘ Dataset expansion pipeline
∘ Tested models
· Tasks
∘ Classification
∘ Image Search
· Results
∘ Classification
∘ Image Search
· Summary

Introduction

The year 2023 was undoubtedly the year of LLM. Chat-GPT, GPT-4, Llama, Falcon, Alpaca, Dolly, Bloom → These models are also called foundation models.

In Computer Vision, changes were also taking place here at that time, also spectacular, although less noticed. In this part, I will take a look at the DINOv2 model from Meta AI and its main rival, CLIP. And the question is whether these models also know the world. Will they work well on faces, animals, and medical images? How do their skills differ from models trained only on ImageNet?

Architectures

CLIP

The CLIP model from OpenAI was the first model that could be called a foundation model. The main reason for this was the model’s excellent generalization abilities. That is, as befits a foundation model, it has sufficient knowledge about the world to learn a new task with a few examples (this has already been noticed in the LLM).

How did OpenAI manage to create such a model?
The main idea of the model was to create two branches of the model, a text part and an image part. Using a dataset consisting of photo-description pairs, the model learns the similarity between the photo and the description. This is the main difference between other models used in computer vision → CLIP is a multi-modal model. I must mention here that CLIP is perfect for zero-shot learning tasks, i.e. we provide a text as a class (e.g. tomato soup) and look for photos similar to this description.

And that’s it! No training, although you need to choose an appropriate cutoff value for the similarity metric that will cut out non-tomato soup cases.
All tests are performed using OpenCLIP models, as the largest/best CLIP-trained models can be found.

When trying to match text to an image, the model does not focus on the entire content of the photo, but only on what can be described in the image caption. This has its pros and cons:

Advantage: the background of the photo does not play a significant role, i.e. the model is invariant with the background
Disadvantage: when there is some form of text in the photo, the model focuses mainly on it. This is due to the way the model is trained, which looks for shortcuts in solving this task

DINO

DINO: Learning Robust Visual Features without Supervision is another foundation model presented in 2023. Scientists praise the model as one that performs very well in many tasks such as segmentation, depth estimation, image search, and point matching, without any adaptation of the original model's features. How did they teach him?

The main idea of model learning is Self-Supervised Learning. That is, the task in which:

A dataset has no labels. They are generated automatically by changing the same photo in two different ways.
we have one neural network, but it is used twice. Once as a student (the one who is taught) and once as a teacher (the weights are not taught, but are changed using the EMA mechanism)
the goal is to create as identical embeddings as possible for both photos using different branches of the model.

Dataset expansion pipeline

It is worth taking a closer look at the photo deduplication process, as I consider it to be the key to the success of the entire DINO-V2 operation. Earlier SSL models (such as BYOL, MOCO, SimCLR) performed very well on pure sets, with contrasting concepts (e.g. ImageNet contains cats, dogs, houses, fruits, etc.). However, when we upload data from the Internet into one bag, it may happen that, for example, there is only one photo of an Arctic fox. And there are millions of classes with one photo. In this case, many researchers mentioned the difficulty in training the model. Therefore, the following pipeline was proposed in DINO-V2:

creating a list of concepts based on clean data sets, e.g. ImageNet, Google Landmarks
downloading a huge amount of data from the Internet (~1.2 billion)
removing exact duplicate photos
creating an index of photos based on a strong ViT model
for each image in the concept list, searching the index, and finding its nearest neighbors. Add them to a new set

This means that the goal of the creators of DINO-V2 was to expand the set by increasing the number of examples of concepts, not by increasing the number of concepts. This is interesting because it shows that the maxim Give me more Data is not entirely true. Give me more clean and appropiate Data sounds better.

Looking at the results provided by the authors, where they trained the same architecture on different datasets, for the same number of iterations. And you can see that Uncurated data’scollection of random concepts doesn’t perform as well as other approaches.

Tested models

DinoV2:

ViT-L/14: distilled version of the biggest DINOv2, 2nd biggest model
ViT-B/14: distilled version of the biggest DINOv2, 3rd biggest model

OpenCLIP

ConvNext-XXLarge: trained on LAION-2B, 2nd best model
ViT-L/14: trained onDataComp-1B, 3rd best model
ViT-H/14: trained on LAION-2B, 3rd best model

Models trained onImageNet:

VIT-L-14
ConvNext

I selected the best models from both repositories, excluding those that are difficult to use on the T4 GPU. Plus, from OpenCLIP I chose the ConvNext model, which is a convolutional network, not a transformer, and is the second-best model from the OpenCLIP repo. CLIP has text, so it does not rely solely on transformations.

Tasks

Each task uses the kNN testing protocol.
Pipeline:

Extracting features from training photos
Extracting features from test photos
For each test photo, find the most similar training photo and assign it a tag

In this way, we can test both classification and image search tasks. The only difference is that in a classification task, there is only one good answer, while in a search, one query may have many correct answers and we want to find all of them.

Metrics

Classification task: Accuracy
Here we only care about the nearest neighbor obtained from the search.

Image Search task: mAP
In this metric, we want the results to be properly sorted. I.e. if a given query has 3 good answers in the gallery, then we want all 3 photos to occupy the top 3 places.

mAP in case of incorrectly sorted results

Classification datasets

I decided to skip the ImageNet set, as we already have values for it. Additionally, many sets have already been tested (such as SVHN, Stanford Card, SUN397). That’s why here I decided to choose three datasets from Kaggla:

Fruits and Vegetables

a typical set for classification. This is an easy task, it constitutes a baseline of models

iNaturalist-Birds

a subset of the full iNaturalist. The challenge here is the scale (30k photos in training and testing) and the natural similarity of the birds. This is a medium-type task

Chest X-Ray

a collection of medical photos, i.e. it is a typical OOD prediction. No model has seen this type of photo during training, so I consider this task difficult

Image Search datasets

I decided to test two datasets:

FaceScrub

Face Recognition dataset with 530 people. Current face recognition models achieve very high classification scores (99%) on this set. However, in the mAP metric, this task is much more difficult. This task is very difficult, mainly because face recognition involves learning the details of facial features, and most models have not even seen a human face in the training set (e.g. in DINO-v2 all faces are blurred).

Visual Search z AiCrowd

Visual-Search collection from the AiCrowd competition. In my experience, models not trained on a similar task perform poorly. The main reason is the same as with face recognition, it’s the details of objects that matter, not their overall appearance.

I chose the above two sets as I haven’t seen anyone else testing large vision models on this type of data. And it seems to me that the results of this experiment can say something more about the generalization of models.

Results

I performed all the tests on Kaggle:

Note: I am not able to run all tests at the same time due to RAM limitations, so to reproduce the result you need to run CLIP, DINO, and ImageNet separately

In each set, I will assign points for each of the Model Family: CLIP, DINO, and ImageNet.

3 for the best score (Green color)
2 for a mediocre score (Blue color)
1 for the weakest score (Red color)

Classification

Fruits and Vegetables

iNaturals-Birds

The results in this category surprised me the most. The difference between the best and the worst model is as much as 34 percentage points! Additionally, between the best CLIP model and the DINO-v2 model, there is a difference of 17% points in favor of DINO-v2. The DINO-v2 base training set includes CUB-200–2011, a set of bird photos, which certainly helped in obtaining a better result, but there is no data leakage here.

Chest X-Ray

The worst model in this category is DINOv2, and the best is the model trained on ImageNet (which should not be the case). This shows that all models do not cope well with this task, and DINOv2 has the most problems in understanding these types of images.

Image Search

FaceScrub

Despite not analyzing faces, the winner of the face recognition evaluation was DINO-v2, followed by CLIP. My intuition told me otherwise, apparently I need to read how the training set for OpenCLIP was created.

Visual-Search

The OpenCLIP model performed best in this task, in line with my intuition, where this model focuses very much on the foreground of the photo and ignores the background. And this is exactly what is needed to achieve good results in this task. What is puzzling is the very poor result of the DINO models, slightly better than the models based on ImageNet. I think this is due to the way the model is trained, in SSL the model is interested in the entire image, not just the foreground.

Summary

The official winner of the CLIP vs DINO battle is CLIP!

But does that mean I should forget about DINO? No! I think my 5 tests are too few to give a final answer. At the same time, it can be seen that if the model was trained on a specific domain (like DINO in a congregation showing different birds), then it generalizes better. If I were to start a new computer vision project now, I would still test two models and see their results in a head-to-head match.
An additional conclusion from my tests is that the ConvNext and ViT-L-14 models are very comparable to each other, although, despite the same teaching methodology (but slightly different training sets), each has its strengths and weaknesses.