Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

CSE 465

Lecture 17
Advanced Neural Network Architectures
Multi-Modal Models: CLIP
• CLIP stands for Contrastive Language-Image Pretraining, is a deep learning
model developed by OpenAI
• CLIP’s embeddings for images and text share the same space
• Therefore, it is possible to directly compare between the two modalities
• This is accomplished by training the model to bring related images and
texts closer together while pushing unrelated ones apart.
CLIP
• CLIP can be used for image classification tasks by associating images with
natural language descriptions
• It allows for more versatile and flexible image retrieval systems where users
can search for images using textual queries
• CLIP can be used to moderate content on online platforms by analyzing
images and accompanying text to identify and filter out inappropriate or
harmful content
How does CLIP work
• CLIP is designed to predict which N × N potential (image, text) pairings
within the batch are actual matches
• To achieve this, CLIP establishes a multi-modal embedding space through
the joint training of an image encoder and text encoder
• The CLIP loss aims to maximize the cosine similarity between the image and
text embeddings for the N genuine pairs in the batch while minimizing the
cosine similarity for the N² − N incorrect pairings
• The optimization process involves using a symmetric cross-entropy loss
function that operates on these similarity scores
Architecture
• ClIP uses two separate architectures as the backbone for encoding vision
and text datasets:
• Image encoder: Represents the neural network architecture (e.g., ResNet or Vision
Transformer) responsible for encoding images.
• Text encoder: Represents the neural network architecture (e.g., CBOW, BERT, or Text
Transformer) responsible for encoding textual information.
• The original CLIP model was trained from scratch without initializing the
image encoder and the text encoder with pre-trained weights as it was
trained on large volume of the dataset (400 million image-text pairs)
Architecture
Architecture
Input
• The model takes a batch of n pairs of images and texts as input where:
• I[n, h, w, c]: Represents a minibatch of aligned images
• Here n is the batch size, h is the image height, w is the image width, and c is the number of
channels.
• T[n, l]: Represents a minibatch of aligned texts
• Here n is the batch size, and l is the length of the textual sequence.
• Feature extraction
• Image Encoder: Extracts feature representations from the image encoder.
• The shape of the feature set is [n, d_i],
• Here d_i is the dimensionality of the image features.
• Text Encoder: Extracts feature representations from the text encoder.
• The shape of text feature set is [n, d_t]
• Here d_t is the dimensionality of the text features.

Learned Projections
• CLIP learns the projection matrix from image features to embedding space
• The embedding represents the new representation
• The shape of the matrix is [d_i, d_e]
• So, d_e is the desired dimensionality of the joint embedding space.
• Similarly a projection matrix is learned for the text representation
• The shape of the matrix is [d_t, d_e]
• Again, the matrix projects the text representation in d_e
• The projection operation can be done using a neural network with two linear
layers, whose weights are the learned projection matrix
• In most cases, the projection weights are the only weights with active gradients
that can be trained on new datasets.
• Additionally, the projection layer plays a crucial role in aligning the dimensions of
image and text embeddings, ensuring that they have the same size.
Symmetric Loss Function
• CLIP uses contrastive loss to bring related images and texts closer together
while pushing unrelated ones apart.
• labels = Generates labels representing the indices of the batch.
• loss_i = cross_entropy_loss(logits, labels, axis=0): Computes the cross-entropy loss
along the image axis.
• loss_t = cross_entropy_loss(logits, labels, axis=1): Computes the cross-entropy loss
along the text axis.
• loss = (loss_i + loss_t)/2: Computes the symmetric average of the image and text
losses.
Contrastive Loss
• The main idea behind contrastive pre-training is to teach the model to
differentiate between “positive” and “negative” pairs. In the context of
CLIP:
• Positive Pairs: These are pairs of images and text that are truly related or
semantically meaningful. Those pairs will have high cosine similarity. For example,
an image of a cat should have a positive text description like “a cute cat.”
• Negative Pairs: These are pairs where the text and image do not match. For
example, an image of a cat should have a negative text description like “a sunny day
at the beach.”
• The cosine similarity score provides a measure of how well the input image
and text match in the shared embedding space.
• Higher cosine similarity scores indicate stronger semantic alignment, suggesting that
the text and image are more closely related or that the image corresponds to the
textual description.
• Lower scores suggest weaker alignment or a mismatch.
Contrastive Loss Function
• A Contrastive Loss is low if the two correlated vectors
• And high and high otherwise.
Unsupervised Image Embedding
General Visual Representation Learning
• Done from unlabeled image dataset (unsupervised)
• After unsupervised learning, the learned model and image representations
can be used for downstream application
• Generative modeling
• Generate model pixels in the input space
• Pixel-level generation is computationally expensive
• Generating images of high-fidelity may not be necessary for representation learning
• Discriminative modeling
• Train networks to perform pretext tasks where inputs and labels are derived from an
unlabeled dataset
• Heuristic-based pretext tasks: rotation prediction, relative patch location prediction,
colorization, solving jigsaw puzzle
• Many heuristics seem ad-hoc and may be limiting
Self-supervised learning
• Autoencoders are a traditional self-supervised learning algorithm that
trains to reconstruct the input from a learned low-dimension
representation. Variants of autoencoders include denoising autoencoders,
where noise is added to the input during training
• Data augmentation is a widely used technique in computer vision research,
with numerous techniques such as flipping, cropping, and coloring. Image
overlay (also known as image mixture) is also a type of augmentation
method, but has been less explored compared to others
SimCLR Method
The SimCLR method
• Maximize the agreement of representations under data transformation,
using a contrastive loss in the latent/feature space
• A framework for contrastive representation learning.
• Two separate stochastic data augmentations t, t’ – T are applied to each
example to obtain two correlated views
• A base encoder network f(.) with a projection head g(.) is trained to
maximize agreement in latent representation
Data augmentation
• SimCLR uses random crop and color distortion for augmentation
• Transforms a given image randomly in two ways, yielding two correlated
views of the same example
Base Encoder
• F(x) is the base network that computes internal represent
• SimCLR uses ResNet, however, it is possible to use other networks
• This is the model whose weights we are training for the eventual
downstream task.
Projection Header
• G(h) is a projection network that project representation to a latent space
• SimCLR uses 2-layer MLP
• A Projection Head makes the representation vectors smaller before the loss
function
The SimCLR framework
• Maximize agreement using a contrastive tasks:
• Given {x_k} where two different examples x_i and x_j are a positive pair, identify x_j
in {x_k}_{k!=i} for x_i
• Loss function

You might also like