Contrastive Learning With Semantic Consistency Constraint

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Image and Vision Computing 136 (2023) 104754

Contents lists available at ScienceDirect

Image and Vision Computing


journal homepage: www.elsevier.com/locate/imavis

Review article

Contrastive learning with semantic consistency constraint


Huijie Guo, Lei Shi *
Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China

A R T I C L E I N F O A B S T R A C T

Keywords: Contrastive representation learning (CL) can be viewed as an anchor-based learning paradigm that learns rep­
Representation learning resentations by maximizing the similarity between an anchor and positive samples while reducing the similarity
Contrastive learning with negative samples. A randomly adopted data augmentation strategy generates positive and negative samples,
Semantic consistency
resulting in semantic inconsistency in the learning process. The randomness may introduce additional distur­
bances to the original sample, thereby reversing the sample identity. Also, the negative sample demarcation
strategy makes the negative samples containing semantically similar samples to the anchors, called false negative
samples. Therefore, CL’s maximization and reduction process cause distractors to be incorporated into the
learned feature representation. In this paper, we propose a novel Semantic Consistency Regularization (SCR)
method to alleviate this problem. Specifically, we introduce a new regularization item, pairwise subspace dis­
tance, to constrain the consistency of distributions across different views. Furthermore, we propose a divide-and-
conquer strategy to ensure that the proposed SCR is well-suited for large mini-batch cases. Empirically, results
across multiple benchmark mini and large datasets demonstrate that SCR outperforms state-of-the-art methods.
Codes are available at https://github.com/PaulGHJ/SCR.git.

1. Introduction emphasize the issue of semantic inconsistency that arises when we


decrease the similarity between negative samples and anchor.
Driven by deep neural networks and theoretical analysis, machine Since the mini-batch samples are obtained by random sampling from
learning has achieved breakthrough developments in various fields with the training dataset [17,18], it will lead to sampling bias or class colli­
sufficient labeled data [1–5]. The difficulty or cost of annotations has­ sion. These methods cause the negative class to include in part seman­
tens research on unsupervised learning methods, which is still not tically similar samples (false negative samples) [16,19]. As a result,
comparable to supervised learning. Self-supervised learning (SSL), a minimizing the InfoNCE loss will result in semantically similar samples
new learning paradigm that does not rely on human annotations but being incorrectly separated in the learned feature representation space.
generates labels from raw data, has received extensive attention in Furthermore, we observe that maximizing the similarity of positive
recent years [6–10]. A typical and widely applied SSL method is samples and anchors also suffers from semantic inconsistency, mainly
Contrastive Learning (CL). The key idea behind CL is to learn similar due to the generation mechanism of positive samples.
representations for positive sample pairs, and discrepant representations In traditional contrastive learning methods, diverse augmentation
for negative samples [11,12]. In general, CL learns feature extractors by strategies, such as random cropping, rotation, and Gaussian filtering, are
minimizing contrastive loss, e.g., the InfoNCE [13], which can be viewed employed to generate positive samples by transforming images. The
as an instance-level representation learning paradigm, namely instance similarity between these positive samples should ideally be driven by
discrimination [14,15]. foreground information (correspond to ground truth). However, the
Take, for example, a sample from a mini-batch. Most existing CL generation mechanism of augmented views in contrastive learning leads
frameworks first transform one training sample to two augmented views to the similarity between positive samples encompassing both back­
[6,16]. One of the two augmented views is viewed as the anchor, and the ground and foreground information from the original image, as shown
other is regarded as a positive/same class sample relative to the anchor. in Fig. 1 (a). What’s more, if the background occupies a larger propor­
Meanwhile, the augmented views of the rest samples are treated as tion between positive samples compared to the foreground, as shown in
negative class samples relative to the anchor. Here, our aim is to Fig. 1 (b), the model may mistakenly perceive the foreground as noise

* Corresponding author.
E-mail addresses: guo_hj@buaa.edu.cn (H. Guo), leishi@buaa.edu.cn (L. Shi).

https://doi.org/10.1016/j.imavis.2023.104754
Received 27 October 2022; Received in revised form 15 June 2023; Accepted 20 June 2023
Available online 28 June 2023
0262-8856/© 2023 Elsevier B.V. All rights reserved.
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

2. Related work

2.1. Contrastive learning

CL has achieved excellent performance on visual tasks such as pre­


dicting position [24,25], image in-painting [26,27], automatic coloring
[28]. From an information theory perspective, CL focuses on learning
representations by maximizing the mutual information between input
and output [13,29–31]. CPC [13] uses the InfoNCE loss function for the
first time to learn feature representations. Then, CMC [30] extends the
mutual information maximization criterion to multi-view representation
learning. SimCLR [6] is also based on InfoNCE, but is more regarded as
an augmentation-based approach. Specifically, SimCLR uses data
Fig. 1. The generation mechanism of positive samples causes semantic incon­ augmentation techniques to construct pairs of positive and negative
sistency. (a). Rotation t a is exploited as the data augmentation strategy; (b). samples. Subsequent contrastive learning methods also employ this
Random cropping t b may result in positive samples containing more back­ augmentation-based learning paradigm.
ground than foreground. Although many studies have shown that contrastive learning is
concise and practical, it still faces many challenges. One of the chal­
during the early stages of training and consequently process it lenges is that it is sensitive to batch size. MoCo [7] proposes to build a
inadequately. dynamic dictionary using queues. BYOL [8] and SimSiam [32] design a
To address the semantic inconsistency problem mentioned above, we novel structure that only uses positive samples without negative sam­
propose a new method called semantic consistency regularization (SCR) ples. SwAV [33] and PCL [34] learn good-quality negative samples by
in this paper. SCR mainly contains a proposed distribution measure introducing clustering methods in the training process, thereby reducing
named pairwise subspace distance (PSD). Because the proposed PSD can the dependence on the number of negative samples. Second, it is sen­
be viewed as a regularized loss function and can be integrated into the sitive to the selection of positive and negative samples. Zhang et.al [35]
standard framework of contrastive learning, we collectively refer to each proposes a hierarchical augmentation method for CL to solve this
regularized contrastive learning method as SCR. Since each sample in problem. Some researchers [36,37] also suggest that let model auto­
the mini-batch generates two augmented samples, we treat the data matic learn to capture varying and invariant factors for visual repre­
generated by different augmentations as different distributions sepa­ sentations. Debiased CL [19], HNS [16], and UOTA [38] proposes to
rately. Then, SCR can be used to alleviate semantic inconsistencies from reweight the negative class samples. Patch-based NS [39] propose to
a distributional perspective. The main idea of SCR consists of two as­ generate negative samples by the texture-based and patch-based
pects. The first is that SCR needs to obtain the eigenvectors of the augmentations.
covariance matrix of the data of two distributions. The eigenvectors can Also, most contrastive learning methods are instance discrimination,
be seen as important semantic information and are robust to the noise which does not consider the distribution of samples, and follow-up
contamination of a single point in the input data [21–23]. The second is works supplement related research. Inspired by the pairwise distance
that SCR aligns the two distributions based on the eigenvectors, making distribution feature in metric learning, LMCL [40] proposes a contras­
the learned feature extractor pay more attention to semantic informa­ tive learning method with a distance polarization regularizer. Wang et.al
tion. When faced with a large mini-batch case, the solving process of [41] identifies two key properties related to the contrastive loss, align­
eigenvectors can deal with large computation complexity. To this end, ment and uniformity. Furthermore, W-MSE [42] proposes a new loss
we propose a divide-and-conquer strategy to alleviate this problem. Our function without additional networks and negative samples. In this
main contributions are summarized as follows: paper, we analyze CL from the perspective of semantic inconsistency.
There is also some proposed methods dealing with this inconsistency.
• From the perspective of semantic inconsistency, we analyze the ICL-MSR [20] proposes a meta-semantic consistency regularization to
challenges of contrastive learning. Specifically, the problem of se­ deal with the confounder in maximizing the similarity between positive
mantic inconsistency occurs not only in reducing the similarity be­ samples and anchors. RINCE [43] treats the process of maximizing the
tween negative samples and anchor but in maximizing the similarity similarity between positive samples and anchors as a robust alignment
between positive samples and anchor. problem. Thus, the inconsistency can be rescued by robust InfoNCE
• We propose a new method, called semantic consistency regulariza­ objective. This paper analyzes that the problem of semantic inconsis­
tion (SCR), to alleviate the semantic inconsistency problem. SCR tency may appear in the process of reducing the similarity between
aligns the two distributions of augmented views based on eigen­ negative samples and anchors, and maximizing the similarity between
vectors so that the learned feature extractor can extract as much positive samples and anchors. We propose a new method to deal with
important semantic information as possible. this problem.
• We propose a pairwise subspace distance (PSD), which can be inte­
grated into most of the standard contrastive learning frameworks. 2.2. Metric of distributions
Also, we proposes a divide-and-conquer strategy to reduce the
computational complexity in solving eigenvectors for a large mini- In machine learning, it is frequently necessary to quantify the dis­
batch case. tance between probability distributions, whether it involves comparing
• Compared with state-of-the-art contrastive learning models, the actual and observed probability distributions or evaluating the dissim­
experiment results show that the proposed SCR can improve the ilarity between input and generated probability distributions. A
performance of typical downstream tasks such as image classification commonly used measurement method based on information theory is
across multiple benchmark datasets. Among alternative distribution the Kullback–Leibler (KL) divergence, which, however, does not possess
distance metrics, PSD is to be proved the best when applied to the symmetry property of a metric. The Jensen-Shannon (JS) diver­
classical contrastive learning loss functions. gence, a variant of the KL divergence, addresses the asymmetry issue.
However, it is important to note that when two probability distributions
are completely non-overlapping, the KL divergence becomes meaning­
less, and the JS divergence remains constant, which lead to the collapse

2
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

of the learning training model [44]. The Wasserstein distance [45–47] of the two distributed data. Let D 1Z and D 2Z represent the distributions
provides a meaningful notion of closeness and and is capable of handling corresponding to the two augmented view datasets Z1 and Z2 in the
non-overlapping probability distributions in low-dimensional mani­ embedded space Z .
folds. Research has demonstrated that the generalization of KL diver­ Motivation. The traditional contrastive learning methods, such as
gence, JS divergence, and Wasserstein distance can be limited by the SimCLR, can be regarded as an n-classification problem, where each
influence of sample complexity. As shown in Fig. 2, Sliced Wasserstein sample is treated as an individual class within a batch. Positive pairs are
distance (SWD) [45] calculates the average Wasserstein distance of all formed by different augmented views of the same sample and assigned
marginal measures through uniform random slicing into sets of 1-dimen­ the same label. Wang pointed out that the contrastive learning method
sional distributions with equal weight. Max Sliced-Wasserstein Distance actually obeys two key principles, alignment and uniformity [41].
(Max-SWD) [48] finds a single linear projection that maximizes the Alignment is achieved by enforcing positive pairs to share similar rep­
distance of the probability measures in the projected space. Max resentations in a low-dimensional space, that is, instance-level align­
Generalized Sliced-Wasserstein Distance(Max-GSWD) [49] uses non- ment, ignoring semantic correlation between views. The representation
linear projection to find the critical projection direction, reducing the learned originate from the semantic information shared by different
computational cost induced by the projection operations. views, but the uncontrollable data augmentation strategy makes the
shared semantics incomplete, such as cropping, so that the information
3. Methodology in the common area is learned. We propose to use semantic-level
alignment to tackle the problem of instance-level alignment methods.
Given a training dataset, we consider a mini-batch sampled from the Although the augmented views are different in the input space, the
dataset, denoted by X = {xi }ni=1 . Taking SimCLR as the baseline method, distributions of the two views in the embedding space should be natu­
the general framework of contrastive learning consists of the following rally aligned. By aligning the two augmented views of the batch, we
parts: (1) Two augmented views, defined as X1 = T1 (X), X2 = T 2 (X), expect the semantic distribution to contain more image semantics for
where T1 and T 2 represent different data augmentation strategies, T1 (X) semantic complementarity.
and T 2 (X) represent transforming each sample to obtain the corre­ Some studies [21,22,50,51] have pointed out that the principal
{ }n { }n
sponding augmented samples, X1 = x1i i=1 , X2 = x2i i=1 ; (2) An component direction or orthogonal basis of the sample is closely related
to the semantic information of the sample. The eigenvectors are ob­
encoder fθ encodes the augmented views to obtain the corresponding
( ) ( ) tained by matrix factorization of the covariance matrix of the training
representations y1i = fθ x1i and y2i = fθ x2i ; (3) A projection head maps
data. The first k eigenvectors selected represent the k directions with
the learned representations to the embedding space Z1 and Z2 . We can large variance of data distribution. According to information theory, the
( ) ( )
obtain that z1i = gξ y1i ∈ Z1 , z2i = gξ y2i ∈ Z2 . Typical CL methods larger the variance of data distribution, the higher the information en­
employ the noise contrast estimation (NCE) as the loss function, which tropy, indicating a larger amount of information present. Therefore, we
can be formulated as: can regard the projection of samples on each feature direction as se­

sim(zi ,z+ )
⎤ mantics which are understood at the level of feature representations.
Thus, we achieve semantic alignment by forcing the distributions of
i
⎢ e τ ⎥
L NCE = Ezi ∼Z 1 ∪Z 2 ⎣ − log ⎦ (1)
sim(zi ,z+ )
i ∑ sim(zi ,zj ) different views of a batch to be aligned separately in all principal
e τ + zj e τ component directions.
[ ]T
We denote Zv = z11 , …, z1i , …, z1n , v = 1, 2, respectively. The two
where τ represents the temperature hyper-parameter and sim(⋅, ⋅) rep­ ( )T
resents the similarity of samples(i.e., the cosine similarity). In objective covariance matrices Z1cov and Z2cov can be calculated by Z1cov = Z1 Z1
{ } ( )T
(1), CL regards zi as the anchor, the sample pair zi , z+
i as positive pair, and Z2cov = Z2 Z2 . In the embedded spaces Z , the eigenvector of
{ } { 1 }\{ }
the pair zi , zj as negatives, where zj ∈ Z ∪ Z2 zi , z+
i . covariance matrix can be obtained by orthogonal decomposition of
matrix. Thus, we have:
( )T
3.1. Paired subspace distance 1
Zcov = U 1 S1 U 1
( )T (2)
2
Zcov = U 2 S2 U 2
This subsection introduces the proposed paired subspace distance
(PSD) for measuring the distance between two augmented view distri­ [ ] [ ]
where U1 = u11 , …, u1l , …, u1k and U2 = u21 , …, u2l , …, u2k represent the
butions. The PSD is defined in the embedded spaces Z and aims to align
orthonormal eigenvector matrices, u1l is the l-th orthogonal basis of view
the two distributions based on the eigenvalues of the covariance matrix

Fig. 2. Several variants based on wasserstein distance are used to measure distributional differences, taking 2-dimensional distribution as an example. Here we only
emphasize the selection method of projection direction from high-dimensional space to 1-dimensional space. (a). SWD randomly selects a large number of projection
directions; (b). Max-SWD looks for a special linear projection direction; (c). Pairwise subspace distance selects the respective principal component directions of
augmented views.

3
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

1, and so is u1l ; S1 and S2 describe the eigenvalue matrices corresponding there are significant differences. As shown in Fig. 2, the idea of SWD is to
to the eigenvector matrices, k indicates that we finally select k principal obtain multiple one-dimensional distributions of high-dimensional
component directions of two views respectively. U1 and U2 refer to the probability distributions through random transformation, and combine
projection directions of the principal components. Here, denote ̃ u2l− 1 = all Wasserstein-slices distances with equal weights. The effectiveness of
u1l , ̃
u2l = u2l , we can rewrite U = {̃u1 , ̃
u2 , …, ̃
u2k− 1 , ̃
u2k }. SWD depends on the number of slices and the quality of the slices.
We regard the projection directions as the coordinate primitive. For However, the importance of projections cannot be guaranteed by
random slicing [45]. Max-SWD [48] finds a linear projection that
the samples in Z1 and Z2 , we can obtain a new representation under this
maximizes the distance of the probability measure in the projected
coordinate primitive. So, we obtain the follows:
space, which does not reveal more semantic information about the two
p11 , …, p1l , …, p12k distributions or spatial structure information. By choosing the principal
(3)
p21 , …, p2l , …, p22k component projections of samples in a single view rather than all views,
the projection bias caused by the view difference can be reduced, and
where p1l = Z1 ̃ul and p2l = Z2 ̃ul are the projection representations of view the semantic information within the view can be preserved to a greater
1 and 2 in the l-th projection directions and ̃ ul ∈ U. Also, we can obtain extent.
[ ] [ ] { }n
pl = zl,1 , …, zl,i , …, zl,n and pl = zl,1 , …, z2l,i , …, z2l,n . Zl1 = z1l,j
1 1 1 1 2 2
is
j=1 3.2. Semantic consistency regularization
the projection set of Z1 in the l-th principal component direction, and
{ }n
Z2l = z2l,i is the projection set of Z2 in the l-th principal component Lastly, we integrate the proposed PSD into the contrastive loss as a
i=1
regularization term. Thus, the objective function of the contrastive
direction.
learning method with semantic consistency regularization(SCR) is
Next, we calculate the disparity in the distributions of the two views
defined as follows:
by analyzing their differences along the corresponding principal
component directions. Considering the existence of non-overlapping minf loss = loss cl + λ* loss sc (7)
low-dimensional manifolds, we design a new method to measure the
v
difference between p1l and p2l . Define ̃pl as the ordered vector repre­ where loss cl is the contrastive loss function such as NCE loss, loss sc is
j
sentation of pi in the form: called the semantic consistency loss that equals to eq. (6), and λ is the
[ ] regularization parameter that controls the relative importance of the
pvl = ̃zvl,1 , …,̃zvl,n
̃ (4) two loss terms. Note that loss cl can also be another loss function used in
contrastive learning. The learning framework of the proposed SCR is
v v v v shown in Fig. 4, and the complete training process is shown in Algorithm
where ̃ zl,1 ≥ … ≥ ̃ zl,n , ̃zl,i ∈ Zvl , and v ∈ {1, 2}. As shown in
zl,i ≥ … ≥ ̃
1.
Fig. 3, in l-th principal component direction, the difference between
view 1 and view 2 can be calculated by the following formula:
( ) ⃦ 1 ⃦2
dist p1 , p2 = ⃦̃
l l p2 ⃦
p − ̃ l l 2
(5)

In the experimental Section 4.7, we also compare the performance


impact of whether samples are sorted.
Finally, we compute the pairwise subspace distance (PSD) between
two views in the embedding space by obtaining the distribution differ­
ence in each principal component direction. The PSD formula is
expressed as follows:
( ) 1 ∑k ( )
dist Z 1 , Z 2 = dist p1l , p2l (6)
2k l=1
Analysis. It is worth noting that PSD is very similar to SWD [45], but

Fig. 3. The distribution measure of view 1 and view 2 in the direction of the 3.3. Divide-and-conquer strategy
l-th principal component. Orange and blue represent the projections of view 1
and view 2 samples in the l-th direction of the principal components, respec­ When faced with large mini-batch in the training phase, the
tively. (For interpretation of the references to color in this figure legend, the computational complexity of the eigenvalue solution process of covari­
reader is referred to the web version of this article.) ance matrix will become particularly large. To solve this problem, we

4
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

Fig. 4. The framework of Contrastive Learning with Semantic Consistency. Given mini-batch X = {x1 , …, xi , …, xn }, we construct data pairs using two different
augmentation T 1 , T 2 . Then we extract features Y 1 , Y 2 from augmented datasets using a shared encoder f , and g is a 3 layer multi-linear network that projects learned
representations into the low dimensional space Z1 , Z2 . Finally, we calculate contrastive loss and paired subspace distance.

propose a simple and effective strategy called divide-and-conquer additional 100,000 unlabeled images are also provided for unsu­
strategy. Suppose the batch size is N, two augmentation view are pervised learning.
employed. In embedding space, we divide the large batch into M parts, • The Tiny ImageNet dataset [54] composed of 100,000 images in 200
Define the sample set of the m-th part of the v view as Zvm ,m ∈ 1,…,M,v ∈ classes. Each colored image is downsized to 64 × 64 resolution. Each
1,2. We can obtain the the l-th eigenvector of the m-th part sample in the class has 500 training images, 50 validation images and 50 test
v-th view for each part separately uvl,m , l ∈ {1, …, k}. Then, we add all images.
{ } • The ImageNet-100 dataset [55] is a subset of the ImageNet-1K
eigenvectors into the final set of eigenvectors U = uvl,m . Further, we
dataset and contains random 100 classes. 500 images per class
can obtain the projection of each view on these principal component sample for training and 50 images for validation.
directions p1l,m = Z1 uvl,m ,p2l,m = Z2 uvl,m . Finally, we use Eq. 3–6 to calculate • The ImageNet-1K [55] spans 1000 object classes and contains
the difference in the distribution of different views. 1,281,167 training images, 50,000 validation images and 100,000
The divide-and-conquer strategy can be seen as a trick for reducing test images.
the time complexity of singular value decomposition. We measure the
time spent for one batch by the method SimCLR +SCR on CIFAR-100 Data Augmentation. Each input image generates two correspond­
dataset with batch size 512, k = 10. We randomly divide the samples ing augmented views in all experiments. The image augmentation
of each augmented view into two parts in the embedding space to obtain strategies comprises the following image transformations: random
its principal component directions respectively, that is, we can get 40 cropping, resizing, horizontal flipping, color jittering, converting to
projection directions. Then we calculate the distribution difference of grayscale, gaussian blurring, and solarization. The probabilities for
each view in these 40 directions. Table 1 shows the change in time and generating two positive samples are as follows: horizontal mirroring
accuracy with and without this strategy. We can somewhat accept the with probability 0.5, and color jittering with configuration
reduction in precision compared to the time it takes. (0.4; 0.4; 0.2; 0, 1) with probability 0.8, gaussian blurring with proba­
bility 1 and 0.1, and solarizing with the probability 0.1 and 0.2.
4. Experiments Architecture and Optimization. For small and medium datasets,
we use a ResNet-18 as the encoder function fθ , followed by a projection
Benchmark Datasets. We evaluate our proposed method by classi­ head gξ for all experiments where the projection head is a three-layer
fication tasks in computer vision on the six datasets: linear network. The output of the encoder function serves as the
learned representation for downstream tasks in the testing phase. While
• The CIFAR-10 dataset [52] consists of 60,000 color images in 10 for ImageNet-100 and ImageNet-1K datasets, all methods are imple­
classes, with 6,000 images in each class. There are 50,000 training mented by a ResNet-50.
images and 10,000 test images. For small and medium datasets, we use the LARS optimizer [56] to
• The CIFAR-100 dataset is similar to CIFAR-10, except that it has 100 train the encoder fθ for 200 epochs with a mini-batch size of 128. A basic
classes with 600 images in each class. There are 500 training images learning rate of 8 with a cosine decay schedule without restart is applied
and 100 testing images per class. These 100 classes are further to all datasets. After a learning rate warm-up period of 10 epochs, we use
grouped into 20 superclasses. a cosine decay schedule to reduce the learning rate by a factor of 1000.
• The STL-10 dataset [53] is a subset of labeled examples from For ImageNet-1K, we use Stochastic Gradient Descent (SGD) with a
ImageNet. There are 5,000 labeled training images (500 in each momentum of 0.9 to minimize our objective functions. The self-
class) and 8,000 labeled testing images (800 in each class). An supervised pre-training performed on the ImageNet-1K datasets uses a
single machine with eight GPUs (A100). The models are trained for 100/
200/400 epochs based on the cosine annealing schedule and Automatic
Table 1
Effect of divide-and-conquer on experimental accuracy and time spent. w and w/ Mixed-Precision.
o represent using or not using the strategy respectively. Linear evaluation. We employ the widely adopted linear evaluation
protocol to evaluate the trained model. First, we freeze the learned
top-1 Time top-1 Time
encoder using the Adam optimizer [57] to train a supervised linear
w 64.78 18(s) 65.02 25(s) w/o classifier on the dataset for 200 epochs, where the linear classifier is a

5
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

fully connected layer that replaces the projection head gξ in the training Table 3
phase. The learning rate of the optimizer decays from 3 × 10− 1 to 3 × Classification accuracy (top-1%) on ImageNet-100. All results are based on a
10− 5 , and the weight decay is set to 10− 6 . Finally, the linear classifier’s ResNet-50 encoder.
top-1 (or top-5) classification accuracy is used as the final evaluation Methods top-1 top-1 Methods*
criterion. SimCLR 70.15 72.56 SimCLR + SCR
DCL 72.12 75.14 DCL + SCR
BT 69.89 69.94 BT + SCR
4.1. Comparison with the state of the art HCL 73.45 75.89 HCL + SCR

We verify the effectiveness of our proposed SCR on image classifi­


cation tasks. The methods used for comparison include DIM [29], Table 4
SimCLR [6], Debiased Contrastive Learning (DCL) [19], Hard Negative Classification accuracy (top-1%) of a linear classifier for all different methods on
based Contrastive Learning (HCL) [16], the clustering-based method ImageNet-1K. All results are based on a ResNet-50 encoder.
(SwAV) [33], BYOL [8], W-MSE [42], Barlow Twins (BT) [58], RINCE Methods 100 epochs 200 epochs 400 epochs
[43] and MetAug [37]. To analyze whether distribution constraints can
SimCLR 66.5 68.3 70.4
improve existing contrastive learning methods, we combine the existing BYOL 66.5 70.6 73.2
contrastive loss and the proposed SCR, such as SimCLR +SCR, DCL + BT 66.5 69.1 70.7
SCR, BT + SCR, and HCL + SCR. For experiments on all datasets, we set SimCLR+ SCR 67.2 71.1 73.4
k = 5/10, λ = 0.001.
Table 2 shows the top-1 accuracy for classification performance of all
methods on four small and medium-sized datasets. From the results, we Table 5
can see that adding semantic constraints to traditional methods clearly Semantic experiment results (top-1%) of four different methods. FU,training and
improves classification accuracy. For example, after adding the pro­ testing on full images; FG,training on full images and testing on foreground
posed regularization term, SimCLR improved by 1.1% on CIFAR-100, images.
DCL improved by 1.4% on STL-10, and HCL improved by more than Group SimCLR SimCLR + SCR MetAug MetAug + SCR
2% on Tiny ImageNet. Experimental results demonstrate the effective­ FU 37.83 39.13 41.45 43.25
ness of the semantic consistent regularization term. FG 34.13 36.52 40.23 42.83
Table 3 shows the classification results on the larger dataset
ImageNet-100. As shown in the table, HCL + SCR had the best results,
more than 2% better than HCL, while Barlow Twins only improved FG show that our proposed regularization term does mine semantic in­
slightly. In addition, we compare linear classification with a ResNet-50 formation during the training phase, i.e., related to the foreground.
model pre-trained on the ImageNet-1K dataset. Table 4 shows classifi­
cation results for some methods for 100, 200, and 400 epochs. Among all 4.2. Semi-supervised evaluations on ImageNet-1K dataset
the methods used for comparison, BYOL achieves the highest accuracy.
The accuracy of our proposed method, SimCLR+SCR, is 0.2%–0.7% We further evaluate the fine-tuning performance of the self-
higher than BYOL. supervised pre-trained ResNet-50 on subsets of the ImageNet-1K data­
Following the toy experimental settings in ICL-MSR [20], we also did set. For a fair comparison, we use the 1%, and 10% subsets of the labeled
the same experiment on the CoCo dataset. The CoCo dataset provides randomly selected by [6]. We fine-tune the models on these two subsets
ground-truth bounding boxes of objects in images, so the area inside the for 50 epochs with classifier learning rate 1.0 (0.1), backbone learning
bounding box of each image is considered as the foreground images. We rate 0.0001 (0.01) for the 1% (10%) subset, which are decayed by a
used four kinds of methods including SimCLR, SimCLR+SCR, MetAug, factor of 10 at the 30-th and 40-th epochs. Table 6 reports the semi-
MetAug +SCR. Our experiments are divided into two groups, FU supervised results obtained on the ImageNet-1K against the existing
(training and testing on full images) and FG(training on full images and methods(SimCLR, BYOL and SwAV). The results of our proposed
testing on foreground images). Table 5 shows the experimental results. methods (+SCR) are about 2% better than the existing methods.
In group FN, experimental results show that our proposed regularization
term is helpful for downstream classification tasks. The results of Group
4.3. Parametric sensitivities
Table 2
We experimentally investigate the choice of model parameters in this
Classification accuracy (top-1%) of a linear classifier for all different methods on
four small and medium datasets with a ResNet-18 encoder. For all datasets, we section. The loss function loss includes two important parameters, the
train for 200 epochs. number of projection directions k and the regularization parameter λ.
We assume that the few principal component directions represented
Methods CIFAR-10 CIFAR-100 STL-10 Tiny ImageNet
carry a large amount of semantic information. If there are too few di­
DIM 74.52 58.76 72.31 43.04 rections, there will be insufficient information. On the contrary, the
SwAV 73.56 59.81 73.31 43.17
structural information will be redundant. λ controls discrimination and
SimCLR 79.62 60.33 74.12 45.26
DCL 80.56 62.73 74.58 45.81
BT 78.50 51.73 68.25 44.73 Table 6
HCL 80.81 62.91 75.11 46.12 Semi-supervised learning using 1% and 10% labeled training samples on
W-MSE 79.21 58.71 74.33 44.93
ImageNet-1K. Best results (%) are in bold.
RINCE 80.41 62.13 74.72 45.81
MetAug 81.32 62.78 75.71 46.62 Methods Epochs 1% 10%
SimCLR + SCR 79.93 61.49 74.75 45.77
top-1 top-5 top-1 top-5
DCL + SCR 81.40 63.26 75.96 47.69
BT + SCR 79.06 52.43 69.50 46.01 SimCLR 1000 48.3 75.5 65.6 87.8
HCL + SCR 82.40 64.15 76.81 48.66 BYOL 800 53.2 78.4 68.8 89.1
W-MSE + SCR 80.75 59.55 74.71 45.03 SwAV 800 53.9 78.5 70.2 89.9
RINCE+ SCR 80.62 62.35 75.11 46.29 SimCLR+ SCR 1000 51.1 77.9 67.6 89.4
MetAug + SCR 82.62 63.86 76.51 47.32 BYOL+ SCR 800 55.9 81.7 70.7 91.3

6
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

distributional consistency. If the loss of the model is biased towards


consistency, the learned representation will perform poorly on down­
stream tasks. To test the above hypothesis, we set k in {1, 5, 10, 20, 30},
and λ in {100 , 10− 1 , 10− 2 , 10− 3 } respectively.
Table 7 shows the classification accuracy of the model SimCLR+SCR
on the CIFAR10 dataset under different parameter choices. Under a fixed
λ, the classification accuracy first increases and then decreases with the
increase of k. The number of projection directions k significantly affects
the experimental results. This observation confirms our hypothesis that
an inadequate or excessive number of projection directions leads to
inferior learning performance. Conversely, as the parameter λ increases,
the classification accuracy initially improves but gradually declines
thereafter. In contrast to the traditional contrastive loss, which solely
focuses on view consistency, the distribution constraint employed lacks
the capability of instance discrimination. Consequently, we introduce
our proposed distribution constraint as a regularization term integrated
into the loss function.
Fig. 5. The trend of PSD on the CIFAR-10 dataset with 200 epochs
using SimCLR+SCR.
4.4. Study of PSD

We further conducted experiments to study the loss convergence


analysis on the proposed PSD. As shown in Fig. 5, The loss function of
PSD on CIFAR-10 dataset converges after 100 epochs. The results show
that the difference in distribution between augmented views gradually
decreases during training.

4.5. Ablation studies

Further experiments were conducted to examine the individual ef­


fects of different components in our proposed method. The loss function
comprises two distinct parts: the contrastive loss (lossc l) and the regu­
larization loss loss sc, loss : loss cl + loss sc. Fig. 6 shows the relationship
between different losses and accuracy on four datasets, where the
contrastive loss loss cl utilizes a loss function like SimCLR. The accuracy
of using loss is better than that of using only loss cl on all datasets, which Fig. 6. Classification accuracy with different loss functions loss, loss cl, loss sc.
is consistent with those in Table 2 and Table 3, demonstrating the
effectiveness of the proposed PSD. Importantly, it should be highlighted
can be seen that a higher degree of perturbation does lead to lower
that utilizing only the loss (loss cs) yields the lowest accuracy, suggest­
performance on downstream tasks, which is consistent with our intui­
ing that the learned representations lack sufficient discriminative
tion that training data quality affects downstream tasks.
power. The distribution constraint, which emphasizes the consistency of
Then, we compare the classification accuracy under different
all samples between the two views, differs from instance discrimination.
augmentation strategies, as shown in Table 8. The methods we used in
Consequently, the proposed distribution constraint is more suitable as a
the experiment on CIFAR10 dataset include SimCLR +SCR, DCL + SCR
regularization term to enhance the performance of existing methods
and HCL + SCR. As can be seen, there is no significant difference in
rather than functioning as a standalone loss function.
classification accuracy, indicating that our method can be applied to
different augmentation strategies. In other words, our method is insen­
4.6. Analysis of disturbance sitive to the choice of augmentation strategy.

This section investigates the effects of perturbations introduced 4.7. Comparison of distribution metrics
through data augmentation strategies on the experimental results.
Firstly, we set different perturbations to observe the change in classifi­ In this paper, we adopt the proposed PSD as a measure of distribution
cation accuracy. We employ the same data augmentation strategy and difference for constraining semantic consistency. We further compare
the same parameter settings in each set of experiments, except for the PSD with existing distribution metrics to investigate the perfor­
gaussianBlur. The probability of gaussianBlur is set in {0.2, 0.5, 0.8}, mance on classification task. The methods used for comparison in the
and adopts different standard deviation σ . Table 9 shows the results with experiments include KL divergence, JS divergence, SWD, Max-SWD, and
SimCLR +SCR on the STL-10 dataset. From the experimental results, it SCR*. SCR* is a variant of SCR that does not order samples in the
principal component direction. In practical applications, since the dis­
Table 7 tributions of the two augmented views are not completely non-
Parametric sensitivities of the number of projection directions k and hyper­ overlapping in the projection space, the KL divergence and the JS
parameter λ. divergence can be used to measure the difference between the views.
k\λ 100 10− 1
10− 2
10− 3 We choose SimCLR as our benchmark method. Table 10 shows all the
1 51.15 75.12 77.94 76.31 experimental results with different distribution constraints on CIFAR-10
5 57.13 75.49 78.52 76.84 dataset. The experimental results show that the distribution constraints
10 53.88 72.98 76.14 73.15 between views can improve the performance of the model in down­
20 51.02 69.93 76.67 72.73 stream tasks. On the other hand, the results show that the +SCR method
30 50.13 67.37 75.89 70.76
outperforms other distribution methods. In addition, the results of

7
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

Table 8
Comparison of different data augmentations by using the resnet-18 backbone on CIFAR10. Method+ refer to method +SCR.
ID Data augmentations Methods

Horizontal flip rotate random crop random gray color jitter mixup SimCLR + DCL + HCL+

1 ✓ ✓ 79.51 80.80 82.11


2 ✓ 79.61 81.22 82.32
3 ✓ 79.13 80.71 81.64
4 ✓ 79.20 81.11 81.91
5 ✓ ✓ 79.32 81.05 82.12
6 ✓ ✓ 79.52 81.32 82.01
7 ✓ ✓ ✓ 79.57 81.23 82.22
8 ✓ ✓ ✓ ✓ ✓ ✓ 79.91 81.34 82.34

References
Table 9
The influence of the different degrees of disturbance on experimental results is [1] X. Su, L. Si, W. Qiang, J. Yu, F. Wu, C. Zheng, F. Sun, Intriguing property and
introduced by data augmentation. counterfactual explanation of gan for remote sensing image generation, arXiv
preprint arXiv:2303.05240, 2023.
σ\p 0.2 0.5 0.8
[2] D. Chen, J. Hu, W. Qiang, X. Wei, E. Wu, Rethinking skip connection model as a
2 76.85 77.07 77.21 learnable markov chain, arXiv preprint arXiv:2209.15278, 2022.
5 76.42 76.90 76.83 [3] W. Qiang, J. Zhang, L. Zhen, L. Jing, Robust weighted linear loss twin multi-class
8 75.93 75.61 76.14 support vector regression for large-scale classification, Signal Process. 170 (2020),
107449.
[4] W. Qiang, H. Zhang, J. Zhang, L. Jing, Tsvm-m3: twin support vector machine
based on multi-order moment matching for large-scale multi-class classification,
Appl. Soft Comput. 128 (2022), 109506.
Table 10
[5] W. Qiang, J. Li, B. Su, J. Fu, H. Xiong, J.-R. Wen, Meta attention-generation
Top-1 and Top-5 accuracy (%) of various distribution constrain on CIFAR-10 network for cross-granularity few-shot learning, Int. J. Comput. Vis. 131 (2023)
dataset trained and tested for 200 epochs. 1211–1233.
[6] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive
Methods top-1 top-5
learning of visual representations, in: International Conference on Machine
SimCLR 78.86 99.21 Learning, PMLR, 2020, pp. 1597–1607.
SimCLR + KL 80.52 99.34 [7] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised
SimCLR + JS 80.61 99.32 visual representation learning, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
SimCLR + SWD 80.43 99.30
[8] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,
SimCLR + Max-SWD 80.32 99.38
B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap your own latent-a new
SimCLR + SCR* 80.19 99.21
approach to self-supervised learning, Adv. Neural Inf. Proces. Syst. 33 (2020)
SimCLR + SCR 81.43 99.43 21271–21284.
[9] Z. Wen, Y. Li, Toward understanding the feature learning process of self-supervised
contrastive learning, in: International Conference on Machine Learning, PMLR,
+ SCR* are lower than +SCR, illustrating the necessity of sorting. 2021, pp. 11112–11122.
[10] M. Patacchiola, A.J. Storkey, Self-supervised relational reasoning for
representation learning, Adv. Neural Inf. Proces. Syst. 33 (2020) 4003–4014.
5. Conclusion [11] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, N. Saunshi, A theoretical
analysis of contrastive unsupervised representation learning, in: 36th International
In this paper, we propose a semantic consistency regularization(SCR) Conference on Machine Learning, ICML 2019, International Machine Learning
Society (IMLS), 2019, pp. 9904–9923.
to relieve the semantic inconsistency problem in contrastive learning. [12] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant
Unlike the existing metrics to measure distribution distance, we intro­ mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and
duce a novel metric called Paired Subspace Distance (PSD), which cal­ Pattern Recognition (CVPR’06) vol. 2, IEEE, 2006, pp. 1735–1742.
[13] A.V.D. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive
culates the distribution distance between two corresponding views. The coding, arXiv preprint arXiv:1807.03748, 2018.
proposed PSD can be integrated into the standard framework of [14] A. Dosovitskiy, J.T. Springenberg, M. Riedmiller, T. Brox, Discriminative
contrastive learning methods. Experimental results show that SCR out­ unsupervised feature learning with convolutional neural networks, Adv. Neural Inf.
Proces. Syst. 27 (2014).
performs previous methods for self-supervised and semi-supervised
[15] Y. Li, P. Hu, Z. Liu, D. Peng, J.T. Zhou, X. Peng, Contrastive clustering, in: 2021
classification tasks on multi-scale datasets. AAAI Conference on Artificial Intelligence (AAAI), 2021.
[16] J.D. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Contrastive learning with hard
negative samples, in: International Conference on Learning Representations, 2020.
Declaration of Competing Interest
[17] Z. Wu, Y. Xiong, S.X. Yu, D. Lin, Unsupervised feature learning via non-parametric
instance discrimination, in: Proceedings of the IEEE Conference on Computer
The authors declare that they have no known competing financial Vision and Pattern Recognition, 2018, pp. 3733–3742.
interests or personal relationships that could have appeared to influence [18] M. Wu, M. Mosse, C. Zhuang, D. Yamins, N. Goodman, Conditional negative
sampling for contrastive learning of visual representations, in: International
the work reported in this paper. Conference on Learning Representations, 2020.
[19] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, S. Jegelka, Debiased contrastive
Data availability learning, Adv. Neural Inf. Proces. Syst. 33 (2020) 8765–8775.
[20] W. Qiang, J. Li, C. Zheng, B. Su, H. Xiong, Interventional contrastive learning with
meta semantic regularizer, in: International Conference on Machine Learning,
Data will be made available on request. PMLR, 2022, pp. 18018–18030.
[21] S.T. Dumais, et al., Latent semantic analysis, Annu. Rev. Inf. Sci. Technol. 38
(2004) 188–230.
Acknowledgement [22] D.I. Martin, M.W. Berry, Mathematical foundations behind latent semantic
analysis, in: Handbook of Latent Semantic Analysis, 2007, pp. 35–55.
This work was supported by National Key R\&D Program of China [23] H. Abdi, L.J. Williams, Principal component analysis, Wiley Interdiscip. Rev.
Comput. Stat. 2 (2010) 433–459.
(2021YFB3500700), NSFC Grant 62172026, National Social Science
[24] H. Bao, L. Dong, F. Wei, Beit: Bert pre-training of image transformers, arXiv
Fund of China 22\&ZD153, and SKLSDE. preprint arXiv:2106.08254, 2021.

8
H. Guo and L. Shi Image and Vision Computing 136 (2023) 104754

[25] K. Purohit, M. Suin, A. Rajagopalan, V.N. Boddeti, Spatially-adaptive image [41] T. Wang, P. Isola, Understanding contrastive representation learning through
restoration using distortion-guided networks, in: Proceedings of the IEEE/CVF alignment and uniformity on the hypersphere, in: International Conference on
International Conference on Computer Vision, 2021, pp. 2309–2319. Machine Learning, PMLR, 2020, pp. 9929–9939.
[26] X. Guo, H. Yang, D. Huang, Image inpainting via conditional texture and structure [42] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe, Whitening for self-supervised
dual generation, in: Proceedings of the IEEE/CVF International Conference on representation learning, in: International Conference on Machine Learning, PMLR,
Computer Vision, 2021, pp. 14134–14143. 2021, pp. 3015–3024.
[27] X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, C.C. Loy, Self-supervised scene de-occlusion, [43] C.-Y. Chuang, R.D. Hjelm, X. Wang, V. Vineet, N. Joshi, A. Torralba, S. Jegelka,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Y. Song, Robust contrastive learning against noisy views, in: Proceedings of the
Recognition, 2020, pp. 3784–3792. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022,
[28] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic pp. 16670–16681.
colorization, in: European Conference on Computer Vision, Springer, 2016, [44] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved
pp. 577–593. training of wasserstein gans, Adv. Neural Inf. Proces. Syst. 30 (2017).
[29] R.D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, [45] S. Kolouri, G.K. Rohde, H. Hoffmann, Sliced wasserstein distance for learning
Y. Bengio, Learning deep representations by mutual information estimation and gaussian mixture models, in: Proceedings of the IEEE Conference on Computer
maximization, in: International Conference on Learning Representations, 2018. Vision and Pattern Recognition, 2018, pp. 3427–3436.
[30] Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, in: Computer [46] W. Qiang, J. Li, C. Zheng, B. Su, H. Xiong, Robust local preserving and global
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, aligning network for adversarial domain adaptation, IEEE Trans. Knowl. Data Eng
Proceedings, Part XI 16, Springer, 2020, pp. 776–794. (2021).
[31] Y.-H.H. Tsai, H. Zhao, M. Yamada, L.-P. Morency, R.R. Salakhutdinov, Neural [47] W. Qiang, J. Li, C. Zheng, B. Su, Auxiliary task guided mean and covariance
methods for point-wise dependency estimation, Adv. Neural Inf. Proces. Syst. 33 alignment network for adversarial domain adaptation, Knowl.-Based Syst. 223
(2020) 62–72. (2021) 107066.
[32] X. Chen, K. He, Exploring simple siamese representation learning, in: Proceedings [48] I. Deshpande, Y.-T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao,
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, D. Forsyth, A.G. Schwing, Max-sliced wasserstein distance and its use for gans, in:
pp. 15750–15758. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[33] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised Recognition, 2019, pp. 10648–10656.
learning of visual features by contrasting cluster assignments, Adv. Neural Inf. [49] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, G. Rohde, Generalized sliced
Proces. Syst. 33 (2020) 9912–9924. wasserstein distances, Adv. Neural Inf. Proces. Syst. 32 (2019).
[34] J. Li, P. Zhou, C. Xiong, S. Hoi, Prototypical contrastive learning of unsupervised [50] X. Chen, S. Wang, J. Wang, M. Long, Representation subspace distance for domain
representations, in: International Conference on Learning Representations, 2020. adaptation regression, in: International Conference on Machine Learning, PMLR,
[35] J. Zhang, K. Ma, Rethinking the augmentation module in contrastive learning: 2021, pp. 1749–1759.
Learning hierarchical augmentation invariance with expanded views, in: [51] G.H. Golub, C.F. Van Loan, Matrix Computations, JHU press, 2013.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [52] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny
Recognition, 2022, pp. 16650–16659. images, 2009.
[36] T. Xiao, X. Wang, A.A. Efros, T. Darrell, What should not be contrastive in [53] A. Coates, A. Ng, H. Lee, An analysis of single-layer networks in unsupervised
contrastive learning, in: International Conference on Learning Representations, feature learning, in: Proceedings of the Fourteenth International Conference on
2020. Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings,
[37] J. Li, W. Qiang, C. Zheng, B. Su, H. Xiong, Metaug: Contrastive learning via meta 2011, pp. 215–223.
feature augmentation, arXiv preprint arXiv:2203.05119, 2022. [54] Y. Le, X. Yang, Tiny imagenet visual recognition challenge, CS 231N 7, 2015, p. 3.
[38] Y. Wang, J. Lin, J. Zou, Y. Pan, T. Yao, T. Mei, Improving self-supervised learning [55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
with automated unsupervised outlier arbitration, Adv. Neural Inf. Proces. Syst. 34 A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition
(2021) 27617–27630. challenge, Int. J. Comput. Vis. 115 (2015) 211–252.
[39] S. Ge, S. Mishra, C.-L. Li, H. Wang, D. Jacobs, Robust contrastive learning using [56] Y. You, I. Gitman, B. Ginsburg, Large batch training of convolutional networks,
negative samples with diminished semantics, Adv. Neural Inf. Proces. Syst. 34 arXiv preprint arXiv:1708.03888, 2017.
(2021) 27356–27368. [57] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
[40] S. Chen, G. Niu, C. Gong, J. Li, J. Yang, M. Sugiyama, Large-margin contrastive arXiv:1412.6980, 2014.
learning with distance polarization regularizer, in: International Conference on [58] J. Zbontar, L. Jing, I. Misra, Y. LeCun, S. Deny, Barlow twins: Self-supervised
Machine Learning, PMLR, 2021, pp. 1673–1683. learning via redundancy reduction, in: International Conference on Machine
Learning, PMLR, 2021, pp. 12310–12320.

You might also like