Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Contrastive Learning methods for Graph

Representation Learning

Chandan Kumar G P

M Tech AI
Indian Institute of Science
Bangalore

July 21, 2023

Chandan Kumar G P 1/12


Graph representation learning :

Aim of graph representation learning is to learn effective


representations of graphs.

The main goal of graph representation learning is to map each


node in a graph to a vector representation in a continuous
vector space, commonly referred to as an embedding.

Some methods of representation learning are Node2Vec ,


Graph Neural Networks (GNNs) , Graph Autoencoders.

Some applications are Node Classification, Link Prediction,


Graph Clustering and Community Detection, etc.

Chandan Kumar G P 2/12


Contrastive Learning:
Contrastive learning is a selfsupervised representation learning
method
Contrastive learning in computer vision

Figure 1

Chandan Kumar G P 3/12


Contrastive learning in graphs
Get two diffrent views from a single graphs and learn better
representations

Figure 2

Encode using encoders and use contrastive loss to learn better


representations

Chandan Kumar G P 4/12


Contrastive Learning framework:
Let G = (V, E) denote a graph, where V = {v1 , v2 , ......., vN }
and E ∈ V × V represent the node set and the edge set
respectively , X ∈ RN ×F , A ∈ {0, 1}N ×N denote the feature
matrix and the adjacency matrix respectively .

Our objective is to learn a GNN encoder f (X, A) ∈ RN ×F
that produces node embeddings in low dimensionality.

Figure 3: Illustrative model

Chandan Kumar G P 5/12


Loss function: for each positive pair (ui , vi )
θ(u ,v )/τ
ℓ(ui , vi ) = log θ(u ,v )/τ X eθ(u i,v i)/τ X θ(u ,u )/τ
e| {z } +
i i
e i k + e i k
positive pair k̸=i k̸=i
| {z } | {z }
inter-view negative pairs intra-view negative pairs
where τ is a temperature parameter,θ(u, v) = s(g(u), g(v)) ,
where s(., .) is the cosine similarity and g(.) is a nonlinear
projection (implemented with a two-layer perceptron model).

The overall objective to be maximized is the average over all


positive P
pairs
1 N
L = 2N i=1 [ℓ(ui , vi ) + ℓ(ui , vi )]

Chandan Kumar G P 6/12


Adaptive Graph Augmentation:

This augmentation scheme tend to keep important structures


and attributes unchanged, while perturbing possibly
unimportant links and features.
T opology level augmentation : we sample a modified
subset Ẽ from the original E with probability
P ((u, v) ∈ Ẽ) = 1 − peuv
1 − peuv should reflect the importance of (u, v)
We define edge centrality as the average of two adjacent
e = (ϕ (u) + ϕ (v))/2
node’s centrality scores i.e., wuv c c
On directed graph, we simply use the centrality of the tail i.e.,
e = ϕ (v)
wuv c

seuv = log(wuv
e )

e −suv e
peuv = min( ssmax
e −µe .pe , pτ )
max s

Chandan Kumar G P 7/12


We can use Degree centrality , Eigenvector centrality or
PageRank centrality
N ode attribute level augmentation : We add noise to
node attributes via randomly masking a fraction of dimensions
with zeros in node features. with probability pfi
the probability pfi should reflect the importance of the i-th
dimension of node features.
For each feature dimension we calculate weights as
wif = u∈V |xui |.ϕc (u)
P

We compute probabilty as
seuv = log(wuv
e )

f f
pfi = min( smax
f
−suv
f .pf , pτ )
smax −µs

Chandan Kumar G P 8/12


Canonical Correlation Analysis based Contrastive learning:

This introduces a non-contrastive and non-discriminative


objective for self-supervised learning, which is inspired by
Canonical Correlation Analysis methods.
Canonical Correlation Analysis: For two random variables
xP∈ Rm and y ∈ Rn , their covariance matrix is
xy = Cov(x, y)

CCA aims at seeking two vectors a ∈ Rm and b ∈PRn such


aT xy b
that the correlation, ρ =corr(aT x, bT y)= √ q P
aT xy a bT xy b
P

is maximized

Objective is: maxa,b aT s.t aT = bT


P P P
xy b xy a xy b =1

Chandan Kumar G P 9/12


By replacing the linear transformation with neural networks.
Concretely, assuming x1, x2 as two views of an input data.
objective is: maxθ1 ,θ2 Tr(PθT1 (x1)Pθ2 (x2)) s.t
PθT1 (x1)Pθ1 (x1) = PθT2 (x2)Pθ2 (x2) =I
where Pθ1 and Pθ2 are two feedforward neural networks and I
is an identity matrix.
still such computation is really expensive and soft CCA
removes the hard decorrelation constraint by adopting the
following Lagrangian relaxation:
minθ1 ,θ2
Ldist (Pθ1 (x1), Pθ2 (x2))+λ(LSDL (Pθ1 (x1)) + LSDL (Pθ2 (x2)))
Ldist measures correlation between two views representations
and LSDL called stochastic decorrelation loss
2 2 2
L = ∥Z̃A − Z̃B ∥F +λ (∥Z̃TA Z̃A − I∥F + ∥Z̃TB Z̃B − I∥F )
| {z } | {z }
invariance term decorrelation term

Chandan Kumar G P 10/12


Results:

X for node features, A for adjacency matrix, S for diffusion matrix,


and Y for node labels

Chandan Kumar G P 11/12


Thank You

Chandan Kumar G P 12/12

You might also like