Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Open in app Sign up Sign In

Boris Knyazev Follow

Aug 4, 2019 · 17 min read · Listen

Save

Tutorial on Graph Neural Networks for


Computer Vision and Beyond
I’m answering questions that AI/ML/CV people not familiar with graphs or graph neural
networks typically ask. I provide PyTorch examples to clarify the idea behind this relatively
new and exciting kind of model.

1.8K 10

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 1/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

A figure from (Bruna et al., ICLR, 2014) depicting an MNIST image on the 3D sphere. While it’s hard to adapt
Convolutional Networks to classify spherical data, Graph Networks can naturally handle it. This is a toy
example, but similar tasks arise in many real applications.

The questions addressed in this part of my tutorial are:

1. Why are graphs useful?

2. Why is it difficult to define convolution on graphs?

3. What makes a neural network a graph neural network?

To answer them, I’ll provide motivating examples, papers and Python code making
it a tutorial on Graph Neural Networks (GNNs). Some basic knowledge of machine
learning and computer vision is expected, however, I’ll provide some background
and intuitive explanation as we go.

First of all, let’s briefly recall what is a graph? A graph G is a set of nodes (vertices)
connected by directed/undirected edges. Nodes and edges typically come from
some expert knowledge or intuition about the problem. So, it can be atoms in
molecules, users in a social network, cities in a transportation system, players in
team sport, neurons in the brain, interacting objects in a dynamic physical system,
pixels, bounding boxes or segmentation masks in images. In other words, in many
practical cases, it is actually you who gets to decide what are the nodes and edges in
a graph.

In many practical cases, it is actually you who gets to


decide what are the nodes and edges in a graph.
This is a very flexible data structure that generalizes many other data structures. For
example, if there are no edges, then it becomes a set; if there are only “vertical”
edges and any two nodes are connected by exactly one path, then we have a tree.
Such flexibility is both good and bad as I’ll discuss in this tutorial.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 2/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Two undirected graphs with 5 and 6 nodes. The order of nodes is arbitrary.

1. Why graphs can be useful?


In the context of computer vision (CV) and machine learning (ML), studying graphs
and the models to learn from them can give us at least four benefits:

1. We can become closer to solving important problems that previously were too
challenging, such as: drug discovery for cancer (Veselkov et al., Nature, 2019);
better understanding of the human brain connectome (Diez & Sepulcre, Nature
Communications, 2019); materials discovery for energy and environmental
challenges (Xie et al., Nature Communications, 2019).

2. In most CV/ML applications, data can be actually viewed as graphs even though
you used to represent them as another data structure. Representing your data as
graph(s) gives you a lot of flexibility and can give you a very different and
interesting perspective on your problem. For instance, instead of learning from
image pixels you can learn from “superpixels” as in (Liang et al., ECCV, 2016)
and in our forthcoming BMVC paper. Graphs also let you impose a relational
inductive bias in data — some prior knowledge you have about the problem. For
instance, if you want to reason about a human pose, your relational bias can be
a graph of skeleton joints of a human body (Yan et al., AAAI, 2018); or if you
want to reason about videos, your relational bias can be a graph of moving
bounding boxes (Wang & Gupta, ECCV, 2018). Another example can be
representing facial landmarks as a graph (Antonakos et al., CVPR, 2015) to make
reasoning about facial attributes and identity.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 3/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

3. Your favourite neural network itself can be viewed as a graph, where nodes are
neurons and edges are weights, or where nodes are layers and edges denote
flow of forward/backward pass (in which case we are talking about a
computational graph used in TensorFlow, PyTorch and other DL frameworks).
An application can be optimization of a computational graph, neural
architecture search, analyzing training behavior, etc.

4. Finally, you can solve many problems, where data can be more naturally
represented as graphs, more effectively. This includes, but is not limited to,
molecule and social network classification (Knyazev et al., NeurIPS-W, 2018) and
generation (Simonovsky & Komodakis, ICANN, 2018), 3D Mesh classification
and correspondence (Fey et al., CVPR, 2018) and generation (Wang et al., ECCV,
2018), modeling behavior of dynamic interacting objects (Kipf et al., ICML,
2018), visual scene graph modeling (see the upcoming ICCV Workshop) and
question answering (Narasimhan, NeurIPS, 2018), program synthesis (Allamanis
et al., ICLR, 2018), different reinforcement learning tasks (Bapst et al., ICML,
2019) and many other exciting problems.

As my previous research was related to recognizing and analyzing faces and


emotions, I particularly like this figure below.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 4/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

A figure from (Antonakos et al., CVPR, 2015) showing representation of a face as a graph of landmarks. This
is an interesting approach, but it is not a sufficient facial representation in many cases, since a lot can be told
from the face texture captured well by convolutional networks. In contrast, reasoning over 3D meshes of a
face looks like a more sensible approach compared to 2D landmarks (Ranjan et al., ECCV, 2018).

2. Why is it difficult to define convolution on graphs?


To answer this question, I first give some motivation for using convolution in
general and then describe “convolution on images” using the graph terminology
which should make the transition to “convolution on graphs” more smooth.

2.1. Why is convolution useful?


Let’s understand why we care about convolution so much and why we want to use it
for graphs. Compared to fully-connected neural networks (a.k.a. NNs or MLPs),
convolutional networks (a.k.a. CNNs or ConvNets) have certain advantages
explained below based on the image of a nice old Chevy.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 5/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

“Chevrolet Vega” according to Google Image Search.

First, ConvNets exploit a natural prior in images, more formally described in


(Bronstein et al., 2016), such as:

1. Shift-invariance — if we translate the car on the image above to the


left/right/up/down, we still should be able to detect and recognize it as a car.
This is exploited by sharing filters across all locations, i.e. applying convolution.

2. Locality — nearby pixels are closely related and often represent some semantic
concept, such as a wheel or a window. This is exploited by using relatively large
filters, which can capture image features in a local spatial neighborhood.

3. Compositionality (or hierarchy)— a larger region in the image is often a


semantic parent of smaller regions it contains. For example, a car is a parent of
doors, windows, wheels, driver, etc. And a driver is a parent of head, arms, etc.
This is implicitly exploited by stacking convolutional layers and applying
pooling.

Second, the number of trainable parameters (i.e. filters) in convolutional layers


does not depend on the input dimensionality, so technically we can train exactly the
same model on 28×28 and 512×512 images. In other words, the model is parametric.

Ideally, our goal is to develop a model that is as


flexible as Graph Neural Nets and can digest and

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 6/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

learn from any data, but at the same time we want to


control (regularize) factors of this flexibility by
turning on/off certain priors.

All these nice properties make ConvNets less prone to overfitting (high accuracy on
the training set and low accuracy on the validation/test set), more accurate in
different visual tasks, and easily scalable to large images and datasets. So, when we
want to solve important tasks where input data are graph-structured, it is appealing
to transfer all these properties to graph neural networks (GNNs) to regularize their
flexibility and make them scalable. Ideally, our goal is to develop a model that is as
flexible as GNNs and can digest and learn from any data, but at the same time we
want to control (regularize) factors of this flexibility by turning on/off certain priors.
This can open research in many interesting directions. However, controlling of this
trade-off is challenging.

2.2. Convolution on images in terms of graphs


Let’s consider an undirected graph G with N nodes. Edges E represent undirected
connections between nodes. Nodes and edges typically come from your intuition
about the problem. Our intuition in the case of images is that nodes are pixels or
superpixels (a group of pixels of weird shape) and edges are spatial distances
between them. For example, the MNIST image below on the left is typically
represented as an 28×28 dimensional matrix. We can also represent it as a set of
N=28*28=784 pixels. So, our graph G is going to have N=784 nodes and edges will
have large values (thicker edges in the Figure below) for closely located pixels and
small values (thinner edges) for remote pixels.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 7/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

An image from the MNIST dataset on the left and an example of its graph representation on the right. Darker
and larger nodes on the right correspond to higher pixel intensities. The figure on the right is inspired by
Figure 5 in (Fey et al., CVPR, 2018)

When we train our neural networks or ConvNets on images, we implicitly define


images on a graph — a regular two-dimensional grid as the one on the figure below.
Since this grid is the same for all training and test images and is regular, i.e. all
pixels of the grid are connected to each other in exactly the same way across all
images (i.e. have the same number of neighbors, length of edges, etc.), this regular
grid graph has no information that will help us to tell one image from another.
Below I visualize some 2D and 3D regular grids, where the order of nodes is color-
coded. By the way, I’m using NetworkX in Python to do that, e.g. G =

networkx.grid_graph([4, 4]) .

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 8/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Examples of regular 2D and 3D grids. Images are defined on 2D grids and videos are on 3D grids.

Given this 4×4 regular grid, let’s briefly look at how 2D convolution works to
understand why it’s difficult to transfer this operator to graphs. A filter on a regular
grid has the same order of nodes, but modern convolutional nets typically have
small filters, such as 3×3 in the example below. This filter has 9 values: W₁,W₂,…,
W₉, which is what we are updating during training using backprop to minimize the
loss and solve the downstream task. In our example below, we just heuristically
initialize this filter to be an edge detector (see other possible filters here):

Example of a 3×3 filter on a regular 2D grid with arbitrary weights w on the left and an edge detector on the
right.

When we perform convolution, we slide this filter in both directions: to the right
and to the bottom, but nothing prevents us from starting in the bottom corner — the
important thing is to slide over all possible locations. At each location, we compute
the dot product between the values on the grid (let’s denote them as X) and the
values of filters, W: X₁W₁+X₂W₂+…+X₉W₉, and store the result in the output image.
In our visualization, we change the color of nodes during sliding to match the colors
of nodes in the grid. In a regular grid, we always can match a node of the filter with
a node of the grid. Unfortunately, this is not true for graphs as I’ll explain later
below.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 9/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

2 steps of 2D convolution on a regular grid. If we don’t apply padding, there will be 4 steps in total, so the
result will be a 2×2 image. To make the resulting image larger, we need to apply padding. See a
comprehensive guide to convolution in deep learning here.

The dot product used above is one of so called “aggregator operators”. Broadly
speaking, the goal of an aggregator operator is to summarize data to a reduced
form. In our example above, the dot product summarizes a 3×3 matrix to a single
value. Another example is pooling in ConvNets. Keep in mind, that such methods as
max or sum pooling are permutation-invariant, i.e. they will pool the same value
from a spatial region even if you randomly shuffle all pixels inside that region. To
make it clear, the dot product is not permutation-invariant simply because in
general: X₁W₁+X₂W₂ ≠X₂W₁+X₁W₂.

Now let’s use our MNIST image and illustrate the meaning of a regular grid, a filter
and convolution. Keeping in mind our graph terminology, this regular 28×28 grid
will be our graph G, so that every cell in this grid is a node, and node features are an
actual image X, i.e. every node will have just a single feature — pixel intensity from
0 (black) to 1 (white).

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 10/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Regular 28×28 grid (left) and an image on that grid (right).

Next, we define a filter and let it be a famous Gabor filter with some (almost)
arbitrary parameters. Once we have an image and a filter, we can perform
convolution by sliding the filter over that image (of digit 7 in our case) and putting
the result of the dot product to the output matrix after each step.

A 28×28 filter (left) and the result of 2D convolution of this filter with the image of digit 7 (right).

This is all cool, but as I mentioned before, it becomes tricky when you try to
generalize convolution to graphs.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 11/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Nodes are a set, and any permutation of this set


does not change it. Therefore, the aggregator
operator that people apply should be permutation-
invariant.
As I have already mentioned, the dot product used above to compute convolution at
each step is sensitive to the order. This sensitivity permits us to learn edge detectors
similar to Gabor filters important to capture image features. The problem is that in
graphs there is no well-defined order of nodes unless you learn to order them, or come
up with some heuristic that will result in a consistent (canonical) order from graph
to graph. In short, nodes are a set, and any permutation of this set does not change
it. Therefore, the aggregator operator that people apply should be permutation-
invariant. The most popular choices are averaging (GCN, Kipf & Welling, ICLR, 2017)
and summation (GIN, Xu et al., ICLR, 2019) of all neighbors, i.e. sum or mean
pooling, followed by projection by a trainable vector W. See Hamilton et al., NIPS,
2017 for some other aggregators.

Illustration of “convolution on graphs” of node features X with filter W centered at node 1 (dark blue).

For example, for the graph above on the left, the output of the summation
aggregator for node 1 will be X₁=(X₁+X₂+X₃+X₄)W₁, for node 2: X₂=(X₁+X₂+X₃+X₅)W₁
and so forth for nodes 3, 4 and 5, i.e. we need to apply this aggregator for all nodes.
In result, we will have the graph with the same structure, but node features will now
https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 12/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

contain features of neighbors. We can process the graph on the right using the same
idea.

Colloquially, people call this averaging or summation “convolution”, since we also


“slide” from one node to another and apply an aggregator operator in each step.
However, it’s important to keep in mind that this is a very specific form of
convolution, where filters don’t have a sense of orientation. Below I’ll show how
those filters look like and give an idea how to make them better.

3. What makes a neural network a graph neural network?


You know how a classical neural network works, right? We have some C-
dimensional features X as the input to the net. Using our running MNIST example,
X will be our C=784 dimensional pixel features (i.e. a “flattened” image). These
features get multiplied by C×F dimensional weights W that we update during
training to get the output closer to what we expect. The result can be directly used
to solve the task (e.g. in case of regression) or can be further fed to some
nonlinearity (activation), like ReLU, or other differentiable (or more precisely, sub-
differentiable) functions to form a multi-layer network. In general, the output of
some layer l is:

Fully-connected layer with learnable weights W. “Fully-connected” means that each output value in X⁽ˡ⁺¹⁾
depends on, or “connected to”, all inputs X⁽ˡ⁾. Typically, although not always, we add a bias term to the output.

The signal in MNIST is so strong, that you can get an accuracy of 91% by just using
the formula above and the Cross Entropy loss without any nonlinearities and other
tricks (I used a slightly modified PyTorch example to do that). Such model is called
multinomial (or multiclass, since we have 10 classes of digits) logistic regression.

Now, how do we transform our vanilla neural network to a graph neural network?
As you already know, the core idea behind GNNs is aggregation over “neighbors”.
Here, it is important to understand that in many cases, it is actually you who
specifies “neighbors”.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 13/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Let’s consider a simple case first, when you are given some graph. For example, this
can be a fragment (subgraph) of a social network with 5 persons and an edge
between a pair of nodes denotes if two people are friends (or at least one of them
think so). An adjacency matrix (usually denoted as A) in the figure below on the
right is a way to represent these edges in a matrix form, convenient for our deep
learning frameworks. Yellow cells in the matrix represent the edge and blue — the
absence of the edge.

Example of a graph and its adjacency matrix. The order of nodes we defined in both cases is random, while
the graph is still the same.

Now, let’s create an adjacency matrix A for our MNIST example based on
coordinates of pixels (complete code is provided in the end of the post):

import numpy as np
from scipy.spatial.distance import cdist

img_size = 28 # MNIST image width and height


col, row = np.meshgrid(np.arange(img_size), np.arange(img_size))
coord = np.stack((col, row), axis=2).reshape(-1, 2) / img_size
dist = cdist(coord, coord) # see figure below on the left
sigma = 0.2 * np.pi # width of a Gaussian
A = np.exp(- dist ** 2 / sigma ** 2) # see figure below in the
middle

This is a typical, but not the only, way to define an adjacency matrix for visual tasks
(Defferrard et al., NIPS, 2016, Bronstein et al., 2016). This adjacency matrix is our
prior, or our inductive bias, we impose on the model based on our intuition that
nearby pixels should be connected and remote pixels shouldn’t or should have very
thin edge (edge of a small value). This is motivated by observations that in natural
images nearby pixels often correspond to the same object or objects that interact
frequently (the locality principle we mentioned in Section 2.1.), so it makes a lot of
sense to connect such pixels.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 14/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Adjacency matrix (NxN) in the form of distances (left) and closeness (middle) between all pairs of nodes.
(right) A subgraph with 16 neighboring pixels corresponding to the adjacency matrix in the middle. Since it’s
a complete subgraph, it’s also called a “clique”.

So, now instead of having just features X we have some fancy matrix A with values
in the range [0,1]. It’s important to note that once we know that our input is a graph,
we assume that there is no canonical order of nodes that will be consistent across
all other graphs in the dataset. In terms of images, it means that pixels are assumed to
be randomly shuffled. Finding the canonical order of nodes is combinatorially
unsolvable in practice. Even though for MNIST we technically can cheat by knowing
this order (because data are originally from a regular grid), it’s not going to work on
actual graph datasets.

Remember that our matrix of features X has 𝑁 rows and C columns. So, in terms of
graphs, each row corresponds to one node and C is the dimensionality of node
features. But now the problem is that we don’t know the order of nodes, so we don’t
know in which row to put features of a particular node. If we just pretend to ignore
this problem and feed X directly to an MLP as we did before, the effect will be the
same as feeding images with randomly shuffled pixels with independent (yet the
same for each epoch) shuffling for each image! Surprisingly, a neural network can
in principle still fit such random data (Zhang et al., ICLR, 2017), however test
performance will be close to random prediction. One of the solutions is to simply
use the adjacency matrix A, we created before, in the following way:

Graph neural layer with adjacency matrix A, input/output features X and learnable weights W.

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 15/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

We just need to make sure that row i in A corresponds to features of node in row i of
X. Here, I’m using 𝓐 instead of plain A, because often you want to normalize A. If
𝓐=A, the matrix multiplication 𝓐X⁽ˡ⁾ will be equivalent to summing features of
neighbors, which turned out to be useful in many tasks (Xu et al., ICLR, 2019). Most
commonly, you normalize it so that 𝓐X⁽ˡ⁾ averages features of neighbors, i.e. 𝓐=A/
ΣᵢAᵢ. A better way to normalize matrix A can be found in (Kipf & Welling, ICLR,
2017).

Below is the comparison of NN and GNN in terms of PyTorch code:

1 import torch
2 import torch.nn as nn
3
4 C = 2 # Input feature dimensionality
5 F = 8 # Output feature dimensionality
6 W = nn.Linear(in_features=C, out_features=F) # Trainable weights
7
8 # Fully connected layer
9 X = torch.randn(1, C) # Input features
10 Z = W(X) # Output features : torch.Size([1, 8])
11
12 #Graph Neural Network layer
13 N = 6 # Number of nodes in a graph
14 X = torch.randn(N, C) # Input feature
15 A = torch.rand(N, N) # Adjacency matrix (edges of a graph)
16 Z = W(torch.mm(A, X)) # Output features: torch.Size([6, 8])

nn_vs_gnn.py hosted with ❤ by GitHub view raw

And HERE is the full PyTorch code to train two models above: python mnist_fc.py --

model fc to train the NN case; python mnist_fc.py --model graph to train the GNN
case. As an exercise, try to randomly shuffle pixels in code in the --model graph

case (don’t forget to shuffle A in the same way) and make sure that it will not affect
the result. Is it going to be true for the --model fc case?

Here is the full PyTorch code to train two models.


After running the code, you may notice that the classification accuracy is actually
about the same. What’s the problem? Aren’t graph networks supposed to work
better? Well, they are, in many cases. But not in this one, because the 𝓐X⁽ˡ⁾
operator we added is actually nothing else, but a Gaussian filter:

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 16/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

2D visualization of a filter used in a graph neural network and it’s effect on the image.

So, our graph neural network turned out to be equivalent to a convolutional neural
network with a single Gaussian filter, that we never update during training,
followed by the fully-connected layer. This filter basically blurs/smooths the image,
which is not a particularly useful thing to do (see the image above on the right).
However, this is the simplest variant of a graph neural network, which nevertheless
works great on graph-structured data. To make GNNs work better on regular graphs,
like images, we need to apply a bunch of tricks. For example, instead of using a
predefined Gaussian filter, we can learn to predict an edge between any pair of
pixels by using a differentiable function like this:

import torch.nn as nn # using PyTorch

nn.Sequential(nn.Linear(4, 64), # map coordinates to a hidden layer


nn.ReLU(), # nonlinearity
nn.Linear(64, 1), # map hidden representation to edge
nn.Tanh()) # squash edge values to [-1, 1]

To make GNNs work better on regular graphs, like


images, we need to apply a bunch of tricks. For
example, instead of using a predefined Gaussian

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 17/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

filter, we can learn to predict an edge between any


pair of pixels.

This idea is similar to Dynamic Filter Networks (Brabander et al., NIPS, 2016), Edge-
conditioned Graph Networks (ECC, Simonovsky & Komodakis, CVPR, 2017) and
(Knyazev et al., NeurIPS-W, 2018). To try it using my code, you just need to add the -

-pred_edge flag, so the entire command is python mnist_fc.py --model graph --

pred_edge . Below I show the animation of the predefined Gaussian and learned
filters. You may notice that the filter we just learned (in the middle) looks weird.
That’s because the task is quite complicated since we optimize two models at the
same time: the model that predicts edges and the model that predicts a digit class.
To learn better filters (like the one on the right), we need to apply some other tricks
from our BMVC paper, which is beyond the scope of this part of the tutorial.

2D filter of a graph neural network centered in the red point. Averaging (left, accuracy 92.24%), learned
based on coordinates (middle, accuracy 91.05%), learned based on coordinates with some tricks (right,
accuracy 92.39%).

The code to generate these GIFs is quite simple:

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 18/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

1 import imageio # to save GIFs


2 import matplotlib as mpl
3 import matplotlib.pyplot as plt
4 import numpy as np
5 from scipy.spatial.distance import cdist
6 import cv2 # optional (for resizing the filter to look better)
7
8 img_size = 28
9 # Create/load some adjacency matrix A (for example, based on coordinates)
10 col, row = np.meshgrid(np.arange(img_size), np.arange(img_size))
11 coord = np.stack((col, row), axis=2).reshape(-1, 2) / img_size
12 dist = cdist(coord, coord) # distances between all pairs of pixels
13 sigma = 0.2 * np.pi # width of a Gaussian (can be a hyperparameter when training a model)
14
15 A = np.exp(- dist / sigma ** 2) # adjacency matrix of spatial similarity
16 # above, dist should have been squared to make it a Gaussian (forgot to do that)
17
18 scale = 4
19 img_list = []
20 cmap = mpl.cm.get_cmap('viridis')
21 for i in np.arange(0, img_size, 4): # for every row with step 4
22 for j in np.arange(0, img_size, 4): # for every col with step 4
23 k = i*img_size + j
24 img = A[k, :].reshape(img_size, img_size)
25 img = (img - img.min()) / (img.max() - img.min())
26 img = cmap(img)
27 img[i, j] = np.array([1., 0, 0, 0]) # add the red dot
28 img = cv2.resize(img, (img_size*scale, img_size*scale))
29 img_list.append((img * 255).astype(np.uint8))
30 imageio.mimsave('filter.gif', img_list, format='GIF', duration=0.2)

generate_gif.py hosted with ❤ by GitHub view raw

I’m also sharing an IPython notebook showing 2D convolution of an image with a


Gabor filter in terms of graphs (using an adjacency matrix) compared to using
circulant matrices, which is often used in signal processing.

In the next part of the tutorial, I’ll tell you about more advanced graph layers that
can lead to better filters on graphs.

Update:

Throught this blog post and in the code the dist variable should have been squared
to make it a Gaussian. Thanks Alfredo Canziani for spotting that. All figures and

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 19/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

results were generated without squaring it. If you observe very different results
after squaring it, I suggest to tune sigma .

Conclusion
Graph Neural Networks are a very flexible and interesting family of neural networks
that can be applied to really complex data. As always, such flexibility must come at
a certain cost. In case of GNNs it is the difficulty of regularizing the model by
defining such operators as convolution. Research in that direction is advancing
quite fast, so that GNNs will see application in increasingly wider areas of machine
learning and computer vision.

See another nice blog post about GNNs from Neptune.ai.

Acknowledgement: A large portion of this tutorial was prepared during my internship at


SRI International under the supervision of Mohamed Amer (homepage) and my PhD
advisor Graham Taylor (homepage).

Find me on Github, LinkedIn and Twitter. My homepage.

If you want to cite this tutorial in your paper, please use:


@misc{knyazev2019tutorial,
title={Tutorial on Graph Neural Networks for Computer Vision and Beyond},
author={Knyazev, Boris and Taylor, Graham W and Amer, Mohamed R},
year={2019}
}

Machine Learning Graph Neural Networks Deep Learning Computer Vision

Pytorch

About Help Terms Privacy

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 20/21
4/21/23, 3:50 PM Tutorial on Graph Neural Networks for Computer Vision and Beyond | by Boris Knyazev | Medium

Get the Medium app

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d 21/21

You might also like