Professional Documents
Culture Documents
Graph Analytics
Graph Analytics
Graph Analytics
Save
Let’s start by looking at a sample graph of friends presented below. I will be using the
same graph in some of the following sections to further explain the concepts of graph
analytics.
The above picture depicts a graph of friends where the node/entity such as A,B etc.
depicts a particular individual and a link (also known as an edge) between any two
individuals depicts a relation (“friendship” in this case) between them.
Generalizing from the above example:
Further, by simply looking at the graph, one can analyze that A and B have a common
friend C, which is not friends with D. The branch of data science that deals with
extracting information from graphs by performing analysis on them is known as
“Graph Analytics”.
Moving on-wards from introduction, lets venture into the world of graph analytics by
exploring some fundamental concepts. In this article we will be particularly focusing
on Centrality based concepts used in graph analytics. Don’t fret if you did not
understand the aforementioned statement as I am going to cover everything from
scratch as we move forward.
Centrality
Degree Centrality
The first flavor of Centrality we are going to discuss is “Degree Centrality”.To
understand it, let’s first explore the concept of degree of a node in a graph.
In a directed graph (each edge has a direction), degree of a node is further divided into
In-degree and Out-degree. In-degree refers to the number of edges/connections
incident on it and Out-degree refers to the number of edges/connections from it to
other nodes.Lets look at a sample Twitter graph below where nodes are individuals and
edges with arrows indicate the “Follows” relationship:
Figure 2
We can see that nodes E,C,D and B have an outgoing edge towards node A and hence
follow node A. Thus, the in-degree of node A is 4 as it has 4 edges incident on it.
We can also see that node B follows both node D and node A, hence it’s out-degree is 2.
Now lets briefly discuss a sample application of degree centrality to the above shown
graph of friends. Looking at node A and G, they have a high degree centrality(7 and 5
respectively) and will be ideal candidates if we want to propagate any information to a
large part of the network quickly as compared to node L which only has a degree
centrality of 1.This information is very useful for creating a marketing or an
influencing strategy if a new product or idea/thought has to be introduced in the
network. Marketers can focus on nodes such as A,G etc. with high degree centrality to
market their product or ideas in the network to ensure higher reach-ability among
nodes.
Similarly, keeping in mind the sample Twitter graph (in Figure 2), if we actually
examine a social network such as Twitter with millions of nodes and calculate in-
degree centrality for various nodes, the nodes with high in-degree centrality (such as
Kanye West, Lady Gaga and other celebrities) will be the nodes that have huge number
of followers and could be ideal candidates to influence the public or promote
commercial products. Now you know why celebrities or popular people are paid on
social networks such as Instagram and Twitter to say certain things or promote certain
products as commercial companies are aware that these individuals have a very high
in-degree and have the ability to influence or reach a large number of people quickly .
Closeness Centrality
The second flavor we are going to discuss is “Closeness Centrality”.To understand the
same, first let’s understand the concept of “Geodesic distance” between two nodes in a
graph.
The Geodesic distance d between two nodes a and b is defined as the number of
edges/links between these two nodes on the shortest path(path with minimum
number of edges) between them.
Let’s examine the geodesic distance between A and F to further clarify the concept. We
can reach F from A by going through B and E or by going through D. However, the
shortest path from F to A is through D(2 edges), hence the geodesic distance d(A,F) will
be defined as 2 as there are 2 edges between A and F.
d(a , b) = No. of edges between a and b on the shortest path from a to b, if a path exists
from a to b
d(a , b) = 0, if a = b
Again, looking at the previously introduced graph of friends in Figure 1 below, we can
see that the Closeness centrality of node A is 17 while that of node L is 33.
Figure 5
Betweenness Centrality
Looking at node A, we can observe that it lies on the shortest path between the
following pair of nodes : (D,M), (D,E),(G,C),(G,B),(G,F),(G,I),(K,C),(D,C) etc. and thus has
the highest BC among all other nodes in the graph. We can also observe that both
nodes G and C also have high Betweenness Centralities (BCs) as compared to other
nodes (except A) in the graph
As discussed, if we look at our friends graph above (Figure 6), node A has a very high
BC. If we were to remove it, it would lead to huge disruption in the network as there
would be no way for nodes {J,H,G,M,K,E,D} to communicate with nodes {F,B,C,I,L} and
vice versa and we would end up with two isolated sub graphs. This understanding
marks the importance of nodes with high BCs.
A real life use case of the above application is in analyzing global terrorism networks.
For example, if we have a network of terrorists or terrorist groups and other related
individuals represented as nodes of a graph, we can calculate BC for each node and
identify nodes with high BCs. These nodes (or terrorists in this case) will be bridge
nodes in the network. This information is very useful for defense agencies as they can
be highly effective in disrupting the whole terrorism network . Another use-case of this
metric is to detect and monitor possible bottlenecks or hot-spots in computer networks
or flow networks.
The last flavor of centrality that we will be exploring is known as the Eigen Vector
Centrality. This metric measures the importance of a node in a graph as a function of
the importance of its neighbors. If a node is connected to highly important nodes, it
will have a higher Eigen Vector Centrality score as compared to a node which is
connected to lesser important nodes.
Let’s look at the graph given below to further explain the concept:
Figure 7
Figure 8
Let’s assume that in the above graph, the importance of each node is measured by its
degree, such that the higher the degree of a node, the more important it is in the
graph. Degrees of various nodes are shown as below:
Figure 9
Figure 10
The resultant 1-D vector in the above equation gives the Eigen Vector Centrality (EVC)
score for each of the nodes in the graph. Effect of the first iteration of multiplication
can be visualized as shown below:
Figure 12 showing EVC scores of each node after 1st iteration of multiplication
As you can see above, node A and B both have a high score of 8 since both of them are
connected to multiple nodes with high degrees (importance) while node E has a score
of 3 since its only connected to a single node of degree 3.It is also important to observe
that the EVC score value for each node in the resultant vector is nothing but the sum of
degrees of its neighboring nodes.For example: EVC score for node A = degree(B) +
degree(C) + degree(D) = 8
Now if the resultant EVC vector that we got above in the equation (Figure 11) is again
multiplied by the adjacency matrix A, we will get bigger values for EVC score for each
node in the graph, as shown below:
The effect of multiplying the resultant vector again (2nd iteration of multiplication)
with the adjacency matrix can be visualized, as shown below:
Figure 14 showing EVC scores of each node after 2nd iteration of multiplication
Now, why did we multiple the resultant vector again with the adjacency matrix?
In short, the answer to that lies in the fact that multiplying the resultant vector again
with the adjacency matrix of the graph helps the EVC score spread out in the graph so
as to get a more globally prominent EVC score vs a localized EVC score for each node
in the graph. If we observe, after the first iteration of multiplication, each node’s EVC
score is a function of only its direct (1st degree) neighbors, thus is a localized score
which might not be accurate at a global level in the graph.
Elaborating the above, if we visualize the above operations, we can observe the
following:
After the first iteration of multiplication, each node gets it’s EVC score from its
direct(1st degree) neighbors.
In the second iteration, when we multiply the resultant vector again with the
adjacency matrix, each node again gets it’s EVC score from its direct neighbors but
the difference in the second iteration is that this time, the scores of the direct
neighbors have already been impacted by their own direct(1st degree) neighbors
previously(from the first iteration of multiplication) which eventually helps the
EVC score of any node to be a function of its 2nd degree neighboring nodes as well.
Repeated multiplication makes the EVC score of every node to eventually be a function
of or dependent on several degrees of its neighboring nodes, thereby providing a
globally accurate EVC score for each node.Usually the process of multiplying the EVC
vector with the adjacency matrix is repeated until the EVC values for nodes in the
graph reach an equilibrium or stop showing appreciable change.
The field of graph analytics is vast and has immense practical applications. The scope
of this article was to cover the fundamentals of Centrality and hopefully will give the
reader an insight into the fascinating world of Graph Analytics.
Below is a list of various Graph Analytics libraries and software that can be used for
Graph Analytics:
Gephi (https://gephi.org/)
Cytoscape (https://github.com/cytoscape/cytoscape.js)
Neo4j (https://neo4j.com)
GraphAnalyticsLib (https://github.com/jb123/GraphAnalyticsLib)
327 2
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
Get this newsletter