Graph Analytics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Published in Towards Data Science

Jatin Bhasin Follow

Aug 14, 2019 · 12 min read · Listen

Save

Graph Analytics — Introduction and Concepts of


Centrality

Photo by Alina Grubnyak on Unsplash


The advent of social networks, big data and e-commerce has re-emphasized the
importance of analyzing a unique type of data structure- one which depicts
relationships among its entities, also known as a Graph. It is imperative to briefly
introduce the concept of a “Graph” before I venture into the Introduction of Graph
Analytics.

Let’s start by looking at a sample graph of friends presented below. I will be using the
same graph in some of the following sections to further explain the concepts of graph
analytics.

Figure 1 ( This graph was designed in Gephi )

The above picture depicts a graph of friends where the node/entity such as A,B etc.
depicts a particular individual and a link (also known as an edge) between any two
individuals depicts a relation (“friendship” in this case) between them.
Generalizing from the above example:

Graphs can be defined as a representation of relationships between “entities” or


“things” where as these “entities” are the “nodes” (also known as “vertices”) of the
graph and the relationships between them are represented by “links” (also known as
“edges”) of the graph. The study of graphs is also known as “Graph Theory”

Further, by simply looking at the graph, one can analyze that A and B have a common
friend C, which is not friends with D. The branch of data science that deals with
extracting information from graphs by performing analysis on them is known as
“Graph Analytics”.

Moving on-wards from introduction, lets venture into the world of graph analytics by
exploring some fundamental concepts. In this article we will be particularly focusing
on Centrality based concepts used in graph analytics. Don’t fret if you did not
understand the aforementioned statement as I am going to cover everything from
scratch as we move forward.

Centrality

In graph analytics, Centrality is a very important concept in identifying important


nodes in a graph. It is used to measure the importance (or “centrality” as in how
“central” a node is in the graph) of various nodes in a graph. Now, each node could be
important from an angle depending on how “importance” is defined. Centrality comes
in different flavors and each flavor or a metric defines importance of a node from a
different perspective and further provides relevant analytical information about the
graph and its nodes.

Degree Centrality
The first flavor of Centrality we are going to discuss is “Degree Centrality”.To
understand it, let’s first explore the concept of degree of a node in a graph.

In a non-directed graph, degree of a node is defined as the number of direct


connections a node has with other nodes. Looking at the graph below:

In a directed graph (each edge has a direction), degree of a node is further divided into
In-degree and Out-degree. In-degree refers to the number of edges/connections
incident on it and Out-degree refers to the number of edges/connections from it to
other nodes.Lets look at a sample Twitter graph below where nodes are individuals and
edges with arrows indicate the “Follows” relationship:

Figure 2

We can see that nodes E,C,D and B have an outgoing edge towards node A and hence
follow node A. Thus, the in-degree of node A is 4 as it has 4 edges incident on it.

We can also see that node B follows both node D and node A, hence it’s out-degree is 2.

Now, Degree Centrality metric defines importance of a node in a graph as being


measured based on its degree i.e the higher the degree of a node, the more important
it is in a graph.

Re-examining the aforementioned friends graph (Figure 1) below:


Figure 3

The degree centrality of node A is 7, node G is 5, node C is 4 and node L is 1.

Mathematically, Degree Centrality is defined as D(i) for a node “i” as below:

Now lets briefly discuss a sample application of degree centrality to the above shown
graph of friends. Looking at node A and G, they have a high degree centrality(7 and 5
respectively) and will be ideal candidates if we want to propagate any information to a
large part of the network quickly as compared to node L which only has a degree
centrality of 1.This information is very useful for creating a marketing or an
influencing strategy if a new product or idea/thought has to be introduced in the
network. Marketers can focus on nodes such as A,G etc. with high degree centrality to
market their product or ideas in the network to ensure higher reach-ability among
nodes.

Similarly, keeping in mind the sample Twitter graph (in Figure 2), if we actually
examine a social network such as Twitter with millions of nodes and calculate in-
degree centrality for various nodes, the nodes with high in-degree centrality (such as
Kanye West, Lady Gaga and other celebrities) will be the nodes that have huge number
of followers and could be ideal candidates to influence the public or promote
commercial products. Now you know why celebrities or popular people are paid on
social networks such as Instagram and Twitter to say certain things or promote certain
products as commercial companies are aware that these individuals have a very high
in-degree and have the ability to influence or reach a large number of people quickly .

Application/Usefulness of analyzing importance of nodes based on degree centrality is


vast and depends on the nature of graph/network in consideration.

Closeness Centrality

The second flavor we are going to discuss is “Closeness Centrality”.To understand the
same, first let’s understand the concept of “Geodesic distance” between two nodes in a
graph.

The Geodesic distance d between two nodes a and b is defined as the number of
edges/links between these two nodes on the shortest path(path with minimum
number of edges) between them.

Let’s look at the graph below:


Figure 4

Let’s examine the geodesic distance between A and F to further clarify the concept. We
can reach F from A by going through B and E or by going through D. However, the
shortest path from F to A is through D(2 edges), hence the geodesic distance d(A,F) will
be defined as 2 as there are 2 edges between A and F.

Mathematically, Geodesic distance can be defined as below:

d(a , b) = No. of edges between a and b on the shortest path from a to b, if a path exists
from a to b

d(a , b) = 0, if a = b

d(a , b) = ∞ (Infinity) , if no path exists from a to b

Further, closeness centrality metric defines the importance of a node in a graph as


being measured by how close it is to all other nodes in the graph.For a node, it is
defined as the sum of the geodesic distance between that node to all other nodes in the
network.

Again, looking at the previously introduced graph of friends in Figure 1 below, we can
see that the Closeness centrality of node A is 17 while that of node L is 33.
Figure 5

Mathematically, Closeness Centrality C(i) of a node i in a graph can be defined as


below:

Let’s briefly describe a sample application of Closeness Centrality by examining the


friends graph above in Figure 5. Now let’s suppose that in the friend’s graph, each
link/edge had a weight (attribute) of 1 minute associated with it i.e it would take 1
minute to transmit information from a node to its neighboring node such as A to B or B
to C. Now lets suppose we want to send a piece of specific information (information
will be different for each node) to each node of the graph and we need to select a node
in the graph that can transmit it quickly to all the nodes in the network.
To solve the above problem, we can calculate the Closeness Centrality measure for all
the nodes in the network. As we already calculated above for node A, if we select node
A, the information can reach all the nodes by traversing 17 edges (i.e starting at A,
information can be transmitted to all nodes in 17 minutes in a worst case scenario
assuming sequential sends from A) as compared to node L, where it would take 33
minutes to transmit the information to all nodes.Clearly we can see the difference in
importance of both the nodes A and L in terms of Closeness Centrality measure.

Betweenness Centrality

The third flavor of centrality we are going to discuss is known as “Betweenness


Centrality” (BC). This metric defines and measures the importance of a node in a
network based upon how many times it occurs in the shortest path between all pairs of
nodes in a graph.To elaborate the metric further, let’s again look at our friends graph
below:
Figure 6

Mathematically, Betweenness Centrality B(i) of a node i in a graph is defined as below:

Looking at node A, we can observe that it lies on the shortest path between the
following pair of nodes : (D,M), (D,E),(G,C),(G,B),(G,F),(G,I),(K,C),(D,C) etc. and thus has
the highest BC among all other nodes in the graph. We can also observe that both
nodes G and C also have high Betweenness Centralities (BCs) as compared to other
nodes (except A) in the graph

As discussed, if we look at our friends graph above (Figure 6), node A has a very high
BC. If we were to remove it, it would lead to huge disruption in the network as there
would be no way for nodes {J,H,G,M,K,E,D} to communicate with nodes {F,B,C,I,L} and
vice versa and we would end up with two isolated sub graphs. This understanding
marks the importance of nodes with high BCs.

A sample application of BC is to find bridge nodes in graphs.Nodes having high BC are


the nodes that are on the shortest paths between a large number of pair of nodes and
hence are crucial to the communication in a graph as they connect a high number of
nodes with each other.Removing these nodes from the network would lead to huge
disruption in the linkage or communication of the network.

A real life use case of the above application is in analyzing global terrorism networks.
For example, if we have a network of terrorists or terrorist groups and other related
individuals represented as nodes of a graph, we can calculate BC for each node and
identify nodes with high BCs. These nodes (or terrorists in this case) will be bridge
nodes in the network. This information is very useful for defense agencies as they can
be highly effective in disrupting the whole terrorism network . Another use-case of this
metric is to detect and monitor possible bottlenecks or hot-spots in computer networks
or flow networks.

Eigen Vector Centrality

The last flavor of centrality that we will be exploring is known as the Eigen Vector
Centrality. This metric measures the importance of a node in a graph as a function of
the importance of its neighbors. If a node is connected to highly important nodes, it
will have a higher Eigen Vector Centrality score as compared to a node which is
connected to lesser important nodes.

Let’s look at the graph given below to further explain the concept:
Figure 7

The adjacency matrix A of the above graph will be as shown below:

Figure 8

Let’s assume that in the above graph, the importance of each node is measured by its
degree, such that the higher the degree of a node, the more important it is in the
graph. Degrees of various nodes are shown as below:
Figure 9

The above can also be represented as a matrix vector V as shown below:

Figure 10

Now, mathematically the Eigen Vector Centrality is calculated as below:

Figure 11 showing Eigen Vector Centrality Calculation — 1st Iteration

The resultant 1-D vector in the above equation gives the Eigen Vector Centrality (EVC)
score for each of the nodes in the graph. Effect of the first iteration of multiplication
can be visualized as shown below:

Figure 12 showing EVC scores of each node after 1st iteration of multiplication
As you can see above, node A and B both have a high score of 8 since both of them are
connected to multiple nodes with high degrees (importance) while node E has a score
of 3 since its only connected to a single node of degree 3.It is also important to observe
that the EVC score value for each node in the resultant vector is nothing but the sum of
degrees of its neighboring nodes.For example: EVC score for node A = degree(B) +
degree(C) + degree(D) = 8

Now if the resultant EVC vector that we got above in the equation (Figure 11) is again
multiplied by the adjacency matrix A, we will get bigger values for EVC score for each
node in the graph, as shown below:

Figure 13 showing Eigen Vector Centrality Calculation — 2nd Iteration of multiplication

The effect of multiplying the resultant vector again (2nd iteration of multiplication)
with the adjacency matrix can be visualized, as shown below:

Figure 14 showing EVC scores of each node after 2nd iteration of multiplication

Now, why did we multiple the resultant vector again with the adjacency matrix?
In short, the answer to that lies in the fact that multiplying the resultant vector again
with the adjacency matrix of the graph helps the EVC score spread out in the graph so
as to get a more globally prominent EVC score vs a localized EVC score for each node
in the graph. If we observe, after the first iteration of multiplication, each node’s EVC
score is a function of only its direct (1st degree) neighbors, thus is a localized score
which might not be accurate at a global level in the graph.

Elaborating the above, if we visualize the above operations, we can observe the
following:

After the first iteration of multiplication, each node gets it’s EVC score from its
direct(1st degree) neighbors.

In the second iteration, when we multiply the resultant vector again with the
adjacency matrix, each node again gets it’s EVC score from its direct neighbors but
the difference in the second iteration is that this time, the scores of the direct
neighbors have already been impacted by their own direct(1st degree) neighbors
previously(from the first iteration of multiplication) which eventually helps the
EVC score of any node to be a function of its 2nd degree neighboring nodes as well.

In subsequent iterations of multiplication, the EVC score of graph nodes keeps


getting updated by getting impacted by EVC scores from neighboring nodes of
farther degree (3rd, 4th and so on).

Repeated multiplication makes the EVC score of every node to eventually be a function
of or dependent on several degrees of its neighboring nodes, thereby providing a
globally accurate EVC score for each node.Usually the process of multiplying the EVC
vector with the adjacency matrix is repeated until the EVC values for nodes in the
graph reach an equilibrium or stop showing appreciable change.

The discussion of applications of Eigen Vector Centrality is vast and deserves a


separate article in itself. One sample application of EVC is the calculation of Page Rank
or Page Rank algorithm used by Google and many other companies to rank web pages
on the internet by relevance. Page Rank is a direct variant of EVC. Web pages on the
World Wide Web have links that point to/from other web pages. You can think of each
web page being a node in the graph and each outgoing/incoming link as a directed
edge leading to/from another web page on the web, thereby making up the whole
World Wide Web graph. The graph of web pages in the world wide web undergoes
several iterations of EVS calculation so as to calculate globally accurate relevance
rankings of each web page.The web pages with high EVC scores can then be targeted
for marketing and other commercial purposes.

The field of graph analytics is vast and has immense practical applications. The scope
of this article was to cover the fundamentals of Centrality and hopefully will give the
reader an insight into the fascinating world of Graph Analytics.

Below is a list of various Graph Analytics libraries and software that can be used for
Graph Analytics:

Gephi (https://gephi.org/)

Cytoscape (https://github.com/cytoscape/cytoscape.js)

Neo4j (https://neo4j.com)

GraphAnalyticsLib (https://github.com/jb123/GraphAnalyticsLib)

Data Science Graph Analytics Data Visualization Data Analysis

327 2

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
Get this newsletter

Open in app Sign up Sign In

About Help Terms Privacy


Search Medium

Get the Medium app

You might also like