Episode 5 Comunidades

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Episode 5: Communities in

Networks - Network Law No. 2


What's the rumpus :) I'm Asaf Shapira and this is NETfrix.
In the last episode we learned that since the 21st century, network science
acknowledged that networks are not random, and they follow universal laws. And if
the Power Law is the network's rule No. 1, then rule No. 2 is: Nodes in networks
congregate to communities. What does it mean?
A closer look at the internal structure of a network shows that a network is made up
of dense clusters that are highly inter-connected. These clusters are called
communities, if you are a network scientist, or clusters, if you are more into
computer science.
Understanding communities will allow us to analyze - easily and efficiently –
networks of any size and even discover what we did not know we did not know (so-
called, unknown unknown) and most importantly, to tell an interesting story about
our data. In this episode we'll show how to do it, and what it has to do with
cauliflower. So, let's begin.

So, why do clusters or communities form in networks?


Say we're going to a party with a friend. We walk in the door, look around and
recognize no one. When suddenly a friend from the other side of the room gives us
the hello sign. We join our gang that we have known for a long time from joined
service in the army. Looking around the room we see more groups: guys that work
together on some kind of a startup, guys who know each other from college, a group
of girls who just arrived and a bunch of guys I do not know or understand what they
are doing here. We naturally formed clusters or groups that each have a common
trait: common workplace or college, common gender, or just common weirdness.
These clusters were formed according to the logic of homophilia (from Latin
"preferring the same" or just plain "similarity"). Similar nodes better connect to each
other and tend to have more interactions between them.
In an example from a previous episode, we analyzed the sociograms or drawings of
Moreno’s social networks.

We'll recall that Moreno, as a social psychologist, researched the social relationships
of pupils within a classroom. Moreno noticed that in third and fourth grade, the
network had split into 2 distinct communities with a single connection between
them. The common trait, or homophily that defined each community was that of
gender. By the way, the study was done in the 1930s in Brooklyn. When I presented
it to teachers in 2019 Israel, they noted that the clustering of pupils into gender-
based communities happens in much earlier grades these days than those depicted
by Moreno.
Moreno's Sociograms: Social ties form a "Boys" Community and a "Girls" Community

In human networks, there are many kinds of homophilic traits to which the network
cluster by, such as: language, religion, ethnicity, common opinions and more. But the
most common homophilic trait is probably geography, or more accurately,
geographical proximity.
If we think about it for a moment, then the traits of language, religion and ethnic
origin are often very related and even dependent on geographical proximity.
Even shared opinions are often a matter of geography. For example, it can be
assumed that the hard core of Israel's supporters is, well... in Israel.
The same is true of the "Small World" example from a previous episode. Just as a
reminder, "the Small World" experiment was conducted by Milgram, who was a
well-known social psychologist in the 1960s. In this experiment. Milgram set out to
find how many mutual friends would a chain letter have to go through from a
random source to a random destination. The chains of letters whose source and
destination shared geographical proximity, were shorter and faster, indicating that
geographical proximity makes for better connectivity.
But what about today's on-line networks? Technological developments should have
made us indifferent to distance. After all, theoretically, I could offer friendship on
Facebook to someone who lives next door to me just as easily as to anyone on the
other side of the globe. In practice, we'll find that the chances to become friends with
someone whom you share the same space, are much higher. And check your friends
list at Facebook just to make sure.
We can certainly be friends with people abroad that are not physically close to us.
For example, through working relationships or rooting for the same sports team
(Green Bay Packers, anyone?). Our homophilia can be based on working on the same
project or shared interests, but these ties are usually fewer and weaker than our
close proximity ones.
For example, I estimate that if we partition Facebook to communities (in the
network sense), we will find communities with strong geographical homophilia: the
Israeli community, the community of France, Spain, etc.
But what about border regions between countries?
If geographical proximity is so significant, will we get border communities? The
answer is probably not.
Visible boundaries, and invisible ones, limit geographical homophilia. For example,
in the case of countries, it's probably because of language barriers. Which means
that proximity is a necessary but not a sufficient condition, so we see nationalism
still carries weight.
We can visually see the effect of borders in an interactive map that was published in
a New York Times' article in 2018, showing where American Facebook users' friends

are located, across a map of the states. In-light of everything we talked about so far,
it will come as no surprise that the vast majority of the user's friends are in high
proximity, but the number of ties swoops down very steeply once the state's borders
are crossed.
Another example in the context of borders is a network study of gangs in Los Angeles
published by Radil in 2010, in which he showed that the highway in Pasadena serves
both as a physical and a virtual network border between the gangs.
Personally, I recall an example of virtual organizational borders from my service,
when initial attempts were made to produce integrated teams from different units.
The idea originated from the reasonable assumption about synergy, meaning that
combining several disciplines together, makes for a better output.
Some of the early participants of those synergy-teams expressed frustration that
despite the joint sit down, there were still barriers and lack of cooperation between
the team partners. Often the participants did not see themselves as part of the team
but as representatives of the organizational framework, or in free language, human
sacrifices to the managers' attempt to show cooperation. Therefore, structural
reform was also required alongside the physical and geographical regroup.
Network analysis of the organizational ties could have revealed it sooner, because it
would have shown that the network clustered according to same-unit-homophily
and not geographical homophily.
And yet another example for the homophilic traits of communities we can find in the
academic field. For this, will use the database containing academic papers, and
present is as a graph or network, where each paper (or node) is linked to the paper
that cites it.
When using community detection on this network, we'll find that the communities
are arranged according to the various fields of interest that concern the academy (a
community of physicists, community of biologists, etc.).

That's because papers mostly cite papers that are related to their field, which is
homophilia by practice. Papers in the field of chemistry will most often cite papers
in chemistry. Sometimes they will also cite papers from the field of biology or
computer science, but their strongest ties will be to their field of practice.

Apart of homophilic traits, communities also have another pretty amazing feature.
They are fractal. What does it mean?
At first glance, the communities look like a collection of random clusters, but a
deeper look into them will reveal that these clusters are composed of smaller
clusters, which are also dense and interconnected and so on and so forth. This is the
fractal structure of the network. And you can't say the word fractal and not mention
cauliflower, because it's a great example for fractal structure.
I do not know how close your relationship to cauliflowers, but we are going to take
it to the next level. From a distance, a cauliflower looks like a white lump but on
closer inspection we can see its made up of several lumps and also these
lumps are made up of smaller lumps and so on but the cool thing is that each lump
preserves the structure of the original cauliflower, at any resolution.
For the skeptics among us, I suggest going to the fridge and take a look in the
vegetable drawer. Cauliflower - not the innocent looking vegetable we thought.
These lumps are called fractals and we will expand on them in the episode on
dynamic networks. What is the connection between cauliflower and a dynamic
network? Stay tuned to future episodes.
So, just as the cauliflower structure is preserved in every resolution, so too is the
structure of the network, thanks to the communities, that is, the structure of the
network meets Barabashi's definition of a structure that is Scale-free. A quick
reminder: As we've learned in episode 3, Barabashi, a famous network scientist,
discovered that
the distribution of the Power Law is maintained at any resolution in which we look
at the network making it Scale-free. And the same goes for clusters or communities
in a network. In each community the Power Law is kept: the community will have a
few central nodes and most of the nodes will have few edges.
Even when we break down the communities to sub-communities and so on, the
power law distribution within them will be maintained. On any network.
This feature of communities in a network will help us in our next use case which
involves classification and labeling.
When we seek to classify or label big data, using communities' homophilic nature
will allow us to label entire communities in the network.
And vice versa. In order to see what is the homophilic nature of the community, we
don't need to identify or tag all the nodes in it. Why? In the Power Law we trust.
Using the Degree centrality measure, which is simply the number of edges
connected to a node, can point us to central nodes which in turn will best
characterize what is the homophilic nature of the community.
For example, let's choose a community from the academic paper network we
touched upon earlier. Now, let's look at the top 1% of the most cited papers in it. If
those high-degree papers are from the field of physics, it will suffice to label the
community as the "physics papers". Because of the Power Law, we don't have to
classify each paper in the community.
The same if we come across a community on Facebook where the nodes with the
highest centrality score in the community are from Boston, it means we probably
reached the "Boston" community".
So, let's do a quick summing up:
A network is made of communities, which are dense clusters, meaning lots of edges
between the nodes in the community. There's a reason for this density and that is
the community's homophilic nature, meaning nodes that have similar traits tend to
connect more. Usually, but not exclusively, this trait will be geographical proximity.
The nodes within the communities conform to the Power Law, meaning there are a
few central nodes and many nodes that have only a few connections. Through the
central nodes, we can understand what's the community all about.

Another use case of community detection can be found in the field of text mining,
specifically in an original study in 2012 funded by the US army. The army was in
search of alternative ingredients to use in recipes, and so they called for SNA (Social
Network Analysis). The study made a network out of about 46 thousand recipes, by
linking together ingredients that appeared in the same recipe. The communities
detected on this network were discovered to be based on the homophily of different
flavors: salty, spicy and sweet (desserts). Using SNA algorithms, the researchers
were also able to show which ingredient can be replaced with which ingredient
because they shared the same role in the same community. So, this is how SNA
allowed for more flexibility in the army's kitchen, if not elsewhere.
When surveying the different communities, the most interesting findings are of
course the anomalies – the communities where the homophilic reasoning is
different from the rest of the network. What does it mean?
Take Facebook as an example. As we mentioned before, if we use community
detection on the entire Facebook network, we'll get geographical communities,
meaning a community to each country. It will be very interesting to discover a
central community where its homophily reasoning is different, say a community of
cat lovers. This means that passion for cats is stronger in those people than their
geographical identity (and I'm sorry I called cat-lovers "those people". Some of my
best friends are people). In addition, the central position of the cat-lovers
community in the network, will testify to the importance of that issue to enough
people around the network.
Now let's imagine ourselves to be James Bond's arch-nemesis, fondling a cat while
brewing over our dreams of world domination. In order to take over the world, we
can go from one geographic community to another and take over its influencers or
we can take a shortcut and enlist the help of our global feline-friendly community to
spread our message to the world. Knowing cats, I'm sure they are in for it.
OK, so how can we detect communities in the network?
In order to detect them, we'll first need to define what we are looking for, only to
find out that, in fact, there is no scientific definition of what a community is. Weird,
ah?
We will differentiate for a second between the concept of "connected components",
which we talked about in the episode about the small world, and between
communities. Recall that connected components are the "islands" in the network
and unlike communities, they have a clear definition or border: any node which is
connected to a component is part of it. But if there's no edge between this node and
other nodes in the component, than it's not part of it. The islands themselves can be
made up of one community or several communities, but what defines the boundaries
between the communities in the component? We will discuss this in more details
later-on, but for the time being, we will content ourselves with the most common
definition for a community in data science. The common definition is that the
amount of edges within a community is greater than the amount of edges coming
out of it. thus making the ties in the community denser on the inside than on the
outside. In other words, there will be many strong ties within the community and
mostly sparse and weak ties to nodes outside the community.
And you can't mention "weak ties" without mentioning Granovetter. Granovetter's
paper from 1973: "The Power of Weak Ties", gained over 52,000 citations and is the
most cited paper in the network field.
His paper is considered groundbreaking since before it, the field has dealt mainly
with the strong ties of the individual, based on the intuitive assumption that strong
ties are the important and influential ties.
In his paper, Granovetter warns us against this kind of intuitions and so he uses
empirical data to examine the strength of the weak ties, by demonstrating how
people find a new job.
Finding a job is a very significant event in the life of an individual and we might
intuitively assume that the strong ties of the individual will play a significant role in
this endeavor. An example for strong ties can be close friends, family and so on. What
Granovetter has shown was that it is precisely the individual's weak ties that made
it possible to find a new job. An example for such weak ties can be acquaintances
from the past, such as from college or previous job. The reason he gave for this
seeming paradox is the circulation of information.
We usually do not get new information from strong ties. There's a low chance that
the people close to us will know of news we don't already possess. It is precisely the
weak ties that allow new information to trickle in.
Granovetter's paper also has additional and lesser-known insights but ones that still
hold water even half a century after its publication. Some of them will be dealt in
future episodes. The one we'll mention that has bearing to our case is the paradox
of strong ties. Granovetter diagnosed that strong ties produce a paradox at different
network resolutions: in the micro-level, the strong ties create cohesion between the
nodes but on the macro-level they actually contribute to the fragmentation of the
network. He claimed that this paradox needs to be further proven and later in the
episode we will see if he got it right.
As a side note: Unlike current papers in which authors explain how all the previous
researchers were wrong and how their paper is significantly better, Granovetter in
his paper often praises earlier researchers and other papers. Well, it was the
seventies… Simpler times.. (:
At the same time when Granovetter's paper was published, another researcher
called Wayne Zachary discovered the potential of community detection, in his
famous study of a Karate Club.
Zachary mapped friendships at a karate club which included dozens of members. He
did so for several years during the 1970s. And as it happens in many karate films in
the 70's, this club also split due to a conflict between instructors or as we know it
from the movies: the good dojo and the evil dojo.
What made Zachary's study stand out? Though Zachary is an anthropologist, he used
mathematical tools to model his research. Zachary identified that there are intrigues
in the club that are manifested in the flow of information within each group and at
the same time withholding of information from the other group. Thus, to analyze the
flow of information, Zachary used a method called max flow / min cut which is used
to optimize the flow between 2 nodes in a network. To do so, one node is named "the
source" and the other named "the sink" or "the target".
The friendship network in the karate club. The instructors are labeld 1 and 34

This method makes it possible to identify bottlenecks in the flow between two
nodes. Zachary chose one instructor as the source and the other as the target
because he realized that the conflict between the two instructors about the dojo's
values encouraged dissemination of information within the support group of each
instructor alongside denial of information from the other group. That is, Zachary
assumed that the bottleneck in the network would be the boundary between the 2
groups, and so it was.
The algorithm partitioned Zachary's graph to two communities in an almost
complete accordance to the actual split that happened in the club, i.e. gave an
excellent result in relation to the ground-truth of the research.
The disadvantage of the min / max method for detecting communities in a network
is that it places prior knowledge on the network. First, a source and a target for the
flow must be defined in advance, and we will not always be able to point at 2 such
nodes. In addition, the partition is limited to only two communities, which does not
necessarily reflect the complex aspects of reality. Setting this aside, Zachary's
breakthrough came from a data-driven approach, that is, letting data tell its story
about the network, by using network algorithms.
This data-driven approach is the reason why this study is so iconic in Network
science or SNA: The community detection algorithm made it possible to give a good
prediction about the split, and which side would each member pick. Another
achievement was the data-driven ability to explain the social reason for the split. By
using the homophilic reasoning for the detected communities, and identifying the
central nodes in them, we can establish that those nodes, i.e., the instructors, were
the ones behind the split. Their high Degree score indicates that they rounded up
followers in support of their view about the future of the club.
Because of the discovery's importance, Zachary's Karate Club is a very well-known
concept in network science, but what is perhaps less known is that there is a club
called the "Zachary's Karate Club Club" which awards a trophy to scientists in the
network field. In order to win it, the scientists must be first to speak at a network
conference and that the first slide in their presentation must be a graph depicting
Zachary's Karate Club.
Though only a small country, Israel had it share in this prize thanks to Amir Rubin
from Beer Sheva University.
Another winner of the trophy is Mark Newman, who together with his research
partner, Girvan, gave wings to community detection research when they showed in
early 21st century how to detect communities using a data-oriented method.
Until then, a common method was to partition the graph to communities by using
hierarchical clustering or dendrograms.

Dendrograms

The intuition behind that method was that networks have a hierarchical structure
and thus so do communities.
Hierarchical structuring of communities starts with advancing the nodes along their
strong ties (which is a bottom-up method) and these ties will define the
community's boundary.

There were other methods that were used in the field of clustering and they too
required early assumptions about the data, even if not as extreme as in the case of
Zachary's min cut/max flow method. and here are some examples:
The first example is an algorithm called k-means, that requires prior knowledge of
the number of communities we want to get from the network, which means that this
method is more scenario-driven than data-driven, and perhaps a little supervised.
The downside to using K-means is of course that the network can be partitioned to
a different number of communities than we initially thought. Although K-means
allows the partition to be changed based on observation of the result, this requires
many manual iterations.
Another clustering algorithm is DBscan, a 1996 algorithm that its paper is one of the
most cited papers regarding clustering algorithm, but DBscan also requires an early
assumption about the data. In this case, the required assumption is: What's the
minimal number of nodes that should be in a community.
There are also methods that assume a gaussian distribution of the data, but since we
know our data is Power law, we will politely brush them aside and move on.
Another known method is a method called k-clique that comes from a family of
algorithms that are based on rigid assumptions about what is the definition for a
community.
As we've mentioned before, the definition of what a community is is a bit vague. But
there are definite situations in which it can be said with certainty that a community
is a community. For example, in the case of a clique.
A clique is a group of nodes in which each node is connected to all the other nodes.

No edges can be added, making it a complete graph

For example, to be considered a clique, a community of 4 nodes must have 6 edges


to link them, which is the maximum possible number to do so (see image for
examples).
Because there's no room in a clique for additional edges, it is sometimes called a
complete or completed graph.
It's not so unusual to find a three-nodes clique, but as the number of members
increases, the chance of a community to become a clique decreases significantly on
a non-linear scale. This means there might be many tiny cliques in a network but
only a few large ones. Why? Power Law, bit...s!
The slim chance to find a big clique in a network was what arose the suspicion of the
authors of an independent report, investigating twitter campaigns in 2019 elections
in Israel. They suspected that one of the tweeting communities was not
spontaneous, since it acted as a large clique and this, as we've mentioned, is a rare
event. The investigation was called the "Big Bot Report" and in the public debate
that followed, the authors were accused of claiming or implying that the active users
in this community were bots where in fact behind the users were real activists.
We will not go into the debate here and links to the discussion are available here
(Articles in Hebrew): - The authors' report
- The critic of the report by Dr. Oren Tzur and his staff, Ben Gurion University
But what is undisputed by all is that a large clique indicates a significant
coordination that can certainly be suspected of being activated by bots.
The fact that cliques are less common is also the reason why k-clique algorithm is
not ideal. It will give us a very partial picture of the communities in the network
since it will refer us only to the most extreme communities – the cliques (which
sounds like a 1960's band). Another obstacle we'll face is the difficulty to understand
the community's sub-structure because of the clique's density. Besides, k-clique
takes a long time to compute and who has time to wait so long for a mediocre
answer?
So, let's do a quick summary:
We've covered the evolution of clustering algorithms in the 20th century. Not to be
an ageist, but those algorithms had their constraints, especially the need for early
assumptions on the network, such as communities must be cliques of a certain size
or assessing beforehand how many communities are supposed to be in the network.
Another aspect of those algorithms is their emphasis on the community's intra-
connections, or strong ties. And this is where Girvan and Newman come in and the
algorithm that bears their name. The Girvan-Newman algorithm, from the early 21st
century, makes no presuppositions about the community or network, neither in
terms of the shape of the communities nor in terms of their number. It does this by
flipping our perspective on what is a community:
If in all the examples we have given so far, the main preoccupation was to find the
strong ties that build the community, the complimentary view would be to find the
weak ties that form the boundary of the community.
Instead of tracking the community from the bottom up, through the strong
connections, the Girvan-Newman algorithm first detects the boundaries of the
community from the top down or from the outside inward.
This kind of logic resonance Granovetter's hypothesis from the beginning of the
episode. The strong ties that contribute to the creation of a small community, cannot
hold a large network and so the network disintegrates to communities whose weak
ties represents the boundaries of its circulation. These are the boundaries that
Girvan and Newman used.
To find them, they followed the same logic that led Zachary, who used bottlenecks
to partition the network. Just instead of using flow analysis between two nodes, they
identified the bottlenecks in the graph by calculating the betweenness of the edges
in the graph. For those cheaters who skipped episode 4 about centrality measures,
Betweenness is a metric that allows us to find nodes that bridge between different
nodes in the network.
What Newman and Girvan did was to assign the score to the edges that connect the
nodes. According to their method, removing the edges with the highest betweenness
score, one-by-one, would gradually result in breaking down the network to
connected components (or "islands") any of which would be considered as a
community. On a side note, Girvan and Newman's paper is characterized not only by
a straight-forward easy-to-follow style, but also by the modesty of its writers who
were the first to point out the weaknesses of the method they proposed:
First all, the computation time for this method is long (because betweenness is a
metric that takes a long time to compute).
Secondly, when examining the algorithm on dense networks, with many edges, the
results obtained are problematic.
And here's another problem- when to stop removing edges? After all, theoretically
it is possible to continue and remove the edges with the highest Betweenness in the
network until we are left at the end with a network that consists only of nodes, each
of which is a community on its own, with no connections at all.
The solution to this problem was brought by Newman himself, in 2004, and on the
fly he also solved the other problems he presented in his previous paper.
To decide when to stop removing edges, Newman used the modularity metric.
Simply put, this measure is designed to examine the quality of community detection
algorithms by giving a score to their partition of the network.
The score is calculated by comparing the density of the edges in the community
obtained versus the density of the edges in a hypothetic random network, which has
the same number of nodes.
The denser the community by comparison, the higher the score.
The great advantage of this metric was also that it computes very quickly.
So, wait,

If we now have a quick tool that gives scores for partition to communities, why not
use it by its own as a partition algorithm instead of using it only as a measure for
other algorithms?
And this is exactly what Newman did and this breakthrough led to a flood of
algorithms in the field of community detection.
Perhaps the most famous algorithm that stemmed from the modularity family is the
Louvain algorithm by Blondel, who published it in a paper from Louvain University
in 2008.
So, how does Louvain Work:
Ironically, the Louvain algorithm uses the ol' bottom-up method of constructing
communities, and it does so in several iterations:
First, it starts at a random node and maps its immediate surrounding nodes and
computes the modularity score for these sub-graphs.

The stages or iterations of the Louvain algorithm (from Blondel's paper)

The scores can range on a scale from 1, which is a perfect community to -1, that is, a
clear cut case that this is not a community: Nothing to see here – keep moving!
If the modularity score of the small community or sub-graph is good enough, the
Louvain moves to another node and performs a similar process until convergence.
The output of the first iteration will be numerous small communities. In the second
iteration, Louvain aggregates the smaller communities to produce larger
communities, in the same way as in the first iteration: if the edges between the
communities preserves a good modularity score, it continues and so on in the next
iteration, until convergence is achieved.
Louvain is also weighted, that is, the algorithm also considers the weight or strength
of the edges. For example, if a node is connected to 2 communities, the Louvain will
assign the node to the community to which its connections are stronger.

Community detection by Louvain. Colors = communities

The final output is aggregated communities, meaning that they are a product of sub-
communities that are connected to each other which in turn are themselves a
product of sub-sub-communities and so on, up to the resolution of a single node or
atom.
So, what's so great about the Louvain Algorithm:
First all, speed. It gets the job done very quickly, even on large networks, and is
capable to be run on parallel computing platforms. The results obtained are
considered relatively reliable and are widely used in network research. In addition,
the modularity score that the algorithm gives to each community provides
additional insights about the network and will cover this in another episode that will
deal with best practices in network analysis.
Before I elaborate on the disadvantages of Louvain I am required to full discloser
and that is I am sucker for Louvain.
The feature I like best about it is that it is unsupervised which means it has no
presuppositions on the data other than the very definition of the community as a
denser part of the network. In contrast to other algorithms that require the
community to have rigid definitions, such as cliques, or that predetermine how
many communities the network should be partitioned into, the Louvain's way sends
a cooler and more liberal vibe.
Now, it seems that the main claim against Louvain is the so-called "resolution
problem". Louvain's output usually looks like this: One huge community, and next
to it a small number of large communities and beyond them a long tail of many tiny
communities.
If this reminds anyone of a Power Law, it's not by chance. Communities are Power
Law distributed.
So, where's the beef?
It depends on your approach. Some researchers would like to see the large
communities disbanded in the first stage of the research into smaller communities,
and we know that this is feasible because this is how Louvain works: it groups small
communities into a large community.
Personally, I see it as a feature, not a bug. This way you can see which communities
are more connected to other communities by being members of the same large
community.
But when it comes to huge networks, which contain millions of nodes, there is a limit
to the resolution of Louvain. After about 3 dives into the sub-communities, the
Louvain "breaks down", that is, you can dive to the sub-communities of a large
community, and to their sub-sub-communities but an attempt to see even smaller
communities will probably result in seeing each node as a community in itself. This
is because Louvain uses only 3 iterations or 3 levels of aggregation.
A possible solution to this is to run the Louvain at any level at which the network is
viewed. But here we encounter another challenge and that is that every time we run
Louvain we might get a different result for the communities detected. Usually a very
negligible difference but a difference none the less.
Why?
Louvain selects at random the seed node from which its process begins. A different
starting point to the process of aggregation of communities, bottom-up, can produce
a difference in the partition to communities.
How different? Not by much, because the Louvain conforms to the basic idea behind
the logic in which communities are formed in the network, and as we said earlier,
they are not randomly formed. A popular practice to deal with this problem is to run
Louvain several times and choose the result that comes back the most and thus also
optimize the communities' partition.
Optimization via iterations is a main part of the Leiden algorithm, named after
Leiden university as a comeback to Louvain. Published in the 10th year of the
Louvain's reign, this algorithm allows for more iterations and corrections, and
claims to be even faster than Louvain. Time would tell, but Leiden might be the new
successor to throne, for those among us who are willing to embrace change.
A final point regarding Louvain and like-minded algorithms, is that they can assign
each node to only one community. It seems like an apparent drawback, especially
when analyzing human networks.
If community can be defined by a homophilic traits such as gender, geography etc.,
it stands to reason that a person can have more than one of these traits and the edges
in the network to prove it. So why limit nodes to just one community or just one
aspect of their behavior? Is the Louvain a victim of Identity Politics?
And indeed, there are algorithms that allow a node to be a member of more than one
community. This family of algorithms is called "overlapping communities". K-clique,
for example, is one of them. A node can belong to more than one clique.
Blinded as I am by my love to Louvain, I must say I see this restriction more as a
feature, not a bug.
It's clear that a node can be part of countless communities, but the question is to
which community it belongs the most, true to the period studied. In my opinion, the
slogan "one node to one community" reveals the true nature of the node.
For example, if I try convincing myself that I am a devoted family guy, but a
community detection analysis will assign me to my job community rather than to
my family community, it means something. It means that my ties to my workplace
are more numerous and stronger than to my family.
So, let's do a quick recap: We've covered the modularity family of algorithms which
is an unsupervised method for community detection. Because they are
unsupervised, this algorithms are not committed to the analysts predisposition and
thus can produce innovative results This method optimizes the assigning of nodes
to communities by iterations and aggregations and it does so fast enough to be used
on bigdata. But just as there are hundreds of algorithms to find key nodes in the
network, that stem from different notions of what it means to be central in a
network, so there are hundreds of algorithms for community detection, stemming
from different notions of what is a community. In future episodes we might touch
some more of them, like the random walk family but I think that's enough for now.

So, we now understand why communities are formed, what they look like and how
to detect them. So, what do you do with them?
Probably one of the common use cases is the use of community detection to classify
or label data. Here's an example from a graph theory meet-up I've been to, that dealt
with the classification of products. A product can be associated with several
categories, for example, a table. A table can be assigned to numerous categories such
as: dining, furniture, outdoors equipment and more, depending on its kind. If we
build a network of products by assigning an edge between products that were
purchased together, then community detection (such as Louvain's) will show us
how customers associate the products, even if sometimes our content experts do not
like it.
Another major use case for community detection is that it is a complimentary effort
to the centrality measures in order to find the centers of gravity in the network that
are in the context of our research. And now we'll come back and address the Justin
Bieber's problem we've encountered in the previous episode on centrality
measures.
In the previous episode, we cited as an example an American study that tried to find
who are the leaders of the revolution of the "Arab Spring" in Egypt in 2011. The
study was based on the Twitter network, where nodes are twitter accounts and
edges are retweets. What they looked for were the centers of gravity in the
revolution movement, and using centrality measures, they found that the top
ranking twitter account in the network belongs to none other than Justin Bieber.
And here comes community detection to the rescue. Community detection on the
Twitter network has shown that Justin Bieber is indeed a center of gravity, but it's
ties are mostly limited to a specific community. The homophilic nature of that
community was Bieberism and bad taste. Not surprisingly, The "Justin Bieber"
community was built in the shape of a star and you would not believe who was at its
center.
In contrast, the community that actually consisted the revolutionaries in Egypt, was
much denser, or intra-connected, than Justin's community, indicating that it held
more discourse and better echoing of messages. At the center of gravity of this
community were the leading bloggers and people who have probably taken a more
significant role in the events than Justin, although having scored less than him on
the Degree centrality measure.
Twitter communities: Left: The revolution community | Right: Bieber's community

As we mentioned before, you don't need to go through all the nodes in the
community to find out what the community is all about. It's enough to go through
the high Degree nodes as in the case of Justin Bieber's Twitter community. The
highest ranking node will probably be Justin Bieber, making it easier to label the
community's homophilic nature.
By the way, in this example we used a new metric we can apply to networks and
communities, called the Density metric. This measure can be easily calculated by
dividing the number of existing connections or edges in the network by the number
of potential edges that can be in the network. The higher the number, the denser the
network, which indicates greater interactivity. A score of 0 means there are no
connections at all in the network and a score of 1 means the network is actually a
clique or a complete graph.
Density metrics can be used, for example, in a comparative follow-up of new groups.
The group's cohesion can be measured by using the Density metric. A successful
group should usually score high on this index.
But the real power behind community detection is when tackling Big data.
There are too many podcasts and YouTube's aimed to frighten the older population
of Bigdata and giant networks by pointing to the challenges of analyzing and
deriving meanings from data, but fortunately, we prefer to come up with solutions,
not problems.
When it comes to analyzing large networks, community detection should be our first
go-to tool. It allows us to view the entire network, Top-Down, and thus analyze it
quickly and elegantly, regardless of its size. This is made possible by the
"Communities' Graph". A community graph is a network of links or edges between
communities, where each node in the graph represents one community which in
turn represents all its nodes. In fact, this graph is just an aggregation of the nodes
graph.
Left: Communities Graph | Right: Node-level Graph

What we'll see in the aggregated communities' graph is the same phenomena seen
in an ordinary graph or network. That is – the communities' graph is also Scale-free.
Doesn't matter in which resolution we view the network – nodes or communities -
the same laws apply to both and the No. 1 law is the Power Law. As the network
grows, there will be few large communities connected to each other that will
constitute a Giant Connected Component while most communities will be connected
components of their own, not connected to any node outside their community.
These communities will form the "long tail" of the Power Law distribution. Most
often, each such community will have a relatively low number of nodes.
So, to understand what the network's about, we need to visit only the central ones,
because the centrality and size of communities is also a Power Law.
For example, if we look at the community graph of our academic papers network,
we could see which communities, or rather, fields of study, are more central. In the
previous episode, and in the first part of this one, we've touched the subject of
anomalies in the network, as a good place to find interesting insights. The same goes
for the community graph. If we find a central, but comparatively small community,
it means that this community, regardless of its size, is probably important to other
in the network, making it an interesting subject for further research.
By combining the communities' graph to our previous understanding of
communities' homophilic nature, we'll get an overview of the whole network that
can reveal to us its structure and the logic behind it. But even more, it can reveal to
us what we did not know we did not know or the "Unknown-Unknown".
How can we know something if we do not know it exists?
Let's go back to our network of academic papers example. By applying community
detection to this network, we can reveal and label, using the homophilic nature of
communities, the various fields of interest that occupy the academia and also view
the connections between these fields.
But what if there is a community that we have a hard time labeling?
For example, if we have looked at the academic network several years ago, we would
find an unfamiliar community that contains papers from several different fields such
as biology, chemistry and psychology.
What we have probably stumbled upon is a new academic field, in this case,
neuroscience, which we didn't know we didn't know about, because it was a field in
the making. Applying a data-driven approach brought it to our knowledge.
Now, imagine a network where only some of the data is tagged. If you find central
communities that do not contain even a single tagged node, this could indicate you
have a significant gap in the field that that community represents.

So, in conclusion: We learned that networks are made of communities, which are the
dense clusters that make up the network. Their density is not by random but as a
result of the homophilic nature of the community, meaning nodes will often connect
to similar nodes. Applying both the 1st law of the network, i.e., Power Law, and the
2nd law which is "nodes congregate to homophilic communities", we can tackle the
world of bigdata in many fields, such as: Classification and Labeling, finding relevant
centers of gravity in the network, finding the unknown-unknown and of course,
stimulate our imagination in order to tell a good story about the network and our
data. Is that all? Of course not. Communities still have many more features, for
example, we haven't touched the fascinating aspects of the dynamic nature of
communities or communities in directed networks. But this calls for restraint, so
we'll save something for future episodes.

You might also like