Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Detection of Overlapping Communities

in Social Tagging Systems

A thesis submitted for partial fulfilment


of the requirements for the degree of

Master of Technology

by

Abhijnan Chakraborty
10CS60R03

Under the Guidance of

Prof. Niloy Ganguly

Department of Computer Science and Engineering


Indian Institute of Technology, Kharagpur
India
April, 2012
“Man is by nature a social animal; an individual who is unsocial naturally and not
accidentally is either beneath our notice or more than human. Society is something
that precedes the individual. Anyone who either cannot lead the common life or is
so self-sufficient as not to need to, and therefore does not partake of society, is
either a beast or a god.”
– Aristotle

Dedicated to My Parents
Certificate

This is to certify that the thesis entitled ‘Detection of Overlapping Communities in Social Tag-
ging Systems’ submitted by Abhijnan Chakraborty, Roll – 10CS60R03, Department of Computer
Science and Engineering, Indian Institute of Technology, Kharagpur; for partial fulfilment of the
requirements for the degree of Master of Technology in Computer Science and Engineering; is
a bonafide record of the work and investigations carried out by him under my supervision and
guidance.

Prof Niloy Ganguly


Dept. of Computer Science & Engg.
Indian Institute of Technology
Kharagpur – 721302, India

1
Acknowledgements

While the rest of the thesis is meant to convey the technical work done, this is the only place to
take the liberty to express personal gratitudes. Specially after working on online ‘social’ systems,
I do not want to undermine the very basics of such studies – “Man is a social animal”. No one
can even survive, let alone building a thesis, without countless direct and indirect helps from
others.
First and foremost, I would like to thank my research advisor, Prof. Niloy Ganguly, for his advice
and support during the work. He gave me the freedom to pursue my ideas and work at my own
pace, and was always available to discuss various problems on the way. I enjoyed spending last
one year with him both at work and otherwise. His attitude towards students and the countless
hours of discussions on different issues have changed me in many ways.
A special thanks to Saptarshi Ghosh, a research scholar in the department, who closely followed
this work. I am highly indebted to him for clarifying my doubts and for providing suggestions
and criticisms on my work. All the members of CNeRG (Complex Networks Research Group)
have extended personal and professional helps in the time of need.
I am lucky enough to have some outstanding teachers in my school days, specially Mr. Mukunda
Lal Pal and Mr. Dipankar Sen, who were always behind me in every tough situations and never
lost their belief and confidence on me. Words are not enough to express my gratitude to them.
I also thank all my teachers at Jadavpur University, who introduced me to the exciting field of
computer science.
I want to take this opportunity to thank all of my friends for reminding me that there are many
other important things in life than studying. My university friends Sandip, Abhirup, Sourav,
Utsab and folks from ‘Amar Bangla Mess’ – Sudipta, Ambarish, Apurba, Arijit, Debabrata, Sou-
vik, Kaushik, Dhruba have already become a part of my life. My childhood buddies – Anwesha,
Pinaki, Prithwiraj, Jyotirban, Soumya are kind enough not to expect explicit acknowledgements
from me. My life wouldn’t have been complete without them.
Last but not the least, I would like to thank Maa, Baba, Dida, Didivai, Masimoni, Mesomoni,
Valomasi, Valomeso, Papluda, Pappanda, Ashokmama, Benumama for their constant support,
love and encouragement. Their selfless guidance has helped me to find my path in this beautiful
journey called ‘life’.

Abhijnan Chakraborty

2
Abstract

Some of the most popular sites in the Web today are social tagging systems or folksonomies
(e.g. Delicious, Flickr, LastFm etc.) where users share resources and collaboratively annotate
resources with tags which help in the search, personalized recommendation and organization of
the resources.
Folksonomies are modelled as tripartite (user-resource-tag) hypergraphs in order to study their
network properties, and detecting communities of similar nodes from such networks is a chal-
lenging and well-studied problem. However, most existing algorithm for community detection in
folksonomies assign unique communities to nodes, whereas in reality, nodes in folksonomies are
associated with multiple overlapping communities – users have multiple topical interests, and
the same resource is often tagged with semantically different tags. The few attempts to detect
overlapping communities work on projections of the hypergraph, which results in significant loss
of the information contained in the original tripartite structure.
In this work, we propose the first algorithm to detect overlapping communities in folksonomies
using the complete hypergraph structure. Our algorithm converts a hypergraph into its corre-
sponding weighted line-graph, using measures of hyperedge similarity, whereby any community
detection algorithm on unipartite graphs can be used to produce overlapping communities in the
folksonomy. Through extensive experiments on synthetic as well as real folksonomy data, we
demonstrate that the proposed algorithm can detect better community structures as compared
to existing state-of-the-art algorithms for folksonomies.

3
Contents

Abstract 3

1 Introduction 8
1.1 Folksonomy as Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Existence of Overlapping Communities . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Identifying Overlapping Communities . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Link Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Related Work 13
2.1 Community Detection in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Detecting Overlapping Communities in Graphs . . . . . . . . . . . . . . . . . . . 14
2.3 Community Detection in Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Overlapping Community Detection in Folksonomies . . . . . . . . . . . . . . . . . 15

3 Our Proposed Algorithm 17


3.1 Basic Idea of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Calculating Similarity Between Hyperedges . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Expressing Hyperedges as Vectors . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Considering Vertex Neighbourhoods . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Choosing the Best Similarity Metric . . . . . . . . . . . . . . . . . . . . . 20
3.3 Detecting Communities in Line Graph . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Fast Modularity Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 Infomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.5 Choosing the Best Community Detection Method . . . . . . . . . . . . . . 23
3.4 Time Complexity of Our Proposed Algorithm . . . . . . . . . . . . . . . . . . . . 23

4 Experiments and Evaluation 24


4.1 Generation of Synthetic Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Normalized Mutual Information (NMI) . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Comparison between Different Choices of OHC . . . . . . . . . . . . . . . . . . . 26
4.4 Comparing OHC with Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 Performance w.r.t. Number of Hyperedges . . . . . . . . . . . . . . . . . . 28
4.4.2 Performance in Presence of Scattered Hyperedges . . . . . . . . . . . . . . 29
4.4.3 Performance w.r.t. Fraction of Nodes in Multiple Communities . . . . . . 29
4.4.4 Performance w.r.t. Size of Real Community . . . . . . . . . . . . . . . . . 30

4
5 Experiments on Real World Folksonomies 32
5.1 Overlapping Communities in Folksonomies . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Evaluation of Communities Detected . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Comparison of Conductance Value . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2 Comparing Detected User Communities with Social Network . . . . . . . 36

6 Conclusion 38

Bibliography 41

A Publications from the Thesis 42

5
List of Figures

1.1 A toy example of Tripartite Hypergraph. Three types of nodes are graphically
represented as Blue Circles, Orange Rectangles and Black Diamonds respectively.
Each triangle created by connecting these three type of nodes is a hyperedge. . . 9
1.2 Example of Overlapping Community Structure . . . . . . . . . . . . . . . . . . . 10
1.3 Necessity of considering both resources as well as tags to identify users having
similar interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Neighbourhood of two adjacent hyperedges . . . . . . . . . . . . . . . . . . . . . 19

4.1 An example synthetic hypergraph. There are two communities – blue and green.
Violet nodes belong to both the communities. . . . . . . . . . . . . . . . . . . . . 25
4.2 Comparison of NMI values for different similarity metrics with varying hyperedge
density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Comparison of NMI values for different community detection algorithms with vary-
ing hyperedge density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Variation of NMI values with varying hyperedge density when 10% nodes belong
to multiple communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Variation of NMI values with varying hyperedge density in presence of scattered
hyperedges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Variation of NMI values with varying fraction of nodes in multiple communities
keeping hyperedge density constant at 0.2 . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Comparison of NMI values with varying number of real communities . . . . . . . 31

5.1 Cumulative distribution of the fraction of communities which overlap with a given
number (x) of other communities; main figure – LastFm, inset – MovieLens . . . 34
5.2 Cumulative distribution of conductance values of tag communities obtained from
the real-world folksonomies: LastFm (main plot), Delicious and MovieLens (both
inset) for OHC and HGC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Community structure detected by OHC and CL algorithm with the social network
in LastFm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4 Community structure detected by OHC and CL algorithm with the social network
in Delicious . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6
List of Tables

5.1 Statistics of Real Folksonomy Datasets . . . . . . . . . . . . . . . . . . . . . . . . 32


5.2 Examples of communities detected by proposed OHC algorithm. The algorithm
successfully clusters nodes which are related to a common semantic theme (see
Column 2). Nodes related to multiple themes (boldfaced and italicized) are placed
in overlapping communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7
Chapter 1

Introduction

A number of the most popular sites in the Web today are online social systems where users form
social relationships with one another and generate and share various forms of contents. Among
these social systems, some are specifically designed for content sharing. This type of websites
are known as Social Tagging Systems. Here, users share contents or resources in these sites,
and collaboratively annotate resources with descriptive keywords (known as ‘tags’) in order to
facilitate search and retrieval of interesting resources. Examples of such websites include Delicious
(http://www.delicious.com), Flickr (http://www.flickr.com), LastFm (http://www.last.
fm), MovieLens (http://www.movielens.org), Bibsonomy (http://www.bibsonomy.org) etc.
Thomas Vander Wal coined a term Folksonomy1 to describe social tagging systems. The word
‘Folksonomy’ is a combination of two words – ‘folk’ and ‘taxonomy’. In such systems, ways for
classification and categorization evolve through the practice of collaboratively creating and man-
aging tags. For this reason, folksonomies are also known as Collaborative Tagging Systems.
In this work, we use the terms ‘Social Tagging System’ and ‘Folksonomy’ interchangeably.
With the growing popularity of social media sites in today’s Web, a tremendous amount of re-
sources are being uploaded to the popular folksonomies; consequently it has become practically
impossible for users to discover on their own interesting resources and people having common
interests. Hence it is important to develop algorithms for personalized search [1] and recommen-
dation of resources [2] and potential friends to the users. One approach to these tasks is to group
the entities (resources, tags, users) into communities or clusters which are typically thought of as
groups of entities having more/better interactions among themselves than with entities outside
the group.
For detecting communities as well as studying other network properties, Folksonomies are mod-
elled in literature [3–5] as tripartite hypergraphs.

1.1 Folksonomy as Hypergraph

Hypergraph model of folksonomies includes user, resource and tag nodes, where an hyperedge
(u, t, r) indicates that user u has assigned tag t to resource r. Figure 1.1 shows a toy example of
tripartite hypergraph.
1
http://vanderwal.net/folksonomy.html

8
Figure 1.1: A toy example of Tripartite Hypergraph. Three types of nodes are graphically
represented as Blue Circles, Orange Rectangles and Black Diamonds respectively. Each triangle
created by connecting these three type of nodes is a hyperedge.

Detecting communities from such hypergraphs is a challenging problem – this not only helps in
efficient search and recommendation of resources or friends to users, but also in the organization
of the vast amount of resources present in folksonomies into semantic categories.

1.2 Existence of Overlapping Communities

Several algorithms have been proposed for detecting communities in hypergraphs [4,6–9] (details
in Chapter 2 at Page 13). But, almost all of the prior approaches do not consider an important
aspect of the problem – they assign a single community to each node, whereas in reality, nodes
in folksonomies frequently belong to multiple overlapping communities. For instance, users have
multiple topics of interest, and thus link to resources and tags of many different semantic cate-
gories. Similarly, the same resource is frequently associated with semantically different tags by
users who appreciate different aspects of the resource.
As a motivating example, consider a popular photo of a daffodil in Flickr (Figure 1.2). Since
many users are likely to tag the photo with ‘flower’ (or ‘daffodil’), as compared to few users
using the tag ‘yellow’, algorithms assigning single communities to nodes would place this photo
in the community related to flowers (or daffodils). Community-based recommendation schemes,
which recommend resources to users based on common memberships in communities, would thus
overlook the fact that this photo is an excellent candidate for recommendation to a user who
favours tagging objects that are yellow-coloured (e.g. photos of yellow cars, sunset etc.). On the
other hand, an algorithm detecting multiple overlapping communities would place the photo in
both communities related to flowers and the colour ‘yellow’, and thus raise the chances that this
popular photo is recommended to the above mentioned user.

9
Figure 1.2: Example of Overlapping Community Structure

1.3 Identifying Overlapping Communities

To the best of our knowledge, only two studies have addressed the problem of identifying over-
lapping communities in folksonomies.
1. Wang et al. [10] proposed an algorithm to detect overlapping communities of users in folk-
sonomies considering only the user-tag relationships (i.e. the user-tag bipartite projection
of the hypergraph), and
2. Papadopoulos et al. [5] detected overlapping tag communities by taking a projection of the
hypergraph onto the set of tags.
Taking projections (as used by both these approaches) results in loss of some of the information
contained in the original tripartite network and it is known that qualities of the communities ob-
tained from projected networks are not as good as those obtained from the original network [11].
Further, none of these algorithms consider the resource nodes in the hypergraph. However, it
is necessary to detect overlapping communities of users, resources and tags simultaneously for
personalized recommendation of resources to users. Additionally, it is better to consider common
resources as well as common tags in order to identify users having similar interests (i.e. potential
friends).

10
0.14

Fraction of Friendship Links


0.12 Only Resources
0.1 Only Tags

0.08 Resources or Tags

0.06

0.04

0.02

0
2 4 6 8 10 12
Number of Shared Items

Figure 1.3: Necessity of considering both resources as well as tags to identify users having similar
interests

To demonstrate this, we give here a motivating statistics from the real data of the LastFm
folksonomy2 which also allows users to create a social network among themselves.
Users who are linked in the social network (i.e. friends) usually have common tastes (a property
known as homophily [12]), and hence can be expected to have similar tagging behaviour in the
folksonomy as well.
Figure 1.3 plots the fraction of friends (i.e. user-pairs who are linked in the social network) who
share k items in the folksonomy for different values of k, where the shared items are
1. only resources
2. only tags
3. resources or tags.
It is seen that the curve for 3 consistently has higher values as compared to the curves for 1 and 2,
which shows the necessity of considering both resources and tags while identifying communities
in folksonomies, without which some of the potential friendship relations cannot be identified.
The goal of this work is to propose such an algorithm that utilizes the complete tripartite struc-
ture to detect overlapping communities, using the concept of link clustering which is explained
next.

1.4 Link Clustering

Though a node in a network can be associated to multiple semantic topics, a link (or edge,
the terms are used interchangeably) is usually associated with only one semantics [13] – for
2
The real folksonomy datasets are detailed in Chapter 5 at Page 32.

11
instance, a user can have multiple topical interests, but each link created by the user is likely to
be associated with exactly one of his interests.
Link clustering algorithms utilize this notion to detect overlapping communities, by clustering
links instead of the more conventional approach of clustering nodes – though each link is placed
in exactly one link cluster, this automatically associates multiple overlapping communities with
the nodes since a node inherits membership of all the communities into which its links are placed.
Link clustering algorithms have recently been proposed for unipartite networks [13,14] and bipar-
tite networks [10]. However, to our knowledge, this is the first attempt to propose a link-clustering
algorithm for tripartite hypergraphs. Thus, the present work takes the first important step to-
wards detecting overlapping communities in folksonomies considering the complete hypergraph
structure.

1.5 Organization of the Thesis

Chapter 2 gives a summary about prior works in community detection in graphs as well as in
hypergraphs. Our link-clustering based algorithm is detailed in Chapter 3. We compare the
performance of the proposed algorithm with the existing algorithms by Papadopoulos et al. [5]
and Wang et al. [10]. Extensive experiments on synthetically generated hypergraphs show that
our proposed algorithm out-performs both these algorithms (Chapter 4). Further, using data
from three popular real folksonomies – Delicious, MovieLens and LastFm – we also show that the
proposed algorithm can identify better overlapping community structures in real folksonomies
(Chapter 5). Chapter 6 concludes the thesis.

12
Chapter 2

Related Work

Large networks or graphs are increasingly being used to model various types of complex systems
in the real world. These real world networks are not random graphs, as they display big inho-
mogeneities, revealing a high level of order and organization. The degree distribution is broad,
with a tail that often follows a power law. Therefore, many vertices with low degree coexist with
some vertices with large degree.
Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, with
high concentrations of edges within special groups of vertices, and low concentrations between
these groups. This feature of real networks is called community structure or clustering. Commu-
nities are groups of vertices which probably share common properties and/or play similar roles
within the graph. Several algorithms have been proposed for finding communities or groups of
‘similar’ nodes in graphs.

2.1 Community Detection in Graphs

Girvan and Newman proposed one of the initial algorithms for community detection [15]. Their
algorithm removes network edges iteratively based on their betweenness centrality, which results
in splitting the network into disconnected components. In a successive work, they introduced
the notion of modularity as a measure of the quality of community structure in a network [16].
A bunch of algorithms were proposed which attempt to detect community structure in a network
by maximizing modularity score. For instance, Clauset et al. [17] proposed an agglomerative
hierarchical clustering which successively joins pairs of communities (starting from single-node
communities) such that each agglomeration results into the maximum possible modularity in-
crease. Later, techniques like simulated annealing, extremal and spectral optimizations were
presented to maximize modularity score. Refer to [18] for a detailed survey of different commu-
nity detection algorithms for graphs.
In social networks, every individual typically belongs to more than one communities. There are
communities of her family members, friends and classmates, co-workers etc. Hence, a commu-
nity detection algorithm should address the issue of overlapping communities. Recently many
algorithms have been proposed which detect overlapping communities in graphs.

13
2.2 Detecting Overlapping Communities in Graphs

One of the initial methods to find overlapping communities was designed by Baumes et al. [19].
They defined a community as a subset of actors whose induced subgraph locally optimizes a
given metric based on the edge density of the cluster. As different overlapping subsets may all be
locally optimal, vertices can belong to multiple communities. Detecting communities of a graph
is equivalent to finding the set of all locally optimal clusters.
Clique Percolation Method (CPM) by Palla et al. [20] is the most used overlapping community
detection technique. It is based on the concept that finding overlapping communities is equivalent
to finding k-cliques in the social networks. Their algorithm first finds all k-cliques with a fixed
constant k. Two detected k-cliques will be joined if they share k − 1 nodes. Each community is
formed by joining maximum set of such k-cliques. One node may belong to multiple disconnected
k-cliques.
Clique Percolation scheme has been extended for different types of real word networks. Farkas
et al. [21] and Lehmann et al. [22] extended the method to weighted and bipartite graphs re-
spectively. Adamcsek et al. [23] designed a software package CFinder1 which implements CPM.
Kumpula et al. [24] proposed a faster sequential implementation of CPM algorithm.
Lancichinetti et al. [25] proposed a local community detection algorithm. Their algorithm tries
to optimize a fitness function, which is defined using the internal and external degrees of the
computed cluster. By varying the parameters in the fitness function, both overlapping and
hierarchical community structures can be obtained using the algorithm.
The well known modularity metric can be extended to overlapping community scenario. Nocosia
et al. [26] introduced overlapping modularity metric. In their scheme, a vector is assigned for
each node in the graph. This vector stores the probability that this node belongs to a particular
community. Their definition of overlapping modularity utilizes these vectors. With the notion
of overlapping modularity, any modularity maximization algorithms can be applied to detect
overlapping communities.
Gregory [27] proposed an algorithm which works in multiple stages. First, the vertices with
highest split betweenness are identified. They are the potential vertices which may belong to
multiple communities. Then, these vertices are split into multiple nodes connected by edges. The
original graph is transformed into a larger graph including these vertex sets instead of potential
overlapping nodes. After that, any state-of-the-art non-overlapping clustering technique can be
applied to the resulting graph. Finally, the communities are mapped back into the original graph.
Some of the recent algorithms proposed for detecting overlapping communities [13,14] adopt the
methodology of link clustering i.e. they find groups of ‘similar’ edges unlike conventional attempts
to group similar nodes. Link clustering strategies build from the idea that even though many
actors may belong to multiple groups, their social ties can be classified into a single category.
Evans et al. [14] considered a modified random walk on the line graph of a particular graph along
with other diffusion processes. Ahn et al. [13] proposed to group edges with an agglomerative
hierarchical clustering technique.
The advantage of these algorithms is that while overlapping communities of nodes are indeed
discovered (since a given node inherits membership of all communities that contain the edges
associated with the node), these algorithms are much simpler and more efficient than the ones
1
Available at http://www.cfinder.org.

14
which directly find overlapping groups of nodes. Hence in the present study, we adopt the
link-clustering methodology to propose an algorithm for overlapping community detection in
tripartite hypergraphs.

2.3 Community Detection in Hypergraphs

Several algorithms have been proposed for detecting communities in hypergraphs. Vazquez [7]
proposed an Bayesian formulation of the problem of finding hypergraph communities. Starting
from a statistical model on hypergraphs, the author uses a Mean Field (MF) approximation
as variational function which resolves the population structure by determining the hypergraph
communities and model parameters from the data. The final Variational Bayes (VB) algorithm
is a self-consistent set of equations for determining the group assignments and the model pa-
rameters. The VB algorithm is based on recursive equations similar to those for the Expectation
Maximization (EM) algorithm.
Bulo et al. [28] proposed a Game Theoretic approach to hypergraph clustering. They have
shown that the hypergraph clustering problem can be converted into a non-cooperative multi-
player clustering game. There the notion of a cluster is equivalent to a classical game-theoretic
equilibrium concept. Zhou et al. [29] generalized spectral clustering techniques to hypergraphs.
Lin et al. [9] proposed an efficient multi-tensor factorization method for community extraction
from hypergraphs.
Neubauer et al. [6] used modularity concept to extract communities from hypergraphs. The
original k-partite hypergraph is decomposed into k(k+1)
2 bipartite graphs. The algorithm tries to
optimize a joint modularity measure, which is based on the average bipartite modularity in the
individual bipartite graphs, in a brute-force, greedy bottom-up fashion. Later, Murata defined
tripartite modularity [30] and proposed an algorithm to detect communities from hypergraphs
using tripartite modularity maximization principle [4].

2.4 Overlapping Community Detection in Folksonomies

All the community detection algorithms mentioned above assign a single community to each
node. Only two studies have addressed the problem of overlapping community detection in
folksonomies. But, they do not consider full tripartite hypergraph structure.
Wang et al. [10] proposed an edge clustering methodology to detect overlapping communities
using only user-tag subscription information (in effect, they consider the projection of a tripar-
tite folksonomy onto a user-tag bipartite graph). Their algorithm is a k-means variant which
maximizes intra-cluster similarity. The network is considered in an edge-centric view and each
centroid only compares to a small set of edges that are correlated to the centroid. Though this
algorithm is computationally fast, it requires the number of communities as an input which is
difficult to predict in real world folksonomies.
Papadopoulos et al. [5] proposed an algorithm to detect overlapping communities of tags. This
algorithm extracts resource-tag association graph from tripartite hypergraph, transforms it to
tag co-occurrence network and then finds overlapping tag communities. The proposed scheme
searches for core sets in tag co-occurrence network. Cores are densely connected groups of tag

15
nodes. Then, the algorithm successively expands the identified cores by maximizing a local
subgraph quality measure.
Taking projections (as used by both these approaches) loose some information contained in the
original tripartite network. Guimera et al. [11] have shown that qualities of the communities
obtained from projected networks are worse than those obtained from the original network. To
the best of our knowledge, the present work is the first algorithm for detecting overlapping
communities in folksonomies considering the complete hypergraph structure.
The proposed algorithm is detailed in the next chapter.

16
Chapter 3

Our Proposed Algorithm

This chapter details the proposed link-clustering algorithm for detecting overlapping commu-
nities in tripartite hypergraphs. As discussed earlier, a folksonomy is modelled as a tripartite
hypergraph (more specifically 3-uniform tripartite hypergraph). We first discuss the notations
used to model a folksonomy as a tripartite hypergraph.
A tripartite hypergraph is denoted as G = (V, E) where V is the set of nodes and E is the set
of hyperedges. V is composed of three partite-sets (types of vertices) V X , V Y and V Z . Each
hyperedge in E connects triples of nodes (a, b, c) where a ∈ V X , b ∈ V Y , c ∈ V Z .

3.1 Basic Idea of the Algorithm



For a given hypergraph G, we convert G to the weighted line graph G which is a unipartite graph

in which the hyperedges in G are nodes, and two nodes e1 and e2 in G are connected by an

edge if e1 and e2 are similar in G. The weight of the edge (e1 , e2 ) in G represents the similarity
between the two hyperedges e1 and e2 in the hypergraph G. Similarity calculation is detailed in
Section 3.2.

Once the weighted line graph G is constructed from the given tripartite hypergraph G, any

community detection algorithm for unipartite graphs can be used to cluster the nodes in G
(i.e. the hyperedges in G). Even the overlapping community detection algorithms for graphs
can be used here. But, as discussed earlier, a link is usually associated with one particular
semantics. Hence, we have considered only the algorithms which do not produce overlapping
communities. Choice of a particular community detection algorithm among them is described in
detail in Section 3.3.

As we get the node communities in G , each hyperedge in G gets placed into a single link-
community. This automatically assigns multiple overlapping communities to nodes in G, since
a node inherits membership of all those communities into which the hyperedges connected with
this node are placed.

17
3.2 Calculating Similarity Between Hyperedges

The similarity between a pair of hyperedges can be computed using different metrics. For exam-
ple, hyperedges can be expressed as feature vectors and then can be compared to find similarity.
Another way of measuring similarity is by considering the neighbourhood of end vertices of
hyperedges.

3.2.1 Expressing Hyperedges as Vectors

In a hypergraph, each hyperedge is associated with three nodes, one each from V X , V Y and V Z
sets. We express each hyperedge as a vector of size |V X | + |V Y | + |V Z |, where an element of the
vector represents the amount of participation of a particular node in that hyperedge.
Let di denote the degree of the node i. Then the i-th entry in the vector representation for a
particular hyperedge will be 0 if there is no path from i to any of its end nodes. Otherwise the
i-th entry will be the inverse of the product of degrees of intermediate vertices in the shortest
path from i to any end vertices.
For example, in the vector representation X of hyperedge e = (a, b, c); the a-th entry of X will
be d1a . Whereas, if the shortest path from any node j to the node b contains the nodes i and
k, then the j-th entry of X will be dj .di1.dk .db . It is to be noted that, while calculating the j-th
entry, we are considering shortest paths from j to all a,b and c and then taking the path having
minimum hop length among these paths.
Now, with the vector representation presented above, the following two well known metrics can
be used to find similarity between two hyperedges.
1. Pearson Correlation:
If hyperedges e1 and e2 can be expressed as vectors X and Y respectively, then the simi-
larity between e1 and e2 can be measured by the following equation

  
( X)( Y )
XY −
sim(e1, e2) =    
n
  (3.1)
 ( X)2 ( Y )2
X2 − n Y2 − n

2. Cosine Similarity:
It is a measure of similarity between two vectors by measuring the cosine of the angle
between them. The similarity between e1 and e2 can be expressed as

n
Xi × Yi
X·Y i=1
sim(e1, e2) = =  (3.2)
XY  
n 
n
(Xi )2 × (Yi )2
i=1 i=1

For both metrics, the similarity value ranges from −1 to +1. Where −1 means exactly opposite,
+1 means exactly the same, and in-between values indicates intermediate similarity or dissimi-
larity with 0 usually indicating independence. Here, we have considered only positive similarity
values. If the similarity between the two hyperedges e1 and e2 in the hypergraph G is more than

18

0, only then e1 and e2 are connected in the line graph G where the edge weight denoting the
similarity value.

3.2.2 Considering Vertex Neighbourhoods

Similarity between hyperedges can be measured by the relative overlap among the neighbours of
their end vertices. We measure the similarity between only those hyperedges which are adjacent.
Non-adjacent hyperedges are considered to have zero similarity.
It is to be noted that, the adjacency of two hyperedges can be defined in the following ways
1. Two hyperedges are adjacent if the hyperedges have at least one node in common.
2. Two hyperedges are adjacent if the hyperedges have exactly two nodes in common.
Although the second definition is a special case of the first definition, the choice will have an
impact on the overall performance of the algorithm. If we consider the second definition, the line
 
graph G will be sparser than if we take the first definition and G will contain many disconnected

components. Detecting communities from this sparser G will be more difficult. Also, in real
world folksonomies, condition of having two nodes common is too rigid. So, here in this work, we
have considered the first definition of adjacency. Two hyperedges are considered to be adjacent
if they share at least one endpoint.

Figure 3.1: Neighbourhood of two adjacent hyperedges

The notations N X (i), N Y (i) and N Z (i) denote the set of neighbours of node i of type V X ,
V Y and V Z respectively (if i ∈ V X , then N X (i) = φ since nodes in the same partite-set are
not linked). Figure 3.1 shows the neighborhood of two adjacent hyperedges e1 = (a, b, c) and
e2 = (p, q, r) where a, p ∈ V X ; b, q ∈ V Y ; c, r ∈ V Z and assumed a = p.

19
With the notations discussed, we have considered the following two popular similarity metrics
which can be used to measure hyperedge similarity.
1. Matching Similarity:
It can be defined as the size of overlap between neighbour sets of end points. The matching
similarity measure can be expressed as the following equation

sim(e1 , e2 ) = |N1 N2 | (3.3)

where   
N1 = N X (b) N Z (b) N Y (c) N X (c)
and   
N2 = N X (q) N Z (q) N Y (r) N X (r)

2. Jaccard Similarity:
It is expressed as the size of overlap normalized by the size of union of neighbour sets of
end vertices.

|S S | + |N Y (c) N Y (r)| + |N Z (b) N Z (q)|
sim(e1 , e2 ) =


(3.4)
|S S | + |N Y (c) N Y (r)| + |N Z (b) N Z (q)|



where S = N X (b) N X (c) and S = N X (q) N X (r). Jaccard Similarity value can range
from 0 to 1.

3.2.3 Choosing the Best Similarity Metric

Vector based similarity metrics are global metrics which requires knowledge of the entire hyper-
graph. Moreover, calculating similarity using vectors requires large memory. Size of each vector
is O(n) where n is the umber of nodes in the hypergraph. If there are m hyperedges, the space
complexity for vector based similarity calculation is O(m · n).
On the other hand, neighbourhood based metrics can be computed locally for a pair of hyper-
edges and can thus be computed efficiently for large real folksonomies. Also, experiments on
synthetically generated hypergraphs (details in Section 4.3) show that Jaccard Similarity gives
the best performance compared to other similarity metrics. Further, a metric similar to it was
found to perform well in detecting overlapping communities in unipartite graphs [13]. Hence, for
our algorithm, we choose Jaccard Similarity as the similarity metric. The algorithm for Jaccard
Similarity calculation is presented in Algorithm 1.

20
Algorithm 1 Compute Similarity between two Hyperedges
Input: hyperedges e1 = (a, b, c) and e2 = (p, q, r); a, p ∈ V X ; b, q ∈ V Y ; c, r ∈ V Z
Output: sim, Similarity between e1 and e2

if a = p AND b = q AND c = r then


/* Hyperedges are non-adjacent */
sim ← 0
else
/* Without loss of generality, let a = p; Any of the other pairs may be common as well */

S1 ← N X (b)
N X (c), S2 ← N Y (c), S3 ← N Z (b)
  
S1 ← N X (q) N X (r), S2 ← N Y (r), S3 ← N Z (q)
  
|S1 S1 | + |S2 S2 | + |S3 S3 |
sim ←



|S1 S1 | + |S2 S2 | + |S3 S3 |
end if
return sim

3.3 Detecting Communities in Line Graph

With the similarity measure in Algorithm 1, we convert the hypergraph to its corresponding
line graph where any community detection algorithm can be used. We have experimented with
different community detection algorithms to find the best candidate to be used in our proposed
algorithm. We present some of those algorithms below.

3.3.1 Hierarchical Clustering

In the line graph, we use single-linkage hierarchical clustering to construct a dendrogram. We


start with each node in the line graph as an individual cluster, then at each step, the two most
similar clusters are merged. This procedure is continued until all nodes belong to a single cluster,
and cutting this dendrogram at some suitable level gives the final clusters of nodes. The optimal
level for the cut is decided based on the Partition Density metric [13] which is computed on the
original hypergraph.
The partition density of a community Pi of hyperedges is the number of hyperedges in Pi , nor-
malized by the minimum and maximum number of hyperedges possible among the induced nodes
(which are touched by the hyperedges in Pi ). The global partition density D for a given parti-
tioning of the hyperedges is the average partition density of all hyperedge communities weighted
by the fraction of hyperedges present in each community. Algorithm 2 gives the algorithm for
computing D for a given partitioning of the hyperedges at a certain level of the dendrogram.
The dendrogram is cut at that level at which the global partition density D is maximum [13].

21
Algorithm 2 Compute Partition Density
Input: {P1 , P2 , . . . , PC }, a partitioning of the M hyperedges in E into C subsets
Output: Global Partition Density D

for all i, 1 ≤ i ≤ C do
mi ← |Pi |
/* Count

number of induced nodes

of the three types in P
i */
ni ← | (a,b,c)∈Pi {a}|, ni ← | (a,b,c)∈Pi {b}|, ni ← | (a,b,c)∈Pi {c}|
X Y Z

/* Compute Partition Density Di of subset Pi */


mi − max{nX Y Z
i , ni , ni }
Di ←
(nX
i × nYi × nZ X Y Z
i ) − max{ni , ni , ni }
end for

1
D ← M (mi × Di ) /* Global Partition Density */
i
return D

3.3.2 Fast Modularity Optimization

Clauset et al. [17] proposed a fast and greedy approach1 to implement modularity maximization
technique proposed by Newman [31]. Starting from a set of isolated nodes in the graph, the links
(which are present in the original graph) are iteratively added to produce the largest possible
increase in the modularity at each step. The algorithm uses different efficient data structures.
A sparse matrix is used to contain the increase in modularity by joining two communities who
have at least one edge between them. A max-heap is also used to minimize the time complexity
to O(n · log2 n) where n is the number of nodes in the graph.

3.3.3 Louvain Method

Blondel et al. [32] proposed a multistep technique2 . On the initial step, communities are detected
based on local optimization of modularity in the neighbourhood of each node in the graph. In the
next step, a weighted graph is formed where nodes are the communities detected in the earlier
phase. These two steps are iterated until modularity (which is always computed in the original
graph) does not increase any further. Computational complexity of this algorithm is O(m) where
m is the number of edges in the original graph.

3.3.4 Infomap

This is a dynamic algorithm proposed by Rosvall and Bergstrom [33]. The authors have shown
that the problem of finding the best cluster structure of a graph is equivalent to the problem
of optimally compressing the information of a random walk taking place on the graph. The
optimal compression is achieved by optimizing a quality function Minimum Description Length
1
Can be found at http://www.cs.unm.edu/~aaron/research/fastmodularity.htm
2
Downloadable from https://sites.google.com/site/findcommunities/

22
of the random walk. Minimum Description Length expresses the best trade-off between least
difference between the original and the compressed information and the maximal compression.
Optimizing Minimum Description Length can be carried out with a combination of greedy search
and simulated annealing3 . Computational complexity of this algorithm is also O(m) where m is
the number of edges in the graph.

3.3.5 Choosing the Best Community Detection Method

We have compared the performances of all the above community detection algorithms using
synthetic hypergraphs (Section 4.3). Infomap algorithm is found to perform better than other
algorithms. Lancichinetti et al. [34] also showed that for community detection in large graphs,
Infomap can identify communities more accurately as compared to several other algorithms
including Louvain and greedy modularity maximization. Further, as Infomap has low computa-
tional complexity, it can be used efficiently on line graphs of large real folksonomies. Therefore,
we used Infomap to as the community detection algorithm.

3.4 Time Complexity of Our Proposed Algorithm

Let the number of nodes in the hypergraph be n and average node-degree be d, which implies that
the number of hyperedges will be n·d3 . Each hyperedge will, on average, be adjacent to 3 · (d − 1)
2
3 nodes and 3 ×3·(d−1) = n·d·(d−1) = O(n·d )
other hyperedges. So, the line graph will have n·d n·d

edges.
Time complexity of infomap algorithm is linear in the size of the graph. So, community detection
in line graph takes O(n · d2 ). Jaccard similarity calculation in the hypergraph also takes O(n · d2 )
time. Therefore, the time complexity of the proposed algorithm is O(n · d2 ).
It is to be noted that real-world folksonomies are known to be sparse, having small average degree
d. So, essentially the complexity of our algorithm becomes O(n) which makes this algorithm
scalable to work in large real world folksonomies.
The performance of this algorithm is evaluated in the next chapter.

3
Available at http://www.tp.umu.se/~rosvall/code.html

23
Chapter 4

Experiments and Evaluation

In this chapter, we evaluate the performance of our proposed algorithm which we name as
‘Overlapping Hypergraph Clustering’ (abbreviated to ‘OHC’). We first compare different choices
of similarity metrics as well as community detection algorithms for line-graph to be used in OHC.
Then, we compare OHC algorithm with the algorithms by Wang et al. [10] and Papadopoulos
et al. [5], which are henceforth referred to as ‘CL’ (abbreviation of ‘Correlational Learning’) and
‘HGC’ (as referred by the respective authors) respectively.
Since evaluation of clustering is difficult without the knowledge of ‘ground truth’ regarding the
community memberships of nodes, we have used synthetically generated hypergraphs with a
known community structure for evaluation of the algorithms. We discuss the generation of
synthetic hypergraphs and the metric used to evaluate the algorithms, followed by the results of
experiments on synthetic hypergraphs.

4.1 Generation of Synthetic Hypergraphs

Synthetic hypergraphs are generated using a modified version of the method used in [10]. The
generator algorithm takes the following as input:
1. Number of nodes in a partite set
(all 3 partite sets V X , V Y and V Z are assumed to contain equal number of nodes)
2. Number of communities C
3. Fraction γ of nodes which belong to multiple communities
4. Hyperedge density β (i.e. fraction of total number of hyperedges possible in the hypergraph)
Initially, the nodes in each partite set are evenly distributed among each community under con-
sideration (e.g. |V X |/C nodes in the partite set V X are assigned to each of the C communities).
Subsequently, γ fraction of nodes are selected at random from each of V X , V Y and V Z . Each
selected node is assigned to some randomly chosen communities apart from the one it already
has been assigned to. Nodes assigned to the same community are then randomly selected, one
from each partite set, and interconnected with hyperedges. The number of hyperedges is decided
based on the specified density β.

24
Figure 4.1 demonstrates an example of synthetic hypergraphs generated. In this example, 4
nodes in each partite set is divided into two communities (i.e. C = 2). Hyperedge Density (β)
is 20% and 25% nodes belong to both communities (i.e. γ = 0.25).

Figure 4.1: An example synthetic hypergraph. There are two communities – blue and green.
Violet nodes belong to both the communities.

Users in real-world folksonomies often tag a few resources related to topics that are different from
their topics of primary interest, according to their transient interests at different times. Though
such taggings are typically much fewer than those related to the primary interests of users, they
can adversely affect the performance of algorithms that assign a single community to nodes.
To test whether the proposed algorithm can identify both the primary and transient interests
of users, a second set of hypergraphs are generated, where 1% of the generated hyperedges
interconnect randomly-selected nodes from different communities; we denote these as ‘scattered’
hyperedges.
The above assignment of communities to nodes constitutes the ‘ground truth’. After a hypergraph
is generated, information about the communities is hidden, and then communities are detected
from the hypergraph by different community detection algorithms. The community structure
detected by each algorithm is compared with the ground truth using the metric ‘Normalized
Mutual Information (NMI)’ which is explained next.

4.2 Normalized Mutual Information (NMI)

Normalized Mutual Information is an information-theoretic measure of similarity between two


partitioning of a set of elements, which can be used to compare two community structures for the

25
same graph (as identified by different algorithms). It is based on defining a confusion matrix N ,
where the rows correspond to the ‘real’ communities, and the columns correspond to the ‘found’
communities. The member of N , Nij is simply the number of nodes in the real community i that
appear in the found community j. Then NMI is defined in terms of different Nij s. This variable
is in the range [0, 1] and equals 1 only when the two partitions are exactly coincident.
This ‘traditional’ definition of NMI does not consider the case of overlapping communities. They
place each node to only one cluster. But, a node may belong to more than one cluster. Therefore
the membership of the node i is not a number xi ∈ {1, 2, ..., |C|} any more, but it must be
considered as a binary array of |C| entries, one for each cluster of the partition C (say (xi )k =
1 if the node i is present in the Ck cluster, (xi )k = 0 otherwise).
Lancichinetti et al. [35] proposed an alternative definition of NMI considering overlapping com-
munities. According to [35], given two community structures / partitions X and Y , NMI is
defined as
1 
N M I(X, Y ) = 1 − H(X|Y )norm + H(Y |X)norm (4.1)
2
where
1 minj∈{1,2,...,NX } H(Xi |Yj )
H(X|Y )norm =
NX H(Xi )
i

1 minj∈{1,2,...,NY } H(Yi |Xj )


H(Y |X)norm =
NY H(Yi )
i

Here H(X) and H(Y ) are entropies of X and Y . H(Y |X) and H(X|Y ) are conditional entropies
and NX and NY are number of clusters in X and Y respectively.
This NMI value is computed in two steps.
1. The pairs of clusters that are closest to each other are found from two clusterings.
2. The mutual information between those pairs of clusters are then averaged.
The value is in the range [0, 1]. Higher the NMI value, the more similar are the two community
structures (refer to [35] for details).

4.3 Comparison between Different Choices of OHC

To find the best similarity metric and community detection method, we generated synthetic hy-
pergraphs having various hyperedge densities β = 0.1, 0.2, . . ., 1.0. In each of these hypergraphs,
10% of nodes in each partite set belonged to multiple communities (i.e. γ = 0.1).
First, we compare the performances of the different similarity metrics. Infomap is used as the
community detection method in line graph. NMI values between original and detected community
structures are compared. The comparison result is shown in Figure 4.2. We can see that across
every value of hyperedge density, Jaccard Similarity gives the best result.

26
1

0.9

NMI 0.8

0.7

0.6 Jaccard
Matching
0.5 Pearson
Cosine
0.4
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Hyperedge Density

Figure 4.2: Comparison of NMI values for different similarity metrics with varying hyperedge
density

Once Jaccard Similarity has been chosen as the desired similarity metric, we compare different
community detection methods which can be applied on line graph. Figure 4.3 shows the com-
parison of NMI values. Across all possible hyperedge densities, Infomap algorithm is found to
perform better than other algorithms.

0.9

0.8
NMI

0.7

0.6
Infomap
0.5 Louvain
Hierarchical
0.4 Modularity
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Hyperedge Density

Figure 4.3: Comparison of NMI values for different community detection algorithms with varying
hyperedge density

27
4.4 Comparing OHC with Other Algorithms

The CL and HGC algorithms produce only user and tag communities respectively. Hence, while
calculating the NMI value for these algorithms, we have used the community memberships of
only the user (respectively, tag) nodes according to the ground truth. Whereas the proposed
OHC algorithm gives composite communities containing all three types of nodes. Hence, to
evaluate the performance of OHC, we have considered the community memberships of all three
types of nodes.
For all the following experiments, |V X | = |V Y | = |V Z | = 200 and number of communities
C = 20. For each result, random hypergraphs were generated 50 times using the same set of
parameter values and the average performances over all 50 runs are reported.

4.4.1 Performance w.r.t. Number of Hyperedges

To study how the number of hyperedges affects the performance of the clustering algorithms,
we generated synthetic hypergraphs having various hyperedge densities β = 0.1, 0.2, . . ., 1.0. In
each of these hypergraphs, 10% of nodes in each partite set belonged to multiple communities
(i.e. γ = 0.1). The NMI values for the three algorithms are shown in Figure 4.4.

0.8
NMI

0.6

0.4 OHC
HGC
0.2 CL

0.2 0.4 0.6 0.8 1


Hyperedge Density (β)
Figure 4.4: Variation of NMI values with varying hyperedge density when 10% nodes belong to
multiple communities

It can be clearly seen that, across all hyperedge densities, OHC performs significantly better
than HGC and CL algorithms. A possible explanation for this is that the proposed OHC algo-
rithm utilizes the complete tripartite structure of the hypergraph, whereas both CL and HGC
algorithms work on unweighted projections.
Guimera et al. [11] have shown that taking projection results in loss of some of the informa-
tion contained in the original tripartite network. Moreover, unweighted projection loose more

28
information than weighted projection. Whereas, even for weighted projections, calculating the
weight is most challenging and determining factor for the amount of information retained. For
example, while taking projections from hypergraph to user-tag bipartite network, one doesn’t
take into account the relative importance of resource nodes. A resource node having higher
degree shouldn’t be considered same as another resource node having lower degree. The weight
calculation algorithm should take this and many other factors into consideration.
It is to noted that even for very low hyperedge densities, when detecting community structures
is difficult, the proposed OHC algorithm performs very well resulting in NMI scores above 0.8.
This makes OHC suitable for real world folksonomies where hyperedge density is typically low.

4.4.2 Performance in Presence of Scattered Hyperedges

We have also experimented with synthetic hypergraphs having 1% of total hyperedges as ‘scat-
tered’. Figure 4.5 shows the result. As the presence of scattered hyperedges disturbs the commu-
nity structure in the hypergraph, the performance of all three algorithms degrade as expected.
However, performance of OHC is still better than HGC and CL algorithms. For OHC algorithm,
NMI scores remain above 0.7 which signifies its effectiveness in detecting community structure
even in presence of noisy or scattered hyperedges.

1
0.9
0.8
0.7
NMI

0.6
0.5 OHC
0.4 HGC
CL
0.3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Hyperedge Density (β)

Figure 4.5: Variation of NMI values with varying hyperedge density in presence of scattered
hyperedges

4.4.3 Performance w.r.t. Fraction of Nodes in Multiple Communities

A node belonging to multiple communities creates hyperedges to nodes in all those communities;
hence, from the perspective of a particular community, the hyperedges created by this member
node to nodes in other communities reduces the exclusivity of this particular community. As
the number of nodes in multiple overlapping community increases, the fraction of such inter-
community hyperedges increases making the community structure more difficult to identify. We
now study how this affects the performance of the algorithms.

29
1
0.9

0.7 OHC
HGC
NMI

CL
0.5

0.3

0.1
0.1 0.3 0.5 0.7 0.9 1
Fraction of Nodes in Multiple Communtiy (γ)

Figure 4.6: Variation of NMI values with varying fraction of nodes in multiple communities
keeping hyperedge density constant at 0.2

We generated synthetic hypergraphs by varying the fraction of nodes in multiple communities


(γ) while keeping hyperedge density (β) constant at 0.2. This low value of hyperedge density
was chosen to measure the effectiveness of the algorithms in sparse environment (as in real-world
foksnomies).
Figure 4.6 shows that OHC performs consistently better than HGC and CL algorithms in this case
as well. Further, as the community structure becomes more and more complex, the information
loss as a result of projections becomes increasingly more crucial, hence the performance of the
HGC and CL algorithms degrade sharply with increase in γ. On the other hand, the performance
of our OHC algorithm shows relatively much greater stability.

4.4.4 Performance w.r.t. Size of Real Community

We also observed how the performances of different algorithms are affected by the size of each
real community. Hypergraphs having 200 nodes in each partite set were generated changing
the number of real communities. Here hyperedge density is fixed at 0.2 and 10% of total nodes
belong to multiple communities. The results are shown in Figure 4.7.

30
1

0.9

NMI 0.8

0.7

0.6
OHC
0.5 HGC
CL
0.4
3 4 5 6 7 8 9 10
Number of Real Communities

Figure 4.7: Comparison of NMI values with varying number of real communities

When number of nodes in one community is large, random assignment of hyperedges during gen-
eration of synthetic hypergraphs may create smaller communities inside one large community.
Community detection algorithms find these smaller communities rather than the large encom-
passing community. For this reason, as the number of real communities increases, size of each
community decreases enabling better NMI performance. Here also, OHC performs better than
CL and HGC algorithms.
The above experiments clearly validate our motivation and show that considering the complete
tripartite structure of hypergraphs can result in better identification of community structure, as
compared to considering projections (as done in prior studies).
In the next chapter, we use OHC to study the community structure of real world folksonomies.

31
Chapter 5

Experiments on Real World


Folksonomies

In this chapter, we apply the proposed OHC algorithm to gain insights into the community
structures prevalent in real folksonomies. For this, we use the publicly available datasets [36]
having snapshots of the folksonomies – Delicious, LastFm and MovieLens. The statistics of these
data sets are summarized in Table 5.1.

Dataset users resources tags hyperedges


Delicious 1,867 69,226 53,388 437,593
LastFm 1,892 17,632 11,946 186,479
MovieLens 2,113 10,197 13,222 47,957

Table 5.1: Statistics of Real Folksonomy Datasets

5.1 Overlapping Communities in Folksonomies

For all three datasets, OHC algorithm successfully groups semantically related resources and
tags and the users tagging these resources. As an illustration, Table 5.2 shows the resources
and tags placed in some example communities for each of the three datasets. It is evident that
the resources and tags that are placed in the same community are often related to a common
semantic theme.

32
Community Theme Example of Member Nodes
LastFm Hard Rock Van Halen, Deep Purple, Aerosmith, Alice Cooper,
Artists Guns N’ Roses, Scorpions, Kiss, Living Colour, White
Lion, Bad Company, Bon Jovi, Hardline, The Rolling
Stones
(resources) Heavy Metal Van Halen, Deep Purple, Aerosmith, Iron Maiden,
Motorhead, Black Sabbath, Metallica, Twisted Sister,
Crazy Lixx, Blind Guardian
LastFm Tags Metal blues rock, psychedelic rock, rap metal, nu metal ,
metal, symphonic metal, doom metal, progressive metal,
speed metal, folk metal, metalcore, viking metal, power
metal
Rock blues rock, psychedelic rock, rap metal, nu metal ,
progressive rock, polish rock, art rock, soft rock, gothic
rock, polish, punk, punk rock, hard rock, glam rock, pop-
rock
MovieLens Superhero The Incredibles, Shrek, Shrek 2, The Incredible
Movies Hulk , Batman Begins, Batman Returns, Batman For-
ever, Spider-Man, Superman, Superman II, Superman III,
X-Men
(resources) Animation The Incredibles, Shrek, Shrek 2, The Incredible
Hulk , Shrek the Third, Beowulf, WALL-E, Ratatouille,
Finding Nemo, Cars, Toy Story, Toy Story 2, Kung fu
Panda
MovieLens Criticism violent, brutal , too violent, waste of celluloid, disturb-
Tags ing, junk, tragically stupid, lousy script, pointless, waste
of money, not very good, confusing plot, worst animated
flick ever
Violence violent, brutal , violence, murder, fatality, civil war,
great villain, dark, spanish civil war, serial killer, great
war depiction, vietnam war, world war ii, best war film
Delicious Tags Web 2.0 socialnetworking, socialweb, socialmedia, web20, php,
drupal, xml, cms, webdesign, css3, twitter, skype, ruby,
facebook, snippets, wikipedia, blog

Table 5.2: Examples of communities detected by proposed OHC algorithm. The algorithm
successfully clusters nodes which are related to a common semantic theme (see Column 2). Nodes
related to multiple themes (boldfaced and italicized) are placed in overlapping communities.

A closer look at Table 5.2 reveals that the algorithm also correctly identifies nodes that are
related to multiple overlapping communities (themes). For instance, the band Van Halen is
placed in two different communities detected from LastFm. The Wikipedia article about Van
Halen1 justifies this placement pointing their genre as both ‘Hard Rock’ and ‘Heavy Metal’.
Any non-overlapping community detection algorithm would have placed this node to either of
the two communities (assume ‘Hard Rock’). Community based recommendation schemes, which
recommend resources to users based on common memberships in communities, would have only
1
http://en.wikipedia.org/wiki/Van_Halen

33
recommended this resource to users who are interested in ‘Hard Rock’. But, this resource can also
be recommended to a user who likes to listen to ‘Heavy Metal’ songs. our proposed OHC algo-
rithm places the resource in both communities; thus raise the chance of proper recommendation
to users of real world folksonomies.

Tag
0.9
Resource
User
0.8
CDF

0.7 0.9

0.8
0.6 MovieLens
LastFm
0.7
0 50 100
0.5
0 50 100 150 200 250
Number of Overlapping Communities
Figure 5.1: Cumulative distribution of the fraction of communities which overlap with a given
number (x) of other communities; main figure – LastFm, inset – MovieLens

Substantial amount of overlap is detected by OHC algorithm in all three datasets. Figure 5.1
shows the cumulative distribution of the fraction of communities which overlap with a given
number of other communities, for LastFm and MovieLens. A similar pattern was also detected
in Delicious.

5.2 Evaluation of Communities Detected

The principal difficulty in evaluating the communities detected in case of real folksonomies is
the absence of ‘ground truth’ regarding the community memberships of nodes in folksonomies,
since their huge size makes it impossible for human experts to evaluate the quality of identified
communities.
Hence, we use the following two methods for evaluation.
1. we use the graph-based metric Conductance, which has been shown to correctly conform
with the intuitive notion of communities and is extensively used for evaluating quality of
communities in online social networks (see [37] for details). As conductance is defined only
for unipartite networks, we compare tag communities detected by HGC with the tag nodes
in the communities identified by our OHC algorithm.
2. in case of the folksonomies which allow users to form a social network among themselves, we
can assume that users having similar interests are likely to be linked in the social network,

34
or at least to have a common social neighbourhood (a property known as homophily [12].
We utilize this notion to evaluate the user communities detected by CL algorithm and the
user nodes in the communities identified by OHC algorithm.

5.2.1 Comparison of Conductance Value

Conductance (φ(C)) of a community C, which implies a cut (C, G − C) in a graph G, is defined


as 
Aij
i∈C, j∈(G−C)
φ(C) = (5.1)
min(A(C), A(G − C))
where A is the adjacency matrix for the network and

A(C) = Aij
i∈C j∈G

The Conductance [37] value ranges from 0 to 1 where a lower value signifies better community
structure. Figure 5.2 shows the cumulative distribution of conductance values of detected tag
communities by the two algorithms. Across all three datasets, OHC produces more communities
having lower conductance values, which implies that OHC can find communities of better quality
than obtained by HGC algorithm. The reason for this superior performance is that OHC groups
semantically related nodes into relatively smaller cohesive communities instead of creating a few
number of generalized large communities. For example of semantically related communities, refer
to Table 5.2.

1.0
1.0
0.8
0.5
OHC
0.6 MovieLens
HGC
CDF

0
0 0.5 1
1.0
0.4
0.5
0.2 Delicious
LastFm
0
0 0.5 1
0
0 0.2 0.4 0.6 0.8 1
Conductance
Figure 5.2: Cumulative distribution of conductance values of tag communities obtained from the
real-world folksonomies: LastFm (main plot), Delicious and MovieLens (both inset) for OHC
and HGC.

35
5.2.2 Comparing Detected User Communities with Social Network

In case of folksonomies which allow users to form a social network, there can be two types
of relationships among users – explicit social connections (in the social network) and implicit
connections through their tagging behaviour (e.g. tagging the same resource) in the hypergraph 2 .
A community detection algorithm for hypergraphs utilizes the implicit relationships to identify
the community structure, and we propose to evaluate the detected community structure using
the explicit connections that the users themselves create (in the social network). For instance, if a
large fraction of the users who are socially linked (or share a common social neighbourhood in the
social network) are placed in the same community (by the algorithm), the detected community
structure can be said to group together users having common interests.
Hence, to compare the community structure identified by two algorithms, we consider the user-
pairs who are within a certain distance from each other in the social network (where distance 1
implies friends, i.e. two users who are directly linked in the social network), and compute the
fraction of such user-pairs who have been placed in a common community by the algorithm.
in Same User Community
Fraction of User Pairs

0.65
OHC
0.55 CL
0.45

0.35

0.25

1 2 3 4 5 6
Distance in Social Network
Figure 5.3: Community structure detected by OHC and CL algorithm with the social network
in LastFm

Figure 5.3 shows the result for the proposed OHC algorithm and the CL algorithm, for the
LastFm dataset. Across all distances, OHC places a larger number of user-pairs who share a
common social neighbourhood, in a common community than the CL algorithm. Also, as the
distance between two users in the social network increases, both algorithms put a smaller fraction
of such user-pairs in the same community.
We can also investigate the reverse question – among the users who are placed in a common
community (by a community detection algorithm), what fraction of these users are actually con-
nected in the social network (or share a common social neighbourhood)? While investigating
2
The social network in LastFm is undirected, while in Delicious, a user can be a ‘fan’ of another user, but this
fan-relationship may or may not be reciprocated. We assumed two users are linked if they belong to a mutual
fan relationship. In the LastFm and Delicious dataset analysed here, there are 12,717 and 7,668 bi-directional
user-user links respectively.

36
this question, it is to be noted that ‘quality’ of large communities detected by community detec-
tion algorithms are known to be lower than smaller communities [37]. Hence it is meaningful to
answer this question for detected communities taking their size into consideration.

Fraction of User Pairs 1

0.8

0.6

0.4
CommSize < 20 By OHC
CommSize < 20 By CL
0.2
CommSize > 20 By OHC
CommSize > 20 By CL
0
1 2 3 4 5 6
Distance in Social Network

Figure 5.4: Community structure detected by OHC and CL algorithm with the social network
in Delicious

Figure 5.4 shows the fraction of users who are placed in a common community by the OHC and
CL algorithms, that are within a certain distance in the social network (where distance 1 implies
friends), for the Delicious dataset.
For detected user-communities of size lesser than 20, more than 70% of the users who are placed
in a common community by OHC are actually connected in the social network, whereas the cor-
responding value for the CL algorithm is much lesser. However, for larger detected communities
(having more than 20 users), the fraction of user-pairs who share a common social neighbourhood
is much lower and almost identical for both algorithms.
The above results clearly show that even in case of real folksonomies (as in the case of syntheti-
cally generated hypergraphs), the proposed OHC algorithm can detect much better community
structure as compared to the existing CL and HGC algorithms. The fact that a very large
fraction of the users who are placed in a common community by OHC are actually friends (i.e.
directly linked in the social network) shows that OHC can be used to identify potential friends
directly from the hypergraph structure.

37
Chapter 6

Conclusion

In this work, we proposed the first algorithm to detect overlapping communities considering the
full tripartite hypergraph structure of folksonomies. Through extensive experiments on synthetic
as well as real folksonomy networks, we showed that the proposed algorithm out-performs existing
algorithms that consider projections of hypergaphs.
In large folksonomies, it is difficult for an individual user to find other like-minded users as well
as resources of her interest. Our algorithm successfully groups nodes into multiple communities
where each community represents a topic of interest. Based on these interests, like-minded users
as well as resources can be found out.
Thus the proposed algorithm can be effectively used in recommending interesting resources and
friends to users in folksonomies. Building such a personalized recommendation system taking
advantage of the effectiveness of the proposed algorithm comprises the future work.

38
Bibliography

[1] Shengliang Xu, Shenghua Bao, Ben Fei, Zhong Su, and Yong Yu. Exploring folksonomy for
personalized search. In ACM SIGIR, pages 155–162, 2008.
[2] Ioannis Konstas, Vassilios Stathopoulos, and Joemon M. Jose. On social networks and
collaborative recommendation. In ACM SIGIR, pages 195–202, 2009.
[3] Ciro Cattuto, Christoph Schmitz, Andrea Baldassarri, Vito D P Servedio, Vittorio Loreto,
Andreas Hotho, Miranda Grahl, and Gerd Stumme. Network properties of folksonomies. Ai
Communications, 20(4):245–262, 2007.
[4] Tsuyoshi Murata. Detecting communities from social tagging networks based on tripartite
modularity. In Link Analysis in Heterogeneous Information Networks, July 2011.
[5] Symeon Papadopoulos, Yiannis Kompatsiaris, and Athena Vakali. A graph-based clustering
scheme for identifying related tags in folksonomies. In Data Warehousing and Knowledge
Discovery Conference, pages 65–76, 2010.
[6] Nicolas Neubauer and Klaus Obermayer. Towards Community Detection in k-Partite k-
Uniform Hypergraphs, pages 1–9. 2009.
[7] Alexei Vazquez. Finding hypergraph communities: a Bayesian approach and variational
solution. Journal of Statistical Mechanics: Theory and Experiment, 2009, Jul 2009.
[8] Michael Brinkmeier, Jeremias Werner, and Sven Recknagel. Communities in graphs and
hypergraphs. In ACM CIKM, 2007.
[9] Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi Konuru, Hari Sundaram, and Aisling Kelliher.
Metafac: community discovery via relational hypergraph factorization. In ACM SIGKDD,
pages 527–536, 2009.
[10] Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu. Discovering Overlapping Groups in Social
Media. In IEEE ICDM, pages 569–578, 2010.
[11] Roger Guimera, Marta Sales-Pardo, and Luis A. Nunes Amaral. Module identification in
bipartite and directed networks. Phys. Rev. E, 76:036102, Sep 2007.
[12] M McPherson, L Smith-Lovin, and Jm Cook. Birds of a feather : Homophily in social
networks. Annual Review of Sociology, 27:415–444, 2001.
[13] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann. Link communities reveal multiscale
complexity in networks. Nature, 466(7307):761–764, August 2010.
[14] T. S. Evans and R. Lambiotte. Line graphs, link partitions, and overlapping communities.
Phys. Rev. E, 80:016105, 2009.

39
[15] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.
Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.
[16] M. Girvan and M. E. J. Newman. Finding and evaluating community structure in networks.
Physical Review E, page 69, 2004.
[17] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure in
very large networks. Phys. Rev. E, 70:066111, Dec 2004.
[18] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010.
[19] Jeffrey Baumes, Mark K. Goldberg, Mukkai S. Krishnamoorthy, Malik M. Ismail, and
Nathan Preston. Finding communities by clustering a graph into overlapping subgraphs. In
Nuno Guimaraes and Pedro T. Isaias, editors, IADIS AC, pages 97–104. IADIS, 2005.
[20] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. Uncovering the overlapping community
structure of complex networks in nature and society. Nature, 435:814–818, Jun 2005.
[21] Illes Farkas, Daniel Abel, Gergely Palla, and Tamas Vicsek. Weighted network modules.
New Journal of Physics, 9(6):180, 2007.
[22] Sune Lehmann, Martin Schwartz, and Lars Kai Hansen. Biclique communities. Phys. Rev.
E, 78:016108, Jul 2008.
[23] Balazs Adamcsek, Gergely Palla, Illes J. Farkas, Imre Derenyi, and Tamas Vicsek. Cfinder:
locating cliques and overlapping modules in biological networks. Bioinformatics, 22(8):1021–
1023, 2006.
[24] Jussi M. Kumpula, Mikko Kivelä, Kimmo Kaski, and Jari Saramäki. Sequential algorithm
for fast clique percolation. Phys. Rev. E, 78:026109, Aug 2008.
[25] Andrea Lancichinetti and Santo Fortunato. Benchmarks for testing community detection
algorithms on directed and weighted graphs with overlapping communities. Physical Review
E, 80(1):9, 2009.
[26] V. Nicosia, G. Mangioni, V. Carchiolo, and M. Malgeri. Extending the definition of modu-
larity to directed graphs with overlapping communities, 2008.
[27] Steve Gregory. Finding overlapping communities using disjoint community detection algo-
rithms. In Santo Fortunato, Giuseppe Mangioni, Ronaldo Menezes, and Vincenzo Nicosia,
editors, Complex Networks, volume 207 of Studies in Computational Intelligence, pages 47–
61. Springer Berlin / Heidelberg, 2009.
[28] Samuel Rota Bul and Marcello Pelillo. A game-theoretic approach to hypergraph clustering.
Advances in Neural Information Processing Systems, pages 1–9, 2009.
[29] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning with hypergraphs:
Clustering, classification, and embedding. In Advances in Neural Information Processing
Systems (NIPS) 19, page 2006. MIT Press, 2006.
[30] Tsuyoshi Murata. Modularity for heterogeneous networks. In ACM Hypertext and Hyper-
media, pages 129–134, 2010.
[31] M. E. J. Newman. Fast algorithm for detecting community structure in networks. Phys.
Rev. E, 69:066133, Jun 2004.

40
[32] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast
unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and
Experiment, 2008(10), oct 2008.
[33] Martin Rosvall and Carl T. Bergstrom. Maps of random walks on complex networks reveal
community structure. PNAS, 105:1118–1123, Jan 2008.
[34] Andrea Lancichinetti and Santo Fortunato. Community detection algorithms: a comparative
analysis. Phys. Rev. E, 80:056117, Sep 2009.
[35] A. Lancichinetti, S. Fortunato, and J. Kertesz. Detecting the overlapping and hierarchical
community structure in complex networks. New Journal of Physics, 11:033015, 2009.
[36] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Workshop on Information Heterogeneity
and Fusion in Recommender Systems (HetRec 2011). In ACM RecSys, 2011.
[37] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Statistical
properties of community structure in large social and information networks. In ACM WWW,
2008.

41
Appendix A

Publications from the Thesis

The work presented in the thesis resulted in the following publications

[1] Abhijnan Chakraborty, Saptarshi Ghosh, Niloy Ganguly. Detecting Overlapping Commu-
nities in Folksonomies. In proceedings of the 23rd ACM Conference on Hypertext and Social
Media (Hypertext 2012). Milwaukee, Wisconsin, USA. June, 2012.

[2] Abhijnan Chakraborty, Saptarshi Ghosh. Identifying Overlapping Communities in Folk-


sonomies. In Dynamics on and of Complex Networks: Applications to Biology, Computer Sci-
ence, Economics, and the Social Sciences, Volume 2, Ganguly, N., Deutsch, A., and Mukherjee,
A. (eds.), Springer.

[3] Abhijnan Chakraborty, Saptarshi Ghosh, Niloy Ganguly. Detection of Overlapping Com-
munities in Folksonomies. Poster in the International Workshop on Mathematical Physics of
Complex Networks: From Graph Theory to Biological Physics (MAPCON12). Dresden, Ger-
many. May, 2012.

42

You might also like