Professional Documents
Culture Documents
Ne7012 Social Network Anatysis: P Rep Ared B Y: A.Rathnadevi A.V.C College of Engineering
Ne7012 Social Network Anatysis: P Rep Ared B Y: A.Rathnadevi A.V.C College of Engineering
NETWORK ANATYSIS
P REP ARED BY: A.RATHNADEVI A.V.C COLLEGE OF
ENGINEERING
Supposing that we have started out with some data sets in traditional formats (relational
databases, Excel sheets, XML files etc.) our first step is to convert them into an RDF-
based syntax, which allows to store the data in an ontology store and manipulate it with
ontology-based tools.
In this process we need to assign identifiers to resources and re-represent our data in
terms of a shared ontology such as FOAF.
The task of aggregation, however, is not complete yet: we need to find identical resources
across the data sets.
This is a two-step process.
First, it requires capturing the domain-specific knowledge of when to consider two
instances to be the same.
As we will see the FOAF ontology itself also prescribes ways to infer the equality of two
instances, for example based on their email address.
However, beyond these properties it is likely that we need to introduce some domain-
specific criteria based on domain-specific properties. In order to do this, we need to
consider the general meaning of equality in RDF/OWL.
In practice, only a limited part of the knowledge regarding instance equality can be
captured in RDF or OWL. For example, determining equality is often considered as
applying a threshold value to a similarity measure, which is a weighted combination of
the similarity of certain properties.
Once we determined the rules or procedures that determine equality in our domain, we
need to carry out the actual instance unification or smushing .
Unlike much of the related work on instance unification, we consider smushing as a
reasoning task, where we iteratively execute the rules or procedures that determine
equality until no more equivalent instances can be found.
The advantage of this approach (compared to a one-step computation of a similarity
measure) is that we can take into account the learned equalities in subsequent rounds of
reasoning.
We will discuss the advantages and disadvantages of using rule-based reasoners and
Description Logic reasoners for this task and the trade-off between forward- and
backward chaining reasoning. We will also outline the approach in case we need to
combine procedural and rule based reasoning
RDF and OWL (starting with OWL Lite) allow us to identify resources and to represent
their (in)equality using the owl:sameAs and owl:differentFrom properties.
However, these are only the basic building blocks for defining equality.
In particular, the meaning of equality with respect to the objects of the domain depends
on the domain itself and the goals of the modelling. For example, it is part of the domain
knowledge what it takes for two persons or publications to be considered the same, i.e.
what are the characteristics that we take into account.
Further, depending on the level of modelling we may distinguish, for example, individual
persons or instead consider groups of persons (e.g. based on roles) and consider each
individual within the group to be equivalent (role equivalence).
Nevertheless, it is interesting to take a short de-tour to the domain of philosophy, where a
number of attempts have been made over the centuries to capture the general
characteristics of the notions of identity and equality (indiscernibility).
The most well-known formulation of equality was given by Wilhelm Gottfried Leibniz
in his Discourse on Metaphysics.
The Identity of Indiscernibles or Leibniz-law can be loosely phrased as the principle of
all things being identical unless we are able to distinguish (discern) them or put it
differently: no two distinct objects have exactly the same properties.
The Leibniz-law is formalized using the logical formula given in Formula 5.1. The
converse of the Leibniz-law is called Indiscernibility of Identicals and written as Formula
5.2. The two laws taken together (often also called Leibniz-law) are the basis of the
definition of equality in a number of systems.
∀P : P(x) ↔ P(y) → x = y ---- (5.1)
∀P : P(x) ↔ P(y) ← x = y ---- (5.2)
∀s : (s, owl:sameAs, s)
(s1, owl:sameAs, s2) → (s2, owl:sameAs, s1)
(s1, owl:sameAs, s2) ∧ (s2, owl:sameAs, s3) → (s1, owl:sameAs, s3) ----- (5.4)
Note that it is not inconsistent to have resources that are owl:same As but have different
stated properties, e.g. Formula 5.5 is not an inconsistent ontology.
The explanation lies again in the open world assumption: we can assume that the missing
statements (that s1 has the foaf :name Paul and s2 has the foaf :name John) exist
somewhere. In a closed world this ontology would be inconsistent.
In our case, we are interested in capturing knowledge about the identity of resources that
can lead to conclude the (in)equality of resources.
In OWL there are a limited set of constructs that can lead to (in)equality statements.
Functional and inverse functional properties (IFPs) and maximum cardinality restrictions
in general can lead to conclude that two symbols must denote the same resource when
otherwise the cardinality restriction could not be fulfilled. For example, the foaf :mbox
property denoting the email address of a person is inverse-functional as a mailbox can
only belong to a single person.
As another example, consider a hypothetical ex:hasParent property, which has a
maximum cardinality of two. If we state that a single person has three parents (which we
are allowed to state) an OWL reasoner should conclude that at least two of them has to be
the same.
Once inferred, the equality of instances can be stated by using the owl:same As property.
There are more ways to conclude that two symbols do not denote the same object, which
is expressed by the owl:different From property. For example, instances of disjoint
classes must be different from each other.
Often, the knowledge that we need to capture cannot be expressed in OWL, only in more
expressive rule languages.
A typical example is the case of complex keys: for example, we may want to equate two
persons only if both their names and their birth dates match. This is not possible to
express in OWL but can be captured in Horn logic, a typical target for rule languages for
the Semantic Web.
Reasoning is the inference of new statements (facts) that necessarily follow from the set
of known statements. As discussed in the previous Chapter, what follows from a set of
statements is determined by the semantics of the language.
Every fact we add to our knowledge base has a meaning in the sense that it represents
some constraint on the mapping between the symbols of our representation and some
domain of interpretation. In other words, every piece of additional knowledge excludes
some unintended interpretation of our knowledge base.
3.2.1Types of Changes
Emerge: A community c(tk) emerges in C(tk), when c(tk) shares no URLs with any
community in C(tk−1). Note that not all URLs in c(tk) newly appear in W(tk). Some
URLs in c(tk) may be included inW(tk−1), and do not have enough connectivity to form
a community.
Dissolve: A community c(tk−1) in C(tk1) has dissolved, when c(tk−1) shares no URLs
with any community in C(tk). Note that not all URLs in c(tk−1) disappeared from
W(tk−1). Some URLs in c(tk−1) may still be included inW(tk) losing connectivity to any
community.
Grow and shrink: When c(tk−1) in C(tk−1) shares URLs with only c(tk) in C(tk), and
vice versa, only two changes can occur to c(tk−1). The community grows when new
URLs are appeared in c(tk), and shrinks when URLs disappeared from c(tk−1). When the
number of appeared URLs is greater than the number of disappeared URLs, it grows. In
the reverse case, it shrinks.
Split: A community c(tk−1) may split into some smaller communities. In this case,
c(tk−1) shares URLs with multiple communities in C(tk). Split is caused by
disconnections of URLs in SDG. Split communities may grow and shrink. They may also
merge (see the next item) with other communities.
Merge: When multiple communities (c(tk−1)), d(tk−1), ...) share URLs with a single
community e(tk), these communities are merged into e(tk) by connections of their URLs
in SDG. Merged community may grow and shrink. They may also split before merging.
The merge rate, Rmerge(c(tk−1), c(tk)), is the number of absorbed URLs from other
communities by merging per unit time. Higher merge rate means that the community has
obtained URLs mainly by merging. The merge rate is defined as follows.
The split rate, Rsplit (c(tk−1), c(tk)), is the number of split URLs from c(tk−1) per unit
time. When the split rate is low, c(tk) is larger than other split communities. Otherwise,
c(tk) is smaller than other split communities. The split rate is defined as follows.
By combining these metrics, some complex evolution patterns can be represented. For
example, a community has stably grown when its growth rate is positive, and its
disappearance and split rates are low.
Similar evolution patterns can be defined for shrinkage. Longer range metrics (more than
one unit time) can be calculated for main lines. For example, the novelty metrics of a
main line (c(ti), c(ti+1), ..., c(t j)) is calculated as follows.
Other metrics can be calculated similarly.
3.3.1 Introduction
Relations of real-world entities are often represented as networks, such as social networks
connected with friendships or co-authorships.
In many cases, real social networks contain denser parts and sparser parts. Denser
subnetworks correspond to groups of people that are closely connected with each other.
Such denser subnetworks are called “communities” in this chapter. (There are many
definitions of communities which will be described later.)
Detecting communities from given social networks are practically important for the
following reasons:
1. Communities can be used for information recommendation because members of the
communities often have similar tastes and preferences. Membership of detected
communities will be the basis of collaborative filtering.
2. Communities will help us understand the structures of given social networks.
Communities are regarded as components of given social networks, and they will clarify
the functions and properties of the networks.
3. Communities will play important roles when we visualize large-scale social networks.
Relations of the communities clarify the processes of information sharing and
information diffusions, and they may give us some insights for the growth the networks
in the future.
where the sum runs over all pairs of vertices, A is the adjacency matrix, ki is the degree
of vertexi i and m is the total number of edges of the network.
The element of Aij of the adjacency matrix is 1 if vertices i and j are connected,
otherwise it is 0.
The sigma -function yields 1 if vertices i and j are in the same community, 0 otherwise.
Modularity can be rewritten as follows.
where nm is the number of communities, ls is the total number of edges joining vertices
of community s, and ds is the sum of the degrees of the vertices of s.
In the above formula, the first term of each summand is the fraction of edges of the
network inside the community, whereas the second term represents the expected fraction
of edges that would be there if the network were a random network with the same degree
for each vertex.
Therefore, the comparison between real and expected edges is expressed by the
corresponding summand of the above formula. Figure 12.2 illustrates the meaning of
modularity.
The latter formula implicitly shows the definition of a community: a subnetwork is a
community if the number of edges inside it is larger than the expected number in
modularity’s null model.
If this is the case, the vertices of the subnetwork are more tightly connected than
expected. Large positive values of Q are expected to indicate good partitions.
The modularity of the whole network, taken as a single community, is zero.
Modularity is always smaller than one, and it can be negativeas well.
Modularity has been employed as quality function in many algorithms. In addition,
modularity optimization is a popular method for community detection However, there are
some caveats on the use of modularity.
Modularity values cannot be compared for different networks. Another thing we have to
remember is that modularity optimization may fail to identify communities smaller than a
scale which depends on the total size of the network and on the degree of
interconnectedness of the communities, which is called resulution limit.
3.5 Core Methods for Community Detection & Mining
There are naive methods for dividing given networks into subnetworks, such as graph
partitioning, hierarchical clustering, and k-means clustering. However, these methods
needs to provide the numbers of clusters or their size in advance.
It is desirable to devise methods that have abilities of extracting a complete information
about the community structure of networks.
The methods for detecting communities are roughly classified into the following
categories: (1) divisive algorithms, (2) modularity optimization, (3) spectral algorithms,
and (4) other algorithms.
All eigenvalues of L are real and non-negative, and L has a full set of n real and
orthogonal eigenvectors.
In order to minimize the above cut, vertices are partitioned based on the signs of the
eigenvector that corresponds to the second smallest eigenvalue of L.
In general, community detection based on repetative bipartitioning is relatively fast.
Newman proposes a method for maximizing modularity based on the spectral algorithm
Fig. 16.3 The bibliography network for the book entitled “graph products: structure and
recognition”
Figure 16.4 provides the community structure as detected using a community mining
algorithm, called ICS in which 147 communities are uncovered from the bibliography
network.
As one would have expected, each community contains some papers and their
corresponding coauthors.
Fig. 16.4 The community structure of the bibliography network as detected using ICS
Solution 2
Methods which use the information already encoded in the partially labeled graph to help
us predict labels.
This is based on the paradigm of machine learning and classification.
identify some “features” of nodes that can be used to guide the classification.
o set of features are based on simple graph properties
o features derived from properties of the nearby nodes
Two important phenomena that can apply in online social networks:
o homophily, a link between individuals is correlated with those individuals being
similar in nature.
o co-citation regularity holds when similar individuals tend to refer or connect to
the same things.
A secondary aspect of the graph setting is that the classification process can be iterative.
We are given a graph G(V, E,W) with a subset of nodes Vl⊂ V labeled, where V is
the set of n nodes in the graph and Vu = V \ Vl is the set of unlabeled nodes.
The task is to infer labels Y on all nodes V of the graph.
The output of the node classification problem is labels Y̅ on all nodes in V
Naive Method
LINK-GROUP