Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

UNIT II

MODELING AND VISUALIZATION


PART A

1. What is visualization of online social networks?


Visualization is a powerful technique to facilitate exploring social relationships within social
networks. With the advances of computer graphic technologies, visualizations of social networks
have been evolved from hand drawn images to the era of Web interfaces. visualization
technologies have been widely applied to social networks to facilitate accessibility and
interoperability through the platform of web browsers.

2. Give the taxonomy of visualization.


There are many taxonomic reviews about visualization. visualization of structural data are
reviewed by graph layout, navigation, interaction and distortion techniques of visualization.

3. Give examples for different types of visualization.


There are two types of visualization. Node edge diagrams and matrix representation

4. State node density.


Node density: The density of an undirected graph can be defined as (2 * E)/=N*(N-1), where E is
the number of edges. The density of a directed graph can be defined as E/N*(N-1).

5. What is clustering coefficient?


A clustering coefficient is to measure the degrees of nodes to decide which nodes in a graph tend
to be clustered together. Thus, the clustering coefficient measure is to quantify how close its
neighbors are to being a complete graph. As the nodes grouped in the real-world social network
tend to have relatively high density of ties, the clustering coefficient is also utilized for small
world analysis

6. Give the primary kinds of data which are often analyzed in the context of social
networks .
A node-edge diagram is an intuitive way to visualize social networks. With the node-edge
visualization, many network analysis tasks, such as component size calculation, centrality
analysis, and pattern sketching, can be better presented in a more straightforward manner.

7. List the three kinds of layouts .


Three kinds of layouts, namely, random layout, force-directed layout, and tree layout.

8. Draw random geographic layout.


A random layout is to put the nodes at random geometric locations in the graph, and thus it may
not yield very clear visualization results, particularly when the number of nodes immensely
increases, e.g. more than thousands of nodes.
9. List out the advantages of matrices.
A matrix presentation can help minimize the occlusion problems caused by the node-edge
diagram, the matrix-based representation of graphs offers an alternative to the traditional node-
edge diagrams. With a matrix-based representation, clusters and associations among the nodes
can also be better discovered when the number of nodes increases.

10. Mention the advantages of node-link diagrams


The principle of node-link diagrams is to graphically represent actors of the network by nodes
and connections by links. Node-link representations are widely used and familiar to a very large
audience, making them a powerful communication tool. Their readability and the message they
convey greatly depends on the positions of their nodes.

11. What are the tools available for interactively manipulate matrix and node-link
representations.
To interactively manipulate matrix and node-link representations, the following set of tools are
provided:
 Interactive specification of visual attributes
 Interactive layout and reordering
 Automatic layout and reordering techniques
 Computer-assisted layout and reordering techniques
 Interactive filtering
 Interactive clustering
 Overview Detail techniques to navigate in both representations

12. What is meant by hybrid visualization?


Hybrid visualisations provides both matrix and node-link diagrams to the user and also has a
number of advantages. There are two hybrid representation:MatLink and NodeTrix. The goal of
these hybrids is to augment one representation to overcome its drawbacks and enrich it with the
advantages of other one.
13. State triangulation?
Triangulation is an important strategy for identifying and adjusting for discrepancies and biases
in one or more of the samples. triangulation is also used to improve the confidence and
inferences based on the Social Network Analysis.

14. Give the key problems in aggregating social network data.


The two key problems in aggregating social network data are the identification and
disambiguation of social individuals and the aggregation of information about social
relationships.

15. Mention the principle reasons behind Ontological representation of social


relationships.
Ontological representations of social networks such as FOAF need to be extended with a
framework for modelling and characterizing social relationships for two principle reasons: (1) to
support the automated integration of social information on a semantical basis and (2) to capture
established concepts in Social Network Analysis.

16. What is smushing?


Smushing can be considered as an optimization task where It tries to optimize an information
retrieval or clustering-based measure. Smushing is “easier” than ontology mapping where one-
to-one mappings are enforced.

17. What are the benefits of random walk?


Random walks provide a simple framework for unifying the information from ensembles of
paths between two nodes. The ensemble of paths between two nodes is an essential ingredient of
many popular measures such as personalized page rank, hitting and commute times , Katz
measure , harmonic functions.

18. Mention the use of Map Reduce.


Map Reduce is used for
1. Index building
2. Article clustering
3. Statistical machine translation
4. Spam detection
5. Data mining
6. Ad optimization

19. What is the advantage of FOAF in terms of sharing FOAF data?


FOAF (Friend-of-a-friend) was proposed to visualize such human-centric social relationships
based on Semantic Web social metadata. With XML/RDF format, the FOAF relations can be
explicitly defined for further social network analysis and visualization.

20. List the most commonly discussed characteristics of social relationships.


Some of the most commonly discussed characteristics of social relationships are:
Sign, Strength, Provenance, Relationship history, Relationship roles
PART B

1. What is visualization? Explain Social Network visualization on the Web.


Visualization is a powerful technique to facilitate exploring social relationships within social
networks. With the advances of computer graphic technologies, visualizations of social networks
have been evolved from hand drawn images to the era of Web interfaces. visualization
technologies have been widely applied to social networks to facilitate accessibility and
interoperability through the platform of web browsers.

Various visualization techniques and metaphors were proposed to improve the analysis of social
networks and enhance the human computer interactions. For example, in 1950s, computational
methods, such as factor analysis and multidimensional scaling (MDS), were proposed to lay out
nodes in social networks. Factor analysis was developed to reduce the number of nodes by
mapping similar nodes into “factors”.MDS was further utilized to lay out nodes in a 2D or 3D
way that distances between pairs of nodes on the display correspond to distances between
individuals in the data.

With evolution of computer technologies and visualization techniques, machine-drawn images


and screen-oriented graphics were developed to visualize social networks with more abundant
visual components and interactions. Although many visualization techniques have been focused
on the discussions, such as displaying fine graph layouts, coloring, and presenting clear node-
edge relations, visualizing complex relations is still challenging to social network visualization.

In the recent decade, visualization technologies have been widely applied to social networks to
facilitate accessibility and interoperability through the platform of Web browsers. Currently,
many studies have attempted to visualize complex and multilayer node-edge relationships with
innovative metaphors and techniques. However, visualizing social networks with a large number
of nodes and complex relationships on the Internet is still a challenging issue.

In the era of Web 2.0, many forms of online sociality, including e-mail, instant personal
messengers, blogs, and online social services, produced composite social networks and greatly
involved our social lives. Visualizing social networks thus plays an important role of accessing
social networks and connecting people in a more effective and efficient way. Many visualization
applications have been also employed in different online social networks to help people
manipulate their social relationships and access the abundant resources on the Internet.

2. Discuss the taxonomy of visualizations of social networks.


Social networks are built when actors belonging to different social groups are connected to each
other. In late 1800s and early 1900s, sociologists have been looking for social relationships,
groups, positions, and organizations in human activities. social network analysis is still a hot
issue that attracts many concerns, particularly for analyzing the online social networks. some
important metrics for network analysis are described to give a brief overview of social network
analysis.
Graph Theory
Many fundamental concepts and metrics in social network analysis are derived from graph
theory, because graph theory formally represents social networks with structural properties.
Node degree:
In graph theory, the degree of a node in a graph is the number of edges incident to the node. If
there are loops in the graph, the degree of a node will be counted twice.
Node density:
The density of an undirected graph can be defined as (2 * E)/=N*(N-1), where E is the number
of edges. The density of a directed graph can be defined as E/N*(N-1).
Path length:
The path length is the number of edges in the sequence that a walk follows. In a path, all nodes
and edges appear only once in the sequence.
Component size:
When the component size is concerned, a connected graph needs to be discovered first since the
component size is counted by the number of connected nodes in a graph.

Centrality
One of the key applications in social networks is to identify the most important or central nodes
in the network. The measure of centrality is thus used to give a rough indication of the social
power of a node based on how well they connect the network.
In the descriptions below, we depict the distinction between the three popular individual
centrality measures: degree centrality, betweenness centrality, and closeness centrality.
_ Degree centrality
Degree centrality is defined as the number of edges incident upon a node, and thus it is usually
the first way to calculate the nodes that are most potential to determine other nodes.
_ Betweenness centrality
betweeness centrality is another key metrics for computing the extent to which a node lies
between other nodes in the network. If a node is the only node that links two groups of nodes in
the network, this node shall be seen as an important node for keeping the social network
together.
_ Closeness centrality
The measure of closeness centrality is to take into account how distant a node is to the other
nodes in the network. Hence, closeness centrality is to measure the order of magnitude that a
node is near all other nodes in a social network by calculating the mean shortest path for a node
to all other nodes in the graph.

Clustering
Many social networks contain subsets of nodes that are highly connected within the subset and
have relatively few connections to nodes outside the subset. The nodes in such subsets are likely
to share some attributes and form their own communities.
_ Clustering coefficient
A clustering coefficient is to measure the degrees of nodes to decide which nodes in a graph tend
to be clustered together. Thus, the clustering coefficient measure is to quantify how close its
neighbors are to being a complete graph.

3. Explain the Node-edge diagrams to visualize social networks.


A node-edge diagram is an intuitive way to visualize social networks. With the node-edge
visualization, many network analysis tasks, such as component size calculation, centrality
analysis, and pattern sketching, can be better presented in a more straightforward manner.
Three kinds of layouts, namely, random layout, force-directed layout, and tree layout, are
described to explain the node-edge diagrams.

Random Layout
A random layout is to put the nodes at random geometric locations in the graph, and thus it may
not yield very clear visualization results, particularly when the number of nodes immensely
increases, e.g. more than thousands of nodes.

Random graphs have been proposed as a possible model to take into account the structural
characteristics of instances that appear in many practical applications. The above figure shows a
random geographic layout.

Force-Directed Layout
A force-directed layout is also known as a spring layout, which simulates the graph as a virtual
physical system. In a force-directed layout, the edges act as spring and the nodes act as repelling
objects, just like the Hooke’s law and the Coulomb’s law.

Generally, an initial random layout will be yielded first, and then the force-directed algorithms
will run iteratively to adjust the positions of nodes until all graph nodes and attractive forces
between the adjacent nodes run to convergence. Since a force-directed layout may take hundreds
of iterations to obtain a stable layout, the running time is at least O.N logN/ or O.E/, where N is
the numberof nodes and E is the number of edges.

Compared with a random layout, the running cost of a force-directed layout is much higher than
that of a random layout, especially when the number of nodes is large. It is therefore not suitable
for graphs larger than hundreds of nodes.
Tree Layout
A basic tree layout is to choose a node as the root of tree, and the nodes connected to the root
become children of the root node. Nodes that are at more levels away from the root become the
grand-children of the root and so on.

A tree layout can display a more structural layout than graph layouts by considering more
contextual information. Because of the hierarchical nature of a tree layout, trees are more
straightforward to grasp human eye than general graphs. Drawing a tree layout thus takes more
constraints than drawing a general graph since tree structures are a special case of graphs.

For a better visual presentation of domain specific information, more suitable variants of the tree
layout were proposed, such as hyperbolic tree layout and a radial tree layout. These tree
visualizations utilize the idea of focusCcontext to better the visualization effects with animation
techniques and help users to get global and local views of a social network in a 2D display.

4. Explain how to visualize social networks with matrix-based representation. Also


discuss the pros and cons of matrix-based representation.

A social network graph consists of nodes connected with edges, it can be transformed into a
simple Boolean matrix whose rows and columns represent the vertices of the graph. Moreover,
the Boolean values in the matrix can be further replaced with valued attributes associated with
the edges to provide more informative network visualizations.

A matrix presentation can help minimize the occlusion problems caused by the node-edge
diagram, the matrix-based representation of graphs offers an alternative to the traditional node-
edge diagrams.

With a matrix-based representation, clusters and associations among the nodes can also be better
discovered when the number of nodes increases. Particularly, when the relationships are
complex, a matrix-based representation can effectively outperform a node-edge diagram in
readability since the high connectivity of a node-edge representation will easily diffuse the focus.
Fig. MatrixExplorer: initial order (left) and TSP order (right)
In 2006, an enhanced matrix-based representation, called MatrixExplorer, was developed to
visualize social networks with a Dual-Representation. MatrixExplorer can provide users with
two synchronized representations of the same network: matrix and node-edge.When a social
network is composed of highly interlaced edges, the matrix-based view can help users quickly
recognize the associations between nodes.

The above figure illustrates a matrix-base view of MatrixExplorer with an initial order on the left
and a traveling salesman problem (TSP) order on the right. A reordered matrix can evidently
help users find more clusters. A matrix-based visualization may not entirely replace a
conventional node edge diagram, yet it could complement the shortcomings of a node-edge
diagram to better the social network visualization.

5. Discuss the various approaches to scale node-link diagrams to large networks with
several thousand or millions of nodes.

Node-link diagrams are the most commonly used representation of graphs and networks. It is
well illustrated by Freeman in his survey and history of social network visualization. Freeman
presents a wide variety of social networks and demonstrates that visual representations are a
powerful tool to illustrate social network analysis concepts such as central actors or
communities.

Node-link representations are widely used and familiar to a very large audience, making them a
powerful communication tool. However, their readability and the message they convey greatly
depends on the positions of their nodes.

Determining what makes a node-link diagram aesthetically pleasing, easy to read or conveying
given findings is a difficult challenge. Since the 90s, an entire field of research is devoted to the
problem of graph drawing, i.e. generating algorithms to place nodes in the space according to
certain criteria such as minimizing the number of link crossing each other.

A good introduction to graph drawing can be found in the book of Di Battista et al. including
more than 300 algorithms to layout graphs in 2D space. Additional state-of-the-art techniques to
draw and navigate in node-link diagrams can be found in Herman et al.

Researchers performed a number of studies to identify which criteria are the most important to
improve human understanding. However, the number of these criteria and their interaction with
each other is so large that it is difficult to identify a core set and thus create the ideal layout
algorithm.

Information visualization has a slightly different perspective on the topic. This field of research
focuses on visual exploration and the discovery or communication of insights about the data.
Different representations may help discover different insights in the data. Thus, information
visualization does not aim at the ideal representation but advocates for the use of multiple
representations and multiple perspectives on the data, supported by interactions to quickly
explore them.
6. Briefly explain the hybrid representation of visualization.
Providing both matrix and node-link diagrams to the user has a number of advantages but also
drawbacks.
 
It requires a large amount of display space.

At least two display monitors are required to comfortably use Matrix Explorer; 
 
Switching from one representation to the other may induce high cognitive load to the user.
 
Two hybrid representations were developed namely,

MatLink and NodeTrix

1 AUGMENTING MATRICES
The principle of MatLink is to augment a standard matrix representation with links on its
borders. These links provides a dual encoding of the connections between actors. Two types of
links are added to the representations:

static links (in white on the figure) and 

interactive links (in a darker shade). 
When a row or column is selected, these links show a shortest path to any other row or column
placed under the cursor.

Assessing the Readability of MatLink



MatLink introduced specific tasks of social network analysis: find a cut point, find a

clique and find communities (strongly connected groups).
 
By the way MatLink significantly improve standard matrix representations.

The only task for which node-link diagrams still perform better is the identification
 of cut points.
With MatLink, this task requires to identify specific visual patterns of the links.

Using MatLink for Navigating in the Matrix


To improve readability of matrices, Matlink supports navigation. Since matrices display actors in
rows and columns, they require far more space than node-link diagrams to represent a network.
In MatLink, all links connected to a given actor are displayed when this actor is selected. Thus, a
direct visual feedback is provided on the number of neighbors and the curvature of the links
provides an indication of their distance in the matrix.

Three techniques that provide users with effective tools to navigate in large matrices with
MatLink were listed below:

Melange: folds the space between two far away nodes as if it was a piece of paper. Users

may see side by side parts of the matrix that are far away.

Bring-and-go: neighbors of an actor closer as if their links were elastic, by moving the
cursor over one of the neighbor and releasing the mouse, the view and the node travel to its

previous location.

 Link Sliding : allows users to locks their cursor to a given link and travel very fast to its
destination

2 MERGING MATRIX AND NODE-LINK DIAGRAM


Node-link diagrams or matrices perform differently according to the types of visualized
networks. NodeTrix is a hybrid visualization merging node-link diagrams and matrices. The
principle of NodeTrix is to represent the global network as a node-link diagram and the locally
dense subparts as matrices.

Interactive Exploration
NodeTrix developed a number of interactions based on traditional drag-and-drop of objects with
the mouse cursor for ease creation, exploration and edition of matrices. Matrix representations
have the advantage of placing actors of the network linearly (in rows and in columns), thus it
becomes easy to identify the community members connected to external actors. To add or
remove actors from the matrix, users simply select the node or row/column representing an actor
and drag it in or out of the matrix. Other interactions include the possibility to merge two
matrices or split them to get back to the original node-link representation.

Drawback:
Making it impossible to place an actor in two different communities.

Presenting Findings:
NodeTrix can be used for both exploration and communication because matrices can be
expanded showing detailed information on actors and connections showing higher-level
connection patterns.

7. Brief the concept of modeling and aggregating social network data.

The most common kind of social network data can be modeled by a graph where the nodes
represent individuals and the edges represent binary social relationships. (Less commonly,
 relationships may be represented using hyper-edges, i.e. edges connecting multiple
higher-arity

nodes.)
 of nodes and edges, which can be
Additionally, social network studies build on attributes

formalized as functions operating on nodes or edges.
 for serializing such graphs and attribute data in
A number of different, proprietary formats exist

machine-processable electronic documents.
The most commonly encountered formats are those used by the popular network analysis
packages Pajek and UCINET. These are text-based formats  which have been designed in a way
so that they can be easily edited using simple text editors.
Unfortunately, the two formats are incompatible. Further, researchers in the social sciences often
 spreadsheets, which can be exported in the
represent their data initially using Microsoft Excel
simple CSV (Comma Separated Values) format.

.



The GraphML format represents an advancement over  the previously mentioned formats
 in terms of both interoperability and extensibility.
GraphML originates from the information visualization community  where a shared
 format greatly increases the usability of new visualization methods.
GraphML is therefore based on XML with a schema defined in XML Schema. This has

the advantage that GraphML files can be edited, stored, queried, transformed etc. using
 generic XML tools.
Common to all these generic graph representations is that they focus on  the graph
 structure, which is the primary input to network analysis and visualization.
Attribute data when entered electronic form is typically stored separately from network
data in Excel sheets, databases or SPSS tables.

8. Explain how clustering is performed with random walk based measures. Also
discuss the algorithms for computing proximity measures.
A Random Walk in synthesis:
Given an indirected graph and a starting point, select a neighbour at random
Move to the selected neighbour and repeat the same process till a termination condition is
verified
The random sequence of points selected in this way is a random walk of the graph

Important parameters of random walk:


Access time or hitting time: Hij is the expected number of steps before node j is visited,
starting from node i
Commute time: i j i: Hij + Hji
Cover time: Starting from a node/distribution the expected number of steps to reach
every node.

Applications of Random Walks on Graphs


Ranking Web Pages
HITS on citation network
Clustering using random walk

1 USE OF HADOOP AND MAP REDUCE


Map reduce
Data-parallel programming model for clusters of commodity machines
Pioneered by Google
- Processes 20 PB of data per day
Popularized by open-source Hadoop project
- Used by Yahoo!, Facebook, Amazon, …

Map Reduce used for


At Google:
1. Index building for Google Search
2. Article clustering for Google News
3. Statistical machine translation
At Yahoo!:
1. Index building for Yahoo! Search
2. Spam detection for Yahoo! Mail
At Facebook:
1. Data mining
2. Ad optimization
3. Spam detection

In research:
Analyzing Wikipedia conflicts (PARC)
Natural language processing (CMU)
Bioinformatics (Maryland)
Particle physics (Nebraska)
Ocean climate simulation (Washington)

Map Reduce Goals


1. Scalability to large data volumes:
Scan 100 TB on 1 node @ 50 MB/s = 24 days
Scan on 1000-node cluster = 35 minutes

2. Cost-efficiency:
Commodity nodes (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer admins)
Easy to use (fewer programmers)

TYPICAL HADOOP CLUSTER:

40 nodes/rack, 1000-4000 nodes in cluster


1 GBps bandwidth in rack, 8 GBps out of rack
Node specs (Yahoo! terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Challenges
Cheap nodes fail, especially if you have many
- Mean time between failures for 1 node = 3 years
- MTBF for 1000 nodes = 1 day
- Solution: Build fault-tolerance into system
Commodity network = low bandwidth
- Solution: Push computation to the data
Programming distributed systems is hard
- Solution: Users write data-parallel “map” and “reduce” functions, system handles work
distribution and faults

Hadoop Components:
Distributed file system (HDFS)
- Single namespace for entire cluster
- Replicates data 3x for fault-tolerance
MapReduce framework
- Executes user jobs specified as “map” and “reduce” functions
- Manages work distribution & fault-tolerance

9. Describe random walk and their application.

Ranking is one of the most well known methods in web search. Starting with the well known
page-rank algorithm for ranking web documents, the broad principle can also be applied for
searching and ranking entities and actors in social networks.

The page-rank algorithm uses random walk techniques for the ranking process. The idea is that a
random walk approach is used on the network in order to estimate the probability of visiting each
node. This probability is estimated as the page rank. Clearly, nodes which are structurally well
connected have a higher page-rank, and are also naturally of greater importance.

Random walk techniques can also be used in order to personalize the page-rank computation
process, by biasing the ranking towards particular kinds of nodes. In chapter 3, we present
methods for leveraging random walk techniques for a variety of ranking applications in social
networks.

Application in Computer Vision

A common technique in computer vision is to use a graph-representation of an image frame,


where two neighboring pixels share a strong connection if they have similar color, intensity or
texture.
Gorelick et al use the average hitting time of a random walk from an object boundary to
characterize object shape from silhouettes. Grady et al. introduced a novel graph clustering
algorithm which was shown to have an interpretation in terms of random walks. Hitting times
from all nodes to a designated node were thresholded to produce partitions with various
beneficial theoretical properties. Quiet al have used commute times clustering for robust
multibody motion tracking in and image segmentation.

Harmonic functions have been used for colorizing images, and for automatedimage-
segmentation. The colorization application involves addingcolor to a monochrome image or
movie. An artist annotates the image with afew colored scribbles and the indicated color is
propagated to produce a fullycolored image. This can be viewed as a multi-class classification
problemwhen the class (color) information is available for only a few pixels.

The segmentation example uses user-defined labels for different segments and quicklypropagates
the information to produce high-quality segmentation of the image.All of the above examples
rely on the same intuition: neighboring nodes in a graph should have similar labels.

Text Analysis

A collection of documents can be represented in graph in many different ways based on the
available information. We will describe a few popular approaches. Zhu et al. build a sparse graph
using feature similarity between pairs of documents, and then a subset of labels are used to
compute the harmonic function for document classification.

Another way to build a graph from documents in a publication database, is to build an entity-
relation graph from authors and papers, where papers are connected via citations and co-authors.
The ObjectRank algorithm in computes personalized pagerank for keyword-specific ranking in
such a graph built from a publication database. For keyword search surfers start random walks
from different entities containing that word.

Any surfer either moves randomly to a neighboring node or jumps back to a node containing the
keyword. The final ranking is done based on the resulting probability distribution on the objects
in the database. In essence the personalized pagerank for each word is computed and stored
offline, and at query time combined linearly to gener ate keyword-specific ranking.
PART C

1. With neat sketch, Explain types of clustering.


Random walks provide a natural way of examining the graph structure. Itis popular for clustering
applications as well. Spectral clustering is abody of algorithms that clusters datapoints xi using
eigenvectors of a matrixderived from the affinity matrix constructed from the data. Each
datapoint is associated to a node in a graph. The weight on a link between two nodesi and j, i.e.
Aij , is obtained from a measure of similarity between the two datapoints.

Global graph clustering.

It replace traditional shortest-path distances between nodes in a graph by hitting and commute
times and show that standard clustering algorithms (e.g. K-means) produce much better results
when applied to these re-weighted graphs. These techniques exploit the fact that commute times
are robust to noise and provide a finer measure of cluster cohesion thansimple use of edge
weight.

It present a general framework for using random walk based measures as separating operators
which can be repeatedly applied to reveal cluster structure at different scales of granularity. It
propose using step probabilities, escape probabilities and other variants to obtain edge
separation. It show how to use these operators as a primitive for other high level clustering
algorithms like multi-level and agglomerative clustering.

The different powers of the transition matrix P to obtain clustering of the data. The authors
estimate the number of steps and the number of clusters by optimizing spectral properties of P.

Local Graph Clustering.

Random walks provide a natural way of clustering a graph. If a cluster has relatively fewer
number of cross-edges compared to number of edges inside, then a random walk will tend to stay
inside that cluster. Recently there has been interesting theoretical work [75, 5] for using random
walk based approaches for computing good quality local graph partitions
(cluster) near a given seed node. The main intuition is that a random walk started inside a good
cluster will mostly stay inside the cluster.

A good-quality cluster has small conductance, resulting from a small number of cross-edges
compared to the total number of edges. The smaller the conductance, the better the cluster
quality. Hence 0 is perfect score, for a disconnected partition, whereas 1 is the worst score for
having a cluster with no intra-cluster edges. Conductance of a graph is defined as the minimum
conductance of all subsets S of the set of nodes V .

The formal algorithm to compute a low conductance local partition near a seed node is given.The
algorithm propagates probability mass from the seed node and at any step rounds the small
probabilities, leading to a sparse representation of the probability distribution. Nowa local cluster
is obtained by making a sweep over this probability distribution. The running time is nearly
linear in the size of the cluster it outputs. improve upon the above result by computing local cuts
from personalized pagerank vectors from the predefined seed nodes.
2. Explain the following: a.Clustering b.Centrality

a. Clustering
Many social networks contain subsets of nodes that are highly connected within the subset and
have relatively few connections to nodes outside the subset. The nodes in such subsets are likely
to share some attributes and form their own communities.
Since the detection of these community structures is not trivial, how to efficiently and effectively
discover such community structures is important. Therefore, the main measure described below
is to help explore the grouping effects by clustering coefficient.

Clustering coefficient:

A clustering coefficient is to measure the degrees of nodes to decide which nodes in a graph tend
to be clustered together. Thus, the clustering coefficient measure is to quantify how close its
neighbors are to being a complete graph. As the nodes grouped in the real-world social network
tend to have relatively high density of ties, the clustering coefficient is also utilized for small
world analysis. From the descriptions above, we quickly overview some important metrics used
for social network analysis.

b. Centrality
One of the key applications in social networks is to identify the most important or central nodes
in the network. The measure of centrality is thus used to give a rough indication of the social
power of a node based on how well they connect the network. HITS and PageRank are two most
famous representatives using centrality for ranking. HITS analyzes the important nodes based on
calculating Authorities (indegrees) and Hubs (out-degrees), and PageRank calculates node values
based on out-degrees. In social network analysis, “Degree”, “Betweenness”, and “Closeness”
centrality are most popularly adopted methods to measure the centrality of a social network.

Degree centrality:
Degree centrality is defined as the number of edges incident upon a node, and thus it is usually
the first way to calculate the nodes that are most potential to determine other nodes. For
calculating degree centrality, the nodes that have direct connections to a large number of nodes
are considered. If the edges in a graph are directed, the in-degree centrality is differentiated from
the out-degree centrality.

Betweenness centrality:
In addition to degree centrality, betweeness centrality is another key metrics for computing the
extent to which a node lies between other nodes in the network. If a node is the only node that
links two groups of nodes in the network, this node shall be seen as an important node for
keeping the social network together.

Closeness centrality:
The measure of closeness centrality is to take into account how distant a node is to the other
nodes in the network. Hence, closeness centrality is to measure the order of magnitude that a
node is near all other nodes in a social network by calculating the mean shortest path for a node
to all other nodes in the graph.
3. Write short notes on Node-link diagrams.
Refer Question No:5 in PART B

4. Discuss the applications of random walks approach.


Refer Question No:9 in PART B

5. Briefly explain the use of Hadoop and Map Reduce.


Refer Question No:8 in PART B

6. Brief Ontological representation of social individuals


(i) The Friend-of-a-Friend (FOAF) ontology that we use in our work is an OWL based format for
representing personal information 
(ii) FOAF started as experimentation with Semantic Web technology.
(iii) The idea of FOAF was to provide a machine processable format for representing the kind of
information that made the original Web successful, namely the kind of personal information
described in homepages of individuals.
(iv) Thus FOAF has a vocabulary for describing personal attribute information typically found
on homepages such as name and email address of the individual, projects, interests, links to
work and school homepage etc.
(v) FOAF profiles contain a description of friends the using the individuals same vocabulary
that is used to describe the individual himself.
(vi) FOAF became the center point of interest in 2003 with the spread of Social Networking
Services such Friendster, Orkut, LinkedIn etc.
Drawbacks:
1. The information is under the control of the database owner
2. Centralized systems do not allow users to control the information they
provide on their own terms.
(vii) FOAF profiles are created and controlled by the individual user and shared in a
distributed fashion. FOAF profiles are typically posted on the personal website of the
user and linked from the home page user switch the HTML META tag.
(viii) An advantage of FOAF in terms of sharing FOAF data is the relative stability of
the ontology. The number of FOAF users means that the maintainers of the ontology
are obliged to keep the vocabulary and its semantics stable.

7. Write about Ontological representation of social relationships.

Ontological representations of social networks such as FOAF need to be extended with a


framework for modeling and characterizing social relationships for two principle reasons:
(1) To support the automated integration of social information on a semantical basis and
(2) To capture established concepts in Social Network Analysis.

Characteristics of social relationships


 Sign: A relationship can represent both positive and negative attitudes such as like or
hate. The positive or negative charge of relationships is the subject of balance theory
 Strength: Tie strength itself is a complex construct of several characteristics of social
relations. Tie strength lists the following: Frequency/frequent contact , Reciprocity,
Trust/enforceable trust, Complementarity, Accommodation/adaptation,
Indebtedness/imbalance, Collaboration, Transaction investments, Strong history,
Fungible skills, Expectations, Social capital
 Provenance: A social relationship may be viewed differently by the individual
participants of the relationship, sometimes even to the degree that the tie is
unreciprocated. Similarly, outsiders may provide different accounts of the relationship,
which is a well-known bias in SNA.
 Relationship history: Social relationships come into existence by some event involving
two individuals
 Relationship roles: A social relationship may have a number of social roles associated
with it, which we call relationship roles. For example, in a student/professor relationship
within a university setting there is one individual playing the role of professor, while
another individual is playing the role of a student. Both the relationship and the roles may
be limited in their interpretation and use to a certain social context.

Ideally, all users of all these services would agree to a single shared typology of social relations
and shared characterizations of relations. However, this is neither feasible nor necessary. What
is required from such a representation is that it is minimal in order to facilitate adoption and
that it should preserve key identifying characteristics such as the case of identifying properties
for social individuals.

Conceptual model
Social relations could be represented as n-ary predicates; however, n-ary relations are not
supported directly by the RDF/OWL languages. There are several alternatives to n-ary relations
in RDF/ OWL 
In all cases dealing with n-ary relations we employ the technique that is known as reification:
we represent the relation as a class, whose instances are concrete relations of that type. 
 One may recall that RDF itself has a reified representation of statements: the rdf :Statement
object represents the class of statements. 
This class has three properties that correspond to the components of a statement, namely rdf:
subject, rdf :predicate, rdf :object. 
These properties are used to link the statement instance to the resources involved in the
statement. 
In other words relationships become subclasses of the rdf :Statement class. Common is that
the new Relationship class is related to a general Parameter class by the hasParameter
relationship.
Cognitive structuring, works by applying the generic pattern we associate with such a
relationship to the actual state-of-affairs we observe. For example, a student/professor
relationship at the Free University of Amsterdam is defined by the social context of the
university and this kind of relationship may not be recognizable outside of the university.

The below figure shows descriptions and Situations ontology design pattern that provides a
model of context and allows to clearly delineate these two layers of representation.

You might also like