Professional Documents
Culture Documents
Sna-Unit II Smart Material
Sna-Unit II Smart Material
6. Give the primary kinds of data which are often analyzed in the context of social
networks .
A node-edge diagram is an intuitive way to visualize social networks. With the node-edge
visualization, many network analysis tasks, such as component size calculation, centrality
analysis, and pattern sketching, can be better presented in a more straightforward manner.
11. What are the tools available for interactively manipulate matrix and node-link
representations.
To interactively manipulate matrix and node-link representations, the following set of tools are
provided:
Interactive specification of visual attributes
Interactive layout and reordering
Automatic layout and reordering techniques
Computer-assisted layout and reordering techniques
Interactive filtering
Interactive clustering
Overview Detail techniques to navigate in both representations
Various visualization techniques and metaphors were proposed to improve the analysis of social
networks and enhance the human computer interactions. For example, in 1950s, computational
methods, such as factor analysis and multidimensional scaling (MDS), were proposed to lay out
nodes in social networks. Factor analysis was developed to reduce the number of nodes by
mapping similar nodes into “factors”.MDS was further utilized to lay out nodes in a 2D or 3D
way that distances between pairs of nodes on the display correspond to distances between
individuals in the data.
In the recent decade, visualization technologies have been widely applied to social networks to
facilitate accessibility and interoperability through the platform of Web browsers. Currently,
many studies have attempted to visualize complex and multilayer node-edge relationships with
innovative metaphors and techniques. However, visualizing social networks with a large number
of nodes and complex relationships on the Internet is still a challenging issue.
In the era of Web 2.0, many forms of online sociality, including e-mail, instant personal
messengers, blogs, and online social services, produced composite social networks and greatly
involved our social lives. Visualizing social networks thus plays an important role of accessing
social networks and connecting people in a more effective and efficient way. Many visualization
applications have been also employed in different online social networks to help people
manipulate their social relationships and access the abundant resources on the Internet.
Centrality
One of the key applications in social networks is to identify the most important or central nodes
in the network. The measure of centrality is thus used to give a rough indication of the social
power of a node based on how well they connect the network.
In the descriptions below, we depict the distinction between the three popular individual
centrality measures: degree centrality, betweenness centrality, and closeness centrality.
_ Degree centrality
Degree centrality is defined as the number of edges incident upon a node, and thus it is usually
the first way to calculate the nodes that are most potential to determine other nodes.
_ Betweenness centrality
betweeness centrality is another key metrics for computing the extent to which a node lies
between other nodes in the network. If a node is the only node that links two groups of nodes in
the network, this node shall be seen as an important node for keeping the social network
together.
_ Closeness centrality
The measure of closeness centrality is to take into account how distant a node is to the other
nodes in the network. Hence, closeness centrality is to measure the order of magnitude that a
node is near all other nodes in a social network by calculating the mean shortest path for a node
to all other nodes in the graph.
Clustering
Many social networks contain subsets of nodes that are highly connected within the subset and
have relatively few connections to nodes outside the subset. The nodes in such subsets are likely
to share some attributes and form their own communities.
_ Clustering coefficient
A clustering coefficient is to measure the degrees of nodes to decide which nodes in a graph tend
to be clustered together. Thus, the clustering coefficient measure is to quantify how close its
neighbors are to being a complete graph.
Random Layout
A random layout is to put the nodes at random geometric locations in the graph, and thus it may
not yield very clear visualization results, particularly when the number of nodes immensely
increases, e.g. more than thousands of nodes.
Random graphs have been proposed as a possible model to take into account the structural
characteristics of instances that appear in many practical applications. The above figure shows a
random geographic layout.
Force-Directed Layout
A force-directed layout is also known as a spring layout, which simulates the graph as a virtual
physical system. In a force-directed layout, the edges act as spring and the nodes act as repelling
objects, just like the Hooke’s law and the Coulomb’s law.
Generally, an initial random layout will be yielded first, and then the force-directed algorithms
will run iteratively to adjust the positions of nodes until all graph nodes and attractive forces
between the adjacent nodes run to convergence. Since a force-directed layout may take hundreds
of iterations to obtain a stable layout, the running time is at least O.N logN/ or O.E/, where N is
the numberof nodes and E is the number of edges.
Compared with a random layout, the running cost of a force-directed layout is much higher than
that of a random layout, especially when the number of nodes is large. It is therefore not suitable
for graphs larger than hundreds of nodes.
Tree Layout
A basic tree layout is to choose a node as the root of tree, and the nodes connected to the root
become children of the root node. Nodes that are at more levels away from the root become the
grand-children of the root and so on.
A tree layout can display a more structural layout than graph layouts by considering more
contextual information. Because of the hierarchical nature of a tree layout, trees are more
straightforward to grasp human eye than general graphs. Drawing a tree layout thus takes more
constraints than drawing a general graph since tree structures are a special case of graphs.
For a better visual presentation of domain specific information, more suitable variants of the tree
layout were proposed, such as hyperbolic tree layout and a radial tree layout. These tree
visualizations utilize the idea of focusCcontext to better the visualization effects with animation
techniques and help users to get global and local views of a social network in a 2D display.
A social network graph consists of nodes connected with edges, it can be transformed into a
simple Boolean matrix whose rows and columns represent the vertices of the graph. Moreover,
the Boolean values in the matrix can be further replaced with valued attributes associated with
the edges to provide more informative network visualizations.
A matrix presentation can help minimize the occlusion problems caused by the node-edge
diagram, the matrix-based representation of graphs offers an alternative to the traditional node-
edge diagrams.
With a matrix-based representation, clusters and associations among the nodes can also be better
discovered when the number of nodes increases. Particularly, when the relationships are
complex, a matrix-based representation can effectively outperform a node-edge diagram in
readability since the high connectivity of a node-edge representation will easily diffuse the focus.
Fig. MatrixExplorer: initial order (left) and TSP order (right)
In 2006, an enhanced matrix-based representation, called MatrixExplorer, was developed to
visualize social networks with a Dual-Representation. MatrixExplorer can provide users with
two synchronized representations of the same network: matrix and node-edge.When a social
network is composed of highly interlaced edges, the matrix-based view can help users quickly
recognize the associations between nodes.
The above figure illustrates a matrix-base view of MatrixExplorer with an initial order on the left
and a traveling salesman problem (TSP) order on the right. A reordered matrix can evidently
help users find more clusters. A matrix-based visualization may not entirely replace a
conventional node edge diagram, yet it could complement the shortcomings of a node-edge
diagram to better the social network visualization.
5. Discuss the various approaches to scale node-link diagrams to large networks with
several thousand or millions of nodes.
Node-link diagrams are the most commonly used representation of graphs and networks. It is
well illustrated by Freeman in his survey and history of social network visualization. Freeman
presents a wide variety of social networks and demonstrates that visual representations are a
powerful tool to illustrate social network analysis concepts such as central actors or
communities.
Node-link representations are widely used and familiar to a very large audience, making them a
powerful communication tool. However, their readability and the message they convey greatly
depends on the positions of their nodes.
Determining what makes a node-link diagram aesthetically pleasing, easy to read or conveying
given findings is a difficult challenge. Since the 90s, an entire field of research is devoted to the
problem of graph drawing, i.e. generating algorithms to place nodes in the space according to
certain criteria such as minimizing the number of link crossing each other.
A good introduction to graph drawing can be found in the book of Di Battista et al. including
more than 300 algorithms to layout graphs in 2D space. Additional state-of-the-art techniques to
draw and navigate in node-link diagrams can be found in Herman et al.
Researchers performed a number of studies to identify which criteria are the most important to
improve human understanding. However, the number of these criteria and their interaction with
each other is so large that it is difficult to identify a core set and thus create the ideal layout
algorithm.
Information visualization has a slightly different perspective on the topic. This field of research
focuses on visual exploration and the discovery or communication of insights about the data.
Different representations may help discover different insights in the data. Thus, information
visualization does not aim at the ideal representation but advocates for the use of multiple
representations and multiple perspectives on the data, supported by interactions to quickly
explore them.
6. Briefly explain the hybrid representation of visualization.
Providing both matrix and node-link diagrams to the user has a number of advantages but also
drawbacks.
It requires a large amount of display space.
At least two display monitors are required to comfortably use Matrix Explorer;
Switching from one representation to the other may induce high cognitive load to the user.
Two hybrid representations were developed namely,
1 AUGMENTING MATRICES
The principle of MatLink is to augment a standard matrix representation with links on its
borders. These links provides a dual encoding of the connections between actors. Two types of
links are added to the representations:
static links (in white on the figure) and
interactive links (in a darker shade).
When a row or column is selected, these links show a shortest path to any other row or column
placed under the cursor.
Three techniques that provide users with effective tools to navigate in large matrices with
MatLink were listed below:
Melange: folds the space between two far away nodes as if it was a piece of paper. Users
may see side by side parts of the matrix that are far away.
Bring-and-go: neighbors of an actor closer as if their links were elastic, by moving the
cursor over one of the neighbor and releasing the mouse, the view and the node travel to its
previous location.
Link Sliding : allows users to locks their cursor to a given link and travel very fast to its
destination
Interactive Exploration
NodeTrix developed a number of interactions based on traditional drag-and-drop of objects with
the mouse cursor for ease creation, exploration and edition of matrices. Matrix representations
have the advantage of placing actors of the network linearly (in rows and in columns), thus it
becomes easy to identify the community members connected to external actors. To add or
remove actors from the matrix, users simply select the node or row/column representing an actor
and drag it in or out of the matrix. Other interactions include the possibility to merge two
matrices or split them to get back to the original node-link representation.
Drawback:
Making it impossible to place an actor in two different communities.
Presenting Findings:
NodeTrix can be used for both exploration and communication because matrices can be
expanded showing detailed information on actors and connections showing higher-level
connection patterns.
The most common kind of social network data can be modeled by a graph where the nodes
represent individuals and the edges represent binary social relationships. (Less commonly,
relationships may be represented using hyper-edges, i.e. edges connecting multiple
higher-arity
nodes.)
of nodes and edges, which can be
Additionally, social network studies build on attributes
formalized as functions operating on nodes or edges.
for serializing such graphs and attribute data in
A number of different, proprietary formats exist
machine-processable electronic documents.
The most commonly encountered formats are those used by the popular network analysis
packages Pajek and UCINET. These are text-based formats which have been designed in a way
so that they can be easily edited using simple text editors.
Unfortunately, the two formats are incompatible. Further, researchers in the social sciences often
spreadsheets, which can be exported in the
represent their data initially using Microsoft Excel
simple CSV (Comma Separated Values) format.
.
The GraphML format represents an advancement over the previously mentioned formats
in terms of both interoperability and extensibility.
GraphML originates from the information visualization community where a shared
format greatly increases the usability of new visualization methods.
GraphML is therefore based on XML with a schema defined in XML Schema. This has
the advantage that GraphML files can be edited, stored, queried, transformed etc. using
generic XML tools.
Common to all these generic graph representations is that they focus on the graph
structure, which is the primary input to network analysis and visualization.
Attribute data when entered electronic form is typically stored separately from network
data in Excel sheets, databases or SPSS tables.
8. Explain how clustering is performed with random walk based measures. Also
discuss the algorithms for computing proximity measures.
A Random Walk in synthesis:
Given an indirected graph and a starting point, select a neighbour at random
Move to the selected neighbour and repeat the same process till a termination condition is
verified
The random sequence of points selected in this way is a random walk of the graph
In research:
Analyzing Wikipedia conflicts (PARC)
Natural language processing (CMU)
Bioinformatics (Maryland)
Particle physics (Nebraska)
Ocean climate simulation (Washington)
2. Cost-efficiency:
Commodity nodes (cheap, but unreliable)
Commodity network
Automatic fault-tolerance (fewer admins)
Easy to use (fewer programmers)
Challenges
Cheap nodes fail, especially if you have many
- Mean time between failures for 1 node = 3 years
- MTBF for 1000 nodes = 1 day
- Solution: Build fault-tolerance into system
Commodity network = low bandwidth
- Solution: Push computation to the data
Programming distributed systems is hard
- Solution: Users write data-parallel “map” and “reduce” functions, system handles work
distribution and faults
Hadoop Components:
Distributed file system (HDFS)
- Single namespace for entire cluster
- Replicates data 3x for fault-tolerance
MapReduce framework
- Executes user jobs specified as “map” and “reduce” functions
- Manages work distribution & fault-tolerance
Ranking is one of the most well known methods in web search. Starting with the well known
page-rank algorithm for ranking web documents, the broad principle can also be applied for
searching and ranking entities and actors in social networks.
The page-rank algorithm uses random walk techniques for the ranking process. The idea is that a
random walk approach is used on the network in order to estimate the probability of visiting each
node. This probability is estimated as the page rank. Clearly, nodes which are structurally well
connected have a higher page-rank, and are also naturally of greater importance.
Random walk techniques can also be used in order to personalize the page-rank computation
process, by biasing the ranking towards particular kinds of nodes. In chapter 3, we present
methods for leveraging random walk techniques for a variety of ranking applications in social
networks.
Harmonic functions have been used for colorizing images, and for automatedimage-
segmentation. The colorization application involves addingcolor to a monochrome image or
movie. An artist annotates the image with afew colored scribbles and the indicated color is
propagated to produce a fullycolored image. This can be viewed as a multi-class classification
problemwhen the class (color) information is available for only a few pixels.
The segmentation example uses user-defined labels for different segments and quicklypropagates
the information to produce high-quality segmentation of the image.All of the above examples
rely on the same intuition: neighboring nodes in a graph should have similar labels.
Text Analysis
A collection of documents can be represented in graph in many different ways based on the
available information. We will describe a few popular approaches. Zhu et al. build a sparse graph
using feature similarity between pairs of documents, and then a subset of labels are used to
compute the harmonic function for document classification.
Another way to build a graph from documents in a publication database, is to build an entity-
relation graph from authors and papers, where papers are connected via citations and co-authors.
The ObjectRank algorithm in computes personalized pagerank for keyword-specific ranking in
such a graph built from a publication database. For keyword search surfers start random walks
from different entities containing that word.
Any surfer either moves randomly to a neighboring node or jumps back to a node containing the
keyword. The final ranking is done based on the resulting probability distribution on the objects
in the database. In essence the personalized pagerank for each word is computed and stored
offline, and at query time combined linearly to gener ate keyword-specific ranking.
PART C
It replace traditional shortest-path distances between nodes in a graph by hitting and commute
times and show that standard clustering algorithms (e.g. K-means) produce much better results
when applied to these re-weighted graphs. These techniques exploit the fact that commute times
are robust to noise and provide a finer measure of cluster cohesion thansimple use of edge
weight.
It present a general framework for using random walk based measures as separating operators
which can be repeatedly applied to reveal cluster structure at different scales of granularity. It
propose using step probabilities, escape probabilities and other variants to obtain edge
separation. It show how to use these operators as a primitive for other high level clustering
algorithms like multi-level and agglomerative clustering.
The different powers of the transition matrix P to obtain clustering of the data. The authors
estimate the number of steps and the number of clusters by optimizing spectral properties of P.
Random walks provide a natural way of clustering a graph. If a cluster has relatively fewer
number of cross-edges compared to number of edges inside, then a random walk will tend to stay
inside that cluster. Recently there has been interesting theoretical work [75, 5] for using random
walk based approaches for computing good quality local graph partitions
(cluster) near a given seed node. The main intuition is that a random walk started inside a good
cluster will mostly stay inside the cluster.
A good-quality cluster has small conductance, resulting from a small number of cross-edges
compared to the total number of edges. The smaller the conductance, the better the cluster
quality. Hence 0 is perfect score, for a disconnected partition, whereas 1 is the worst score for
having a cluster with no intra-cluster edges. Conductance of a graph is defined as the minimum
conductance of all subsets S of the set of nodes V .
The formal algorithm to compute a low conductance local partition near a seed node is given.The
algorithm propagates probability mass from the seed node and at any step rounds the small
probabilities, leading to a sparse representation of the probability distribution. Nowa local cluster
is obtained by making a sweep over this probability distribution. The running time is nearly
linear in the size of the cluster it outputs. improve upon the above result by computing local cuts
from personalized pagerank vectors from the predefined seed nodes.
2. Explain the following: a.Clustering b.Centrality
a. Clustering
Many social networks contain subsets of nodes that are highly connected within the subset and
have relatively few connections to nodes outside the subset. The nodes in such subsets are likely
to share some attributes and form their own communities.
Since the detection of these community structures is not trivial, how to efficiently and effectively
discover such community structures is important. Therefore, the main measure described below
is to help explore the grouping effects by clustering coefficient.
Clustering coefficient:
A clustering coefficient is to measure the degrees of nodes to decide which nodes in a graph tend
to be clustered together. Thus, the clustering coefficient measure is to quantify how close its
neighbors are to being a complete graph. As the nodes grouped in the real-world social network
tend to have relatively high density of ties, the clustering coefficient is also utilized for small
world analysis. From the descriptions above, we quickly overview some important metrics used
for social network analysis.
b. Centrality
One of the key applications in social networks is to identify the most important or central nodes
in the network. The measure of centrality is thus used to give a rough indication of the social
power of a node based on how well they connect the network. HITS and PageRank are two most
famous representatives using centrality for ranking. HITS analyzes the important nodes based on
calculating Authorities (indegrees) and Hubs (out-degrees), and PageRank calculates node values
based on out-degrees. In social network analysis, “Degree”, “Betweenness”, and “Closeness”
centrality are most popularly adopted methods to measure the centrality of a social network.
Degree centrality:
Degree centrality is defined as the number of edges incident upon a node, and thus it is usually
the first way to calculate the nodes that are most potential to determine other nodes. For
calculating degree centrality, the nodes that have direct connections to a large number of nodes
are considered. If the edges in a graph are directed, the in-degree centrality is differentiated from
the out-degree centrality.
Betweenness centrality:
In addition to degree centrality, betweeness centrality is another key metrics for computing the
extent to which a node lies between other nodes in the network. If a node is the only node that
links two groups of nodes in the network, this node shall be seen as an important node for
keeping the social network together.
Closeness centrality:
The measure of closeness centrality is to take into account how distant a node is to the other
nodes in the network. Hence, closeness centrality is to measure the order of magnitude that a
node is near all other nodes in a social network by calculating the mean shortest path for a node
to all other nodes in the graph.
3. Write short notes on Node-link diagrams.
Refer Question No:5 in PART B
Ideally, all users of all these services would agree to a single shared typology of social relations
and shared characterizations of relations. However, this is neither feasible nor necessary. What
is required from such a representation is that it is minimal in order to facilitate adoption and
that it should preserve key identifying characteristics such as the case of identifying properties
for social individuals.
Conceptual model
Social relations could be represented as n-ary predicates; however, n-ary relations are not
supported directly by the RDF/OWL languages. There are several alternatives to n-ary relations
in RDF/ OWL
In all cases dealing with n-ary relations we employ the technique that is known as reification:
we represent the relation as a class, whose instances are concrete relations of that type.
One may recall that RDF itself has a reified representation of statements: the rdf :Statement
object represents the class of statements.
This class has three properties that correspond to the components of a statement, namely rdf:
subject, rdf :predicate, rdf :object.
These properties are used to link the statement instance to the resources involved in the
statement.
In other words relationships become subclasses of the rdf :Statement class. Common is that
the new Relationship class is related to a general Parameter class by the hasParameter
relationship.
Cognitive structuring, works by applying the generic pattern we associate with such a
relationship to the actual state-of-affairs we observe. For example, a student/professor
relationship at the Free University of Amsterdam is defined by the social context of the
university and this kind of relationship may not be recognizable outside of the university.
The below figure shows descriptions and Situations ontology design pattern that provides a
model of context and allows to clearly delineate these two layers of representation.