Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Mining and Analyzing Online Social Networks

Emilio Ferrara
emilio.ferrara@unime.it
Department of Mathematics
University of Messina - Italy
Introduction and Objectives Example of Wrapper Adaptation with CTM Crawling Facebook with CTM

Social Network Analysis and Mining (SNAM) includes different


techniques from sociology, social sciences, mathematics,
statistics and computer science.
Objectives
I Analysis of the structure of a social network
I Analysis of large sub-networks and connected components
I Discovering nodes of particular interest
I Identifying communities within the network
Advantages
I Large scale studies, impossible before, are feasible
I Data can be automatically acquired
I A huge amount of information is accessible online Figure: FBI Agent i) visits the page containing the friendlist, ii)
I Data could be acquired at different granularity level
generates a Wrapper to extract Name and ID of each friend, iii)
Limits
I Problems related to large scale data mining issues
insert into the graph these data, and iv) processes next profile.
I Computational and algorithmic challenges
I Bias of data should be investigated Dataset: BFS and Uniform (August 2010)
Figure: An example of automatic adaptation to modifications.
Web Data Extraction Elements matched by the original wrapper are identified even N. Visited users N. Discovered users N. edges
with structural differences, by applying the CTM algorithm. BFS 63.4K 8.21M 12.58M
WDE Systems Software platform for the extraction, in an Uni 48.1K 7.69M 7.84M
automatic and intelligent fashion, of data from Web pages,
under the form of static and/or dynamic contents, in order WDE Platform Architecture Avg. deg. Eigenvec. Diameter Clustering Coverage Density
to store them in a database (or other structured data 396.8 68.93 8.75 0.0789 98.98% 0.626%
sources) and make them available for other applications. Intelligent Agent It’s a platform (software + architecture) which 326.0 23.63 16.32 0.0471 94.96% 0.678 %
Wrapper An algorithmic procedure which aims to the extraction could autonomously take smart decisions to achieve a goal. Remarks Our data contain the same information as if we would
of unstructured information from a data source (such as a acquire all the friendship relations among all the inhabitants
Web page) and transform it in a structured format. I Each Web wrapper is of a middle-size town (e.g., 100k people).
Wrapper maintenance Web pages changes, dynamically, not implemented as an Agent
only in contents but also in the structure, without any I Several Agents populate the Facebook Network Graph: Visual Results
forewarning. Web wrappers could stop working properly, same environment
because of radical or even small changes in that structure.
I If a Wrapper fails, it adapts
Automatic Wrapper Adaptation A novel smart approach to itself to changes
automatize the work of maintenance has been proposed.
I Results are collected in a
transparent way w.r.t. users Figure: Web Data Mining
Clustered Tree Matching platform architecture

HTML Web pages are represented as trees, whose nodes BFS Sampling
contains elements displayed in the page.
XPath A standard language defined to identify elements within BFS (breadth-first search): starting from a seed, a graph is
a Web page. Wrappers implements the XPath logic. visited exploring all the neighbors in order of discovering.

Clustered Tree Matching key Pros


aspects (Ferrara, 2011) I Optimal solution for unweighted
I Inspired by Simple Tree Matching and/or undirected graphs (such as
(STM) (Selkow, 1977) Facebook and other OSNs)
I Assigns weights to evaluate I Intuitive implementation Betweenness Centrality Results
importance of matches Cons Resulting samples are biased
I Different behavior considering towards high degree nodes in Figure: BFS (3rd sub-level)
leaves or middle-level nodes incomplete visits.
I Introduces a degree of accuracy 1 seed
Challenge Obtaining a sub-graph of
I Identify clusters of similar 2-4 friends
the Facebook network which
sub-trees 5-8 friends of friends
preserves properties of the 9-12 friends of friends of friends
complete graph.

Uniform Sampling
Uniform (rejection sampling): a list of random nodes to be
visited is generated.
Pros
I Independent w.r.t. the structural Node Degree Distribution
distribution of friendship ties
I Produces unbiased results
I Simple and efficient implementation I Power-law distribution, P(k) ∼ k−γ , γ ≤ 3
Cons Resulting graph has disconnected I This means the existence of a relatively small number of
Figure: A and B are two similar trees. CTM assigns weights to
components. users highly connected each other
matching nodes. Node f in A has weight 21 because in B it
Challenge Acquiring a uniform I This distribution could be represented by a Complementary
appears in a sub-tree with two children. Node h in B has
weight 13 for the same reason. sub-graph with a huge connected Figure: Uniform sampling Cumulative Distribution Function
component.

Pros vs Cons FBI Agent (Facebook Intelligent Agent)

Advantages
I CTM produces an intrinsic measure of similarity (STM
calculates only the number of different elements)
I It is possible to introduce a degree of accuracy References
establishing a minimum similarity threshold, required to Figure: FBI Agent Mining Architecture: Java application - Ferrara E., Baumgartner R. Automatic Wrapper Adaptation by Tree Edit Distance
match two clusters Embedded Firefox browser / HTTP Interface → Facebook
Matching. In: Combinations of Intelligent Methods and Applications – Springer 2011.

I The more the structure of compared trees is complex and De Meo P., Ferrara E., Fiumara G. Finding similar users in Facebook. In: Social Networking
and Community Behavior Modeling: Qualitative and Quantitative Measures – IGI Global 2011.
similar, the better the similarity calculated by CTM is!
Catanese S., De Meo P., Ferrara E., Fiumara G., Provetti A. Crawling Facebook for social
Limits CTM/STM do not produce perfect results when nodes network analysis purposes. In: Int. Conf. on Web Intelligence, Mining and Semantics – 2011.
are added/removed in middle-levels. Ferrara E., Baumgartner R. Design of automatically adaptable web wrappers.
Further considerations Pure text can be matched using classic Figure: FBI Agent Logic of Crawling: Diagram of the process of In: 3rd International Conference on Agents and Artificial Intelligence – 2011.

string matching techniques, e.g. Jaro-Winkler, bigrams, etc. data extraction from Facebook Catanese S., De Meo P., Ferrara E., Fiumara G. Analyzing the Facebook friendship graph.
In: 1st International Workshop on Mining the Future Internet – 2010.

Mining and Analyzing Online Social Networks - Emilio Ferrara emilio.ferrara@unime.it

You might also like