Professional Documents
Culture Documents
Poster Surf
Poster Surf
Emilio Ferrara
emilio.ferrara@unime.it
Department of Mathematics
University of Messina - Italy
Introduction and Objectives Example of Wrapper Adaptation with CTM Crawling Facebook with CTM
HTML Web pages are represented as trees, whose nodes BFS Sampling
contains elements displayed in the page.
XPath A standard language defined to identify elements within BFS (breadth-first search): starting from a seed, a graph is
a Web page. Wrappers implements the XPath logic. visited exploring all the neighbors in order of discovering.
Uniform Sampling
Uniform (rejection sampling): a list of random nodes to be
visited is generated.
Pros
I Independent w.r.t. the structural Node Degree Distribution
distribution of friendship ties
I Produces unbiased results
I Simple and efficient implementation I Power-law distribution, P(k) ∼ k−γ , γ ≤ 3
Cons Resulting graph has disconnected I This means the existence of a relatively small number of
Figure: A and B are two similar trees. CTM assigns weights to
components. users highly connected each other
matching nodes. Node f in A has weight 21 because in B it
Challenge Acquiring a uniform I This distribution could be represented by a Complementary
appears in a sub-tree with two children. Node h in B has
weight 13 for the same reason. sub-graph with a huge connected Figure: Uniform sampling Cumulative Distribution Function
component.
Advantages
I CTM produces an intrinsic measure of similarity (STM
calculates only the number of different elements)
I It is possible to introduce a degree of accuracy References
establishing a minimum similarity threshold, required to Figure: FBI Agent Mining Architecture: Java application - Ferrara E., Baumgartner R. Automatic Wrapper Adaptation by Tree Edit Distance
match two clusters Embedded Firefox browser / HTTP Interface → Facebook
Matching. In: Combinations of Intelligent Methods and Applications – Springer 2011.
I The more the structure of compared trees is complex and De Meo P., Ferrara E., Fiumara G. Finding similar users in Facebook. In: Social Networking
and Community Behavior Modeling: Qualitative and Quantitative Measures – IGI Global 2011.
similar, the better the similarity calculated by CTM is!
Catanese S., De Meo P., Ferrara E., Fiumara G., Provetti A. Crawling Facebook for social
Limits CTM/STM do not produce perfect results when nodes network analysis purposes. In: Int. Conf. on Web Intelligence, Mining and Semantics – 2011.
are added/removed in middle-levels. Ferrara E., Baumgartner R. Design of automatically adaptable web wrappers.
Further considerations Pure text can be matched using classic Figure: FBI Agent Logic of Crawling: Diagram of the process of In: 3rd International Conference on Agents and Artificial Intelligence – 2011.
string matching techniques, e.g. Jaro-Winkler, bigrams, etc. data extraction from Facebook Catanese S., De Meo P., Ferrara E., Fiumara G. Analyzing the Facebook friendship graph.
In: 1st International Workshop on Mining the Future Internet – 2010.