Professional Documents
Culture Documents
Temporalwikigraph
Temporalwikigraph
Luciana S. Buriol1 Carlos Castillo1,2 Debora Donato1 Stefano Leonardi1 Stefano Millozzi1
1 2
DIS - Università di Roma “La Sapienza” Universitat Pompeu Fabra
Rome, Italy Barcelona, Spain
1200 9000
Actual Actual
The first few English pages were published in January 15, 1000 6.2% monthly growth
8000
7000
11.3% monthly growth
2001. Four and half years later, Wikipedia has more then 1 800 6000
5000
600
million articles. 400
4000
3000
We opted for this encyclopedia as a good dataset for Web 200
2000
1000
graph-type experiments essentially because i) the content 0
Jan Jan Jan Jan Jan
0
Jan Jan Jan Jan Jan
pendent and heterogeneous individuals as for the Web, ii) (c) Visitors (d) Editors
Wikipedia provides time information associated with nodes,
Number of users (thousands)
35000 1000
Weekly Reach (millions)
Created:
6.0 Before Jan’03 7.5% of the articles), but some articles may have a large
During 2003 number of editors: about 50% of the articles have more than
5.0 During 2004
During 2005 7 different persons involved and about 5% of the articles
4.0
have more than 50 different editors. The average number
3.0 of updates per user has dropped by about 30% in the last
2.0 two years, as shown in Figure 4 (right). In any case, the
1.0 fraction of articles that are updated is very high. Over 80%
Jan Jan Jan Jan
’03 ’04 ’05 ’06
of the articles are updated in the three-months period we are
Time considering for these snapshots. This fraction has remained
basically constant during the last two years, as shown in
Figure 2. Growth of the size of articles written Figure 4 (right).
on or before a certain date.
0.8
change of a page that is edited, measured in terms of how 20
0.7
many bytes the page gains or looses, has remained roughly 15
0.6
10
the same during the entire Wikipedia history (except at the 0.5
5 0.4
very beginning). On average, on each edit a page changes 0 0.3
from 300 to 500 bytes, this is roughly equivalent to a short Jan
’02
Jan
’03
Jan
’04
Jan
’05
Jan
’06
Jan
’02
Jan
’03
Jan
’04
Jan
’05
Jan
’06
Time Time
paragraph of text.
Number of Articles
Fraction of Updates
0.8 0.8
was observed in [18] that in general about half of a certain 0.6 0.6
type of vandalism (the vandalism that involves a mass dele- 0.4 0.4
tion of content in a page) is corrected within the next three 0.2 0.2
minutes after it is done. Our measurements indicate that the 0.0 0.0
Jan Jan Jan Jan Jan Jan
percentage of reverts that are executed in less than one hour ’04 ’05 ’06 ’04 ’05 ’06
Cluster 2 Cluster 3
has remained constant over time and is around 70%. 1.0 1.0
Sometimes the system of reverts is also used editorially,
Fraction of Updates
Fraction of Updates
0.8 0.8
when one editors disagrees with other and rejects his/her 0.6 0.6
changes. This is considered rude in the Wikipedia and can 0.4 0.4
be replied with a second revert by the affected user, reiter- 0.2 0.2
ating his/her editions. This creates a phenomenon known 0.0 0.0
as a “revert war” in the Wikipedia, and can be difficult to Jan
’04
Jan
’05
Jan
’06
Jan
’04
Jan
’05
Jan
’06
Number of Articles
4 4
10 10
3 3
the graph, while the in-degree is a microscopic characteris-
10 10
2 2
tic. The tail of the distribution of PageRank for the usual
10 10
10
1
10
1 parameter setting follows a power-law [5]. In our case, we
10 0
100
observe a power law distribution with exponent γ = 2.1.
1 10 100 1000 10000 1 10 100 1000 10000
Number of In−links Number of Out−links
Previous measures for the Web graphs [9, 17] have mea-
sured the same exponent.
Figure 8. Degree distribution in the English Usually the correlation between PageRank and indegree
version of the Wikipedia as of January 1st, is very low in Web graphs. Interestingly, in the Wikigraph
2006. Left: indegree, right: outdegree. this correlation has increased in the last years, basically fol-
lowing the densification of the graph, as shown in Figure 10.
The power-law exponent of the indegree has remained
remarkably constant even when the graph has grown sub- 0.8
stantially in the studied period (from ≈14,000 articles to
PageRank / In−Degree
0.7
≈1,000,000). 0.6
Correlation
The Wikipedia graph is becoming denser. During the last 0.5
two and a half years, articles have grown from an average 0.4
of 7 out-links per article to an average of 16 (on average one 0.3
new reference every 100 days). This is shown in Figure 9. 0.2
0.1
0.0
18 Jan Jan Jan Jan Jan
Actual ’02 ’03 ’04 ’05 ’06
16 1 new link every 100 days Time
14
Out−degree
12
10
Figure 10. Correlation between PageRank
and indegree over time.
8
6 This means that many of the new links that have been
4 appearing point to pages that already have high Page-
Jan Jan Jan Jan Jan
’02 ’03 ’04 ’05 ’06
Rank. Potentially, this could mean that the low Page-
Time Rank/indegree correlation observed in Web graph is a tran-
sient phenomenon and as the Web matures, this correlation
Figure 9. Average number of links per article could increase.
over time. Finally, with respect to other microscopic measures such
as clustering coefficient or edge reciprocity (how many of
This is a trend observed in other social networks by the edges are reciprocal), the Wikigraph seems to have sta-
Leskovec et al. [13]. They observed that for several net- bilized in the last years (Figure 11). The average cluster-
works, the number of edges grows exponentially with the ing coefficient is roughly 0.23 and the fraction of reciprocal
number of nodes. The exponent is typically small, in the edges is about about 0.13.
range 1.1 to 1.6 depending on the particular network.
In the Wikipedia, one possible explanation of this in-
6 Macroscopic structure of the Wikigraph
crease in the number of out-links could be that, as the size
of the articles is getting longer, there are more concepts that If we disregard the direction of the links, we observe that
should be linked. However, we can factor out the effect of the Wikigraph is almost entirely (weakly) connected. Over
size and note that in fact, now there is more text per each 98.5% of the articles are connected to each other.
out-link than in the past. From an average of 200-250 bytes If we consider the direction of the links, the macroscopic
of text per out-link in the first three years of the Wikipedia, connectivity structure of Web graphs can be characterized
this quantity has grown to about 300-350 bytes per out-link by mapping their strongly connected components. This
in the last year. analysis generated what is known as the bow-tie structure
0.22 this way the content of each article can be fully understood
0.20 while the surfer visits many different articles.
0.18 The increase of the size of the CORE has been mostly
0.16
0.14 at the expense of the OUT component, as shown in Fig-
0.12 ure 12(right). The OUT component contains articles reach-
0.10
Jan Jan Jan Jan Jan
able from the main strongly connected component but that
’02 ’03 ’04 ’05 ’06 do not link back to it. We can explain at least in part the re-
Time duction of the size component if we consider that the num-
ber of out-links per page is becoming larger, as presented in
Figure 11. Clustering coefficient and edge the previous section. The OUT component of the Wikipedia
reciprocity over time. is also very thin, and over 99% of its nodes are directly
reachable from the CORE across all snapshots. The same
happens with the IN component, in contrast with larger
of the Web [7]. This “bow-tie” is formed of four compo-
graphs that exhibit several levels in these components.
nents. The main component is a large strongly connected
Finally, we study the “movement” of articles among
component named CORE, comprised of all nodes that can
components of the different components of the Wikipedia
reach each other along directed edges. The second and third
graph, following [3]. This is depicted by the state dia-
components are the IN and OUT sets. The IN is the set of
gram in Figure 13 in which only CORE, IN and OUT are
nodes that can reach the CORE but cannot be reached from
depicted, and to each arc we have associated the proba-
it, whereas the OUT is the set of nodes that are reached by
bility that that a state change occurs from one snapshot to
the CORE but cannot reach it.
the other. Interestingly, the CORE and OUT components are
The first observation about the evolution of this macro
structure in the Wikigraph is that the CORE component is
getting larger, as can be seen in Figure 12(left). Currently,
about 2/3 of the nodes belong to the larger strongly con-
nected component of the Wikigraph, which is larger than
was observed in 1999 in the Altavista crawl [7] or in 2001
in the WebBase crawl [9] but is consistent with measures in
more recent samples of the Web such as [2, 6, 10].
70% 70%
IN
60%
Size of IN and OUT
60% OUT
Size of CORE
50%
50% 40%
40% 30%
30%
20% Figure 13. Migration of articles among the
10%
20% 0%
components.
Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan
’02 ’03 ’04 ’05 ’06 ’02 ’03 ’04 ’05 ’06
Time Time very stable, with over 60% of the articles remaining in the
same component after three months. Also, these transition
Figure 12. Relative size of the CORE (left) and probabilities have remained basically stable over different
IN , OUT (right) components with respect to snapshots (in the figure we depict the average transition
the rest of the graph probability across snapshots).