Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Temporal Analysis of the Wikigraph

Luciana S. Buriol1 Carlos Castillo1,2 Debora Donato1 Stefano Leonardi1 Stefano Millozzi1
1 2
DIS - Università di Roma “La Sapienza” Universitat Pompeu Fabra
Rome, Italy Barcelona, Spain

Abstract It is worth to underline that extracting the link structure


of the Web at a specific point in time is not possible. Down-
Wikipedia is an online encyclopedia, available in more loading pages from the Web, in a certain way, resembles
than 100 languages and comprising over 1 million articles watching the sky on a clear night: what we see reflects the
in its English version. If we consider each Wikipedia article state of the stars at different times, as the light had to travel
as a node and each hyperlink between articles as an arc different distances. What we obtain by Web crawling is not
we have a “Wikigraph”, a graph that represents the link a “snapshot” of the Web, because it does not represents the
structure of Wikipedia. Web at any given instant of time [4].
The Wikigraph differs from other Web graphs studied in An attempt of capturing the dynamics of the Web can
the literature by the fact that there are explicit timestamps be made by collecting a series of frequent static snapshots
associated with each node’s events. This allows us to do a by sequential crawls. From the study of these snapshots,
detailed analysis of the Wikipedia evolution over time. it can be inferred if a page has been modified or deleted
In the first part of this study we characterize this evolu- during a certain time frame. However, it is not possible to
tion in terms of users, editions and articles; in the second determinate exactly the instant when the update or deletion
part, we depict the temporal evolution of several topologi- occurred.
cal properties of the Wikigraph. The insights obtained from Kumar et al. [12] overcame this problem considering the
the Wikigraphs can be applied to large Web graphs from evolution of the Blogspace, this is, the set of Web logs (or
which the temporal data is usually not available. blogs). A Blog is commonly a page that contains a series of
dated entries (or posts) ordered from newest to oldest. Each
time that a new entry is inserted, the page can be consid-
1 Introduction ered updated. In this way, all the information concerning
In the past decade, the Web has experienced a very fast the ”time” can be directly extracted from the date of each
growth rate: recent estimates [11] indicate that the index- individual entry.
able Web exceeds 11.5 billion pages. The Web is also very Our approach to the study of the evolution of Web graphs
dynamic: pages are modified, created and deleted continu- over time is to study the Wikipedia, an on-line and free
ously. Because of this huge amount of changing data, search content encyclopedia written in more than 100 languages
engines have to constantly afford the burdensome task of and with over 1 million articles in its English version. De-
updating their index in order to keep an up-to-date copy of tails about the evolution of the Wikipedia on time are avail-
the current Web. able for every article and every version of an article in the
The study of the evolving Web has been mainly focused Wikipedia. This work aims at verifying if any evolving
on the degree and the frequency of changes in the Web trend is observable in the statistical properties of the articles
pages. Ntoulas and Cho [15] observed that the evolution and/or the Wikigraph, and how different measures evolve
of the hyperlinked structure is more dynamic than the evo- together over time. We want to stress that, up to now, very
lution of the textual contents of the pages. After one year, little research work has been devoted to the evolution of the
the percentage of links still present in the Web is only 24% statistical and topological properties of hyperlinked graph
against a number of unchanged pages that reaches 50%. as Web graphs, Blog graphs and Wikigraphs.
This fast pace of change in the Web graph is very impor- Besides the fact that the Wikipedia has grown over time,
tant if we consider that the hyperlinked structure is the basis this study reveals interesting properties of the Wikigraph:
of many algorithms that assign an authoritativeness score to – the Wikipedia in general has become “denser” over time
the pages. both in terms of contents and hyperlinks,

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
– the degree of vandalism seem to have increased but still a links are not considered for generating Wikigraphs, since
low fraction of the updates are related to correcting vandal- we want to restrict the graph to pages into the set being
ism, and this is usually done very rapidly, analyzed. For analyzing the evolution of the Wikipedia, we
– in some aspects the Wikigraphs appears quite mature, generated 17 snapshots of the Web graph at different points
while in other aspects it is still evolving rapidly. of time. We started with the Wikipedia graph as of January
In many aspects the static statistical properties of Web 1st, 2002 (14,247 articles) and generate one snapshot every
graph and the Wikigraphs are very similar [8]. So this study 3 months until April 1st, 2006 (1,064,757 articles). We used
should provide several hints about how the Web at large is the COSIN library for the analysis [14], this is a software for
evolving and may continue to evolve in the future, in par- processing large graphs that implements several algorithms
ticular with respect to its connectivity. This article aims at in semi-external memory.
gaining insights about the evolution of an hyperlink struc-
ture from the study of a coherent, information-rich corpus. 3 The growth of Wikipedia
The next section discusses related work on this topic.
Section 2 presents our data set and section 3 depicts the The number of articles, shown in Figure 1 (a) presents
growth of the Wikipedia in the studied period. Section 4 a remarkable growth, roughly doubling itself in size every
studies the dynamics of the article updates. Sections 5 and 6 year for the last three years, i.e. a 6.2% montly growth. In
the evolution of the hyperlink graph at the microscopic and the same period, the number of updates (b) has been grow-
macroscopic level respectively. The last section presents ing at a rate of 11.3% per month. This means that in the
our conclusions. most recent Wikipedia “versions” each individual article re-
ceives more attention that in the previous ones.
2 Data set
Wikipedia is an on-line and free content encyclopedia. (a) Articles (b) Updates

Number of updates (thousands)


Number of articles (thousands)

1200 9000
Actual Actual
The first few English pages were published in January 15, 1000 6.2% monthly growth
8000
7000
11.3% monthly growth

2001. Four and half years later, Wikipedia has more then 1 800 6000
5000
600
million articles. 400
4000
3000
We opted for this encyclopedia as a good dataset for Web 200
2000
1000
graph-type experiments essentially because i) the content 0
Jan Jan Jan Jan Jan
0
Jan Jan Jan Jan Jan

creation mechanism is brought out by a number of inde- ’02 ’03 ’04


Time
’05 ’06 ’02 ’03 ’04
Time
’05 ’06

pendent and heterogeneous individuals as for the Web, ii) (c) Visitors (d) Editors
Wikipedia provides time information associated with nodes,
Number of users (thousands)
35000 1000
Weekly Reach (millions)

Actual 900 Actual


30000 14.5% monthly growth 13.1% monthly growth
800
iii) the articles link mainly to articles on the same dataset. 25000 700
600
The dataset, freely available, is a large, compressed XML 20000
15000
500
400
file containing information about pages, and inside each 10000 300
200
5000
page, information about all the revisions the page has un- 0
100
0
dergo. Jan
’03
Jan
’04
Jan
’05
Jan
’06
Jan
’03
Jan
’04
Jan
’05
Jan
’06
Time Time
For each revision, the time stamp, author, editorial com-
ment and full text of the version of the page is available.
Figure 1. Growth of the Wikipedia: number
The author corresponds to the name of a registered user in
of (a) articles, (b) updates, (c) unique visitors
the case of normal edits, to the IP address of the originat-
and (d) distinct editors.
ing computer in the case of an anonymous edit, and to the
program name in the case of a (semi-)automatic edit. Auto- The number of different visitors and different editors
matic edits are done to adapt articles to newer standards and have also been growing at a fast peace, as shown in Fig-
automatically create links, and except in the case of very ure 1 (c). The number of unique visitors has grown at a rate
trivial syntactic changes, are usually reviewed by a human of ∼15% during the last three years, while the number of
operator. unique editors (d) has grown at a rate of ∼13%.
We consider articles only in the main name space, this is, Interestingly, not only the number of articles has grown,
encyclopedic articles themselves, as opposed to templates, but also individual articles have become longer. In [18] it
discussions, user pages and other administrative pages. In is shown that pages with more than 100 edits grow steadily
order to generate a graph from the link structure of a dataset, over time. Actually, if we consider all the pages that existed
each article corresponds to a node and each hyperlink be- three years ago (January 2003) and plot their size, as in Fig-
tween existing articles to an edge. An article also might ure 2, we observe that they become longer, growing linearly
contain some external links that point to pages outside the during the last two years at a rate of 1 KB of text every 6-8
dataset, but they are usually only a few. These kind of months.

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
Newer articles follow a similar growth process; we are The reason for this skewed distribution appears to be
also including in the plot the articles that were created dur- two-fold: first, the updates might be related to page views,
ing 2003, 2004 and 2005. The growth rate of newer articles in the sense that highly-visited articles are updated more
does not seem to be accelerating and in fact appears to be often, so the distribution of updates follows the distribu-
lower than the growth rate of older articles, so older articles tion of page importance. Second, the Wikipedia is built as
attract more edits than newer ones. This may be interpreted a community and there is a prominent link to see the last
as a sign of maturity of the coverage of Wikipedia, but not 50 changes plus the option to “watch” an article, to receive
of the depth of particular articles (many of them are marked alerts when it is changed. This may cause that edits made to
as “stubs” and tend to be very terse), so we expect that the an article attract even more edits, as explained later in this
average length of the articles continues to grow for a few section.
more years. The distribution of the number of different users involved
on the edition of each article is also very skewed, as shown
7.0 in Figure 3 (right). Having a single editor is rare (about
Average page length [KB]

Created:
6.0 Before Jan’03 7.5% of the articles), but some articles may have a large
During 2003 number of editors: about 50% of the articles have more than
5.0 During 2004
During 2005 7 different persons involved and about 5% of the articles
4.0
have more than 50 different editors. The average number
3.0 of updates per user has dropped by about 30% in the last
2.0 two years, as shown in Figure 4 (right). In any case, the
1.0 fraction of articles that are updated is very high. Over 80%
Jan Jan Jan Jan
’03 ’04 ’05 ’06
of the articles are updated in the three-months period we are
Time considering for these snapshots. This fraction has remained
basically constant during the last two years, as shown in
Figure 2. Growth of the size of articles written Figure 4 (right).
on or before a certain date.

Fraction of Articles Updated


30 1
Even if the number of edits is increasing, the average 25 0.9
Updates per User

0.8
change of a page that is edited, measured in terms of how 20
0.7
many bytes the page gains or looses, has remained roughly 15
0.6
10
the same during the entire Wikipedia history (except at the 0.5
5 0.4
very beginning). On average, on each edit a page changes 0 0.3
from 300 to 500 bytes, this is roughly equivalent to a short Jan
’02
Jan
’03
Jan
’04
Jan
’05
Jan
’06
Jan
’02
Jan
’03
Jan
’04
Jan
’05
Jan
’06
Time Time
paragraph of text.

Figure 4. Left: average number of edits per


4 The dynamics of the updates user. Right: fraction of articles existing in
The updates of pages are a stream of data in the form one period that are updated by the end of the
(time stamp,article,user) representing the events of creating same period.
new revisions of pages. The distribution of updates per page
There are many interactions among the actions of dif-
follows a power-law, as shown in Figure 3 (left). About
ferent editors. In general after one article is edited, there
53% of the pages have received more than 10 updates, and
is a 7% chance that the same article gets edited again by a
about 5% more than 100 updates.
different user during the next hour. This probability raises
6 6
to 13% if we consider 6 hours and 22% if we consider a 24
10 10
Actual Actual
10
5 Power−law (γ=2.07) 10
5 Power−law (γ=2.07) hours-period. This has remained constant during the history
Number of Articles

Number of Articles

104 104 of Wikipedia. These numbers reflect actual collaboration


10
3
10
3
among users and do not include the number of “reverts”.
2 2
10
1
10
1
Reverts can be done by any registered user, and with
10 10
0 0
this one-click operation an article can be taken back to a
10 10
1 10 100 1000 10000 1 10 100 1000 previous version. This is done mostly to fight vandalism.
Number of Changes Number of Different Editors
We detected when an update is a revert by searching for the
Figure 3. Distribution of the updates per arti- string revert or rv in the comment of the edits (this is
cle. Left: number of changes, right: number inserted automatically). The fraction of editor actions that
of different users involved on each article. are reverts is in general small, but it has been steadily in-

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
creasing in the last years, as can be seen in Figure 5. This

Double Reverts per Revert


9%
may signal an increasing amount of vandalism per page, al- 8%
3−revert rule

though in general the number of reverts per edit is less than 7%


6%, and does not seem to be related to the number of anony- 6%
mous edits, which has remained consistently between 20% 5%
4%
and 30% over the last years. 3%
2%
1%
7%
Reverts 0%
6% Fast reverts (< 1h)
Reverts per Update

Jan Jan Jan Jan Jan


’02 ’03 ’04 ’05 ’06
5%
Time
4%
3% Figure 6. Fraction of double-reverts.
2%
In Figure 7 we clustered by the k-means algorithm the
1%
“update profile” of a set of pages using the euclidian dis-
0%
Jan Jan Jan Jan Jan
tance among the update vectors. We obtained 4 clusters:
’02 ’03 ’04 ’05 ’06 (1) pages that are updated evenly in time, (2) pages that
Time have been more updated than usual in the last 9 months, (3)
in the last 6 months and (4) in the last 3 months.
Figure 5. Percentage of editor actions that are
reverts. Cluster 1 Cluster 4
1.0 1.0
As the main reason for reverts is to correct vandalism, it
Fraction of Updates

Fraction of Updates
0.8 0.8
was observed in [18] that in general about half of a certain 0.6 0.6
type of vandalism (the vandalism that involves a mass dele- 0.4 0.4

tion of content in a page) is corrected within the next three 0.2 0.2

minutes after it is done. Our measurements indicate that the 0.0 0.0
Jan Jan Jan Jan Jan Jan
percentage of reverts that are executed in less than one hour ’04 ’05 ’06 ’04 ’05 ’06

Cluster 2 Cluster 3
has remained constant over time and is around 70%. 1.0 1.0
Sometimes the system of reverts is also used editorially,
Fraction of Updates

Fraction of Updates
0.8 0.8
when one editors disagrees with other and rejects his/her 0.6 0.6
changes. This is considered rude in the Wikipedia and can 0.4 0.4
be replied with a second revert by the affected user, reiter- 0.2 0.2
ating his/her editions. This creates a phenomenon known 0.0 0.0
as a “revert war” in the Wikipedia, and can be difficult to Jan
’04
Jan
’05
Jan
’06
Jan
’04
Jan
’05
Jan
’06

resolve. In fact, there are technological tools such as pro-


tecting a page, as well as social tools such as arbitration Figure 7. Clustering of pages per update pro-
committees that can help settle these edit wars. file.
In November 2004 a 3-revert rule was established: no By looking at the list of articles on each cluster, there
user should revert a page more than three times in a 24- seems to be no topical relationship between them, and no
hours period. This had an effect in the rate of double-reverts single event explains several changes to the Wikipedia.
that dropped from 7.8% to less than 4% almost immedi- While an external news event may trigger a large number
atly. The rate of double-reverts has continued under 5% of updates on a single article, no single event has triggered
after that, as shown in Figure 6. a massive modification of many pages in the Wikipedia si-
While the amount of updates is increasing in the multaneously.
Wikipedia, if we focus on individual pages, we can see that
this change is not homogeneous. Some entries are updated
in response to external events (such as the articles related to
5 The evolution of the hyperlinks
candidates in a political election), while other entries have As expected, the Wikigraph is a scale-free network in
their updates more evenly distributed in time (such as the which the indegree distribution follows a power-law, i.e. the
articles on biology or mathematics). probability that a node has indegree i is proportional to γ1i ,
An attempt to characterize the differences among differ- for γ > 1. We found γ = 2.00, similar to the value that has
ent articles is to use clustering, as suggested in [1] in the been observed in the indegree distribution of Web graphs
context of information diffusion in Blog posting. (2.1) [7, 9]. For the outdegree the distribution appears to

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
be log-normal or double-pareto with an exponent in the tail We also studied the distribution of PageRank [16], a link-
(articles with more than 30 out-links) of γ = 2.46. Both based ranking function that assigns to every page in the Web
distributions are depicted in Figure 8. graph a number corresponding to the probability of visiting
the page when following links at random (on a modified
10
6
Actual
10
6
Actual
version of the Web graph in which “sinks” are avoided).
5
10 Power−law (γ=2.00) 105 Power−law (γ=2.46)
PageRank represents mesoscopic (mid-range) properties of
Number of Articles

Number of Articles
4 4
10 10
3 3
the graph, while the in-degree is a microscopic characteris-
10 10
2 2
tic. The tail of the distribution of PageRank for the usual
10 10
10
1
10
1 parameter setting follows a power-law [5]. In our case, we
10 0
100
observe a power law distribution with exponent γ = 2.1.
1 10 100 1000 10000 1 10 100 1000 10000
Number of In−links Number of Out−links
Previous measures for the Web graphs [9, 17] have mea-
sured the same exponent.
Figure 8. Degree distribution in the English Usually the correlation between PageRank and indegree
version of the Wikipedia as of January 1st, is very low in Web graphs. Interestingly, in the Wikigraph
2006. Left: indegree, right: outdegree. this correlation has increased in the last years, basically fol-
lowing the densification of the graph, as shown in Figure 10.
The power-law exponent of the indegree has remained
remarkably constant even when the graph has grown sub- 0.8
stantially in the studied period (from ≈14,000 articles to

PageRank / In−Degree
0.7
≈1,000,000). 0.6

Correlation
The Wikipedia graph is becoming denser. During the last 0.5
two and a half years, articles have grown from an average 0.4
of 7 out-links per article to an average of 16 (on average one 0.3
new reference every 100 days). This is shown in Figure 9. 0.2
0.1
0.0
18 Jan Jan Jan Jan Jan
Actual ’02 ’03 ’04 ’05 ’06
16 1 new link every 100 days Time
14
Out−degree

12
10
Figure 10. Correlation between PageRank
and indegree over time.
8
6 This means that many of the new links that have been
4 appearing point to pages that already have high Page-
Jan Jan Jan Jan Jan
’02 ’03 ’04 ’05 ’06
Rank. Potentially, this could mean that the low Page-
Time Rank/indegree correlation observed in Web graph is a tran-
sient phenomenon and as the Web matures, this correlation
Figure 9. Average number of links per article could increase.
over time. Finally, with respect to other microscopic measures such
as clustering coefficient or edge reciprocity (how many of
This is a trend observed in other social networks by the edges are reciprocal), the Wikigraph seems to have sta-
Leskovec et al. [13]. They observed that for several net- bilized in the last years (Figure 11). The average cluster-
works, the number of edges grows exponentially with the ing coefficient is roughly 0.23 and the fraction of reciprocal
number of nodes. The exponent is typically small, in the edges is about about 0.13.
range 1.1 to 1.6 depending on the particular network.
In the Wikipedia, one possible explanation of this in-
6 Macroscopic structure of the Wikigraph
crease in the number of out-links could be that, as the size
of the articles is getting longer, there are more concepts that If we disregard the direction of the links, we observe that
should be linked. However, we can factor out the effect of the Wikigraph is almost entirely (weakly) connected. Over
size and note that in fact, now there is more text per each 98.5% of the articles are connected to each other.
out-link than in the past. From an average of 200-250 bytes If we consider the direction of the links, the macroscopic
of text per out-link in the first three years of the Wikipedia, connectivity structure of Web graphs can be characterized
this quantity has grown to about 300-350 bytes per out-link by mapping their strongly connected components. This
in the last year. analysis generated what is known as the bow-tie structure

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
0.30 any other. This is probably due to an implicit aim of an on-
0.28 Clustering coefficient line encyclopedia, that is driving the reader to related topics
0.26 Edge reciprocity
on the same encyclopedia during the topic description. In
0.24
Coefficient

0.22 this way the content of each article can be fully understood
0.20 while the surfer visits many different articles.
0.18 The increase of the size of the CORE has been mostly
0.16
0.14 at the expense of the OUT component, as shown in Fig-
0.12 ure 12(right). The OUT component contains articles reach-
0.10
Jan Jan Jan Jan Jan
able from the main strongly connected component but that
’02 ’03 ’04 ’05 ’06 do not link back to it. We can explain at least in part the re-
Time duction of the size component if we consider that the num-
ber of out-links per page is becoming larger, as presented in
Figure 11. Clustering coefficient and edge the previous section. The OUT component of the Wikipedia
reciprocity over time. is also very thin, and over 99% of its nodes are directly
reachable from the CORE across all snapshots. The same
happens with the IN component, in contrast with larger
of the Web [7]. This “bow-tie” is formed of four compo-
graphs that exhibit several levels in these components.
nents. The main component is a large strongly connected
Finally, we study the “movement” of articles among
component named CORE, comprised of all nodes that can
components of the different components of the Wikipedia
reach each other along directed edges. The second and third
graph, following [3]. This is depicted by the state dia-
components are the IN and OUT sets. The IN is the set of
gram in Figure 13 in which only CORE, IN and OUT are
nodes that can reach the CORE but cannot be reached from
depicted, and to each arc we have associated the proba-
it, whereas the OUT is the set of nodes that are reached by
bility that that a state change occurs from one snapshot to
the CORE but cannot reach it.
the other. Interestingly, the CORE and OUT components are
The first observation about the evolution of this macro
structure in the Wikigraph is that the CORE component is
getting larger, as can be seen in Figure 12(left). Currently,
about 2/3 of the nodes belong to the larger strongly con-
nected component of the Wikigraph, which is larger than
was observed in 1999 in the Altavista crawl [7] or in 2001
in the WebBase crawl [9] but is consistent with measures in
more recent samples of the Web such as [2, 6, 10].

70% 70%
IN
60%
Size of IN and OUT

60% OUT
Size of CORE

50%
50% 40%
40% 30%

30%
20% Figure 13. Migration of articles among the
10%
20% 0%
components.
Jan Jan Jan Jan Jan Jan Jan Jan Jan Jan
’02 ’03 ’04 ’05 ’06 ’02 ’03 ’04 ’05 ’06
Time Time very stable, with over 60% of the articles remaining in the
same component after three months. Also, these transition
Figure 12. Relative size of the CORE (left) and probabilities have remained basically stable over different
IN , OUT (right) components with respect to snapshots (in the figure we depict the average transition
the rest of the graph probability across snapshots).

In the past we have observed that in the CORE there may


7 Concluding remarks
be nodes that are only connected by two links (“unstable”
members of the CORE). These nodes can be from 4% to 6% In this paper we presented a link and temporal analysis of
[10, 6] in large Web graphs, but in the case of Wikipedia the Wikigraph. We performed a series of measurements and
they are less than 1.5%, meaning that the largest strongly observed that the Wikigraph resembles many characteristics
connected component of the Wikipedia is more tightly knit- of the Web graph. The core of this study was the temporal
ted. We can conclude that the link structure of Wikipedia is analysis of Wikigraphs, where we made a large number of
well interconnected, in the sense that most of the nodes are experiments on the evolution over time of the topological
in the core, and from any page it is possible to reach almost and statistical properties of Wikigraphs and made several

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006
observations on the frequency of update of the articles of [3] R. Baeza-Yates and B. Poblete. Dynamics of the chilean web
Wikipedia. structure. In Proc. of the 3rd WebDyn, 2004.
The observation of the Wikipedia provides mixed [4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
signals of growth and maturity of this collection. Retrieval. Addison Wesley, May 1999.
[5] L. Becchetti and C. Castillo. The distribution of PageRank
Signs of transient regime (growth): follows a power-law only for particular values of the damp-
ing factor. In Proc. of the 15th WWW, pages 941–942. ACM
• The number of articles, updates, visitors and editors is Press, 2006.
still growing exponentially. [6] L. Becchetti, C. Castillo, D. Donato, and A. Fazzone. A com-
• The size of articles is still growing linearly. parison of sampling techniques for web characterization. In
Proc. of LinkKDD, August 2006.
• The number of links per article is also growing linearly, [7] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra-
slowly than the amount of text. jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph struc-
• The number of reverts is growing slowly, which may ture in the web: Experiments and models. In Proc. of the 9th
signal more vandalism, but the number of double re- WWW, pages 309–320. ACM Press, May 2000.
verts (revert wars) has stabilized. [8] L. S. Buriol, D. Donato, S. Leonardi, and S. Millozzi. Link
and temporal analysis of wikigraphs. Technical report, DELIS
Signs of permanent regime (maturity): – Dynamically Evolving, Large-Scale Information Systems,
2005.
• There is a clear power-law distribution of the indegree [9] D. Donato, L. Laura, S. Leonardi, and S. Millozzi. Large
and outdegree. scale properties of the webgraph. European Physical Journal
• The average edits per user has been mostly constant in B, 38:239–243, March 2004.
the last two years. [10] D. Donato, S. Leonardi, S. Millozzi, and P. Tsaparas. Min-
ing the inner structure of the web graph. In Proc of the 8th
• There is a high correlation between PageRank and in- WebDB, June 2005.
degree, indicating that the microscopic connectivity of [11] A. Gulli and A. Signorini. The indexable Web is more than
the encyclopedia resembles its mesoscopic properties. 11.5 billion pages. In Poster proc. of the 14th WWW, pages
902–903. ACM Press, 2005.
• The clustering coefficient and edge reciprocity of links [12] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the
have remained basically constant during the last two bursty evolution of blogspace. In Proc. of the 12th WWW,
years. pages 568–576. ACM Press, 2003.
[13] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over
• Over 2/3 of the articles belong now to the larger time: densification laws, shrinking diameters and possible ex-
strongly connected component. planations. In Proc. of the 11th KDD, pages 177–187. ACM
Press, 2005.
These are the first observations with this degree of detail [14] S. Millozzi, D. Donato, L. Laura, and S. Leonardi. Cosin
of the evolution of a large hyperlinked corpus. In the future, tools: a library for generating and measuring massive web-
we expect to relate this study with the observed evolution of graphs, 2003.
large samples of pages from the Web. [15] A. Ntoulas, J. Cho, and C. Olston. What’s new on the web?:
the evolution of the web from a search engine perspective. In
8 Acknowledgements Proc. of the 13th WWW. ACM Press, May 2004.
[16] L. Page, S. Brin, R. Motwani, and T. Winograd. The Page-
This work was partially supported by the FET Open Rank citation ranking: bringing order to the Web. Technical
Project IST-2001-33555 (COSIN), and by the EU IST-FET report, Stanford Digital Library Technologies Project, 1998.
project 001907 (DELIS). We acknowledge partial support [17] G. Pandurangan, P. Raghavan, and E. Upfal. Using Page-
of Grant TIN 2005-09201 of the Ministry of Education and rank to characterize Web structure. In Proc. of the 8th CO-
Science of Spain. COON, volume 2387 of LNCS, pages 330–390. Springer, Au-
gust 2002.
[18] F. Viegas, M. Wattenberg, and K. Dave. Studying coopera-
References tion and conflict between authors with history flow visualiza-
tions. In Proc. of SIGCHI, pages 575–582, 2004.
[1] E. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose. Implicit
structure and the dynamics of blogspace. In Workshop on the
Weblogging Ecosystem, May 2004.
[2] R. Baeza-Yates, C. Castillo, and E. Efthimiadis. Character-
ization of national web domains. To appear in ACM TOIT,
2006.

Proceedings of the 2006 IEEE/WIC/ACM International Conference


on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06)
0-7695-2747-7/06 $20.00 © 2006

You might also like