Learning From Labeled and Unlabeled Data On A Directed Graph

Learning from Labeled and Unlabeled Data on a Directed Graph
Dengyong Zhou dengyong.zhou@tuebingen.mpg.de

Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T
ubingen, Germany
Jiayuan Huang j9huang@cs.uwaterloo.ca
School of Computer Science, University of Waterloo, Waterloo ON, N2L 3G1, Canada
Bernhard Sch olkopf bernhard.schoelkopf@tuebingen.mpg.de
Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T
ubingen, Germany
Abstract view has been considered previously in the method of

We propose a general framework for learning Zhou et al. (2005). It is motivated by the frame-
from labeled and unlabeled data on a directed work of hubs and authorities (Kleinberg, 1999), which
graph in which the structure of the graph in- separates web pages into two categories and uses the
cluding the directionality of the edges is con- following recursive notion: a hub is a web page with
sidered. The time complexity of the algo- links to many good authorities, while an authority is
rithm derived from this framework is nearly a web page that receives links from many good hubs.
linear due to recently developed numerical In contrast, the approach that we will present is in-
techniques. In the absence of labeled in- spired by the ranking algorithm PageRank used by
stances, this framework can be utilized as the Google search engine (Page et al., 1998). Differ-
a spectral clustering method for directed ent from the the framework of hubs and authorities,
graphs, which generalizes the spectral clus- PageRank is based on a direct recursion as follows: an
tering approach for undirected graphs. We authoritative web page is one that receives many links
have applied our framework to real-world web from other authoritative web page. When the under-
classification problems and obtained encour- lying graph is undirected, the approach that we will
aging results. present reduces to the method of Zhou et al. (2004).
There has been a large amount of activity on how to
exploit the link structure of the web for ranking web
1. Introduction pages, detecting web communities, finding web pages
similar to a given web page or web pages of interest
Given a directed graph, the vertices in a subset of the
to a given geographical region, and other applications.
graph are labeled. Our problem is to classify the re-
We may refer to (Henzinger, 2001) for a comprehensive
maining unlabeled vertices. Typical examples of this
survey. Unlike those work, the present work is on how
kind are web page categorization based on hyperlink
to classify the unclassified vertices of a directed graph
structure and document classification based on cita-
in which some vertices have been classified by globally
tion graphs (Fig. 1). The main issue to be resolved
exploiting the structure of the graph. Classifying a fi-
is to determine how to effectively exploit the structure
nite set of objects in which some are labeled is called
of directed graphs.
transductive inference (Vapnik, 1998). In the absence
One may assign a label to an unclassified vertex on the of labeled instances, our approach reduces to a spectral
basis of the most common label present among the clustering method for directed graphs, which general-
classified neighbors of the vertex. However we want izes the work of Shi and Malik (2000) that may be the
to exploit the structure of the graph globally rather most popular spectral clustering scheme for undirected
than locally such that the classification or clustering graphs. We would like to mention that understanding
is consistent over the whole graph. Such a point of how eigenvectors partition a directed graph has been
proposed as one of six algorithmic challenges in web
Appearing in Proceedings of the 22 nd International Confer-
search engines by Henzinger (2003). The framework
ence on Machine Learning, Bonn, Germany, 2005. Copy-
right 2005 by the author(s)/owner(s). of probabilistic relational models may also be used to
for all 0 r k 1 each edge [u, v] E with u Vr

has v Vr+1 , where Vk = V0 , and k is maximal, that
0 0
is, there is no other such partition V = V0 Vk0 1
0
with k > k. When k = 1, we say that the graph is
aperiodic; otherwise we say that the graph is periodic.
A graph is weighted when there is a function w :
E R+ which associates a positive value w([u, v])
with each edge [u, v] E. The function w is called a
weight function. Typically, we can equip a graph with
a canonical weight function defined by w([u, v]) := 1 at
each edge [u, v] E. Given a weighted directed graph
Figure 1. The World Wide Web can be thought of as a and a vertex v of this graph, the in-degree function
directed graph, in which the vertices represent web pages, d : V R+ and out-degree functionPd+ : V R+
and the directed edges hyperlinks. are respectivelyPdefined by d (v) := uv w([u, v]),
and d+ (v) := uv w([v, u]), where u v denotes
the set of vertices adjacent to the vertex v, and u v
deal with structured data like the web (e.g. Getoor
the set of vertices adjacent from the vertex v.
et al. (2002)). In contrast to the spirit of the present
work however, it focuses on modeling the probabilistic Let H(V ) denote the space of functions, in which each
distribution over the attributes of the related entities one f : V R assigns a real value f (v) to each vertex
in the model. v. A function in H(V ) can be thought of as a col-
umn vector in R|V | , where |V | denotes the number of
The structure of the paper is as follows. We first
the vertices in V . The function space H(V ) then can
introduce some basic notions from graph theory and
be endowed withP the standard inner product in R|V |
Markov chains in Section 2. The framework for learn-
as hf, giH(V ) = vV f (v)g(v) for all f, g H(V ).
ing from directed graphs is presented in Section 3. In
Similarly, define the function space H(E) consisting of
the absence of labeled instances, as shown in section 4,
the real-valued functions on edges. When the function
this framework can be utilized as a spectral clustering
space of the inner product is clear in its context, we
approach for directed graphs. In Section 5, we develop
omit the subscript H(V ) or H(E).
discrete analysis for directed graphs, and characterize
this framework in terms of discrete analysis. Experi- For a given weighted directed graph, there is a nat-
mental results on web classification problems are de- ural random walk on the graph with the transition
scribed in Section 6. probability function p : V V R+ defined by
p(u, v) = w([u, v])/d+ (u) for all [u, v] E, and 0 oth-
erwise. The random walk on a strongly connected and
2. Preliminaries
aperiodic directed graph has a unique stationary dis-
A directed graph G = (V, E) consists of a finite set V, tribution , i.e. a unique probabilityP distribution satis-
together with a subset E V V. The elements of fying the balance equations (v) = uv (u)p(u, v),
V are the vertices of the graph, and the elements of for all v V. Moreover, (v) > 0 for all v V.
E are the edges of the graph. An edge of a directed
graph is an ordered pair [u, v] where u and v are the 3. Regularization Framework
vertices of the graph. When u = v the edge is called
a loop. A graph is simple if it has no loop. We say Given a directed graph G = (V, E) and a label set Y =
that the vertex v is adjacent from the vertex u, and {1, 1}, the vertices in a subset S V is labeled. The
the the vertex u is adjacent to the vertex v, and the problem is to classify the vertices in the complement of
edge [u, v] is incident from the vertex u and incident S. The graph G is assumed to be strongly connected
to the vertex v. and aperiodic. Later we will discuss how to dispose
this assumption.
A path in a directed graph is a tuple of vertices
(v1 , v2 , . . . , vp ) with the property that [vi , vi+1 ] E Assume a classification function f H(V ), which as-
for 1 i p 1. We say that a directed graph is signs a label sign f (v) to each vertex v V. On the
strongly connected when for every pair of vertices u one hand, similar vertices should be classified into the
and v there is a path in which v1 = u and vp = v. For same class. More specifically, a pair of vertices linked
a strongly connected graph, there is an integer k 1 by an edge are likely to have the same label. Moreover,
and a unique partition V = V0 V1 Vk1 such that vertices lying on a densely linked subgraph are likely
to have the same label. Thus we define a functional Lemma 3.1. Let = I , where I denotes the
!2 identity. Then (f ) = hf, f i.
1 X f (u) f (v)
(f ) := (u)p(u, v) p p ,
2 (u) (v) Proof. The idea is to use summation by parts, a dis-
[u,v]E
(1) crete analogue of the more common integration by
which sums the weighted variation of a function on parts.
each edge of the directed graph. On the other hand, !2
X f (u) f (v)
the initial label assignment should be changed as little (u)p(u, v) p p
as possible. Let y denote the function in H(V ) defined [u,v]E
(u) (v)
by y(v) = 1 or 1 if vertex v has been labeled as pos- !2
itive or negative respectively, and 0 if it is unlabeled. 1X X f (u) f (v)
= (u)p(u, v) p p
Thus we may consider the optimization problem 2 uv (u) (v)
vV
!2
argmin (f ) + kf yk2 ,

(2) X f (v) f (u)
f H(V ) + (v)p(v, u) p p
uv (v) (u)
where > 0 is the parameter specifying the tradeoff X
1 X X (u)p(u, v) 2
between the two competitive terms. = p(u, v)f 2 (u) + f (v)
2 uv uv
(v)
vV
We will provide the motivations for the functional de- X (u)p(u, v)f (u)f (v)
fined by (1). In the end of this section, this functional 2 p
will be compared with another choice which may seem uv (u)(v)
more natural. The comparison may make us gain an 1 X X X (v)p(v, u)
insight into this functional. In Section 4, it will be + p(v, u)f 2 (v) + f 2 (u)
2 uv uv
(u)
shown that this functional may be naturally derived vV
from a combinatorial optimization problem. In Sec- X (v)p(v, u)f (v)f (u)

2
tion 5, we will further characterize this functional in
p
uv (v)(u)
terms of discrete analysis on directed graphs.
For an undirected graph, it is well-known that the sta- The first term on the right-hand side may be written
tionary distribution of the natural random walk has a X XX
P p(u, v)f 2 (u) = p(u, v)f 2 (u)
closed form expression (v) = d(v)/ uV d(u), where
[u,v]E uV vu
d(v) denotes the degree of the vertex v. Substituting !
the closed form expression into (1), we have X X X X
= p(u, v) f 2 (u) = f 2 (u) = f 2 (v),
!2 uV vu uV vV
1 X f (u) f (v)
(f ) = w([u, v]) p p , and the second term
2 d(u) d(v)
[u,v]E !
X X (u)p(u, v) X
which is exactly the regularizer of the transductive in- f 2 (v) = f 2 (v).
uv
(v)
vV vV
ference algorithm of Zhou et al. (2004) operating on
undirected graphs. Similarly, for the fourth and fifth terms, we can show
that
For solving the optimization problem (2), we introduce XX X
p(v, u)f 2 (v) = f 2 (v),
an operator : H(V ) H(V ) defined by
vV uv vV
and

1 X (u)p(u, v)f (u)
(f )(v) = p
2 uv (u)(v) X X (v)p(v, u) X
f 2 (u) = f 2 (v).
X (v)p(v, u)f (u) uv
(u)
vV vV
+ p . (3)
uv (v)(u) respectively. Therefore,
X
1 X (u)p(u, v)f (u)f (v)
2
Let denote the diagonal matrix with (v, v) = (v) (f ) = f (v) p
2 uv (u)(v)
for all v V. Let P denote the transition probability vV
matrix and P T the transpose of P. Then X (v)p(v, u)f (v)f (u)
+ p ,
1/2 P 1/2 + 1/2 P T 1/2 uv (v)(u)
= . (4)
2 which completes the proof.
Lemma 3.2. The eigenvalues of the operator are in that we are currently at vertex u with d+ (u) > 0, the
[1, 1],
and the eigenvector with the eigenvalue equal next step of this random walk proceeds as follows: (1)
to 1 is . with probability 1 jump to a vertex chosen uni-
formly at random over the whole vertex set except u;
Proof. It is easy to see that is similar and (2) with probability w([u, v])/d+ (u) jump to a
to the operator
: H(V ) H(V ) defined by = P + 1 P T /2. vertex v adjacent from u. If we are at vertex u with

Hence and have the same set of eigenvalues. As- d+ (u) = 0, just jump to a vertex chosen uniformly at
sume that f is the eigenvector of with eigenvalue . random over the whole vertex set except u.
Choose a vertex v such that |f (v)| = maxuV |f (u)|.
Algorithm. Given a directed graph G = (V, E) and
Then we can show that || 1 by
a label set Y = {1, 1}, the vertices in a subset S V
are labeled. Then the remaining unlabeled vertices

X X
|||f (v)| = (v, u)f (u) (v, u)|f (v)| may be classified as follows:

uV uV
! 1. Define a random walk over G with a transition
|f (v)| X X (u)p(u, v)
probability matrix P such that it has a unique sta-
= p(v, u) +
2 uv uv
(v) tionary distribution, such as the teleporting ran-
= |f (v)|. dom walk.
2. Let denote the diagonal matrix with its di-

agonal elements being the stationary distribu-
In addition, we can show that = by tion of the random walk. Compute the matrix
p = (1/2 P 1/2 + 1/2 P T 1/2 )/2.
1 X (u)p(u, v) (u)
3. Define a function y on V with y(v) = 1 or 1 if
p
2 uv (u)(v)
p
X (v)p(v, u) (u) vertex v is labeled as 1 or 1, and 0 if v is unla-
+ p beled. Compute the function f = (I )1 y,
uv (v)(u) where is a parameter in ]0, 1[, and classify each
unlabeled vertex v as sign f (v).
!
1 X (u)p(u, v) X (v)p(v, u)
= p + p
2 uv (v) uv (v)
! It is worth mentioning that the approach of Zhou
1 1 X p X et al. (2005) can also be derived from this algo-
= p (u)p(u, v) + (v) p(v, u)
2 (v) uv uv
rithmic framework by defining a two-step random
p walk. Assume a directed graph G = (V, E) with
= (v). d+ (v) > 0 and d (v) > 0 for all v V. Given that
we are currently at vertex u, the next step of this
random walk proceeds as follows: first jump back-
Theorem 3.3. The solution of (2) is f = (1)(I ward to a vertex h adjacent to u with probability
)1 y, where = 1/(1 + ). p (u, h) = w([h, u])/d (u); then jump forward to a
vertex v adjacent from u with probability p+ (h, v) =
Proof. From Lemma 3.1, differentiating (2) with re- w([h, v])/d+ (h). Thus
P the transition probability from
spect to function f, we have (I )f + (f y) = 0. u to v is p(u, v) = hV p (u, h)p+ (h, v). It is easy
Define = 1/(1 + ). This system may be written to show that the stationary
P distribution of the random
(I )f = (1 )y. From Lemma 3.2, we easily walk is (v) = d (v)/ uV d (u) for all v V. Sub-
know that (I ) is positive definite and thus in- stituting the quantities of p(u, v) and (v) into (1), we
vertible. This completes the proof. then recover one of the two regularizers proposed by
Zhou et al. (2005). The other one can also be recov-
At the beginning of this section, we assume the graph ered simply by reversing this two-step random walk.
to be strongly connected and aperiodic such that the Now we discuss implementation issues. The closed
natural random walk over the graph converges to a form solution shown in Theorem 3.3 involves a ma-
unique and positive stationary distribution. Obviously trix inverse. Given an n n invertible matrix A, the
this assumption cannot be guaranteed for a general di- time required to compute the inverse A1 is gener-
rected graph. To remedy this problem, we may intro- ally O(n3 ) and the representation of the inverse re-
duce the so-called teleporting random walk (Page et al., quires (n2 ) space. Recent progress in numerical anal-
1998) as the replacement of the natural one. Given
ysis (Spielman & Teng, 2003), however, shows that,

for an n n symmetric positive semi-definite, diag-
onally dominant matrix A with m non-zero entries
and a n-vector b, we can obtain a vector x within rel-
ative distance of the solution to Ax = b in time
O(m1.31 log(nf (A)/)O(1) ), where f (A) is the log of
the ratio of the largest to smallest non-zero eigenvalue
of A. It can be shown that our approach can benefit
from this numerical technique. From Theorem 3.3,
1/2 P 1/2 + 1/2 P T 1/2

I f = (1 )y,
2
which may be transformed into Figure 2. A subset S and its complement S c . Note that

P + P T
there is only one edge in the out-boundary of S.
(1/2 f ) = (1 )1/2 y.
2
vol S is the probability with which the random walk
P + P T occupies some vertex in S and consequently vol V = 1.
Let A = . It is easy to verify that A
2 Let S c denote the complement of S (Fig. 2). The out-
is diagonally dominant.
boundary S of S is defined by P S := {[u, v]|u
For well understanding this regularization framework, S, v S c }. The value vol S := [u,v]S (u)p(u, v)
we may compare it with an alternative approach in is called the volume of S. Note that vol S is the
which the regularizer is defined by probability with which one sees a jump from S to S c .
!2
X f (u) f (v) Generalizing the normalized cut criterion for undi-
(f ) = w([u, v]) p p . rected graphs is based on a key observation stated by
[u,v]E
d+ (u) d (v)
Proposition 4.1. vol S = vol S c .
(5)
A similar closed form solution can be obtained from
Proof. It immediately follows from that the probabil-
the corresponding optimization problem. Clearly, for
ity with which the random walk leaves a vertex equals
undirected graphs, this functional also reduces to that
the probability with which the random walk arrives at
in (Zhou et al., 2004). At first glance, this functional
this vertex. Formally, for each vertex v in V, it is easy
may look natural, but in the later experiments we will
to see that
show that the algorithm based on this functional does X X
not work as well as the previous one. This is because (u)p(u, v) (v)p(v, u) = 0.
the directionality is only slightly taken into account by uv uv
this functional via the degree normalization such that Summing the above equation over the vertices of S
much valuable information for classification conveyed (see also Fig. 2), then we have
by the directionality is ignored by the corresponding !
algorithm. Once we remove the degree normalization X X X
from this functional, then the resulted functional is (u)p(u, v) (v)p(v, u)
vS uv uv
totally insensitive to the directionality. X X
= (u)p(u, v) (u)p(u, v) = 0,
[u,v]S c [u,v]S
4. Directed Spectral Clustering
which completes the proof.
In the absence of labeled instances, this framework can
be utilized in an unsupervised setting as a spectral
From Proposition 4.1, we may partition the vertex set
clustering method for directed graphs. We first define
of a directed graph into two nonempty parts S and S c
a combinational partition criterion, which generalizes
by minimizing
the normalized cut criterion for undirected graphs (Shi
& Malik, 2000). Then relaxing the combinational op- 1 1
Ncut(S) = vol S + , (6)
timization problem into a real-valued one leads to the vol S vol S c
functional defined in Section 3.
which is a directed generalization of the normalized
Given a subset S of the vertices of P
a directed graph G, cut criterion for undirected graphs. Clearly, the ra-
define the volume of S by vol S := vS (v). Clearly, tio of vol S to vol S is the probability with which the
random walk leaves S in the next step under the con- It is easy to extend this approach to k-partition. As-
dition that it is in fact in S now. Similarly understand sume a k-partition to be V = V1 V2 Vk , where
the ratio of vol S c to vol S c . Vi Vj = for all 1 i, j k. Let Pk denote a k-
partition. Then we may obtain a k-partition by mini-
In the following, we show that the functional (1) can
mizing
be recovered from (6). Define an indicator function X vol Vi
h H(V ) by h(v) = 1 if v S, and 1 if v S c . Ncut(Pk ) = . (8)
vol Vi
Denote by the volume of S. Clearly, we have 0 < 1ik
< 1 due to S G. Then (6) may be written It is not hard to show that the solution of the corre-
P
(u)p(u, v)(h(u) h(v))2 sponding relaxed optimization problem of (8) can be
[u,v]E any orthonormal basis for the linear space spanned by
Ncut(S) = . the eigenvectors of pertaining to the k largest eigen-
8(1 )
values.
Define another function g H(V ) by g(v) = 2(1 )
if v S, and 2 if v S c . We easily know that
sign g(v) = sign h(v) for all v V and h(u) h(v) = 5. Discrete Analysis
g(u) g(v)
P for all u, v V. Moreover,
P it is not hard to We develop discrete analysis on directed graphs. The
see that vV (v)g(v) = 0, and vV (v)g 2 (v) = regularization framework in Section 3 is then recon-
4(1 ). Therefore structed and generalized using discrete analysis. This
(u)p(u, v)(g(u) g(v))2
P
work is the discrete analogue of classic regularization
[u,v]E theory (Tikhonov & Arsenin, 1977; Wahba, 1990).
Ncut(S) = .
(v)g 2 (v)
P
2
vV We define the graph gradient to be an operator :
H(V ) H(E) which satisfies
Define another function f = g. Then the above
s s
equation may be further transformed into
!
p p(u, v) p(u, v)
!2 (f )([u, v]) := (u) f (v) f (u) .
(v) (u)

P f (u) f (v)
(u)p(u, v) p p (9)
[u,v]E (u) (v)
Ncut(S) = . For an undirected graph, equation (9) reduces to
2hf, f i s s
If we allow the function f to take arbitrary real values, w([u, v]) w([u, v])
(f )([u, v]) = f (v) f (u).
then the graph partition problem (6) becomes d(v) d(u)
argmin (f ) (7)
f H(V ) We may also define the graph gradient of function f
at each vertex v as f (v) := {(f )([v, u])|[v, u] E},
subject to kf k = 1, hf, i = 0.
which is often denoted by v f. Then the norm of the
From Lemma 3.2, it is easy to see that the solution graph gradient f at v is defined by
of (7) is the normalized eigenvector of the operator ! 21
with the second largest eigenvalue. X
2
kv f k := (f ) ([v, u]) , (10)
Algorithm. Given a directed graph G = (V, E), it uv
may be partitioned into two parts as follows: and the p-Dirichlet form
1. Define a random walk over G with a transition 1X
p (f ) := kv f kp , p [1, [. (11)
probability matrix P such that it has a unique 2
vV
stationary distribution.
Note that 2 (f ) = (f ). Intuitively, the norm of the
2. Let denote the diagonal matrix with its di- graph gradient measures the smoothness of a function
agonal elements being the stationary distribu- around a vertex, and the p-Dirichlet form the smooth-
tion of the random walk. Compute the matrix ness of a function over the whole graph. In addition,
= (1/2 P 1/2 + 1/2 P T 1/2 )/2. we define kf ([v, u])k := kv f k. Note that kf k is
1/2
3. Compute the eigenvector of corresponding to defined in the space H(E) as kf k = hf, f iH(E) .
the second largest eigenvalue, and then partition We define the graph divergence to be an operator div :
the vertex set V of G into two parts S = {v H(E) H(V ) which satisfies
V |(v) 0} and S c = {v V |(v) < 0}.
hf, giH(E) = hf, div giH(V ) (12)
for any two functions f and g in H(E). Equation (12) Let f denote the solution of (18). It is not hard to
is a discrete analogue of the Stokes theorem 1 . It is show that
not hard to show that
X pp f + 2(f y) = 0. (19)
1 p
(div g)(v) = p (v)p([v, u])g([v, u])
(v) uv When p = 2, f + (f y) = 0, as we have shown
Xp before, which leads to the closed form solution in The-
(u)p(u, v)g([u, v]) . (13) orem 3.3. When p 6= 2, we are not aware of any closed
uv form solution.
Intuitively, we may think of the graph divergence
(div g)(v) as the net outflow of the function g at 6. Experiments
the vertex pv. For a function c : E R defined by
We address the web categorization task on the WebKB
c([u, v]) = (u)p(u, v), it follows from equation (13)
dataset (see http://www-2.cs.cmu.edu/webkb/). We
that (div c)(v) = 0 at any vertex v in V.
only consider a subset containing the pages from the
We define the graph Laplacian to be an operator : four universities Cornell, Texas, Washington and Wis-
H(V ) H(V ) which satisfies 2 consin, from which we remove the isolated pages, i.e.,
1 the pages which have no incoming and outgoing links,
f := div(f ). (14) resulting in 858, 825, 1195 and 1238 pages respectively,
2
for a total of 4116. These pages have been manually
We easily know that the graph Laplacian is linear, self- classified into the following seven categories: student,
adjoint and positive semi-definite. Substituting (9) faculty, staff, department, course, project and other.
and (13) into (14), we obtain We may assign a weight to each hyperlink according

1 X (u)p(u, v)f (u) to the textual content or the anchor text. However,
(f )(v) = f (v) p here we are only interested in how much we can obtain
2 uv (u)(v) from link structure only and hence adopt the canonical
X (v)p(v, u)f (u) weight function.
+ p . (15)
uv (v)(u) We compare the approach in Section 3 with its coun-
In matrix notation, can be written as terpart based on (5). Moreover, we also compare both
methods with the schemes in (Zhou et al., 2005; Zhou
1/2 P 1/2 + 1/2 P T 1/2 et al., 2004). For the last approach, we transform a
=I , (16)
2 directed graph into an undirected one by defining a
which is just the Laplace matrix for directed graphs symmetric weight function as w([u, v]) = 1 if [u, v] or
proposed by Chung (to appear). For an undirected [v, u] in E. To distinguish among these approaches,
graph, equation (16) clearly reduces to the Laplacian we refer to them as distribution regularization, degree
for undirected graphs (Chung, 1997). regularization, second-order regularization and undi-
rected regularization respectively. As we have shown,
We define the graph p-Laplacian to be an operator both the distribution and degree regularization ap-
p : H(V ) H(V ) which satisfies proaches are generalizations of the undirected regu-
1 larization method.
p f := div(kf kp2 f ). (17)
2 The investigated task is to discriminate the student
Clearly, 2 = , and p (p 6= 2) is nonlinear. In pages in a university from the non-student pages in the
addition, p (f ) = hp f, f i . same university. We further remove the isolated pages
in each university. Other categories including faculty
Now we consider the optimization problem and course are considered as well. For all approaches,
the regularization parameter is set to = 0.1 as in
argmin p (f ) + kf yk2 .

(18)
f H(V ) (Zhou et al., 2005). In the distribution regularization
approach, we adopt the teleporting random walk with
1
Given a compact Riemannian manifold (M, g) with a small jumping probability = 0.01 for obtaining a
a function f C (M ) and a vector fieldR X X (M ), unique and positive stationary distribution. The test-
it Rfollows from the Stokes theorem that M hf, Xi =
ing errors are averaged over 50 trials. In each trial,
M (div X)f.
2
The Laplace-Beltrami operator : C (M ) it is randomly decided which of the training points
C (M ) is define by f = div(f ). The additional fac-
get labeled. If there is no labeled point existed for
tor 1/2 in (14) is due to edges being oriented. some class, we sample again. The experimental re-
distribution regularization distribution regularization distribution regularization

degree regularization 0.6 degree regularization 0.7 degree regularization
0.6
secondorder regularization secondorder regularization secondorder regularization
undirected regularization 0.5 undirected regularization 0.6 undirected regularization
0.5
test error
test error
test error
0.5
0.4
0.4
0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
# labeled points # labeled points # labeled points
(a) Cornell (student) (b) Texas (student) (c) Washington (student)
0.7 distribution regularization distribution regularization distribution regularization

degree regularization 0.6 degree regularization 0.5 degree regularization
0.6 secondorder regularization secondorder regularization secondorder regularization
undirected regularization 0.5 undirected regularization undirected regularization
0.4
0.5
test error
test error
test error
0.4
0.3
0.4
0.3
0.3 0.2
0.2
0.2 0.1 0.1
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
# labeled points # labeled points # labeled points
(d) Wisconsin (student) (e) Cornell (faculty) (f) Cornell (course)
Figure 3. Classification on the WebKB dataset. Fig. (a)-(d) depict the test errors of the regularization approaches on the
classification problem of student vs. non-student in each university. Fig. (e)-(f) illustrate the test errors of these methods
on the classification problems of faculty vs. non-faculty and course vs. non-course in Cornell University.
sults are shown in Fig. 3. The distribution regulariza- Kleinberg, J. (1999). Authoritative sources in a hyper-
tion approach shows significantly improved results in linked environment. Journal of the ACM, 46, 604632.
comparison to the degree regularization method. Fur- Page, L., Brin, S., Motwani, R., & Winograd, T. (1998).
thermore, the distribution regularization approach is The PageRank citation ranking: bring order to the web
comparable with the second-order regularization one. (Technical Report). Stanford University.
In contrast, the degree regularization approach shows Shi, J., & Malik, J. (2000). Normalized cuts and image
similar performance to the undirected regularization segmentation. IEEE Transactions on Pattern Analysis
one. Therefore we can conclude that the degree reg- and Machine Intelligence, 22, 888905.
ularization approach almost does not take the direc- Spielman, D., & Teng, S. (2003). Solving sparse, symmet-
tionality into account. ric, diagonally-dominant linear systems in time o(m1.31 ).
Proc. 44th Annual IEEE Symposium on Foundations of
Computer Science.
References
Tikhonov, A., & Arsenin, V. (1977). Solutions of ill-posed
Chung, F. (1997). Spectral graph theory. No. 92 in CBMS problems. W. H. Winston, Washington, DC.
Regional Conference Series in Mathematics. American
Mathematical Society, Providence, RI. Vapnik, V. (1998). Statistical learning theory. Wiley, NY.
Wahba, G. (1990). Spline models for observational data.
Chung, F. (to appear). Laplacian and the Cheeger inequal- No. 59 in CBMS-NSF Regional Conference Series in Ap-
ity for directed graphs. Annals of Combinatorics. plied Mathematics. SIAM, Philadelphia.
Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Zhou, D., Bousquet, O., Lal, T., Weston, J., & Sch
olkopf,
Learning probabilistic models of link structure. Journal B. (2004). Learning with local and global consistency.
of Machine Learning Research, 3, 679708. Advances in Neural Information Processing Systems 16.
MIT Press, Cambridge, MA.
Henzinger, M. (2001). Hyperlink analysis for the web. Zhou, D., Scholkopf, B., & Hofmann, T. (2005). Semi-
IEEE Internet Computing, 5, 4550. supervised learning on directed graphs. Advances in
Neural Information Processing Systems 17. MIT Press,
Henzinger, M. (2003). Algorithmic challenges in web search Cambridge, MA.
engines. Internet Mathematics, 1, 115123.

Learning From Labeled and Unlabeled Data On A Directed Graph

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Labeled and Unlabeled Data On A Directed Graph

Uploaded by

Copyright:

Available Formats

Learning from Labeled and Unlabeled Data on a Directed Graph

Dengyong Zhou dengyong.zhou@tuebingen.mpg.de

Abstract view has been considered previously in the method of

for all 0 r k 1 each edge [u, v] E with u Vr

from a combinatorial optimization problem. In Sec- X (v)p(v, u)f (v)f (u)

2. Let denote the diagonal matrix with its di-

ysis (Spielman & Teng, 2003), however, shows that,

distribution regularization distribution regularization distribution regularization

0.7 distribution regularization distribution regularization distribution regularization

0.2 0.1 0.1

You might also like