Professional Documents
Culture Documents
Learning From Labeled and Unlabeled Data On A Directed Graph
Learning From Labeled and Unlabeled Data On A Directed Graph
to have the same label. Thus we define a functional Lemma 3.1. Let = I , where I denotes the
!2 identity. Then (f ) = hf, f i.
1 X f (u) f (v)
(f ) := (u)p(u, v) p p ,
2 (u) (v) Proof. The idea is to use summation by parts, a dis-
[u,v]E
(1) crete analogue of the more common integration by
which sums the weighted variation of a function on parts.
each edge of the directed graph. On the other hand, !2
X f (u) f (v)
the initial label assignment should be changed as little (u)p(u, v) p p
as possible. Let y denote the function in H(V ) defined [u,v]E
(u) (v)
by y(v) = 1 or 1 if vertex v has been labeled as pos- !2
itive or negative respectively, and 0 if it is unlabeled. 1X X f (u) f (v)
= (u)p(u, v) p p
Thus we may consider the optimization problem 2 uv (u) (v)
vV
!2
argmin (f ) + kf yk2 ,
(2) X f (v) f (u)
f H(V ) + (v)p(v, u) p p
uv (v) (u)
where > 0 is the parameter specifying the tradeoff X
1 X X (u)p(u, v) 2
between the two competitive terms. = p(u, v)f 2 (u) + f (v)
2 uv uv
(v)
vV
We will provide the motivations for the functional de- X (u)p(u, v)f (u)f (v)
fined by (1). In the end of this section, this functional 2 p
will be compared with another choice which may seem uv (u)(v)
more natural. The comparison may make us gain an 1 X X X (v)p(v, u)
insight into this functional. In Section 4, it will be + p(v, u)f 2 (v) + f 2 (u)
2 uv uv
(u)
shown that this functional may be naturally derived vV
Lemma 3.2. The eigenvalues of the operator are in that we are currently at vertex u with d+ (u) > 0, the
[1, 1],
and the eigenvector with the eigenvalue equal next step of this random walk proceeds as follows: (1)
to 1 is . with probability 1 jump to a vertex chosen uni-
formly at random over the whole vertex set except u;
Proof. It is easy to see that is similar and (2) with probability w([u, v])/d+ (u) jump to a
to the operator
: H(V ) H(V ) defined by = P + 1 P T /2. vertex v adjacent from u. If we are at vertex u with
Hence and have the same set of eigenvalues. As- d+ (u) = 0, just jump to a vertex chosen uniformly at
sume that f is the eigenvector of with eigenvalue . random over the whole vertex set except u.
Choose a vertex v such that |f (v)| = maxuV |f (u)|.
Algorithm. Given a directed graph G = (V, E) and
Then we can show that || 1 by
a label set Y = {1, 1}, the vertices in a subset S V
are labeled. Then the remaining unlabeled vertices
X X
|||f (v)| = (v, u)f (u) (v, u)|f (v)| may be classified as follows:
uV uV
! 1. Define a random walk over G with a transition
|f (v)| X X (u)p(u, v)
probability matrix P such that it has a unique sta-
= p(v, u) +
2 uv uv
(v) tionary distribution, such as the teleporting ran-
= |f (v)|. dom walk.
random walk leaves S in the next step under the con- It is easy to extend this approach to k-partition. As-
dition that it is in fact in S now. Similarly understand sume a k-partition to be V = V1 V2 Vk , where
the ratio of vol S c to vol S c . Vi Vj = for all 1 i, j k. Let Pk denote a k-
partition. Then we may obtain a k-partition by mini-
In the following, we show that the functional (1) can
mizing
be recovered from (6). Define an indicator function X vol Vi
h H(V ) by h(v) = 1 if v S, and 1 if v S c . Ncut(Pk ) = . (8)
vol Vi
Denote by the volume of S. Clearly, we have 0 < 1ik
< 1 due to S G. Then (6) may be written It is not hard to show that the solution of the corre-
P
(u)p(u, v)(h(u) h(v))2 sponding relaxed optimization problem of (8) can be
[u,v]E any orthonormal basis for the linear space spanned by
Ncut(S) = . the eigenvectors of pertaining to the k largest eigen-
8(1 )
values.
Define another function g H(V ) by g(v) = 2(1 )
if v S, and 2 if v S c . We easily know that
sign g(v) = sign h(v) for all v V and h(u) h(v) = 5. Discrete Analysis
g(u) g(v)
P for all u, v V. Moreover,
P it is not hard to We develop discrete analysis on directed graphs. The
see that vV (v)g(v) = 0, and vV (v)g 2 (v) = regularization framework in Section 3 is then recon-
4(1 ). Therefore structed and generalized using discrete analysis. This
(u)p(u, v)(g(u) g(v))2
P
work is the discrete analogue of classic regularization
[u,v]E theory (Tikhonov & Arsenin, 1977; Wahba, 1990).
Ncut(S) = .
(v)g 2 (v)
P
2
vV We define the graph gradient to be an operator :
H(V ) H(E) which satisfies
Define another function f = g. Then the above
s s
equation may be further transformed into
!
p p(u, v) p(u, v)
!2 (f )([u, v]) := (u) f (v) f (u) .
(v) (u)
P f (u) f (v)
(u)p(u, v) p p (9)
[u,v]E (u) (v)
Ncut(S) = . For an undirected graph, equation (9) reduces to
2hf, f i s s
If we allow the function f to take arbitrary real values, w([u, v]) w([u, v])
(f )([u, v]) = f (v) f (u).
then the graph partition problem (6) becomes d(v) d(u)
argmin (f ) (7)
f H(V ) We may also define the graph gradient of function f
at each vertex v as f (v) := {(f )([v, u])|[v, u] E},
subject to kf k = 1, hf, i = 0.
which is often denoted by v f. Then the norm of the
From Lemma 3.2, it is easy to see that the solution graph gradient f at v is defined by
of (7) is the normalized eigenvector of the operator ! 21
with the second largest eigenvalue. X
2
kv f k := (f ) ([v, u]) , (10)
Algorithm. Given a directed graph G = (V, E), it uv
may be partitioned into two parts as follows: and the p-Dirichlet form
1. Define a random walk over G with a transition 1X
p (f ) := kv f kp , p [1, [. (11)
probability matrix P such that it has a unique 2
vV
stationary distribution.
Note that 2 (f ) = (f ). Intuitively, the norm of the
2. Let denote the diagonal matrix with its di- graph gradient measures the smoothness of a function
agonal elements being the stationary distribu- around a vertex, and the p-Dirichlet form the smooth-
tion of the random walk. Compute the matrix ness of a function over the whole graph. In addition,
= (1/2 P 1/2 + 1/2 P T 1/2 )/2. we define kf ([v, u])k := kv f k. Note that kf k is
1/2
3. Compute the eigenvector of corresponding to defined in the space H(E) as kf k = hf, f iH(E) .
the second largest eigenvalue, and then partition We define the graph divergence to be an operator div :
the vertex set V of G into two parts S = {v H(E) H(V ) which satisfies
V |(v) 0} and S c = {v V |(v) < 0}.
hf, giH(E) = hf, div giH(V ) (12)
Learning from Labeled and Unlabeled Data on a Directed Graph
for any two functions f and g in H(E). Equation (12) Let f denote the solution of (18). It is not hard to
is a discrete analogue of the Stokes theorem 1 . It is show that
not hard to show that
X pp f + 2(f y) = 0. (19)
1 p
(div g)(v) = p (v)p([v, u])g([v, u])
(v) uv When p = 2, f + (f y) = 0, as we have shown
Xp before, which leads to the closed form solution in The-
(u)p(u, v)g([u, v]) . (13) orem 3.3. When p 6= 2, we are not aware of any closed
uv form solution.
Intuitively, we may think of the graph divergence
(div g)(v) as the net outflow of the function g at 6. Experiments
the vertex pv. For a function c : E R defined by
We address the web categorization task on the WebKB
c([u, v]) = (u)p(u, v), it follows from equation (13)
dataset (see http://www-2.cs.cmu.edu/webkb/). We
that (div c)(v) = 0 at any vertex v in V.
only consider a subset containing the pages from the
We define the graph Laplacian to be an operator : four universities Cornell, Texas, Washington and Wis-
H(V ) H(V ) which satisfies 2 consin, from which we remove the isolated pages, i.e.,
1 the pages which have no incoming and outgoing links,
f := div(f ). (14) resulting in 858, 825, 1195 and 1238 pages respectively,
2
for a total of 4116. These pages have been manually
We easily know that the graph Laplacian is linear, self- classified into the following seven categories: student,
adjoint and positive semi-definite. Substituting (9) faculty, staff, department, course, project and other.
and (13) into (14), we obtain We may assign a weight to each hyperlink according
1 X (u)p(u, v)f (u) to the textual content or the anchor text. However,
(f )(v) = f (v) p here we are only interested in how much we can obtain
2 uv (u)(v) from link structure only and hence adopt the canonical
X (v)p(v, u)f (u) weight function.
+ p . (15)
uv (v)(u) We compare the approach in Section 3 with its coun-
In matrix notation, can be written as terpart based on (5). Moreover, we also compare both
methods with the schemes in (Zhou et al., 2005; Zhou
1/2 P 1/2 + 1/2 P T 1/2 et al., 2004). For the last approach, we transform a
=I , (16)
2 directed graph into an undirected one by defining a
which is just the Laplace matrix for directed graphs symmetric weight function as w([u, v]) = 1 if [u, v] or
proposed by Chung (to appear). For an undirected [v, u] in E. To distinguish among these approaches,
graph, equation (16) clearly reduces to the Laplacian we refer to them as distribution regularization, degree
for undirected graphs (Chung, 1997). regularization, second-order regularization and undi-
rected regularization respectively. As we have shown,
We define the graph p-Laplacian to be an operator both the distribution and degree regularization ap-
p : H(V ) H(V ) which satisfies proaches are generalizations of the undirected regu-
1 larization method.
p f := div(kf kp2 f ). (17)
2 The investigated task is to discriminate the student
Clearly, 2 = , and p (p 6= 2) is nonlinear. In pages in a university from the non-student pages in the
addition, p (f ) = hp f, f i . same university. We further remove the isolated pages
in each university. Other categories including faculty
Now we consider the optimization problem and course are considered as well. For all approaches,
the regularization parameter is set to = 0.1 as in
argmin p (f ) + kf yk2 .
(18)
f H(V ) (Zhou et al., 2005). In the distribution regularization
approach, we adopt the teleporting random walk with
1
Given a compact Riemannian manifold (M, g) with a small jumping probability = 0.01 for obtaining a
a function f C (M ) and a vector fieldR X X (M ), unique and positive stationary distribution. The test-
it Rfollows from the Stokes theorem that M hf, Xi =
ing errors are averaged over 50 trials. In each trial,
M (div X)f.
2
The Laplace-Beltrami operator : C (M ) it is randomly decided which of the training points
C (M ) is define by f = div(f ). The additional fac-
get labeled. If there is no labeled point existed for
tor 1/2 in (14) is due to edges being oriented. some class, we sample again. The experimental re-
Learning from Labeled and Unlabeled Data on a Directed Graph
test error
test error
0.5
0.4
0.4
0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
# labeled points # labeled points # labeled points
(a) Cornell (student) (b) Texas (student) (c) Washington (student)
test error
test error
0.4
0.3
0.4
0.3
0.3 0.2
0.2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
# labeled points # labeled points # labeled points
(d) Wisconsin (student) (e) Cornell (faculty) (f) Cornell (course)
Figure 3. Classification on the WebKB dataset. Fig. (a)-(d) depict the test errors of the regularization approaches on the
classification problem of student vs. non-student in each university. Fig. (e)-(f) illustrate the test errors of these methods
on the classification problems of faculty vs. non-faculty and course vs. non-course in Cornell University.
sults are shown in Fig. 3. The distribution regulariza- Kleinberg, J. (1999). Authoritative sources in a hyper-
tion approach shows significantly improved results in linked environment. Journal of the ACM, 46, 604632.
comparison to the degree regularization method. Fur- Page, L., Brin, S., Motwani, R., & Winograd, T. (1998).
thermore, the distribution regularization approach is The PageRank citation ranking: bring order to the web
comparable with the second-order regularization one. (Technical Report). Stanford University.
In contrast, the degree regularization approach shows Shi, J., & Malik, J. (2000). Normalized cuts and image
similar performance to the undirected regularization segmentation. IEEE Transactions on Pattern Analysis
one. Therefore we can conclude that the degree reg- and Machine Intelligence, 22, 888905.
ularization approach almost does not take the direc- Spielman, D., & Teng, S. (2003). Solving sparse, symmet-
tionality into account. ric, diagonally-dominant linear systems in time o(m1.31 ).
Proc. 44th Annual IEEE Symposium on Foundations of
Computer Science.
References
Tikhonov, A., & Arsenin, V. (1977). Solutions of ill-posed
Chung, F. (1997). Spectral graph theory. No. 92 in CBMS problems. W. H. Winston, Washington, DC.
Regional Conference Series in Mathematics. American
Mathematical Society, Providence, RI. Vapnik, V. (1998). Statistical learning theory. Wiley, NY.
Wahba, G. (1990). Spline models for observational data.
Chung, F. (to appear). Laplacian and the Cheeger inequal- No. 59 in CBMS-NSF Regional Conference Series in Ap-
ity for directed graphs. Annals of Combinatorics. plied Mathematics. SIAM, Philadelphia.
Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Zhou, D., Bousquet, O., Lal, T., Weston, J., & Sch
olkopf,
Learning probabilistic models of link structure. Journal B. (2004). Learning with local and global consistency.
of Machine Learning Research, 3, 679708. Advances in Neural Information Processing Systems 16.
MIT Press, Cambridge, MA.
Henzinger, M. (2001). Hyperlink analysis for the web. Zhou, D., Scholkopf, B., & Hofmann, T. (2005). Semi-
IEEE Internet Computing, 5, 4550. supervised learning on directed graphs. Advances in
Neural Information Processing Systems 17. MIT Press,
Henzinger, M. (2003). Algorithmic challenges in web search Cambridge, MA.
engines. Internet Mathematics, 1, 115123.