Professional Documents
Culture Documents
Ssy 2010 11
Ssy 2010 11
17
18
(z, z Z) : hF (z) F (z ), z z i 0
find z Z : hF (z), z zi 0 z Z
note that this error is always 0 and equals zero iff z is a solution to (2).
In what follows we impose on F , aside of the monotonicity, the requirement
(4)
kk = max h, zi
z:kzk1
19
20
ity problem and Eigenvalue minimization. All technical proofs are collected
in the appendix.
We do not include any numerical results into this paper, some supporting
results can be found in [6]. In that manuscript the proposed approach is
applied to bilinear saddle point problems arising in sparse 1 recovery, i.e. to
variational inequalities with affine monotone operators F , and the goal is to
accelerate the solution process by replacing computationally expensive in
the large-scale case precise values of F by computationally cheap unbiased
random estimates of these values (cf. section 4.4 below).
Notations. In the sequel, lowercase Latin letters denote vectors (and sometimes matrices). Script capital letters, like E, Y, denote Euclidean spaces;
the inner product in such a space, say, E, is denoted by h, iE (or merely
h, i, when the corresponding space is clear from the context). Linear mappings from one Euclidean space to another, say, from E to F, are denoted
by boldface capitals like A (there are also some reserved boldface capitals,
like E for expectation, Rk for the k-dimensional coordinate space, and Sk
for the space of k k symmetric matrices). A stands for the conjugate
to mapping A: if A : E F, then A : F E is given by the identity
hf, AeiF = hA f, eiE for f F, e E. When both the origin and the destination space of a linear map, like A, are the standard coordinate spaces,
the map is identified with its matrix A, and A is identified with AT . For a
norm k k on E, k k stands for the conjugate norm, see (5).
For Euclidean spaces E1 , . . . , Em , E = E1 Em denotes their Euclidean
direct product, so that a vector from E is a collection u = [u1 ; . . . ; um ]
P
(MATLAB notation) of vectors u E , and hu, viE = hu , v iE . Sometimes we allow ourselves to write (u1 , . . . , um ) instead of [u1 ; . . . ; um ].
2. Preliminaries and problem of interest.
2.1. Nash v.i.s and functional error. In the sequel, we shall be especially
interested in a special case of v.i. (2) in a Nash v.i. coming from a convex
Nash Equilibrium problem, and in the associated functional error measure.
The Nash Equilibrium problem can be described as follows: there are m
players, the ith of them choosing a point zi from a given set Zi . The loss of
the ith player is a given function i (z) of the collection z = (z1 , . . . , zm )
Z = Z1 Zm of players choices. With slight abuse of notation, we
use for i (z) also the notation i (zi , z i ), where z i is the collection of choices
of all but the ith players. Players are interested to minimize their losses,
and Nash equilibrium zb is a point from Z such that for every i the function
i (zi , zbi ) attains its minimum in zi Zi at zi = zbi (so that in the state zb no
21
player has an incentive to change his choice, provided that the other players
stick to their choices).
We call a Nash equilibrium problem convex, if for every i, Zi is a compact
convex set, i (zi , z i ) is a Lipschitz continuous function convex in zi and
P
concave in z i , and the function (z) = m
i=1 i (z) is convex. It is well known
(see, e.g., [11]) that setting
h
m
X
i=1
max
(u1 ,u2 )Z
[(z1 , u1 ) (u2 , z2 )] .
Recall that the convex-concave saddle point problem minz1 Z1 maxz2 Z2 (z1 ,
z2 ) gives rise to the primal-dual pair of convex optimization problems
(P ) : min (z1 ),
z1 Z1
where
(z1 ) = max (z1 , z2 ), (z2 ) = min (z1 , z2 ).
z2 Z2
z1 Z1
22
The optimal values Opt(P ) and Opt(D) in these problems are equal, the set
of saddle points of (i.e., the set of Nash equilibria of the underlying convex
Nash problem) is exactly the direct product of the optimal sets of (P ) and
(D), and ErrN (z1 , z2 ) is nothing but the sum of non-optimalities of z1 , z2
considered as approximate solutions to respective optimization problems:
i
m
X
y (x),
=1
m
where Y = {y Rm
+ :
=1 y = 1} is the standard simplex. The advantages
of the saddle point reformulation are twofold. First, when all are smooth,
so is , in contrast to the objective in (7) which typically is nonsmooth; this
makes the saddle point reformulation better suited for processing by first
order algorithms. Starting with the breakthrough paper of Nesterov [12], this
phenomenon, in its general form, is utilized in the fastest known so far first
order algorithms for well-structured nonsmooth convex programs. Second,
in the stochastic case, stochastic oracles providing unbiased estimates of the
first order information on i oracles, while not induce a similar oracle for
the objective of (7), do induce such an oracle for the v.i. associated with (8)
and thus make the problem amenable to first order algorithms.
23
where the inner functions () are vector-valued, and the outer function
is real-valued. We are about to impose structural restrictions which allow to
reformulate the problem as a good convex-concave saddle point problem,
specifically, as follows:
A. X X is a convex compact;
B. (x) : X E , 1 m, are Lipschitz continuous mappings taking
values in Euclidean spaces E equipped with closed convex cones K .
We assume to be K -convex, meaning that for any x, x X,
[0, 1],
(x + (1 )x ) K (x) + (1 ) (x ),
where the notation a K b b K a means that b a K.
C. () is a convex function on E = E1 Em given by the Fenchel-type
representation
(10)
(u1 , . . . , um ) = max
yY
m
X
=1
hu , A y + b iE (y) ,
for u E , 1 m. Here
Y Y is a convex compact set,
the affine mappings y 7 A y + b : Y E are such that A y + b
K for all y Y and all , K being the cone dual to K ,
(y) is a given Lipschitz continuous convex function on Y .
Under these assumptions, the optimization problem (9) is nothing but the
primal problem associated with the saddle point problem
(11)
"
m
X
=1
h (x), A y + b iE (y)
and the cost function in the latter problem is Lipschitz continuous and
convex-concave due to the convexity of , K -convexity of () and the
condition A y + b K whenever y Y . The associated Nash v.i. is given
by the domain Z = X Y and the monotone mapping
(12)
F (z) F (x, y) =
"m
X
[ (x)] [A y
=1
+ b ];
m
X
=1
A (x) +
(y)
24
Same as in the case of minimax problem (7), the advantage of the saddle
point reformulation (11) of (9) is that, independently of whether is smooth,
is smooth whenever all are so. Another advantage, instrumental in
the stochastic case, is that F is linear in (), so that stochastic oracles
providing unbiased estimates of the first order information on induce
straightforwardly an unbiased SO for F .
2.2.2. Example: Matrix Minimax problem. For 1 m, let E = Sp
be the space of symmetric p p matrices equipped with the Frobenius
inner product hA, BiF = Tr(AB), and let K be the cone Sp+ of symmetric
positive semidefinite p p matrices. Now let X X be a convex compact
set, and : X E be Sp+ -convex Lipschitz continuous mappings. These
data induce the Matrix Minimax problem
min max max
xX 1jk
m
X
T
Pj
(x)Pj ,
=1
(P )
where Pj are given p qj matrices, and max (A) is the maximal eigenvalue
of a symmetric matrix A. Observing that for a symmetric q q matrix A
one has
max (A) = max Tr(AS)
SSq
Sq+
where Sq = {S
: Tr(S) = 1}. When denoting by Y the set of all symmetric positive semidefinite block-diagonal matrices y = Diag{y1 , . . . , yk }
with unit trace and diagonal blocks yj of sizes qj qj , we can represent (P )
in the form of (9), (10) with
(u) :=
max max
1jk
= max
yY
A y =
m
X
k
X
m
X
=1
T
Pj u Pj
=1
Tr u
k
X
j=1
T
Pj yj Pj
= max
yY
k
X
Tr
j=1
T
Pj
yj Pj = max
yY
m
X
T
Pj
u Pj yj
=1
m
X
hu , A yiF ,
=1
j=1
min
xX 1m
25
find x X : (x) 0, 1 m.
(S)
The setup.
(a) with Z o being the set of all points z Z such that the subdifferential (z) of () at z is nonempty, () admits a continuous
selection on Z o : there exists a continuous on Z o vector-valued
function (z) such that (z) (z) for all z Z o ;
(z, z Z o ) : h (z) (z ), z z i kz z k2 .
In order for the SMP associated with the outlined setup to be practical, ()
and Z should fit each other, meaning that one can easily solve problems
of the form
(15)
The prox-function.
e E.
26
We set
(a) (z) = maxuZ V (z, u)
(16)
(b)
zc = argminZ (z);
(c)
[z Z o ];
2(zc ).
1
(u Z) : ku zc k2 (zc ) max (z) (zc );
zZ
2
Z {z : kz zc k }.
uZ
z RN : z 0,
PN
N
X
j=1
j=1 zj
zj = 1
27
2 ln(N )
N
X
uj ln(uj /zj ),
j=1
N
X
i=1
!1
zi exp{i }
zj exp{j }.
Example 3: Spectahedron setup. This is the matrix analogy of the Simplex setup. Specifically, now E is the space of N N block-diagonal symmetric matrices, N > 1, of a given block-diagonal structure equipped with
the Frobenius inner product ha, biF = Tr(ab) and the trace norm |a|1 =
PN
i=1 |i (a)|, where 1 (a) N (a) are the eigenvalues of a symmetric
N N matrix a; the conjugate norm |a| is the usual spectral norm (the
largest singular value) of a. Z is assumed to be a closed convex subset of the
spectahedron S = {z E : z 0, Tr(z) = 1} containing the matrix N 1 IN .
The d.-g.f. is twice the matrix entropy
(z) = 2
N
X
j (z) ln j (z),
j=1
28
Algorithm 1.
1. Initialization: Choose r0 Z o and stepsizes > 0, 1 t.
2. Step , = 1, 2, . . . , t: Given r 1 Z o , set
(
(19)
= Pr 1 ( Fb (r 1 )),
= Pr 1 ( Fb (w ))
zbt =
"
t
X
=1
#1
t
X
w .
=1
(21)
(b)
E k(z, i ) F (z)k2 2 .
which is slightly milder than (6). The associated version of Algorithm 1 will
be referred to as Stochastic Mirror Prox (SMP) algorithm.
In some cases, we augment Assumption I by the following
Assumption II: For all z Z and all i we have
n
(22)
29
3.4. Algorithm: Main result. From now on, assume that the starting
point r0 in Algorithm 1 is the minimizer zc of () on Z. Further, to avoid
unnecessarily complicated formulas (and with no harm to the efficiency estimates) we stick to the constant stepsize policy , 1 t, where t
is a fixed in advance number of iterations of the algorithm. Our main result
is as follows:
Theorem 1. Let v.i. (2) with monotone operator F satisfying (4) be
solved by t-step Algorithm 1 using a SO, and let the stepsizes , 1
t, satisfy 0 < 13L . Then
(i) Under Assumption I, one has
(23)
"
2 7
E Errvi (zbt ) K0 (t)
+
[M 2 + 2 2 ] + 2,
t
2
7 2 2
+ .
2
t
In the case of a Nash v.i., Errvi () in (23), (24) can be replaced with ErrN ().
When optimizing the bound (23) in , we get the following
Corollary 1. In the situation of Theorem 1, let the stepsizes
be chosen according to
"
1
= min ,
3L
(25)
2
.
7t(M 2 + 2 2 )
7 2 L
, 7
E Errvi (zbt ) K0 (t) max
4 t
M 2 + 2 2
+ 2,
3t
(see (16)). Under Assumptions I, II, one has, in addition to (26), for any
> 0,
(27)
30
with
K1 (t) =
(28)
7
.
2 t
In the case of a Nash v.i., Errvi () in (26), (27) can be replaced with ErrN ().
Remark 1. Observe that the upper bound (26) for the error of Algorithm 1 with stepsize strategy (25), in agreement with the lower bound of [8],
depends in the same way on the size of the perturbation (z, i ) F (z)
and on the bound M for the non-Lipschitz component of F . From now on
to simplify the presentation, with slight abuse of notation, we denote M the
maximum of these quantities. Clearly, the latter implies that the bounds
(21.b) and (22), and thus the bounds (26)(24) of Corollary 1 hold with M
substituted for .
3.5. Comparison with Robust Mirror SA Algorithm. Consider the case
of a Nash s.v.i. with operator F satisfying (4) with L = 0, and let the SO
be unbiased (i.e., = 0). In this case, the bound (26) reads
7M
E {ErrN (zbt )} ,
t
(29)
where
2
M = max
"
sup kF (z) F (z
z,z Z
)k2 ,
sup E k(z, i )
zZ
F (z)k2
The bound (29) looks very much like the efficiency estimate
M
E {ErrN (
zt )} O(1)
t
(30)
(from now on, all O(1)s are appropriate absolute positive constants) for the
approximate solution zt of the t-step Robust Mirror SA (RMSA) algorithm
[4]1) . In the latter estimate, is exactly the same as in (29), and M is given
by
"
#
2
zZ
Note that we always have M 2M , and typically M and M are of the same
order of magnitude; it may happen, however (think of the case when F is
1)
In this reference, only the Minimization and the Saddle Point problems are considered. However, the results of [4] can be easily extended to s.v.i.s.
31
almost constant), that M M . Thus, the bound (29) never is worse, and
sometimes can be much better than the SA bound (30). It should be added
that as far as implementation is concerned, the SMP algorithm is not more
complicated than the RMSA (cf. the description of Algorithm 1 with the
description
rt = Prt1 (Fb (rt1 )),
zbt =
"
t
X
=1
#1
t
X
r ,
=1
of the RMSA).
The just outlined advantage of SMP as compared to the usual Stochastic
Approximation is not that important, since typically M and M are of
the same order. We believe that the most interesting feature of the SMP
algorithm is its ability to take advantage of a specific structure of a stochastic
optimization problem, namely, insensitivity to the presence in the objective
of large, but smooth and well-observable components.
We are about to consider several less straightforward applications of the
outlined insensitivity of the SMP algorithm to smooth well-observed components in the objective.
4. Application to Stochastic Composite minimization. Our present goal is to apply the SMP algorithm to the Composite minimization
problem (9) in the case when the associated monotone operator (12) is given
by a Stochastic Oracle.
Throughout this section, the structural assumptions A C from Section 2.2 are in force.
4.1. Assumptions. We start with augmenting the description of the problem of interest (9), see Section 2.2, with additional assumptions specifying
the SO and the SMP setup for the v.i. reformulation
(31)
find z Z := X Y : hF (z), z z i 0 z Z
32
(33)
Mx2 2x ;
(b)
(c)
E {G (x, i )} = (x), 1 m,
(d) E
1m
hX
khkx 1
Mx2 , 1 m.
k (y) (y )ky, Ly ky y ky + My
33
(x, y, i ) =
"m
X
G (x, i )[A y
=1
+ b ];
m
X
A f (x, i ) +
=1
(y)
and this is the oracle we will use when solving (9) by the SMP algorithm.
4.2. Setup for the SMP as applied to (31), (12). In retrospect, the setup
for SMP we are about to present is kind of the best resulting in the best
possible efficiency estimate (26) we can build from the entities participating
in the description of the problem (9) as given by the assumptions A G.
Specifically, we equip the space E = X Y with the norm
k(x, y)k
k(, )k =
2x kk2x, + 2y kk2y,
1
1
x (x) + 2 y (y)
2
x
y
Let
A=
max
yY:kyky 1
m
X
=1
kA yk(,) , B =
m
X
=1
kb k(,) .
2.
(37)
where
L = 4A2x y Lx + x B + 2y Ly ,
= [3Ay + B] x Mx + y My .
Besides this,
(38)
(z Z, i) : E {(z, i )} = F (z);
E k(z, i ) F (z)k2 M 2 .
34
1m
E exp
then
hX ,
khkx 1
exp{1}, 1 m,
(40)
m
X
T
Pj
(x)Pj ,
=1
Sp
max
xX y=Diag{y1 ,...,yk }Y
(42)
A y =
k
X
m
X
=1
h (x), A yiF ,
T
Pj yj Pj
,
j=1
max
q
1jk R j :kk2 =1
m
X
=1
m
X
=1
T
Pj
Pj | ,
B = 0.
35
(43)
= O(1)A ln(q1 + + qk )x Mx .
find x X : (x) 0, 1 m.
Assuming form now on that the latter problem is feasible, we can rewrite it
as the Matrix Minimax problem
(45)
(x) = (x),
where > 0 are scale factors we are free to choose. We are about to show
how to use this freedom in order to improve the SMP efficiency estimates.
Assuming, same as in the case of a general-type Matrix Minimax problem,
that D takes place, let us modify assumptions E, F as follows:
E : the -convex Lipschitz functions : X Sp are such that
max
hX , khkx 1
(46)
max
hX , khkx 1
|[ (x) (x )]h| L kx x kx + M ,
| (x)h| x L
for certain selections (x) K (x), x X, with some known nonnegative constants L , M .
F : () are represented by an SO which at the ith call, the input being
b (x, i )
x X, returns the matrices fb (x, i ) Sp and the linear maps G
from X to S ({i } are i.i.d. random oracle noises) such that for any
x X it holds
o
b (x, i ) = (x), 1 m
(a) E fb (x, i ) = (x), E G
(47)
(b)
(c)
1m
b (x, i ) (x)]h|2 /M 2
max |[G
hX ,
khkx 1
1, 1 m.
36
= max , =
1m
, () = (), Lx = 1
x t, Mx = .
associated with the just defined 1 , . . . , m and solve this Stochastic composite problem by t-step SMP algorithm. Combining Lemma 1, Corollary 1
and taking into account the origin of the quantities Lx , Mx , and the fact
that A = 1, B = 0, we arrive at the following result:
Proposition 1. With the outlined construction, the t-step SMP algorithm with the setup presented in Section 4.2 (where one uses A = 1, B =
Ly = My = 0 and the just defined Lx , Mx ) and constant stepsizes
bt , ybt ) such that
defined by (25), yields an approximate solution zbt = (x
(50)
E
bt ), 0]
max max[ max ( (x
1m
bt )) Opt
max max ( (x
1m
p)
x ln( m
=1 ,
K0 (t) 80
t
(cf. (26) and take into account that we are in the case of = 2, while the
optimal value in (49) is nonpositive, since (44) is feasible).
Furthermore, if assumptions (47.b,c) are strengthened to
2
2
b
E max exp{|f (x, i ) (x)| /(x M ) }
1m
(
)
b (x, i ) (x)]h|2 /M 2 }
E exp{ max
|[G
hX , khkx 1
exp{1},
exp{1}, 1 m,
37
where
1m
Pm
15x ln(
K1 (t) =
t
=1 p )
Discussion. Imagine that instead of solving the system of matrix inequalities (44), we were interested to solve just a single matrix inequality (x)
0, x X. When solving this inequality by the SMP algorithm as explained
above, the efficiency estimate would be
E
bt )), 0]
max[max ( (x
x L M
O(1) ln(p + 1)x
+
t
t
q
x
= O(1) ln(p + 1)1 ,
t
bt is the
(recall that the matrix inequality in question is feasible), where x
resulting approximate solution. Looking at (50), we see that the expected
accuracy of the SMP as applied, in the aforementioned manner, to (44) is
P
only by a logarithmic in p factor worse:
(51)
bt ), 0]}
E {max[max ( (x
v
!
u
m
X
u
x
O(1)tln
p 1
=1
v
!
u
m
X
u
x
p .
O(1)tln
=1
X
u
bt )} O(1)tln
E { (x
p
=1
2x L
t ,
x
M ,
t
is easy,
is difficult.
In other words, the violations of the easy and the difficult constraints
in (44)
38
O(1/t) and O(1/ t) are the best rates one can achieve without imposing
bounds on n and/or imposing additional restrictions on s.
4.4. Eigenvalue optimization via SMP. The problem we are interested
in now is
Opt = min f (x) := max (A0 + x1 A1 + + xn An ),
xX
(52)
X =
x Rn : x 0,
n
X
xi = 1 ,
i=1
m
X
=1
Setting
: X 7 E = Sp , (x) = A0 +
n
X
j=1
xj Aj , 1 m,
min max
xX yY
y, A0 +
Y = y = Diag{y1 , . . . , ym } : y
n
X
xj Aj
j=1
Sp+ ,
,
F
1 m, Tr(y) = 1 .
f (
xt ) Opt O(1)
ln(n) ln(p(1) )A
t
39
This efficiency estimate is the best known so far among those attainable
with computationally cheap deterministic methods. On the other hand,
the complexity of one step of the algorithm is dominated, up to an absolute
constant factor, by the necessity, given x X and y Y ,
P
When using the standard Linear Algebra, the computational effort per step
is
Cdet = O(1)[np(2) + p(3) ]
(55)
arithmetic operations.
We are about to demonstrate that one can equip the deterministic problem in question by an artificial SO in such a way that the associated SMP
algorithm, under certain circumstances, exhibits better performance than
deterministic algorithms. Let us consider the following construction of the
SO for F (different from the SO (35)!). Observe that the monotone operator
associated with the saddle point problem (53) is
(56)
{z
F x (x,y)
} |
Xn
{z
j=1
F y (x,y)
xj Aj .
}
y = A0 + A ;
The just defined random estimate of F (x, y) can be expressed as a deterministic function (x, y, ) of (x, y) and random variable uniformly distributed on [0, 1]. Assuming all matrices Aj directly available (so that it takes
40
k
1X
(z, is ),
k s=1
where i = [i1 ; . . . ; ik ] and {is }1i, 1sk are independent random variables uniformly distributed on [0, 1]. Note that the arithmetic cost of a single
call to k is
Ck = O(1)k(n(pmax )2 + p(2) ).
The Nash v.i. associated with (53) and the stochastic oracle k (k is the
first parameter of our construction) specify a Nash s.v.i. on the domain
Z = X Y . We equip the standard simplex X and its embedding space
X = Rn with the Simplex setup, and the spectahedron Y and its embedding
space S with the Spectahedron setup (see Section 3.2). Let us next combine
the x- and the y-setups, exactly as explained in the beginning of Section 4.2,
into an SMP setup for the domain Z = X Y a d.-g.f. () and a norm
k k on the embedding space Rn (Sp1 Sp ) of Z. The SMP-related
properties of the resulting setup are summarized in the following statement.
Lemma 2. Let n 3, p(1) 3. Then
(i) The parameter
of the just defined d.-g.f. w.r.t. the just defined norm
k k is = 2.
(ii) For any z, z Z one has
(59)
kF (z) F (z )k Lkz z k,
L=
i
h
2 ln(n) + ln(p(1) ) A .
(b)
41
min[1, k/t]
,1 t
= O(1)
ln(np(1) )A
[A = max |Aj | ]
1jn
satisfying
bt ) = max A0 +
(x
n
X
bt ]j Aj Opt
[x
j=1
bt )} O(1) ln(np(1) )A
E {(x
(61)
1
1
+
,
t
kt
(1)
)A
1 1+
+
t
kt
Further, assuming that all matrices Aj are directly available, the overall
bt is
computational effort to compute x
(62)
arithmetic operations.
42
ln(n) = O(1) ln(mp). Assume also that we are interested in a (perhaps, ranbt which with probability 1 satisfies (x
bt ) . We fix a
dom) solution x
happens when (some of) the sizes m, n, p of the problem become large.
Observe first of all that the overall computational effort to solve (52)
within relative accuracy with the DMP algorithm is
C DMP () = O(1)m(n + p)p2 1
operations (see (54), (55)). As for the SMP algorithm, let us choose k
which balances the per step computational effort O(1)k(n(pmax )2 + p(2) ) =
O(1)k(n + m)p2 to produce the answers of the stochastic oracle and the per
mp 4
.
step cost of prox mappings O(1)(n+mp3 ), that is, let us set k = Ceil m+n
With this choice of k, Proposition 2 says that to get a solution of the required
quality, it suffices to carry out
h
t = O(1) 1 + ln(1/)k1 2
(63)
m(n + p)
C DMP ()
= O(1)
SMP
C
(, )
(m + n)(k + ln(1/) 1 )
[k = Ceil
mp
m+n
We see that when , are fixed, m n/p and n/p is large, then R is
large as well, that is, the randomized algorithm significantly outperforms its
deterministic counterpart.
Another interesting observation is as follows. In order to produce, with
probability 1 , an approximate solution to (52) with relative accuracy
, the just defined SMP algorithm requires t steps, with t given by (63),
and at every one of these steps it visits O(1)k(m + n)p2 randomly chosen
4
The rationale behind balancing is clear: with the just defined k, the arithmetic cost
of an iteration still is of the same order as when k = 1, while the left hand side in the
efficiency estimate (61) becomes better than for k = 1.
43
h
i 1
1
N SMP
1
2
=
O(1)
k
+
ln(1/)
+
:=
tot
N
m n
[k = Ceil
mp
m+n
We see that when , are fixed, m n/p and n/p is large, is small,
i.e., the approximate solution of the required quality is built when inspecting
a tiny fraction of the data. This sublinear time behavior [15] was already
observed in [4] for the Robust Mirror Descent Stochastic Approximation as
applied to a matrix game (the latter problem is the particular case of (52)
with p1 = = pm = 1 and A0 = 0). Note also that an ad hoc sublinear
time algorithm for a matrix game, in retrospect close to the one from [4],
was discovered in [3] as early as in 1995.
Acknowledgment. We are very grateful to the Associated Editor and
Referees for their suggestions which allowed us to improve significantly the
papers readability.
5. Appendix.
5.1. Preliminaries. We need the following technical result about the algorithm (19), (20):
Theorem 2. Consider t-step algorithm 1 as applied to a v.i. (2) with a
monotone operator F satisfying (4). For = 1, 2, . . . , let us set
= F (w ) Fb (w );
y = Py 1 ( ), y0 = r0 .
Assume that
(65)
1
,
3L
44
Then
Errvi (zbt )
(66)
t
X
=1
(67)
(t) = 2(r0 ) +
t
X
32
=1
t
X
"
!1
(t),
M + (r 1
2
+ w ) + w
3
2
h , w y 1 i
=1
(68)
kPz () Pz ()k k k
, E.
V (z, u) + h, u zi +
kk2
.
2
45
h (v) (z) + , v ui 0 u Z.
(72)
h (w) (z) + , w ui 0 u Z.
kk2
.
2
y0 Z o .
46
(u Z) :
t
X
, u
=1
V (y0 , u) +
t
X
=1
(74)
=1
t
X
h , y 1 i +
=1
t
1X
k k2 .
2 =1
V (y , u) V (y 1 , u) h , ui h , y i V (y 1 , y ) ;
k k2
.
2
r+ = Pz ()
(b)
1
1
h, u wi + k k2 kw zk2 .
2
2
47
where the concluding inequalities are due to the strong convexity of ().
To conclude the bound (b) of (76) it suffices to note that by the Young
inequality,
k k2 1
h , w r+ i
+ kw r+ k2 .
2
2
50 . We are able now to prove Theorem 2. By (4) we have
kFb (w ) Fb (r 1 )k2 (Lkr 1 w k + M + r 1 + w )2
(78)
3L2 kw r 1 k2 + 3M 2 + 3(r 1 + w )2 .
1
2 b
kF (w ) Fb (r 1 )k2 kw r 1 k2
2
2
i 1
32 h 2
L kw r 1 k2 + M 2 + (r 1 + w )2 kw r 1 k2 [by (78)]
2
2
i
32 h 2
M + (r 1 + w )2 [by (65)]
48
h Fb (w ), w ui
=1
V (r0 , u) V (rt , u) +
(r0 ) +
t
X
32 h
=1
t
X
32 h
=1
M 2 + (r 1 + w )2
i
M 2 + (r 1 + w )2 .
=1
(79)
h F (w ), w ui
(r0 ) +
= (r0 ) +
+
t
X
t
X
32 h
=1
t
X
M + (r 1 + w )
t
X
h , w ui
=1
t
X
i
32 h 2
M + (r 1 + w )2 +
h , w y 1 i
2
=1
=1
h , y 1 ui
=1
t
X
h , y 1 ui V (r0 , u) +
=1
(r0 ) +
t
X
2
=1
t
X
2
=1
k k2
2w ,
t
X
h F (w ), w ui (t)
=1
with (t) defined in (67). To complete the proof of (66) in the general case,
note that since F is monotone, (80) implies that for all u Z,
t
X
=1
hF (u), w ui (t),
49
whence
(u Z) : hF (u), zbt ui
"
t
X
=1
#1
(t).
=1
m
X
i=1
[i (w ) i (ui , (w )i )]
Setting (z) =
Pm
i=1 i (z),
t
X
=1
t
X
m
X
=1
i=1
hF i (w ), (w )i ui i (t).
we get
"
(w )
m
X
i=1
i (ui , (w )i ) (t).
t
X
=1
#"
"
i (zbt )
(zbt )
m
X
i=1
m
X
i=1
i (ui , (zbt ) )
"
t
X
=1
#1
(t).
"
t
X
=1
#1
(t).
5.2. Proof of Theorem 1. In what follows, we use the notation from Theorem 2. By this theorem, in the case of constant stepsizes we have
Errvi (zbt ) [t]1 (t),
(81)
where
"
t
2
3 2 X
(t) = +
M 2 + (r 1 + w )2 + w
2 =1
3
2
(82) 2 +
t
X
h , w y 1 i
=1
t
t h
i
X
7 2 X
h , w y 1 i.
M 2 + 2r 1 + 2w +
2 =1
=1
50
For a Nash v.i., Errvi in this relation can be replaced with ErrN .
Let us suppose that the random vectors i are defined on the probability
space (, F, P). We define two nested families of -fields Fi = (r0 , 1 , 2 ,
. . . , 2i1 ) and Gi = (r0 , 1 , 2 , . . . , 2i ), i = 1, 2, . . . , so that F1 Gi1
Fi Gi . Then by description of the algorithm r 1 is G 1 -measurable
and w is F -measurable. Therefore r 1 is F -measurable, and w and
are G -measurable. We conclude that under Assumption I we have
(83)
E 2r 1 |G 1 2 , E 2w |F 2 , kE { |F } k ,
E exp{2r 1 2 }G 1 exp{1},
(84)
E exp{2w 2 }F exp{1}.
Now, let
t h
i
7 2 X
M 2 + 2r 1 + 2w .
0 (t) =
2 =1
E {0 (t)}
7 2 t 2
[M + 2 2 ].
2
(86)
E h , w y 1 iF
= hE { |F } , w y 1 i
kw y 1 k 2,
where the concluding inequality follows from the fact that Z is contained in
the k k-ball of radius centered at zc , see (18). From (86) it follows that
(
t
X
=1
h , w y 1 i
2t.
Combining the latter relation, (81), (82) and (85), we arrive at (23). (i) is
proved.
To prove (ii), observe, first, that setting
Jt =
t h
X
2 2r 1 + 2 2w ,
=1
51
we get
(87)
0 (t) =
7 2 M 2 t 7 2 2
+
Jt .
2
2
Jt =
j ,
j=1
k
X
E exp
j
= E E exp
j exp{k+1 } Hk
j=1
j=1
k
k
X
X
= E exp
exp{1}E exp
j E exp{k+1 }Hk
j
.
k+1
X
j=1
j=1
7 2 2 t
7 2 t 2
[M + 2 2 ] +
(89) 0 : Prob 0 (t) >
2
2
exp{t}.
(a) E { |F } D,
(b)
Observe that exp{x} x + exp{9x2 /16} for all x. Thus (90.b) implies for
4
0 s 3R
(
s + exp
9s2 R2
16
9s2 2
16
(
|F
9s2 R2
exp s +
16
52
22
3R2
|F
exp
3s2 R2 2
+
8
3
4
, the latter quantity is exp{3s2 R2 /4}, which combines with
When s 3R
(91) to imply that for s 0,
(92)
s 0 E exp s
t
X
=1
))
t
X
> t + R t
=1
Finally, we arrive at
(93)
( t
Prob
s0
i
h , w y 1 i > 2 t + t
=1
exp{2 /3}.
for all > 0. Combining (81), (82), (89) and (93), we get (24).
5.3. Proof of Lemma 1.
Proof of (i). We clearly have Z o = X o Y o , and () is indeed continuously
differentiable on this set. Let z = (x, y) and z = (x , y ), z, z Z. Then
h (z) (z ), z z i
1
1
hx (x) x (x ), x x i + 2 hy (y) y (y ), y y i
2
x
y
1
1
2 kx x k2x + 2 ky y k2y k[x x; y y]k2 .
x
y
53
(94)
m
X
[ (x ) (x)] [A y + b ] +
=1
m
X
y =
=1
m
X
[ (x)] A [y y],
=1
A [ (x) (x )] + (y ) (y).
We have
kx kx,
m
X
max hh,
[[ (x ) (x)] [A y + b ] + [ (x)] A [y y]]iX
hX khkx 1
=1
#
"
m
X
hX ,khkx 1
hX
=1 khkx 1
m
X
max
=1
hX khkx 1
k[ (x ) (x)]hk() kA y + b k(,)
+
max
hX khkx 1
k (x)hk() kA [y y]k(,) .
Then by (32),
kx kx,
m
X
=1
= [Lx kx x kx + Mx ]
m
X
=1
m
X
=1
kA [y y ]k(,)
[x Lx kz z k + Mx ][2Ay + B] + x Lx Ay kz z k,
54
which implies
kx kx, [3Ax Lx y + B] kz z k + [2Ay + B]Mx .
(95)
Further,
ky ky, =
max
Y,kky 1
max
Y,kky 1
max
Y,kky 1
max
Y,kky 1
max
Y,kky 1
m
X
=1
m
X
=1
m
X
=1
m
X
=1
m
X
A [ (x)
=1
(x )]
+ (y )
(y)
(96)
4ALx 2x y + x B + 2y Ly kz z k + [2Ax y + x B] Mx + y My .
We have justified (37)
20 . Let us verify (38). The first relation in (38) is readily given by (33.a,c).
Let us fix z = (x, y) Z and i, and let
(97)
= F (z) (z, i )
Xm
= [
=1
}|
[ (x) G (x, i )] [A y + b ];
{z
} |
Xm
=1
A [ (x) f (x, i )] .
{z
55
As we have seen,
m
X
(98)
k k(,) 2Ay + B.
=1
y,
max
Y, kky 1
max
max
Y, kky 1
1m
A u ,
=1
=
Y
max ku k()
X
1m
=1
y,
{z
=(i )
m
X
h, [ (x) G (x, i )]
max
hX , khkx 1
=1
X
m
X
h[ (x) G (x, i )]h, iX
max
hX , khkx 1
=1
m
X
k[ (x) G (x, i )]hk() k k(,)
max
hX , khkx 1
=1
m
X
max
k[ (x) G (x, i )]hk() k k(,)
hX , khkx 1
| {z }
=1 |
{z
}
= (i )
kx kx, =
=
kx kx,
P
1m
m
X
A [ (x) f (x, i )]
where all 0,
u , A
=1
m
X
ku k() kA k(,)
Y, kky 1 1m
ky ky, =
max
Y, kky 1
m
X
2Ay + B and
= (i ) =
max
hX , khkx 1
=1
56
Denoting by p2 () the second moment of a scalar random variable , observe that p() is a norm on the space of square summable random variables
representable as deterministic functions of i , and that
p() x Mx , p( ) Mx
by (33.b,d). Now by (100), (101),
h
E kk2
oi 1
E 2x kx k2x, + 2y ky k2y,
oi 1
2
p (x kx kx, + y ky ky, ) p x
x
m
X
=1
+ y A
max p( ) + y Ap()
x [2Ay + B]Mx + y Ax Mx ,
and the latter quantity is M , see (37). We have established the second
relation in (38).
30 . It remains to prove that in the case of (39), relation (40) takes place.
To this end, one can repeat
word by word the reasoning from
item 20 with the
2
2
function pe () = inf t > 0 : E exp{ /t } exp{1} in the role of p().
Note that similarly to p(), pe () is a norm on the space of random variables
which are deterministic functions of i and are such that pe () < .
5.4. Proof of Lemma 2. Item (i) can be verified exactly as in the case of
Lemma 1; the facts expressed in (i) depend solely on the construction from
Section 4.2 preceding the latter Lemma, and are independent of what are
the setups for X, X and Y, Y.
Let us verify item (ii). Note that we are in the situation
(102)
k(x, y)k =
k(, )k =
{z
} |
Xn
j=1
{z
|y | kx x k max |Aj |
1jn
(xj xj )Aj .
2 ln(n)A kz z k,
and
57
(103)
2
E exp{kxk (x, y, i ) F x (x, y)k2 /Nk,x
} exp{1},
2
} exp{1},
E exp{kyk (x, y, i ) F y (x, y)k2 /Nk,y
Nk,y = 2A 2 exp{1/2}
ln(p(1) ) +
3 k1/2 .
58
[8] Nemirovski, A., Yudin, D. (1983). Problem complexity and method efficiency in
Optimization, J. Wiley & Sons. MR0702836
[9] Nemirovski, A. (2004). Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM J. Optim. 15, 229251. MR2112984
[10] Lu, Z., Nemirovski A., Monteiro, R. (2007). Large-scale Semidefinite Programming via saddle point Mirror-Prox algorithm. Math. Progr. 109, 211237. MR2295141
[11] Nemirovski, A., Onn, S., Rothblum, U. (2010). Accuracy certificates for computational problems with convex structure. Math. of Oper. Res. 35:1, 5278. MR2676756
[12] Nesterov, Yu. (2005). Smooth minimization of non-smooth functions. Math. Progr.
103, 127152. MR2166537
[13] Nesterov, Yu. (2005). Excessive gap technique in nonsmooth convex minimization.
SIAM J. Optim. 16, 235249. MR2177777
[14] Nesterov, Yu. (2007). Dual extrapolation and its applications to solving variational
inequalities and related problems. Math. Progr. 109, 319344. MR2295146
[15] Rubinfeld, R. (2006). Sublinear time algorithms. In: Marta Sanz-Sole, Javier Soria, Juan Luis Varona, Joan Verdera, Eds. International Congress of Mathematicians, Madrid 2006, Vol. III, 109511110. European Mathematical Society Publishing
House. MR2275720
e J. Fourier
LJK, Universit
B.P. 53
38041 Grenoble Cedex 9, France
E-mail: juditsky@imag.fr
claire.tauvel@imag.fr
ISyE
Georgia Institute of Technology
Atlanta, GA 30332, USA
E-mail: nemirovs@isye.gatech.edu