Professional Documents
Culture Documents
Universal Prediction of Individual Sequences
Universal Prediction of Individual Sequences
4, JULY 1992
Abstruct-The problem of predicting the next outcome of an tor, performing as well as the best fixed (or single-state)
individual binary sequence using finite memory, is considered. predictor, can be obtained using the theory of compound
The finite-state predictability of an infinite sequence is defined as sequential Bayes decision rules developed in [6] and [7] and
the minimum fraction of prediction errors that can be made by
any finite-state (FS) predictor. It is proved that this FS pre- the approachability-excludability theory [8], [9]. In [ 5 ] , this
dictability can be attained by universal sequential prediction predictor is extended to achieve the performance of the best
schemes. Specifically, an efficient prediction procedure based on Markov predictor, i.e., an FS predictor whose state is deter-
the incremental parsing procedure of the Lempel-Ziv data com- mined by a finite number (order) of successive preceding
pression algorithm is shown to achieve asymptotically the FS outcomes. Our work extends these results by proving the
predictability. Finally, some relations between compressibility
and predictability are pointed out, and the predictability is existence and showing the structure of universal predictors
proposed as an additional measure of the complexity of a that perform as well as any FS predictor and by providing a
sequence. further understanding of the sequential prediction problem.
Index Terms-Predictability, compressibility, complexity, fi- Analogously to the FS compressibility defined in [l], or
nite-state machines, Lempel- Ziv algorithm. the FS complexity defined in [2], we define the FS pre-
dictability of an infinite individual sequence as the minimum
asymptotic fraction of errors that can be made by any FS
I. INTRODUCTION predictor. This quantity takes on values between zero and a
half, where zero corresponds to perfect predictability and a
I MAGINE an observer receiving sequentially an arbitrary half corresponds to total unpredictability. While the definition
deterministic binary sequence xl, x2, * * , and wishing to of FS predictability enables a different optimal FS predictor
predict at time t the next bit x,,, based on the past for each sequence, we demonstrate universal predictors,
x,, x2, * , x,. While only a limited amount of information independent of the particular sequence, that always attain the
from the past can be memorized by the observer, it is desired FS predictability.
to keep the relative frequency of prediction errors as small as This goal is accomplished in several steps. In one of these
possible in the long run. steps, an auxiliary result which might be interesting in its
It might seem surprising, at first glance, that the past can own right is derived. It states that the FS predictability can be
be useful in predicting the future because when a sequence is always nearly attained by a Markov predictor. Furthermore,
arbitrary, the future is not necessarily related to the past. if the Markov order grows with time at an appropriate rate,
Nonetheless, it turns out that sequential (randomized) predic- then the exact value of the FS predictability is attained
tion schemes exist that utilize the past, whenever helpful in asymptotically. In particular, a prediction scheme, based on
predicting the future, as well as any finite-state (FS) predic- the Lempel-Ziv (LZ) parsing algorithm, can be viewed as
tor. A similar observation has been made in data compression such a Markov predictor with a time-varying order and hence
[l] and gambling [2]. However, while in these problems a attaining the FS predictability.
conditional probability of the next outcome is estimated, here The techniques and results presented in this paper are not
a decision is to be made for the value of this outcome, and unique to the prediction problem, and they can be extended to
thus it cannot be deduced as a special case of either of these more general sequential decision problems [lo]. In particu-
problems. lar, when these techniques are applied to the data compres-
Sequential prediction of binary sequences has been consid- sion problem, the LZ algorithm can be viewed as a universal
ered in [3]-[5], where it was shown that a universal predic- Markov encoder of growing order which can be analyzed
accordingly. This observation may add insight to why the LZ
Manuscript received April 22, 1991; revised October 31, 1991. This work data compression method works well.
was supported in part by the Wolfson Research Awards administrated by the Finally, we introduce the notion of predictability as a
Israel Academy of Science and Humanities, at Tel-Aviv University. This
work was partially presented at the 17th Convention of Electrical and reasonable measure of complexity of a sequence. It is demon-
Electronics Engineers in Israel, May 1991. strated that the predictability of a sequence is not uniquely
M. Feder is with the Department of Electrical Engineering-Systems, determined by its compressibility. Nevertheless, upper and
Tel-Aviv University, Tel-Aviv, 69978, Israel.
N. Merhav is with the Department of Electrical Engineering, lower bounds on the predictability in terms of the compress-
Technion-Israel Institute of Technology, Haifa, 32000, Israel. ibility are derived which imply the intuitively appealing result
M. Gutman was with the Department of Electrical Engineering, Technion that a sequence is perfectly predictable iff it is totally redun-
--Israel Institute of Technology, Haifa, 32000, Israel. He is now with Intel
Israel, P.O. Box 1659, Haifa, 31015, Israel. dant and conversely, a sequence is totally unpredictable iff it
IEEE Log Number 9106944. is incompressible. Since the predictability is not uniquely
determined by the compressibility, it is a distinct feature of and finally, define the FS predictability as
the sequence that can be used to define a predictive com-
a
(
.
) = lim T , ( x ) = lim Iimsupa,(xr), (7)
plexity which is different from earlier definitions associated s+ w S-w ,+w
with the description length i.e., [ 111- [ 131. Our definition of
predictive complexity is also different from that of [14] and where the limit as S 03 always exists since the minimum
+
[15], which again is a description length complexity but fraction of errors, for each n and thus for its limit supre-
defined in a predictive fashion. mum, is monotonically nonincreasing with S . The definitions
(5)-(7) are analogous to these associated with the FS com-
11. FINITE-STATE PREDICTABILITY pressibility, [l], [2], and [16].
Let x = x , , x 2 , * * be an infinite binary sequence. The Observe that, by definition, a( x) is attained by a sequence
prediction rule f(.)of an FS predictor is defined by of FSM’s that depends on the particular sequence x. In what
follows, however, we will present sequential prediction
i l + l =f(s,), (1) schemes that are universal in the sense of being independent
‘ of x and yet asymptotically achieving a(x).
where i t + I E { 0, l } is the predicted value for x , + ,, and s,
is the current state which takes on values in a finite set 111. S-STATEUNIVERSAL
SEQUENTIAL
PREDICTORS
.Y= { 1,2, * . S } . We allow stochastic rules f, namely,
e ,
,
selecting i,+randomly with respect to a conditional proba- We begin with the case S = 1, i.e., single-state machines.
From (3), the optimal single-state predictor employs counts
bility distribution, given s,. The state sequence of the finite-
state machine (FSM) is generated recursively according to NJO) and N,(l) of zeros and ones, respectively, along the
entire sequence x r . It constantly predicts “0” if N,(O) >
S,+I = g(xt4. (2) Nn(l), and “ l ” , otherwise. The fraction of errors made by
this scheme is T , ( X ~ = ) n-I min {N,(O), Nn(l)}. In this
The function g ( . , * ) is called the next-statefunction of the
section, we first discuss how to achieve sequentially n,(x:)
FSM. Thus, an FS predictor is defined by a pair (f,g).
Consider first a finite sequence xr = x,,. * x, and sup-
e ,
and later on extend the result to general S-state machines.
pose that the initial state sIand the next-state function g (and Consider the following simple prediction procedure. At
each time instant t , update the counts N,(O) and N,(1) of
hence the state sequence) are provided. In this case, as
discussed in [2], the best prediction rule for the sequence xr zeros and ones observed so far in x : . Choose a small E > 0,
is deterministic and given by + +
and let $,(x) = ( N , ( x ) l ) / ( t 2 ) , x = 0, 1, be the (bi-
ased) current empirical probability of x . Consider the sym-
bol x with the larger count, i.e., $ , ( x ) 2 1/2. If in addition
$ , ( x ) 2 1/2 + E, guess that the next outcome will be x . If
F r ( x )5 1/2 + E , i.e., the counts are almost balanced, use a
where N,,( s, x ) , s E Y , x E { 0, l } is the joint count of s, = s randomized rule for which the probability that the next
and x , , ~= x along the sequence x ; . Note that this optimal outcome is x continuously decreases to 1/2, as fit(x) ap-
rule depends on the entire sequence xr and hence cannot be proaches 1/2. Specifically, the prediction rule is
determined sequentially.
Applying (3) to x r , the minimum fraction of prediction “ O ” , with probability 4(fit(0)),
errors is i l + l =
“ l ” , with probability r#~( S,(l))= 1 - 4(@,(0)),
1 s
’[. f ]
mum fraction of prediction errors with respect to all FSM’s 2
with S states as the S-state predictability of x r , 1 1 1
$(CY) =
2E
- + -,
2
-
2
-E la!I - + E ,
2
a,(xr) = mina(g; x r ) , (5)
sG,
1
where G, is the set of all S2’ next-state functions corre- 1, - + E < a! 51,
\
2
1260 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 38, NO. 4, JULY 1992
CY
CY>
=1
2'
;.
However, this choice might be problematic for some
sequences. For example, consider the sequence x: =
(17)
4) It is also verified in Appendix A that the best sequence s}. Applying Theorem 1 to each x " ( s ) , we find that
among all sequences of a given composition 1 s
{NJO),NJl)}, in the sense of the smallest expected + ( g ; x , " ) I- [min{N,(s,O), N n ( s ,1))
fraction of errors, has the form n s=l
4- Nn ( 4 . 6 ( N n ( 4)I
1
as(xy). Later on we present much more efficient schemes the S-state machine cannot perform much better than the
that achieve the performance of any FS predictor, without Markov machine.
even requiring an advance specification of S . Let s, be the state at time t of the S-state machine g and
consider a machine gj whose state at time t is S, =
IV . MARKOV
PREDICTORS ...
( s t -j, x, -j x , - ~ ) . Clearly, for every positive integer
An important subclass of FS predictors is the class of j , gj is a refinement of g. As a result r(gj;x,") Ir(g; x,"),
Markov predictors. A Markov predictor of order k is an FS and so
predictor with 2k states where s, = (x,- * x r P k ) Simi-
. e ,
Note that Theorem 2 holds for any arbitrary integers k and [ r i pI x') - ri(x I x',Y ) ]
S , and it becomes meaningful when 2 k + S in contrast to
(29) in which S = 2 k .
Proof: The idea in the proof is to consider a predictor
which is a refinement of both the Markov machine and a
given S-state machine. This refined predictor performs better
than both machines. We will show, however, that when the where the second inequality follows from (31), the third
Markov order k is large (relative to In S) the performance of inequality follows from Lemma 1 and the last inequality
this refined machine with 2k x S states is not much better
than that of the Markov machine with 2 k states. A-fortiori, ' Throughout this paper, log x = log, x while In x = log, x .
FEDER et al.: UNIVERSAL PREDICTION OF INDIVIDUAL SEQUENCES 1263
follows from Jensen’s inequality and the concavity of the (22) with s, = ( x , ~ ~ + ~ ,x,),
- * *i.e.,
,
square root function. By the chain rule of conditional en-
tropies, “o”, with probability 4( B,(O 1 x,; * * ,x , - k + l ) ) ,
k
X,+I =
“1 ”, with probability $(a,(1 1 x , ,. . . , x t p k +
fi( X 1 X j ) = fi( X k , X ) = f i ( X k + ’ ) , (36)
j=O (43)
k
E i ( X 1 X J , < Y )= k ( X , X k l Y ) = E i ( X k + ’ I Y ) . where, e.g.:
j=O
(37)
The chain rule applies since the empirical counts are com-
puted using the cyclic convention, resulting in a shift invari-and is determined with E ~ , ( ~ , _ ~...+ x, , ) .
+ ( a )
ant empirical measure in the sense that the jth order marginal To attain p ( x ) , the order k must grow as more data is
empirical measure derived from the kth order empirical available. Otherwise, if the order is at most k*, the scheme
measure ( j Ik ) is independent of the position of the j-tuple may not outperform a Markov predictor of order k > k*.
in the k-tuple. Now observe that Increasing the number of states corresponding to increasing
the number of separate counters for N r ( x k ,x ) . There are
fi( P+’) - fi( X k + ’ I Y ) = k(X k + ’ )
two conflicting goals: On one hand, one wants to increase the
- k ( X k + ’ ,Y ) + k ( Y )5 k ( Y )Ilog S . (38) order rapidly so that a high-order Markov predictability is
reached as soon as possible. On the other hand, one has to
Combining (35)- (3S), increase the order slowly enough to assure that there are
enough counts in each state for a reliable estimate of
lj,( x I x,; . . , x t p k + As will be seen, the order k must
grow not faster than O(log t ) to satisfy both requirements.
More precisely, denote by j i k ( x : ) the expected fraction of
Since g E G , is arbitrary, the proof is complete. 0 errors of the predictor (43). Following (23) and (24),
Having proved (30), one can take the limit supremum as
n 03, then the limit k 03 and finally the limit S 03
-+ -+ -+ b k ( ):. p k ( ):. + 62k(n), (44)
and obtain p ( x ) 5 n ( x ) , which together with the obvious
relation p ( x ) 2 n ( x ) leads to where 6 2 k ( n ) = O( m). Suppose now that the ob-
served data is divided into nonoverlapping segments, x =
4 4 = 4.). (40) x(I), . . . and apply the kth order sequential predictor
(43) to the kth segment, d k ) Choose . a sequence a k such
The fact that Markov machines perform asymptotically, as
well as any FSM is not unique to the prediction problem. In that ak 03 monotonically as k
-+ 03, and let the length of
+
k(X)Xj)-k(XIY)
Using (41) for all j Ik and following the same steps as in Thus, in each segment the Markov predictability of the
(35)-(38), respective order is attained as k increases, in a rate that
depends on the choice of c y k .
log s Consider now a finite, arbitrarily long sequence xy, where
k ( X I X k ) S k ( X I Y ) + k- + l ’ (42) n=1 ;: I n k and k , is the number of segments in x:. The
average fraction of errors made by the above predictor
This technique is further exercised in [lo] to obtain similar denoted ji( x:), satisfies,
relations between the performances of Markov machines and
FS machines for a broad class of sequential decision prob-
lems.
Next we demonstrate a sequential universal scheme that
attains p ( x ) and thus, n ( x ) . First observe that from the
discussion in Section 111, for a fixed k , the kth order Markov
predictability can be achieved asymptotically by the predictor
1264 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 38, NO. 4, JULY 1992
Now, for any fixed k' < k,, The LZ parsing algorithm parses an outcome sequence into
distinct phrases such that each phrase is the shortest string
which is not a previously parsed phrase. For example, the
sequence 001010100***is parsed into (0, 01, 010, 1,
0100, ---
}. It is convenient to consider this procedure as a
process of growing a tree, where each new phrase is repre-
sented by a leaf in the tree. The initial tree for binary
sequences consists of a root and two leaves, corresponding to
(47) the phrases ( 0 , l } , respectively. At each step, the current
tree is used to create an additional phrase by following the
where (47) holds since always P ~ ( x ( ~5) )1/2, since path (from the root to a leaf) that corresponds to the incom-
pi(x ' ~ ) I
) p,( x ( ~ )when
) i > j , and since adding positive ing symbols. Once a leaf has been reached, the tree is
terms only increases the right-hand side (RHS) of (47). Now, extended at that point, making the leaf an internal node, and
the term C E L l ( n k / n ) p k # ( d k ) is
) the fraction of errors made adding its two offsprings to the tree. The process of growing
in predicting x ; by a machine whose state is determined by a binary tree for the above example is shown in Fig. 2. The
the k'th Markov state and the current segment number. This set of phrases which correspond to the leaves of the tree is
is a refinement of the k'th-order Markov predictor. Thus, called a dictionary. Note that in the process of parsing the
sequence, each outcome x , is associated with a node reached
by the path corresponding to the string starting at the begin-
ning of the phrase and ending at x , .
Also since the t ( k ) is monotonically decreasing and since Let K, be the number of leaves in the jth step (note that
the length of each segment is monotonically increasing we +
K j = j 1 for binary sequences) and assign a weight l / K j
can write to each leaf. This can be thought of as assigning a uniform
probability mass function to the leaves. The weight of each
internal node is the sum of weights of its two offsprings (see
Fig. 2, for an example). Define the conditional probability
where by the Cesaro theorem c(k,) 0 as k, + 03. Thus, +
@f'(x,,, I x : ) of a symbol x , + ~given its past as the ratio
between the weight of the node corresponding to x , + ~(0 or
we can summarize the result of this section in the following
1) that follows the current node x , , and the weight of the
theorem.
node associated with x , . Note that if x,+ is the first symbol
Theorem 3: For any finite k, any finite S , and for any of a new phrase, the node associated with X , is the root. This
x; E ( 0 , l}" definition of a;"(
x , 1 x i ) as the conditional probability
+
After '0' After "01" Theorem 4: For every sequence x ; and any integer
"0" Initial Dictionary-
Y1l3 k 2 0,
2/3
."(xr) 5 p&:) + v ( n ,k ) (52)
Yll4 where for a fixed k , v ( n , k ) = O(l/=).
Proof: Following the counting interpretation, the IP
1/3 predictor is a set of sequential predictors each operating on a
After"010" After "I"
separate bin. Applying Theorem 1 to each one of the c bins
y 115 and averaging over the bins similarly to (23), we get
2/5A/5 Y'l6 9 V 6
/4'%437 w l / 5
4/6
7 \?-"3,6/
213 2/6
q,, TIp( x;) 4 -
I
n j=1
C
1 (min { N ~ ( o )N, ~ ( I )+}N! * 6,( N!))
'\ y l / 5 , -1l6
1 c
5 -
n
min
j=1
{ Ni(O), Ni(1)) + 6,( n ) , (53)
I
+C min { N ~ ( o ~
) , j I() } , (54)
A result concerning the compression performance of Lem- optimal capital growth rate in sequential gambling over the
pel-Ziv algorithm, known as Ziv's inequality and analogous outcome of the sequence using any FS machine.
to the theorem above, states that the compression ratio of the As was observed in [2], by the concavity of h ( . ) and
LZ algorithm is upper bounded by the kth-order empirical Jensen's inequality,
entropy plus an O((log1og n)/(log n)) term. This result has
been originally shown in [22] (see also [23, ch. 121). It can
be proved, more directly, by a technique similar to the
above, utilizing (42) and the 0((2k/(n)log ( n /2k)) conver-
gence rate observed in universal coding schemes for Markov
sources, [20], [24]-[26], and the fact that the LZ algorithm
has an equivalent order of log c = log ( n /log n).
For each individual sequence, the compression ratio of the
LZ algorithm is determined uniquely by the number of where 2 = arg min N,(s, x ) . By minimizing over g E G,,
parsed strings c ( x f ) , which is a relatively easily computable taking the limit supremum as n -+ 00, and the limit S -+ 00
quantity. It is well known [l] that this compression ratio, for both sides of (59) and by the monotonicity of h( - ) in the
n - ' c ( x f ) log c ( x y ) , is a good estimator for the compress- domain [0, 1 /2],
ibility of the sequence (and the entropy of a stationary
ergodic source). As will be evident from the discussion in the .(x) 2 h - I ( & ) ) . (60)
next section, the predictability of a sequence cannot be An upper bound on the predictability in terms of the
uniquely determined by its compressibility and hence neither compressibility can be derived as well. Since h( a) 2 2 a for
by c( x y ) . It is thus an interesting open problem to find out an 0 I a I1/2,
easily calculable estimator for the predictability.
Finally, the IP predictor proposed and analyzed here has
been suggested independently in [27] as an algorithm for
page prefetching into a cache memory. For this purpose, the
algorithm in [27] was suggested and analyzed in a more
general setting of nonbinary data, and in the case where one
may predict that the next outcome lie in a set of possible which leads to
values (corresponding to a cache size larger than 1). How- +p(x) 17r(x). (62)
ever, unlike our analysis which holds for any individual Both the upper bound and the lower bound as well as any
sequence, the analysis in [27] was performed under the point in the region in between, can be obtained by some
assumption that the data is generated by a finite state proba- sequence. Thus, the compressibility of the sequence does not
bilistic source. determine uniquely its predictability. The achievable region
in the p - 7r plane is illustrated in Fig. 3.
VI. PREDICTABILITY AND COMPRESSIBILITY
The lower bound (60) is achieved whenever @,(aI s) =
Intuitively, predictability is related to compressibility in N ( s , 2 ) / N , ( s ) are equal for all s. This is the case where the
the sense that sequences which are easy to compress seem to FS compressibility of the sequence is equal to the zero-order
be also easy to predict and conversely, incompressible se- empirical entropy of the prediction error sequence (i.e., the
quences are hard to predict. In this section we try to consoli- prediction error sequence is "memoryless"). Only in this
date this intuition. case, a predictive encoder based on the optimal FS predictor
The definition of FS predictability is analogous to the will perform as well as the optimal FS encoder.
definition of the FS compressibility p ( x ) , see [l], [2], and The upper bound (62) is achieved when at some states
[ 161. Specifically, the FS compressibility (or FS complexity, @(, 2 I s) = 0 and in the remaining states an( 2 1 s) = 1/2,
FS empirical entropy) was defined in [2] as i.e., in a case where the sequence can be decomposed by an
FSM into perfectly predictable and totally unpredictable sub-
p ( x ) 9 lim limsup m i n p ( g ; x ? ) , (57) sequences.
S + m n+03 geCs
The upper and lower bounds coincide at ( p = 0, T = 0)
where and ( p = 1, ?r = 1 '2) implying that a sequence is perfectly
predictable fl it Is totally redundant, and conversely, a
sequence is totally unpredictable 18 it is incompressible.
A new complexity measure may be defined based on the
and where h ( a ) = - a log a - (1 - a)log (1 - a) is the notion of predictability. Analogously to the complexity defi-
binary entropy function. This quantity represents the optimal nitions of Solomonoff, [ll], Kolmogorov, [12] and Chaitin,
data compression performance of any FS machine where the [131, we may define the predictive complexity as the mini-
integer codeword length constraint is relaxed (this noninteger mum fraction of errors made by a universal Turing machine
codelength can be nearly obtained using, e.g., arithmetic in sequentially predicting the future of the sequence. A point
coding [28]). Also, utilizing the relation between compres- to observe is that while the complexities above are related to
sion and gambling, [17], [18], the quantity 1 - p ( x ) is the the description length (or program length) and hence, to each
FEDER et al.: UNIVERSAL PREDICTION OF INDIVIDUAL SEQUENCES 1267
n(
dictability i.e., XI)1 n( X , I X , ) L n(X, I X , , X , ) ktc.
where I ( . ) = 1 - $ ( e ) . The loss incurred along the down-
These definitions can be generalized to random vectors, and
ward loop is
stochastic processes. For example, the predictability of a
stationary ergodic process X is defined as
.(X) = lim I X , ; . . , XI)
T(x,+,
n-a,
We want to show that CY I 0.We may write
= n ( X , I X _ , ; * * ) . (65)
It will be interesting to further explore the predictability a - P = $ , + I
measure and its properties. For example, to establish the
predictability as the minimum frequency of errors that can be N,(O) + 1 N,(O) + 1
made by any sequential predictor over the outcome of a
general (ergodic) source, and convergence of the perfor-
+$,,,i
t+3 )-zit( t + 2 )’
mance of prediction schemes to the source’s predictability. where denotes the function
$,(a) of (9) with a possibly
$(e)
Another problem of interest is the derivation of a tight lower time varying E = E,. Observe that (N,(O) 2)/(t 3) > + +
bound on the rate at which the predictability can be ap- (N,(O) l)/(t 2) > (N,(O) l)/(t+ + +
3) 1 1/2 and con- +
proached asymptotically by a universal predictor for a para- sider the following two possible cases. In the first case
+ +
metric class of sources. This problem is motivated by an (N,(O) l ) / ( t 2) > 1/2 E, and when E , is nonincreas- +
analogous existing result in data compression [ 141, which ing we have (N,(O) 2)/(t 3) > 1/2 E , + , and so + + +
states that no lossless code has a compression ratio that &((N,(O) + l > / ( t + 2)) = 4,+1((4(0) + 2)/(t + 3)) = 1,
approaches the entropy faster than (S/2n) log n , where S is in which case CY - = $,+l((N,(0) l ) / ( t 3)) - 1 I0. + +
the number of parameters, except for a small subset of the In the second case, (N,(O) l ) / ( t 2) I1/2 E , and we + + +
sources corresponding to a small subset of the parameter can replace $t((e(0) l)/(t +
2)) by 1/2~,[(N,(0) + +
+ +
space. A solution to this problem might follow from [15]. It l ) / ( t 2) - 1/21 1/2. Also, note that a continuation of
is interesting, in general, to relate the results and approach of the sloping part of to the right serves as an upper bound
$ ( e )
1268 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 38, NO. 4, JULY 1992
-
- q(0)- t - N,(O) - t
2Et+& + 3) 2 4 t + 2) .
Thus, as long as ~ , (+ t 2) I~ , + ~+( 3),
t we have again
a I0.In other words, whenever both is nonincreasing
and the function f ( t ) = ~ , (+
t 2) is nondecreasing, (e.g., a
constant E or an E, that monotonically goes to zero, say, as
C / m, but not faster than C/( t + 2)), the loss incurred
in the upward loop is smaller than that of the downward loop.
When N,(1) > N,(O) then a is the loss incurred in the
downward loop and 0 is the loss incurred in the upward
loop. Using similar observations we can show that /3 ICY where in the last inequality we used the fact that 2Nn(1) In.
+
under the similar condition on E , . Again, the loss incurred As for B = Cf:((p)-Nn(l)l((Nn(l) k)/(2Nn(1) k l)), + +
over the downward loop can only be greater than the loss some of the terms are zero and the arguments of I( for the e )
incurred in the upward loop. We note that the proof above nonzero terms must satisfy (N,(l) + + +
k)/(2Nn(1) k 1)
+
can be generalized to any nondecreasing 4(a),concave for I1/2 E and hence, for these terms
a 1 1/2, such that + ( C Y ) = 1 - 4(1 - a).
Now given any sequence xf and its trellis, one can replace kI-
4 ~ N , ( 1 ) 1 2~ A
+-=K.
+
every upward loop by a downward loop and thereby increase 1-2E 1-2E
the average loss at each step. After a finite number of such Also, the nonzero terms are smaller than 1/2 since the
steps one reaches a sequence such that the first N,(1) pairs of argument of I ( . ) is greater than 1/2. Thus,
bits ( x , , - , , x t t ) are all either “01” or “lo”, and the
remaining NJO) - Nn(l) bits are all “0”. For such a se- K 1
BsC-=-
2ENn(1) + 1 2E +
quence, all upward loops correspond to k = 0 and hence k=l 2 1 -2E 2(1 - 2 E )
cannot be replaced by downward loops. Thus, every se-
quence of this structure, in particular the sequence (12), E 1 +2E
incurs the same maximal loss. Note that by replacing any
-
<-* n+ . (AS)
1 - 2E 2(1 - 2 4
downward loop with an upward loop one ends up with a
sequence of the form (19) which has, as a result, the smallest Combining (A.3), (A.4), and (AS), we get
loss among all the sequences of the given composition.
Proof of (13): Since the sequence of (12), denoted i?:,
is the worst sequence for a given value of T , ( X , ” > then 1 1 + 2E
? i l ( x : )I?;,(i?;). The average loss over 2: is + -8~
h(n + 1) + 2(1 - 2 4 ’ (A4
Nn(l) + f J,W) du
I-
2 V 5 z - i
I-
N’(1) Jn+l
+---
1
2 2 2’ (A.7)
FEDER et al.: UNTVERSAL PREDICTION OF INDIVIDUAL SEQUENCES 1269
K I J(2Nn(1) + z) + I J(2Nn(1) + I ) + 2,
where the second inequality holds since N,(1) I 0. Now
each of these K nonzero terms is smaller than 1/2 and so
K
B r 1 = fJ(2Nn(1) + 1) + 1 If m+ 1 .
k= 1
and
( A 4
Combining (A.3), (A.7), and (A.8), we get
iqn;) Nn(l)
I- A
2
+ B IN,(1) + + i, (A.9) completes the proof of the lemma. 0
which completes the proof of (14). REFERENCES
Note that slightly different expressions for the loss may be [l] J. Ziv and A. Lempel, “Compression of individual sequences via
obtained by choosing = c0 fi/ Jt+;? with arbitrary eo; variable rate coding,” IEEE Trans. Inform. Theory, vol. IT-24, pp.
however, in all these cases the excess loss beyond ?r,(xy) 530-536, Sept. 1978.
[2] M. Feder, “Gambling using a finite-state machine,” IEEE Trans.
decays like O ( l / f i ) , and our choice of eo = 1/2 fi leads Inform. Theory, vol. 37, pp. 1459-1465, Sept. 1991.
to the tightest bound on the coefficient of l / f i that is [3] J. F. Hannan, “Approximation to Bayes risk in repeated plays,” in
attained in the previous technique. 0 Contributions to the Theory of Games, Vol. III, Annals of
Mathematics Studies, Princeton, N J , 1957, no. 39, pp. 97-139.
[4] T. M. Cover, “Behavior of sequential predictors of binary
B
APPENDIX sequences,” in Proc. 4th Prague Conf. Inform. Theory, Statistical
Proof of Lemma I : The proof is based on Pinsker’s Decision Functions, Random Processes, 1965, Prague: Publishing
House of the Czechoslovak Academy of Sciences, Prague, 1967, pp.
inequality (see, e.g., [29, ch. 3, problem 171) asserting that 263 -272.
f o r e v e r y o r p r 1 a n d O 1 q 1 1, T. M. Cover and A. Shenhar, “Compound Bayes predictors for
sequences with apparent Markov structure,” IEEE Trans. Syst. Man
Cybern., vol. SMC-7, pp. 421-424, May/June 1977.
H. Robbins, “Asymptotically subminimax solutions of compound
statistical decision problems,” in Proc. Second Berkley Symp.
2 Math. Statist. Prob., 1951, pp. 131-148.
2 -(P - 4)*. J. F. Hannan and H. Robbins, “Asymptotic solutions of the com-
In 2 pound decision problem for two completely specified distributions,”
Ann. Math. Statist., vol. 26, pp. 37-51, 1955.
Since min { p , 1 - p } - min { q , 1 - q } I I p - q 1, D. Blackwell, “An analog to the minimax theorem for vector payoffs,”
Pac. J. Math., vol. 6 , pp. 1-8, 1956.
P 1-P
plog- + (1 - p ) l o g - -, “Controlled random walk,” in Proc. Int. Congr. Mathemati-
cians, 1954, Vol. 111. Amsterdam, North Holland, 1956, pp. 336-338.
4 1-q N. Merhav and M. Feder, “Universal schemes for sequential decision
2 from individual data sequences,” submitted to Inform. Computat.,
2 -(minip, 1 - p > - min{q, 1 - q } ) ’ . (B.I) 1991.
In 2 R. J. Solomonoff, “A formal theory of inductive inference, pts. 1 and
Let a,( e ) denote an empirical measure based on x: where 2,” Inform. Contr., vol. 7, pp. 1-22 and pp. 224-254, 1964.
A. N. Kolmogorov, “Three approaches to the quantitative definition
of information,” Probl. Inform. Tramm., vol. 1, pp. 4-7. 1965.
G. J . Chaitin, “A theory of program size formally identical to
information theory,” J. A C M , vol. 22, pp. 329-340, 1975.
J. Rissanen, “Universal coding, information, prediction, and estima-
tion,” IEEE Trans. Inform. Theory, vol. IT-30, pp. 629-636,
1984.
-, “Stochastic complexity and modelling,” Ann. Statist., vol.
14, pp. 1080-1100, 1986.
J. Ziv, “Coding theorems for individual sequences,” IEEE Trans.
Inform. Theory, vol. IT-24, pp. 405-412, July 1978.
T. M. Cover, “Universal gambling schemes and the complexity
2
measures of Kolmogorov and Chaitin,” Tech. Rep. 12, Dept. of
Statistics, Stanford Univ., Oct. 1974.
1181 T. M. Cover and R. King, “A convergent gambling estimate of the
1270 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 38, NO. 4, JULY 1992
entropy of English,” ZEEE Trans. Znform. Theory, vol. IT-24, pp. [24] L. D. Davisson, “Universal lossless coding,” ZEEE Trans. Znform.
413-421, July 1978. Theory, vol. IT-19, pp. 783-795, Nov. 1973.
[19] G. G. Langdon, “A note on the Lempel-Ziv model for compressing [25] L. D. Davisson, “Minimax noiseless universal coding for Markov
individual sequences,” ZEEE Trans. Znform. Theory, vol. IT-29, sources,” ZEEE Trans. Znform. Theory, vol. IT-29, pp. 211-215,
pp. 284-287, Mar. 1983. Mar. 1983.
[20] I. Rissanen, “A universal data compression system,” IEEE Trans. [26] J. Rissanen, “Complexity of strings in the class of Markov sources,”
Znform. Theory, vol. IT-29, pp. 656-664, July 1983. ZEEE Trans. Znform. Theory, vol. IT-32, pp. 526-532, July 1986.
[21] A. Lempel and J. Ziv, “On the complexity of finite sequences,” [27] J. S. Vitter and P. Krishnan, “Optimal prefetching via data compres-
ZEEE Trans. Znform. Theory, vol. IT-22, pp. 75-81, Jan. 1976. sion,” tech. rep. CS-91-46, Brown Univ.; also summarized in Proc.
[22] E. Plotnik, M. J. Weinberger, and J . Ziv, “Upper bounds on the Cor$ Foundation of Comput. Sci. (FOCS), 1991, pp. 121-130.
probability of sequences emitted by finite-state sources and the redun- [28] J. Rissanen, “Generalized Kraft’s inequality and arithmetic coding,’’
dancy of the Lempel-Ziv algorithm,” ZEEE Trans. Znform. The- ZBMJ. Res. Dev., vol. 20, no. 3, pp. 198-203, 1976.
ory, vol. 38, pp. 66-72, Jan. 1992. ~ 9 1I. CsiszL and J. Komer, Information Theory-Coding Theorems
[23] T. M. Cover and J. A. Thomas, Elements of Information Theory. for Discrete Memoryless Systems. New York Academic Press,
New York: Wiley, 1991. 1981.