L2 Review of Basics

Outline
Probability basics
Information theory basics
Additional properties of random variables
Review of Statistics and Information Theory

Basics
Lecture 2, COMP / ELEC / STAT 502
c E. Merenyi

January 12, 2012
c E. Mer
enyi
Review of Statistics and Information Theory Basics
Outline
Probability basics
Outline
Probability basics
Random variables, CDF, pdf, moments
Covariance, joint CDF and pdf
Information, entropy, differential entropy
Maximum Entropy principle
Further entropy definitions
Mutual information and relation to entropy
c E. Mer
enyi
Outline
Probability basics

Outline
Probability basics
c E. Mer
enyi
Outline
Probability basics

CDF of random variable

Let , be continuous scalar random variables (r.v.-s).
(Cumulative) probability distribution function (CDF)
Z x
F (x) = Pr ( x) =
f (u)du
and
Z
(2.1)
F (y ) = Pr ( y ) =
f (v )dv
where
F0 (x) = f (x)
and
F0 (y ) = f (y )
(2.2)
and exist almost everywhere for continuous r.v.-s. by definition.

F (x) is monotoniously non-decreasing; right-continuous, and
F () = 0, F () = 1.
c E. Mer
enyi
Outline
Probability basics

pdf of random variable

Probability density function (pdf)
f (x) and f (y ) are the pdfs of and , respectively. Then
0 f (x)
Z
and
Z
f (x)dx = 1
and
0 f (y )
f (x)dx = 1
(2.3)
and f (x) and f (y ) are continuous almost everywhere.

Interpretation of f (x) as probability:
f (x) = lim
h0
= lim
h0
F (x + h/2) F (x h/2)
h
P [ x + h/2] P [ x h/2]
h
= lim
h0
P [x h/2) < x + h/2]

h
(2.4)
The larger f (x) the larger the probability that falls close to x.
c E. Mer
enyi
Outline
Probability basics

Moments of random variable

(k)
The kth moment M of is defined as the expected value of the kth

power of the variable:
Z
Z
(k)
M = E [ k ] =
x k dF (x) =
x k f (x)dx
(2.5)
The kth central moment,
(k)
m
is defined similarly except for a shift to
(1)
(m ).
obtain zero mean

Z
Z
(k)
k
m =
(x E []) dF (x) =
c E. Mer
enyi
(x E [])k f (x)dx
(2.6)
Outline
Probability basics


Quantities of interest related to the first four moments, as estimated
from a sample, are
Mean =
(1)
M
n
1X
=
i
n
(2.7)
i=1
(2)
Variance = m
Standard Deviation
1 X
(1)
=
(i M )2
n1
i=1
(STD) = Variance
c E. Mer
enyi
(2.8)
(2.9)
Outline
Probability basics

n
1X
Skewness =
n
(1)
i M
i=1
(2) 3/2
= m
(Normalized) Kurtosis =
n
1X
n
i=1
(2) 2
= m
c E. Mer
enyi
!3
STD
(3)
(2.10)
!
(1) 4
i M
STD
(4)
m 3
3
(2.11)
Outline
Probability basics

Kurtosis is often used as a quantitative measure of non-Gaussianity. If

kurt() denotes the Kurtosis of ,
< 0 for subgaussian or platykurtic distribution

kurt() = = 0 for gaussian or mesokurtic distribution
(2.12)
> 0 for supergaussian or leptokurtic distribution
c E. Mer
enyi
Outline
Probability basics

Covariance of random variables

The covariance of random variables and is defined as
(1)
c = E [( M )( M(1) )]
(2.13)
or in general, for the elements xi of random vector x

cij = E [(xi Mx(1)
)(xj Mx(1)
)]
i
j
or
Cx = E [(x Mx(1) )(x Mx(1) )T ]
(2.14)
(2.15)
Similarly, the cross-covariance of random vectors x, y is

Cxy = E [(x Mx(1) )(y My(1) )T ]
c E. Mer
enyi
(2.16)
Outline
Probability basics

Joint CDF and pdf of random variables

Joint CDF and pdf of (continuous) random variables and
F, (x, y ) = Pr ( x y )
Z x Z y
F, (x, y ) =
f, (u, v )dudv
(2.17)
(2.18)
where
f, (x, y ) =
2
F, (x, y )
xy
(2.19)
exists almost everywhere, and it is the joint pdf of and . Properties of

F, (x, y ) and f, (x, y ) are analogous to eqs (2.1)(2.3).
c E. Mer
enyi
Outline
Probability basics

Marginal distributions
Z
F (x)
F (y )
= F, (x, )
= F, (, y )
c E. Mer
enyi
Z y
f, (u, v )dudv
(2.20)
f, (u, v )dudv
(2.21)
Outline
Probability basics

Statistical independence
A pair of joint random variables and are statistically independent iff
their joint CDF factorizes into the product of the marginal CDFs:
F, (x, y ) = F (x) F (y )
(2.22)
For two jointly continuous random variables and , equivalently,

f, (x, y ) = f (x) f (y )
F, (x, y ) =
Z x
(2.23)
f, (u, v )dudv
Z y
f (u)du
f (v )dv
= F (x) F (y )
(2.24)
(Prove the other direction at home.)

c E. Mer
enyi
Outline
Probability basics

Conditional probability
Conditional probability
F| (x|y ) = F, (x, y )/F (y )
f| (x|y ) = f, (x, y )/f (y )
(2.25)
(2.26)
Obviously, F| (x|y ) = F (x) iff and are independent.

Equivalently, f| (x|y ) = f (x) iff and are independent.
Z
f (x) =
f| (x|y )f (y )dy
Z
x f| (x|y )dx
E [|] =
c E. Mer
enyi
Outline
Probability basics

Outline
Probability basics
c E. Mer
enyi
Outline
Probability basics

Information
Let X be a discrete random variable, taking its values from
H
Pn= {x1 , x2 , . . . , xn }. Let pi be the probability of xi , pi 0 and
i=1 pi = 1. The amount of information in observing the event X = xi
was defined by Shannon as
I (xi ) = loga (pi )
(2.27)
then I is given in bits,

2
If a = e
then I is given in nats,
10 then I is given in digits.
(2.28)
c E. Mer
enyi
Outline
Probability basics

Shannon entropy
The expected value of I (xi ) over the entire set of events H is the entropy
of X :
H(X ) =
n
X
pi log (pi )
(2.29)
i=1
(Note that X in H(X ) is not an argument of a function but the label of

the random variable.) The convention 0 log 0 = 0 is used. The entropy
is the average amount of information gained by observing one event from
all possible events for the random variable X (the average information
per message), the average length of the shortest description of a random
variable, the amount of uncertainty, a measure of the randomness of X .
c E. Mer
enyi
Outline
Probability basics

Shannon entropy
Some basic properties of H(X ):

H(X ) = 0
iff pi = 1 for some i and pk = 0 for k 6= i.
(2.30)
(If an event occurs with probability 1 then there is no uncertainty.)

0 H(X ) log |H| with equality iff pi = 1/|H| = 1/n for all i. (2.31)
(Uniform distribution carries maximum uncertainty.)
c E. Mer
enyi
Outline
Probability basics

Differential entropy
The differential entropy of a continuous random variable.
A quantity analogous to the entropy of a discrete random variable can be
defined for continuous random variables.
Reminder: The random variable is continuous if its distribution
function, F (x) is continuous.
Definition: Let f (x) = F0 (x) be the pdf of . The set S : {x|f (x) > 0}
is called the support set of . The differential entropy of is defined as
Z
h() = f (x)logf (x)dx
= E [logf (x)]
(2.32)
S
where S is the support set of . The lower case letter h is used to

distinguish differential entropy from the discrete (or absolute) entropy H.
c E. Mer
enyi
Outline
Probability basics

Discrete vs differential entropy
The differential entropy of a continuous variable can be derived as a

limiting case for the discrete entropy of the quantized version of . (See,
for example, Cover and Thomas, p. 228.) If = 2n is the bin size of
the quantization, and the quantized version of is denoted by then
H( ) + log h() as n ( 0)
(2.33)
That is, the absolute entropy of an n-bit quantization of a continuous

random variable is approximately h() log 2n = h() + n. It follows
that the absolute entropy of a continuous (non-quantized) variable is .
c E. Mer
enyi
Outline
Probability basics

The Maximum Entropy principle
Consider the following: suppose we have a stochastic system with a set

of known states but unknown probabilities, and we also know some
constraints on the distribution of the states. How can we choose the
probability model that is optimum in some sense among the possibly
infinite number of models that may satisfy the constraints?
The Maximum Entropy principle (due to Jaynes, 1957) states that
When an inference is made on the basis of incomplete information, it
should be drawn from the probability ditribution that maximizes the
entropy, subject to constraints on the distribution.
c E. Mer
enyi
Outline
Probability basics

Finding the Maximum Entropy distribution

Solving this problem, in general, involves the maximization of the
differential entropy
Z
h() =
f (x)logf (x)dx
(2.34)
over all pdfs f (x) of the r.v. , subject to the following constraints:
f (x) 0
(2.35)
f (x)dx = 1
(2.36)
f (x)gi (x)dx = i
for i = 2, . . . , m
(2.37)
where f (x) = 0 holds outside of the support set of , and i are the
prior knowledge about .
c E. Mer
enyi
Outline
Probability basics


Using the Lagrange multipliers i
objective function
Z
J(f ) =
(i = 1, 2, . . . , m) to formulate the

m
X
f (x)logf (x) + 1 f (x) +
i gi (x)f (x) dx
(2.38)
i=2
Differentiation of the integrand with respect to f and setting the results

to 0 produces
1 logf (x) + 1 +
m
X
i gi (x) = 0
(2.39)
i=2
c E. Mer
enyi
Outline
Probability basics


Solving this for f (x) yields the maximum entropy distribution for the
given constraints:

m
X
f (x) = exp 1 + 1 +
i gi (x)
(2.40)
i=2
Example: If the prior knowledge consists of the mean and the variance
2 of then
Z
f (x)(x )2 dx = 2
(2.41)
and from the constraints it follows that g2 (x) = (x )2 and 2 = 2 .

The Maximum Entropy distribution for this case is the Gaussian
distribution as shown on page 28 of Colin Fyfe 2 (CF2), or Haykin,
Chapter 10, page 491.
c E. Mer
enyi
Outline
Probability basics

Joint differential entropy
Joint differential entropy (C-T 230)

Z Z
f, (u, v )logf, (u, v )dudv
h(, ) =
(2.42)
For more details on joint differential entropy, see C-T, p. 230.
c E. Mer
enyi
Outline
Probability basics

Conditional differential entropy

In CF 2/2, you have seen the formula of the conditional entropy of
discrete random variable X given Y . It represents the amount of
uncertainty remaining about the system input X after the system output
Y has been observed. Analogous to that is the conditional differential
entropy for continuous random variables:
Z Z
f, (x, y )logf, (x|y )dxdy
(2.43)
h(|) =
Since f, (x|y ) = f, (x, y )/f (y )

h(|) = h(, ) h().
c E. Mer
enyi
(2.44)
Outline
Probability basics

Relative entropy (Kullback-Leibler distance)

The relative entropy is an entropy based measure of the difference
between two pdfs f (x) and g (x), which is defined as
Z
f (x)
dx
(2.45)
D(f k g ) =
f (x)log
g
(x)
Note that D(f k g ) is finite only if the support set of f is contained in

the support set of g . The K-L distance is a measure of the penalty
(error) for assuming g (x) when in reality the density function of is
f (x). The convention 0 log 00 = 0 is used.
It follows (can be proven) that D(f k g ) 0 with equality iff f (x) =
g (x) almost everywhere. The relative entropy or K-L distance is also
called the K-L divergence.
c E. Mer
enyi
Outline
Probability basics

Mutual information
Mutual information is the measure of the amount of information one
random variable contains about another.
Z Z
f, (x, y )
I (; ) =
f, (x, y )log
dxdy
(2.46)
f
(x)f (y )

The concept of mutual information is of profound importance in the
design of self-organizing systems, where the objective is to develop an
algorithm that can learn an input-output relationship of interest on the
basis of the input patterns alone. The Maximum mutual information
principle of Linsker (1988) states that the synaptic connections of a
multilayered ANN develop such as to maximize the amount of information
that is preserved when signals are transformed at each processing stage
of the network, subject to certain constraints. This is based on earlier
ideas that came from the recognition that encoding of data from a scene
for the purpose of redundancy reduction is related to the identification of
specific features in the scene (by the biological perceptual machinery).
c E. Mer
enyi
Outline
Probability basics

Properties of Mutual Information

Z
f, (x, y )
dxdy
f
(x)f (y )

Z Z
f, (x, y )
=
f, (x, y )log
dxdy
f (y )

Z Z
f, (x, y )logf (x)dxdy

Z Z
= h(|)
(
f, (x, y )dy )logf (x)dx

Z
= h(|)
f (x)logf (x)dx
I (; ) =
f, (x, y )log
= h() h(|)
= h() h(|)
= h() + h() h(, )
c E. Mer
enyi
(2.47)
Outline
Probability basics

Properties of Mutual Information
I (; ) = I (; )
I (; ) 0
(2.48)
(2.49)
I (; ) = D(f, (x, y ) k f (x)f (y ))
(2.50)
The last equation expresses mutual information as a special case of the

K-L distance.
I (; ) = D(f, (x, y ) k f (x)f (y ))
| {z } | {z }
density #1
density #2
The larger the K-L distane the less independent and are, because f, (x, y ) and f (x)f (y ) are farther apart
= I is large. See also diagram in CF 2/2 or in Haykin p.494 (fig. 10.1) showing insightful relations among
mutual information, entropy, conditional and joint entropy.
c E. Mer
enyi
Outline
Probability basics
Outline
Probability basics
c E. Mer
enyi
Outline
Probability basics
Sufficient condition of independece

Statistically independent random variables satisfy the following:
E [g ()h()] = E [g ()]E [h()]
(2.51)
where g () and h() are any absolutely integrable functions of and ,

respectively. This follows from
Z
E [g ()h()] =
Z
=
g (u)h(v )f, (u, v )dudv

Z
g (u)f (u)du
h(v )f (v )dv
= E [g ()]E [h()]
c E. Mer
enyi
(2.52)
Outline
Probability basics
Uncorrelatedness vs independece
Statistical independence is a stronger property than uncorrelatedness.
and are uncorrelated iff
E [] = E []E []
(2.53)
In general two random vectors x, y are uncorrelated iff

Cxy = E [(x Mx(1) )(y My(1) )T ] = 0
(2.54)
or equivalently,
Rxy = E [xyT ] = E [x]E [yT ] = Mx(1) My(1)
(2.55)
where Rxy is the correlation matrix of x, y. For zero mean the

correlation matrix is the same as the covariance matrix.
c E. Mer
enyi
Outline
Probability basics
Uncorrelatedness, contd, and whiteness

As a special case, the components of a random vector x are mutually
uncorrelated iff
Cx = E [(x Mx(1) )(x Mx(1) )T ]
= D = Diag (c11 , c22 , . . . , cnn )
= Diag (12 , 22 , . . . , n2 )
(2.56)
Mx(1) = 0 and Cx = Rx = I
(2.57)
Random vectors with
i.e., having zero mean and unit covariance (and hence unit correlation)
matrix are called white. Whiteness is a stronger property than
uncorrelatedness but weaker than statistical independence.
c E. Mer
enyi
Outline
Probability basics
Density functions of frequently used distributions

Uniform:
(
f (x) =
1
ba
Gaussian:
f (x) =
x [a, b]
elsewhere
2
2
1
e (x) /2
2
(2.58)
(2.59)
where and 2 denote the mean and variance, respectively.

Laplacian:
f (x) =
|x|
e
2
c E. Mer
enyi
>0
(2.60)
Outline
Probability basics
Exponential power family of density functions
These density functions are special cases of the generalized gaussian or

exponential power family that has the general formula (for zero mean):

|x|
f (x) = C exp
(2.61)
E [|x| ]
As it is easy to verify, = 2 yields the Gaussian, = 1 the Laplacian,
and the uniform density function. The values of < 2 give rise
to supergaussian densities, > 2 to subgaussian ones.
1
(x)2 /2 2
f (x) =
e
2
c E. Mer
enyi
f (x) =
|x|
e
2
>0
Outline
Probability basics
Uncorrelated but not independent example
Example of uncorrelated but not independent random variables:

Assume that and are discrete valued and their joint probabilities are
0.25 for any of (0,1), (0,-1), (1,0) and (-1,0). Then it is easy to calculate
that and are uncorrelated. However, they are not independent since
E [ 2 2 ] = 0 6= 0.25 = E [ 2 ]E [ 2 ]
c E. Mer
enyi
(2.62)

L2 Review of Basics

Uploaded by

Copyright:

Available Formats

You might also like

L2 Review of Basics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L2 Review of Basics

Uploaded by

Copyright:

Available Formats

Outline

Review of Statistics and Information Theory

January 12, 2012

Review of Statistics and Information Theory Basics

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

CDF of random variable

and exist almost everywhere for continuous r.v.-s. by definition.

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

pdf of random variable

and f (x) and f (y ) are continuous almost everywhere.

P [x h/2) < x + h/2]

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Moments of random variable

The kth moment M of is defined as the expected value of the kth

The kth central moment,

is defined similarly except for a shift to

obtain zero mean

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Moments of random variable

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Moments of random variable

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Moments of random variable

Kurtosis is often used as a quantitative measure of non-Gaussianity. If

< 0 for subgaussian or platykurtic distribution

> 0 for supergaussian or leptokurtic distribution

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Covariance of random variables

or in general, for the elements xi of random vector x

Cx = E [(x Mx(1) )(x Mx(1) )T ]

Similarly, the cross-covariance of random vectors x, y is

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Joint CDF and pdf of random variables

exists almost everywhere, and it is the joint pdf of and . Properties of

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

For two jointly continuous random variables and , equivalently,

(Prove the other direction at home.)

Review of Statistics and Information Theory Basics

Random variables, CDF, pdf, moments

Obviously, F| (x|y ) = F (x) iff and are independent.

Review of Statistics and Information Theory Basics

Information, entropy, differential entropy

Review of Statistics and Information Theory Basics

Information, entropy, differential entropy

then I is given in bits,