Thebook v33

Stochastic
Processes
Lecture notes: Course taught by Dr. Rakesh Nigam.
Madras School of Economics
PGDM research and analytics
Stochastic Processes
Compiled by the PGDM research cell
Course taught by
Dr. Rakesh Nigam
November 8, 2020
1
Preface
This book is a compilation of lecture notes from the Stochastic Processes course
module, part of the Madras School of Economics PGDM curriculum, taught by
Dr Rakesh Nigam. The Data Science and Financial Engineering courses at MSE are
deeply focussed on the Mathematical foundations behind core concepts like - Re-
gression Analysis, Financial time series, CAPM, Natural Language Processing.
In this edition, our focus primarily rests on - the fundamentals of Stochastic pro-
cesses. We start off by revisiting various key concepts in probability theory, set
theory and distribution functions, before moving to more invovled concepts like
- Measure theory and Limit theorems. The field of Stochastics is primarily con-
cerned with sequences of random variables and its related theorems prove to be
cornerstones of important applications in the real world - especially those involv-
ing stock returns and financial analyses. Therefore we develop these concepts
from the ground up. The aim of this book is to strengthen one’s foundations in
Probability, Statistics and Convergence theorems, which form the crucial under-
pinnings of many Statistical modeling and Machine Learning applications as well.
Note that the book starts off with some prerequisites, which we must firmly grasp
before moving on to more involved concepts. The actual chapters begin after the
prerequisites sections.
Contents
0.1 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.1.2 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.1.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 8
0.2 Basic Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.3.1 Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.3.2 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.4.1 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.4.2 Geometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.4.3 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . 13
0.5 Cumulative Frequency distributions . . . . . . . . . . . . . . . . . . 13
0.6 General points about discrete variables . . . . . . . . . . . . . . . . . 14
0.7 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
0.7.1 General theorems and statements . . . . . . . . . . . . . . . 16
0.7.2 Squeeze theorem . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.7.3 Increasing, Decreasing and Bounded . . . . . . . . . . . . . . 17
0.8 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
0.8.1 The Ratio and Root test . . . . . . . . . . . . . . . . . . . . . 19
1 Lecture 1: Aakaash N (2019DMB01) 20
2
Contents 3
1.1 Conditions For Independence . . . . . . . . . . . . . . . . . . . . . . 20
1.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.1.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2 Axioms Of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Properties Of Moment Generating Function . . . . . . . . . . . . . . 27
1.6 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Lecture 3: Kishan R (2019DMF06) 29
2.1 Introduction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Application Of CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Continuity Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Continuity correction for Discrete RV . . . . . . . . . . . . . 37
2.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . 37
2.3.3 Convergence in Probability is a Stronger notion than Con-

vergence in Distribution . . . . . . . . . . . . . . . . . . . . 38
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Lecture 5 - Shashi Ranjan Mandal (2019DMB08) 42
3.1 Weak law of large number . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Converse of Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Counter Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Consequence of Weak Law of Large Number . . . . . . . . . . . . . 47
3.6 WLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Lecture 6 - Venkat Suman Panigrahi (2019DMF12) 50

Contents 4
4.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . 50
4.2.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . 51
4.2.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . 53
4.2.3 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Strong Law of Large Number . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Properties of MGF . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.1 Law of Large Number . . . . . . . . . . . . . . . . . . . . . . 65
4.6.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 66
5 Lecture 9 - Akash Gupta (2019DMB02) 69
5.1 Fundamental guiding principles . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 Working with sets: part 1 . . . . . . . . . . . . . . . . . . . . 69
5.1.3 Working with sets: part 2 . . . . . . . . . . . . . . . . . . . . 70
5.1.4 Basic laws governing set operations . . . . . . . . . . . . . . 71
5.1.5 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . 71
5.1.6 Probability of continuous sets . . . . . . . . . . . . . . . . . . 72
5.2 Countable and Uncountable sets . . . . . . . . . . . . . . . . . . . . 72
5.3 Concepts of Lecture 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 Revisiting the background of discussion . . . . . . . . . . . . 74
5.4 Lim Sup and Lim Inf . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Contents 5
5.4.1 An illustrative example: 1 . . . . . . . . . . . . . . . . . . . . 75
5.4.2 An illustrative example: 2 . . . . . . . . . . . . . . . . . . . . 76
5.4.3 Reiterating the Definitions . . . . . . . . . . . . . . . . . . . . 78
6 Lecture 12: 2019DMB09 - Sri Rajitha 79
6.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Properties of Power spectral density Sxx (ω) : . . . . . . . . . . . . . 80
6.4 White Noise (WN): . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5 Discrete Time Stochastic Processes . . . . . . . . . . . . . . . . . . . 86
6.5.1 Sampling a CTSP : Continuous Time Stochastic Processes . . 89
6.6 Strong stationarity: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7 Cyclo Stationery process . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.8 White Noise is a special Stochastic Process . . . . . . . . . . . . . . . 94
6.9 Gaussian Random Process – GRP . . . . . . . . . . . . . . . . . . . . 95
6.10 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Lecture 14: 2019DMB04 - Karnam Yogesh 98
7.1 ARMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3 Lag Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.4 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.5 Autocovariance Functions and Stationarity of ARMA models . . . . 101
7.5.1 MA(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.5.2 MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.6 AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.6.1 Estimation of AR parameters . . . . . . . . . . . . . . . . . . 102
7.6.2 AR(P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Contents 6
7.7 ARMA Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A Appendix A: Fourier Transforms 105
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3 Amplitudes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.4 Alternate forms of writing the series . . . . . . . . . . . . . . . . . . 107
A.5 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.6 Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.6.1 The autocorrelation theorem . . . . . . . . . . . . . . . . . . 109
B Appendix B: Laplace, Dirac Delta and Fourier Series 111
B.1 The Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.1.1 Examples and Properties . . . . . . . . . . . . . . . . . . . . 112
B.1.2 Expanding a little on Heaviside function . . . . . . . . . . . 113
B.1.3 Inverse Laplace transform . . . . . . . . . . . . . . . . . . . . 114
B.2 The Dirac Delta Impulse function . . . . . . . . . . . . . . . . . . . . 114
B.2.1 Filtering property . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.3 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.3.1 Fourier series formula . . . . . . . . . . . . . . . . . . . . . . 117

Prerequisites: Probability
foundations
0.1 Combinatorics
Starting with the fundamental principle of counting, we can assume that experi-
ment 1 results in any of m possible outcomes and experiment 2 results in any of n
possible outcomes. Then, if these two experiments are performed in succession,
we would observe that there are a total of mn outcomes possible. Note that the
below matrix lists out all the possible pairs of outcomes from experiment 1 and
2. The item (i, j) corresponds to the pair in which i was obtained in experiment 1
and j was obtained in experiment 2.
 
(1, 1) (1, 2) · · · (1, n)
 .. .. .. .. 
 
 . . . .  (1)
 
(m, 1) (m, 2) · · · (m, n)
As a general rule, remember that if there are a total of r experiments to be per-

formed and the first has n1 possibilities as outcomes, the second experiment has
n2 possibilities and the rth experiment has nr possibilities, then we will have a total
possibilities from all the experiments together as:
n1 × n2 · · · × nr (2)
0.1.1 Permutations
Ordered arrangements of elements are called permutations. For example if we

have letters a, b, c then the permutations of these letters (elements) is given as:
abc, acb, bac, bca, cab, cba (3)
Each such arrangement is called a permutation. Note that as a general rule, for n
objects there are n! permutations:
n(n − 1)(n − 2) · · · 3.2.1 (4)
7
Contents 8
Things become a bit more involved when we are permuting elements in which
there are some objects that are alike. For example if we want to find different
arrangements of the word PEPPER then obviously we will have a total of 6! per-
mutation possible since there are six letters in the word. But, what if we simply
interchange the alike elements in the word ? For example, if we simply inter-
change the two middle P’s, then it wouldn’t really change our permutation. For
this reason we calculate the total number of permutations of PEPPER by adjust-
ing for the permutations among the alike elements as well. So we final number of
permutations would become:
6!
(5)
3!2!
Note that 3! refers to the number of permutations among the P’s (which are three
in number) and 2! refers to the number of permutations among the E’s. As a
general rule we can say:
n!
(6)
n1 !n2 ! · · · nr !
Where there are n1 alike elements of type 1, n2 alike elements of type 2 and so on.
0.1.2 Combinations
If we want to form groups of r objects from a total of n objects where essentially

our permutations are now order irrelevant then we call them combinations. The
number of such combinations are given by:
Ç å
n n!
= (7)
r (n − r)!r!
0.1.3 Binomial Theorem
The Binomial theorem is a general rule that applies to polynomial expansion of a

sum of two variables. It is given by:
n Ç å
X n k n−k
(x + y)n = x y (8)
k=0
k
As a simple example, consider finding the expansion of (x + y)3 . We can use the
binomial theorem to expand this expression:
Ç å Ç å Ç å Ç å
3 3 3 3 3
(x + y)3 = x0 y 3 + x1 y 2 + x2 y + xy (9)
0 1 2 3
= y 3 + 3xy 2 + 3x2 y + x3 (10)

Contents 9
0.2 Basic Sets
The Union of many events given by E1 , E2 , · · · , En can be expressed as:

inf
[
En (11)
n=1
This set of all unions consists of all outcomes in atleast one of the Ei events. In a
similar manner, the event consisting of outcomes in all of the Ei events is given by
a continuous intersection of these sets:
inf
\
En (12)
n=1
Note the all important De-Morgans laws given by the following expressions. Also
note that the superscript c refers to complement of a set (which is nothing but a
set of elements not in the set).
n
!c n
[ \
c c
(A ∪ B) = A ∩ B → c
Ei = Eic (13)
n=1 n=1
n
!c n
\ [
c c
(A ∩ B) = A ∪ B → c
Ei = Eic (14)
n=1 n=1
Note an important point that events are nothing but sets of outcomes and hence
we can denote events as sets and perform set manipulation on them. For exam-
ple, we can denote the concept of mutually exclusive events using set notation as
follows:
E1 ∩ E2 = E1 E2 = φ (15)
The above equation basically means that if the intersection of two sets is the dis-
joint set then they are effectively, mutually exclusive sets or events. We can also
compute the probability of the union of many mutually exclusive events as fol-
lows: Ñ é
inf
[ Xinf
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) → P = P (Ei ) (16)
n=1 n=1
Moving on, we can state the basic expansion of a union of sets that are not mu-
tually exclusive as:
E ∪ F = E + F − EF (17)
Expanding the union of three sets:
E ∪ F ∪ G = E + F + G − EF − EG − F G + EF G (18)
Applying the probability operator and we simply get:
P (E ∪F ∪G) = P (E)+P (F )+P (G)−P (EF )−P (EG)−P (F G)+P (EF G) (19)
Contents 10
Notice hard enough and you’ll see that a pattern emerges in terms of signs in the
above summation. The combined union is the sum of all (positive sign) sets taken
one at a time, all (negative sign) sets taken two at a time and all (positive sign)
sets taken three at a time. We can generalize this to the union of n sets as follows
in terms of probability:
n
X X
P (E1 ∪ E2 · · · En ) = P (Ei ) − P (E1 E2 ) + · · · + (−1)n+1 P (E1 E2 · · · En ) (20)
i=1 i1 <i2
0.3 Conditional probability
This means finding the probability of an event E occurring given the fact that
event F has occurred. Since F has already occurred we can say that this is now
our new sample space, instead of the entire sample space. So essentially, we want
to find the probability that E and F both occur simultaneously given that F has
already occurred. It is given by:
P (EF )
P (E|F ) = (21)
P (F )
Note that the above expression can alternatively be written as:
P (EF ) = P (E|F )P (F ) (22)
Now note an important point. Suppose that there are two sets or events called
E and F . Now we know that when only these two sets exist in our world, then
the set E can be defined as - the union of the intersection of E with F and the
intersection of E with the complement of F .
E = EF ∪ EF c (23)
Applying the probability operator to the above sets we get:
P (E) = P (EF ) + P (EF c ) = P (E|F )P (F ) + P (E|F c )P (F c ) (24)
Therefore from the above expression we can say that the total probability of event
E is the weighted average of the conditional probability of E that F has occurred
and the conditional probability of E such that F c has occurred or F has not oc-
curred.
0.3.1 Bayes
We will introduce the concept of Bayes theorem with the help of a common exmaple.
Suppose that D is the event that a person has a disease and E is the event that upon
testing for the disease, the test comes out positive (Note that there can be a false
Contents 11
positive test also - if a person does not have the disease then the test comes pos-
itive). Now if we want to find the probability that - the person has the disease
given that the result if positive.
P (DE)
P (D|E) = (25)
P (E)
P (E|D)P (D)
= (26)
P (E|D)P (D) + P (E|Dc )P (Dc )
0.3.2 Odds
As a quick note, odds are defined as the ratio of probability of occurrence of an

event to the probability of the non-occurrence of the event. It is given as:
P (A) P (A)
c
= (27)
P (A ) 1 − P (A)
0.4 Distributions
Starting with the Bernoulli random variable, we define this random variable as
the outcome of a single trial when the outcomes are of only two types - success
and failure, encoded as 1 and 0 respectively.
p(0) = P (X = 0) = 1 − p (28)
p(1) = P (X = 1) = p (29)
Extending the same concept a little further, if supposing we have n independent
trials, each of which is associated with a probability of success of p and probability
of failure of (1−p) and if we define random variable X as the Number of successes
in n trials then what we have is essentially a binomial random variable.
Ç å
n i
p(i) = p (1 − p)n−i (30)
i
Some general properties:
E[X] = np (31)
V AR[X] = npq = np(1 − p) (32)
n Ç å
X n k
P (X ≤ i) = p (1 − p)n−k (33)
k=0
k
Contents 12
0.4.1 Poisson
A random variable X taking on values 1, 2, · · · , is called a Poisson random vari-

able with parameter λ if:
e−λ λi
p(i) = P (X = i) = (34)
i!
Note that this is the approximation of a binomial variable when n is very large
and p is small. Some general properties:
E[X] = λ (35)
V AR[X] = λ (36)
The derivation for optional reading can be presented below in a step wise manner:
• First, we assume a binomial random variable with parameters n and p such

that n is very large and p quite small. Using the binomial distribution prob-
ability mass function, we can get the probability of i successes in n trials as:
n!
P (X = i) = pi (1 − p)n−i (37)
(n − i)!i!
• Now we basically let λ = np or p = λ/n and with this we can rewrite the
previous formula in terms of λ as follows :
Å ãi Å
λ n−i
ã
n! λ
P (X = i) = 1− (38)
(n − i)!i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
= (39)
ni i! (1 − λ/n)i
• Now we note the following approximations :
λ n
Å ã
1− ≈ e−λ (40)
n
n(n − 1) · · · (n − i + 1)
≈1 (41)
ni
λ i
Å ã
1− ≈1 (42)
n
• And finally we end up with :
e−λ λi
P (X = i) = (43)
i!
Contents 13
0.4.2 Geometric
Suppose now that there are many independent trials, each having a probability
of success as p, such that these trials are performed until a success occurs. Our
random variable X primarily defines the number of trials required until the first
success is encountered.
P (X = n) = (1 − p)n−1 p (44)
Some key points:
1
E[X] = (45)
p
1−p q
V AR[X] = = (46)
p2 p2
0.4.3 Negative Binomial
Now suppose that we perform many independent trials, with each trial having
the same probability of success as p and we perform trials until we accumulate r
successes. Here let the primary random variable X denote the number of trials
required to accumulate r successes.
Ç å
n−1 r
P (X = n) = p (1 − p)n−r (47)
r−1
The main logic is that for us to stop conducting the trials, the rt h success has to
happen at the nt h trial and therefore we count the combinations of the r − 1 suc-
cesses that must have occurred in the last n − 1 trials. Some key points :
r
E[X] = (48)
p
r(1 − p) rq
V AR[X] = 2
= 2 (49)
p p
0.5 Cumulative Frequency distributions
The cumulative distribution function F (X) for a random variable X is given by:
F (X) = P (X ≤ x) (50)
Note that for a distribution function F (X), F (b) denotes the probability that the
random variable takes on values less than or equal to b. some properties about
CDF functions are:
• F is non decreasing which essentially means that for a < b we have F (a) <
F (b).
Contents 14
• The following can be defined as a limiting case:
lim F (b) = 1 (51)

b→inf
• The following can be defined as a limiting case:
lim F (b) = 0 (52)

b→− inf
• The CDF function is right continuous.
0.6 General points about discrete variables
Here is a quick list of some general pointers regarding expectations and variances
regarding discrete random variables.
X
E[X] = xp(x) (53)
X
E[g(X)] = g(x)p(x) (54)
V AR[X] = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 (55)

E[aX + b] = aE[X] + b (56)
V AR[aX + b] = a2 V AR[X] (57)
Prerequisites: Sequences and Series
0.7 Sequences
A sequence is nothing but a list of numbers written in a specific order. Sequences

can be infinite or finite. A general form of sequences can be shown:
a1 - first term
a2 - second term
an - nth term
Some of the common ways we can denote sequences are as follows:
{a1 , a2 , · · · , an , an+1 , · · · } (58)
{an } (59)
{an }∞
n=1 (60)
To illustrate with an example, here is how we would write the first few terms of a
sequence:
n+1 3 4 5
{ 2 }∞ n=1 = { 2 , , , } (61)
n n=1 4 9 16
n=2 n=3 n=4
An interesting way to think about sequences is as functions that map index values
to the value that the particular sequence might take. For example consider the
same sequence as above written as a function and its values written in a tuple of
the format (n, f (n)).
n+1
f (n) = (62)
n2
values → (1, 2), (2, 3/4), (3, 4/9), (4, 5/16) (63)
We do this because in this situation we can essentially plot out the values and
obtain a graphical representation of a sequence.
15
Contents 16
2.5
f (n)
2
1.5
0 1 2 3 4 5
n
We can observe from this graph that as n increases the value of sequence terms is
going closer and closer to zero. Hence we can say that the limiting value of this
sequenec is zero:
n+1
lim an = lim =0 (64)
n→∞ n→∞ n2
0.7.1 General theorems and statements

• We can generally state that if we make an sufficiently close to L for large
values of n then the values of an approaches L as n approaches infinity.
limn→∞ an = L
As a more precise definition we can say that limn→∞ an = L if for every

number > 0 there exists an integer N such that:
|an − L| < , when: n > N
• We can say that limn→∞ an = ∞ if for every number M > 0 there is an integer
N such that:
an > M when: n > N
• We can say that limn→∞ an = −∞ if for every number M < 0 there exists a
number N such that:
an < M when: n > N
• The key insight for us that for a limit to exist and have a finite value, then
all the sequence terms must get closer and closer to that finite value as n
approaches infinity.
Contents 17
• If limn→∞ an exists and is finite we say that the sequence is convergent whereas
if limn→∞ an does not exist and if infinite then we say that the sequence is di-
vergent.
• Given a sequence {an } if we have a function f (x) such that f (n) = an and
that limx→∞ f (x) = L then we can say that:
limn→∞ an = L
0.7.2 Squeeze theorem
We can state the squeeze theorem for sequences as follows:
if: an ≤ cn ≤ bn for all n > N for some N

and if: limn→∞ an = limn→∞ bn = L
Then we can say that: limn→∞ cn = L
This theorem is particularly useful when we are trying to compute the limits of
sequence that alternate in signs, for example the modulus function. Another im-
portant theorem to state, which we shall prove using the squeeze theorem is:
if: limn→∞ |an | = 0 then: limn→∞ an = 0
Additionally we note that for this theorem to work, the limit has to be zero. Now
to prove this using the squeeze theorem:
We can first of all note that: −|an | ≤ an ≤ |an |

Then we note that: limn→∞ (−|an |) = − limn→∞ |an | = 0
Therefore now that we have: limn→∞ (−|an |) = limn→∞ |an | = 0
then by squeeze theorem we have: limn→∞ an = 0
As an additional theorem of convergence that is closely related we can state:
The sequence: {rn }∞

n=0 converges if: −1 < r < 1
and diverges for all other values of r
Mathematically this can also be stated as :


0, if −1 < r < 1.


lim r = 1, if r = 1.
n
(65)
n→∞
∞ otherwise


0.7.3 Increasing, Decreasing and Bounded
Given a sequence {an } we have the following important definitions that explain
key concepts about the nature of the sequence.
Contents 18
• A sequence is increasing if: an < an+1 for every n.
• A sequence is decreasing if: an > an+1 for every n.
• If an is an increasing or decreasing sequence it is known to be monotonic.

Note that a monotonic sequence always either increases or decreases, not
both.
• If there exists a number m such that m ≤ an for every n then we say that
the sequence is bounded below and m is called the lower bound of the se-
quence.
• If there exists a number M such that an ≤ M for every n then we say that
the sequence is bounded above and M is called the upper bound of the
sequence.
• Finally we can say that if {an } is bounded and monotonic then {an } is con-
vergent.
0.8 Series
To begin defining an infinite series we first start with a sequence {an }. Note that
a sequence is just a sequence of numbers whereas a series represents some kind
of operation on those sequence of numbers. We can define a basic series as:
s 1 = a1
s 2 = a1 + s 2
s 3 = a1 P + a2 + a3
sn = ni=1 ai
We can further note that the successive values of the series itself forms a sequence
of numbers which can be represented as {sn }∞ n=1 . This is a sequence of partial
sums. Now we can compute the limiting value of this sequence of partial sums
as: n ∞
X X
lim sn = lim ai = ai (66)
n→∞ n→∞
i=1 i=1
Note that as in the case of sequences before, if the sequence of series values has a
finite limit, then the series is said to be convergent and if the limit does not exist
then it is divergent. Now we will prove the following theorem :
P
if: an converges then: limn→∞ an = 0
• Step 1: We can write the following two partial sums for the given series:
sn−1 = Pn−1
P
i=1 ai = a1 + a2 + · · · + an−1
sn = ni=1 ai = a1 + a2 + · · · + an
Contents 19
• Step 2: Subtracting the two partial sums we can get :
an = sn − sn−1
• We can say that if an is convergent then the sequence of partial sums is

P
also convergent for some finite value. Note that the same holds true for the
partial sums series of n and (n − 1).
{sn }∞
n=1 → limn→∞ sn = s → limn→∞ sn−1 = s
• Step 4: Finally we can write:
limn→∞ an = limn→∞ (sn − sn−1 ) = s − s = 0
0.8.1 The Ratio and Root test
The ratio test can be applied to check for convergence of a series. Suppose we
have a series given by: X
an (67)
Then we can define:
an+1
L = lim
(68)
n→∞ an
Now the following conditions would hold:
• If L < 1 the series is convergent.
• If L > 1 the series is divergent.
• If L = 1 the series may be divergent or convergent.
Now to present the root test, suppose we have the series defined by:
X
an (69)
Then we can define: »

L = lim n
|an | = lim |an |1/n (70)
n→∞ n→∞
Now the following conditions would hold:
• If L < 1 the series is convergent.
• If L > 1 the series is divergent.
• If L = 1 the series may be divergent or convergent.

Chapter 1
Lecture 1: Aakaash N (2019DMB01)
1.1 Conditions For Independence
Consider A1 , A2 , A3 are the three events in sample space S. Pairwise independence

between the events can be defined by the following conditions
P (A1 A2 ) = P (A1 )P (A2 ) (1.1)

P (A1 A3 ) = P (A1 )P (A3 ) (1.2)
P (A2 A3 ) = P (A2 )P (A3 ) (1.3)
Mutual independence between the events can be defined by the following condi-
tions
P (A1 A2 ) = P (A1 )P (A2 ) (1.4)
P (A1 A3 ) = P (A1 )P (A3 ) (1.5)
P (A2 A3 ) = P (A2 )P (A3 ) (1.6)
P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 ) (1.7)
Mutual independence implies pairwise independence but pairwise independence

does not imply pairwise independence. Mutual independence conditions need to
get satisfied to say events as independent
1.1.1 Example 1
Let us consider a fair coin is tossed twice. Let sample space be denoted as S and
A1 , A2 , A3 are the three events in sample space S
S = {HH, HT, T H, T T }
A1 = {HH, HT }, A2 = {HH, T H}, A3 = {HH, T T }
20
Chapter 1. Lecture 1: Aakaash N (2019DMB01) 21
Let us see whether independence exist between the events
P (A1 A2 ) = P {HH} = 1/4
P (A1 ) = 2/4
P (A2 ) = 2/4
P (A1 )P (A2 ) = 1/4
P (A1 A2 ) = P (A1 )P (A2 )
Let us check for other conditions
P (A1 A2 A3 ) = P (A1 )P (A2 )p(A3 ) (1.8)
P (A1 A2 A3 ) = P {HH} = 1/4

P (A1 ) = 1/2
P (A2 ) = 1/2
P (A3 ) = 1/2
P (A1 )P (A2 )P (A3 ) = 1/8
Equation 8 is not satisfied . Therefore independence does not exist between the
events
1.1.2 Example 2
Let us consider a fair die is tossed .
S = {1, 2, 3, 4, 5, 6}
A1 = {1, 2, 3, 4}
A2 = {4, 5, 6}
A3 = {4, 5, 6}
Let us check for the following condition
P (A1 A2 A3 ) = P (A1 )P (A2 )p(A3 )
P (A1 A2 A3 ) = P {4} = 1/6

P (A1 ) = 4/6
P (A2 ) = 3/6
P (A3 ) = 3/6
P (A1 )P (A2 )P (A3 ) = 1/6
Let us check for other conditions

P (A1 A2 ) = P (A1 )P (A2 )
P (A1 A2 ) = 1/6
P (A1 ) = 4/6
P (A2 ) = 1/2
P (A1 )P (A2 ) = 1/3
P (A1 A2 ) is not equal to P (A1 )P (A2 ) Therefore independence does not exist be-
tween the events
1.2 Axioms Of Probability
If S is the sample space

S : P (f unction)
Given any A is a subset of S, we have a value P(A)
P : p(s)− > R
where R is the real number If sample space is infinite, then there is some restriction
on A
1.3 Moment Generating Function
Let us consider a function f(x)

f (x) = a0 + a1 x + a2 x2 + a3 x3 + ... + an xn + ....
Differentiating f(x), we get
f 1 (x) = a1 + 2a2 x + 3a3 x2 + ... + nan xn − 1 + ....
Differentiating the above equation, we get
f 11 (x) = 2a2 + 3.2a3 x + ... + n(n − 1)an xn − 2 + ....
putting x=0
f (0) = a0
f 1 (0) = a1
f 11 (0) = 2a2
f 111 (0) = 6a3
f n (0) = n!an
an = (1/n!)f n (0)
The power series expansion for f(x) is unique

f (x) = f (0) + f 1 (0)x + (f 11 (0)x2 )/2! + (f 111 (0)x3 )/3! + ...
1.4 Consequences
Let us consider x and y are discrete and have a probability mass function p(x,y)
1.E(x + y) = E(x) + E(y) X
E(x) = xp(x)
x
XX
E(x + y) = (x + y)p(x, y)
x y
XX XX
E(x + y) = xp(x, y) + yp(x, y)
x y x y
XX
E(x) = xp(x, y)
x y
XX
E(y) = yp(x, y)
x y
E(x + y) = E(x) + E(y)

The same result holds for the continuous case random variable If x,y are jointly
continuous then
E(x + y) = E(x) + E(y)
In a analogous way we get for a finite number of random variablesx1 , x2 , xn
n
X n
X
E[ xi ] = E[xi ]
i=1 i=1
For infinite number of random variables we have to worry about convergence
2.If x and y are independent
E(xy) = E(x)E(y)
If x and y are jointly discrete with pmf p(x,y)

XX
E(xy) = (xy)p(x, y)
x y
But x and y are independent
p(x, y) = p( x, y)(x, y) = px (x)py (y)
So X
E(XY ) = xypx (x)py (y)
x,y
XX
E(XY ) = [ xypx (x)py (y)]
x y
X X
E(XY ) = [ xpx (x)][ ypy (y)]
x y
we know X
E(X) = xpx (x)
x
X
E(Y ) = ypy (y)
y
Therefore
E(xy) = E(x)E(y)
The same result is applicable for continuous case
3. var(X + Y ) = E[(X + Y )2 ] − [E(X + Y )]2
var(X + Y ) = E[X 2 + Y 2 + 2XY ] − [E(X) + E(Y )]2
var(X + Y ) = E[X 2 + Y 2 + 2XY ] − [E(X)]2 − [E(Y )]2 − 2E(X)E(Y )

var(X + Y ) = E[X 2 ] + E[Y 2 ] + E[2XY ] − [E(X)]2 − [E(Y )]2 − 2E(X)E(Y )
var(X + Y ) = E[X 2 ] − [E(X)]2 + E[Y 2 ] − [E(Y )]2 + E[2XY ] − 2E(X)E(Y )
we know
var(X) = E[X 2 ] − [E(X)]2
var(Y ) = E[Y 2 ] − [E(Y )]2
2cov(x, y) = E[2XY ] − 2E(X)E(Y )
Therefore
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )
Note: If X and Y are independent, then
var(x + y) = var(x) + var(y)
cov(x, y) = 0
Let us see an example n independent identical trials and each with probability p
of success . Let x be no of successes
Xj = {1if successontrialj, 0otherwise}
j=1,2...n, then
E(x1 ) = p
var(x1 ) = pq
But Xj has n independent identical trials
E(x1 ) = np
var(x1 ) = npq
Note: X X XX
cov( Xi , Yj ) = cov(Xi , Yj )
i j
Definition: If X is a continuous or discrete random variable
Mx (t) = E(etx )
Mx (t) is the moment generating function of X(MGF of random variable X)
E(x) = 1st momentof X = mean
E(x2 ) = 2nd momentof X

E(xk ) = k th momentof X
for x discrete random variable
X
Mx (t) = etx P (X = x)
x
for x continuous random variable

Z ∞
Mx (t) = etx fx (X)dx
−∞
Mx (t) = E(etx )
x=∞
X
x
e = xn /n!
x=0
e = 1 + (tx) + (tx) /2! + .... + (tx)k /k! + ....

tx 2
E(etx ) = E(1) + tE(x) + t2 /2!E(x2 ) + .... + tk /k!E(xk ) + ...

E(1) = 1, E(etx ) is the moment generating function
1.4.1 Example 1
X is a binomial with parameter n and p,Mx (t) =?

n
X
Mx (t) = etk P (X = k)
k=0
n Ç å
X n k (n−k)
Mx (t) = etk p q
k=0
k
n
X n
Ç å
tk
Mx (t) = e (et p)k q (n−k)
k=0
k
Mx (t) = (et p + q)n

Mxk (0) = E(X k )

f (x) = a0 + a1 x + ... + an xn + ....
an = f n (0)/n!
Mx (t) = (et p + q)n
Mx1 (t) = n(et p + q)n−1 pet
Mx11 (t) = n(n − 1)(et p + q)n−2 (et p)pet + n(et p + q)n−1 pet
Mx1 (0) = n(pe0 + q)n−1 pe0 = np = E(X) = meanof binomial
Mx11 (0) = n(n − 1)(p2 ) + np = n2 p2 − np2 + np = E(X 2 )
Lets calculate Variance now
σ 2 = E(x2 ) − [E(x)]2 = n2 p2 − np2 + np − n2 p2
σ 2 = npq
Fact:
Consider X and Y are random variables. If Mx (t) = My (t) for all t then X and Y
have the same distribution
Note:
X and Y may have same probability mass function or cumulative distribution func-
tion but X and Y are two different function which have same distribution
Mx (t) = My (t)
E(X) = E(Y )
This implies means are same
E(X 2 ) = E(Y 2 )
This implies variance will be same
X and Y are discrete random variables

E(etx ) = p1 etx1 + ..... + pn etxn = Mx (t)
E(ety ) = q1 ety1 + ..... + qm etym = My (t)
Lets see what happens as t becomes ∞
Mx (t) = E(etx ) = pn etxn
My (t) = E(ety ) = qm etym
For Mx (t) = My (t), n=m,xn = ym ,pn = qm subtract off the last terms and apply the
argument again
pn etxn = qm etym (ast− > ∞)
n = m => pi = qi ; xi = yi
So X and Y have the same distribution
1.5 Properties Of Moment Generating Function
1.Max (t) = E[et(ax) ] = E[etax ] = Mx (at)
2.If X and Y are independent random variables
M(x+y) (t) = E[et(x+y) ] = E[etx ety ]
= E(etx )E(ety )
As X and Y are independent random variables ,etx and ety are independent ran-
dom variables
= Mx (t)My (t)
etx and ety are independent random variables by independence of X and Y as etx
and ety are powers series of X and Y
3.If C is the constant
M(x+c) (t) = E[et(x+c) ] = E[etx etc ]
M(x+c) (t) = etc E(etx ) = etc MX (t)
Random variable X has normal probability density function(pdf) with parameters

µ and σ √ 2
fx (x) = 1/ 2πe−1/2(x−µ/σ)
Mx (t) =?
√
Z has pdf fz (x) = 1/ 2πe−1/2(x) is normal pdf with parameters µ = 0,σ 2 = 1
2
2
Mz (t) = et /2
X = σz + µ
has normal pdf with parameters
µ = 0; σ 2 = 1
F (x) = P (σz + µ ≤ x) = P (Z ≤ (x − µ)/σ)

√ Z (x−µ/σ) (t2 /2)
1/ 2π e dt = F (x)
−∞
F (x) = P (σz + µ ≤ x)
F 1 (x) = P df of (σz + µ = x)
√ Z (x−µ/σ) (t2 /2)
F (X) = 1/ 2π e dt = F (x)
−∞
√ 2
F 1 (x) = 1/ 2πe−1/2((x−µ)/σ) 1/σ
We want Mx (t) where X has normal pdf with paramater µ, σ
X = σz + µ
M( σz + µ)(t) = eµt Mσz (t) = eµt Mz (σt)

2 /2)
Mz (t) = e(t
2 /2)
Mx (t) = Mσz+µ (t) = eµt e(σt
2 /2)+µt
Mx (t) = e(σt
Mx1 (0) = µ
1.6 Theorem
If X and Y are independent random variables
X ≈ N (µx , σx2 )
Y ≈ N (µy , σy2 )
Then X+Y is normal N (µx + µy , σx2 + σy2 ) Moment generating function completely
determines the distributions, X and Y are indpendent random variables
Mx+y (t) = Mx (t)My (t)

2 2 2 /2+(µ +µ )t]
Mx+y (t) = e[(σx +σy )t x y
which is the moment generating function of a normal random variable with mean
(µx + µy ) and variance (σx2 + σy2 )
Chapter 2
Lecture 3: Kishan R (2019DMF06)
2.1 Introduction:
The Central Limit Theorem states that the sampling distribution of the sample
means approaches to a normal distribution as the sample size increases, irrespec-
tive of what is the shape of population distribution. This holds usually true for
sample sizes above 30 i.e. n≥ 30.
Let’s consider the version of CLT for independent and identically distributed (iid)
Random Variables(RV’s) X1 ,X2 ,.,Xi ,.,Xn are iid RV’s with
E[xi ] = µ < ∞ (f inite mean) (2.1)
V ar[xi ] = σ 2 < ∞ (f inite variance) (2.2)

X1 + X2 + ... + Xi + ... + Xn
Sample M ean = S¯n = (2.3)
n
For n sample ω ∈ Ω, Taking the expectation of S¯n
n n
1X 1X
S¯n (ω) = Xi (ω) =⇒ E[S¯n ] = µ = E[xi ]
n i=1 n i=1
(2.4)
1 nµ
[W e know that E[xi ] = µ] = (µ + ... + µ) = =µ
n µ
=⇒ E[S¯n ] = µ
Calculating the variance of S¯n ,
n
¯ 1 X
V ar(Sn ) = 2 V ar(Xi )
n i=1
1 2 nσ 2
[W e know that V ar[xi ] = σ 2 ] = (σ + .. + σ 2
) =
n2 n2
nσ 2 σ2
V ar(S¯n ) = 2 =
n n
29
Chapter 2. Lecture 3: Kishan R (2019DMF06) 30
=⇒ SD(S¯n ) = √σ
n
Now normalizing the random variable S¯n to Zn i.e. by subtracting with mean
and dividing with the its standard deviation.
√ ¯
(S¯n ) − E(S¯n ) (S¯n − µ) n(Sn − µ)
Zn = ¯ = σ =
SD(Sn ) √
n
σ
√ √ ¯
n[ n(Sn − µ)] n(S¯n − µ) (nS¯n − nµ) (2.5)
Zn = √ = √ = √
nσ nσ nσ
[( ni=1 Xi ) − nµ]
P
=⇒ Zn = √
nσ
Here E(Zn ) = 0 and Var (Zn ) = 1

Which means that the mean of Zn is zero and its variance is 1
CLT: Zn converges in distribution to standard normal as n → ∞ i.e. [N(0,1)]
Normal distribution (Zn ) where mean is o and variance is 1.
D
Zn ====⇒ Zn ∼ N (0, 1) (2.6)
n =⇒ ∞
Now we can write,
lim P (Zn ≤ x) = P (Zn ≤ x) = φ(x)∀x ∈ R (2.7)

n =⇒ ∞
Where P(Zn ≤ x) is the CDF of standard normal.
Note: The convergence of zn is independent of what the distribution of RV

of Xi is, i.e. Xi can be discrete, continious or mixed Random variable.
(S¯n − µ) [( ni=1 Xi ) − nµ]

P
Yn − nµ
Zn =
√σ
= √ = √ (2.8)
n
nσ nσ
Here Zn is normalized sum of IID RV’s.

Therefore, the Sequence of RV’s:
Xi are IID with E(Xi ) = µ < ∞ and Var(Xi ) = σ 2 < ∞
n
X
If Yn = Xi then S¯n becomes
i=1
Yn
=⇒ S¯n = (2.9)
n
Example 1: Assume Xi ∼ Bernoulli(p) =⇒ E(Xi ) = µ = p,

i ) = p(1 − p) = pq < ∞; q = (1 − p), Also where Xi is a Discrete RV
σ 2 = var(XP
=⇒ Yn = ni=1 Xi =⇒ Yn ∼ Bin(n, p)
Yn − nµ (Yn − np) (Yn − np)

Zn = √ =√ p =p (2.10)
nσ n p(1 − p) np(1 − p)
Where Zn is a Discrete RV with PMF and it does not have PDF since Zn is not con-
tinious RV. Hence, CLT is expressed in terms of CDF and not PDF.
Considering the case of CLT: There is a convergence in distribution of Zn to Z

Standard Normal.
D
CDF of Zn ====⇒ CDF of Standard Normal
n =⇒ ∞
[(x1 + x2 + ... + xn ) − np]

Zn = p , Xi ∼ Bern(p) (2.11)
np(1 − p)
Yn ∈ {0, 1, 2, .., n} =⇒ Zn = zn (Yn = yn ∈ {0, 1, 2, ..., n})
n = 1 =⇒ Y1 ∈ {0, 1}; n = 2 =⇒ Y2 ∈ {0, 1, 2}
Further we see, Xi ∼ Ber(p), p = 31 , n = number of samples
i.e. number of coin tosses, p = P(H) i.e. probability of heads
(
1 (wp) p = 1/3
Xi =
0 wp(1 − p) = 2/3
p = 31 , n= 1 Toss , then PMF of Z1 is
Z1 = √X1 −p
p(1−p)
2
3
1
3
P 0 1 z
−1 1
X1 = √ X1 = √
2 2
Figure 3.1 PMF of Z1

−1
0− 1
z1 (w1 ) = √x1 −p = √ 1 3 2 = √3
2
= −1
√
2
, w = w1 (Tails, T)
p(1−p) ( 3 )( 3 ) 3
1− 13 2
z( w2 ) = √ = √3 = √2 = √1 , where w= w2 (Heads termed as H)
2 2 2 2
3 3
X1 (w) = 0 and X1 (w2 ) = 1

X1 (w)−p
Z1 (w) = √ , Ω = {w1 , w2 } , where w1 is T and w2 is H.
p(1−p)
Ç å
n k
P (Yn = k) = p (1 − p)(n−k) , where k = 0, 1, 2, ..., n (2.12)
k
When n= 2, we get Z2 = (X√1 +X2 )−2p and Ω2 = {HH,HT,TH,TT}

2p(1−p)
When n= 3, we get Z3 = (X1 +X2 +X3 )−3p

√ and for a large sample where n= 50, we
3p(1−p)
( 50
P
i=1 Xi )−50p
get Z50 = √
50p(1−p)
where (X1 + X2 ) is Y2 , (X1 + X2 + X3 ) is Y3 i=1 Xi ) is Y50 .

( 50
P
PMF of Zn is PZn (z) = P(Zn = z).
PMF of Z2 PMF of Z3 PMF of Zn Tends of N(0,1)

Not Symmetric Becomes Symmetric
z z z
Figure 3.2 PMF of Z2 , Z3 and Zn

Example 2: Xi ∼ Uniform distribution (0,1) continious RV ’s iid E[Xi ] = µ = 1
2
<
1
∞; var(Xi ) = σ 2 = 12 < ∞.
2
Recall: Xi ∼ uniform (a, b) =⇒ E(Xi ) = (b−a)
2
, σ = √112 , var(Xi ) = (b−a)
12
PDF of Xi =⇒ fxi (X) = (
1
(b−a)
(x) ∈ (a, b)

0 if x ∈
/ (a, b)
Fxi (x)
Area = 1 fx i (x)
1 1
(b − a) 1
a 0 b x 0 1 x
(b - a) 1
Figure 3.3 Uniform Distribution
(X1 + X2 + X3 + ... + Xn ) [(X1 + X2 + X3 + ... + Xn ) − n2 ]

Zn = √ = √
n
nσ √
12
n (2.13)
[(X1 + X2 + X3 + ... + Xn ) − 2
] (X1 − 1 )
= pn =⇒ Z1 = » 2
1
12 12
X1 ∼ unif orm (0, 1) =⇒ x1 ∈ (0, 1)

2
(X1 + X2 ) − 2
Z2 = »
2
12
=⇒ X2 ∼ unif orm (0, 1)

50
(X1 + X2 + X3 + ... + X5 0) − 2
Z50 = »
50
12
=⇒ Y50 = (X1 + ... + X5 0) ∼ IH(0, 50)
PDF of Z1 PDF of Z2 PDF of Z50 tends on N(0,1)
x1
0 1 z
0 2
Figure 3.4 Uniform Distribution II

Zn is normalized sum of n IID Uniform (0,1). Shape of PDF of Zn fzn (z) tends to
standard normal as n → ∞
Why do we normalize Yn ?
Xi with E(Xi ) = µ < ∞ and var (Xi ) = σ 2 < ∞.Xi iid RV’s
n −nµ
Y√ Pn
Zn = nσ
, Yn = i=1
n
X
E(Yn ) = E(Xi ) = µ + µ + ... + µ = nµ
i=1
n (2.14)
X
2 2 2 2
√
V ar(Yn ) = V ar(Xi ) = σ + σ + ... + σ = nσ and SD(Yn ) = nσ
i=1
Zn is normalized version of Yn
−E(Yn )]
Zn = [YnSD(Y n)
=⇒ E(Zn ) = 0 and V ar(Zn ) = 1.
In which we normalize Yn to get finite mean and variance for Zn
n→∞ n→∞
E(Yn ) = nµ ===⇒ ∞ and V ar(Yn ) = nσ 2 ===⇒ ∞
Note: For any fixed n , the CDF of Zn Fzn (z) is obtained by scaling and shift-
ing CDF of Yn . Hence CDF of Zn and CDF pf Yn have similar shape.
Ä ä n → ∞ Y∞
Yn ∼ nµ, nσ 2 Flat Distribution
Y∞ ∼ (∞, ∞)
nµ
nµ
(large value ∞)
Figure 3.5 Distribution
σ 2 n→∞
Also consider, S¯n = Xi , E[S¯n ] =µ and var(S¯n ) = 0,
1
Pn
n i=1 n
===⇒
n→∞
but E[S¯n ] = µ ===⇒ µ
Distribution
concentrated
at µ - no
S∞ distribution
n → ∞ σ2
Here
Ç å
σ2 ∼ (µ, 0)
Sn ∼ µ, n
n Sn ∼ (µ, 0)
µ
µ (fixed)
Z∞ Stays stable
n → ∞ =⇒ Z∞ ∼ N (0, 1)
Zn ∼ (0, 1)
0 0
Figure 3.6 Distribution
2.2 Application Of CLT

1) We can use it when the lab experiments are performed and there would be
certain measurement criteria. The Lab measurement errors could be mod-
eled as Normal random variables
2) In the process of Module communication, there exists a noise. Hence incor-

porating the Gausian Noise model for noise.
3) Widely used in Finance. The Percentage change in prices of same assets are
assigned Normal RV’s. Returns of an index which is weighted average of
many assets ; hence even of returns of individuals assets are not normal.
4) Simplifier Computation: Instead of suming RV’s we could just use normal

distribution. For n ≥ 30 the normal approximately is good for Zn
2.3 Continuity Correction
A continuity correction is the name given to adding or subtracting 0.5 to a dis-

crete x-value. Before modern statistical software existed and calculations had to
be done manually, continuity corrections were often used to find probabilities in-
volving discrete distributions. It’s simply a topic discussed in statistics classes to
illustrate the relationship between a binomial distribution and a normal distri-
bution and to show that it’s possible for a normal distribution to approximate a
binomial distribution by applying a continuity correction.
(Yn −nµ)
Example 3: Y ∼ Bin(n, p), n =20 and p = 12 and also w.k.t Zn = √
nσ
where
Y is a Discrete RV and Xi . Find P (8 ≤ Y ≤ 10)
√ √ √
Sol: Y = (X1 + X2 + ... + Xn ) , nσ = 20 12 = 5
P (8 ≤ Y ≤ 10) = P[ 8−nµ
√
nσ
< Y√−nµ
nσ
< 10−nµ
√
nσ
]
= P[ 8−10
√
5
< Y√−nµ
nσ
< √ ]
10−10
5
−2
P (8 ≤ Y ≤ 10) ' φ(0) − φ( √ 5
) = 0.3145 (CLT) less accurate.
P (8 ≤ Y ≤ 10) = P [Y = 8orY = 9orY = 10] = P (Y = 8) + P (Y = 9) + P (Y = 10)

P10 n
k 20
20
20
1 20
= k=8 k p (1 − k)(n−k) = [ 8
+ 9
+ 10
]( 2 )
P (8 ≤ Y ≤ 10) = 0.4565 more accurate (Binomial). Ever due to fact that Y is

discrete Binomial RV and we using continious normal distribution (CLT) to find
P (8 ≤ Y ≤ 10).
Applying Continuity Correction:
Here we are making the continuity correction for the above range of probabil-
ity.
P (8 ≤ Y ≤ 10) = P (7.5 < Y < 10.5) where Y is discrete.
= P[ 7.5−10
√
5
< Y√−nµ
nσ
< 10.5−10
√
5
] 0.5
' φ( √ 5
) − φ( −2.5
√ ) = 0.4567
5
Clearly, the probability before the correction is different from after the correction.
The continuity corrected range probability is more accurate.
Continuity correction used for P (y1 ≤ Y ≤ y2 ) when Y ∼ Bin and y1 and y2 are
close together
Xi ∼ Bern(p) where E(Xi ) = µ = p = 21 and

V ar(Xi ) = σ 2 = p(1 − p) = 14 , nµ = (20) 20
1
= 10
2.3.1 Continuity correction for Discrete RV
A continuity correction is usually applied when we want to use a continuous dis-

tribution to approximate a discrete distribution. Majorly, it is used when you want
to use a normal distribution to approximate a binomial distribution.
i=1 Xi are independent Discrete RV =⇒ Y is discrete RV, integer val-

Pn
Y =
ued RV.
P (E) = P (l ≤ Y ≤ µ) where E is event , l is integer, Y is discrete and µ is in-

teger.
=⇒ P (E) = P (l ≤ Y ≤ µ) ' P ((l − 12 ) ≤ Y ≤ (µ + 21 ) used when Xi ∼

Bern (p) and Y ∼ Bin (n,p)
2.3.2 Convergence in Probability
The concept of convergence in probability is based on the intuition that two ran-
dom variables are "close to each other" if there is a high probability that their
difference is very small. Convergence in probability of Xn tells tail probability is
small. That the non-occurance of the event is very small or near to zero.
P
Xn ====⇒ X = a.
n =⇒ ∞
For every E > 0, limn→∞ P [|Xn − a| ≥ E] → 0
Xn n → ∞
a+E
Band
a around Band
a
a-E
A B
Figure 3.7 Convergence In Probability

Where A in the above figure shows the tail of the RV Xn and label B shows con-
vergence in probability tells about tail probability is Small but it does not tell how
far is the tail. So this gives contribution to expectations and variance.
Note: Convergence in probability of Xn tells tail probability is small but

does not tell how far is the tail.
E[Xn ] = n( n1 ) + 0(1 − n1 ) so we get E(Xn ) = 1

n→∞
E[Xn2 ] = 02 (1 − n1 ) + n2 ( n1 ) = n by which we get E[Xn2 ] = n ===⇒ ∞
P
Xn ===⇒ 0 (2.15)
n→∞
n→∞
E(Xn ) = 1 9 0 and var(Xn ) 9 0 by which we get E(Yn2 ) = n ===⇒ ∞
Almost all of the PMF/PDF of Xn eventually (n → ∞) gets concentrated (E >

0) close to a.
P
For Xn ===⇒ 0;
n→∞ (
n (wp) n1 → 0
Xn =
0 (wp) (1 − n1 ) → 1
2.3.3 Convergence in Probability is a Stronger notion than Con-

vergence in Distribution
Convergence in Probability(Stronger) =⇒ Convergence in Distribution (weaker)

P
Xn −−−→ X (2.16)
n→∞
n→∞
P [|Xn − X| ≥ E] −−−→ 0 f or any E > 0. (2.17)
Definition A sequence of RV’s X1 , X2 , ..., Xn converges in probability to a

Random Variable X i.e.
P
Xn −−−→ X
n→∞
if limn→∞ P [|Xn − X| ≥ E] = 0 ∀ E > 0

P
Example 4: Xn ∼ Exp (n). Show Xn ===⇒ 0 = X
n→∞
That is a sequence of RV’s X1 , X2 , ..., Xn ,... Converge in probability to the zero

RV X.
limn→∞ P [|Xn − 0| ≥ E] = limn→∞ P (|Xn | ≥ E)
= limn→∞ P (Xn ≥ E) = limn→∞ P (Xn ≥ E) = limn→∞ e−nE = 0∀E > 0.
P d
Note: If Xn −−−→ Xn then Xn −−−→ X
n→∞ n→∞
i.e Convergence in probability =⇒ Convergence in distribution.
σ2
Example 5: Let Xn = (X + Yn ), where E(Yn ) = n1 ; var(Yn ) = n
;σ > 0 (constant).
P
Show that Xn converges in probability i.e. Xn −−−→ X.
n→∞
Sol: Recall triangle inequality, which states that for any triangle, the sum of the
lengths of any two sides must be greater than or equal to the length of the remain-
ing side.
a ∈ R, b ∈ R, |a + b| ≤ |a| + |b|.
Let a = [Yn − E(Yn )] and b = E(Yn ) then |a + b| = |Yn | ≤ |a| + |b|
1
= |Yn − E(Yn )| + |E(Yn )| =⇒ |Yn | ≤ |Yn − E(Yn )| + n
P
Recall the Definition: Xn −−−→ X.
n→∞
P [|Xn − X| ≥ E] = 0∀E > 0 i.e |Xn − X| = |Yn | =⇒ P [|Yn | ≥ E]
P (Yn > E)
P (Yn < −E)
Yn
-E E
Figure 3.8 Graphical representation of P [|Yn | ≥ E]
=⇒ En = [|Yn | ≥ E] = [Wn ≥ E]; |Yn | ≤ Wn =⇒ Wn ≥ |Yn | ≥ E =⇒ Wn ≥ E
by transitivity of greater than or equal to (≥).
Now, Let Dn = Yn − E(Yn ) = (Yn − n1 ) =⇒ E(Dn ) = 0
Yn dn
-E E(Y n) E 0
Figure 3.9 Shifted Distribution From Yn to Dn

1
Wn = |Yn − E(Yn )| + n
= |Yn − n1 | + 1
n
n =
1
Wn = |Yn − E(Yn )| +
Y n ≥ E

 (if ) n1 → Yn ≥ n1
Fn = [Wn ≥ e] =⇒ (−Yn + n2 ≥ E) (if ) Yn < n1

−Yn ≥ (E − 2 ) ( =⇒ ) Yn ≤ (−E + 2 ))

n n
En = (En+ ∪ En− ); En = [|Yn | ≥ E]
Yn = 
Y n ≥ E

 (if ) n1 → Yn ≥ 0
(−Yn ≥ E) (if ) Yn < 0

Y n

(≤) 0)
Fn = [Wn ≥ e] =⇒
(
1
Yn ≥ E (if ) Yn ≥ n
=⇒ Yn ≥ 0
1 n→∞
Yn ≤ (−E + n2 ) (if ) Yn < n
−−−→ Yn < 0
Fn = En+ ∪ Fn− but (Fn− area ) > (En− area) =⇒ En− ⊂ Fn−
En ⊆ Fn =⇒ P (En ) ≤ P (Fn ) =⇒ for any E>0
P [|Xn − X| ≥ E] = P [|Yn | ≥ E] ≤ P [{|Yn − E(Yn )| + n1 } ≥ E]

= P [|Yn − E(Yn | ≥ (E − n1 ))]
Using Chebyshev’s Inequality, which provides an upper bound to the probability

that the absolute deviation of a random variable from its mean will exceed a given
threshold.
σ2 2
=⇒ (E− 1 )2 = (E−n 1 )2 = n(Eσ 1 )2
var(Yn )
n n n
σx2
Recall Chebyshev’s Inequality P [|X − µx ≥ k] ≤ k2
=⇒ P [|Xn − X ≥ E] ≤
σ2 n→∞
n(E− 1 )2
−−−→ 0
n
P
=⇒ Xn −−−→ X
n→∞
P then d
Note: If [Xn −−−→ Xn ] ==⇒ [X −−−→ X] ( Convergence in probability
n→∞ n→∞
implies Convergence in distribution) but converse is not true i.e.
d P
If [Xn −−−→ X] ; [Xn −−−→ X] (Convergence in distribution does not
n→∞ n→∞
imply Convergence in probability)
Example 6: Let X1 , X2 , ...Xi ∼ iid Bern( 12 ) sequences iid of RV’s.

d
Let X ∼ Bern( 12 ) be independent from the Xi ’s then Xn −−−→ X . Since Xn and X
n→∞
are both given as Bern( 12 )
2.4 Summary
Zn converges in distribution to standard normal as n → ∞ i.e. [N(0,1)] Normal

distribution (Zn ) where mean is o and variance is 1. The convergence of zn is
independent of what the distribution of RV of Xi is, i.e. Xi can be discrete, con-
tinious or mixed RV’s.
A continuity correction is the name given to adding or subtracting 0.5 to a discrete
x-value. A continuity correction is usually applied when we want to use a contin-
uous distribution to approximate a discrete distribution. Majorly, it is used when
you want to use a normal distribution to approximate a binomial distribution.
For any fixed n , the CDF of Zn Fzn (z) is obtained by scaling and shifting CDF of Yn .
Hence CDF of Zn and CDF pf Yn have similar shape. Convergence in probability
of Xn tells tail probability is small but does not tell how far is the tail.
P then d d
If [Xn −−−→ Xn ] ==⇒ [X −−−→ X] but convese is not true i.e. If [Xn −−−→ X] ;
n→∞ n→∞ n→∞
P
[Xn −−−→ X].
n→∞
Chapter 3
Lecture 5 - Shashi Ranjan Mandal

(2019DMB08)
3.1 Weak law of large number
Consider X1 , X2 , .........., Xn be identical independent distributed(iids) random

variables with finite mean E[Xi ]=µ < ∞. i.e.
n
X1 + X2 + ...... + Xn 1X
Sñ = = Xi (3.1)
n n i=1
Then, the sample mean almost sure converges to population mean µ. i.e.
P (probability)
Sñ −−−−−−−−→ µ (3.2)
n→∞
In the previous lecture , we studied the proof of theorem that Convergence in

Probability implies Convergence in distribution .
P (P robability) D(Distribution)
If [Xn −−−−−−−−→ X], then [Xn −−−−−−−−−→ X] (3.3)
n→∞ n→∞
The above statement is a stronger form of convergence.
But in this lecture we will learn that converse of the above theorem is not true.
3.2 Converse of Theorem
The theorem says that : Convergence in distribution does not imply convergence
in Probability
42
Chapter 3. Lecture 5 - Shashi Ranjan Mandal (2019DMB08) 43
D(Distribution) P (P robability)
If [Xn −−−−−−−−−→ X]; [Xn −−−−−−−−→ X] (3.4)
n→∞ n→∞
The above statement says that if a convergence occur in distribution , then it does
not imply that convergence will occur in probability also.Let us try to verify the
statement using counter example .
3.3 Counter Example
Let X be the standard normal variable with mean zero and standard deviation as
one represented as X ∼ N(0,1) .Let us also consider Xn = −X for n = 1,2,3,4,......
Using the above definition , we can tell Xn is also standard normal with mean zero
and standard deviation as one represented as Xn ∼ N(0,1) .
Here we can see that both Xn and X has the same cummulative distribution func-
tion for all values of n i.e ∀ n .This is because we have defined Xn = −X , hence
both of them has same cummulative distribution function (cdf).
Now , since they have same distribution , so we want to check whether it has
convergence in probability or not .
In order to check that , let us use the definition of Probability , by considering

greater than 0. For > 0 ,we have
P [|Xn − X| > ] = P [|−X − X| > ] = P [|2X| > ] (3.5)
In the above equation , we know Xn = −X , hence substituted it in the equation
and we get P[|2X| > ] . Simplying the above equation further we get ,

P [|2X| > ] = P [|X| > ] (3.6)
2
Now removing the modulus from the equation (6) and expanding the term , we
get ,

P [|X| > ] = P [X > ] + P [X < − ] 6= 0 (3.7)
2 2 2
So now we are looking at the two tails , where we have positive probability on
both sides of the tail as depicted in the graph shown below .
Now we will explain how we remove the modulus from event.


ñ

ô X > , if X ≥ 0
Event, |X| > = 2 (3.8)
2 X > − , if X < 0

2
Figure 3.1: Figure 1
Now the above equation can be written as union of the two cases which we ob-
tained after removing modulus from X .
ñ ô

Event, |X| > = (X > ) ∪ (X < − ) (3.9)
2 2 2
So we saw that Xn does not converge in probability , since ,
P [|Xn − X| > ] > 0, for > 0 (3.10)
3.4 General Case
Let us look at the general case of convergence .

Statement 3.4.1. If Xn converges to X in probability , then Xn converges to X in dis-
tribution .This implies that Convergence in Probability is stronger than Convergence in
Distribution .
In mathematical form , we can write the above statement as follows :-

P (P robability) D(Distribution)
If [Xn −−−−−−−−→ X]⇒ [Xn −−−−−−−−−→ X] (3.11)
n→∞ n→∞
Proof. In order to proof the above statement , let us consider a distribution function
as shown below :
FXn (x) = P [Xn ≤ x] (3.12)
Now using the Law of total Probability , the above equation can be written into
summation of below two terms .
P [Xn ≤ x] = P [Xn ≤ x, X ≤ (x + )] + P [Xn ≤ x, X > (x + )] (3.13)

Simplifying the equation further by factoring into conditional probability i.e. we

are conditioning on X.Accordingly we can write the above equation as follows :
= P [Xn ≤ x|X ≤ (x + )]P [X ≤ (x + )] + P [Xx , X − > x] (3.14)
Now we know that joint probability is intersection of two events which is smaller.The
conditional probability is a probability which is less than one. Hence we can write
as follows :
P [Xn ≤ x|X ≤ (x + )]P [X ≤ (x + )] = P [X ≤ (x + )] (3.15)
Similarly , we can write the second term of equation 14 as follows :
P [Xn ≤ x, X − > x] = P [Xn < X − ] (3.16)
The above equation can be written since Xn ≤ X − .
Therefore, substituting the values of equation 15 and 16 in equation 14 , we get ,
FXn (x) ≤ P [X ≤ (x + )] + P [Xn < X − ] (3.17)
FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (3.18)

Since Xn < (X-)
(
(Xn − x) > , if Xn ≥ X
Event(E), [|Xn − x| > ] = (3.19)
−(Xn − x) > , if Xn < X
Event E is the union of event E1 and E2 .
Also
Fx (x − ) = P [X ≤ (x − )] (3.20)
P [X ≤ (x − )] = P [X ≤ (x − ), Xn ≤ x] + P [X ≤ (x − ), Xn > x] (3.21)
Fx (x − ) = P [X ≤ (x − )|Xn ≤ x] + P [X ≤ (x − ), Xn > x] (3.22)
Now we know that ,
P [X ≤ (x − )|Xn ≤ x] = P [Xn ≤ x] (3.23)
Let us try to simplify above equation , we have
X ≤ (x − ), Xn > x ⇒ X ≤ (x − ), (Xn − ) > (x − (3.24)

⇒ X ≤ (x − ) < (Xn − ) (3.25)
⇒ X < (Xn − ) (3.26)
Using the above , we can substitute the above value in equation 22 as follows ,
P [X ≤ (x − ), Xn > x] = P [X < Xn − ] (3.27)

Therefore , combining equation (23) and equation (27) , we can write the equation
(22) as follows :-
Fx (x − ) ≤ P [Xn ≤ x] + P [X < Xn − ] (3.28)
Fx (x − ) ≤ FXn (x) + P [X < Xn − ] (3.29)
Fx (x − ) ≤ FXn (x) + p[|Xn − x| > ] (3.30)

(
(Xn − x) > , if Xn ≥ X ⇐ F1
F unction(F ) = [|Xn − x| > ] = (3.31)
−(Xn − x) > , if Xn < X ⇐ F2
F unction(F1 ) = [X < Xn − ] = [ < (Xn − x)] (3.32)
Similarly , we can rewrite function F2 as ,
F2 = −(Xn − x) > = (Xn − X) < − (3.33)
Now we know , Function (F) is union of F1 and F2
F = F1 ∪ F2 (3.34)
Now taking Probability on both sides , we get
P (F ) = P (F1) + P (F2) (3.35)
Now we can see that P (F1) ≤ P(F) , since P (F2) ≥ 0. Now comparing above equa-
tion ,
FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (3.36)
Fx (x − ) ≤ FXn (x) + P [|Xn − x| > ] (3.37)

Now we can rewrite the above equation 36 as ,
Fx (x − ) − P [|Xn − x| > ] ≤ FXn (x) (3.38)
Combining equation 35 and 37 , we get
Fx (x − ) − P [|Xn − x| > ] ≤ FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (3.39)
P (P robability)
Now take the limit n → ∞ , since [Xn −−−−−−−−→ X] means
n→∞
for > 0 ,
n→∞
P [|Xn − x| > ] −−−→ (3.40)
The above equation implies that ,
Fx (x − ) ≤ lim Fxn (x) ≤ FX (x + ), ∀ > 0 (3.41)
n→inf
⇒ lim Fxn (x) = FX (x−) (3.42)

n→inf
Thus above equation implies that ,
D(Distribution)
[Xn −−−−−−−−−→ X] (3.43)
n→∞
3.5 Consequence of Weak Law of Large Number
Note 1. The assumption of finite variance Var(Xi )=σ 2 < ∞ is not required .
Let us consider a random sample Sn which converges to mean of the population

µ when n → ∞ as shown below :
P (P robability)
S¯n −−−−−−−−→ µ = E(Xi ) (3.44)
n→∞
Let us consider , (
1, if ω ∈ A
Xi (ω) = (3.45)
0, if ω ∈
/A
using the above condition , we can write as ,
E[Xi ] = P (ω ∈ A) = P (A) (3.46)
Now we know we can obtain random sample mean S¯n as shown in the below
equation :
∞
X1 + X2 + · + Xn 1X
S¯n = = Xi = f ractionof timesω ∈ A (3.47)
n n n=1
3.6 WLLN
We define S¯n as Fraction of times the outcome ω ∈ Ω is in a given set A, converges

in probability to E(Xi )=µ =P(A) = Probability of set or event A. Also E(Xi )=µ
and var(Xi )=σ 2
3.7 Convergence in Mean
Convergence in mean is one of the strongest form of convergence .
Converges
Xn (ω) −−−−−−→ X(ω) (3.48)
n→∞
The above equation then implies that

Converges
Xn −−−−−−→ X (3.49)
n→∞
Thus the above two equation means that distance between Xn and X tends to zero
i.e. d(Xn ,X) → 0.
Definition 3.7.1. Let r ≥ 1 be a constant number . Thus the sequence of random variable
(RVs) X1 ,X2 ,·,Xn , converges to rth or(Lr ) norm mean to random variable X.
In mathematical form ,we can write the above definition as follows :

Lr
Xn −−−→X (3.50)
n→∞
if ,
lim d(Xn , X) = lim E[|Xn − X|r ] → 0 (3.51)
x→∞ x→∞
Now when r= 2 , the above equation gives us the mean square convergence (m.s)
as shown in the equation below :
mean square convergence
Xn −−−−−−−−−−−−−−→ X (3.52)
n→∞
Now let us explain the above definition with the help of an example . Consider
a random sample Xn which has uniform distribution with mean equal to 0 and
1
variance equal to .Now we need to prove that
n
L
r
Xn −→ X = 0∀r ≥ 1, (3.53)
If , function of x defined as below,


1
n, 0≤x≤
F unction, FXn (x) = n (3.54)
0, otherwise
Now in order to prove the following , we know that ,
E[|Xn − X|r ] = E[|Xn |r ], asX = 0 (3.55)
Z∞ Z1/n Z1/n
E[|Xn |r ] = FXn (x)dx = xr ndx = n xr dx (3.56)
−∞ 0 0
Simplifying the above equation as follows :

" #1/n " #1/n
x(r+1 n
r
E[|Xn | ] = n = x(r+1) (3.57)
(r + 1) (r + 1)
0 0
After substuting the limits in the above equation , we get

" #(r+1)
n 1 n 1
E[|Xn |r ] = = = (3.58)
(r + 1) n (r + 1)n(r+1) (r + 1)nr
1 n→∞
E[|Xn |r ] = −−−→ 0∀r ≥ 1 (3.59)
(r + 1)nr
Thus the we proved that a random sample converges to X .
L
r
Xn −→ X = 0 (proved) (3.60)
Theorem 3.7.1. Convergence in mean us stronger than convergence in Probability .
Mathematically , we can write the above theorem as follows :
L P (P robability
r
Xn −−−→ X =⇒ Xn −−−−−−−−→ X (3.61)
n→∞ n→∞
Proof. Consider for any > 0 , the tail probability given as follows :
P [|Xn − X| ≥ ] = P [|Xn − X|r ≥ r ], sincer ≥ 1 (3.62)
Now using markov inequality , we can write

" #
|Xn − X|r n→∞
E −−−→ 0 (3.63)
r
We can write the above equation because |Xn − X|r ≥ 0 and thus |Xn − X| ≥ 0 i.e.
it is non-negative random variable . Also r > 0.
Now using the above statement , we can write as follows .

n→∞
P [|Xn − X| ≥ ] −−−→ 0 (3.64)
Thus we proved that ,

P (P robability)
Xn −−−−−−−−→ X (3.65)
n→∞
If
L n→∞
r
Xn −−−→ X =⇒ E[|Xn − X|r ] −−−→ 0 (3.66)
n→∞
Note 2. Converse of the above theorem is not true i.e. the sequence of random variable Xn
that converges in Probability but does not converge in mean.
P (P robability L
r
M athematically, Xn −−−−−−−−→ X ; Xn −−−→X (3.67)
n→∞ n→∞
Chapter 4
Lecture 6 - Venkat Suman Panigrahi

(2019DMF12)
4.1 Recap
So far we have introduced the topics of Convergence and Moment Generating

Functions. We have also looked upto the limit theorems in detail. And in this
lecture we will discuss in detail the almost sure convergence and strong law of
large number. And in the coming lecture we will bulid the course and introduce
to the stochastic process.
4.2 Convergence of Random Variables
Random variables are those variables whose values depends on the outcomes of
a random experiment. Suppose we are having a sequence of random variables
X1 , X2 , .........., Xn converges to a random variable X. That is Xn gets closer and
closer to the X for some value of n as n increases. Suppose we want to observe the
value of the random variable X, but we cannot observe it directly. So, what we do
is that we come with some estimation technique to measure the X, and got it as
X1 . And again we estimate it and update the estimation to X2 and so on. And we
continue this process to get X1 , X2 , ..... And as we increase the n our estimation
gets better and better. So, we hope that the value Xn converges to X.
There are different ways in which a sequence can be converge. Some of these
convergence are stonger than the other and some are weaker. If a sequence is
stonger and the other sequence is weaker. Then the later sequence convergence
implies the former convergence. A sequence can converges to following types:
• Convergence in distribution
50
Chapter 4. Lecture 6 - Venkat Suman Panigrahi (2019DMF12) 51
• Convergence in probability
• Convergence in mean
• Almost sure convergence
For example, using the figure, we conclude that if a sequence of random variables
converges in probability to a random variable X, then the sequence converges in
distribution to X as well.
Figure 4.1: Relations between different types of convergence
4.2.1 Convergence in distribution
With this type of convergence, the next outcome in a sequence of a random exper-
iment becomes good with a given probability distribution. Convergence in distri-
bution is the weakest form of convergence. However, this form of convergence is
widely used, most often it arises from application of the central limit theorem.
A sequence X1 , X2 , .........., Xn of random variable is said to be converge in distri-

bution, or weakly converge to a random variable X if,
lim Fn (x) = F (x), (4.1)

n→∞
for every number x R and F is continuous. Here, Fn and F are the CDFs of the
random variables Xn and X, respectively.
Figure 4.2: Use case of central limit theorem
Examples of convergence in distribution
• Tossing coins
Let Xn be the fraction of tails coming after tossing an unbiased coin n times. Then
let the X1 has the Bernoulli distribution with expected value of µ= 0.5 and vari-
ance σ 2 = 0.25. And the subsequent sequence of random variables X2 , X3 ....... will
also be distributed binomially.
So as we increase the number toss i.e. is n the distribution starts converting to a
normal distribution. This is explained by the Central Limit Theorem. If we in-
crease the n the mean of the sample µ will be distributed normally.
• Dice Problem
Suppose in a dice making factory, the first batch of the dice produced came out
to be biased or defects. So, now the outcome from tossing the dice will follow an
different marked uniform distribution.
But as the production process improved and the dice became less and less defec-
tives. And the outcome from throwing the dice will now follow uniform distribu-
tion more closely.
Figure 4.3: Convergence in Probability
4.2.2 Convergence in Probability
As a sequence progress the probability associated with the outcomes of the ex-
periment becomes smaller and smaller.
Convergence in probability is also the type of convergence established by the weak

law of large numbers.
A sequence X1 , X2 , .........., Xn of random variable converges in probability towards

the random variable X, if for all > 0
lim P (|Xn − X| > ) = 0 (4.2)

n→∞
Examples of Convergence in Probability
This example should not be taken literally. Consider the following experiment.
First, pick a random person in the street. Let X be his/her height, which is ex ante
a random variable. Then ask other people to estimate this height by eye. Let Xn be
the average of the first n responses. Then (provided there is no systematic error)
by the law of large numbers, the sequence Xn will converge in probability to the
random variable X.
• Note
1. Convergence in probability implies convergence in distribution.

2. Convergence in probability does not imply almost sure convergence.
4.2.3 Convergence in Mean
For the interpretation , when we say that the sequence Xn converges to X, it means
the distance between the Xn and X gets smaller and smaller. For example, if we
define the distance between Xn and X as P (|Xn − X| > ), we define it as conver-
gence in probability. Another way to represent the distance between Xn and X is,
E(|Xn − X|r ), (4.3)

where r ≥ 1 is a fixed number. This defines the convergence in mean.
Defination: So a convergence in mean is defined as if a sequence of random vari-

able X1 , X2 , .........., Xn converges in rth mean or in the Lr norm to a random vari-
able X, if
lim E(|Xn − X|r ) = 0 (4.4)

n→∞
It is shown by,
Lr
Xn −→ X (4.5)
4.3 Almost Sure Convergence
Almost sure convergence is one of the important discoveries in the probability and
statistics. It will lead to the establishment for the strong law of large numbers.
Also called ’with probability one convergence’ (w.p.1).When we say that probability
of an event is zero (0), then the event doesnot occure at all and if one (1) then the
event occures all the time.
For example, if both the side of a coin are ’Heads’ (biased coin) Then the probabil-
ity of ’tails’ is zero and probability of ’heads’ is one. When we say P (1) then it’s a
sure event i.e. degerating case.
Almost sure means occures almost every where but there are some places that
it doesnot occure. This is pointwise convergence in the same space sS. We have
a sequence of random variables X1 , X2 , .........., Xn defines on an underlying sam-
ple space and we assume S is a finite set |S| < ∞.
And we have a function (Xn ) and (X)that matches ’S’ to real numbers.Xn : S
−→ R and X : S −→ R. Limitng random variable is also on the sample space.
S = {s1 , s2 , .........sn } , |S| = k < ∞

S = {s1 , s2 , .........si ........sk };
si is the ith outcome. Take a randomn variable Xn on sample space Xn (si ) = xni .
This is the ith real number outcome. For i = 1, 2, .......k and n = 1, 2, .......
After a random experiment is performed for example coin is tossed one of the si
will occure (that is ’H occurs). Since it is the outcome of the experiment and then
the values of the Xn are known, i.e. xni are known.
If si is the outconme of the experiment, si S. The random variable is realised.

The following sequence x1i , x2i , ........, xni , of real number is observed and we can
discuss about the convergence of the rel number.
Almost sure convergence is defined on the convergence of the thse sequence.

Once the outcome is reservered the sequence of the real numbers coming
their. That where the almost sure convergence comes into picture.
4.3.1 Example 1
Suppose we do a random experiment of tossing a coin. The outcomes being head

or tail.|S| = 2 < ∞
Consider a sequence of random variable X1 , X2 , .........., Xn .
(
n
, if s = H
Xn = n+1 n (4.6)
(−1) , otherwise
So, consider each of the outcomes H or T and determine if the sequence of real
number converges or not.
If s = H =⇒ Xn (H) = n
n+1
We fixed the outcome head, so
X1 (H) = 12 , X2 (H) = 23 , X3 (H) = 43 .......
We see here a sequence of real numbers and as we increase the n, the squence is
converging to 1.
Hence, sequence converges to 1 as n −→ ∞, if s = H (outcome is fixed at H).
Now, if s = T . outcome is T rails =⇒ Xn (T )= (−1)n .

So, X1 (T ) = −1, X2 (T ) = +1, ...... continous till infinite.
We see that the sequence doesnot converges since it oscillates between -1 and +1
as n becomes larger and larger. Let’s define an event S,
E∞ = {si ∈ S; lim Xn (s) = 1} (4.7)

n→∞
The probability of the event will be,
P [E∞ ] = P [{si ∈ S; lim Xn (s) = 1}] (4.8)

n→∞
This event converges when the uoutcome is Head. (s = H).So, the probability of
the event E∞ i.e. the probability of Heads is 12 . Since it is a single toss of fair coin.
NOTE: In this examaple the sequence Xn (s) = xn s converged to 1 when s = H

and with probability(H) = 12 and the sequence did not converged when s = T .
If the probability that the sequence Xn (s) converges to X(s) is equal to 1 then Xn
converges X almost surely with probability 1.
a.s w.p.1
Xn −−−→ XorXn −−−→ X (4.9)
n→∞ n→∞
• Definition: Consider sequence of random variable X1 , X2 , .........., Xn , con-

verges almost surely to a ransom variable X and consider the event {s∈S|limn→∞
Xn (s)=X(s)} = E∞ ,
a.s
if Xn −−−→ Xif P (E∞ ) = 1. (4.10)
n→∞
4.3.2 Example 2
Suppose a sample space between 0, 1 and uniform probability measure on S.
P ([a, b]) = (b − a), 0 6 a 6 b < 1 (4.11)
Probability is nothing the distance of the interval. Consider sequence of random

variable X1 , X2 , .........., Xn ,
(
1, if 0 6 s < n+1
Xn = 2n
(4.12)
0, otherwise
Now, define a randon variable X on the sample space S,

(
1, if 0 6 s < 21
X(s) = (4.13)
0, otherwise
n+1
2n
is a bigger interval and 1
2
is a smaller interval.
Now we have to show that
a.s
Xn −−−→ X (4.14)
n→∞
Putting different values for the Xn (s), we get the values X1 (s) = 1, X2 (s) = 34 , ....
and so on. Here the intervals are shrinking.
Consider an event E∞ = {si ∈ S; limn→∞ Xn (s)=X(S)}. These are the set of out-
comes where the limn→∞ Xn (s)=X(S).
• Case (i) (Outcomes lies between the interval 0 to 21 ) For s ∈ (0 , 1

2
), X(s) =
1. (From the defination). So, [0, 21 ] ⊂ E∞ since s ∈ [0. 21 ].
Here, s is a sunset of E∞
• Case (ii) Now for s > 1

2
=⇒ X(s) = 0. Since s > 12 , then 2s > 1 or 2(s − 1) >
0.
(
1, if 0 6 s < n+1
Xn (s) = 2n
(4.15)
0, otherwise
We know (2s − 1) > 0, X(s). limn→∞ Xn (s)=X(s)=0, ∀ s> 21 . Now,

1
limn→∞ Xn (s)=0, ∀ n> 2s−1 .
So, we choose ’n’ in such a way that when we increase ’n’ larger and larger it be-
comes, 2s−11
. So, the condition becomes, limn→∞ Xn (s)=X(s)=0. That implies, s ∈
E∞ .
=⇒ ( 21 , 1) ⊂ E∞ .
Now we can write the event as, E∞ = [0, 21 ] ∪ ( 12 , 1). And applying probability, we
get;
1 1
P [E∞ ] = P {[0, ]} + P {( , 1)} (4.16)
2 2
From the axioms of probability, disjoint unions of event and both are uniform
measures. So, the probability of occuring of both will be half ( 21 ).
= 12 + 12 . = 1.
so, P[E∞ ] = 1
So, from the case above we have shown that a sequence of random variable X1 , X2 , .........., Xn
converges almost sure to X(s) when the sample size increases. i.e.
a.s
Xn −−−→ X
n→∞
4.4 Strong Law of Large Number
Consider X1 , X2 , .........., Xn be identical independent distributed random variables

with finite mean µ(=E[Xi ]) < ∞. i.e.
n
X1 + X2 + ...... + Xn 1X
Sñ = = Xi (4.17)
n n i=1
Then, the sample mean almost sure converges to population mean µ. i.e.
a.s
Sñ −−−→ µ (4.18)
n→∞
This is stronger form of convegence because this implies weaker form of conver-
gence. Stronger law implies the weak law.So, the expected value of mean will be
E(Sñ ) = µ ( This goes to a degenerated randomn variable µ) and the var(Sñ ) goes
to zero as n −→ ∞.
4.4.1 Example 3
Consider a sample space S = [0, 1]. 0 6 s 6 1, with uniform probability distribu-

tion.
We define a sequence of random variable
Xn (s) = (s + S n ) and X(s) = s in S ∀ s ∈ [0,1]
that is, 0 6 s 6 1, such that,

n→∞
sn −−−→ 0
So, as ’n’ tends to infite, Xn becomes smaller and smaller. This implies, limn→∞
n→∞
Xn (s) → s, since sn −−−→ 0.
So, we have to show that

lim Xn (s) → X(s) (4.19)
n→∞
Now consider, s = 1, which gives Xn (1)= 2 ∀ n and X(s) = 1. Hence, limn→∞ Xn (1)
= 2 6= X(1)= 1.
As n → ∞. limn→∞ Xn (1) is not equal to X(1) at s = 1. So, the probability of

P [1] = 0. But, limn→∞ Xn (s) → X(s) for s[0, 1]
So, the event becomes, {s∈S | limn→∞ Xn (s)=X(s)} = E∞ and the probability of
this event is givent by,
P (E∞ ) = P {[0, 1]} = (1 − 0) = 1 (4.20)
a.s
It is unoform probability measure, which implies , Xn −−−→ X.
n→∞
NOTE: Almost sure convergence is similar to pointwise convergence of a sequence

of function, except that the convergence need not occure on a set "D" with proba-
bility zero (0).
D = {s ∈ S| lim Xn (s) = X(s) = {1}} (4.21)
n→∞
and the probability of D is zero (= P (D) = 0). SInce it is uniform distribution

which is continous distribution. So, the probability of a single point is zero.
We have already seen, let a random variable X is a function of probability space

domain of X which maps the domain of X to real number (range of values of
X), then an another function ’h’ and we get another real line. Here ’h’ is also a
function of X. This is a composition of two functions. Z is function of X.
So, random variable,

Z = h(X(ω)), ωS (4.22)
X h
−
→R→
− R (4.23)
We have proved weak law of large number for the finite variance case where,
var(Xi ) = σ 2 < ∞, (4.24)
by using the Chebyshev’s inequality.
That mean the sample mean of the random variable converges in probability as n
tends to infinity to a degenerate random variable (X) µ, where µ is the a deter-
ministic constant.
The mean is also finite.

E(Xi ) < ∞. (4.25)
So,
E(S¯n ) = µ,
σ 2 n→∞ (4.26)
var(S¯n ) = −−−→ 0.
n
NOTE: If we remove the finite variance assumption, then lets check what will hap-
pen.
Let Xi are independent identical random variable with well defined moment gen-
erating function. So, these are defined as, MXi (0) = 1 and MX0 i (0) = µ = E(Xi ).
So, by defination,
MXi (t) = E[etXi ] (4.27)
So now the moment generating function of S¯n is defined in,

¯
MS¯n (t) = E[etSn ] (4.28)
t
(4.29)
P
Xi
= E[e n ]
t
(4.30)
P
= E[e( n ) Xi
]
Since.
S¯n = n1 ni=1
P
Now,
t
MS¯n (t) = E[e n [X1 + X2 + ....Xn ]] (4.31)
t t t
= E[e n X1 .e n X2 .........e n Xn ] (4.32)
Breaking down the equation. As Xi are independent, the above equation 32 will
factor out.
t t t
MS¯n (t) = E[e n X1 ].E[e n X2 ].........E[e n Xn ] (4.33)
So, now what se got is the product of moment generating functions.
t t t t
MS¯n (t) = MX1 ( ).MX2 ( ).MX3 ( )..............MXn ( ) (4.34)
n n n n
Since all the momnet generating functions are identical and since Xi is identical
to random variable X, we have
t
MS¯n (t) = [MX ( )]2 (4.35)
n
Let see it through the Taylor series expansion. So, the taylor series of ’f ’ about x
= 0 to finite order,
f (x) = f (0) + hf 0 (0) + ........, (4.36)

where h = nt .
So as we increase the ’h’ it will go to zero.
t t t
MX ( ) = [MX (0) + M 0 (0) + 0( )] (4.37)
n n n
Expanding to the 1st order about t = 0. So, by defination, MX (0) = 1 and MX0 (0) =
µ.
t t
=⇒ MS¯n ( ) = [1 + ( )µ] (4.38)
n n
So, as n tends to infinity, the

t n→∞
[1 + ( )µ] −−−→ etµ (4.39)
n
Recall the limit in calculus concepts,

1 r n
lim (1 + rx) x = lim (1 + ) = er (4.40)
x→o x→∞ n
So, similarly we incorporated here.

So,
n→∞
MS¯n (t) −−−→ etµ = Mµ (t) (4.41)
The moment generating funtion of S¯n goes to eµt when n → ∞ which is the mo-
ment generating funtion of the degenerated random variable µ.
So, hence the distribution of the S¯n converges weakly to the distribution of de-
generated random variable Y = µ.
Recall:
We saw convergence in distribution to a constant µ implies convergence in proba-
bility hence we have weak law of large number. So,
n→∞ n→∞
[Xn −−−→ µ] =⇒ [Xn −−−→ µ] (4.42)
D P
where, µ = constant
NOTE:
Here, using the moment generating function we can prove the weak law of large
number without finite variance assumption.
4.5 Moment Generating Functions
One of the important reason why we are using moment generating functions is to
determine the sum of random variable. Moment generating functions are a sim-
ple way to find the moments like mean(µ) and variance(σ 2 ). Through MGF we
can represent a probability distribution with a simple one-variable function. Each
probability distribution has a unique MGF which means they are especially useful
for solving problems like finding the distribution for sums of random variables.
But, before defining moment generating function let’s define what are moments.
• Moments
The nth moment of a random variable is the expected value of its nth power.
Definition: Let X be a random variable, and n N . If the expected value,
µX (n) = E[X n ] (4.43)
exists and is finite, then X is said to possess a finite nth moment and µX (n) is called
nth moment.
For example, the first moment gives the expected value E[X]. And the second cen-
tral moment is the variance of X. Similar to mean and variance, other moments
give useful information about random variables.
The moment generating function (MX (s)) of a random variable X is defined

as,
MX (s) = E[esX ]
We say that MGF of X exists, if there exists a positive constant a such that
MX (s) is finite for all s[−a, a].
One question that can be raised is, why Moment Generating Function are usefull?
So, there are two reasons behind this. First being, moment generating function
of any random variable X gives us all moments of X. That is why it is called the
moment generating function.
Second, MGF can uniquely determine the distribution. So, therefore if two ran-
dom variable have the same MGF then they must have the same distribution as
well. Thus, if you find the MGF of a random variable, you have indeed determined
its distribution.
• Finding Moments from MGF:
Remember the Taylor series for ex : for all xR, we have

∞
x2 x3 X xk
x
e =1+x+ + + ......... = (4.44)
2! 3! k=0
k!
Now we can write,

∞ ∞
X sX k X X k sk
e sX
= = (4.45)
k=0
k! k=0
k!
So, thus the equation becomes,
∞
X sk
MX (s) = E[e sX
]= E[X k ] (4.46)
k=0
k!
k
We conclude that the k th moment of X is the coefficient of sk ! in the Taylor series
of MX (s). Thus, if we have the Taylor series of MX (s), we can obtain all moments
of X.
4.5.1 Properties of MGF
Some of the properties that can be said about the moment generating functions
are:
• Property 1
Let two random variable have the same moment generating function, then the
random variables have the same distribution. Suppose, X and Y are two random
variable and have the same moment generating function MX (s), then X & Y are
distributed in the same way(same CDF etc.).
So, we can say moment generating function determines the distribution of a ran-
dom variable. And this can come handy while dealing with unknown random
variable.
• Property 2
While dealing with the sum of the random variable, moment generating functions
makes it easier to handel. If there are two independent random variables, X and
Y and we want to see the moment generating function of X + Y , then we have to
multiply the separate, individual moment generating functions of X and Y .
So, if X and Y are independent and X has moment generating function MX (s)
and Y has moment generating function MY (s) , then the moment generating func-
tion of X + Y is just MX (s) MY (s) , or the product of the two moment generating
functions.
Example 4
If X ≡ Uniform (0, 1) , find E[Y k ] using MY (s).
MGF of Uniform distribution is,

es − 1
MY (s) = (4.47)
s
So, simplifying down the above function into parts we get,
∞
1 X sk
= ( − 1) (4.48)
s k=0 k !
∞
1 X sk
= (4.49)
s k=1 k !
∞
X sk−1
= (4.50)
k=1
k!
∞
X 1 sk
= (4.51)
k=1
k+1k!
sk
Thus, the coefficient of k!
in the Taylor series for MY (s) is 1
k+1
, so
1
E[Y k ] = (4.52)
k+1
4.6 Limit Theorems
Limit theorems are very important and extremely useful in applied Statistics. The
first limit theorem is the Law of Large Number.
4.6.1 Law of Large Number
Limit theorems help us to deal with random variable as we take limit. The first
being Law of Large Number, which essentially states that the sample mean will
eventually approach to the population mean of a random variable as we increase
the draws to infinity.
• Definition:
Consider i.i.d. random variables X1 , X2 , .........., Xn . Let the mean of each random
variable be µ. We define the sample mean as,
X1 + X2 , .......... + Xn
X̄n = (4.53)
n
It is to note that the sample mean X̄n is random itself.
It makes sense that this sample mean will fluctuate, because the components that
make it up (the X terms) are themselves random.
Figure 4.4: Law of Large Number
Now based on the concept we have two different type of law of large number,
Strong Law of Large Number
The strong law of large number states that as n tends to ∞ the sample mean X̄n
goes to the population mean X̄ with probability 1. This is a formal way of saying
that the sample mean will definitely approach the true mean.
The strong law of large number is based on almost sure convergence.
The strong law of Large Number and Almost sure convergence are being dis-
cussed throughly in the previous section of this lecture.
a.s
X̄n −−−→ µ (4.54)
n→∞
Weak Law of Large Number
As the sample sixe ’n’ increases/grows to infinity (∞), the probability that the
sample mean differs from the population mean by some small amount "" is equal
to zero.
In simple words, The Weak Law of Large Numbers, also known as Bernoulli’s the-
orem, states that if you have a sample of independent and identically distributed
random variables, as the sample size grows larger, the sample mean will tend to-
ward the population mean.
• Definition
Let X1 , X2 , .........., Xn be i.i.d. random variables with a finite expected value E[Xi ] =
µ < ∞. Then for any > 0,
lim P [|X̄ − µ| ≥ ] = 0 (4.55)

n→∞
4.6.2 Central Limit Theorem
Central limit theorem is the second limit theorem which is equally important deal-
ing with random variable and it also deals with the long-run behavior of the sam-
ple mean as n grows.
The central limit theorem states that, if we choose a sufficiently large random sam-
ple from a population mean of mean ’µ’ and variance "σ 2 ", then the random sam-
ple will be distributed normally with sample mean "µ".
This is an extremely powerful result, because it holds no matter what the distri-
bution of the underlying random variables (i.e., the X 0 s) is.
D σ2
X̄n −
→ N (µ, ) (4.56)
n
D
Where −→ means ‘converges in distribution’; it’s implied here that this convergence
takes place as n , or the number of underlying random variables, grows.
• NOTE:
LLN states that the mean of a large number of i.i.d. random variable converges to
expected value.
And, CLT states that, under similar condition, the sum of large sample of random
variable has an approximate normal distribution.
X̄ − µ
Zn = (4.57)
√σ
n
The central limit theorem states tha CDF of Zn converges to the standard normal
CDF.
References 68
References
[1] https://medium.com/
[2] https://www.probabilitycourse.com/chapter7/7_0_0_intro.php
[3] https://www.probabilitycourse.com/chapter7/7_2_0_convergence_of_
random_variables.php
[4] https://www.statlect.com/fundamentals-of-probability/
moment-generating-function
[5] https://www.probabilitycourse.com/
Chapter 5
Lecture 9 - Akash Gupta

(2019DMB02)
5.1 Fundamental guiding principles
Before delving into the rather involved concepts of Stochastic Processes, it is ab-
solutely essential for us to have a firm grasp on the fundamentals of Probability
theory, Set theory, Sequences and Limit theorems. In this prelude of sorts, we will
revisit some fundamental principles regarding some of those concepts and then
move on to discussing stochastic processes.
5.1.1 Sample space
Suppose a random experiment is conducted. Then, the set of all possible outcomes
of the experiment is called the sample space, denoted by Ω of S. Suppose our
experiment is about tossing two coins. Then the sample space could be given by:
S = {(H, H), (H, T ), (T, H), (T, T )} (5.1)
Now note that any subset E of the sample space is called an event. This is typi-
cally a set that contains various outcomes of the experiment and we say that if a
particular outcome is contained within E, then event E has occurred. For exam-
ple if we define our event to be - E is the event that heads appears on the first coin toss
- then our associated set for this event would be:
E = {(H, H), (H, T )} (5.2)
5.1.2 Working with sets: part 1
For two events E and F belonging to some sample space, we say that the union of
those events is the event that consists of outcomes that are contained in either E
69
Chapter 5. Lecture 9 - Akash Gupta (2019DMB02) 70
or F or both. Consider E to be defined as in the previous section and we further

define F as {(T, H)}. Then the union set would be:
E ∪ F = {(H, H), (H, T ), (T, H)} (5.3)
Now if we consider an event such that it contains all the outcomes contained
in both E and F then that event would be the intersection of the two events
and is shown as follows. Assume that E = {(H, H), (H, T ), (T, H)} and F =
{(H, T ), (T, H), (T, T )}.
E ∩ F = EF = {(H, T ), (T, H)} (5.4)
Now let us consider another example of two events obtained from rolling two dies
and the associated outcome tuples denote the sum of the two die rolls. Suppose
E = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} is the event that the sum of two die
rolls is 7 and let F = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} be the event that the sum of
two die rolls is 6. Look carefully and you might notice that the two events have
nothing in common. There are no outcomes that are contained in both sets and
hence we say that such an event simply could not occur. Such an event is known
as a null event and is denoted as EF = φ. In this case we say that events E and
F are mutually exclusive.
5.1.3 Working with sets: part 2
Just like we defined unions and intersections of sets in the above section, we can
define unions and intersections over n number of events. Suppose we have many
events given by (E1 , E2 , · · · ) then their union is typically given by:
∞
[
En (5.5)
n=1
In a similar manner, the intersection event of many events can be defined as fol-
lows: ∞
\
En (5.6)
n=1
Now if we want to define an event such that it contains all those outcomes in
the sample space S that are not in event E, then such an event is known as the
complement of E denoted as E c . Note that the complement of the sample space
is the null set (S c = φ). Further, for any two events E and F if all the outcomes in
E are also present in F then we say that E is a subset of F and consequently, F is
the superset of E. This is denoted as:
E⊂F (5.7)
Note that the condition of equality of two sets is given is they are both subsets of
each other. That is:
E = F ⇐⇒ E ⊂ F and F ⊂ E (5.8)
Some of these concepts seem quite intuitive when viewed in the form of Venn
diagrams. Basic examples are presented below.
A B A B
5.1.4 Basic laws governing set operations
Some of the basic algebraic rules that govern set operations are mentioned below:
• Commutative law: E ∪ F = F ∪ E and EF = F E.

• Associative law: (E ∪ F ) ∪ F = E ∪ (F ∪ G) and EF (G) = E(F G).
• Distributive law: (E ∪ F )G = EG ∪ F G and EF ∪ G = (E ∪ G)(F ∪ G).
• De-Morgans’s laws: These laws give us meaningful relations between unions,
intersections and complements of sets.
Ñ éc
[n n
\
Ei = Eic (5.9)
i=1 i=1
Ñ éc
n
\ n
[
Ei = Eic (5.10)
i=1 i=1
5.1.5 Axioms of probability
Consider an experiment with sample space S. For each event E of the sample
we assume that a function P (E) is defined and satisfies the following the three
axioms:
• 0 ≤ P (E) ≤ 1
• P (S) = 1.
• For a sequence of mutually exclusive events E1 , E2 , · · · such that Ei Ej = φ
then we have: Ñ é
∞
[ X∞
P Ei = P (Ei ) (5.11)
i=1 i=1
5.1.6 Probability of continuous sets
We consider a sequence of events {En } to be an increasing sequence if the follow-

ing is true:
E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 (5.12)
Similarly, we define a sequence of events to be decreasing is the following holds:
E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 (5.13)
Now for an increasing sequence of events we can essentially define a limiting event
in the form of: ∞
[
lim En = Ei (5.14)
n→∞
i=1
Similarly we can define the limiting event for a decreasing sequence of events as:
∞
\
lim En = Ei (5.15)
n→∞
i=1
Additionally we note an important proposition that lays down the probability for
an increasing or decreasing sequence of events as:
lim P (En ) = P ( lim En ) (5.16)

n→∞ n→∞
5.2 Countable and Uncountable sets
We will first illustrate the inherent confusion when looking at computing the prob-
abilities of a uniform random variable given by X ∼ U (0, 1). We know that for
all X = x that P (X = x) = 0. But at the same it is also true that P (0 ≤ X ≤ 1) = 1.
Now this is where the confusion comes in:
X
P (0 ≤ X ≤ 1) = P (X = x) (5.17)
x∈[0,1]
Going by the initial statement if we multiple individual sums over P (X = x) = 0

we are bound to get 0 in the overall sum instead of 1. Now this confusion can
be cleared by stating that the summation over an interval of real numbers is not
sensible. We would fall into a trap while deciding which values of x to use in our
sums. Suppose we take the first value as 0 but then what would be the second
value in the sum?
With this in mind, we tend to define infinite sums as a limiting case of finite sums,
denoted as: ∞ n
X X
xi = lim xi (5.18)
n→∞
i=1 i=1
Therefore we further state that in order to even have an infinite sum, it has to
be possible to arrange the terms in a sequence. Note now that if an infinite set
of terms can be arranged in a sequence it is called countable and otherwise it is
uncountable. Positive rational are said to be countable since we can list them in
an order as a ratio of integers.
1 2 3
, , ,··· (5.19)
2 3 4
However we must note that real numbers between 0 and 1 are not countable. Sup-
pose we try to arrange these real numbers into a sequence x1 , x2 , · · · . Also note
that choose to express these as:
∞
X
xj = dij 10−i (5.20)
i=1
Where dij ∈ (0, 1, 2, · · · , 9) is the ith digit after the decimal place, of the j th number
in the sequence. What is happeneing here is that we assume that any given x can
be written as a long sequence of decimal expansions. We randomly assign decimal
expansions by allowing any given ith digit after the decimal point to take on any
value between 0 and 9 as is embodied by the set dij . We can see this illustrated as
follows:
P∞ x1 → j−i= 1
x1 as i=1 di1 10 = 0.1 + 0.031 + · · · ≈ 0.2 · · · 74131
Therefore we can write P
Similarly, x2 as ∞ i=1 di2 10
−i
= 0.2 + 0.572 + · · · ≈ 0.1 · · · 90572
We assume that with the above sequence we can essentially list out the entire set
between 0 and 1. Therefore if now consider an indicator random variable such
that I(A) = 1 if condition A is true and I(A) = 0 if condition A is false. Then we
can define a new number by:
∞
X
y= (1 + I{dii = 1})10−i (5.21)
i=1
Now look closely at this number, it is basically saying that if the diagonal element
of the array of the xj sequences, that is dii is equal to 1 then the ith digit of y would
take on the number 2. Finally when we get the decimal expansion of y we would
have the case that - The first decimal expansion element of y would be different
from the d11 element of x1 , the second decimal expansion element of y would be
different from the d22 element of x2 and so on. With this what we have essentially
done is that proven that all the xj sequence of numbers differ in at least one digit
from the newly defined number y. Therefore we note that while y does in fact
belong to the set (0, 1) it is not equal to any of the xj sequence of numbers. Hence
we can say that the elements between 0 and 1 cannot be arranged or explicitly
listed out.
5.3 Concepts of Lecture 8
An infimum of a subset set S of a partially ordered set T denoted by inf S is the

greatest element in T that is less than or equal to all elements in S. It is the Greatest
lower bound. The supremum of a set S of a partially ordered set T denoted by
sup S is the least element in T that is greater than or equal to all elements in S. It
is the lowest upper bound.
5.3.1 Revisiting the background of discussion
In the earlier lecture, we saw that if we have an infinite sequence of 0 and 1 and
that our event of interest is a particular ek in that sequence turning out to be 1,
then we found that there are infinitely many such events possible. No matter how
far we choose the cut off point (n) in such a sequence, we would still end up
with infinitely many occurrences of our event of interest. Then we laid down the
condition that - for all cut off points n there exists a k ≥ n such that ek = 1 which
is given as:
∀n∈N ∃k≥n ek = 1 (5.22)
Then we translated this broad condition into events. Basically we are interested in
event Ek happens infinitely many times. Before moving further, note some basic
principles:
\∞
∀n∈N → , intersections (5.23)
n=1
∞
[
∃k≥n → , unions (5.24)
k=n
With this we can write our condition of Ek happening infinitely many times as:
∞ [
\ ∞
Ek (5.25)
n=1 k=n
After this we saw that if the series of events is divergent then, under the assump-
tion that the Ek events are disjoint, we proved that as the limit of n tends to infinity,
the tail probability (starting with cutoff point n) would tend to infinity.
∞
X
lim P (Ek ) → ∞ (5.26)
n→∞
k=n
And conversely, if the series of events is convergent then the tail of the series goes
to 0 as n tends to infinity.
∞
X
lim P (Ek ) → 0 (5.27)
n→∞
k=n
Finally with these funamental properties laid out, we formulated the Borel Canilli
Lemma which stated:
 
∞
X ∞ [
\ ∞
if P (En ) < ∞ → P  Ek  = 0 (5.28)
n=1 n=1 k=n
5.4 Lim Sup and Lim Inf
For an event An that occurs infinitely often, we define Lim Sup as follows:
∞ [
\ ∞
lim sup An = lim sup An = Ak (5.29)
n→∞
n=1 k=n
For an event A that occurs all but finitely often, we define the Lim Inf as follows:
∞ \
[ ∞
lim inf An = lim inf An = Ak (5.30)
n→∞
n=1 k=n
Now let us consider Ω to be our sample space and in that we consider a sample
point or an outcome ω ∈ Ω. Then the following condition can be defined:
ω ∈ [limn→∞ sup An ] ⇐⇒ ω lies in infinitely many of the individual sets An
We can define a similar condition for the Infimum as well:
ω ∈ [limn→∞ inf An ] ⇐⇒ ω lies in all but a finite number of sets
5.4.1 An illustrative example: 1
Consider a set X defined as X = {0, 1}. Now we consider a sequence of subsets

of this main set as follows:
{Xn } = {(X1 = {0}), (X2 = {1}), (X3 = {0}), (X4 = {1}) · · · } (5.31)
From this set of subsets, we can clearly notice that the even indexed elements are
all {0}, whereas all the odd indexed elements are {1}. With these we define two
new series of event classes that separately contain the odd and even indexed ele-
ments:
{Yn } = {{0}, {0}, · · · } (5.32)
{Zn } = {{1}, {1}, · · · } (5.33)
Now consider the original series {Xn }. In order to evaluate its Lim Sup we need
to first compute the successive unions of all the subsets in the series of the form
{0}∪{1}. Note that in every union iteration, we get the same set of the form {0, 1}.
Now define the LimSup as follows:
 
∞
\ ∞
[ \∞
lim sup An =  An → lim sup Xn =
 {0, 1} = {0, 1} (5.34)
n=1 k=n n=1
Now similarly, if we want to compute the Lim Inf of this series, then we need to
first take successive intersections of all the events, which in this case would be
iteratively found out as follows: {0} ∩ {1} = φ. φ ∩ {0} = φ. And so on. In the we
will get only the null set φ. With this the LimInf could be defined as follows:
 
∞
[ ∞
\ [∞
lim inf An =  Ak → lim inf Xn =
 [φ] = φ (5.35)
n=1 k=n n=1
We can notice from equation 34 and 35 that the LimSup and LimInf functions of
particular sequence are not equal. When this is the case we say that the limit of this
sequence does not exist. Now recall through equation 32, the sequence Yn . We will
now compute the LimSup and LimInf for this series and check if it’s limit exists,
which essentially the same condition as the LimSup and LimInf being equal.
 
\∞ ∞
[ ∞
\
lim sup Yn =  Yk =
 {0} = {0} (5.36)
n=1 k=n n=1
 
∞
[ ∞
\ ∞
[
lim inf Yn =  Yk  = {0} = {0} (5.37)
n=1 k=n n=1
Hence from the above two equations we see clearly that the limit of the series
would exist and be given as:
lim sup Yn = lim inf Yn = lim Yn = {0} (5.38)
5.4.2 An illustrative example: 2
We will now, through an example, show that the limiting behaviour of a series
does not depend on transients, rather it depends on the long term pattern shown
by the tail of the sequence. Transients are events that occur finitely often, whereas
we are more interested in finding the events that tend to happen infinitely often.
The idea is that these finitely occurring transients do not have any effect on the
long term limiting behaviour of the series. Let us suppose a sequence given by:
{Bn } = {(B1 = {50}), (B2 = {20}), (B3 = {35}), {0}, {1}, · · ·} (5.39)
| {z } | {z }
T ransients tail pattern
Now for each value of the cut-off, specifying the starting of a sequence, we call
union set of all events after the cut-off point as Dn . With this the LimSup can be
specified as:
 
∞
\ [∞
lim sup Bn =  Bk  = D1 ∩ D2 ∩ D3 ∩ · · · = {0, 1} (5.40)
n=1 k=n
| {z }
Dn
For all values of cut-off points, we see that the event {0, 1} tends to happen in-
finitely often since the multiple unions for all n cut off points resolve to that set.
To explicitly break down this process, we see the contents of the those Dn sets and
how they evolve.
∞
[
D1 = Bk = {50, 20, 35, −15, 0, 1}
k=1
D2 = {20, 35, −15, 0, 1}

D3 = {35, −15, 0, 1}
D4 = {−15, 0, 1}
D5 = {0, 1}
D6 = {0, 1}
Now we will compute the LimInf of this series. Note that the successive intersec-
tions of each set in this series would simply be the null set. Let us see the workings
of this:
 
∞
[ \ ∞
lim inf Bn =  Bk  = E1 ∪ E2 ∪ E3 · · · = φ ∪ φ · · · = φ (5.41)
n=1 k=n
| {z }
En
We note that each En is the successive intersection of multiple Bn events. These

successive intersections boil down to the null set since each successive element in
the the series is different.
E1 = B1 ∩ B2 ∩ B3 ∩ · · · = φ (5.42)
We now see that the limiting supremum and infimum are not at all affected by
the the transient, finitely occurring events in the series. Hence with example, it is
shown that transients do not affect the LimSup and LimInf and that the limiting
behaviour of the sequence is defined by its tail.
5.4.3 Reiterating the Definitions
To reiterate some of the subtleties expressed in the concepts relating to LimSup

and LimInf here we attempt to further strengthen the conceptual understanding
by reiterating key obervations and definitions. First, suppose that {Xn } is a se-
quence of subsets of some large set X. Now we note that LimSup of Xn is a set
that consists of elements of X which belong to Xn for infinitely many n, where
the series is countably infinite. To state this even more precisely we can say:
x ∈ lim sup Xn iff there exists a subsequence {Xnk } of {Xn } such that x ∈ Xnk ∀k
Similar definitions for the LimInf can also be stated. LimInf of Xn is a set that
consists of elements of X that belong to Xn for all except finitely many n. We
can state this more precisly as follows:
x ∈ lim inf Xn iff there exists some m > 0 such that x ∈ Xn for all n > m
Note that the sequence of infimum of a series is an increasing sequence. We can

show this as follows:
∞
\
In = inf Xm = Xm = Xn ∩ Xn+1 ∩ Xn+2 · · · (5.43)
m=n
We see that the sequence {In } is infact an increasing sequence with In ⊂ In+1 .
Why is this an increasing sequence? Because in each increasing iteration of form-
ing In we are taking fewer successive intersections among the Xn events and fewer
intersections naturally correspond to bigger sets. Hence the (n + 1)th will be big-
ger than the nth set. Now we say the the least upper bound on this sequence of
Infimums (In ) is the LimInf and is given by:
∞
" ∞ #
[ \
lim inf Xn = sup{inf{Xm |m ∈ (n, n + 1, · · · )}} = Xm (5.44)
n→∞
n=1 m=n
Similarly, the LimSup is the greatest lower bound on a decreasing sequence of

supremums (unions of sets). Why are successive unions of sets a decreasing se-
quence? Because as we increase the cut-off point, we take fewer unions each time,
which naturally implies a smaller set. This condition is represented as
∞
[
Jn = sup{Xm |m ∈ (n, n + 1, · · · )} = Xm = Xn ∪ Xn+1 · · · (5.45)
m=n
Chapter 6
Lecture 12: 2019DMB09 - Sri Rajitha
6.1 Recap
We have discussed about what are random/ stochastic processes. Stochastic pro-
cess is the process of some values changing randomly over time. At its simplest
form, it involves a variable changing at a random rate through time. There are var-
ious types of stochastic processes. Mainly classified into discrete time stochastic
processes and continuos time stochastic processes.
6.2 Introduction
For FT [ y ( t ) ] to exist, y(t) must be absolutely intergratable

Z ∞
=⇒ |y(t)|dt < ∞ (6.1)
−∞
Recall: For Weakly Stationery Stochastic Process X ( t ) ,
|Rxx (τ )| ≤ Rxx (0) = E[x2 (t)] (6.2)

where the Auto-correlation function Rxx (τ ) is bounded.
Thus instead of working directly with X (t), we deal with autocorrelation function
Rxx (τ ) which is bounded and hence absolutely integratable.
Consider Weakly Stationery Stochastic Process, WSSP X(t), (Wiener-Khinchine

Theorem: This theorem plays a central role in the stochastic series analysis, since
it relates the Fourier transform of x(t) to the Autocorrelation function-ACF.)
Z ∞
F T [Rxx (τ )] = Sxx (ω) = Rxx (τ )e−jωτ dτ (6.3)
n
79
Chapter 6. Lecture 12: 2019DMB09 - Sri Rajitha 80
Z ∞
1 0
Rxx (τ ) = F T −1
[Sxx (w)] = Sxx (ω)ej ωτ dω (6.4)
2π −∞
Where Rxx (τ ) is the autocorrelation function, Sxx (ω) is the spectral density F T [Rxx (τ )]
is the fourier transform and F T −1 [Sxx (w)] is inverse fourier transform.
Z ∞
1
2
E(x (t)) = Rxx (0) = Sxx (ω)dω (6.5)
2π −∞
Where E(x2 (t)) say that Mean square value of the stochastic process X(t) is the
average power of X(t).
6.3 Properties of Power spectral density Sxx(ω) :
Power Spectral Densfty (PSD) is the frequency response of a random or periodic

signal. It tells us where the average power is distributed as a function of frequency.
The PSD is deterministic, and for certain types of random signals is independent of
time. This is useful because the Fourier transform of a random time signal is itself
random, and therefore of little use calculating transfer relationships (i.e., finding
the output of a filter when the input is random). The signal has to be stationary,
which means that the statistics(mean, variance, covariance) do not change as a
function of time.
1) Sxx (ω) is a non-negative function of omega.
Sxx (ω) ≥ 0 (6.6)
2) Sxx (ω) is an even function.
Sxx (−ω) = Sxx (ω) (6.7)

R τ =+∞
Consider, Sxx (−ω) = 7=−∞
Rxx (τ )e−j(−ω)τ dτ
Z ∞ Z ∞
= Rxx (τ )e +jωτ
dτ = Rxx (τ )e−jω−τ dτ
−∞ −∞
Lets take r=-τ

Z r=∞ Z r=∞
−jωτ
Sxx (−ω) = Rxx (τ )e (−dτ ) = Rxx (τ )e−jωτ dτ = Sxx (ω)
r=−∞ r=−∞
=⇒ −dτ = dτ =⇒ Even.
3) Power spectral density Sxx (ω) is real function of X(t) in real.

Z +∞ Z +∞
−jωτ
Sxx (ω) = Rxx (τ )e dτ = Rxx (τ )[Cos(ωτ ) − jSin(ωτ )]dτ
−∞ −∞
Z +∞ Z +∞
Sxx (ω) = Rxx (τ )Cos(ωτ )dτ − j Rxx (τ )Sin(ωτ )dτ
−∞ −∞
Z +∞ Z +∞
=⇒ Sxx (ω) = Rxx (τ )Cos(ωτ )dτ = 2 Rxx (τ )Cos(ωτ )dτ.
−∞ 0
4) Average power of X(t) is E(x2 (t)).

Z ∞
1
2
E(x (t)) = Rxx (0) = Sxx (ω)dω. (6.8)
2π −∞
5) Sxx (ω) is real valued function.
∗
=⇒ Sxx (ω) = Sxx (ω) (6.9)
Where Sxx∗
(ω) is complex conjugate of Sxx (ω).
Recall: a = x + jy and a∗ = x − jy.
=⇒ If a= a∗ =⇒ x + jy = x − jy =⇒ y = −y =⇒ 2y = 0 =⇒ y=0
=⇒ a=x which is a real valued function.
R +∞
6) −∞ Rxx dτ < ∞ then Sxx (ω) is a continuous function of omega.
Z +∞
Sxx (ω) = Rxx (τ )e−jωτ dτ = F T [Rxx (τ )] (6.10)
−∞
Note: Since Power spectral density Sxx (ω) is even, non-negative real func-
tion. Same Rxx (τ ) can not be the autocorrelation function of WSSP X(t).
For Example: e−ατ , τ e−ατ of Sin(ω0 τ ) can not be autocorrelation function of WSSP
since Fourier transform of the function is complex.
Definition: For 2 Stochastic Processes X(t) and Y(t) that are jointly WSSP F T [Rxx (τ )]=
Cross power spectral density= Sxy (ω). Then
Z ∞
Sxy (ω) = F T [Rxx (τ )] = Rxy (τ )e−jωτ dτ. (6.11)
−∞
Where Sxy (ω) is a complex function even when X(t) and Y(t) are real stochastic
processes.
Recall: Rxy (τ ) = Ryx (−τ ) from property (1)
=⇒ Ryx (τ ) = Rxy (−τ ) (6.12)

(t - τ ) t (t + τ )
Figure 12.1
∗
=⇒ Syx (ω) = Sxy (−ω) = Sxy (ω) (6.13)
By Definition: Rxy (τ ) = E[X(T )Y (T + τ )] =⇒ Ryx (τ ) = E[Y (T )X(T + τ )]
=⇒ Ryx (−τ ) = E[Y (T )X(T − τ )] = E[X(T )Y (T − τ )] = Rxy (τ )
Then, Z ∞
Sxy (ω) = Rxy (τ )e−jωτ dτ = F T [Rxy (τ )]
−∞
Z ∞ Z ∞
∗ jωτ
Sxy (ω) = Rxy (τ )e dτ = Ryx (−τ )ejωτ dτ
−∞ −∞
Z ∞ Z ∞
∗
Sxy (ω) = Ryx (−τ )e jωτ
dτ ⇐⇒ τ =−τ
dτ =−dτ − Ryx (τ1 )e−jωτ dτ1
−∞ −∞
Z ∞
∗
Sxy (ω) = Ryx (τ1 )e−jωτ dτ1 = F T [Ryx (τ ) = Syx (ω).
−∞
∗
Sxy (ω) = Syx (ω) (6.14)
Example: Find autocorrelation

 function Rxx (τ ) of stochastic process with power
 S
0 ; |ω| < ω0
spectral density Sxx (ω) =
 0 otherwise



 ω < ω0 if ω > 0
Sol: |ω|<ω0 =⇒ −ω < ω0 if ω < 0

 =⇒ ω > −ω if ω < 0

0
SXX (ω)
S0
ω
-ω0 0 + ω0
Figure 12.2
ejx = cos x + j sin x (6.15)

−x
e−∞ = cos x − j sin x (6.16)
Adding equation (15) and (16)
ejx + e−jx
⇒ cos x = (6.17)
2
Subtracting equation (15) and (16)
ejx + e−jx
⇒ sin x = (6.18)
2j
R∞
Rxx (ω) = F T −1 [Sxx (ω)] = 2π
1
S ejωτ dc
−∞ xx
1
R∞ So ∞
R (6.19)
Rxx (τ ) = 2π S ejωτ dω = 2π
−∞ o −∞
ejωτ dω
So ejω0 τ −e−jω0 τ
⇒ Rxx (τ ) = So
[ejωτ ]ω−ω
0
= So
[ejω0 τ − e−jω0 τ ] = [ ]
2πjτ 0 2πjτ
So
πτ 2j
(6.20)
⇒ Rxx (τ ) = S (ω0 τ )
πτ n
6.4 White Noise (WN):
White noise is a random signal having equal intensity at different frequencies, giv-
ing it a constant power spectral density. Even a binary signal which can only take
on the values 1 or 0 will be white if the sequence is statistically uncorrelated. Noise
having a continuous distribution, such as a normal distribution, can of course be
white.
In statistics and econometrics one often assumes that an observed series of data
values is the sum of a series of values generated by a deterministic linear pro-
cess, depending on certain independent /explanatory variables, and on a series
of random noise values. If there is non-zero correlation between the noise values
underlying different observations then the estimated model parameters are still
unbiased, but estimates of their uncertainties such as confidence intervals will be
biased .
In Time series analysis there are often no explanatory variables other than the past
values of the variable being modeled i.e. the dependent variable. In this case the
noise process is often modeled as a moving average process, in which the current
value of the dependent variable depends on current and past values of a sequen-
tial white noise process.
Here, N(t) = White Noise
Definition: White noise is a random function N(t) whose power spectral den-
sity Sn n(ω) is constant for all frequencies ω. N(t) is the white noise.
N0
⇒ Snn (ω) = is constant ∀ ω (6.21)
2
Where N0 is a real positive constant.
Autocorrelation of white noise =Rnn (τ ) = F T −1 [Sxx (ω)] = F T −1 ( N20 )
N0 N0 N0
Rnn (τ ) = F T −1 ( ) = ( )F T −1 [1] = ( )δ(τ )
2 2 2
R∞
F T [δ(τ )] = −∞
δ(τ )e−jωτ dτ = e−jω(0) = e0 = 1
F T −1 [1] = δ(τ ) ⇒ F T [δ(τ )] = 1

 ∞ if x = 0
δ(τ ) = (6.22)
 0 if x 6= 0
N0 N0
SN N (ω) = RN N (τ ) = δ (τ )
Z Z
F T −1

N0 N0
FT
Z Z
ω τ
0 0
Power Spectral Density Auto Correlation Function
SN N (ω) RN N (τ )
Figure 12.3
Example: Let Y(t) = [X(t) +N(t)] be weakly stationery process with X(t) as the
actual speed and N(t) is the zero mean noise process with variance σN
2
and µN = 0.
Find the power spectral density of Y(T) i.e. Sxx (ω ).
Sol: Z ∞
Ryy (τ ) e−jωτ dτ

Syy (ω) = F T Ryy (τ ) =
−∞
Where Ryy (τ ) is the autocorrelation function of WSSP Y(t).

= Ryy (τ ) = E Y (t) Y (t + τ ) andY (t) = X[X(t) + N (t)]
Where X(t) and N(t) are independent stochastic processes.
2
µN = 0 and V ar(N (t)) = σN (6.23)

Ryy (τ ) = E y (t) y (t + τ ) = E X (t + τ ) + N (t + τ )

Ryy (τ ) = E X (t) X (t + τ ) + X (t) N (t + τ ) + N (t) X (t + τ ) + N (t) N (t + τ )
Where E [ X (t)N (t + τ ] = 0andE[N (t)X(t + τ ) ] =0

Ryy (τ ) = E X (t) X (t + τ ) + N (t) N (t + τ )
2
Ryy (τ ) = Rxx (τ ) + Rnn (τ ) = Rxx (τ ) + σN δτ
Definition: White noise is a random function N(t) whose power spectral

density Sn n(ω) is constant for all frequencies ω. N(t) is the white noise.
N0
⇒ Snn (ω) = is constant ∀ ω
2
Where N0 is a real positive constant.
With the Autocorrelation of white noise as
N0
Rnn (τ ) = ( )δ(τ )
2
where δ(τ ) is the dirac delta function is

 ∞ if x = 0
δ(τ ) =
 0 if x 6= 0

Recall: Rnn (0) = E N 2 (t)
2
î ó î ó2 î ó
= V ar N (t) = E N 2 (t) − E N (t) = E N 2 (t) − µ2N

σN
î ó
2
σN = E N 2 (t) = Rnn (0)
∞ if x = 0
2
⇒ Rnn (τ ) = σN δ τ ) where δ τ )= {
0 if x 6= 0
2
Ryy (τ ) = Rxx (τ ) + σN δ (τ )
Now solving for the power spectral density of Y(t),
δ (τ ) = F T Rxx (τ ) + F T σN
2
2
Syy (ω) = F T Ryy (τ ) = F T Rxx (τ ) + σN δ (τ )
Since F T δ (τ ) =0

2
Syy (ω) = Sxx (ω) + σN
Till now we have discussed about the continuous time stochastic processes (CTSP).
Let us look at the discrete time stochastic processes in further section.
6.5 Discrete Time Stochastic Processes
A discrete-time stochastic process is essentially a random vector with components

indexed by time, and a time series observed in an economic application is one re-
alization of this random vector. we exclusively consider processes in discrete time,
i.e. processes which are observed at equally spaced points of time t = 0, 1, 2,.... In
other words, a discrete process is considered to be an approximation of the con-
tinuous counterpart.
When interpreted as time, if the index set of a stochastic process has a finite or
countable number of elements, such as a finite set of numbers, the set of integers,
or the natural numbers, then the stochastic process is said to be in discrete time.
Let Xn =X[n], n=0,1,2,. . . . . . in a Random sequence.
DTSP Xn = X[n] iid got by sampling a continuous time stochastic process. If the
sampling interval is Ts
n→0 1 2
t
0 1 TS 2 TS
Figure 12.4
X[n]= X[Ts ] where n=0, ± 1, ± 2, ± 3, . . . . . .
Mean of X[n] is µx [n] = E X (n)

Autocorrelation function of X(n) is Rnn (n, n + m) = E X (n) X (n + m)

Definition: Auto covariance function of X(n) is Cxx (n1 , n2 ) measured by sam-

pling between X( n1 ) and X (n2 ) .

Cxx (n1 , n2 ) = E{ X (n1 ) − µx (n1 ) X (n2 ) − µx (n2 ) }
Cxx (n1 , n2 ) = E[X(n1 )X(n2 )] − µx (n1 )µx (n2 )]

If X( n1 ) and X (n2 ) are independent then Rxx (n1 , n2 ) = µx (n1 ) µx (n2 )
⇒ Cxx (n1 , n2 ) = 0 i.e. X( n1 ) and X (n2 ) are uncorrelated random variables.
Definition: A Discrete Time Stochastic Process (DTSP) is called white noise if the
Random variables X( nk ) are uncorrelated.
Note: If White noise is Gaussian WSSP then X(n) consist of a sequence of IID RVS
with Variance σ 2
Auto Correlation function of White noise Gaussian= Rxx (m) = σ 2 δ (m)
1 if m = 0
δ (m) = {
0 if m 6= 0
Definition: Power spectral density of X(n) is Sxx (Ω) =

P+∞
m=−∞ Rxx (m) e−jΩm
(6.24)

Sxx (Ω) = DF T Rxx (m)
Where Rxx (m) is discrete autocorrelation function of X (m) and DF T Rxx (m)

is Discrete Fourier transformation.
e−j(Ω+2π)n = ejΩn e−j2πn = e−jΩn
e−j(Ω+2π)n = e−jΩn
Hence e−jΩn is periodic with 2π . ⇒ Sxx (Ω) is periodic with 2π .
Therefore, it is sufficient to define Sxx (Ω) in range Ω ∈ (- π, π )
Ω
−π 0 π
Figure 12.5
R∞
⇒ Autocorrelation function of X(n)= Rxx (m) = 1
2π −∞
Sxx ejΩτ dΩ
Properties of Power Spectral Density Sxx (ω):
1) Sxx (Ω) is periodic with 2π.

=⇒ Sxx (Ω + 2π) = Sxx (Ω) (6.25)
2) Sxx (Ω) is an even function in Ω.

=⇒ Sxx (−Ω) = Sxx (Ω) (6.26)
3) Sxx (Ω) is real.
+∞
X
Sxx (Ω) = Rxx (m) e−jΩm (6.27)
m=−∞
e−jΩm = Cos (jΩm) − jSin (jΩm)
+∞
X
Sxx (Ω) = Rxx (m) Cos (jΩm) − jSin (jΩm)
m=−∞
+∞
X +∞
X
Rxx (m) Sin (jΩm) (6.28)

Sxx (Ω) = Rxx (m) Cos (jΩm) −j
m=−∞ m=−∞
Sxx (Ω) is an even function. Cos (jΩm) is even and Sin (jΩm) is odd.
⇒ Sxx (Ω) is real.
4) E[X2 (n)] is the average power of DTSP X(m).
E[X 2 (n)] = Rnn (0)

Z π Z π
1 0 1
= Sxx (Ω) e dΩ = Sxx (Ω) dΩ
2π −π 2π −π
Z π
2 1
E[X (n)] = Rnn (0) = Sxx (Ω) dΩ (6.29)
2π −π
Example: Assume X(n) is Real SP ⇒ Rxx (−m) = Rxx (m) . Find the power spec-
tral density of X(n) i.e. Sxx (Ω) .
Sol:
Sxx (Ω) = DF T Rxx (m) = +∞ −jΩm

P
m=−∞ Rxx (m) e
P−1 .
+
−jΩm
P+∞ −jΩm
Sxx (Ω) = m=−∞ Rxx (m) e m=0 Rxx (m) e
.
Sxx (Ω) = +∞ −jΩm
= −1 −jΩm
+
P P
R
m=−∞ Pxx (m) e m=−∞ Rxx (m) e
+∞ −jΩm
m=0 Rxx (m) e
Introducing a dummy k.
⇒ Sxx (Ω) = k=+1 jΩm

+ +∞ −jΩm
P P
k=−∞ Rxx (−k) e m=0 Rxx (m) e
k=∞
X +∞
X
Sxx (Ω) = Rxx (−k) ejΩm + Rxx (0) + Rxx (k) e−jΩm
k=1 k=0
k=∞
X î ó
Sxx (Ω) = Rxx (k) ejΩm + e−jΩm + Rxx (0)
k=1
We know Rxx (−k) = Rxx (k) is an even function.
⇒ Sxx (Ω) = 2 k=∞

P
k=1 Rxx (k) Cos (kΩ) + Rxx (0)
Proving Rxx (−k) = Rxx (k)
Z π
1
Rxx (m) = Sxx (Ω) ejΩm dΩ
2π −π
Z π
1
Rxx (−m) = Sxx (Ω) e−jΩm dΩ
2π −π
Let α = - Ω
Z π
1
Rxx (−m) = Sxx (−α) e−j(−α)m [−dα]
2π −π
1
Rπ
⇒ Rxx (−m) = − 2π −π
Sxx (−α) ejαm dα
Sxx (α) is an even function by property.

Rπ
⇒ Rxx (−m) = 1
2π −π
Sxx (α) ejαm dα = Rxx (m)
Rxx (−m) = Rxx (m) (6.30)
6.5.1 Sampling a CTSP : Continuous Time Stochastic Processes
A stochastic process with property that almost all sample paths are continuous is
called a continuous process. If the index set is some interval of the real line, then
time is said to be continuous. An example of a continuous-time stochastic process
for which sample paths are not continuous is a Poisson process.
Discrete Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP
X(t) is sampled at constant intervals Ts time units i.e. Ts is sampling period. Then
samples from a DTSP are defined by X(n).
− -2 -1 0 +1 2 − −
− -2 TS -1 TS 0 TS 1 TS 2 TS − −
TS TS TS TS
Figure 12.6
Let X(n)=X(n Ts ) for n=0, ± 1, ± 2, ...........
If mean is µx (t) and autocorrelation is Rxx (t) of CTSP X(t) then for DTSP X(n)
we have
µx (t) = µx (n Ts ) and Rxx (n1 , n2 ) = Rxx (n1 Ts , n2 Ts )
µx (t) = µx (n Ts ) is the continuous time mean sampled at n Ts
Rxx (n1 , n2 ) = Rxx (n1 Ts , n2 Ts ) continuous time autocorrelation function sam-

pled at n1 Ts and n2 Ts
Note: If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete
time with µx (n) = µx = Constant and Rxx (m) = Rxx (mTs ) .
X(t) CTSP =⇒ XT
X (ω1 , t) One Realization of CTSP
is a sample path
0 t1 t2 t3 t4 t (Continious time)
=⇒
Sampling
X(n) DTSP
X (ω1 , t) Xn , n = 0, 1, 2.....
n (Discrete time)
0 1 2 3 4
Figure 12.7
CTSP { X(t), t∈ T}
For to ∈ T , X(to ) is Random Variable so here
CDF is FX(t0 ) (n) = P X (t0 ) ≤ x

For t1 ∈ T and t2 ∈ T. The Joint CDF of X(t1 ) and X(t2 ) is
FX(t0 ) (x1 , x2 ) = P X (t1 ) ≤ x1 , X (t2 ) ≤ x2

6.6 Strong stationarity:
In mathematics and statistics, a strictly stationary process or strongly stationary

process is a stochastic process whose unconditional joint probability distribution
does not change when shifted in time. For many applications strict-sense station-
arity is too restrictive. A strong form of stationarity is when the distribution of a
time-series is exactly the same trough time.
FX(t) (x) = FX(t+∆) (x) ∀ t ∈ T and (t +∆ ) ∈ T
Joint CDF of X(t1 ) and X(t2 ) is the same as the joint distribution of X(t1 +∆ ) and
X(t2 +∆ ) i.e. time shift of ∆ doesn’t change it’s stationarity properties.
Definition: CTSP { X(t), t∈ T} is SSSP if ∀ t1 , t2 , ......, tn ∈ R and all τ ∈ R, Joint

CDF of X(t1 ), X(t2 ), ......., X(tr )that is for all real numbers x1 , x2 , ....., xn is
FX(t1 ),X(t2 ),...,X(tn ) ( x1 , x2 , . . . ., xn ) = FX(t1 +∆), X(t2 +∆)....,X(tn +∆) (x1 , x2 , . . . ., xn )
Definition: For DTSP X(n), n ∈ Z
Integer set= { . . . .,-2,-1,0,1,2,. . . .} ∀ n1 , n2 . . . .nk ∈ Z and all ∆ ∈ Z
Joint CDF of X(n1 ) , X(n2 ), ...... , X(nk ) is same as joint CDF of X(n1 +∆ ) , X(n2 +∆ ),
...... , X(nk +∆ ) i.e. for all real numbers x1 , x2 , ....., xk
FX(n1 ),X(n2 ),...,X(nk ) ( x1 , x2 , . . . ., xk ) = FX(n1 +∆), X(n2 +∆)....,X(nk +∆) (x1 , x2 , . . . ., xk )
Weak Stationarity WSSP, t1 , t2 , ..... , tn ∈ R and for all τ ∈ R
1) The mean function does not change due to shifts in time and is independent
of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.
2) The autocorrelation function does not change by shifts in time and is inde-
pendent of time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Definition: CTSP is WSSP if τ =( t1 - t2 )
1) µx (t) = µx (x) ∀ t ∈ R
2) Rxx (t1 , t2 ) = Rxx (t1 − t2 ) = Rxx (τ )
Definition: DTSP { X(n), n∈ Z } is WSSP if
1) µx (n) = µx ∀ n ∈ Z
2) Rxx (n1 , n2 ) = Rxx (n1 − n2 ) ∀ n1 , n2 . . . .nk ∈ Z
For Weakly Stationery Stochastic Processes,

⇒ Rxx (τ ) = E X (t) X (t + τ ) = E X (t + τ ) X (t)
î ó
Rxx (0) = E X 2 (t)
For WSSP E X 2 (t) is not a function of time.

î ó
E X 2 (t) = Rxx (0)
Since X 2 (t) ≥ 0 ⇒ E X 2 (t) ≥ 0 ⇒ Rxx (0) ≥ 0

Rxx (−τ ) = E X (t) X (t + τ ) = E X (t + τ ) X (t) = Rxx (τ )
Rxx (τ ) is even function all τ ∈ R.
Note: Rxx (τ ) takes its maximum value at τ =0 that is X (t + τ ) and X (t)

have highest correlation at τ =0.
Theorem: |Rxx (τ ) | ≤ Rxx (0) ∀ τ ∈ R
Proof: | E[XY]| ≤ E (x2 ) E (y 2 )

p
The above equality holds iff X= α Y for some constant α ∈ R
X= X(t) and Y= X (t − τ )
»
[|E[X(t)X(t − τ )]) ≤ E(X(t)2 )E[X(t − τ )2 ] (6.31)
»
|Rxx (τ ) | = Rxx (0) Rxx (0) = Rxx (0) (6.32)
⇒ |Rxx (τ ) | ≤ Rxx (0) (6.33)
RXX (τ )
max value RXX (0)
τ = 0 τ
Figure 12.8
6.7 Cyclo Stationery process
A cyclostationary process is a signal having statistical properties that vary cycli-

cally with time. By probabilistically viewing the measurements as an instance of
a stochastic process or an alternative veiwing deterministicly approach the mea-
surements as a single time series, from which a probability distribution for some
event associated with the time series can be defined as the fraction of time that
event occurs over the lifetime of the time series. In both the ways, the process
or time series is said to be cyclostationary if and only if its associated probability
distributions vary periodically with time.
A signal that is just a function of time and not a sample path of a stochastic pro-
cess can exhibit cyclostationary properties in the framework of the fraction-of-time
point of view. If the signal is further ergodic, all sample paths exhibits the same
time-average.
This process has a periodic structure. The statistical properties are repeated every
T0 units of time. That is the random variables X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the
same joint CDF as the RVS X(t1 + Tp ), X(t2 + Tp ), . . . .., X(tn + Tp ) then the RVS
are cyclo-stationery.
For Example: X(t)= A Cos (ω t) ⇒ X(t + 2π

ω
) = A Cos (ω (t + 2π
ω
))
= A Cos (ω t + 2π )= A Cos (ω t)= X(t)
X(t) is periodic with period = Tp = 2π

ω
Statistical properties of X(t) do not change by shifting time by Tp units.
Note: In the above definition τ =Tp or ∆ = Tp
Definition: Cyclo-stationery DTSP X(n), n ∈ Z if ∃ M ∈ R ={ 0,1,2,..} such

that
1) µx (n + M ) = µx ∀ n ∈ Z
2) Rxx (n1 + M, n2 + M ) = Rxx (n1 , n2 ) ∀ n1 , n2 . . . .nk ∈ Z
Definition: X(t) is CTSP and X(t) is mean square continuous at time t if

î ó
lim E |X (t + δ) − X (t) |2 |. = 0
δ→0
i.e. the difference |X (t + δ) − X (t) is small on an average.
Note: Mean Square continuity does not mean that every possible realization of
X(t) is a continuos function.
6.8 White Noise is a special Stochastic Process
A very commonly-used random process is white noise. White noise is a random

signal having equal intensity at different frequencies, giving it a constant power
spectral density.
Definition: N(t) is called White Noise of Snn (ω) = No

2
∀ω
R∞ R∞
E X 2 (t) = −∞ Snn (ω) dω = −∞ N2o dω = ∞

With the Autocorrelation function Rxx

ï ò
−1 No No
Rxx (τ ) = F T = δ (τ )
2 2
N(t) is white noise stochastic process

No
Snn (ω) = ∀ω
2
ï ò
−1 −1 No No

Rnn (τ ) = F T Snn (ω) = F T = δ (τ )
2 2
Where δ (τ ) is the dirac delta function
∞ if τ = 0
δ (τ ) = {
0 if τ 6= 0
E X 2 (t) = Rnn (0) ⇒ WNSP has infinite power.

Also, Rnn (τ ) = 0 f or any τ 6= 0
⇒ N(t1 ) and N(t2 ) are uncorrelated for any t1 6= t2
⇒ White Gaussian noise GN(t1 ) and GN(t2 ) are independent for any t1 6= t2
For Gaussian RVS independence ⇔ uncorrelated
6.9 Gaussian Random Process – GRP
A Gaussian process is a stochastic process i.e. a collection of random variables

indexed by time or space, such that every finite collection of those random vari-
ables has a multivariate normal distribution, i.e. every finite linear combination
of them is normally distributed. If a random process is modelled as a Gaussian
process, the distributions of various derived quantities can be obtained explic-
itly. Such quantities include the average value of the process over a range of times
and the error in estimating the average using sample values at a small set of times.
The Gaussian random variable is clearly the most commonly used and of most
importance. For continuous variables, possible values are distributed on a con-
tinuous scale and the probability density function links every possible value with
a given probability intensity which we can think of as the probability to find the
value of the variable around every possible value. A theoretical frequency distri-
bution for a random variable, characterized by a bell-shaped curve symmetrical
about its mean.
X → = [X1 , X2 , . . . , Xn ] is a random vector.

T
→
a = [a1 , a2 , . . . , an ]T ∈ R.
RVS X1 , X2 , . . . , Xn are jointly normal if all ai ∈ R.
Jointly Gaussian random variables can be characterized by the property that every
scalar linear combination of such variables is Gaussian. An important property of
jointly normal random variables is that their joint PDF is completely determined
by their mean and covariance matrices.
The random normal variable= Y = a→T X → = a1 X1 + a2 X2 + ..... + an Xn
X → is Gaussian vector if the RVS X1 , X2 , . . . , Xn are jointly normal
Note: Joint PDF of X1 , X2 , . . . , Xn is completely determined by m→ vector

of Covariance matrix C
m→ = E [X → ] , X → = [X1 , X2 , . . . , Xn ]T
Covariance matrix= C
C= E X → X →T with | C| = Det(C )

1D Gaussian PDF of X = fx (x)

1 −( X−µ)2 2 )
= e 2σ (6.34)
2πσ
N D Gaussian PDF of X → = fx (x)
x−m)T −1 (~
1 1c x−m)
~
= np e−( 2
)
(6.35)
(2π) 2 |c|
Definition: A SP { X(t), t ∈ R.} in a Gaussian or Normal Random Process (SP) if

∀ t1 , t2 , ....., tn ∈ R the n RVS X(t1 ) , X(t2 ),..... , X(tn ) are jointly normal.
Note: For Gausian SP, Strong weak stationery are equivalent.
Note: If two jointly norma; random process X(t) and Y (t) are uncorrelated
that is Cxy (t1 , t2 ) = 0 ∀ t1 , t2 then X(t) and Y (t) are two independent SPs
Note: For Gaussian SP, Weak statinarity and strong stationarity SSSP are
equivalent
Theorem: For Gaussian SP, { X(t), t∈ T} if X(t) is WSSP then X(t) is SSSP
Definition: 2 SP { X(t)} and { Y (t)} are jointly Gaussian for all
t1 , t2 , ....., tn ∈ Rx
t1 , t2 , ......, tn ∈ Ry the RVS X(t1 ) , X(t2 ), ...... , X(tn ) are jointly normal.
Proof:
Need to show ∀ t1 , t2 ,....., tn ∈ R, the variables X(t1 ) , X(t2 ), ....... , X(tk ) have the
same joint CDF as the RVS X(t1 + τ ), X(t2 + τ ), ....., X(tk + τ )
Since these RVS are jointly Gaussian, We show that the mean vector of co variance
matrices are same.
If X(t) is WSSP ⇒ µx (ti ) = µx tj = µx = Constant ∀ i,j

And Cxx (ti + τ, tj + τ ) = Cxx (ti , tj ) = Cxx (ti − tj )∀i, j
⇒ Mean vector of Covariance matrix of X(t1 ) , X(t2 ), ...... , X(tk ) is same as the
mean vector and covariance matrix of X(t1 + τ ), X(t2 + τ ), ......, X(tk + τ ).
6.10 Summary:
Properties of Power spectral density Sxx (ω) are it is a non-negative, even, real
and continuous function in ω. Since Power spectral density Sxx (ω) is even, non-
negative real function. Same Rxx (τ ) can not be the autocorrelation function of
WSSP X(t).White noise is a random function N(t) whose power spectral density
Sn n(ω) is constant for all frequencies ω. N(t) is the white noise. ⇒ Snn (ω) =
N0
2
is constant ∀ ω A Discrete Time Stochastic Process (DTSP) is called white
noise if the Random variables X( nk ) are uncorrelated.Properties of Sxx (ω) are it
is is periodic with 2π, is an even function in Ω, is real and even function. Discrete
Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP X(t) is
sampled at constant intervals Ts time units i.e. Ts is sampling period. Then sam-
ples from a DTSP are defined by X(n).
If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete time with
µx (n) = µx = Constant and Rxx (m) = Rxx (mTs )
For Weak Stationarity, The mean function does not change due to shifts in time and
is independent of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.The
autocorrelation function does not change by shifts in time and is independent of
time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Rxx (τ ) takes its maximum value at τ =0 that is X (t + τ ) and X (t) have highest
correlation at τ =0 The Cyclo Stationery process has a periodic structure. The sta-
tistical properties are repeated every T0 units of time. That is the random variables
X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the same joint CDF as the RVS X(t1 + Tp ), X(t2
+ Tp ), . . . .., X(tn + Tp ) then the RVS are cyclo-stationery. For Gaussian SP, Weak
statinarity and strong stationarity SSSP are equivalent
Chapter 7
Lecture 14: 2019DMB04 - Karnam

Yogesh
7.1 ARMA Model
ARMA is a model of forecasting in which the methods of autoregression (AR)

analysis and moving average (MA) are both applied to time-series data that is
well behaved. In ARMA it is assumed that the time series is stationary and when
it fluctuates, it does so uniformly around a particular time.
The ARMA angle, developed by Box and Jenkins (1970) using the time series anal-
ysis method, fails to consider the part played by explanatory variables based on
economic or financial theory and instead opts to use the extrapolation mecha-
nism, in the description of the time series, based on the changing law of the time
series itself. The reason for developing a time series model is that the time series
is stationary.
ARMA is essential in studying a time series. It is usually utilized in market re-

search for long-term tracking data research. For example, it is used in retail re-
search, to analyze sales volume which has seasonal variation characteristics.
This model is among the high-resolution spectral analysis methods of the model
parameter method, which is used in studying the rational spectrum of the sta-
tionary stochastic processes and is suited for a large class of practical problems.
ARMA has a better and more accurate spectral estimation and resolution perfor-
mance when compared to the AR or MA model, but it has a cumbersome param-
eter estimation.
xt = ϕ 1 xt− 1 + ϕ 2 xt− 2 + . . . + ϕ p xt− p + st + θ 1 st− 1 + . . . + θ q st- q (7.1)
98
Chapter 7. Lecture 14: 2019DMB04 - Karnam Yogesh 99
7.1.1 Introduction
As we have remarked, dependence is very common in time series observations.

To model this time series dependence, we start with univariate ARMA models. To
motivate the model, basically we can track two lines of thinking. First, for a series
xt , we can model that the level of its current observations depends on the level
of its lagged observations. For example, if we observe a high GDP realization
this quarter, we would expect that the GDP in the next few quarters are good
as well. This way of thinking can be represented by an AR model. The AR(1)
(autoregressive of order one) can be written as:
xt = ϕ xt− 1 + st . . . (7.2)
where st ∼ WN (0, σ 2 ) and we keep this assumption through this lecture. Simi-
larly, AR(p) (auto regressive of order p) can be written as:
xt = ϕ 1 xt− 1 + ϕ 2 xt− 2 + ... + ϕ p xt− p + st . . . (7.3)
In a second way of thinking, we can model that the observations of a random

variable at time t are not only affected by the shock at time t, but also the shocks
that have taken place before time t. For example, if we observe a negative shock
to the economy, say, a catastrophic earthquake, then we would expect that this
negative effect affects the economy not only for the time it takes place, but also for
the near future. This kind of thinking can be represented by an MA model. The
MA(1) (moving average of order one) and MA(q) (moving average of order q)
can be written as
xt = st + θ st− 1 (7.4)
and
xt = st + θ 1 st− 1 + . . . + θ q st− q . . . (7.5)
If we combine these two models, we get a general ARMA(p, q) model,
7.2 White Noise
The term white noise was originally an engineering term and there are subtle but
important differences in the way it is defined in various econometric texts.Here we
define white noise as a series of uncorrelated random variables with zero mean
and uniform variance If it is necessary to make the stronger assumptions of inde-
pendence or normality this will be made clear in the context and we will refer to
independent white noise or normal or Gaussian white noise. Be careful of various
definitions and of terms like weak strong and strict white noise
7.3 Lag Operators
Lag operators enable us to present an ARMA in a much concise way. Applying lag
operator(denoted L) once, we move the index back one time unit; and applying
it k times, we move the index back k units.
Lxt = xt− 1
L2 xt = xt− 2
..
.
Lk xt = xt− k
The lag operator is distributive over the addition operator, i.e.
L(xt + yt ) = xt− 1 + yt− 1 (7.7)

Using lag operators, we can rewrite the ARMA models as:
AR(1) : (1 − ϕ L)xt = st
AR(p) : (1 − ϕ 1 L − ϕ 2 L2 − . . . − ϕ p Lp )xt = st
MA(1) : xt = (1 + θ L)st
MA(q) : xt = (1 + θ 1 L + θ 2 L2 +. . . + θ q Lq )st
Let ϕ 0 = 1, θ 0 = 1 and define log polynomials
ϕ (L) = 1− ϕ 1 L − ϕ 2 L2 − . . . − ϕ p Lp . . . (7.8)
θ (L) = 1 + θ 1 L + θ 2 L2 + . . . + θ p Lq . . . (7.9)
With lag polynomials, we can rewrite an ARMA process in a more compact way:
AR : ϕ (L)xt = st
MA : xt = θ (L)st
ARMA : ϕ (L)xt = θ (L)st
Lag operators enable us to present an ARMA in a much concise way. Applying lag
operator (denoted L) once, we move the index back one time unit; and applying
it k times, we move the index back k units. Given a time series probability model,
usually we can find multiple ways to represent it.
7.4 Invertibility
Given a time series probability model, usually we can find multiple ways to rep-
resent it. Which representation to choose depends on our problem. For example,
to study the impulse-response functions, MA representations maybe more conve-
nient; while to estimate an ARMA model, AR representations maybe more conve-
nient as usually xt is observable while st is not. However, not all ARMA processes
can be inverted. In this section, we will consider under what conditions can we
invert an AR model to an MA model and invert an MA model to an AR model.
It turns out that invertibility, which means that the process can be inverted, is an
important property of the model. If we let 1 denotes the identity operator, i.e., 1yt
= yt , then the inversion operator (1 ϕ L)− 1 is defined to be the operator so that
(1 − ϕ L)− 1 (1 − ϕ L) = 1
7.5 Autocovariance Functions and Stationarity of ARMA

models
7.5.1 MA(1)
xt = st + θ st− 1 ,
where st ∼ WN (0, σ 2 ). It is easy to calculate the first two moments of xt :
E(xt ) = E(st + θ st− 1 ) = 0
E(x2 ) = (1 + θ 2 )σ 2
So, for a MA(1) process, we have a fixed mean and a covariance function which
does not depend on time t: So we know MA(1) is stationary The autocorrelation
can be computed as ρ x(h) = γ x(h)/γ x(0), so
ρ x (0) = 1, ρ x (1) = 1 + θ 2 , ρ x (h) = 0 for h > 1
We have proposed in the section on invertability that for an invertible (noninvertible) MA

process, there always exists a noninvertible (invertible) process which is the same as the
original process up to the second moment. We use the following MA(1) process as an
example.
7.5.2 MA(q)
Time series models known as ARIMA models may include autoregressive terms
and/or moving average terms., we learned an autoregressive term in a time series
model for the variable is a lagged value of . For instance, a lag 1 autoregressive
term is (multiplied by a coefficient). This lesson defines moving average terms.
A moving average term in a time series model is a past error (multiplied by a

coefficient).
xt = θ (L)st = (θ k Lk )st
The first two moments are:
E(xt ) = 0
E(x2 ) =Σ θ 2 σ 2
Again, a MA(q) is stationary for any finite values of θ 1 , . . . , θ q
7.6 AR(1)
In statistics, econometrics and signal processing, an autoregressive (AR) model is

a representation of a type of random process; as such, it is used to describe certain
time-varying processes in nature, economics, etc. The autoregressive model spec-
ifies that the output variable depends linearly on its own previous values and
on a stochastic term (an imperfectly predictable term); thus the model is in the
form of a stochastic difference equation (or recurrence relation which should not
be confused with differential equation). Together with the moving-average (MA)
model, it is a special case and key component of the more general autoregressive–moving-
average (ARMA) and autoregressive integrated moving average (ARIMA) mod-
els of time series, which have a more complicated stochastic structure; it is also
a special case of the vector autoregressive model (VAR), which consists of a sys-
tem of more than one interlocking stochastic difference equation in more than one
evolving random variable.
xt = ϕ xt− 1 + st . . . (7.10)
7.6.1 Estimation of AR parameters
∗Estimation of autocovariances or autocorrelations. Here each of these terms is

estimated separately, using conventional estimates. There are different ways of
doing this and the choice between these affects the properties of the estimation
scheme. For example, negative estimates of the variance can be produced by some
choices.
∗Formulation as a least squares regression problem in which an ordinary least
squares prediction problem is constructed, basing prediction of values of Xt on
the p previous values of the same series. This can be thought of as a forward-
prediction scheme. The normal equations for this problem can be seen to cor-
respond to an approximation of the matrix form of the Yule–Walker equations
in which each appearance of an autocovariance of the same lag is replaced by a

slightly different estimate.
∗Formulation as an extended form of ordinary least squares prediction problem.
Here two sets of prediction equations are combined into a single estimation scheme
and a single set of normal equations. One set is the set of forward-prediction
equations and the other is a corresponding set of backward prediction equations,
relating to the backward representation of the AR model.
7.6.2 AR(P)
Autoregressive models of order p,abbreviated AR(p), are commonly used in time

series analyses. In particular, AR(1) models (and their multivariate extensions)
see considerable use in ecology as we will see later in the course. Recall from
lecture that an AR(p) model is written as
larly, AR(p) (auto regressive of order p)
7.7 ARMA Modeling
First, for a series xt , we can model that the level of its current observations depends
on the level of its lagged observations. For example, if we observe a high GDP
realization this quarter, we would expect that the GDP in the next few quarters
are good as well. This way of thinking can be represented by an AR model. The
AR(1) (autoregressive of order one) can be written as:
xt = ϕ xt− 1 + st . . . (7.12)
larly, AR(p) (auto regressive of order p) can be written as:
In a second way of thinking, we can model that the observations of a random

variable at time t are not only affected by the shock at time t, but also the shocks
that have taken place before time t. For example, if we observe a negative shock
to the economy, say, a catastrophic earthquake, then we would expect that this
negative effect affects the economy not only for the time it takes place, but also for
the near future. This kind of thinking can be represented by an MA model. The
MA(1) (moving average of order one) and MA(q) (moving average of order q)
can be written as
xt = st + θ st− 1 (7.14)
and
xt = st + θ 1 st− 1 + . . . + θ q st− q . . . (7.15)
If we combine these two models, we get a general ARMA(p, q) model,
Now that we understand the theoretical behavior of ARMA processes, we will

consider how to take an actual observed time series, fit an ARMA model to the
data, and forecast future values of the time series. The stages in our process for
ARMA modeling a time series beginning with observed values z1, z2, . . ., zn are:
1. Remove any nonzero mean from the time series.
2. Estimate the autocorrelation and PACF of the time series. Use these to deter-
mine the autoregressive order p and the moving average order
3. Estimate a1, a2, . . ., an.
4. Use the fitted model to forecast zn+1, zn+2, . . ..
The first step in our process is removing any nonzero mean from the time series
z. This is a very straight forward computation- just compute the mean of the time
series, and subtract it from each element of the time series.The next step in the
process is determining the autoregressive order p and the moving average order
q
Appendix A
Appendix A: Fourier Transforms
A.1 Introduction
To provide some context for our discussion about Fourier transforms and series,
let us imagine a scenario. Since Fourier analysis is concerned with signals and
waves, we imagine a musician playing a steady note on a trumpet. Further there
is a microphone in front of the trumpet that is essentially capturing the sound
produced. The mic typically has a diaphram which undergoes pressure due to
the sound waves from the trumpet and this pressure then translates into voltage,
which is proportional to the instantaneous air pressure. Now if measure this with
an oscillopscope we will get a graph of pressure against time F (t) which would
turn out to be periodic. Note that it is the reciprocal of the period which is termed
as the frequency of the note being on the trumpet. The typical relationship be-
tween frequency and time period is:
1
ν= (A.1)
T
Let us say that the fundamental frequency of this one note sound is 256Hz. Now
in reality one sine wave of the said frequency is not produced, rather multiple
overtones are produced which are multiples of the fundamental frequency with
various amplitudes and phases. Phase basically determines where in the cycle
the signal would start repeating. Technically, we can analyse the wave by finding
a list of the amplitudes and phases of the various sine waves that comprise the
complex signal. We can plot a graph of amplitudes against frequency denoted by
A(ν). Now since we are effectively bringing the function from the time domain
to the frequency domain we say that A(ν) is the Fourier transform of F (t).
105
Appendix A. Appendix A: Fourier Transforms 106
A.2 Fourier series
Continuing with our previous example, we can say that a steady note sound signal
can be described completely by the fundamental frequency, its amplitude and the
amplitudes of its overtones or harmonics. For this we can use a discrete sum:
F (t) = a0 + a1 cos(2πν0 t) + b1 sin(2πν0 t) + a2 cos(4πν0 t) + b2 sin(4πν0 t) + · · · (A.2)
Here ν0 represents the fundamental frequency. The various sine and cosine func-
tions in the series denote the various phases of the signal that are not in step with
the fundamental signal. We can rewrite the previous formula as:
∞
X
F (t) = an cos(2πnν0 t) + bn sin(2πnν0 t) (A.3)
−∞
Note that this process of constructing a waveform by adding together the funda-
mental frequency and its overtones of various amplitudes is called Fourier syn-
thesis. Given that cos(−x) = cos(x) and sin(x) = − sin(−x) we can rewrite the
above expression as:
∞
X
F (t) = A0 /2 + An cos(2πnν0 t) + Bn sin(2πnν0 t) (A.4)
n=1
Where An = a−n + an and Bn = bn − b−n .
A.3 Amplitudes
Now note that the opposite process of extracting the frequencies and amplitudes
from the original signal is called Fourier analysis. We are interested in trying
to find the amplitudes Am and Bm for various instances of m. Now before mov-
ing ahead, we note the utilisation of the orthogonality property of trigonometric
functions - the central idea is that if we take a sine and a cosine, or two sines or two
cosines (as multiples of the fundamental frequency), then take their product and
integrate this product over the period of fundamental frequency, then the result
is zero. Noting that 1 period is denoted as the inverse of frequency: P = 1/ν0 , we
have: Z P
cos(2πnν0 t) cos(2πmν0 t)dt = 0 (A.5)
t=0
Z P
sin(2πnν0 t) sin(2πmν0 t)dt = 0 (A.6)
t=0
Z P
sin(2πnν0 t) cos(2πmν0 t)dt = 0 (A.7)
t=0
Note that in case m = n then the first two integrals would resolve to 1/2ν0 . Now
we note some general expressions for the coefficient values:
Z P
2
Bm = F (t) sin(2πmν0 t)dt (A.8)
P t=0
Z P
2
Am = F (t) cos(2πmν0 t)dt (A.9)
P t=0
An alternate way of writing the Fourier series is shown below. Note that this
expression comes about as a result of taking Am = Rm cos φm and Bm = Rm sin φm .
∞
A0 X
F (t) = + Rm cos(2πmν0 t + φm ) (A.10)
2 m=1
Where Rm is the mth harmonic amplitude and φm is the corresponding phase

value. Note as an additional point that two waves are said to be in phase if their
crests arrive together at a certain point. If however at some point, one wave has a
trough and the other has a crest, then they are said to be completely out of phase
- in this case they are said to have a 180 degrees phase difference.
A.4 Alternate forms of writing the series
We can actually write the Fourier series in the form of complex exponentials in-
stead of trigonometric functions. First as a reference we note the DeMoivre’s the-
orem:
(cos x + i sin x)n = cos(nx) + i sin(nx) = eixn (A.11)
Now we denote the Fourier series using this notation:
∞
X
F (t) = Cm e2πimν0 t (A.12)
−∞
Now the coefficients Cm are basically complex numbers. Now typically, without
going into the derivations, we use inversion formulae to get the coefficienty values
of the real and complex parts of the coefficients:
Z 1/ν0
Am = 2ν0 F (t) cos(2πmν0 t)dt (A.13)
0
Z 1/ν0
Bm = 2ν0 F (t) sin(2πmν0 t)dt (A.14)
0
Z 1/ν0
Cm = 2ν0 F (t)e−2πmν0 t dt (A.15)
0
The above expressions can be further rewritten in slightly different notation, if we

let ν0 = ω0 /2π. The expressions become:
Z 2π/ω0
Am = ω0 /π F (t) cos(mω0 t)dt (A.16)
0
Z 2π/ω0
Bm = ω0 /π F (t) sin(mω0 t)dt (A.17)
0
Z 2π/ω0
Cm = 2ω0 /π F (t)e−imω0 t dt (A.18)
0
An easy way to remember these formulae is by writing them as:

Z Å ã
2 2πmt
Am = F (t) cos dt (A.19)
period one period period
Z Å ã
2 2πmt
Bm = F (t) sin (A.20)
period one period period
A.5 Fourier Transforms
We note that whether F (t) is periodic or not, we can give a complete description
of the function using combinations of sines and cosines. Note that a non periodic
function can be thought of as the limiting case of a periodic function wherein
the period tends to infinity and the fundamental frequency tends to zero. Also
in this case the harmonics would be closely spaced and there would be a contin-
uum of them, with each such harmonic having an infinitesimal amplitude given
as: a(ν)dν. Now integrating throughout all these amplitudes to synthesize the
function we get:
Z ∞ Z ∞
F (t) = a(ν)dν cos(2πνt) + b(ν)dν sin(2πνt) (A.21)
−∞ −∞
Writing this in terms of amplitude and phase values we have:

Z ∞
F (t) = r(ν) cos(2πνt + φ(ν))dν (A.22)
−∞
This can also, alternatively be written as:

Z ∞
F (t) = Φ(ν)e2πiνt dν (A.23)
−∞
Note that if F (t) is real then a(ν) and b(ν) are real as well, however if the function
F (t) is asymmetrical that is if F (t) 6= F (−t) then we have complex values Φ(ν). In
certain cases F (t) is symmetrical, which in turn implies that Φ(ν) is real and F (t)
consists only of cosines. Our Fourier series would then become:
Z ∞
F (t) = Φ(ν) cos(2πνt)dν (A.24)
−∞
Now comes the interesting bit. We can actually recover the function that contains
information about the frequencies Φ(ν) from F (t) by way of inversion.
Z ∞
Φ(ν) = F (t) cos(2πνt)dt (A.25)
−∞
Finally we say that Φ(ν) which is a function in the frequency domain, is the Fourier
transform of F (t) which is in the time domain. Another general formulation of
this is given by: Z ∞
Φ(ν) = F (t)e−2πiνt dt (A.26)
−∞
A.6 Spectrum
Note first that the square of the amplitude of oscillation of a wave, gives a measure
of power contained in each harmonic of the wave. In case the fourier transform
of F (t) that is Φ(ν) is complex, then if we take the product of this and its complex
conjugate Φ∗ (ν) then we would get the power spectrum or the spectral power
density of F (t).
S(ν) = Φ(ν)Φ∗ (ν) (A.27)
A.6.1 The autocorrelation theorem
The autocorrelation function of a function F (t) can be defined as:

Z ∞
A(t) = F (t0 )F (t + t0 )dt0 (A.28)
−∞
We note that the process of autocorrelation can be thought of as - multiplying

every point of a function by another point that is t0 distance away and then sum-
ming all those products. Now let us take its fourier transform on both sides of the
equation:
Z ∞ Z ∞Z ∞
Γ(ν) = A(t)e 2πiνt
dt = F (t0 )F (t + t0 )e2πiνt dt0 dt (A.29)
−∞ −∞ −∞
After solving this we would the following expression:

Γ(ν) = Φ∗ (ν)Φ(ν) (A.30)
Note that Φ is nothing but the fourier transform of F (t) and the multiplication of
the fourier transform with its complex conjugate gives us the power spectral den-
sity of F (t). Finally we note that autocorrelation function is the fourier transform
of this power spectum function we just obtained.
References 110
References
[1] A students guide to Fourier transforms- JF James

Appendix B
Appendix B: Laplace, Dirac Delta and

Fourier Series
B.1 The Laplace Transform
The Laplace function is primarily a mapping of points in the t domain of a function

f (t) to points in the s domain. We must note that the Laplace transform exists
for functions that are of exponential order, are bounded and have converging
infinite integrals. The mathematical definition of the Laplace Transform is as
follows: Z ∞
L{F (t)} = f (s) = F (t)e−st dt (B.1)
0
A popular property of the Laplace transform is that of linearity and can be stated
as:
L{aF1 (t) + bF2 (t)} = aL{F1 (t)} + b{F2 (t)} (B.2)
Yet another important theorem associated with this transform is called the first
shift theorem and can be defined as follows:
L{e−bt F (t)} = f (s + b) (B.3)
The proof of this theorem is pretty straightforward.

Z T
L{e −bt
F (t)} = lim e−st e−bt F (t)dt (B.4)
T →∞ 0
Z ∞ Z ∞
= −st −bt
e e F (t)dt = e−(s+b)t F (t)dt = f (s + b) (B.5)
0 0
111
Appendix B. Appendix B: Laplace, Dirac Delta and Fourier Series 112
B.1.1 Examples and Properties
To demonstrate how this transform works, we will show a simple example of

transforming the function F (t) = t. Note that integration by parts is used.
Z T
L(t) = lim te−st dt (B.6)
T →∞ 0
T
t −st T
Z ï ò Z T
1
→ −st
te dt = − e − − e−st dt (B.7)
0 s 0 0 s
1 −st T
ï ò
T −sT
=− e + − 2e (B.8)
s s 0
T 1 1 T →∞ 1
= − e−sT − 2 e−st + 2 → 2 (B.9)
s s s s
Therefore the Laplace transform of the function F (t) = t is given by f (s) = 1/s2 .
Now we note some general formulae regarding various Laplace transforms. The
derivation of these expressions is omitted in this section.
n!
• L(tn ) =
sn+1
1
• L{tea t} =
(s − a)2
• Before the next formula we must recall the Euler’s formula that gives us an
expression for the polar coordinates of complex numbers:
eit = cos(t) + i sin(t) (B.10)
Now we note that due to the linearity property, the Lapace transform of ei t
is given by:
L(eit ) = L(cos(t)) + iL(sin(t)) (B.11)
Where the Laplace transforms of the individual trigonometric functions are:
s
L(cos(t)) = 2 (B.12)
s +1
1
L(sin(t)) = 2 (B.13)
s +1
d
• L{tF (t)} = − f (s)
ds
• A popular function whose Laplace transform is immensely useful is the
Heaviside’s unit step function which is given by:
(
0, if t < 0.
H(t) = (B.14)
1, if t ≥ 0.
Consequently its Laplace transform is given by:
1
L(1) = (B.15)
s
• Laplace transform of a first order differentiable function can be writen as:

Z ∞
0
L{F (t)} = e−st F 0 (t)dt = −F (0) + sf (s) (B.16)
0
• Laplace transform of a second order differentiable function is given as:
L{F 00 (t)} = s2 f (s) − sF (0) − F 0 (0) (B.17)
In a similar manner, we can generalize the above two points to write the
Laplace transform of an n times differentiable function as:
L{F (n) (t)} = sn f (s) − sn−1 F (0) − sn−2 F 0 (0) − · · · − F (n−1) (0) (B.18)
w
• L{sin(wt)} =
s2
+ w2
s
• L{cos(wt)} = 2
s + w2
B.1.2 Expanding a little on Heaviside function
We earlier mentioned that the Laplace transform of the Heaviside function is given
by L{H(t)} = 1/s. However we are usually more interested in finding out the
transform os H(t − t0 ) where t0 > 0. Applying the Laplace transform definition
to this we get: Z ∞
l{H(t − t0 )} = H(t − t0 )e−st dt (B.19)
0
Note that as per the way this function is defined, for t < t0 we would have H(t −
t0 ) = 0 and the transform would evaluate as follows, taking only those t such that
t > t0 and consequently the function evaluating to H(t − t0 ) = 1.
Z ∞ ñ ô∞
e−st e−st0
L{H(t − t0 )} = −st
e dt = − = (B.20)
t0 s t s
0
This function assumes special relevance when it is multiplied with another func-
tion and that action of multiplying this Heaviside function is analogous to ’switch-
ing on’ the other function. With this intuition we can state the second shift theo-
rem defined as:
L{H(t − t0 )F (t − t0 )} = e−st0 f (s) (B.21)
Note that with this we can find the Laplace transform of a function that is switched
on at t = t0 .
B.1.3 Inverse Laplace transform
Taking out the inverse of Laplace transforms usually involves a bit of solving us-
ing the partial fractions decomposition method. The standard definition for the
inverse transform is given as:
if L{F (t)} = f (s) (B.22)
then L−1 (f (s)) = F (t) (B.23)

Now we take a simple example wherein the inverse transform is determined using
partial fractions. Å ã
a
→L −1
(B.24)
s2 − a2
Solving the undetermined coefficients using partial fractions we get:
ï ò
a 1 1 1
= − (B.25)
s 2 − a2 2 s−a s+a
Now we can simply apply the linearity property of the inverse transform operator
to get: ï ò
a 1
L −1
2 2
= (eat − e−at ) (B.26)
s −a 2
B.2 The Dirac Delta Impulse function
It is observed that there exist certain functions which might not classify as func-
tions in the true sense. In order to classify as a function, an expression needs to
be defined for all values of the variable in the specified range. Note that if this
is not the case, then the expression would not be a function since it would cease
to be well defined. We are usually not interested in such expressions, however
we note that even if some of these expressions might not be well defined, if they
have some desirable global properties, then such expressions indeed turn out to
be rather useful. One such function is Dirac’s δ function. The definition is as
follows:
δ(t) = 0, ∀t, t 6= 0 (B.27)
Z ∞
h(t)δ(t)dt = h(0) (B.28)
−∞
The above is defined for any function h(t) that is continuous in the interval (−∞, ∞).
The Dirac-δ function can be thought of as the limiting case of a top hat function
with unit area as it becomes infinitesimally thin and tall. First we define a function
as follows: 
0,

 if t ≤ −1/T .
Tp (t) = 0.5T, if −1/T < t < 1/T . (B.29)
if t ≥ 1/T .

0,

Figure B.1: Top hat function
The Dirac Delta function then models the limiting behaviour of this function and
can be written as:
δ(t) = lim Tp (t) (B.30)
T →∞
The integral definition can be written as follows:

Z ∞ Z ∞
h(t) lim Tp (t)dt = lim h(t)Tp (t)dt (B.31)
−∞ T →∞ T →∞ −∞
The value of the integral within the limits indicates the area under the cuve h(t)Tp (t)
and we say that this area would approach the value h(0) as T → ∞. Further we
say that for very large value of T the interval [−1/T, 1/T ] will be small enough for
the value of h(t) to not differ from its value at the origin. With this we can express
h in the form of: h(t) = h(0) + (t) where the term (t) tends to 0 as T goes to
infinity. Therefore we can say that h(t) tends to h(0) for extremely large values of
T . Note that δ(t) is not a true function since it is not defined for t = 0, therefore
δ(0) has no value. Writing out the left and right side limits we get:
Z ∞
h(t)δ(t)dt = h(0) (B.32)
0−
Z 0+
h(t)δ(t)dt = h(0) (B.33)
−∞
As a limiting case of the top hat function the Dirac Delta function then looks this:
We note an important property that as the interval gets smaller and smaller due
to T becoming large, the area under the top hat function would always be unity.
Hence in the limiting case, the length of the arrow (which happens to represent
the Dirac-δ function) is 1. Therefore we have with h = 1:
Z ∞
δ(t)dt = 1 (B.34)
−∞
The Laplace transform of the Dirac Delta function is given by:

L{δ(t)} = 1 (B.35)
Figure B.2: Dirac Delta function: limiting case of a top hat
This essentially means that we are reducing the width of the top hat function such
that it lies between 0 and 1/T (because in the exponential order laplace transfor-
mation we usually have limits starting from 0), and that we are increasing the
height from T /2 to T so as to preserve the unit area.
B.2.1 Filtering property
Going further with the Dirac-δ function we say that the function δ(t−t0 ) represents
an impulse that is centered at time t0 . This can be thought of as a transient signal
and the limiting case of a function K(t) which is the displaced version of the top
hat function: 
0,

 if t ≤ t0 − 1/T .
K(t) = 0.5T, if t0 − 1/T < t < t0 + 1/T . (B.36)
if t ≥ t0 + 1/T .

0,

Now as T → ∞ and utilising the definition of the Dirac-δ function we get:

Z ∞
h(t)δ(t − t0 )dt = h(t0 ) (B.37)
−∞
We can get the Laplace transform of this Dirac Delta function, provided that t > 0
as:
L{δ(t − t0 )} = e−st0 (B.38)
This has been called the filtering property since we can see clearly from the defi-
nition that the Dirac-δ function helps us pick out a particular value of a function.
Z ∞
h(t)δ(t − t0 )dt = h(t0 ) (B.39)
−∞
B.3 Fourier Series
The central idea behind a Fourier series is that any given function can be ex-
pressed as a series of Sine and Cosine functions. Here we will be dealing mostly
with periodic and piecewise continuous functions. Let us first start with functions
defined on the closed interval [−π, π] which possess one sided limits at −π and π.
We have a function that maps values such that f : [−π, π] → C. We can now state
the Dirichlet theorem as follows: If f is a member of the space of piecewise con-
tinuous functions which are 2π periodic on the closed interval [−π, π], having both
left and right derivatives at the end points, then we say that for each x ∈ [−π, π]
the Fourier series of f converges to:
f (x− ) + f (x+ )
(B.40)
2
And at both the end points (x = ±π) the series converges to:
f (π − ) + f ((−π)+ )
(B.41)
2
The fourier series gives us a result that at points of discontinuity the value of
Fourier series of f takes the value of the mean of one sided limits of f as the
value at the discontinuous point.
B.3.1 Fourier series formula
Remember that the whole point of a Fourier series is to express a periodic function
as a series of sine and cosine functions. We see the components of such a series
are typically periodic functions of period 2π given as:
1, cos(x), sin(x), cos(2x), sin(2x), · · · , cos(nx), sin(nx) (B.42)
These terms, together form a trigonometric system and the resulting series so
obtained is called the trigonometric series:
a0 + a1 cos(x) + b1 sin(x) + a2 cos(2x) + b2 sin(2x) + · · · (B.43)

∞
X
= a0 + (an cos(nx) + bn sin(nx)) (B.44)
n=1
Here the a and b terms are the coefficients of the series and we say that if the
coefficients are such that the series converges, then its sum would also have the
same period as the individual components, that is 2π. Now if we have a function
f (x) of period 2π and can be represented by a convergent series of the form in
equation 44 then we say that the Fourier series of f (x) is:
∞
X
f (x) = a0 + (an cos(nx) + bn sin(nx)) (B.45)
n=1
References 118
Consequently, the Fourier coefficients can be found using the following equa-
tions: Z π
1
a0 = f (x)dx (B.46)
2π −π
1 π
Z
an = f (x) cos(nx)dx (B.47)
π −π
1 π
Z
bn = f (x) sin(nx)dx (B.48)
π −π
A crucial point to note is that the underlying concept behind this Fourier series is
the orthogonality of the trigonometric system - which means that every term in
the trigonometric series is orthogonal to each other, or that their inner product is
zero. In terms of integrals we can write this condition as:
Z π
cos(nx) sin(mx)dx = 0 (B.49)
−π
References
[1] An introduction to Laplace transforms and Fourier series - PPG Dyke
[2] Advanced Engineering Mathematics - Kreyszig

Thebook v33

Uploaded by

Copyright:

Available Formats

You might also like

Thebook v33

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thebook v33

Uploaded by

Copyright:

Available Formats

Stochastic

Compiled by the PGDM research cell

Dr. Rakesh Nigam

0.1.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . 8

0.2 Basic Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

0.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 10

0.4.3 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . 13

0.5 Cumulative Frequency distributions . . . . . . . . . . . . . . . . . . 13

0.6 General points about discrete variables . . . . . . . . . . . . . . . . . 14

0.7.1 General theorems and statements . . . . . . . . . . . . . . . 16

0.7.2 Squeeze theorem . . . . . . . . . . . . . . . . . . . . . . . . . 17

0.7.3 Increasing, Decreasing and Bounded . . . . . . . . . . . . . . 17

0.8.1 The Ratio and Root test . . . . . . . . . . . . . . . . . . . . . 19

1 Lecture 1: Aakaash N (2019DMB01) 20

1.1 Conditions For Independence . . . . . . . . . . . . . . . . . . . . . . 20

1.2 Axioms Of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . 22

1.5 Properties Of Moment Generating Function . . . . . . . . . . . . . . 27

2 Lecture 3: Kishan R (2019DMF06) 29

2.2 Application Of CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Continuity Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.1 Continuity correction for Discrete RV . . . . . . . . . . . . . 37

2.3.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . 37

2.3.3 Convergence in Probability is a Stronger notion than Con-

3 Lecture 5 - Shashi Ranjan Mandal (2019DMB08) 42

3.1 Weak law of large number . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Converse of Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Counter Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Consequence of Weak Law of Large Number . . . . . . . . . . . . . 47

3.7 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Lecture 6 - Venkat Suman Panigrahi (2019DMF12) 50

4.2 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . 50

4.2.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . 51

4.2.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . 53

4.2.3 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Strong Law of Large Number . . . . . . . . . . . . . . . . . . . . . . 58

4.5 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . 62

4.5.1 Properties of MGF . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6.1 Law of Large Number . . . . . . . . . . . . . . . . . . . . . . 65

4.6.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 66

5 Lecture 9 - Akash Gupta (2019DMB02) 69

5.1 Fundamental guiding principles . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.2 Working with sets: part 1 . . . . . . . . . . . . . . . . . . . . 69

5.1.3 Working with sets: part 2 . . . . . . . . . . . . . . . . . . . . 70

5.1.4 Basic laws governing set operations . . . . . . . . . . . . . . 71

5.1.5 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . 71

5.1.6 Probability of continuous sets . . . . . . . . . . . . . . . . . . 72

5.2 Countable and Uncountable sets . . . . . . . . . . . . . . . . . . . . 72

5.3 Concepts of Lecture 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.1 Revisiting the background of discussion . . . . . . . . . . . . 74

5.4 Lim Sup and Lim Inf . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4.1 An illustrative example: 1 . . . . . . . . . . . . . . . . . . . . 75

5.4.2 An illustrative example: 2 . . . . . . . . . . . . . . . . . . . . 76