SIT787: Mathematics for Artificial Intelligence

Topic 3: Probability
Asef Nazari

School of Information Technology, Deakin University

Exploratory Data Analysis

Data Collection and Sampling
Statistics, the art and science of learning from data
Everything starts with a question
Collection, description, analysis, drawing conclusion
Journey form data to models and back
Data is the essence of modelling,
models are prepared to describe data
Data are obtained from experiments and are the result of
measuring some characteristics or property of objects

Each row is for a case, an

observation, or an object
These characteristics or
properties are called
Data variables or features
Numerical (quantitative):can be measured on a numerical
scale or counted
A variable is numerical if its observation takes numerical values
that represent different magnitudes of the variable
Continuous or discrete
your weight is continuous and the number of cars in a
household is discrete
Categorical (qualitative): take values that are non-numerical
in nature
If a variable has observations belonging to one of a set of
categories, it is called categorical.
It is meaningless to do arithmetic on categorical data (like
Categorical data could only be classified into categories, levels
or classes
Nominal or ordinal
”yes” or ”no” for whether someone passed the driving test,
Populations and samples

The set of all the objects of interest is called the population

Populations are not generally available to study, or it is very
difficult and costly to access them
In a census, we actually trying to capture all the population.
Instead of accessing the whole population,
we might take a sample, a smaller subset of the population
do conduct the study over the sample
Based on the information we get from the sample, we make
inferences about the population
We need to be very careful in choosing a sample.
It should be a good representative of the population
A good sample presents similar characteristics to the
A random sample: every individual in the population has the
same chance to be selected in the sample
Issues in data collection

there are always sources of inaccuracies in any real data
Missing values
Missing variable
latent variable analysis (latent variable = hidden variable =
missing variable)
systematically inaccurate estimates of population values
These are data points that lie well beyond the bulk of samples

Sample statistics and population parameters

Numerical summaries of the population are called parameters

and generally shown by Greek letters.
A parameter is a numerical summary of the population
A statistic is a numerical summary of a sample
We are interested in learning about the parameters to have a
better understanding about the population.
The parameter values are almost always unknown.
We use sample statistics to estimate the related parameter
We estimate population parameters using the information
obtained from a sample from it.

We must also distinguish between sample statistics that we

calculated so far and corresponding population parameters
Population mean µ
Sample mean x̄ 2
population variance σ
sample variance s2
population standard deviation σ
sample standard deviation s

Data table

Columns are variables or features

rows are cases or observations, assume we have m
A collumn is considered as a random variable with unknown
we only have a sample x1 , x2 , . . . , xn
Assume that all the observations are selected randomly
if we have n columns (assumen they are all numerical), then
we ahve a system of random variables (X1 , X2 , . . . , Xn )
So, for each column j, x1j , x2j , . . . , xmj ∼iid Xj

Exploratory data analysis

Considre single variables

Detect the type
Find summary statistics
measure central tendencies (mean ,median, mode) and
dispersion (variance, std, range, IQR)
frequency and realtive frequency tables
Visualisation: Histograms, barplots, boxplots, piecharts
consider them two at a time
Find the correlation between them
Scatter plots, contingency tables and plots

Measures of central tendency

To summarise a categorical variable: frequency table

The category with the highest frequency is called the modal
Bar plots
we may use relative frequencies
To summarise a numerical variable {x1 , x2 , . . . xn }
P xi
Average or mean x̄ = i=1
Median x̃: Sort the data increasingly, and choose the middle
for odd n, or get the average of two middle ones for even n
It is an elemnt that 50% of data is less than x̃ and 50% of
data is larger than x̃.
The median is more robust than the average, as it is not
affected by outliers
Mode: the most frequent item
Measures of statistical dispersion
(xi −x̄)2
Variance s2 = i=1
n−1 √
Standard deviation s = s2
IQR: consider the increasingly sorted data
x1 , x2 , x3 , x4 , x5 , x6 , x7
The median x̃ = x4 which is called 50th percentile or second
quartile Q2 .
the median of the left side of x̃, is x2 , whihc is called the 25th
QU1HPBT85Apercentile, or the first quartile Q1
the median of the right side of x̃, is x6 , whihc is called the
75th percentile or the third quartile Q3
IQR = Q3 − Q1

Probability Theory

Probability: big picture

Two interpretation of uncertainty

the fraction of times an event occurs
degree of belief about an event
Sources of uncertainty
uncertainty in the data,
uncertainty in the machine learning model, and
QU1HPBT85Auncertainty in the predictions produced by a model.
Quantifying uncertainty requires the idea of a random variable
Associated with the random variable there is a function called
probability distribution
Random variables and their distributions are meant to
describe populations in Statistics.

Sample space, events, and probability

Consider a random experiment or trial

We know all the outcomes
but we do not know which one will happen
outcome is not predictable with certainty in advance
The sample space Ω
the set of all possible outcomes of the experiment
The event space A
A subset of the sample space is called an event A ⊂ Ω
An event A occers if the outcome of the expriment is a
membebr of A
A is the collection of all subsets of Ω
A is the power set of Ω
The probability P
With each A ∈ A we associate a number P (A)
measures the probability that the event will occur
The big picture

Our aim is to learn from data

Any data collection has noise (randomness)
We use probability theory (in particular random variables) to
model the data nd deal with the noise
Different types of data features can be modelled using
different types of random variablaes
We first do Exploratory data analysis to learn some aspects of
Using descriptivr statistics to summarise the data
use visualisation to represent the data efficiently
We may look at features one at a time, or two together

The data

Variables or features each will be modelled using a random

Machine learning is the same as statistical learning

Each column, variable, or

feature is treated as a
random variable
Here we need to study a
system of random variables
Explanatory Variables vs.
Response Variables
supervised and unsupervised
Examples of sample space

If the outcome of an experiment consists in the determination

of the sex of a newborn child, then

Ω = {m, f }

If the experiment consists of the running of a race among the

seven horses having post positions 1, 2, 3, 4, 5, 6, 7, then
Ω = {all orderings of (1, 2, 3, 4, 5, 6, 7)}

Suppose we are interested in determining the amount of

dosage that must be given to a patient until that patient
reacts positively
Ω = (0, ∞)

Random experiments

Toss a fair (or unfair) coin once, twice, and 3 times

Ω1 = {H, T }
Ω2 = {HH, HT, T H, T T }
Ω3 = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }
roll a fair die, and roll two fair dice
Ω4 = {1, 2, 3, 4, 5, 6}
Ω5 = {(i, j)|i, j ∈ {1, 2, 3, 4, 5, 6}}
toss a fair (or unfair) coin n times (n = 5 for example)
Ω6 = {HHHHH, HHHHT, . . .}, |Ω6 | = 2n
toss a coin until the first head appears
Ω7 = {H, T H, T T H, T T T H, . . .}
number of arrivals to a shop in a given period of time
Ω8 = {0, 1, 2, 3, . . .}
lottory game playing with a coin
Ω9 = {H, T }
Random experiments

Number of emails I receive every day Ω = {0, 1, 2, . . .}

Amount of time someone spent in social media in a day
Ω = [0, 24]

The concept of probability
For a randomised experiment or trial, the probability is the
likeliness of a particular outcome
in tossing two coins, Ω = {hh, ht, th, tt}
If we are interested in the cases where the first coin lands
heads E = {hh, ht}
When all possible outcomes are equally likely number of outcomes in E
QU1HPBT85A P (E) =
number of outcomes in the sample space

P (E) = 24 = 0.5
Two events are called disjoint or exclusive if they have no
outcome in common
Event E the first coin lands heads: E = {hh, ht}
Event F the first coin lands tails: F = {tt, th}

Algebra of events

Suppose E1 and E2 are events (E1 , E2 ⊂ Ω)

You can make new events by any set operations that you
The union of events E1 ∪ E2
The intersection of events E1 ∩ E2
The difference of events E1 − E2
The complement of an event is another event Ωc = ∅, ∅c = Ω

Axioms of probability (Kolmogorov)
1 P (Ω) = 1
2 For E ⊂ Ω, 0 ≤ P (E) ≤ 1
3 For two disjoint events E1 , E2 ⊂ Ω and E1 ∩ E2 = ∅

P (E1 ∪ E2 ) = P (E1 ) + P (E2 )

Some extensions
For any two events E1 , E2 ⊂ Ω
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )

The collection of events {E1 , E2 , . . . , En } is colled mutually

disjoint if Ei ∩ Ej = ∅ for every i 6= j
For mutually disjoint events {E1 , E2 , . . . , En }

[n Xn
P ( Ei ) = P (Ei )
Some propositions

1 = P (Ω) = P (E ∪ E c ) = P (E) + P (E c ) which implies that

P (E c ) = 1 − P (E)

For any two events E1 , E2 ⊂ Ω

1P (E ∪ E ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )


A total of 28 percent of males living in a city smoke

cigarettes, 6 percent smoke cigars, and 3 percent smoke both
cigars and cigarettes. What percentage of males smoke
neither cigars nor cigarettes?
Let E be the event that a randomly chosen male is a cigarette
smoker P (E) = 0.28
QU1HPBT85ALet F be the event that he is a cigar smoker P (F ) = 0.6
P (E ∩ F ) = 0.3
Then, the probability this person is either a cigarette or a cigar
smoker is
P (E∪F ) = P (E)+P (F )−P (E∩F ) = 0.28+0.06−0.03 = 0.31

Conditional Probability
and Independent Events

Conditional probability

A and B are two events, A ⊂ Ω and B ⊂ Ω

P (A ∩ B)
P (A|B) =
P (B)

P (A ∩ B) = P (A|B)P (B)
P (A ∩ B) = P (B|A)P (A)
useful to check when events are dependent or independent
Two events with non-zero probabilities are independent
P (A ∩ B) = P (A)P (B)
P (A|B) = P (A)
P (B|A) = P (B)
If E and F are independent, then so are E and F c .
If we have several independent models, it is better to make an
ensemble model.
Conditional probabilities: example of die tossing

Toss a die Ω = {1, 2, 3, 4, 5, 6}

Event A the number is even: A = {2, 4, 6}
Event B: the number is greater than 4, B = {4, 5, 6}
Event C: The number is greater than 5, C = {5, 6}
P (A) = 12 , P (B) = 12 , P (C) = 13
QU1HPBT85AA ∩ B = {4, 6} and P (A ∩ B) = 1
A ∩ C = {6} and P (A ∩ C) = 16
B ∩ C = {5, 6} and P (B ∩ C) = 3
If P (E ∩ F ) = P (E)P (F ), then they are independent.
Otherwise dependent.

Conditional probabilities: example

Consider a group of 100

people |Ω| = 100 Consider a group of 100
H: the chosen person is people |Ω| = 100
happy P (M ) = 100 = 0.48 C: the chosen person likes
M: the chosen participant is chees P (C) = 100 = 0.8
70 48
married P (M ) = 100 = 0.7 P (C|D) = 60 = 0.8
P (H|M ) = 42
70 = 0.6 P (C|D) = P (C)
QU1HPBT85A c 6
P (H|M ) = 30 = 0.2
Happy yes no
yes no yes 48 32 80
yes 42 28 70 no 12 8 20
no 6 24 30 60 40 100
48 52 100
Independence of more than two events

Two events are indepenent if P (E ∩ F ) = P (E)P (F )

Three events are independent
P (E ∩ F ∩ G) = P (E)P (F )P (G)
P (E ∩ F ) = P (E)P (F )
(E ∩ G) = P (E)P (G)
P (F ∩ G) = P (F )P (G)
The events E1 , E2 , . . . , En are said to be independent if for
every subset E10 , E20 , . . . , Er0 , r ≤ n of these events

P (E10 ∩ E20 ∩ . . . ∩ Er0 ) = P (E10 )P (E20 ) . . . P (Er0 )

Conditional probability: example

A bin contains 5 defective (that immediately fail when put in

use), 10 partially defective (that fail after a couple of hours of
use), and 25 acceptable transistors. A transistor is chosen at
random from the bin and put into use. If it does not
immediately fail, what is the probability it is acceptable?
Solution: Since the transistor did not immediately fail, we know that it
is not one of the 5 defectives and so the desired probability is:
P {acceptable|not defective} = P {acceptable, not defective}
P {not defective} ,
P {acceptable} 25/40 5
= P {not defective} = 35/40 = 7
where the last equality follows since the transistor will be both
acceptable and not defective if it is acceptable.

This file is meant for personal use by only.

Consider two events F and E

F = (F ∩ E) ∪ (F ∩ E c ), the union of two disjoint events

P (F ) = P (F ∩ E) + P (F ∩ E c ) =
QU1HPBT85AP (F |E)P (E) + P (F |E c )P (E c )
Consider events E1 , . . . , En s.t. Ei ∩ Ej = ∅,
Ei = Ω
P (C) = P (Ei )P (C|Ei )

Ei are hypothesis. Only one of

them will happen
The law of total probability: example
Example: An insurance company believes that people can be
divided into two classes — those that are accident prone and
those that are not. Their statistics show that an
accident-prone person will have an accident at some time
within a fixed 1-year period with probability .4, whereas this
probability decreases to .2 for a non-accident-prone person. If
we assume that 30 percent of the population is accident
prone, what is the probability that a new policy holder will
have an accident within a year of purchasing a policy?
Solution: We obtain the desired probability by first
conditioning on whether or not the policy holder is accident
prone. Let A1 denote the event that the policy holder will
have an accident within a year of purchase; and let A denote
the event that the policy holder is accident prone. Hence, the
desired probability, P (A1 ), is given by P (A1 ) =
Bayes’ Theorem

Bayes’ Theorem

The essence of Bayes’ Theorem is updating probabilities when

we receive further knowledge or information.
That’s why we are going to have prior and posterior
probabilities here.
Consider event E, you may know P (E). This is your prior
Another event F occurs.
Now, you are interested to see how your prior probability is
going to be changed.

P (F |E)P (E) P (F |E)

P (E|F ) = = P (E)
P (F ) P (F )

There is something which is going to be multiplied by your

prior P (A) and makes your posterior P (A|B)
Bayes’ Theorem

Prior probability P (E)

Posterior probability P (E|F )
P (F |E)
Likelihood P (F )

BAYES’ Theorem

Let E and F be events. we can write F = (F ∩ E) ∪ (F ∩ E c )
P (F ) = P (F ∩ E) + P (F ∩ E c )
= P (F |E)P (E) + P (F |E c )P (E c )
= P (F |E)P (E) + P (F |E c )[1 − P (F )]
the probability of the event F is a weighted average of the
conditional probability of F given that E has occurred and the
Bayes’ Theorem

P (F ) = P (F |E)P (E) + P (F |E c )[1 − P (F )]

Bayes’ Theorem

P (E ∩ F ) P (E|F ) =

P (F |E)P (E) P (F |E)P (E)

= =
P (F ) P (F |E)P (E) + P (F |E c )[1 − P (F )]

Bayes’ Theorem: example

Question: A laboratory blood test is 99 percent effective in detecting a

certain disease when it is, in fact, present. However, the test
also yields a “false positive” result for 1 percent of the healthy
persons tested. (That is, if a healthy person is tested, then,
with probability .01, the test result will imply he or she has
the disease.) If .5 percent of the population actually has the
disease, what is the probability a person has the disease given
that his test result is positive?
Solution: Let D be the event that the tested person has the disease and
E the event that his test result is positive. The desired
probability P (D|E) is obtained by P (E|D) = 0.99, P (D) =
0.005, P (E|Dc ) = 0.01, P (Dc ) = 0.995

P (D ∩ E) P (E|D)P (D)
P (D|E) = = = 0.3322
P (E) P (E|D)P (D) + P (E|Dc )P (Dc )
BAYES’ Theorem: extension

F1 , F2 , . . . , Fn are mutually exclusive events

Fi = Ω

In other words, exactly one of the events

F1 , F2 , . . . , Fn must occur
(E ∩ Fi )
E is another event, and can be written as E =
n n
P (E ∩ Fi ) =
P (E) = P (E|Fi )P (Fi )
i=1 i=1
Suppose now that E has occurred
P (E ∩ Fj ) P (E|Fj )P (Fj )
P (Fj |E) = = P
P (E) P (E|Fi )P (Fi )
Bayes’ Theorem: explanation

Bayes’ formula

P (E ∩ Fj ) P (E|Fj )P (Fj )
P (Fj |E) = = P
P (E) P (E|Fi )P (Fi )

If we think of the events Fj as being possible “hypotheses”
about some subject matter,
P (Fj ) are our priors
then Bayes’ formula may be interpreted as showing us how
opinions about these hypotheses held before the experiment
[that is, the P (Fj )] should be modified by the evidence of the
P (Fj |E) is our posteriors
Bayes’ Theorem: simle version
P (B|A)P (A)
P (A|B) = P (B|A)P (A)+P (B|Ac )P (Ac )
The probabilities involved: P (A|B), P (B|A), P (A), P (Ac ),
P (B|Ac )
We need to know 4 of these to find the fifth one.

Bayes’ theorem: example
Example: A test is 98% effective at detecting HIV. However,
test has a “false positive” rate of 1%. If 0.5% of a country’s
population has HIV, what is the probability having HIV when
the test is positive?
Let E = someone’s test is positive for HIV with this test
Let F = That person actually have HIV
A test is 98% effective at detecting HIV
If someone has HIV the success of test in prediction is
P (E|F ) = 0.98
QU1HPBT85AHowever, test has a “false positive” rate of 1%
P (E|F c ) = 0.01
0.5% of a country’s population has HIV
P (F ) = 0.005 and therefore P (F c ) = 1 − 0.005 = 0.995
What is P (F |E)?
P (E|F )P (F )
P (F |E) =
P (E|F )P (F ) + P (E|F c )P (F c )
Bayes’ theorem: the intuition

Conditioning on a positive result

changes the sample space
People who test positive
P (E|F )P (F ) + P (E|F c )P (F c )
People who test positive and
have HIV P (E|F )P (F )

Bayes’ theorem: the intuition
Say we have 1000 people
5 have HIV and test positive
985 do not have HIV and test negative
10 do not have HIV and test positive

This file is meant for personal use by only.

Sharing or publishing the contents in part or full is liable for legal action.
Asef Nazari Math AI - SIT787 Topic 3 45 / 128
Random Variables

Random variables

Tools to study the randomness in data

It is to assign numbers to the outcomes of a random
types of random variables
QU1HPBT85AX is the number of heads in 3 tosses of a coin, it is a discrete
random variable
X is the reading of a scale in weighing people, it is a
continuous random variable

Example of a random variable

Experiment: Toss a coin n times. X is the number of heads

n=1 (
1 if H
0 if T
n=2 
QU1HPBT85A if  1 HT

if 1 TH


 2 if HH

0 if


Example of a random variable

Experiment: Roll a die. X square of the number


 1 1

4 2 9 if 3

16 if 4

25 if 5

36 if 6

The definition

Consider Ω to be a sample space of a random experiment.

A random variable is defined as


For all ω ∈ Ω, X(ω) is a number
All possible values that a random variable X can take is called
the support set of that random variable and denoted by SX
If we conduct this experiment several times, we will have a
sample of this ramdom variable.
x1 = X(ω1 ), x2 = X(ω2 ), . . . , xn = X(ωn ) or simply

x1 , x2 , . . . , xn

The notion of probability distribution

To represent a random variabel consisely.

n = 2: toss a fair coin twice
probability of
Outcome (ω) X(ω)
outcome, P (ω)
4 HH 2
4 HT 1
4 TH 1
4 TT 0 1
P (X = 0) = P ({ω ∈ Ω|X(ω) = 0}) = P ({T T }) =
P (X = 1) = P ({ω ∈ Ω|X(ω) = 1}) = P ({T H, HT }) = 4
P (X = 2) = P ({ω ∈ Ω|X(ω) = 2}) = P ({HH}) = 4
P (X = x)
x 0 1 2
1 2 1
P (X = x) 4 4 4
Using the probability distribution

P (X = x)
x 0 1 2
1 2 1
P (X = x) 4 4 4
1 1 3
P (X ≥ 1) = P (X = 1) + P (X = 2) = 2 + 4 = 4
Probability mass function (PMF) pX (x) = p(x) = P (X = x)
Let X is a random variable and SX = {x1 , x2 , . . . , xn } with
probabilities {p1 , p2 , . . . , pn } where P (X = xj ) = pj for all j.
pj ≥ 0
pj = 1

Rnadom variables

Discrete random variable: whose set of possible values can be

written either as a finite sequence x1 , . . . , xn , or as an infinite
sequence x1 , x2 , . . .
random variable whose set of possible values is the set of
nonnegative integers is a discrete random variable
Continuous random variable: random variables that take on a
continum of possible values
the random variable denoting the lifetime of a car, (t1 , t2 )

Probability mass function

For a discrete random variabl X with values

SX = {x1 , . . . , xn }: probability mass function p(a) is defined
p(x) = P {X = x}
p(xi ) > 0, i = 1, 2, . . .
p(x) = 0, for other values of x
p(xi ) = 1

Expectation of a Random

Expected value of a random variable
You know the concept of average of x1 , x2 , . . . , xn which is
x̄ =
For a random variable the values have different probabilities
and importance.
We want values with higher probability contribute more to the
for a discrete random variable X with finite support set
x1 , x2 , . . . , xn and probabilities p1 , p2 , . . . , pn
µ = E[X] = xi p i = xi P {X = xi }
i=1 i
for a discrete random variable X with infinite support set
x1 , x2 , . . . and probabilities p1 , p2 , . . .

µ = E[X] = xp This may not exist
Example of expectation of a random variable
for a random variable X
x x1 x2 ... xn
P (X = x) p1 p2 ... pn
The expectation is
µ = E[X] = xi pi
Lottory game:
win $5 with probability 0.1
lose $1 with probability 0.9
The related random variable
x 5 -1
P (X = x) 0.1 0.9
Lottory game:

win $100 with probability 0.5

lose $100 with probability 0.3

lose $50 with probability 0.2
The related random variable
x 100 -100 -50
P (X = x) 18 37
µ = E[X] = 100(0.5) + (−100)(0.3) + (−50)(0.2) = 10
dollars per game

Expectation example
Find E[X] where X is the outcome when we roll a fair die.
Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 16 then
E[X] =
(1)p(1) + (2)p(2) + (3)p(3) + (4)p(4) + (5)p(5) + (6)p(6) = 72
Note that, for this example, the expected value of X is not a
value that X could possibly assume.
the average value of X in a large number of repetitions of the
If I is an indicator random variable for the event A, that is, if
1 if A occurs
0 if A does not occure

Then E[I] = 1P (A) + 0P (Ac ) = P (A)

the expectation of the indicator random variable for the event
Discrete Random Variables

Discrete random variables

x x1 x2 ... xn
x x1 x2 ... xn ...
P (X = x) p1 p2 ... pn
P (X = x) p1 p2 ... pn ...

With finite support set With infinite support set

pi ≥ 0 for all i pi ≥ 0 for all i
p =1 ∞
i P
pi = 1
expectations (may not exist)

µ = E[X] = xi p i X
µ = E[X] = xi pi

Discrete random variable with infinite support set

Experiment: Toss a fair coin until the first head

Ω = {H, T H, T T H, T T T H, . . .}

Define X as the number of tossing till the first head

(including the last one) 1
P (X = 1) = P ({H}) =
P (X = 2) = P ({T H}) = 4
x 1 2 ... k ...
1 1 1
P (X = x) 2 22
... 2k

P 1

Bernoulli trial

Independent repeated trials of an experiment with exactly two

possible outcomes are called Bernoulli trials.
There is a coin, P (H) = p, P (T ) = 1 − p
a sequence of independent coin toss
n = 2, H1 : First tossing is heads, H2 : second tossing is
QU1HPBT85Athese two are indep
P (HH) = P (H1 ∩ H2 ) = p2
P (T T ) = P (H1c ∩ H2c ) = (1 − p)2
P (T H) = P (HT ) = p(1 − p)
Arbitrary n, |Ω| = 2n
P (HHT HT ) = p3 (1 − p)2
P (m heads and n − m tails) = pm (1 − p)n−m
Binomial distribution

Random experiment: Toss a coin with P (H) = p for n times

X is the number of heads
SX = {0, 1, 2, . . . , n}
consider n = 5

P (X = 0) = P ({T T T T T }) =
(1 − p)5
P (X = 1) =
P ({HT T T T, T HT T T, T T HT T,
T T T HT, T T T TT H}) =
5p(1 − p)4 = 51 p(1 − p)4
P (X = 2) = 52 p2 (1 − p)3

n k
For B(n, p): P (X = k) = k p (1 − p)n−k for k ∈ SX
Geometric and Poison distributions with infinite support set

Geometric distribution
Toss a coin with P (H) = p untill
the first head
X is the number of tossing
(including the last one)
P (X = k) = P ({T T . . . T H}) =
p(1 − p)k−1 , k ≥ 1

Poisson distribution
X the number of arrivals
P (X = k) = λk! e−λ for
parameter λ > 0, k ≥ 0

Variance of a Random

Expectation is not enough
Given a random variable X along with its probability
distribution function
We want summarize the essential properties of the mass
function by certain suitably defined measures.
One such measure would be E[X], the expected value of X
E[X] yields the weighted average of the possible values of X,
it does not tell us anything about the variation, or spread, of
these values
Example: all have the same expectation
W = 0 with probability 1
−1 With probability 2
Y = 1
1 With probability 2
−100 With probability 12
The measure how far a random variable deviates fro its mean
We wanto to measure X − E[X]
E[X − E[X]] = 0
Let’s considser E[(X − E[X])2 ] which is the variance
Lottory game 1: µ = E[X] = 5(0.1) + (−1)(0.9) = −0.4
dollars per game x 5 -1
QU1HPBT85A P (X = x) 0.1 0.9
X − E[X] 5.4 −0.6
(X − E[X])2 29.16 0.36

E[(X − E[X])2 ] = (29.16)(0.1) + (0.36)(0.9) = 3.24

The definition:

Var(X) = E[(X − E[X])2 ] = E[(X − µ)2 ]

If X is a random variable with mean µ, then the variance of
X, denoted by Var(X), is defined by
Var(X) = E[(X − µ)2 ]
Also, Var(X) = E[X 2 ] − µ2
Example: Compute Var(X) when X represents the outcome
when we roll a fair die.
Since P {X = i} = 16 , i = 1, 2, 3, 4, 5, 6 we obtain
E[X 2 ] = i2 P {X = i}
1 1 1 1 1 1 91
= 12 ( ) + 22 ( ) + 32 ( ) + 42 ( ) + 52 ( ) + 62 ( ) =
6 6 6 6 6 6 6
We computed the mean before, and µ = E[X] = 2 then
2 2 91 7 2 35
This file is meantVar(X) = E[X
for personal − µ
Systems of random

Systems of random variables

A random experiment happend. we have several random

variables related to this experiment. We want to study them.
We find Ω and compute P (ω) for every ω ∈ Ω
We can define many real functions on Ω. Each is a random
X : Ω → R and Y : Ω → R
considering these together is a system of random variables

(X, Y )

we want to know whether we can say something about Y

based on our knowledge about X and vice versa
We may ocnsider many random variables together in a larger
system (X1 , . . . , Xn )
Linear transformations

Let X be a random variable and c ∈ R

Y =X +c
Example Y = X + 2 and PMF

x -1 3 4 x 1 5 6
P (X = x) 0.2 0.5 0.3
QU1HPBT85A P (Y = x) 0.2 0.5 0.3

P (X = x) P (Y = x)

x x

Linear transformations

Let X be a random variable and c ∈ R

Y = cX
Example Y = 2X and PMF

x -1 3 4 x -2 6 8
P (X = x) 0.2 0.5 0.3
QU1HPBT85A P (Y = x) 0.2 0.5 0.3

P (X = x) P (Y = x)

x x

properties of expected values

E[aX + b] = E[aX] + E[b] = aE[X] + b

E[X + Y ] = E[X] + E[Y ]
E[aX + bY ] = aE[X] + bE[Y ]
For n random variables X1 , X2 , . . . , Xn
QU1HPBT85A " n n
E ci Xi = ci E[Xi ]
i=1 i=1

Transformation of a random variable
A function of random variable X is a ranodm variable and
g : R → R,
Y = g(X)
g(x) = ax + b, then Y = g(X) = aX + b
g(x) = x2 , then Y = X 2
g(x) = (x − E[X])2 , then Y = (X − E[X])2
g(x) = sin(x), then Y = sin(X)
Y = g(X) where Y (ω) = g(X(ω))

x x1 x2 ... xn x g(x1 ) g(x2 ) ... g(xn )

P (X = x) p1 p2 ... pn P (Y = x) p1 p2 ... pn

E[X] = xi p i
This file ] = E[g(X)]
is meant =
Properties of variance

For a random variable X

Var(X) = E[(X − E[X])2 ]

from another perspective

g(x) = (x − E[X])2 , then Y = g(X) = (X − E[X])2 n n
(xi − µ)2 pi
QU1HPBT85AE[Y ] = E[(X − E[X]) ] = g(xi )pi =
i=1 i=1
Var(x) = E[Y ]
Var(aX + b) = a2 Var(X), a, b ∈ R
adding a constant will shift the cloud of data to the left or
right, it does not change the variability
What about Var(X1 + X2 )?

Properties of variance

Var(aX + b) = a2 V ar(X)
if a = 0, Var(b) = 0
The quantity Var(X) is called the standard deviation of X.
std(X) = Var(X)

Var(X1 + X2 ) = E[((X1 + X2 ) − E(X1 + X2 ))2 ] =
E[((X1 − E[X1 ]) + (X2 − E[X2 ]))2 ]

= E[(X1 −E[X1 ])2 +(X2 −E[X2 ])2 +2(X1 −E[X1 ])(X2 −E[X2 ])]

= E[(X1 − E[X1 ])2 ] + E[(X2 − E[X2 ])2 ] + 2E[(X1 −

E[X1 ])(X2 − E[X2 ])]

= Var(X1 ) + Var(X2 ) + 2E[(X1 − E[X1 ])(X2 − E[X2 ])]

The covariance of two random variables X and Y , written

Cov(X, Y ), is defined by

Cov(X, Y ) = E[(X − µX )(Y − µY )]

Equivalently, Cov(X, Y ) = E[XY ] − E[X]E[Y ]
Cov(X, Y ) = Cov(Y, X)
Cov(aX, Y ) = aCov(X, Y )
Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y )
Cov(X + a, Y ) = Cov(X, Y + b) = Cov(X, Y )
Y = aX + b,
Cov(X, Y ) = Cov(X, aX + b) = aCov(X, Y ) = aVar(X)

Definition: Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]

Covariance measures association between X and Y
It is strongly related to the dependence of X and Y
If X and Y are independent then Cov(X, Y ) = 0
E [(X − E[X])(Y − E[Y ])]

= E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]

E[XY ] − E[Y ]E[X] − E[Y ]E[X] + E[X]E[Y ] = 0

Back to varince

Var(X + X) = 4Var(X) 6= Var(X) + Var(X)

P n
P n
P n
Var( Xi ) = Var(Xi ) + Cov(Xi , Xj )
i=1 i=1 i=1 j=1,j6=i
for n = 2,
Var(X + Y ) = Var(X) + Var(Y ) + Cov(X, Y ) + Cov(Y, X)

If X and Y are independent random variables, then

Cov(X, Y ) = 0
P Pn
Var( Xi ) = Var(Xi )
i=1 i=1

Properties of covariance

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

If Cov(X, Y ) = 0 then Cov(X, Y ) = 0
if Cov(X, Y ) = 0, we cannot say Cov(X, Y ) = 0.
P (X = 0 ∩ Y = 1) = 0 6= P (X = 0)P (Y = 1) they are not
E[X] = 0, E[Y ] = 0
E[XY ] = 0
Cov(X, Y ) = 0

Y \X -1 0 1 margin
1 1 2
1 3 0 3 3
1 1
-2 0 3 0 3
1 1
margin 3 3 0

it can be shown that a positive value of Cov(X, Y ) is an

indication that Y tends to increase as X does,
a negative value indicates that Y tends to decrease as X
The strength of the relationship between X and Y is
indicated by the correlation between X and Y

Cov(X, Y )
Corr(X, Y ) = p
Var(X)Var(Y )

that this quantity always has a value between -1 and +1

This file is meant for personal use by only.

Correlation of two random variable

Corr(X, Y ) = √ Cov(X,Y )
Var(X)Var(Y )
If there is a perfect linear relationship between X and Y ,
Y = aX + b
Corr(X, Y ) = Corr(X, aX + b) = √ aVar(X) 2
= |k| = ±1
Var(X)a Var(X)
Corr(aX, Y ) = Corr(X, Y ) if a > 0
Corr(X, Y ) ∈ [−1, 1]
Corr(X, Y ) = 0 then two random variables are uncorrelated

This file is meant for personal use by only.

Sum of random variables

X : Ω → R, Y : Ω → R, Z : Ω → R
Z = X + Y then E[Z] = E[X] + E[Y ]
Suppose Var(X) > 0
QU1HPBT85AY = −X then Var(Y ) = Var(Y )
Z = X + Y , Var(Z) = Var(X − X) = Var(0) = 0
Var(X) + Var(Y ) = 2Var(X)
Var(X + Y ) 6= Var(X) + Var(Y )
For especial case that X and Y are independent random
variables, Var(X + Y ) = Var(X) + Var(Y )

This file is meant for personal use by only.

Joint Probability

Joint probability distribution
To express interactions between two random variables
exeriment: Tossing two fair coins Ω = {HH, HT, T H, T T }
Let’s define three random variables over this sample space
1 if the 1st toss is H x 1 0
X= P (X = x) 0.5 0.5
0 else
x 1 0
Y =1−X P (Y = x) 0.5 0.5
1 if the 2nd toss is H x 1 0
Z= P (Z = x) 0.5 0.5
0 else
P (X = 0 ∩ Y = 0) = P (X = 0 ∩ X = 1) = 0 impossible!
P (X = 0 ∩ Z = 0) = P ({T T }) = 14
although they look similar, but they have dofferent
interactions. We use joint doistribution to express interactions
Joint probability distribution

x x1 x2 ... xm x y1 y2 ... yn
P (X = x) p1 p2 ... pm P (Y = x) q1 q2 ... qn

P (X = xi ∩ Y = yj ) = pij for i = 1, . . . , m and j = 1, . . . , n

m P
P n
pij ≥ 0 and pij = 1
i=1 j=1
1 if the 1st toss is H
X= and Y = 1 − X
0 else
0 1 HH 0 1
0 P (X = 0 ∩ Y = 0) P (X = 0 ∩ Y = 1) 0 0 0.5
1 P (X = 1 ∩ Y = 0) P (X = 1 ∩ Y = 1) 1 0.5 0

( (
1 if the 1st is H 1 if the 2nd is H
X= and Z =
0 else 0 else
0 1
0 1
Marginal probabilities

Consider random variable X with SX = {x1 , . . . , xm } and Y

with SY = {y1 , . . . , yn }
we know P (X = xi ∩ Y = yj ) = pij
P (X = xi ) = P (X = xi ∩Y = y1 )+. . .+P (X = xi ∩Y = yn )
= pi1 + . . . + pin

X\Z 0 1
X\Y 0 1 0 0.25 0.25
QU1HPBT85A 1 0.25 0.25
0 0 0.5
1 0.5 0
P (X = 0) = 0.25 + 0.25 =
P (X = 0) = 0 + 0.5 = 0.5 0.5
P (X = 1) = 0.5 + 0 = 0.5 P (X = 1) = 0.25 + 0 = 0.25
P (Y = 0) = 0 + 0.5 = 0.5 P (Z = 0) = 0.25 + 0.25 =
P (Y = 1) = 0.5 + 0 = 0.5 0.5
Independent random variables

H 0 1 0 1
0 0 0.5 0 0.25 0.25
1 0.5 0 1 0.25 0.25

in this example, if we know X we can tell what is happening

to Y , But, knowledge about X does not say anything about
Consider two random variable defined on the same sample
space Ω; random variable X with SX = {x1 , . . . , xm } and Y
with SY = {y1 , . . . , yn }
if P (Y = yj |X = xi ) = P (Y = yj ) for all i and j, then X
and Y are independent.
P (X = xi ∩ Y = yj ) = P (x = xi )P (Y = yj ) for all i and j
Experiment: three tosses of a fair coin
X is the number of heads for 1st and 2nd tosse
Y is the number of heads for 2nd and 3rd tosse
Are these random variables independent?
P (X = 0 ∩ Y = 0) = P ({T T T }) = 18
P (X = 1 ∩ Y = 0) = P ({HT T }) = 18
P (X = 2 ∩ Y = 0) = 0 impossible ({HT H, T HH}) = 14
QU1HPBT85AP (X = 1 ∩ Y = 1) = P
P (X = 0 ∩ Y = 0) = 8 6= P (X = 0)P (Y = 0) = ( 14 )( 14 )

H 0 1 2 marginal
1 1
0 8 8 0 14
1 1 1 1
1 8 4 8 2
2 0 8 8 14
1 1
1 1 1
Independent random variables

x x1 x2 ... xm x y1 y2 ... yn
P (X = x) p1 p2 ... pm P (Y = x) q1 q2 ... qn

If two random variables are independent

E[XY ] = E[X]E[Y ]
m n
QU1HPBT85AE[XY ] = P P x y P (X = x ∩ Y = y )
i j i j
P P i=1 j=1
= xi yj P (X = xi )P (Y = yj )
i P
P j P P
= xi yj pi qj = xi pi yj qj = E[X]E[Y ]
i j i j

Continuous random

Continuous random variables

Discrete random variables can take distinct values

Continous random variables can take all values in a continum,
like an inreval
Temperature, height, weight, time
The nature of this kind of variables is that they never take
exact values. Even the best scales have limit in accuracy.
something is 10 germs ± 0.1 grams. It means the exact value
is in [10 − 0.1, 10 + 0.1] or [a − , a + ]
Therefore, for a continuous random variable X,
P (X = a) = 0

Probability density function
Xis a continuous random variable
fx (x) is the probabilty density functions given that
fX (x) ≥ 0
Z +∞
fX (x) = 1
Z b
P (X ∈ [a, b]) = P (a ≤ X ≤ b) = fX (x)dx
Z a a
P (X = a) = P (X ∈ [a, a]) = fX (x)dx = 0

This file is meant for personal use by only.

The cumulative distribution
Z a
F (a) = P {X ≤ a} = P {X ∈ (−∞, a]} = f (x)dx

dF (a)
da = f (a)
P {a < x < b} = area of the shaded region
For (
QU1HPBT85A e−x , x ≥ 0
f (x)
0, x < 0

This file is meant for personal use by only.

Uniform continuous random variable
A continuous random variable X on an interval [a, b]
fX (x) = b−a
we can easily check
fX (x) ≥ 0
Z +∞
fX (x)dx = 1

This file is meant for personal use by only.

Cumulative distribution function
X is a conntinous or discrete random variable
The cumulative distribution function is defined as

FX (a) = P (X ≤ a)

For a continuous random

Z avariable
FX (a) = P (X ≤ a) = fX (x)dx −∞
QU1HPBT85AFor a discrete random variable
FX (a) = P (X ≤ a) = pX (x)

This file is meant for personal use by only.

Properties of CDF

FX (x) = P (X ≤ x)
FX (x) non-strictly increasing
lim FX (x) = 0 and lim FX (x) = 1
x→−∞ x→+∞
FX (x) ∈ [0, 1]
Relationship between PDF and CDF
X is a random variable with CDF FX (x) and PDF fX (x)
P (X ∈ [a, b]) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)
P (X ∈ [a, b]) = fX (x)dx
Z b a

fX (x)dx = FX (b) − FX (a), means FX (x) is an

antiderivative of fX (x)
fX (x) = FX (x)

This file is meant for personal use by only.

Uniform distribution on [a, b], X ∼ U (a, b)

b−a x ∈ [a, b]
fX (x) =
0 otherwise

This file is meant for personal use by only.

Cumulative distribution function: example

Question: Suppose the random variable X has distribution function

0 x≤0
F (x) = 2
1 − e−x x>0
What is the probability that X exceeds 1?
Solution The desired probability is computed as follows:

P {X > 1} = 1 − P {X ≤ 1} = 1 − F (1) = e−1

This file is meant for personal use by only.

Normal (Gaussian) distribution
X is a normally distributed random variable X ∼ N (µ, σ 2 )
The PDF with parameters (µ, σ 2 )
1 1 x−µ 2
fX (x) = √ e− 2 ( σ )
σ 2π

Standard normal distribution when (µ = 0, σ 2 = 1),

Z ∼ N (0, 1)
QU1HPBT85A 1 1 2
fX (x) = √ e− 2 x

This file is meant for personal use by only.

Normal distribution and transformations

Standard normal distribution is obtained from a normal

distribution using a linear transformation
X ∼ N (µ, σ 2 )
Z= σ
Z = 1
X + σµ
Sometines data do not have normal distribution, and we have
nonlinear transformation to make them look similar to normal.
For example log transformation Y = log(X)
The new variable may have normal distribution
other examples Y = X, Y = X 2 , Y = sqrtX

This file is meant for personal use by only.

Approximating PDFs using histograms

Consider X is a random variable, but only have a sample of it

You can think about a column in your data matrix
x1 , x2 , . . . , xn an independent sample from X

This file is meant for personal use by only.

Different shapes of distributions

Symmetric, skewed right, skewed left

Unimodal, bimodal, mutlimodal

Expected value of a continuous random variable
for a discrete random variable X
x x1 x2 ... xn
P (X = x) p1 p2 ... pn
The expectation is
µ = E[X] = xi pi
for a continuous random variable X with PDF fX (x)
Z +∞
µ = E[X] = xfX (x)dx

Example: Uniform distribution X ∼ U (0, 1) on [0, 1],

fX (x) = 1,
Z +∞ Z 1
E[X] = xfX (x)dx = (x)(1)dx =
This file is meant for personal
Properties of expected value
E[X + Y ] = E[X] + E[Y ]
E[aX + bY + c] = aE[X] + bE[Y ] + c
Z +∞
E[g(X)] = g(x)fX (x)dx
Variance of a continuous random variable
Z +∞
Var(X) = E[(X − E[X])2 ] = (x − µ)2 fX (x)dx

Properties of expected value

Suppose we are given a random variable X and its probability

we are interested in the expected value of some function of X,
say g(X)
If X is a discrete random variable with probability mass
function p(x), then for any real-valued function g, X
QU1HPBT85A E[g(X)] = g(x)p(x)

If X is a continuous random variable with probability density

function f (x), then for any real-valued function g,
Z ∞
E[g(X)] = g(x)f (x)dx

This file is meant for personal use by only.

Properties of expected value

The expected value of a random variable X, E[X], is also

referred to as the mean or the first moment of X
The quantity E[X n ], n ≥ 1 is called the nth moment of X
P 
 xn p(x) if discrete

 x

E[X n ] =

R ∞
 n f (x)dx
−∞ x if continuous

This file is meant for personal use by only.

Joint PDF and CDF

Let X and Y are two random variables defined on the same

probability space
Joint PDF for discrete random variable

pX,Y (x, y) = P (X = x ∩ Y = y)
Joint PDF for continuous random variable
fX,Y (x, y)

Joint CDF

FX,Y (x, y) = P (X ≤ x ∩ Y ≤ y)

This file is meant for personal use by only.

Independent random variables

X and Y are independent if

FX,Y (x, y) = FX (x)FY (y) for all x and y
equivalent to P (X ≤ x ∩ Y ≤ y) = P (X ≤ x)P (Y ≤ y)
fX,Y (x, y) = fX (x)fY (y)
Covariance Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
Correlation Corr(X, Y ) = √ Cov(X,Y ) Var(X)Var(Y )
X and Y are independent, then Cov(X, Y ) = 0. The inverse
may not be true.
X and Y are independent, then

Var(X + Y ) = Var(X) + Var(Y )

This file is meant for personal use by only.

Descriptive statistics

Descriptive statistics
Generally we have a sample of a population (modelled as a
random variable)
for example a column representing age of all the cases in a
data table is a sample form a an original random variable that
we don’t know it.
average age is 56.75, variance 90.2, and stdev is 9.5. We can
summarise the column as 56.75 ± 9.5

This file is meant for personal use by only.

Random variables as populations and samples

Consider a random variable X, and a sample from it

x1 , . . . , x n
using the properties of the sample, we can estimate the
properties of the population (random variable)
Properties of X Properties of the sample
µ = E[X] x̄
σ2 = Var(X) s2
σ = std(X) s
m x̃

This file is meant for personal use by only.

Study two random variables
Conisder two random variable X and Y
σX,Y = Cov(X, Y ) = E[(X − µx )(Y − µY )]
ρX,Y = Corr(X, Y ) = √ Cov(X,Y )
Var(X)Var(Y )
Instead of random variables we have samples of them, or we
want to study to numereical columns in a data matrix
x1 , . . . , xn and y1 , . . . , yn
sample covariance
(xi − x̄)(yi − ȳ)
sX,Y =
Sample correlation
n n
(xi − x̄)(yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
r = =s
(n − 1)sX sY n n
(xi − x̄)2 (yi − ȳ)2
Correlation coefficient through scatter plots

−1 ≤ r ≤ 1

Approximately normal datasets

Consider X whose distibution (histogram) is similar to

symmetric bellshape
The empirical rule
Approximately 68% of the observations lie within x̄ ± s
Approximately 95% of the observations lie within x̄ ± 2s
Approximately 99.7% of the observations lie within x̄ ± 3s
using standard normal
P (−1 ≤ Z ≤ +1) = 0.68
P (−2 ≤ Z ≤ +2) = 0.95
P (−3 ≤ Z ≤ +3) = 0.997

If a value sits more than 3 std of the mean, it is called an

Important Random Variables

Important discrete random variables

X is a Bernoulli random variable, X ∼ Br(p)

Suppose that an experiment, whose outcome can be classified
as either a success or as a failure is performed
If we let X = 1 when the outcome is a success and X = 0
when it is a failure
QU1HPBT85AP (X = 1) = p
P (X = 0) = 1 − p
where p, 0 ≤ p ≤ 1, is the probability that the trial is a success.
E[X] = p, Var(X) = p(1 − p)

This file is meant for personal use by only.

Important discrete random variables
X is a Binomial random variable, X ∼ Binom(n, p)
the number of successes in n independent trials, when each
trial is a success with probability p.
SX = {0, 1, . . . , n}
Probability mass function for k ∈ SX
n k n!
P (X = k) = p (1 − p)n−k = pk (1 − p)n−k
k k!(n − k!)
QU1HPBT85AE[X] = np, Var(X) = np(p − 1)
X is a Poisson random variable with parameter λ > 0,
X ∼ Pois(λ)
SX = {0, 1, 2, . . .}
Probability mass function for k ∈ SX

e−λ λk
P (X = k) =
Important continuous random variables

X is a uniform random variable in interval [a, b],

X ∼ Unif(a, b)
SX = R
Probability density function
b−a if a ≤ x ≤ b
f (x) =
0 otherwise
X is normal random variable, X ∼ N (µ, σ 2 ),
SX = R
Probability density function
1 (x−µ)2
f (x) = √ e− 2σ2
σ 2π

E[X] = µ, Var(X) = σ 2
Important continuous random variables

X is an exponential random variable with parameter λ > 0,

X ∼ Exp(λ)
SX = R
Probability density function
λe−λx if x ≥ 0
f (x) = 0 if x < 0
the distribution of the amount of time until some specific
event occurs.
the amount of time (starting from now) until an earthquake
or until a new war breaks out,
or until a telephone call you receive

This file is meant for personal use by only.

Parameter Estimation and

Random variables as populations and samples

Consider a random variable X, and a sample from it

x1 , . . . , x n
using the properties of the sample, we can estimate the
properties of the population (random variable)

Population parameters Point estimates n
µ = E[X] x̄ = n
(xi −x̄)2
σ 2 = Var(X) s2 = i=1 n−1

σ = std(X) s = s2

This file is meant for personal use by only.

Sampling distribution

µ, σ 2 , σ for a population
x̄, s2 , s coming for one
X̄, S 2 , S when we have
many samples
QU1HPBT85A From a population with µ
and σ 2
X̄ is the random variable of
sample mean of samples
S 2 a random variable of
sample variances of samples

This file is meant for personal use by only.

Central limit theorem for the mean

Consider a population with the mean µ and variance σ 2

Random variable X̄ is the sample mean of randomly selected
samples of size n
CLT says
X̄ is approximately distributed as a normal distribution
E[X̄] = µ and std(X̄) = √σn
The original population is normal, OR
The original population is symmetric and n ≥ 10, OR
Any population, n ≥ 30
In practice we have only one sample
If you are interested in better approximation of µ, use larger
sample size
Confidence Interval

Think about a population with parameters µ and σ 2

you have a sample x1 , . . . , xn and you computed the point
estinate x̄
you can make a confidence interval of the form x̄ ± z √sn
P uP n

u (xi −x̄)2
x̄ = i=1n , s = i=1 n−1 , n is the sample size
z is the level of confidence
z = 1.645 for 90% CI for the mean
z = 1.96 for 95% CI for the mean
z = 2.576 for 99% CI for the mean

This file is meant for personal use by only.

Point estimate v.s. confidence interval

