Probability

SIT787: Mathematics for Artificial Intelligence
Topic 3: Probability
amangupta0141@gmail.com
QU1HPBT85A
Asef Nazari
School of Information Technology, Deakin University
This file is meant for personal use by amangupta0141@gmail.com only.

Sharing or publishing the contents in part or full is liable for legal action.
Asef Nazari Math AI - SIT787 Topic 3 1 / 128
Exploratory Data Analysis
QU1HPBT85A

Data Collection and Sampling
Statistics, the art and science of learning from data
Everything starts with a question
Collection, description, analysis, drawing conclusion
Journey form data to models and back
Data is the essence of modelling,
models are prepared to describe data
Data are obtained from experiments and are the result of
measuring some characteristics or property of objects
QU1HPBT85A
Each row is for a case, an

observation, or an object
These characteristics or
properties are called
variables
Data variables or features
Numerical (quantitative):can be measured on a numerical
scale or counted
A variable is numerical if its observation takes numerical values
that represent different magnitudes of the variable
Continuous or discrete
your weight is continuous and the number of cars in a
household is discrete
Categorical (qualitative): take values that are non-numerical
in nature
QU1HPBT85A
If a variable has observations belonging to one of a set of
categories, it is called categorical.
It is meaningless to do arithmetic on categorical data (like
postcodes)
Categorical data could only be classified into categories, levels
or classes
Nominal or ordinal
”yes” or ”no” for whether someone passed the driving test,
This file is meantgendet
for personal use by amangupta0141@gmail.com only.
Level of education
Populations and samples
The set of all the objects of interest is called the population

Populations are not generally available to study, or it is very
difficult and costly to access them
In a census, we actually trying to capture all the population.
Instead of accessing the whole population,
we might take a sample, a smaller subset of the population
do conduct the study over the sample
QU1HPBT85A
Based on the information we get from the sample, we make
inferences about the population
We need to be very careful in choosing a sample.
It should be a good representative of the population
A good sample presents similar characteristics to the
population
A random sample: every individual in the population has the
same chance to be selected in the sample
Issues in data collection
Noise
there are always sources of inaccuracies in any real data
collected
Missing values
Missing variable
latent variable analysis (latent variable = hidden variable =
QU1HPBT85A
missing variable)
Bias
systematically inaccurate estimates of population values
Outliers
These are data points that lie well beyond the bulk of samples

Sample statistics and population parameters
Numerical summaries of the population are called parameters

and generally shown by Greek letters.
A parameter is a numerical summary of the population
A statistic is a numerical summary of a sample
We are interested in learning about the parameters to have a
better understanding about the population.
QU1HPBT85A
The parameter values are almost always unknown.
We use sample statistics to estimate the related parameter
values.
We estimate population parameters using the information
obtained from a sample from it.

Notations
We must also distinguish between sample statistics that we

calculated so far and corresponding population parameters
Population mean µ
Sample mean x̄
amangupta0141@gmail.com 2
population variance σ
QU1HPBT85A
sample variance s2
population standard deviation σ
sample standard deviation s

Data table
Columns are variables or features

rows are cases or observations, assume we have m
observations
A collumn is considered as a random variable with unknown
parameters
QU1HPBT85A
we only have a sample x1 , x2 , . . . , xn
Assume that all the observations are selected randomly
if we have n columns (assumen they are all numerical), then
we ahve a system of random variables (X1 , X2 , . . . , Xn )
So, for each column j, x1j , x2j , . . . , xmj ∼iid Xj

Exploratory data analysis
Considre single variables

Detect the type
Find summary statistics
measure central tendencies (mean ,median, mode) and
dispersion (variance, std, range, IQR)
frequency and realtive frequency tables
QU1HPBT85A
Visualisation: Histograms, barplots, boxplots, piecharts
consider them two at a time
Find the correlation between them
Scatter plots, contingency tables and plots

Measures of central tendency
To summarise a categorical variable: frequency table

The category with the highest frequency is called the modal
category
Bar plots
we may use relative frequencies
To summarise a numerical variable {x1 , x2 , . . . xn }
n
P
amangupta0141@gmail.com xi
QU1HPBT85A
Average or mean x̄ = i=1
n
Median x̃: Sort the data increasingly, and choose the middle
for odd n, or get the average of two middle ones for even n
It is an elemnt that 50% of data is less than x̃ and 50% of
data is larger than x̃.
The median is more robust than the average, as it is not
affected by outliers
Mode: the most frequent item
Measures of statistical dispersion
n
P
(xi −x̄)2
Variance s2 = i=1
n−1 √
Standard deviation s = s2
IQR: consider the increasingly sorted data
x1 , x2 , x3 , x4 , x5 , x6 , x7
The median x̃ = x4 which is called 50th percentile or second
quartile Q2 .
the median of the left side of x̃, is x2 , whihc is called the 25th
QU1HPBT85Apercentile, or the first quartile Q1
the median of the right side of x̃, is x6 , whihc is called the
75th percentile or the third quartile Q3
IQR = Q3 − Q1

Probability Theory
QU1HPBT85A

Probability: big picture
Two interpretation of uncertainty

the fraction of times an event occurs
degree of belief about an event
Sources of uncertainty
uncertainty in the data,
uncertainty in the machine learning model, and
QU1HPBT85Auncertainty in the predictions produced by a model.
Quantifying uncertainty requires the idea of a random variable
Associated with the random variable there is a function called
probability distribution
Random variables and their distributions are meant to
describe populations in Statistics.

Sample space, events, and probability
Consider a random experiment or trial

We know all the outcomes
but we do not know which one will happen
outcome is not predictable with certainty in advance
The sample space Ω
the set of all possible outcomes of the experiment
The event space A
A subset of the sample space is called an event A ⊂ Ω
QU1HPBT85A
An event A occers if the outcome of the expriment is a
membebr of A
A is the collection of all subsets of Ω
A is the power set of Ω
The probability P
With each A ∈ A we associate a number P (A)
measures the probability that the event will occur
(Ω,isA,
This file P ) isforcalled
meant a probability
personal space
use by amangupta0141@gmail.com only.
The big picture
Our aim is to learn from data

Any data collection has noise (randomness)
We use probability theory (in particular random variables) to
model the data nd deal with the noise
Different types of data features can be modelled using
different types of random variablaes
QU1HPBT85A
We first do Exploratory data analysis to learn some aspects of
data
Using descriptivr statistics to summarise the data
use visualisation to represent the data efficiently
We may look at features one at a time, or two together

The data
Variables or features each will be modelled using a random

variable
Machine learning is the same as statistical learning
Each column, variable, or

feature is treated as a
random variable
QU1HPBT85A
Here we need to study a
system of random variables
together.
Explanatory Variables vs.
Response Variables
supervised and unsupervised
learning
Examples of sample space
If the outcome of an experiment consists in the determination

of the sex of a newborn child, then
Ω = {m, f }
If the experiment consists of the running of a race among the

seven horses having post positions 1, 2, 3, 4, 5, 6, 7, then
QU1HPBT85A
Ω = {all orderings of (1, 2, 3, 4, 5, 6, 7)}
Suppose we are interested in determining the amount of

dosage that must be given to a patient until that patient
reacts positively
Ω = (0, ∞)

Random experiments
Toss a fair (or unfair) coin once, twice, and 3 times

Ω1 = {H, T }
Ω2 = {HH, HT, T H, T T }
Ω3 = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }
roll a fair die, and roll two fair dice
Ω4 = {1, 2, 3, 4, 5, 6}
Ω5 = {(i, j)|i, j ∈ {1, 2, 3, 4, 5, 6}}
toss a fair (or unfair) coin n times (n = 5 for example)
QU1HPBT85A
Ω6 = {HHHHH, HHHHT, . . .}, |Ω6 | = 2n
toss a coin until the first head appears
Ω7 = {H, T H, T T H, T T T H, . . .}
number of arrivals to a shop in a given period of time
Ω8 = {0, 1, 2, 3, . . .}
lottory game playing with a coin
Ω9 = {H, T }
Random experiments
Number of emails I receive every day Ω = {0, 1, 2, . . .}

Amount of time someone spent in social media in a day
Ω = [0, 24]
QU1HPBT85A

The concept of probability
For a randomised experiment or trial, the probability is the
likeliness of a particular outcome
in tossing two coins, Ω = {hh, ht, th, tt}
If we are interested in the cases where the first coin lands
heads E = {hh, ht}
When all possible outcomes are equally likely
amangupta0141@gmail.com number of outcomes in E
QU1HPBT85A P (E) =
number of outcomes in the sample space
P (E) = 24 = 0.5
Two events are called disjoint or exclusive if they have no
outcome in common
Event E the first coin lands heads: E = {hh, ht}
Event F the first coin lands tails: F = {tt, th}
This file is meant for personal use by

E amangupta0141@gmail.com
∩F =∅ only.
Algebra of events
Suppose E1 and E2 are events (E1 , E2 ⊂ Ω)

You can make new events by any set operations that you
know
The union of events E1 ∪ E2
The intersection of events E1 ∩ E2
The difference of events E1 − E2
The complement of an event is another event Ωc = ∅, ∅c = Ω
QU1HPBT85A

Axioms of probability (Kolmogorov)
1 P (Ω) = 1
2 For E ⊂ Ω, 0 ≤ P (E) ≤ 1
3 For two disjoint events E1 , E2 ⊂ Ω and E1 ∩ E2 = ∅
P (E1 ∪ E2 ) = P (E1 ) + P (E2 )
Some extensions
For any two events E1 , E2 ⊂ Ω
QU1HPBT85A
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )
The collection of events {E1 , E2 , . . . , En } is colled mutually

disjoint if Ei ∩ Ej = ∅ for every i 6= j
For mutually disjoint events {E1 , E2 , . . . , En }
[n Xn
P ( Ei ) = P (Ei )
This file is meant for personal use
i=1by amangupta0141@gmail.com
i=1 only.
Some propositions
1 = P (Ω) = P (E ∪ E c ) = P (E) + P (E c ) which implies that
P (E c ) = 1 − P (E)
For any two events E1 , E2 ⊂ Ω
1P (E ∪ E ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )

2
QU1HPBT85A

Example
A total of 28 percent of males living in a city smoke

cigarettes, 6 percent smoke cigars, and 3 percent smoke both
cigars and cigarettes. What percentage of males smoke
neither cigars nor cigarettes?
Let E be the event that a randomly chosen male is a cigarette
smoker P (E) = 0.28
QU1HPBT85ALet F be the event that he is a cigar smoker P (F ) = 0.6
P (E ∩ F ) = 0.3
Then, the probability this person is either a cigarette or a cigar
smoker is
P (E∪F ) = P (E)+P (F )−P (E∩F ) = 0.28+0.06−0.03 = 0.31

Conditional Probability
and Independent Events
QU1HPBT85A

Conditional probability
A and B are two events, A ⊂ Ω and B ⊂ Ω

P (A ∩ B)
P (A|B) =
P (B)
P (A ∩ B) = P (A|B)P (B)
P (A ∩ B) = P (B|A)P (A)
useful to check when events are dependent or independent
QU1HPBT85A
Two events with non-zero probabilities are independent
P (A ∩ B) = P (A)P (B)
P (A|B) = P (A)
P (B|A) = P (B)
If E and F are independent, then so are E and F c .
If we have several independent models, it is better to make an
ensemble model.
Conditional probabilities: example of die tossing
Toss a die Ω = {1, 2, 3, 4, 5, 6}

Event A the number is even: A = {2, 4, 6}
Event B: the number is greater than 4, B = {4, 5, 6}
Event C: The number is greater than 5, C = {5, 6}
P (A) = 12 , P (B) = 12 , P (C) = 13
QU1HPBT85AA ∩ B = {4, 6} and P (A ∩ B) = 1
3
A ∩ C = {6} and P (A ∩ C) = 16
1
B ∩ C = {5, 6} and P (B ∩ C) = 3
If P (E ∩ F ) = P (E)P (F ), then they are independent.
Otherwise dependent.

Conditional probabilities: example
Consider a group of 100

people |Ω| = 100 Consider a group of 100
H: the chosen person is people |Ω| = 100
48
happy P (M ) = 100 = 0.48 C: the chosen person likes
80
M: the chosen participant is chees P (C) = 100 = 0.8
70 48
married P (M ) = 100 = 0.7 P (C|D) = 60 = 0.8
P (H|M ) = 42
70 = 0.6
amangupta0141@gmail.com P (C|D) = P (C)
QU1HPBT85A c 6
P (H|M ) = 30 = 0.2
Dogs
Happy yes no
yes no yes 48 32 80
Cheese
yes 42 28 70 no 12 8 20
Married
no 6 24 30 60 40 100
48 52 100
Independence of more than two events
Two events are indepenent if P (E ∩ F ) = P (E)P (F )

Three events are independent
P (E ∩ F ∩ G) = P (E)P (F )P (G)
P (E ∩ F ) = P (E)P (F )
(E ∩ G) = P (E)P (G)
P (F ∩ G) = P (F )P (G)
QU1HPBT85A
The events E1 , E2 , . . . , En are said to be independent if for
every subset E10 , E20 , . . . , Er0 , r ≤ n of these events
P (E10 ∩ E20 ∩ . . . ∩ Er0 ) = P (E10 )P (E20 ) . . . P (Er0 )

Conditional probability: example
A bin contains 5 defective (that immediately fail when put in

use), 10 partially defective (that fail after a couple of hours of
use), and 25 acceptable transistors. A transistor is chosen at
random from the bin and put into use. If it does not
immediately fail, what is the probability it is acceptable?
Solution: Since the transistor did not immediately fail, we know that it
QU1HPBT85A
is not one of the 5 defectives and so the desired probability is:
P {acceptable|not defective} = P {acceptable, not defective}
P {not defective} ,
P {acceptable} 25/40 5
= P {not defective} = 35/40 = 7
where the last equality follows since the transistor will be both
acceptable and not defective if it is acceptable.

The law of total probability
Consider two events F and E
F = (F ∩ E) ∪ (F ∩ E c ), the union of two disjoint events

P (F ) = P (F ∩ E) + P (F ∩ E c ) =
QU1HPBT85AP (F |E)P (E) + P (F |E c )P (E c )
n
Consider events E1 , . . . , En s.t. Ei ∩ Ej = ∅,
S
Ei = Ω
i=1
n
P
P (C) = P (Ei )P (C|Ei )
i=1
Ei are hypothesis. Only one of

them will happen
ThisNow
file is meant
find P (Eifor
|C)personal use by amangupta0141@gmail.com only.
The law of total probability: example
Example: An insurance company believes that people can be
divided into two classes — those that are accident prone and
those that are not. Their statistics show that an
accident-prone person will have an accident at some time
within a fixed 1-year period with probability .4, whereas this
probability decreases to .2 for a non-accident-prone person. If
we assume that 30 percent of the population is accident
prone, what is the probability that a new policy holder will
QU1HPBT85A
have an accident within a year of purchasing a policy?
Solution: We obtain the desired probability by first
conditioning on whether or not the policy holder is accident
prone. Let A1 denote the event that the policy holder will
have an accident within a year of purchase; and let A denote
the event that the policy holder is accident prone. Hence, the
desired probability, P (A1 ), is given by P (A1 ) =
This file is meant for personal usec )P
by(A c ) = (.4)(.3) + (.2)(.7) = .26
amangupta0141@gmail.com only.
P (A 1 |A)P (A) + P (A1 |A
Bayes’ Theorem
QU1HPBT85A

Bayes’ Theorem
The essence of Bayes’ Theorem is updating probabilities when

we receive further knowledge or information.
That’s why we are going to have prior and posterior
probabilities here.
Consider event E, you may know P (E). This is your prior
probability
Another event F occurs.
QU1HPBT85A
Now, you are interested to see how your prior probability is
going to be changed.
P (F |E)P (E) P (F |E)

P (E|F ) = = P (E)
P (F ) P (F )
There is something which is going to be multiplied by your

prior P (A) and makes your posterior P (A|B)
Bayes’ Theorem
Prior probability P (E)

Posterior probability P (E|F )
P (F |E)
Likelihood P (F )
QU1HPBT85A

BAYES’ Theorem
Let E and F be events. we can write F = (F ∩ E) ∪ (F ∩ E c )
QU1HPBT85A
P (F ) = P (F ∩ E) + P (F ∩ E c )
= P (F |E)P (E) + P (F |E c )P (E c )
= P (F |E)P (E) + P (F |E c )[1 − P (F )]
the probability of the event F is a weighted average of the
conditional probability of F given that E has occurred and the
conditional
This file probability
is meant for personal of
useF by
given that E has not occurred,only.
Bayes’ Theorem
P (F ) = P (F |E)P (E) + P (F |E c )[1 − P (F )]

Bayes’ Theorem
P (E ∩ F )
amangupta0141@gmail.com P (E|F ) =
QU1HPBT85A P (F )
P (F |E)P (E) P (F |E)P (E)

= =
P (F ) P (F |E)P (E) + P (F |E c )[1 − P (F )]

Bayes’ Theorem: example
Question: A laboratory blood test is 99 percent effective in detecting a

certain disease when it is, in fact, present. However, the test
also yields a “false positive” result for 1 percent of the healthy
persons tested. (That is, if a healthy person is tested, then,
with probability .01, the test result will imply he or she has
the disease.) If .5 percent of the population actually has the
disease, what is the probability a person has the disease given
that his test result is positive?
QU1HPBT85A
Solution: Let D be the event that the tested person has the disease and
E the event that his test result is positive. The desired
probability P (D|E) is obtained by P (E|D) = 0.99, P (D) =
0.005, P (E|Dc ) = 0.01, P (Dc ) = 0.995
P (D ∩ E) P (E|D)P (D)
P (D|E) = = = 0.3322
P (E) P (E|D)P (D) + P (E|Dc )P (Dc )
BAYES’ Theorem: extension
F1 , F2 , . . . , Fn are mutually exclusive events

n
[
Fi = Ω
i=1
In other words, exactly one of the events

F1 , F2 , . . . , Fn must occur
QU1HPBT85A n
(E ∩ Fi )
S
E is another event, and can be written as E =
i=1
n n
P (E ∩ Fi ) =
P P
P (E) = P (E|Fi )P (Fi )
i=1 i=1
Suppose now that E has occurred
P (E ∩ Fj ) P (E|Fj )P (Fj )
P (Fj |E) = = P
n
P (E) P (E|Fi )P (Fi )
This file is meant for personal use by amangupta0141@gmail.com
i=1 only.
Bayes’ Theorem: explanation
Bayes’ formula
P (E ∩ Fj ) P (E|Fj )P (Fj )
P (Fj |E) = = P
n
P (E) P (E|Fi )P (Fi )
i=1
If we think of the events Fj as being possible “hypotheses”

about some subject matter,
QU1HPBT85A
P (Fj ) are our priors
then Bayes’ formula may be interpreted as showing us how
opinions about these hypotheses held before the experiment
[that is, the P (Fj )] should be modified by the evidence of the
experiment.
P (Fj |E) is our posteriors
Bayes’ Theorem: simle version
P (B|A)P (A)
P (A|B) = P (B|A)P (A)+P (B|Ac )P (Ac )
The probabilities involved: P (A|B), P (B|A), P (A), P (Ac ),
P (B|Ac )
We need to know 4 of these to find the fifth one.
QU1HPBT85A

Bayes’ theorem: example
Example: A test is 98% effective at detecting HIV. However,
test has a “false positive” rate of 1%. If 0.5% of a country’s
population has HIV, what is the probability having HIV when
the test is positive?
Let E = someone’s test is positive for HIV with this test
Let F = That person actually have HIV
A test is 98% effective at detecting HIV
If someone has HIV the success of test in prediction is
P (E|F ) = 0.98
QU1HPBT85AHowever, test has a “false positive” rate of 1%
P (E|F c ) = 0.01
0.5% of a country’s population has HIV
P (F ) = 0.005 and therefore P (F c ) = 1 − 0.005 = 0.995
What is P (F |E)?
P (E|F )P (F )
P (F |E) =
P (E|F )P (F ) + P (E|F c )P (F c )
(0.98)(0.005)
= for personal use by amangupta0141@gmail.com
This file is meant = 0.33 only.
(0.98)(0.005) + (0.01)(1 − 0.005)
Bayes’ theorem: the intuition
QU1HPBT85A
Conditioning on a positive result

changes the sample space
People who test positive
P (E|F )P (F ) + P (E|F c )P (F c )
People who test positive and
have HIV P (E|F )P (F )

Bayes’ theorem: the intuition
Say we have 1000 people
5 have HIV and test positive
985 do not have HIV and test negative
10 do not have HIV and test positive
QU1HPBT85A

Random Variables
QU1HPBT85A

Random variables
Tools to study the randomness in data

It is to assign numbers to the outcomes of a random
experiment
types of random variables
QU1HPBT85AX is the number of heads in 3 tosses of a coin, it is a discrete
random variable
X is the reading of a scale in weighing people, it is a
continuous random variable

Example of a random variable
Experiment: Toss a coin n times. X is the number of heads

n=1 (
1 if H
X=
0 if T
n=2 
QU1HPBT85A if  1 HT



if 1 TH

X=


 2 if HH

0 if

TT

Example of a random variable
Experiment: Roll a die. X square of the number


if


 1 1

if
4 2




amangupta0141@gmail.com 9 if 3


QU1HPBT85A X=
16 if 4



25 if 5







36 if 6

The definition
Consider Ω to be a sample space of a random experiment.

A random variable is defined as
X:Ω→R
For all ω ∈ Ω, X(ω) is a number

All possible values that a random variable X can take is called
QU1HPBT85A
the support set of that random variable and denoted by SX
If we conduct this experiment several times, we will have a
sample of this ramdom variable.
x1 = X(ω1 ), x2 = X(ω2 ), . . . , xn = X(ωn ) or simply
x1 , x2 , . . . , xn

The notion of probability distribution
To represent a random variabel consisely.

n = 2: toss a fair coin twice
probability of
Outcome (ω) X(ω)
outcome, P (ω)
1
4 HH 2
1
4 HT 1
1
4 TH 1
1
4 TT 0
P (X = 0) = P ({ω ∈ Ω|X(ω) = 0}) = P ({T T }) =
QU1HPBT85A 4
2
P (X = 1) = P ({ω ∈ Ω|X(ω) = 1}) = P ({T H, HT }) = 4
1
P (X = 2) = P ({ω ∈ Ω|X(ω) = 2}) = P ({HH}) = 4
P (X = x)
x 0 1 2
1 2 1
P (X = x) 4 4 4
This file is meant for personal use by amangupta0141@gmail.com only. x
Using the probability distribution
P (X = x)
x 0 1 2
1 2 1
P (X = x) 4 4 4
x
1 1 3
P (X ≥ 1) = P (X = 1) + P (X = 2) = 2 + 4 = 4
QU1HPBT85A
Probability mass function (PMF) pX (x) = p(x) = P (X = x)
Let X is a random variable and SX = {x1 , x2 , . . . , xn } with
probabilities {p1 , p2 , . . . , pn } where P (X = xj ) = pj for all j.
pj ≥ 0
Pn
pj = 1
j=1

Rnadom variables
Discrete random variable: whose set of possible values can be

written either as a finite sequence x1 , . . . , xn , or as an infinite
sequence x1 , x2 , . . .
random variable whose set of possible values is the set of
nonnegative integers is a discrete random variable
QU1HPBT85A
Continuous random variable: random variables that take on a
continum of possible values
the random variable denoting the lifetime of a car, (t1 , t2 )

Probability mass function
For a discrete random variabl X with values

SX = {x1 , . . . , xn }: probability mass function p(a) is defined
as
p(x) = P {X = x}
QU1HPBT85A
p(xi ) > 0, i = 1, 2, . . .
p(x) = 0, for other values of x
P∞
p(xi ) = 1
i=1

Expectation of a Random
Variables
QU1HPBT85A

Expected value of a random variable
You know the concept of average of x1 , x2 , . . . , xn which is
P
xi
x̄ =
n
For a random variable the values have different probabilities
and importance.
We want values with higher probability contribute more to the
average
for a discrete random variable X with finite support set
QU1HPBT85A
x1 , x2 , . . . , xn and probabilities p1 , p2 , . . . , pn
n
X X
µ = E[X] = xi p i = xi P {X = xi }
i=1 i
for a discrete random variable X with infinite support set
x1 , x2 , . . . and probabilities p1 , p2 , . . .
∞
X
µ = E[X] = xp This may not exist
This file is meant for personal usei by
i
i=1
Example of expectation of a random variable
for a random variable X
x x1 x2 ... xn
P (X = x) p1 p2 ... pn
The expectation is
n
X
µ = E[X] = xi pi
i=1
Lottory game:
QU1HPBT85A
(
win $5 with probability 0.1
lose $1 with probability 0.9
The related random variable
x 5 -1
P (X = x) 0.1 0.9
µ = E[X] = 5(0.1) + (−1)(0.9) = −0.4 dollars per game
Examples
Lottory game:

win $100 with probability 0.5


lose $100 with probability 0.3


lose $50 with probability 0.2
QU1HPBT85A
The related random variable
x 100 -100 -50
P (X = x) 18 37
18
37
1
37
µ = E[X] = 100(0.5) + (−100)(0.3) + (−50)(0.2) = 10
dollars per game

Expectation example
Find E[X] where X is the outcome when we roll a fair die.
Since p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 16 then
E[X] =
(1)p(1) + (2)p(2) + (3)p(3) + (4)p(4) + (5)p(5) + (6)p(6) = 72
Note that, for this example, the expected value of X is not a
value that X could possibly assume.
the average value of X in a large number of repetitions of the
experiment
QU1HPBT85A
If I is an indicator random variable for the event A, that is, if
(
1 if A occurs
I=
0 if A does not occure
Then E[I] = 1P (A) + 0P (Ac ) = P (A)

the expectation of the indicator random variable for the event
A is just the probability that A occurs
Discrete Random Variables
QU1HPBT85A

Discrete random variables
x x1 x2 ... xn
x x1 x2 ... xn ...
P (X = x) p1 p2 ... pn
P (X = x) p1 p2 ... pn ...
With finite support set With infinite support set

pi ≥ 0 for all i pi ≥ 0 for all i
n
QU1HPBT85A
P
p =1 ∞
i P
pi = 1
i=1
i=1
expectations
expectations (may not exist)
n
∞
X
µ = E[X] = xi p i X
i=1
µ = E[X] = xi pi
i=1

Discrete random variable with infinite support set
Experiment: Toss a fair coin until the first head
Ω = {H, T H, T T H, T T T H, . . .}
Define X as the number of tossing till the first head

(including the last one)
P (X = 1) = P ({H}) =
QU1HPBT85A 2
1
P (X = 2) = P ({T H}) = 4
x 1 2 ... k ...
1 1 1
P (X = x) 2 22
... 2k
...
∞
P 1
2i
=1
i=1

Bernoulli trial
Independent repeated trials of an experiment with exactly two

possible outcomes are called Bernoulli trials.
There is a coin, P (H) = p, P (T ) = 1 − p
a sequence of independent coin toss
n = 2, H1 : First tossing is heads, H2 : second tossing is
heads
QU1HPBT85Athese two are indep
P (HH) = P (H1 ∩ H2 ) = p2
P (T T ) = P (H1c ∩ H2c ) = (1 − p)2
P (T H) = P (HT ) = p(1 − p)
Arbitrary n, |Ω| = 2n
P (HHT HT ) = p3 (1 − p)2
P (m heads and n − m tails) = pm (1 − p)n−m
Binomial distribution
Random experiment: Toss a coin with P (H) = p for n times

X is the number of heads
SX = {0, 1, 2, . . . , n}
consider n = 5
P (X = 0) = P ({T T T T T }) =
(1 − p)5
QU1HPBT85A
P (X = 1) =
P ({HT T T T, T HT T T, T T HT T,
T T T HT, T T T TT H}) =
5p(1 − p)4 = 51 p(1 − p)4
P (X = 2) = 52 p2 (1 − p)3

n k
For B(n, p): P (X = k) = k p (1 − p)n−k for k ∈ SX
Geometric and Poison distributions with infinite support set
Geometric distribution
Toss a coin with P (H) = p untill
the first head
X is the number of tossing
(including the last one)
P (X = k) = P ({T T . . . T H}) =
p(1 − p)k−1 , k ≥ 1
QU1HPBT85A
Poisson distribution
X the number of arrivals
k
P (X = k) = λk! e−λ for
parameter λ > 0, k ≥ 0

Variance of a Random
Variables
QU1HPBT85A

Expectation is not enough
Given a random variable X along with its probability
distribution function
We want summarize the essential properties of the mass
function by certain suitably defined measures.
One such measure would be E[X], the expected value of X
E[X] yields the weighted average of the possible values of X,
it does not tell us anything about the variation, or spread, of
these values
QU1HPBT85A
Example: all have the same expectation
W = 0 with probability 1
(
1
−1 With probability 2
Y = 1
1 With probability 2
(
−100 With probability 12
Z=
This file is meant for personal
100use byWith probability 12
Variance
The measure how far a random variable deviates fro its mean
We wanto to measure X − E[X]
E[X − E[X]] = 0
Let’s considser E[(X − E[X])2 ] which is the variance
Lottory game 1: µ = E[X] = 5(0.1) + (−1)(0.9) = −0.4
dollars per game
amangupta0141@gmail.com x 5 -1
QU1HPBT85A P (X = x) 0.1 0.9
X − E[X] 5.4 −0.6
(X − E[X])2 29.16 0.36
E[(X − E[X])2 ] = (29.16)(0.1) + (0.36)(0.9) = 3.24

The definition:
Var(X) = E[(X − E[X])2 ] = E[(X − µ)2 ]

Variance
If X is a random variable with mean µ, then the variance of
X, denoted by Var(X), is defined by
Var(X) = E[(X − µ)2 ]
Also, Var(X) = E[X 2 ] − µ2
Example: Compute Var(X) when X represents the outcome
when we roll a fair die.
Since P {X = i} = 16 , i = 1, 2, 3, 4, 5, 6 we obtain
QU1HPBT85A
6
X
E[X 2 ] = i2 P {X = i}
i=1
1 1 1 1 1 1 91
= 12 ( ) + 22 ( ) + 32 ( ) + 42 ( ) + 52 ( ) + 62 ( ) =
6 6 6 6 6 6 6
7
We computed the mean before, and µ = E[X] = 2 then
2 2 91 7 2 35
This file is meantVar(X) = E[X
for personal − µamangupta0141@gmail.com
use] by = −( ) = only.
6 2 12
Systems of random
variables
QU1HPBT85A

Systems of random variables
A random experiment happend. we have several random

variables related to this experiment. We want to study them.
We find Ω and compute P (ω) for every ω ∈ Ω
We can define many real functions on Ω. Each is a random
variable
X : Ω → R and Y : Ω → R
considering these together is a system of random variables
QU1HPBT85A
(X, Y )
we want to know whether we can say something about Y

based on our knowledge about X and vice versa
We may ocnsider many random variables together in a larger
system (X1 , . . . , Xn )
Linear transformations
Let X be a random variable and c ∈ R

Y =X +c
Example Y = X + 2 and PMF
x -1 3 4 x 1 5 6
P (X = x) 0.2 0.5 0.3
QU1HPBT85A P (Y = x) 0.2 0.5 0.3
P (X = x) P (Y = x)
x x

Linear transformations
Let X be a random variable and c ∈ R

Y = cX
Example Y = 2X and PMF
x -1 3 4 x -2 6 8
P (X = x) 0.2 0.5 0.3
QU1HPBT85A P (Y = x) 0.2 0.5 0.3
P (X = x) P (Y = x)
x x

properties of expected values
E[aX + b] = E[aX] + E[b] = aE[X] + b

E[X + Y ] = E[X] + E[Y ]
E[aX + bY ] = aE[X] + bE[Y ]
For n random variables X1 , X2 , . . . , Xn
QU1HPBT85A " n n
#
X X
E ci Xi = ci E[Xi ]
i=1 i=1

Transformation of a random variable
A function of random variable X is a ranodm variable and
g : R → R,
Y = g(X)
Examples
g(x) = ax + b, then Y = g(X) = aX + b
g(x) = x2 , then Y = X 2
g(x) = (x − E[X])2 , then Y = (X − E[X])2
g(x) = sin(x), then Y = sin(X)
QU1HPBT85A
Y = g(X) where Y (ω) = g(X(ω))
x x1 x2 ... xn x g(x1 ) g(x2 ) ... g(xn )

P (X = x) p1 p2 ... pn P (Y = x) p1 p2 ... pn
n
P
E[X] = xi p i
i=1
n
P
E[Y
This file ] = E[g(X)]
is meant =
for personal g(xiby
use )piamangupta0141@gmail.com only.
i=1
Properties of variance
For a random variable X
Var(X) = E[(X − E[X])2 ]
from another perspective

g(x) = (x − E[X])2 , then Y = g(X) = (X − E[X])2
amangupta0141@gmail.com n n
2
(xi − µ)2 pi
P P
QU1HPBT85AE[Y ] = E[(X − E[X]) ] = g(xi )pi =
i=1 i=1
Var(x) = E[Y ]
Var(aX + b) = a2 Var(X), a, b ∈ R
adding a constant will shift the cloud of data to the left or
right, it does not change the variability
What about Var(X1 + X2 )?

Properties of variance
Var(aX + b) = a2 V ar(X)
if a = 0, Var(b) = 0
p
The quantity Var(X) is called the standard deviation of X.
q
std(X) = Var(X)
Var(X1 + X2 ) = E[((X1 + X2 ) − E(X1 + X2 ))2 ] =

E[((X1 − E[X1 ]) + (X2 − E[X2 ]))2 ]
QU1HPBT85A
= E[(X1 −E[X1 ])2 +(X2 −E[X2 ])2 +2(X1 −E[X1 ])(X2 −E[X2 ])]
= E[(X1 − E[X1 ])2 ] + E[(X2 − E[X2 ])2 ] + 2E[(X1 −

E[X1 ])(X2 − E[X2 ])]
= Var(X1 ) + Var(X2 ) + 2E[(X1 − E[X1 ])(X2 − E[X2 ])]

Covariance
The covariance of two random variables X and Y , written

Cov(X, Y ), is defined by
Cov(X, Y ) = E[(X − µX )(Y − µY )]
Equivalently, Cov(X, Y ) = E[XY ] − E[X]E[Y ]

Cov(X, Y ) = Cov(Y, X)
QU1HPBT85A
Cov(aX, Y ) = aCov(X, Y )
Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y )
Cov(X + a, Y ) = Cov(X, Y + b) = Cov(X, Y )
Y = aX + b,
Cov(X, Y ) = Cov(X, aX + b) = aCov(X, Y ) = aVar(X)

Covariance
Definition: Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]

Covariance measures association between X and Y
It is strongly related to the dependence of X and Y
If X and Y are independent then Cov(X, Y ) = 0
QU1HPBT85A
E [(X − E[X])(Y − E[Y ])]
= E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]

E[XY ] − E[Y ]E[X] − E[Y ]E[X] + E[X]E[Y ] = 0

Back to varince
Var(X + X) = 4Var(X) 6= Var(X) + Var(X)

n
P n
P n
P n
P
Var( Xi ) = Var(Xi ) + Cov(Xi , Xj )
i=1 i=1 i=1 j=1,j6=i
for n = 2,
Var(X + Y ) = Var(X) + Var(Y ) + Cov(X, Y ) + Cov(Y, X)
QU1HPBT85A
If X and Y are independent random variables, then

Cov(X, Y ) = 0
n
P Pn
Var( Xi ) = Var(Xi )
i=1 i=1

Properties of covariance
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

If Cov(X, Y ) = 0 then Cov(X, Y ) = 0
if Cov(X, Y ) = 0, we cannot say Cov(X, Y ) = 0.
P (X = 0 ∩ Y = 1) = 0 6= P (X = 0)P (Y = 1) they are not
independent.
E[X] = 0, E[Y ] = 0
E[XY ] = 0
QU1HPBT85A
Cov(X, Y ) = 0
Y \X -1 0 1 margin
1 1 2
1 3 0 3 3
1 1
-2 0 3 0 3
1 1
margin 3 3 0

Correlation
it can be shown that a positive value of Cov(X, Y ) is an

indication that Y tends to increase as X does,
a negative value indicates that Y tends to decrease as X
increases.
The strength of the relationship between X and Y is
QU1HPBT85A
indicated by the correlation between X and Y
Cov(X, Y )
Corr(X, Y ) = p
Var(X)Var(Y )
that this quantity always has a value between -1 and +1

Correlation of two random variable
Corr(X, Y ) = √ Cov(X,Y )
Var(X)Var(Y )
If there is a perfect linear relationship between X and Y ,
Y = aX + b
Corr(X, Y ) = Corr(X, aX + b) = √ aVar(X)
k
= |k| = ±1
Var(X)a Var(X)
QU1HPBT85A
Corr(aX, Y ) = Corr(X, Y ) if a > 0
Corr(X, Y ) ∈ [−1, 1]
Corr(X, Y ) = 0 then two random variables are uncorrelated

Sum of random variables
X : Ω → R, Y : Ω → R, Z : Ω → R
Expectation
Z = X + Y then E[Z] = E[X] + E[Y ]
Variance
Suppose Var(X) > 0
QU1HPBT85AY = −X then Var(Y ) = Var(Y )
Z = X + Y , Var(Z) = Var(X − X) = Var(0) = 0
Var(X) + Var(Y ) = 2Var(X)
Var(X + Y ) 6= Var(X) + Var(Y )
For especial case that X and Y are independent random
variables, Var(X + Y ) = Var(X) + Var(Y )

Joint Probability
Distribution
QU1HPBT85A

Joint probability distribution
To express interactions between two random variables
exeriment: Tossing two fair coins Ω = {HH, HT, T H, T T }
Let’s define three random variables over this sample space
(
1 if the 1st toss is H x 1 0
X= P (X = x) 0.5 0.5
0 else
x 1 0
Y =1−X P (Y = x) 0.5 0.5
QU1HPBT85A
(
1 if the 2nd toss is H x 1 0
Z= P (Z = x) 0.5 0.5
0 else
P (X = 0 ∩ Y = 0) = P (X = 0 ∩ X = 1) = 0 impossible!
P (X = 0 ∩ Z = 0) = P ({T T }) = 14
although they look similar, but they have dofferent
interactions. We use joint doistribution to express interactions
between them.
Joint probability distribution
x x1 x2 ... xm x y1 y2 ... yn
P (X = x) p1 p2 ... pm P (Y = x) q1 q2 ... qn
P (X = xi ∩ Y = yj ) = pij for i = 1, . . . , m and j = 1, . . . , n

m P
P n
pij ≥ 0 and pij = 1
i=1 j=1
(
1 if the 1st toss is H
X= and Y = 1 − X
0 else
QU1HPBT85A
HH Y
H HH Y
0 1 HH 0 1
X HHH X H
H
0 P (X = 0 ∩ Y = 0) P (X = 0 ∩ Y = 1) 0 0 0.5
1 P (X = 1 ∩ Y = 0) P (X = 1 ∩ Y = 1) 1 0.5 0
( (
1 if the 1st is H 1 if the 2nd is H
X= and Z =
0 else 0 else
HH Y
H
0 1
HH Y
0 1
This fileHisP (Xmeant
X HH
for personal use by amangupta0141@gmail.com only.
H
X HH
H
0 = 0 ∩ Z = 0) P (X = 0 ∩ Z = 1) 0 0.25 0.25
P (X = 1 ∩ Z = 0) P (X = 1 ∩ Z = 1) 1 0.25 0.25
1
Marginal probabilities
Consider random variable X with SX = {x1 , . . . , xm } and Y

with SY = {y1 , . . . , yn }
we know P (X = xi ∩ Y = yj ) = pij
P (X = xi ) = P (X = xi ∩Y = y1 )+. . .+P (X = xi ∩Y = yn )
= pi1 + . . . + pin
X\Z 0 1
X\Y 0 1 0 0.25 0.25
QU1HPBT85A 1 0.25 0.25
0 0 0.5
1 0.5 0
P (X = 0) = 0.25 + 0.25 =
P (X = 0) = 0 + 0.5 = 0.5 0.5
P (X = 1) = 0.5 + 0 = 0.5 P (X = 1) = 0.25 + 0 = 0.25
P (Y = 0) = 0 + 0.5 = 0.5 P (Z = 0) = 0.25 + 0.25 =
P (Y = 1) = 0.5 + 0 = 0.5 0.5
P (Z = 1) = 0.25 + 0 =only. 0.25
Independent random variables
HH Y H
HH Y
H 0 1 0 1
X HH H X H
H
H
0 0 0.5 0 0.25 0.25
1 0.5 0 1 0.25 0.25
in this example, if we know X we can tell what is happening

to Y , But, knowledge about X does not say anything about
Z.
QU1HPBT85A
Consider two random variable defined on the same sample
space Ω; random variable X with SX = {x1 , . . . , xm } and Y
with SY = {y1 , . . . , yn }
if P (Y = yj |X = xi ) = P (Y = yj ) for all i and j, then X
and Y are independent.
P (X = xi ∩ Y = yj ) = P (x = xi )P (Y = yj ) for all i and j
Example
Experiment: three tosses of a fair coin
X is the number of heads for 1st and 2nd tosse
Y is the number of heads for 2nd and 3rd tosse
Are these random variables independent?
P (X = 0 ∩ Y = 0) = P ({T T T }) = 18
P (X = 1 ∩ Y = 0) = P ({HT T }) = 18
P (X = 2 ∩ Y = 0) = 0 impossible
amangupta0141@gmail.com ({HT H, T HH}) = 14
QU1HPBT85AP (X = 1 ∩ Y = 1) = P
P (X = 0 ∩ Y = 0) = 8 6= P (X = 0)P (Y = 0) = ( 14 )( 14 )
1
HH Y
H 0 1 2 marginal
X HH H
1 1
0 8 8 0 14
1 1 1 1
1 8 4 8 2
2 0 8 8 14
1 1
1 1 1
This file is meant formarginal
personal use
4 by
2 amangupta0141@gmail.com
4 only.
x x1 x2 ... xm x y1 y2 ... yn
P (X = x) p1 p2 ... pm P (Y = x) q1 q2 ... qn
If two random variables are independent

E[XY ] = E[X]E[Y ]
m n
QU1HPBT85AE[XY ] = P P x y P (X = x ∩ Y = y )
i j i j
P P i=1 j=1
= xi yj P (X = xi )P (Y = yj )
i P
P j P P
= xi yj pi qj = xi pi yj qj = E[X]E[Y ]
i j i j

Continuous random
variables
QU1HPBT85A

Continuous random variables
Discrete random variables can take distinct values

Continous random variables can take all values in a continum,
like an inreval
Temperature, height, weight, time
The nature of this kind of variables is that they never take
QU1HPBT85A
exact values. Even the best scales have limit in accuracy.
something is 10 germs ± 0.1 grams. It means the exact value
is in [10 − 0.1, 10 + 0.1] or [a − , a + ]
Therefore, for a continuous random variable X,
P (X = a) = 0

Probability density function
Xis a continuous random variable
fx (x) is the probabilty density functions given that
fX (x) ≥ 0
Z +∞
fX (x) = 1
−∞
Z b
P (X ∈ [a, b]) = P (a ≤ X ≤ b) = fX (x)dx
Z a a
P (X = a) = P (X ∈ [a, a]) =
amangupta0141@gmail.com fX (x)dx = 0
QU1HPBT85A a

The cumulative distribution
Z a
F (a) = P {X ≤ a} = P {X ∈ (−∞, a]} = f (x)dx
−∞
dF (a)
da = f (a)
P {a < x < b} = area of the shaded region
For
amangupta0141@gmail.com (
QU1HPBT85A e−x , x ≥ 0
f (x)
0, x < 0

Uniform continuous random variable
A continuous random variable X on an interval [a, b]
1
fX (x) = b−a
we can easily check
fX (x) ≥ 0
Z +∞
fX (x)dx = 1
−∞
QU1HPBT85A

Cumulative distribution function
X is a conntinous or discrete random variable
The cumulative distribution function is defined as
FX (a) = P (X ≤ a)
For a continuous random

Z avariable
FX (a) = P (X ≤ a) = fX (x)dx
amangupta0141@gmail.com −∞
QU1HPBT85AFor a discrete random variable
P
FX (a) = P (X ≤ a) = pX (x)
x≤a

Properties of CDF
FX (x) = P (X ≤ x)
FX (x) non-strictly increasing
lim FX (x) = 0 and lim FX (x) = 1
x→−∞ x→+∞
FX (x) ∈ [0, 1]
Relationship between PDF and CDF
X is a random variable with CDF FX (x) and PDF fX (x)
P (X ∈ [a, b]) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)
QU1HPBT85A Z b
P (X ∈ [a, b]) = fX (x)dx
Z b a
fX (x)dx = FX (b) − FX (a), means FX (x) is an

a
antiderivative of fX (x)
0
fX (x) = FX (x)

Examples
Uniform distribution on [a, b], X ∼ U (a, b)

(
1
b−a x ∈ [a, b]
fX (x) =
0 otherwise
QU1HPBT85A

Cumulative distribution function: example
Question: Suppose the random variable X has distribution function

(
0 x≤0
F (x) = 2
1 − e−x x>0
QU1HPBT85A
What is the probability that X exceeds 1?
Solution The desired probability is computed as follows:
P {X > 1} = 1 − P {X ≤ 1} = 1 − F (1) = e−1

Normal (Gaussian) distribution
X is a normally distributed random variable X ∼ N (µ, σ 2 )
The PDF with parameters (µ, σ 2 )
1 1 x−µ 2
fX (x) = √ e− 2 ( σ )
σ 2π
Standard normal distribution when (µ = 0, σ 2 = 1),

Z ∼ N (0, 1)
QU1HPBT85A 1 1 2
fX (x) = √ e− 2 x
2π

Normal distribution and transformations
Standard normal distribution is obtained from a normal

distribution using a linear transformation
X ∼ N (µ, σ 2 )
X−µ
Z= σ
Z = 1
X + σµ
σ
QU1HPBT85A
Sometines data do not have normal distribution, and we have
nonlinear transformation to make them look similar to normal.
For example log transformation Y = log(X)
The new variable may have normal distribution
1
other examples Y = X, Y = X 2 , Y = sqrtX

Approximating PDFs using histograms
Consider X is a random variable, but only have a sample of it

You can think about a column in your data matrix
x1 , x2 , . . . , xn an independent sample from X
QU1HPBT85A

Different shapes of distributions
Symmetric, skewed right, skewed left

Unimodal, bimodal, mutlimodal
QU1HPBT85A

Expected value of a continuous random variable
for a discrete random variable X
x x1 x2 ... xn
P (X = x) p1 p2 ... pn
The expectation is
n
X
µ = E[X] = xi pi
i=1
for a continuous random variable X with PDF fX (x)
QU1HPBT85A
Z +∞
µ = E[X] = xfX (x)dx
−∞
Example: Uniform distribution X ∼ U (0, 1) on [0, 1],

fX (x) = 1,
Z +∞ Z 1
1
E[X] = xfX (x)dx = (x)(1)dx =
−∞ use by amangupta0141@gmail.com
0 2 only.
Properties of expected value
E[X + Y ] = E[X] + E[Y ]
E[aX + bY + c] = aE[X] + bE[Y ] + c
Z +∞
E[g(X)] = g(x)fX (x)dx
−∞
Variance of a continuous random variable
Z +∞
Var(X) = E[(X − E[X])2 ] = (x − µ)2 fX (x)dx
−∞
QU1HPBT85A

Suppose we are given a random variable X and its probability

distribution
we are interested in the expected value of some function of X,
say g(X)
If X is a discrete random variable with probability mass
function p(x), then for any real-valued function g,
amangupta0141@gmail.com X
QU1HPBT85A E[g(X)] = g(x)p(x)
x
If X is a continuous random variable with probability density

function f (x), then for any real-valued function g,
Z ∞
E[g(X)] = g(x)f (x)dx
−∞

The expected value of a random variable X, E[X], is also

referred to as the mean or the first moment of X
The quantity E[X n ], n ≥ 1 is called the nth moment of X
P
amangupta0141@gmail.com 
 xn p(x) if discrete
QU1HPBT85A


 x

E[X n ] =




R ∞
 n f (x)dx
−∞ x if continuous

Joint PDF and CDF
Let X and Y are two random variables defined on the same

probability space
Joint PDF for discrete random variable
pX,Y (x, y) = P (X = x ∩ Y = y)
Joint PDF for continuous random variable
QU1HPBT85A
fX,Y (x, y)
Joint CDF
FX,Y (x, y) = P (X ≤ x ∩ Y ≤ y)

X and Y are independent if

FX,Y (x, y) = FX (x)FY (y) for all x and y
equivalent to P (X ≤ x ∩ Y ≤ y) = P (X ≤ x)P (Y ≤ y)
fX,Y (x, y) = fX (x)fY (y)
Covariance Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
Correlation Corr(X, Y ) = √ Cov(X,Y )
amangupta0141@gmail.com Var(X)Var(Y )
QU1HPBT85A
X and Y are independent, then Cov(X, Y ) = 0. The inverse
may not be true.
X and Y are independent, then
Var(X + Y ) = Var(X) + Var(Y )

Descriptive statistics
QU1HPBT85A

Descriptive statistics
Generally we have a sample of a population (modelled as a
random variable)
for example a column representing age of all the cases in a
data table is a sample form a an original random variable that
we don’t know it.
average age is 56.75, variance 90.2, and stdev is 9.5. We can
summarise the column as 56.75 ± 9.5
QU1HPBT85A

Random variables as populations and samples
Consider a random variable X, and a sample from it

x1 , . . . , x n
using the properties of the sample, we can estimate the
properties of the population (random variable)
Properties of X Properties of the sample
QU1HPBT85A
µ = E[X] x̄
σ2 = Var(X) s2
σ = std(X) s
m x̃

Study two random variables
Conisder two random variable X and Y
σX,Y = Cov(X, Y ) = E[(X − µx )(Y − µY )]
ρX,Y = Corr(X, Y ) = √ Cov(X,Y )
Var(X)Var(Y )
Instead of random variables we have samples of them, or we
want to study to numereical columns in a data matrix
x1 , . . . , xn and y1 , . . . , yn
sample covariance
n
QU1HPBT85A
(xi − x̄)(yi − ȳ)
P
i=1
sX,Y =
n
Sample correlation
n n
(xi − x̄)(yi − ȳ) (xi − x̄)(yi − ȳ)
P P
i=1 i=1
r = =s
(n − 1)sX sY n n
(xi − x̄)2 (yi − ȳ)2
P P
i=1 i=1 only.
Correlation coefficient through scatter plots
−1 ≤ r ≤ 1
QU1HPBT85A

Approximately normal datasets
Consider X whose distibution (histogram) is similar to

symmetric bellshape
The empirical rule
Approximately 68% of the observations lie within x̄ ± s
Approximately 95% of the observations lie within x̄ ± 2s
Approximately 99.7% of the observations lie within x̄ ± 3s
using standard normal
QU1HPBT85A
distribution
P (−1 ≤ Z ≤ +1) = 0.68
P (−2 ≤ Z ≤ +2) = 0.95
P (−3 ≤ Z ≤ +3) = 0.997
If a value sits more than 3 std of the mean, it is called an

outlier.
Important Random Variables
QU1HPBT85A

Important discrete random variables
X is a Bernoulli random variable, X ∼ Br(p)

Suppose that an experiment, whose outcome can be classified
as either a success or as a failure is performed
If we let X = 1 when the outcome is a success and X = 0
when it is a failure
QU1HPBT85AP (X = 1) = p
P (X = 0) = 1 − p
where p, 0 ≤ p ≤ 1, is the probability that the trial is a success.
E[X] = p, Var(X) = p(1 − p)

Important discrete random variables
X is a Binomial random variable, X ∼ Binom(n, p)
the number of successes in n independent trials, when each
trial is a success with probability p.
SX = {0, 1, . . . , n}
Probability mass function for k ∈ SX

n k n!
P (X = k) = p (1 − p)n−k = pk (1 − p)n−k
k k!(n − k!)
QU1HPBT85AE[X] = np, Var(X) = np(p − 1)
X is a Poisson random variable with parameter λ > 0,
X ∼ Pois(λ)
SX = {0, 1, 2, . . .}
Probability mass function for k ∈ SX
e−λ λk
P (X = k) =
k!
E[X] = Var(X) = λ use by amangupta0141@gmail.com only.
Important continuous random variables
X is a uniform random variable in interval [a, b],

X ∼ Unif(a, b)
SX = R
(
1
b−a if a ≤ x ≤ b
f (x) =
0 otherwise
QU1HPBT85A
X is normal random variable, X ∼ N (µ, σ 2 ),
SX = R
1 (x−µ)2
f (x) = √ e− 2σ2
σ 2π
E[X] = µ, Var(X) = σ 2
Important continuous random variables
X is an exponential random variable with parameter λ > 0,

X ∼ Exp(λ)
SX = R
(
λe−λx if x ≥ 0
f (x) =
amangupta0141@gmail.com 0 if x < 0
QU1HPBT85A
the distribution of the amount of time until some specific
event occurs.
the amount of time (starting from now) until an earthquake
occurs,
or until a new war breaks out,
or until a telephone call you receive

Parameter Estimation and
Sampling
QU1HPBT85A

Random variables as populations and samples
Consider a random variable X, and a sample from it

x1 , . . . , x n
using the properties of the sample, we can estimate the
properties of the population (random variable)
Population parameters Point estimates

amangupta0141@gmail.com n
QU1HPBT85A P
xi
i=1
µ = E[X] x̄ = n
n
P
(xi −x̄)2
σ 2 = Var(X) s2 = i=1 n−1
√
σ = std(X) s = s2

Statistical Inference
QU1HPBT85A

Sampling distribution
µ, σ 2 , σ for a population
x̄, s2 , s coming for one
sample
X̄, S 2 , S when we have
many samples
QU1HPBT85A From a population with µ
and σ 2
X̄ is the random variable of
sample mean of samples
S 2 a random variable of
sample variances of samples

Central limit theorem for the mean
Consider a population with the mean µ and variance σ 2

Random variable X̄ is the sample mean of randomly selected
samples of size n
CLT says
X̄ is approximately distributed as a normal distribution
E[X̄] = µ and std(X̄) = √σn
Conditions
QU1HPBT85A
The original population is normal, OR
The original population is symmetric and n ≥ 10, OR
Any population, n ≥ 30
Cautions
In practice we have only one sample
If you are interested in better approximation of µ, use larger
sample size
Confidence Interval
Think about a population with parameters µ and σ 2

you have a sample x1 , . . . , xn and you computed the point
estinate x̄
you can make a confidence interval of the form x̄ ± z √sn
v
n
P uP n
QU1HPBT85A xi
u (xi −x̄)2
t
x̄ = i=1n , s = i=1 n−1 , n is the sample size
z is the level of confidence
z = 1.645 for 90% CI for the mean

Point estimate v.s. confidence interval
QU1HPBT85A


Probability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability

Uploaded by

Copyright:

Available Formats

SIT787: Mathematics for Artificial Intelligence

School of Information Technology, Deakin University

This file is meant for personal use by amangupta0141@gmail.com only.

This file is meant for personal use by amangupta0141@gmail.com only.

Each row is for a case, an

The set of all the objects of interest is called the population

This file is meant for personal use by amangupta0141@gmail.com only.

Numerical summaries of the population are called parameters

This file is meant for personal use by amangupta0141@gmail.com only.

We must also distinguish between sample statistics that we

This file is meant for personal use by amangupta0141@gmail.com only.

Columns are variables or features

This file is meant for personal use by amangupta0141@gmail.com only.

Considre single variables

This file is meant for personal use by amangupta0141@gmail.com only.

To summarise a categorical variable: frequency table

This file is meant for personal use by amangupta0141@gmail.com only.

This file is meant for personal use by amangupta0141@gmail.com only.

Two interpretation of uncertainty

This file is meant for personal use by amangupta0141@gmail.com only.

Consider a random experiment or trial

Our aim is to learn from data

This file is meant for personal use by amangupta0141@gmail.com only.

Variables or features each will be modelled using a random

Each column, variable, or

If the outcome of an experiment consists in the determination

If the experiment consists of the running of a race among the

Suppose we are interested in determining the amount of

This file is meant for personal use by amangupta0141@gmail.com only.

Toss a fair (or unfair) coin once, twice, and 3 times

Number of emails I receive every day Ω = {0, 1, 2, . . .}

This file is meant for personal use by amangupta0141@gmail.com only.

This file is meant for personal use by

Suppose E1 and E2 are events (E1 , E2 ⊂ Ω)

This file is meant for personal use by amangupta0141@gmail.com only.

P (E1 ∪ E2 ) = P (E1 ) + P (E2 )

The collection of events {E1 , E2 , . . . , En } is colled mutually

1 = P (Ω) = P (E ∪ E c ) = P (E) + P (E c ) which implies that

For any two events E1 , E2 ⊂ Ω

1P (E ∪ E ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )

This file is meant for personal use by amangupta0141@gmail.com only.

A total of 28 percent of males living in a city smoke

This file is meant for personal use by amangupta0141@gmail.com only.

This file is meant for personal use by amangupta0141@gmail.com only.

A and B are two events, A ⊂ Ω and B ⊂ Ω

Toss a die Ω = {1, 2, 3, 4, 5, 6}

This file is meant for personal use by amangupta0141@gmail.com only.

Consider a group of 100

Two events are indepenent if P (E ∩ F ) = P (E)P (F )

P (E10 ∩ E20 ∩ . . . ∩ Er0 ) = P (E10 )P (E20 ) . . . P (Er0 )

This file is meant for personal use by amangupta0141@gmail.com only.

A bin contains 5 defective (that immediately fail when put in

This file is meant for personal use by amangupta0141@gmail.com only.

F = (F ∩ E) ∪ (F ∩ E c ), the union of two disjoint events

Ei are hypothesis. Only one of

This file is meant for personal use by amangupta0141@gmail.com only.

The essence of Bayes’ Theorem is updating probabilities when

P (F |E)P (E) P (F |E)

There is something which is going to be multiplied by your

Prior probability P (E)

This file is meant for personal use by amangupta0141@gmail.com only.

Let E and F be events. we can write F = (F ∩ E) ∪ (F ∩ E c )

P (F ) = P (F |E)P (E) + P (F |E c )[1 − P (F )]