Unit7 Probability Statistics I-1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Foundations of Data Science

7COM1073

Introduction to Probability and Statistics (I)


Learning Outcomes
At the end of this unit, you should have

• A good understanding of the standard univariate distributions and


their properties
• A good understanding of the use of conditional probability
Probability
• Probability – deducing what is likely to happen when an ‘experiment’ is
performed.

• Some terms people use for probability:


Chance, percentage, likelihood, odds, proportion

• A number between zero and one


➢ The classical approach
▪ It is a mathematical approach using counting rules.
▪ It is used to random processes with certain assumptions.
➢ The relative frequency approach
▪ It is based on collecting data and finding the percentage of time that an event 𝐸 occurred on that data.
➢ The subjective approach
➢ The logical approach
Axiomatic Probability Theory
Let (𝛺, 𝜮, 𝑃) denote a probability space, where
• 𝛺 is the set of all possible outcomes, known as the sample space
➢For example: A coin is flipped three times in succession.
• 𝜮 is a collection of subsets of 𝛺, each subset being called an event.
• 𝑃 is a probability measure defined as a real valued function of the elements of 𝜮
satisfying the following axioms of probability:
➢Axiom 1: 0 ≤ 𝑃 𝐴 ≤ 1 for all 𝐴 ∈ 𝜮.
➢Axiom 2: 𝑃 𝛺 = 1.
➢Axiom 3: If two events A and B are mutually exclusive(that is, no elements in
common), then the probability of either A or B occurring is the probability
of A occurring plus the probability of B occurring:
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)
Probability
If 𝐸 is an event, 𝑃(𝐸) is the probability of the occurrence of the event

Number of elements in event 𝐸


𝑃(𝐸) =
size of the sample space

The maximum probability of any event is 1.


The
sample
space
E
Example
Rolling a fair six-sided die.
• Let 𝛺 = {1, 2, 3, 4,5, 6}
• 𝐴 = 2, 4, 6
• 𝐵 = 3, 5
• 𝐶 = 2, 3, 5
• 𝑃 𝐴 ∪ 𝐵 =?
• 𝑃 𝐴 ∪ 𝐶 =?
Example
Rolling two fair six-sided dice. What is the probability of getting two
even numbers?
Example
Rolling two fair six-sided dice. What is the probability of getting two
even numbers?
1 2 3 4 5 6
1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

Event: getting two even numbers


The size of sample space = 36
The number of elements in event = 9
9 1
P(getting two even numbers)= 36 = 4
Exercise
• A coin is flipped three times in succession. What is the probability of
getting two heads?

➢Sample space:
➢E: getting two heads
➢Size of the sample space
➢Number of times E occurs
Number of times E occuring
➢𝑃(𝐸) =
size of the sample space
Discrete Random Variables
Informally, a random variable is a map from the outcome space (𝛺) to
the real numbers.

Example: Consider throwing two fair dice.


➢The sample space 𝛺 = 1, 1 , 1,2 , 1,3 , 1,4 , 1,5 , 1,6 , ⋯ , (6,6)

➢To each element 𝜔 ∈ 𝛺, we assign the real number 𝑋 𝜔 . For example:

▪ We are interested in ‘the total score obtained’, which is a random variable


and may be denoted by 𝑋

▪ 𝑋 (1,1) = 2, 𝑋 (2,3) = 5
Discrete Random Variables
Event defined by random variables

➢If 𝑋 is a random variable and 𝑥, 𝑥1 and 𝑥2 are


fixed real numbers, we may have the following events:

(𝑋 = 𝑥), 𝑋 ≤ 𝑥 , 𝑋 > 𝑥 or (𝑥1 < 𝑋 ≤ 𝑥2 ).

➢These events have probabilities which are denoted by

𝑃(𝑋 = 𝑥), 𝑃 𝑋 ≤ 𝑥 , 𝑃 𝑋 > 𝑥 or 𝑃(𝑥1 < 𝑋 ≤ 𝑥2 ).


Discrete Random Variables
• A random variable 𝑋 is called discrete if it only takes values in the integers
or (possibly) some other countable set of real numbers.

• Probability mass function (or probability density function): the probability


that 𝑋 takes on a certain value:

𝑓𝑋 𝑥𝑘 = 𝑃(𝑋 = 𝑥𝑘 )

• Cumulative distribution function


𝐹𝑋 𝑥 = 𝑃 𝑋 ≤ 𝑥 = σ𝑥𝑘 ≤𝑥 𝑓𝑋 (𝑥𝑘 )

➢𝐹𝑋 𝑥 is a staircase function.


Example
Flipping a fair coin three times. Let 𝑋 be the number of heads in three
tosses of a fair coin.
• Probability mass function:
𝑓𝑋 0 = 𝑃 𝑋 = 0 = 1/8; 𝑓𝑋 1 = 𝑃 𝑋 = 1 = 3/8;
𝑓𝑋 2 = 𝑃 𝑋 = 2 = 3/8; 𝑓𝑋 3 = 𝑃 𝑋 = 3 = 1/8.
• Cumulative distribution function:
𝑥 𝑋≤𝑥 𝐹𝑋 𝑥
-1 ∅ 0
1
0 {TTT} 8
4
1 {TTT, HTT, THT, TTH} 8
7
2 {TTT, HTT, THT, TTH, HHT, HTH, THH} 8
1
3 {TTT, HTT, THT, TTH, HHT, HTH, THH, HHH}
1
4 {TTT, HTT, THT, TTH, HHT, HTH, THH, HHH}
Continuous Random Variables [4]
𝑋 is a continuous random variable if there is a real-valued function 𝑓𝑋 ,
called the probability density function of 𝑋, which satisfies
• 𝑓𝑋 is piecewise continuous;
• 𝑓𝑋 𝑥 ≥ 0;

• ‫׬‬−∞ 𝑓𝑋 𝑥 𝑑𝑥 = 1
Credit: https://wildart.github.io/CSC21700/CSC_21700_L03.html

Cumulative distribution function (cdf)


𝑥
• 𝐹𝑋 𝑥 = 𝑃 𝑋 ≤ 𝑥 = ‫׬‬−∞ 𝑓𝑋 𝑡 𝑑𝑡; 0 ≤ 𝐹𝑋 𝑥 ≤ 1; 𝐹𝑋 𝑥 is a nondecreasing function
𝑏
• 𝑃 𝑎 < 𝑋 < 𝑏 = ‫𝑥𝑑 𝑥 𝑋𝑓 𝑎׬‬
• 𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)
• 𝑃 𝑋 > 𝑎 = 1 − 𝐹𝑋 (𝑎)
Mean
• The mean (or expected value) of a random variable 𝑋, denoted by 𝜇𝑋
or 𝐸(𝑋), is defined by [4]
σ𝑘 𝑥𝑘 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
𝜇𝑋 = 𝐸 𝑋 = ൝ ∞
‫׬‬−∞ 𝑥𝑓𝑋 𝑥 𝑑𝑥 𝑋: 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠

The expected value should be regarded as the average value.


Variance
• The variance of a random variable 𝑋, denoted by 𝜎𝑥2 or 𝑉𝑎𝑟(𝑋), is defined
by [4]
𝜎𝑥2 = 𝑉𝑎𝑟(𝑋)=𝐸{[𝑋 − 𝐸(𝑋)]2 }

෍( 𝑥𝑘 − 𝜇𝑋 )2 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
𝜎𝑥2 = ∞
𝑘

න (𝑥 − 𝜇𝑋 )2 𝑓𝑋 𝑥 𝑑𝑥 𝑋: 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠
−∞

The variance should be regarded as the average of the difference of the actual values from
the average.
Example
Three products are selected at random from 9 products, of which 2 are
defective. The sample space consists of the distinct, equally likely
samples of size 3. Let 𝑋 be the random variable which counts the
number of defective items in a sample. The values of 𝑋 are 0, 1, and 2.
What is the expected value of defective products in a sample of size 3?

𝐸(𝑋) = σ𝑘 𝑥𝑘 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒

𝑛 𝑛!
𝐶 𝑛, 𝑘 = 𝑘
=
𝑘! 𝑛−𝑘 !
Example
Three products are selected at random from 9 products, of which 2 are defective. The
sample space consists of the distinct, equally likely samples of size 3. Let 𝑋 be the random
variable which counts the number of defective items in a sample. The values of 𝑋 are
0, 1, and 2. What is the expected value of defective products in a sample of size 3?
• Solution:
➢ The number of ways of choosing 𝑥𝑖 defectives from 2 defectives and choosing
3 − 𝑥𝑖 nondefectives from 7 nondefectives is : 𝑥2 3−𝑥7
𝑖 𝑖
9
➢ The total number of possible outcomes is 3
➢ The probability of the value 𝑥𝑖 of 𝑋 is
2 7 9
𝑝𝑖 = 𝑥𝑖 3−𝑥𝑖
/ 3
(𝑥𝑖 =0, 1, 2)

➢ 𝐸(𝑋) = σ𝑖 𝑥𝑖 𝑝𝑖 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
Example
Letting 𝑋 be a random variable. Consider its distribution function on
the interval [0, 1] has the probability density function

0 if 𝑥 < 0 or 𝑥 > 1
𝑓𝑋 𝑥 = ቊ
1 if 0 ≤ 𝑥 ≤ 1
Compute 𝐸 𝑋 .
Solution

•𝐸 𝑋 = ‫׬‬−∞ 𝑥𝑓𝑋 𝑥 𝑑𝑥
0 1 ∞
= ‫׬‬−∞ 𝑥 × 0𝑑𝑥 + ‫׬‬0 𝑥 × 1𝑑𝑥 + ‫׬‬1 𝑥 × 0𝑑𝑥
1 2 1
= 0 + 𝑥 |0 + 0
2
1
=
2
The Discrete Uniform Distribution
Interval [𝑎, 𝑏]

𝑛 =𝑏−𝑎+1

1
𝑓𝑋 𝑥 = ቐ 𝑎≤𝑥≤𝑏
𝑛
0 Otherwise

𝑎+𝑏
𝜇𝑋 = 𝐸 𝑋 = 2

𝑛2 −1
𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 12
Example:

For the special distributions, you just need to know Suppose we throw a die. Let 𝑋 be the
how to use formulas to get the mean and the random variable denoting the obtained
variance. You need not to derive it by yourself. number.
The Bernoulli Distribution
• If 𝑋 is a random variable with this distribution

𝑃 𝑋=1 =𝑝
𝑃(𝑋 = 0) = 1 − 𝑝
The probability mass function 𝑓of this distribution is:
𝑝 𝑖𝑓 𝑋 = 1
𝑓 𝑋; 𝑝 = ቊ
𝑞 = 1−𝑝 𝑖𝑓 𝑋 = 0

Generate 5000 Bernoulli random numbers


with the probability of success: p =0.2

Credit: http://cmdlinetips.com/2018/03/probability-distributions-in-python/
The Bernoulli Distribution

𝜇𝑋 = 𝐸 𝑋 = 𝑝
𝜎𝑋2 = 𝑉𝑎𝑟 𝑋 = 𝑝(1 − 𝑝)

➢Coin tosses: a coin lands head up and tail up.


➢Birth: boy or girl.
➢Use in epidemiology: an event like death or survive.
The Binomial Distribution

𝑃 𝑋=𝑘 =

𝜇𝑥 = 𝐸 𝑋 = 𝐸 𝑋1 + ⋯ + 𝑋𝑛 = 𝑝 + ⋯ + 𝑝 = 𝑛𝑝

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + ⋯ + 𝑋𝑛 = 𝑝 1 − 𝑝 + ⋯ + 𝑝 1 − 𝑝 = 𝑛𝑝(1 − 𝑝)

Examples:
• The number of defective/non-defective
products in a production run.
• Yes/No survey
• The number of successful sales calls.
The Binomial Distribution
For example: run 20 independent experiments, each
having a Bernoulli distribution with parameter p=0.6.
Repeat the whole procedure 10000 times.

𝑃 𝑋=𝑘 =

𝜇𝑥 = 𝐸 𝑋 = 𝑛𝑝

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝑛𝑝(1 − 𝑝)

For example, for X = 12, that is, we set 𝑘 to 12, then we have:

20!
𝑃 𝑋 = 12 = 0.612 (1 − 0.6)20−12 ≈ 0.1797
12! 20 − 12 !
…})
Example
• 90% of all students pass the module
• A sample of 10 new students is selected
• Find the probability that exactly seven will pass

Do we satisfy the conditions of the binomial distribution model?


➢There are only two possible mutually exclusive outcomes: pass or fail
➢There are a fixed number of trails (students) -10
➢It is reasonable to assume that students are independent
➢The probability of pass for each student is 0.9.
➢We have 𝑛 = 10, 𝑝 = 0.90, 𝑞 = 0.10, 𝑘 = 7
Example
• 90% of all students pass the module
• A sample of 10 new students is selected
• Find the probability that exactly seven will pass

Do we satisfy the conditions of the binomial distribution model?


➢ There are only two possible mutually exclusive outcomes: pass or fail
➢ There are a fixed number of trails (students) -10
➢ It is reasonable to assume that students are independent
➢ The probability of pass for each student is 0.9.
➢ We have 𝑛 = 10, 𝑝 = 0.90, 𝑞 = 0.10, 𝑘 = 7

10!
𝑃 𝑋=7 = 0.907 (1 − 0.90)10−7 = 5.74%
7! 10 − 7 !
The Poisson Distribution
𝜆 𝑘
𝑃 𝑋 = 𝑘 events in an interval = 𝑒 −𝜆 𝑘 = 0, 1, ⋯
𝑘!
𝜆> 0, is the average number of events per interval
e is the number 2.71828

𝜇𝑋 = 𝐸 𝑋 = 𝜆
𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝜆

Applications:
➢The number of customers entering a supermarket during various intervals of time.
➢The number of misprints on a page of a document.
The Poisson Distribution
𝜆 𝑘
𝑃 𝑋 = 𝑘 events in an interval = 𝑒 −𝜆 𝑘 = 0, 1, ⋯
𝑘!

Generate 20,000 random numbers following the Poisson distribution with 𝜆= 0.4

0
0.4
For example, 𝑃 𝑋 ≥ 1 = 1 − 𝑃 𝑋 = 0 = 1 − 𝑒 −0.4 ≈ 0.3297
0!
Example
Example: A doctor was able to see 3 patients an hour on average. Find
the probability that she can see 5 patients the next hour
5
3
𝑃 𝑘 = 5 = 𝑒 −3 = 0.1008
5!
The Law of Large Numbers
The law of large numbers states that if we repeat a procedure over and
over, the relative frequency probability will approach the actual
probability.

The frequentist approach is to calculate the following:

The number of times E occured


𝑃 𝐸 =
The number of times the procedure was repeated
The Law of Large Numbers
The law of large numbers states that if we repeat a procedure over and over, the
relative frequency probability will approach the actual probability.

A coin is flipped three times in succession. What is the


probability of getting two heads?

➢Sample space:
➢E: getting two heads
➢Size of the sample space
➢Number of times E occurs
Number of times E occurs
➢𝑃(𝐸) =
size of the sample space

Credit:
http://cmdlinetips.com/2018/03/probability-distributions-in-python/
The Continuous Uniform Distribution

Example: Credit:
http://resources.esri.com/help/9.3/arcgisengine/java/gp_t
oolref/process_simulations_sensitivity_analysis_and_error
Student’s height can take any value within a range _analysis_modeling/distributions_for_assigning_random_
values.htm
The Gaussian or Normal Distribution
A random variable 𝑋 is distributed normally with mean
𝜇 and variance 𝜎 2 if its density is

We write

𝜇𝑋 = 𝐸 𝑋 = 𝜇

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝜎 2

Examples: https://studiousguy.com/real-life-examples-normal-distribution/
The Gaussian or Normal Distribution

Credit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
The Normal Distribution

Credit:

https://www.mathsisfun.com/data/standard-
normal-distribution.html
Z-scores
• Measure how far away a single data point is from the mean
𝑥−𝑥ҧ 𝑥−𝜇
𝑧= or 𝑧 =
𝑠 𝜎
➢𝑥 is the data point
➢𝑥ҧ is the sample mean; 𝜇 is the population mean
➢𝑠 is the sample standard deviation;
➢𝜎 is the population standard deviation Credit:
http://www.z-table.com/
• Table: areas under the standard normal curve to the left of Z.
Example

Cumulative distribution function: gives us the area under the probability density function for the interval negative
infinity to x.

We can see: 1) the probability of x taking on values less than -2.5 is nearly 0;
2) the values sampled from x will mostly be less than 2.5.
Central Limit Theorem [3]
Let 𝑋1,⋯, 𝑋𝑛 be 𝑛 independent random variables, each of which has
mean 𝜇 and standard deviation 𝜎. Let 𝑌 = (𝑋1 + ⋯ + 𝑋𝑛 )/𝑛 be the
average; thus, 𝑌 has mean 𝜇 and standard deviation 𝜎/ 𝑛. If 𝑛 is large,
then the cumulative distribution of 𝑌 is very nearly equal to the
cumulative distribution of the Gaussian with mean 𝜇 and standard
deviation 𝜎/ 𝑛.

Applications:
https://en.wikipedia.org/wiki/Central_limit_theorem#Applications_and_examples
Compound Events
• The probability of either 𝐴 or 𝐵 occurs
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 𝑜𝑟 𝐵

• The probability of 𝐴 and 𝐵 occur


𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴 𝑎𝑛𝑑 𝐵)
Example [1]
• 100 people who showed up for a new test for cancer
• Event A: people actually have cancer
• Event B: people’s test result was positive (it claimed that they had cancer)
• The number of event A: 25
• The number of event B: 30
Sample space
• Among 30 people whose test
results were positive, 20 actually 𝐴∩𝐵

have cancer B
𝐴∩𝐵
A
• 𝑃(𝐴 ∩ 𝐵) =??
• 𝑃 𝐴 ∪ 𝐵 =? ?
Example [1]
• 100 people who showed up for a new test for cancer
• Event A: people actually have cancer
• Event B: people’s test result was positive (it claimed that they had cancer)
• The number of event A: 25
• The number of event B: 30
• Among 30 people whose test Sample space
results were positive, 20 actually
𝐴∩𝐵
have cancer
B
20 𝐴∩𝐵
• 𝑃 𝐴∩𝐵 = = 20% A
100
35
• 𝑃 𝐴∪𝐵 = = 35%
100
The Rules of Probability - The Addition Rule

• The addition rule


𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵

• The probability of someone either has cancer or has the positive test result

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃 𝐴 ∩ 𝐵 = 0.25 + 0.30 − 0.20 = 0.35


The Rules of Probability - Mutual exclusivity

Mutual exclusivity: two events are mutually exclusive if they cannot


occur at the same time, that is, 𝐴 ∩ 𝐵 = ∅, and 𝑃 𝐴 ∩ 𝐵 = 0.

𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵 =𝑃 𝐴 +𝑃 𝐵
• Example:
Considering a module, students pass the module and students fail
the module are mutually exclusive.
Conditional Probability
𝑃 𝐴 >0

Problem: A math teacher gave her class two tests. 25% of the
class passed both tests and 42% of the class passed the first
test. What percent of those who passed the first test also
passed the second test?

Credit: https://www.mathgoodies.com/lessons/vol6/conditional

See: http://setosa.io/ev/conditional-probability/
Exercise
• Flipping a fair coin three times.
• Let 𝐴 be the event that the first flip is a head.
• Let 𝐵 be the event of getting exactly two heads.

Compute 𝑃(𝐵|𝐴)
Solution

Let 𝛺 be the sample space of all eight outcomes of flipping a coin three times, and
1
let 𝑃 𝑋 = 8 .

𝐴 = 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇 ; 𝐵 = {𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝑇𝐻𝐻}

𝐴 ∩ 𝐵 = 𝐻𝐻𝑇, 𝐻𝑇𝐻
2/8 1
𝑃 𝐵𝐴 = =
4/8 2
Exercise
Sarah has 2 children. You learn that she has a son, Mark. What is the
probability that Mark’s sibling is a brother?

𝑃(𝐴∩𝐵)
Conditional probability: 𝑃 𝐵𝐴 =
𝑃(𝐴)
➢What is the sample space?
➢What is the event A?
➢What is the event B?
References
[1] Chapters 5-6 in Principles of Data Science, Sinan Ozdemir, 2016
Packt Publishing
[2] Probability Theory, Fundamentals of Machine Learning (Part 1) by
William Fleshman
[3] Linear Algebra and Probability for Computer Science Applications by
Ernest Davis, 2012.
[4] Schaum’s Outlines Probability, Random Variables, & Random
Processes by Hwei Hsu, 1997

You might also like