Lesson 2: P (A B) P (A B) P (B)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Lesson 2

Contents
1. Conditional Probability............................................................................................................................1
1.1 Independence........................................................................................................................................2
2. Bayes’ Theorem.......................................................................................................................................3
3. Bernoulli and binomial distributions.......................................................................................................5
3.1 Bernoulli Distribution.......................................................................................................................5
3.1.1 Expected value.........................................................................................................................7
3.1.2 Variance....................................................................................................................................7
3.2 Binomial Distribution.......................................................................................................................7
4. Uniform Distribution................................................................................................................................8
5. Exponential and Normal Distribution....................................................................................................12

1. Conditional Probability

In this segment, I will discuss conditional probability, and Bayes' theorem.


Bayes' theorem is the theoretical underpinning of most of what we do within the Bayesian
statistical framework.
Conditional probability is when we're trying to consider two events that are related to each other.
So we can ask, what is the probability of event A given that we know event B happened? This is
defined as the probability that both events A and B happened divided by the probability that event
B happens.

P( A ∩ B)
P( A∨B)=
P(B)

Example
Consider a class of 30 students Suppose within this class, there are 9 female students. Suppose
also,
we have 12 computer science majors. And of which, 4 are female.
So then, we can ask questions about probabilities of the segment population.

Female Not Female Total


CS Major 4 8 12
Not CS Major 5 13 18
Total 9 21 30
We can say :

9 3
 The probability that someone's female is P ( F )= =
30 10
12 6 2
 The probability of someone, who is a computer science major is P ( CS )== =
30 15 5
 The probability that someone is both female, and a computer science major is
4 2
P ( F ∩CS )= =
30 15
So now we can ask conditional probability questions. What's the probability that someone is
female given they're a computer science major?

P ( F ∩CS ) 2 5 5 1
P ( F∨CS )= = = =
P (CS ) 15 2 15 3

We can also think about that in the original framework here of the 12 computer science majors,
what fraction are female, that's exactly what this probability is saying. The probability that they're
female
given they're computer science. And so there we could just look at the 4 over 12, that would give
us the same answer of 1/3. It's an intuitive way to think about a conditional probability is that
we're looking at a sub segment of the original population, and asking a probability question within
that segment.
We can also look in the other direction, suppose we want to know what's the probability that
someone's female given they're not a computer science major P(F|NCS), or I might denote that, as
CS compliment P(C S c ) :

12 18
P ( CS ) + P ( NCS )=1=¿ P ( NCS )=1−P ( CS )=P ( C S c )=1− =
30 30

So,

P ( F ∩C S c ) P ( F ∩ C S c ) 5 30 5
P ( F|C S c )= = = =
P ( C Sc ) 1−P(CS) 30 18 18

1.1 Independence

There's a concept of independence, which is when one event doesn't depend on the other.
When two events are independent, we have that the probability of A given B is equal to just the
probability of A, it doesn't matter whether, or not B occurred. When this is true, we also get that
the probability of A and B happening is just the probability of A times the probability of B. This is a
useful equality that
we'll use throughout this course.

P ( A|B )=P ( A ) if A∧B are indipendent =¿ P ( A ∩ B )=P ( A ) P ( B )

In this case, we can see that the probability of being a female given they're computer scientist is
not equal to the marginal probability that they're female. And so, being female and being
computer science are not independent.

2. Bayes’ Theorem

Bayes' Theorem is used to reverse the direction of conditioning.


Suppose we want to ask what's the P( A∨B) but we know it in terms of P( B∨ A). So we can
write

P( A ∩B) P ( B| A ) P ( A )
P ( A|B )= =
P (B) P (B )

Using Total Probability Theorem we have:

Total Probability Theorem

The events A1,….,AN form a partition of the sample space Ω if :

 Ai are mutually exclusive : Ai ∩ A j=∅ for i≠ j


 Ai ∪ A i+1 ∪ … .. A N =Ω

Let Ai…..AN be partition of Ω. For any event B

N
P ( B )=∑ P ( A i ) P(B∨ Ai )
i=1

P( A ∩B) P ( B| A ) P ( A ) P ( B| A ) P ( A )
P ( A|B )= = =
P (B) P (B ) P ( A ) P(B∨ A)+ P ( Ac ) P ( B| A c )

In the example of computer science and females in the class, we can say the
12
35
P ( F|CS ) P ( CS ) c ¿ 4
P ( CS|F ) = c
P(C S )¿= = (1) ¿
P( F∨CS) P(CS)+ P ( F|C S ¿ 1 2 5 18 9
+
3 5 18 30

We can compare this to the direct answer:

P(CS ∩ F ) 2 10 4
P ( CS|F ) = = =
P( F ) 15 3 9

In this case, it's straight forward to get the answer either way.
But here we're moving from a direct definition. In equation (1) , it's where we have it in terms
of one direction of conditioning P ( F|CS ) and we flip it to the other direction of conditioning
P ( CS|F ) .
This is particularly useful in a lot of problems where it's posed in that direction.

Example
Another example was an early test for HIV antibodies known as the ELISA test.
In that case

P(+¿ HIV )=0.977 and P (−¿ no HIV )=0.926 .=¿

¿> P(−¿ no HIV )=1−P(+¿ no HIV )=0.926

So this is a pretty accurate test. Over 90% of the time, it'll give you an accurate result.
Today's tests are much more accurate, but in the early days this was still pretty good.
That point a study found that among North American's, probability that a North American
would have HIV was about 0.0026:

P ( HIV )=0.0026

So then we could ask, if we randomly selected someone from North America and we tested them
and
they tested positive for HIV, what's the probability that they actually have HIV given they've tested
positive:

P¿

This case we don't have all the information ( like in the before example) to compute it directly as
easily but we have the information in the reverse direction of conditioning, and so it's the perfect
time
to use Bayes' Theorem:

P¿

The probability they have HIV is less then 4%, even though they've tested positive on a test that is
nominally quite accurate. Why is this so? It's because this is a rare disease and this is actually fairly
common
a problem for rare diseases.
The number of false positives(1−0.926)(1−0.0026), greatly outnumbers the true positives
because it's a rare disease. So even though the test is very accurate, we get more false positives
than we get true positives.
This obviously has important policy implications for things like mandatory testing. It makes much
more sense to test in a sub population where the prevalence of HIV is higher, rather than in a
general
population where it's quite rare.
To conclude, Bayes' Theorem is an important part of our approach to Bayesian statistics. We can
use it to update our information. We start with prior beliefs, we'll collect data, we'll then condition
on the data
to lead to our posterior beliefs. Bayes' Theorem is the coherent way to do this updating.

3. Bernoulli and binomial distributions

In this segment we will review some basic probability distributions, both discrete and continuous.

3.1 Bernoulli Distribution

The first one is the Bernoulli distribution. It's used when we have two possible outcomes, such as
flipping a coin, where it could be heads and tails, or the cases where we have a success or a
failure.
Well, to denote this, let's say a random variable x follows a Bernoulli distribution with probability
p:

X B( p)

where p is probability of success, or probability of heads.


The refers to, is distributed as, saying it follows this distribution.
In this case, the probability of a success or heads, we'll say is p:

P ( X=1 ) =p

Instead
P ( X=0 )=1− p

We can write this as a function for all the different possible outcomes. And say what's the
probability
that the random variable x takes a value of little x given a specific value of p? :

f ( X=x| p )=?

So the notation here is that the capital letters refer to the random variable. The lower case letters
refer to
a possible value it might take. And then over here we have a probability specified for it. Later
when these properties are unknown, we'll represent that with a Greek letter rather than a Roman
letter.
We may write this in short hand as:

f ( X=x| p )=f ( x∨ p)

In the case of a Bernoulli, mathematically this works out to be:

f ( x∨ p )= px ( 1− p )1−x

for the case that x's are either 0 or 1.


One way to write this is with an indicator function :

f ( x∨ p )= px ( 1− p )1−x I {x ∈ {0,1 }} (x )

This indicator function we could write as a function of x and its a step function, sometimes
referred to as a heavy side function. It takes value 1 when its argument is true, it takes value 0
when its argument is false. This will be really useful notation for us in the rest of this course.
The indicator function takes precedence in the order of operations so we always evaluate it first,
this is a way we can avoid doing things such as taking the log or the square root of a negative
number.
In basic textbooks, this is referred to as the probability mass function. It gives the probability of
different
outcomes of the random variable.
In some textbooks, they make a strong distinction between discrete variables where these were
probably mass functions, and continues variables where these are probability density functions.
Turns out if you get far enough along in math and you get up to the measure theory level, you can
view everything as a density. So I'm going to refer to this as a density function in a measured
theoretic sense.
3.1.1 Expected value

One more concept is that of an expected value. This is the theoretical average or the theoretical
mean. We write it with a capital E, and the expected value of x as we sum over all possible
outcomes.

E [ X ] =∑ x P( X =x)
x

Little x, we sum up x times the probability random variable takes up variable x.


In this case it's really simple. One possible outcome is one, it takes that with the probability of p.
Another possible outcome is 0, it takes that with the probability 1- p. So the expected value for
Bernoulli is just the probability p.

E [ X ] =∑ x P( X =x)=0 P ( X =0 ) +1 P ( X=1 )=0 ( 1−p ) + p= p


x

3.1.2 Variance

Similarly we can talk about the variance which is the square of the standard deviation. For Bernolli,
the variance works out to be :

Var ( X )= p( 1− p)

3.2 Binomial Distribution

The generalization of the Bernoulli when we have N repeated trials is a binomial. Binomial is just
the sum of the N independent Bernoullis.
We can say X follows a binomial distribution with parameters n and p:

X Bin(n , p)

In this case, the probability function, probability that X takes some value little x is given p. Is (n
choose x) p to the x (1- p) to the n -x.
f ( x∨ p )= n p x ( 1− p )
n−x
()
x

n choose x is the common term:

n!
( nx)= x ! ( n−x )!
for x ∈{0,1, … . n }

And this is all for X taking values 0, 1 up to N.


The expected value for binomial is

E [ X ] =np

and the variance for binomial is

Var ( X )=np(1− p)

4. Uniform Distribution

As we move over to continuous random variables the math gets a little bit more complicated and
requires a little bit of calculus.
We can define a continuous random variable based on its probability density function or PDF for
short.
The PDF is sort of proportional to the probability that the random variable will take a particular
value. In differential calculus sense because it can take an infinite number of possible values.
The key idea is that if you integrate the PDF over an interval, it gives you the probability that the
random variable would be in that interval.
As a specific example, let's consider a uniform distribution.
X is uniformly distributed between 0 and 1:

X U [0,1]

So it can take any value between 0 and 1 and they're all equally likely.
This case, we can write the density, f(x) as being 1, if x is in the interval 0, 1 and 0 otherwise:
f ( x )= 1 if x ∈[0,1]
{ 0 otherwise

We could also represent this as an indicator function and it's just the indicator function, that X is
between 0 and 1.

f ( x )= 1 if x ∈[0,1]=I {0 ≤ x ≤1 }(x )
{ 0 otherwise

We can then think about how we might plot this as a function of X.


f(x)

0 1 x

Very simple, the most simple, probably, density function we have.


We can then ask questions about probabilities.
What's the probability that X will be between 0 and one-half?

1
(
P 0≤ x≤
2
=?)
In order to get this probability, we can just integrate the density function. So this is the integral
from
1 1
0 to of x dx, the integral from 0 to of just dx, because it's 1 in that range and so this is just one-
2 2
half.

1/ 2 1/ 2
1 1
(
P 0< x<
2 0
)
=∫ f (x )dx=∫ 1 dx=¿ ¿
0 2

If we look at the picture, we could see that this integral corresponds to dashed area.
f(x)

0 1/2 1 x

We could ask a similar question.


What's the probability X is between 0 to one-half, inclusive, versus not including the end points?

1
(
P 0≤ x≤
2
=?)
Well, the integral doesn't really depend on the end points being in or not, you're going to get the
same answer, so this is one-half.
In the calculus sense, we can also ask what's the probability that X is equal to one-half?

( 12 )=?
P x=

Well, if we integrate from one-half to one-half, we just going to get a 0 because there are an
infinite
number of possible outcomes. The probability of taking any particular 1 is going to be 0.

1/ 2
1
( )
P x= =∫ f (x )dx=0
2 1/ 2

Some key rules for probability density functions

 It is going to be always true that the integral from minus infinity to infinity of f(x) dx has to
add up to 1, the probability 1 something happens.
+∞

∫ f ( x) dx=1
−∞

 Also has to be true that densities are non-negative for all possible values of X.

f ( x ) ≥0

 We can write the expected value for a continuous random variable as the integral of X
times f(x) dx so this is analogous to the sum that we have for a discrete variable.

+∞
E [ X ] = ∫ xf ( x )dx
−∞

In general, the expected value of some function of x, g(x), we can write as the interval of g(x)f(x)dx.

+∞
E [ g( x) ] =∫ g( x) f ( x )dx
−∞

If we take the expectation of some constant, times a random variable x. That's just the
constant times the expected value of X, we can pull the constant out.

E [ cX ] =c E [ X ] =c

Similarly, if we're looking at the expectation of a sum of two random variables, that's the
sum of the two expectations.

E [ X +Y ] =E [ X ] + E [ Y ]

Finally, if we have independence, if X is independent of Y, that symbol ⊥ means


independent Then the expected value of X times Y is just the product of the expected
values.

E [ XY ] =E [ X ] E [ Y ]
5. Exponential and Normal Distribution

Some particular examples of continuous distributions are next.

Exponential Distribution

Let's start with the exponential distribution. We can write X follows an exponential distribution
with a rate parameter lambda.

X exp( λ)

So, these are events that are occurring at a particular rate, and the exponential is the waiting
time in between events.
For example, you're waiting for a bus. The bus comes once every ten minutes. And then,
the exponential is your waiting time.
This is particularly true for events coming from a Poisson process.
The density function for x given lambda, is lambda times e to the minus lambda x. This is for none
negative values of x.

f (x∨λ)= λ e(− λx) for x ≥ 0

The expected value is one over lambda and the variance is 1 over lambda squared.

1 1
E [ X ] = Var ( X )= 2
λ λ

So in the case of a bus that comes once every ten minutes, the rate is one-tenth.

Uniform Distribution

Another example is the uniform distribution.


We saw this already in the case where it's uniform on 0, 1. But it could be on a larger interval,
in fact, the end points could be unknown. It could be uniform between end points theta 1 and
theta 2.
X [θ1 ; θ2 ]

This case, we write its density:

1
f ( X|θ1 ,θ 1 ]= I
θ 2−θ1 {θ ≤ x ≤θ }
1 2

Normal Distribution
Final example I want to mention right now is the Normal Distribution, also known as the Gaussian
Distribution.

X N (μ , σ 2 )

In this case, it's density function is 1 over the square root of 2 pi sigma squared times e to the
minus 1 over 2 sigma squared x minus mu squared.

1 1
f ( x ∨μ ,σ 2 )= 2
exp ⁡{− 2
( x−μ )2 }
√2 π σ 2σ

This says mean mu, and variance signal squared.

You might also like