Professional Documents
Culture Documents
Frequentist Versus Bayesian Inference: B I T S, P
Frequentist Versus Bayesian Inference: B I T S, P
BY
1 Frequentist Inference 3
2 Bayesian Inference 4
Appendices 10
1
Part I
2
Chapter 1
Frequentist Inference
Frequentist inference is the type of statistical inference which treats probability of events in terms
of the "frequency" of occurrence of that event and draws inferences from sample by the help of pro-
portion of findings of that data. The frequentist approach is based on frequentist probability which
defines the probability of an event E as the limit of its frequency as we keep on increasing the num-
ber of trials, i.e.:
n( E)
P( E) = lim
n→∞ n
As we can observe, the frequentist approach is all about repetitive experiments and it needs more
data to draw conclusions. Such a task is easy when the experiments to be done can be easily re-
peated an unlimited number of times. Consider the experiment, for example, of rolling a fair die.
If one has to predict the probability of getting number 3, the experiment can be conducted as many
times as possible to approximate the probability. The following simulation shows how the probabil-
ities of getting any number on a fair die converge to 1/6.
Figure 1.1: Coin tossing simulation for 100, 1000, and 1000000 experiments
However, when the person conducting the experiment is not at liberty to conduct large number
of experiments or cannot even conduct any new experiment and can only make inferences from
previous data, frequentist approach is not of any help. One example is predicting whether a certain
plant X is toxic for human consumption or not, in which it is impossible to conduct experiments
due to ethical reasons. However, we can conduct hypothesis testing for a recent case based on our
previous belief from observing multiple cases in the past. Such type of inference requires the con-
cept of Bayesian interpretation.
3
Chapter 2
Bayesian Inference
Definition (Bayes’ Theorem). Given two events A and B (where P( B) 6= 0), the probability of A
occurring given that B is true is given by
P( B| A) P( A)
P( A| B) =
P( B)
The Bayes’ theorem is used to continuously update the probabilities when new evidence is found
which brings the probability closer to its true value. The equivalent version of Bayes’ theorem which
helps us to determine the probability of a hypothesis H for some new evidence E is given as:
P( E| H ) P( H )
P( H | E) =
P( E)
Here E stands for "evidence", which is the new data which was not accounted for in prior proba-
bility. H stands for "hypothesis", whose probability may be affected by new evidence E. P( H ) is the
"prior probability" of the hypothesis H, i.e., probability of hypothesis H being true when evidence E
was not found. P( H | E) is the "posterior probability", which is the probability of hypothesis H after
evidence E is observed. P( E| H ) is known as the "likelihood" which is the probability of getting the
evidence E given that out hypothesis H is true, and P( E) is known as "marginal likelihood" which
is a measure of the belief in evidence and takes into account all possible hypotheses.
Choice of prior is important as it influences the results as more evidence is found. However, as
the amount of data increases, the posterior becomes closer and closer to true value of probability
and errors in prior tend to reduce. Hence, just like frequentist probability estimate gets stronger
with more and more number of experiments, Bayesian probability gets stronger with more new evi-
dence.
An example where Bayesian inference works is the probability of finding head in a biased coin.
The likelihood function is assumed to be Binomial with probability mass function (pmf):
P ( E | H ) = H x (1 − H ) n − x ; x = 0, 1, ..., n
where H is the prior hypothesis about proportion of heads, x is the number of heads obtained in n
throws in new evidence E.
4
We set up the prior probabilities as a triangular function shown below (with most likely estimate
at 0.5):
First we analyze the case for a single toss with one head and no tail observed (Fig. 2.2.(a)). Then
we analyze the case for 1000 tosses where heads occur for 750 times. Using prior probabilities and
the likelihood function, the posterior probability calculated using Bayes’ theorem has the following
plot:
(a) Trials = 1, No. of heads = 1 (b) Trials = 1000, No. of heads = 750
It is observed that the likelihood function and posterior probabilities follow almost the same
trend for large number of evidences, further proving that Bayesian probability becomes invariant
of prior probabilities for more evidence.
5
Part II
6
Although Bayesian and Frequentist approaches have their own advantages and disadvantages,
in most situations Bayesian inference triumphs over Frequentist inference. Here we discuss some of
the examples where Bayesian approach is better than Frequentist approach.
8
p̂ =
13
Assuming the MLE as our estimate, the probability of Roger winning the game is obtained if
Roger wins the following 5 rounds, which is given by:
P( R) = (1 − p̂)5 = 0.0084
Hence, the probability of Roger winning is 0.0084 or the odds against Roger are 118 to 1.
(n D , n R ) = (8, 5).
We want to calculate the probability P( R| E), i.e., probability of Roger winning the game given
the information of the current event E. We can find P( R| E) by integration the joint distribution
P( R, p| E) so that the posterior is independent of the prior estimate p. Hence, we have:
Z ∞
P( R| E) = P( R, p| E) dp
−∞
Z ∞
= P( R| p, E) P( p| E) dp (using conditional probability law)
−∞
Z ∞
P( E| p) P( p)
= P( R| p, E) dp (using Bayes’ theorem)
−∞ P( E)
R∞
−∞ P( R| p, E) P( E| p) P( p) dp
=
P( E)
R∞
−∞ P( R| p, E) P( E| p) P( p) dp
= R∞ (total probability of E)
−∞ P ( E | p ) P ( p ) dp
7
Now P( R| p, E) is the probability of Roger winning for given p and event E, which is (1 − p)5 .
P( E| p) is the likelihood of event E for given p which in this case would be p8 (1 − p)5 using bino-
mial distribution. Since we have assumed that Gary’s program is a perfect random number genera-
tor for p between 0 and 1, p has a uniform distribution over [0, 1] and hence:
(
1, p ∈ [0, 1]
P( p) =
0, otherwise
Hence, we obtain the posterior probability as:
R1
p8 (1 − p)10 dp β(11, 9)
P( R| E) = R01 =
p8 (1 − p)5 dp β(6, 9)
0
where β(m, n) denotes the Beta function. Estimating this value with Python [Appendix C] gives the
posterior probability as P( R| E) = 0.02167 or the odds against Roger are 46 to 1.
2.1.3 Discussion
The results given by the two approaches vary significantly. We check the true estimated value by
simulating millions of games and check the results [Appendix D].
From the simulation, we obtain the odds against Roger winning to be 45 to 1, which is the same
as that obtained using the Bayesian inference. The frequentist approach gives a very poor result due
to the fact that the probability value was highly dependent on the maximum likelihood estimate
(MLE) of p while in Bayesian inference, it is almost independent of p.
8
Appendices
9
A Simulation for die throw (frequentist approach)
10
B Simulation for coin toss (Bayesian approach)
1 require ( graphics )
2 library ( dplyr )
3 library ( knitr )
4 library ( argparse )
5 library ( Cairo )
6 library ( ggplot2 )
7 library ( cowplot )
8 library ( glue )
9
10 # ' Generates a " Triangle " Prior Probability Distribution
11 #'
12 # ' @param vals Sample space of all possible parameter values .
13 # ' @return 2 column dataframe containing the parameter and its corresponding
14 #' prior probability .
15 get _ prior _ distr <- function ( vals ) {
16 vals _ pmin <- pmin ( vals , 1 - vals )
17
18 # Normalize the prior so that they sum to 1.
19 tibble :: tibble (
20 theta = vals ,
21 prior = vals _ pmin / sum ( vals _ pmin )
22 )
23 }
24
25 # Define the Space of all theta values
26 theta _ vals <- seq (0 , 1 , 0.01)
27
28 theta _ prior _ distr _ df <- get _ prior _ distr ( theta _ vals )
29
30 # ' Plots the Prior Probability Distribution
31 #'
32 # ' @param prior _ distr _ df Prior probability distribution dataframe from
33 # ' get _ prior _ distr () .
34 # ' @param plot _ x _ labels Plot the parameter values on the x - axes that are taken
35 # ' from the input data .
36 # ' @return ggplot of the prior probability distribution
37 plot _ prior _ distr <- function ( prior _ distr _ df , plot _ x _ labels = TRUE ) {
38
39 theta _ prior _ p <-
40 prior _ distr _ df % >%
41 ggplot ( aes ( x = theta , y = prior ) ) +
42 geom _ point () +
43 geom _ segment ( aes ( x = theta , xend = theta , y = prior , yend = 0) ) +
44 xlab ( expression ( " H ( proportion of heads ) " ) ) +
45 ylab ( expression ( paste ( " P ( H ) " ) ) ) +
46 ggtitle ( " Prior Distribution " )
47
48 if ( plot _ x _ labels ) {
49 theta _ vals <- prior _ distr _ df [[ " theta " ]]
50
51 theta _ prior _ p <-
52 theta _ prior _ p +
53 scale _ x _ continuous ( breaks = c ( theta _ vals ) , labels = theta _ vals )
54 }
55
56 return ( theta _ prior _ p )
57 }
58
59 # ' Get the Likelihood Probability Distribution
60 #'
61 # ' Generates a likelihood probability distribution dataframe
62 #'
11
63 # ' @param theta _ vals Vector of theta values for the binomial distribution .
64 # ' @param num _ heads Number of heads .
65 # ' @param num _ tails Number of tails .
66 # ' @return Dataframe of the likelihood probability distribution .
67 get _ likelihood _ df <- function ( theta _ vals , num _ heads , num _ tails ) {
68 likelihood _ vals <- dbinom ( num _ heads , num _ heads + num _ tails , theta _ vals )
69 likelihood _ df <-
70 tibble :: tibble (
71 theta = theta _ vals ,
72 likelihood = likelihood _ vals
73 )
74
75 return ( likelihood _ df )
76 }
77
78 # ' Get Posterior Probability Distribution
79 #'
80 # ' Generate a posterior probability distribution dataframe .
81 #'
82 # ' @param likelihood _ df Likelihood distribution dataframe from
83 #' get _ likelihood _ df () .
84 # ' @param theta _ prior _ distr _ df Prior distribution dataframe from
85 #' get _ prior _ distr () .
86 # ' @return Dataframe with 4 columns :
87 #' * theta : Theta value .
88 #' * likelihood : Binomial likelihood of the observed data with the specific
89 #' theta .
90 #' * prior : Prior of the theta value .
91 #' * post _ prob : Pposterior probability
92 get _ posterior _ df <- function ( likelihood _ df , prior _ distr _ df ) {
93
94 likelihood _ prior _ df <-
95 dplyr :: left _ join ( likelihood _ df , prior _ distr _ df , by = " theta " )
96
97 marg _ likelihood <-
98 likelihood _ prior _ df % >%
99 dplyr :: mutate (
100 likelihood _ theta = . data [[ " likelihood " ]] * . data [[ " prior " ]]
101 ) % >%
102 dplyr :: pull ( " likelihood _ theta " ) % >%
103 sum ()
104
105 posterior _ df <-
106 dplyr :: mutate (
107 likelihood _ prior _ df ,
108 post _ prob = ( likelihood * prior ) / marg _ likelihood # Bayes theorem
109 )
110
111 return ( posterior _ df )
112 }
113
114 # ' Plots Likelihood Probability Distribution
115 plot _ likelihood _ prob _ distr <- function ( likelihood _ df ) {
116 likelihood _ df % >%
117 ggplot ( aes ( x = theta , y = likelihood ) ) +
118 geom _ point () +
119 geom _ segment ( aes ( x = theta , xend = theta , y = likelihood , yend = 0) ) +
120 xlab ( expression ( " H ( proportion of heads ) " ) ) +
121 ylab ( expression ( paste ( " P ( E | H ) " ) ) ) +
122 ggtitle ( " Likelihood Distribution " )
123 }
124
125 # ' Plots Posterior Probability Distribution
126 plot _ posterior _ prob _ distr <- function ( posterior _ df , theta _ vals ) {
127 posterior _ df % >%
128 ggplot ( aes ( x = theta , y = post _ prob ) ) +
12
129 geom _ point () +
130 geom _ segment ( aes ( x = theta , xend = theta , y = post _ prob , yend = 0) ) +
131 xlab ( expression ( " H ( proportion of heads ) " ) ) +
132 ylab ( expression ( paste ( " P ( H | E ) " ) ) ) +
133 ggtitle ( " Posterior Distribution " )
134 }
135
136 # likelihood _ df <- get _ likelihood _ df ( theta _ vals , 1 , 0)
137 likelihood _ df <- get _ likelihood _ df ( theta _ vals , 750 , 250)
138 posterior _ df <- get _ posterior _ df ( likelihood _ df , theta _ prior _ distr _ df )
139
140 plot _ grid (
141 plot _ prior _ distr ( theta _ prior _ distr _ df , plot _ x _ labels = FALSE ) ,
142 plot _ likelihood _ prob _ distr ( likelihood _ df ) ,
143 plot _ posterior _ prob _ distr ( posterior _ df , theta _ vals ) ,
144 nrow = 3 ,
145 align = " v "
146 )
Output:
1 import numpy as np
2
3 # Set seed
4 np . random . seed (1)
5
6 for i in [1000000 , 2000000 , 5000000]:
7
8 # Create an array of i different p values from 0 to 1
9 p = np . random . random ( i )
10
11 # Do 19 rounds on each game ( because 19 tosses are sufficient to decide the
winner )
12 rounds = np . random . random ((19 , len ( p ) ) )
13
14 # Count the number of wins for David and Roger
15 David = np . cumsum ( rounds <p ,0)
16 Roger = np . cumsum ( rounds >= p ,0)
17
18 # Find all the games with (8 ,5) win ratio after 13 th round and only include them
19 good_games = Roger [12]==5
20 David = David [: , good_games ]
21 Roger = Roger [: , good_games ]
22
23 # Determine which of these games Roger won
24 Roger_won = np . sum ( Roger [17]==10)
25
13
26 # Compute the probability
27 prob = Roger_won . sum () *1./ good_games . sum ()
28 print ( " Total number of games simulated = " + str ( i ) )
29 print ( " Number of good games = " + str ( good_games . sum () ) )
30 print ( " Probability of Roger winning = {0: .5 f } " . format ( prob ) )
31 print ( " Odds against Roger winning = " + str ((1 - prob ) / prob ) + " to 1 " )
32 print ()
Output:
14