Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

B IRLA I NSTITUTE OF T ECHNOLOGY AND S CIENCE , P ILANI

A PPLIED S TATISTICAL M ETHODS


MATH F432

Frequentist versus Bayesian Inference

BY

G ROUP 6 (M AHALANOBIS G ROUP )

O CTOBER 10, 2021


Contents

I Introduction to Frequentist and Bayesian Inferences 2

1 Frequentist Inference 3

2 Bayesian Inference 4

II Comparison between Frequentist and Bayesian Inference 6


2.1 Example 1: Estimation of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Frequentist Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Appendices 10

1
Part I

Introduction to Frequentist and Bayesian


Inferences

2
Chapter 1

Frequentist Inference

Frequentist inference is the type of statistical inference which treats probability of events in terms
of the "frequency" of occurrence of that event and draws inferences from sample by the help of pro-
portion of findings of that data. The frequentist approach is based on frequentist probability which
defines the probability of an event E as the limit of its frequency as we keep on increasing the num-
ber of trials, i.e.:

n( E)
P( E) = lim
n→∞ n
As we can observe, the frequentist approach is all about repetitive experiments and it needs more
data to draw conclusions. Such a task is easy when the experiments to be done can be easily re-
peated an unlimited number of times. Consider the experiment, for example, of rolling a fair die.
If one has to predict the probability of getting number 3, the experiment can be conducted as many
times as possible to approximate the probability. The following simulation shows how the probabil-
ities of getting any number on a fair die converge to 1/6.

Figure 1.1: Coin tossing simulation for 100, 1000, and 1000000 experiments

However, when the person conducting the experiment is not at liberty to conduct large number
of experiments or cannot even conduct any new experiment and can only make inferences from
previous data, frequentist approach is not of any help. One example is predicting whether a certain
plant X is toxic for human consumption or not, in which it is impossible to conduct experiments
due to ethical reasons. However, we can conduct hypothesis testing for a recent case based on our
previous belief from observing multiple cases in the past. Such type of inference requires the con-
cept of Bayesian interpretation.

3
Chapter 2

Bayesian Inference

In Bayesian inference, probabilities are interpreted as a measure of degree of belief of an event


based on previous observations. From the definition, it is apparent that Bayesian probabilities are
calculated in the manner of "probability of event A happening given some other event B has hap-
pened", which is exactly how we obtain the Bayes’ theorem.

Definition (Bayes’ Theorem). Given two events A and B (where P( B) 6= 0), the probability of A
occurring given that B is true is given by

P( B| A) P( A)
P( A| B) =
P( B)
The Bayes’ theorem is used to continuously update the probabilities when new evidence is found
which brings the probability closer to its true value. The equivalent version of Bayes’ theorem which
helps us to determine the probability of a hypothesis H for some new evidence E is given as:

P( E| H ) P( H )
P( H | E) =
P( E)
Here E stands for "evidence", which is the new data which was not accounted for in prior proba-
bility. H stands for "hypothesis", whose probability may be affected by new evidence E. P( H ) is the
"prior probability" of the hypothesis H, i.e., probability of hypothesis H being true when evidence E
was not found. P( H | E) is the "posterior probability", which is the probability of hypothesis H after
evidence E is observed. P( E| H ) is known as the "likelihood" which is the probability of getting the
evidence E given that out hypothesis H is true, and P( E) is known as "marginal likelihood" which
is a measure of the belief in evidence and takes into account all possible hypotheses.
Choice of prior is important as it influences the results as more evidence is found. However, as
the amount of data increases, the posterior becomes closer and closer to true value of probability
and errors in prior tend to reduce. Hence, just like frequentist probability estimate gets stronger
with more and more number of experiments, Bayesian probability gets stronger with more new evi-
dence.
An example where Bayesian inference works is the probability of finding head in a biased coin.
The likelihood function is assumed to be Binomial with probability mass function (pmf):

P ( E | H ) = H x (1 − H ) n − x ; x = 0, 1, ..., n
where H is the prior hypothesis about proportion of heads, x is the number of heads obtained in n
throws in new evidence E.

4
We set up the prior probabilities as a triangular function shown below (with most likely estimate
at 0.5):

Figure 2.1: Prior probabilities for proportion of heads

First we analyze the case for a single toss with one head and no tail observed (Fig. 2.2.(a)). Then
we analyze the case for 1000 tosses where heads occur for 750 times. Using prior probabilities and
the likelihood function, the posterior probability calculated using Bayes’ theorem has the following
plot:

(a) Trials = 1, No. of heads = 1 (b) Trials = 1000, No. of heads = 750

Figure 2.2: Posterior probabilities for proportion of heads

It is observed that the likelihood function and posterior probabilities follow almost the same
trend for large number of evidences, further proving that Bayesian probability becomes invariant
of prior probabilities for more evidence.

5
Part II

Comparison between Frequentist and


Bayesian Inference

6
Although Bayesian and Frequentist approaches have their own advantages and disadvantages,
in most situations Bayesian inference triumphs over Frequentist inference. Here we discuss some of
the examples where Bayesian approach is better than Frequentist approach.

2.1 Example 1: Estimation of probability


David and Roger decide to try out their luck on the "Who is the lucky one?" game that their friend
Gary programmed. The program randomly assigns a random number between 0 and 1 to the bi-
nary digit (1)2 . David and Roger decide among themselves that David will win a round if the pro-
gram outputs a (1)2 and Roger will win if the program outputs a (0)2 . We assume that the random
number generator used by Gary is "perfect".
Gary decides that the first person to win 10 rounds wins the game. With the setup at hand, we
seek to answer the question: Suppose in 13 rounds, David won 8 and Roger won 5, what is the
probability that Roger will win 10 rounds first to win the game?

2.1.1 Frequentist Approach


We first estimate the probability of getting an output of (1)2 from Gary’s program based on the
prior information given to us. As David won 8 out of 13 rounds, the maximum likelihood estimate
(MLE) of proportion p of (1)2 as output is:

8
p̂ =
13
Assuming the MLE as our estimate, the probability of Roger winning the game is obtained if
Roger wins the following 5 rounds, which is given by:

P( R) = (1 − p̂)5 = 0.0084
Hence, the probability of Roger winning is 0.0084 or the odds against Roger are 118 to 1.

2.1.2 Bayesian Approach


Let R denote the event that Roger wins and let E denote the event that we get the observed data

(n D , n R ) = (8, 5).
We want to calculate the probability P( R| E), i.e., probability of Roger winning the game given
the information of the current event E. We can find P( R| E) by integration the joint distribution
P( R, p| E) so that the posterior is independent of the prior estimate p. Hence, we have:

Z ∞
P( R| E) = P( R, p| E) dp
−∞
Z ∞
= P( R| p, E) P( p| E) dp (using conditional probability law)
−∞
Z ∞
P( E| p) P( p)
= P( R| p, E) dp (using Bayes’ theorem)
−∞ P( E)
R∞
−∞ P( R| p, E) P( E| p) P( p) dp
=
P( E)
R∞
−∞ P( R| p, E) P( E| p) P( p) dp
= R∞ (total probability of E)
−∞ P ( E | p ) P ( p ) dp

7
Now P( R| p, E) is the probability of Roger winning for given p and event E, which is (1 − p)5 .
P( E| p) is the likelihood of event E for given p which in this case would be p8 (1 − p)5 using bino-
mial distribution. Since we have assumed that Gary’s program is a perfect random number genera-
tor for p between 0 and 1, p has a uniform distribution over [0, 1] and hence:
(
1, p ∈ [0, 1]
P( p) =
0, otherwise
Hence, we obtain the posterior probability as:
R1
p8 (1 − p)10 dp β(11, 9)
P( R| E) = R01 =
p8 (1 − p)5 dp β(6, 9)
0

where β(m, n) denotes the Beta function. Estimating this value with Python [Appendix C] gives the
posterior probability as P( R| E) = 0.02167 or the odds against Roger are 46 to 1.

2.1.3 Discussion
The results given by the two approaches vary significantly. We check the true estimated value by
simulating millions of games and check the results [Appendix D].
From the simulation, we obtain the odds against Roger winning to be 45 to 1, which is the same
as that obtained using the Bayesian inference. The frequentist approach gives a very poor result due
to the fact that the probability value was highly dependent on the maximum likelihood estimate
(MLE) of p while in Bayesian inference, it is almost independent of p.

8
Appendices

9
A Simulation for die throw (frequentist approach)

1 import matplotlib . pyplot as plt


2 import numpy as np
3 import collections
4
5 from numpy . random import seed
6 from numpy . random import randint
7
8 seed (1)
9
10 # Generating 3 random samples of sizes 100 , 10^3 , 10^6
11 s1 = randint (1 ,7 ,100)
12 s2 = randint (1 ,7 ,10000)
13 s3 = randint (1 ,7 ,1000000)
14
15 # Counting freq for each value
16 freq1 = [0 ,0 ,0 ,0 ,0 ,0]
17 freq2 = [0 ,0 ,0 ,0 ,0 ,0]
18 freq3 = [0 ,0 ,0 ,0 ,0 ,0]
19
20 elements_count = collections . Counter ( s1 )
21
22 for key , value in elements_count . items () :
23 freq1 [ key -1] = value
24
25 elements_count = collections . Counter ( s2 )
26
27 for key , value in elements_count . items () :
28 freq2 [ key -1] = value
29
30 elements_count = collections . Counter ( s3 )
31
32 for key , value in elements_count . items () :
33 freq3 [ key -1] = value
34
35 # plotting the values
36 plt . style . use ( ' ggplot ')
37
38 fig = plt . figure ()
39
40 plt1 = fig . add_subplot (131)
41 plt2 = fig . add_subplot (132)
42 plt3 = fig . add_subplot (133)
43
44 x = [1 ,2 ,3 ,4 ,5 ,6]
45
46 plt1 . stem (x , freq1 , linefmt = ' black ' , basefmt = ' ' , markerfmt = ' ko ')
47 plt2 . stem (x , freq2 , linefmt = ' black ' , basefmt = ' ' , markerfmt = ' ko ')
48 plt3 . stem (x , freq3 , linefmt = ' black ' , basefmt = ' ' , markerfmt = ' ko ')
49
50 plt1 . set_xlabel ( ' Outcome of die throw ')
51 plt2 . set_xlabel ( ' Outcome of die throw ')
52 plt3 . set_xlabel ( ' Outcome of die throw ')
53
54 plt1 . set_ylabel ( ' Frequency ( per hundred throws ) ')
55 plt2 . set_ylabel ( ' Frequency ( per ten - thousand throws ) ')
56 plt3 . set_ylabel ( ' Frequency ( per million throws ) ')
57
58 # plt . subplot_tool ()
59 plt . show ()

10
B Simulation for coin toss (Bayesian approach)

1 require ( graphics )
2 library ( dplyr )
3 library ( knitr )
4 library ( argparse )
5 library ( Cairo )
6 library ( ggplot2 )
7 library ( cowplot )
8 library ( glue )
9
10 # ' Generates a " Triangle " Prior Probability Distribution
11 #'
12 # ' @param vals Sample space of all possible parameter values .
13 # ' @return 2 column dataframe containing the parameter and its corresponding
14 #' prior probability .
15 get _ prior _ distr <- function ( vals ) {
16 vals _ pmin <- pmin ( vals , 1 - vals )
17
18 # Normalize the prior so that they sum to 1.
19 tibble :: tibble (
20 theta = vals ,
21 prior = vals _ pmin / sum ( vals _ pmin )
22 )
23 }
24
25 # Define the Space of all theta values
26 theta _ vals <- seq (0 , 1 , 0.01)
27
28 theta _ prior _ distr _ df <- get _ prior _ distr ( theta _ vals )
29
30 # ' Plots the Prior Probability Distribution
31 #'
32 # ' @param prior _ distr _ df Prior probability distribution dataframe from
33 # ' get _ prior _ distr () .
34 # ' @param plot _ x _ labels Plot the parameter values on the x - axes that are taken
35 # ' from the input data .
36 # ' @return ggplot of the prior probability distribution
37 plot _ prior _ distr <- function ( prior _ distr _ df , plot _ x _ labels = TRUE ) {
38
39 theta _ prior _ p <-
40 prior _ distr _ df % >%
41 ggplot ( aes ( x = theta , y = prior ) ) +
42 geom _ point () +
43 geom _ segment ( aes ( x = theta , xend = theta , y = prior , yend = 0) ) +
44 xlab ( expression ( " H ( proportion of heads ) " ) ) +
45 ylab ( expression ( paste ( " P ( H ) " ) ) ) +
46 ggtitle ( " Prior Distribution " )
47
48 if ( plot _ x _ labels ) {
49 theta _ vals <- prior _ distr _ df [[ " theta " ]]
50
51 theta _ prior _ p <-
52 theta _ prior _ p +
53 scale _ x _ continuous ( breaks = c ( theta _ vals ) , labels = theta _ vals )
54 }
55
56 return ( theta _ prior _ p )
57 }
58
59 # ' Get the Likelihood Probability Distribution
60 #'
61 # ' Generates a likelihood probability distribution dataframe
62 #'

11
63 # ' @param theta _ vals Vector of theta values for the binomial distribution .
64 # ' @param num _ heads Number of heads .
65 # ' @param num _ tails Number of tails .
66 # ' @return Dataframe of the likelihood probability distribution .
67 get _ likelihood _ df <- function ( theta _ vals , num _ heads , num _ tails ) {
68 likelihood _ vals <- dbinom ( num _ heads , num _ heads + num _ tails , theta _ vals )
69 likelihood _ df <-
70 tibble :: tibble (
71 theta = theta _ vals ,
72 likelihood = likelihood _ vals
73 )
74
75 return ( likelihood _ df )
76 }
77
78 # ' Get Posterior Probability Distribution
79 #'
80 # ' Generate a posterior probability distribution dataframe .
81 #'
82 # ' @param likelihood _ df Likelihood distribution dataframe from
83 #' get _ likelihood _ df () .
84 # ' @param theta _ prior _ distr _ df Prior distribution dataframe from
85 #' get _ prior _ distr () .
86 # ' @return Dataframe with 4 columns :
87 #' * theta : Theta value .
88 #' * likelihood : Binomial likelihood of the observed data with the specific
89 #' theta .
90 #' * prior : Prior of the theta value .
91 #' * post _ prob : Pposterior probability
92 get _ posterior _ df <- function ( likelihood _ df , prior _ distr _ df ) {
93
94 likelihood _ prior _ df <-
95 dplyr :: left _ join ( likelihood _ df , prior _ distr _ df , by = " theta " )
96
97 marg _ likelihood <-
98 likelihood _ prior _ df % >%
99 dplyr :: mutate (
100 likelihood _ theta = . data [[ " likelihood " ]] * . data [[ " prior " ]]
101 ) % >%
102 dplyr :: pull ( " likelihood _ theta " ) % >%
103 sum ()
104
105 posterior _ df <-
106 dplyr :: mutate (
107 likelihood _ prior _ df ,
108 post _ prob = ( likelihood * prior ) / marg _ likelihood # Bayes theorem
109 )
110
111 return ( posterior _ df )
112 }
113
114 # ' Plots Likelihood Probability Distribution
115 plot _ likelihood _ prob _ distr <- function ( likelihood _ df ) {
116 likelihood _ df % >%
117 ggplot ( aes ( x = theta , y = likelihood ) ) +
118 geom _ point () +
119 geom _ segment ( aes ( x = theta , xend = theta , y = likelihood , yend = 0) ) +
120 xlab ( expression ( " H ( proportion of heads ) " ) ) +
121 ylab ( expression ( paste ( " P ( E | H ) " ) ) ) +
122 ggtitle ( " Likelihood Distribution " )
123 }
124
125 # ' Plots Posterior Probability Distribution
126 plot _ posterior _ prob _ distr <- function ( posterior _ df , theta _ vals ) {
127 posterior _ df % >%
128 ggplot ( aes ( x = theta , y = post _ prob ) ) +

12
129 geom _ point () +
130 geom _ segment ( aes ( x = theta , xend = theta , y = post _ prob , yend = 0) ) +
131 xlab ( expression ( " H ( proportion of heads ) " ) ) +
132 ylab ( expression ( paste ( " P ( H | E ) " ) ) ) +
133 ggtitle ( " Posterior Distribution " )
134 }
135
136 # likelihood _ df <- get _ likelihood _ df ( theta _ vals , 1 , 0)
137 likelihood _ df <- get _ likelihood _ df ( theta _ vals , 750 , 250)
138 posterior _ df <- get _ posterior _ df ( likelihood _ df , theta _ prior _ distr _ df )
139
140 plot _ grid (
141 plot _ prior _ distr ( theta _ prior _ distr _ df , plot _ x _ labels = FALSE ) ,
142 plot _ likelihood _ prob _ distr ( likelihood _ df ) ,
143 plot _ posterior _ prob _ distr ( posterior _ df , theta _ vals ) ,
144 nrow = 3 ,
145 align = " v "
146 )

C Calculation of posterior probability using Beta function in Example 1

1 from scipy . special import beta


2
3 pos_prob = beta (9 ,11) / beta (9 ,6)
4 print ( " Posterior probability = " + str ( pos_prob ) )

Output:

1 Posterior probability = 0 .021671826625386997

D Simulation for Example 1

1 import numpy as np
2
3 # Set seed
4 np . random . seed (1)
5
6 for i in [1000000 , 2000000 , 5000000]:
7
8 # Create an array of i different p values from 0 to 1
9 p = np . random . random ( i )
10
11 # Do 19 rounds on each game ( because 19 tosses are sufficient to decide the
winner )
12 rounds = np . random . random ((19 , len ( p ) ) )
13
14 # Count the number of wins for David and Roger
15 David = np . cumsum ( rounds <p ,0)
16 Roger = np . cumsum ( rounds >= p ,0)
17
18 # Find all the games with (8 ,5) win ratio after 13 th round and only include them
19 good_games = Roger [12]==5
20 David = David [: , good_games ]
21 Roger = Roger [: , good_games ]
22
23 # Determine which of these games Roger won
24 Roger_won = np . sum ( Roger [17]==10)
25

13
26 # Compute the probability
27 prob = Roger_won . sum () *1./ good_games . sum ()
28 print ( " Total number of games simulated = " + str ( i ) )
29 print ( " Number of good games = " + str ( good_games . sum () ) )
30 print ( " Probability of Roger winning = {0: .5 f } " . format ( prob ) )
31 print ( " Odds against Roger winning = " + str ((1 - prob ) / prob ) + " to 1 " )
32 print ()

Output:

1 Total number of games simulated = 1000000


2 Number of good games = 71561
3 Probability of Roger winning = 0 .02213
4 Odds against Roger winning = 44 .17739898989899 to 1
5
6 Total number of games simulated = 2000000
7 Number of good games = 142953
8 Probability of Roger winning = 0 .02164
9 Odds against Roger winning = 45 .2032967032967 to 1
10
11 Total number of games simulated = 5000000
12 Number of good games = 357761
13 Probability of Roger winning = 0 .02200
14 Odds against Roger winning = 44 .453055520264265 to 1

14

You might also like