Sampling

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

STAA 567 Lec 5: Simulation and Integration

(Instructor : Nishant Panda) (Additional Reference (IMC) : Introducing Monte Carlo Methods with R,
Christian P. Robert & George Casella, Springer)

Introduction

Simulation is an important technique in applied mathematics. It typically means using computers to


conduct experiments! In statistics, it is a very useful way to check if a statistical model has the necessary
properties and this involves sampling from probability distributions. In most real life situations, analytical
derivations of statistical properties for your model is rarely possible and one usually works with Large sample
approximations. To test properties in finite samples simulation studies are extremely crucial!

Random Variable generation

The core step in simulation is drawing samples from probability distributions. This will be the topic of
interest in this lecture.
How does R generate samples from a named distribution? For example rnorm generates samples from a
normal distribution.
# Generates 10,000 samples from a normal distribution
# with mean 2 and standard deviation 0.5
norm.samples <- rnorm(10000, mean = 2, sd = 0.5)
hist(norm.samples, breaks = 21, col = "steelblue")
Histogram of norm.samples
1500
Frequency

0 500

0 1 2 3 4

norm.samples
# Plot the "density"
plot(density(norm.samples))
polygon(density(norm.samples), col="steelblue", border="red")

1
density.default(x = norm.samples)
0.8
Density

0.4
0.0

0 1 2 3 4

N = 10000 Bandwidth = 0.07076


As it turns out, all that is needed to generate samples from any distribution is a uniform number generator.
Let us assume that this is God Given (i.e God wrote the runif).
unif.samples <- runif(100000, 0, 2)
plot(density(unif.samples))
polygon(density(unif.samples), col="steelblue", border="red")
density.default(x = unif.samples)
0.4
Density

0.2
0.0

0.0 0.5 1.0 1.5 2.0

N = 100000 Bandwidth = 0.05201


Note, that there is no such thing as a random number generator! The computer generates numbers
deterministically that satistisfies all the properties of the distribution. This can be seen through set.seed
method in R.
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Methods that produce an endless supply of random samples from distributions are termed as Monte Carlo
methods.

2
For a nice overview see https://en.wikipedia.org/wiki/Monte_Carlo_method

Inverse Transform method


The inverse transform theorem is a technical result in probability theory which states that if X is
a random variable with the C.D.F F (x), then if u is a random sample from the uniform(0,1) then
F 1 (u) is a random sample of X.
(Technically F 1 is NOT the inverse of the C.D.F but for the purpose of this course thinking of it as an
inverse function will suffice!)
If we want to generate random samples of X, what this means is the following:
1. First generate random samples ui from uniform U (0, 1).
2. Let xi = F 1 (ui ). Then xi s are random samples from X.

(Home Assignment!) Let X exponentially with rate 1. Show that the C.D.F of X is

F (x) = 1 ex

Example 1: Using the inverse transform method, generate 10000 samples from Exponential
Distribution with rate 1 i.e generate samples from X Exp(1).
First we need to get the inverse function F 1 . This means that if F (x) = y, then we solve for x. From Home
Assignment 1 in this Lecture, we know that F (x) = 1 ex . Thus solving for y,

y = 1 ex
= ex = 1 y
= x = log(1 y)
= x = log(1 y)

Hence, the inverse function is given by F 1 (y) = log(1 y).


# Step 1: generate n samples from uniform random variable
n <- 10000
u <- runif(n)

# Step 2: compute F^-1(u)


my.exp.samples <- sapply(u, function(u) -1.0*log(1-u))
Let us test this visually with a plot.
plot(my.exp.samples)

3
my.exp.samples

8
4
0

0 2000 4000 6000 8000

Index
Oops What happened! This doesnt look like an exponential distribution, does it? We are plotting samples
and not the distribution! Let us look at the histogram!
# Plot histogram
hist(my.exp.samples, breaks = 21, col = "steelblue", main = "Histogram of Exp(1) samples" )
Histogram of Exp(1) samples
4000
Frequency

2000
0

0 2 4 6 8 10

my.exp.samples
# Plot density
plot(density(my.exp.samples), xlim = c(0.5, 5), main = "Density plot for Exp(1)")
polygon(density(my.exp.samples), col="steelblue", border="red", xlim = c(0.5, 5))
Density plot for Exp(1)
0.8
Density

0.4
0.0

1 2 3 4 5

N = 10000 Bandwidth = 0.1196


Let us look at the R version

4
# Plot density
plot(density(rexp(10000)), xlim = c(0.5, 5), main = "Density plot for Exp(1) from rexp")
polygon(density(rexp(10000)), col="steelblue", border="red")
Density plot for Exp(1) from rexp
0.8
Density

0.4
0.0

1 2 3 4 5

N = 10000 Bandwidth = 0.1178


(From IMC) The generation of uniform random variables is therefore a key determinant of the
behavior of simulation methods for other probability distributions since those distributions can
be represented as a deterministic transformation of uniform random variables.

(Home Assignment!) Let X be the Logistic distribution with the C.D.F given by
1
F (x) = (x)
( )
1+e

for two parameters and . Generate 1000 random samples using the inverse transform
method. Take = 5 and = 2.
R uses the inverse transform method to generate samples from the Normal distribution. Can we use the
inverse transform method to generate samples from discrete distributions like Binomial etc? Yes we can! But,
we wont delve further into this.

Accept-Reject simulation method


Typically, a C.D.F wont come up in nice closed expression for us to invert! When we come up against
such a wall, the next simulation method of choice is almost always going to be the Accept-Reject method.
Accept-Reject method fall under a large class of simulation methods called Indirect methods.

Again, for a great read see https://en.wikipedia.org/wiki/Rejection_sampling


With indirect methods we generate samples from a candidate (simpler) random variable and only
accept it as the sample of our target random variable subject to passing a test.
Notation:
1. X f is the random variable of interest and f (x) is called the target density.
2. Let g(x) be another density from which we (or R!) can generate samples having the same support as
f (x). In other words, g(x) > 0 exactly for those x for which f (x) > 0. We call g a candidate density.

5
3. Let e(x) be the envelope function that serves as an upper bound for f having the property e(x) =
M g(x) f (x), where M 1 is some constant.
Goal: Draw a sample of X i.e simulate from f .
The accept-reject algorithm can be compactly stated as follows:
1. Generate Y g and U U (0, 1).
2. Set X = Y if
f (Y )
U< ,
e(Y )

3. else, go back to 1.
We see that picking an envelope (i.e picking M and a candidate density) is crucial for this method to work.
Let us illustrate this with the following example.

Example 2: Generate 1000 samples from Beta(4, 3) using the accept-reject method.
We know that if X Beta(, ), then its density f is given by,

( + ) 1
f (x|, ) = x (1 x)1
()()

Plugging in = 4 and = 3, we get


f (x) = 60x3 (1 x)2

What is the support of f ?


Any Beta distribution is supported on [0, 1]. Let us plot this density.
x.seq <- seq(from = 0, to = 1, length.out = 101)
f <- sapply(x.seq, function(x) 60*x^3*(1-x)^2)
library(ggplot2)
# put the x and y values in a data frame
plot_data <- data.frame("x" = x.seq, "f" = f)

# create the base layer of ggplot with geom_Line


plot_f <- ggplot(plot_data, aes(x = x, y = f)) +
geom_line(color = "red", size = 1.5) +
labs(x = "x", y = "f(x| 4,3)")

plot_f

6
2.0

1.5
f(x| 4,3)

1.0

0.5

0.0
0.00 0.25 0.50 0.75 1.00
x
Now, our candidate density function g should have the same support as f and should be a valid density. Lets
take g to be uniform density U (0, 1).
A naive choice for the envelope e(x) = M g(x), would be to take M = max(f (x)). Thus, we can guarantee
that e(x) f (x). This naive choice works whenver we have a compact support for f (x) and can find a global
maxima. Lets find M . We can use the optimize function in R to find the maxima!
max_p <- optimize(function(x) 60*x^3*(1-x)^2, c(0.05, 0.95), maximum = TRUE)
round(max_p$maximum, 4)
## [1] 0.6
Thus, M = f (0.6).
# Lets create our beta density f
beta.f <- function(x) {
60*x^3*(1-x)^2
}

M <- beta.f(0.6)
print(M)
## [1] 2.0736
Our candidate density function g(x) = 1 for any x in [0, 1] because we chose it to be uniform(0,1). Hence,
our envelope function e(y) = M . Now, let us write a simple accept-reject algorithm for this example
num.samp <- 1000
beta.samples <- rep(0, num.samp)
count.samp <- 0

# keep running the accept-reject method until you have drawn 1000 samples

while(count.samp < num.samp) {


# Step 1: generat sample y from g and u from unif(0,1)
y <- runif(1) # g is U(0,1)
u <- runif(1)

# Step 2: Check the test


if (u < beta.f(y)/M) {

7
# increase the counter
count.samp <- count.samp + 1

# set X = Y
beta.samples[count.samp] <- y
}
}
Let us check if this worked
hist(beta.samples,prob=T,ylab="f(x)",xlab="x",ylim=c(0,max(beta.f(x.seq))),
main="Histogram of draws from Beta(4,3)" )
lines(x.seq,f,lty=2,col = "red")
Histogram of draws from Beta(4,3)
2.0
1.5
f(x)

1.0
0.5
0.0

0.2 0.4 0.6 0.8 1.0

x
Let us visualize which samples were accepted and which were rejected.
# Total number of samples
num.sim <- 3000

# testing condition : u < f/e and hence ue < f

# e is just M, so ue is u * M
ue <- M*runif(num.sim)

# y is a sample from g and g is U(0,1)


y <- runif(num.sim)
# create all the pairs y and ue
mat <- cbind(y, ue)

8
# put them in a data frame
plot_data <- data.frame(mat)
colnames(plot_data) <- c("y", "ue")
# create a factor variable to see which were accepted
plot_data$Accepted <- ue < beta.f(y)

# show the first 5 pairs


head(plot_data,5)
## y ue Accepted
## 1 0.5521227 0.6982361 TRUE
## 2 0.8818690 1.0105790 FALSE
## 3 0.9236169 1.3849691 FALSE
## 4 0.2549313 0.9364871 FALSE
## 5 0.1082588 0.5137490 FALSE
# we will create a scatter plot
# "x" values will be y (i.e samples from g)
# "y" values will be ue (u * e)

# If (u*e < f(y)) then samples are accepted otherwise rejected


# Hence, in scatter plot we will color by the Accepted variable

plot_ar <- ggplot(plot_data, aes(x=y, y=ue)) +


geom_point(shape=20, aes(color=Accepted)) +
stat_function(fun = beta.f) +
geom_vline(xintercept = 0.0) +
geom_vline(xintercept = 1.0) +
geom_hline(yintercept = M) +
geom_hline(yintercept = 0)

# beautify : Increase font size in axes


plot_ar + theme(axis.text.x =
element_text(face = "bold",size = 12),
axis.text.y =
element_text(face = "bold", size = 12),
axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))

9
2.0

1.5

Accepted
ue

FALSE
1.0
TRUE

0.5

0.0
0.00 0.25 0.50 0.75 1.00
y
Let us check what proportion of the samples were accepted
round(sum(plot_data$Accepted)/nrow(plot_data),2)
## [1] 0.48
round((1/M),2)
## [1] 0.48
1
This is not a coincidence! M is the expected proportion of candidates that are accepted. If we find a better
M (i.e a tighter envelope) we will on avergae have fewer rejections.
A good envelope should have the following properties:
1. Exceeds the target f everywhere
2. Easy to sample from g.
3. Generete few rejections.
1
If f is supported on a set [a, b], then a naive but good choice for e is max(f (x)) ba i.e we take the candidate
density to be uniform U (a, b) and M = max(f (x)). Will this method work if f N (, )?

(Home Assignment!) Consider the strange density


2
f (x) = cex /2
sin2 (6x) + 3cos2 (x)sin2 (4x) + 1 .


Use the Accept-Reject algorithm to sample from f by filling out the following steps: 1. We
dont need to know the normalizing constant c to sample from f ! We can disregard c in this

10
question! The support of f is (, ). Take g(x) to be the standard normal density. 2. Find
a suitable M (Hint, use optimize!). 3. Plot M g(x) and f (without the c). 4. Generate 1000
samples from f (without the c!). 5. Mimic the plot in this lecture to show the accepted and
rejected samples.

11

You might also like