Sampling

STAA 567 Lec 5: Simulation and Integration
(Instructor : Nishant Panda) (Additional Reference (IMC) : Introducing Monte Carlo Methods with R,
Christian P. Robert & George Casella, Springer)
Introduction
Simulation is an important technique in applied mathematics. It typically means using computers to

conduct experiments! In statistics, it is a very useful way to check if a statistical model has the necessary
properties and this involves sampling from probability distributions. In most real life situations, analytical
derivations of statistical properties for your model is rarely possible and one usually works with Large sample
approximations. To test properties in finite samples simulation studies are extremely crucial!
Random Variable generation
The core step in simulation is drawing samples from probability distributions. This will be the topic of
interest in this lecture.
How does R generate samples from a named distribution? For example rnorm generates samples from a
normal distribution.
# Generates 10,000 samples from a normal distribution
# with mean 2 and standard deviation 0.5
norm.samples <- rnorm(10000, mean = 2, sd = 0.5)
hist(norm.samples, breaks = 21, col = "steelblue")
Histogram of norm.samples
1500
Frequency
0 500
0 1 2 3 4
norm.samples
# Plot the "density"
plot(density(norm.samples))
polygon(density(norm.samples), col="steelblue", border="red")
1
density.default(x = norm.samples)
0.8
Density
0.4
0.0
0 1 2 3 4
N = 10000 Bandwidth = 0.07076

As it turns out, all that is needed to generate samples from any distribution is a uniform number generator.
Let us assume that this is God Given (i.e God wrote the runif).
unif.samples <- runif(100000, 0, 2)
plot(density(unif.samples))
polygon(density(unif.samples), col="steelblue", border="red")
density.default(x = unif.samples)
0.4
Density
0.2
0.0
0.0 0.5 1.0 1.5 2.0
N = 100000 Bandwidth = 0.05201

Note, that there is no such thing as a random number generator! The computer generates numbers
deterministically that satistisfies all the properties of the distribution. This can be seen through set.seed
method in R.
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
set.seed(1)
runif(5)
## [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Methods that produce an endless supply of random samples from distributions are termed as Monte Carlo
methods.
2
For a nice overview see https://en.wikipedia.org/wiki/Monte_Carlo_method
Inverse Transform method

The inverse transform theorem is a technical result in probability theory which states that if X is
a random variable with the C.D.F F (x), then if u is a random sample from the uniform(0,1) then
F 1 (u) is a random sample of X.
(Technically F 1 is NOT the inverse of the C.D.F but for the purpose of this course thinking of it as an
inverse function will suffice!)
If we want to generate random samples of X, what this means is the following:
1. First generate random samples ui from uniform U (0, 1).
2. Let xi = F 1 (ui ). Then xi s are random samples from X.
(Home Assignment!) Let X exponentially with rate 1. Show that the C.D.F of X is
F (x) = 1 ex
Example 1: Using the inverse transform method, generate 10000 samples from Exponential
Distribution with rate 1 i.e generate samples from X Exp(1).
First we need to get the inverse function F 1 . This means that if F (x) = y, then we solve for x. From Home
Assignment 1 in this Lecture, we know that F (x) = 1 ex . Thus solving for y,
y = 1 ex
= ex = 1 y
= x = log(1 y)
= x = log(1 y)
Hence, the inverse function is given by F 1 (y) = log(1 y).

# Step 1: generate n samples from uniform random variable
n <- 10000
u <- runif(n)
# Step 2: compute F^-1(u)

my.exp.samples <- sapply(u, function(u) -1.0*log(1-u))
Let us test this visually with a plot.
plot(my.exp.samples)
3
my.exp.samples
8
4
0
0 2000 4000 6000 8000
Index
Oops What happened! This doesnt look like an exponential distribution, does it? We are plotting samples
and not the distribution! Let us look at the histogram!
# Plot histogram
hist(my.exp.samples, breaks = 21, col = "steelblue", main = "Histogram of Exp(1) samples" )
Histogram of Exp(1) samples
4000
Frequency
2000
0
0 2 4 6 8 10
my.exp.samples
# Plot density
plot(density(my.exp.samples), xlim = c(0.5, 5), main = "Density plot for Exp(1)")
polygon(density(my.exp.samples), col="steelblue", border="red", xlim = c(0.5, 5))
Density plot for Exp(1)
0.8
Density
0.4
0.0
1 2 3 4 5
N = 10000 Bandwidth = 0.1196

Let us look at the R version
4
# Plot density
plot(density(rexp(10000)), xlim = c(0.5, 5), main = "Density plot for Exp(1) from rexp")
polygon(density(rexp(10000)), col="steelblue", border="red")
Density plot for Exp(1) from rexp
0.8
Density
0.4
0.0
1 2 3 4 5
N = 10000 Bandwidth = 0.1178

(From IMC) The generation of uniform random variables is therefore a key determinant of the
behavior of simulation methods for other probability distributions since those distributions can
be represented as a deterministic transformation of uniform random variables.
(Home Assignment!) Let X be the Logistic distribution with the C.D.F given by
1
F (x) = (x)
( )
1+e
for two parameters and . Generate 1000 random samples using the inverse transform
method. Take = 5 and = 2.
R uses the inverse transform method to generate samples from the Normal distribution. Can we use the
inverse transform method to generate samples from discrete distributions like Binomial etc? Yes we can! But,
we wont delve further into this.
Accept-Reject simulation method

Typically, a C.D.F wont come up in nice closed expression for us to invert! When we come up against
such a wall, the next simulation method of choice is almost always going to be the Accept-Reject method.
Accept-Reject method fall under a large class of simulation methods called Indirect methods.
Again, for a great read see https://en.wikipedia.org/wiki/Rejection_sampling

With indirect methods we generate samples from a candidate (simpler) random variable and only
accept it as the sample of our target random variable subject to passing a test.
Notation:
1. X f is the random variable of interest and f (x) is called the target density.
2. Let g(x) be another density from which we (or R!) can generate samples having the same support as
f (x). In other words, g(x) > 0 exactly for those x for which f (x) > 0. We call g a candidate density.
5
3. Let e(x) be the envelope function that serves as an upper bound for f having the property e(x) =
M g(x) f (x), where M 1 is some constant.
Goal: Draw a sample of X i.e simulate from f .
The accept-reject algorithm can be compactly stated as follows:
1. Generate Y g and U U (0, 1).
2. Set X = Y if
f (Y )
U< ,
e(Y )
3. else, go back to 1.
We see that picking an envelope (i.e picking M and a candidate density) is crucial for this method to work.
Let us illustrate this with the following example.
Example 2: Generate 1000 samples from Beta(4, 3) using the accept-reject method.
We know that if X Beta(, ), then its density f is given by,
( + ) 1
f (x|, ) = x (1 x)1
()()
Plugging in = 4 and = 3, we get

f (x) = 60x3 (1 x)2
What is the support of f ?

Any Beta distribution is supported on [0, 1]. Let us plot this density.
x.seq <- seq(from = 0, to = 1, length.out = 101)
f <- sapply(x.seq, function(x) 60*x^3*(1-x)^2)
library(ggplot2)
# put the x and y values in a data frame
plot_data <- data.frame("x" = x.seq, "f" = f)
# create the base layer of ggplot with geom_Line

plot_f <- ggplot(plot_data, aes(x = x, y = f)) +
geom_line(color = "red", size = 1.5) +
labs(x = "x", y = "f(x| 4,3)")
plot_f
6
2.0
1.5
f(x| 4,3)
1.0
0.5
0.0
0.00 0.25 0.50 0.75 1.00
x
Now, our candidate density function g should have the same support as f and should be a valid density. Lets
take g to be uniform density U (0, 1).
A naive choice for the envelope e(x) = M g(x), would be to take M = max(f (x)). Thus, we can guarantee
that e(x) f (x). This naive choice works whenver we have a compact support for f (x) and can find a global
maxima. Lets find M . We can use the optimize function in R to find the maxima!
max_p <- optimize(function(x) 60*x^3*(1-x)^2, c(0.05, 0.95), maximum = TRUE)
round(max_p$maximum, 4)
## [1] 0.6
Thus, M = f (0.6).
# Lets create our beta density f
beta.f <- function(x) {
60*x^3*(1-x)^2
}
M <- beta.f(0.6)
print(M)
## [1] 2.0736
Our candidate density function g(x) = 1 for any x in [0, 1] because we chose it to be uniform(0,1). Hence,
our envelope function e(y) = M . Now, let us write a simple accept-reject algorithm for this example
num.samp <- 1000
beta.samples <- rep(0, num.samp)
count.samp <- 0
# keep running the accept-reject method until you have drawn 1000 samples
while(count.samp < num.samp) {

# Step 1: generat sample y from g and u from unif(0,1)
y <- runif(1) # g is U(0,1)
u <- runif(1)
# Step 2: Check the test

if (u < beta.f(y)/M) {
7
# increase the counter
count.samp <- count.samp + 1
# set X = Y
beta.samples[count.samp] <- y
}
}
Let us check if this worked
hist(beta.samples,prob=T,ylab="f(x)",xlab="x",ylim=c(0,max(beta.f(x.seq))),
main="Histogram of draws from Beta(4,3)" )
lines(x.seq,f,lty=2,col = "red")
Histogram of draws from Beta(4,3)
2.0
1.5
f(x)
1.0
0.5
0.0
0.2 0.4 0.6 0.8 1.0
x
Let us visualize which samples were accepted and which were rejected.
# Total number of samples
num.sim <- 3000
# testing condition : u < f/e and hence ue < f
# e is just M, so ue is u * M
ue <- M*runif(num.sim)
# y is a sample from g and g is U(0,1)

y <- runif(num.sim)
# create all the pairs y and ue
mat <- cbind(y, ue)
8
# put them in a data frame
plot_data <- data.frame(mat)
colnames(plot_data) <- c("y", "ue")
# create a factor variable to see which were accepted
plot_data$Accepted <- ue < beta.f(y)
# show the first 5 pairs

head(plot_data,5)
## y ue Accepted
## 1 0.5521227 0.6982361 TRUE
## 2 0.8818690 1.0105790 FALSE
## 3 0.9236169 1.3849691 FALSE
## 4 0.2549313 0.9364871 FALSE
## 5 0.1082588 0.5137490 FALSE
# we will create a scatter plot
# "x" values will be y (i.e samples from g)
# "y" values will be ue (u * e)
# If (u*e < f(y)) then samples are accepted otherwise rejected

# Hence, in scatter plot we will color by the Accepted variable
plot_ar <- ggplot(plot_data, aes(x=y, y=ue)) +

geom_point(shape=20, aes(color=Accepted)) +
stat_function(fun = beta.f) +
geom_vline(xintercept = 0.0) +
geom_vline(xintercept = 1.0) +
geom_hline(yintercept = M) +
geom_hline(yintercept = 0)
# beautify : Increase font size in axes

plot_ar + theme(axis.text.x =
element_text(face = "bold",size = 12),
axis.text.y =
element_text(face = "bold", size = 12),
axis.line = element_line(colour = "black",
size = 1, linetype = "solid"))
9
2.0
1.5
Accepted
ue
FALSE
1.0
TRUE
0.5
0.0
0.00 0.25 0.50 0.75 1.00
y
Let us check what proportion of the samples were accepted
round(sum(plot_data$Accepted)/nrow(plot_data),2)
## [1] 0.48
round((1/M),2)
## [1] 0.48
1
This is not a coincidence! M is the expected proportion of candidates that are accepted. If we find a better
M (i.e a tighter envelope) we will on avergae have fewer rejections.
A good envelope should have the following properties:
1. Exceeds the target f everywhere
2. Easy to sample from g.
3. Generete few rejections.
1
If f is supported on a set [a, b], then a naive but good choice for e is max(f (x)) ba i.e we take the candidate
density to be uniform U (a, b) and M = max(f (x)). Will this method work if f N (, )?
(Home Assignment!) Consider the strange density

2
f (x) = cex /2
sin2 (6x) + 3cos2 (x)sin2 (4x) + 1 .

Use the Accept-Reject algorithm to sample from f by filling out the following steps: 1. We
dont need to know the normalizing constant c to sample from f ! We can disregard c in this
10
question! The support of f is (, ). Take g(x) to be the standard normal density. 2. Find
a suitable M (Hint, use optimize!). 3. Plot M g(x) and f (without the c). 4. Generate 1000
samples from f (without the c!). 5. Mimic the plot in this lecture to show the accepted and
rejected samples.
11

Sampling

Uploaded by

Copyright:

Available Formats

You might also like

Sampling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sampling

Uploaded by

Copyright:

Available Formats

STAA 567 Lec 5: Simulation and Integration

Simulation is an important technique in applied mathematics. It typically means using computers to

Random Variable generation

N = 10000 Bandwidth = 0.07076

0.0 0.5 1.0 1.5 2.0

N = 100000 Bandwidth = 0.05201

Inverse Transform method

Hence, the inverse function is given by F 1 (y) = log(1 y).

# Step 2: compute F^-1(u)

0 2000 4000 6000 8000

N = 10000 Bandwidth = 0.1196

N = 10000 Bandwidth = 0.1178

Accept-Reject simulation method

Again, for a great read see https://en.wikipedia.org/wiki/Rejection_sampling

Plugging in = 4 and = 3, we get

What is the support of f ?

# create the base layer of ggplot with geom_Line

while(count.samp < num.samp) {

# Step 2: Check the test

0.2 0.4 0.6 0.8 1.0

# testing condition : u < f/e and hence ue < f

# y is a sample from g and g is U(0,1)

# show the first 5 pairs

# If (u*e < f(y)) then samples are accepted otherwise rejected

plot_ar <- ggplot(plot_data, aes(x=y, y=ue)) +

# beautify : Increase font size in axes

(Home Assignment!) Consider the strange density

You might also like