Lab1 Sol

STAT 135 Lab 1 Solutions
January 26, 2015
Introduction
To complete this lab, you will need to have access to R and RStudio. If you have not already done so, you can
download R from http://cran.cnr.berkeley.edu/, and RStudio from http://www.rstudio.com/. You will
have the 2 hours in this lab to begin the lab questions, although you are not required to turn in your lab as they
will not be graded. There will be 3-4 assignments throughout the semester (worth a total of 20%) which require
knowledge of R, hence if you do not finish the lab during the section, you are encouraged to finish the lab in your
own time.
Question 1
The content of this question refers to the file “Question1.R” located in BCourses. This R script
contains an example of some code which generates a Bernoulli(0.5) random variable by manipulat-
ing a Uniform(0,1) random variable. This question involves editing the code in the “Question1.R”
file in order to generate a Bernoulli(0.3, 7) random variable.
The code provided in Question1.R to generate a bernoulli(0.5) random variable was
# generate a sample from the Uniform(0,1) distribution. Call it random.uniform

random.uniform <- runif(1, 0, 1) # to see what these arguments correspond to, type ?runif
# using random.uniform, generate a bernoulli(0.5) distributed random variable

random.bernoulli <- as.numeric(random.uniform < 0.5)
# to understand what is happening above,
#note that the uniformly distirbuted random variable will be less than 0.5 half of the time.
# Moreover:
# - "random.uniform < 0.5" returns TRUE when random.uniform is less than 0.5
# - "random.uniform < 0.5" returns FALSE when random.uniform is greater or equal to 0.5
# - "as.numeric(x)" converts x to a number. When x is TRUE, as.numeric(x) will return 1,
# when x is FALSE, as.numeric(x) will return 0
1
1.1
Edit the code provided to generate a Binomial(0.3, 7) random variable.
Hint: there are several ways to do this, however, for a particularly simple way, you only need to:
- change one number in each line of code, and
- add a final line of code which involves the sum() function
(Recall that a Binomial(p,n) random variable can be considered as the sum of n Bernoulli(p) random
variables).
The code above can be edited as follows to generate a Binomial(0.3, 7) random variable.
# generate 7 samples from the Uniform(0,1) distribution. Call it random.uniform

random.uniform <- runif(7, 0, 1)
# using random.uniform, generate a bernoulli(0.3) distributed random variables

# add them up to get a binomial random variable (number of successes in 7 trials)

random.binomial <- sum(random.bernoulli)
2
1.2
Using your code from the previous part, use a for loop to generate 10,000 Binomial(0.3, 7) random
variables. Plot a histogram, using hist(), of the generated numbers (use this as a check to make
sure that your code is working).
The following code uses a for loop to generate 10,000 Binomial(0.3, 7) random variables. The histogram is
presented in Figure 1.
# define an empty vector random.binomial

random.binomial <- c()
for(i in 1:10000) {
# generate 7 samples from the Uniform(0,1) distribution. Call it random.uniform
random.uniform <- runif(7, 0, 1)
# using random.uniform, generate a bernoulli(0.3) distributed random variables

# add them up to get a binomial random variable (numebr of successes in 7 trials)

# assign the value to the ith element of the vector random.binomial
random.binomial[i] <- sum(random.bernoulli)
}
hist(random.binomial,
nclass = 8,
main="Histogram of simulated Bin(0.3,7)",
xlab = "simulated value",
right=FALSE)
Note that you can make much nicer and more intuitive graphics using ggplot(). First, you will need to install
the ggplot2 package using
install.packages("ggplot2")
and load the package into the current environment using
library(ggplot2)
Next, ggplot() works by first generating an empty plot object with a specified data frame which will be used.
On top of this empty plot, we can add whatever we want. Here we want a histogram, so we add on a histogram
where we need to specify which variable/column of the data frame we are interested in plotting.
First, let’s turn our vector into a data frame.
3
Histogram of simulated Bin(0.3,7)
3000
2000
Frequency
1000
0
0 1 2 3 4 5 6
simulated value
Figure 1: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using the Uniform
distribution
4
# define the data frame binom.df and call the column/variable (we only have 1 variable) "binom"
binom.df <- data.frame(binom = random.binomial)
# instead of a vector, we now have a data frame with one column

# the head() function allows us to just loop at the first 6 rows
head(binom.df)
## binom
## 1 1
## 2 3
## 3 1
## 4 2
## 5 3
## 6 3
Then plot the histogram (Figure 2.)
# the first parameter inside the ggplot() function specifies which data frame we want to access
ggplot(binom.df) +
# add on a histogram using geom_histogram
# we specify which column to plot on the x axis
# and add a white border around the boxes to make it prettier
geom_histogram(aes(x = binom), binwidth = 1, col = "white") +
# add an informative title!
ggtitle("Histogram of simulated Bin(0.3,7)") +
# add an informative axis label
scale_x_continuous(name="simulated value")
5
Histogram of simulated Bin(0.3,7)
3000
2000
count
1000
0.0 2.5 5.0

simulated value
Figure 2: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using the Uniform
distribution
6
Histogram of rbinom() Bin(0.3,7)
2500
Frequency
1500
0 500
0 1 2 3 4 5 6 7
value
Figure 3: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using rbinom()
1.3
Use the inbuilt R function, rbinom(), to generate 1000 Binomial(0.3, 7) random variables, and plot
a histogram of the generated numbers. Compare with the histogram generated in the previous
question.
Using the inbuilt R function, the code is as follows:
random.binom.R <- rbinom(10000, 7, 0.3)
and the histogram is presented in Figure 3.
hist(random.binom.R,
nclass = 8,
main="Histogram of rbinom() Bin(0.3,7)",
xlab = "value",
right = FALSE)
Again, here is the ggplot version
7
Histogram of rbinom() Bin(0.3,7)
3000
2000
count
1000
0 2 4 6 8
value
Figure 4: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using rbinom()
# turn vector into data frame

binom.R.df <- data.frame(binom = random.binom.R)
# the first parameter inside the ggplot() function specifies which data frame we want to access
ggplot(binom.R.df) +
# add on a histogram using geom_histogram
# we specify which column to plot on the x axis
# and add a white border around the boxes to make it prettier
geom_histogram(aes(x = binom), binwidth = 1, col = "white") +
# add an informative title!
ggtitle("Histogram of rbinom() Bin(0.3,7)") +
# add an informative axis label
scale_x_continuous(name="value")
8
Question 2
Suppose that there are n < 365 people in a room. Ignoring leap years (i.e. ignoring February 29), we
are interested in finding the probability that no two people have the same birthday. Assume that
all birthdays are equally probable so that the probability of a given birthday for a person chosen
from the entire population at random is 1/365.
2.1
Show that the probability that no two people have the same birthday is given by
365!
P (no two people have the same birthday) =
(365 − n)!365n
Start with an arbitrary person’s birthday. Note that the probability that the second person’s birthday is different
is
365 − 1 364
P (Second person’s birthday is different from the first person’s birthday) = =
365 365
Next, the probability that the first, second and third person each has a different birthday is
365 − 1 365 − 2 364 × 363
P (First, second and third person each has a different birthday) = =
365 365 3652
since the second person’s birthday must be different from the first person’s birthday (so can fall on any of the re-
maining 364 days) and the third person’s birthday must be different again (and so can fall on any of the remaining
363 days).
Continuing on in this way, we find that the probability of no two people having the same birthday is
365 − 1 365 − 2 365 − (n − 1)

P (No two people have the same birthday) = ...
365 365 365
(365 − 1)(365 − 2) . . . (365 − (n − 1))
=
365n−1
365!
=
(365 − n)!365n
9
2.2
Hence, calculate the probability that two or more people out of a group of n do have the same
birthday. Hint: identify first the complementary event.
The trick here is to note that the event “two or more people have the same birthday” is the complement of the
event that “no two people have the same birthday”. Thus
P (Two or more people have the same birthday) = 1 − P (No two people have the same birthday)
365!
=1−
(365 − n)!365n
10
2.3
Suppose that n = 50. Verify this formula in R by taking n samples from a Uniform(1,365) (using the
runif() function) distribution and identifying whether or not two or more people have the same
birthday. Repeat this N = 1000 times using a for loop to estimate the probability that two or more
people have the same birthday.
Note that an easy way to convert the continuous Uniform(1,365) RV to a discrete Uniform(1,365) RV is to use
the floor function on the continuous Uniform(1, 366) distribution. So we can take a sample of 50 Uniform(1,365)
random variables (representing taking a sample of 50 birthdays) using the following code
set.seed(123) # set the seed so we get the same results every time.
sample <- floor(runif(50,1,366))
To identify whether or not there are any birthdays in common, we can use the duplicated command which tells
us if any individual entry of the vector is duplicated.
duplicated(sample)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
It looks like there are two people with the same birthday in this sample. Let’s do this 1000 times to see the
proportion of such samples.
N <- 10000 # we will do 10000 simulations

duplicated.birthday <- c() # define an empty vector that we will fill up
# begin the for loop

for(i in 1:N) {
sample <- floor(runif(50, 1, 366)) # take the sample
# identify if there are duplicates in the sample
duplicated.birthday[i] <- sum(duplicated(sample))>0
}
# print the proportion of samples which contained duplicated birthdays

sum(duplicated.birthday)/N
## [1] 0.9697
Now to check that this is an approximate estimation using the formula we derived.
11
n <- 50
# note that we can't simply use the factorial formula, since the number is far too big!
prob <- 1 - factorial(365)/((factorial(365 - n))*(365^n))
## Warning in factorial(365): value out of range in ’gammafn’

## Warning in factorial(365 - n): value out of range in ’gammafn’
# we can instead use a for loop!

prod <- 1
for(i in 1:(n-1)){
prod <- prod*(365 - i)/365
}
prob <- 1 - prod

prob
## [1] 0.9703736
So our estimation was pretty accurate: with probability 0.970, two or more prople in a room of 50 people will have
the same birthday.
12
Question 3
Suppose we have a population x1 , x2 , ...., xN , with population mean µ and population variance σ 2 .
Denote the distinct values assumed by the population members by a1 , a2 , ..., am , and denote the
number of population members that have the value ak by nk , k = 1, ..., m. Then given a sample
X1 , ..., Xn drawn from this population, we have that each Xi is a discrete random variable with
probability mass function
nk
P (Xi = ak ) =
N
3.1
Show that the expected value of Xi is equal to the population mean. That is, show that EXi = µ.
m
X
EXi = ak P (Xi = ak )
k=1
m
1 X
= nk ak
N
k=1
=µ (by definition of the population mean)
3.2
Show that the variance of Xi is equal to the population variance. That is, show that Var(Xi ) = σ 2 .
V ar(Xi ) = E(Xi2 ) − [E(Xi )]2

m
X
= a2k P (Xi = ak ) − µ2 (where the second term follows from the previous question)
k=1
m
1 X
= nk a2k − µ2
N
k=1
2
=σ (by definition of the population variance)
PN 2
Pm 2
where we have used that i=1 xi = k=1 nk ak .
13
To complete the remainder of this exercise, you will need to download the “jester” dataset. This
dataset contains data from 14,116 users (rows) who have rated each of 30 jokes (columns). The
ratings fall between -10 (not at all funny), and 10 (extremely funny). The text for the jokes are
provided in the ‘jokes’ folder.
3.3
Using the read.csv() command, define an object called jester containing a data frame corresponding
to the data found in the jester.csv file (ensure that your working directory contains the jester.cv
file).
jester <- read.csv("jester.csv")
3.4
Read through some of the jokes, and plot histograms of a joke that you found funny and a joke
that you did not (note that you can plot histograms using the hist() function). Comment on the
distribution of ratings for each joke.
I found joke 89 fairly funny:

A radio conversation of a US naval ship with Canadian authorities ...
Americans: Please divert your course 15 degrees to the North to avoid a collision.
Canadians: Recommend you divert YOUR course 15 degrees to the South to avoid a collision.
Americans: This is the Captain of a US Navy ship. I say again, divert YOUR course.
Canadians: No. I say again, you divert YOUR course.
Americans: This is the aircraft carrier USS LINCOLN, the second largest ship in the United States’ Atlantic
Fleet. We are accompanied by three destroyers, three cruisers and numerous support vessels. I demand that you
change your course 15 degrees north, that’s ONE FIVE DEGREES NORTH, or counter-measures will be under-
taken to ensure the safety of this ship.
Canadians: This is a lighthouse. Your call.
Whereas joke 16 wasn’t too funny:

Q. What is orange and sounds like a parrot?
A. A carrot.
Based on the histograms, most people agreed with me! (See Figure 3.)
14
Joke 89 Joke 16
2500
2500
Frequency
Frequency
1500
1500
500
0 500
0
−10 −5 0 5 10 −10 −5 0 5 10
ratings ratings
Figure 5: Histograms of the ratings given for jokes 16 and 89
par(mfrow=c(1,2))
hist(jester$joke89, nclass = 10, main = "Joke 89", xlab = "ratings")
hist(jester$joke16, nclass = 10, main = "Joke 16", xlab = "ratings")
3.5
Calculate and print the mean and standard deviation (in R) of the ratings for each joke (hint: the
apply() function will be very useful here). What can you conclude about the overall response to
jokes 43 and 91?
apply(jester, 2, mean)
## joke1 joke6 joke8 joke10 joke16 joke18

## 1.3252040 1.5527883 -0.2901374 1.5411519 -2.5953039 -0.3386441
## 0.6582793 3.2320898 1.6761384 1.1824532 3.2709216 1.7891938
## -0.4789154 -1.6339239 0.7786370 -1.5145105 0.5239345 0.7913191
## 0.1348172 0.9980852 0.5014770 2.0142703 3.7558600 2.2426112
15
## 1.4010031 1.4087383 1.2873973 1.7013169 1.7959974 0.1801608
apply(jester, 2, sd)
## joke1 joke6 joke8 joke10 joke16 joke18 joke23 joke27

## 5.115485 4.871420 4.901709 5.011477 5.027276 4.848822 5.068341 4.590847
## 4.957192 4.922782 4.531652 4.892373 5.263245 5.319676 5.174673 5.446493
## 5.114424 4.716020 5.343211 5.100844 5.223714 4.815500 4.763649 5.062386
## 5.120330 5.155950 5.167159 4.817724 5.007766 5.203852
Joke 43 has a mean rating of -0.47 (not too funny) and joke 91 has a mean rating of 3.76 (a bit funnier), note that
the standard deviation for all questions are fairly similar around 5, indicating that there is quite a large spread of
ratings for all questions.
3.6
Using R, calculate the correlation and covariance between the ratings for jokes 43 and 91. Comment
on the results.
cor(jester$joke43, jester$joke91)
## [1] 0.2815762
cov(jester$joke43, jester$joke91)
## [1] 7.50248
The ratings for the two jokes don’t seem to be extremely correlated.
16
3.7
Prove that
−1 ≤ ρ ≤ 1
where ρ = √ Cov(X,Y ) is the correlation between X and Y .
Var(X)Var(Y )
Since the variance of a random variable is nonnegative,

X Y
0 ≤ Var +
σX σY

X Y X Y
= Var + Var + 2Cov ,
σX σY σX σY
Var(X) Var(Y ) Cov(X, Y )
= 2 + +
σX σY2 σX σY
= 2(1 + ρ)
So that ρ ≥ −1.
To show the upper bound, similarly consider the inequality

X Y
0 ≤ Var −
σX σY
and show that

0 ≤ 2(1 − ρ)
3.8
Cov(X,Y )
Using the results of part 6, verify in R that ρ = √ , where X corresponds to the ratings
Var(X)Var(Y )
for question 43, and Y corresponds to the ratings for question 91.
var.X <- var(jester$joke43)

var.Y <- var(jester$joke91)
cov.XY <- cov(jester$joke43, jester$joke91)
cov.XY/sqrt(var.Y * var.X)
## [1] 0.2815762
cor(jester$joke43, jester$joke91)
## [1] 0.2815762
17
Some R tips
For a (free) comprehensive introduction to R, see http://cran.r-project.org/doc/manuals/r-release/R-intro.
pdf. However, if you’re not sure how to use a particular function, you can use the ? command. For example, to
see how to use the runif() command, you can type ?runif.
18

Lab1 Sol

Uploaded by

Copyright:

Available Formats

You might also like

Lab1 Sol

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab1 Sol

Uploaded by

Copyright:

Available Formats

STAT 135 Lab 1 Solutions

January 26, 2015

The code provided in Question1.R to generate a bernoulli(0.5) random variable was

# generate a sample from the Uniform(0,1) distribution. Call it random.uniform

# using random.uniform, generate a bernoulli(0.5) distributed random variable

# generate 7 samples from the Uniform(0,1) distribution. Call it random.uniform

# using random.uniform, generate a bernoulli(0.3) distributed random variables

# add them up to get a binomial random variable (number of successes in 7 trials)

# define an empty vector random.binomial

# using random.uniform, generate a bernoulli(0.3) distributed random variables

# add them up to get a binomial random variable (numebr of successes in 7 trials)

and load the package into the current environment using

First, let’s turn our vector into a data frame.

# instead of a vector, we now have a data frame with one column

Then plot the histogram (Figure 2.)

0.0 2.5 5.0

Using the inbuilt R function, the code is as follows:

random.binom.R <- rbinom(10000, 7, 0.3)

and the histogram is presented in Figure 3.

Again, here is the ggplot version

# turn vector into data frame

365 − 1 365 − 2 365 − (n − 1)

N <- 10000 # we will do 10000 simulations

# begin the for loop

# print the proportion of samples which contained duplicated birthdays

## Warning in factorial(365): value out of range in ’gammafn’

# we can instead use a for loop!

prob <- 1 - prod

V ar(Xi ) = E(Xi2 ) − [E(Xi )]2

jester <- read.csv("jester.csv")

I found joke 89 fairly funny:

Canadians: No. I say again, you divert YOUR course.

Canadians: This is a lighthouse. Your call.

Whereas joke 16 wasn’t too funny:

Figure 5: Histograms of the ratings given for jokes 16 and 89

## joke1 joke6 joke8 joke10 joke16 joke18

## joke1 joke6 joke8 joke10 joke16 joke18 joke23 joke27

Since the variance of a random variable is nonnegative,

To show the upper bound, similarly consider the inequality

and show that

var.X <- var(jester$joke43)

You might also like