Lab1 Sol

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

STAT 135 Lab 1 Solutions

January 26, 2015

Introduction
To complete this lab, you will need to have access to R and RStudio. If you have not already done so, you can
download R from http://cran.cnr.berkeley.edu/, and RStudio from http://www.rstudio.com/. You will
have the 2 hours in this lab to begin the lab questions, although you are not required to turn in your lab as they
will not be graded. There will be 3-4 assignments throughout the semester (worth a total of 20%) which require
knowledge of R, hence if you do not finish the lab during the section, you are encouraged to finish the lab in your
own time.

Question 1
The content of this question refers to the file “Question1.R” located in BCourses. This R script
contains an example of some code which generates a Bernoulli(0.5) random variable by manipulat-
ing a Uniform(0,1) random variable. This question involves editing the code in the “Question1.R”
file in order to generate a Bernoulli(0.3, 7) random variable.

The code provided in Question1.R to generate a bernoulli(0.5) random variable was

# generate a sample from the Uniform(0,1) distribution. Call it random.uniform


random.uniform <- runif(1, 0, 1) # to see what these arguments correspond to, type ?runif

# using random.uniform, generate a bernoulli(0.5) distributed random variable


random.bernoulli <- as.numeric(random.uniform < 0.5)
# to understand what is happening above,
#note that the uniformly distirbuted random variable will be less than 0.5 half of the time.
# Moreover:
# - "random.uniform < 0.5" returns TRUE when random.uniform is less than 0.5
# - "random.uniform < 0.5" returns FALSE when random.uniform is greater or equal to 0.5
# - "as.numeric(x)" converts x to a number. When x is TRUE, as.numeric(x) will return 1,
# when x is FALSE, as.numeric(x) will return 0

1
1.1
Edit the code provided to generate a Binomial(0.3, 7) random variable.
Hint: there are several ways to do this, however, for a particularly simple way, you only need to:
- change one number in each line of code, and
- add a final line of code which involves the sum() function
(Recall that a Binomial(p,n) random variable can be considered as the sum of n Bernoulli(p) random
variables).

The code above can be edited as follows to generate a Binomial(0.3, 7) random variable.

# generate 7 samples from the Uniform(0,1) distribution. Call it random.uniform


random.uniform <- runif(7, 0, 1)

# using random.uniform, generate a bernoulli(0.3) distributed random variables


random.bernoulli <- as.numeric(random.uniform < 0.3)

# add them up to get a binomial random variable (number of successes in 7 trials)


random.binomial <- sum(random.bernoulli)

2
1.2
Using your code from the previous part, use a for loop to generate 10,000 Binomial(0.3, 7) random
variables. Plot a histogram, using hist(), of the generated numbers (use this as a check to make
sure that your code is working).

The following code uses a for loop to generate 10,000 Binomial(0.3, 7) random variables. The histogram is
presented in Figure 1.

# define an empty vector random.binomial


random.binomial <- c()
for(i in 1:10000) {
# generate 7 samples from the Uniform(0,1) distribution. Call it random.uniform
random.uniform <- runif(7, 0, 1)

# using random.uniform, generate a bernoulli(0.3) distributed random variables


random.bernoulli <- as.numeric(random.uniform < 0.3)

# add them up to get a binomial random variable (numebr of successes in 7 trials)


# assign the value to the ith element of the vector random.binomial
random.binomial[i] <- sum(random.bernoulli)
}

hist(random.binomial,
nclass = 8,
main="Histogram of simulated Bin(0.3,7)",
xlab = "simulated value",
right=FALSE)

Note that you can make much nicer and more intuitive graphics using ggplot(). First, you will need to install
the ggplot2 package using

install.packages("ggplot2")

and load the package into the current environment using

library(ggplot2)

Next, ggplot() works by first generating an empty plot object with a specified data frame which will be used.
On top of this empty plot, we can add whatever we want. Here we want a histogram, so we add on a histogram
where we need to specify which variable/column of the data frame we are interested in plotting.

First, let’s turn our vector into a data frame.

3
Histogram of simulated Bin(0.3,7)
3000
2000
Frequency

1000
0

0 1 2 3 4 5 6

simulated value

Figure 1: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using the Uniform
distribution

4
# define the data frame binom.df and call the column/variable (we only have 1 variable) "binom"
binom.df <- data.frame(binom = random.binomial)

# instead of a vector, we now have a data frame with one column


# the head() function allows us to just loop at the first 6 rows
head(binom.df)

## binom
## 1 1
## 2 3
## 3 1
## 4 2
## 5 3
## 6 3

Then plot the histogram (Figure 2.)

# the first parameter inside the ggplot() function specifies which data frame we want to access
ggplot(binom.df) +
# add on a histogram using geom_histogram
# we specify which column to plot on the x axis
# and add a white border around the boxes to make it prettier
geom_histogram(aes(x = binom), binwidth = 1, col = "white") +
# add an informative title!
ggtitle("Histogram of simulated Bin(0.3,7)") +
# add an informative axis label
scale_x_continuous(name="simulated value")

5
Histogram of simulated Bin(0.3,7)

3000

2000
count

1000

0.0 2.5 5.0


simulated value

Figure 2: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using the Uniform
distribution

6
Histogram of rbinom() Bin(0.3,7)

2500
Frequency

1500
0 500

0 1 2 3 4 5 6 7

value

Figure 3: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using rbinom()

1.3
Use the inbuilt R function, rbinom(), to generate 1000 Binomial(0.3, 7) random variables, and plot
a histogram of the generated numbers. Compare with the histogram generated in the previous
question.

Using the inbuilt R function, the code is as follows:

random.binom.R <- rbinom(10000, 7, 0.3)

and the histogram is presented in Figure 3.

hist(random.binom.R,
nclass = 8,
main="Histogram of rbinom() Bin(0.3,7)",
xlab = "value",
right = FALSE)

Again, here is the ggplot version

7
Histogram of rbinom() Bin(0.3,7)
3000

2000
count

1000

0 2 4 6 8
value

Figure 4: A histogram of a simulation of size 10,000 from the binomial(0.3,7) distribution using rbinom()

# turn vector into data frame


binom.R.df <- data.frame(binom = random.binom.R)

# the first parameter inside the ggplot() function specifies which data frame we want to access
ggplot(binom.R.df) +
# add on a histogram using geom_histogram
# we specify which column to plot on the x axis
# and add a white border around the boxes to make it prettier
geom_histogram(aes(x = binom), binwidth = 1, col = "white") +
# add an informative title!
ggtitle("Histogram of rbinom() Bin(0.3,7)") +
# add an informative axis label
scale_x_continuous(name="value")

8
Question 2
Suppose that there are n < 365 people in a room. Ignoring leap years (i.e. ignoring February 29), we
are interested in finding the probability that no two people have the same birthday. Assume that
all birthdays are equally probable so that the probability of a given birthday for a person chosen
from the entire population at random is 1/365.

2.1
Show that the probability that no two people have the same birthday is given by
365!
P (no two people have the same birthday) =
(365 − n)!365n

Start with an arbitrary person’s birthday. Note that the probability that the second person’s birthday is different
is
365 − 1 364
P (Second person’s birthday is different from the first person’s birthday) = =
365 365
Next, the probability that the first, second and third person each has a different birthday is
365 − 1 365 − 2 364 × 363
P (First, second and third person each has a different birthday) = =
365 365 3652
since the second person’s birthday must be different from the first person’s birthday (so can fall on any of the re-
maining 364 days) and the third person’s birthday must be different again (and so can fall on any of the remaining
363 days).

Continuing on in this way, we find that the probability of no two people having the same birthday is

365 − 1 365 − 2 365 − (n − 1)


P (No two people have the same birthday) = ...
365 365 365
(365 − 1)(365 − 2) . . . (365 − (n − 1))
=
365n−1
365!
=
(365 − n)!365n

9
2.2
Hence, calculate the probability that two or more people out of a group of n do have the same
birthday. Hint: identify first the complementary event.

The trick here is to note that the event “two or more people have the same birthday” is the complement of the
event that “no two people have the same birthday”. Thus

P (Two or more people have the same birthday) = 1 − P (No two people have the same birthday)
365!
=1−
(365 − n)!365n

10
2.3
Suppose that n = 50. Verify this formula in R by taking n samples from a Uniform(1,365) (using the
runif() function) distribution and identifying whether or not two or more people have the same
birthday. Repeat this N = 1000 times using a for loop to estimate the probability that two or more
people have the same birthday.

Note that an easy way to convert the continuous Uniform(1,365) RV to a discrete Uniform(1,365) RV is to use
the floor function on the continuous Uniform(1, 366) distribution. So we can take a sample of 50 Uniform(1,365)
random variables (representing taking a sample of 50 birthdays) using the following code

set.seed(123) # set the seed so we get the same results every time.
sample <- floor(runif(50,1,366))

To identify whether or not there are any birthdays in common, we can use the duplicated command which tells
us if any individual entry of the vector is duplicated.

duplicated(sample)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

It looks like there are two people with the same birthday in this sample. Let’s do this 1000 times to see the
proportion of such samples.

N <- 10000 # we will do 10000 simulations


duplicated.birthday <- c() # define an empty vector that we will fill up

# begin the for loop


for(i in 1:N) {
sample <- floor(runif(50, 1, 366)) # take the sample
# identify if there are duplicates in the sample
duplicated.birthday[i] <- sum(duplicated(sample))>0
}

# print the proportion of samples which contained duplicated birthdays


sum(duplicated.birthday)/N

## [1] 0.9697

Now to check that this is an approximate estimation using the formula we derived.

11
n <- 50
# note that we can't simply use the factorial formula, since the number is far too big!
prob <- 1 - factorial(365)/((factorial(365 - n))*(365^n))

## Warning in factorial(365): value out of range in ’gammafn’


## Warning in factorial(365 - n): value out of range in ’gammafn’

# we can instead use a for loop!


prod <- 1
for(i in 1:(n-1)){
prod <- prod*(365 - i)/365
}

prob <- 1 - prod


prob

## [1] 0.9703736

So our estimation was pretty accurate: with probability 0.970, two or more prople in a room of 50 people will have
the same birthday.

12
Question 3
Suppose we have a population x1 , x2 , ...., xN , with population mean µ and population variance σ 2 .
Denote the distinct values assumed by the population members by a1 , a2 , ..., am , and denote the
number of population members that have the value ak by nk , k = 1, ..., m. Then given a sample
X1 , ..., Xn drawn from this population, we have that each Xi is a discrete random variable with
probability mass function
nk
P (Xi = ak ) =
N

3.1
Show that the expected value of Xi is equal to the population mean. That is, show that EXi = µ.

m
X
EXi = ak P (Xi = ak )
k=1
m
1 X
= nk ak
N
k=1
=µ (by definition of the population mean)

3.2
Show that the variance of Xi is equal to the population variance. That is, show that Var(Xi ) = σ 2 .

V ar(Xi ) = E(Xi2 ) − [E(Xi )]2


m
X
= a2k P (Xi = ak ) − µ2 (where the second term follows from the previous question)
k=1
m
1 X
= nk a2k − µ2
N
k=1
2
=σ (by definition of the population variance)
PN 2
Pm 2
where we have used that i=1 xi = k=1 nk ak .

13
To complete the remainder of this exercise, you will need to download the “jester” dataset. This
dataset contains data from 14,116 users (rows) who have rated each of 30 jokes (columns). The
ratings fall between -10 (not at all funny), and 10 (extremely funny). The text for the jokes are
provided in the ‘jokes’ folder.

3.3
Using the read.csv() command, define an object called jester containing a data frame corresponding
to the data found in the jester.csv file (ensure that your working directory contains the jester.cv
file).

jester <- read.csv("jester.csv")

3.4
Read through some of the jokes, and plot histograms of a joke that you found funny and a joke
that you did not (note that you can plot histograms using the hist() function). Comment on the
distribution of ratings for each joke.

I found joke 89 fairly funny:


A radio conversation of a US naval ship with Canadian authorities ...
Americans: Please divert your course 15 degrees to the North to avoid a collision.

Canadians: Recommend you divert YOUR course 15 degrees to the South to avoid a collision.

Americans: This is the Captain of a US Navy ship. I say again, divert YOUR course.

Canadians: No. I say again, you divert YOUR course.

Americans: This is the aircraft carrier USS LINCOLN, the second largest ship in the United States’ Atlantic
Fleet. We are accompanied by three destroyers, three cruisers and numerous support vessels. I demand that you
change your course 15 degrees north, that’s ONE FIVE DEGREES NORTH, or counter-measures will be under-
taken to ensure the safety of this ship.

Canadians: This is a lighthouse. Your call.

Whereas joke 16 wasn’t too funny:


Q. What is orange and sounds like a parrot?

A. A carrot.

Based on the histograms, most people agreed with me! (See Figure 3.)

14
Joke 89 Joke 16

2500
2500
Frequency

Frequency

1500
1500

500
0 500

0
−10 −5 0 5 10 −10 −5 0 5 10

ratings ratings

Figure 5: Histograms of the ratings given for jokes 16 and 89

par(mfrow=c(1,2))
hist(jester$joke89, nclass = 10, main = "Joke 89", xlab = "ratings")
hist(jester$joke16, nclass = 10, main = "Joke 16", xlab = "ratings")

3.5
Calculate and print the mean and standard deviation (in R) of the ratings for each joke (hint: the
apply() function will be very useful here). What can you conclude about the overall response to
jokes 43 and 91?

apply(jester, 2, mean)

## joke1 joke6 joke8 joke10 joke16 joke18


## 1.3252040 1.5527883 -0.2901374 1.5411519 -2.5953039 -0.3386441
## joke23 joke27 joke28 joke34 joke35 joke38
## 0.6582793 3.2320898 1.6761384 1.1824532 3.2709216 1.7891938
## joke43 joke44 joke55 joke57 joke63 joke70
## -0.4789154 -1.6339239 0.7786370 -1.5145105 0.5239345 0.7913191
## joke79 joke82 joke86 joke87 joke89 joke91
## 0.1348172 0.9980852 0.5014770 2.0142703 3.7558600 2.2426112

15
## joke92 joke94 joke95 joke96 joke97 joke99
## 1.4010031 1.4087383 1.2873973 1.7013169 1.7959974 0.1801608

apply(jester, 2, sd)

## joke1 joke6 joke8 joke10 joke16 joke18 joke23 joke27


## 5.115485 4.871420 4.901709 5.011477 5.027276 4.848822 5.068341 4.590847
## joke28 joke34 joke35 joke38 joke43 joke44 joke55 joke57
## 4.957192 4.922782 4.531652 4.892373 5.263245 5.319676 5.174673 5.446493
## joke63 joke70 joke79 joke82 joke86 joke87 joke89 joke91
## 5.114424 4.716020 5.343211 5.100844 5.223714 4.815500 4.763649 5.062386
## joke92 joke94 joke95 joke96 joke97 joke99
## 5.120330 5.155950 5.167159 4.817724 5.007766 5.203852

Joke 43 has a mean rating of -0.47 (not too funny) and joke 91 has a mean rating of 3.76 (a bit funnier), note that
the standard deviation for all questions are fairly similar around 5, indicating that there is quite a large spread of
ratings for all questions.

3.6
Using R, calculate the correlation and covariance between the ratings for jokes 43 and 91. Comment
on the results.

cor(jester$joke43, jester$joke91)

## [1] 0.2815762

cov(jester$joke43, jester$joke91)

## [1] 7.50248

The ratings for the two jokes don’t seem to be extremely correlated.

16
3.7
Prove that
−1 ≤ ρ ≤ 1
where ρ = √ Cov(X,Y ) is the correlation between X and Y .
Var(X)Var(Y )

Since the variance of a random variable is nonnegative,


 
X Y
0 ≤ Var +
σX σY
     
X Y X Y
= Var + Var + 2Cov ,
σX σY σX σY
Var(X) Var(Y ) Cov(X, Y )
= 2 + +
σX σY2 σX σY
= 2(1 + ρ)

So that ρ ≥ −1.

To show the upper bound, similarly consider the inequality


 
X Y
0 ≤ Var −
σX σY

and show that


0 ≤ 2(1 − ρ)

3.8
Cov(X,Y )
Using the results of part 6, verify in R that ρ = √ , where X corresponds to the ratings
Var(X)Var(Y )
for question 43, and Y corresponds to the ratings for question 91.

var.X <- var(jester$joke43)


var.Y <- var(jester$joke91)
cov.XY <- cov(jester$joke43, jester$joke91)

cov.XY/sqrt(var.Y * var.X)

## [1] 0.2815762

cor(jester$joke43, jester$joke91)

## [1] 0.2815762

17
Some R tips
For a (free) comprehensive introduction to R, see http://cran.r-project.org/doc/manuals/r-release/R-intro.
pdf. However, if you’re not sure how to use a particular function, you can use the ? command. For example, to
see how to use the runif() command, you can type ?runif.

18

You might also like