Bacs HW3

HW3
112078516
2024-03-06
Question 1) Let’s reexamine how to standardize data: subtract the mean of a

vector from all its values, and divide this difference by the standard deviation
to get a vector of standardized values.
a) Create a normal distribution (mean=940, sd=190) and standardize it (let’s call it

rnorm_std)
rnorm_std <- rnorm(n=1000, mean=940, sd=190)
standardize <- function(numbers) {

numbers <- (numbers - mean(numbers)) / sd(numbers)
return(numbers)
}
# Combining them into a composite dataset
rnorm_std <- standardize(rnorm_std)
i) What should we expect the mean and standard deviation of standardized to have a mean
of 0 and a standard deviation of 1 to be, and why?
mean(rnorm_std)
## [1] -2.746245e-16
sd(rnorm_std)
## [1] 1
Function “standardize” (Standardization) would smooth the samples and make the data more balanced,
which causes mean to 0 and standard deviation to 1.
ii) What should the distribution (shape) of rnorm_std look like, and why?
# Plot the density function

plot(density(rnorm_std), col="blue", lwd=2, main = "rnorm_std", xlim = c(-4,4))
1
rnorm_std
0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
N = 1000 Bandwidth = 0.2231
The standardized distribution often appears bell-shaped due to the Central Limit Theorem (CLT), which
shows the data distributed in Symmetry and concentration in the middle.
iii) What do we generally call distributions that are normal and standardized?
A: Standard Normal Distribution.
b) Create a standardized version of minday discussed in question 3
(let’s call it minday_std)
bookings <- read.table("first_bookings_datetime_sample.txt", header=TRUE)

hours <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$hour
mins <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$min
minday <- hours*60 + mins
minday_std <- standardize(minday)
i) What should we expect the mean and standard deviation of minday_std to be, and why?
mean(minday_std)
## [1] -4.25589e-17
sd(minday_std)
## [1] 1
2
Function “standardize” (Standardization) would smooth the samples and make the bookings datetime data
more balanced, which causes mean to 0 and standard deviation to 1.
ii) What should the distribution of minday_std look like compared to minday, and why?
par(mfrow=c(1,2))
plot(density(minday), main="minday", col="blue", lwd=2)
plot(density(minday_std), main="minday_std", col="red", lwd=2)
minday minday_std
0.004
0.6
0.003
Density
Density
0.002
0.4
0.001
0.2
0.000
0.0
0 500 1000 1500 −4 −2 0 2
N = 100000 Bandwidth = 17.07 N = 100000 Bandwidth = 0.09
These two distributions look quite the same, but only with different scale. This is because doing standard-
ization will not change the shape of distribution, but it will rescale the values from distribution to mean =
0, sd = 1.
3
Question 2)
Install the “compstatslib” package from CRAN (see week 2 class handout) and
run the plot_sample_ci() function that simulates samples drawn randomly from
a population.
Each sample is a horizontal line with a dark band for its 95% CI, and a lighter
band for its 99% CI, and a dot for its mean.
The population mean is a vertical black line. Samples whose 95% CI includes
the population mean are blue, and others are red.
library(compstatslib)
plot_sample_ci()
100
80
60
Samples
40
20
0
−0.4 −0.2 0.0 0.2 0.4
Confidence Intervals
a) Simulate 100 samples (each of size 100), from a normally distributed population of 10,000:
plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, distr_func=rnorm,

mean=20, sd=3)
plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, distr_func=rnorm, mean=20, sd=3)
4
100
80
60
Samples
40
20
0
18.5 19.0 19.5 20.0 20.5 21.0 21.5
i) How many samples do we expect to NOT include the population mean in its 95% CI?
5 samples (100 * 95%), the meaning of 95% CI is, for example take 100 different samples and compute a
95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain
the true mean.
ii) How many samples do we expect to NOT include the population mean in their 99% CI?
Likewise, we can expect that only 1 sample is NOT include the population mean in their 99% CI.
b) Rerun the previous simulation with the same number of samples, but larger sample size
(sample_size=300):
plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, distr_func=rnorm, mean=20, sd=3)
5
100
80
60
Samples
40
20
0
18.5 19.0 19.5 20.0 20.5 21.0 21.5
i) Now that the size of each sample has increased, do we expect their 95% and 99% CI to
become wider or narrower than before? The 95% and 99% CI will become narrower than before, this
is owing to the formula of Confidence Interval of the mean:
𝑠
𝑥 ± 𝑍 𝛼2 √
𝑛
Since the size of each sample has increased, which makes the value of 𝑍 𝛼2 √𝑠𝑛 decrease.
Therefore, the result of confidence interval become narrower than before.
ii) This time, how many samples (out of the 100) would we expect to NOT include the popu-
lation mean in its 95% CI?
A: 5 samples. Same reason as above. Besides, it won’t get affected by the param sample_size (the times of
simulation).
c) If we ran the above two examples (a and b) using a uniformly distributed population (specify
parameter distr_func=runif for plot_sample_ci), how do you expect your answers to (a) and
(b) to change, and why?
# plot_sample_ci with sample size=100 distr_func=runif

plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, distr_func=runif)
6
100
80
60
Samples
40
20
0
0.35 0.40 0.45 0.50 0.55 0.60 0.65
plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, distr_func=runif)
7
100
80
60
Samples
40
20
0
0.35 0.40 0.45 0.50 0.55 0.60 0.65
The answer for (a) and (b) will not change, which still expect 5 samples not in 95% CI and 1 sample not in
99% CI, with the same explanation above.
Besides, by looking up the formula of Confidence Interval of mean 𝑥 ± 𝑍 𝛼2 √𝑠𝑛 , the result of confidence
interval become narrower than before with the same reason in (b).
Question 3) The company EZTABLE has an online restaurant reservation plat-

form that is accessible by mobile and web. Imagine that EZTABLE would like
to start a promotion for new members to make their bookings earlier in the day.
We have a sample of data about their new members, in particular the date and time for which they make
their first ever booking (i.e., the booked time for the restaurant) using the EZTABLE platform. Here is
some sample code to explore the data:
bookings <- read.table("first_bookings_datetime_sample.txt", header=TRUE)

bookings$datetime[1:9]
## [1] "4/16/2014 17:30" "1/11/2014 20:00" "3/24/2013 12:00" "8/8/2013 12:00"

## [5] "2/16/2013 18:00" "5/25/2014 15:00" "12/18/2013 19:00" "12/23/2012 12:00"
## [9] "10/18/2013 20:00"
# [1] 4/16/2014 17:30 1/11/2014 20:00 3/24/2013 12:00 ...

# 18416 Levels: 1/1/2012 17:15 1/1/2012 19:00 ... 9/9/2014 19:30
hours <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$hour
mins <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$min
8
minday <- hours*60 + mins
plot(density(minday), main="Minute (of the day) of first ever booking", col="blue", lwd=2)
Minute (of the day) of first ever booking

0.004
0.003
Density
0.002
0.001
0.000
0 500 1000 1500
N = 100000 Bandwidth = 17.07
a) What is the “average” booking time for new members making their first restaurant booking?
(use minday, which is the absolute minute of the day from 0-1440)
i) Use traditional statistical methods to estimate the population mean of minday, its standard
error, and the 95% confidence interval (CI) of the sampling means
mean(minday)
## [1] 942.4964
sd(minday)/sqrt(length(minday))
## [1] 0.5997673
c(mean(minday)-1.96*sd(minday)/sqrt(length(minday)),mean(minday)+1.96*sd(minday)/sqrt(length(minday)))
## [1] 941.3208 943.6719
9
ii) Bootstrap to produce 2000 new samples from the original sample
compute_sample_mean <- function(n_sample) {

resample <- sample(n_sample, length(n_sample), replace=TRUE)
mean(resample)
}
mean_boots <- replicate(2000,compute_sample_mean(minday))
iii) Visualize the means of the 2000 bootstrapped samples
hist(mean_boots)
Histogram of mean_boots
500
Frequency
300
100
0
940 941 942 943 944 945
mean_boots
iv) Estimate the 95% CI of the bootstrapped means using the quantile function The 95% CI of
the bootstrapped means:
mean(mean_boots)
## [1] 942.4838
The bootstrap mean:
quantile(mean_boots, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 941.3266 943.6235
10
b) By what time of day, have half the new members of the day already arrived at their
restaurant?
i) Estimate the median of minday
median(minday)
## [1] 1040
ii) Visualize the medians of the 2000 bootstrapped samples
# Create a median computing function

compute_sample_median <- function(n_sample) {
resample <- sample(n_sample, length(n_sample), replace=TRUE)
median(resample)
}
bootmedian <- replicate(2000, compute_sample_median(minday))
hist(bootmedian)
Histogram of bootmedian
1000
800
600
Frequency
400
200
0
1020 1025 1030 1035 1040 1045 1050
bootmedian
iii) Estimate the 95% CI of the bootstrapped medians using the quantile function
quantile(bootmedian, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 1020 1050
11
95% CI of the bootstrapped medians is (1020, 1050).
min_to_datetime <- function(minday) {

# Compute hour and minutes
hours <- floor(minday / 60)
minutes <- minday %% 60
# return datetime (string format)

datetime <- paste(hours, ":", minutes, sep = "")
return(datetime)
}
min_to_datetime(1050)
## [1] "17:30"
It means that when 17:30(minday of 1050), the store have half the new members of the day already arrived
at their restaurant.
12

Bacs HW3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bacs HW3

Uploaded by

Copyright:

Available Formats

HW3

Question 1) Let’s reexamine how to standardize data: subtract the mean of a

a) Create a normal distribution (mean=940, sd=190) and standardize it (let’s call it

rnorm_std <- rnorm(n=1000, mean=940, sd=190)

standardize <- function(numbers) {

# Plot the density function

N = 1000 Bandwidth = 0.2231

b) Create a standardized version of minday discussed in question 3

(let’s call it minday_std)

bookings <- read.table("first_bookings_datetime_sample.txt", header=TRUE)

0 500 1000 1500 −4 −2 0 2

N = 100000 Bandwidth = 17.07 N = 100000 Bandwidth = 0.09

−0.4 −0.2 0.0 0.2 0.4

plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, distr_func=rnorm,

plot_sample_ci(num_samples = 100, sample_size = 100, pop_size=10000, distr_func=rnorm, mean=20, sd=3)

18.5 19.0 19.5 20.0 20.5 21.0 21.5

plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, distr_func=rnorm, mean=20, sd=3)

18.5 19.0 19.5 20.0 20.5 21.0 21.5

# plot_sample_ci with sample size=100 distr_func=runif

0.35 0.40 0.45 0.50 0.55 0.60 0.65

plot_sample_ci(num_samples = 100, sample_size = 300, pop_size=10000, distr_func=runif)

0.35 0.40 0.45 0.50 0.55 0.60 0.65

Question 3) The company EZTABLE has an online restaurant reservation plat-

bookings <- read.table("first_bookings_datetime_sample.txt", header=TRUE)

## [1] "4/16/2014 17:30" "1/11/2014 20:00" "3/24/2013 12:00" "8/8/2013 12:00"

# [1] 4/16/2014 17:30 1/11/2014 20:00 3/24/2013 12:00 ...

Minute (of the day) of first ever booking

0 500 1000 1500

N = 100000 Bandwidth = 17.07

## [1] 941.3208 943.6719

compute_sample_mean <- function(n_sample) {

iii) Visualize the means of the 2000 bootstrapped samples

940 941 942 943 944 945

The bootstrap mean:

quantile(mean_boots, probs=c(0.025, 0.975))

i) Estimate the median of minday

ii) Visualize the medians of the 2000 bootstrapped samples

# Create a median computing function

1020 1025 1030 1035 1040 1045 1050

quantile(bootmedian, probs=c(0.025, 0.975))

min_to_datetime <- function(minday) {

# return datetime (string format)

You might also like