Professional Documents
Culture Documents
R Basics: 26-JULY-2019
R Basics: 26-JULY-2019
26-JULY-2019
VECTORS
tapply Applies a function to a dataframe and can be used to create pivot tables
Binomial Distribution:
Cumulative Distribution: pbinom(k, n, p)
Inverse Cumulative Distribution: qbinom(cdfvalue, n, p)
PROBABILITY AND STATISTICS IN R
Hypothesis Testing
t-test
t.test(sample1, [sample2], mu=, alternative=, var.equal= , conf.level=)
z.test (BSDA)
z.test(sample1, [sample2], mu= , sigma.x= , [sigma.y=], alternative=,
conf.level= )
f.test
var.test(sample1, sample2, ratio=, alternative= , conf.level= )
Case: Cheating in an IQ test
▪ Suppose five students seated nearby in an IQ test had above average scores 125, 120, 112, 118 and 105, what
is the probability that they might have cheated ? Alternatively, what is the probability that in the ‘normal’
course of things, such an event happens.
▪ Given that IQ is normally distributed with mean 100 and standard deviation 16, and under the null hypothesis
that the students did not copy, and that they were randomly chosen, what is the probability that they had the
above scores ?
Step 1: Compute mean IQ of the 5 students: 116
Step 2: Compute the Z statistic (since we are given population
16
stdev, we can use normal distribution) = 16 = 2.23
ൗ 5
Step 3: Since the one sided 95% confidence interval is at z=1.96,
and z =2.23 in our case, the null hypothesis can be rejected with
error <5% and we can say that the sample mean is significantly
larger than100.
.
Two sample comparisons
▪ Suppose that a chain store is selling an uncommon item – an expensive persian carpet. They want to
gauge whether the use of a “deal sweetner” such as a buy 1, get any other item at 20% discount would
help the sales of this carpet. Let us report sales numbers in 10,000 per month. They could for example
compare the sales figures between two stores, one which has the incentive and the other that does not.
Let us assume that historically, the two stores have very similar sales of most items and that the
means are 𝑋1 and 𝑋2 and the sample sizes are 𝑛1 and 𝑛2 . Assume that 𝑛1 and 𝑛2 are small <30. Since we
would have limited data for this uncommon item, we do not have population variance / stdev. We can
use the pooled variance of the two samples and the null hypothesis that the two populations have the
same mean and variance.
▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner
▪ Reject / Accept
Two sample comparisons
▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner
▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner
𝑋ത1 −𝑋ത2
▪ Test: compute the z – statistic: , where 𝜎1 and 𝜎2 are population stdevs.
𝜎2 2
1 +𝜎2
𝑛1 𝑛2
▪ Comparison: Check if the value of this statistic lies in the confidence interval
▪ Reject / Accept
Two sample comparisons when sample is large
9.84 10.80
𝑋ത1 −𝑋ത2
▪ Test: compute the z – statistic: , where 𝜎1 and 𝜎2 are population stdevs.
𝜎2 2
1 +𝜎2
𝑛1 𝑛2
▪ Comparison: Check if the value of this statistic lies in the confidence interval
z statistic -10.78
▪ Null hypothesis H0: p0≤ 0.85 (proportion of orders filled correctly is ≤ 0.85)
94
p= 100 = 0.94
𝑝−𝑝0 0.94−0.85
Z= == = 2.52
1−𝑝0 𝑝0 0.85(1−0.85)
𝑛 100
▪ Compute the p-value as 1-0.9941 = 0.0059 which is less than 0.01 . Therefore, the hypothesis can be
rejected with less than 0.01 error.
Z test for proportion
▪ A Wall street journal poll asked respondents if they trusted energy efficiency ratings on cars and appliances, 552
responded yes and 531 no. At the 0.05 level of significance, is there evidence that the percentage of people who
trusted energy efficiency ratings is 50%.
▪ Null hypothesis H0: p0=0.5 (proportion of people who trust energy ratings = 0.5)
552
p= = 0.509
1083
𝑝−𝑝0 0.509−0.5
Z= == = 2.848
1−𝑝0 𝑝0 0.5(1−0.5)
𝑛 1083
▪ Compute the p-value as 2*(1-0.9977) = 0.0046 which is less than 0.05 . Therefore, the hypothesis can be
rejected with less than 0.05 error.
F test for ratio of variances
▪ A professor in the accounting dept. of a B-school claims there is more variability in the final exam scores of students
taking accounting as a requirement than as a major. To test this, he surveys 13 non-accounting and 10 accounting
majors.
𝑛1 = 13, 𝑆1 = 210.2
𝑛2 = 10, 𝑆2 = 36.5
𝑆12
F= = 5.76
𝑆22
12,9 12,9
At the 0.05 level, the 𝐹𝑐𝑟𝑖𝑡 = 3.07 and 𝐹 > 𝐹𝑐𝑟𝑖𝑡 . Therefore, the null hypothesis is rejected. There is a difference in
variability.
F test for ratio of variances
▪ Is there a difference in the variation of the yield of different types of investment between banks:
4,5
𝐹𝑐𝑟𝑖𝑡 = 5.19
4,4
F = 4.5446 < 𝐹𝑐𝑟𝑖𝑡 . Hence the null hypothesis is accepted.