Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

R BASICS

26-JULY-2019
VECTORS

• Create a vector: vec  vector(“numeric”,n)


• Refer to kth element vec[k]
• Perform operations on all elements of a vector vec vec/10
• Append to a vector vec[n+1]  1.2
• Append to a vector vec  append(vec, vec2 , after = length(vec))
• Append can be done using vec  c(vec, vec2)
VECTORS
• seq(0,n,step) gives vector of size n with step
• rep(0,n) gives a vector of size n with 0s
• c stands for concatenate – c(vec1, vec2, vec3…) =(vec1,vec2,vec3) in one
vector
VECTORS
• length of a vector: length(vec)
• Find positions of elements meeting criteria: which(vec>0.5 & vec <0.8)
• Find elements satisfying criteria: vec[which(vec>0.5 & vec <0.8)]
• Subsetting vector: vec[vec>0.5 & vec <0.8]
MATRICES
• Create a matrix from a vector: mat matrix(vec,3,7,byrow = TRUE)
• Fill column wise: mat  matrix(vec,3,7,byrow = FALSE)
• Finding dimensions of a matrix: dim(mat), nrow(mat), ncol(mat)
• Printing out using a print command:
print(“Hello World !”)
• Concatenating strings: paste(“matrix dimensions are “,nrow(mat), “rows”)
MATRICES
• Printing out the kth column: mat[,k]
• Printing out the kth row: mat[k,]
• Removing the 4th column of a 3 x 7 matrix:
mat_trim mat[,c(1:3,5:7)]
MATRICES
• # adding a column
• mat  cbind(mat,rep(5,nrow(mat)))
• mat
• #adding a row
• mat  rbind(mat, rep(4.12,ncol(mat)))
• mat
• #reshaping a matrix
• mat_resize  matrix(mat,7,3,byrow = FALSE)
is.condition
• is.vector()
• is.numeric()
• is.na()
• is.character(“a”)
• is.logical(FALSE)
data.frame

• Check column names: names(athletes)


• Check the dimensions: dim(athletes)
• Check the number of rows: nrow(athletes)
• Extract the Age column: athletes$Age
• athletes[,“Age”]
• i’th row, j’th column: athletes[i,j]
data.frame

• Adding a row: rbind


• Subsetting subset(athletes, Gender ==‘F’)
• Or athletes[Gender ==‘F’,]
Programming with R
Group manipulation
• order by Age column: athletes[order(athletes$Age),]
• aggregate some column by grouping along another column:
• average age of female and male athletes
tapply(athletes$Age,list(AthleteGender = athletes$Gender), mean )
• a bit nicer:
aggregate(Age ~ Gender, athletes, mean)
aggregate(Age ~ Gender+Sport, athletes, mean)
aggregate(cbind(Age,WorldRank) ~ Gender+Sport, athletes, mean)
Group manipulation: comparison
Command What it does

apply Applies a function to a “margin” – a row or column in a matrix

lapply Applies a function to a list and returns list

sapply Applies a function to a list and returns a vector output

tapply Applies a function to a dataframe and can be used to create pivot tables

aggregate Can easily be used to create highly granular pivots


OTHER USEFUL COMMANDS

• merge: students <- merge(students,StudentIDs)


• Unique: vec<-c(1,2,3,4,2,3,4,4,5,2,1,3); unique(vec)
• duplicated: duplicated(athletes)
• match: match(unique(athletes$Name),athletes$Name)
• na.omit: athletes <- na.omit(athletes)
PROGRAMMING IN R

• Loops: for(i in 1:10){ print(i) }


• Functions: factorial<-function(x){
ans <-prod(1:x)
return(ans)}
• Conditionals: if(length(numbers)==2){print(numbers[1])}
else {print("nothing to do")}
DPLYR: SPLIT, GROUP AND SUMMARIZE
• Allows the creation of complex pivots with custom functions – group_by,
summarize
• Provides additional row/column operators and subsetting for data frames –
slice, select and filter
• Is faster than alternatives in many cases
• Interfaces well with databases
• Provides functions which can take parsable text commands as inputs
PROBABILITY AND STATISTICS IN R
Normal Distribution:
Cumulative: pnorm(z, , ) gives the area under the curve from - ∞ to z.
Inverse Cumulative: qnorm(p, , ) gives the z value for a given area p.
PROBABILITY AND STATISTICS IN R
Uniform Distribution:
Random number generation: runif(n, min= , max= )

Binomial Distribution:
Cumulative Distribution: pbinom(k, n, p)
Inverse Cumulative Distribution: qbinom(cdfvalue, n, p)
PROBABILITY AND STATISTICS IN R
Hypothesis Testing
t-test
t.test(sample1, [sample2], mu=, alternative=, var.equal= , conf.level=)
z.test (BSDA)
z.test(sample1, [sample2], mu= , sigma.x= , [sigma.y=], alternative=,
conf.level= )
f.test
var.test(sample1, sample2, ratio=, alternative= , conf.level= )
Case: Cheating in an IQ test
▪ Suppose five students seated nearby in an IQ test had above average scores 125, 120, 112, 118 and 105, what
is the probability that they might have cheated ? Alternatively, what is the probability that in the ‘normal’
course of things, such an event happens.
▪ Given that IQ is normally distributed with mean 100 and standard deviation 16, and under the null hypothesis
that the students did not copy, and that they were randomly chosen, what is the probability that they had the
above scores ?
Step 1: Compute mean IQ of the 5 students: 116
Step 2: Compute the Z statistic (since we are given population
16
stdev, we can use normal distribution) = 16 = 2.23
ൗ 5
Step 3: Since the one sided 95% confidence interval is at z=1.96,
and z =2.23 in our case, the null hypothesis can be rejected with
error <5% and we can say that the sample mean is significantly
larger than100.

▪ The null hypothesis can be rejected with 95+% confidence


or a 5-% Type I error.
Case: Mercury levels in fish
▪ A 150 pound human can consume no more than 7 mcg of mercury in a day or 50 mcg in a week. Assuming
that in a geoethnic community, 1 kg of fish is eaten every week on average, the safety levels per kg would be
50 ppm (1mcg per kg is 1 ppm). A lot of ~1000 fish is poured into a broth every day at a public canteen. 10
fish are drawn from the sample to estimate the mean mercury level of the entire lot. On one day, the 10
sampled fish have the below mercury levels ppm: (67, 46, 48, 61, 65, 68, 66, 57, 48, 51). What is the probability
that the above data could result from the null hypothesis that the fish have less than 50 ppm per kg of mercury
on average ?

Step 1: Compute sample mean: 57


7.7
Step 2: Compute the t statistic = = 2.7
8.84/ 10
Step 3: Compute 95% confidence interval for 9 degrees of
freedom = 1.833 < 2.7.
Hence the null hypothesis is rejected with 95% confidence.

.
Two sample comparisons
▪ Suppose that a chain store is selling an uncommon item – an expensive persian carpet. They want to
gauge whether the use of a “deal sweetner” such as a buy 1, get any other item at 20% discount would
help the sales of this carpet. Let us report sales numbers in 10,000 per month. They could for example
compare the sales figures between two stores, one which has the incentive and the other that does not.
Let us assume that historically, the two stores have very similar sales of most items and that the
means are 𝑋1 and 𝑋2 and the sample sizes are 𝑛1 and 𝑛2 . Assume that 𝑛1 and 𝑛2 are small <30. Since we
would have limited data for this uncommon item, we do not have population variance / stdev. We can
use the pooled variance of the two samples and the null hypothesis that the two populations have the
same mean and variance.

▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner

𝑋ത1 −𝑋ത2 𝑛1 −1 𝑆12 + 𝑛2 −1 𝑆22


▪ Test: compute the pooled variance t – statistic: , where 𝑆𝑝 =
1 1 (𝑛1 +𝑛2 −2)
𝑆𝑝2 𝑛 +𝑛
1 2
▪ Comparison: Check if the value of this statistic lies in the confidence interval

▪ Reject / Accept
Two sample comparisons
▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner

𝑋ത1 −𝑋ത2 𝑛1 −1 𝑆12 + 𝑛2 −1 𝑆22


▪ Test: Compute the pooled variance t – statistic: , where 𝑆𝑝 =
1 1 (𝑛1 +𝑛2 −2)
𝑆𝑝2 +
𝑛1 𝑛2
▪ Data: 15 months of sales data
Sample mean 1 Sample mean 2
4.29 7.11

Sample stdev 1 Sample stdev 2


1.64 2.08
t = 4.13
Sample variance 1 Sample 95% = 2.15
t14
variance 2
2.68 4.34

Pooled variance 3.51 t > t14


95% the null hypothesis is rejected
t statistic -4.13

All sales figures are in units of Rs 10,000 per month


Two sample comparisons when sample is large
▪ Suppose that a chain store is selling a common item – a bed sheet. They want to gauge whether the use
of a “deal sweetner” such as a buy 1, get any other item at 20% discount would help the sales of this
bedsheet. Sales are in 100 rupees per day. Let us assume that historically, the two stores have very
similar sales of most items and that the means are 𝑋1 and 𝑋2 , and the sample sizes are 𝑛1 and 𝑛2 .
Assume that 𝑛1 and 𝑛2 are large enough that their stdev can be assumed to be the population stdev. We
can use the normal pooled variance of the two samples and the null hypothesis that the two
populations have the same mean.

▪ Null Hypothesis H0 = The two stores have the same sales with and w/o the deal sweetner

𝑋ത1 −𝑋ത2
▪ Test: compute the z – statistic: , where 𝜎1 and 𝜎2 are population stdevs.
𝜎2 2
1 +𝜎2
𝑛1 𝑛2

▪ Comparison: Check if the value of this statistic lies in the confidence interval

▪ Reject / Accept
Two sample comparisons when sample is large

Sample size 1 Sample size 2


200 200

Sample mean 1 Sample mean 2


50.43 61.00

Sample stdev 1 Sample stdev 2

9.84 10.80

Sample variance 1 Sample variance 2


96.84 116.84

All sales figures are in units of Rs 100 per day


Two sample comparisons when samples are large
▪ Null Hypothesis H0 = The two stores have the same price.

𝑋ത1 −𝑋ത2
▪ Test: compute the z – statistic: , where 𝜎1 and 𝜎2 are population stdevs.
𝜎2 2
1 +𝜎2
𝑛1 𝑛2

▪ Comparison: Check if the value of this statistic lies in the confidence interval

▪ Reject Null Hypothesis Sample mean 1 Sample mean 2


50.42967568 61.0030792

Sample stdev 1 Sample stdev 2


9.840609366 10.8092973

Sample variance 1 Sample variance 2


96.8375927 116.840908

z statistic -10.78

Z 95% confidence -1.959964


Paired t-test: Does Amazon sell cheaper ?
▪ Is the mean price of books cheaper on Amazon than at the book store ?

Textbook Book Store Amazon


Concepts in Federal Taxation 138.21 143.95
Intermediate Accounting 151.92 152.70
The Middle East and Central Asia 52.06 53.00
West's Business Law 159.31 143.95
Leadership: Theory & Practice 49.59 48.95
Making Choices for Multicultural Education 71.74 56.95
Direct Instruction Reading 98.12 97.35
Essentials of Economics 102.12 99.60
Marriage and Family 106.92 100.98
America and its People 100.44 95.20
Oceanography 105.18 128.95
Calculus: Early Transcendental Single Variable 115.00 133.50
Access to Health 93.47 88.60
Women and Globalization 29.54 18.48
Paired t-test: Does Amazon sell cheaper ?
▪ Take the set of differences – this is a sample. Compute the mean and standard deviation and apply t
test.

▪ Sample mean Sample stdev


0.82 11.10484985
t statistic 95% confidence interval
0.275809 1.770933396

▪ Null hypothesis cannot be rejected.


Z test for proportion
▪ A fast food chain has developed a new process to improve the proportion of orders filled correctly.
Assuming that the previous process was 85% correct and the new process fills 94 out of 100 correctly in
one measurement, does the improvement work ?

▪ Null hypothesis H0: p0≤ 0.85 (proportion of orders filled correctly is ≤ 0.85)

94
p= 100 = 0.94
𝑝−𝑝0 0.94−0.85
Z= == = 2.52
1−𝑝0 𝑝0 0.85(1−0.85)
𝑛 100

▪ Compute the p-value as 1-0.9941 = 0.0059 which is less than 0.01 . Therefore, the hypothesis can be
rejected with less than 0.01 error.
Z test for proportion
▪ A Wall street journal poll asked respondents if they trusted energy efficiency ratings on cars and appliances, 552
responded yes and 531 no. At the 0.05 level of significance, is there evidence that the percentage of people who
trusted energy efficiency ratings is 50%.

▪ Null hypothesis H0: p0=0.5 (proportion of people who trust energy ratings = 0.5)
552
p= = 0.509
1083
𝑝−𝑝0 0.509−0.5
Z= == = 2.848
1−𝑝0 𝑝0 0.5(1−0.5)
𝑛 1083

▪ Compute the p-value as 2*(1-0.9977) = 0.0046 which is less than 0.05 . Therefore, the hypothesis can be
rejected with less than 0.05 error.
F test for ratio of variances
▪ A professor in the accounting dept. of a B-school claims there is more variability in the final exam scores of students
taking accounting as a requirement than as a major. To test this, he surveys 13 non-accounting and 10 accounting
majors.
𝑛1 = 13, 𝑆1 = 210.2
𝑛2 = 10, 𝑆2 = 36.5

𝑆12
F= = 5.76
𝑆22

12,9 12,9
At the 0.05 level, the 𝐹𝑐𝑟𝑖𝑡 = 3.07 and 𝐹 > 𝐹𝑐𝑟𝑖𝑡 . Therefore, the null hypothesis is rejected. There is a difference in
variability.
F test for ratio of variances
▪ Is there a difference in the variation of the yield of different types of investment between banks:

Money market accts: 4.55 4.5 4.4 4.38 4.38


One year CD: 4.94 4.9 4.85 4.85 4.85 4.87

4,5
𝐹𝑐𝑟𝑖𝑡 = 5.19

4,4
F = 4.5446 < 𝐹𝑐𝑟𝑖𝑡 . Hence the null hypothesis is accepted.

You might also like