R Cheat Sheet v3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

R Cheat Sheet

Basic Data Exploration

nrow
nrow(dataframe)

Output: number of rows in a data frame


Best for: large data sets, getting a general sense of how much you’re working with, tells
you the number of observations in the data set

summary
summary(dataframe)
summary(dataframe$variable)

Output: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum


Best for: Descriptive statistics of interval-ration variables; category frequencies in
categorical variables with NAs
*Note: ALWAYS determine level of measurement through metadata first to
understand output

table
table(dataframe$variable)
table(dataframe$variable1, dataframe$variable2,
dnn=c("NameVariable1", "NameVariable2"))

Output: Frequencies
Best for: Showing frequencies of categorical variables only, c

prop.table
prop.table(table(dataframe$variable)) #percentage within dataframe overall
prop.table(table(dataframe$variable1, dataframe$variable2)) #cell percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 1) #row percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 2) #column percent.

Output: Proportions
Best for: Showing proportions for one or more categorical variable in data frame

freqtable
freqtable(dataframe$variable)

Output: Category, frequency, proportion of the whole, proportion minus NAs


Best for: Showing frequency and proportion for categorical variables

aggregate
aggregate(IRvariable ~ CategoricalVariable, dataframe, mean)

Page 1 of 7 Spring 2018


aggregate(IRvariable ~ CategoricalVariable, dataframe, median)
aggregate(IRvariable ~ CategoricalVariable, dataframe, sd)

Output: List of categories within categorical variable with corresponding descriptive


statistic about the paired interval-ratio variable
Best for: Determining frequency among categories; one categorical, one interval-ratio
variable

Descriptive Statistics

range
range(dataframe$IRvariable)

Output: Lowest response and highest response

max
max(dataframe$IRvariable)

Output: Highest response (maximum)

var
var(dataframe$IRvariable)

Output: Variance

sd
sd(dataframe$IRvariable)

Output: Standard deviation

IQR
IQR(dataframe$IRvariable)

Output: Interquartile range

quantile
quantile(dataframe$IRvariable)
quantile(dataframe$IRvariable, seq(0, 1, 0.1))
*0 - starting point of analysis (between 0 and 1)
*1 - ending point of analysis (between 0 and 1)
*0.1 - percentile (in decimal) that you want to analyze data by (between 0 and 1)

Output: Percentiles
Best for: IR variables only; getting general sense of where data is congregated; great
check point before creating categories with IR variables

Page 2 of 7 Spring 2018


Charts

hist
hist(dataframe$IRvariable)

Output: Histogram
Best for: Graphic display of IR variable; watch out for outliers

pie
pie(table(dataframe$CategoricalVariable)

Output: Pie chart


Best for: Graphic display of categorical/grouped variable; only use when categories are
mutually exclusive and exhaustive (so there is not double-counting of
observations and so the total adds up to 100%)

barplot
barplot(table(dataframe$variable)

Output: Bar chart


Best for: Graphic display of categorical/grouped variables

scatterplot

plot(dataframe$dependentvariable~dataframe$independentvariable)
abline(lm(dataframe$dependentvariable~dataframe$independentvariable))

Output: plot: scatter plot; abline: best fit line


Best for: Graphic display of correlation/relationship between two IR variables. When line
imposed, also displays regression slope.

boxplot

boxplot(dataframe$IRvariable~dataframe$CategoricalVariable)

Output: boxplot of IR variable by categorical variableb


Best for: Graphic display of distribution of IR variable within each categorical variable.
Shows minimum, first quartile, median, third quartile, and maximum.

Note: to label and color any chart, add: main=“ChartTitle", xlab="XaxisTitle",


ylab="YaxisTitle”, col=c("color1", “color 2”) before the last parenthesis.

Significance Testing

Page 3 of 7 Spring 2018


Single Sample Hypothesis Testing

t.test
t.test(dataframe$IRvariable, mu= X, conf.level=0.5)

Notes: mu is μ, or the mean of the population or comparison group you’re comparing


your data against. Confidence level is in percent (e.g. conf.level=0.9 is a 90%
confidence level; 0.1 is a 10% confidence level). If you do not specify a
confidence level, it defaults to 0.95 (or 95%, an alpha of 0.05)

Output: One-sample t-test: test statistic, degrees of freedom, p-value, confidence interval,
and your sample mean
Best for: Performing single-sample hypothesis testing on IR variables

prop.test

prop.test(FrequencyofYes, TotalResponses, p=AlphaLevel)

Notes: All inputs into this function are actual numbers, not variables. The first
(‘FrequencyofYes’) will be the frequency of instances for the thing you’re
measuring (e.g. is there a statistically significant difference between the number of
smokers and non smokers? To measure for smokers, you’d put the total number of
smokers in this spot). The second (‘TotalResponses’) is the total number of
responses, minus any NAs.

Output: P-value, confidence interval


Best for: Performing single-sample hypothesis testing on categorical variables
(proportions)

Two Sample Hypothesis Testing

t.test
t.test(IRVariable~CategoricalVariable, dataframe)

Notes: Categorical variable can only have a maximum of two “levels” (categories) for
this test to work. See Lab 6 for work around.

Output: Two sample hypothesis test: test statistic, degrees of freedom, p-value,
confidence interval, and raw means for both categories being tested.
Best for: Performing two-sample hypothesis test where one variable is categorical and the
other is interval ratio (the test is comparing the means of the IR variable)

prop.test

prop.test(x=c(IndepVarG1, IndepVarG2), n=c(TotalDepG1, TotalDepG2))

Page 4 of 7 Spring 2018


Notes: Like the single-sample prop.test, all inputs will be actually numbers that
you can gather from running a table function with both variables. The first number
will be the frequency of the first group within the independent variable, the
second number will be the frequency of the second group. (eg, if you’re
measuring how many low birth weight babies (independent) have moms who
smoke (dependent), Group 1 will be low birthweight moms that don’t smoke, and
Group 2 will be low birthweight moms that do smoke.

The second section need to match the first; if we use the example we were already
working with, TotalDepG1 would be the total number of moms who don’t smoke,
and TotalDepG2 would be the total number of moms who do smoke.

Output: test statistic, p-value, confidence interval, raw proportions


Best for: Performing a two sample hypothesis test on two categorical variables

ANOVA

variableXvariable.anova.results <- aov(IRvariable~CatVariable, dataset)


summary(variableXvariable.anova.results)

Output: F-obtained, Pr(>F)


Best for: Performing a two sample hypothesis test where dependent variable is IR and
independent is nominal/ordinal with more than 2 categories. (You are
comparing means across multiple categories of nominal/ordinal variable)

Post Hoc

TukeyHSD

TukeyHSD(anova.results)

Notes: Must be used with a previous ANOVA attempt. Uses 95% confidence level
Output: Shows matched pairs with difference in means, lower and upper bounds
of the confidence interval, and, most importantly, p-adjusted.
Best for: Determining where the significant difference is detected in ANOVA

Bonferroni

pairwise.t.test(dataframe$variable1, dataframe$variable2,
p.adj = “bonferroni")

Notes: Doesn’t rely on having done any other hypothesis testing before hand.
More conservative than Tukey.
Output: Matrix of relationships between both variables showing adjusted p-value.
Best for: Detailed hypothesis test for two variables

Page 5 of 7 Spring 2018


Chi Square

chisq.test(CrosstabDataframe, correct=FALSE)

Notes: You must make a new dataset (I recommend calling it crosstabxsomething) in


order to run the Chi Squared function on it. Always add ‘correct=FALSE’ to
disable continuity correction. See Lab 8 for details.

Output: X squared, degrees of freedom, and p-value


Best for: Performing hypothesis testing on two nominal or ordinal variables

Measures of Association

cor
cor(dataframe$dependent, dataframe$independent,
method="pearson", use="complete.obs")

Output: Number, negative or positive, on scale from 0 to 1; higher is


Best for: Testing the strength of an association between two IR variables

cor.test

cor.test (dataframe$dependent, dataframe$independent,


method="pearson", use="complete.obs")

Output: Significance and Pearson’s R


Best for: Testing significance of association between two IR variables

Simple Regression + Multivariate Regression

lm
lm(dataframe$variable1 ~ dataframe$variable2)
test_regression <- lm(dataframe$depvariable1 ~ dataframe$indepvariable2)
summary(test_regression)

Notes: you can add additional variables to run regressions on by adding:


+ dataframe$variable3 before the last parenthesis in the lm line.

Output: Coefficients including standard error, test statistic, and Pr value. Also
includes degrees of freedom, R-squared, adjusted R-squared.
Best for: Performing bivariate and multivariate regression

Page 6 of 7 Spring 2018


Lab Topics for Reference:
- Lab 1: Very basic descriptives
- Lab 2: Frequency tables, crosstabs, more descriptive statistics, basic graphing
- Lab 3: Categorizing variables
- Lab 4: Advanced visualization from Nicole Mader
- Lab 5a: Subsetting (Can use for dummy variables)
- Lab 5b: Single-Sample Hypothesis Testing (t.test and prop.test)
- Lab 6: Two-Sample Hypothesis Testing (t.test and prop.test)
- Lab 7: ANOVA and post-hoc tests
- Lab 8: Chi Squared
- Lab 9: Measures of Association
- Lab 10: Multivariate Regression and Dummy Variables

*bolded are longer-style references that weren’t mentioned in this cheat sheet

Page 7 of 7 Spring 2018

You might also like