R Cheat Sheet

Basic Data Exploration


Output: number of rows in a data frame

Best for: large data sets, getting a general sense of how much you’re working with, tells
you the number of observations in the data set


Output: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum

Best for: Descriptive statistics of interval-ration variables; category frequencies in
categorical variables with NAs
*Note: ALWAYS determine level of measurement through metadata first to
understand output

table(dataframe$variable1, dataframe$variable2,
dnn=c("NameVariable1", "NameVariable2"))

Output: Frequencies
Best for: Showing frequencies of categorical variables only, c

prop.table(table(dataframe$variable)) #percentage within dataframe overall
prop.table(table(dataframe$variable1, dataframe$variable2)) #cell percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 1) #row percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 2) #column percent.

Output: Proportions
Best for: Showing proportions for one or more categorical variable in data frame


Output: Category, frequency, proportion of the whole, proportion minus NAs

Best for: Showing frequency and proportion for categorical variables

aggregate(IRvariable ~ CategoricalVariable, dataframe, mean)

aggregate(IRvariable ~ CategoricalVariable, dataframe, median)
aggregate(IRvariable ~ CategoricalVariable, dataframe, sd)

Output: List of categories within categorical variable with corresponding descriptive

statistic about the paired interval-ratio variable
Best for: Determining frequency among categories; one categorical, one interval-ratio

Descriptive Statistics


Output: Lowest response and highest response


Output: Highest response (maximum)


Output: Variance


Output: Standard deviation


Output: Interquartile range

quantile(dataframe$IRvariable, seq(0, 1, 0.1))
*0 - starting point of analysis (between 0 and 1)
*1 - ending point of analysis (between 0 and 1)
*0.1 - percentile (in decimal) that you want to analyze data by (between 0 and 1)

Output: Percentiles
Best for: IR variables only; getting general sense of where data is congregated; great
check point before creating categories with IR variables

Output: Histogram
Best for: Graphic display of IR variable; watch out for outliers


Output: Pie chart

Best for: Graphic display of categorical/grouped variable; only use when categories are
mutually exclusive and exhaustive (so there is not double-counting of
observations and so the total adds up to 100%)


Output: Bar chart

Best for: Graphic display of categorical/grouped variables



Output: plot: scatter plot; abline: best fit line

Best for: Graphic display of correlation/relationship between two IR variables. When line
imposed, also displays regression slope.



Output: boxplot of IR variable by categorical variableb

Best for: Graphic display of distribution of IR variable within each categorical variable.
Shows minimum, first quartile, median, third quartile, and maximum.

Note: to label and color any chart, add: main=“ChartTitle", xlab="XaxisTitle",

ylab="YaxisTitle”, col=c("color1", “color 2”) before the last parenthesis.

Significance Testing

Single Sample Hypothesis Testing

t.test(dataframe$IRvariable, mu= X, conf.level=0.5)

Notes: mu is μ, or the mean of the population or comparison group you’re comparing

your data against. Confidence level is in percent (e.g. conf.level=0.9 is a 90%
confidence level; 0.1 is a 10% confidence level). If you do not specify a
confidence level, it defaults to 0.95 (or 95%, an alpha of 0.05)

Output: One-sample t-test: test statistic, degrees of freedom, p-value, confidence interval,
and your sample mean
Best for: Performing single-sample hypothesis testing on IR variables


prop.test(FrequencyofYes, TotalResponses, p=AlphaLevel)

Notes: All inputs into this function are actual numbers, not variables. The first
(‘FrequencyofYes’) will be the frequency of instances for the thing you’re
measuring (e.g. is there a statistically significant difference between the number of
smokers and non smokers? To measure for smokers, you’d put the total number of
smokers in this spot). The second (‘TotalResponses’) is the total number of
responses, minus any NAs.

Output: P-value, confidence interval

Best for: Performing single-sample hypothesis testing on categorical variables

Two Sample Hypothesis Testing

t.test(IRVariable~CategoricalVariable, dataframe)

Notes: Categorical variable can only have a maximum of two “levels” (categories) for
this test to work. See Lab 6 for work around.

Output: Two sample hypothesis test: test statistic, degrees of freedom, p-value,
confidence interval, and raw means for both categories being tested.
Best for: Performing two-sample hypothesis test where one variable is categorical and the
other is interval ratio (the test is comparing the means of the IR variable)


prop.test(x=c(IndepVarG1, IndepVarG2), n=c(TotalDepG1, TotalDepG2))

Notes: Like the single-sample prop.test, all inputs will be actually numbers that
you can gather from running a table function with both variables. The first number
will be the frequency of the first group within the independent variable, the
second number will be the frequency of the second group. (eg, if you’re
measuring how many low birth weight babies (independent) have moms who
smoke (dependent), Group 1 will be low birthweight moms that don’t smoke, and
Group 2 will be low birthweight moms that do smoke.

The second section need to match the first; if we use the example we were already
working with, TotalDepG1 would be the total number of moms who don’t smoke,
and TotalDepG2 would be the total number of moms who do smoke.

Output: test statistic, p-value, confidence interval, raw proportions

Best for: Performing a two sample hypothesis test on two categorical variables


variableXvariable.anova.results <- aov(IRvariable~CatVariable, dataset)


Output: F-obtained, Pr(>F)

Best for: Performing a two sample hypothesis test where dependent variable is IR and
independent is nominal/ordinal with more than 2 categories. (You are
comparing means across multiple categories of nominal/ordinal variable)

Post Hoc



Notes: Must be used with a previous ANOVA attempt. Uses 95% confidence level
Output: Shows matched pairs with difference in means, lower and upper bounds
of the confidence interval, and, most importantly, p-adjusted.
Best for: Determining where the significant difference is detected in ANOVA


pairwise.t.test(dataframe$variable1, dataframe$variable2,
p.adj = “bonferroni")

Notes: Doesn’t rely on having done any other hypothesis testing before hand.
More conservative than Tukey.
Output: Matrix of relationships between both variables showing adjusted p-value.
Best for: Detailed hypothesis test for two variables

Chi Square

chisq.test(CrosstabDataframe, correct=FALSE)

Notes: You must make a new dataset (I recommend calling it crosstabxsomething) in

order to run the Chi Squared function on it. Always add ‘correct=FALSE’ to
disable continuity correction. See Lab 8 for details.

Output: X squared, degrees of freedom, and p-value

Best for: Performing hypothesis testing on two nominal or ordinal variables

Measures of Association

cor(dataframe$dependent, dataframe$independent,
method="pearson", use="complete.obs")

Output: Number, negative or positive, on scale from 0 to 1; higher is

Best for: Testing the strength of an association between two IR variables


cor.test (dataframe$dependent, dataframe$independent,

method="pearson", use="complete.obs")

Output: Significance and Pearson’s R

Best for: Testing significance of association between two IR variables

Simple Regression + Multivariate Regression

lm(dataframe$variable1 ~ dataframe$variable2)
test_regression <- lm(dataframe$depvariable1 ~ dataframe$indepvariable2)

Notes: you can add additional variables to run regressions on by adding:

+ dataframe$variable3 before the last parenthesis in the lm line.

Output: Coefficients including standard error, test statistic, and Pr value. Also
includes degrees of freedom, R-squared, adjusted R-squared.
Best for: Performing bivariate and multivariate regression

Lab Topics for Reference:
- Lab 1: Very basic descriptives
- Lab 2: Frequency tables, crosstabs, more descriptive statistics, basic graphing
- Lab 3: Categorizing variables
- Lab 4: Advanced visualization from Nicole Mader
- Lab 5a: Subsetting (Can use for dummy variables)
- Lab 5b: Single-Sample Hypothesis Testing (t.test and prop.test)
- Lab 6: Two-Sample Hypothesis Testing (t.test and prop.test)
- Lab 7: ANOVA and post-hoc tests
- Lab 8: Chi Squared
- Lab 9: Measures of Association
- Lab 10: Multivariate Regression and Dummy Variables

*bolded are longer-style references that weren’t mentioned in this cheat sheet

