R Cheat Sheet v3

R Cheat Sheet
Basic Data Exploration
nrow
nrow(dataframe)
Output: number of rows in a data frame

Best for: large data sets, getting a general sense of how much you’re working with, tells
you the number of observations in the data set
summary
summary(dataframe)
summary(dataframe$variable)
Output: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum

Best for: Descriptive statistics of interval-ration variables; category frequencies in
categorical variables with NAs
*Note: ALWAYS determine level of measurement through metadata first to
understand output
table
table(dataframe$variable)
table(dataframe$variable1, dataframe$variable2,
dnn=c("NameVariable1", "NameVariable2"))
Output: Frequencies
Best for: Showing frequencies of categorical variables only, c
prop.table
prop.table(table(dataframe$variable)) #percentage within dataframe overall
prop.table(table(dataframe$variable1, dataframe$variable2)) #cell percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 1) #row percentages
prop.table(table(dataframe$variable1, dataframe$variable2), 2) #column percent.
Output: Proportions
Best for: Showing proportions for one or more categorical variable in data frame
freqtable
freqtable(dataframe$variable)
Output: Category, frequency, proportion of the whole, proportion minus NAs

Best for: Showing frequency and proportion for categorical variables
aggregate
aggregate(IRvariable ~ CategoricalVariable, dataframe, mean)
Page 1 of 7 Spring 2018

aggregate(IRvariable ~ CategoricalVariable, dataframe, median)
aggregate(IRvariable ~ CategoricalVariable, dataframe, sd)
Output: List of categories within categorical variable with corresponding descriptive

statistic about the paired interval-ratio variable
Best for: Determining frequency among categories; one categorical, one interval-ratio
variable
Descriptive Statistics
range
range(dataframe$IRvariable)
Output: Lowest response and highest response
max
max(dataframe$IRvariable)
Output: Highest response (maximum)
var
var(dataframe$IRvariable)
Output: Variance
sd
sd(dataframe$IRvariable)
Output: Standard deviation
IQR
IQR(dataframe$IRvariable)
Output: Interquartile range
quantile
quantile(dataframe$IRvariable)
quantile(dataframe$IRvariable, seq(0, 1, 0.1))
*0 - starting point of analysis (between 0 and 1)
*1 - ending point of analysis (between 0 and 1)
*0.1 - percentile (in decimal) that you want to analyze data by (between 0 and 1)
Output: Percentiles
Best for: IR variables only; getting general sense of where data is congregated; great
check point before creating categories with IR variables

Charts
hist
hist(dataframe$IRvariable)
Output: Histogram
Best for: Graphic display of IR variable; watch out for outliers
pie
pie(table(dataframe$CategoricalVariable)
Output: Pie chart

Best for: Graphic display of categorical/grouped variable; only use when categories are
mutually exclusive and exhaustive (so there is not double-counting of
observations and so the total adds up to 100%)
barplot
barplot(table(dataframe$variable)
Output: Bar chart

Best for: Graphic display of categorical/grouped variables
scatterplot
plot(dataframe$dependentvariable~dataframe$independentvariable)
abline(lm(dataframe$dependentvariable~dataframe$independentvariable))
Output: plot: scatter plot; abline: best fit line

Best for: Graphic display of correlation/relationship between two IR variables. When line
imposed, also displays regression slope.
boxplot
boxplot(dataframe$IRvariable~dataframe$CategoricalVariable)
Output: boxplot of IR variable by categorical variableb

Best for: Graphic display of distribution of IR variable within each categorical variable.
Shows minimum, first quartile, median, third quartile, and maximum.
Note: to label and color any chart, add: main=“ChartTitle", xlab="XaxisTitle",

ylab="YaxisTitle”, col=c("color1", “color 2”) before the last parenthesis.
Significance Testing

Single Sample Hypothesis Testing
t.test
t.test(dataframe$IRvariable, mu= X, conf.level=0.5)
Notes: mu is μ, or the mean of the population or comparison group you’re comparing

your data against. Confidence level is in percent (e.g. conf.level=0.9 is a 90%
confidence level; 0.1 is a 10% confidence level). If you do not specify a
confidence level, it defaults to 0.95 (or 95%, an alpha of 0.05)
Output: One-sample t-test: test statistic, degrees of freedom, p-value, confidence interval,
and your sample mean
Best for: Performing single-sample hypothesis testing on IR variables
prop.test
prop.test(FrequencyofYes, TotalResponses, p=AlphaLevel)
Notes: All inputs into this function are actual numbers, not variables. The first
(‘FrequencyofYes’) will be the frequency of instances for the thing you’re
measuring (e.g. is there a statistically significant difference between the number of
smokers and non smokers? To measure for smokers, you’d put the total number of
smokers in this spot). The second (‘TotalResponses’) is the total number of
responses, minus any NAs.
Output: P-value, confidence interval

Best for: Performing single-sample hypothesis testing on categorical variables
(proportions)
Two Sample Hypothesis Testing
t.test
t.test(IRVariable~CategoricalVariable, dataframe)
Notes: Categorical variable can only have a maximum of two “levels” (categories) for
this test to work. See Lab 6 for work around.
Output: Two sample hypothesis test: test statistic, degrees of freedom, p-value,
confidence interval, and raw means for both categories being tested.
Best for: Performing two-sample hypothesis test where one variable is categorical and the
other is interval ratio (the test is comparing the means of the IR variable)
prop.test
prop.test(x=c(IndepVarG1, IndepVarG2), n=c(TotalDepG1, TotalDepG2))

Notes: Like the single-sample prop.test, all inputs will be actually numbers that
you can gather from running a table function with both variables. The first number
will be the frequency of the first group within the independent variable, the
second number will be the frequency of the second group. (eg, if you’re
measuring how many low birth weight babies (independent) have moms who
smoke (dependent), Group 1 will be low birthweight moms that don’t smoke, and
Group 2 will be low birthweight moms that do smoke.
The second section need to match the first; if we use the example we were already
working with, TotalDepG1 would be the total number of moms who don’t smoke,
and TotalDepG2 would be the total number of moms who do smoke.
Output: test statistic, p-value, confidence interval, raw proportions

Best for: Performing a two sample hypothesis test on two categorical variables
ANOVA
variableXvariable.anova.results <- aov(IRvariable~CatVariable, dataset)

summary(variableXvariable.anova.results)
Output: F-obtained, Pr(>F)

Best for: Performing a two sample hypothesis test where dependent variable is IR and
independent is nominal/ordinal with more than 2 categories. (You are
comparing means across multiple categories of nominal/ordinal variable)
Post Hoc
TukeyHSD
TukeyHSD(anova.results)
Notes: Must be used with a previous ANOVA attempt. Uses 95% confidence level
Output: Shows matched pairs with difference in means, lower and upper bounds
of the confidence interval, and, most importantly, p-adjusted.
Best for: Determining where the significant difference is detected in ANOVA
Bonferroni
pairwise.t.test(dataframe$variable1, dataframe$variable2,
p.adj = “bonferroni")
Notes: Doesn’t rely on having done any other hypothesis testing before hand.
More conservative than Tukey.
Output: Matrix of relationships between both variables showing adjusted p-value.
Best for: Detailed hypothesis test for two variables

Chi Square
chisq.test(CrosstabDataframe, correct=FALSE)
Notes: You must make a new dataset (I recommend calling it crosstabxsomething) in

order to run the Chi Squared function on it. Always add ‘correct=FALSE’ to
disable continuity correction. See Lab 8 for details.
Output: X squared, degrees of freedom, and p-value

Best for: Performing hypothesis testing on two nominal or ordinal variables
Measures of Association
cor
cor(dataframe$dependent, dataframe$independent,
method="pearson", use="complete.obs")
Output: Number, negative or positive, on scale from 0 to 1; higher is

Best for: Testing the strength of an association between two IR variables
cor.test
cor.test (dataframe$dependent, dataframe$independent,

method="pearson", use="complete.obs")
Output: Significance and Pearson’s R

Best for: Testing significance of association between two IR variables
Simple Regression + Multivariate Regression
lm
lm(dataframe$variable1 ~ dataframe$variable2)
test_regression <- lm(dataframe$depvariable1 ~ dataframe$indepvariable2)
summary(test_regression)
Notes: you can add additional variables to run regressions on by adding:

+ dataframe$variable3 before the last parenthesis in the lm line.
Output: Coefficients including standard error, test statistic, and Pr value. Also
includes degrees of freedom, R-squared, adjusted R-squared.
Best for: Performing bivariate and multivariate regression

Lab Topics for Reference:
- Lab 1: Very basic descriptives
- Lab 2: Frequency tables, crosstabs, more descriptive statistics, basic graphing
- Lab 3: Categorizing variables
- Lab 4: Advanced visualization from Nicole Mader
- Lab 5a: Subsetting (Can use for dummy variables)
- Lab 5b: Single-Sample Hypothesis Testing (t.test and prop.test)
- Lab 6: Two-Sample Hypothesis Testing (t.test and prop.test)
- Lab 7: ANOVA and post-hoc tests
- Lab 8: Chi Squared
- Lab 9: Measures of Association
- Lab 10: Multivariate Regression and Dummy Variables
*bolded are longer-style references that weren’t mentioned in this cheat sheet

R Cheat Sheet v3

Uploaded by

Copyright:

Available Formats

You might also like

R Cheat Sheet v3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R Cheat Sheet v3

Uploaded by

Copyright:

Available Formats

R Cheat Sheet

Basic Data Exploration

Output: number of rows in a data frame

Output: Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum

Output: Category, frequency, proportion of the whole, proportion minus NAs

Page 1 of 7 Spring 2018

Output: List of categories within categorical variable with corresponding descriptive

Output: Lowest response and highest response

Output: Highest response (maximum)

Output: Standard deviation

Output: Interquartile range

Page 2 of 7 Spring 2018

Output: Pie chart

Output: Bar chart

Output: plot: scatter plot; abline: best fit line

Output: boxplot of IR variable by categorical variableb

Note: to label and color any chart, add: main=“ChartTitle", xlab="XaxisTitle",

Page 3 of 7 Spring 2018

Notes: mu is μ, or the mean of the population or comparison group you’re comparing

prop.test(FrequencyofYes, TotalResponses, p=AlphaLevel)

Output: P-value, confidence interval

Two Sample Hypothesis Testing

prop.test(x=c(IndepVarG1, IndepVarG2), n=c(TotalDepG1, TotalDepG2))

Page 4 of 7 Spring 2018

Output: test statistic, p-value, confidence interval, raw proportions

variableXvariable.anova.results <- aov(IRvariable~CatVariable, dataset)

Output: F-obtained, Pr(>F)

Page 5 of 7 Spring 2018

Notes: You must make a new dataset (I recommend calling it crosstabxsomething) in

Output: X squared, degrees of freedom, and p-value

Output: Number, negative or positive, on scale from 0 to 1; higher is

cor.test (dataframe$dependent, dataframe$independent,

Output: Significance and Pearson’s R

Simple Regression + Multivariate Regression

Notes: you can add additional variables to run regressions on by adding:

Page 6 of 7 Spring 2018

Page 7 of 7 Spring 2018

You might also like