Professional Documents
Culture Documents
Mis Notas de R PDF
Mis Notas de R PDF
Correspondence Analysis The main bootstrapping function is boot( ) and has the follovting forrnat:
M11ltjdjmen5jona( Scaling
bootobject <· boot(data= , statistic= , R=, ... ) where
CI 115ter Analysis
parameter description
Tree·Based Models
data A vector, matrix, or data frame
Bootstrapoing statistic A function that produces the k statistics to be bootstrapped (k=1 if
bootstrapping a single statistic).
Matrix Algebra The function should indude an indices parameter that the boot() function
can use to select cases for each replication (see examples below).
R Number of bootstrap replicates
R in Action Additional parameters to be passed to the function that produces the
statistic of interest
boot( ) calls the statistic function R times. Each time, it generates a set of random indices, vtith
replacement, from the integers 1 :nrow(data). These indices are used within the statistic function to
select a sample. The statistics are calculated on the sample and the results are accumulated in the
bootobject. The bootobject structure includes
element description
R in Action significantly expands
tO The observed values of k statistics applied to the orginal data.
upon this material. Use promo
t An R x k matrix where each row is a bootstrap replicate of the k statistics.
code ria38 for a 38% discount.
You can access these as bootobject$t0 and bootobject$t.
Top Menu Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to
examine the results. lf the results look reasonable, you can use boot.ci( ) function to obtain confidence
intervals for the statistic(s).
The R Interface
The format is
Data Input
boot.ci(bootobject, conf=, type= ) where
Data Management
Basic Statistics
parameter description
Advanced Statistics
bootobject The object retumed by the boot function
Basic Graphs conf The desired confidence interval (default: conf=0.95)
Advanced Graohs type The type of confidence interval retumed. Possible values are "norm",
"basic", "stud", "perc", "bca" and "all" (default: type="all")
# view results
results
plot(results)
11 click to view
# view results
results
pl ot(results, index=l) # intercept
plot(results, index=2) # wt
pl ot(results, index=3) # disp
/ / / click to view
Going Further
The boot( ) function can generate both nonparametric and parametric resampling. For the
nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation.
For the nonparametric bootstrap, stratified resampling is supported. lmportance resampling weights can
also be specified.
The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric
confidence intervals. These include the first order nom1al approximation, the basic bootstrap interval,
the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap
percentile (BCa) interval.
Learning More
Good sources of infom1ation include Resampling Methods in R: The boot Package by Angelo Canty,
Getting started with the boot package by Ajay Shah, Bootstrapping Regression Models by John Fox,
and Bootstrap Metbods and Their Appli<:atjons by Oavison and Hinkley.
Advanced Statistics Correspondence Analysis
Correspondence analysis provides a graphic method of exploring the relationship between variables in a
Generalized Linear Models contingency table. There are many options for correspondence analysis in R. 1 recommend the ca
Oiscriminant Function package by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and
comprehensive graphics. You can obtain the package hf!re.
Time Series
Factor Analysis Although ca can perform multiple correspondence analysis (more than two categorical variables), only
Correspondence Analysis simple correspondence analysis is covered here. See their fil1k.ll! for details on multiple CA.
M11ltjdjmen5jona( Scaling
Bootstrapoing
# Correspondence Analysis
Matrix Algebra
library(ca)
mytable <- with(mydata, table(A,B)) # create a 2 way table
prop.table(mytable, 1) # row percentages
R in Action
prop.table(mytable, 2) # column percentages
fit <- ca(mytable)
print(fit) # basic results
sunvnary(fit) # extended results
pl ot(fit) # symmetric map
pl ot(fit, mass =TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map
Top Menu
elick to view
The R Interface
Row points (column points) that are eloser together have more similar column profiles (row profiles).
Data Input
Keep in mind that you can not interpret the distance between row and column points directly.
Data Management
Basic Statistics The second graph is asymmetric , with rows in the principal coordinates and colunms in reconstructions
of the standarized residuals. Additionally, mass is represented by points and columns are represented
Advanced Statistics
by arrows. Point intensity (shading) corresponds to the absolute contributions for the rows. This
Basic Graphs example is ineluded to highlight sorne of the available options.
Advanced Graphs
elick to view
Advanced Statistics Tree-Based Models
Recursive partitioning is a fundamental tool in data mining. lt helps us explore the stucture of a set of
Generalized Linear Models data, while developing easy to visualize decision rules for predicting a categorical (classification tree)
Oiscriminant Function or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional
inference trees, and random forests.
Time Series
Factor Analysis
CART Modeling via rpart
Correspondence Analysis
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be
M11ltjdjniensjona( Scaling
generated through the rpart package. Oetailed information on rpart is available in An lntroduction to
CI 115ter Analysis Rec11rsjve Partjtjonjng llsing tbe RPART Ro11tines. The general steps are provided below followed by two
Tree-Based Models examples.
R in Action
formula is in the fom1at
outcome - predictor1+predictor2+predictor3+ect.
data= specifies the data frame
method= "class" for a classification tree
"anova" for a regression tree
control= optional parameters for controlling tree growth. For example,
control=rpart.control(minsplit=30, cp=0.001) requires that the minimum
number of observations in a node be 30 before attempting a split and that a
split must decrease the overall lack of fit by a factor of 0.001 (cost
R in Action significantly expands complexity factor) before being attempted.
upon this material. Use promo
code ria38 for a 38% discount. 2. Examine the resutts
The following functions help us to examine the results.
Basic Graphs In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs
below).
Advanced Graphs
3. prune tree
Prune back the tree to avoid overfitting the data. Typically, you will want to selecta tree size that
minimizes the cross-validated error, the xerror column printed by printcp( ).
Specifically, use printcp( ) to examine the cross-validated error results, select the complexity
parameter associated with mínimum error, and place it into the prune( ) function. Altematively, you
can use the code fragment
to automatically select the complexity parameter associated with the smallest cross-validated error.
Thanks to HSAUR for this idea.
# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
1.
.. ..
click to view
click to view
# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)
# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
1: 1:
;. . click to view
lt turns out that this produces the same tree as the original.
ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor,
numeric, etc. ). Tree growth is based on statistical stopping rules, so pruning should not be required.
~I elick to view
:=
. - elick to view
Random Forests
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based
on random samples of variables), elassifying a case using each tree in this new "forest'', and deciding a
final predicted outcome by combining the results across ali of the trees (an average in regression, a
majority vote in elassification). Breiman and Cutler's random forest approach is implimented via the
randomForest package.
Here is an example.
Going Further
This section has only touched on the options available. To learn more, see the CRAN Task View on
Machine & Statistical Leaming.
Advanced Statistics Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, 1 will describe three of the
Generalized Linear Models many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best
Oiscriminant Function solutions for the problem of determining the number of clusters to extract, severa! approaches are
given below.
Time Series
Factor Analysis
Correspondence Analysis
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescate variables for
M11ltjdjmen5jona( Scaling
comparability.
CI 115ter Analysis
Tree-8ased Models
# Prepare Data
Bootstrapoing mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Matrix Algebra 1
R in Action Partitioning
K·means clustering is the most popular partitioning method. lt requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help detem1ine the appropriate number of clusters. The analyst looks for a bend in the
plot similar to a scree test in factor analysis. 5ee Everitt & Hothom (og . 251\.
Advanced Statistics A robust version of K·means based on mediods can be invoked by using pam( ) instead of kmeans( ).
Basic Graphs The function pamk( ) in the ~ package is a wrapper for pam that also prints the suggested number of
clusters based on optimum average silhouette width.
Advanced Graohs
Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches. 1 have had good luck with Ward's method
described below.
# Ward Hierarchical clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fi t <- he l ust (d, method="ward")
pl ot(fit) # displ ay dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
,.
: imn~~f'~'l!!~~ 1
The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on
multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p
values. lnterpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows.
Transpose your data before using.
,.
.iR . 1fir~~~
...,.. click to view
Model Based
Model based approaches assume a variety of data models and apply maximum likelihood estimation and
Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( )
function in the mclust package selects the optimal model according to BIC for EM initialized by
hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and
number of clusters with the largest BIC. See helo!mclustModelNamesl to details on the model chosen as
best.
,.
.
,
elick to view
-- ......
click to view
Oiscriminant Function probabilities (i.e., prior probabilities are based on sample sizes). In the examples below, lower case
letters are numeric variables and upper case letters are categorical íaó.ms,.
Time Series
Factor Analysis
Correspondence Analysis
Linear Discriminant Function
M11ltjdjmen5jona( Scaling # Linear Discriminant Anal ysis with Jacknifed Prediction
l i brary(MASS)
CI 115ter Analysis
fi t <- lda(G - xl + x2 + x3, data=mydata,
Tree-8ased Models na . action="na_omit", CV=TRUE)
fit # show results
Bootstrapoing
Matrix Algebra
The code above performs an LOA, using listviise deletion of missing data. CV=TRUE generates jacknifed
(i.e., leave one out) predictions. The code below assesses the accuracy of the prediction.
R in Action
Data Management
# Quad r atic Discriminant Analysis with 3 groups applying
Basic Statistics # resubstitution prediction and equal prior probabi l ities .
l i brary(MASS)
Advanced Statistics
fi t <- qda(G - xl + x2 + x3 + x4, data=na . omi t(mydata),
Basic Graphs prior=c(l,1,1) / 3))
Advanced Graohs
Note the altemate way of specifying listviise deletion of missing data. Re-subsitution (using the same
data to derive the functions and evaluate their prediction accuracy) is the default method unless
CV=TRUE is specified. Re-substitution viill be overly optimistic.
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the
following code. Points are identified with the group ID.
click to view
The following code displays histograms and density plots for the observations in each group on the first
linear discriminant dimension. There is one panel for each group and they ali appear lined up on the
same graph.
·- click to view
The partimat( ) function in the klaR package can display the results of a linear or quadratic
classifications 2 variables at a time.
click to view
You can also produce a scatterplot matrix vlith color coding by group.
.•\....... - - ,,i. .,
"-
c. ; ~·
o;.#..,,.~· • ~ ··~.
'
"'c .1i.-{\";. · t~
...... .
:: ~~
~
·'
- click to view
Test Assumptions
See (M)ANOVA Assumotions for methods of evaluating multivariate normalíty and homogeneíty of
covariance matrices.
Advanced Statistics Principal Components and Factor Analysis
This section covers principal components and factor analysis. The later includes both exploratory and
Generalized Linear Models confim1atory methods.
Oiscriminant Function
click to view
Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat= option to
R in Action significantly expands
enter a correlation or covariance matrix directly. lf entering a covariance matrix, include the option
upan this material. Use promo
n.obs=.
code ria38 for a 38% discount.
The principal( ) function in the psvch package can be used to extract and rotate principal components.
Top Menu
# Vari max Rotated Pri nci pa1 Components
# retaining 5 components
l i brary(psych)
The R Interface fi t <- principal (mydata, nfactors=5, rotate="varimax")
Data Input fit # print results
Data Management
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Basic Statistics
rotate can "none", ''varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
Advanced Statistics
Basic Graphs
Exploratory Factor Analysis
Advanced Graohs
The factanal( ) function produces maximum likelihood factor analysis.
1 -=-
¡.... • -
click to view
The rotation= options include "varimax", "promax", and "none". Add the option scores="regression" or
"Bartlett" to produce factor seores. Use the covmat= option to enter a correlation or covariance matrix
directly. lf entering a covariance matrix, include the option n.obs=.
The factor.pa( ) function in the ~ package offers a number of factor analysis related functions,
including principal axis factoring.
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Rotation can be "varimax" or "promax".
1-
elick to view
Going Further
The FactoMjneR package offers a large number of additional functions for exploratory factor analysis.
This includes the use of both quantitative and qualitative variables, as well as the inclusion of
supplimentary variables and observations. Here is an example of the types of graphs that you can
create with this package.
click to view
Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.
e3
e4
e5
~ e6
Assume that we have six observered variables (X1, X2, ... , X6). We
hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables
as described in thís diagram. X1, X2, and X3 load on F1 (with loadíngs lam1, lam2, and lam3). X4, X5,
and X6 load on F2 (with loadings lam4, lam5, and lan16). The double headed arrow indicates the
covariance between the two latent factors (F1 F2). e1 thru e6 represent the residual variances (variance
in the observed variables not accounted for by the t\'VO latent factors). We set the variances of F1 and
F2 equal to one so that the parameters will have a scale. This will result in F1F2 representing the
correlatíon between the two latent factors.
For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the
code below. The CFA model is specified using the specify.model( ) function. The fom1at is arrow
specification, parameter name, start value. Choosing a start value of NA tells the program to choose a
start value rather than supplying one yourself. Note that the variance of F1 and F2 are fixed at 1 (NA in
the second column). The blank líne is required to end the RAM specification.
You can use the boot.sem( ) function to bootstrap the structual equation model. See help(boot.sem)
for details. Additionally, the function mod.indices( ) will produce modification indices. Using
modification indices to improve model fít by respecifying the parameters moves you from a
confirmatory to an exploratory analysis.
For more information on sem , see Structurn! Equatjoo Modeling rnth tbe sem Package in R, by John
Fox.
Advanced Statistics Generalized Linear Models
Generalized linear models are fit using the glm( ) function. The form of the glm function is
Generalized Linear Models
glm(formula, family=familytype(link=linkfunction), data=)
Oiscriminant Function
Matrix Algebra See help(glm) for other modeling options. See help(family) for other allowable link functions for each
family. Three subtypes of generalized linear models will be covered here: logistic regression, poisson
regression, and survival analysis.
R in Action
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables. lt is frequently preferred over discrjmjnant ftmctjon analysis because of its less
restrictive assumptions.
Data Input
Data Management You can use anova(fit1 ,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F-x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x
Basic Statistics
variable.
Advanced Statistics
Basic Graphs
Advanced Graohs
elick to view
Poisson Regression
Poisson regression is useful v1hen predicting an outcome variable representing counts from a set of
continuous predictor variables.
# Poisson Regression
# where count is a count and
# xl-x3 are continuous predictors
fit <- glm(count ~ xl+x2+x3, data=ltlydata, family=poisson())
summary(fit) display results
lf you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may
want to use quasi poisson() instead of poisson () .
Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for
modeling the time to an event. Data may be right ce nsored - the event may not have occured by the
end of the study or we may have incomplete information on an observation but know that up to a
certain time the event had not occured (e.g. the participant dropped out of study in week 10 but was
alive at that time).
While generalized linear models are typically analyzed using the glm( ) function, survival analyis is
typically carried out using functions from the survival package . The survival package can handle one
and two sample problems, parametric accelerated failure models, and the Cox proportional hazards
model.
Data are typically entered in the format start time, stop time , and status (1=event occured, O=event
did not occur). Alternatively, the data may be in the fom1at time to event and status (1=event
occured, O=event did not occur). A status=O indicates that the observation is right cencored. Data are
bundled into a Surv object vía the Surv( ) function prior to further analyses.
# display resu l ts
MaleMod
·~
\
l. \ l . . \..
click to view
See Thomas Lumley's R news article on the survival package for more information . Other good sources
include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M. J . Crawley's chapter
on Survival Analysis.
Advanced Statistics Matrix Algebra
Most of the methods on this website actually describe the programming of matrices. lt is built deeply
Generalized Linear Models into the R language. This section will simply cover operators and functions specifically suited to linear
Discriminant Function algebra. Before proceeding you many want to review the sections on Data Tvoes and Ooerators.
Time Series
M11ltjdjmen5jona( Scaling
Operator or Description
CI 115ter Analysis Function
Tree-8ased Models A*B Element-vlise multiplication
A%*% B Matrix multiplication
Bootstrapoing
A %o% B Outer product. AB'
Matrix Algebra
crossprod(A,B) A'B and A'A respectively.
crossprod(A)
t(A) Transpose
R in Action
diag(x) Creates diagonal matrix with elements of x in the principal diagonal
diag(A) Returns a vector containing the elements of the principal diagonal
diag(k) lf k is a scalar, this creates a k x k identity matrix. Go figure.
solve(A, b) Returns vector x in the equation b =Ax (i.e., A' 1b)
solve(A) lnverse of A where A is a square matrix.
ginv(A) Moore-Penrose Generalized lnverse of A.
ginv(A) requires loading the MASS package.
R in Action significantly expands y< ·eigen(A) y$val are the eigenvalues of A
y$vec are the eigenvectors of A
upon this material. Use promo
y<·svd(A) Single value decomposition of A.
code ria38 for a 38% discount. y$d =vector containing the singular values of A
y$u = matrix with colunms contain the left singular vectors of A
y$v = matrix vlith columns contain the right singular vectors of A
Top Menu R <· chol(A) Choleski factorization of A. Returns the upper triangular factor, such that R'R
=A.
y<· qr(A) QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and a lower
triangle that contains information on the Q decomposition.
The R Interface y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
Data Input y$pivot contains information on the pivoting strategy used.
cbind(A,B,... ) Combine matrices(vectors) horizontally. Returns a rnatrix.
Data Management
rbind(A,B, ... ) Combine matrices(vectors) vertically. Returns a matrix.
Basic Statistics rowMeans(A) Returns vector of row means.
Advanced Statistics rowSums(A) Returns vector of row sums.
Matlab Emulation
The ma1lah package contains wrapper functions and variables used to replicate MATLAB function calls
as best possible. This can help porting MATLAB applications and code to R.
Going Further
The Ma1dx package contains functions that extend R to support highly dense or sparse matrices. lt
provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack (dense mat rix), TAUCS
(sparse mat rix) and UMFPACK (sparse matrix) rout ines.
Advanced Statistics Multidimensional Scaling
R provides functions for both classical and nonmetric multidimensional scaling. Assume that we have N
Generalized Linear Models objects measured on p numeric variables. We want to represent the distances among the objects in a
Oiscriminant Function parsimonious (and visual) way (i.e., a lower k-dimensional space).
Time Series
M11ltjdjmen5jona( Scaling
# classical MDS
CI 115ter Analysis
# N rows (objects) x p columns ( variables)
Tree-8ased Models # each row identified by a unique row name
Bootstrapoing
d <- dist(mydata) # eucl idean distances between the rows
Matrix Algebra fit < - cmdscal e(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
Top Menu
Nonmetric MDS
Nonmetric MOS is performed using the isoMDS() function in the MASS package.
The R Interface
Time Series
Tree-8ased Models
# save a numeric vector containing 48 monthly observations
Bootstrapoing # from Jan 2009 to Dec 2012 as a time series object
myts <- ts(myvector , start=c(2009, 1), end=c(2012, 12), frequency=12)
Matrix Algebra
# subset the time series (June 2012 to December 2012)
myts2 <- wi ndow(myts, start=c(2012, 6), end=c (2012, 12))
R in Action
# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
R in Action significantly expands stl() function. Note that a series with multiplicative effects can often by transfom1ed into series with
upan this material. Use promo additive effects through a log transformation (i.e., newts <- log(myts)).
code ria38 for a 38% discount.
# Seasonal decompostion
Top Menu fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
The R Interface
library(forecast)
Data Input seasonplot(myts)
Data Management
Basic Statistics
Exponential Models
Advanced Statistics
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package,
Basic Graphs can be used to fit exponential models.
Advanced Graohs
# simple exponential - models level
fit <- Holtwinters(myts, beta=FALSE, gamma=FALSE)
# double exponential - models level and trend
fit <- HoltWinters(myts, gamma=FALSE)
#triple exponential - models level, trend, and seasonal components
fit <- Holtwinters(myts)
# predictive accuracy
library(forecast)
accuracy(fit)
ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model. Other
useful functions include:
Note that the forecast package has somewhat nicer versions of acf() and pad() called Acf() and Pad()
respectively.
# predicti ve accuracy
library( forecast)
accuracy(fit)
Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models.
The ets() function supports both additive and multiplicative models. The auto.arima() function can
handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of severa[ fit
criteria.
library(forecast)
# Automated forecasting using an exponential model
fit <- ets(myts)
Going Further
There are many good online resources for learning time series analysis with R. These include A little
book of R for time series by Avril Chohlan, and Forecasting: principies and practice by Rob Hyndman and
George Athanasopoulos. Vito Ricci has created a time series reference card . There are also a time series
tutorial by Walter Zuccbjnj Oleg llenadic that is quite useful.
See also the comprehensive book Time Series Analysis and its Aoplications with R Examples by Robert
Shunway and David Stoffer.
Data Input Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices,
Data tvoes data frames, and lists.
lmoorting Data
Missing Data
Date Values
1 a[c(2,4)] # 2nd and 4th elements of vector
R in Action Matrices
Ali columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The
general format is
R in Action significantly expands byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
upon this material. Use promo matáx should be filled by columns (the default). dimnames provides optional labels for the columns
code ria38 for a 38% discount. and rows.
# another example
cells <- c(l,26,24,68)
The R Interface
rnarnes <- c("Rl", "R2")
Data Input cnarnes <- c("cl", "c2")
mymatri x <- rnatri x(ce lls, nroW=2, neo1=2 , byroW=TRUE,
Data Management dimnames=list(rnarnes, cnarnes))
Basjc Statjstics
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.). This is similar to SAS and SPSS datasets.
d <- c(l,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data . frame(d,e, f)
names(mydata) <- c("ID", "col or", "Passed") # variable names
# examp l e of a l i st wi th 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred'', mynumbers=a, mymatrix=y, age=5.3)
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
el ass(object) # class or type of an object
names(object) # names
Value Labels The following symbols can be used with the format( ) function to print dates.
Missing Data
Symbol Meaning Example
Date Values %d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
R in Action %m month (00-12) 00-12
%b abbreviated month Jan
%8 unabbreviated month January
%y 2-digit year 07
%Y 4-digit year 2007
Here is an example.
Top Menu
Date Conversion
Character to Date
The R Interface
You can use the as.Date( ) function to convert character data to dates. The fom1at is as.Date(x,
Data Input "forma('), where x is the character data and format gives the appropriate fom1at.
Data Management
Date to Character
You can convert dates to character data using the as.Character() function.
Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates. See
help(ISOdatetime) for more information about formatting date/times.
Data Input Access to Database Management Systems (DBMS)
l i brary(RODBC)
myconn <-odbcconnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, Crime)
R in Action significantly expands pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
upon this material. Use promo
code ria38 for a 38% discount.
Other lnterfaces
Top Menu
The RMySOL package provides an interface to MySQL.
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Exporting Data
There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you
Data tvoes will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
lmoorting Data
Exporting Data
1 write.table(mydata, "c:/mydata.txt", sep="\t")
Vjewjng Data
R in Action
To SPSS
# write out text datafi l e and
# an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c : /mydata . sps" , package="SPSS")
To Stata
The R Interface
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input lmporting Data
lmporting data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS 1
Data tvoes would recommend the Hmisc package for ease and functionality. 5ee the Quick·R section on packages,
lmoorting Data for information on obtaining and installing the these packages. Example of importing data are provided
below.
Keyboard Input
Database lnout
From A Comma Delimited Text File
Exporting Data
Date Values
Top Menu
From SPSS
The R Interface # save SPSS dataset in trasport format
Data Input get fil e=' e: \mydata. sav' .
export outfile='c:\mydata.por'.
Data Management
# in R
Basjc Statjstics
l i brary(Hmi se)
Advanced Statistics mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
Basic Graphs
Advanced Graohs
From SAS
# save SAS dataset in trasport format
l i bname out xport 'c: /mydata. xpt' ;
1 data out.mydata;
set sasuser.mydata;
run;
# in R
l i brary(Hmi se)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read. dta("c: /mydata.dta")
1
From systat
# input Systat file
library(foreign)
mydata <- read.systat("c: /mydata.dta")
1
Data Input Keyboard Input
Usually you will obtain a data frame by imoorting it from SAS, SPSS, Excel, Stata, a database, or an
Data tvoes ASCII file. To create it interactively, you can do something like the following.
lmoorting Data
Vaáable 1abels
You can also use R's built in spreadsheet to enter the data interactively, as in the following example.
Value Labels
Missing Data
# enter data using editor
Date Values mydata <- data . frame(age=numeric(O), gender=character(O),
weight=numeric(O))
mydata <- edit(mydata)
R in Action # note t hat without the assignment in the line above,
# the edits are not saved!
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Missing Data
In R, missing values are represented by the symbol NA (not available) . lmpossible values (e.g., dividing
Data tvoes by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for
Keyboard Input
X <- c(l,2,NA,3)
R in Action significantly expands
mean(x) # returns NA
upon this material. Use promo mean(x, na . rm=TRUE) # returns 2
code ria38 for a 38% discount. 1
The function complet e. cases() returns a logical vector indicating which cases are complete.
Top Menu
Data Management
Keyboard Input
# variable vl is coded 1, 2 or 3
Database lnout
# we want to attach va1 ue 1abe1 s l=red, 2=b1 ue, 3=green
Exporting Data
mydata$v1 <- factor(mydata$v1,
Vjewjng Data
levels c(l,2,3),
Vaáable 1abels labels = c("red", "blue", "green"))
Value Labels
Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical
and graphic functions wíll then treat the data appriopriately.
Note: factor and ordered are used the same way, wíth the same arguments. The former creates factors
and the later creates ordered factors.
R in Actíon significantly expands
upon this material. Use promo
code ria38 for a 38% discount.
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Variable Labels
R's ability to handle variable labels is somewhat unsatisfying.
Data tvoes
lf you use the Hmisc package, you can take advantage of sorne labeling features.
lmoorting Data
Keyboard Input
library(Hmisc)
Database Input
label(mydata$myvar) <- "Vari ab 1 e 1abe1 for vari ab 1 e myvar"
Exporting Data describe(mydata)
Vjewjng Data
1
Variable 1abels Unfortunately the label is only in effect for functions provided by the Hmisc package, such as
describe(). Your other option is to use the variable !abe! as the variable name and then refer to the
Value Labels
variable by position index.
Missing Data
Date Values
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the va riable
1
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Management Aggregati ng Data
lt is relatively easy to collapse data in R using one or more BY variables anda defined function.
Creating l~ew Variables
Ooerators # aggregate data frame mtcars by cyl and vs, returning means
Built-in Functions # for nurneric variables
attach(mtcars)
Control Structures aggdata <-aggregate(mtcars, by=list(cyl,vs),
! Jser-defined F1mctjons
FUN=mean, na. rm=TRUE)
print(aggdata)
Sortjng Data detach(mtcars)
Mergjng Data
Aggregating Data When using the aggregate() function, the by variables must be in a list (even if there is only one). The
function can be built -in or user provided.
Reshaoing Data
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Control Structures
R has the standard control structures you would expect. expr can be multiple (compound) statements
Creating l~ew Variables by enclosing them in braces ( }. lt is more efficient to use built-in functions rather than control
Ooerators structures whenever possible.
Built-in Functions
Mergjng Data
1
Aggregating Data
for
Reshaoing Data
while
R in Action
switch
1 switch(expr, ... )
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount. ifelse
Top Menu
1 ifelse(test,yes,no)
Example
The R Interface
# try it
z <- matrix(l:lO, nrow:5, ncol=2)
tz <- mytrans(z)
Data Management Built-in Functions
Almost everything in R is done through functions. Here l'm only refering to numeric and character
Creating l~ew Variables functions that are commonly used in creating or recoding variables.
Ooerators
Character Functions
Function Oescription
substr(x, start= n 1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef'
R in Action significantly expands substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef'
upon this material. Use promo
grep(pattem, x, Search for pattem in x. lf fixed =FALSE then pattem is a regular.
code ria38 for a 38% discount. ignore.case=FALSE, expression. lf fixed=TRUE then pattern is a text string. Returns
fixed=FALSE) matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
Top Menu sub(pattem, rep/acement, Find pattern in x and replace with rep/acement text. lf
x, ignore.case =FALSE, fixed=FALSE then pattern is a regular expression~
fixed=EALSE) lf fixed = T then pattem is a text string.
sub("\\s",".","Hello There") returns "Helio.There"
strsplit(x, split) Split the elements of character vector x at split.
The R Interface strsplit("abc", "") returns 3 element vector "a","b","c"
paste( ... , sep="") Concatenate strings after using sep string to seperate them.
Data Input paste("x",1 :3,sep="") returns c("x1","x2" "x3")
paste("x", 1 :3,sep="M") returns c("xM1","xM2" "xM3")
Data Management paste("Today is", date())
Basic Statistics toupper(x) Uppercase
tolower(x) lowercase
Advanced Statistics
Basic Graphs
Function Description
mean(x, trim=O, mean of object x
na.rm=FALSE) # trimmed mean, removing any missing values and
# 5 percent of highest and lowest seores
mx <- mean(x,trim=.05, na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) far variance and
mad(x) far median absolute deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and
probs is a numeric vector with probabilities in [O, 1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=/) lagged differences, with lag indicating which lag to use
min{x) mínimum
max(x) maximum
scale(x, column center or standardize a matrix.
center=TRUE,
sea le= TRUE)
Note that while the examples on this page apply functions to individual variables, many can be applied
to vectors and matrices as well.
Data Management Merging Data
Control Structures
# merge two data frames by ID
! Jser-defined F1 mctjons
total <- merge(data frameA,data frameB,by="ID")
Sortjng Data
1
Mergjng Data
# merge two data frames by ID and Country
Aggregating Data total <- merge (data frameA, data frameB, by=c ("ro", "country"))
Reshaoing Data 1
Subsetting Data
Adding Rows
R in Action To join two data frames (datasets) vertically, use the rbind function. The two data frames must have
the same variables, but they do not have to be in the same order.
lf data frameA has variables that data frameB does not, then either:
R in Action significantly expands 1. Delete the extra variables in data frameA or
upon this material. Use promo 2. Create the additional variables in data frameB and set them to UA (missing)
code ria38 for a 38% discount. before joining them with rbind( ).
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Operators
R's binary and logical operators \'lill look very familiar to programmers. Note that binary operators work
Creating l~ew Variables on vectors and matrices as well as scalars.
Ooerators
Logical Operators
R in Action
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
-- exactly equal to
!= not equal to
R in Action significantly expands
!x Not X
upon this material. Use promo
X 1Y x OR y
code ria38 for a 38% discount.
x!Iy xANDy
isTRUE(x) test if X is TRUE
Top Menu
# An example
X <- c(l;lQ)
The R Interface x[(x>8) 1 (x<5)]
# yei l ds 1 2 3 4 9 10
Data Input
Advanced Statistics 1 2 3 4 5 6 7 8 9 10
X > 8
Basic Graphs E E E E E E E FTT
X < 5
Advanced Graohs
TTTT E E E E E E
X > 8 1 X < 5
TTTTFF E FTT
x[c(T,T,T,T,F ,F,F,F,T,T)]
1 2 3 4 9 10
Data Management Reshapi ng Data
R provides a variety of methods for reshaping data prior to analysis.
Creating l~ew Variables
Ooerators
Transpose
Built-in Functions
Use the t() function to transpose a matrix or a data frame. In the later case, rownames become
Control Structures variable (column) names.
! Jser-defined F1 mctjons
R in Action Basically, you "melt" data so that each row is a unique id-variable combination. Then you "cast" the
melted data into any shape you would like. Here is a very simple example.
,~
mydata
id time x1 x2
s 6
2 3 s
R in Action significantly expands 2 6
upon this material. Use promo 2 2 2 4
code ria38 for a 38% discount.
1
# cast the melted data
# cast(data, formula, function)
subjrneans <- cast(mdata, i d~vari ab le, mean)
timemeans <- cast(mdata, time~va riable, mean)
1
subjmeans
id x1 x2
1 4 5.5
2 4 2.5
timemeans
time x1 x2
5.5 3.5
2 2.5 4.5
There is much more that you can do with the melt( ) and cast( ) functions. See the documentation fer
more details.
Data Management Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the
Creating l~ew Variables sorting variable by a minus sign to indicate DESCENDING order. Here are some examples.
Ooerators
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select
Creating l~ew Variables and exdude variables and observations. The following code snippets demonstrate ways to keep or
Ooerators delete variables and observations and to take random samples from a dataset.
Built-in Functions
R in Action
Excluding (DROPPING) Variables
# exclude variables vl, v2, v3
myvars <- names(mydata) %in% c("vl", "v2", "v3")
newdata <- mydata[!myvars]
Top Menu
Selecting Observations
# first 5 observerations
The R Interface newdata <- mydata[l:S,]
Going Further
R has extensive facilities for sampling, including drawing and calibrating survey samples (see the
sampling package), analyzing complex survey data (see the survey package and it's homepagel and
bootstrapping.
Data Management Data Type Conversion
Type conversions in R work as you would expect. For example, adding a character string to a numeric
Creating l~ew Variables vector converts ali the elements in the vector to character.
Ooerators
Use is.feo to test for data type feo. Returns TRUE or FALSE
Built-in Functions
Use as.feo to explicitly convert it.
Control Structures
is.numeric(), is.character(), is_vector(), is.matrix(), is.data.trame()
! Jser-defined F1 mctjons
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.trame)
Sortjng Data
Mergjng Data
Examples
Aggregating Data to one long to to
vector matrix data trame
Reshaoing Data
from c(x,y) cbind(x,y) data.frame(x,y)
Subsetting Data vector rbind(x,y)
trom as. vector(mymatrix) as.data.frame(mymatrix)
Data Tyoe Conversion
matrix
from as. matrix(myframe)
data trame
R in Action
Dates
You can convert dates to and from character or numeric data. See date values for more inforn1ation.
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management User-written Functions
One of the great strengths of R is the user's ability to add functions . In fact, many of the functions in R
Creating l~ew Variables are actually functions of functions. The structure of a function is given below.
Ooerators
Sortjng Data
Mergjng Data Objects in the function are local to the function. The object returned can be any data tvoe. Here is an
example.
Aggregating Data
Reshaoing Data
# function example - get measures of central tendency
Subsetting Data # and spread for a numeri c vector x . The u ser has a
# choice of measures and whether the resul ts are printed.
Data Tyoe Conversion
Advanced Graohs lt can be instructive to look at the code of a function. In R, you can view a function's code by typing
the function name without the ( ). lf this method fails, look at the following R Wiki link for hints on
viewing function sourcecode.
Finally, you rnay want to store your own functions, and have thern available in every session. You can
customjze the R enyjroment to load your functions at start-up.
Data Management Creating new variables
Use the assignment operator <- to create new variables. A wide array of ooerators and functions are
Creating l~ew Variables available here.
Ooerators
R in Action
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.
Data Management
Value Labels
R SPSS
l >mydata$workshop <- CD ' C:\myRfolder' .
2 factor ( mydata$workshop, 1
2 GET FILE='mydata.sav' .VARIABLE LEVEL workshop
3 levels = c(l, 2, 3, 4),
4 labels = c( "R" , "SAS" , 3 (NOMINAL)
5 "SPSS" , "Stata" ) ) 4 /ql TO q4 (SCALE).VALUE LABELS
5 /workshop
6 1 'R'
7 2 'SAS '
8 3 'SPSS'
9 4 'Stata'
10 /ql TO q4
11 1 'Strongly Disagree'
12 2 'Disagree'
13 3 'Neutral'
14 4 ' Agree'
15 5 'Strongly Agree' .
16 SAVE OUTfile = "mydata.sav" .
Variable Labels
1 # Filename: VarLabels.Rsetwd("c:/myRfolder")
2 load (file = "mydata.RData" )
3 options (width = 63)
4 mydata
5 # Using the Hmisc label attribute.
6 library ( "Hmisc" )
7 label (mydata$ql) <- "The i nstructor was well prepared."
8 label (mydata$q2) <- "The instructor communicated well. "
9 label (mydata$q3) <- "The course materials were helpful."
10 label (mydata$q4) <- "Overall, I found this workshop useful."
11
12 # Hmisc describe function uses the labels.
13 des cribe ( mydata[ ,3:6] )
14
15 # Buit-in summary funct ion
16 # ignores the labels.
17 summary ( mydata[ ,3:6] )
SPSS
=-:t
-
[Eñier your emaíl address
Enhancing Output 1r4stats com
SAS
In SAS, getting tabular results into your word processor or Web page is as easy as setting
the style when you install it and then saving the output directly in fonnat you need. To
get a subset of output, copy and paste works fine .
SPSS
In SPSS, getting tabular results into your word processor or Web page is as easy as
setting the style when you install it and then saving the output directly in format you
need. To get a subset of output, copy and paste works fine.
Stata
Like this:
To make it easy to get struted, the ggplot2 package offers two main functions: quickplotü
and ggplotü. The quickplotü function - also knmvn as qplotü - mi.mies R's traditional
plotü function in mru1y ways. It is particularly easy to use for simple plots. Below is an
example of the default plots that qplot() makes. The command that created each plot is
shown in the title of each graph. Most of them are useful except for núddle oue in the left
column of qplot(workshop, gender). A plot like that of two factors simply shows the
combinations of tlle factors that exist which is ce1tainly not w01th doing a graph to
discover.
=-:t
-
[Eñier your emaíl address
ü.~
0 o
u 10- u 4-
5-
o- 1 1 1 1
2-
o- • .1
1 1
.I ~ 1 1 1 1
R SAS SPSS Stata 60 65 70 75 80 85
workshop pretest
••••
Cl)
-o
e
Cl)
j:::l
(J) 1 1
8.10- •
•
Ol
Female - • • • •
60 - •1
1 1 1 1 1 1 1
R SAS SPSS Stata R SAS SPSS Stata
workshop workshop
........ -
qplot(posttest, workshop) qplot(pretest, posttest)
: .
•• •
.........
--
Stata - •
§-SPSS - .. ~ ao-
90 -
........
a.:........
. ... .•.•:a-
r.
(J)
60- .
1 t 1 1 1 1
60 70 80 90 60 65 70 75 80 85
posttest pretest
While qplot() is easy to use for simple graphs, it does not use the powerful grammar of
graphics. The ggplot() function does that. To understand ggplot, you need to ask yourself,
what are the fundamental parts of every data graph? They are:
• Aesthetics - these are the roles that the variables play in each graph. A variable may
control where points appear, the color or shape of a point, the height of a bar and so
on.
• Geoms - these are the geometric objects. Do you need bars, points, lines?
• Statistics - these are the functions like linear regression you might need to draw a
line.
• Scales - these are legends that show things like circular symbols represent females
while circles represent males.
• Facets - these are the groups in your data. Faceting by gender would cause the graph
to repeat for the two genders.
In R for SAS and SPSS Users and R for Stata Users I showed how to create almost all the
graphs using both qplot() and ggplot(). For the remainder of this page I use only ""ill
ggplot() because it is the more flexible function and by focusing on it, I hope to make it
easier to learn.Let us start our use of the ggplot() function with a single stacked bar plot.
It is not a very popular plot, but it helps demonstrate how different the grammar of
graphics perspective is. On the x-axis there really is no variable, so I plugged in a call to
the factor() function that creates an empty one on the fly. I then fill the single bar in
using the fill argument. There is only one type of geomebic object on the plot, which I
add with geom_bar. TI1e colors are a bit garish, but they are chosen so that colorblind
people (10% of males) can still read them.
80-
60 -
.R
workshop
. SAS
. SPSS
. Stata
20-
o-
factor("")
The x-axis comes out labeled as "factor("")" but we can over-write that ·with a title for the
x-axis. \i\i'hat is particularly interesting is that this can become a pie chart simply by
changing its coordinate system to polar. TI1e final line of code changes the label on the
discrete x-axis to blank with "".
e:
~
o
20
.R
workshop
. SAS
o
. SPSS
• Stata
BarPlots
The upper left corner of the plot of the first plot above shows a bar plot of workshop
created with qplotQ. From the grammar of graphics approach, that graph has only one
type of geometric object: bars. The ggplotQ function itself only needs to specify the data
set to use. Note the unusual use of the plus sign "+" to add the effect of of geom_ barQ to
ggplotQ. Only one variable plays an "aesthetic" role: workshop. The aes() function sets
that role. So here is one way to v.Tite the code:
25-
20 -
§ 15-
oo
10-
5-
o-
1 1 1 1
R SAS SPSS Sta ta
workshop
A very useful feature of the ggplot() function is that it can pass aesthetic roles to all the
functions that are "added" to it. For example, we can create exactly the sanie barplot with
this code:
In our case it's just as easy either way but I like the first approach since it ties the
aesthetic role clearly to the bars. However, as our graphs become more complex, it can be
a big time-saver to set as many aesthetic roles in the ggplot() function call and let it pass
them through to various other functions that we will add on to build a more complex
plot.
The grammar of graphics way of creating plots looks quite odd at first, especially when
you consider that qplot(workshop) also does the above plot! However, as graphs get more
complex, ggplot() can handle it using the same ideas while qplot() cannot.Flipping from
ve1tical to horizontal bars is easy ·with the addition of the coord_ flip() function.
1
2
1 +> ggplot (mydata100, aes (workshop)
geom_ bar () + coord_flip ()
) +
Stata -
SPSS-
a.
o
L:,
(J)
~
~ SAS-
R-
1 1 1 1 1 1 1
o 5 10 15 20 25 30
count
If you want to fill the bars ·with color, you can do that using the "fill" argument.
30-
25 -
20 - workshop
§ 15-
o
o
l :AS
. SPSS
10-
• Stata
5-
o-
1 1 1 1
R SAS SPSS Stata
workshop
The use of color above was, well, colorful, but it did not add any useful infonnation.
However, when displaying bar plots of two factors, the fill argument becomes very useful.
You can display it severa} ways. Below I use fill to color the bars by workshop and set the
"position" to stack.
50-
40 -
e::J
o
30 - .R
workshop
. SAS
0 20 - . SPSS
• Stata
10-
o-
1
Fe mate Male
gender
In the plot above, the height of the bars represents the total number of males and
females. This is fine if you want to compare counts, but if you want to compare
proportions of each gender that took each class, you would have to make the bars equal
heights. You can do that by simply changing the position to "fill".
1 o-
o 8-
workshop
o6-
e::::¡
8 04 -
l :AS
. SPSS
• Stata
o 2-
oo-
1 1
Female Male
gender
Here is the same plot changing only the bar position to be "dodge".
15 -
e::::¡
o
10- .R
workshop
. SAS
o
. SPSS
5- • Stata
o-
1
Female Male
gender
You can change any of the above colored graphs to shades of grey by sin1ply adding the
scale_fill_ grey() function. Here is the plot immediately above repeated in greyscale.
15 -
e:::::¡
o
10- .R
workshop
. SAS
o
SPSS
5- Stata
o-
1
Female Male
gender
You can get the same information that is in the above plot by making small separate plots
for one of the groups. You can accomplish that ·with the facet_grid() function. It accepts a
formula in the form "ro·ws ~ colums", so using "gender ~ ." asks for two rows for the
genders (three if we had not removed missing values) and no columns.
15-
10-
5-
e: o-
:::J
8 15-
10-
5-
o- 1 1 1 1
R SAS SPSS Stata
workshop
Uu s mumarized Data
The ggplot2 package summarizes your data for you. If it is already sUll1111arized, you can
create a small data frame of the results to plot.
60-
50 -
40 -
Q)
:s
~ 30 -
Q)
2
E'20-
10-
o-
1 1
After Before
myGroup
Dot Charts
Dot chaits are similar to bai· chaits, but since they are plotting points on both an x - and
y-axis, they reqtúre a special variable called " ..count..". It calculates the counts and lets
you plot them on the y-axis. The points use the "hin" statistic. Since dot chaits are
usually sho·wn "sideways" I an1 adding the coord_ flip() funtion.
Stata - •
SPSS - • "TI
CD
...3
SAS - • ¡¡-
a.
o
r:.
R- •
IJ>
~ Stata-
o
::
•
SPSS - • ...s:
¡¡-
SAS- •
R-
1 1 1 1
•
10 12 14 16
count
To add a title, use the opts() function and its title argument. Adding titless to axes is
trickier. You use four different functions depending 011 the axis and whether or not it is
discrete: scale_ x_ discrete scale_y_ distrete scale_ x_ continuous scale_y_ continuous For a
bar plot, the x-axis is discrete so 1 will use scale_ x_ discrete to assign a label to it. The
character sequence "\ n" tells R to go to a new line in all R packages.
Workshop Attendance
30 -
25-
20 -
§ 15-
oo
10-
5-
o- 1 1 1
R SAS SPSS Stata
Statistics Package
Workshops
Histograms
Recall from om first example that you can use qplot to get a quick histogram:
qplot(posttest). However, as things get more complicated, ggplot() is easier to conh·ol.
The geom_ histogran1 function is all you need. 1 have set the color of the bar edges to
white. Without that, the bars all run together in the same shade of grey.
12-
10-
B-
e::::¡ 6-
o
~
o
4-
2-
o- 1
1
111111
1
111 1 111
1 1 1
60 70 80 90 100
posttest
You can change the number ofbars used using the binwidth argument. Since this many
bars do not touch, I did not bother setting the edge color to white.
1
2
1 >> ggplot (mydata100, aes (posttest) )
geom_histogram (binwidth = 0.5)
+
s-
6-
e
54-
o
2-
o- 111111
1 1 1 1
60 70 80 90
posttest
l
2
1 >> ggpl ot (mydata100,
geom_dens i ty ()
aes (posttest)) +
o05-
o04 -
2:-o 03-
·¡¡¡
e
Q)
"O o02-
o01-
o 00-
1 1 1 1
60 70 80 90
posttest
It is easy to layer many different geometric objects onto your plots. In this case to get the
same axis on the histogram as the density uses, I used a special ggplot2 variable named
" .. density.." on the y-axis. I also added a "rug" of carpet-like tick marks on the x-axis
0.08-
006-
.?;-
·¡¡¡
fü o04 -
u
o02-
7-
6-
.11IJ~.. 1.1.
5- ..,
...3
4-
3-
2-
..
¡¡-
1-
-c o- 111 1
5 1-
0 6-
1J~~ul1 1
5-
4-
3-
..s::
¡¡-
2-
1-
o- 1 1 1 1
1 1 1 1 1
60 70 80 90 100
posttest
N ormal QQ Plots
Normal QQ plots are done in ggplot ·with the stat_ qq() function and the sample aesthetic.
--.-
•• •
90-
r
~ 80-
/
a.
E
ro
en •
_/
10 - ••
• ••
60- •
1 1 1 1
-2 -1 o 2
theoretical
StripPlots
With fairly small data sets, you can do strip plots using the point geom.
•
•
90- t ••
1 •t
1 t 1 1
t> ao- •t •
(])
t
±::
en
o
a.
• 1 1 ¡
10 - • •
•• •
60-
1 1 1
•l
R SAS SPSS Stata
workshop
With large data sets, you can use the jitter geom instead. Our data is so small that the
default amount of jitter malees it hard to even notice where each group ends. See the
books for details on controlling the amount of jitter.
•
• ••
•• ••
90-
... ... • .• ...••• . •• • •
I •l •
• \
•
~·
• •• • .
~
• • • ~.
t> ao -
(])
±::
en
o
a.
• • ••••• • •
• • •••
••
\ • •
• •• •• •
10- • •
•• •
60-
1 1 1
•1
R SAS SPSS Stata
workshop
Scatter and Li.ue Plots Vaiious type of scatter ai1d line plots Catl be done using different
geoms as shown below. You can, of course, add multiple geoms to a plot. For example, you might
want both points and lines, in wbich case you would simply add botll geoms.
• • •
90- • •• •• •• •
• i ••••
···i· •• i •• •.
• i ·i·i:.1i
• ··i
t> so -
Q) • • i·¡ i. • i ••
:t:::
(/)
o • • • • ••
a.
10- • •
• •
•
60- •
1 1 ' 1 1 1
60 65 70 75 BO B5
pretest
When you add a line geom, the ggplot s01ts the data along the x-axis automatically. If you
had time-series data that were not s01ted by date, it would do so.
1
2
1 +> ggpl ot (mydata100, aes (pretest, posttest) )
geom_line ()
+
90-
t> so -
Q)
:t:::
(/)
o
a.
10-
60-
1 1 1 1 1 1
60 65 70 75 80 85
pretest
The path geom leaves the order of the data as it is; it does not s01t it before com1ecting
the points. See the books for more examples.
1
2
1 +> ggpl ot (mydata100, aes (pretest, posttest) )
geom_path ()
+
90-
(¡) 80-
Q)
:t::
en
o
a.
10 -
60-
1 1 1 1 1 1
60 65 70 75 80 85
pretest
Large data sets provide a challenge since so many points are obscured by other points.
First let us create a data set 'l>vith 5 ,000 points.
100-
95-
90 -
N 85-
(¡)
Q)
:t:: 80-
en
o
a.
75 -
70-
65-
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2
Next I \-vill use very small sized points and lay a set of 2D density contours on top of
them. To help see the contours more clearly, I will not jitter the points.
100-
95 -
90-
N 85-
u;
Q)
:t::
(/) so -
o
a.
75 -
70-
..
65-
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2
Finally, I will create a hexbin plot, that replaces bunches of points ·with a larger
hexagonal symbol.
100-
95 -
90- count
N 85- 20
u; 40
Q)
:t:: 80 -
(/)
o 60
Q.
75- 80
100
70-
65 - •
1 1 1 1 1 1 1
65 70 75 80 85 90 95
pretest2
The ggplotü functi.on makes it parti.cularly easy to add fit lines to scatter plots. Simply
adding the geom_ smooth() functi.on <loes the trick.
• • •
90 - ...
l •
••• ••• •
• ~
•
•• i. 1 ,;.;/ ••
¡¡; 80 -
a>
:t:: . ~
• • *·.
• i .. ~ ~ ¡i
--.• i
. .•:1!.· ..
(/)
o
a.
10 -
•
•
60 - •
/
1 1 1 1 1 1
60 65 70 75 BO 85
pretest
Adding a linear regression fit requires only the addition of "method = hn" argument.
l
2
1 +> ggplot (mydata100, aes (pretest, posttest)
geom_point () + geom_smooth (method=lm)
) +
• • •
90
¡¡¡ 80 -
:m
(/)
o
a.
10-
60 - •
1 1 1 1 1 1
60 65 70 75 80 85
pretest
To plot labels instead of point characters, add the label aesthetic. I placed "size = 3 11 in
the geom_ text function to clarify its role. I could have put it in the aesQ function call
within the &,aplotü call, but then it would have added a useless legend indicating what 3
represented, when it is merely a size.
.l
2 1 +> ggplot (mydata100,
aes (pretest, posttest, label as.character (gender))) +
3 + geom_text (size = 3)
Male
Female
Male
MIMBle Female
90 - Female M. le Fenl6ilet.lale Female
¡¡¡ 80 - Female
Fe~~~
:m
(/) Female
Ma1Ee~ll!Ya1e
o Male le
a. Male
Female
10-
FemaleFemele
Male
60 - Male
1 1 1 1 1 1
60 65 70 75 80 85
pretest
To use point shapes to represent the value of a third variable, simply set the shape
aesthetic.
1
2
1 +> ggplot (mydata100, aes (pretest,
geom_point ( aes (shape=gender
posttest) ) +
) )
• ..
..... •
90- •
60- ...
1 1 1 1 1 1
60 65 70 75 80 85
pretest
One way to use a different fit for each group is to do them 011 the same plot. This involves
setting aesthetics for both linetype and point shape. You can place these in the main
ggplot() function call, but since linetype applies only to geom_ smooth and shape applies
only to geom_point, I prefer to place them in those function calls. I tend to think oflines
being added to the scattered points, but in this case I placed the geom_point() call last so
that the shading from the gray confidence intervals would not shade the points
themselves.
• ..
90- •
t • : .i "'..,.....
.
............. •&..• ,,..,
.........
•t • v • :t
• ..f t ••• gender
t>
Q)
8o- ...••r ......
:t::
en
o • t·1···· ...
4 A • •.
-+-- Female
a. ..... Male
...
10 - •
. .. • •
60- ...
1 1 1 1 1 1
60 65 70 75 80 85
pretest
Another way to display linear fits per group is to facet the plot.
•
90 - •
• ••
• •
. .• •: ...... • •••
~· ...,,
..
80 -
3
"'
•• •
70 - •
• •
.._, 60 -
fJ)
Q)
:t::
en
o •
a.
•
90-
.. .... .• • ••
,~
/
.,,_,
• • .
.
· • ...
80-
·~ • •
•• ••
•
70-
•
60 - •
1 1 1 1 1 1
60 65 70 7f. 80 85
pretest
Box Plots
The ggplot package offers consíderable control over how you can do box plots. Here I plot
the raw poínts and then the boxes on top of them. This hldes the poínts that are actually
ín the míddle 50% of the data. They are usually dense and of less ínterest than the poínts
that are further out. If you have a lot of data you míght consíder usíng geom_jítter() to
spread the poínts around, preventing over-plotting.
90
t> 8o -
~
fJ)
o
a.
70-
60-
1 1 1
•
1
R SAS SPSS Stata
workshop
The ggplot package offers a nearly endless a.rray of combínations to visualize your data. I
Graphics, Traditional
R offers three main graphics packages: traditional (or base), lattice and ggplot2. Traditional
graphics are built into R, create nice looking graphs, and are ve1y flexible. However, they require
a lot of work when repeating a graph for different groups in your data. Lattice graphics excel at
repeating graphs for va1ious groups. The ggplot2 package also deals with groups well and is quite
a bit more flexible than lattice graphics. This section deals just with traditional R graphics
functions. Om· books devote 130 pages to desc1ibing the relationship among these packages and
explaining how to crea te each type of plot. However, if you look at the examples below, you will
often be able to plug your variables into the code to create the graph you need. The practice data
set is shown here. The programs and the data they use are also available for download
here.
R for SAS and SPSS Users and R for Stata Users contai.t1 examples for advanced users. Paul
Murrell 's book R Graphics (right) also offers excellent coverage of traditional graphics in great
detail.
Bar Plots
60-
50 -
40 -
10-
o-
1 1
Alter Before
myGroup
This bar plot summarizes the variable q4. If the data is a factor (q4 is not) then plot() will
do it automatically since it is a generic function. You can also use barplot() if the data is
swnmarized váth table() first.
o
<")
lO
N
o
N
I[)
ºº
I[)
2 3 4 5
This plots a factor, gender. The plotO ftmction recognizes that it is a factor and so it smnmarizes it
befo1-e plotting. it. Altematively, you can use barplotO on the frequencies obtained by tableO.
> plot(gender) # or .. .
> barplot( table (gender)
o
1()
o
N
Female Male
Using traditional graphics functions you can hlrn a graph on its side by adding the
argument, horiz = TRUE. As before, plotü will do frequencies automatically and barplotü
requires some sort of suunnarization, in this case tableO.
(/)
(/)
a.
(/)
([)
<(
(/)
o 5 10 15 20 25 30
A stacked bar plot is like a rectangular pie chart. You can make one by conve1ting the
output from table() into a matrix. Note that it drops the value labels so we would have to
add them to make this useful. More on that later.
o
CD
o
<.D
o
N
You can visualize frequencies 011 two factors by using plotQ. It will label the x- and y-axes so I
have suppressed that using the arguments xlal:>="" and ylab="". TI1e barplotO ftmction can also do
this plot ifyou first swnmaiize the data with another fünction, tableO in this case. However, it
does not label the genders 011 the y-axis, making it a bit more work.
You can also do this plot using the mosaicplotO function. It uses tableO to get frequencies. It
would display " table(workshop. gender)" in the main title ifI had not suppressed it with the
argun1ent main="".
.._
Q)
u
eQ)
O>
workshop
The mosaicplotO function can also handle more than two factors. Our practice data set only has
two, so we will use the Titanic srnvivor data that comes witll R. The plot below is much lai·ger
than the others because displaying the th.ird vatiable takes quite a bit more space.
Titanic
Mele Female
Yes
liD
No
Sex
So far we llave been plotting frequencies. An advantage ba¡plotO has over plotO is that you can
get the height of bars to represent any measlU'e you like. Below I use tapplyO to get the means of
ql by gender, store it in myMeans and then plot those means.
Female Male
We can get means broken down by both gender and workshop by including both of them on the
tapplyO call. To include more than one factor in tapplyQ, you must supply the factors in a list (or
a data frame, which is a type of a list). Note that we do not labels for the workshops. You could
add them with the legendO fi.mction (see next example) but this is a good example of something
the ggplotO fi.mction would do automatically. I never use the baiplot fünction for more thai1 one
factor.
Female Male
Ali the traditional graphics functions calls can be embellished with a variety of
arguments: col for color, xlab/ylab for x- and y-axis labels, main and sub for main and
sub-titles. The legend function provides an extreme level of control over legends or scales.
However, the functions in the ggplot2 package will do very nice legends automatically.
l() O Female
o Male
o
e:J
o
o
Workshop
Graphics parameters control R's traditional plot functions. You can get a list ofthem by
running simply "parü''. One of the parameters is mfrow. It sets up m ulti-frame plots in
rows. Once you have set how many rows (first value) and columns (second) then all the
plots that follow ""ill
fill the rows as we read: left to right, top to bottom. Here is an
example.
o('f)
~
U)
N o
....
o
N
g
U)
o
N
o
~
o
U)
o o
U) U)
o o
U) U)
o o
Pie Charts
A pie chait is easy to do using pieü but the slices are empty by default. The col argument
can fill in shades of gray or colors.
>pie( t able(workshop) ,
> col = c("wh ite", "gray90" , " gray60 ", "black" ) )
Dot Charts
The dotchart() function works just like barplot() in that you either provide it values
directly or use other summarization functions to get those values. Below I use table() to
get frequencies. By default dotchart() uses open circles as its plot character. The
argument pch = 19 changes that to be a solid circle, and the cex argument makes the
character bigger through character expansion.
Fema le
Stata •
SPSS •
SAS •
R •
Mal e
Stata
SPSS -· •
SAS •
R •
10 12 14 16
Histogranis
A basic histogram is very easy to get. Note that it adds its own main title to the plot. You
can suppress that by adding: main = "".
> hist(posttest )
Histogram of posttest
oC")
ll)
N
o
N
ll)
ll)
60 70 80 90 100
You can change the nmnber of bars in the histogram with the breaks argument. The
linesü function can add a density curve to the plot, and the rug() function adds shag
carpet-like tick marks to the x-axis where each data point appears.
Histogram of posttest
e.o
q
o
~
q
o
N
q
o
o
q
o
60 70 80 90
You can select subsets of your data to plot by using logical selections as in any R function
call. Here I have selected the males. It displays the logical selection as the label on the x-
axis. You can suppress that >vith: xlab = '"'. I have also used the col argument to change
the color of the bars to gray.
>-
o o
eQ)
::J
o-
Q)
u:: lO
60 70 80 90 100
posttest[gender == "Male"]
Normal QQ Plots
A nom1al QQ plot displays a fairly straight line when the data are normally distributed. R
has a qqnormü function that is built in, but I prefer the qq.plotO function in the car
(Companion to Applied Regression) package since it includes a 95% confidence interval.
o
(J)
......
(/)
<I> o
tl 00
(/)
o
a.
,,.
,, ,,
o
"-
,, ,,
o /
CD e;
-2 -1 o 1 2
norm quantiles
Sn·ip Charts
Strip chruts ru·e scatter plots of single variables. To prevent points from obscuring one
another, they can either be moved arow1d a bit at rru1dom (iittered) or stacked upon one
another. The multi-frame plot below shows both approaches.
o o
o
ºº
1 1 1 1
60 70 80 90
Stripchart with Stacking
o Bo o
1 1 1 1
60 70 80 90
You can display either type of strip chart by group.
Stata - o
SPSS -
SAS -
R -
1 1 1 1
o
(D
o
,,... o(J() o
O)
posttest
Scatter plots are the default when you supply hvo nmneri.c variables to the plot() function.
o
o o
o o 00 o
0
O)
§ 0° 00 ºgoº o
ºº 8º 8 8 ºº
00 o o 08 º 8º8 º 0§8
(1)
j::; 00 - o ºº8 o
(f)
o o o 8º§8º 08 ºo
a. o o o o0
o - o o
....... o
o o
o - o
<D
1 1 1 1 1 1
60 65 70 75 80 85
pretest
The type argmnent controls whetl1er plotO displays points ("p"), lines ("1"), both ("b"), or
histogram-like needles to each point ("h"). Displaying liues would make sense if our data were
collected th.rough time and displayed overall or seasonal trends. The order of the points in our data
set mean nothing so we get a mess ofzig-zagging liues. Note that "main" changes the title, not the
display; that is cont.rolled only by the type argmnent.
> plot ( pretest, postt est , type "p'', main ' type "pº'
> plot ( pret est , postt est, type "l " I main ' type "l " '
> plot ( pret est , postt est, type "b " I main ' type "b"'
> plot ( pretes t, postt est , type "h", main ' t ype "h" '
type="p" type="I"
o
o o
o 00 o
g - o§,,g000 º'/x,o o ~
gº 8 S 0 o
¡¡; o oS oaº&,ºofi8 ¡¡;
~ ~- ~
oCX)
¡¡;
o o ºº~80
o o u;
o
o.. o o o a.
o o o
12 - o o .....
o
~ -o o
<D
1 1 1 1 1 1
60 65 70 75 80 85 60 65 70 75 80 85
o o -
m m
¡¡;
o
u; o
~ -
¡¡; CD
;.;¡"' CD
o o
a. a.
o
..... o -
.....
g g - 1
1 1
60 65 70 75 80 85 60 65 70 75 80 85
pretest pretest
If you have many points plotting on top of one another, as often happens with 1 to 5
Likert-scale data, you can add sorne jitter (random variation) to the points to get a better
view of overall trends.
> plot( ql, q4, main= "Likert Scale Without Jitter " )
> plot( jitter(ql, 3), j itter(q4, 3), main=" Likert Scale With
Jitter " )
1 2 3 4 5 2 3 4 5
q1 jitter(q 1, 3)
The problem of overplotting becomes severe when you have thousands of points. In the
neid: example I generate a new data set containing 5,000 observations. Then I plot them
first using the default settings (left). Many of the points are obscured by other points. On
the right side I plot the data using a much smaller point character and add sorne jitter so
you can see many more of the points.
;¡¡ ~
o
om - o o
~ m
83 ~
~
~ o
o V>
CIJ o
,gi co - ::
Ul co
Ul
o o
a. a.
~
~
o - o o
r-
"'" ~¡ ~ ti
;;:::;,
o(O
o(O - oo
1 1 1 1
60 70 80 90 60 70 80 90 100
pretest2 Jitter(pretest2, 4)
Another way to do scatter plots of large data sets is to replace a set of points with a single large
hexagonal point. The hexbinO function does just that. The plot below is shown larger than most
because the values in the scale showing the number of com1ts in each hexagon overwrite one
anotller in smaller sizes.
100
Counts
108
101
90 95
N
88
...... 81
en 75
(]) 80 68
~ 61
en 54
o 48
o.
70 41
34
28
21
14
60 8
1
65 70 75 80 85 90 95
pretest2
A final way to get a scatter plot for large numbers of observations is to use the smootl1ScatterO
function shown below. The white lines that divide the scatter into rectangles look oddly spaced in
this low-resolution inlage, but they look much better in a bigh-resolution version for publication.
smooth Scatter (pretest2 , p o stt est2)
o
o
o
m
~
VJ
Q)
:t::l o
VJ CX>
o
a.
o
,....
o
IO -~...-~~---..--~~--.-~~~...-~~---..--~~--.-~~~...-~~--.--'
65 70 75 80 85 90 95 100
pretest2
To see what type ofline might fit your data, the lowess() function is a good place to stait.
The lines() function adds the lowess() fit to the data.
o
o o
0 0
o o o
(j)
§ o o o o
º ºg º
(¡) o
o 8 8
(]) o o o
:t::l co o
VJ
o o
o 8 o §8 o o 8 o o
a. o o o o
o o
o
,.... o
o o
o
<D o
60 65 70 75 80 85
pretest
That looks li.ke a fairly straight line so we might want to fit that with a simple linear regression.
o
o o
0 0
o o
(J) o 000
§ºggº 8
...... o 8 o
U>
Q)
t::
o<X) o
o
00 80
U>
o o
o 8o §8o o 8 o o
a. o o o o
o o
r- o
o o
o
<D o
60 65 70 75 80 85
pretest
To use point characters to identify groups in your data, you can add the pch argmnent. It accepts
nllllle1ic vectors, so to use a factor like gender, nest it within the as.muneiicO function. By using
logical selections 011 the variables, such as gender = "Male" you can easily get the ablineO
fünction to do separate lines for each group. You can also use which(gender == "Male") to
select groups while eliminate missing values a bit more cleanly; see the books for
obsessive details 011 that. TI1e lty argument sets the line type for each group.
u; o
Q)
tl
w
"'a.
o
,-
o
t-
60 65 70 75 80 85
pretest
Plotting Labels Ins te ad of Points
A helpful way to display group membership on a scatter plot is to plot labels instead of
other plot characters. If one character will suffice, the pch argument "vill do this. Since
gender is a factor and I need character values for this plot, I enclose gender in the
as.character() function. Below you can easily see the lowest scoring person on both
pretest aud posttest is male.
M
F M
~ - F M MMF~ ~ F
~ M~ FMM
M ~ F
F
FM
u; o
~
M ~ M=M=MMF~M
M~~~ F M
(])
tl ro
"'a.
o F F MM
M M F MF
o F M
t-
F F
M
o
<D - M
1 1 1 1 1 1
60 65 70 75 80 85
pretest
The pch argument will only display the first character. That's often a good idea since it minimizes
point overlap. However, you can use whole labels by suppressing all plot characters with pch =
"n" and adding them with the textO function. Although this plot is quite a mess vátl1 our practice
data set, it works quite well wifu small data sets.
o
O>
(¡)
Q) o
:i:: ro
U>
o
a. Female Male Mi
o
,.._ Female Male
MaleFemal~emale
o
<D ale
60 65 70 75 80 85
pretest
Box Plots
Box plots are easy to do with the plot() function. R also has a boxplot() function that
allows for additional control. See books for details.
o
--r--
----.--
t
1
1 •
o - ----.--- ---.--- '
O> ' t
~
'' t
t
''
''
bd ~ g
t
1
o
ro - '
1
'
---'-- _.___
1
''
' ''
o
,.._ - '' __.____
'
'
----'---
o
<D o
1 1 1 1
Statistics
Below is a comparison of the commands used to peiform various statistical analyses in R,
SAS, SPSS and Stata. For R functions that are not included in base R, the libra11'Ü
function loads the package that contains the function right before it is used. The variables
gender and workshop are categorical factors and ql to q4, pretest and posttest are
considered continuous and normally distributed.
The practice data set is shown her.e.. The programs and the data they use are also
available for dovmload her.e.. Detailed step-by-step explanations are in the books along
>vith the output of each analysis. My Books
R for SAS and SPSS Users
R for Stata Users
A:nalysis ofVariance
My Workshops
R for SAS, SPSS & Stata Users
R
Managing Data with R
Archives
• December 2013
Stata • October 2013
• September 2013
• August 2013
• June 2013
anova posttest workshop • May 2013
• April 2013
• March 2013
• Februaty 2013
Correlate, Pearson • January 2013
• October 2012
• September 2012
• July 2012
cor( mydata[3:6J, • June 2012
method = "pearson", • May 2012
use = "pairwise" ) • April 2012
cor .test(mydata$ql,
Blogroll
mydata$q2, u se = "pairwise")
• Cookbook for R
• Deducer Group
# Again, adjusting p - values for multiple t esti ng . • Deducer Manual
library ( "Rcmdr" ) • ggplot2 Group
rcorr. adjust( mydata[ 3:6 ] • ggplot2 Web Site
• plyr / reshape Group
• Quick-R
• R
SAS • R Cheat Sheets
• R-Bloggers
• Stack Overflow
• Stata-bloggers.com
PROC CORR; • Statistics Blog
VAR ql - q4 ; • StatsB!ogs
RUN; • The R Journal
• Twotorials
Categories
SPSS • Analytics
• Data Mangement
• R
• SAS
CORRELATI ONS
• SPSS
/ VARIABLES=ql TO q4 . • Statistics
• Uncategorized
Correlate, Spearman
Blog Stats
R • 233,134 hits
Meta
• Register
cor( mydata [3:6 ], • Login
method = "spearman", • Entries RSS
use = "pairwise") • Comments RSS
• WordPress.com
cor.test (mydata$ql ,
mydata$q2, Twitter
use = "pairwise" ) • @kohske @hadleywickham A
tagged vector in which the only
# Again, adjusting p -values for multip l e t esti ng . element, 999, is named (or
library ( "Rcmd r") tagged) as "e" 1 week ago
• @hadleywíckham Time to start a
rcorr.adjust(mydata [3 : 6J , type "spearman")
ggplot2 Task View on GRAN! (re:
your list of pkgs using ggplot2)
1 weekago
SAS • #Rstats comes out on top on yet
another survey. #BigData
#Analytics @SASsoftware
@IBMspss
PROC CORR S PEARMAN; blog.revolutionanalytics.com/2014
VAR ql - q4 ; /01/in-dat. .. 1 week ago
• @wrathematics Haha, so
RUN;
appropriate! You'll never hear the
end of it. Carla says Joyce was a
very ponderous writer, so I can't
SPSS laugh 2 much 1 week ago
• Adoption of R by large Enterprise
Software Vendors r-
bloggers.com/adoption-of-r-... via
NONPAR CORR @rbloggers #rstats #bigdata
/VARIABLES=ql t o q4 #analytics 2 weeks ago
/ PRINT=S PEARMAN.
Stata
spearman q*
SAS
PROC FREQ;
TABLES workshop*gender / CHISQ;
RUN;
SPSS
CROSSTABS
/TABLES=workshop BY gender
/ FORMAT= AVALUE TABLES
/STATISTIC=CHI SQ
/CELLS= COUNT ROW
/COUNT ROUND CELL
Stata
Descriptive Statistics
summary(mydata)
library("Hmisc")
describe(mydata)
SAS
PROC MEANS;
VAR ql --posttest;
SPSS
Stata
summary q*
Frequencies
summary(mydata)
library ("Deducer")
frequencies(mydata)
SAS
PROC FREQ;
TABLES workshop--q4;
RUN;
SPSS
Stata
Kruskal-Wallis
SAS
PROC nparlway;
CLASS works h op;
VAR posttest;
SPSS
NPAR TESTS
/K -W=postte s t BY
worksh op(l 3) .
Stata
Linear Regression
SAS
PROC REG;
MODEL q4 ql - q3;
SPSS
REGRESSION
/MI SSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRI TERIA=PIN ( . 05) POUT(.10)
/NOORIGIN
/DEPENDENT q4
/METHOD=ENTER ql q2 q3 .
Stata
regress q4 ql - q3
lvr2plot
Sign Test
libr a ry ( "PASWR")
SIGN.test(posttest, pretest,
conf.level = .95)
SAS
myDiff=posttest -pretest;
PROC UNIVARIATE;
VAR myDiff;
RUN;
SPSS
NPTESTS
/RELATED TEST(ql q2) SI GN
/MI SSING SCOPE=ANALYSIS USERMI SSING=EXCLUDE
/CRI TERIA ALPHA=0.05
CILEVEL= 95.
Stata
t-Test, Indepdendent
t . test(ql - gender,
data = mydatalOO)
SAS
PROC TTEST;
CLASS gender ;
VAR ql ¡
RUN;
SPSS
T-TEST
GROUPS = gende r ( •m' • f. )
/VARIABLES = q l .
Stata
t-Test, Paired
SAS
PROC TTEST;
PAIRED pretest*posttest;
RUN;
SPSS
T-TEST
PAI RS=p retest WITH
posttes t (PAIRED) .
Stata
Variance Tes t
SAS
SPSS
St ata
* Or ...
sdt est postt est gender
SAS
SPSS
NPTESTS
/RELATED TEST(pretest p o s t tes t ) SIGN WILCOXON.
Stata
SAS
SPSS
NPTESTS
/RELATED TEST(ql q2) WILCOXON
/MI SSING SCOPE=ANALYSIS USERMI SSING=EXCLUDE
/CRI TERIA ALPHA=0 . 05 CILEVEL=95 .
Stata
signrank ql gender
Leave a Reply
R main
Access
Manipulate
Summarise
Plot
Analyse
R provides a variety of methods for summarising data in tabular and other forms.
It is sometimes useful to work with a smaller version of a large data frame, by creating a representative
subset of the data, via random sampling:
A.small <- A[sample(nrow(A), 4), ] # select 4 rows at random
"which.min " & " which.max " return the element number of the lowest/highest value:
set.seed(123) # allow reproducible random numbers
r-show_data.html[27/01/2014 22:23:59]
R show data - summarize and tabulate data with R
x <- sample(10)
> which.max(x)
[1] 7
> x[which.max(x)]
[1] 10
This can be used in a data frame to extract the corresponding row containing the min/max value of one of the
columns:
A <- data.frame(x=rnorm(10), y=runif(10))
A[which.min(A$x), ]
#--Alternatively:
subset(A, x == min(x))
Other summaries:
x <- rnorm(100)
fivenum(x) # Tukey's five number summary, used to construct a boxplot:
boxplot(x) # see "?boxplot.stats" for more details
stem(x) # A stem-and-leaf plot
Matrix summaries:
A <- matrix(rnorm(50), nrow=10) # create 10x5 random number matrix
colSums(A); rowSums(A); colMeans(A), rowMeans(A) # self-explanatory
max.col(A) # maximum position for each row of a matrix, same as:
which.max(A[1,]); which.max(A[2,]) # etc.
Tables
Load some data on a sample of 20 galaxy clusters with a categorical classification status ("cctype")
indicating whether there is a cool core or not and a factor ("det") specifying which of two detectors was used
to make the X-ray observation of the cluster:
file <- "http://www.sr.bham.ac.uk/~ajrs/papers/sanderson09/sanderson09_table2.txt"
a <- read.table(file, header=TRUE, sep="|")
#
table(a$cctype) # count numbers in each cctype category
table(a$cctype, a$det) # 2-way table
xtabs(~ cctype + det, data=a) # alternative (formula) syntax
addmargins(xtabs(~ cctype + det, data=a)) # add row/col summary (default is sum)
prop.table(xtabs(~ cctype + det, data=a)) # show counts as proportions of total
-there is marginal evidence (p=0.07) of an interaction: clusters observed with ACIS-S are more likely to have
a cool core than not.
r-show_data.html[27/01/2014 22:23:59]
R show data - summarize and tabulate data with R
For further information, you can find out more about how to access, manipulate, summarise, plot and analyse data
using R.
Also, why not check out some of the graphs and plots shown in the R gallery, with the accompanying R source
code used to create them.
Quick links
why R
getting started
R plot gallery
R tutorials
R resources
R function list
Jump to
view structure
numerical summary
contingency tables
aggregate statistics
r-show_data.html[27/01/2014 22:23:59]
Recode Data in R with R Recode Examples 1 RProgramming net
Programming.net
Recode Data in R PAGES
Subset Data in R
specific rows is reptaced. Note that if you want to replace NA with sorne
Wrjte CSV in R
vatue you cannot use ==NA. You must use is.na(). See below for an
exampte.
RECENT POSTS
The third example shows how to replace data based on more than one RECENT COMMENTS
The fourth example shows how to make a copy of an existing field. using write csy 1 on Wri1e..
Sometimes you don't want to recode data but instead just w ant another CSV in R
column containing the same data. This example makes a new column
called CopyOfGrade and fills it with the data from Grade. This isn't exactly
recoding but is related and comes up a lot since it is usually a good idea IN MADISON, WI? JOIN THE R
to make a copy of a field and then do the recoding on the copy rather MEETUP
# Then copy the data from the existing column into the new
one. January 2014
SchoolData$CopyOfGrade <- Schoo1Data$Grade Februarv 2013
Janyary 2013
The fifth example shows how to recode data into a new numeric field
based on criteria from a numeric field. Note that w ith numeric fields you do CATEGORIES
not surround the value w ith quotation marks. With a character field you do
surround the value with quotation marks (next example}. Uncategorized
This example creates a new field called NewGrade based on the field
Grade. Note that, as w ith the above examples, you can again use & or META
BY THE FOLLOWING
The sixth example shows how to recode data into a new character field
based on criteria from a numeric field. This example again creates a new Cjgar Humidifier
specified rows
SchoolData$NewGrade(SchoolData$Grade==5) <- "Grade Five"
The seventh example shows how to recode data into a new character field
based on criteria from a character field. This example creates a new field
called NewGrade based on the field Grade.
Recode Data in R with R Recode Examples _ RProgramming net htm(27/0 l/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net
StudentData$NewGrade <- NA
# Then recode the old field into the new one for the
specified rows
schooloata$NewGrade[SchoolData$Grade=="Grade Five"] <- "Grade
Five"
The eighth example shows how to recode data into a new numeric field
based on criteria from a character field. This example again creates a new
field called NewGrade based on the field Grade.
This is w here things get a little weird. lf you want to recode data into a
field and pull that data from another field, you have to specify the criteria
on both sides of the <-. lf you don't, R will still recode but you won't get
the results you're expecting. For example, let's say you want to copy the
data from Grade into NewGrade but only where SchoolType is
"Elementary". You might think that this will work:
And it will work! But you won't get the results you are expecting. R won't
copy the data from Grade for only the rows where SchoolType is
Elementary. lnstead, it will start at the top of the data frame and copy
each row. To recode correctly you have to specify the criteria on both
Recode Data in R with R Recode Examples _ RProgranuuing net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net
The recode() command from the car package is another great way to
recode data in R. Recode from car can be very powerful and is a good
alternative to the code above.
lf you want to recode from car you have to first install the car package and
then load it far use.
lf you want to recode based on text, use the ' mark around the text.
Recode Data in R with R Recode Examples _ RProgranuuing net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1RProgranuuing net
Recode can recode data into a new field. This code creates a new field
called NewGrade based on Grade. Note that if you don't specify that value
is recoded R will just copy the existing value into the new field.
# Recode grade 3 to NA
schooloata$Grade< - recode(Schooloata$Grade,"3=NA")
# or recode NA to 7
schooloata$Grade <- recode(Schooloata$Grade, "NA=7")
One advantage to recode is that it can recode multiple values in one line
of code.
# Recode grades
schooloata$Grade<-
recode(schooloata$Grade, "1:S='Elementary' ;6:8='Middle;else='Hig
h ... )
There are other options that can be used with recode in car. See official
R-manual page on read.csv to learn more: http·lfcran r-
project.orq/web/packages/car/car.pdf.
tl•,...,_.n,.c...• •
g ..... ,,..,
,.~ ~ ,, .
J. • tr a thc cr a< ' f!
•"•'•ll
:P'.
l p..(.~~ , ..
l
• Ln• t?it (.tr' padl1Jt!
1
.
1tbr,¡ry <M
'
sc"°o1DU•'Gl'AOe reeiod1 scNolt\at•'Vld•.
. ,_, ~.
Recode in action.
Thanks for reading! This website took a great deal of time to create. lf it
was helpful to you, please show it by sharing with friends, liking, or
tweeting! lf you have any thoughts regarding this R code please post in
the comments.
RelateQr-posts:
Recode Data in R with R Recode Examples _ RProgramming net htm[27/0 1/2014 22:23:47]
Recode Data in R with R Recode Examples 1 RProgramming net
1. Subset Data jo R
2. R Data Maoipulatioo
3. Aggregate Data jo R Usjog data table
Sierra Bravo
May 7. 2013 at 9:30 am
Leave a Reply
Your email address will oot be published. Required fields are marked ..
Name ..
LEmail ..
L
Website
Commeot
You may use these HTML tags and attributes: <a href="" title=""> <abbr
<code> <del datetime="" > <ero> <i> <q cite=""> <strike> <strong>
Post Comment'J
Correspondence Analysis The main bootstrapping function is boot( ) and has the follovting forrnat:
M11ltjdjmen5jona( Scaling
bootobject <· boot(data= , statistic= , R=, ... ) where
CI 115ter Analysis
parameter description
Tree·Based Models
data A vector, matrix, or data frame
Bootstrapoing statistic A function that produces the k statistics to be bootstrapped (k=1 if
bootstrapping a single statistic).
Matrix Algebra The function should indude an indices parameter that the boot() function
can use to select cases for each replication (see examples below).
R Number of bootstrap replicates
R in Action Additional parameters to be passed to the function that produces the
statistic of interest
boot( ) calls the statistic function R times. Each time, it generates a set of random indices, vtith
replacement, from the integers 1 :nrow(data). These indices are used within the statistic function to
select a sample. The statistics are calculated on the sample and the results are accumulated in the
bootobject. The bootobject structure includes
element description
R in Action significantly expands
tO The observed values of k statistics applied to the orginal data.
upon this material. Use promo
t An R x k matrix where each row is a bootstrap replicate of the k statistics.
code ria38 for a 38% discount.
You can access these as bootobject$t0 and bootobject$t.
Top Menu Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to
examine the results. lf the results look reasonable, you can use boot.ci( ) function to obtain confidence
intervals for the statistic(s).
The R Interface
The format is
Data Input
boot.ci(bootobject, conf=, type= ) where
Data Management
Basic Statistics
parameter description
Advanced Statistics
bootobject The object retumed by the boot function
Basic Graphs conf The desired confidence interval (default: conf=0.95)
Advanced Graohs type The type of confidence interval retumed. Possible values are "norm",
"basic", "stud", "perc", "bca" and "all" (default: type="all")
# view results
results
plot(results)
11 click to view
# view results
results
pl ot(results, index=l) # intercept
plot(results, index=2) # wt
pl ot(results, index=3) # disp
/ / / click to view
Going Further
The boot( ) function can generate both nonparametric and parametric resampling. For the
nonparametric bootstrap, resampling methods include ordinary, balanced, antithetic and permutation.
For the nonparametric bootstrap, stratified resampling is supported. lmportance resampling weights can
also be specified.
The boot.ci( ) function takes a bootobject and generates 5 different types of two-sided nonparametric
confidence intervals. These include the first order nom1al approximation, the basic bootstrap interval,
the studentized bootstrap interval, the bootstrap percentile interval, and the adjusted bootstrap
percentile (BCa) interval.
Learning More
Good sources of infom1ation include Resampling Methods in R: The boot Package by Angelo Canty,
Getting started with the boot package by Ajay Shah, Bootstrapping Regression Models by John Fox,
and Bootstrap Metbods and Their Appli<:atjons by Oavison and Hinkley.
Advanced Statistics Correspondence Analysis
Correspondence analysis provides a graphic method of exploring the relationship between variables in a
Generalized Linear Models contingency table. There are many options for correspondence analysis in R. 1 recommend the ca
Oiscriminant Function package by Nenadic and Greenacre because it supports supplimentary points, subset analyses, and
comprehensive graphics. You can obtain the package hf!re.
Time Series
Factor Analysis Although ca can perform multiple correspondence analysis (more than two categorical variables), only
Correspondence Analysis simple correspondence analysis is covered here. See their fil1k.ll! for details on multiple CA.
M11ltjdjmen5jona( Scaling
Bootstrapoing
# Correspondence Analysis
Matrix Algebra
library(ca)
mytable <- with(mydata, table(A,B)) # create a 2 way table
prop.table(mytable, 1) # row percentages
R in Action
prop.table(mytable, 2) # column percentages
fit <- ca(mytable)
print(fit) # basic results
sunvnary(fit) # extended results
pl ot(fit) # symmetric map
pl ot(fit, mass =TRUE, contrib = "absolute", map =
"rowgreen", arrows = c(FALSE, TRUE)) # asymmetric map
Top Menu
elick to view
The R Interface
Row points (column points) that are eloser together have more similar column profiles (row profiles).
Data Input
Keep in mind that you can not interpret the distance between row and column points directly.
Data Management
Basic Statistics The second graph is asymmetric , with rows in the principal coordinates and colunms in reconstructions
of the standarized residuals. Additionally, mass is represented by points and columns are represented
Advanced Statistics
by arrows. Point intensity (shading) corresponds to the absolute contributions for the rows. This
Basic Graphs example is ineluded to highlight sorne of the available options.
Advanced Graphs
elick to view
Advanced Statistics Tree-Based Models
Recursive partitioning is a fundamental tool in data mining. lt helps us explore the stucture of a set of
Generalized Linear Models data, while developing easy to visualize decision rules for predicting a categorical (classification tree)
Oiscriminant Function or continuous (regression tree) outcome. This section briefly describes CART modeling, conditional
inference trees, and random forests.
Time Series
Factor Analysis
CART Modeling via rpart
Correspondence Analysis
Classification and regression trees (as described by Brieman, Freidman, Olshen, and Stone) can be
M11ltjdjniensjona( Scaling
generated through the rpart package. Oetailed information on rpart is available in An lntroduction to
CI 115ter Analysis Rec11rsjve Partjtjonjng llsing tbe RPART Ro11tines. The general steps are provided below followed by two
Tree-Based Models examples.
R in Action
formula is in the fom1at
outcome - predictor1+predictor2+predictor3+ect.
data= specifies the data frame
method= "class" for a classification tree
"anova" for a regression tree
control= optional parameters for controlling tree growth. For example,
control=rpart.control(minsplit=30, cp=0.001) requires that the minimum
number of observations in a node be 30 before attempting a split and that a
split must decrease the overall lack of fit by a factor of 0.001 (cost
R in Action significantly expands complexity factor) before being attempted.
upon this material. Use promo
code ria38 for a 38% discount. 2. Examine the resutts
The following functions help us to examine the results.
Basic Graphs In trees created by rpart( ), move to the LEFT branch when the stated condition is true (see the graphs
below).
Advanced Graphs
3. prune tree
Prune back the tree to avoid overfitting the data. Typically, you will want to selecta tree size that
minimizes the cross-validated error, the xerror column printed by printcp( ).
Specifically, use printcp( ) to examine the cross-validated error results, select the complexity
parameter associated with mínimum error, and place it into the prune( ) function. Altematively, you
can use the code fragment
to automatically select the complexity parameter associated with the smallest cross-validated error.
Thanks to HSAUR for this idea.
# grow tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis)
# plot tree
plot(fit, uniform=TRUE,
main="Classification Tree for Kyphosis")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
1.
.. ..
click to view
click to view
# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)
# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Mileage ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
1: 1:
;. . click to view
lt turns out that this produces the same tree as the original.
ctree(formula, data=)
The type of tree created will depend on the outcome variable (nominal factor, ordered factor,
numeric, etc. ). Tree growth is based on statistical stopping rules, so pruning should not be required.
~I elick to view
:=
. - elick to view
Random Forests
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based
on random samples of variables), elassifying a case using each tree in this new "forest'', and deciding a
final predicted outcome by combining the results across ali of the trees (an average in regression, a
majority vote in elassification). Breiman and Cutler's random forest approach is implimented via the
randomForest package.
Here is an example.
Going Further
This section has only touched on the options available. To learn more, see the CRAN Task View on
Machine & Statistical Leaming.
Advanced Statistics Cluster Analysis
R has an amazing variety of functions for cluster analysis. In this section, 1 will describe three of the
Generalized Linear Models many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best
Oiscriminant Function solutions for the problem of determining the number of clusters to extract, severa! approaches are
given below.
Time Series
Factor Analysis
Correspondence Analysis
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescate variables for
M11ltjdjmen5jona( Scaling
comparability.
CI 115ter Analysis
Tree-8ased Models
# Prepare Data
Bootstrapoing mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Matrix Algebra 1
R in Action Partitioning
K·means clustering is the most popular partitioning method. lt requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help detem1ine the appropriate number of clusters. The analyst looks for a bend in the
plot similar to a scree test in factor analysis. 5ee Everitt & Hothom (og . 251\.
Advanced Statistics A robust version of K·means based on mediods can be invoked by using pam( ) instead of kmeans( ).
Basic Graphs The function pamk( ) in the ~ package is a wrapper for pam that also prints the suggested number of
clusters based on optimum average silhouette width.
Advanced Graohs
Hierarchical Agglomerative
There are a wide range of hierarchical clustering approaches. 1 have had good luck with Ward's method
described below.
# Ward Hierarchical clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fi t <- he l ust (d, method="ward")
pl ot(fit) # displ ay dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
,.
: imn~~f'~'l!!~~ 1
The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on
multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p
values. lnterpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows.
Transpose your data before using.
,.
.iR . 1fir~~~
...,.. click to view
Model Based
Model based approaches assume a variety of data models and apply maximum likelihood estimation and
Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( )
function in the mclust package selects the optimal model according to BIC for EM initialized by
hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and
number of clusters with the largest BIC. See helo!mclustModelNamesl to details on the model chosen as
best.
,.
.
,
elick to view
-- ......
click to view
Oiscriminant Function probabilities (i.e., prior probabilities are based on sample sizes). In the examples below, lower case
letters are numeric variables and upper case letters are categorical íaó.ms,.
Time Series
Factor Analysis
Correspondence Analysis
Linear Discriminant Function
M11ltjdjmen5jona( Scaling # Linear Discriminant Anal ysis with Jacknifed Prediction
l i brary(MASS)
CI 115ter Analysis
fi t <- lda(G - xl + x2 + x3, data=mydata,
Tree-8ased Models na . action="na_omit", CV=TRUE)
fit # show results
Bootstrapoing
Matrix Algebra
The code above performs an LOA, using listviise deletion of missing data. CV=TRUE generates jacknifed
(i.e., leave one out) predictions. The code below assesses the accuracy of the prediction.
R in Action
Data Management
# Quad r atic Discriminant Analysis with 3 groups applying
Basic Statistics # resubstitution prediction and equal prior probabi l ities .
l i brary(MASS)
Advanced Statistics
fi t <- qda(G - xl + x2 + x3 + x4, data=na . omi t(mydata),
Basic Graphs prior=c(l,1,1) / 3))
Advanced Graohs
Note the altemate way of specifying listviise deletion of missing data. Re-subsitution (using the same
data to derive the functions and evaluate their prediction accuracy) is the default method unless
CV=TRUE is specified. Re-substitution viill be overly optimistic.
Visualizing the Results
You can plot each observation in the space of the first 2 linear discriminant functions using the
following code. Points are identified with the group ID.
click to view
The following code displays histograms and density plots for the observations in each group on the first
linear discriminant dimension. There is one panel for each group and they ali appear lined up on the
same graph.
·- click to view
The partimat( ) function in the klaR package can display the results of a linear or quadratic
classifications 2 variables at a time.
click to view
You can also produce a scatterplot matrix vlith color coding by group.
.•\....... - - ,,i. .,
"-
c. ; ~·
o;.#..,,.~· • ~ ··~.
'
"'c .1i.-{\";. · t~
...... .
:: ~~
~
·'
- click to view
Test Assumptions
See (M)ANOVA Assumotions for methods of evaluating multivariate normalíty and homogeneíty of
covariance matrices.
Advanced Statistics Principal Components and Factor Analysis
This section covers principal components and factor analysis. The later includes both exploratory and
Generalized Linear Models confim1atory methods.
Oiscriminant Function
click to view
Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat= option to
R in Action significantly expands
enter a correlation or covariance matrix directly. lf entering a covariance matrix, include the option
upan this material. Use promo
n.obs=.
code ria38 for a 38% discount.
The principal( ) function in the psvch package can be used to extract and rotate principal components.
Top Menu
# Vari max Rotated Pri nci pa1 Components
# retaining 5 components
l i brary(psych)
The R Interface fi t <- principal (mydata, nfactors=5, rotate="varimax")
Data Input fit # print results
Data Management
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Basic Statistics
rotate can "none", ''varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
Advanced Statistics
Basic Graphs
Exploratory Factor Analysis
Advanced Graohs
The factanal( ) function produces maximum likelihood factor analysis.
1 -=-
¡.... • -
click to view
The rotation= options include "varimax", "promax", and "none". Add the option scores="regression" or
"Bartlett" to produce factor seores. Use the covmat= option to enter a correlation or covariance matrix
directly. lf entering a covariance matrix, include the option n.obs=.
The factor.pa( ) function in the ~ package offers a number of factor analysis related functions,
including principal axis factoring.
mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is used.
Rotation can be "varimax" or "promax".
1-
elick to view
Going Further
The FactoMjneR package offers a large number of additional functions for exploratory factor analysis.
This includes the use of both quantitative and qualitative variables, as well as the inclusion of
supplimentary variables and observations. Here is an example of the types of graphs that you can
create with this package.
click to view
Thye GPARotation package offers a wealth of rotation options beyond varimax and promax.
e3
e4
e5
~ e6
Assume that we have six observered variables (X1, X2, ... , X6). We
hypothesize that there are two unobserved latent factors (F1, F2) that underly the observed variables
as described in thís diagram. X1, X2, and X3 load on F1 (with loadíngs lam1, lam2, and lam3). X4, X5,
and X6 load on F2 (with loadings lam4, lam5, and lan16). The double headed arrow indicates the
covariance between the two latent factors (F1 F2). e1 thru e6 represent the residual variances (variance
in the observed variables not accounted for by the t\'VO latent factors). We set the variances of F1 and
F2 equal to one so that the parameters will have a scale. This will result in F1F2 representing the
correlatíon between the two latent factors.
For sem, we need the covariance matrix of the observed variables - thus the cov( ) statement in the
code below. The CFA model is specified using the specify.model( ) function. The fom1at is arrow
specification, parameter name, start value. Choosing a start value of NA tells the program to choose a
start value rather than supplying one yourself. Note that the variance of F1 and F2 are fixed at 1 (NA in
the second column). The blank líne is required to end the RAM specification.
You can use the boot.sem( ) function to bootstrap the structual equation model. See help(boot.sem)
for details. Additionally, the function mod.indices( ) will produce modification indices. Using
modification indices to improve model fít by respecifying the parameters moves you from a
confirmatory to an exploratory analysis.
For more information on sem , see Structurn! Equatjoo Modeling rnth tbe sem Package in R, by John
Fox.
Advanced Statistics Generalized Linear Models
Generalized linear models are fit using the glm( ) function. The form of the glm function is
Generalized Linear Models
glm(formula, family=familytype(link=linkfunction), data=)
Oiscriminant Function
Matrix Algebra See help(glm) for other modeling options. See help(family) for other allowable link functions for each
family. Three subtypes of generalized linear models will be covered here: logistic regression, poisson
regression, and survival analysis.
R in Action
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables. lt is frequently preferred over discrjmjnant ftmctjon analysis because of its less
restrictive assumptions.
Data Input
Data Management You can use anova(fit1 ,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F-x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x
Basic Statistics
variable.
Advanced Statistics
Basic Graphs
Advanced Graohs
elick to view
Poisson Regression
Poisson regression is useful v1hen predicting an outcome variable representing counts from a set of
continuous predictor variables.
# Poisson Regression
# where count is a count and
# xl-x3 are continuous predictors
fit <- glm(count ~ xl+x2+x3, data=ltlydata, family=poisson())
summary(fit) display results
lf you have overdispersion (see if residual deviance is much larger than degrees of freedom), you may
want to use quasi poisson() instead of poisson () .
Survival Analysis
Survival analysis (also called event history analysis or reliability analysis) covers a set of techniques for
modeling the time to an event. Data may be right ce nsored - the event may not have occured by the
end of the study or we may have incomplete information on an observation but know that up to a
certain time the event had not occured (e.g. the participant dropped out of study in week 10 but was
alive at that time).
While generalized linear models are typically analyzed using the glm( ) function, survival analyis is
typically carried out using functions from the survival package . The survival package can handle one
and two sample problems, parametric accelerated failure models, and the Cox proportional hazards
model.
Data are typically entered in the format start time, stop time , and status (1=event occured, O=event
did not occur). Alternatively, the data may be in the fom1at time to event and status (1=event
occured, O=event did not occur). A status=O indicates that the observation is right cencored. Data are
bundled into a Surv object vía the Surv( ) function prior to further analyses.
# display resu l ts
MaleMod
·~
\
l. \ l . . \..
click to view
See Thomas Lumley's R news article on the survival package for more information . Other good sources
include Mai Zhou's Use R Software to do Survival Analysis and Simulation and M. J . Crawley's chapter
on Survival Analysis.
Advanced Statistics Matrix Algebra
Most of the methods on this website actually describe the programming of matrices. lt is built deeply
Generalized Linear Models into the R language. This section will simply cover operators and functions specifically suited to linear
Discriminant Function algebra. Before proceeding you many want to review the sections on Data Tvoes and Ooerators.
Time Series
M11ltjdjmen5jona( Scaling
Operator or Description
CI 115ter Analysis Function
Tree-8ased Models A*B Element-vlise multiplication
A%*% B Matrix multiplication
Bootstrapoing
A %o% B Outer product. AB'
Matrix Algebra
crossprod(A,B) A'B and A'A respectively.
crossprod(A)
t(A) Transpose
R in Action
diag(x) Creates diagonal matrix with elements of x in the principal diagonal
diag(A) Returns a vector containing the elements of the principal diagonal
diag(k) lf k is a scalar, this creates a k x k identity matrix. Go figure.
solve(A, b) Returns vector x in the equation b =Ax (i.e., A' 1b)
solve(A) lnverse of A where A is a square matrix.
ginv(A) Moore-Penrose Generalized lnverse of A.
ginv(A) requires loading the MASS package.
R in Action significantly expands y< ·eigen(A) y$val are the eigenvalues of A
y$vec are the eigenvectors of A
upon this material. Use promo
y<·svd(A) Single value decomposition of A.
code ria38 for a 38% discount. y$d =vector containing the singular values of A
y$u = matrix with colunms contain the left singular vectors of A
y$v = matrix vlith columns contain the right singular vectors of A
Top Menu R <· chol(A) Choleski factorization of A. Returns the upper triangular factor, such that R'R
=A.
y<· qr(A) QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and a lower
triangle that contains information on the Q decomposition.
The R Interface y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
Data Input y$pivot contains information on the pivoting strategy used.
cbind(A,B,... ) Combine matrices(vectors) horizontally. Returns a rnatrix.
Data Management
rbind(A,B, ... ) Combine matrices(vectors) vertically. Returns a matrix.
Basic Statistics rowMeans(A) Returns vector of row means.
Advanced Statistics rowSums(A) Returns vector of row sums.
Matlab Emulation
The ma1lah package contains wrapper functions and variables used to replicate MATLAB function calls
as best possible. This can help porting MATLAB applications and code to R.
Going Further
The Ma1dx package contains functions that extend R to support highly dense or sparse matrices. lt
provides efficient access to BLAS (Basic Linear Algebra Subroutines), Lapack (dense mat rix), TAUCS
(sparse mat rix) and UMFPACK (sparse matrix) rout ines.
Advanced Statistics Multidimensional Scaling
R provides functions for both classical and nonmetric multidimensional scaling. Assume that we have N
Generalized Linear Models objects measured on p numeric variables. We want to represent the distances among the objects in a
Oiscriminant Function parsimonious (and visual) way (i.e., a lower k-dimensional space).
Time Series
M11ltjdjmen5jona( Scaling
# classical MDS
CI 115ter Analysis
# N rows (objects) x p columns ( variables)
Tree-8ased Models # each row identified by a unique row name
Bootstrapoing
d <- dist(mydata) # eucl idean distances between the rows
Matrix Algebra fit < - cmdscal e(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
Top Menu
Nonmetric MDS
Nonmetric MOS is performed using the isoMDS() function in the MASS package.
The R Interface
Time Series
Tree-8ased Models
# save a numeric vector containing 48 monthly observations
Bootstrapoing # from Jan 2009 to Dec 2012 as a time series object
myts <- ts(myvector , start=c(2009, 1), end=c(2012, 12), frequency=12)
Matrix Algebra
# subset the time series (June 2012 to December 2012)
myts2 <- wi ndow(myts, start=c(2012, 6), end=c (2012, 12))
R in Action
# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
R in Action significantly expands stl() function. Note that a series with multiplicative effects can often by transfom1ed into series with
upan this material. Use promo additive effects through a log transformation (i.e., newts <- log(myts)).
code ria38 for a 38% discount.
# Seasonal decompostion
Top Menu fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
The R Interface
library(forecast)
Data Input seasonplot(myts)
Data Management
Basic Statistics
Exponential Models
Advanced Statistics
Both the HoltWinters() function in the base installation, and the ets() function in the forecast package,
Basic Graphs can be used to fit exponential models.
Advanced Graohs
# simple exponential - models level
fit <- Holtwinters(myts, beta=FALSE, gamma=FALSE)
# double exponential - models level and trend
fit <- HoltWinters(myts, gamma=FALSE)
#triple exponential - models level, trend, and seasonal components
fit <- Holtwinters(myts)
# predictive accuracy
library(forecast)
accuracy(fit)
ARIMA Models
The arima() function can be used to fit an autoregressive integrated moving averages model. Other
useful functions include:
Note that the forecast package has somewhat nicer versions of acf() and pad() called Acf() and Pad()
respectively.
# predicti ve accuracy
library( forecast)
accuracy(fit)
Automated Forecasting
The forecast package provides functions for the automatic selection of exponential and ARIMA models.
The ets() function supports both additive and multiplicative models. The auto.arima() function can
handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of severa[ fit
criteria.
library(forecast)
# Automated forecasting using an exponential model
fit <- ets(myts)
Going Further
There are many good online resources for learning time series analysis with R. These include A little
book of R for time series by Avril Chohlan, and Forecasting: principies and practice by Rob Hyndman and
George Athanasopoulos. Vito Ricci has created a time series reference card . There are also a time series
tutorial by Walter Zuccbjnj Oleg llenadic that is quite useful.
See also the comprehensive book Time Series Analysis and its Aoplications with R Examples by Robert
Shunway and David Stoffer.
Data Input Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices,
Data tvoes data frames, and lists.
lmoorting Data
Missing Data
Date Values
1 a[c(2,4)] # 2nd and 4th elements of vector
R in Action Matrices
Ali columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The
general format is
R in Action significantly expands byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
upon this material. Use promo matáx should be filled by columns (the default). dimnames provides optional labels for the columns
code ria38 for a 38% discount. and rows.
# another example
cells <- c(l,26,24,68)
The R Interface
rnarnes <- c("Rl", "R2")
Data Input cnarnes <- c("cl", "c2")
mymatri x <- rnatri x(ce lls, nroW=2, neo1=2 , byroW=TRUE,
Data Management dimnames=list(rnarnes, cnarnes))
Basjc Statjstics
Data Frames
A data frame is more general than a matrix, in that different columns can have different modes
(numeric, character, factor, etc.). This is similar to SAS and SPSS datasets.
d <- c(l,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data . frame(d,e, f)
names(mydata) <- c("ID", "col or", "Passed") # variable names
# examp l e of a l i st wi th 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred'', mynumbers=a, mymatrix=y, age=5.3)
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
el ass(object) # class or type of an object
names(object) # names
Value Labels The following symbols can be used with the format( ) function to print dates.
Missing Data
Symbol Meaning Example
Date Values %d day as a number (0-31) 01-31
%a abbreviated weekday Mon
%A unabbreviated weekday Monday
R in Action %m month (00-12) 00-12
%b abbreviated month Jan
%8 unabbreviated month January
%y 2-digit year 07
%Y 4-digit year 2007
Here is an example.
Top Menu
Date Conversion
Character to Date
The R Interface
You can use the as.Date( ) function to convert character data to dates. The fom1at is as.Date(x,
Data Input "forma('), where x is the character data and format gives the appropriate fom1at.
Data Management
Date to Character
You can convert dates to character data using the as.Character() function.
Learning More
See help(as.Date) and help(strftime) for details on converting character data to dates. See
help(ISOdatetime) for more information about formatting date/times.
Data Input Access to Database Management Systems (DBMS)
l i brary(RODBC)
myconn <-odbcconnect("mydsn", uid="Rob", pwd="aardvark")
crimedat <- sqlFetch(myconn, Crime)
R in Action significantly expands pundat <- sqlQuery(myconn, "select * from Punishment")
close(myconn)
upon this material. Use promo
code ria38 for a 38% discount.
Other lnterfaces
Top Menu
The RMySOL package provides an interface to MySQL.
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Exporting Data
There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata. you
Data tvoes will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
lmoorting Data
Exporting Data
1 write.table(mydata, "c:/mydata.txt", sep="\t")
Vjewjng Data
R in Action
To SPSS
# write out text datafi l e and
# an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c : /mydata . sps" , package="SPSS")
To Stata
The R Interface
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input lmporting Data
lmporting data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS 1
Data tvoes would recommend the Hmisc package for ease and functionality. 5ee the Quick·R section on packages,
lmoorting Data for information on obtaining and installing the these packages. Example of importing data are provided
below.
Keyboard Input
Database lnout
From A Comma Delimited Text File
Exporting Data
Date Values
Top Menu
From SPSS
The R Interface # save SPSS dataset in trasport format
Data Input get fil e=' e: \mydata. sav' .
export outfile='c:\mydata.por'.
Data Management
# in R
Basjc Statjstics
l i brary(Hmi se)
Advanced Statistics mydata <- spss.get("c:/mydata.por", use.value.labels=TRUE)
# last option converts value labels to R factors
Basic Graphs
Advanced Graohs
From SAS
# save SAS dataset in trasport format
l i bname out xport 'c: /mydata. xpt' ;
1 data out.mydata;
set sasuser.mydata;
run;
# in R
l i brary(Hmi se)
mydata <- sasxport.get("c:/mydata.xpt")
# character variables are converted to R factors
From Stata
# input Stata file
library(foreign)
mydata <- read. dta("c: /mydata.dta")
1
From systat
# input Systat file
library(foreign)
mydata <- read.systat("c: /mydata.dta")
1
Data Input Keyboard Input
Usually you will obtain a data frame by imoorting it from SAS, SPSS, Excel, Stata, a database, or an
Data tvoes ASCII file. To create it interactively, you can do something like the following.
lmoorting Data
Vaáable 1abels
You can also use R's built in spreadsheet to enter the data interactively, as in the following example.
Value Labels
Missing Data
# enter data using editor
Date Values mydata <- data . frame(age=numeric(O), gender=character(O),
weight=numeric(O))
mydata <- edit(mydata)
R in Action # note t hat without the assignment in the line above,
# the edits are not saved!
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graohs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Missing Data
In R, missing values are represented by the symbol NA (not available) . lmpossible values (e.g., dividing
Data tvoes by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for
Keyboard Input
X <- c(l,2,NA,3)
R in Action significantly expands
mean(x) # returns NA
upon this material. Use promo mean(x, na . rm=TRUE) # returns 2
code ria38 for a 38% discount. 1
The function complet e. cases() returns a logical vector indicating which cases are complete.
Top Menu
Data Management
Keyboard Input
# variable vl is coded 1, 2 or 3
Database lnout
# we want to attach va1 ue 1abe1 s l=red, 2=b1 ue, 3=green
Exporting Data
mydata$v1 <- factor(mydata$v1,
Vjewjng Data
levels c(l,2,3),
Vaáable 1abels labels = c("red", "blue", "green"))
Value Labels
Use the factor() function for nominal data and the ordered() function for ordinal data. R statistical
and graphic functions wíll then treat the data appriopriately.
Note: factor and ordered are used the same way, wíth the same arguments. The former creates factors
and the later creates ordered factors.
R in Actíon significantly expands
upon this material. Use promo
code ria38 for a 38% discount.
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Input Variable Labels
R's ability to handle variable labels is somewhat unsatisfying.
Data tvoes
lf you use the Hmisc package, you can take advantage of sorne labeling features.
lmoorting Data
Keyboard Input
library(Hmisc)
Database Input
label(mydata$myvar) <- "Vari ab 1 e 1abe1 for vari ab 1 e myvar"
Exporting Data describe(mydata)
Vjewjng Data
1
Variable 1abels Unfortunately the label is only in effect for functions provided by the Hmisc package, such as
describe(). Your other option is to use the variable !abe! as the variable name and then refer to the
Value Labels
variable by position index.
Missing Data
Date Values
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the va riable
1
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basjc Statjstics
Advanced Statistics
Basic Graphs
Advanced Graphs
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Data Management Aggregati ng Data
lt is relatively easy to collapse data in R using one or more BY variables anda defined function.
Creating l~ew Variables
Ooerators # aggregate data frame mtcars by cyl and vs, returning means
Built-in Functions # for nurneric variables
attach(mtcars)
Control Structures aggdata <-aggregate(mtcars, by=list(cyl,vs),
! Jser-defined F1mctjons
FUN=mean, na. rm=TRUE)
print(aggdata)
Sortjng Data detach(mtcars)
Mergjng Data
Aggregating Data When using the aggregate() function, the by variables must be in a list (even if there is only one). The
function can be built -in or user provided.
Reshaoing Data
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Control Structures
R has the standard control structures you would expect. expr can be multiple (compound) statements
Creating l~ew Variables by enclosing them in braces ( }. lt is more efficient to use built-in functions rather than control
Ooerators structures whenever possible.
Built-in Functions
Mergjng Data
1
Aggregating Data
for
Reshaoing Data
while
R in Action
switch
1 switch(expr, ... )
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount. ifelse
Top Menu
1 ifelse(test,yes,no)
Example
The R Interface
# try it
z <- matrix(l:lO, nrow:5, ncol=2)
tz <- mytrans(z)
Data Management Built-in Functions
Almost everything in R is done through functions. Here l'm only refering to numeric and character
Creating l~ew Variables functions that are commonly used in creating or recoding variables.
Ooerators
Character Functions
Function Oescription
substr(x, start= n 1, Extract or replace substrings in a character vector.
stop=n2) x <- "abcdef'
R in Action significantly expands substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef'
upon this material. Use promo
grep(pattem, x, Search for pattem in x. lf fixed =FALSE then pattem is a regular.
code ria38 for a 38% discount. ignore.case=FALSE, expression. lf fixed=TRUE then pattern is a text string. Returns
fixed=FALSE) matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
Top Menu sub(pattem, rep/acement, Find pattern in x and replace with rep/acement text. lf
x, ignore.case =FALSE, fixed=FALSE then pattern is a regular expression~
fixed=EALSE) lf fixed = T then pattem is a text string.
sub("\\s",".","Hello There") returns "Helio.There"
strsplit(x, split) Split the elements of character vector x at split.
The R Interface strsplit("abc", "") returns 3 element vector "a","b","c"
paste( ... , sep="") Concatenate strings after using sep string to seperate them.
Data Input paste("x",1 :3,sep="") returns c("x1","x2" "x3")
paste("x", 1 :3,sep="M") returns c("xM1","xM2" "xM3")
Data Management paste("Today is", date())
Basic Statistics toupper(x) Uppercase
tolower(x) lowercase
Advanced Statistics
Basic Graphs
Function Description
mean(x, trim=O, mean of object x
na.rm=FALSE) # trimmed mean, removing any missing values and
# 5 percent of highest and lowest seores
mx <- mean(x,trim=.05, na.rm=TRUE)
sd(x) standard deviation of object(x). also look at var(x) far variance and
mad(x) far median absolute deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and
probs is a numeric vector with probabilities in [O, 1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=/) lagged differences, with lag indicating which lag to use
min{x) mínimum
max(x) maximum
scale(x, column center or standardize a matrix.
center=TRUE,
sea le= TRUE)
Note that while the examples on this page apply functions to individual variables, many can be applied
to vectors and matrices as well.
Data Management Merging Data
Control Structures
# merge two data frames by ID
! Jser-defined F1 mctjons
total <- merge(data frameA,data frameB,by="ID")
Sortjng Data
1
Mergjng Data
# merge two data frames by ID and Country
Aggregating Data total <- merge (data frameA, data frameB, by=c ("ro", "country"))
Reshaoing Data 1
Subsetting Data
Adding Rows
R in Action To join two data frames (datasets) vertically, use the rbind function. The two data frames must have
the same variables, but they do not have to be in the same order.
lf data frameA has variables that data frameB does not, then either:
R in Action significantly expands 1. Delete the extra variables in data frameA or
upon this material. Use promo 2. Create the additional variables in data frameB and set them to UA (missing)
code ria38 for a 38% discount. before joining them with rbind( ).
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Operators
R's binary and logical operators \'lill look very familiar to programmers. Note that binary operators work
Creating l~ew Variables on vectors and matrices as well as scalars.
Ooerators
Logical Operators
R in Action
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
-- exactly equal to
!= not equal to
R in Action significantly expands
!x Not X
upon this material. Use promo
X 1Y x OR y
code ria38 for a 38% discount.
x!Iy xANDy
isTRUE(x) test if X is TRUE
Top Menu
# An example
X <- c(l;lQ)
The R Interface x[(x>8) 1 (x<5)]
# yei l ds 1 2 3 4 9 10
Data Input
Advanced Statistics 1 2 3 4 5 6 7 8 9 10
X > 8
Basic Graphs E E E E E E E FTT
X < 5
Advanced Graohs
TTTT E E E E E E
X > 8 1 X < 5
TTTTFF E FTT
x[c(T,T,T,T,F ,F,F,F,T,T)]
1 2 3 4 9 10
Data Management Reshapi ng Data
R provides a variety of methods for reshaping data prior to analysis.
Creating l~ew Variables
Ooerators
Transpose
Built-in Functions
Use the t() function to transpose a matrix or a data frame. In the later case, rownames become
Control Structures variable (column) names.
! Jser-defined F1 mctjons
R in Action Basically, you "melt" data so that each row is a unique id-variable combination. Then you "cast" the
melted data into any shape you would like. Here is a very simple example.
,~
mydata
id time x1 x2
s 6
2 3 s
R in Action significantly expands 2 6
upon this material. Use promo 2 2 2 4
code ria38 for a 38% discount.
1
# cast the melted data
# cast(data, formula, function)
subjrneans <- cast(mdata, i d~vari ab le, mean)
timemeans <- cast(mdata, time~va riable, mean)
1
subjmeans
id x1 x2
1 4 5.5
2 4 2.5
timemeans
time x1 x2
5.5 3.5
2 2.5 4.5
There is much more that you can do with the melt( ) and cast( ) functions. See the documentation fer
more details.
Data Management Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the
Creating l~ew Variables sorting variable by a minus sign to indicate DESCENDING order. Here are some examples.
Ooerators
R in Action
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management Subsetting Data
R has powerful indexing features for accessing object elements. These features can be used to select
Creating l~ew Variables and exdude variables and observations. The following code snippets demonstrate ways to keep or
Ooerators delete variables and observations and to take random samples from a dataset.
Built-in Functions
R in Action
Excluding (DROPPING) Variables
# exclude variables vl, v2, v3
myvars <- names(mydata) %in% c("vl", "v2", "v3")
newdata <- mydata[!myvars]
Top Menu
Selecting Observations
# first 5 observerations
The R Interface newdata <- mydata[l:S,]
Going Further
R has extensive facilities for sampling, including drawing and calibrating survey samples (see the
sampling package), analyzing complex survey data (see the survey package and it's homepagel and
bootstrapping.
Data Management Data Type Conversion
Type conversions in R work as you would expect. For example, adding a character string to a numeric
Creating l~ew Variables vector converts ali the elements in the vector to character.
Ooerators
Use is.feo to test for data type feo. Returns TRUE or FALSE
Built-in Functions
Use as.feo to explicitly convert it.
Control Structures
is.numeric(), is.character(), is_vector(), is.matrix(), is.data.trame()
! Jser-defined F1 mctjons
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.trame)
Sortjng Data
Mergjng Data
Examples
Aggregating Data to one long to to
vector matrix data trame
Reshaoing Data
from c(x,y) cbind(x,y) data.frame(x,y)
Subsetting Data vector rbind(x,y)
trom as. vector(mymatrix) as.data.frame(mymatrix)
Data Tyoe Conversion
matrix
from as. matrix(myframe)
data trame
R in Action
Dates
You can convert dates to and from character or numeric data. See date values for more inforn1ation.
Top Menu
The R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graohs
Data Management User-written Functions
One of the great strengths of R is the user's ability to add functions . In fact, many of the functions in R
Creating l~ew Variables are actually functions of functions. The structure of a function is given below.
Ooerators
Sortjng Data
Mergjng Data Objects in the function are local to the function. The object returned can be any data tvoe. Here is an
example.
Aggregating Data
Reshaoing Data
# function example - get measures of central tendency
Subsetting Data # and spread for a numeri c vector x . The u ser has a
# choice of measures and whether the resul ts are printed.
Data Tyoe Conversion
Advanced Graohs lt can be instructive to look at the code of a function. In R, you can view a function's code by typing
the function name without the ( ). lf this method fails, look at the following R Wiki link for hints on
viewing function sourcecode.
Finally, you rnay want to store your own functions, and have thern available in every session. You can
customjze the R enyjroment to load your functions at start-up.
Data Management Creating new variables
Use the assignment operator <- to create new variables. A wide array of ooerators and functions are
Creating l~ew Variables available here.
Ooerators
R in Action
Recoding variables
In order to recode data, you will probably use one or more of R's control structures.
Data Management
Pmbability Plots
For finer control or for modularization, you can use the functions described below.
Mosaic Plots
Correlograms
R in Action
title(main="main title", sub=" sub-title",
xlab="x-axis 1abe1", ylab="y-axis 1abe1")
1
Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in
the title( ) function.
The R Interface
Text Annotations
Text can be added to graphs using the text( ) and mtext( ) functions . text( ) places text within the
Data Input
graph while mtext( ) places text in one of the four margins.
Data Management
Basic Statistics
text( 1ocation, "text to place", pos, ... )
Advance<l Statj5tic5 mtext("text to place", side, line=n, ... )
Basic Graphs
1
Advanced Graphs Common options are described below.
option description
location location can be an x,y coordinate. Alternatively, the text can be placed
interactively vía mouse by specifying location as locator( 1).
pos position relative to location. 1=below, 2=left, 3=above , 4=rigbt. lf you
specify pos, you can specify offset= in percent of character width.
side which margin to place text. 1=bottom, 2=left, 3=top, 4=rigbt. you can
specify line= to indicate the line in the margin starting with O and moving
out. you can also specify adj=O for left/bottom alignment or adj=1 for
top/ right alignment.
Other common options are cex, col, and font (for size, color, and font style respectively).
Labeling points
You can use the text( ) function (see above) for labeling point as well as for adding other text
annotations. Specífy location as a set of x, y coordinates and specify the text to place as a vector of
labels. The x, y, and label vectors should ali be the sarne length.
click to view
Math Annotations
You can add mathematically formulas to a graph using TEX·like rules. See help(plotmath) for details
and examples.
Axes
You can create custom axes using the axis() function .
where
option description
side an integer indicating the side of the graph to draw the axis {1=bottom, 2=left,
3=top, 4=right)
at a numeric vector indicating where tic marks should be drawn
labels a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty líne type
col the line and tick mark color
las labels are parallel (=O) or perpendicular(=2) to axis
tck length of tick mark as fraction of plotting region (negative number is outside
graph, positive number is inside, O suppresses ticks, 1 creates gridlines) default
is -0.01
(... ) other grapbical parameters
lf you are going to create a custom axis, you should suppress the axis automatically generated by your
high level plotting function. The option axes=FALSE suppresses both x and y axes. xaxt=··n·· and
yaxt="n" suppress the x and y axis respectively. Here is a (somewhat overblown) example.
# add x v s. 1/ x
lines(x, z, type="b", pch=22, col="blue", lty=2)
click to view
Reference Lines
Add reference lines to a graph using the abline( ) function.
1 abline(h=yva7ues, v=xva7ues)
Other graohical oarameters (such as line type, color, and width) can also be specified in the abline( )
function.
Note: You can also use the grid( ) functíon to add reference lines.
Legend
Add a legend with the legend() function.
1 legend(location, title, legend, ... )
option clescription
location There are severa( ways to indicate the location of the legend. You can give
an x,y coordinate for the upper left hand corner of the legend. You can use
locator(1), in which case you use the mouse to indicate the location of the
legend. You can also use the keywords "bottom", "bottomleft", "left",
"topleft", "top", "topright", "right", "bottomright", or "center". lf you use a
keyword, you may want to use inset= to specify an amount to move the
legend into the graph (as fraction of plot region).
title A character string for the legend title (optional)
legend A character vector with the labels
Other options. lf the legend labels colored lines, specify col= and a vector of
colors. lf the legend labels point symbols, specify pch= anda vector of point
symbols. lf the legend labels line width or line style, use twd= or lty= anda
vector of widths or styles. To create colored boxes for the legend (common
in bar, box, or pie charts), use fill= and a vector of colors.
Other common legend options include bty for box type, bg for background color, cex for size, and
text.col for text color. Setting horiz=TRUE sets the legend horizontally rather than vertically.
# Legend Examp le
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
yaxt="n", xlab="Milage", horizontal=TRUE,
col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
c("4", "6", "8"), fill=terrain.colors(3), horiz=TRUE)
- • click to view
For more on legends, see help(legend). The exan1ples in the help are particularly informative.
R Interface Correlograms
Correlograms help us visualize the data in correlation matrices. For details, see Corrgrams:
Graphical Parameters Exploratory displays for correlation matrices.
Axes and Text
In R, correlograms are implimented through the corrgram(x, order = , panel=, lower.panel=,
Combining Plots
upper.panel=, text.panel=, diag.panel=) function in the corrgram package.
Lattice Graphs
Options
ggp! ot2 Graphs
x is a data frame with one observation per row.
Pmbability Plots
order=TRUE will cause the variables to be ordered using principal component analysis of the
Mosaic Plots
correlation matrix.
Correlograms
lnteractive Graphs panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose
different options below and above the main diagonal respectively. text.panel= and diag.panel= refer
to the main diagnonal. Allowable parameters are giv en below.
R in Action
off diagonal panels
panel.pie (the filled portien of the pie indicates the magnitude of the correlation)
panel.shade (the depth of the shading indicates the magnitude of the correlation)
panel.ellipse (confidence ellipse and smoothed line)
panel.pts (scatterplot)
Top Menu
Data Management
Basic Statistics
Advance<l Statj5tic5
Basic Graphs
Advanced Graphs
.. click to view
elick to view
click to view
R Interface Graphics with ggplot2
The ~ package, created by Hadley Wickham, offers a powerful graphics language for creating
Graphical Parameters elegant and complex plots. lts popularity in the R community has exploded in recent years. Origianlly
Axes and Text based on Leland Wilkinson's The Grammar of Graohics, ggplotl allows you to create graphs that
represent both univariate and multivariate numerical and categorical data in a straightforward manner.
Combining Plots
Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots
Lattice Graphs (i.e., conditioning) is relatively simple.
ggpl ot2 Graphs
Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful
Pmbability Plots
resources). There is a helper function called qplot() (for quick plot) that can hide much of this
Mosaic Plots complexity when creating standard graphs.
Correlograms
Data Management main, Character vectors specifying the title and subtitle
sub
Basic Statistics method, lf geom="smooth", a loess fit line and confidence limits are added by default. When
formula the number of observations is greater than 1,000, a more efficient smoothing
Advance<l Statj5tic5 algorithm is employed. Methods include "lm" for regression, "gam" for generalized
additive models, and "rlm" for robust regression. The formula parameter gives the
Basic Graphs form of the fit.
Advanced Graphs For example, to add simple linear regression lines, you'd specify geom="smooth",
method="lm", fom1ula=y- x. Changing the formula to y- poly(x,2) would produce a
quadratic fit. Note that the fom1ula uses the letters x and y, not the nan1es of the
variables.
For method="gam", be sure to load the mgcv package. For method="rml", load the
MASS package.
x, y Specifies the variables placed on the horizontal and vertical axis. For univariate
plots (for example, histograms), omit y
xlab, Character vectors specifying horizontal and vertical axis labels
ylab
xlim,ylim Two -etement numeric vectors giving the mínimum and maximum values for the
horizontal and vertical axes, respectively
Notes:
Here are sorne examples using automotive data (car mileage, weight, number of gears, number of
cylinders, etc.) contained in the mtcars data frame.
# ggplot2 examples
library(ggplot2)
J,
•1
• 1I:
1:. 1 ..
• •••
click to view
-,l
1
Ji
1 '• •
1
• click to view
Going Further
We have only scratched the surface here. To learn more, see the ggolot reference site, and Winston
Chang's excellent Cookbook for R site. Though slightly out of date, ggolot2: Elegant Graohics for Data
Anaysis is still the definative book on this subject.
R Interface lnteractive Graphics
There are a severa! ways to interact with R graphics in real time. Three methods are described below.
Graphical Parameters
Mosaic Plots Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from
Correlograms within R . This gives you interactive graphics access to all of your R data! See An lntroduction to
RGGOBI.
lnteractive Graphs
iPlots
Top Menu
The ip1Qtt package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots
and histograms that can be linked and color brusbed. iplots is implimented through the Java GUI for R.
For more information, see the íplots websjte.
The R Interface
R in Action
elick to view
Top Menu
The R Interface
Data Management The layout( ) function has the form layout(mat) where
Basic Statistics mat is a matrix object specifying the location of the N figures to plot.
Advance<l Statj5tic5
Optionally, you can include widths= and heights= options in the layout() function to control the size of
each figure more precisely. These options have the form
widths= a vector of values for the widths of columns
heights= a vector of values for the heights of rows.
Relative widths are specified with numeric values. Absolute widths (in centimetres) are specified vlith
the lcm() function.
1.
elick to view
To understand this graph, think of the full graph area as going from (O,O) in the lower left comer to
(1, 1) in the upper right comer. The format of the fig= parameter is a numerical vector of the form
c(x1, x2, y1, y2). The first fig= sets up the scatterplot going from O to 0.8 on the x axis and O to 0.8 on
the y axis. The top boxplot goes from O to 0.8 on the x axis and 0.55 to 1 on the y axis. 1chose0.55
rather than 0.8 so that the top figure will be pulled closer to the scatter plot. The right hand boxplot
goes from 0.65 to 1 on the x axis and O to 0.8 on the y axis. Again, 1 chose a value to pull the right
hand boxplot closer to the scatterplot. You have to experiment to get ít just right.
fíg= starts a new plot, so to add toan exísting plot use new=TRUE.
You can use thís to combine severa[ plots in any arrangement ínto one graph.
R Interface Visualizing Categorical Data
The ved package provides a variety of methods for visualizing multivariate categorical data, inspired by
Graphical Parameters Michael Friendly's wonderful ºVisualizing Categorical Data". Extended mosaic and association plots are
Axes and Text described here. Each provides a method of visualing complex data and evaluating deviations from a
specified independence model. For more details, see The Strucplot framework
Combining Plots
Lattice Graphs
Basic Statistics
,_
...
Advance<l Statj5tic5
Basic Graphs
Advanced Graphs
--· ·;
·-.
dick to view
Going Further
Both functions are complex and offer multiple input and output options. See help(mosaic) and
help(assoc) for more details.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
R Interface Graphical Parameters
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
Graphical Parameters
One way is to specify these options in through the par() function. lf you set parameter values here, the
Axes and Text
changes will be in effect for the rest of the session or until you change them again. The format is
Combining Plots
par(optionname=value, optionname= value, ... )
Lattice Graphs
R in Action A second way to specify graphical parameters is by providing the optionname=va/ue pairs directly to a
high leve! plotting function. In this case, the options are only in effect for that specific graph.
Top Menu
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
The R Interface
Data Management cex number indicating the amount by which plotting text and symbols should be
scaled relative to the default. 1=default, 1.5 is 50% larger, 0.5 is 50%
Basic Statistics smaller, etc.
cex.axis magnification of axis annotation relative to cex
Advance<l Statj5tic5
cex. lab magnification of x and y labels relative to cex
Basic Graphs cex.main magnification of titles relative to cex
Advanced Graphs cex.sub magnification of subtitles relative to cex
Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21 through 25, specify
border color (col=) and fil! color (bg=).
plot symbols : pch =
ºº
1 0
6\7
7 ~
12 EB
13~
1s•
1s •
24.6.
25 '\i'
ºº
·+
14~
2 b.
ª* 20 .
* *
3+ 9~ 1s • 21 o
%
%
4 X 10 $- 1s • 22 0
ºº
s <> 11 Z& 11 & 23<>
ºº ##
Lines
You can change lines using the folloV'ting options. This is particularty useful for reference tines, axes,
and fit lines.
option description
lty line type. see the chart below.
lwd line width relative to the default (default=1). 2 is t'Jl/Íce as 1'Vide.
6
5
4
3
2
Colors
Options that specify colors include the follovting.
option description
col Default plotting color. Some functions (e.g. lines) accept a vector of values
that are recycled.
col.axis color for axis annotation
col.lab color for x and y labels
col.main color for titles
col.sub color for subtitles
fg plot foreground color (axes, boxes · also sets col= to sanie)
bg plot background color
The following chart was produced with code developed by Earl F. Glynn. See his Color Chart for ali the
details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n),
terrain.colors(n), topo.colors(n), and cm.colors(n).
Fonts
You can easily set font size and style, but font family is a bit more complicated.
option description
font lnteger specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
font.axis font for axis annotation
font.lab font for x and y labels
font.main font for titles
font. sub font for subtitles
ps font point size (roughly 1/n inch)
text size=ps•cex
family font family for drawing text. Standard values are "serif", "sans", "mono",
"symbol". Mapping is device dependent.
In vlindows, mono is mapped to 'TI Courier New", serif is mapped to"TT Times New Roman", sans is
mapped to "TT Arial", mono is mapped to 'TI Courier New", and symbol is mapped to 'TI Symbol"
(TT=True Type). You can add your own mappings.
option description
mar numerical vector indicating margin size c(bottom, left, top, right) in lines.
default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right) in inches
pin plot dimensions (width, height) in inches
Going Further
See help(par) for more information on graphical parameters. The customization of plotting axes and
text annotations are covered next section.
R Interface Probability Plots
This section describes creating probability plots in R for both didactic purposes and for data analyses.
Graphical Parameters
For a comprehensive list, see Statistical Oistributions on the R \'lliki. The functions available for each
R in Action significantly expands
distribution follow this format :
upen this material. Use promo
code ria38 for a 38% discount. name description
dname( ) density or probability function
pname( ) cumulative density function
Top Menu
qname{ ) quantile function
rname( ) random deviates
For example, pnorm{O) =0.5 (the area under the standard normal curve to the left of zero).
The R Interface
qnorm(O. 9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution). rnorm(100)
Data Input generates 100 random deviates from a standard normal distribution.
Data Management
Each function has parameters specific to that distribution. For example, rnorm(100, m=SO, sd=10)
Basic Statistics generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10.
Advance<l Statj5tic5
You can use these functions to demonstrate various aspects of probability distributions. Two common
Basic Graphs
examples are given below.
Advanced Graphs
for (i in 1:4){
lines(x, dt(x,degf[i ] ), lwd=2, col=colors[i])
}
A
,)
click to view
mean=lOO; sd=15
l b=80; ub=l20
elick to view
For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Pistdbutions.
Fitting Distributions
There are severa! methods of fitting distributions in R. Here are some options.
You can use the qqnorm() function to create a Quantile-Quantile plot evaluating the fit of sample data
to the normal distribution. More generally, the qqplot( ) function creates a Quantile-Quantile plot for
any theoretical distribution.
1# Q-Q plots
par(mfrow=c(l,2))
# normal fít
qqnorm(x); qq1 í ne ( x)
# t(3Df) fít
qqplot(rt(1000,df=3), x, maín="t(3) Q-Q Plot",
ylab="Sarnple Quantíles")
ab lí ne(O, 1)
_/ I
elick to view
The fitdistr( ) function in the MASS package provides maximurn-likelihood fitting of univariate
distributions. The fom1at is fitdistr(x, densityftmction) where x is the sample data and densityfunction
is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f', "gamma", "geometríc", "log-
nom1al", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull".
# estímate paramters
1 í brary(MASS)
fítdístr(x, "lognormal")
Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a
random sample comes from a specified theoretical distríbution. These include chi-square, Kolmogorov-
Smirnov, and Anderson-Darling.
For more details on fitting distríbutions, see Vito Ricci's Fjttioe Pistábutjoos witb R. For general (non
R) advice, see Bill Huber's Fitting Distábutions to Data.
R Interface Lattice Graphs
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing
Graphical Parameters better defaults and the ability to easily display multivariate relationships. In particular, the package
Axes and Text supports the creation of tre//is graphs · graphs that display a variable or the relationship between
variables, conditioned on one or more other variables.
Combining Plots
Mosaic Plots
Correlograms where graph_type is selected from the listed below. formula specifies the variable(s) to display and
any conditioning variables . For example -x 1A means display numeric variable x for each leve! of factor
lnteractive Graphs
A. y-x 1 A•s means display the relationship between numeric variables y and x separately for every
combination of factor A and B levels. -x means display numeric variable x alone.
R in Action
graph_type description formula examples
barchart bar chart x-A or A-x
bwplot boxplot x-A or A-x
cloud 3D scatterplot z-x•ylA
contourplot 3D contour plot z- x"y
densityplot kernal density plot -xlA*B
dotplot dotplot -xi A
R in Action significantly expands histogram histogram -x
upen this material. Use promo levelplot 3D leve! plot z-y•x
cede ria38 for a 38% discount. parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A- x or x-A
Top Menu
xyplot scatterplot y-xi A
wirefrarne 3D wireframe graph z-y•x
Here are sorne examples. They use the car data (mileage, weight, number of gears, number of
The R Interface
cylinders, etc.) from the mtcars data frame.
Data Input
main="Density Plot",
xlab="Miles per Gallon")
# kernel dens i ty plots by factor l evel
densityplot (~mpg l cyl. f,
main="Density Plot by Number of Cyl inders",
xl ab="Miles per Gallon")
# scatterplot matrix
splom(mtcars [c(l,3,4,5,6)),
main="MTCARS Data")
-
...
click to view
Note, as in graph 1, that you specifying a conditioning variable is optional. The difference between
graphs 2 & 3 is the use of the layout option to contol the placement of panels.
click to view
Going Further
Lattice graphics are a comprehensive graphical system in their own right. Deepanyan Sarkar's book
Lattice: Multivariate Data Visualization vtith R is the definitive reference. Additionally, see the Trellis
Graphic5 homepage and the Trellis Usea Gujde. Dr. lhaka has created a wonderful set of 5ljde5 on the
subject. An excellent early consideration of trellis graphs can be found in W.S. Cleavland's classic book
Visualizjng Data.
Stat1stics ANOVA
lf you bave been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's
Oescriptive Statistics approacb less coberent and user-friendly. A good online presentation on ANOVA in R can be found in
Freguencies & Crosstabs ANOVA section of tbe Personality Project. (Note: 1 bave found tbat tbese pages render fine in Chrome
and Safari browsers, but can appear distorted in iExplorer.)
Correlations
1. Fit a Model
In tbe fallowing examples lower case letters are numeric variables and upper case letters are factors .
Nonparametric Statjstics
Multjple R~re55jon
# One Way Anova (Completely Randomized Design)
R~re55jon Djagnostjc5 fit <- aov(y - A, data=mydataframe)
AMOVA/MAMOVA
1
(M)Al~OVA Assumptions
# Randomized Block Design (B is the blocking factor)
Resampling Stats fit <- aov(y - A+ B, data=mydataframe)
Power Analysis
1
Using Witb and By
# Two Way Factorial Design
fi t <- aov (y - A + B + A: B, data=111ydataframe)
fi t <- aov(y - A~'B, data=mydataframe) # same thi ng
R in Action
1
# Analysis of Covariance
fit <- aov(y - A + x, data=mydataframe)
1
Far witbin subjects designs, the data frame has to be rearranged so tbat eacb measurement on a
R in Action significantly expands subject is a separate observation. 5ee R and Analysis of Variaoce
upan this material. Use promo
code ria38 far a 38% discount. # One Witbin Factor
fi t <- aov(y-A+Error(Subject/A) ,data=mydataframe)
1
Top Menu
Data Management
2. Look at Diagnostic Plots
Basic Statistics
Oiagnostic plots provide cbecks far beteroscedasticity, normality, and influential observerations.
Advanced Statistics
Basic Graphs
Multiple Comparisons
You can get Tukey HSD tests using the function belov·1. By default, ít calculates post hoc comparisons on
each factor in the model. You can specify specífic factors asan option. Again, remember that results
are based on Type 1 SS!
Visualizing Results
Use box olots and line olots to visualize group differences. There are also two functions specifically
designed for visualizing mean differences in ANOVA layouts. interaction.plot( ) in the base stats
package produces plots for two-way interactions. plotmeans( ) in the ~ackage produces mean
plots for single factors, and includes confidence intervals.
..i· ·~
'\
J.i
\
••
~.
click to view
MANOVA
lf there is more than one dependent (outcome) variable, you can test them simultaneously using a
multivariate analysis of variance (MANOVA). In the following example, let Y be a matrix whose
columns are the dependent variables.
Other test options are 'Wilks", "Hotelling-lawley", and "Roy". Use summary.aov( ) to get univariate
statistics. TukeyHSD( ) and plot( ) will not work with a MANOVA fit. Run each dependent variable
separately to obtain them. Like ANOVA, MANOVA results in R are based on Type 1 55. To obtain Type 111
55, vary the order of variables in the model and rerun the analyses. For example, fit y-A*B for the
Typell l B effect and y-B'A for the Type 111 A effect.
Going Further
R has excellent facilities for fitting linear and generalized linear mixed-effects models. The lastest
implimentation is in package lme4. 5ee the R News Article on Fitting Mixed linear Models in R for
details.
Stat1stics Assessing Classical Test Assumptions
In elassical parametric procedures we often assume nom1ality and constant variance for the model error
Oescriptive Statistics term. Methods of exploring these assumptions in an ANOVA/ ANCOVA/MANOVA framework are discussed
Freguencies & Crosstabs here. Regression diagnostics are covered under multiple linear regression.
Correlations
Outliers
Nonparametric Statjstics Since outliers can severly affect normality and homogeneity of variance, methods for detecting
disparate observerations are described first.
Multjple R~re55jon
R~re55jon Djagnostjc5 The aq.plot() function in the mvoutlier package allows you to identfy multivariate outliers by plotting
AMOVA/MAMOVA the ordered squared robust Mahalanobis distances of the observations against the empírica[ distribution
function of the M0 2¡ . Input consists of a matrix or data frame. The function produces 4 graphs and
(M)Al~OVA Assumptions
retums a boolean vector identifying the outliers.
Resampling Stats
Power Analysis
# Detect Outl iers in the MTCARS Data
Using With and By 1 i brary(mvoutl i er)
outliers <-
aq .plot(mtcars [c("mpg", "di sp", "hp", "drat", "wt", "qsec") ])
R in Action outliers # show list of outliers
-·-·
elick to view
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount.
Univariate Normality
You can evaluate the normality of a variable using a Q·Q plot.
Top Menu
# Q-Q Plot for variable MPG
attach(mtcars)
qqnorm(mpg)
Tbe R Interface qqline(mpg)
Data Input
Data Management
Basic Statistics
Advanced Statistics
elick to view
Basic Graphs
Advanced Graphs Significant departures from the line suggest violations of normality.
You can also perfom1 a Shapiro-Wilk test of normality with the shapiro.test(x) function, where x is a
numeric vector. Additional functions for testing normality are available in nortest package.
Multivariate Normality
MANOVA assumes multívaríate normality. The function mshapíro.test( ) ín the mvnormtest package
produces the Shapiro-Wilk test for multivariate normality. Input must be a numeric matrix.
click to view
Homogeneity of Variances
The bartlett.test( ) functíon provides a parametric K-sample test of the equality of variances. The
fligner.test( ) function provides a non-parametric test of the same. In the follov1ing examples y is a
numeric variable and G ís the groupíng variable.
The hovPlot( ) functíon ín the HH package provides a graphic test of homogeneity of variances based on
Brown-Forsyth. In the following example, y is numeric and Gis a grouping factor. Note that G must be
of type factor.
click to view
Homogeneity of Covariance Matrices
MANOVA and LDF assume homogeneity of variance-covariance matrices. The assumption is usually
tested with Box's M. Unfortunately the test is very sensitive to violations of normality, leading to
rejection in most typical cases. Box's Mis not included in R, but code is available.
Stat1stics Correlations
You can use the cor( ) function to produce correlations and the cov( ) function to produces
Oescriptive Statistics covariances.
Freguencies & Crosstabs
A simplified format is cor(x, use=, method= ) where
Correlations
Option Description
x Matrix or data frame
Nonparametric Statjstics use Specifies the handling of missing data. Options are all.obs (assumes no
missing data - missing data will produce an error), complete.obs (listwise
Multjple R~re55jon
deletion), and pairwise.complete.obs (pairwise deletion)
R~re55jon Djagnostjc5 method Specifies the type of correlation. Options are pearson, spearman or kendall.
AMOVA/MAMOVA
# Correlations/ covariances among numeric variables in
(M)Al~OVA Assumptions # data frame mtcars. Use listwise deletion of missing data.
Resampling Stats cor(mtcars, use="complete.obs", method="kendall")
cov (mtcars, use="complete.obs")
Power Analysis
You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and
Top Menu the columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
Advanced Statistics
Basic Graphs
Other Types of Correlations
Advanced Graphs
# polychoric correlation
# x is a contingency table of counts
library(polycor)
1
polychor(x)
# partial correlations
1 i brary(ggm)
data(mydata)
pcor(c("a", "b", "x", "y", "z"), var(mydata))
# partial corr between a and b controlling for x, y, z
Visualizing Correlations
Use corr¡¡ram( ) to plot correlograms .
A great example of a plotted correlation matrix can be found in the R Graoh Gallery.
Stat1stics Descriptive Statistics
R provides a wide range of functíons for obtaining summary statistics. One method of obtaining
Oescriptive Statistics descriptive statistics is to use the sapply( ) function with a specified summary statistic.
Freguencies & Crosstabs
Resampling Stats
# mean,median,25th and 75th quartiles,min,max
Power Analysis summary(mydata)
R in Action
Using the Hmisc package
l i brary(Hmi se)
deseribe(mydata)
# n, nmiss, unique, mean, 5,10,25,50, 75,90,95th pereentiles
# 5 lowest and 5 highest seores
library(pastees)
stat.desc(mydata)
Top Menu
# nbr.val, nbr.null, nbr.na, min max, range, sum,
# median, mean, SE.mean, CI.mean, var, std.dev, eoef . var
The R Interface
Using the ~ package
Data Input
Advanced Graphs
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the ~ package.
library(psych)
describe. by(mydata, group, ... )
1
The doBy package provides much of the functionality of SAS PROC SUMMARY. lt defines the desired
table using a model formula and a function. Here is a simple example.
1 i brary(doBy)
sumrnaryBy(mpg + wt ~ cyl + vs, data = mtcars,
FUN = function(x) { c(m = mean(x), s = sd(x)) } )
#produces mpg . rn wt . rn rnpg . s wt . s for each
# combination of t he levels of cyl and vs
Correlations
Generating Frequency Tables
Nonparametric Statjstics R provides many metbods far creating frequency and contingency tables. Three are described below. In
tbe following examples, assume tbat A, B, and C represent categorical variables.
Multjple R~re55jon
table
R~re55jon Djagnostjc5
You can generate frequency tables using tbe table( ) function, tables of proportions using tbe
ANOVA/MANOVA
prop.table( ) function, and marginal frequencies using margin.table( ).
(M)Al~OVA Assumptions
Resampling Stats
# 2-Way Frequency Table
Power Analysis attacb(mydata)
mytable <- table(A,B) # A wi l l be rows, B will be columns
Using Witb and By mytable # print table
table( ) can also generate multidimensional tables based on 3 or more categorical variables. In tbis
case, use tbe ftable( ) function to print tbe results more attractively.
R in Action significantly expands
upan this material. Use promo
code ria38 far a 38% discount. # 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
Top Menu 1
Table ignores missing values. To include NA as a category in counts, include tbe table option
exclude=NULL if tbe variable is a vector. lf tbe variable is a factor you bave to create a new factor
Tbe R Interface using newfactor <· factor(oldfactor, exclude=NULL).
Basic Statistics
lf a variable is included on tbe left side of tbe formula, it is assumed to be a vector of frequencies
(useful if the data have already been tabulated).
Crosstable
The CrossTable( ) function in the ~ package produces crosstabulations modeled after PROC FREQ
in SAS or CROSSTABS in SPSS. lt has a wealth of options.
Testsof lndependence
Chi-Square Test
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable.
By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic.
Optionally, the p-value can be derived vía Monte Cario simultation.
Mantel-Haenszel test
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null
hypothesis that two nominal variables are conditionally independent in each stratum, assuming that
there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension
refers to the strata.
Loglinear Models
You can use the loglm( ) function in the MASS package to produce log-linear models. For example, let's
assume we have a 3-way contingency table based on variables A, B, and C.
1 i brary(MASS)
mytable <- xtabs(-A+B+C, data=mydata)
1
We can perform the following tests:
1 loglm(-A+B+C, mytable)
1 loglin(-A+B+C+B*C, mytable)
1 loglm(-A+B+C+A*C+B*C, mytable)
No Three-Way lnteraction
1 loglm(~A+B+C+A*B+A*C+B*C, mytabl e)
Martín Theus and Stephan Lauer have written an excellent article on Yisualizjng 1oglinear Models, using
mosaic olots. There is also great tutoría! example by Kevin Quino on analyzing loglinear models vía g!m.
Measures of Association
The assocstats(mytable) function in the ved package calculates the phi coefficient, contingency
coefficient, and Cramer's V for an rxc table. The kappa(mytable) function in the ~ package calculates
Cohen's kappa and weighted kappa for a confusion matrix. See Richard Oarlington's article on Measures
of Assocjatjon jo Cmsstab Tables for an excellent review of these statistics.
Visualizing results
Use bar and oie charts for visualizing frequencies in one dimension.
Use the ved package for visualizing relationships among categorical data (e.g. mosaic and association
plots).
Use the ca package for correspondence analysis (visually exploring relationships between rows and
columns in contingency tables).
R in Action significantly expands For the wilcox.test you can use the alternative="less" or alternative="greater" option to specify a one
upon this material. Use promo tailed test.
code ria38 for a 38% discount.
Parametric and resampling altematives are available.
Top Menu Tbe package npmc provides nonparametric multiple comparisons. (Note: This package has been
witbdrawn but is still available in the CRAN archives.)
Basic Statistics
Advanced Statistics
Visualizing Results
Basic Graphs
Use box plots or densitv plots to visual group differences.
Advanced Graphs
Stat1stics Power Analysis
Multjple R~re55jon
The follovling four quantities have an intimate relationship:
R~re55joo Djagnostjc5
1. sample size
AMOVA/MANOVA 2. effect size
3. significance level =P(Type 1 error) =probability of finding an effect that is not there
(M)Al~OVA Assumptions 4. power = 1 - P(Type 11 error) = probability of finding an effect that is there
Power Analysis
code ria38 for a 38% discount. pwr.t.test t-tests (one sample, 2 sample, paired)
pwr.t2n.test t-test (two samples with unequal n)
For each of these functions, you enter three of the four quantities (effect size, sample size,
Top Menu
significance level, power) and the fourth is calculated.
The significance level defaults to 0.05. Tberefore, to calculate the significance level, given an effect
Tbe R Interface size, sample size, and power, use tbe option "sig.level=NULL".
Data Input
Specifying an effect size can be a daunting task. ES fom1ulas and Cohen's suggestions (based on social
Data Management science research) are provided below. Cohen's suggestions should only be seen as very rough guidelines.
Basic Statistics Your own subject matter experience should be brought to bear.
Advanced Statistics
Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.
ANOVA
For a one-way analysis of variance use
where k is the number of groups and n is the common sample size in each group.
where p; =n; / N,
"
"'\' P; * rv<;
"--
i- 1
.. _ µ )2 n; = number of observati ons in group i
f = N = total number of observations
µ; =mean of group i
µ =grand mean
a2 = error variance within groups
Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes
respectively.
Correlations
For correlation coefficients use
where n is the sample size and r is the correlation. We use the population correlation coefficient as
the effect size measure. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium,
and large effect sizes respectively.
Linear Models
For linear models (e.g., multiple regression) use
where u and v are the numerator and denominator degrees of freedom. We use f2 as the effect size
measure.
The first formula is appropriate when we are evaluating the impact of a set of predictors on an
outcome. The second fom1ula is appropriate when we are evaluating the impact of one set of predictors
above and beyond a second set of predictors (or covariates). Cohen suggests f2 values of 0.02, 0.15,
and 0.35 represent small, medium, and large effect sizes.
Tests of Proportions
When comparing two proportions use
where h is the effect size and n is the common sample size in each group.
Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less", or
"greater" to indicate a two-tailed, or one-tailed test. A two taíled test is the default.
Chi-square Tests
For chi-square tests use
where w is the effect size, Nis the total sample size, and df is the degrees of freedom . The effect size
w is defined as
Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes
respectively.
sorne Examples
1 i brary(pwr)
pwr.2p . test(n=30,sig.level=0.0l,power=0.75)
l i brary(pwr)
# set up graph
xrange <- range(r)
yrange <- round(range(samsi ze))
colors <- rainbow(length(p))
pl ot(xrange, yrange, type="n",
xlab="Correl ation Coefficient (r)",
ylab="Sample Size (n)" )
click to view
Stat1stics Regression Diagnostics
An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of
Oescriptive Statistics Regression Oiagnostics. Dr. Fox's ca r package provides advanced utilities far regression modeling.
Freguencies & Crosstabs
Multjple R~re55jon
Ibis example is far expo5it ion only. We vlill ignore the fact that this may not be a great way of
R~re55jon Djagnostjc5
modeling the this particular set of data!
AMOVA/MAMOVA
(M)Al~OVA Assumptions
Outliers
Resampling Stats
R in Action
leverage plot
elick to view
Basic Graphs
Non-normality
# Normal ity of Residuals
# qq pl ot for studentized resid
qqPl ot(fit , main="QQ Pl ot")
# distribution of studentized residual s
l i brary(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residual s")
xfit<- seq(mi n(sresi d) ,max(sresi d), l ength=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
1 /
-
/ elick to view
1-
elick to view
Mu lti-collinearity
# Eval uate Col l inearity
vif(fit) # variance infl ation factors
sqrt(vif(fit)) > 2 # prob l em?
1
Nonlinearity
# Eval uate Nonl inearity
# component + residual pl ot
crPl ots(fit)
# Ceres plots
ceresPlots(fit)
~ l~ ._
1 . .. -.. ... '.. ..........
"
-- -- - .
1~! .. j
elick to view
Non-independence of Errors
1# Test for Autocorrelated Errors
1durbinWatsonTest(fit)
Going Further
lf you would like to delve deeper into regression diagnostics, two books written by John Fox can help:
Applied regression analysis and generalized linear models (2nd ed) and An R and 5-Plus comoanion to
applied regression.
Stat1stics Multiple (Linear) Regression
R provides comprehensive support for multiple linear regression. The topics below are provided in order
Oescriptive Statistics of increasing complexity.
Freguencies & Crosstabs
R in Action
Diagnostic Plots
Oiagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(l,2,3,4),2,2)) # optional 4 graphs/page
pl ot(fi t)
R in Action significantly expands
upon this material. Use promo
1
code ria38 for a 38% discount.
Top Menu
click to view
Data Input
Data Management
Comparing Models
You can compare nested models ~vith tbe anova( ) function. The following code provides a simultaneous
Basic Statistics
test that x3 and x4 add to linear prediction above and beyond x1 and x2.
Advanced Statistics
Basic Graphs
# compare models
Advanced Graphs fitl <- lm(y ~ xl + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ xl + x2)
anova(fitl, fit2)
Cross Validation
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.
# K-fold cross-validation
l i brary(DAAG)
cv. lm(df=mydata, fit, m=3) # 3 fold cross-validation
1
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the
cross-validated standard error of estímate.
You can assess R2 shrinkage vía K-fold cross-validation. Using the crossval() function from the
bootstrap package, do the follovting:
library(bootstrap)
# define functions
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function (fit,x){cbind(l,x)%*%fit$coef}
# matrix of predictors
X<- as.matrix(mydata[c("xl","x2","x3")])
# vector of predicted values
y <- as.matrix(mydata[c("y")])
Variable Selection
Selecting a subset of predictor variables from a larger set (e. g., stepwise selection) is a controversia!
topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from
the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.
# Stepwise Regression
l i brary(MASS)
fit <- lm(y-xl+x2+x3,data=mydata)
step <- stepAIC(fi t, di rection="both")
step$anova # display results
Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps_
package. In the following code nbest indicates the number of subsets of each size to report. Here, the
ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.).
click to view
Other options for plot( ) are bic, Cp, and adjr2. Other options for plotting with
subset( ) are bic, cp, adjr2, and rss.
Relative lmportance
The relaimpo package provides measures of relatíve importance for each of the predictors in the
model. See help(calc.relimp) for details on the four measures of relative importance provided.
Graphic Enhancements
The car package offers a wide variety of plots for regression , including added variable plots, and
enhanced diagnostic and scatter olots.
Going Further
Nonlinear Regression
The nis package provides functions for nonlinear regression. 5ee John Fox's Nonlinear Regression and
Nonljoear 1 east Squares for an overview . Huet and colteagues' Statistical Tools for (loolinear Regressioo-
A Practicat Guide with 5-PLUS and R Examples is a valuable reference book.
Robust Regression
There are many functions in R to aid with robust regressíon. For example, you can perfom1 robust
regression with the rlm( ) function in the MASS package. John Fox's (who else?) Robust Regression
provides a good starting overview . The UCLA Statísticat Computing website has Robqst Regression
Examples.
The mbun package provides a comprehensive library of robust methods, including regression. The
robustbase package also provides basic robust statistics including model selection methods. And David
Olive has provided an detailed entine review of Applied Robust Statistics with sample R cede.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Stat1stics Resampling Statistics
Tbe coin package provides tbe ability to perform a wide variety of re-randomization or pem1utation
Oescriptive Statistics based statistical tests. These tests do not assume random sampling from well-defined populations. They
Freguencies & Crosstabs can be a reasonable alternative to classical procedures vlhen test assumptions can not be met. See
cojo· A Comp11tatjonal Frameymrk for Coodjtjonal lnference far details.
Correlations
In tbe examples below, lowe r case letters represent numerical variables and upper case letters
Nonparametric Statjstics represent categorical .fac.tms.. Monte-Cario simulation are available far ali tests. Exact tests are
available far 2 group procedures.
M11ltjple R~re55jon
R~re55jon Djagnostjc5
lndependent Two- and K-Sample Location Tests
AMOVA/MAMOVA
Advanced Statistics
distribution=approximate(B=9999))
Many other univariate and multivariate tests are possible using the functions in the coin package. See 8
1ego System for Condjtjonal loference for more details.
Stat1stics t-tests
The t .te st( ) function produces a variety of t-tests. Unlike most statistical packages, the default
Oescriptive Statistics assumes unequal variance and applies the Welsh df modification.
ANOVA/MANOVA
# paired t-test
(M)Al~OVA Assumptions t .test(yl,y2,paired=TRUE) # whe r e yl & y2 are numeri c
1
Resampling Stats
Power Analysis
# one sample t - test
t . test(y,mu=3) # Ho : mu =3
Using With and By
1
R in Action You can use the va r.equal =TRUE option to specify equal variances and a pooled variance estimate.
You can use the alternative="less" or alternative="greater" option to specify a one tailed test.
Visualizing Results
Use box plots or densitv plots to visualize group differences.
R in Action significantly expands
upen this material. Use promo
cede ria38 for a 38% discount.
Top Menu
Tbe R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graphs
Stat1stics Using with( ) and by( )
There are two functions that can help write simpler and more efficient code.
Oescriptive Statistics
(M)Al~OVA Assumptions By
Resampling Stats The by( ) function applys a function to each leve! of a factor or factors. lt is similar to BY processing in
Top Menu
Tbe R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graphs
R Interface Axes and Text
Many high leve[ plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options
Graphical Parameters (as well as otber graphical paramters). For example
Axes and Text
Pmbability Plots
For finer control or for modularization, you can use the functions described below.
Mosaic Plots
Correlograms
R in Action
title(main="main title", sub=" sub-title",
xlab="x-axis 1abe1", ylab="y-axis 1abe1")
1
Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in
the title( ) function.
The R Interface
Text Annotations
Text can be added to graphs using the text( ) and mtext( ) functions . text( ) places text within the
Data Input
graph while mtext( ) places text in one of the four margins.
Data Management
Basic Statistics
text( 1ocation, "text to place", pos, ... )
Advance<l Statj5tic5 mtext("text to place", side, line=n, ... )
Basic Graphs
1
Advanced Graphs Common options are described below.
option description
location location can be an x,y coordinate. Alternatively, the text can be placed
interactively vía mouse by specifying location as locator( 1).
pos position relative to location. 1=below, 2=left, 3=above , 4=rigbt. lf you
specify pos, you can specify offset= in percent of character width.
side which margin to place text. 1=bottom, 2=left, 3=top, 4=rigbt. you can
specify line= to indicate the line in the margin starting with O and moving
out. you can also specify adj=O for left/bottom alignment or adj=1 for
top/ right alignment.
Other common options are cex, col, and font (for size, color, and font style respectively).
Labeling points
You can use the text( ) function (see above) for labeling point as well as for adding other text
annotations. Specífy location as a set of x, y coordinates and specify the text to place as a vector of
labels. The x, y, and label vectors should ali be the sarne length.
click to view
Math Annotations
You can add mathematically formulas to a graph using TEX·like rules. See help(plotmath) for details
and examples.
Axes
You can create custom axes using the axis() function .
where
option description
side an integer indicating the side of the graph to draw the axis {1=bottom, 2=left,
3=top, 4=right)
at a numeric vector indicating where tic marks should be drawn
labels a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty líne type
col the line and tick mark color
las labels are parallel (=O) or perpendicular(=2) to axis
tck length of tick mark as fraction of plotting region (negative number is outside
graph, positive number is inside, O suppresses ticks, 1 creates gridlines) default
is -0.01
(... ) other grapbical parameters
lf you are going to create a custom axis, you should suppress the axis automatically generated by your
high level plotting function. The option axes=FALSE suppresses both x and y axes. xaxt=··n·· and
yaxt="n" suppress the x and y axis respectively. Here is a (somewhat overblown) example.
# add x v s. 1/ x
lines(x, z, type="b", pch=22, col="blue", lty=2)
click to view
Reference Lines
Add reference lines to a graph using the abline( ) function.
1 abline(h=yva7ues, v=xva7ues)
Other graohical oarameters (such as line type, color, and width) can also be specified in the abline( )
function.
Note: You can also use the grid( ) functíon to add reference lines.
Legend
Add a legend with the legend() function.
1 legend(location, title, legend, ... )
option clescription
location There are severa( ways to indicate the location of the legend. You can give
an x,y coordinate for the upper left hand corner of the legend. You can use
locator(1), in which case you use the mouse to indicate the location of the
legend. You can also use the keywords "bottom", "bottomleft", "left",
"topleft", "top", "topright", "right", "bottomright", or "center". lf you use a
keyword, you may want to use inset= to specify an amount to move the
legend into the graph (as fraction of plot region).
title A character string for the legend title (optional)
legend A character vector with the labels
Other options. lf the legend labels colored lines, specify col= and a vector of
colors. lf the legend labels point symbols, specify pch= anda vector of point
symbols. lf the legend labels line width or line style, use twd= or lty= anda
vector of widths or styles. To create colored boxes for the legend (common
in bar, box, or pie charts), use fill= and a vector of colors.
Other common legend options include bty for box type, bg for background color, cex for size, and
text.col for text color. Setting horiz=TRUE sets the legend horizontally rather than vertically.
# Legend Examp le
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
yaxt="n", xlab="Milage", horizontal=TRUE,
col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
c("4", "6", "8"), fill=terrain.colors(3), horiz=TRUE)
- • click to view
For more on legends, see help(legend). The exan1ples in the help are particularly informative.
R Interface Correlograms
Correlograms help us visualize the data in correlation matrices. For details, see Corrgrams:
Graphical Parameters Exploratory displays for correlation matrices.
Axes and Text
In R, correlograms are implimented through the corrgram(x, order = , panel=, lower.panel=,
Combining Plots
upper.panel=, text.panel=, diag.panel=) function in the corrgram package.
Lattice Graphs
Options
ggp! ot2 Graphs
x is a data frame with one observation per row.
Pmbability Plots
order=TRUE will cause the variables to be ordered using principal component analysis of the
Mosaic Plots
correlation matrix.
Correlograms
lnteractive Graphs panel= refers to the off-diagonal panels. You can use lower.panel= and upper.panel= to choose
different options below and above the main diagonal respectively. text.panel= and diag.panel= refer
to the main diagnonal. Allowable parameters are giv en below.
R in Action
off diagonal panels
panel.pie (the filled portien of the pie indicates the magnitude of the correlation)
panel.shade (the depth of the shading indicates the magnitude of the correlation)
panel.ellipse (confidence ellipse and smoothed line)
panel.pts (scatterplot)
Top Menu
Data Management
Basic Statistics
Advance<l Statj5tic5
Basic Graphs
Advanced Graphs
.. click to view
elick to view
click to view
R Interface Graphics with ggplot2
The ~ package, created by Hadley Wickham, offers a powerful graphics language for creating
Graphical Parameters elegant and complex plots. lts popularity in the R community has exploded in recent years. Origianlly
Axes and Text based on Leland Wilkinson's The Grammar of Graohics, ggplotl allows you to create graphs that
represent both univariate and multivariate numerical and categorical data in a straightforward manner.
Combining Plots
Grouping can be represented by color, symbol, size, and transparency. The creation of trellis plots
Lattice Graphs (i.e., conditioning) is relatively simple.
ggpl ot2 Graphs
Mastering the ggplot2 language can be challenging (see the Going Further section below for helpful
Pmbability Plots
resources). There is a helper function called qplot() (for quick plot) that can hide much of this
Mosaic Plots complexity when creating standard graphs.
Correlograms
Data Management main, Character vectors specifying the title and subtitle
sub
Basic Statistics method, lf geom="smooth", a loess fit line and confidence limits are added by default. When
formula the number of observations is greater than 1,000, a more efficient smoothing
Advance<l Statj5tic5 algorithm is employed. Methods include "lm" for regression, "gam" for generalized
additive models, and "rlm" for robust regression. The formula parameter gives the
Basic Graphs form of the fit.
Advanced Graphs For example, to add simple linear regression lines, you'd specify geom="smooth",
method="lm", fom1ula=y- x. Changing the formula to y- poly(x,2) would produce a
quadratic fit. Note that the fom1ula uses the letters x and y, not the nan1es of the
variables.
For method="gam", be sure to load the mgcv package. For method="rml", load the
MASS package.
x, y Specifies the variables placed on the horizontal and vertical axis. For univariate
plots (for example, histograms), omit y
xlab, Character vectors specifying horizontal and vertical axis labels
ylab
xlim,ylim Two -etement numeric vectors giving the mínimum and maximum values for the
horizontal and vertical axes, respectively
Notes:
Here are sorne examples using automotive data (car mileage, weight, number of gears, number of
cylinders, etc.) contained in the mtcars data frame.
# ggplot2 examples
library(ggplot2)
J,
•1
• 1I:
1:. 1 ..
• •••
click to view
-,l
1
Ji
1 '• •
1
• click to view
Going Further
We have only scratched the surface here. To learn more, see the ggolot reference site, and Winston
Chang's excellent Cookbook for R site. Though slightly out of date, ggolot2: Elegant Graohics for Data
Anaysis is still the definative book on this subject.
R Interface lnteractive Graphics
There are a severa! ways to interact with R graphics in real time. Three methods are described below.
Graphical Parameters
Mosaic Plots Once GGobi is installed, you can use the ggobi( ) function in the package rggobi to run GGobi from
Correlograms within R . This gives you interactive graphics access to all of your R data! See An lntroduction to
RGGOBI.
lnteractive Graphs
iPlots
Top Menu
The ip1Qtt package provide interactive mosaic plots, bar plots, box plots, parallel plots, scatter plots
and histograms that can be linked and color brusbed. iplots is implimented through the Java GUI for R.
For more information, see the íplots websjte.
The R Interface
R in Action
elick to view
Top Menu
The R Interface
Data Management The layout( ) function has the form layout(mat) where
Basic Statistics mat is a matrix object specifying the location of the N figures to plot.
Advance<l Statj5tic5
Optionally, you can include widths= and heights= options in the layout() function to control the size of
each figure more precisely. These options have the form
widths= a vector of values for the widths of columns
heights= a vector of values for the heights of rows.
Relative widths are specified with numeric values. Absolute widths (in centimetres) are specified vlith
the lcm() function.
1.
elick to view
To understand this graph, think of the full graph area as going from (O,O) in the lower left comer to
(1, 1) in the upper right comer. The format of the fig= parameter is a numerical vector of the form
c(x1, x2, y1, y2). The first fig= sets up the scatterplot going from O to 0.8 on the x axis and O to 0.8 on
the y axis. The top boxplot goes from O to 0.8 on the x axis and 0.55 to 1 on the y axis. 1chose0.55
rather than 0.8 so that the top figure will be pulled closer to the scatter plot. The right hand boxplot
goes from 0.65 to 1 on the x axis and O to 0.8 on the y axis. Again, 1 chose a value to pull the right
hand boxplot closer to the scatterplot. You have to experiment to get ít just right.
fíg= starts a new plot, so to add toan exísting plot use new=TRUE.
You can use thís to combine severa[ plots in any arrangement ínto one graph.
R Interface Visualizing Categorical Data
The ved package provides a variety of methods for visualizing multivariate categorical data, inspired by
Graphical Parameters Michael Friendly's wonderful ºVisualizing Categorical Data". Extended mosaic and association plots are
Axes and Text described here. Each provides a method of visualing complex data and evaluating deviations from a
specified independence model. For more details, see The Strucplot framework
Combining Plots
Lattice Graphs
Basic Statistics
,_
...
Advance<l Statj5tic5
Basic Graphs
Advanced Graphs
--· ·;
·-.
dick to view
Going Further
Both functions are complex and offer multiple input and output options. See help(mosaic) and
help(assoc) for more details.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
R Interface Graphical Parameters
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic options.
Graphical Parameters
One way is to specify these options in through the par() function. lf you set parameter values here, the
Axes and Text
changes will be in effect for the rest of the session or until you change them again. The format is
Combining Plots
par(optionname=value, optionname= value, ... )
Lattice Graphs
R in Action A second way to specify graphical parameters is by providing the optionname=va/ue pairs directly to a
high leve! plotting function. In this case, the options are only in effect for that specific graph.
Top Menu
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
The R Interface
Data Management cex number indicating the amount by which plotting text and symbols should be
scaled relative to the default. 1=default, 1.5 is 50% larger, 0.5 is 50%
Basic Statistics smaller, etc.
cex.axis magnification of axis annotation relative to cex
Advance<l Statj5tic5
cex. lab magnification of x and y labels relative to cex
Basic Graphs cex.main magnification of titles relative to cex
Advanced Graphs cex.sub magnification of subtitles relative to cex
Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21 through 25, specify
border color (col=) and fil! color (bg=).
plot symbols : pch =
ºº
1 0
6\7
7 ~
12 EB
13~
1s•
1s •
24.6.
25 '\i'
ºº
·+
14~
2 b.
ª* 20 .
* *
3+ 9~ 1s • 21 o
%
%
4 X 10 $- 1s • 22 0
ºº
s <> 11 Z& 11 & 23<>
ºº ##
Lines
You can change lines using the folloV'ting options. This is particularty useful for reference tines, axes,
and fit lines.
option description
lty line type. see the chart below.
lwd line width relative to the default (default=1). 2 is t'Jl/Íce as 1'Vide.
6
5
4
3
2
Colors
Options that specify colors include the follovting.
option description
col Default plotting color. Some functions (e.g. lines) accept a vector of values
that are recycled.
col.axis color for axis annotation
col.lab color for x and y labels
col.main color for titles
col.sub color for subtitles
fg plot foreground color (axes, boxes · also sets col= to sanie)
bg plot background color
The following chart was produced with code developed by Earl F. Glynn. See his Color Chart for ali the
details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the functions rainbow(n), heat.colors(n),
terrain.colors(n), topo.colors(n), and cm.colors(n).
Fonts
You can easily set font size and style, but font family is a bit more complicated.
option description
font lnteger specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
font.axis font for axis annotation
font.lab font for x and y labels
font.main font for titles
font. sub font for subtitles
ps font point size (roughly 1/n inch)
text size=ps•cex
family font family for drawing text. Standard values are "serif", "sans", "mono",
"symbol". Mapping is device dependent.
In vlindows, mono is mapped to 'TI Courier New", serif is mapped to"TT Times New Roman", sans is
mapped to "TT Arial", mono is mapped to 'TI Courier New", and symbol is mapped to 'TI Symbol"
(TT=True Type). You can add your own mappings.
option description
mar numerical vector indicating margin size c(bottom, left, top, right) in lines.
default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right) in inches
pin plot dimensions (width, height) in inches
Going Further
See help(par) for more information on graphical parameters. The customization of plotting axes and
text annotations are covered next section.
R Interface Probability Plots
This section describes creating probability plots in R for both didactic purposes and for data analyses.
Graphical Parameters
For a comprehensive list, see Statistical Oistributions on the R \'lliki. The functions available for each
R in Action significantly expands
distribution follow this format :
upen this material. Use promo
code ria38 for a 38% discount. name description
dname( ) density or probability function
pname( ) cumulative density function
Top Menu
qname{ ) quantile function
rname( ) random deviates
For example, pnorm{O) =0.5 (the area under the standard normal curve to the left of zero).
The R Interface
qnorm(O. 9) = 1.28 (1.28 is the 90th percentile of the standard normal distribution). rnorm(100)
Data Input generates 100 random deviates from a standard normal distribution.
Data Management
Each function has parameters specific to that distribution. For example, rnorm(100, m=SO, sd=10)
Basic Statistics generates 100 random deviates from a normal distribution with mean 50 and standard deviation 10.
Advance<l Statj5tic5
You can use these functions to demonstrate various aspects of probability distributions. Two common
Basic Graphs
examples are given below.
Advanced Graphs
for (i in 1:4){
lines(x, dt(x,degf[i ] ), lwd=2, col=colors[i])
}
A
,)
click to view
mean=lOO; sd=15
l b=80; ub=l20
elick to view
For a comprehensive view of probability plotting in R, see Vincent Zonekynd's Probability Pistdbutions.
Fitting Distributions
There are severa! methods of fitting distributions in R. Here are some options.
You can use the qqnorm() function to create a Quantile-Quantile plot evaluating the fit of sample data
to the normal distribution. More generally, the qqplot( ) function creates a Quantile-Quantile plot for
any theoretical distribution.
1# Q-Q plots
par(mfrow=c(l,2))
# normal fít
qqnorm(x); qq1 í ne ( x)
# t(3Df) fít
qqplot(rt(1000,df=3), x, maín="t(3) Q-Q Plot",
ylab="Sarnple Quantíles")
ab lí ne(O, 1)
_/ I
elick to view
The fitdistr( ) function in the MASS package provides maximurn-likelihood fitting of univariate
distributions. The fom1at is fitdistr(x, densityftmction) where x is the sample data and densityfunction
is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f', "gamma", "geometríc", "log-
nom1al", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t" or "weibull".
# estímate paramters
1 í brary(MASS)
fítdístr(x, "lognormal")
Finally R has a wide range of goodness of fit tests for evaluating if it is reasonable to assume that a
random sample comes from a specified theoretical distríbution. These include chi-square, Kolmogorov-
Smirnov, and Anderson-Darling.
For more details on fitting distríbutions, see Vito Ricci's Fjttioe Pistábutjoos witb R. For general (non
R) advice, see Bill Huber's Fitting Distábutions to Data.
R Interface Lattice Graphs
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing
Graphical Parameters better defaults and the ability to easily display multivariate relationships. In particular, the package
Axes and Text supports the creation of tre//is graphs · graphs that display a variable or the relationship between
variables, conditioned on one or more other variables.
Combining Plots
Mosaic Plots
Correlograms where graph_type is selected from the listed below. formula specifies the variable(s) to display and
any conditioning variables . For example -x 1A means display numeric variable x for each leve! of factor
lnteractive Graphs
A. y-x 1 A•s means display the relationship between numeric variables y and x separately for every
combination of factor A and B levels. -x means display numeric variable x alone.
R in Action
graph_type description formula examples
barchart bar chart x-A or A-x
bwplot boxplot x-A or A-x
cloud 3D scatterplot z-x•ylA
contourplot 3D contour plot z- x"y
densityplot kernal density plot -xlA*B
dotplot dotplot -xi A
R in Action significantly expands histogram histogram -x
upen this material. Use promo levelplot 3D leve! plot z-y•x
cede ria38 for a 38% discount. parallel parallel coordinates plot data frame
splom scatterplot matrix data frame
stripplot strip plots A- x or x-A
Top Menu
xyplot scatterplot y-xi A
wirefrarne 3D wireframe graph z-y•x
Here are sorne examples. They use the car data (mileage, weight, number of gears, number of
The R Interface
cylinders, etc.) from the mtcars data frame.
Data Input
main="Density Plot",
xlab="Miles per Gallon")
# kernel dens i ty plots by factor l evel
densityplot (~mpg l cyl. f,
main="Density Plot by Number of Cyl inders",
xl ab="Miles per Gallon")
# scatterplot matrix
splom(mtcars [c(l,3,4,5,6)),
main="MTCARS Data")
-
...
click to view
Note, as in graph 1, that you specifying a conditioning variable is optional. The difference between
graphs 2 & 3 is the use of the layout option to contol the placement of panels.
click to view
Going Further
Lattice graphics are a comprehensive graphical system in their own right. Deepanyan Sarkar's book
Lattice: Multivariate Data Visualization vtith R is the definitive reference. Additionally, see the Trellis
Graphic5 homepage and the Trellis Usea Gujde. Dr. lhaka has created a wonderful set of 5ljde5 on the
subject. An excellent early consideration of trellis graphs can be found in W.S. Cleavland's classic book
Visualizjng Data.
R Tutorials--Counts and Proportions
Count data--data derived from counting things--are often treated as if they are assumed to be binomial distributed or
Poisson distributed. The binomial model applies when the counts are derived from independent Bernoulli trials in which
the probability of a "success" is known, and the number of trials (i.e., the maximum possible value of the count) is also
known. The classic example is coin tossing. We toss a coin 100 times and count the number of times the coin lands heads
side up. The maximum possible count is 100, and the probability of adding one to the count (a "success") on each trial is
known to be 0.5 (assuming the coin is fair). Trials are independent; i.e., previous outcomes do not influence the probability
of success on current or future trials. Another requirement is that the probability of success remain constant over trials.
Poisson counts are often assumed to occur when the maximum possible count is not known, but is assumed to be large, and
the probability of adding one to the count at each moment (trials are often ill defined) is also unknown, but is assumed to
be small. A example may help.
Suppose we are counting traffic fatalities during a given month. The maximum possible count is quite high, we can
imagine, but in advance we can't say exactly (or even approximately) what it might be. Furthermore, the probability of a
fatal accident at any given moment in time is unknown but small. This sounds like it might be a Poisson distributed
variable. But is it?
The built in data set "Seatbelts" allows us to have a look at such a count variable. "Seatbelts" is not a data frame, it's a time
series, so extracting the counts of fatalities will take a bit of trickery...
Next we'll look at a histogram of the "deaths" vector, and plotted over top of that we will put the Poisson distribution with
mean = 122.8...
> hist(deaths, breaks=15, prob=T, ylim=c(0,.04))
> lines(x<-seq(65,195,10),dpois(x,lambda=mean(deaths)),lty=2,col="red")
And I'm not gonna lie to you. That took some trial and error! The histogram function asks for 15 break points, turns on
density plotting, and sets the y-axis to go from 0 to .04. The poisson density function was plotted using the lines( )
function. The x-values were generated by using seq( ) and stored into "x" on the fly. The y-values were generated using
the dpois( ) function. We also requested a dashed, red line. Examination of the figure shows what we suspected. The
empirical distribution does not match the theoretical Poisson distribution.
proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions
Finally, you may recall (if you read that tutorial) the
qqnorm( ) function can be used to check a distribution
for normality. The qqplot( ) function will compare any
two distributions to see if they have the same shape. If they
do, the plotted points will fall along a straight line. The plot
to the right doesn't look too terribly bad until we realize
these two distributions should not only both be poisson,
they should also both have the same mean (or lambda
value). Thus, the points should fall along a line with
intercept 0 and slope 1. The moral of the story: just because
count data sound like they might fit a certain distribution
doesn't mean they will. R provides a number of mechanisms
for checking this.
Suppose we set up a classic card-guessing test for ESP using a 25-card deck of Zener cards, which consists of 5 cards each
of 5 different symbols. If the null hypothesis is correct (H0 : no ESP) and the subject is just guessing at random, then we
should expect pN correct guesses from N independent Bernoulli trials on which the probability of a success (correct guess)
is p = 0.2. Suppose our subject gets 9 correct guesses. Is this out of line with what we should expect just by random
chance?
proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions
A number of proportion tests could be applied here, but the sample size is fairly small (just 25 guesses), so an exact
binomial test is our best choice...
We might argue at this point that we should have done a one-tailed test. (A two-tailed test is the default.) Of course, this
decision should be made in advance, but if the subject is displaying evidence of ESP, we would expect his success rate to
be not just different from chance but greater than chance. To take this into account in the test, we need to set the
"alternative=" option. Choices are "less", "greater", and "two.sided" (the default)...
The subject keeps guessing because, of course, we'd like to see this above chance performance repeated. He has now made
400 passes through the deck for a total of 10,000 independent guesses. He has guessed correctly 2,022 times. What should
we conclude?
An exact binomial test is probably not the best choice here as the sample size is now very large. We'll substitute a single-
sample proportion test...
proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions
applied. If you don't want it, set the "correct=" option to FALSE. The default value is TRUE. (This value must be set to
FALSE to make the test mathematically equivalent to the uncorrected z-test of a proportion.)
A random sample of 428 adults from Myrtle Beach reveals 128 smokers. A random sample of 682 adults from San
Francisco reveals 170 smokers. Is the proportion of adult smokers in Myrtle Beach different from that in San Francisco?
R incorporates a function for calculating the power of a 2-proportions test. The syntax is illustrated here from the help
page...
The two-proportions test generalizes directly to a multiple proportions test. The example from the help page should suffice
to illustrate...
> example(prop.test)
> ## Data from Fleiss (1981), p. 139.
> ## H0: The null hypothesis is that the four populations from which
> ## the patients were drawn have the same true proportion of smokers.
> ## A: The alternative is that this proportion is different in at
> ## least one of the populations.
>
proport.html[27/01/2014 22:19:22]
R Tutorials--Counts and Proportions
proport.html[27/01/2014 22:19:22]
R Tutorials--Data Frames
DATA FRAMES
Preamble
There is plenty to say about data frames because they are the primary data structure in R. Some of what follows is essential
knowledge. Some of it will be satisfactorily learned for now if you remember that "R can do that." I will try to point out
which parts are which. Set aside some time. This is a long one!
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one
variable, and each row contains one case. As we shall see, a "case" is not necessarily the same as an experimental subject
or unit, although they are often the same. Technically, in R a data frame is a list of column vectors, although there is only
one reason why you might need to remember such an arcane thing. Unlike an array, the data you store in the columns of a
data frame can be of various types. I.e., one column might be a numerical variable, another might be a factor, and a third
might be a character variable. All columns have to be the same length (contain the same number of data items).
Let's say we've collected data on one response variable or DV from 15 subjects, who were divided into three experimental
groups called control ("contr"), treatment one ("treat1"), and treatment two ("treat2"). We might be tempted to table the
data as follows...
contr treat1 treat2
---------------------------
22 32 30
18 35 28
25 30 25
25 42 22
20 31 33
---------------------------
While this is a perfectly acceptable table, it is NOT a data frame, because values on our one response variable have been
divided into three columns (and so have values on the grouping or independent variable). A data frame has the name of the
variable at the top of the column, and values of that variable in the column under the variable name. So the data above
should be tabled as follows...
scores group
----------------
22 contr
18 contr
25 contr
25 contr
20 contr
32 treat1
35 treat1
30 treat1
42 treat1
31 treat1
30 treat2
28 treat2
25 treat2
22 treat2
33 treat2
----------------
This is a proper data frame (and leave out the dashed lines, although in actual fact R could read this table just as you see it
here). It does not matter what order you type the columns in, as long as each column contains values of one variable, and
every recorded value of that variable is in that column.
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
In a previous tutorial we used the data object "women" as an example of a data frame...
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
In this data frame we have two numerical variables and no real explanatory variables (IVs) or response variables (DVs).
Notice when R prints out a data frame, it numbers the rows. These numbers are for convenience only and are not part of
the data frame, and I'll have much more to say about them shortly.
We can refer to any value, or subset of values, in this data frame using the already familiar notation...
> women[12,2] # row 12, column 2; note: square brackets
[1] 150
> women[8,] # row 8, all columns
height weight
8 65 135
> women[1:5,] # rows 1 to 5, all columns
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
> women[,2] # all rows, column 2
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> women[c(1,3,7,13),] # rows 1, 3, 7, and 13, all columns
height weight
1 58 115
3 60 120
7 64 132
13 70 154
> women[c(1,3,7,13),1] # rows 1, 3, 7, and 13, column 1
[1] 58 60 64 70
Here's the catch. Those index numbers do NOT necessarily correspond to the numbers you see printed out with the data
frame. This can be confusing at first, and it is something you need to keep in mind. I will explain in a moment.
Another built-in data object that is a data frame is "warpbreaks". This data frame contains 54 cases, so I will print out only
every third one. I do this with the sequence function, since this function creates a vector just as the c( ) function did in
the above examples...
> warpbreaks[seq(1,54,3),]
breaks wool tension
1 26 A L
4 25 A L
7 51 A L
10 18 A M
13 17 A M
16 35 A M
19 36 A H
22 18 A H
25 28 A H
28 27 B L
31 19 B L
34 41 B L
37 42 B M
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
40 16 B M
43 21 B M
46 20 B H
49 17 B H
52 15 B H
In this data frame we have one numerical variable (number of breaks), and two categorical variables (type of wool and
tension on the wool). We don't have to look at the data frame itself to get this information. We can also use the str( )
function, which displays a breakdown of the structure of a data frame...
> str(warpbreaks)
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
> sleep
extra group
1 0.7 1
2 -1.6 1
3 -0.2 1
4 -1.2 1
5 -0.1 1
6 3.4 1
7 3.7 1
8 0.8 1
9 0.0 1
10 2.0 1
11 1.9 2
12 0.8 2
13 1.1 2
14 0.1 2
15 -0.1 2
16 4.4 2
17 5.5 2
18 1.6 2
19 4.6 2
20 3.4 2
Here we have two variables, the change in sleep time a subject got ("extra"), and what drug the subject received ("group").
In this case, the first variable (the dependent variable, DV, response variable, etc.) is numerical and the second (the
independent variable, IV, explanatory variable, grouping variable, etc.) is categorical, even though the categorical variable
is coded as a number. Once again, it does not matter in what order the columns occur. Put the IV in the first column and
the DV in the second column if you want.
However, if categorical variables are coded as numbers (a common practice), R will not know this until you tell it...
> str(sleep)
'data.frame': 20 obs. of 2 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
In this case, the fact that "group" is a factor is stored internally in the data frame, but that will not always be the case. So
it's worth taking a look to make sure things you intend to be factors are being interpreted as factors by R. You can do this
with str( ), but you can also do it with summary( ), because numerical variables and factors are summarized
differently...
> summary(sleep)
extra group
Min. :-1.600 1:10
1st Qu.:-0.025 2:10
Median : 0.950
Mean : 1.540
3rd Qu.: 3.400
Max. : 5.500
Notice that numerical variables (extra) are summarized with numerical summary statistics, while factors are summarized
with a frequency table. In these data, there are 10 subjects in group 1 and 10 subjects in group 2.
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
Entering data into a data frame sometimes involves making a tough decision as to what your variables are. The following
example is from a built-in data object called "anorexia". This data set is not in the libraries that are loaded by default when
R starts, so to see it, the first thing we need to do is attach the correct library to the search path. Let's see how that works...
> search()
[1] ".GlobalEnv" "tools:RGUI" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
This is the default search path, the one you have right after R starts. (It will be a little different in different operating
systems.) We want to see an object in the MASS library (or package), which is not currently in the search path. So to get it
into the search path, do this...
> library(MASS)
> search()
[1] ".GlobalEnv" "package:MASS" "tools:RGUI"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"
Notice we have added "package:MASS" to the search path in position 2. This means if we request an R object, R will look
first in the global environment (the workspace), and if the object is not found there, R will look next in MASS, then in
RGUI, then in stats, and so on, until the object either is found or R runs out of places to look for it. The "anorexia" data
frame is 72 cases long, so to conserve space we will look at only every fifth row of it...
> anorexia[seq(1,72,5),]
Treat Prewt Postwt
1 Cont 80.7 80.2
6 Cont 88.3 78.1
11 Cont 77.6 77.4
16 Cont 77.3 77.3
21 Cont 85.5 88.3
26 Cont 89.0 78.8
31 CBT 79.9 76.4
36 CBT 80.5 82.1
41 CBT 70.0 90.9
46 CBT 84.2 83.9
51 CBT 83.3 85.2
56 FT 83.8 95.2
61 FT 79.6 76.7
66 FT 81.6 77.8
71 FT 86.0 91.7
The data frame contains data from women who underwent treatment for anorexia. In the first column we have the
treatment variable ("Treat"). The second column contains the pretreatment body weight in pounds ("Prewt"). The third
column contains the posttreatment body weight in pounds ("Postwt"). So where is the ambiguity?
Here's the awkward question. In our analysis of these data, do we wish to treat weight as two variables (pre and post) each
measured once on each subject, or as one variable (weight) measured twice on each subject? The data frame is currently
arranged as if the plan was for an analysis of covariance, with "Postwt" being the response, "Treat" the explanatory
variable, and "Prewt" the covariate. Prewt and Postwt are treated as two variables.
If the plan was for a repeated measures ANOVA, then the data frame is wrong, because in this case, "weight" is ONE
variable measured twice ("pre" and "post") on each woman. In this analysis, we would also need to add a "subject"
variable to the data frame as well, since each subject would have two lines, a "pre" line and a "post" line.
It's not a disaster. The data frame is easy enough to rearrange on the fly, and we will do so below.
By the way, this is how you get the MASS package out of the search path if you no longer need it...
> detach("package:MASS")
The easiest way--and the usual way--of getting a data frame into the R workspace is to read it in from a file. We will do
that in the next tutorial. Sometimes it becomes necessary to create one at the console, however. Here are the steps
involved:
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
Use the data.frame( ) function to create a data frame from the vectors.
Tah dah! It's as simple as that. You wouldn't want to have to do that with a large data set, however, and that's why we'll
learn how to read them in from a file in the next tutorial. DON'T clean up your workspace. We will carry this example
over into the next section.
First, let's look at a few functions that allow us to get general information about a data frame...
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
> summary(my.data)
name age hgt wgt race year
Barb:1 Min. :18.0 Min. :64.0 Min. : 128.0 Af.Am:2 Fr:2
Bob :1 1st Qu.:18.0 1st Qu.:66.0 1st Qu.: 156.0 Asian:1 Jr:1
Fred:1 Median :20.0 Median :67.0 Median : 180.0 Cauc :2 So:1
Jeff:1 Mean :20.2 Mean :67.8 Mean : 356.8 Sr:1
Sue :1 3rd Qu.:21.0 3rd Qu.:70.0 3rd Qu.: 202.0
Max. :24.0 Max. :72.0 Max. :1118.0
SAT
Min. : 840
1st Qu.: 880
Median :1080
Mean :1070
3rd Qu.:1210
Max. :1340
Or at least that would be useful if the data frame were larger!
There are four ways to get at the data inside a data frame, and this is NOT one of them...
> SAT
[1] 1080 1210 840 1340 880
That only seemed to work, because remember when you created the data frame, you started by putting a vector called
"SAT" into the workspace. THAT'S what you're seeing now! You are not seeing the SAT variable from inside the data
frame.
Let's erase all those vectors EXCEPT "age", which we will keep to illustrate something that you will need to remember
about R...
> ls()
[1] "age" "hgt" "my.data" "name" "race" "SAT" "wgt"
[8] "year"
> rm(hgt, name, race, SAT, wgt, year) ### Don't erase my.data!
> ls()
[1] "age" "my.data"
Now if we try to see SAT as we did above...
> SAT
Error: object 'SAT' not found
...we get an error. R will not look inside data frames for variables unless you tell it to. Here are the four ways to do that...
by using $
by using with( )
by using data=
by using attach( )
A data frame is a list of column vectors. We can extract items from inside it by using the usual list indexing device, $. To
do this, type the name of the data frame, a dollar sign, and the name of the variable you want to work with...
> my.data$SAT
[1] 1080 1210 840 1340 880
> mean(my.data$SAT)
[1] 1070
If that dollar sign stuff gets hard to read, you can put spaces around the $ to make the command line easier to read...
> mean(my.data $ SAT)
[1] 1070
This can certainly be a nuisance, because it will mean that in some commands you have to type the data frame name
multiple times. An example is the command that calculates a correlation...
> cor(my.data$hgt, my.data$wgt)
[1] -0.2531835
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
In this case, you can use the with( ) function to tell R where to get the data from...
> with(my.data, cor(hgt, wgt))
[1] -0.2531835
It doesn't save much typing in this example, but there are cases where that will save a LOT of typing! Notice the syntax of
this function. You type the name of the data frame first, followed by a comma, followed by the function you want to
execute, then you close the parentheses on with( ).
As we will learn later, some functions, especially significance tests, take what's called a formula interface. When that's the
case, there is always a data= option to specify the name of the data frame where the variables are to be found. I'll just show
you an example for now. We'll have plenty of time to examine the formula interface later...
Finally, there is the dreaded attach( ) function. This attaches the data frame to your search path (in position 2) so that
R will know to look there for data objects that are referenced by name. Some people use this device routinely when
working with data frames, but it can cause problems, and we are about to see one...
> attach(my.data)
The following object(s) are masked _by_ .GlobalEnv :
age
Say what? When an object is masked (or shadowed) by the global environment, that means there is a data object in the
workspace that has this name AND there is a variable inside the data frame that has this name. I can now ask for any
variable inside the data frame EXCEPT age...
> SAT
[1] 1080 1210 840 1340 880
> mean(SAT)
[1] 1070
> table(year)
year
Fr Jr So Sr
2 1 1 1
> age
[1] 21 18 18 24 20
You might think you are seeing my.data$age here, but YOU ARE NOT! You're seeing "age" from the workspace. In this
case they're the same, but that won't always be true...
> age = 112
> age
[1] 112
The assignment changed the value of "age" in the workspace, but not in the data frame...
> my.data$age
[1] 21 18 18 24 20
If we remove age from the workspace, R will then search inside the data frame for it...
> rm(age)
> age
[1] 21 18 18 24 20
The lesson is, when you get one of these masking (or shadowing) conflicts, WATCH OUT! Be extra careful to know
which version of the variable you're working with. This has tripped up many an R user, including me. This is why you
want to keep your workspace as clean as possible. The best strategy here is to remove the masking variable from the
workspace. If you want to keep it, at least rename it and then remove the conflicting version from the workspace. You'll
eventually be sorry if you don't!
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
> detach(my.data)
When you're done with an attached data frame, ALWAYS detach it. This will remove it from the search path so that R will
no longer look inside it for variables. You'll have to go back to using $ to reference variables inside the data frame after it
is detached. This isn't necessary if you're going to quit your R session right away. Quitting detaches everything that was
attached. But if you're going to continue working, detach data frames you no longer need. Otherwise, your search path will
get messy, and you'll get more and more masking conflicts as other objects are attached.
This will cost you BIGTIME eventually if you don't pay close attention!
Look at the printed data frame. Suppose we wanted to extract Barb's weight. That's the value in row 3 and column 4, so we
could get it this way...
Those last two ways seem to be the same, BUT THEY ARE NOT!!!
Let's sort the data frame using the age variable. Sorting a data frame is done using the order( ) function. Remember
how it worked when we sorted a vector? If a call to the order( ) function is put in place of the row index the data
frame will be sorted on whatever variable is specified inside that function. You will have to use the full name of the
variable; i.e., you will have to use the $ notation. (Why?) Otherwise, R will be looking in your workspace for a variable
called "age", not finding it, and giving a "not found" error. It happens to me a lot, so you might as well just get used to it!
> my.data[order(my.data$age),]
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
Observe the row names! They have also sorted, haven't they? Let's save this into a new data object so we can play with it a
bit...
> my.data[order(my.data$age),] -> my.data.sorted # Did you remember up arrow?
> my.data.sorted
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340
Now let's try to extract Barb's weight from this new data frame...
> my.data.sorted[3,4] ### Wrong!
[1] 202
> my.data.sorted[3,"wgt"] ### Also wrong!
[1] 202
> my.data.sorted["3","wgt"] ### Correct!
[1] 128
> my.data.sorted[2,4] ### Also correct!
[1] 128
Confused yet?
Here's what you have to remember. Those numbers that often print out on the left side of a data frame ARE NOT
NUMBERS. They're row names. So data frames have both row and column names, whether you like it or not! The point
becomes clearer when we give the rows actual names. Let's erase the names from my.data and then re-enter them as row
names...
> rm(my.data.sorted) # Get rid of that first.
> my.data$name <- NULL # This is how you erase a variable.
> my.data # See?
age hgt wgt race year SAT
1 21 70 180 Cauc Jr 1080
2 18 67 156 Af.Am Fr 1210
3 18 64 128 Af.Am Fr 840
4 24 66 1118 Cauc Sr 1340
5 20 72 202 Asian So 880
> rownames(my.data) <- c("Bob","Fred","Barb","Sue","Jeff")
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 1118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data["Barb", "wgt"] # Makes getting Barb's weight a lot
easier!
[1] 128
Notice the numbers are gone now because we have actual row names. And OF COURSE they sort with the rest of the data
frame...
> my.data[order(my.data$age),]
age hgt wgt race year SAT
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Jeff 20 72 202 Asian So 880
Bob 21 70 180 Cauc Jr 1080
Sue 24 66 1118 Cauc Sr 1340
It would be absolutely silly if they didn't! Just remember: Data frames ALWAYS have row names. Sometimes those row
names just happen to look like numbers. It's the row names that print out to your console when you ask to see the data
frame, or any part of it, and NOT the index numbers.
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
While this isn't strictly against the law, it's a bad idea and can get very confusing as to exactly what it is you've modified. I
could try to explain it, but I'm not sure I understand it myself. So just don't do it!
The time will come when you want to change a data frame in some way. Here are some examples of how to do that. You
may have noticed that Sue (in the my.data data frame) is a wee bit on the chunky side. This was an innocent mistake. I
really didn't do that on purpose. How do we fix it? The value was supposed to be 118, but let's change it to 135 just for
kicks...
Just remember that "wgt" is now in column 3, since the row names don't count as a column.
I have to warn you about modifying data frames. It's always a good idea to make a backup copy in the workspace first.
Because there are some commands that modify data frames that, if they go wrong, can really screw things up! But let's live
dangerously. Suppose we wanted "wgt" to be in kilograms instead of pounds. Easy enough...
> my.data$wgt / 2.2
[1] 81.81818 70.90909 58.18182 53.63636 91.81818
> my.data # Nothing has changed yet. Why not?
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data$wgt / 2.2 -> my.data$wgt # Aha! It has to be stored back into
my.data.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.81818 Cauc Jr 1080
Fred 18 67 70.90909 Af.Am Fr 1210
Barb 18 64 58.18182 Af.Am Fr 840
Sue 24 66 53.63636 Cauc Sr 1340
Jeff 20 72 91.81818 Asian So 880
> round(my.data$wgt, 1) -> my.data$wgt # A little rounding for good measure.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.8 Cauc Jr 1080
Fred 18 67 70.9 Af.Am Fr 1210
Barb 18 64 58.2 Af.Am Fr 840
Sue 24 66 53.6 Cauc Sr 1340
Jeff 20 72 91.8 Asian So 880
Now that we've rounded them off, we've lost the original weight data in pounds...
> my.data$wgt*2.2
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
We could have avoided this by making a backup copy of my.data first, or by putting the new weight in kilograms into a
new column in the data frame.
You can clean up now. We're done with this data frame.
Do this...
> library(MASS)
> data(Cars93)
> attach(Cars93)
> str(Cars93) # Output not shown.
This is a data frame with 93 observations on 27 variables. You can see what the variables represent by looking at the help
page for this data set: ?Cars93. We're interested in the variable "Luggage.room" in particular, which is the trunk space in
cubic feet, to the nearest cubic foot...
> summary(Luggage.room)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.00 12.00 14.00 13.89 15.00 22.00 11.00
This is a numerical variable, so we get the summary we are accustomed to by now. But what are those NAs? Whether we
like it or not, data sets often have missing values, and we need to know how to deal with them. R's standard code for
missing values is "NA", for "not available". The number associated with NA is a frequency. There are 11 cases in this data
frame in which "Luggage.room" is a missing value. If you looked at the help page, you know why.
Some functions fail to work when there are missing values, but this can (almost always) be fixed with a simple option...
> mean(Luggage.room)
[1] NA
> mean(Luggage.room, na.rm=TRUE)
[1] 13.89024
> mean(Luggage.room, na.rm=T)
[1] 13.89024
There is no mean when some of the values are missing, so the "na.rm" option removes them when set to TRUE (must be all
caps, but the shorter form T also works provided you haven't assigned another value to it). If you want to clean the data set
by removing casewise all cases with missing values on any variable, use the na.omit( ) function...
> na.omit(Cars93) # Output not shown.
I will not reproduce the output here because it is extensive, but it is also instructive, so take a look at it. Scroll the console
window backwards to see all of it. Of course, to use this cleaned data frame, you would have to assign it to a new data
object.
The which( ) function does not work to identify which of the values are missing. Use is.na( ) instead...
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
[23] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[89] TRUE FALSE FALSE FALSE FALSE
> which(is.na(Luggage.room))
[1] 16 17 19 26 36 56 57 66 70 87 89
Finally, some data sets come with other codes for missing values. 999 is a common missing value code, as are blank
spaces. Blanks are a very bad idea. If you find a data set with blanks in it, it may have to be edited in a text editor or
spreadsheet before the file can be read into R. It depends on how the file is formatted. In some cases, R will automatically
assign NA to blank values, but in other cases it will not. Other missing value codes are not a problem, as they can be
recoded...
> ifelse(is.na(Luggage.room), 999, Luggage.room) -> temp
> temp
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 999 999
[18] 20 999 15 14 17 11 13 14 999 16 11 11 15 12 12 13 12
[35] 18 999 18 21 10 11 8 12 14 11 12 9 14 15 14 9 19
[52] 22 16 13 14 999 999 12 15 6 15 11 14 12 14 999 14 14
[69] 16 999 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13
[86] 15 999 10 999 14 15 14 15
> # first we'll mess it up
> # and then we'll fix it
> ifelse(temp == 999, NA, temp) -> fixed
> fixed
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 NA NA 20 NA 15 14 17 11
[24] 13 14 NA 16 11 11 15 12 12 13 12 18 NA 18 21 10 11 8 12 14 11 12 9
[47] 14 15 14 9 19 22 16 13 14 NA NA 12 15 6 15 11 14 12 14 NA 14 14 16
[70] NA 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13 15 NA 10 NA 14 15 14
[93] 15
The ifelse( ) function is very handy for recoding a data vector, so let me take a moment to explain it. Inside the
parentheses, the first thing you give is a test. In the second of these commands above, where we are going from the messed
up variable back to "fixed", the test was "if any value of temp is equal to 999". Notice the double equals sign meaning
"equal". (I still get this wrong a lot!) The second thing you give is how to recode those values, and finally you tell what to
do with the values that don't pass the test. So the whole command reads like this: "If any value of temp is equal to 999,
assign it the value NA, else assign it the value that is currently in temp."
In the first instance of the function, we had to use is.na, since nothing can really be "equal to" something that is not
available! Try these, and say them in words as you're typing them...
> data(USArrests)
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Here is another useful function for looking at a data frame. The head( ) function shows the first six lines of data (cases)
inside a data frame. There is also a tail( ) function that shows the last six lines, and the number of lines shown can be
changed with an option (see the help pages).
In this case we have a data frame with row names set to state names and containing variables that give the crime rates (per
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
100,000 population) for Murder, Assault, and Rape, as well as the percentage of the population that lives in urban areas.
These data are from 1973 so are not current.
Because state names are used as row names, to see the data for any state, all we have to be able to do is spell the name of
the state...
Suppose we wanted to work with data only from these states. How can we extract them from the data frame and make a
new data frame that contains only those states? I'm glad you asked...
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
Suppose someone has retained your services as a data analyst and gives you his data (from an Excel file or something) in
this format...
Now do this...
> stack(wrong.data) -> correct.data
> correct.data
values ind
1 22 contr
2 18 contr
3 25 contr
4 25 contr
5 20 contr
6 32 treat1
7 35 treat1
8 30 treat1
9 42 treat1
10 31 treat1
11 30 treat2
12 28 treat2
13 25 treat2
14 22 treat2
15 33 treat2
And there you go. Now you have a proper data frame.
There is also an unstack( ) function that does the reverse of this, and it will work automatically on a data frame that
has been created by stack( ), but otherwise is a little trickier to use. You probably won't have to use it much, so I'll
refer you to the help page if you ever need it.
You can remove these data objects. We won't use them again.
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
Going From Wide to Long and Long to Wide (eventually you'll probably need to know this)
I mention this above under "An Ambiguous Case." There are two kinds of data frames in R, and in most statistical
software: wide ones and long ones. Let's fetch the "anorexia" data again (and we'll do it without attaching the MASS
package this time)...
I also shortened up the name of our data frame, because we're going to be typing it a lot.
This is a wide data frame. It's wide because each line of the data frame contains information on ONE SUBJECT, even
though that subject was measured multiple times (twice) on weight (Prewt, Postwt). So all the data for each subject goes on
ONE LINE, even though we could interpret this as a repeated measures design, or longitudinal data.
In a long data frame, each value of weight would define a case. So each of these subjects would have two lines in such a
data frame, one for the subject's Prewt, and one for her Postwt. A wide data frame would be used, for example, in analysis
of covariance. A long data frame would be used in repeated measures analysis of variance. Do we have to retype the data
frame to get from wide to long? Fortunately not! Because R has a function called reshape( ) which will do the work
for us.
It is not an easy function to understand, however (and don't count on the help page being a whole lot of help!). So let me
illustrate it, and then I will explain what's happening...
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
In the second line of this command, I specified varying= as a vector of variable names in anor that correspond to the
repeated measures or longitudinal measures (the time-varying variables). These values will be given in one column in the
new data frame, so I named that new column using the v.names= option.
A long data frame needs two things that a wide one does not have. One of those things is a column identifying the subject
(case or experimental unit) from which the data in a row of the data frame come from. This is necessary because each
subject will have multiple rows of data in a long data frame. So I used the idvar= option to specify the name of this new
column that would identify the subjects. I then used ids= to specify how the subjects were to be named. I told it to use the
row names from anor, which is a sensible thing to do.
The other thing a long format data frame needs that a wide one does not is a variable giving the condition (or time) in
which the subject is being measured for this particular row of data. In the wide format, this information is in the column
(variable) names, but that will no longer be true in the long format. We need to know which measure is Prewt and which
measure is Postwt for each subject, since these will be on different rows of the data frame in long format. I named this new
variable using the timevar= option, and I gave its possible values in a vector using the times= option. The order in which
those values should be listed is the same as the order in which the corresponding columns occur in the wide data frame.
Finally, I closed the parentheses on the reshape( ) function and assigned the output to a new data object. Done!
This can also be made to work if you have more than one repeated measures variable, in which case all I can say is may
the saints be with you!
If the data frame results from a reshape( ) command, then it can be converted back very simply. All you have to do is
this...
> reshape(anor.long)
Treat subject Prewt Postwt
1.Prewt Cont 1 80.7 80.2
2.Prewt Cont 2 89.4 80.1
3.Prewt Cont 3 91.8 86.4
27.Prewt CBT 27 80.5 82.2
28.Prewt CBT 28 84.9 85.6
29.Prewt CBT 29 81.5 81.4
56.Prewt FT 56 83.8 95.2
57.Prewt FT 57 83.3 94.3
58.Prewt FT 58 86.0 91.5
The row names have gone a little screwy, but all the correct information is there. This isn't very useful actually, because we
already have the data in wide format in the data frame anor, which we were smart enough not to overwrite. So let's see
how to convert from long to wide the hard way.
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Data Frames
+ idvar="subject",
+ timevar="PrePost"
+ )
Treat subject Weight.Prewt Weight.Postwt
1 Cont 1 80.7 80.2
2 Cont 2 89.4 80.1
3 Cont 3 91.8 86.4
4 CBT 27 80.5 82.2
5 CBT 28 84.9 85.6
6 CBT 29 81.5 81.4
7 FT 56 83.8 95.2
8 FT 57 83.3 94.3
9 FT 58 86.0 91.5
We didn't quite recover the original table, but then we probably didn't really want to. The first two options name the data
frame we are reshaping and tell the direction we are reshaping TO. The next option, v.names=, gives the name of the time-
varying variable that will be split into two (or more) columns. The idvar= option gives the name of the variable that is the
subject identifier. Finally, the timevar= option gives the name of the variable that contains the conditions under which the
longitidinal information was collected; i.e., there were two weights, a Prewt and a Postwt. Notice these values were used to
name the two new columns of Weight data. Want a pneumonic to help you remember all that? Yeah, me too!
dataframes.html[26/01/2014 15:20:41]
R Tutorials--Describing Data Graphically
Introduction
The graphical procedures in R are extremely powerful. I'm told there are some people who use R not so much for data
analysis as for its ability to produce top notch publication quality graphics. I will only scratch the surface of these
capabilities here. A later tutorial will fill in a few more of the details.
High level plotting functions that will create a more or less complete graph, often with axis labels, titles, and so
forth.
Low level plotting functions that allow additional information to be added to an existing graph, or that allow graphs
to be drawn from scratch.
Interactive graphics functions that allow you to extract information from an existing graph, or to label points and so
on.
I'm going to do something unusual, and perhaps ill-advised, and cover the low level functions first, so that you will be
ready to use them in conjunction with the high level functions when we get to those. If you just want the quick and dirty
approach, then skip this first section (for now).
Just about anything can be drawn into a graphics window in R if you are clever enough. I'm not that clever, so I'll keep it
simple. To conserve space, I'm also not going to reproduce the output of every single example. If you have R open and are
following along, you can see it on your own screen.
High level plotting functions open a graphics device (window) automatically, but the low level functions do not. So to get
a graph and some axes to work with, the following command will get us started without actually drawing a graph...
One thing we can do is plot a curve from an algebraic equation. Let's say the equation is y = 0.01 x2 ...
The text( ) function takes, first, arguments that give x,y-coordinates at which the text will be centered (and this can
take some careful eyeballing or some trial and error), and then it takes quoted text or a mathematical expression. The
syntax for the expression( ) function is an art form in itself (similar to LaTex), and I have not mastered it, but it can
be used to produce some very fancy mathematical expressions. There are also options for controlling font face and size as
well as spacing, etc.
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
You can experiment for yourself to find out what the rest of them look like.
Now let's draw a straight line through a couple of those points, say the one at (20, 4) and the one at (90, 81). The draw-a-
straight-line function is abline( ), and in this case it's arguments are "a=the y-intercept" and "b=the slope" of the
desired line...
The "lty=" option specifies the line type. (1=solid, 2=dashed, 3=dotted.) You can also change the color of these lines with
col=, and the width of the lines with lwd= options. Try repeating that last command but set lty=1 and lwd=3.
We can also draw lines and/or points using the lines( ) function...
> lines(x=c(40, 40, 60, 60), y=c(80, 100, 100, 80), type="b")
> lines(x=c(40, 60), y=c(80, 80), type="l")
Once again, the first vector gives the x-coordinates, the second vector the y-coordinates, and the "type=" option tells
whether you want just points (type="p"), just lines (type="l"), or both (type="b"). Note: for just lines, use a lower case L.
This example shows that type="l" and type="b" behave a bit differently in terms of where the line begins and terminates.
Finally (at least as far as this tutorial is concerned!), titles and axis labels can be added using the title( ) function. We
already have axis labels (they are set by default in the plot( ) function), so I'll use a little trick that SOMETIMES
works to erase one of them. I'll write over it in the background color of the graph...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
Beautiful! Okay, so it's a little first-graderish as R graphics go. There are entire books on R graphics, and I am but a
beginner! Here is the entire script if you just now have decided you want to see this happen on your own monitor.
# Start copying here.
plot(1:100, 1:100, type="n", xlab="", ylab="")
curve(x^2/100, add=TRUE)
text(80, 50, "This is a graph of")
text(80, 45, "the equation")
text(80, 37, expression(y == frac(1,100) * x^2))
points(c(20,60,90), c(4,36,81), pch=6)
points(rep(100,10), seq(0,90,10), pch=0:9)
abline(a=-18, b=1.1, col="red")
abline(h=20, lty=2)
abline(v=20, lty=3)
lines(c(40,40,60,60), c(80,100,100,80), type="b")
lines(c(40,60), c(80,80), type="l")
title(main="A Drawing To Put On the Refrigerator!")
title(xlab="x", col.lab="white")
title(xlab="This is the x-axis", col.lab="black")
# Stop copying here and paste to your R Console.
Usually, we don't want to fuss that much. We just want to see a graph of some data we're examining. If we want to dress it
up for publication, THEN we'll worry about the low-level functions and various options.
The basic high level plotting function is plot( ), and it works differently depending upon what you're asking it to plot.
The basic syntax is plot(x, y, ...), where x is a vector of x-coordinates, y is a vector of y-coordinates, and ...
represents further refinements and options, as will be illustrated...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> plot(waiting, eruptions) # x is num., y is num., plot is scatterplot
> detach(faithful)
> rm(faithful)
>
>
> data(ToothGrowth)
> attach(ToothGrowth)
> names(ToothGrowth)
[1] "len" "supp" "dose"
> plot(supp, len) # x is factor, y is num., plot is boxplots
> plot(factor(dose), len) # coercing dose to a factor
> detach(ToothGrowth)
> rm(ToothGrowth)
>
>
> data(sunspots)
> class(sunspots)
[1] "ts"
> plot(sunspots) # x is time series, y missing, plot is a
> rm(sunspots) # time-series plot
>
>
> data(UCBAdmissions)
> class(UCBAdmissions)
[1] "table"
> plot(UCBAdmissions) # x is table, y missing, plot is a mosaic plot
> rm(UCBAdmissions)
>
>
> data(mtcars)
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17.0 18.6 19.4 17.0 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> plot(mtcars) # x is dataframe of num. vars., y missing,
> rm(mtcars) # plot is a scatterplot matrix
>
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
As we go through the individual data analyses in future tutorials, we will see these various plots again, and we will dress
them up a bit. So for now, let me just illustrate a few other things R can do.
When a single categorical variable is being graphed, the customary way is to use a piechart or a barplot. Statisticians are
somewhat biased against piecharts, and I suppose for good reason, but I'll illustrate them anyway, just in case you have a
hankerin' to flaunt good statistical practice.
The data set UCBAdmissions, which we were using above, is the Berkeley admissions data we used in a different form in
a previous tutorial. The data set is a 3-D table, and we need a 1-D table to illustrate a basic piechart and barplot, so...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
The pie( ) function in R is limited because, as I mentioned above, many statisticians (including the R folks) consider
pie charts to be poor statistical practice. However, if you want something flashy like a 3D exploded pie chart, you can get
it by installing an optional graphics package called "plotrix", which contains a function called pie3D( ), which has an
"explode" option. It just goes to show, if you want it, someone has probably written an R package that will do it! To see an
example of an exploded pie chart produced with this package, try this link. In fact, I recommend the plotrix package if you
want some useful extensions to the basic R graphics capabilities.
If you want to look at two categorical variables at once, a stacked barplot, or better yet, a side-by-side barplot is usually
the way to go...
> margin.table(UCBAdmissions, c(1,3)) -> Admit.by.Dept
> barplot(Admit.by.Dept)
> barplot(Admit.by.Dept, beside=T, ylim=c(0,1000), legend=T,
+ main="Admissions by Department")
Notice a stacked barplot is the default. To change that, set the "beside=" option to TRUE. Also, I dressed up the second
barplot a bit by adding a main title, and by changing the limits on the y-axis to make room for a legend. I need to adjust
the font size a bit in the legend, and maybe change its location, but that's a future tutorial!
Histograms
When you have one numerical variable to look at, a histogram is appropriate. I'll use the "faithful" data set again to
illustrate...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
It doesn't get much more straightforward than that! And by the way, in case you're wondering, I resized the graphic by
resizing the graphics device window before saving it. There are better ways, but that works in a pinch.
If you want more or fewer bars, you can refine your plot by using the "breaks=" option and defining your own
breakpoints...
> range(waiting)
[1] 43 96
> hist(waiting, breaks=seq(40,100,10))
By default, R includes the right limit (right side of the bar) but not the left limit in the intervals. Usually, I prefer it the
other way around, so I change it with the "right=" option, which by default is TRUE...
> hist(waiting, breaks=seq(40,100,10), right=F)
There are many, many other options as well, which you can examine by looking at the help page for this function: ?hist.
R also incorporates many functions for data smoothing, including kernel density smoothing of histograms. If you'd rather
see a smooth curve than a boxy histogram, it can be done as follows...
> plot(density(waiting))
> # Or, getting fancier...
> hist(waiting, prob=T)
> lines(density(waiting))
> detach(faithful)
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
The density( ) function does kernel density smoothing, which can be refined by adjusting the options of the function.
To plot the smoothed curve on top of a histogram, set the "prob=" option to TRUE inside the hist( ) function. This
plots densities rather than frequencies. Also, use lines( ) rather than plot( ) to plot the smoothed curve. This low
level graphics function will add the smoothed curve to the histogram rather than drawing a new plot and thereby erasing
the histogram.
When you have a numerical variable indexed by a categorical variable or factor, you might want a group-by-group
summary in graphical form. The primary way R offers to achieve this is side-by-side boxplots...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
The function boxplot( ), which takes a formula interface, can also be used. Here is the example copied and pasted off
the "chickwts" help page...
> boxplot(weight ~ feed, data = chickwts, col = "lightgray",
+ varwidth = TRUE, notch = TRUE, main = "chickwt data",
+ ylab = "Weight at six weeks (gm)")
Warning message:
In bxp(list(stats = c(216, 271.5, 342, 373.5, 404, 108, 136, 151.5, :
some notches went outside hinges ('box'): maybe set notch=FALSE
Notice that several options are set, including an option to color the boxes, the "varwidth=" option, which sets the width of
the box according to the sample size, the "notch=" option, which gives a confidence interval around the median, and
options to print a main title and y-axis label. The procedure generated a warning message, which you will understand when
you look at the graphic (which I have not reproduced here).
Scatterplots
For examining the relationship between two numerical variables, you can't beat a scatterplot. R has several functions for
producing them, two of which will be demonstrated here...
graphically.html[27/01/2014 22:18:39]
R Tutorials--Describing Data Graphically
Some explanations are in order. First, I didn't want to attach the MASS package to the search path, so I used an option
when I copied the "mammals" data frame that told R to look for it there. The data frame contains brain and body weights
from 62 species of land mammals. Second, to produce a linear plot, I had to do a log transform on both variables, and I did
that "on the fly." Third, the two functions produced the same scatterplot, but the scatter.smooth( ) function also
plots a smoothed, nonparametric regression line on the plot. This line is computed using the loess technique and is called
the "loess line" (locally weighted scatterplot smoothing, sometimes also called "lowess", although I understand some
sources use the two acronyms differently). Both functions have options that allow the plots to be modified in several ways.
R supplies several functions that allow you to interact with the graphics window, including functions that allow you to
identify and label points on the graph. See the help pages for the locator( ) and identify( ) functions for details.
I'll discuss these briefly in a later tutorial.
graphically.html[27/01/2014 22:18:39]
R Tutorials--Functions and Scripts
Functions
One of the advantages of using a scriptable statistics language like R is, if you don't like the way it does something, you
can change it. Or if there is a function missing you'd like to have, write it. Writing basic functions is not difficult. If you
can calculate it at the command line, you can write a function to calculate it.
There is no function in the R base packages to calculate the standard error of the mean. So let's create one. The standard
error of the mean is calculated from a sample (I should say estimated from a sample) by taking the square root of the
sample variance divided by the sample size. So from the command line...
First, we checked to make sure "sem" was not already used as a keyword by asking for a help page. (That's no guarantee,
but it's a good check.) Then we typed "sem=function(x)", which requires a bit of explanation. The name we desire for our
function is "sem", so that's the first thing we type. Then we tell R we want to define this as a function by typing
"=function". The sem is going to be calculated on a data object--a vector--so we have to pass the data to the function, and
that is the point of "(x)". This tells R to expect one argument to be passed to the function. It doesn't have to be called "x".
This is just a dummy variable, so call it "fred" if you want, as long as you call it the same thing throughout the function
definition.
After you hit the Enter key, R will see that you are defining a function, and it will give you the + prompt, meaning "tell me
more." Type an open curly brace and hit Enter again. (A more common practice is to type this on the first line, but it
doesn't matter, and I learned it otherwise.) Then type the calculations needed to get the standard error. Spacing is optional,
but I think it makes it a bit easier to understand if you use some indenting here. Hit Enter. Type a closed curly brace and hit
Enter again. Your function has been defined and is now in your workspace to be used whenever you want...
> ls()
[1] "nums" "sem"
And it will stay in your workspace for whatever working directory you are in PROVIDED you save your workspace when
functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts
you quit R. You use the function just like you use any other function in R...
> sem(nums)
[1] 2.584941
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
2.584941 2.584941 2.584941
So next week you fire up R, you see "sem" in your workspace, and you wonder what it is (if you're like me). Easy enough
to find out...
> class(sem)
[1] "function"
> sem
function(x)
{
sqrt(var(x)/length(x))
}
Just like any other data object, typing its name without an argument prints it out.
I don't like it. Our sem function is good enough, but if there are missing values in the data vector, sem( ) will choke...
> nums[20] = NA # create a missing value
> sem(nums)
Error in var(nums) : missing observations in cov/cor
So let's fix it...
> rm(sem) # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+ n = sum(x,na.rm=T)/mean(x,na.rm=T)
+ sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem"
> sem(nums)
[1] 2.641737
By the way, we couldn't use the length( ) function in this calculation because it has no "na.rm=" option. However, another
way to get the length of the vector without counting missing values, and perhaps a more elegant way, is: n=sum(!is.na(x)).
This tests each value of the vector to see if it's missing. If it is NOT (the ! means NOT), then it returns TRUE for that
position in the vector. Finally, the values returned as TRUE are counted with sum( ).
The length( ) function counts NAs as data values and doesn't tell you. (Which is why we couldn't use it above--it would
have given the wrong value for n.) Let's create another function for sample size that reports on NAs...
> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+ n = length(x) - sum(is.na(x))
+ nas = sum(is.na(x))
+ out = c(n, nas)
+ names(out) = c("", "NAs")
+ out
+ }
> ls()
[1] "nums" "samp.size" "sem"
> samp.size(nums)
NAs
24 1
Now samp.size( ) returns a vector the first element of which is the number of nonmissing values in the data object we feed
into it. So...
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])
2.641737
functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts
If you ask me, R has some annoying idiosyncrasies. Take the tapply( ) function for example. What could "tapply" possibly
mean? And who came up with that convoluted syntax? Don't like it? Then change it!
> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+ tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums" "samp.size" "sem"
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
Which makes more sense to you?! You should also know that these one-liners can be entered all on one line...
> rm(calculate)
> ls()
[1] "nums" "samp.size" "sem"
> calculate = function(FUN, of, by) tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
You don't even need the curly braces, although you can use them if you want. I usually do.
Another R function that annoys the crap out of me is summary( ) when applied to a numerical vector. I want a standard
deviation, and I want a sample size...
> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+ m=mean(x,na.rm=T)
+ s=sd(x,na.rm=T)
+ N=sum(is.na(x))
+ n=length(x)-N
+ se=s/sqrt(n)
+ out=c(m,s,se,n,N)
+ names(out)=c("mean","sd","sem","n","NAs")
+ round(out,4)
+ }
> ls()
[1] "calculate" "describe" "nums" "samp.size" "sem"
> describe(nums)
mean sd sem n NAs
96.5680 12.9418 2.6417 24.0000 1.0000
That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want to keep these functions.
Scripts
A script is just a plain text file with R commands in it. You can prepare a script in any text editor, such as vim,
TextWrangler, or Notepad. You can also prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad,
PROVIDED you save the script in plain text (ascii) format. In Windows, this will append a ".txt" file extension to the file.
Drop the script into your working directory, and then read it into R using the source( ) function.
In a moment you're going to see a link. Click on it and a text page will appear with a sample script on it. Use your browser
to save this page to your desktop. Then move the saved file into your R working directory. Or, if you really want to be
adventurous, type the script into a text editor like Notepad, save it in your working directory, and you are ready to go.
Okay, here is the link...
functions.html[27/01/2014 22:18:26]
R Tutorials--Functions and Scripts
Now that you've got it in your working directory one way or another, do this in R...
> source("sample_script.txt") # Don't forget those quotes!
A note: R does not like spaces in script names, so don't put spaces in your script names! Now, what didn't happen that you
expected to happen? Go back to the link and read the script again if you have to. What happened to the mean of "y" and
the mean of "x"?
The script has created the variables "x" and "y" in your workspace (and has erased any old objects you had by that name--
sorry). You can see them with the ls( ) function. Executing a script does everything typing those commands in the Console
would do, EXCEPT print things to the Console. Do this...
> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8
See? It's there. But if you want to be sure a script will print it to the Console, you should use the print( ) function...
> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8
When you're working in the Console, the print( ) is understood (implicit) when you type a command or data object name.
This is not necessarily so in a script.
A script is a good way to keep track of what you're doing. If you have a long analysis, and you want to be able to recreate
it later, a good idea is to type it into a script. If you're working in the Windows R GUI (also in the Mac R GUI), there is
even a built-in script editor. To get to it, pull down the File menu and choose New Script (New Document on a Mac). A
window will open in which you can type your script. Type this script into the open window...
Hit the Enter key after the last line. Now, in the editor window, pull down the Edit menu and choose Run All. (On a Mac,
highlight all the lines of the script and choose Execute.) The script should execute in your R Console. Pull down the File
Menu and choose Save As... Give the file a nice name, like "script2.txt". R will NOT save it by default with a file
extension, so be sure you give it one. Close the editor window. Now, in the R Console, do this...
> source("script2.txt")
Nothing happens! Why not? Actually, something did happen. The "aov.out" object was created in your workspace.
However, nothing was echoed to your Console because you didn't tell it to print( ). Go to File and choose New Script. In
the script editor, pull down File and choose Open Script... In the Open Script dialog that appears, change Files Of Type to
all files. Then choose to open "script2.txt". Edit it to look like this...
Pull down File and choose Save. Close the script editor window(s). And FINALLY...
> source("script2.txt")
functions.html[27/01/2014 22:18:26]
R Tutorials--Logistic Regression
LOGISTIC REGRESSION
Preliminaries
Model Formulae
You will need to know a bit about Model Formulae to understand this tutorial.
When you go to the track, how do you know which horse to bet on? You look at the odds. In the program, you may see the
odds for your horse, Sea Brisket, are 8 to 1, which are the odds AGAINST winning. This means in nine races Sea Brisket
would be expected to win 1 and lose 8. In probability terms, Sea Brisket has a probability of winning of 1/9, or 0.111. But
the odds of winning are 1/8, or 0.125. Odds are actually the ratio of two probabilities...
p(one outcome) p(success) p
odds = -------------------- = ----------- = ---, where q = 1 - p
p(the other outcome) p(failure) q
So for Sea Brisket, odds(winning) = (1/9)/(8/9) = 1/8. Notice that odds have these properties:
Logistic regression fits b0 and b1 , the regression coefficients (which were 0 and 1, respectively, for the graph above). It
should have already struck you that this curve is not linear. However, the point of the logit transform is to make it linear...
logit(y) = b0 + b1x
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
Hence, logistic regression is linear regression on the logit transform of y, where y is the proportion (or probability) of
success at each value of x. However, you should avoid the temptation to do a traditional least-squares regression at this
point, as neither the normality nor the homoscedasticity assumption will be met.
Odds ratio might best be illustrated by returning to our horse race. Suppose in the same race Seattle Stew is given odds of
2 to 1, which is to say, two expected loses for each expected win. Seattle Stew's odds of winning are 1/2, or 0.5. How
much better is this than the winning odds for Sea Brisket? The odds ratio tells us: 0.5 / 0.125 = 4.0. The odds of Seattle
Stew winning are four times the odds of Sea Brisket winning. Be careful not to say "times as likely to win," which would
not be correct. The probability (likelihood, chance) of Seattle Stew winning is 1/3 and for Sea Brisket is 1/9, resulting in a
likelihood ratio of 3.0. Seattle Stew is three times more likely to win than is Sea Brisket.
In the "MASS" library there is a data set called "menarche" (Milicer, H. and Szczotka, F., 1966, Age at Menarche in
Warsaw girls in 1965, Human Biology, 38, 199-203), in which there are three variables: "Age" (average age of age
homogeneous groups of girls), "Total" (number of girls in each group), and "Menarche" (number of girls in the group who
have reached menarche)...
> library("MASS")
> data(menarche)
> str(menarche)
'data.frame': 25 obs. of 3 variables:
$ Age : num 9.21 10.21 10.58 10.83 11.08 ...
$ Total : num 376 200 93 120 90 88 105 111 100 93
...
$ Menarche: num 0 0 0 2 2 5 10 17 16 29 ...
> summary(menarche)
Age Total Menarche
Min. : 9.21 Min. : 88.0 Min. : 0.00
1st Qu.:11.58 1st Qu.: 98.0 1st Qu.: 10.00
Median :13.08 Median : 105.0 Median : 51.00
Mean :13.10 Mean : 156.7 Mean : 92.32
3rd Qu.:14.58 3rd Qu.: 117.0 3rd Qu.: 92.00
Max. :17.58 Max. :1049.0 Max. :1049.00
> plot(Menarche/Total ~ Age, data=menarche)
From the graph at right, it appears a logistic fit is called for here. The fit would be done this way...
> glm.out = glm(cbind(Menarche, Total-Menarche) ~ Age,
+ family=binomial(logit), data=menarche)
Numerous explanation are in order! First, glm( ) is the function used to do generalized linear models, and will be explained
more completely in another tutorial. With "family=" set to "binomial" with a "logit" link, glm( ) produces a logistic
regression. Because we are using glm( ) with binomial errors in the response variable, the ordinary assumptions of least
squares linear regression (normality and homoscedasticity) don't apply. Second, our data frame does not contain a row for
every case (i.e., every girl upon whom data were collected). Therefore, we do not have a binary (0,1) coded response
variable. No problem! If we feed glm( ) a table (or matrix) in which the first column is number of successes and the
second column is number of failures, R will take care of the coding for us. In the above analysis, we made that table on the
fly inside the model formula by binding "Menarche" and "Total − Menarche" into the columns of a table using cbind( ).
Let's look at how closely the fitted values from our logistic regression match the observed values...
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
I'm impressed! I don't know about you. The numerical results are extracted like this...
> summary(glm.out)
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit),
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and
anova(glm.out).
Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase
in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."
To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the
print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean).
This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference
between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by
3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant
decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the
model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of
0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the
observed values.
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
During the Fall semester of 2005, two students in our program--Rachel Mullet and Lauren Garafola--did a senior research
project in which they studied a phenomenon called Inattentional Blindness (IB). IB refers to situations in which a person
fails to see an obvious stimulus right in front of his eyes. (For details see this website.) In their study, Rachel and Lauren
had subjects view an online video showing college students passing basketballs to each other, and the task was to count the
number of times students in white shirts passed the basketball. During the video, a person in a black gorilla suit walked
though the picture in a very obvious way. At the end of the video, subjects were asked if they saw the gorilla. Most did
not!
Rachel and Lauren hypothesized that IB could be predicted from performance on the Stroop Color Word test. This test
produces three scores: "W" (word alone, i.e., a score derived from reading a list of color words such as red, green, black),
"C" (color alone, in which a score is derived from naming the color in which a series of Xs are printed), and "CW" (the
Stroop task, in which a score is derived from the subject's attempt to name the color in which a color word is printed when
the word and the color do not agree). The data are in the following table, in which the response, "seen", is coded as 0=no
and 1=yes...
seen W C CW
1 0 126 86 64
2 0 118 76 54
3 0 61 66 44
4 0 69 48 32
5 0 57 59 42
6 0 78 64 53
7 0 114 61 41
8 0 81 85 47
9 0 73 57 33
10 0 93 50 45
11 0 116 92 49
12 0 156 70 45
13 0 90 66 48
14 0 120 73 49
15 0 99 68 44
16 0 113 110 47
17 0 103 78 52
18 0 123 61 28
19 0 86 65 42
20 0 99 77 51
21 0 102 77 54
22 0 120 74 53
23 0 128 100 56
24 0 100 89 56
25 0 95 61 37
26 0 80 55 36
27 0 98 92 51
28 0 111 90 52
29 0 101 85 45
30 0 102 78 51
31 1 100 66 48
32 1 112 78 55
33 1 82 84 37
34 1 72 63 46
35 1 72 65 47
36 1 89 71 49
37 1 108 46 29
38 1 88 70 49
39 1 116 83 67
40 1 100 69 39
41 1 99 70 43
42 1 93 63 36
43 1 100 93 62
44 1 110 76 56
45 1 100 83 36
46 1 106 71 49
47 1 115 112 66
48 1 120 87 54
49 1 97 82 41
To get them into R, try this first...
> file = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/gorilla.csv"
> read.csv(file) -> gorilla
> str(gorilla)
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
If that doesn't work (and it should), try copying and pasting this script into R at the command prompt...
### Begin copying here.
gorilla = data.frame(rep(c(0,1),c(30,19)),
c(126,118,61,69,57,78,114,81,73,93,116,156,90,120,99,113,103,123,
86,99,102,120,128,100,95,80,98,111,101,102,100,112,82,72,72,
89,108,88,116,100,99,93,100,110,100,106,115,120,97),
c(86,76,66,48,59,64,61,85,57,50,92,70,66,73,68,110,78,61,65,
77,77,74,100,89,61,55,92,90,85,78,66,78,84,63,65,71,46,70,
83,69,70,63,93,76,83,71,112,87,82),
c(64,54,44,32,42,53,41,47,33,45,49,45,48,49,44,47,52,28,42,51,54,
53,56,56,37,36,51,52,45,51,48,55,37,46,47,49,29,49,67,39,43,36,
62,56,36,49,66,54,41))
colnames(gorilla) = c("seen","W","C","CW")
str(gorilla)
### End copying here.
And if that doesn't work, well, you know what you have to do!
The Stroop scale scores are moderately positively correlated with each other, but none of them appears to be related to the
"seen" response variable, at least not to any impressive extent. There doesn't appear to be much here to look at. Let's have
a go at it anyway.
Since the response is a binomial variable, a logistic regression can be done as follows...
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
The first gives us what amount to regression coefficients with standard errors and a z-test, as we saw in the single variable
example above. None of the coefficients are significantly different from zero (but a few are close). The deviance was
reduced by 8.157 points on 7 degrees of freedom, for a p-value of...
The second print out shows the same overall reduction in deviance, from
65.438 to 57.281 on 7 degrees of freedom. In this print out, however, the
reduction in deviance is shown for each term, added sequentially first to
last. Of note is the three-way interaction term, which produced a nearly
significant reduction in deviance of 3.305 on 1 degree of freedom
(p = 0.069).
In the event you are encouraged by any of this, the following graph might
be revealing...
> plot(glm.out$fitted)
> abline(v=30.5,col="red")
> abline(h=.3,col="green")
> abline(h=.5,col="green")
> text(15,.9,"seen = 0")
> text(40,.9,"seen = 1")
Let's re-examine the "UCBAdmissions" data set, which we looked at in a previous tutorial...
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
F 22 351
Female A 89 19
B 17 8
C 202 391
D 131 244
E 94 299
F 24 317
The data are from 1973 and show admissions by gender to the top six grad programs at the University of California,
Berkeley. Looked at as a two-way table, there appears to be a bias against admitting women...
> dimnames(UCBAdmissions)
$Admit
[1] "Admitted" "Rejected"
$Gender
[1] "Male" "Female"
$Dept
[1] "A" "B" "C" "D" "E" "F"
> margin.table(UCBAdmissions, c(2,1))
Admit
Gender Admitted Rejected
Male 1198 1493
Female 557 1278
However, there are also relationships between "Gender" and "Dept" as well as between "Dept" and "Admit", which means
the above relationship may be confounded by "Dept" (or "Dept" might be a lurking variable, in the language of traditional
regression analysis). Perhaps a logistic regression with the binomial variable "Admit" as the response can tease these
variables apart.
If there is a way to conveniently get that flat table into a data frame (without splitting an infinitive), I don't know it. So I
had to do this...
I used a trick here of storing the model formula in a data object, and then entering the name of this object into the glm( )
function. That way, if I made a mistake in the model formula (or want to run an alternative model), I have only to edit the
"mod.form" object to do it.
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The
odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on
its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted.
Each coefficient compares the corresponding predictor to the base level. So...
> exp(-2.2046)
[1] 0.1102946
...the odds of being admitted to department C were only about 1/9th the odds of being admitted to department A, all else
being equal. If you want to compare, for example, department C to department D, do this...
> exp(-2.2046) / exp(-2.1662) # C:A / D:A leaves C:D
[1] 0.962328
All else equal, the odds of being admitted to department C were 0.96 times the odds of being admitted to department D.
logistic.html[27/01/2014 22:18:53]
R Tutorials--Logistic Regression
(To be honest, I'm not sure I'm comfortable with the interaction in this model. You might want to examine the interaction,
and if you think it doesn't merit inclusion, run the model again without it. Statistics are nice, but in the end it's what makes
sense that should rule the day.)
logistic.html[27/01/2014 22:18:53]
R Tutorials--Model Formulae
MODEL FORMULAE
This is a short tutorial on writing model formulae for ANOVA and regression analyses. It will be linked to from those
tutorials, but you are welcome to read it just for kicks if you'd like.
R functions such as aov( ), lm( ), and glm( ) use a formula interface to specify the variables to be included in the analysis.
The formula determines the model that will be built (and tested) by the R procedure. The basic format of such a formula
is...
y ~ x
...where "x" is the explanatory variable or IV, and "y" is the response variable or DV. Additional explanatory variables
would be added in as follows...
y ~ x + z
...which would make this a multiple regression with two predictors. This raises a critical issue that must be understood to
get model formulae correct. Symbols used as mathematical operators in other contexts do not have their usual
mathematical meaning inside model formulae. The following table lists the meaning of these symbols when used in a
formula.
You may have noticed already that some formula structures can be specified in more than one way...
formulae.html[27/01/2014 22:18:14]
R Tutorials--Model Formulae
All three of these specify a model in which the variables "u", "v", "w", and all the interactions between them are included.
Any of these formats...
y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2
The nature of the variables--binary, categorial (factors), numerical--will determine the nature of the analysis. For example,
if "u" and "v" are factors...
y ~ u + v
...dictates an analysis of variance (without the interaction term). If "u" and "v" are numerical, the same formula would
dictate a multiple regression. If "u" is numerical and "v" is a factor, then an analysis of covariance is dictated.
That ought to do if for now. Specific examples will appear in the tutorials devoted to specific analyses.
formulae.html[27/01/2014 22:18:14]
R Tutorials--More Descriptive Statistics
Categorical Data
You summarize categorical data basically by counting up frequencies and by calculating proportions and percentages.
Categorical data are commonly encountered in three forms: a frequency table or crosstabulation, a flat table, or a case-by-
case data frame. Let's begin with the last of these. Copy and paste the following lines ALL AT ONCE into R. That is,
highlight these lines with your mouse, hit Ctrl-C on your keyboard, click at a command prompt in R, and hit Ctrl-V on
your keyboard, and hit Enter if necessary, i.e., if R hasn't returned to a command prompt. On the Mac, use Command-C
and Command-V. This will execute these lines as a script and create a data frame called "ucb" in your workspace.
WARNING: Your workspace will also be cleared, so save anything you don't want to lose first.
Data sets that are purely categorical are not economically represented in case-by-case data frames, and so the built-in data
sets that are purely categorical come in the form of tables (contingency tables or crosstabulations). We have just taken the
data from one of these (the "UCBAdmissions" built-in data set) and turned it into a case-by-case data frame. It's the classic
University of California, Berkeley, admissions data from 1973 describing admissions into six different graduate programs
broken down by gender.
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
gender, admitted, and department the person applied to. All three variables are coded as factors, but this doesn't matter. For
our present purposes, categorical variables and factors are the same thing.
> summary(ucb)
gender admitted department
female:1835 no :2771 A:933
male :2691 yes:1755 B:585
C:918
D:792
E:584
F:714
We now have frequency tables for each of the variables. We could get the same information for any individual variable
using the table( ) function...
> table(ucb$gender)
female male
1835 2691
>
> table(ucb$admitted)
no yes
2771 1755
>
> table(ucb$department)
A B C D E F
933 585 918 792 584 714
These tables can easily be turned into relative frequency tables using the prop.table( ) function...
> table(ucb$department) -> dept.table # Requires a table as an argument.
> prop.table(dept.table) # Calculate proportions.
A B C D E F
0.2061423 0.1292532 0.2028281 0.1749890 0.1290323 0.1577552
>
> prop.table(dept.table) * 100 # Or calculate percentages.
A B C D E F
20.61423 12.92532 20.28281 17.49890 12.90323 15.77552
As we see, 20.6% of the applicants applied to department A, 12.9% to department B, 20.3% to department C, etc.
The prop.table( ) function requires a table object as its argument, so the data must first be tabled before being
prop.tabled.
Contingency tables or crosstabs can be produced using either the table( ) or xtabs( ) function. Table is easier, so
I'll illustrate that first...
When using prop.table( ) on a multidimensional table, it's necessary to specify which marginal sums you want to
use to calculate the proportions. To use the row sums, specify 1; to use the column sums, specify 2...
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
gender no yes
female 0.4612053 0.3173789
male 0.5387947 0.6826211
The name of this option is "margin=", so this form could also have been used...
> prop.table(my.table, margin=1)
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
Since "margin=" is the first thing prop.table( ) expects after the table name, either form will work here.
The xtabs( ) function works quite a bit differently. It uses a formula interface. The formula interface is used most often
in model building and significance testing, so we'll see it a lot, and it's discussed in detail in another tutorial. Formulas can
become quite complex, but their most basic form is as follows...
Instead of using with( ) to give the name of the data frame, we could also have used the data= option, since we are
using a formula interface in the function...
> xtabs(~ gender + admitted, data=ucb)
admitted
gender no yes
female 1278 557
male 1493 1198
The resulting table could also have been stored and operated on with other functions. Here are some examples...
> xtabs(~ gender + admitted, data=ucb) -> gen.adm.table
> prop.table(gen.adm.table, 1) # Get proportions relative to row sums.
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
>
> addmargins(gen.adm.table) # Add marginal sums to table.
admitted
gender no yes Sum
female 1278 557 1835
male 1493 1198 2691
Sum 2771 1755 4526
>
> margin.table(gen.adm.table, 1) # Collapse over admitted (row marginals).
gender
female male
1835 2691
>
> margin.table(gen.adm.table, 2) # Collapse over sex (column marginals).
admitted
no yes
2771 1755
Six different crosstabulations of the entire data set are possible, depending upon the order in which we list the variables...
> with(ucb, table(gender, department, admitted))
, , admitted = no
department
gender A B C D E F
female 19 8 391 244 299 317
male 313 207 205 279 138 351
, , admitted = yes
department
gender A B C D E F
female 89 17 202 131 94 24
male 512 353 120 138 53 22
> with(ucb, table(admitted, department, gender))
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
, , gender = female
department
admitted A B C D E F
no 19 8 391 244 299 317
yes 89 17 202 131 94 24
, , gender = male
department
admitted A B C D E F
no 313 207 205 279 138 351
yes 512 353 120 138 53 22
Etc.
A flat table is produced from the data frame by using the ftable( ) function. Use the "col.vars=" option to control
which variable goes in the columns...
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
Flat tables can also be made from contingency tables, the order of the row variables can be changed, and multiple column
variables can be specified...
> with(ucb, table(admitted,department,gender)) -> my.table # A 3D contingency table.
> ftable(my.table)
gender female male
admitted department
no A 19 313
B 8 207
C 391 205
D 244 279
E 299 138
F 317 351
yes A 89 512
B 17 353
C 202 120
D 131 138
E 94 53
F 24 22
>
> ftable(my.table, col.vars="admitted")
admitted no yes
department gender
A female 19 89
male 313 512
B female 8 17
male 207 353
C female 391 202
male 205 120
D female 244 131
male 279 138
E female 299 94
male 138 53
F female 317 24
male 351 22
>
> ftable(my.table, row.vars=c("gender","department"), col.vars="admitted")
admitted no yes
gender department
female A 19 89
B 8 17
C 391 202
D 244 131
E 299 94
F 317 24
male A 313 512
B 207 353
C 205 120
D 279 138
E 138 53
F 351 22
>
> ftable(my.table, row.vars="department", col.vars=c("gender","admitted"))
gender female male
admitted no yes no yes
department
A 19 89 313 512
B 8 17 207 353
C 391 202 205 120
D 244 131 279 138
E 299 94 138 53
F 317 24 351 22
Use the "row.vars=" option to control the order in which the row variables occur. If more than one variable name is to be
used in either the "row.vars" or "col.vars" option, use vector notation to specify them. (HINT: In R, everything is a vector,
even if it has only one element! The notation row.vars="department" is just a shortcut for row.vars=c("department").)
You can also get a VERY useful and more efficient data frame from a contingency table as follows...
> as.data.frame(my.table) -> my.df.table
> my.df.table
admitted department gender Freq
1 no A female 19
2 yes A female 89
3 no B female 8
4 yes B female 17
5 no C female 391
6 yes C female 202
7 no D female 244
8 yes D female 131
9 no E female 299
10 yes E female 94
11 no F female 317
12 yes F female 24
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
13 no A male 313
14 yes A male 512
15 no B male 207
16 yes B male 353
17 no C male 205
18 yes C male 120
19 no D male 279
20 yes D male 138
21 no E male 138
22 yes E male 53
23 no F male 351
24 yes F male 22
There are only 24 possible unique cases in these data (2 genders times 2 admit statuses times 6 departments). So why put
this information into a data frame with 4500+ rows in it? The form above is much more efficient. It lists the possible
unique cases in the first three columns (every possible combination of factors) and then gives a frequency in the last Freq
column.
When there is a Freq column in the data frame, you CANNOT use table( ) to get crosstabulations. You must use
xtabs( ), AND you must specify Freq as the DV.
Numerical Data
Imagine a number line stretching all the way across the universe from negative infinity to positive infinity. Somewhere
along this number line is your little mound of data, and you have to tell someone how to find it. What information would it
be useful to give this person?
First, you might want to tell this person how big the mound is. Is she to look for a sizeable mound of data or just a little
speck? Second, you might want to tell this person about where to look. Where is the mound centered on the number line?
Third, you might want to tell this person how spread out to expect the mound to be. Are the data all in a nice, compact
mound, or is it spread out all over the place? Finally, you might want to tell this person what shape to expect the mound to
be.
The data frame "faithful" consists of 272 observations on the Old Faithful geyser in Yellowstone National Park. Each
observation consists of two variables: "eruptions" (how long an eruption lasted in minutes), and "waiting" (how long in
minutes the geyser was quiet before that eruption)...
> data(faithful)
> str(faithful)
'data.frame': 272 obs. of 2 variables:
$ eruptions: num 3.60 1.80 3.33 2.28 4.53 ...
$ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
I'm going to attach this so that we can easily look at the variables inside. Remind me to detach it when we're done!
> attach(faithful)
We'll begin with the "waiting" variable. Let's see if we can characterize it.
First, we want to know how many values are in this variable. If this variable is our mound of data somewhere on that
number line, how big a mound is it?
> length(waiting)
[1] 272
I don't suppose you were much surprised by that. We already knew that this data frame contains 272 cases, so each
variable must be 272 cases long. There is a slight catch, however. The length( ) function counts missing values (NAs)
as cases, and there is no way to stop it from doing so. That is, there is no na.rm= option for this function. In fact, there are
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
no options at all. It will not even reveal the presence of missing values, and for this reason the length( ) function can
give a misleading result when missing values are present.
The following command will tell you how many of the values in a variable are NAs (missing values)...
> sum(is.na(waiting))
[1] 0
It does this as follows. The is.na( ) function tests every value in a vector for missingness, returning either TRUE or
FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as zeroes, so when we sum the result of
testing for missingness, we get the number of TRUEs, i.e., the number of missing values. It's kind of a nuisance to
remember that, so I propose a new function, one that doesn't actually exist in R.
Here's one of the nicest things about R. If it doesn't do something you want it to do, you can write a function that does it.
We'll talk about this in detail in a future tutorial, but for right now the basic idea is this. If you can calculate it at the
command line, you can write a function to do it. In this case...
> length(waiting) - sum(is.na(waiting))
[1] 272
...would give us the number of nonmissing cases in "waiting". So do this (and be VERY CAREFUL with your typing)...
> sampsize = function(x) length(x) - sum(is.na(x))
> sampsize(waiting)
[1] 272
> ls()
[1] "faithful" "sampsize"
The first line creates a function called "sampsize" and gives it a mathematical definition (i.e., tells how to calculate it). The
variable "x" is called a dummy variable, or a place holder. When we actually use the function, as we did in the second line,
we put in the name of the vector we want "sampsize" calculated for. This takes the place of "x" in the calculation. Notice
that an object called "sampsize" has been added to your workspace. It will be there until you remove it (or neglect to save
the workspace at the end of this R session). Better yet, if you go back to your default working directoy and save it there, it
will load automatically every time you start R. We'll do that at the end of this tutorial.
The second thing we want to know about our little mound of data is where it is located on our number line. Where is it
centered? In other words, we want a measure of center or of central tendency. There are three that are commonly used:
mean, median, and mode. The mode is not very useful and is rarely used in statistical calculations, so there is no R
function for it. Mean and median, on'the other hand, are straightforward...
> mean(waiting)
[1] 70.89706
> median(waiting)
[1] 76
Now we know to hunt around somewhere in the seventies for our data.
How spread out should we expect it to be? This is given to us by a measure of spread, also called a measure of variability
or a measure of dispersion. There are several of these in common usage: the range, the interquartile range, the variance,
and the standard deviation. Here's how to get them...
> range(waiting)
[1] 43 96
> IQR(waiting)
[1] 24
> var(waiting)
[1] 184.8233
> sd(waiting)
[1] 13.59497
There are several things you should be aware of here. First, range( ) does not actually give the range (difference
between the maximum and minimum values), it gives the minimum and maximum value. So the values in "waiting" range
from a minimum of 43 to a maximum of 96.
Second, the interquartile range (IQR) is defined as the difference between the value at the 3rd quartile and the value at the
1st quartile. However, there is not universal agreement on how these values should be calculated, and different software
packages do it differently. R allows nine different methods to calculate these values, and the one used by default is NOT
the one you were probably taught in elementary statiistics. So the result given by IQR( ) is not the one you might get if
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
you do the calculation by hand (or on a TI-84 calculator). It should be very close though. If your instructor insists that you
get the "calculator" value, do the calculation this way...
The last two things you need to be aware of is that the variance and standard deviation calculated above are the sample
values. That is, they are calculated as estimates from a sample of the population values (the n − 1 method). There are no
functions for the population variance and standard deviation, although they are easily enough calculated if you need them.
You may remember from your elementary stats course a statistic called the standard error of the mean, or SEM.
Technically, SEM is a measure of error and not of variability, but you may need to calculate it nevertheless. There is no R
function for it, so let's write one.
The standard error of the sample mean of "waiting" can be calculated like this...
> sqrt(var(waiting)/length(waiting))
[1] 0.8243164
...the square root of the (variance divided by the sample size). This calculation will choke on missing values, however.
Let's see that by adding a missing value to "waiting"...
> waiting -> wait2
> wait2[100] <- NA
> sqrt(var(wait2)/length(wait2))
[1] NA
...seems to fix things. But it doesn't. This isn't quite the correct answer, and that is because length( ) gave us the wrong
sample size. (It counted the case that was missing.) We could try this...
> sqrt(var(wait2, na.rm=T)/sampsize(wait2))
[1] 0.8263412
...but that depends upon the existence of the "sampsize" function. If that somehow gets erased, this calculation will fail.
Here's one of the disadvantages of using R: You have to know what you're doing. Unlike other statistical packages, such as
SPSS, to use R you occasionally have to think. It appears some of that is necessary here...
> sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x))))
> sem(wait2)
[1] 0.8263412
> ls()
[1] "faithful" "sampsize" "sem" "wait2"
We now have a perfectly good function for calculating SEM in our workspace, and we will not let it get away!
Before we move on, I should remind you that the summary( ) function will give you quite a bit of the above information
in one go...
> summary(waiting)
Min. 1st Qu. Median Mean 3rd Qu. Max.
43.0 58.0 76.0 70.9 82.0 96.0
It will tell you if there are missing values, too...
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
> summary(wait2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
43.00 58.00 76.00 70.86 82.00 96.00 1.00
The last thing we want to know (for the time being anyway) about our little mound of data is what its shape is. Is it a nice
mound-shaped mound, or is it deformed in some way, skewed or bimodal or worse? This will involve looking at some sort
of frequency distribution.
The table( ) function will do that as well for a numerical variable as it will for a categorical variable, but the result
may not be pretty (try this with eruptions--I dare you!)...
> table(waiting)
waiting
43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
1 3 5 4 3 5 5 6 5 7 9 6 4 3 4 7 6 4 3 4 3 2 1 1 2 4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
5 1 7 6 8 9 12 15 10 8 13 12 14 10 6 6 2 6 3 6 1 1 2 1 1
Looking at this reveals that we have one value of 43, three values of 45, five values of 46, and so on. Surely there's a better
way! We could try a grouped frequency distribution...
> cut(waiting, breaks=10) -> wait3
> table(wait3)
wait3
(42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8]
16 28 26 24 9 23
(74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1]
62 55 23 6
The cut( ) function cuts a numerical variable into class intervals, the number of class intervals given (approximately) by
the breaks= option. (R has a bit of a mind of it's own, so if you pick a clumbsy number of breaks, R will fix that for you.)
The notation (x,y] means the class interval goes from x to y, with x not being included in the interval and y being included.
Another disadvantage of using R is that it is intended to be utilitarian. The output will be useful but not necessarily pretty.
We can pretty that up a little bit with a trick...
> as.data.frame(table(wait3))
wait3 Freq
1 (42.9,48.3] 16
2 (48.3,53.6] 28
3 (53.6,58.9] 26
4 (58.9,64.2] 24
5 (64.2,69.5] 9
6 (69.5,74.8] 23
7 (74.8,80.1] 62
8 (80.1,85.4] 55
9 (85.4,90.7] 23
10 (90.7,96.1] 6
You can also specify the break points yourself in a vector, if you are that anal retentive...
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
> stem(waiting)
The decimal point is 1 digit(s) to the right of the |
4 | 3
4 | 55566666777788899999
5 | 00000111111222223333333444444444
5 | 555555666677788889999999
6 | 00000022223334444
6 | 555667899
7 | 00001111123333333444444
7 | 555555556666666667777777777778888888888888889999999999
8 | 000000001111111111111222222222222333333333333334444444444
8 | 55555566666677888888999
9 | 00000012334
9 | 6
> stem(waiting, scale=.5)
The decimal point is 1 digit(s) to the right of the |
4 | 355566666777788899999
5 | 00000111111222223333333444444444555555666677788889999999
6 | 00000022223334444555667899
7 | 00001111123333333444444555555556666666667777777777778888888888888889
8 | 00000000111111111111122222222222233333333333333444444444455555566666
9 | 000000123346
Use the "scale=" option to determine how many lines are in the display. The value of scale= is not actually the number of
lines, so this may take some fiddling. The first of those displays clearly shows the bimodal structure of this variable.
Oftentimes, we are faced with summarizing numerical data by group membership, or as indexed by a factor. The
tapply( ) and by( ) functions are most useful here...
> detach(faithful)
> rm(faithful, wait2, wait3, wait4) # Out with the old...
> as.data.frame(state.x77) -> states
> states$region <- state.region
> head(states)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
region
Alabama South
Alaska West
Arizona West
Arkansas South
California West
Colorado West
The "state.x77" data set in R is a collection of information about the U.S. states (in 1977), but it is in the form of a matrix.
We converted it to a data frame. All the variables are numerical, so we created a factor in the data frame by adding
information about census region, from the built-in data set state.region. Now we can do interesting things like break down
life expectancy by regions of the country. But hold on there! Some very foolish person has named that variable "Life Exp",
with a space in the middle of the name. How do we deal with that?
> names(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder"
[6] "HS Grad" "Frost" "Area" "region"
> by(data=states[4], IND=states[9], FUN=mean)
region: Northeast
[1] 71.26444
------------------------------------------------------------
region: South
[1] 69.70625
------------------------------------------------------------
region: North Central
descriptive.html[27/01/2014 22:17:37]
R Tutorials--More Descriptive Statistics
[1] 71.76667
------------------------------------------------------------
region: West
[1] 71.23462
>
> tapply(X=states[,4], IND=states[,9], FUN=mean)
Northeast South North Central West
71.26444 69.70625 71.76667 71.23462
That's one way. There are a couple things to notice here. The arguments to by( ) are listed in the default order, so you
don't really have to name them. The same is true of tapply( ). Notice also that you can be a lot more loosey goosey
with the indexing in by( ) than in tapply( ). You only have to give the column number that you want from the data
frame.
Of course, we could also rename the column and use the new, more sensible name...
The next time you start R, those functions will be loaded automatically.
On to graphical summaries!
descriptive.html[27/01/2014 22:17:37]
R Tutorials--Objects
OBJECTS
Data.
Your data is the information upon which you wish to do a statistical analysis. By the way, the word "data" is plural, so
ordinarily you would not say "data is" or "data was." Correct are "date are" and "data were." I'm not the grammar police,
but I will nail you on that one!
Maintaining a data set is one of the most important things a statistician needs to know how to do. Most statistical software
requires that the data set be in a very specific format, called a data table or, in R, a data frame (one word or two, take your
pick). Data frames will be covered in detail in a future tutorial.
This is where R truly shines. R is much more flexible in that it does not require that you use the data frame format for your
data. If it is more convenient to keep your data in a contingency table, or a list, or a matrix, or a single vector, you can do
so. This flexibility has a price--more to learn. In the end, however, it makes R a much more convenient way to analyze
data sets, especially simple ones.
In the behavioral and social sciences, the unit of analysis is usually a subject, human or animal. In the more general case,
subjects are called "cases" or "observations" or "experimental units." I prefer cases. There will actually come a time when
we have to distinguish between subjects and cases, so you should not think of these two terms as being exactly equivalent.
Let's say you've collected data from five subjects: Bob, Fred, Barb, Sue, and Jeff. From each subject you have collected
information about age, height, weight, race, year in school (they are all college students), and SAT score. Your cases are
Bob, Fred, Barb, Sue, and Jeff. Age, height, weight, race, year in school, and SAT score are called variables. You would
ordinarily put this information into a data frame as follows:
name age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
Notice that the cases, or subjects, go into rows in this table, and each variable has it's own column. This is the standard
form for maintaining a data table (data frame). It looks a lot like a spreadsheet, and in fact, using spreadsheet software is a
very good way to manage data. The first row in this table is called the header. It contains the variable names. Having a
header row is optional but usually a good idea.
I should call your attention to the fact that we have two fundamentally different kinds of variables in this data frame. Some
are numbers, like age and weight. These are called numerical variables. Other variables contain just the names of
categories that the subject falls into. Race is an example of such a variable, called a categorical variable. It's absolutely
essential that you be able to distinquish these two types of variables. You can't do statistics otherwise! Categorical variables
are often called factors in R. Just to make matters a bit more confusing, examine the "year" variable. What would you call
it, numerical or categorical? If those were your only choices, you'd have to call it categorical. In fact, in this variable the
categories have a natural order to them: Fr, So, Jr, Sr. Sometimes such a categorical variable is called an ordered factor in
R.
You may be more familiar with the terms nominal, ordinal, interval, and ratio variables. Nominal variables and categorical
variables are roughly the same thing. Factors are usually nominal. However, ordered factors are ordinal. Numerical
variables are either interval or ratio variables, and it usually doesn't matter which. One more catch to all this--examine the
column labeled "name" in the table above. Is this a variable? I suppose it is since its value is different for everyone.
Usually when we think of categorical variables or factors, we are thinking of variables that have relatively few possible
values. These values are called levels. The levels of year, for example, are Fr, So, Jr, Sr. When a variable has a different
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
value for everyone, like the subject's name or address for example, it's often called a character variable. You will see R
make this distinction, and it's a useful one, so remember it.
You get data into R by creating data objects, so let's see how that is done.
Assignment.
In R you create things, called "objects", by a process called assignment. Start an R session and set the working directory to
Rspace. Also, clear the workspace...
> setwd("Rspace") # There is a menu item for this in the GUI, btw.
> rm(list=ls()) # Or use the menus to do this.
If you don't know what this means or have forgotten to create the Rspace directory, you can find out how in the tutorial
called Preliminaries.
There are three ways to assign data to an object name in R (actually four, but one is rarely used). Here is one way...
> x = 7
This SHOULD NOT be read as "x equals 7", which will result in confusion later. Instead, the single equals sign means
"takes the value" or "is assigned the value." R is not usually picky about spacing, so all of the following are equivalent...
> x=7
> x = 7
> x= 7
> x = 7
> x = # Press Enter here.
+ 7 # Press Enter again.
Use spacing to make your typed input look "pretty." Or not. It's (generally) up to you. There are a few situations where R
will get uppity about spacing, but usually it is not an issue. DON'T, however, be so silly as to put a space in the middle of
the name of something. That would be bad!
Two things to note here. First, R is perfectly willing to let you be stupid and overwrite things you have in your workspace.
There is no warning. If you assign something to an object name that already exists, the old object is gone! Second, the
arrow assignment works from either direction. The equal sign does not! When using =, you must give the object name first
followed by the value you wish to give it.
Objects.
vectors
lists
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
arrays
matrices
tables
data frames
Some of these are more important than others. And there are more, but these are the ones we need to know about for now.
Let's begin at the beginning.
R doesn't care much what you name things,whether they are variables or complete data objects. As noted in the last
tutorial, however, DO NOT put spaces or dashes in your names. Thus, all of these are acceptable (and different) object or
variable names:
x
X
x2
x.2
x_2
myData
MyData
my_data
my.data
my.data.from.the.learning.experiment
fred
Fred
FRED
Rutherford.B.Hayes
Be creative! But if you make your object names too long, you'll be sorry, because you'll be typing them a lot! Another
warning: It is generally safest to confine yourself to letters, numbers, dots, and underline characters and to start your
variable names with letters. Also, try to avoid using names that are also functions in R, like "mean" for example, although
R will usually work around this. The only names I would warn you against are T and F. Avoid those as variable names
because, as we will see later, R uses them to mean true and false. If you assign them another value, that could cause
trouble.
Remember, R has a large number of built-in data objects. Some of them will be used below to illustrate the various kinds
of R data objects. For example, here is a data object containing the lengths of major North American rivers (in miles)...
> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315
[15] 870 906 202 329 290 1000 600 505 1450 840 1243 890 350 407
[29] 286 280 525 720 390 250 327 230 265 850 210 630 260 230
[43] 360 730 600 306 390 420 291 710 340 217 281 352 259 250
[57] 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533
[71] 780 280 410 460 260 255 431 350 760 618 338 981 1306 500
[85] 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270
[99] 545 445 1885 380 300 380 377 425 276 210 800 420 350 360
[113] 538 1100 1205 314 237 610 360 540 1038 424 310 300 444 301
[127] 268 620 215 652 900 525 246 360 529 500 720 270 430 671
[141] 1770
(The output on your screen may be slightly different, depending upon how wide you have your R Console window set to.)
In this R output, everything is numbered, but only the number of the first item on each output line is printed. Thus, the
value 1205 (third line from the bottom three items in--may be different on your screen) is item number 115 in this output.
These index numbers are NOT PART OF THE DATA ITSELF! This will be made clearer in the following section. The
object "rivers" is a vector, so...
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
Vectors.
One kind of vector consists of numbers, as was the case just above for the vector "rivers". This is called a numerical
vector, cleverly enough. Any item in this vector can be addressed by using its index number...
> rivers[10:20]
[1] 600 330 336 280 315 870 906 202 329 290 1000
In R, a colon has two meanings. This is one of them. When two numbers are separated by a colon, it means "to" as in "10
to 20". Try this...
> 10:20 # Output not shown.
Since no function is specified to operate on these numbers, R assumes you meant print(10:20). If you want to see
items 18, 104, and 168, do this...
> rivers[c(18,104,168)]
[1] 329 380 NA
"NA" means not available, or missing. The "rivers" vector is only 141 items long, so you just asked for something that
doesn't exist. The point is, to see specific items within a vector, enter a vector of index numbers inside the square brackets.
You can also use relational operators (about which more later) to pick out certain items from a vector. If you just want to
see the data values greater than 500, do this...
> rivers[rivers > 500]
[1] 735 524 1459 600 870 906 1000 600 505 1450 840 1243 890 525 720
[16] 850 630 730 600 710 680 570 560 900 625 2348 1171 3710 2315 2533
[31] 780 760 618 981 1306 696 605 1054 735 1270 545 1885 800 538 1100
[46] 1205 610 540 1038 620 652 900 525 529 720 671 1770
I will tell you how to find out which rivers those are in a later tutorial.
To create a vector, use the c( ) function (short for concatenate, or combine)...
> x = c(12, 14, 15, 17, 19, 8, 10)
> x
[1] 12 14 15 17 19 8 10
Once again, R isn't picky about spacing. None of the spaces in the above command need to be there. Or you can put more
in if you like. I won't mention this again. I assume if you get curious about some special case, you will experiment and find
the answer for yourself.
If the values you wish to enter into a vector are consecutive, then this is sufficient:
Vectors can also contain words or character values. When you enter these values, they must be in double or single quotes...
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
> x = c("Bob","Carol","Ted","Alice")
> x
[1] "Bob" "Carol" "Ted" "Alice"
Two vectors can also be concatenated into one with the concatenate function as follows...
> y = c("John","Joy","Fred","Frances")
> z = c(x, y)
> z
[1] "Bob" "Carol" "Ted" "Alice" "John" "Joy" "Fred"
[8] "Frances"
What would have happened if, instead, you had done this?
> z = c("x", "y")
> z
It's worth finding out, so don't just sit there wondering. Type! One thing I had a bit of trouble getting used to in R is when
to put things in quotes and when not to. The basic rule is: If it's an already defined object, don't quote it. If you want to
refer to the values inside already existing x and y vectors, don't quote. If it's a new character value (i.e., a string--someone's
or something's name), use quotes. R assumes anything not in quotes is an object name (an already defined vector, list,
dataframe, etc.), and it will hunt for that object in the search path. If it doesn't find it, you will be told so...
> Joy # Print out object Joy.
Error: object "Joy" not found
> "Joy" # Print out "Joy".
[1] "Joy"
> y[2] # Print out the second value in vector y.
[1] "Joy"
> Joy = 6 # Create a new object named Joy.
> Joy
[1] 6
Now do this...
> islands # Only first four lines of output shown.
Africa Antarctica Asia Australia
11506 5500 16988 2968
Axel Heiberg Baffin Banks Borneo
16 184 23 280
...
Confusing, right? You'll get used to it. This is a helpful example to study and play around with.
The vector "x" now contains the names of the actors in the movie "Bob and Carol, Ted and Alice." The names( )
function was used to label these values with the names of the characters they played in the movie. Then we used the name
of the character to retrieve the name of the actor. Dyan Cannon could also have been referred to as x[4]. Try it. (I have a
very funny story about this movie, but this is not the place for it!)
In the "islands" vector, the data values are the size of the land mass in thousands of square miles. Each data value is named
with the name of the land mass. Thus, to retrieve the area of Cuba, we do not need to know which of the data values is
Cuba. We can retrieve the value by name. The name is put inside of square brackets just as it if were an index number, and
it is quoted...
> islands["Cuba"]
Cuba
43
Cuba has a land area of 43,000 square miles. Suppose you wanted to work with this data vector, but you wanted the land
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
areas in square kilometers instead of square miles. The following procedure will allow this. First, use the data( )
function to write a copy of "islands" to your workspace. Then do the conversion. The converted values can either be stored
back into the "islands" vector, in which case the old values are overwritten, or it can be stored into a new vector with a
new name...
> data(islands) # writes a copy to your workspace
> ls()
[1] "Alice" "islands" "Joy" "x" "y" "z"
> km_islands <- islands * 2.59 # probably the best way
> km_islands["Cuba"]
Cuba
111.37
> islands <- islands * 2.59 # overwrites the original data values
> islands["Cuba"]
Cuba
111.37
And finally...
> ls()
[1] "Alice" "islands" "Joy" "km_islands" "x"
[6] "y" "z"
> rm(list=ls()) # clean up!
> ls()
character(0)
Vectors are used a lot in R. You should take some time to understand them.
Lists.
Lists are collections of other R objects collected into one place. To create a list, use the list( ) function...
> x=1:10 # a vector
> y=matrix(1:12,nrow=3) # a matrix
> z="Bill" # a character variable
> my.list=list(x,y,z) # create the list
> my.list # view the list
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[[3]]
[1] "Bill"
The output of a lot of R functions is actually composed of lists. Notice that items in a list are indexed by values inside
double brackets. Thus...
> my.list[[3]] # The third item in my.list.
[1] "Bill"
In R, the $ is used for list indexing. That is, it allows you to pull elements out of lists by name. First type the name of the
list, followed by $, followed by the name of the item in the list. For example...
> my.list$my.name
[1] "Bill"
Kinda trivial in this case, but it won't be when you have a much longer list. That's enough on lists for now.
> ls()
[1] "my.list" "x" "y" "z"
> rm(my.list,x,y,z) # Don't forget to clean up!
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
There is one more thing you should remember about lists. Data frames are actually lists. In fact, this is probably the most
important thing you need to remember about lists!
Essentially, these are both table-like objects. You saw how to create a matrix in the last section on lists. That's really
enough for now. Except maybe for extracting values from one. The syntax is my.matrix[row,col], as follows...
> y = matrix(1:16, nrow=4) # First we need a matrix! With 4 rows.
> class(y) # y is an object of class "matrix"
[1] "matrix"
> y
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> y[3,2]
[1] 7
Remember this! Always put the row index first followed by the column index, and always put the indexes inside of square
brackets. The matrix( ) function, which is used to create a matrix, takes a vector as its argument, and then the option
"nrow=" tells how many rows to break the vector into. The matrix is filled "down the columns" first, although there is
another option that will change this behavior. Notice our matrix has no row names or column names. The notation [1,]
means "row one, all columns". To recall an entire row or an entire column of a matrix (or an array or a table), do this...
> y[1,] # all values in row 1
[1] 1 5 9 13
> y[,3] # all values in column 3
[1] 9 10 11 12
More later on matrices, including how to name the rows and columns.
An array is like a matrix, except it can have more than two dimensions. In other words, a matrix is just a two-dimensional
array.
Tables.
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
If the function to create a matrix is matrix( ), and the function to create an array is array( ), I bet you can guess
what function is used to create a table. It's used quite a bit differently, however. The table( ) function is used to create
frequency tables or crosstabulations from raw data contained in a vector or a data frame. The result is something that looks,
in many cases, very much like a matrix or an array, and behaves very much like one as well. For now, we will confine
ourselves to one relatively simple example. First, we have to create some raw data...
The top row of numbers contains the data values, which we can see range from 64 to 129, and the bottom row of numbers
gives the frequencies. The data value (i.e., y-value) of 100, for example, occurs 6 times in the data vector. (Once again,
your result will be different.) Tables, of course, just like everything else in R, can be stored and then used for further
analysis...
> table(y) -> myTable # Store it.
> barplot(myTable)
> ls()
[1] "myTable" "y"
> rm(myTable, y) # And remember to clean up.
This table is (was!) one-dimensional. The "HairEyeColor" object we were playing with previously was a multidimensional
table of frequencies, also called a crosstabulation.
Data Frames.
Data frames are so important that I will devote an entire tutorial just to them. For now, if you want to see one, try this...
> women
The basic structure of a data frame is illustrated here. It's basically a table (in fact, it's a list of column vectors) in which
each variable goes in its own column and each case goes in its own row.
Usually, data frames are read into the R workspace from external files, which may have been created using a spreadsheet.
Small ones can be typed in at the command line, however. Let's use the data at the beginning of this tutorial to see how
that would work.
objects.html[27/01/2014 22:19:11]
R Tutorials--Objects
Last Word.
Further details as needed on these data objects will be covered in future tutorials. For now, you should get the general idea.
objects.html[27/01/2014 22:19:11]
R Tutorials--Resampling Techniques
RESAMPLING TECHNIQUES
Caveat emptor
Resampling techniques depend upon repeated (re)randomization or simulation of a sample. Computers do not generate
random numbers. Since an algorithm is used to produce the results of a function like runif( ), these results are technically
referred to as pseudorandom. I've done a few casual tests of the R random number generator, and it seems to be very good,
but I'm not an expert on pseudorandom number generators. So I will begin with a warning: computer intensive resampling
is only as good as your pseudorandom number generator.
Although the term "resampling" is often used to refer to any repeated random or pseudorandom sampling simulation, when
the "resampling" is done from a known theoretical distribution, the correct term is "Monte Carlo" simulation. I will use
such a scheme here to demonstrate how power can be estimated for the two-sample t-test using Student's sleep data...
> data(sleep)
> str(sleep)
'data.frame': 20 obs. of 2 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
> attach(sleep)
> tapply(extra, group, mean)
1 2
0.75 2.33
> tapply(extra, group, sd)
1 2
1.789010 2.002249
> tapply(extra, group, length)
1 2
10 10
> t.test(extra~group)
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
> power.t.test(n=10, delta=(2.33-.75), sd=1.9, sig.level=.05,
+ type="two.sample", alternative="two.sided")
Two-sample t test power calculation
n = 10
delta = 1.58
sd = 1.9
sig.level = 0.05
power = 0.4208457
alternative = two.sided
NOTE: n is number in *each* group
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
There is the traditional analysis. R claims this t-test has a power of 0.42.
Supposedly, we have a 42% chance of finding a two-tailed significant difference between two samples of 10 chosen from
normal populations with a common standard deviation of 1.9 (the pooled sd of the two samples) but with a true difference
in means of 1.58. To do the same calculation by simulation, we will use the rnorm( ) function to draw the samples, the
t.test( ) function to get a p-value, and then we will simply look to see what percentage of those p-values are less than
alpha=.05 after running this procedure a goodly number of times. But first, a couple explanations.
What we are about to do is, perhaps, best done with a script. That way, if something goes wrong, we don't have to re-enter
multiple lines into the R console. However, we will live dangerously for the sake of illustration. To get this to work with
minimal effort on our part--always a good thing--we will need to get R to do the same set of calculations many times. This
is done with a loop, and in R, loops are most often done using for( ). The syntax is "for (counter in 1 to number of
simulations)". The statements to be looped through repeatedly then follow enclosed in curly braces. I feel a bit like I'm
trying to describe an elephant to someone who's never seen one before. Perhaps it would be best to just show you the
elephant...
> R = 999
> alpha = numeric(R)
> for (i in 1:R) {
+ group1 = rnorm(10, mean=.75, sd=1.9)
+ group2 = rnorm(10, mean=2.33, sd=1.9)
+ alpha[i] = t.test(group1,group2)$p.value
+ }
The first line defines the number of random simulations we want and stores this value in the semi-official data object that
R typically uses for such things, namely R. We've done 999 sims, because that's the number that is normally done. The
second line creates a numeric vector with 999 elements, which will hold the results of each simulation. The third line
begins our for loop: "for i (the counter) going from 1 to 999", do the following stuff. I.e., do the following stuff 999 times,
keeping track by incrementing the counter, i, at each step. Group 1 is generated. Group 2 is generated. The t-test is done,
and the p-value is extracted and stored in "alpha", our designated storage place. In each simulation, the storage occurs in
the ith position of the alpha vector. Now all that remains is to determine how many of those values are less than .05 ...
> mean(alpha<.05)
[1] 0.3983984
To do this, we played a nice little trick. Remember, "alpha<.05" generates a logical vector of TRUEs and FALSEs, in
which TRUEs count as ones and FALSEs count as zeros. We took the mean of that vector, which is to say, we calculated
by a bit of trickery the proportion of TRUEs. Our simulation tells us we have about a 40% chance of rejecting the null
hypothesis under these conditions, pretty close to what we found above. Your results will differ, of course, because we are
using a randomizing procedure after all.
Permutation Tests
Oops! I forgot to clean up. So I will use the sleep data to illustrate our next topic as well, which is permutation tests. If you
were more on the ball than I am and detached the "sleep" data frame, reattach it now.
The logic behind a permutation test is straightforward. It says, "Okay, these are the twenty (in this example) scores we got,
and the way they are divided up between the two groups here is one possible permutation of them. There are, in fact...
> choose(20,10)
[1] 184756
...possible permutations (technically combinations--but same idea) of these data into two groups of ten, and each is equally
likely to occur if we were to pick one at random out of a hat. Most of these permutations would give no or little difference
between the group means, but a few of them would give large differences. How extreme is the obtained case?" In other
words, the logic of the permutation test is quite similar to the logic of the t-test. If the obtained case is in the most extreme
5% of possible results, then we reject the null hypothesis of no difference between the means (assuming we were looking at
differences between means). The advantage of the permutation test is it does not make any assumption about normality,
and in fact, doesn't make any assumption at all about parent distributions. Furthermore, a permutation test is generally more
powerful than a "traditional" nonparametric test.
The disadvantage of a permutation test is the number of permutations that must be generated. Even small samples, as we
have here, will choke a typical desktop computer with repetitive calculations. Therefore, instead of generating all possible
permutations, we generate only a random sample of them. This procedure is often called a "randomization test" instead of a
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
The resampling scheme tells us that a little more than 9% of the sims gave us a result as extreme or more extreme than the
obtained case. This is the p-value resulting from the randomization test. We can see that it is pretty close to the p-value
obtained in the actual t-test above (.0794). The difference may be nothing more than random noise (the standard error is
.009), or we may be seeing that the t-test was a bit too generous here, perhaps due to nonnormality.
Bootstrap Resampling
Bootstrap resampling is similar to the above randomization procedure, except the resampling is done WITH replacement.
The idea is to consider the pooled sample to be a mini-population of scores. Presumably, we were sampling from a larger
parent distribution, but we know nothing about it for certain. The only information we have about the parent distribution is
the sample we've obtained from it. Therefore, we will calculate a sampling distribution for the test statistic by resampling
from our mini-population WITH replacement, calculating the desired statistic, and taking a look at what we get...
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
Note: R contains a library called "boot", the main function within which is boot( ). Supposedly, this automates the whole
resampling procedure. However, I have found it almost impossible to use, and certainly more difficult than the above
methods. All the examples I've seen are much more complex than the problems above and (I might add) poorly explained.
I've yet to get the function to produce a meaningful result. When I do, there will be a revision here. If you wish to give it a
try, read the help page first, then you might try Canty (2002) and Rizzo (2008). And good luck to you. I hope you have
better luck than I have!
Bootstrap resampling is useful for estimating confidence intervals from samples when the sample is from an unknown (and
clearly nonnormal) distribution. The data set "crabs" in the MASS package supplies an example. We will look at the
carapace length of blue crabs...
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
> mean(cara)-1.96*sd(cara)/sqrt(length(cara))
[1] 28.70507
> mean(cara)+1.96*sd(cara)/sqrt(length(cara))
[1] 31.41093
I finally found a more-or-less sensible explanation of the boot( ) function in Crawley (2007). I can't claim anything near
"complete understanding", but I'm working on it. Meanwhile, here is a heavily commented example showing how to
bootstrap a confidence interval...
> library(boot)
> ### The syntax is: boot(data= , statistic= , R= )
> ### Send it the name of a vector or data frame for "data". "R" is the
> ### number of replications you want. The tricky part is "statistic=",
> ### where you have to send it a function you've written. If you're
> ### going to get fancy, you might want to read the Writing Your Own
> ### Functions tutorial first.
> data(crabs, package="MASS")
> ### Not necessary if you haven't yet cleaned up from the previous section.
> cara = crabs$CL[crabs$sp=="B"]
> ### The carapace lengths of 100 blue crabs. See previous section.
> ### And now we have to write a function to calculate the means of these...
> the.means = function(cara, i) {mean(cara[i])}
> ### Finally, we run the bootstrap...
> boot(data=cara, statistic=the.means, R=999)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = cara, statistic = the.means, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 30.058 0.02926827 0.6820252
You MUST write your own function to calculate the statistic you are bootstrapping, even if that function is already built
into R. If it is a built-in function, then writing your own function is a simple matter, as you witnessed above. You must
also include the index i, which R uses internally to do the bootstrap. I don't know why. I'm led to believe this makes the
boot( ) function much more versatile. Finally, the output is interpreted as so. The value "original" is the mean of the
original data vector "cara". The value "bias" gives the difference between "original" and the mean of the bootstrapped
values for this statistic. Clearly, "bias" ought to be close to zero. The value "std. error" is the standard deviation of the
bootstrapped means. To get a confidence interval, the output of boot( ) should be stored...
> boot(data=cara, statistic=the.means, R=999) -> boot.out
> quantile(boot.out$t, c(.025,.975))
2.5% 97.5%
28.73275 31.42255
The result is similar, but not identical, to the result we got in the previous section.
Bootstrapping is more usefully applied to statistics like the median, for which there is no formula for CIs when the
distribution is not normal...
Call:
boot(data = cara, statistic = the.medians, R = 999)
Bootstrap Statistics :
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
Let's do something a little more complex than a confidence interval! Yesterday, I wrote the tutorial on Oneway ANOVA,
and in that, I used the data frame "InsectSprays" for an example. However, "InsectSprays" has some severe problems with
normality and homogeneity of variance, which means the theoretical F-distribution may not apply. How can we get a
sampling distribution that does apply? First, let's use Monte Carlo simulation to reproduce the theoretical distribution...
Go get a snack here! Well, that took awhile, didn't it? "Fstar" should now contain our simulated F-distribution for df1=5
and df2=66 degrees of freedom...
> hist(Fstar, prob=T)
> x=seq(.25,5.25,.5)
> points(x,y=df(x,5,66),type="b",col="red")
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
Now let's do the same, except this time we will use the InsectSprays data to get the Fstar-distribution. We will center all
our groups on the same mean (zero), but we will leave the variance and shape of the individual group distributions
undisturbed (copy and paste this script into R)...
rm(list=ls())
data(InsectSprays)
meanstar = with(InsectSprays, tapply(count,spray,mean))
grpA = InsectSprays$count[InsectSprays$spray=="A"] - meanstar[1]
grpB = InsectSprays$count[InsectSprays$spray=="B"] - meanstar[2]
grpC = InsectSprays$count[InsectSprays$spray=="C"] - meanstar[3]
grpD = InsectSprays$count[InsectSprays$spray=="D"] - meanstar[4]
grpE = InsectSprays$count[InsectSprays$spray=="E"] - meanstar[5]
grpF = InsectSprays$count[InsectSprays$spray=="F"] - meanstar[6]
simspray = InsectSprays$spray
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=12, replace=T)
groupB = sample(grpB, size=12, replace=T)
groupC = sample(grpC, size=12, replace=T)
groupD = sample(grpD, size=12, replace=T)
groupE = sample(grpE, size=12, replace=T)
groupF = sample(grpF, size=12, replace=T)
simcount = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simcount,simspray)
Fstar[i] = oneway.test(simcount~simspray, var.equal=T, data=simdata)$statistic
}
Wait for it! (That took about a minute and 20 seconds on my computer.) We now have a bootstrapped "F"-distribution in
"Fstar" based on equal means (the null hypothesis), but normality and homogeneity are no longer assumed. Let's see what
it looks like...
> max(Fstar)
[1] 10.54805
> hist(Fstar, breaks=seq(0,11,.5), ylim=c(0,.7), prob=T)
> x=seq(.25,6.75,.5)
> points(x,y=df(x,5,66),type="b",col="red")
resample.html[27/01/2014 22:19:37]
R Tutorials--Resampling Techniques
It's a bit different from the theoretical distribution in that it appears to be heavier in the tails...
> qf(.95,5,66)
[1] 2.353809
> quantile(Fstar,.95)
95%
2.790884
The critical value of F(5,66) at alpha=.05 is 2.35, but from the bootstrapped Fstar distribution it appears to be 2.79. Either
way, Fobt=34.7 allows the null to be rejected. It might have been a different story though if the effect had been a marginal
one.
If there is an easier way to do that using the boot( ) function, I'd love to hear about it. I can't get it to work!
resample.html[27/01/2014 22:19:37]
R Tutorials--Simple Data Entry
A Couple Tips
One reason people don't like command line programs is because, if you make a mistake in typing a long command, you
have to start all over from scratch. Not so in R. Suppose you were trying to set your working directory to "Rspace", and
you accidently typed this...
If you continue pressing the up arrow key, R will bring older and older commands to the command line. Thus, if you did
something five commands ago, and you want to do it again, press the up arrow key five times to recall the command, then
press Enter.
Here's another tip, and one you might be a bit miffed I didn't tell you earlier. You can copy and paste stuff into R. For
example, suppose I told you to execute the following command...
Note to my Mac friends: On the Mac keyboard the shortcuts for copy and paste are Command-C and Command-V,
respectively. On older Mac keyboards, the Command key is the one to the left of the space bar with the little flowery thing
on it.
Now you know. Of course, you will have to type your own command eventually.
Creating a Vector
Using built-in data objects is fine and dandy for demonstration purposes, but eventually you're going to want to enter and
analyze your own data. If the data set is small, you can do this easily from within R. The following data were collected by
a student doing his senior research project here at CCU. The numbers represent number of items recalled correctly on a
digit span task, supposedly a measure of short term memory. The explanatory variable ("IV") was whether or not the
subject admitted to regularly smoking marijuana.
smokers 16 20 14 21 20 18 13 15 17 18
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry
nonsmokers 18 22 21 17 20 17 23 20 22 21
It might seem a little silly to go to the trouble of formally entering such a small data set into a data frame or a spreadsheet
and then reading it into R, when the whole thing can be typed into an R console session in just a few seconds. The thing
you need to realize is that all these scores are ON THE SAME VARIABLE, the response variable, and therefore, they need
to go into the same data object or vector. So...
> scores = c(16,20,14,21,20,18,13,15,17,18,18,22,21,17,20,17,23,20,22,21)
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 17.00 19.00 18.65 21.00 23.00
The scores have been entered into a vector using the c( ) function. Since that was an assignment statement, it wrote
nothing to the screen. Then we asked to see the scores, a good check, since (confession) it took me three tries to get the
scores typed in correctly. (ALWAYS double check your data entry!) Then the summary( ) function was used to produce
a preliminary descriptive summary.
That's probably the most annoying way to get data into a vector--all those commas! So here is a more convenient way
when typing data at the keyboard. First, remove the "scores" vector. Then recreate it using scan( ). The scan( )
function allows you to type in numbers one at a time, hitting Enter after each one, rather than putting commas between
them...
> rm(scores)
> scores <- scan()
1: 16 # press Enter
2: 20 # press Enter
3: 14 # etc.
4: 21
5: 20
6: 18
7: 13
8: 15
9: 17
10: 18
11: 18
12: 22
13: 21
14: 17
15: 20
16: 17
17: 23
18: 20
19: 22
20: 21
21: # press Enter here to end data input
Read 20 items
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
This is handy when you're using a numeric keypad. But it gets better. You don't have to hit the Enter key between each
data value. You only have to leave some white space...
> rm(scores)
> scan() -> scores
1: 16 20 14 21 20 18 13 15
9: 17 18 18 22 21 17 20 17 23 20
19: 22 21
21:
Read 20 items
> scores
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
The Enter key can be hit at any time to start a new line. Items entered into scan( ) must be separated by white space: a
space or spaces, a tab, a newline, a carriage return. Notice also that it doesn't matter whether left or right arrow assignment
is used. Better still, you can copy and paste the numbers from this webpage...
> rm(scores)
> scores = scan() # The = assignment can also be used.
1: 16 20 14 21 20 18 13 15 17 18 # Copied and pasted from above.
11: 18 22 21 17 20 17 23 20 22 21 # Copied and pasted from above.
21: # Remember to hit Enter to end entry.
Read 20 items
> scores
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry
[1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
You can also copy and paste comma separated values, but not into the scan( ) function. Copy comma separated values
into c( ). However, you can copy and paste a spreadsheet column (but not a row) into the scan( ) function.
Now, about that summary--what we want, of course, is a summary by groups, and not of all the scores at once. You can
probably think of one way to this...
The syntax of the tapply( ) function can be put into words like this: "Apply the summary function to scores by
groups." The by( ) function does something similar, but the output format is a bit different...
> by(data=scores, IND=groups, FUN=summary)
groups: nonsmoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 18.50 20.50 20.10 21.75 23.00
----------------------------------------------------------
groups: smoker
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 15.25 17.50 17.20 19.50 21.00
Now might be a good time to mention this. The summary( ) function is very versatile, and it's output will depend upon
what you are asking for a summary of, as we will have ample opportunity to see. When a numerical vector is summarized,
the output is the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. There is a qualification. The quartiles
are calculated assuming the vector contains a continuous numerical variable. The variable in this example is not
continuous. Therefore, the quartiles may not come out to have the same values as if you'd used the method you were taught
in elementary statistics to calculate them. We will return to this in a future tutorial. For now, I'll simply say that R can use
any of nine different methods to calculate these values.
> rm(list=ls()) # Clean up.
There is no way to get around it. Entering categorical data, or character values, is a pain in the posterior! However, once
they are entered, R handles them in a much more versatile way than any other statistical software I have ever used. For
example, if you are going to use a categorical variable (entered as character values) in a regression analysis, you do not
have to recode. R will do the appropriate recoding for you.
There are some cautions about entering character values that it will be very healthy to know about right up front. Suppose
we enter the following vector into R...
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry
being case sensitive, counted that as a different value from the uncapitalized "male"s. The one that can really puzzle you is
the difference between "male" and "male ". This can be a real mystery, in fact, when you've entered data using another
program, like a spreadsheet, and then read it into R. Moral of the story: BE CAREFUL TYPING CHARACTER DATA! If
you put a space on the beginning or end of a value, R will assume you mean it to be that way.
> rm(country)
> country = scan(what="character")
1: England
2: Russia
3: United States
5: England
6: England
7:
Read 6 items
> table(country)
country
England Russia States United
3 1 1 1
And there it is! Now you see the problem. Let's do it right...
> rm(country)
> country = scan(what="character")
1: England
2: Russia
3: United.States
4: England
5: England
6:
Read 5 items
> table(country)
country
England Russia United.States
3 1 1
The default data type for scan( ) is numeric. Using scan( ) to enter character data is very convenient because you
can avoid typing commas and quotes, but you do have to remember to specify that you are entering character data by using
the what= option.
Large data sets, however, will probably be typed into a spreadsheet and then read into R. In this case, you will have to be
careful how you tell R the file is formatted. More about that when we get to reading and writing external files.
> summary(country)
Length Class Mode
5 character character
> country = factor(country)
> summary(country)
England Russia United.States
3 1 1
Until you declare your entered vector to be a factor, R will consider it character data. Sometimes that is what you want, but
usually not. If you mean it to be a factor, use factor( ) to declare it as such.
> rm(list=ls()) # Clean up.
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry
Sometimes you have data that someone has already done the work of putting into a table for you. (This happens especially
with problems out of a textbook.) The following data occur in "A Handbook of Small Data Sets" by Hand et al. (1994)...
First, I entered the table row by row into separate vectors. Then I used the rbind( ), or "row bind", function to bind the
rows into a table. (There is also a cbind( ) function, if you prefer to enter your tables column by column.) Then I added
names to the various dimensions of the table, making liberal use of the Enter key and space bar so the screen did not scroll
as I was typing. Notice the row names were entered first followed by the column names. The same method would be used
to name the dimensions in an array or a matrix. It's worth taking a few minutes to examine the syntax of the
dimnames( ) function. Notice it takes a list of the variable names, and the individual levels of each variable are
assigned via vectors typed within the list. Tricky!
I don't like this table, and the reason I don't is because it's customary to put the explanatory variable in the rows and the
response variable in the columns of a contingency table (but not required). So I'm going to flip it using the t( ), for
"transpose", function...
Now let's look at a few functions for extracting information from this table...
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Data Entry
dataentry.html[27/01/2014 22:16:01]
R Tutorials--Simple Linear Regression
Correlation
Correlation is used to test for a relationship between two numerical variables or two ranked (ordinal) variables. In this
tutorial, we assume the relationship (if any) is linear.
To demonstrate, we will begin with a data set called "cats" from the "MASS" library, which contains information on
various anatomical features of house cats...
> library("MASS")
> data(cats)
> str(cats)
'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
> summary(cats)
Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
"Bwt" is the body weight in kilograms, "Hwt" is the heart weight in grams, and "Sex" should be obvious. There are no
missing values in any of the variables, so we are ready to begin by looking at a scatterplot...
> with(cats, plot(Bwt, Hwt))
> title(main="Heart Weight (g) vs. Body Weight (kg)\nof Domestic Cats")
simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression
Since we would expect a positive correlation here, we might have set the alternative to "greater"...
> with(cats, cor.test(Bwt, Hwt, alternative="greater", conf.level=.8))
Pearson's product-moment correlation
data: Bwt and Hwt
t = 16.1194, df = 142, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
80 percent confidence interval:
0.7776141 1.0000000
sample estimates:
cor
0.8041274
There is also a formula interface for cor.test( ), but it's tricky. Both variables should be listed after the tilde...
> with(cats, cor.test(~ Bwt + Hwt)) # output not shown
Using the formula interface makes it easy to subset the data by rows of the data frame...
> with(cats, cor.test(~ Bwt + Hwt, subset=(Sex=="F")))
Pearson's product-moment correlation
data: Bwt and Hwt
t = 4.2152, df = 45, p-value = 0.0001186
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2890452 0.7106399
sample estimates:
cor
0.5320497
The "subset=" option is not available unless you use the formula interface.
simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression
If a data frame (or other table-like object) contains more than two numerical variables, then the cor( ) function will result in
a correlation matrix...
> rm(cats) # if you haven't already
> data(cement) # also in the MASS library
> str(cement)
'data.frame': 13 obs. of 5 variables:
$ x1: int 7 1 11 11 7 11 3 1 2 21 ...
$ x2: int 26 29 56 31 52 55 71 31 54 47 ...
$ x3: int 6 15 8 8 6 9 17 22 18 4 ...
$ x4: int 60 52 20 47 33 22 6 44 22 26 ...
$ y : num 78.5 74.3 104.3 87.6 95.9 ...
> cor(cement)
x1 x2 x3 x4 y
x1 1.0000000 0.2285795 -0.82413376 -0.24544511 0.7307175
x2 0.2285795 1.0000000 -0.13924238 -0.97295500 0.8162526
x3 -0.8241338 -0.1392424 1.00000000 0.02953700 -0.5346707
x4 -0.2454451 -0.9729550 0.02953700 1.00000000 -0.8213050
y 0.7307175 0.8162526 -0.53467068 -0.82130504 1.0000000
If you prefer a covariance matrix, use cov( )...
> cov(cement)
x1 x2 x3 x4 y
x1 34.60256 20.92308 -31.051282 -24.166667 64.66346
x2 20.92308 242.14103 -13.878205 -253.416667 191.07949
x3 -31.05128 -13.87821 41.025641 3.166667 -51.51923
x4 -24.16667 -253.41667 3.166667 280.166667 -206.80833
y 64.66346 191.07949 -51.519231 -206.808333 226.31359
If you want a visual representation of the correlation matrix (i.e., a scatterplot matrix)...
> pairs(cement)
simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression
The command plot(cement) would also have done the same thing.
If the data are ordinal rather than true numerical measures, or have been converted to ranks to fix some problem with
distribution or curvilinearity, then R can calculate a Spearman rho coefficient or a Kendall tau coefficient. Suppose we
have two athletic coaches ranking players by skill...
> ls()
[1] "cement" "cov.matr"
> rm(cement, cov.matr) # clean up first
> coach1 = c(1,2,3,4,5,6,7,8,9,10)
> coach2 = c(4,8,1,5,9,2,10,7,3,6)
> cor(coach1, coach2, method="spearman")
[1] 0.1272727
> cor.test(coach1, coach2, method="spearman")
Spearman's rank correlation rho
data: coach1 and coach2
S = 144, p-value = 0.72
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1272727
> cor(coach1, coach2, method="kendall")
[1] 0.1111111
> cor.test(coach1, coach2, method="kendall")
Kendall's rank correlation tau
data: coach1 and coach2
T = 25, p-value = 0.7275
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.1111111
> ls()
simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Simple Linear Regression
simplelinear.html[27/01/2014 22:19:51]
R Tutorials--Time Series Analysis
Definition
When a process is measured over time--i.e., in a sense, "time" is the independent or explanatory variable--then the
resulting sequence of measured values is called a time series. The difference between time series data and independent
measurements that just happen to be made over time is that in time series data the successive data points are often
correlated. For example, the built-in data set "sunspots" is a count of the number of sunspots observed during every month
from 1749 through 1983. The autocorrelation function reveals how successive data points in the series are correlated...
> acf(sunspots)
As seen in this graph, the data points are perfectly correlated with themselves (lag = 0), but successive points are also
highly correlated as well. For example, a point is correlated with the next point at higher than r = 0.9, and this
autocorrelation between points in the series does not drop to near zero until three years have passed.
This means traditional techniques that assume independent measurements should not be used in the analysis of time series
data.
[NOTE: Who am I kidding? When I first began writing this back in 2010, I had big plans to teach myself time series
analysis. It hasn't happened. My knowledge of time series analysis is rudimentary to say the least. You should not assume
this is all that R can do with time series--it's just all I can do. In fact, R contains extensive facilities, many in optional
packages, for dealing with time series. If you want an elementary introduction to time series, I'm told that Chatfield (2003)
is an excellent source.]
timeseries.html[27/01/2014 22:20:05]
Stat1stics ANOVA
lf you bave been analyzing ANOVA designs in traditional statistical packages, you are likely to find R's
Oescriptive Statistics approacb less coberent and user-friendly. A good online presentation on ANOVA in R can be found in
Freguencies & Crosstabs ANOVA section of tbe Personality Project. (Note: 1 bave found tbat tbese pages render fine in Chrome
and Safari browsers, but can appear distorted in iExplorer.)
Correlations
1. Fit a Model
In tbe fallowing examples lower case letters are numeric variables and upper case letters are factors .
Nonparametric Statjstics
Multjple R~re55jon
# One Way Anova (Completely Randomized Design)
R~re55jon Djagnostjc5 fit <- aov(y - A, data=mydataframe)
AMOVA/MAMOVA
1
(M)Al~OVA Assumptions
# Randomized Block Design (B is the blocking factor)
Resampling Stats fit <- aov(y - A+ B, data=mydataframe)
Power Analysis
1
Using Witb and By
# Two Way Factorial Design
fi t <- aov (y - A + B + A: B, data=111ydataframe)
fi t <- aov(y - A~'B, data=mydataframe) # same thi ng
R in Action
1
# Analysis of Covariance
fit <- aov(y - A + x, data=mydataframe)
1
Far witbin subjects designs, the data frame has to be rearranged so tbat eacb measurement on a
R in Action significantly expands subject is a separate observation. 5ee R and Analysis of Variaoce
upan this material. Use promo
code ria38 far a 38% discount. # One Witbin Factor
fi t <- aov(y-A+Error(Subject/A) ,data=mydataframe)
1
Top Menu
Data Management
2. Look at Diagnostic Plots
Basic Statistics
Oiagnostic plots provide cbecks far beteroscedasticity, normality, and influential observerations.
Advanced Statistics
Basic Graphs
Multiple Comparisons
You can get Tukey HSD tests using the function belov·1. By default, ít calculates post hoc comparisons on
each factor in the model. You can specify specífic factors asan option. Again, remember that results
are based on Type 1 SS!
Visualizing Results
Use box olots and line olots to visualize group differences. There are also two functions specifically
designed for visualizing mean differences in ANOVA layouts. interaction.plot( ) in the base stats
package produces plots for two-way interactions. plotmeans( ) in the ~ackage produces mean
plots for single factors, and includes confidence intervals.
..i· ·~
'\
J.i
\
••
~.
click to view
MANOVA
lf there is more than one dependent (outcome) variable, you can test them simultaneously using a
multivariate analysis of variance (MANOVA). In the following example, let Y be a matrix whose
columns are the dependent variables.
Other test options are 'Wilks", "Hotelling-lawley", and "Roy". Use summary.aov( ) to get univariate
statistics. TukeyHSD( ) and plot( ) will not work with a MANOVA fit. Run each dependent variable
separately to obtain them. Like ANOVA, MANOVA results in R are based on Type 1 55. To obtain Type 111
55, vary the order of variables in the model and rerun the analyses. For example, fit y-A*B for the
Typell l B effect and y-B'A for the Type 111 A effect.
Going Further
R has excellent facilities for fitting linear and generalized linear mixed-effects models. The lastest
implimentation is in package lme4. 5ee the R News Article on Fitting Mixed linear Models in R for
details.
Stat1stics Assessing Classical Test Assumptions
In elassical parametric procedures we often assume nom1ality and constant variance for the model error
Oescriptive Statistics term. Methods of exploring these assumptions in an ANOVA/ ANCOVA/MANOVA framework are discussed
Freguencies & Crosstabs here. Regression diagnostics are covered under multiple linear regression.
Correlations
Outliers
Nonparametric Statjstics Since outliers can severly affect normality and homogeneity of variance, methods for detecting
disparate observerations are described first.
Multjple R~re55jon
R~re55jon Djagnostjc5 The aq.plot() function in the mvoutlier package allows you to identfy multivariate outliers by plotting
AMOVA/MAMOVA the ordered squared robust Mahalanobis distances of the observations against the empírica[ distribution
function of the M0 2¡ . Input consists of a matrix or data frame. The function produces 4 graphs and
(M)Al~OVA Assumptions
retums a boolean vector identifying the outliers.
Resampling Stats
Power Analysis
# Detect Outl iers in the MTCARS Data
Using With and By 1 i brary(mvoutl i er)
outliers <-
aq .plot(mtcars [c("mpg", "di sp", "hp", "drat", "wt", "qsec") ])
R in Action outliers # show list of outliers
-·-·
elick to view
R in Action significantly expands
upon this material. Use promo
code ria38 for a 38% discount.
Univariate Normality
You can evaluate the normality of a variable using a Q·Q plot.
Top Menu
# Q-Q Plot for variable MPG
attach(mtcars)
qqnorm(mpg)
Tbe R Interface qqline(mpg)
Data Input
Data Management
Basic Statistics
Advanced Statistics
elick to view
Basic Graphs
Advanced Graphs Significant departures from the line suggest violations of normality.
You can also perfom1 a Shapiro-Wilk test of normality with the shapiro.test(x) function, where x is a
numeric vector. Additional functions for testing normality are available in nortest package.
Multivariate Normality
MANOVA assumes multívaríate normality. The function mshapíro.test( ) ín the mvnormtest package
produces the Shapiro-Wilk test for multivariate normality. Input must be a numeric matrix.
click to view
Homogeneity of Variances
The bartlett.test( ) functíon provides a parametric K-sample test of the equality of variances. The
fligner.test( ) function provides a non-parametric test of the same. In the follov1ing examples y is a
numeric variable and G ís the groupíng variable.
The hovPlot( ) functíon ín the HH package provides a graphic test of homogeneity of variances based on
Brown-Forsyth. In the following example, y is numeric and Gis a grouping factor. Note that G must be
of type factor.
click to view
Homogeneity of Covariance Matrices
MANOVA and LDF assume homogeneity of variance-covariance matrices. The assumption is usually
tested with Box's M. Unfortunately the test is very sensitive to violations of normality, leading to
rejection in most typical cases. Box's Mis not included in R, but code is available.
Stat1stics Correlations
You can use the cor( ) function to produce correlations and the cov( ) function to produces
Oescriptive Statistics covariances.
Freguencies & Crosstabs
A simplified format is cor(x, use=, method= ) where
Correlations
Option Description
x Matrix or data frame
Nonparametric Statjstics use Specifies the handling of missing data. Options are all.obs (assumes no
missing data - missing data will produce an error), complete.obs (listwise
Multjple R~re55jon
deletion), and pairwise.complete.obs (pairwise deletion)
R~re55jon Djagnostjc5 method Specifies the type of correlation. Options are pearson, spearman or kendall.
AMOVA/MAMOVA
# Correlations/ covariances among numeric variables in
(M)Al~OVA Assumptions # data frame mtcars. Use listwise deletion of missing data.
Resampling Stats cor(mtcars, use="complete.obs", method="kendall")
cov (mtcars, use="complete.obs")
Power Analysis
You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and
Top Menu the columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
Advanced Statistics
Basic Graphs
Other Types of Correlations
Advanced Graphs
# polychoric correlation
# x is a contingency table of counts
library(polycor)
1
polychor(x)
# partial correlations
1 i brary(ggm)
data(mydata)
pcor(c("a", "b", "x", "y", "z"), var(mydata))
# partial corr between a and b controlling for x, y, z
Visualizing Correlations
Use corr¡¡ram( ) to plot correlograms .
A great example of a plotted correlation matrix can be found in the R Graoh Gallery.
Stat1stics Descriptive Statistics
R provides a wide range of functíons for obtaining summary statistics. One method of obtaining
Oescriptive Statistics descriptive statistics is to use the sapply( ) function with a specified summary statistic.
Freguencies & Crosstabs
Resampling Stats
# mean,median,25th and 75th quartiles,min,max
Power Analysis summary(mydata)
R in Action
Using the Hmisc package
l i brary(Hmi se)
deseribe(mydata)
# n, nmiss, unique, mean, 5,10,25,50, 75,90,95th pereentiles
# 5 lowest and 5 highest seores
library(pastees)
stat.desc(mydata)
Top Menu
# nbr.val, nbr.null, nbr.na, min max, range, sum,
# median, mean, SE.mean, CI.mean, var, std.dev, eoef . var
The R Interface
Using the ~ package
Data Input
Advanced Graphs
Summary Statistics by Group
A simple way of generating summary statistics by grouping variable is available in the ~ package.
library(psych)
describe. by(mydata, group, ... )
1
The doBy package provides much of the functionality of SAS PROC SUMMARY. lt defines the desired
table using a model formula and a function. Here is a simple example.
1 i brary(doBy)
sumrnaryBy(mpg + wt ~ cyl + vs, data = mtcars,
FUN = function(x) { c(m = mean(x), s = sd(x)) } )
#produces mpg . rn wt . rn rnpg . s wt . s for each
# combination of t he levels of cyl and vs
Correlations
Generating Frequency Tables
Nonparametric Statjstics R provides many metbods far creating frequency and contingency tables. Three are described below. In
tbe following examples, assume tbat A, B, and C represent categorical variables.
Multjple R~re55jon
table
R~re55jon Djagnostjc5
You can generate frequency tables using tbe table( ) function, tables of proportions using tbe
ANOVA/MANOVA
prop.table( ) function, and marginal frequencies using margin.table( ).
(M)Al~OVA Assumptions
Resampling Stats
# 2-Way Frequency Table
Power Analysis attacb(mydata)
mytable <- table(A,B) # A wi l l be rows, B will be columns
Using Witb and By mytable # print table
table( ) can also generate multidimensional tables based on 3 or more categorical variables. In tbis
case, use tbe ftable( ) function to print tbe results more attractively.
R in Action significantly expands
upan this material. Use promo
code ria38 far a 38% discount. # 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
Top Menu 1
Table ignores missing values. To include NA as a category in counts, include tbe table option
exclude=NULL if tbe variable is a vector. lf tbe variable is a factor you bave to create a new factor
Tbe R Interface using newfactor <· factor(oldfactor, exclude=NULL).
Basic Statistics
lf a variable is included on tbe left side of tbe formula, it is assumed to be a vector of frequencies
(useful if the data have already been tabulated).
Crosstable
The CrossTable( ) function in the ~ package produces crosstabulations modeled after PROC FREQ
in SAS or CROSSTABS in SPSS. lt has a wealth of options.
Testsof lndependence
Chi-Square Test
For 2-way tables you can use chisq.test(mytable) to test independence of the row and column variable.
By default, the p-value is calculated from the asymptotic chi-squared distribution of the test statistic.
Optionally, the p-value can be derived vía Monte Cario simultation.
Mantel-Haenszel test
Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null
hypothesis that two nominal variables are conditionally independent in each stratum, assuming that
there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension
refers to the strata.
Loglinear Models
You can use the loglm( ) function in the MASS package to produce log-linear models. For example, let's
assume we have a 3-way contingency table based on variables A, B, and C.
1 i brary(MASS)
mytable <- xtabs(-A+B+C, data=mydata)
1
We can perform the following tests:
1 loglm(-A+B+C, mytable)
1 loglin(-A+B+C+B*C, mytable)
1 loglm(-A+B+C+A*C+B*C, mytable)
No Three-Way lnteraction
1 loglm(~A+B+C+A*B+A*C+B*C, mytabl e)
Martín Theus and Stephan Lauer have written an excellent article on Yisualizjng 1oglinear Models, using
mosaic olots. There is also great tutoría! example by Kevin Quino on analyzing loglinear models vía g!m.
Measures of Association
The assocstats(mytable) function in the ved package calculates the phi coefficient, contingency
coefficient, and Cramer's V for an rxc table. The kappa(mytable) function in the ~ package calculates
Cohen's kappa and weighted kappa for a confusion matrix. See Richard Oarlington's article on Measures
of Assocjatjon jo Cmsstab Tables for an excellent review of these statistics.
Visualizing results
Use bar and oie charts for visualizing frequencies in one dimension.
Use the ved package for visualizing relationships among categorical data (e.g. mosaic and association
plots).
Use the ca package for correspondence analysis (visually exploring relationships between rows and
columns in contingency tables).
R in Action significantly expands For the wilcox.test you can use the alternative="less" or alternative="greater" option to specify a one
upon this material. Use promo tailed test.
code ria38 for a 38% discount.
Parametric and resampling altematives are available.
Top Menu Tbe package npmc provides nonparametric multiple comparisons. (Note: This package has been
witbdrawn but is still available in the CRAN archives.)
Basic Statistics
Advanced Statistics
Visualizing Results
Basic Graphs
Use box plots or densitv plots to visual group differences.
Advanced Graphs
Stat1stics Power Analysis
Multjple R~re55jon
The follovling four quantities have an intimate relationship:
R~re55joo Djagnostjc5
1. sample size
AMOVA/MANOVA 2. effect size
3. significance level =P(Type 1 error) =probability of finding an effect that is not there
(M)Al~OVA Assumptions 4. power = 1 - P(Type 11 error) = probability of finding an effect that is there
Power Analysis
code ria38 for a 38% discount. pwr.t.test t-tests (one sample, 2 sample, paired)
pwr.t2n.test t-test (two samples with unequal n)
For each of these functions, you enter three of the four quantities (effect size, sample size,
Top Menu
significance level, power) and the fourth is calculated.
The significance level defaults to 0.05. Tberefore, to calculate the significance level, given an effect
Tbe R Interface size, sample size, and power, use tbe option "sig.level=NULL".
Data Input
Specifying an effect size can be a daunting task. ES fom1ulas and Cohen's suggestions (based on social
Data Management science research) are provided below. Cohen's suggestions should only be seen as very rough guidelines.
Basic Statistics Your own subject matter experience should be brought to bear.
Advanced Statistics
Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.
ANOVA
For a one-way analysis of variance use
where k is the number of groups and n is the common sample size in each group.
where p; =n; / N,
"
"'\' P; * rv<;
"--
i- 1
.. _ µ )2 n; = number of observati ons in group i
f = N = total number of observations
µ; =mean of group i
µ =grand mean
a2 = error variance within groups
Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes
respectively.
Correlations
For correlation coefficients use
where n is the sample size and r is the correlation. We use the population correlation coefficient as
the effect size measure. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium,
and large effect sizes respectively.
Linear Models
For linear models (e.g., multiple regression) use
where u and v are the numerator and denominator degrees of freedom. We use f2 as the effect size
measure.
The first formula is appropriate when we are evaluating the impact of a set of predictors on an
outcome. The second fom1ula is appropriate when we are evaluating the impact of one set of predictors
above and beyond a second set of predictors (or covariates). Cohen suggests f2 values of 0.02, 0.15,
and 0.35 represent small, medium, and large effect sizes.
Tests of Proportions
When comparing two proportions use
where h is the effect size and n is the common sample size in each group.
Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less", or
"greater" to indicate a two-tailed, or one-tailed test. A two taíled test is the default.
Chi-square Tests
For chi-square tests use
where w is the effect size, Nis the total sample size, and df is the degrees of freedom . The effect size
w is defined as
Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes
respectively.
sorne Examples
1 i brary(pwr)
pwr.2p . test(n=30,sig.level=0.0l,power=0.75)
l i brary(pwr)
# set up graph
xrange <- range(r)
yrange <- round(range(samsi ze))
colors <- rainbow(length(p))
pl ot(xrange, yrange, type="n",
xlab="Correl ation Coefficient (r)",
ylab="Sample Size (n)" )
click to view
Stat1stics Regression Diagnostics
An excellent review of regression diagnostics is provided in John Fox's aptly named Overview of
Oescriptive Statistics Regression Oiagnostics. Dr. Fox's ca r package provides advanced utilities far regression modeling.
Freguencies & Crosstabs
Multjple R~re55jon
Ibis example is far expo5it ion only. We vlill ignore the fact that this may not be a great way of
R~re55jon Djagnostjc5
modeling the this particular set of data!
AMOVA/MAMOVA
(M)Al~OVA Assumptions
Outliers
Resampling Stats
R in Action
leverage plot
elick to view
Basic Graphs
Non-normality
# Normal ity of Residuals
# qq pl ot for studentized resid
qqPl ot(fit , main="QQ Pl ot")
# distribution of studentized residual s
l i brary(MASS)
sresid <- studres(fit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residual s")
xfit<- seq(mi n(sresi d) ,max(sresi d), l ength=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
1 /
-
/ elick to view
1-
elick to view
Mu lti-collinearity
# Eval uate Col l inearity
vif(fit) # variance infl ation factors
sqrt(vif(fit)) > 2 # prob l em?
1
Nonlinearity
# Eval uate Nonl inearity
# component + residual pl ot
crPl ots(fit)
# Ceres plots
ceresPlots(fit)
~ l~ ._
1 . .. -.. ... '.. ..........
"
-- -- - .
1~! .. j
elick to view
Non-independence of Errors
1# Test for Autocorrelated Errors
1durbinWatsonTest(fit)
Going Further
lf you would like to delve deeper into regression diagnostics, two books written by John Fox can help:
Applied regression analysis and generalized linear models (2nd ed) and An R and 5-Plus comoanion to
applied regression.
Stat1stics Multiple (Linear) Regression
R provides comprehensive support for multiple linear regression. The topics below are provided in order
Oescriptive Statistics of increasing complexity.
Freguencies & Crosstabs
R in Action
Diagnostic Plots
Oiagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(l,2,3,4),2,2)) # optional 4 graphs/page
pl ot(fi t)
R in Action significantly expands
upon this material. Use promo
1
code ria38 for a 38% discount.
Top Menu
click to view
Data Input
Data Management
Comparing Models
You can compare nested models ~vith tbe anova( ) function. The following code provides a simultaneous
Basic Statistics
test that x3 and x4 add to linear prediction above and beyond x1 and x2.
Advanced Statistics
Basic Graphs
# compare models
Advanced Graphs fitl <- lm(y ~ xl + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ xl + x2)
anova(fitl, fit2)
Cross Validation
You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package.
# K-fold cross-validation
l i brary(DAAG)
cv. lm(df=mydata, fit, m=3) # 3 fold cross-validation
1
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the
cross-validated standard error of estímate.
You can assess R2 shrinkage vía K-fold cross-validation. Using the crossval() function from the
bootstrap package, do the follovting:
library(bootstrap)
# define functions
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function (fit,x){cbind(l,x)%*%fit$coef}
# matrix of predictors
X<- as.matrix(mydata[c("xl","x2","x3")])
# vector of predicted values
y <- as.matrix(mydata[c("y")])
Variable Selection
Selecting a subset of predictor variables from a larger set (e. g., stepwise selection) is a controversia!
topic. You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from
the MASS package. stepAIC( ) performs stepwise model selection by exact AIC.
# Stepwise Regression
l i brary(MASS)
fit <- lm(y-xl+x2+x3,data=mydata)
step <- stepAIC(fi t, di rection="both")
step$anova # display results
Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps_
package. In the following code nbest indicates the number of subsets of each size to report. Here, the
ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.).
click to view
Other options for plot( ) are bic, Cp, and adjr2. Other options for plotting with
subset( ) are bic, cp, adjr2, and rss.
Relative lmportance
The relaimpo package provides measures of relatíve importance for each of the predictors in the
model. See help(calc.relimp) for details on the four measures of relative importance provided.
Graphic Enhancements
The car package offers a wide variety of plots for regression , including added variable plots, and
enhanced diagnostic and scatter olots.
Going Further
Nonlinear Regression
The nis package provides functions for nonlinear regression. 5ee John Fox's Nonlinear Regression and
Nonljoear 1 east Squares for an overview . Huet and colteagues' Statistical Tools for (loolinear Regressioo-
A Practicat Guide with 5-PLUS and R Examples is a valuable reference book.
Robust Regression
There are many functions in R to aid with robust regressíon. For example, you can perfom1 robust
regression with the rlm( ) function in the MASS package. John Fox's (who else?) Robust Regression
provides a good starting overview . The UCLA Statísticat Computing website has Robqst Regression
Examples.
The mbun package provides a comprehensive library of robust methods, including regression. The
robustbase package also provides basic robust statistics including model selection methods. And David
Olive has provided an detailed entine review of Applied Robust Statistics with sample R cede.
Copyright © 2012 Robert I. Kabacoff, Ph.D. | Sitemap
Designed by WebTemplateOcean.com
Stat1stics Resampling Statistics
Tbe coin package provides tbe ability to perform a wide variety of re-randomization or pem1utation
Oescriptive Statistics based statistical tests. These tests do not assume random sampling from well-defined populations. They
Freguencies & Crosstabs can be a reasonable alternative to classical procedures vlhen test assumptions can not be met. See
cojo· A Comp11tatjonal Frameymrk for Coodjtjonal lnference far details.
Correlations
In tbe examples below, lowe r case letters represent numerical variables and upper case letters
Nonparametric Statjstics represent categorical .fac.tms.. Monte-Cario simulation are available far ali tests. Exact tests are
available far 2 group procedures.
M11ltjple R~re55jon
R~re55jon Djagnostjc5
lndependent Two- and K-Sample Location Tests
AMOVA/MAMOVA
Advanced Statistics
distribution=approximate(B=9999))
Many other univariate and multivariate tests are possible using the functions in the coin package. See 8
1ego System for Condjtjonal loference for more details.
Stat1stics t-tests
The t .te st( ) function produces a variety of t-tests. Unlike most statistical packages, the default
Oescriptive Statistics assumes unequal variance and applies the Welsh df modification.
ANOVA/MANOVA
# paired t-test
(M)Al~OVA Assumptions t .test(yl,y2,paired=TRUE) # whe r e yl & y2 are numeri c
1
Resampling Stats
Power Analysis
# one sample t - test
t . test(y,mu=3) # Ho : mu =3
Using With and By
1
R in Action You can use the va r.equal =TRUE option to specify equal variances and a pooled variance estimate.
You can use the alternative="less" or alternative="greater" option to specify a one tailed test.
Visualizing Results
Use box plots or densitv plots to visualize group differences.
R in Action significantly expands
upen this material. Use promo
cede ria38 for a 38% discount.
Top Menu
Tbe R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graphs
Stat1stics Using with( ) and by( )
There are two functions that can help write simpler and more efficient code.
Oescriptive Statistics
(M)Al~OVA Assumptions By
Resampling Stats The by( ) function applys a function to each leve! of a factor or factors. lt is similar to BY processing in
Top Menu
Tbe R Interface
Data Input
Data Management
Basic Statistics
Advanced Statistics
Basic Graphs
Advanced Graphs