Professional Documents
Culture Documents
R Programming PDF
R Programming PDF
Articles
R Programming 1
Introduction 2
Sample Session 4
Manage your workspace 7
Settings 9
Documentation 15
Control Structures 19
Working with functions 23
Debugging 27
Using C or Fortran 29
Utilities 29
Estimation utilities 32
Packages 34
Data types 36
Working with data frames 45
Importing and exporting data 52
Text Processing 57
Times and Dates 67
Graphics 70
Grammar of graphics 85
Publication quality ouput 86
Descriptive Statistics 92
Mathematics 99
Optimization 107
Probability Distributions 110
Random Number Generation 116
Maximum Likelihood 120
Method of Moments 122
Bayesian Methods 122
Bootstrap 123
Multiple Imputation 124
Nonparametric Methods 124
Linear Models 128
Quantile Regression 136
Binomial Models 137
Multinomial Models 141
Tobit And Selection Models 143
Count Data Models 144
Duration Analysis 146
Time Series 146
Factor Analysis 148
Ordination 150
Clustering 152
Network Analysis 153
Profiling R code 154
Parallel computing with R 155
Sources 155
Index 156
References
Article Sources and Contributors 158
Image Sources, Licenses and Contributors 160
Article Licenses
License 161
R Programming 1
R Programming
Prerequisites
We assume that readers have a background in statistics. This book is not a book about statistics but a book about
how to implement statistical methods using R. We try to use terms which are already defined on Wikipedia such that
people can refer to the corresponding wikipedia page each time they have some doubts on a notion.
We also assume that readers are familiar with computers and that they know how to use software with a
command-line interface. There are some graphical user interfaces for R but we are not going to explain how to use
them in this textbook. Beginners should have a look at the Sample session for a first session with R. They can also
have a look at the Statistical Analysis: an Introduction using R book.
R Programming 2
References
[1] R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http:/ / www. R-project. org.
[2] CRAN manuals (http:/ / cran. r-project. org/ doc/ manuals/ )
[3] R on Stackoverflow (http:/ / stackoverflow. com/ questions/ tagged/ r)
[4] See R-Bloggers (http:/ / www. r-bloggers. com/ )
[5] The CRAN Task View is already organized by discipline.
Introduction
What is R ?
R is statistical software which is used for data analysis. It includes a huge number of statistical procedures such as
t-test, chi-square tests, standard linear models, instrumental variables estimation, local polynomial regressions, etc. It
also provides high-level graphics capabilities.
Why use R?
R is free software. R is an official GNU project and distributed under the Free Software Foundation General
Public License (GPL).
R is a powerful data-analysis package with many standard and cutting-edge statistical functions. See the
Comprehensive R Archive Network's Task Views [1] to get an idea of what you can do with R.
R is a programming language, so its abilities can easily be extended through the use of user-defined functions. A
large collection of user-contributed functions and packages can be found at CRAN's Contributed Packages [2]
page.
R is widely used in political science, statistics, econometrics, actuarial sciences, sociology, finance, etc.
R is available for all major operating systems (Windows, Mac OS, GNU-Linux).
R is object oriented. Virtually anything (e.g., complex data structures) can be stored as an R object.
R is a matrix language.
R syntax is much more systematic than Stata or SAS syntax.
R can be installed on your USB stick[3].
Alternatives to R
S-PLUS is a commercial version of the same S programming language that R is a free version of.
Gretl is free software for econometrics. It has a graphical user interface and is nice for beginners.
SPSS is proprietary software which is often used in sociology, psychology and marketing. It is known to be easy
to use.
GNU PSPP is a free-software alternative to SPSS.
SAS is proprietary software that can be used with very large datasets such as census data.
Stata is proprietary software that is often used in economics and epidemiology.
MATLAB is proprietary software used widely in the mathematical sciences and engineering.
Octave is free software similar to MATLAB. The syntax is the same and MATLAB code can be used in Octave.
Python is a general programming language. It includes some specific libraries for data analysis such as Pandas.
Beginners can have a look at GNU PSPP or Gretl. Intermediate users can check out Stata. Advanced users who like
matrix programming may prefer MATLAB or Octave. Very advanced users may use C or Fortran.
Introduction 3
R programming style
R is an object oriented programming language. This means that virtually everything can be stored as an R object.
Each object has a class. This class describes what the object contains and what each function does with it. For
instance, plot(x) produces different outputs depending on whether x is a regression object or a vector.
The assignment symbol is "<-". Alternatively, the classical "=" symbol can be used.
The two following statements are equivalent :
> a <- 2
> a = 2
mean(rnorm(1000)^2)
# This is a comment
5 + 7 # This is also a comment
Commands are normally separated by a newline. If you want to put more than one statement on a line, you can
use the ";" delimiter.
See Also
Google's R Style Guide [4] : a set of rules for R programmers
References
[1] http:/ / cran. r-project. org/ web/ views/
[2] http:/ / cran. r-project. org/ web/ packages/
[3] Portable R by Andrew Redd http:/ / sourceforge. net/ projects/ rportable/
[4] http:/ / google-styleguide. googlecode. com/ svn/ trunk/ google-r-style. html
Sample Session
This page is an introduction to the R programming language. It shows how to perform very simple tasks using R.
First you need to have R installed (see the Settings page). If you use Windows or Mac OS, the easiest solution is to
use the R Graphical User Interface (click on its icon). If you use Linux, open a terminal and type R at the command
prompt.
Usually when you open R, you see a message similar to the following in the console:
>
You can type your code after the angle bracket >.
R can be used as a simple calculator and we can perform any simple computation.
[1] 5
> log(2)
[1] 0.6931472
> Height <- c(168, 177, 177, 177, 178, 172, 165, 171, 178, 170) #store a vector
> Height # print the vector
[1] 168 177 177 177 178 172 165 171 178 170
>
> Height[2] # Print the second component
[1] 177
> Height[2:5] # Print the second, the 3rd, the 4th and 5th component
[1] 177 177 177 178
>
> (obs <- 1:10) # Define a vector as a sequence (1 to 10)
[1] 1 2 3 4 5 6 7 8 9 10
>
> Weight <- c(88, 72, 85, 52, 71, 69, 61, 61, 51, 75)
>
> BMI <- Weight/((Height/100)^2) # Performs a simple calculation using vectors
> BMI
[1] 31.17914 22.98190 27.13141 16.59804 22.40879 23.32342 22.40588 20.86112
[9] 16.09645 25.95156
We can also describe the vector with length(), mean() and var().
> length(Height)
[1] 10
> mean(Height) # Compute the sample mean
[1] 173.3
> var(Height)
[1] 22.23333
> plot(Height,Weight,ylab="Weight",xlab="Height",main="Corpulence")
You can save an R session (all the objects in memory) and load the session.
> save.image(file="~/Documents/Logiciels/R/test.rda")
> load("~/Documents/Logiciels/R/test.rda")
We can define a working directory. Note for Windows users : R uses slash ("/") in the directory instead of backslash
("\").
> setwd("~/Desktop") # Sets working directory (character string enclosed in "...")
> getwd() # Returns current working directory
[1] "/Users/username/Desktop"
> dir() * Lists the content of the working directory
> 0/0
[1] NaN
> 1/0
[1] Inf
We can exit R using q(). The no argument specifies that the R session is not saved.
q("no")
Basic functions
ls() lists the objects in your workspace.
list.files() lists the files located in the folder's workspace
rm() removes objects from your workspace; rm(list = ls()) removes them all.
Each object can be saved to the disk using the save() function. They can then be loaded into memory using
load().
load("file.Rda")
...
# assume you want to save an object called 'df'
save(df, file = "file.Rda")
Memory usage
Note: According to R version 2.15.2 on Linux and Mac, memory.size() and memory.limit() are Windows-specific.
memory.size() gives the total amount of memory currently used by R.
> memory.size()
[1] 10.18
Manage your workspace 8
memory.limit() without any argument gives the limit of memory used by R. This can also be used to increase
the limit. The maximum amount is limited by the memory of the computer.
> memory.limit()
[1] 1535
> memory.limit(size=2000) # 2000 stands for 2000 MB
[1] 2000
object.size() returns the size of an R object. You can print the results and choose the unit
(byte,kilobytes,megabytes,etc).
> memory.profile()
NULL symbol pairlist closure environment promise
1 4959 61794 1684 255 3808
language special builtin char logical integer
14253 46 687 5577 2889 4060
double complex character ... any list
523 1 11503 0 0 1024
expression bytecode externalptr weakref raw S4
1 0 497 117 118 642
gc()
References
External links
Dumping functions from the global environment into an R script file (http://www.r-statistics.com/2010/09/
dumping-functions-from-the-global-environment-into-an-r-script-file/)
Settings 9
Settings
This page show how to install R, customize it and choose a working environment. Once you have installed R, you
may want to choose a working environment. This can be a simple text editor (such as Emacs, Vim or Gedit), an
integrated development interface (IDE) or graphical user interface (GUI). RStudio is now a popular option.
Installation
Linux
Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as to type in
sudo aptitude install r-base or sudo apt-get install r-base (don't forget that this is have
to be done as root), or installing the package r-base using your favourite package manager, for example Synaptic.
There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look
at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by
the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this
package, as it depends on r-base.
Installation with apt-get (Debian, Ubuntu and all linux distributions based on Debian)
Installation with aptitude (Debian, Ubuntu and all linux distributions based on Debian)
Mac OS
Installation : Visit the R project website (http:/ / r-project. org/ ), select the "CRAN" page and choose mirror.
Download the disk image (dmg file) and install R.
The default graphical user interface for Mac is much better than the one for Windows. It includes
a dataframe manager,
a history of all commands,
a program editor which supports syntax highlighting.
Windows
(Section source [1])
Download
To install R under Windows operating system you have to download the binaries from the web. First go to
r-project.org [2] and click CRAN under download section on the left panel and select a mirror site, from where you
could download the required content. The best idea is pick a mirror closest to your actual geographical location, but
other ones should work as well. The click Windows and in subdirectories base. The windows binary is the exe file, in
form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version the setup has
the same steps.
Settings 10
Setup
As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems.
However, there are few things that you can alter.
1. On the welcome screen click Next.
2. Read or just notice the GNU license, and click Next.
3. Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc,
the default choice will be OK for you.
4. During the next step you can specify which parts of R you want to install. Choices are: User installation, Minimal
user installation, Full installation and Custom installation. Notice the required space under the selection panel
(varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
5. In this step you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation
(see lower).
6. You can specify the Start menu folder.
7. In the next step you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and
specify registry entries.
With these steps you can customize the R graphical user interface.
You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window
(SDI).
You can select the style, how the Help screen is displayed in R. You will use help a lot, so this may be an
important decision. It is up to you, which style you prefer. Please note, that the content of help file will be the
same regardless of your choice. Here you specify just the appearance of that particular window.
In the next step you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard
option here.
Settings 11
Update
Updating R on Windows requires several steps:
1. Downloading/installing the latest version of R
2. Copying your packages from the library folder to the one in the new R installation
[3]
Both of these steps can easily be done using the installr package, by running the following command (which
would both install the package, and update R) [4]:
There is also the possibility of using a "global" package library, see here [5] for more details.
Working environment
Once you have installed R, you need to choose a working environment. In this section, we review all possible
working environment. This include a basic terminal as well as integrated development environment (therefore IDE),
text editors or graphical user interface (therefore GUI).
A graphical user interface provides some menu which makes it possible to run R without writing code. This is a
good solution for beginners.
A text editor makes it easy to write code.
An integrated development environment provides a text editor and a compiler which makes it easy to write R
scripts, to run them and to correct them.
Note that there are some task specific GUIs. For instance speedR provides a GUI to import data into R.
Terminal
For Linux and Mac OS users it is possible to use R from
the terminal.
$ R
> q("no") # to leave R and return to the terminal
Settings 12
R Gui
For Mac OS and Windows users, there is a graphical user interface. In Mac OS, the GUI includes a package
manager, a program editor with syntax highlighting and a data browser. In Windows, the GUI is not better than a
Terminal.
> library(pmg)
R commander
Rcommander[9] developed by John Fox provides a menu in the standard Graphical User Interface (screenshots
[10]
).
It works on Linux, Mac and Windows.
It is a good interface for beginners and for people who are not used to script editing.
Ubuntu users can also install R Commander from the software center.
RStudio
RKward
RKward is an IDE and a GUI for Linux (KDE) (Screenshots [15]). RKWard aims to provide an easily extensible,
easy to use IDE/GUI for R. RKWard tries to combine the power of the R-language with the (relative) ease of use of
commercial statistics tools.
Rattle GUI
Rattle[17] for Linux, Windows and Mac (screenshots [18])[19].
Tinn R
For Windows only
Tinn R[20] is a good IDE for Windows users. One can easily define keyboard shortcuts to execute selected R code
from Tinn R.
WinEdt
How to use R for Windows with the RWinEdt extension ? by Andy Eggers[26]
WinEdt is not open source
WinEdt is for Windows only.
Install the RWinEdt package.
Customizing R
R profile
R can be customized using the Rprofile file. On Linux, this file is stored in the home directory. You can edit it by
running the following command in a terminal :
$ gedit ~/.Rprofile
If you use some packages very often, you can load them systematically using the Rprofile file. You can also change
the default options.
Options
The function options() without any argument show all options
> options()
> Sys.setlocale()
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/en_US.UTF-8"
By default, error messages are in the local language. However, it is possible to set them in English using
Sys.sentev()
Sys.setenv(LANGUAGE='en')
Settings 15
References
[1] This section was imported from the Wikiversity project Installation, How to use R course
[2] http:/ / www. r-project. org/
[3] http:/ / cran. r-project. org/ web/ packages/ installr/
[4] Updating R from R (on Windows) using the {installr} package (http:/ / www. r-statistics. com/ 2013/ 03/
updating-r-from-r-on-windows-using-the-installr-package/ )
[5] http:/ / www. r-statistics. com/ 2010/ 04/ changing-your-r-upgrading-strategy-and-the-r-code-to-do-it-on-windows/
[6] Portable R http:/ / sourceforge. net/ projects/ rportable/
[7] http:/ / jgr. markushelbig. org/ JGR. html
[8] http:/ / www. rforge. net/ JGR/ screenshots. html
[9] http:/ / socserv. mcmaster. ca/ jfox/ Misc/ Rcmdr/
[10] http:/ / socserv. mcmaster. ca/ jfox/ Misc/ Rcmdr/ Rcmdr-screenshot. html
[11] http:/ / www. rstudio. com/
[12] rstudio.org (http:/ / www. rstudio. org/ )
[13] John Verzani "Getting Started with RStudio An Integrated Development Environment for R", O'Reilly Media, September 2011
[14] Jeffrey Racine, (forthcoming), "RStudio: A Platform Independent IDE for R and Sweave," Journal of Applied Econometrics.
[15] http:/ / sourceforge. net/ apps/ mediawiki/ rkward/ index. php?title=Screenshots
[16] StatET : http:/ / www. walware. de/ goto/ statet
[17] Rattle : http:/ / rattle. togaware. com/
[18] http:/ / rattle. togaware. com/ rattle-screenshots. html
[19] Graham J Williams. Rattle: A Data Mining GUI for R. The R Journal, 1(2):45-55, December 2009
[20] Tinn stands for Tinn Is Not Notepad http:/ / www. sciviews. org/ Tinn-R/
[21] Note that Notepad++ can be installed on a USB stick http:/ / sourceforge. net/ projects/ notepadpluspe/
[22] NPPtoR is also a portable software http:/ / sourceforge. net/ projects/ npptor/
[23] http:/ / www. vim. org/ scripts/ script. php?script_id=2628
[24] ESS : http:/ / ess. r-project. org/
[25] Vincent Goulet Emacs page http:/ / vgoulet. act. ulaval. ca/ emacs
[26] http:/ / www. people. fas. harvard. edu/ ~aeggers/ RWinEdt_installation. pdf
Documentation
Obtaining Help
For each package you have a reference manual available as an HTML file from within R or as a PDF on the CRAN
website. You also often have Vignettes or comprehensive articles in the R Journal, the Journal of Statistical
Software, etc.
>library(help="package_name")
>vignette("np",package="np")
>vignette(all=FALSE) # vignettes for all attached packages
>vignette(all=TRUE) # vignettes for all packages on the computer
You can search for help inside all loaded packages using help() or ?. Usually you do not need to add quotes to
function names, but sometimes it can be useful. args() gives the full syntax of a function.
> help(lm)
> ?lm
> ?"for"
> ?"[["
> args("lm")
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
Documentation 16
NULL
apropos() and find() looks for all the functions in the loaded packages containing a keyword or a regular
expression[1].
> apropos("norm")
[1] "dlnorm" "dnorm" "plnorm"
[4] "pnorm" "qlnorm" "qnorm"
[7] "qqnorm" "qqnorm.default" "rlnorm"
[10] "rnorm" "normalizePath"
You can search for help in all installed packages using help.search() or its shortcut ??.
> ??"lm"
> help.search("covariance")
RSiteSearch() looks for help in all packages and in the R mailing lists. The sos package improves the
RSiteSearch() function with the findFn() function. ??? is a wrapper for findFn().
> RSiteSearch("spline")
> library("sos")
> findFn("spline", maxPages = 2)
> ???"spline"(2)
Handouts
An Introduction to R [2] The R Reference Manual
Robert Kabacoff's Quick R [3]
Grant Farnsworth's Econometrics in R [4] The best introduction for an economist (about 20 pages)
UCLA R Computing Resources [5]
A Handbook of Statistical Analyses Using R [6] by Brian S. Everitt and Torsten Hothorn
fr+en Arthur Charpentier's R for acturies [7]
Dan Goldstein's video tutorial [8]
fr Julien Barnier's introduction to R for sociologists [9]
Rosetta Code [10] presents solutions to the same task in different programming languages.
'R language for programmers', by John Cook [11]
A Brief Guide to R Beginners in Econometrics [12]
R Tutorial by Kelly Black [13]
Documentation 17
Reference Sheet
List of Cheat Sheets [14]
Daniel Kaplan Reference Sheet [15]
Teaching Resources
Franois Briatte has a nice introduction to data analysis using R[16]
Simon Jackman Political Methodology Classes [17]
Jonathan Katz Political Methodology Classes [18]
A Brief Guide to R for Beginners in Econometrics [12]
PRISM luncheons [19]
Statistical Analysis: an Introduction using R - which includes a course on R
Biostatistics with R aka A R companion to Wayne Daniel 's Biostatistics Book
Blogs
Planet R [20] the first R blog aggregator
R Bloggers [21] The news pulse for the R blogosphere
"R" you Ready ? [22]
One R Tip a Day [23]
Revolution computing blog [24]
Yu Sung Su's Blog:R [25]
(fr) Freakonometrics (in French) [26] lots of code chunks
(fr) Baptiste Coulmont (in French) [27]
(fr) Quanti Sciences Sociales (in French) [28] R blog for sociologists
Journals
The R Journal [29]
Journal of Statistical Software [30] contains lots of articles on R packages.
The Political Methodologist [31] contains lots of articles on R for political scientists.
Books
Venables and Ripley : Modern Applied Statistics with S [32]
A very good introduction to R covering numerous topics.
A Handbook of Statistical Analyses Using R [6] (Brian S. Everitt and Torsten Hothorn, Chapman & Hall/CRC,
2008)
An Introduction to Data Technologies, by Paul Murrell [33]
Everything you need to know about data management
A first course in statistical programming with R, John Braun and Duncan Murdoch.
Peter Dalgaard (2009). ISwR: Introductory Statistics with R. R package version 2.0-4. http://CRAN.R-project.
org/package=ISwR
Springer Use R Series [34]
John Fox : An R and S-PLUS Companion to Applied Regression [35]
Gelman Hill : Data Analysis using Regression and Multilevel Hierarchical Models [36]
Documentation 18
Search Engine
R seek [41]
Google Code Search with keyword "lang:r" [42] gives access to r programs including the request. For instance the
following request optim lang:r [43] gives access to all the r programs including optim.
Q&A / Forums
Nabble R http://r.789695.n4.nabble.com/
Stackoverflow [44]
The #rstats hashtag [45] on Twitter
IRC: #r@freenode
r-soc [46] : mailing list for French sociologist
References
[1] If you want to know more about regular expressions, have a look at the Regular expressions section in the Text Processing page.
[2] http:/ / cran. r-project. org/ doc/ manuals/ R-intro. html
[3] http:/ / www. statmethods. net/
[4] http:/ / cran. r-project. org/ doc/ contrib/ Farnsworth-EconometricsInR. pdf
[5] http:/ / www. ats. ucla. edu/ stat/ r/
[6] http:/ / cran. r-project. org/ web/ packages/ HSAUR2/ index. html
[7] http:/ / perso. univ-rennes1. fr/ arthur. charpentier/ slides-R. pdf
[8] http:/ / www. decisionsciencenews. com/ ?p=261
[9] http:/ / alea. fr. eu. org/ j/ intro_R. html
[10] http:/ / rosettacode. org/ wiki/ Category:R
[11] http:/ / www. johndcook. com/ R_language_for_programmers. html
[12] http:/ / people. su. se/ ~ma/ R_intro/ R_intro. pdf
[13] http:/ / www. cyclismo. org/ tutorial/ R/
[14] http:/ / devcheatsheet. com/ tag/ r/
[15] http:/ / www. macalester. edu/ ~kaplan/ ISM/ r-commands. pdf
[16] Introduction to Data Analysis (http:/ / f. briatte. org/ teaching/ ida/ )
[17] http:/ / jackman. stanford. edu/ classes/ index. php
[18] http:/ / jkatz. caltech. edu/ classes/ ss228. html
[19] http:/ / polisci. osu. edu/ prism/ luncheons. htm
[20] http:/ / planetR. stderr. org
[21] http:/ / www. r-bloggers. com/
[22] http:/ / ryouready. wordpress. com/
[23] http:/ / onertipaday. blogspot. com/
[24] http:/ / blog. revolution-computing. com/
[25] http:/ / yusung. blogspot. com/ search/ label/ R
[26] http:/ / freakonometrics. blog. free. fr/ index. php
[27] http:/ / coulmont. com/ blog/ tag/ r/
[28] http:/ / quanti. hypotheses. org/ tag/ r/
[29] http:/ / journal. r-project. org/ current. html
[30] http:/ / www. jstatsoft. org/
[31] http:/ / polmeth. wustl. edu/ thepolmeth. php
[32] http:/ / www. stats. ox. ac. uk/ pub/ MASS4/
[33] http:/ / www. stat. auckland. ac. nz/ ~paul/ ItDT/ HTML/
Documentation 19
Control Structures
Conditional execution
Help for programming :
> ?Control
> if (condition){
+ statement
+ }
> else{
+ alternative
+ }
The unidimensional condition may be one of TRUE or FALSE, T or F, 1 or 0 or a statement using the truth
operators:
x == y "x is equal to y"
x != y "x is not equal to y"
x > y "x is greater than y"
x < y "x is less than y"
x <= y "x is less than or equal to y"
x >= y "x is greater than or equal to y"
And may combine these using the & or && operators for AND. | or || are the operators for OR.
> if(TRUE){
+ print("This is true")
+ }
[1] "This is true"
> x <- 2 # x gets the value 2
> if(x==3){
+ print("This is true")
+ }
+ else{
+ print("This is false")
Control Structures 20
+ }
[1] "This is false"
> y <- 4 # y gets the value 4
> if(x==2 && y>2){
+ print("x equals 2 and y is greater than 2")
+ }
[1] "x equals 2 and y is greater than 2"
The ifelse() command takes as first argument the condition, as second argument the treatment if the condition
is true and as third argument the treatment if the condition is false. In that case, the condition can be a vector. For
instance we generate a sequence from 1 to 10 and we want to display values which are lower than 5 and greater than
8.
Sets
R has some very useful handlers for sets to select a subset of a vector:
> x = runif(10)
> x<.5
[1] TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
> x
[1] 0.32664759 0.57826623 0.98171138 0.01718607 0.24564238 0.62190808 0.74839301
[8] 0.32957783 0.19302650 0.06013694
> x[x<.5]
[1] 0.32664759 0.01718607 0.24564238 0.32957783 0.19302650 0.06013694
> x = 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x[-1:-5]
[1] 6 7 8 9 10
Control Structures 21
Loops
Implicit loops
R has support for implicit loops, which is called vectorization. This is
built-in to many functions and standard operators. for example, the +
operator can add two arrays of numbers without the need for an
explicit loop.
Explicit Loops are generally slow, and it is better to avoid them when
it is possible.
apply() can apply a function to elements of a matrix or an array.
This may be the rows of a matrix (1) or the columns (2).
lapply() applies a function to each column of a dataframe and Example of fast code using vectorisation
returns a list.
sapply() is similar but the output is simplified. It may be a vector or a matrix depending on the function.
tapply() applies the function for each level of a factor.
> N <- 10
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> male <- rbinom(N,1,.48)
> y <- 1 + x1 + x2 + male + rnorm(N)
> mydat <- data.frame(y,x1,x2,male)
> lapply(mydat,mean) # returns a list
$y
[1] 3.247
$x1
[1] 0.1415
$x2
[1] 1.29
$male
[1] 0.5
See also aggregate() which is similar to tapply() but is applied to a dataframe instead of a vector.
Explicit loops
R provides three ways to write loops: for, repeat and while. The for statement is excessively simple. You simply
have to define index (here k) and a vector (in the example below the vector is 1:5) and you specify the action you
want between braces.
When it is not possible to use the for statement, you can also use break or while by specifying a breaking rules. One
should be careful with this kind of loops since if the breaking rules is misspecified the loop will never end. In the two
examples below the standard normal distribution is drawn in as long as the value is lower than 1. The cat() function
is used to display the present value on screen.
> repeat {
+ g <- rnorm(1)
+ if (g > 1.0) break
+ cat(g,"\n")
+ }
-1.214395
0.6393124
0.05505484
-1.217408
> g <- 0
> while (g < 1){
+ g <- rnorm(1)
+ cat(g,"\n")
+ }
-0.08111594
0.1732847
-0.2428368
0.3359238
-0.2080000
0.05458533
0.2627001
1.009195
Control Structures 23
Iterators
Loops in R are generally slow. iterators may be more efficient than loops. See this entry in the Revolution
Computing Blogs [1]
References
[1] http:/ / blog. revolution-computing. com/ 2009/ 07/ counting-with-iterators. html
> lm
> page(lm)
> library(TinnR)
> trCopy(lm)
Returning an object
By default the value of the last line is returned. In the following example, we have a simple function with two
objects. The last one is returned.
Adding arguments
It is possible to add arguments.
The ... argument means that you can add other arguments which will be passed to functions inside the function.
For estimation commands it is possible to add formulas as arguments. For instance, we can create our own function
for ordinary least square using a formula interface.
N <- 100
u <- rnorm(N)
x <- rnorm(N) + 1
y <- 1 + x + u
ols(y~x)
Recursive functions
R supports recursive functions. The function below computes Fibonacci numbers recursively.
Functions as Objects
R functions can be treated as objects
> a <- function(n) function(a) runif(a)
> b(10)
[1] 0.8726873 0.9512367 0.5971435 0.5540743 0.6378967 0.4030071 0.2750673 0.1777123 0.6960378 0.3969920
This can be useful when wanting to make many different kinds of functions
Higher-order functions
You can use higher-order functions in R. Contrary to common belief, using them instead of loops, is not faster,
because the apply function has a for-loop inside its definition. Use them only to improve clarity of your code.
(Reference: Patrick Burns, The R Inferno, p. 24)
apply
apply is the most basic of R's map functions. lapply, sapply and mapply are convenient interfaces for apply that work
on lists, vectors and multiple vectors respectively.
apply takes as arguments an array, a vector of the dimension to map along and a function. The following example is
based on the apply documentation. It uses apply to compute column and row sums of a matrix.
x <- matrix(round(rnorm(100)),10,10)
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)
Working with functions 26
tapply
tapply is similar to apply, but applies a function to each cell of a ragged array, that is to each (non-empty) group of
values given by a unique combination of the levels of certain factors.
Reduce
This function from the Reduce documentation cumulatively adds
References
[1] http:/ / r. 789695. n4. nabble. com/ Optional-data-argument-for-a-function-tp850247p850247. html
Debugging 27
Debugging
Some basic tips
Use print() statements in your functions to print variable values. Although this technique is considered
low-tech or even old fashioned by some, it can still be a quick and easy way to trace an error.
Place a browser() statement in the function just before the crashing line. When the function is called, it will
be executed up to the browser() line. The command-line interface then switches to the function environment,
so that all variables in the function can be inspected or changed. See below for commands available in
browser() mode.
> myFun()
Error in myFun() : Woops! An error
After an error is raised, the traceback() function allows you to show the call stack leading to the error. For
example, the function below calls myFun.
> myFun2()
Error in myFun() : Woops! An error
> traceback()
3: stop("Woops! An error")
2: myFun()
1: myFun2()
The traceback() function can be executed automatically each time an error is raised with the option
options(error=traceback)
options(error=NULL)
Debugging 28
debug(FUNCTION_NAME)
. Then, when the function is called, and a browser in that function's environment is opened so that it can be executed
line by line. In the debugging browser, apart from all standard R functionality, the following commands are
available.
Command Meaning
c, cont Continue to the end of the current context. E.g. to the end the loop within a loop or to the end of the function.
undebug(FUNCTION_NAME)
> options(error=recover)
> myFun2()
Error in myFun() : Woops! An error
1: myFun2()
2: myFun()
Selection:
By typing '1' or '2' behind Selection: the browser will jump to the selected environment. Once in the browser,
all standard R functionality is at your disposal, as well as the commands in the table below.
Debugging 29
Command Meaning
c, cont Exit the browser and continue at the next statement. An empty line will do the same.
where Print a stack trace of active function calls (where are you in the stack?).
Q Exit the browser, do not continue at the next statement but go back to the top-level R browser.
options(error=NULL)
Using C or Fortran
For some tasks, R can be slow. In that case, it is possible to write a program in C or Fortran and to use it from R.
This page is for advanced programmers only.
See wikiversity Connecting Fortran and R
Link C with R [1]
References
[1] http:/ / yusung. blogspot. com/ 2008/ 08/ link-c-with-r. html
Utilities
This page includes material about some utilities. Most of the functions presented here have nothing to do with
statistical analysis but may be useful when working on a project. Many functions are just similar to standard unix
functions.
System (Unix/DOS)
system() gives access to the system (DOS or unix). The option wait=FALSE means that you don't ask R to
wait that the task is finished.
Some examples :
You can convert an image from to PS to PNG using the unix convert function of your computer. If you want
to know more about this function, open a Terminal application and type man convert (This should work on
Mac OS and Linux).
You can open Stata and run a program.
You can run pdflatex from R and directly open the pdf in a pdf browser.
File Handling
dir() lists all the files in a directory. It is similar to the Unix function ls. dir.create() creates a new
directory. It is similar to mkdir in Unix.
file.info() gives information about a file.
> file.info("taille.txt")
size isdir mode mtime ctime atime exe
taille.txt 444 FALSE 666 2009-06-26 12:25:44 2009-06-26 12:25:43 2009-06-26 12:25:43 no
file.remove(dir(path="directoryname", pattern="*.log"))
Internet
browseURL() opens an URL using an internet browser. download.file() download a file from the internet.
> browseURL("http://en.wikibooks.org/wiki/R_Programming")
getOption("browser")
We can change the default browser using the options() command. It is safer to store the options before.
You can download a file from the internet using download.file(). Note that very often you don't need to
download a file from the internet and you can directly load it into R from the internet using standard functions. For
instance, if you want to read a text file from the internet, you can use read.table(), scan() or
readLines().
# For example, we download "http://en.wikibooks.org/wiki/R_Programming/Text_Processing" on our Desktop
download.file(url="http://en.wikibooks.org/wiki/R_Programming/Text_Processing",destfile= "~/Desktop/test_processing.html")
Computing time
If you perform computer intensive task you may want to optimize the computing time. Two functions are available
system.time() and proc.time(). Both returns a vector of values. The first is the standard CPU time.
> system.time(x<-rnorm(10^6))
[1] 1.14 0.07 1.83 0.00 0.00
Computing process
user.prompt() (Zelig) makes a pause in the computation process (useful if you want to do a demo).
waitReturn() (cwhmisc) does the same job. Sys.sleep() stop the computation during a few seconds.
> user.prompt()
It is possible to stop the computing process if a logical condition is not true using stopifnot().
Miscellanous
trCopy() (TinnR package) copy an object to the clipboard. It is useful if you want to copy a large object to the
clipboard. For instance, if you want to copy the code of a function and paste it in a text editor.
> trCopy(lm)
[1] TRUE
sessionInfo() gives information on the current session info (R version + loaded packages). This function
may be useful for reproducible computing. getRversion() gives the current R version. R.version gives
more details about the computer and R.Version() returns the same informations as a list.
See Also
See the R.utils package[1]
References
[1] Henrik Bengtsson (2009). R.utils: Various programming utilities. R package version 1.1.7. http:/ / CRAN. R-project. org/ package=R. utils
Estimation utilities 32
Estimation utilities
This page deals with methods which are available for most estimation commands. This can be useful for all kind of
regression models.
Formulas
Most estimation commands use a formula interface. The outcome is left of the ~ and the covariates are on the right.
y ~ x1 + x2
It is easy to include multinomial variable as predictive variables in a model. If the variable is not already a factor,
one just need to use the as.factor() function. This will create a set of dummy variables.
y ~ as.factor(x)
For instance, we can use the Star data in the Ecdat package :
library("Ecdat")
data(Star)
summary(lm(tmathssk ~ as.factor(classk), data = Star))
I() takes arguments "as is". For instance, if you want to include in your equation a modified variable such as a
squarred term or the addition of two variables, you may use I().
It is easy to include interaction between variables by using : or *. : adds all interaction terms whereas * adds
interaction terms and individual terms.
It is also possible to generate polynomials using the poly() function with option raw = TRUE.
There is also an advanced formula interface which is useful for instrumental variables models and mixed models. For
instance ivreg() (AER) uses this advanced formulas interface. The instrumental variables are entered after the |.
See the Instrumental Variables section if you want to learn more.
library("AER")
ivreg(y ~ x | z)
Output
In addition to the summary() and print() functions which display the output for most estimation commands,
some authors have developed simplified output functions. One of them is the display() function in the arm
package. Another one is the coefplot() in the arm package which displays the coefficients with confidence
intervals in a plot. According to the standards defined by Nathaniel Beck[1], Jeff Gill developped
graph.summary()[2]. This command does not show useless auxiliary statistics.
Estimation utilities 33
R code Output
Delta Method
If you want to know the standard error of a transformation of one of your parameter, you need to use the delta
method
deltamethod() in the msm package[3].
delta.method() in the alr3 package.
deltaMethod in the car package.
References
[1] Nathaniel Beck "Making regression and related output more helpful to users" The Political Methodologist 2010 http:/ / politics. as. nyu. edu/
docs/ IO/ 2576/ beck_tpm_edited. pdf
[2] Jeff Gill graph.summary() http:/ / artsci. wustl. edu/ ~jgill/ Models/ graph. summary. s
[3] See the example on the UCLA Statistics webpage : http:/ / www. ats. ucla. edu/ stat/ r/ faq/ deltamethod. htm
[4] Kosuke Imai, Gary King and Olivia Lau (2009). Zelig: Everyone's Statistical Software. R package version 3.4-5. http:/ / CRAN. R-project.
org/ package=Zelig
Packages
An R package includes a set of functions and datasets. Packages are often developed as supplementary material to
books. For instance the MASS package was developed by Venables and Ripley for their book Modern Applied
Statistics with S and the car package was developed by John Fox for his book An R and S plus Companion to
Applied Regression.
Load a package
A package is loaded into the current R environment using the library() function. A list of functions and
datasets included in a package can be obtained by using the h or help argument of the library function.
A package can be detached from the current environment by using the detach() function :
> detach("package:prettyR")
Without any arguments the library() function lists all of the packages currently available to the user. env()
(gdata) describe all loaded environments (ie packages). search() gives the list of all loaded packages.
> library() # returns the description of all the packages available on the computer
> dir(.libPaths()) # returns the name of all the packages available on the computer (quicker than the previous one)
> search()
> env(unit="MB")
> current.packages("sem")
> .libPaths()
[1] "/Users/username/Library/R/library"
[2] "/Library/Frameworks/R.framework/Resources/library"
> .libPaths("W:/AppData/R/library")
Packages 35
> install.packages("faraway")
> install.packages("rgrs", lib="W:/AppData/R/library" ,
repos=c("http://r-forge.r-project.org","http://cran.fr.r-project.org/"),
dep=TRUE)
Stay up to date.
If you want to be aware of the latest packages, type new.packages() in R or visit the Revolution Computing Blog [24]
which gives each month a list of the new and the updated packages.
> new.packages() # displays all the packages available in the repositories
> update.packages() # updates all the packages installed with the newest version available in the repositories
> install.packages("ctv")
> library("ctv")
> install.views("Econometrics")
> update.views("Econometrics")
Building R Packages
You can write down your own R packages. But, all packages submitted to CRAN (or Bioconductor) must follow
specific guidelines, including the folder structure of the package and the other files like DESCRIPTION,
NAMESPACE and so on.
See Friedrich Leisch's introduction (PDF [1] 20 pages)[2]
See also Duncan Murdoch's tools for building packages using Windows[3]
References
[1] http:/ / cran. r-project. org/ doc/ contrib/ Leisch-CreatingPackages. pdf
[2] Friedrich Leisch Creating R Packages : A Tutorial http:/ / cran. r-project. org/ doc/ contrib/ Leisch-CreatingPackages. pdf
[3] http:/ / www. r-project. org/ conferences/ useR-2008/ slides/ Murdoch. pdf
Data types 36
Data types
Data types
Vectors are the simplest R objects, an ordered list of primitive R objects of a given type (e.g. real numbers, strings,
logicals). Vectors are indexed by integers starting at 1. Factors are similar to vectors but where each element is
categorical, i.e. one of a fixed number of possibilities (or levels). A matrix is like a vector but with a specific
instruction for the layout such that it looks like a matrix, i.e. the elements are indexed by two integers, each starting
at 1. Arrays are similar to matrices but can have more than 2 dimensions. A list is similar to a vector, but the
elements need not all be of the same type. The elements of a list can be indexed either by integers or by named
strings, i.e. an R list can be used to implement what is known in other languages as an "associative array", "hash
table", "map" or "dictionary". A dataframe is like a matrix but does not assume that all columns have the same type.
A dataframe is a list of variables/vectors of the same length. Classes define how objects of a certain type look like.
Classes are attached to object as an attribute. All R objects have a class, a type and a dimension.
> class(object)
> typeof(object)
> dim(object)
Vectors
You can create a vector using the c() function which concatenates some elements. You can create a sequence
using the : symbol or the seq() function. For instance 1:5 gives all the number between 1 and 5. The seq()
function lets you specify the interval between the successive numbers. You can also repeat a pattern using the
rep() function. You can also create a numeric vector of missing values using numeric(), a character vector of
missing values using character() and a logical vector of missing values (ie FALSE) using logical()
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> c("a","b","c","d","e")
[1] "a" "b" "c" "d" "e"
> c(T,F,T,F)
[1] TRUE FALSE TRUE FALSE
> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5,by=.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> rep(1,5)
[1] 1 1 1 1 1
> rep(1:2,5)
[1] 1 2 1 2 1 2 1 2 1 2
> numeric(5)
[1] 0 0 0 0 0
Data types 37
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(5)
[1] "" "" "" "" ""
The length() computes the length of a vector. last() ( sfsmisc [1]) returns the last element of a vector but this
can also be achieved simply without the need for an extra package.
Factors
factor() transforms a vector into a factor. A factor can also be ordered with the option ordered=T or the
function ordered(). levels() returns the levels of a factor. gl() generates factors. n is the number of
levels, k the number of repetition of each factor and length the total length of the factor. labels is optional
and gives labels to each level.
Factors can be most easily thought of as categorical variables. An important function for factor analysis is the
table() function, which offers a type of summary. When considering the types of statistical data (nominal,
ordinal, interval and ratio), factors can be nominal, ordinal or interval. Nominal factors are categorical names,
examples of which could be country names paired with some other information. An example of an ordinal factor
would be a set of race times for a particular athlete paired with the athlete's finishing place (first, second, ...). When
trying to summarize this factor, please see the example with ordinal examples below for an example on self-ordering
your factors. Finally, an example of interval level factors would be age brackets such as "20 - 29", "30 - 39", etc. In
general, R can automatically order numbers stored as factors appropriately but a programmer may use the same
techniques with this type of data to order in the manner most appropriate to their application.
See also is.factor(), as.factor(), is.ordered() and as.ordered().
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes no yes maybe maybe no maybe no no
Levels: maybe no yes
>
> factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"), ordered = T)
[1] yes no yes maybe maybe no maybe no no
Levels: maybe < no < yes
>
> ordered(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
[1] yes no yes maybe maybe no maybe no no
Levels: maybe < no < yes
>
> ordered(as.factor(c("First","Third","Second","Fifth","First","First","Third")),
Data types 38
+ levels = c("First","Second","Third","Fourth","Fifth"))
[1] First Third Second Fifth First First Third
Levels: First < Second < Third < Fourth < Fifth
>
> gl(n=2, k=2, length=10, labels = c("Male", "Female")) # generate factor levels
[1] Male Male Female Female Male Male Female Female Male Male
Levels: Male Female
Matrix
If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of
data, the number of rows and/or columns and finally you can specify if you want R to read your vector by row or
by column (the default option). Here are two examples.
Functions cbind() and rbind() combine vectors into matrices in a column by column or row by row mode:
> rbind(v1,v2)
[,1] [,2] [,3] [,4] [,5]
v1 1 2 3 4 5
v2 5 4 3 2 1
Data types 39
The dimension of a matrix can be obtained using the dim() function. Alternatively nrow() and ncol()
returns the number of rows and columns in a matrix:
> t(X)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 1 6
[2,] 2 7 12 2 7
[3,] 3 8 13 3 8
[4,] 4 9 14 4 9
[5,] 5 10 15 5 10
> a=matrix(2,2,2)
> a
[,1] [,2]
[1,] 2 2
[2,] 2 2
> a = rbind(a,c("A","A"))
> a
[,1] [,2]
[1,] "2" "2"
[2,] "2" "2"
[3,] "A" "A"
Arrays
An array is composed of n dimensions where each dimension is a vector of R objects of the same type. An array of
one dimension of one element may be constructed as follows.
> x = array(c(T,F),dim=c(1))
> print(x)
[1] TRUE
The array x was created with a single dimension (dim=c(1)) drawn from the vector of possible values c(T,F). A
similar array, y, can be created with a single dimension and two values.
> y = array(c(T,F),dim=c(2))
> print(y)
[1] TRUE FALSE
> z = array(1:27,dim=c(3,3,3))
> dim(z)
[1] 3 3 3
> print(z)
, , 1
, , 2
, , 3
R arrays are accessed in a manner similar to arrays in other languages: by integer index, starting at 1 (not 0). The
following code shows how the third dimension of the 3 by 3 by 3 array can be accessed. The third dimension is a 3
by 3 array.
> z[,,3]
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
> z[,3,3]
[1] 25 26 27
> z[3,3,3]
[1] 27
> z[,c(2,3),c(2,3)]
, , 1
[,1] [,2]
Data types 41
[1,] 13 16
[2,] 14 17
[3,] 15 18
, , 2
[,1] [,2]
[1,] 22 25
[2,] 23 26
[3,] 24 27
Arrays need not be symmetric across all dimensions. The following code creates a pair of 3 by 3 arrays.
> w = array(1:18,dim=c(3,3,2))
> print(w)
, , 1
, , 2
Objects of the vectors composing the array must be of the same type, but they need not be numbers.
> u = array(c(T,F),dim=c(3,3,2))
> print(u)
, , 1
, , 2
Lists
A list is a collection of R objects. list() creates a list. unlist() transform a list into a vector. The objects in a
list do not have to be of the same type or length.
[[2]]
[1] FALSE
[[3]]
[,1] [,2]
[1,] 1 2
[2,] 3 4
> a = list()
> a
list()
> a[[1]] = "A"
> a
[[1]]
[1] "A"
> a[[2]]="B"
> a
[[1]]
[1] "A"
[[2]]
[1] "B"
By name:
> a
list()
> a$fruit = "Apple"
> a
$fruit
[1] "Apple"
> a
$fruit
[1] "Apple"
$color
[1] "green"
> a = list()
> a[[1]] = "house"
> a$park = "green's park"
> a
[[1]]
[1] "house"
$park
[1] "green's park"
> a
[[1]]
[[1]][[1]]
[1] "house"
[[1]]$address
[1] "1 main st."
$park
[1] "green's park"
Using the scoping rules in R one can also dynamically name and create list elements
> a = list()
> n = 1:10
> fruit = paste("number of coconuts in bin",n)
> my.number = paste("I have",10:1,"coconuts")
> for (i in 1:10)a[fruit[i]] = my.number[i]
> a$'number of coconuts in bin 7'
[1] "I have 4 coconuts"
Data types 44
Data Frames
A dataframe has been referred to as "a list of variables/vectors of the same length". In the following example, a
dataframe of two vectors is created, each of five elements. The first vector, v1, is composed of a sequence of the
integers 1 through 5. A second vector, v2, is composed of five logical values drawn of type T and F. The dataframe
is then created, composed of the vectors. The columns of the data frame can be accessed using integer subscripts or
the column name and the $ symbol.
> v1 = 1:5
> v2 = c(T,T,F,F,T)
> df = data.frame(v1,v2)
> print(df)
v1 v2
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
5 5 TRUE
> df[,1]
[1] 1 2 3 4 5
> df$v2
[1] TRUE TRUE FALSE FALSE TRUE
The dataframe may be created directly. In the following code, the dataframe is created - naming each vector
composing the dataframe as part of the argument list.
> df = data.frame(foo=1:5,bar=c(T,T,F,F,T))
> print(df)
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
5 5 TRUE
External links
data.frame objects in R [2] (a sample chapter from the R in Action book)
Aggregation and Restructuring of data.frame objects [3] (a sample chapter from the R in Action book)
References
[1] http:/ / cran. r-project. org/ web/ packages/ sfsmisc/ index. html
[2] http:/ / www. r-statistics. com/ 2011/ 12/ data-frame-objects-in-r-via-r-in-action/
[3] http:/ / www. r-bloggers. com/ aggregation-and-restructuring-data-from-%E2%80%9Cr-in-action%E2%80%9D/
Working with data frames 45
load("mydata.Rda")
save(mydata,file="mydata.Rda")
Example Datasets
Most packages include example datasets to test the functions.
The data() function without argument gives the list of all example datasets in all the loaded packages.
If you want to load them in memory, you just need to use the data function and include the name of the dataset as
an argument.
str_data() (sfsmisc) gives the structure of all datasets in a package.
> data() # lists all the datasets in all the packages in memory
> data(package="datasets") # lists all the datasets in the "datasets" package
> data(Orange) # loads the orange dataset in memory
> ?Orange # Help for the "Orange" Datasets
> str_data("datasets") # gives the structure of all the datasets in the datasets package.
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- 1 + x1 + x2 + u
mydat <- data.frame(y,x1,x2)
R has a spreadsheet-style data editor. One can use it to enter the data into a spreadsheet.
You can also read space delimited tables in your code using gsource() (Zelig). Here is an example with Yule
1899 data[3].
Browsing data
You can browse your data in a spreadsheet using View(). Depending on your operating system, this option is
not always available and the result is not always the same.
You can print the first lines using head() and the last lines using tail().
View(mydata)
head(mydata, n = 20) # n = 20 means that the first 20 lines are printed in the R console
Attaching data
One of the big advantage of R over Stata is that you can deal with multiple datasets at the same time. You just need
to specify the name of the dataset and a "$" symbol before each variable name ( for instance mydat1$var1 and
mydat2$var1). If you only work with one dataset and you don't want to write again and again the name of the
dataset as a prefix for each variable, you can use attach().
mydata$var1
attach(mydata)
var1
detach(mydata)
Working with data frames 47
Detecting duplicates
When you want to clean up a data set, it is very often useful to check if you don't have the same information twice in
the data. R provides some functions to detect duplicates.
duplicated() looks at duplicated elements and return a logical vector. You can use table() to summarize
this vector.
Duplicated() (sfsmisc) generalizes this command. Duplicated() only marks unique values with "NA".
remove.dup.rows() (cwhmisc).
unique() keep only the unique lines in a dataset.
library("Zelig")
mydat <- gsource(
variables = "
1 1 1 1
1 1 1 1
1 2 3 4
1 2 3 4
1 2 2 2
1 2 3 2")
unique(mydat) # keep unique rows
library(cwhmisc)
remove.dup.rows(mydat) # similar to unique()
table(duplicated(mydat)) # table duplicated lines
mydat$dups <- duplicated(mydat) # add a logical variable for duplicates
If you want to delete a variable in a dataset, you can assign NULL to that variable :
Renaming variables
It is possible to rename variable by redefining the vector of names of a data frame.
There is also a rename() function in the reshape package.
N <- 100
x1 <- rnorm(N)
x2 <- 1 + rnorm(N) + x1
x3 <- rnorm(N) + x2
mydat <- data.frame(x1,x2,x3)
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = c(x1,x2))
subset(x = mydat, subset = x1 > 0 & x2 < 0, select = - x3) # the same.
mydat[order(var1,var2),]
Suppose you want to randomize the order in a data set. You just need to generate a vector from a uniform
distribution and to sort following that vector.
df[order(runif(nrow(df))),]
> table(complete.cases(df))
Reshaping a dataframe
This topic is important if you deal with panel data. Panel data can be stored in a wide format with one observation
per unit and a variable for each time period or in a long format with one observation per unit and time period.
reshape() reshapes a dataset in a wide or long format.
> country <- c("'Angola'","'UK'","'France'")
> gdp.1960 <- c(1,2,3)
> gdp.1970 <- c(2,4,6)
> mydat <- data.frame(country,gdp.1960,gdp.1970)
> mydat # wide format
country gdp.1960 gdp.1970
1 Angola 1 2
Working with data frames 49
2 UK 2 4
3 France 3 6
> reshape( data = mydat, varying = list(2:3) , v.names = "gdp", direction = "long") # long format
country time gdp id
1.1 Angola 1 1 1
2.1 UK 1 2 2
3.1 France 1 3 3
1.2 Angola 2 2 1
2.2 UK 2 4 2
3.2 France 2 6 3
External links
Printing nested tables in R bridging between the {reshape} and {tables} packages [7]
Expanding a dataset
Sometimes we need to duplicate some lines in a dataset. For instance, if we want to generate a fake dataset with a
panel data structure. In that case, we would first generate time invariant variables and then duplicate each line by a
given scalar in order to create time-varying variables.
It is possible to use the expand() function in the epicalc package. This will multiply each line by a given number.
N <- 1000
T <- 5
wide <- data.frame(id = 1:N,f = rnorm(N), rep = T)
library("epicalc")
long <- expand(wide,index.var = "rep")
long$time <- rep(1:T,N)
We can also use the do it yourself solution or create our own function. The idea is simple. We create a vector which
igives for each line the number of times it should be replicated (dups in the following example). Then we use the
rep() function to create a vector which repeats the line numbers according to what we want. The last step creates a
new dataset which repeats lines according to the desired pattern.
Merging dataframes
Merging data can be very confusing, especially if the case of multiple merge. Here is a simple example :
We have one table describing authors :
We want to merge tables books and authors by author's name ("surname" in the first dataset and "name" in the
second one). We use the merge() command. We specify the name of the first and the second datasets, then by.x and
Working with data frames 51
by.y specify the identifier in both datasets. all.x and all.y specify if we want to keep all the observation of the first
and the second dataset. In that case we want to have all the observations from the books dataset but we just keep the
observations from the author dataset which match with an observation in the books dataset.
> final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
> final
name title other.author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics ... Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>
It is also possible to merge two data.frame objects while preserving the rows order by one of the two merged
objects[8].
Resources
R Data Manual [9][10].
Paul Murrell's Introduction to Data Technologies[11].
References
[1] The AER Package http:/ / cran. r-project. org/ web/ packages/ AER/ index. html
[2] The EcDat Package http:/ / cran. r-project. org/ web/ packages/ Ecdat/ index. html
[3] "An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.)" - GU Yule -
Journal of the Royal Statistical Society, June 1899, p 283
[4] http:/ / www. stat. auckland. ac. nz/ ~paul/ Talks/ viewer. pdf
[5] Reshaping Data with the reshape Package : http:/ / www. jstatsoft. org/ v21/ i12
[6] vignette for the tables package: http:/ / cran. r-project. org/ web/ packages/ tables/ vignettes/ tables. pdf
[7] http:/ / www. r-statistics. com/ 2012/ 01/ printing-nested-tables-in-r-bridging-between-the-reshape-and-tables-packages/
[8] Merging data frames while preserving the rows (http:/ / www. r-statistics. com/ 2012/ 01/
merging-two-data-frame-objects-while-preserving-the-rows-order/ )
[9] http:/ / cran. r-project. org/ doc/ manuals/ R-data. html
[10] R Data Manual http:/ / cran. r-project. org/ doc/ manuals/ R-data. html
[11] Paul Murrell introduction to Data Technologies http:/ / www. stat. auckland. ac. nz/ ~paul/ ItDT/
Importing and exporting data 52
library(speedR)
speedR()
CSV (csv,txt,dat)
You can import data from a text file (often CSV) using read.table(), read.csv() or read.csv2(). The
option header = TRUE indicates that the first line of the CSV file should be interpreted as variables names and
the option sep = gives the separator (generally "," or ";").
csv.get() (Hmisc) is another possibility.
Note that there is no problem if your data are stored on the internet.
By default, strings are converted to factors. If you want to avoid this conversion, you can specify the option
stringsAsFactors = FALSE.
You can export data to a text file using write.table().
write.table(mydat,file="mydat.csv",quote=T,append=F,sep=",",eol = "\n", na = "NA", dec = ".", row.names = T,col.names = T)
Stata (dta)
We can read Stata data using read.dta() in the foreign package and export to Stata data format using
write.dta().
Note that string variables in Stata are limited to 244 characters. This can be an issue during the exportation
process.
See also Stata.file() in the memisc package and stata.get in the Hmisc package.
> library("foreign")
> names(mydata)
SAS (sas7bdat)
Experimental support for SAS databases having the sas7bdat extension is provided by the sas7bdat[3] package.
However, sas7bdat files generated by 64 bit versions of SAS, and SAS running on non-Microsoft Windows
platforms are not yet supported.
SAS (xpt)
See also sasexport.get() and sas.get() in the Hmisc
See also the SASxport package.
library("foreign")
mydata<-read.xport("SASData.xpt")
names(mydata)
SPSS (sav)
read.spss() (foreign) and spss.get() (Hmisc)
> library("foreign")
> mydata<-read.spss("SPSSData.sav")
> names(mydata)
EViews
readEViews() in the hexView package for EViews files.
Importing and exporting data 54
Excel (xls,xlsx)
Importing data from Excel is not easy. The solution depends on your operating system. If none of the methods below
works, you can always export each Excel spreadsheets to CSV format and read the CSV in R. This is often the
simplest and quickest solution.
XLConnect supports reading and writing both xls and xlsx file formats. Since it is based on Apache POI it only
requires a Java installation and as such works on many platforms including Windows, UNIX/Linux and Mac.
Besides reading & writing data it provides a number of additional features such as adding plots, cell styling & style
actions and many more.
require("XLConnect")
wb <- loadWorkbook("myfile.xls", create = FALSE)
# Show a summary of the workbook (shows worksheets,
# defined names, hidden sheets, active sheet name, ...)
summary(wb)
# Read data from a worksheet interpreting the first row as column names
df1 <- readWorksheet(wb, sheet = "mysheet")
# Read data from a named region/range interpreting the first row as column
# names
df2 <- readNamedRegion(wb, name = "myname", header = TRUE)
library("RODBC")
channel <- odbcConnectExcel("Graphiques pourcent croissance.xls") # creates a connection
sqlTables(channel) # List all the tables
effec <- sqlFetch(channel, "effec") # Read one spreadsheet as an R table
odbcClose(channel) # close the connection (don't forget)
mydat <- read.xls("myfile.xls", colNames = T, sheet = "mysheet", type = "data.frame", from = 1, checkNames = TRUE)
"sheet" specifies the name or the number of the sheet you want to import.
"from" specifies the first row of the spreadsheet.
The gnumeric package[4]. This package use an external software called ssconvert which is usually installed with
gnumeric, the Gnome office spreadsheet. The read.gnumeric.sheet() function reads xls and xlsx files.
library("gnumeric")
df1 <- read.gnumeric.sheet(file = "df.xls", head = TRUE, sheet.name = "Feuille1")
df2 <- read.gnumeric.sheet(file = "df.xlsx", head = TRUE, sheet.name = "Feuille1")
See also xlsx for Excel 2007 documents and read.xls() (gdata).
Importing and exporting data 55
gnumeric spreadsheets
The gnumeric package. read.gnumeric.sheet() reads one sheet and read.gnumeric.sheets()
reads all sheets and store them in a list.
library("gnumeric")
df <- read.gnumeric.sheet(file = "df.gnumeric", head = TRUE, sheet.name = "df.csv")
View(df)
df <- read.gnumeric.sheets(file = "df.gnumeric", head = TRUE)
View(df$df.csv)
library("gnumeric")
df <- read.gnumeric.sheet(file = "df.ods", head = TRUE, sheet.name = "Feuille1")
library("speedR")
df <- speedR.importany(file = "df.ods")
Note that you can also use the speedR graphical user interface (speedR()) which will return the command line for
replication.
library("speedR")
speedR()
library("ROpenOffice")
df <- read.ods(file = "df.ods")
JSON
JSON (JavaScript Object Notation) is a very common format on the internet. The rjson library makes it easy to
import data from a json format[8].
Is is easy to export a list or a dataframe to a JSON format using the toJSON() function :
Importing and exporting data 56
# df : a data frame
library("rjson")
json <- toJSON(df)
dBase (dbf)
read.dbf() in the foreign package.
library("foreign")
df <- read.dbf("file.dbf")
str(df)
Resources
R Data Manual [9][11].
Paul Murrell's Introduction to Data Technologies[12].
References
[1] Stat Transfer http:/ / www. stattransfer. com/
[2] speedR http:/ / speedr. r-forge. r-project. org/
[3] sas7bdat http:/ / cran. r-project. org/ web/ packages/ sas7bdat/ index. html
[4] This command has been tested using Ubuntu 10.10 and R 2.11.1
[5] http:/ / blog. revolution-computing. com/ 2009/ 09/ how-to-use-a-google-spreadsheet-as-data-in-r. html
[6] http:/ / www. omegahat. org/ RGoogleDocs/
[7] The Omega Project for Statistical Computing (http:/ / www. omegahat. org/ )
[8] http:/ / cran. r-project. org/ web/ packages/ rjson/ index. html
[9] http:/ / cran. r-project. org/ web/ packages/ hdf5/ index. html
[10] Brandon Whitcher, Volker J. Schmid, Andrew Thorton "Working with the {DICOM} and {NIfTI} Data Standards in R", Journal of
Statistical Software Vol. 44, Issue 6, Oct 2011, link (http:/ / www. jstatsoft. org/ v44/ i06)
[11] R Data Manual http:/ / cran. r-project. org/ doc/ manuals/ R-data. html
[12] Paul Murrell introduction to Data Technologies http:/ / www. stat. auckland. ac. nz/ ~paul/ ItDT/
Text Processing 57
Text Processing
This page includes all the material you need to deal with strings in R. The section on regular expressions may be
useful to understand the rest of the page, even if it is not necessary if you only need to perform some simple tasks.
This page may be useful to :
perform statistical text analysis.
collect data from an unformatted text file.
deal with character variables.
In this page, we learn how to read a text file and how to use R functions for characters. There are two kind of
function for characters, simple functions and regular expressions. Many functions are part of the standard R base
package.
However, their name and their syntax is not intuitive to all users. Hadley Wickham has developed the stringr
package which defines functions with similar behaviour but their names are easier to retain and their syntax much
more systematic[1].
Keywords : text mining, natural language processing
See CRAN Task view on Natural Language Processing[2]
See also the following packages tm, tau, languageR, scrapeR.
We can write the content of an R object into a text file using cat() or writeLines(). By default cat()
concatenates vectors when writing to the text file. You can change it by adding options sep="\n" or
fill=TRUE. The default encoding depends on your computer.
cat(text,file="file.txt",sep="\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = FALSE)
Before reading a text file, you can look at its properties. nlines() (parser package) and countLines()
(R.utils package) count the number of lines in the file. count.chars() (parser package) counts the number of
bytes and characters in each line of a file. You can also display a text file using file.show().
Text Processing 58
Character encoding
R provides functions to deal with various set of encoding schemes. This is useful if you deal with text file which
have been created with another operating system and especially if the language is not English and has many accents
and specific characters. For instance, the standard encoding scheme in Linux is "UTF-8" whereas the standard
encoding scheme in Windows is "Latin1". The Encoding() functions returns the encoding of a string.
iconv() is similar to the unix command iconv and converts the encoding.
iconvlist() gives the list of available encoding scheme on your computer.
readLines(), scan() and file.show() have also an encoding option.
is.utf8() (tau) tests if the encoding is "utf8".
is.locale() (tau) tests if encoding is the same as the default encoding on your computer.
translate() (tau) translates the encoding into the current locale.
fromUTF8() (descr) is less general than iconv().
utf8ToInt() (base)
Example
The following example was run under Windows. Thus, the default encoding is "latin1".
Regular Expressions
A regular expression is a specific pattern in a set of strings. For instance, one could have the following pattern : 2
digits, 2 letters and 4 digits. R provides powerful functions to deal with regular expressions. Two types of regular
expressions are used in R[3]
extended regular expressions, used by perl = FALSE (the default),
Perl-like regular expressions used by perl = TRUE.
There is a also an option called fixed = TRUE which can be considered as a literal regular expression.
fixed() (stringr) is equivalent to fixed=TRUE in the standard regex functions. These functions are by default
case sensitive. This can be changed by specifying the option ignore.case = TRUE.
If you are not a specialist in regular expression you my find the glob2rx() useful. This function suggests some
regular expression for a specific pattern :
> glob2rx("abc.*")
[1] "^abc\\."
Text Processing 59
Examples
If you want to remove space characters in a string, you can use the \\s Perl macro.
Concatenating strings
paste() concatenates strings.
str_c() (stringr) does a similar job.
cat() prints and concatenates strings.
Examples
> paste("toto","tata",sep=' ')
[1] "toto tata"
> paste("toto","tata",sep=",")
[1] "toto,tata"
> str_c("toto","tata",sep=",")
[1] "toto,tata"
> x <- c("a","b","c")
> paste(x,collapse=" ")
[1] "a b c"
> str_c(x, collapse = " ")
[1] "a b c"
> cat(c("a","b","c"), sep = "+")
a+b+c
Splitting a string
strsplit() : Split the elements of a character vector x into substrings according to the matches to substring
split within them.
See also str_split() (stringr).
> nchar("abcdef")
[1] 6
> str_length("abcdef")
[1] 6
> nchar(NA)
[1] 2
> str_length(NA)
[1] NA
The 1st one is true and the second one is false since there is only one digit in the first number.
> cpos("abcdefghijklmnopqrstuvwxyz","p",start=1)
[1] 16
> substring.location("abcdefghijklmnopqrstuvwxyz","def")
$first
[1] 4
$last
[1] 6
> str_match(string,regexp)
[,1] [,2] [,3] [,4]
[1,] "23 mai 2000" "23" "mai" "2000"
> str_match_all(string,regexp)
[[1]]
[,1] [,2] [,3] [,4]
[1,] "23 mai 2000" "23" "mai" "2000"
Text Processing 64
sub(pattern = regexp, replacement = "\\1", x = string) # returns the first part of the regular expression
In the following example, we compare the outcome of sub() and gsub(). The first one removes the first space
whereas the second one removes all spaces in the text.
> chartr(old="a",new="o",x="baba")
[1] "bobo"
> chartr(old="ab",new="ot",x="baba")
[1] "toto"
> replacechar("abc.def.ghi.jkl",".","_")
[1] "abc_def_ghi_jkl"
> str_replace_all("abc.def.ghi.jkl","\\.","_")
[1] "abc_def_ghi_jkl"
Text Processing 65
> tolower("ABCdef")
[1] "abcdef"
> toupper("ABCdef")
[1] "ABCDEF"
> capitalize("abcdef")
[1] "Abcdef"
> library("cwhmisc")
> padding("abc",10," ","center") # adds blanks such that the length of the string is 10.
[1] " abc "
> str_pad("abc",width=10,side="center", pad = "+")
[1] "+++abc++++"
> str_pad(c("1","11","111","1111"),3,side="left",pad="0")
[1] "001" "011" "111" "1111"
Note that str_pad() is very slow. For instance for a vector of length 10,000, we have a very long computing
time. padding()does not seem to handle character vectors but the best solution may be to use the sapply()
and padding() functions together.
>library("stringr")
>library("cwhmisc")
>a <- rep(1,10^4)
> system.time(b <- str_pad(a,3,side="left",pad="0"))
utilisateur systme coul
50.968 0.208 73.322
> system.time(c <- sapply(a, padding, space = 3, with = "0", to = "left"))
utilisateur systme coul
7.700 0.020 12.206
> library("memisc")
> trimws(" abc def ")
[1] "abc def"
> library("gdata")
> trim(" abc def ")
[1] "abc def"
Text Processing 66
> "abc"=="abc"
[1] TRUE
> "abc"=="abd"
[1] FALSE
> library("MiscPsycho")
> stringMatch("test","tester",normalize="NO",penalty=1,case.sensitive = TRUE)
[1] 2
Approximate matching
agrep() search for approximate matches using the Levenshtein distance.
If 'value = TRUE', this returns the value of the string
If 'value = FALSE' this returns the position of the string
max returns the maximal levenshtein distance.
> agrep(pattern = "laysy", x = c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)
[1] "1 lazy"
> agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 3, value = TRUE)
[1] "1 lazy"
Text Processing 67
Miscellaneous
deparse() : Turn unevaluated expressions into character strings.
char.expand() (base) expands a string with respect to a target.
pmatch() (base) and charmatch() (base) seek matches for the elements of their first argument among
those of their second.
make.unique() makes a character string unique. This is useful if you want to use a string as an identifier in
your data.
References
[1] Hadley Wickham "stringr: modern, consistent string processing" The R Journal, December 2010, Vol 2/2, http:/ / journal. r-project. org/
archive/ 2010-2/ RJournal_2010-2_Wickham. pdf
[2] http:/ / cran. r-project. org/ web/ views/ NaturalLanguageProcessing. html
[3] In former versions (< 2.10) we had also basic regular expressions in R :
extended regular expressions, used by extended = TRUE (the default),
basic regular expressions, as used by extended = FALSE (obsolete in R 2.10).
Since basic regular expressions (extended = FALSE) are now obsolete, the extended option is obsolete in version 2.11.
[4] http:/ / www. markvanderloo. eu/ yaRb/ 2013/ 09/ 07/ a-bit-of-benchmarking-with-string-distances/
Format
Many time and date units are recognised. These include:
2 digit year %y 84
Numerical Month %m 03
Minutes %M 35
Seconds %S 52
Times and Dates 68
> Sys.time()
[1] "2010-02-13 23:12:24 COT"
> format(Sys.time(),"%H %M") # in a different format and without the date
[1] "23 13"
> Sys.Date()
[1] "2010-02-13"
> date() # returns the current date and time,
[1] "Wed Jul 18 10:59:42 2012"
> weekdays(my.date) # Get a string representing the weekday of the specified date
[1] "Monday"
> months(my.date)
[1] "December" # Get the month as well
> my.date
[1] "2010-12-20"
> julian(my.date) # Get the integer number of days since the beginning of epoch
[1] 14963
attr(,"origin")
[1] "1970-01-01"
Note that weekdays() and months() returns results in the local language. For instance, if you turn R into
French, you can get weekdays and months in French[1] :
Times and Dates 69
> require("lubridate")
> Sys.setlocale(locale="fr_FR.UTF-8")
[1] "fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8"
> mydate <- ymd("2002-04-21")
> weekdays(mydate)
[1] "Dimanche"
> months(mydate)
[1] "avril"
References
[1] Issue on Stackoverflow (http:/ / stackoverflow. com/ questions/ 17836966/ french-names-using-wday-in-lubridate)
External links
Do more with dates and times in R with lubridate 1.1.0 (http://www.r-statistics.com/2012/03/
do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/) (a sample chapter from the R in Action book)
Graphics 70
Graphics
R includes at least three graphical systems, the standard graphics package, the lattice package for Trellis graphs[1]
and the grammar-of-graphics ggplot2 package[2]. R has good graphical capabilities but there are some alternatives
like gnuplot.
Interactive Graphics
This section discuss some ways to draw graphics without using R scripts.
The playwith package provides a graphical user interface to customize the graphs, add a title, a grid, some text, etc
and it exports the R code you need if you want to replicate the analysis[3]. If you want to know more, you can have a
look at the screenshots on the website (link [4]). See also the example on "R you Ready" [5]. This package require
GTK+ libraries.
library("playwith")
playwith(plot(x1))
There is also a graphical user interface GrapheR which makes it very easy to draw graphs for beginners. This
solution is cross-platform.
> library(GrapheR)
Standard R graphs
In this section we present what you need to know if you want to customize your graphs in the default graph system.
plot() is the main function for graphics. The arguments can be a single point such as 0 or c(.3,.7), a single
vector, a pair of vectors or many other R objects.
par() is another important function which defines the default settings for plots.
There are many other plot functions which are specific to some tasks such as hist(), boxplot(), etc. Most of
them take the same arguments as the plot() function.
> N <- 10^2
> plot(0)
> plot(0,1)
> plot(x1)
> plot(x1,x2) # scatter plot x1 on the horizontal axis and x2 on the vertical axis
> plot(x2 ~ x1) # the same but using a formula (x2 as a function of x1)
> methods(plot) # show all the available methods for plot (depending on the number of loaded packages).
Graphics 71
Titles
main gives the main title, sub the subtitle. They can be passed as argument of the plot() function or using the
title() function. xlab the name of the x axis and ylab the name of the y axis.
plot(x1,x2, main = "Main title", sub = "sub title" , ylab = "Y axis", xlab = "X axis")
plot(x1,x2 , ylab = "Y axis", xlab = "X axis")
title(main = "Main title", sub = "sub title" )
The size of the text can be modified using the parameters cex.main, cex.lab, cex.sub, cex.axis. Those
parameters define a scaling factor, ie the value of the parameter multiply the size of the text. If you choose
cex.main=2 the main title will be twice as big as usual.
Legend
legend(). The position can be "bottomleft", "bottomright", "topleft", "topright" or exact coordinates.
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 1) # the bottom
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 2) # the left
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 3) # the top
plot(x1, type = "l", col = 1, lty = 1) ; mtext("some text", side = 4) # the right margin
Mathematical annotations
We can add mathematical symbols using expression() and makes some substitution in a formula using
substitute().
Types
The type of a plot can be :
n for none (nothing is printed),
p for points,
l for lines,
b for both,
o for both overlayed,
h for histogram-like
and s/S for steps.
R code Output
Graphics 72
x1 <- rnorm(50)
png("plottype.png")
par(mfrow = c(2,2))
plot(x1, type = "p", main = "points", ylab = "", xlab = "")
plot(x1, type = "l", main = "lines", ylab = "", xlab = "")
plot(x1, type = "b", main = "both", ylab = "", xlab = "")
plot(x1, type = "o", main = "both overplot", ylab = "", xlab = "")
dev.off()
Axes
The default output print the axes. We can remove them with axes=FALSE. We can also change them using the
axis() function.
> plot(x1,x2,axes=FALSE)
>
> plot(x1,x2,axes=FALSE)
> axis(1,col="red",col.axis="blue",font.axis=3)
> axis(2,col="red",col.axis="blue",font.axis=2,las=2)
R code Output
1 <- rnorm(100)
ar(mfrow = c(2,2))
lot(x1, las = 0, main = "las = 0", sub = "always parallel to the axis", xlab = "", ylab = "")
lot(x1, las = 1, main = "las = 1", sub = "always horizontal", xlab = "", ylab = "")
lot(x1, las = 2, main = "las = 2", sub = "always perpendicular to the axis", xlab = "", ylab = "")
lot(x1, las = 3, main = "las = 3", sub = "always vertical", xlab = "", ylab = "")
Margins
Margins can be computed in inches or in lines. The default is par(mar = c(5,4,4,2)) which means that
there are 5 lines at the bottom, 4 lines on the left, 4 lines in the top and 2 lines on the right. This can be modified
using the par() function. If you want to specify margins in inches, use par(mai = c(bottom, left,
top, right). If you want to modify margins in lines, use par(mar = c(bottom, left, top, right).
See ?par to learn more about the topic.
Colors
The color of the points or lines can be changed using the col argument, fg for foreground colors (boxes and axes)
and bg for background colors.
show.col(object=NULL) (Hmisc) package plots the main R colors with their numeric code.
The list of all colors in R (pdf [7])
We can also generate new colors using the rgb() function. The first argument is the intensity of red, the
second, the intensity of green and the third, the intensity of blue. They vary between 0 and 1 by default but this
can be modified with the option max = 255. col2rgb() returns the RGB code of R colors. col2hex()
(gplots) gives the hexadecimal code. col2grey() and col2gray() (TeachingDemos) converts colors to
grey scale.
Points
For points the symbols can be changed using the pch option which takes integer values between 0 and 25 or a
single character. pch can also takes a vector as argument. In that case the first points will use the first element of
the vector as symbol, and so on.
The following code displays all the symbols on the same plot :
x <- rep(1,25)
plot(x, pch = 1:25, axes = F, xlab = "", ylab = "")
text(1:25,.95,labels = 1:25)
Lines
We can change the line type with lty. The argument is a string ("blank", "solid", "dashed", "dotted", "dotdash",
"longdash", or "twodash") or an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash,
6=twodash). The line width can be changed with lwd. The default is lwd=1. lwd=2 means that the width is twice
the normal width.
abline() adds an horizontal line (h=), a vertical line (v=) or a linear function to the current plot (a= for the
constant and b= for the slope). abline() can also plot the regression line.
Boxes
Each graph is framed by a box. bty specifies the box type.
Grid
grid() adds a grid to the current graph.
> plot(x1)
> grid()
Although grid has an optional argument nx for setting the number of grid lines, it is not possible to tell it explicitly
where to place those lines (it will usually not place them at integer values). A more precise and manageable
alternative is to use abline().
Other figures
We can also add a circle to a plot with the circle() function in the calibrate package.
Background
You can choose the background of your plot. For instance, you can change the background color with par(bg=).
par(bg="whitesmoke")
par(bg="transparent")
Overlaying plots
matplot() can plot several plots at the same time.
N <- 100
x1 <- rnorm(N)
x2 <- rnorm(N) + x1 + 1
y <- 1 + x1 + x2 + rnorm(N)
mydat <- data.frame(y,x1,x2)
matplot(mydat[,1],mydat[,2:3], pch = 1:2)
Multiple plots
With par() we can display multiple figures on the same plot. mfrow = c(3,2) prints 6 figures on the same
plot with 3 rows and 2 columns. mfcol = c(3,2) does the same but the order is not the same.
par(mfrow = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
plot(x1, type = "l")
plot(x1, type = "h")
plot(x1, type = "s")
plot(x1, type = "S")
par(mfcol = c(3,2))
plot(x1, type = "n")
plot(x1, type = "p")
Graphics 76
Plotting a function
curve() plots a function. This can be added to an existing plot with the option add = TRUE.
plot() can also plots functions.
plot(rnorm(100))
curve((x/100)^2, add = TRUE, col = "red")
Exporting graphs
How can you export a graph ?
First you can plot the graph and use the context menu (right click on Windows and Linux or control + click on
Mac) to copy or save the graphs. The available options depend on your operating system. On Windows, you can
also use copy the current graph to the clipboard as a Bitmap file (raster graphics) using CTRL + C or as a
Windows Metafile (vector graphics) using CTRL + W. You can then paste it into another application.
You can export a plot to pdf, png, jpeg, bmp or tiff by adding pdf("filename.pdf"),
png("filename.png"), jpeg("filename.jpg"), bmp("filename.bmp") or
tiff("filename.tiff") prior to the plotting, and dev.off() after the plotting.
You can also use the savePlot() function to save existing graphs.
Sweave also produce ps and pdf graphics (See the Sweave section).
It is better to use vectorial devices such as pdf, ps or svg.
How can you know the list of all available devices ?
Graphics 77
?Devices
Use the capabilities() function to see the list of available devices on your computer.
?Devices
> capabilities()
jpeg png tiff tcltk X11 aqua http/ftp sockets
TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
libxml fifo cledit iconv NLS profmem cairo
TRUE FALSE TRUE TRUE TRUE TRUE FALSE
postscript(file="graph1.ps",horizontal=F,pagecentre=F,paper="special",width=8.33,height=5.56)
plot(x1, main = "Example")
dev.off()
The RSvgDevice library which was used in earlier versions of R seems now outdated.
Graphics 78
Advanced topics
Animated plots
The animation package provides dynamic graphics capabilities. It is possible to export the animation in flash, mpeg
or gif format. There are more example on the aniwiki website : http://animation.yihui.name/.
You can also create motion charts using the googleVis package[8].
Examples
Interactive Graphics
The iplots package provides a way to have interactive data visualization in R[9][10].
R GUI now offers interactive graphics Deducer 0.4-2 connects with iplots [11]
[12]
To create an interactive, animated plot viewable in a web browser, the animint package can be used. The main
idea is to define an interactive animation as a list of ggplots with two new aesthetics:
showSelected=variable means that only the subset of the data that corresponds to the selected value of variable
will be shown.
clickSelects=variable means that clicking a plot element will change the currently selected value of variable.
Graphics 79
Graphics gallery
In this section, we review all kind of statistical plots and review all alternatives to draw them using R. This include
code for the standard graphics package, the lattice package and the ggplot2 package. Also, we add some examples
from the commons repository. We only add examples which are provided with the R code. You can click on any
graph and find the R code.
Line plot
To draw a line plot, use the generic plot() function by setting type="l".
Then, you can add further lines on the same plot using the lines() function.
Examples
Scatter plot
plot(x,y)
plot(y ~ x)
xyplot(y ~ x) (lattice)
qplot(x,y) (ggplot2)
Log scale
Sometimes it is useful to plot the log of a variable and to have a log scale on the axis. It is possible to plot the log of
a variable using the log option in the plot() function.
For a log log plot, use log = "xy"
For a log in the x axis only, use log = "x"
For a log in the x axis only, use log = "y"
N <- 10
u <-rnorm(N)
x <- 1 + rnorm(N)
y <- 1 + x + u
plot(x, y)
textxy(x, y,labs = signif(x,3), cx=0.7)
Examples
Histogram
hist()
histogram() (lattice)
You can learn more about histograms in the Non parametric methods page.
Examples
Graphics 81
Box plot
Box plot :
boxplot()
Examples
Bar charts
See Bar charts on wikipedia.
barplot() takes a table as argument and returns a bar chart.
qlot() (ggplot2) with the option geom = "bar" takes a variable as argument and returns a bar chart[13].
barchart() takes a variable as argument and returns a bar chart.
Examples
Graphics 82
Dot plot
See also Dot plot on Wikipedia.
dotchart()
Examples
Pie charts
pie()
Examples
Graphics 83
Treemap
The tmPlot() function in the treemap package makes it easy to draw a treemap.
See also
There is another errbar() function in the sfsmisc package.
plotCI() (gplots) also plot error bars.
plotmeans() (gplots)
ciplot() (hacks)
See also Error bar on Wikipedia
3D plots
contour(), image(), persp()
plot3d() (rgl)
wireframe() (lattice)
Examples
Example with
wireframe() (lattice)
Graphics 84
Diagrams
grid package by Paul Murrell[15]
diagram package [16]
Rgraphviz package
igraph package
Arc Diagrams
It is also possible to draw Arc Diagrams.
Dendograms
It is possible to plot dendograms in R.
Treemap
It is possible to draw a treemap using the treemap() function in the treemap package[17].
Wordcloud
There is :
the wordcloud() function in the wordcloud package
the tagcloud() function in the tagcloud package
Timeline
timeline() in the timeline package
Resources
Tables 2 Graphs [18]
R Graphics by Paul Murrell[19]
ggplot2 [20]
References
[1] D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer, 2008. ISBN 9780387759685.
[2] ggplot2: Elegant Graphics for Data Analysis (Use R) by Hadley Wickham and a list of examples on his own website : http:/ / had. co. nz/
ggplot2/
[3] playwith : http:/ / code. google. com/ p/ playwith/
[4] http:/ / code. google. com/ p/ playwith/ wiki/ Screenshots
[5] http:/ / ryouready. wordpress. com/ 2010/ 03/ 23/ playing-with-the-playwith-package/
[6] http:/ / code. google. com/ p/ latticist/
[7] http:/ / www. stat. columbia. edu/ ~tzheng/ files/ Rcolor. pdf
[8] Tutorial for the googleVis package : http:/ / stackoverflow. com/ questions/ 4646779/ embedding-googlevis-charts-into-a-web-site/
4649753#4649753
[9] http:/ / www. r-bloggers. com/ interactive-graphics-with-the-iplots-package-from-%E2%80%9Cr-in-action%E2%80%9D/
[10] http:/ / www. r-statistics. com/ 2012/ 01/ interactive-graphics-with-the-iplots-package-from-r-in-action/ Interactive Graphics with the iplots
Package] - a chapter from the R in action book
[11] http:/ / www. r-statistics. com/ 2010/ 10/ r-gui-now-offers-interactive-graphics-deducer-0-4-2-connects-with-iplots/
[12] https:/ / github. com/ tdhock/ animint
[13] Hadley Wickham ggplot2: Elegant Graphics for Data Analysis, Springer Verlag, 2009
[14] The default output in errbar() changed between R version 2.8.1 and R version 2.9.2. Axis are not displayed by default anymore
[15] Paul Murrell Drawing Diagrams with R, The R Journal, 2009 http:/ / journal. r-project. org/ 2009-1/ RJournal_2009-1_Murrell. pdf
Graphics 85
[16] (example: Using a binary tree diagram for describing a Bernoulli process (http:/ / www. r-statistics. com/ 2011/ 11/
diagram-for-a-bernoulli-process-using-r/ ))
[17] http:/ / cran. r-project. org/ web/ packages/ treemap/ treemap. pdf
[18] http:/ / tables2graphs. com/ doku. php
[19] http:/ / www. stat. auckland. ac. nz/ ~paul/ RGraphics/ rgraphics. html
[20] http:/ / had. co. nz/ ggplot2/
Grammar of graphics
Hadley Wickham has developped the ggplot2, a graphical library designed according to the principles of the
Grammar of Graphics.
Plotting a function
We use qplot() with the option stat=function :
Bibliography
Leland Wilkinson, The Grammar of Graphics (Statistics and Computing), Springer, 2005
Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Use R!, Springer, 2009
Resources
Documentation page for ggplot2 [1]
References
[1] http:/ / docs. ggplot2. org/ current/
Publication quality ouput 86
Sweave
Sweave[1] is a literate programming language which integrates LaTeX and R code. The Sweave file generates a
LaTeX file and an R file which can in turn be compiled. Roger Koenker[2], Meredith and Racine (2009)[3] and
Charles Geyer[4] argue that Sweave favors reproducible econometric/statistical research.
There are some alternatives to Sweave for literate programming. One of them is Babel which is included in Emacs
Orgmode[5]. This tool allow export to LaTeX and HTML. It is also possible to include code chunks for various
programming languages (R, Ruby, etc).
Publication quality ouput 87
Syntax
The main idea is that you write a file which includes LaTeX and R code. LaTeX code begins with @ and R code
with <<>>= (some options can be included between << and >>).
@
% Some LaTeX code
\section{Results}
I show that ...
<<>>=
# Some R code
qnorm(.975)
@
% Some LaTeX code
$$
\Phi^{-1}(.975) = 1.96
$$
The file is stored with extension .Rnw or .rnw. At the end, you extract from this file an R file using Stangle()
and a LaTeX file using Sweave(). Here is an example with a file called file.Rnw which generates file.tex
and file.R
> Sweave("file.Rnw")
Writing to file file.tex
Processing code chunks ...
1 : echo keep.source term verbatim pdf
2 : echo keep.source term verbatim pdf
> Stangle("file.Rnw")
Writing to file file.R
Then you can run LaTeX on your file.tex. This can be done using the system() function or texi2dvi().
Note that you may need to download Sweave.sty from the internet since it is not part of the standard MikTeX
distribution.
You can also add your results in your text using the \Sexpr{} function.
$
\Phi^{-1}(.975) = \Sexpr{qnorm(.975)}
$
Publication quality ouput 88
Options
There are some options. These options can be included for each code chunk or in the Sweave command.
For figures, you can either include them in the tex file using fig=T or not include them using fig=F.
By default, figures are exported as pdf and eps files. If you only want one format suppress the other one with
pdf=F or eps=F option.
The R code can be displayed in the tex file using echo=T. If you don't want to include it in the tex file, use
echo=F.
The R code can be evaluated using eval=T. If you don't want to evaluate the R code, use eval=F.
The results :
results=tex treats the output as LaTeX code
results=verbatim treats the output as Verbatim (the default)
results=hide does not include the results in the LaTeX output
These options can be passed to the Sweave() function.
<<fig=T,pdf=T,eps=F>>=
plot(rnorm(100), col = "red")
@
Export to LaTeX
R has lots of functions which allow it to export results to LaTeX[6].
General functions
toLatex() in the utils package.
Note that toLatex() does not handle matrices.
toLatex() has been adapted to handle matrices and ftables in the memisc package.
> toLatex(sessionInfo())
\begin{itemize}
\item R version 2.2.0, 2005-10-06, \verb|powerpc-apple-darwin7.9.0|
\item Base packages: base, datasets, grDevices,
graphics, methods, stats, utils
\end{itemize}
Publication quality ouput 89
> tex.table(mydat)
\begin{table}[ht]
\begin{center}
\begin{footnotesize}
\begin{tabular}{r|rrr}
\hline
& y & x1 & x2\\ \hline
1 & -0.09 & -0.37 & -1.04\\
2 & 0.31 & 0.19 & -0.09\\
3 & 3.78 & 0.58 & 0.62\\
4 & 2.09 & 1.40 & -0.95\\
5 & -0.18 & -0.73 & -0.54\\
6 & 3.16 & 1.30 & 0.58\\
7 & 2.78 & 0.34 & 0.77\\
8 & 2.59 & 1.04 & 0.46\\
9 & -1.96 & 0.92 & -0.89\\
10 & 0.91 & 0.72 & -1.1\\
\hline
\end{tabular}
\end{footnotesize}
\end{center}
\end{table}
xtable() (xtable) exports various objects, including tables, data frames, lm, aov, and anova, to LaTeX.
> # lm example
> library(xtable)
> x <- rnorm(100)
> y <- 2*x + rnorm(100)
> lin <- lm(y~x)
> xtable(lin)
% latex table generated in R 2.15.1 by xtable 1.7-0 package
% Sun Sep 23 21:54:04 2012
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
\hline
& Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\
\hline
(Intercept) & -0.0407 & 0.0984 & -0.41 & 0.6803 \\
x & 2.0466 & 0.1043 & 19.63 & 0.0000 \\
\hline
\end{tabular}
\end{center}
\end{table}
Publication quality ouput 90
See also :
The highlight package by Romain Franois exports R code to LaTeX and HTML.
format.df() and latex() in the Hmisc package.
The MEMISC and the quantreg packages include other latex() function.
Descriptive statistics
estout package.
The reporttools package include some functions for table of descriptive statistics[7].
Estimation results
The stargazer package provides an easy way to export the results of regressions to LaTeX[8]
texreg provides the same kind of features[9].
The estout package provides functions similar to the Stata's esttab and estout utilities[10]. Estimates are
stored using eststo() and printed using esttab(). They can be exported to CSV and LaTeX. These
functions support lm, glm and plm objects (see plm package).
apsrtable() (apsrtable) exports the results of multiple regression to LaTeX in a way similar to the American
Political Science Review publication standard.
The xtable (xtable package) exports dataframes, matrix, estimation results[11]. xtable() can also be used
to export the results to an HTML file.
Publication quality ouput 91
The outreg() function [12][13] developped by Paul Johnson is similar to the Stata outreg[14] function. See "R
you ready ?" post [15] on this topic.
mtable() and toLatex() in the 'memisc package.
N <- 10^3
u <- rnorm(N)
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
y <- 1 + x1 + x2 + u
lm1 <- lm(y ~ x1 + x2 )
lm2 <- lm(y ~ x1 + x2 + I(x1*x2))
library(estout)
estclear() # clear all the eststo objects
eststo(lm1)
eststo(lm2)
esttab() # print it
library("apsrtable")
apsrtable(lm1,lm2)
library(xtable)
xtable(lm1)
tab <- xtable(lm1)
print(tab,type="html")
source("http://pj.freefaculty.org/R/WorkingExamples/outreg-worked.R")
outreg(list(lm1,lm2))
library("memisc")
toLatex(mtable(lm1,lm2))
Export to HTML
The rpublisher[16] is a literate programming language which publish results in HTML (it is based on python and
was last updated in 2008).
See R2HTML, xtable, hwriter, prettyR, highlight, HTMLUtils
wiki.table() in the hacks package export a matrix or a dataframe into Mediawiki [17] table markup (as used on
this wiki and many others).
> wiki.table(matrix(1:16,4),caption="Test")
{|
|+ Test
| 1 || 5 || 9 || 13
|-
| 2 || 6 || 10 || 14
|-
| 3 || 7 || 11 || 15
|-
Publication quality ouput 92
| 4 || 8 || 12 || 16
|}
References
[1] The Sweave Homepage http:/ / www. stat. uni-muenchen. de/ ~leisch/ Sweave/
[2] http:/ / www. econ. uiuc. edu/ ~roger/ repro. html
[3] Meredith, E. and J.S. Racine (2009), Towards Reproducible Econometric Research: The Sweave Framework, Journal of Applied
Econometrics, Volume 24, pp 366-374.
[4] Charles Geyer "Why Reproducible Research is the Right Thing" http:/ / www. stat. umn. edu/ ~charlie/ Sweave/
[5] Babel in Emacs Orgmode http:/ / orgmode. org/ worg/ org-contrib/ babel/ intro. html
[6] See the LaTeX Wikibook if you want to learn about LaTeX
[7] reporttools: R Functions to Generate LaTeX Tables of Descriptive Statistics (http:/ / www. jstatsoft. org/ v31/ c01)
[8] http:/ / www. r-statistics. com/ 2013/ 01/ stargazer-package-for-beautiful-latex-tables-from-r-statistical-models-output/
[9] http:/ / www. r-bloggers. com/ texreg-a-package-for-beautiful-and-easily-customizable-latex-regression-tables-from-r/
[10] estout : http:/ / repec. org/ bocode/ e/ estout/
[11] xtable on dataninja blog (http:/ / archive. is/ 20121225194652/ dataninja. wordpress. com/ 2006/ 02/ 11/ getting-tables-from-r-output/ )
[12] http:/ / pj. freefaculty. org/ R/ WorkingExamples/ outreg-worked. R
[13] The outreg() function http:/ / pj. freefaculty. org/ R/ WorkingExamples/ outreg-worked. R
[14] Stata outreg http:/ / ideas. repec. org/ c/ boc/ bocode/ s375201. html
[15] http:/ / ryouready. wordpress. com/ 2009/ 06/ 19/ r-function-to-create-tables-in-latex-or-lyx-to-display-regression-models-results/
[16] rpublisher : http:/ / code. google. com/ p/ rpublisher/
[17] http:/ / www. mediawiki. org
Descriptive Statistics
In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes
univariate and bivariate statistical tools.
Generic Functions
We introduce some functions to describe a dataset.
names() gives the names of each variable
str() gives the structure of the dataset
summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
> summary(mydat)
> library("Hmisc")
> describe(mydat)
> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
Descriptive Statistics 93
Univariate analysis
Continuous variable
Moments
mean() computes the mean
the variance : var().
the standard deviation sd().
the skewness skewness() (fUtilities, moment or e1071)
the kurtosis : kurtosis() (fUtilities, moment or e1071)
all the moments : moment() (moment) and all.moments() (moment).
> library(moments)
> x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506
Order statistics
the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of
a vector), min() the minimum and max() the maximum.
IQR() computes the interquartile range. median() computes the median and mad() the median absolute
deviation.
quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages
computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample
size is big.
> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
0% 25% 50% 75% 100%
-3.07328999 -0.66800917 0.02010969 0.72620061 2.92897970
Descriptive Statistics 94
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
0.00 0.25 0.50 0.75 1.00
-3.07328999 -0.66901899 0.02157989 0.72378407 2.92897970
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
0% 25% 50% 75% 100%
-3.07328999 -0.66800917 0.02010969 0.72620061 2.92897970
attr(,"class")
[1] "kuantile"
Inequality Index
The gini coefficient : Gini() (ineq) and gini() (reldist).
ineq() (ineq) gives all inequalities index.
> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) # Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694
Concentration index
> library(ineq)
> Herfindahl(x)
[1] 0.003091162
> Rosenbluth(x)
[1] 0.002141646
Poverty index
> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index
Descriptive Statistics 95
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)
Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters
are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test()
implements the Shapiro Wilk Normality Test
data: x
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)
data: x
W = 0.9916, p-value = 0.7902
Descriptive Statistics 96
> library("nortest")
> ad.test(x)
data: x
A = 0.2541, p-value = 0.7247
See also the package ADGofTest for another version of this test[1].
Shapiro-Francia normality test :
> sf.test(x)
data: x
W = 0.9866, p-value = 0.9953
> library("nortest")
> pearson.test(x)
data: x
P = 0.8, p-value = 0.8495
> cvm.test(x)
data: x
W = 0.0182, p-value = 0.9756
Jarque-Bera test :
> jarque.bera.test(x)
data: x
X-squared = 0.6245, df = 2, p-value = 0.7318
Descriptive Statistics 97
Discrete variable
We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart
(pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot()
(lattice)).
freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.
Multivariate analysis
Continuous variables
Covariance : cov()
Pearson's linear correlation : cor().
Pearson's correlation test cor.test() performs the test.
Spearman's rank correlation :
cor() with method = "spearman".
spearman() (Hmisc)
Spearman's rank correlation test :
spearman2() (Hmisc)
spearman.test() (Hmisc)
spearman.test() (pspearman package) performs the Spearmans rank correlation test with precomputed
exact null distribution for n <= 22.
Kendall's correlation : cor() with method = "kendall". See also the Kendall package.
Discrete variables
table(), xtabs() and prop.table() for contingency tables.
assocplot() and mosaicplot() for graphical display of contingency table.
CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher
independance tests.
my.table.NA() and my.table.margin() (cwhmisc)
chisq.detail() (TeachingDemos)
Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality
of two distributions ks.test().
N <- 100
x <- sample(0:1,N, replace = T)
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)
References
[1] Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http:/ / CRAN. R-project. org/
package=ADGofTest
Mathematics 99
Mathematics
Basics
?Arithmetic
?Special
Linear Algebra
Vectors
Matrix Algebra
If you want to create a new matrix, one way is to use the matrix() function. You have to enter a vector of data,
the number of rows and/or columns and finally you can specify if you want R to read your vector by row or by
column (the default option) with byrow. You can also combine vectors using cbind() or rbind(). The
dimension of a matrix can be obtained using the dim() function or alternatively nrow() and ncol().
> library(matlab)
> eye(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> ones(3)
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> zeros(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
Diagonal matrix
> diag(3)
Upper triangular
for n=3
[,1] [,2] [,3]
[1,] 0 1 1
[2,] 0 0 1
[3,] 0 0 0
Lower triangular
Same as upper triangular but using lower.tri instead
create an Hilbert matrix using hilbert() (fUtilities).
Matrix calculations
compute a matrix multiplication X%*%Y.
[4,] 0 0 2 2
> library(fUtilities)
> kron(I,M)
[,1] [,2] [,3] [,4]
[1,] 2 2 0 0
[2,] 2 2 0 0
[3,] 0 0 2 2
[4,] 0 0 2 2
Matrix transposition
Transpose the matrix
> t(M)
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 2
[3,] 0 0 1
Matrix inversion
Invert a matrix using solve() or inv() (fUtilities). We can also compute the generalized inverse using
ginv() in the MASS package.
[,1]
[1,] 2.388889
[2,] 1.388889
>
> k=solve(m,l)
> k
[,1]
[1,] -0.9111111
[2,] 3.3000000
>
> m%*%k #checking the answer
[,1]
[1,] 2.388889
[2,] 1.388889
>
> eigen(M)
$values
[1] 1 1 1
$vectors
[,1] [,2] [,3]
[1,] 0 2.220446e-16 0.000000e+00
[2,] 0 0.000000e+00 1.110223e-16
[3,] 1 -1.000000e+00 -1.000000e+00
Misc
compute the norm of a matrix using norm() (fUtilities).
check if a matrix is positive definite isPositiveDefinite() (fUtilities).
make a matrix positive definite makePositiveDefinite() (fUtilities).
computes row statistics and column statistics (fUtilities).
extract the upper and the lower part of a matrix triang() and Triang() (fUtilities).
See also the matrix, matlab, matrixcalc, matrixStats packages.
Analysis
Polynomial equations
To solve , where are given numbers, use the command
> polyroot(c(n,...,b,a))
So, for example, to calculate the roots of the equation one would do as follows:
> polyroot(c(-3,-5,2))
[1] -0.5+0i 3.0-0i
Derivatives
Symbolic calculations
R can give the derivative of an expression. You need to convert your function as an expression using the
expression() function. Otherwise you get an error message.
Here are some examples :
> D(expression(x^n),"x")
x^(n - 1) * n
> D(expression(exp(a*x)),"x")
exp(a * x) * a
> D(expression(1/x),"x")
-(1/x^2)
> D(expression(x^3),"x")
3 * x^2
> D(expression(pnorm(x)),"x")
dnorm(x)
> D(expression(dnorm(x)),"x")
-(x * dnorm(x))
Mathematics 105
Numerical approximation
numDeriv package
Integration
R can perform one dimensional integration. For example we can integrate over the density of the normal distribution
between and
> integrate(dnorm,-Inf,Inf)
1 with absolute error < 9.4e-05
> integrate(dnorm,-1.96,1.96)
0.9500042 with absolute error < 1.0e-11
> integrate(dnorm,-1.64,1.64)
0.8989948 with absolute error < 6.8e-14
# we can also store the result in an object
> ci90 <- integrate(dnorm,-1.64,1.64)
> ci90$value
[1] 0.8989948
> integrate(dnorm,-1.64,1.64)$value
[1] 0.8989948
> library(adapt)
> ?adapt
> ir2pi <- 1/sqrt(2*pi)
> fred <- function(z) { ir2pi^length(z) * exp(-0.5 * sum(z * z))}
>
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred)
value relerr minpts lenwrk ifail
1.039222 0.0007911264 231 73 0
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-4)
value relerr minpts lenwrk ifail
1.000237 1.653498e-05 655 143 0
> adapt(2, lo = c(-5,-5), up = c(5,5), functn = fred, eps = 1e-6)
value relerr minpts lenwrk ifail
1.000039 3.22439e-07 1719 283 0
Probability
The number of combination of length k within n numbers :
> choose(100, 5)
[1] 75287520
Arithmetics
> factorial(3)
[1] 6
> prod(1:3)
[1] 6
Note that by convention . factorial() returns 1 in 0. This is not the case with the prod() functions.
> factorial(0)
[1] 1
> prod(0)
[1] 0
Factorial numbers can be very large and cannot be computed for high values.
> factorial(170)
[1] 7.257416e+306
> factorial(171)
[1] Inf
Message d'avis :
In factorial(171) : value out of range in 'gammafn'
> 5%%2
[1] 1
>5%/%2
[1] 2
Note that for some unknown reason, there is a problem with non integer numbers and euclidian divisions.
Geometry
pi the constant
cos(), sin(), tan() the trigonometric functions.
Symbolic calculus
rSymPy (rsympy [2]) provides sympy (link [3]) functions in R.
If you want to do more symbolic calculus, see Maxima[4], SAGE[5], Mathematica[6]
References
[1] http:/ / en. wikipedia. org/ wiki/ Kronecker_product
[2] http:/ / code. google. com/ p/ rsympy/
[3] http:/ / code. google. com/ p/ sympy/
[4] Maxima is open source http:/ / maxima. sourceforge. net/
[5] SAGE is an open source package which includes R and Maxima : http:/ / www. sagemath. org/
[6] Mathematica is not open source http:/ / www. wolfram. com/ products/ mathematica/ index. html
Optimization
optimize() is devoted to one dimensional optimization problem.
optim(), nlm(), ucminf() (ucminf) can be used for multidimensional optimization problems.
nlminb() for constrained optimization.
quadprog, minqa, rgenoud, trust packages
Some work is done to improve optimization in R. See Updating and improving optim(), Use R 2009 slides [1][2],
the R-forge optimizer page [3][4] and the corresponding packages including optimx [5].
Numerical Methods
> plot(grid,func(grid))
>
> # you can find the minimum using the optimize function
> optimize(f=func,interval=c(-10,10))
$minimum
[1] 2
$objective
[1] 0
Newton-Raphson
nlm() provides a Newton algorithm.
maxLik package for maximization of a likelihood function. This package includes the Newton Raphson method.
newtonraphson() in the spuRs package.
BFGS
The BFGS method
Simulation methods
Simulated Annealing
The Simulated Annealing is an algorithm which is useful to maximise non-smooth functions. It is pre
implemented in optim().
Genetic Algorithm
rgenoud package for genetic algorithm[7]
gaoptim package for genetic algorithm[8]
References
Venables and Ripley, Chapter 16.
Cameron and Trivedi, Microeconometrics, chapter 10
Braun and Murdoch (Chapter 7)[9] is a very good reference on optimization using R.
[1] http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Nash+ Varadhan. pdf
[2] Updating and improving optim(), Use R 2009 slides http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Nash+ Varadhan. pdf
[3] http:/ / optimizer. r-forge. r-project. org/
[4] R-forge optimizer http:/ / optimizer. r-forge. r-project. org/
[5] http:/ / r-forge. r-project. org/ R/ ?group_id=395
[6] http:/ / www. stat. umn. edu/ geyer/ trust/
[7] Jasjeet Sekhon homepage : http:/ / sekhon. berkeley. edu/ rgenoud/
[8] gaoptim on CRAN: http:/ / cran. r-project. org/ web/ packages/ gaoptim/ index. html
[9] A first course in statistical programming with R http:/ / portal. acm. org/ citation. cfm?id=1385416
Probability Distributions 110
Probability Distributions
This page review the main probability distributions and describe the main R functions to deal with them.
R has lots of probability functions.
r is the generic prefix for random variable generator such as runif(), rnorm().
d is the generic prefix for the probability density function such as dunif(), dnorm().
p is the generic prefix for the cumulative density function such as punif(), pnorm().
q is the generic prefix for the quantile function such as qunif(), qnorm().
Discrete distributions
Benford Distribution
[1]
The Benford Distribution is the distribution of the first digit of a number. It is due to Benford 1938[2] and
Newcomb 1881[3].
> library(VGAM)
> dbenf(c(1:9))
[1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 0.05799195 0.05115252 0.04575749
Bernouilli
We can draw from a Bernouilli [4] using sample(), runif() or rbinom() with size = 1.
Binomial
We can sample from a binomial distribution [5] using the rbinom() function with arguments n for number of of
samples to take, size defining the number of trials and prob defining the probability of success in each trial.
Hypergeometric distribution
We can sample n times from a hypergeometric distribution [6] using the rhyper() function.
Geometric distribution
The geometric distribution [7].
Multinomial
The multinomial distribution [8].
Poisson distribution
We can draw n values from a Poisson distribution [10] with a mean set by the argument lambda.
Zipf's law
[11]
The distribution of the frequency of words is known as Zipf's Law . It is also a good description of the
distribution of city size[12]. dzipf() and pzipf() (VGAM)
> library(VGAM)
> dzipf(x=2, N=1000, s=2
Continuous distributions
>library(gtools)
>?rdirichlet
>library(bayesm)
>?rdirichlet
>library(MCMCpack)
>?Dirichlet
Cauchy
We can sample n values from a Cauchy distribution [15] with a given location parameter (default is 0) and
scale parameter (default is 1) using the rcauchy() function.
Exponential
[17]
We can sample n values from a exponential distribution with a given rate (default is 1) using the rexp()
function
Fisher-Snedecor
We can draw the density of a Fisher distribution [18] (F-distribution) :
> par(mar=c(3,3,1,1))
> x <- seq(0,5,len=1000)
> plot(range(x),c(0,2),type="n")
> grid()
> lines(x,df(x,df1=1,df2=1),col="black",lwd=3)
> lines(x,df(x,df1=2,df2=1),col="blue",lwd=3)
> lines(x,df(x,df1=5,df2=2),col="green",lwd=3)
> lines(x,df(x,df1=100,df2=1),col="red",lwd=3)
> lines(x,df(x,df1=100,df2=100),col="grey",lwd=3)
> legend(2,1.5,legend=c("n1=1, n2=1","n1=2, n2=1","n1=5,
n2=2","n1=100, n2=1","n1=100,
n2=100"),col=c("black","blue","green","red","grey"),lwd=3,bty="n")
Gamma
We can sample n values from a gamma distribution [19] with a given shape parameter and scale parameter
using the rgamma() function. Alternatively a shape parameter and rate parameter can be given.
Levy
We can sample n values from a Levy distribution [20] with a given location parameter (defined by the argument
m, default is 0) and scaling parameter (given by the argument s, default is 1) using the rlevy() function.
Log-normal distribution
[21]
We can sample n values from a log-normal distribution with a given meanlog (default is 0) and sdlog
(default is 1) using the rlnorm() function
> qnorm(.95)
[1] 1.644854
> qnorm(.975)
[1] 1.959964
> qnorm(.99)
[1] 2.326348
> library(mvtnorm)
> sig <- matrix(c(1, 0.8, 0.8, 1), 2, 2)
> r <- rmvnorm(1000, sigma = sig)
> cor(r)
[,1] [,2]
[1,] 1.0000000 0.8172368
[2,] 0.8172368 1.0000000
Pareto Distributions
Generalized Pareto [23] dgpd() in evd
dpareto(), ppareto(), rpareto(), qpareto() in actuar
The VGAM package also has functions for the Pareto distribution.
Student's t distribution
Quantile of the Student t distribution [24]
> qt(.975,30)
[1] 2.042272
> qt(.975,100)
[1] 1.983972
> qt(.975,1000)
[1] 1.962339
The following lines plot the .975th quantile of the t distribution in function of the degrees of freedom :
curve(qt(.975,x), from = 2 , to = 100, ylab = "Quantile 0.975 ", xlab = "Degrees of freedom", main = "Student t distribution")
abline(h=qnorm(.975), col = 2)
Probability Distributions 115
Uniform distribution
We can sample n values from a uniform distribution [25] (also known as a rectangular distribution] between two
values (defaults are 0 and 1) using the runif() function
Weibull
We can sample n values from a Weibull distribution [26] with a given shape and scale parameter (default is
1) using the rweibull() function.
References
[1] http:/ / en. wikipedia. org/ wiki/ Benford_distribution
[2] Benford, F. (1938) The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78, 551572.
[3] Newcomb, S. (1881) Note on the Frequency of Use of the Different Digits in Natural Numbers. American Journal of Mathematics, 4, 3940.
[4] http:/ / en. wikipedia. org/ wiki/ Bernoulli_distribution
[5] http:/ / en. wikipedia. org/ wiki/ Binomial_distribution
[6] http:/ / en. wikipedia. org/ wiki/ Hypergeometric_distribution
[7] http:/ / en. wikipedia. org/ wiki/ Geometric_distribution
[8] http:/ / en. wikipedia. org/ wiki/ Multinomial_distribution
[9] http:/ / en. wikipedia. org/ wiki/ Negative_binomial_distribution
[10] http:/ / en. wikipedia. org/ wiki/ Poisson_distribution
[11] http:/ / en. wikipedia. org/ wiki/ Zipf%27s_Law
[12] Gabaix, Xavier (August 1999). "Zipf's Law for Cities: An Explanation". Quarterly Journal of Economics 114 (3): 73967.
doi:10.1162/003355399556133. ISSN 0033-5533. http:/ / pages. stern. nyu. edu/ ~xgabaix/ papers/ zipf. pdf.
[13] http:/ / en. wikipedia. org/ wiki/ Beta_distribution
[14] http:/ / en. wikipedia. org/ wiki/ Dirichlet_distribution
[15] http:/ / en. wikipedia. org/ wiki/ Cauchy_distribution
[16] http:/ / en. wikipedia. org/ wiki/ Chi-square_distribution
[17] http:/ / en. wikipedia. org/ wiki/ Exponential_distribution
[18] http:/ / en. wikipedia. org/ wiki/ F-distribution
[19] http:/ / en. wikipedia. org/ wiki/ Gamma_distribution
Probability Distributions 116
?RNGkind
[1]
It is possible to use true random numbers. Some of them are collected on random.org (link ). The random (link
[2]
) package gives an access to them.
Randu
Randu is an old linear congruential pseudorandom number generator. There is a dataset generated with Randu in the
datasets package. The function which is used to generate the dataset is in the help of this page.
library("datasets")
?randu
Seed
A pseudo random number generator is an algorithm based on a starting point called "seed". If you want to perform
an exact replication of your program, you have to specify the seed using the function set.seed(). The argument of
set.seed has to be an integer.
> set.seed(1)
> runif(1)
[1] 0.2655087
> set.seed(1)
> runif(1)
[1] 0.2655087
Random Number Generation 117
Sampling in a vector
Toss 10 coins
> sample(0:1,10,replace=T)
[1] 1 0 0 0 1 0 0 1 1 1
Roll 10 dice
> sample(1:6,10,replace=T)
[1] 4 1 5 3 2 5 5 6 3 2
> sample(1:49,6,replace=F)
[1] 18 35 29 1 33 11
Misspecified argument
Note that if you put as argument of rnorm a vector instead of a number, R takes by default the length of the vector
instead of returning an error. Here is an example :
N <- 100
qnorm(runif(N))
This gives the same results as the rnorm() function but the computing time is higher :
Importance sampling
See Jeff Gill's routine [3]
Gibbs algorithm
rgs package (link [4])
> library(randtoolbox)
> halton(10, dim = 2, init = TRUE, normal = FALSE, usetime = FALSE)
[,1] [,2]
[1,] 0.5000 0.33333333
[2,] 0.2500 0.66666667
[3,] 0.7500 0.11111111
[4,] 0.1250 0.44444444
[5,] 0.6250 0.77777778
[6,] 0.3750 0.22222222
[7,] 0.8750 0.55555556
[8,] 0.0625 0.88888889
[9,] 0.5625 0.03703704
[10,] 0.3125 0.37037037
You can compare Halton draws with the standard R (pseudo) random number generator. Halton draws are much
more systematic.
Examples
Resources
Revolution Computing entry on pseudo random numbers [7]
Statistical Distributions Module Wessa.net [8] it is online application which generates random numbers using R.
You can have access to the R code and use it in your own programs.
References
[1] http:/ / www. random. org/
[2] http:/ / dirk. eddelbuettel. com/ code/ random. html
[3] http:/ / artsci. wustl. edu/ ~jgill/ papers/ sir4. txt
[4] http:/ / code. google. com/ p/ rgs-package/
[5] http:/ / bm2. genes. nig. ac. jp/ RGM2/ R_current/ library/ randtoolbox/ man/ quasiRNG. html
[6] http:/ / mathworld. wolfram. com/ QuasirandomSequence. html
[7] http:/ / blog. revolution-computing. com/ 2009/ 02/ how-to-choose-a-random-number-in-r. html
[8] http:/ / www. wessa. net/ distributions. wasp
Maximum Likelihood 120
Maximum Likelihood
Introduction
Maximum likelihood estimation is just an optimization problem. You have to write down your log likelihood
function and use some optimization technique. Sometimes you also need to write your score (the first derivative of
the log likelihood) and or the hessian (the second derivative of the log likelihood).
One dimension
If there is only one parameter, we can optimize the log likelihood using optimize().
> library(actuar)
> y <- rpareto1(1000, shape = 1, min = 500)
> ll <- function(mu, x) {
+ sum(dpareto1(x,mu[1],min = min(x),log = TRUE))
+ }
> optimize(f = ll, x = y, interval = c(0,10), maximum = TRUE)
Multiple dimension
fitdistr() (MASS package) fits univariate distributions by maximum likelihood. It is a wrapper for
optim().
If you need to program yourself your maximum likelihood estimator (MLE) you have to use a built-in optimizer
such as nlm(), optim(). R also includes the following optimizers :
mle() in the stats4 package
The maxLik package
> # draw from a gumbel distribution using the inverse cdf simulation method
> e.1 <- -log(-log(runif(10000,0,1)))
> e.2 <- -log(-log(runif(10000,0,1)))
> u <- e.2 - e.1 # u follows a logistic distribution (difference between two gumbels.)
> fitdistr(u,densfun=dlogis,start=list(location=0,scale=1))
Maximum Likelihood 121
Tests
Resources
Charles Geyer : Maximum Likelihood in R (pdf, 9 pages) [2]
Marco Steenbergen Maximum Likelihood Programming in R (pdf, 7 pages) [3]
References
[1] Achim Zeileis, Torsten Hothorn (2002). Diagnostic Checking in Regression Relationships. R News 2(3),
Method of Moments
Package gmm implements the generalized method of moment and the generalized empirical likelihood.
First, it is possible to estimate a simple linear model or a simple linear model with instrumental variables using the
gmm() function. The GMM method is often used to estimate heteroskedastic instrumental variable models.
Bayesian Methods
Introduction
R has lots of bayesian estimation procedures, much more than Stata or SAS.
LearnBayes by Jim Albert
bayesm by Peter Rossi and Rob McCulloch
BaM by Jeff Gill
arm package by Jennifer Hill and Andrew Gelman.
MCMCpack package.
mcsm package by Christian Robert and George Casella.
umacs (link [1]) by Jouni Kerman and Andrew Gelman.
Bayesian Methods 123
Resources
Christian Robert and Jean Michel Marin The Bayesian Core (link [3] including slides and R scripts)
Jim Albert : Bayesian Computation with R, Use R!, Springer 2007.
Christian Robert and George Casella Introducing Monte Carlo Methods with R, Use R!, Springer 2009 (link [4]
including the mcsm package).
Peter Rossi, Greg Allenby, Robert McCulloch : Bayesian Statistics and Marketing and "bayesm" package [5].
CRAN task view for bayesian statistics (link [6])
References
[1] http:/ / www. stat. columbia. edu/ ~kerman/ Research/ umacs. html
[2] http:/ / www. stat. columbia. edu/ ~gelman/ bugsR/
[3] http:/ / www. ceremade. dauphine. fr/ ~xian/ BCS/
[4] http:/ / www. ceremade. dauphine. fr/ ~xian/ books. html
[5] http:/ / faculty. chicagobooth. edu/ peter. rossi/ research/ bsm. html
[6] http:/ / cran. r-project. org/ web/ views/ Bayesian. html
Bootstrap
boot package includes functions from the book Bootstrap Methods and Their Applications by A. C. Davison and
D. V. Hinkley (1997, CUP)
bootstrap package.
Quick how-to
Do a boostrap of some data for some function (here, mean):
References
Instructions for the boot package: http://www.statmethods.net/advstats/bootstrapping.html
Sample using the boot package: http://druedin.com/2012/11/10/bootstrapping-in-r/
Multiple Imputation 124
Multiple Imputation
Multiple imputation of missing data generally includes two steps. First, an imputation step which results in multiple
complete datasets. Second, combining the results obtained by applying the chosen technique on each separate
dataset. The packages needed for these two steps are not necessary the same, but can be.
"mi" package Andrew Gelman Blog Entry on the "mi" package [1]
"mice" package [2].
Amelia [3]
mitools [4] by Thomas Lumley
survey and Zelig have support for multiple imputated datasets.
References
[1] http:/ / www. stat. columbia. edu/ ~cook/ movabletype/ archives/ 2009/ 06/ multiple_imputa_4. html
[2] http:/ / www. multiple-imputation. com/
[3] http:/ / cran. r-project. org/ web/ packages/ Amelia/ index. html
[4] http:/ / cran. us. r-project. org/ web/ packages/ mitools/ index. html
Nonparametric Methods
This page deals with a set of non-parametric methods [1] including the estimation of a cumulative distribution
function (CDF), the estimation of probability density function (PDF) with histograms and kernel methods and the
estimation of flexible regression models such as local regressions and generalized additive models.
For an introduction to nonparametric methods you can have a look at the following books or handout :
Nonparametric Econometrics: A Primer by Jeffrey S. Racine[2].
Li and Racine's handbook, Nonparametric econometrics[3].
Larry Wasserman All of Nonparamatric Statistics[4]
Density Estimation
Histogram
hist() is the standard function for drawing histograms. If you store the histogram as an object the estimated
parameters are returned in this object.
> x <- rnorm(1000)
> hist(x, probability = T) # The default uses Sturges method.
> # Sturges, H. A. (1926) The choice of a class interval.
> # Journal of the American Statistical Association 21, 6566.
> hist(x, breaks = "Sturges", probability = T)
>
> # Freedman, D. and Diaconis, P. (1981) On the histogram as a density estimator: L_2 theory.
> # Zeitschrift fr Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 453476.
> # (n^1/3 * range)/(2 * IQR).
> hist(x, breaks = "FD", probability = T)
>
> # Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66, 605610.
> # ceiling[n^1/3 * range/(3.5 * s)].
> hist(x, breaks = "scott", probability = T)
>
> # Wand, M. P. (1995). Data-based choice of histogram binwidth.
> # The American Statistician, 51, 5964.
> library("KernSmooth")
> h <- dpih(x)
> bins <- seq(min(x)-h, max(x)+h, by=h)
> hist(x, breaks=bins, probability = T)
n.bins() (car package) includes several methods to compute the number of bins for an histogram.
histogram() (lattice)
truehist() (MASS)
hist.scott() (MASS) plot a histogram with automatic bin width selection, using the Scott or
FreedmanDiaconis formulae.
histogram package.
Nonparametric Methods 126
Choose the kernel function with kernel : "gaussian", "epanechnikov", "rectangular", "triangular",
"biweight", "cosine", "optcosine".
> x <- rnorm(10^3)
> legend("topright", legend = c("gaussian", "epanechnikov", "rectangular", "triangular"), col = 1:4, lty = 1)
tkdensity() (sfsmisc) is a nice function which allow to dynamically choose the kernel and the bandwith with
a handy graphical user interface. This is a good way to check the sensitivity of the bandwidth and/or kernel choice
on the density estimation.
Examples
Nonparametric Methods 127
Local Regression
loess() is the standard function for local linear regression.
lowess() is similar to loess() but does not has a standard syntax for regression y ~ x .This is the
ancestor of loess (with different defaults!).
ksmooth() (stats) computes the NadarayaWatson kernel regression estimate.
locpoly() (KernSmooth package)
npreg() (np package)
locpol computes local polynomial estimators
locfit local regression, likelihood and density estimation
Examples
> par(mfrow=c(1,2))
> plot(g1, se = T)
>
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
>
> library(mgcv)
> g1 <- gam(y ~ s(x1) + s(x2) ) # x1 and x2 are locally estimated
> par(mfrow=c(1,2))
> plot(g1, se = T)
References
[1] http:/ / en. wikipedia. org/ wiki/ Non-parametric_statistics
[2] Jeffrey S. Racine Nonparametric Econometrics: A Primer http:/ / socserv. mcmaster. ca/ racine/ ECO0301. pdf and at the R code examples
http:/ / socserv. mcmaster. ca/ racine/ primer_code. zip
[3] Qi Li, Jeffrey S. Racine, Nonparametric econometrics, Princeton University Press - 2007
[4] Wasserman, Larry, "All of nonparametric statistics", Springer (2007) (ISBN: 0387251456)
[5] http:/ / en. wikipedia. org/ wiki/ Kernel_Density_Estimation
Linear Models
Standard linear model
In this section we present estimation functions for the standard linear model estimated by ordinary least squares
(OLS). Heteroskedasticity and endogeneity are treated below. The main estimation function is lm().
> library("arm")
> display(fit)
> coefplot(fit)
fit is a list of objects. You can see the list of these objects by typing names(fit). We can also apply functions
to fit.
We can get the estimated coefficients using fit$coeff or coef(fit).
> fit$coeff
(Intercept) x1 x2
1.2026522 0.8427403 1.5146775
> coef(fit)
(Intercept) x1 x2
0.7541 1.7844 0.7222
> output <- summary(fit)
> coef(output)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1945847 0.2298888 5.196359 0.001258035
x1 0.6458170 0.3423214 1.886581 0.101182585
x2 0.6175165 0.2083628 2.963660 0.020995713
> fit$fitted
> fitted(fit)
> fit$resid
> residuals(fit)
> fit$df
Linear Models 130
Confidence intervals
We can get the confidence intervals using confint() or conf.intervals() in the alr3 package.
Tests
coeftest() (lmtest) performs the Student t test and z test on coefficients.
> library("lmtest")
> coeftest(fit) # t-test
> coeftest(fit,df=Inf) # z-test (for large samples)
linear.hypothesis() (car) performs a finite sample F test on a linear hypothesis or an asymptotic Wald test
using statistics.
> library("car")
> linear.hypothesis(fit,"x1 = x2") # tests Beta1 = Beta2
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(1,3)) # Tests Beta0 = Beta1 = Beta2 = 1
> linear.hypothesis(fit,c("(Intercept)", "x1","x2"),rep(0,3)) # Tests Beta0 = Beta1 = Beta2 = 0
> linear.hypothesis(fit,c("x1","x2"),rep(0,2)) # Tests Beta1 = Beta2 = 0
Analysis of variance
We can also make an analysis of variance using anova().
> anova(fit)
> library(stats4)
> ?BIC
> lm1 <- lm(Fertility ~ . , data = swiss)
> AIC(lm1)
[1] 326.0716
> BIC(lm1)
[1] 339.0226
The step() functions performs a model search using the Akaike Information Criteria.
Zelig
The method is also supported in Zelig
Bayesian estimation
MCMCregress() (MCMCpack)
BLR() (BLR)
Heteroskedasticity
See the lmtest and sandwich packages.
gls() (nlme) computes the generalized least squares estimator.
See "Cluster-robust standard errors using R" (pdf) [1] by Mahmood Arai. He suggests two functions for cluster
robust standard errors. clx() allow for one-way clustering and mclx() for two-way clustering. They can be
loaded with the following command source("http://people.su.se/~ma/clmclx.R").
Robustness
Cook's distance
>library(car)
> cookd(fit)
1 2 3 4 5
0.0006205008 0.0643213760 0.2574810866 1.2128206779 0.2295047699
6 7 8 9 10
0.3130578329 0.0003365221 0.0671830241 0.0048474954 0.0714255871
Influence plot:
> influence.plot(fit)
Leverage plots:
Linear Models 133
> leverage.plot(fit,term.name=x1)
> leverage.plot(fit,term.name=x2)
> outlier.test(fit)
Observation: 3
Instrumental Variables
ivreg() in the AER package[3]
tsls() in the sem package.
It is also possible to use the gmm() command in the gmm package. See Methods of moments for an example.
> x <- 1 + z + u + rnorm(N) # x is correlated with the error term u (endogeneity) and the instrument z
> y <- 1 + x + u
Panel Data
plm() (plm) implements the standard random effect, fixed effect, first differences methods[4]. It is similar to Stata's
xtreg command.
Note that plm output are not compatible with xtable() and mtable() for publication quality output.
lme4 and gee implements random effect and multilevel models.
See also BayesPanel
We estimate the random effect model with the plm() function and the model = "random" option.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> # panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> re <- plm(eq, model = "random", data=panel)
> summary(re)
We first transform our data in a plm data frame using plm.data(). We estimate the fixed model using plm()
with model = "within" as an option. Then, we compare the estimate with the random effect model and
perform an Hausman test. At the end, we plot the density of the fixed effects.
> library("plm")
> panel <- plm.data(long, index = c("id","year"))
> #panel <- pdata.frame(long,c("id","year"))
> eq <- y ~ x1 + x2
> fe <- plm(eq, model = "within", data=panel)
> summary(fe)
> re <- plm(eq, model = "random", data=panel)
> summary(re)
> phtest(fe, re)
> plot(density(fixef(fe)))
> rug(fixef(fe))
References
[1] http:/ / people. su. se/ ~ma/ clustering. pdf
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ rreg. htm
[3] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http://CRAN.R-project.org/package=AER
[4] Yves Croissant, Giovanni Millo (2008). Panel Data Econometrics in R: The plm Package. Journal of Statistical Software 27(2). URL http:/ /
www. jstatsoft. org/ v27/ i02/ .
[5] http:/ / en. wikipedia. org/ wiki/ Random_effects_model
[6] M Arellano, S Bond "Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations" - The
Review of Economic Studies, 1991
[7] David Roodman, XTABOND2: Stata module to extend xtabond dynamic panel data estimator, http:/ / ideas. repec. org/ c/ boc/ bocode/
s435901. html
External links
Visualization of regression coefficients (http://www.r-statistics.com/2010/07/
visualization-of-regression-coefficients-in-r/)
Quantile Regression 136
Quantile Regression
Quantile regression is a very old method which has become popular only in the last years thanks to computing
progress. One of the main researcher in this area is also a R practitioner and has developped a specific package for
quantile regressions (quantreg)[1][2].
In theory, Quantile regression are also linear and thus could have been included in the Linear regression page.
However, this is a very specific topic and we think that it is worth writing a specific page for this topic.
N <- 10^3
u <- runif(N)
x <- 1 + rnorm(N)
y <- qnorm(u, mean = 0, sd = 2) + qnorm(u, mean = 1, sd = 1) * x
We estimate the quantile model for some values of tau (the quantile) and plot the coefficients :
We then plot the scatterplot, the predicted values using a standard linear model and the predicted values using a
quantile linear model :
We can also estimate the model for all quantiles in the same time :
Computing time
For large data sets it is better to use the "fn" or "pfn" method.
Resources
Koenker, Roger (2005) Quantile Regression, Cambridge University Press. ISBN 0-521-60827-9
References
[1] Roger Koenker (2010). quantreg: Quantile Regression. R package version 4.50. http:/ / CRAN. R-project. org/ package=quantreg
[2] Roger Koenker's personal webpage (http:/ / www. econ. uiuc. edu/ ~roger/ research/ rq/ rq. html)
Binomial Models
In this section, we look at the binomial model. We have one outcome which is binary and a set of explanatory
variables.
This kind of model can be analyzed using a linear probability model. However a drawback of this model for the
parameter of the Bernoulli distribution is that, unless restrictions are placed on , the estimated coefficients can
imply probabilities outside the unit interval . For this reason, models such as the logit model or the probit
model are more commonly used. If you want to estimate a linear probability model, have a look at the linear models
page.
Logit model
The model takes the form : with the inverse link function : . It can
Zelig
The Zelig' package makes it easy to compute all the quantities of interest.
We develop a new example. First we simulate a new dataset with two continuous explanatory variables and we
estimate the model using zelig() with the model = "logit" option.
We the look at the predicted values of y at the mean of x1 and x2
Then we look at the predicted values when x1 = 0 and x2 = 0
We also look at what happens when x1 changes from the 3rd to the 1st quartile.
> x1 <- 1 + rnorm(1000)
> x2 <- -1 + x1 + rnorm(1000)
> xbeta <- -1 + x1 + x2
> proba <- exp(xbeta)/(1 + exp(xbeta))
> y <- ifelse(runif(1000,0,1) < proba,1,0)
> mydat <- data.frame(y,x1,x2)
> table(y)
>
> z.out <- zelig(y ~ x1 + x2, model = "logit", data = mydat) # estimating the model
> summary(z.out)
> x.out <- setx(z.out, x1 = mean(x1), x2 = mean(x2)) # setting values for the explanatory variables
> s.out <- sim(z.out, x = x.out) # simulating the quantities of interest
> summary(s.out)
> plot(s.out) # plot the quantities of interest
> # What happens if x1 change from the 3rd quartile to the 1st quartile ?
> x.high <- setx(z.out, x1 = quantile(mydat$x1,.75), x2 = mean(mydat$x2))
> x.low <- setx(z.out, x1 = quantile(mydat$x1,.25), x2 = mean(x2))
Binomial Models 139
Bayesian estimation
bayesglm() in the arm package
MCMClogit() in the MCMCpack for a bayesian estimation of the logit model.
> library("arm")
> res <- bayesglm(y ~ x, family = binomial(link=logit))
> summary(res)
Probit model
The probit model is a binary model in which we assume that the link function is the cumulative density function of a
normal distribution.
We simulate fake data. First, we draw two random variables x1 and x2 in any distributions (this does not matter).
Then we create the vector xbeta as a linear combination of x1 and x2. We apply the link function to that vector and
we draw the binary variable y as Bernouilli random variable.
Maximum likelihood
We can use the glm() function with family=binomial(link=probit) option or the probit()
function in the sampleSelection package which is a wrapper of the former one.
Bayesian estimation
MCMCprobit() (MCMCpack)
> library("MCMCpack")
> post <- MCMCprobit(y ~ x1 + x2 , data = mydat)
> summary(post)
> plot(post)
See Also
There is an example of a probit model with R on the UCLA statistical computing website[3].
Semi-Parametric models
Klein and Spady estimator[4] is implemented in the np package[5] (see npindex() with method =
"kleinspady" option).
References
[1] Kosuke Imai, Gary King, and Oliva Lau. 2008. "logit: Logistic Regression for Dichotomous Dependent Variables" in Kosuke Imai, Gary
King, and Olivia Lau, "Zelig: Everyone's Statistical Software," http:/ / gking. harvard. edu/ zelig
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ logit. htm
[3] UCLA statistical computing probit example http:/ / www. ats. ucla. edu/ stat/ R/ dae/ probit. htm
[4] Klein, R. W. and R. H. Spady (1993), An efficient semiparametric estimator for binary response models, Econometrica, 61, 387-421.
[5] Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL
http:/ / www. jstatsoft. org/ v27/ i05/ .
Multinomial Models 141
Multinomial Models
Multinomial Logit
mlogit package.
multinom() nnet
multinom() VGAM
Conditional Logit
clogit() in the survival package
mclogit package.
Multinomial Probit
mprobit package [1]
MNP [2] package to fit a multinomial probit.
N <- 10000
u <- rlogis(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(y,x)
library(MASS)
fit <- polr(y ~ x, method = "logistic", data = df)
summary(fit)
Multinomial Models 142
Bayesian estimation
bayespolr() (arm) performs a bayesian estimation of the multinomial ordered logit
library("arm")
fit <- bayespolr(y ~ x, method = "logistic", data = df)
summary(fit)
N <- 1000
u <- rnorm(N)
x <- rnorm(N)
ys <- x + u
mu <- c(-Inf,-1,0,1, Inf)
y <- cut(ys, mu)
plot(y,ys)
df <- data.frame(x,y)
library(MASS)
fit <- polr(y ~ x, method = "probit", data = df)
summary(fit)
Bayesian estimation
bayespolr() (arm) performs a bayesian estimation of the multinomial ordered probit
References
[1] Harry Joe, Laing Wei Chou and Hongbin Zhang (2006). mprobit: Multivariate probit model for binary/ordinal response. R package
version 0.9-2.
[2] http:/ / imai. princeton. edu/ software/ MNP. html
[3] Beggs, S., Cardell, S., Hausman, J., 1981. Assessing the potential demand for electric cars. Journal of Econometrics 17 (1), 119
(September).
[4] Jonathan Wand, Gary King, Olivia Lau (2009). anchors: Software for Anchoring Vignette Data. Journal of Statistical Software, Forthcoming.
URL
http://www.jstatsoft.org/.
N <- 1000
u <- rnorm(N)
x <- - 1 + rnorm(N)
ystar <- 1 + x + u
y <- ystar*(ystar > 0)
hist(y)
library(AER)
tobit <- tobit(y ~ x,left=0,right=Inf,dist = "gaussian")
N <- 1000
u <- rnorm(N)
v <- rnorm(N)
x <- - 1 + rnorm(N)
z <- 1 + rnorm(N)
d <- (1 + x + z + u + v> 0)
ystar <- 1 + x + u
y <- ystar*(d == 1)
hist(y)
Tobit And Selection Models 144
library(sampleSelection)
heckit.ml <- heckit(selection = d ~ x + z, outcome = y ~ x, method = "ml")
summary(heckit.ml)
Truncation
truncreg package
DTDA "An R package for analyzing truncated data" pdf [4].
References
[1] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http://CRAN.R-project.org/package=AER
[2] Sample Selection Models in R: Package sampleSelection http:/ / www. jstatsoft. org/ v27/ i07
[3] James Heckman "Sample selection bias as a specification error", Econometrica: Journal of the econometric society, 1979
[4] http:/ / www. agrocampus-ouest. fr/ math/ useR-2009/ slides/ Moreira+ DeUnaAlvarez+ Crujeiras. pdf
N <- 1000
x <- rnorm(N)
alpha <- c(1,1)
y <- rpois(N,exp(alpha[1] + alpha[2] * x))
df <- data.frame(x,y)
plot(x,y)
Maximum likelihood
We estimate this simple model using the glm() function with family = poisson as option.
Bayesian estimation
The model can also be estimated using bayesian methods with the MCMCpoisson() function which is provided in
the MCMCpack.
Count Data Models 145
library("MCMCpack")
posterior <- MCMCpoisson(y ~ x, data = df)
plot(posterior)
summary(posterior)
Overdispersion test
dispersiontest() (AER package) provides a test for equidispersion.
References
See UCLA website for an example [2]
Zeileis, A., Kleiber, C. and Jackman, S. Regression Models for Count Data in R [3]
Replication files for Cameron and Trivedi's 1998 book[4] are provided in the AER package[5]. You can simply
type ?CameronTrivedi1998 and you will find the source code.
[1] Markus Jochmann (2010). zic: Bayesian Inference for Zero-Inflated Count Models. R package version 0.5-3. http:/ / CRAN. R-project. org/
package=zic
[2] http:/ / www. ats. ucla. edu/ stat/ r/ dae/ poissonreg. htm
[3] http:/ / cran. r-project. org/ web/ packages/ pscl/ vignettes/ countreg. pdf
[4] Cameron, A.C. and Trivedi, P.K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.
[5] Christian Kleiber and Achim Zeileis (2008). Applied Econometrics with R. New York: Springer-Verlag. ISBN 978-0-387-77316-2. URL
http://CRAN.R-project.org/package=AER
Duration Analysis 146
Duration Analysis
Using R for Survival Analysis (pdf) [1]
See the survival package
bootkm() Bootstrap Kaplan-Meier Estimates in Hmisc package
event.chart() Flexible Event Chart for Time-to-Event Data in the Hmisc package
References
[1] http:/ / www. math. unm. edu/ ~bedrick/ PIBBS/ Rsurv. pdf
Time Series
Introduction
In the following examples we will use the data set Mpyr which is included in the R-package Ecdat, which can be
loaded into R and viewed in R by the following code.
> data.a<-seq(1,24,by=1)
> is.ts(data.a)
[1] FALSE
> ts(data.a, start=c(2005,1), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2005 1 2 3 4 5 6 7 8 9 10 11 12
Time Series 147
2006 13 14 15 16 17 18 19 20 21 22 23 24
> data.b<-seq(1,24,by=1)
> is.ts(data.b)
[1] FALSE
> is.ts(as.ts(data.b))
[1] TRUE
> data.a<-seq(1,12,by=1)
> ts.a<-ts(data.a, start=c(2005,1), frequency=4)
> lag.a<-lag(ts.a,k=1)
> diff.a<-diff(ts.a,lag=1,difference=1)
> ts.a
Qtr1 Qtr2 Qtr3 Qtr4
2005 1 2 3 4
2006 5 6 7 8
2007 9 10 11 12
> lag.a
Qtr1 Qtr2 Qtr3 Qtr4
2004 1
2005 2 3 4 5
2006 6 7 8 9
2007 10 11 12
> diff.a
Qtr1 Qtr2 Qtr3 Qtr4
2005 1 1 1
2006 1 1 1 1
2007 1 1 1 1
Autocorrelation function
The function acf() computes (and by default plots) estimates of the autocovariance or autocorrelation function.
Function pacf() is the function used for the partial autocorrelations. Function ccf() computes the cross-correlation or
cross-covariance of two univariate series.[1]
Time Series 148
Useful R-packages
fBasics, tis, zoo, tseries, xts, urca
References
[1] http:/ / www. inside-r. org/ r-doc/ stats/ acf
http://cran.r-project.org/web/views/TimeSeries.html
http://cran.r-project.org/doc/contrib/Ricci-refcard-ts.pdf
Factor Analysis
Introduction
Factor analysis is a set of techniques to reduce the dimensionality of the data. The goal is to describe the dataset with
a smaller number of variables (ie underlying factors). Factor Analysis was developed in the early part of the 20th
century by L.L. Thurstone and others. Correspondence analysis was originally developed by Jean-Paul Benzcri in
the 60's and the 70's. Factor analysis is mainly used in marketing, sociology and psychology. It is also known as data
mining, multivariate data analysis or exploratory data analysis.
There are three main methods. Principal Component Analysis deals with continuous variables. Correspondence
Analysis deals with a contingency table (two qualitative variables) and Multiple correspondence analysis is a
generalization of the correspondence analysis with more than two qualitative variables. The major difference
between Factor Analysis and Principal Components Analysis is that in FA, only the variance which is common to
multiple variables is analysed, while in PCA, all of the variance is analysed. Factor Analysis is a difficult procedure
to use properly, and is often misapplied in the psychological literature. One of the major issues in FA (and PCA) is
the number of factors to extract from the data. Incorrect numbers of factors can cause difficulties with the
interpretation and analysis of the data.
There are a number of techniques which can be applied to assess how many factors to extract. The two most useful
are parallel analysis and the minimum average partial criterion. Parallel analysis works by simulating a matrix of the
same rank as the data and extracting eigenvalues from the simulated data set. The point at which the simulated
eigenvalues are greater than those of the data is the point at which the "correct" number of factors have been
extracted. The Minimum Average Partial criterion uses a different approach but can often be more accurate.
Simulation studies have established these two methods as the most accurate. Both of these methods are available in
the psych package under the fa.parallel and the VSS commands.
Another issue in factor analysis is which rotation (if any) to choose. Essentially, the rotations transform the scores
such that they are more easily interpretable. There are two major classes of rotations, orthogonal and oblique.
Orthogonal rotations assume that the factors are uncorrelated, while oblique rotations allow the factors to correlate
(but do not force this). Oblique rotations are recommended by some (e.g. MacCallum et al 1999) as an orthogonal
solution can be obtained from an oblique rotation, but not vice versa.
One of the issues surrounding factor analysis is that there are an infinite number of rotations which explain the same
amount of variance, so it can be difficult to assess which model is correct. In response to such concerns, Structural
Equation Modelling (SEM), which is also known as Confirmatory Factor Analysis (CFA) was developed by
Joreskeg in the 1970's. The essential principle of SEM is that given a model, it attempts to reproduce the observed
covariance matrix seen in the data. The ability of a model to reproduce the data can be used as a test of that model's
truth. SEM is implemented in R in the sem and lavaan packages, as well as the OpenMx package (which is not
available on CRAN).
Factor Analysis 149
See the following packages : FactoMineR (website [1]), amap, ade4, anacor, vegan, '"psych"'
N <- 1000
factor1 <- rnorm(N)
factor2 <- rnorm(N)
x1 <- rnorm(N) + factor1
x2 <- rnorm(N) + factor1
x3 <- rnorm(N) + factor2
x4 <- rnorm(N) + factor2
mydat <- data.frame(x1,x2,x3,x4)
pca <- prcomp(mydat)
names(pca)
plot(pca) # plot the eigenvalues
biplot(pca) # A two dimensional plot
pca2 <- princomp(~ x1 + x2 + x3 + x4, data = mydat) # princomp with a formula syntax
References
[1] http:/ / factominer. free. fr/ index. html
[2] http:/ / www. jstatsoft. org/ v20/ i03/ paper
[3] http:/ / www. carme-n. org/
[4] http:/ / www. statmethods. net/ advstats/ ca. html
[5] http:/ / cran. r-project. org/ web/ packages/ anacor/ vignettes/ anacor. pdf
[6] http:/ / www. jstatsoft. org/ v31/ i05
Ordination
Overview
[1]
This page provides basic code for creating a distance matrix and running and plotting a Non-metric
Multidimensional Scaling [2] (NMDS) ordination.
Read more about Ordination [3] on Wikipedia.
This code relies on package vegan in R by Jari Oksanen [4].
Data
First, import data and load required libraries:
require(MASS)
require(vegan)
data(varespec) # species data
data(varechem) # environmental data
Distance matrix
bray <- vegdist(varespec, method = "bray") # calculate a distance matrix
Unconstrained Ordination
PCA biplot
Ordination 152
References
[1] http:/ / en. wikipedia. org/ wiki/ Distance_matrix
[2] http:/ / en. wikipedia. org/ wiki/ Multidimensional_scaling
[3] http:/ / en. wikipedia. org/ wiki/ Ordination_(statistics)
[4] http:/ / cc. oulu. fi/ ~jarioksa/
Clustering
Basic clustering
K-Means Clustering
You can use the kmeans() function.
First create some data:
Hierarchical Clustering
The basic hierarchical clustering function is hclust(), which works on a dissimilarity structure as produced by the
dist() function:
Available alernatives
See packages class, amap and cluster
See The R bioinformatic page on clustering [1]
References
"The Elements of Statistical Learning" [2]
External links
Clustergram: visualization and diagnostics for cluster analysis [3]
[1] http:/ / manuals. bioinformatics. ucr. edu/ home/ R_BioCondManual#CLUSTERBACK
[2] http:/ / www-stat. stanford. edu/ ~tibs/ ElemStatLearn/
[3] http:/ / www. r-statistics. com/ 2010/ 06/ clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
Network Analysis
Introduction
We mainly use following packages to demontrate network analysis in R: statnet, sna, igraph. They are however not
representing a complete list. See Task view of gR, graphical models in R [1] for a complete list.
library(igraph)
then you can choose your preferred format. Below are examples of data provided as edge list and as adjacency
matrix.
References
Statnet website [2] includes all the documentation on network analysis using R.
Julien Barnier's introduction (in French) [3]
Journal of Statistical Software #24 Special Issue on Networks in R [4]
[1] http:/ / cran. r-project. org/ web/ views/ gR. html
[2] http:/ / csde. washington. edu/ statnet/
[3] http:/ / alea. fr. eu. org/ j/ reseaux_R. html
[4] http:/ / www. jstatsoft. org/ v24
Profiling R code
Before starting with parallel or high performance computing it is important to analyze and optimize R code. R
provides some useful tools to analyze and profile R code. A good and short introduction is provided in the R
extension documentation [1].
Soon we are going to provide some example code:
References
[1] http:/ / cran. r-project. org/ doc/ manuals/ R-exts. html#Tidying-and-profiling-R-code
Parallel computing with R 155
References
[1] http:/ / cran. r-project. org/ web/ views/ HighPerformanceComputing. html
[2] http:/ / www. jstatsoft. org/ v31/ i01
Sources
For the following resources, authors have explicitly given the permission to include their material on the R
programming wikibook. Remember that even if they have given their permission, they should be correctly cited.
Blogs
R-statistics (the R category) [1] (A link to a post [2] which provides proper licence for approving this content for
use).
GETTING GENETICS DONE [3] - R tag. The R content is available from here:
http://gettinggeneticsdone.blogspot.com/search/label/R. The R code is copyrighted under The open source BSD
license (as is described here: http://gettinggeneticsdone.blogspot.com/p/copyright.html). The content itself is
licensed under a Creative Commons Attribution-Share-Alike 3.0 Unported License (as is shown at the bottom of
every post). Bottom line - the R code and written content can be used freely (with attribution).
Struggling Through Problems: http://strugglingthroughproblems.blogspot.com/search/label/R
Backsidesmack R-stuff category [4]. Copyright information is in the footer and explicit permission is in this post
[5]
Al3xandr3: http://al3xandr3.github.com/tags/r.html
Cloudnumbers.com (the R category) [6]: Posts about high-performance computing and cloud computing with R. A
link [7] to a post which provides proper license for approving this content for use.
The R Tutorial Series (http://rtutorialseries.blogspot.com) by John M. Quick [8] provides a collection of
user-friendly guides to researchers, students, and others who want to learn how to use R for their statistical
analyses. Its content is available for use in the R Programming wikibook under a Creative Commons BY-SA
License.
Exploring Indian census data using R [9] and useful scripts to download weather related data from websites. The
content is available for use in the R wikibook under cc-sa license.
Plain Data Analysis tips at www.danielmarcelino.com [10] .Topics covered in the blog are related to social
sciences, but there is a great variety of them.
Sources 156
References
[1] http:/ / www. r-statistics. com/ category/ r/
[2] http:/ / www. r-statistics. com/ 2011/ 06/ calling-r-lovers-and-bloggers-to-work-together-on-the-r-programming-wikibook/
[3] http:/ / GettingGeneticsDone. com/
[4] http:/ / www. backsidesmack. com/ category/ r-stuff/
[5] http:/ / www. backsidesmack. com/ 2011/ 06/ no-steal-this-blog/
[6] http:/ / www. cloudnumbers. com/ category/ Rproject/
[7] http:/ / cloudnumbers. com/ the-r-programming-wikibook
[8] http:/ / www. johnmquick. com
[9] http:/ / anandram. wordpress. com/ tag/ r/
[10] http:/ / www. danielmarcelino. com
Index
This page provides tables which make it easy to find functions for usual tasks in statistics in R, SAS and Stata. Other
software may also be included in the future such as SPSS.
Data management
Function R Stata SAS
Descriptive Statistics
Function R Stata SAS
Regression models
Function R Stata SAS
Programming
Function R Stata SAS
Introduction Source: http://en.wikibooks.org/w/index.php?oldid=2561505 Contributors: Adrignola, Albmont, Alice Springs, DavidCary, Dcljr, Eddelbuettel, Lannajin, PAC, PAC2, Panic2k4,
Petter Lquist, Seemu, Tayste, TomyDuby, Twarzin, 13 anonymous edits
Sample Session Source: http://en.wikibooks.org/w/index.php?oldid=2549960 Contributors: Dcljr, F.jackson, Fiero9, PAC2, 2 anonymous edits
Manage your workspace Source: http://en.wikibooks.org/w/index.php?oldid=2665727 Contributors: Fercho, Fiero9, PAC2, Richierocks, Talgalili, Xania, 2 anonymous edits
Settings Source: http://en.wikibooks.org/w/index.php?oldid=2584255 Contributors: Adrignola, Albmont, Calimo, DimAlice, F.jackson, PAC, PAC2, Panic2k4, Pep Roca, Talgalili, ,
12 anonymous edits
Documentation Source: http://en.wikibooks.org/w/index.php?oldid=2665928 Contributors: Adrignola, Dcljr, Eddelbuettel, Herbee, PAC, PAC2, Panic2k4, Timcdlucas, ZeroOne, 1 anonymous
edits
Control Structures Source: http://en.wikibooks.org/w/index.php?oldid=2359198 Contributors: Adam majewski, Dcljr, Edgester, PAC2, Timcdlucas, 10 anonymous edits
Working with functions Source: http://en.wikibooks.org/w/index.php?oldid=2622195 Contributors: Gibravo, PAC2, Thomas Levine, 3 anonymous edits
Packages Source: http://en.wikibooks.org/w/index.php?oldid=2630876 Contributors: Derek Farn, HgDeviasse, Karstew, PAC2, Shabbychef
Data types Source: http://en.wikibooks.org/w/index.php?oldid=2597748 Contributors: Asterisk-lee, Benwing, PAC2, Rob smith, Talgalili, Thomas Levine, Timcdlucas, 11 anonymous edits
Working with data frames Source: http://en.wikibooks.org/w/index.php?oldid=2610783 Contributors: Adrignola, Gibravo, PAC, PAC2, Rob smith, Sigma 7, Talgalili, ZeroOne, 15
anonymous edits
Text Processing Source: http://en.wikibooks.org/w/index.php?oldid=2676765 Contributors: Edgester, EtudiantEco, NigelW, PAC, PAC2, Robbiemorrison, Tom Morris, 7 anonymous edits
Times and Dates Source: http://en.wikibooks.org/w/index.php?oldid=2546276 Contributors: Edgester, PAC2, Pep Roca, Talgalili, Timcdlucas, 3 anonymous edits
Graphics Source: http://en.wikibooks.org/w/index.php?oldid=2575663 Contributors: Adrignola, Albmont, Dcljr, Gibravo, Henrybissonnette, Mwtoews, Orderud, PAC, PAC2, QuiteUnusual,
Saric, Talgalili, 21 anonymous edits
Publication quality ouput Source: http://en.wikibooks.org/w/index.php?oldid=2676375 Contributors: Adrignola, Avraham, Dcljr, Gibravo, PAC, PAC2, Rotlink, Rp, Talgalili, Timcdlucas, 14
anonymous edits
Descriptive Statistics Source: http://en.wikibooks.org/w/index.php?oldid=2482743 Contributors: Adam majewski, Adrignola, Dcljr, Mlm2764, PAC, PAC2, Timcdlucas, ZeroOne, 4
anonymous edits
Mathematics Source: http://en.wikibooks.org/w/index.php?oldid=2619859 Contributors: Adrignola, Albmont, Dcljr, Mcneale, PAC, PAC2, RuneL87, Taxman, Timcdlucas, ZeroOne, 7
anonymous edits
Optimization Source: http://en.wikibooks.org/w/index.php?oldid=2488152 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, ZeroOne, 5 anonymous edits
Probability Distributions Source: http://en.wikibooks.org/w/index.php?oldid=2195106 Contributors: Albmont, Mmessner, PAC, PAC2, Recent Runes, Timcdlucas, Xania, ZeroOne, 3
anonymous edits
Random Number Generation Source: http://en.wikibooks.org/w/index.php?oldid=2213476 Contributors: Adrignola, Billymac00, Dcljr, Derek Farn, PAC, PAC2, ZeroOne, 2 anonymous edits
Maximum Likelihood Source: http://en.wikibooks.org/w/index.php?oldid=2621042 Contributors: Adrignola, Dcljr, Fishpi, PAC, PAC2, Thenub314, Timcdlucas, Wfoolhill, 4 anonymous edits
Bayesian Methods Source: http://en.wikibooks.org/w/index.php?oldid=2443557 Contributors: Adrignola, PAC, PAC2, ZeroOne, 4 anonymous edits
Bootstrap Source: http://en.wikibooks.org/w/index.php?oldid=2457470 Contributors: Adrignola, Dcljr, Dourouc05, PAC, PAC2, 1 anonymous edits
Nonparametric Methods Source: http://en.wikibooks.org/w/index.php?oldid=2213501 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, ZeroOne
Linear Models Source: http://en.wikibooks.org/w/index.php?oldid=2546268 Contributors: Adrignola, Dcljr, PAC, PAC2, Talgalili, Timcdlucas, ZeroOne, 2 anonymous edits
Quantile Regression Source: http://en.wikibooks.org/w/index.php?oldid=2485148 Contributors: Adrignola, Dcljr, PAC, PAC2, Timcdlucas, 1 anonymous edits
Binomial Models Source: http://en.wikibooks.org/w/index.php?oldid=2536965 Contributors: Adrignola, PAC, PAC2, Timcdlucas, ZeroOne, 1 anonymous edits
Time Series Source: http://en.wikibooks.org/w/index.php?oldid=2147584 Contributors: PAC, PAC2, Recent Runes, RuneL87, Timcdlucas, 5 anonymous edits
Factor Analysis Source: http://en.wikibooks.org/w/index.php?oldid=2679297 Contributors: Adrignola, Fishpi, PAC, PAC2, Panic2k4, Richiemorrisroe, Timcdlucas, 2 anonymous edits
Clustering Source: http://en.wikibooks.org/w/index.php?oldid=2552057 Contributors: Gibravo, PAC, PAC2, Talgalili, 2 anonymous edits
Article Sources and Contributors 159
Network Analysis Source: http://en.wikibooks.org/w/index.php?oldid=2425027 Contributors: Adrignola, Gibravo, PAC, PAC2, Seemu, 3 anonymous edits
Sources Source: http://en.wikibooks.org/w/index.php?oldid=2564621 Contributors: Dmsilv, Markus.schmidberger, PAC2, Protonk, Talgalili, 8 anonymous edits
License
Creative Commons Attribution-Share Alike 3.0
//creativecommons.org/licenses/by-sa/3.0/