Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

lOMoARcPSD|35213682

R Tutorial Lecture Notes

Estadística 2 (Universidad del Pacífico Perú)

Studocu no está patrocinado ni avalado por ningún colegio o universidad.


Descargado por Tadeo Meza (mezatadeo2003@gmail.com)
lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 1

Lecture Notes on
Introductory Statistics using R

Walter Bazán-Palomino
School of Economics and Finance
Universidad del Pacífico (University of the Pacific)

This document is designed to give you an introduction to the use of R and RStudio.
Besides, these notes are built to introduce you to Basic Statistics and Inferential
Statistics using RStudio. The goal of this document is not to show all the features
of RStudio, or to replace a standard textbook in Statistics, but to complement it.
The notes start with a definition of basic commands and plots, following by <lower-
level= and <medium-level= statistics material.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 2

1. R AND RSTUDIO
R is an open-source (GPL) statistical environment modeled after S and S-Plus. The R project
was started by Robert Gentleman and Ross Ihaka of the Statistics Department of the University
of Auckland in 1995.
R is a free computer programming language; a nontrivial feature. It is easy to learn its syntax
with many built-in statistical functions, and R language allow us to create our own functions
(user-written functions). R and RStudio run on UNIX, Windows and Macintosh.
The command language is a programming language so students must learn to appreciate
syntax issues etc.
88R can do anything you can imagine,99 and this is hardly an overstatement. With R you can
write functions, do calculations, apply most available statistical techniques, create simple or
complicated graphs, and even write your own library functions. A large user group supports it.
Many research institutes, companies, and universities have migrated to R.
For the purpose of this course, R has excellent statistical facilities. Nearly everything you may
need in terms of statistics has already been programmed and made available in R (either as
part of the main package or as a user-contributed package).
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-
highlighting editor that supports direct code execution, as well as tools for plotting, history,
debugging and workspace management.
How to install R and RStudio in Windows
You can download the latest version of R from
https://cran.r-project.org/bin/windows/base/
Once the installation is done, you need to download RStudio. RStudio is an Integrated
Development Environment (IDE) for R which is friendlier than and runs on top of R.
https://www.rstudio.com/products/rstudio/download/

1.1. RStudio (and R) Environment


For the purpose of this course, the RStudio and R environments are synonyms. There is no
crucial difference between them.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 3

Figure 1. Four panels of RStudio

The console is the <command window.= Type 2+2 after the > symbol (which is where the cursor
appears), and click enter. The spacing in your command is not relevant.
> 2+2
[1] 4

The simple example shows how RStudio (R) works; you type something, press enter, and
RStudio will carry out your commands. The trick is to type in sensible things. Mistakes can
easily be made. For example, suppose you want to calculate the logarithm of 2 with base 10.
You may type log(2) and receive 0.6931472 as the answer. This is the natural logarithm, thus
the 0.693 is not the correct answer.
> log(2)
[1] 0.6931472
> log10(2)
[1] 0.30103

All variables created in RStudio are stored in a common workspace. To see which variables are
defined in the workspace, you can use the function ls(list). Remember that you cannot omit the
parentheses in ls().
> ls()
[1] "counts" "h" "heading" "i"
"n"
[6] "opts" "poly.model" "predicted.intervals" "splin
e.model" "x"
[11] "x.eval" "xfit" "y" "yfit"

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 4

If you want to delete some of the objects, you must use the command rm(remove), so that
> rm(height, weight) #deletes the variables height and weight

The entire workspace can be cleared using rm(list=ls()).


The Environment tab of Figure 1 shows you the names of all the data objects (like vectors,
matrices, and data frames) that you have defined in your current R session. You can also see
information like the number of observations and rows in data objects. The tab also has a few
clickable actions like <Import Dataset" which will open a graphical user interface (GUI) for
important data into R. However, I almost never look at this menu.
The History tab of Figure 1 simply shows you a history of all the code you9ve previously
evaluated in the Console. To be honest, I never look at this. In fact, I didn9t realize it was even
there until I started writing this tutorial.
As you get more comfortable with R, you might find the Environment / History panel useful. But
for now you can just ignore it. If you want to declutter your screen, you can even just minimize
the window by clicking the minimize button on the top right of the panel.
The Files / Plots / Packages / Help panel of Figure 1 shows you lots of helpful information. Let9s
go through each tab in detail:
Files - The files panel gives you access to the file directory on your hard drive. One nice
feature of the <Files= panel is that you can use it to set your working directory - once you
navigate to a folder you want to read and save files to, click <More= and then <Set As
Working Directory.= We9ll talk about working directories in more detail soon.
Plots - The Plots panel (no big surprise), shows all your plots. There are buttons for
opening the plot in a separate window and exporting the plot as a pdf or jpeg (though
you can also do this with code using the pdf() or jpeg() functions).
Packages - Shows a list of all the R packages installed on your harddrive and indicates
whether or not they are currently loaded. Packages that are loaded in the current
session are checked while those that are installed but not yet loaded are unchecked.
We9ll discuss packages in more detail in the next section.
Help - Help menu for R functions. You can either type the name of a function in the
search window, or use the code to search for a function with the name.
The basic object oriented syntax of R
The most important command is <c" which allows user to combine/store a list of numbers/things
in a vector. Within R commands, note that # means comment. Everything after this symbol in a
line is ignored by R.
> c(2,4.5,"walter")
[1] "2" "4.5" "walter"

The above output shows that numbers and words can be included in a vector. Of course, the
words must be placed in simple quotes (not smart quotes of MS Word).

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 5

If R is expecting continuing command it prompts with "+". It may be an indication that something
is wrong and it may be better to press escape key to get out. It can be because parentheses do
not match or other syntax errors.
Naming an object x
R is an object-oriented language. Almost everything is an object with a name. <x =" or <x <-< are
both assignment operations to create an R object named x.
> x <- 4
> x
[1] 4

The R object names are almost arbitrary, except that they cannot start with numbers or contain
symbols. It is not advisable to use common commands as names of R objects (e.g. sum, mean,
sd, c, sin, cos, pi, exp, etc described later). Everything in R including object names is case-
sensitive. Note that 3x is not a valid name of an R object.
> 3x=1:4
Error: unexpected symbol in "3x"

For example, x=5 means 5 stored under a name x. Also x <- c(1,2,3,4) defines variable x as =
(1,2,3,4). Alternatively use x=1:4.
> x=1:4
> x
[1] 1 2 3 4

Manipulating different objects


sum(..., na.rm = FALSE) shows that sum is a function always available in R where x is its
argument. `na.rm=FALSE' is an optional argument with default value FALSE meaning that if
there are missing values (NA's or not-available data values) sum will also be NA. This is a
useful warning.
> sum(x) # Calculates the sum of elements in vector x.
[1] 10

> y=c(1:3,NA,4);y
[1] 1 2 3 NA 4
> sum(y)
[1] NA
> sum(y,na.rm=TRUE)
[1] 10

The above output shows that the sum(x) is NA if we do not recognize the presence of NA and
explicitly ask R to remove it (na.rm means remove NAs) before computing the sum. The option
`na.rm=TRUE' is available for computation of mean, median, standard deviation, variance, etc.
Less sophisticated software gives incorrect number of observations and wrong answers in the
presence of NA's.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 6

R allows us to manipulate objects as a calculator. Note that exp() is a function in R and exp(1)
means e raised to power 1. Note also that the 8c9 function of R defines a catalog or list of two or
more values. R does not understand a mere list of things without the 8c9 command. Print
command of R needs the 8c' above, because we want to print more than one thing from a list.
#First Example
> pi
[1] 3.141593
> exp(1)
[1] 2.718282
> print(c(pi,exp(1))) #prints to screen values of the pi symbol and e symbol
[1] 3.141593 2.718282

#Second Example
> x <- 4; y<- 9
> x+y
[1] 13

#Third Example
> x <- c(1,2,3,4); y<- c(5,6,7,8)
> x+y
[1] 6 8 10 12
> x <- 1:4; y<- 5:8
> x+y
[1] 6 8 10 12
> x*y
[1] 5 12 21 32
> sum(x*y)
[1] 70
> x%*%y
[,1]
[1,] 70

1.2. Scripts
Perhaps, you do not want to work with R on a line-by-line basis. For instance, if you have
entered 8 commands over eight lines and realize that you made a mistake, it is better to work
with R scripts instead of reentering all the commands again. The collection of these lines of
commands can be stored in a file called script. Such a file is usually saved with a .R extension.
In RStudio, you can run a line of code of a R script file by placing a cursor anywhere on that line
(while being careful not to highlight any subset of that line) and pressing the shortcut keys
Ctrl+Enter on a Windows keyboard or Command+Enter on a Mac.
You can also run an entire block of code by selecting all lines to be run then pressing the
shortcut keys Ctrl+Enter/Command+Enter. Or, you can run the entire R script by pressing
Ctrl+Alt+R in Windows or Command+Option+R on a Mac.

1.3. Packages
An R installation contains one or more libraries of packages. Some of these packages are part
of the basic installation. Installing a package simply means downloading the package code onto

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 7

your personal computer. There are two main ways to install new packages. The first, and most
common, method is to download them from the Comprehensive R Archive Network (CRAN).
CRAN is the central repository for R packages, which currently hosts over 1000 packages for
various purposes. You can even create your own packages. To install a new R package from
CRAN, you can simply run the code install.packages("name"), where <name= is the name of the
package.
Once you have installed a package, it is on your computer. However, just because it is on your
computer does not mean R is ready to use it. If you want to use something, like a function or
dataset, from a package you always need to load the package in your R session first. To load a
package, you use the library(<name=) function.

R Relational Operators
Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

R Logical Operators
Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 8

2. DATA TYPE
There is a wide variety of data types, including vectors (numerical, character, and logical),
matrices, strings, factors, arrays, lists, and data frames.

2.1. Vectors
> a <- c(1,2,5.3,6,-2,4) # numeric vector
> a
[1] 1.0 2.0 5.3 6.0 -2.0 4.0
> b <- c("one","two","three") # character vector
> b
[1] "one" "two" "three"
> c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
> c
[1] TRUE TRUE TRUE FALSE TRUE FALSE
> a[c(2,4)] # 2nd and 4th elements of vector
[1] 2 6

2.2. Matrices
All columns in a matrix must have the same mode (numeric, character, etc.) and the same
length. The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
matrix should be filled by columns (the default). dimnames provides optional labels for the
columns and rows.
#First Example
#generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

#Second Example
> vector4matrix <- c(47,6,213,9)
> vector4matrix
[1] 47 6 213 9
> rownames <- c("R1", "R2")
> colnames <- c("C1", "C2")
> mymatrix <- matrix(vector4matrix, nrow=2, ncol=2, byrow=TRUE,
+ dimnames=list(rownames, colnames))
> mymatrix
C1 C2
R1 47 6
R2 213 9

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 9

2.3. Strings
A string is specified by using quotes and you can store it like a <vector=.
> a <- "Good"
> a
[1] "Good"
> b <- c("Good","Bye")
> b
[1] "Good" "Bye"

2.4. Factors
Conceptually, factors are variables in R which take on a limited number of different values; such
variables are often referred to as categorical variables. One of the most important uses of
factors is in statistical modeling; since categorical variables enter into statistical models
differently than continuous variables, storing data as factors insures that the modeling functions
will treat such data correctly.
Factors in R are stored as a vector of integer values with a corresponding set of character
values to use when the factor is displayed. The factor function is used to create a factor.
> data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
> fdata = factor(data)
> fdata
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
> rdata = factor(data,labels=c("I","II","III"))
> rdata
[1] I II II III I II III III I II III III I
Levels: I II III

Factors represent a very efficient way to store character values, because each unique character
value is stored only once, and the data itself is stored as a vector of integers. Because of this,
read.table() will automatically convert character variables to factors unless the as.is= argument
is specified. See Section Data Management for details.
> mons = c("March","April","January","November","January", "September","Octob
er","September","November","August",
+ "January","November","November","February","May","August",
+ "July","December","August","August","September","November",
+ "February","April")
> mons = factor(mons)
> table(mons)
mons
April August December February January July March M
ay November October September
2 4 1 2 3 1 1
1 5 1 3

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 10

2.5. Arrays
Arrays are the R data objects which can store data in more than two dimensions. Arrays are
similar to matrices but can have more than two dimensions. See help(array) for details.
> # Create two vectors of different lengths.
> vector1 <- c(1,1,0)
> vector2 <- c(47,48,45,50,51,49)
> # Take these vectors as input to the array.
> array.1 <- array(c(vector1,vector2),dim = c(3,3,2))
> print(array.1)
, , 1

[,1] [,2] [,3]


[1,] 1 47 50
[2,] 1 48 51
[3,] 0 45 49

, , 2

[,1] [,2] [,3]


[1,] 1 47 50
[2,] 1 48 51
[3,] 0 45 49

Let9s name the rows, columns and each layer of the array. I am going to call each layer as
<Matrix.=
> col.names <- c("COLUMN1","COLUMN2","COLUMN3")
> row.names <- c("ROW1","ROW2","ROW3")
> matrix.names <- c("Matrix1","Matrix2")
> array.2 <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.name
s, col.names, matrix.names))
> print(array.2)
, , Matrix1

COLUMN1 COLUMN2 COLUMN3


ROW1 1 47 50
ROW2 1 48 51
ROW3 0 45 49

, , Matrix2

COLUMN1 COLUMN2 COLUMN3


ROW1 1 47 50
ROW2 1 48 51
ROW3 0 45 49

Now we are going to manipulate arrays


> vector1 <- c(7,8,9)
> vector2 <- c(1,2,3,4,5,6)
> array1 <- array(c(vector1,vector2),dim = c(3,3,2))
> vector3 <- c(4,5,6)
> vector4 <- c(1,0,1,3,4,11)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 11

> array2 <- array(c(vector3,vector4),dim = c(3,3,2))


> # create matrices from these arrays.
> matrix1 <- array1[,,2]
> matrix2 <- array2[,,2]
> # Add the matrices.
> result <- matrix1+matrix2
> print(result)
[,1] [,2] [,3]
[1,] 11 2 7
[2,] 13 2 9
[3,] 15 4 17

2.6. Lists
Lists provide a way to store a variety of objects of possibly varying modes in a single R object
> mylist = list(c(1,2,3),"GDP",TRUE,"Inflation",40,c(9,10,11))
> mylist
[[1]]
[1] 1 2 3

[[2]]
[1] "GDP"

[[3]]
[1] TRUE

[[4]]
[1] "Inflation"

[[5]]
[1] 40

[[6]]
[1] 9 10 11

You can merge many lists into one list.


> # Create two lists.
> list1 <- list(31,30,30,31)
> list2 <- list("March","June","September","December")
> # Merge the two lists.
> merged.list <- c(list1,list2)
> merged.list
[[1]]
[1] 31

[[2]]
[1] 30

[[3]]
[1] 30

[[4]]
[1] 31

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 12

[[5]]
[1] "March"

[[6]]
[1] "June"

[[7]]
[1] "September"

[[8]]
[1] "December"

2.7. Data Frame


Data frame is a two dimensional object in R and it looks similar to a list or array with equal
length.
Let9s create a data frame.
> # create data frame in R.
> subject=c("English","Maths","Chemistry=, <Physics") # vector1 named as subj
ect
> percentage =c(80,100,85,95) # vector2 named as percentage
> students_df=data.frame(subject,percentage) # Vector1 and vector2 together a
s dataframe
> students_df
subject percentage
1 English 80
2 Maths 100
3 Chemistry 85
4 Physics 95

> # rename columns


> names(students_df)<-c("Course","Score")
> students_df
Course Score
1 English 80
2 Maths 100
3 Chemistry 85
4 Physics 95

The structure of the data frame in R can be seen by using str() function
> str(students_df)
'data.frame': 4 obs. of 2 variables:
$ subject : Factor w/ 4 levels "Chemistry","English",..: 2 3 1 4
$ percentage: num 80 100 85 95

The statistical summary and nature of the data can be obtained by applying summary() function.
> summary(students_df)
subject percentage
Chemistry:1 Min. : 80.00

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 13

English :1 1st Qu.: 83.75


Maths :1 Median : 90.00
Physics :1 Mean : 90.00
3rd Qu.: 96.25
Max. :100.00

Recall
 length(object) # number of elements or components
 str(object) # structure of an object
 class(object) # class or type of an object
 names(object) # names
 c(object,object,...) # combine objects into a vector
 cbind(object, object, ...) # combine objects as columns
 rbind(object, object, ...) # combine objects as rows
 object # prints the object
 ls() # list current objects
 rm(object) # delete an object
 sum(object) # returns the sum of all the values present in its arguments
 seq(object) # generate regular sequence
 rep(object) # replicates the values in its arguments

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 14

3. FUNCTIONS
In this section, we discuss applying some simple functions to the data, such as the mean or the
mean of a single data subset. Basic functions are available by default. Other functions are
contained in packages that can be attached to a current session as needed, and you can create
your own functions.
R has data structures or objects (vectors, matrices, arrays, and data frames) that you can
operate on through functions that perform statistical analyses and create graphs.

3.1. Matrix Algebra


Operator or
Function Description
A*B Element-wise multiplication
A %*% B Matrix multiplication
A %o% B Outer product. AB'
crossprod(A,B) A'B and A'A respectively.
crossprod(A)
t(A) Transpose
Creates diagonal matrix with elements of x in the principal
diag(x)
diagonal
Returns a vector containing the elements of the principal
diag(A)
diagonal
diag(k) If k is a scalar, this creates a k x k identity matrix. Go figure.
solve(A, b) Returns vector x in the equation b = Ax (i.e., A-1b)
solve(A) Inverse of A where A is a square matrix.
ginv(A) Moore-Penrose Generalized Inverse of A.
ginv(A) requires loading the MASS package.
y<-eigen(A) y$val are the eigenvalues of A
y$vec are the eigenvectors of A
Single value decomposition of A.
y$d = vector containing the singular values of A
y<-svd(A)
y$u = matrix with columns contain the left singular vectors of A
y$v = matrix with columns contain the right singular vectors of A
Choleski factorization of A. Returns the upper triangular factor,
R <- chol(A)
such that R'R = A.
QR decomposition of A.
y$qr has an upper triangle that contains the decomposition and
a lower triangle that contains information on the Q
y <- qr(A) decomposition.
y$rank is the rank of A.
y$qraux a vector which contains additional information on Q.
y$pivot contains information on the pivoting strategy used.
cbind(A,B,...) Combine matrices(vectors) horizontally. Returns a matrix.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 15

rbind(A,B,...) Combine matrices(vectors) vertically. Returns a matrix.


rowMeans(A) Returns vector of row means.
rowSums(A) Returns vector of row sums.
colMeans(A) Returns vector of column means.
colSums(A) Returns vector of column sums.

In the following example, I will show you how to use four of the most important R functions for
descriptive statistics: colSums, rowSums, colMeans, and rowMeans.
> set.seed(1234) # Set seed
> data <- data.frame(matrix(round(runif(20, 4, 34)), # Create example data
+ nrow = 4, ncol = 5))
> data
X1 X2 X3 X4 X5
1 7 30 24 12 13
2 23 23 19 32 12
3 22 4 25 13 10
4 23 11 20 29 11

> colSums(data) # Basic application of colSums


X1 X2 X3 X4 X5
75 68 88 86 46
> rowSums(data) # Basic application of rowSums
[1] 86 109 74 94
> colMeans(data) # Basic application of colMeans
X1 X2 X3 X4 X5
18.75 17.00 22.00 21.50 11.50
> rowMeans(data) # Basic application of rowMeans
[1] 17.2 21.8 14.8 18.8

> data_ext1 # Add rowSums & rowMeans to data


X1 X2 X3 X4 X5 rowSums rowMeans
1 7 30 24 12 13 86 17.2
2 23 23 19 32 12 109 21.8
3 22 4 25 13 10 74 14.8
4 23 11 20 29 11 94 18.8

3.2. Statistics Functions


cumsum Calculate the cumulative sum of the elements of a numeric vector.
dbeta Return corresponding value of beta density.
dbern Return corresponding value of bernoulli PDF.
dbinom Return corresponding value of binomial density.
dcauchy Return corresponding value of cauchy density.
dchisq Return corresponding value of chi-square PDF.
density Draw Kernel Density Plot.
dexp Return corresponding value of exponential density.
df Return corresponding value of F PDF.
dgamma Return corresponding value of gamma density.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 16

dgeom Return corresponding value of geometric PDF.


dhyper Return corresponding value of hypergeometric PDF.
dlnorm Return corresponding value of log normal PDF.
dlogis Return corresponding value of logistic PDF.
dnbinom Return corresponding value of negative binomial density.
dnorm Create standard normal distribution.
dpois Return value of poisson density.
dt Return corresponding value of Student t PDF.
dtukey Return corresponding value of studentized range PDF.
dunif Return corresponding value of uniform PDF.
dweibull Return corresponding value of weibull density.
dwilcox Return corresponding PDF value of wilcoxon rank sum statistic.
ecdf Compute the Empirical Cumulative Distribution Function (ECDF).
hist Creates a histogram.
mean Compute the arithmetic mean.
median Compute the median.
ppois Return value of poisson cumulative distribution function.
psignrank Return corresponding CDF value of wilcoxon signedank statistic.
pt Return corresponding value of Student t CDF.
ptukey Return corresponding value of studentized range CDF.
punif Return corresponding value of uniform CDF.
pweibull Return corresponding value of weibull CDF.
pwilcox Return corresponding CDF value of wilcoxon rank sum statistic.
qbern Return corresponding value of bernoulli quantile function.
qbeta Return corresponding value of beta quantile function.
qbinom Return corresponding value of binomial quantile function.
qcauchy Return corresponding value of cauchy quantile function.
qchisq Return corresponding value of chi-square quantile function.
qexp Return corresponding value of exponential quantile function.
qgamma Return corresponding value of gamma quantile function.
qgeom Return corresponding value of geometric quantile function.
qhyper Return corresponding value of hypergeometric quantile function.
qlnorm Return corresponding value of log normal quantile function.
qlogis Return corresponding value of logistic quantile function.
qnbinom Return corresponding value of negative binomial quantile function.
qnorm Return value of quantile function.
qpois Return value of poisson quantile function.
qqnorm Create a normal QQplot.
qt Return corresponding value of Student t quantile function.
qtukey Return corresponding value of studentized range quantile function.
quantile Compute sample quantiles.
qunif Return corresponding value of uniform quantile function.
qweibull Return corresponding value of weibull quantile function.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 17

Return corresponding quantile function value of wilcoxon rank sum


qwilcox statistic.
rbern Return bernoulli distributed random number.
rbeta Draw random number from beta density.
rbinom Draw random number from binomial density.
rcauchy Draw random number from cauchy density.
rchisq Return chi-square distributed random number.
sd Compute standard deviation.

3.3. Basic Important Functions


abs Compute the absolute value of a numeric data object.
all Check whether all values of a logical vector are TRUE.
any Check whether any values of a logical vector are TRUE.
Return character vector with names of objects that contain
apropos the input.
attach Give access to variables of a data.frame.
attr Return or set a specific attribute of a data object.
attributes Return or set all attributes of a data object.
as.factor Convert a data object to the class factor.
as.numeric Convert a data object to the class numeric.
boxplot Create a boxplot.
break Break for-loop in R.
cbind Combine vectors, matrices and/or data frames by column.
ceiling Round numeric up to the next higher integer.
find Return location where objects of a given name can be found.
floor Round numeric down to the next lower integer.
geom_bar Create a barplot.
get Search and call a data object.
get0 Call an existing data object or return an alternative value.
gregexpr Search for match of certain character pattern.
Search for match of certain character pattern and return
grep indices.
Search for match of certain character pattern and return
grepl logical.
gsub Replace all matches in character string.
deparse Convert an expression to the character class.
detach Remove the attachment of a data.frame or unload a package.
Compute difference between pairs of consecutive elements of
diff a vector.
difftime Calculate the time difference of two date or time objects.
Return the dimension (e.g. the number of columns and rows)
dim of a matrix, array or data frame.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 18

Return a character vector of file and/or folder names within a


dir directory.
eval Evaluate an expression and return the result.
Return a logical vector or matrix indicating which elements
is.na are missing.
Return a logical vector or matrix indicating which elements
is.nan are not a number.
Return a logical value indicating whether a data object is of
is.null the data class NULL.
is.unsorted Check whether an input is unsorted.
julian Return the number of days between two date objects.
length Return the length of data objects such as vectors or lists.
list.files List files with specific extension in working directory.
load Load RData workspace file into R.
match Return position of first match between two data objects.
max Compute the maximum value of a vector or column.
rbind Combine vectors, matrices and/or data frames by row.
segments Draw a line segment between two pairs of points.
summary Give basic information on variables.
Evaluates expression in environment constructed based on
with data frame.
within Evaluates expression in environment and modify data frame.

Here there is an example


> a = c(5,11,6,13,2,8,3,17,-2)
> b = sort(a)
> c = sort(a,decreasing = TRUE)
> min(a)
[1] -2
> max(a)
[1] 17
> sum(a)
[1] 63
> cumsum(a)
[1] 5 16 22 35 37 45 48 65 63

The tapply, sapply, and lapply functions


tapply() is a function that applies a function to each cell of a ragged array, that is to each (non-
empty) group of values given by a unique combination of the levels of certain factors.
lapply() is a function that returns a list of the same length as X, each element of which is the
result of applying FUN to the corresponding element of X.
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if
simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify =
FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).
tapply(X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 19

lapply(X, FUN, ...)


sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
> set.seed(1234)
> dat.art = data.frame(matrix(round(runif(100,4, 34)), nrow=20, ncol=5))
> cat = c(1, 2, 1, 1, 1, 3, 3, 2, 3, 2, 1, 1, 2, 2, 3, 2, 1, 1, 3, 2)
> dat.art = cbind(dat.art, cat)
> m1 = mean(dat.art$X1[cat==1])
> m2 = mean(dat.art$X1[cat==2])
> m3 = mean(dat.art$X1[cat==3])
> c(m1,m2,m3)
[1] 19.00000 19.57143 14.80000
> tapply(dat.art$X1, dat.art$cat, mean)
1 2 3
19.00000 19.57143 14.80000
> sapply(dat.art, sd)
X1 X2 X3 X4 X5 cat
8.0868639 9.1254416 6.7395143 8.9494193 9.2125487 0.8127277
> lapply(dat.art, var)
$X1
[1] 65.39737
$X2
[1] 83.27368
$X3
[1] 45.42105

$X4
[1] 80.09211
$X5
[1] 84.87105
$cat
[1] 0.6605263

It is important to realize that tapply calculates the mean (or any other function) for subsets of
observations of a variable, whereas lapply and sapply calculate the mean (or any other function)
of one or more variables, using all observations.

3.4. Writing your own functions in R


There are literally thousands of functions written by thousands of programmers around the world
and freely available in R packages.
Defining a function
Functions are defined by code with a specific format:
> function.name <- function(arg1, arg2, ..., argn){
+ #code or function body
+ }

function.name: is the function9s name. This can be any valid variable name, but you should
avoid using names that are used elsewhere in R, such as dir, function, plot, etc.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 20

arg1, arg2, arg3: these are the arguments of the function, also called formals. You can write a
function with any number of arguments. These can be any R object: numbers, strings, arrays,
data frames, of even pointers to other functions; anything that is needed for the function.name
function to run.
Some arguments have default values specified, such as arg3 in our example. Arguments
without a default must have a value supplied for the function to run. You do not need to provide
a value for those arguments with a default, as the function will use the default value.
The 8…9 argument: The ..., or ellipsis, element in the function definition allows for other
arguments to be passed into the function, and passed onto to another function. This technique
is often in plotting, but has uses in many other places.
Function body: The function code between the within the {} brackets is run every time the
function is called. This code might be very long or very short. Ideally functions are short and do
just one thing – problems are rarely too small to benefit from some abstraction. Sometimes a
large function is unavoidable, but usually these can be in turn constructed from a bunch of small
functions. More on that below.
Return value: The last line of the code is the value that will be returned by the function. It is not
necessary that a function return anything, for example a function that makes a plot might not
return anything, whereas a function that does a mathematical operation might return a number,
or a list.
Load the function into the R session
For R to be able to execute your function, it needs first to be read into memory. This is just like
loading a library, until you do it the functions contained within it cannot be called.
There are two methods for loading functions into the memory:
1. Copy the function text and paste it into the console
2. Use the source() function to load your functions from file.
Our recommendation for writing nice R code is that in most cases, you should use the second of
these options. Put your functions into a file with an intuitive name, like plotting-fun.R and save
this file within the R folder in your project.
Examples
Let9s start by defining a function fahrenheit_to_celsius that converts temperatures from
Fahrenheit to Celsius:
> fahrenheit_to_celsius <- function(temp_F) {
+ temp_C <- (temp_F - 32) * 5 / 9
+ return(temp_C)
+ }
> fahrenheit_to_celsius(52)
[1] 11.11111

Here is function to compute the permutations of n things taken k at a time. The formula is nPk=
n! / (n - k)!. Let the function name be permute. Note that the word `function' must be present and
that it represents an algorithm needing some inputs (arguments) and it returns some output

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 21

denoted as `out'. A function can be a very long command extending over many lines. Any long
command in R can be entered on several lines simply by using curly braces. Thus curly braces
have a special meaning in R. The arguments of functions are always in simple parentheses ().
> permute=function(n,k){
+ out=factorial(n)/factorial(n-k)
+ return(out)
+ }
> permute(4,2)
[1] 12

Now we illustrate a function to compute an alternative version of hyper-geometric distribution as


follows.
> myhyper=function(N,n,k){
+ x=0:min(n,k)
+ px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n)
+ return(px)}
> myhyper(N=10,n=4,k=3)
[1] 0.16666667 0.50000000 0.30000000 0.03333333

When you write your own function, it is important to check it against a known answer given
above for N=10, n=4, k=3.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 22

4. DATA MANAGEMENT
Now that we have learned how matrix algebra, statistics functions and other basic functions, we
are ready to deal with and to analyze different datasets.

4.1. Importing and Exporting Data


There are numerous methods for importing and exporting R objects into other formats as data
frames. To import and to export objects/data between R and one of these softwares (SPSS,
SAS, and Stata), you will need to load the foreign() packages. One of the best ways to read (or
to export) an Excel file is to export it to a comma delimited file and import it using the utils()
package. Alternatively you can use the xlsx() package to access Excel files. For exporting data
to Excel, you will need the xlsReadWrite() package.
generic syntax
command(Your Data, file="Path where your file is or you'd like to export the Data\\File
Name.csv", row.names = FALSE)
For reading and exporting <.csv= files, we have the following syntax
read.csv(Your Data, file="Path where your Data is \\File Name.csv", row.names = FALSE)
write.csv(Your Data, file="Path where you'd like to export the Data\\File Name.csv", row.names
= FALSE)

Importing Data Files


From TXT File to R
newdata <- read.table("<FileName>.txt", header = TRUE)
newdata <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)
mydata = read.csv("<FileName>.csv");

From Excel to R
library("readxl")
read_excel("Path where your Excel file is stored\\File Name.xlsx",sheet = "Your sheet name")

From Stata to R
library(foreign)
mydata <- read.dta("c:/mydata.dta")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 23

Note: R imports a variable as a factor because there is an alphanumerical in the column, or


there are problems with the decimal separator.
Exporting Data Files
From R to TXT File
write.table prints its required argument x (after converting it to a data frame if it is not one nor a
matrix) to a file or connection.
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

From R to Excel
write.csv(Your Data, file="Path where you'd like to export the Data\\File Name.csv", row.names
= FALSE)
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")

From R to Stata
library(foreign)
write.dta(mydata, "c:/mydata.dta")

4.2. Data cleaning


Data is rarely if ever perfect or structured how you9d like. Below are some methods of imputing
missing values. One caveat to remember is that there is no standard solution for missing data.
In theory, you are making up data, and thus must look to avoid bad assumptions rather than find
the perfect value. You need to explore and understand your variables (head(), tail(), str(), plot(),
etc.).
Replacing NA’s with 0’s
We are going to generate a sample data.frame
>set.seed(321)
>m = matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
>d = as.data.frame(m)
>d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2 9 5 NA 6 NA 6

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 24

2 10 3 10 7 3 3 NA 4 1 1
3 2 8 10 7 5 NA 7 NA 5 6
4 2 NA 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 NA 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 NA 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5

> d[is.na(d)] = 0
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2 9 5 0 6 0 6
2 10 3 10 7 3 3 0 4 1 1
3 2 8 10 7 5 0 7 0 5 6
4 2 0 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 0 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 0 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5

Replacing NA with "None" - for factors


The replacement does not have to be a numeric value, if we have a factor variable, we can
replace a missing value with the word "None".
> d = as.data.frame(m)
> d[is.na(d)]="None"
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2 9 5 None 6 None 6
2 10 3 10 7 3 3 None 4 1 1
3 2 8 10 7 5 None 7 None 5 6
4 2 None 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 None 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 None 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5

Removing Incomplete Rows


> d = as.data.frame(m)
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2 9 5 NA 6 NA 6
2 10 3 10 7 3 3 NA 4 1 1
3 2 8 10 7 5 NA 7 NA 5 6
4 2 NA 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 NA 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 NA 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5
> nrow(d[!complete.cases(d),]) #How many rows contain "NA"

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 25

[1] 6
> d[complete.cases(d),] # Remove all of the rows with "NA"
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
5 4 6 6 7 9 7 8 10 6 9
7 4 6 10 7 1 1 9 4 6 9
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5

Replacing NA with column means


> for(i in 1:ncol(d)){
+ d[is.na(d[,i]), i] = mean(d[,i], na.rm = TRUE)
+ }
>
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2.000000 9 5 5.571429 6.000000 6 6
2 10 3 10 7.000000 3 3 5.571429 4.000000 1 1
3 2 8 10 7.000000 5 4 7.000000 6.444444 5 6
4 2 5 5 10.000000 9 1 1.000000 10.000000 10 10
5 4 6 6 7.000000 9 7 8.000000 10.000000 6 9
6 3 2 8 4.000000 3 7 5.571429 8.000000 3 5
7 4 6 10 7.000000 1 1 9.000000 4.000000 6 9
8 3 4 4 6.222222 9 2 9.000000 9.000000 8 10
9 4 3 1 7.000000 2 3 4.000000 5.000000 8 1
10 8 7 6 5.000000 1 7 1.000000 2.000000 7 5

Note: in the section <Loops=, we are going to further learn how to clean data

4.3. Basic Graphs


We have demonstrated the use of R tools for importing data, manipulating data, extracting
subsets of data, and making simple calculations, such as mean, variance, standard deviation,
and the like. In this subsection, we introduce basic graph plotting tools.
Basic parameters
plot(x, y, main="title", sub="subtitle", xlab="X-axis label", ylab="y-axix label",
xlim=c(xmin, xmax), ylim=c(ymin, ymax))
Use the title( ) function to add labels to a plot.
title(main="main title", sub="sub-title", xlab="x-axis label", ylab="y-axis label")
Text can be added to graphs using the text( ) and mtext( ) functions. text( ) places text within the
graph while mtext( ) places text in one of the four margins.
text(location, "text to place", pos, ...)
mtext("text to place", side, line=n, ...)
Example
> attach(mtcars)
> plot(wt, mpg, main="Milage vs. Car Weight",
+ xlab="Weight", ylab="Mileage", pch=18, col="blue")
> text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 26

By default, the plot function uses open circles (open dots) as plotting characters, but characters
can be selected from about 20 additional symbols. The plotting character is specified with the
pch option in the plot function.

Histograms
> plot(wt, mpg)
> abline(lm(mpg~wt))
> title("Regression of MPG on Weight")
> hist(mtcars$mpg)
> hist(mtcars$mpg, breaks=12, col="red")
> x <- mtcars$mpg
> h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram w
ith Normal Curve")
> xfit<-seq(min(x),max(x),length=40)
> yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
> yfit <- yfit*diff(h$mids[1:2])*length(x)
> lines(xfit, yfit, col="blue", lwd=2)

Dot Chart
> dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
+ main="Gas Milage for Car Models",
+ xlab="Miles Per Gallon")

Bar Plot
> # Simple Bar Plot
> counts <- table(mtcars$gear)
> barplot(counts, main="Car Distribution",
+ xlab="Number of Gears")
> # Simple Horizontal Bar Plot with Added Labels
> counts <- table(mtcars$gear)
> barplot(counts, main="Car Distribution", horiz=TRUE,
+ names.arg=c("3 Gears", "4 Gears", "5 Gears"))
> # Stacked Bar Plot with Colors and Legend
> counts <- table(mtcars$vs, mtcars$gear)
> barplot(counts, main="Car Distribution by Gears and VS",
+ xlab="Number of Gears", col=c("darkblue","red"),
+ legend = rownames(counts))
> # Grouped Bar Plot

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 27

> counts <- table(mtcars$vs, mtcars$gear)


> barplot(counts, main="Car Distribution by Gears and VS",
+ xlab="Number of Gears", col=c("darkblue","red"),
+ legend = rownames(counts), beside=TRUE)

4.3.1. The package ggplot2


ggplot2 is the most elegant and aesthetically pleasing graphics framework available in R. It has
a nicely planned structure to it. This subsection focusses on exposing this underlying structure
you can use to make any ggplot.
The qplot() function can be used to create the most common graph types.
qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=,
ylim= xlab=, ylab=, main=, sub=)

option description
Alpha transparency for overlapping elements expressed as a fraction
alpha between 0 (complete transparency) and 1 (complete opacity)
Associates the levels of variable with symbol color, shape, or size. For line
color, plots, color associates levels of a variable with line color. For density and box
shape, plots, fill associates fill colors with a variable. Legends are drawn
size, fill automatically.
data Specifies a data frame
Creates a trellis graph by specifying conditioning variables. Its value is
expressed as rowvar ~ colvar. To create trellis graphs based on a single
facets conditioning variable, use rowvar~. or .~colvar)
Specifies the geometric objects that define the graph type. The geom option
is expressed as a character vector with one or more entries. geom values
include "point", "smooth", "boxplot", "line", "histogram", "density", "bar",
geom and "jitter".
main,
sub Character vectors specifying the title and subtitle
If geom="smooth", a loess fit line and confidence limits are added by
default. When the number of observations is greater than 1,000, a more
efficient smoothing algorithm is employed. Methods include "lm" for
regression, "gam" for generalized additive models, and "rlm" for robust
regression. The formula parameter gives the form of the fit.
For example, to add simple linear regression lines, you'd specify
geom="smooth", method="lm", formula=y~x. Changing the formula to
y~poly(x,2) would produce a quadratic fit. Note that the formula uses the
letters x and y, not the names of the variables.
method, For method="gam", be sure to load the mgcv package. For method="rml",
formula load the MASS package.
Specifies the variables placed on the horizontal and vertical axis. For
x, y univariate plots (for example, histograms), omit y
xlab,
ylab Character vectors specifying horizontal and vertical axis labels

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 28

Two-element numeric vectors giving the minimum and maximum values for
xlim,ylim the horizontal and vertical axes, respectively

> mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5),


+ labels=c("3gears","4gears","5gears"))
> mtcars$am <- factor(mtcars$am,levels=c(0,1),
+ labels=c("Automatic","Manual"))
> mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8),
+ labels=c("4cyl","6cyl","8cyl"))
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6cyl 160 110 3.90 2.620 16.46 0 Manual 4gears 4
Mazda RX4 Wag 21.0 6cyl 160 110 3.90 2.875 17.02 0 Manual 4gears 4
Datsun 710 22.8 4cyl 108 93 3.85 2.320 18.61 1 Manual 4gears 1
Hornet 4 Drive 21.4 6cyl 258 110 3.08 3.215 19.44 1 Automatic 3gears 1
Hornet Sportabout 18.7 8cyl 360 175 3.15 3.440 17.02 0 Automatic 3gears 2
Valiant 18.1 6cyl 225 105 2.76 3.460 20.22 1 Automatic 3gears 1

> qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5),


+ main="Distribution of Gas Milage", xlab="Miles Per Gallon",
+ ylab="Density")

Distribution of Gas Milage


> qplot(hp, mpg, data=mtcars, shape=am, color=am,
+ facets=gear~cyl, size=I(3),
+ xlab="Horsepower", ylab="Miles per Gallon")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 29

Horsepower
> qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"),
+ method="lm", formula=y~x, color=cyl,
+ main="Regression of MPG on Weight",
+ xlab="Weight", ylab="Miles per Gallon")

Separate Regressions

> qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"),


+ fill=gear, main="Mileage by Gear Number",
+ xlab="", ylab="Miles per Gallon")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 30

Boxplots

> p <- qplot(hp, mpg, data=mtcars, shape=am, color=am,


+ facets=gear~cyl, main="Scatterplots of MPG vs. Horsepower",
+ xlab="Horsepower", ylab="Miles per Gallon")
> # White background and black grid lines
> p + theme_bw()
>
> # Large brown bold italics labels
> # and legend placed at top of plot
> p + theme(axis.title=element_text(face="bold.italic",
+ size="12", color="brown"), legend.positio
n="top")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 31

Customize Graph
Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par(
) function. They can be modified using the theme() function, and by adding graphic parameters
within the qplot() function. For greater control, use ggplot() and other functions provided by the
package. Note that ggplot2 functions can be chained with "+" signs to generate the final plot.
> dat <- data.frame(
+ time = factor(c("Lunch","Dinner"), levels=c("Lunch","Dinner")),
+ total_bill = c(14.89, 17.23)
+ )
> dat
time total_bill
1 Lunch 14.89
2 Dinner 17.23
> ggplot(data=dat, aes(x=time, y=total_bill)) +
+ geom_bar(stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(colour="black", stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(colour="black", stat="identity") +
+ guides(fill=FALSE)

4.4. Tables
Categorical data are usually described in the form of tables. This section outlines how you can
create tables from your data and calculate relative frequencies.
The matrix function needs an argument containing the table values as a single vector and also
the number of rows in the argument nrow. By default, the values are entered columnwise; if
rowwise entry is desired, then you need to specify byrow=T.
> caff.marital <- matrix(c(652,1537,598,242,36,46,38,21,218,327,106,67), nrow
=3,byrow=T)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 32

> caff.marital
[,1] [,2] [,3] [,4]
[1,] 652 1537 598 242
[2,] 36 46 38 21
[3,] 218 327 106 67

You might also give the number of columns instead of rows using ncol. If exactly one of ncol
and nrow is given, R will compute the other one so that it fits the number of values. If both ncol
and nrow are given and it does not fit the number of values, the values will be <recycled=, which
in some (other!) circumstances can be useful. To get readable printouts, you can add row and
column names to the matrices.
> colnames(caff.marital) <- c("0","1-150","151-300",">300")
> rownames(caff.marital) <- c("Married","Prev.married","Single")
> caff.marital
0 1-150 151-300 >300
Married 652 1537 598 242
Prev.married 36 46 38 21
Single 218 327 106 67

There is a "table" class for which special methods exist, and you can convert to that class using
as.table(caff.marital). The table function below returns an object of class "table". For most
elementary purposes, you can use matrices where two-dimensional tables are expected. One
important case where you do need as.table() is when converting a table to a data frame of
counts:
> as.data.frame(as.table(caff.marital))
Var1 Var2 Freq
1 Married 0 652
2 Prev.married 0 36
3 Single 0 218
4 Married 1-150 1537
5 Prev.married 1-150 46
6 Single 1-150 327
7 Married 151-300 598
8 Prev.married 151-300 38
9 Single 151-300 106
10 Married >300 242
11 Prev.married >300 21
12 Single >300 67

Turning now to another example, we are going to create a table from raw data and call it
<smoker.= This data set was created only to be used as an example. This is not as direct a
method as might be desired. Here we create an array of numbers, specify the row and column
names, and then convert it to a table.
> smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
> colnames(smoke) <- c("High","Low","Middle")
> rownames(smoke) <- c("current","former","never")
> smoke <- as.table(smoke)
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 33

> barplot(smoke,legend=T,beside=T,main='Smoking Status by SES')

> plot(smoke,main="Smoking Status By Socioeconomic Status")

There are a number of ways to get the marginal distributions using the margin.table() command.
If you just give the command the table it calculates the total number of observations. You can
also calculate the marginal distributions across the rows or columns based on the one optional
argument:
> margin.table(smoke)
[1] 356

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 34

> margin.table(smoke,1)
current former never
116 141 99
> margin.table(smoke,2)
High Low Middle
211 93 52

Combining these commands you can get the proportions:


> smoke/margin.table(smoke)
High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> margin.table(smoke,1)/margin.table(smoke)
current former never
0.3258427 0.3960674 0.2780899
> margin.table(smoke,2)/margin.table(smoke)
High Low Middle
0.5926966 0.2612360 0.1460674

That is a little obtuse, so fortunately, there is a better way to get the proportions using the
prop.table() command. You can specify the proportions with respect to the different marginal
distributions using the optional argument:
> prop.table(smoke)
High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> prop.table(smoke,1)
High Low Middle
current 0.43965517 0.37068966 0.18965517
former 0.65248227 0.19858156 0.14893617
never 0.68686869 0.22222222 0.09090909
> prop.table(smoke,2)
High Low Middle
current 0.2417062 0.4623656 0.4230769
former 0.4360190 0.3010753 0.4038462
never 0.3222749 0.2365591 0.1730769

4.5. Data Analysis


Now that we have learned how to clean our datasets from our previous lectures, it9s time to
learn how to perform different types of analysis. This will not be a repetition of your statistics and
econometrics classes, rather an introduction to the formatting of the most common types of
statistical analysis in R you will encounter or likely perform.
We will need the following packages this lecture, please install them with the following line:
install.packages(c("MASS","NNS")).
To calculate the bivariate correlation, we use the cor() command:
> cor(cars)
speed dist
speed 1.0000000 0.8068949
dist 0.8068949 1.0000000

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 35

Our first analysis is the linear regression. It uses the lm() command.
> reg1=lm(dist~speed,data=cars)
> summary(reg1)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 8***9 0.001 8**9 0.01 8*9 0.05 8.9 0.1 8 9 1

Residual standard error: 15.38 on 48 degrees of freedom


Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

We are going to plot the data and add the regression line to the plot
> plot(dist~speed, data=cars)
> # Name the regression object
> reg1=lm(dist~speed,data=cars)
> # Add the regression line to the plot
> abline(reg1)

An lm object is a list of named objects. We can access all of these names via names().
> names(reg1)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 36

[1] "coefficients" "residuals" "effects" "rank" "fitted.


values" "assign" "qr"
[8] "df.residual" "xlevels" "call" "terms" "model"

There is also a plot method for lm objects that gives the diagnostic information:
> par(mfrow=c(2,2))
> # Plot the objects
> plot(reg1)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 37

5. STATISTICS
The concepts of randomness and probability are central to statistics. It is an empirical fact that
most experiments and investigations are not perfectly reproducible. The degree of
irreproducibility may vary: Some experiments in physics may yield data that are accurate to
many decimal places, whereas data on biological systems are typically much less reliable.
However, the view of data as something coming from a statistical distribution is vital to
understanding statistical methods. In this section, we outline the basic ideas of probability and
the functions that R has for random sampling and handling of theoretical distributions.

5.1. Descriptive Statistics


install.packages(c("MASS","NNS"))
In R, you can simulate these situations with the sample function. If you want to pick five
numbers at random from the set 1:40, then you can write sample(1:40,5). The first argument (x)
is a vector of values to be sampled and the second (size) is the sample size. Actually,
sample(40,5) would suffice since a single number is interpreted to represent the length of a
sequence of integers.
> sample(1:40,5)
[1] 3 13 28 19 6

Notice that the default behaviour of sample is sampling without replacement. That is, the
samples will not contain the same number twice, and size obviously cannot be bigger than the
length of the vector to be sampled. If you want sampling with replacement, then you need to add
the argument replace=TRUE.
> sample(c("H","T"), 10, replace=T)
[1] "T" "H" "T" "H" "T" "T" "H" "H" "H" "H"

As describe in the previous section, R provides a wide range of functions for obtaining summary
statistics: mean, sd, var, min, max, median, range, and quantile.
> data <- data.frame(matrix(round(runif(50, 4, 34)), # Create example data
+ nrow = 10, ncol = 5))
> data
X1 X2 X3 X4 X5
1 7 25 13 18 21
2 23 20 13 12 23
3 22 12 9 13 13
4 23 32 5 19 23
5 30 13 11 9 14
6 23 29 28 27 19
7 4 13 20 10 24
8 11 12 31 12 19
9 24 10 29 34 11
10 19 11 5 28 27

You can also use summary() and describe() functions.


> summary(data)
X1 X2 X3 X4 X5

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 38

Min. : 4.0 Min. :10.00 Min. : 5.0 Min. : 9.0 Min. :11.00
1st Qu.:13.0 1st Qu.:12.00 1st Qu.: 9.5 1st Qu.:12.0 1st Qu.:15.25
Median :22.5 Median :13.00 Median :13.0 Median :15.5 Median :20.00
Mean :18.6 Mean :17.70 Mean :16.4 Mean :18.2 Mean :19.40
3rd Qu.:23.0 3rd Qu.:23.75 3rd Qu.:26.0 3rd Qu.:25.0 3rd Qu.:23.00
Max. :30.0 Max. :32.00 Max. :31.0 Max. :34.0 Max. :27.00

> describe(data)
data
5 Variables 10 Observations
-----------------------------------------------------------------------------
--------------------------------------------
X1
n missing distinct Info Mean Gmd
10 0 8 0.976 18.6 9.467

Value 4 7 11 19 22 23 24 30
Frequency 1 1 1 1 1 3 1 1
Proportion 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X2
n missing distinct Info Mean Gmd
10 0 8 0.988 17.7 9.178

Value 10 11 12 13 20 25 29 32
Frequency 1 1 2 2 1 1 1 1
Proportion 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X3
n missing distinct Info Mean Gmd
10 0 8 0.988 16.4 11.64

Value 5 9 11 13 20 28 29 31
Frequency 2 1 1 2 1 1 1 1
Proportion 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X4
n missing distinct Info Mean Gmd
10 0 9 0.994 18.2 10.04

Value 9 10 12 13 18 19 27 28 34
Frequency 1 1 2 1 1 1 1 1 1
Proportion 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X5
n missing distinct Info Mean Gmd
10 0 8 0.988 19.4 6.222

Value 11 13 14 19 21 23 24 27
Frequency 1 1 1 2 1 2 1 1
Proportion 0.1 0.1 0.1 0.2 0.1 0.2 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------

A common application of loops (we are going to learn looping in the next section) is to apply a
function to each element of a set of values or vectors and collect the results in a single

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 39

structure. In R this is abstracted by the functions lapply() and sapply(). The former always
returns a list (hence the 8l9), whereas the latter tries to simplify (hence the 8s9) the result to a
vector or a matrix if possible.
> sapply(data, mean, na.rm=TRUE)
X1 X2 X3 X4 X5
18.6 17.7 16.4 18.2 19.4
> sapply(data, sd, na.rm=TRUE)
X1 X2 X3 X4 X5
8.395766 8.192815 9.924157 8.689713 5.253570

Sometimes you just want to repeat something a number of times but still collect the results as a
vector. Obviously, this makes sense only when the repeated computations actually give different
results, the common case being simulation studies. This can be done using sapply, but there is
a simplified version called replicate(), in which you just have to give a count and the expression
to evaluate.
> replicate(5, mean(data[,1]))
[1] 18.6 18.6 18.6 18.6 18.6

A similar function, apply, allows you to apply() a function to the rows or columns of a matrix (or
over indices of a multidimensional array in general
> apply(data,2,max)
X1 X2 X3 X4 X5
30 32 31 34 27

The second argument is the index (or vector of indices) that defines what the function is applied
to; in this case we get the columnwise maxima.
Empirical quantiles may be obtained with the function quantile().
> quantile(data[,1])
0% 25% 50% 75% 100%
4.0 13.0 22.5 23.0 30.0

It is also possible to obtain other quantiles; this is done by adding an argument containing the
desired percentage points.
> pvec <- seq(0,1,0.1)
> pvec
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> quantile(data[,1],pvec)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
4.0 6.7 10.2 16.6 20.8 22.5 23.0 23.0 23.2 24.6 30.0

5.2. Inferential Statistics


Probability and distributions
The standard distributions that turn up in connection with model building and statistical tests
have been built into R, and it can therefore completely replace traditional statistical tables. Here
we look only at the most common distributions.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 40

How to create random numbers from the uniform density? As describe in Section 3, in R <unif=
means uniform and prefix:

 d means density
 p means cumulative probability
 q means quantile
 r means random numbers from that density
Distributions (R name)

 Beta(beta), Lognormal (lnorm), Binomial (binom), Negative Binomial (nbinom), Cauchy


(cauchy), Normal (norm), Chisquare (chisq), Poisson (pois), Exponential (exp), Student t
(t), F (f), Uniform (unif), Gamma (gamma), Tukey (tukey), Geometric (geom), Weibull
(weib), Hypergeometric (hyper), Wilcoxon (wilcox), Logistic (logis)
Recall that the density for a continuous distribution is a measure of the relative probability of
<getting a value close to x=. The probability of getting a value in a particular interval is the area
under the corresponding part of the curve.
For discrete distributions, where variables can take on only distinct values, it is preferable to
draw a pin diagram, here for the binomial distribution with n = 50 and p = 0.33.
> x <- 0:50
> plot(x,dbinom(x,size=50,prob=.33),type="h")

The Binomial probability for exactly x successes when the probability of one success in one
trial is p and when the number of trials is n. Note that n also equals the largest number of
successes.
> p=0.5; n=3; x=0:n
> db=dbinom(x,prob=p,size=n);db
[1] 0.125 0.375 0.375 0.125

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 41

> names(db)=x #show labels on x axis


> barplot(db, xlab="x")

The Poisson distribution can be described as the limiting case of the binomial distributions
when the size parameter N increases while the expected number of successes » = Np is fixed.
This is useful to describe rare event in large populations. The code name for Poisson is `pois'
and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density.

Turning now to the normal distribution, I will start with the generation of an artificial data vector
x of 150 normally distributed observations. It is used in examples throughout this section.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 42

> x <- rnorm(150)


> hist(x)

The density function is likely the one of the four function types that is least used in practice, but
if for instance it is desired to draw the well-known bell curve of the normal distribution, then it
can be done like this:
> x1 <- seq(-4,4,0.1)
> plot(x,dnorm(x1),type="l")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 43

The empirical cumulative distribution function is defined as the fraction of data smaller than or
equal to x. That is, if x is the k-th smallest observation, then the proportion k/n of the data is
smaller than or equal to x (7/10 if x is no. 7 of 10).
> n <- length(x)
> plot(sort(x),(1:n)/n,type="s",ylim=c(0,1))

One purpose of calculating the empirical cumulative distribution function (c.d.f.) is to see
whether data can be assumed normally distributed. For a better assessment, you might plot the
k-th smallest observation against the expected value of the k-th smallest observation out of n in
a standard normal distribution. The point is that in this way you would expect to obtain a straight
line if data come from a normal distribution with any mean and standard deviation. You can do
this plot with the qqnorm().

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 44

Create a dataset and compute the Z-scores. Z-scores are used to calculate the probability of
a score occurring within a normal distribution, and allow us to compare two or more variables
from different normal distributions.
> students = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10", "
S11", "S12", "S13", "S14", "S15")
> scores <- c(60, 62, 63, 66, 67, 70, 69, 69, 68, 64, 65, 71, 62, 63, 64)
> newdata <- as.data.frame(cbind(students, scores))
> pop_sd <- sd(scores)*sqrt((length(scores)-1)/(length(scores)))
> pop_mean <- mean(scores)
> Z_scores <- (scores-pop_mean)/pop_sd
> Z_scores

Add the Z-scores to the original dataset and export it as a <.csv= or <.txt= file.
> New_ZScoreData <- cbind(newdata,Z_score)
> write.csv(New_ZScoreData,"New_ZScore.csv")
> write.table(New_ZScoreData, "New_ZScore.txt")

The number of intervals (called "bins") is an important parameter of histograms because it splits
the data into equal parts. Each and every observation (or value) in the data set is placed in the
appropriate bin.
> x1 = c(10.2, 11.9, 11.3, 12.2, 12.7, 12.8, 14.3, 14.5, 14.6, 15.9,
+ 14.8, 15.0, 15.5, 13.2, 13.9, 18.5, 18.9, 18.4, 18.9,
+ 19.0, 19.5, 16.1, 16.2, 16.5, 16.8, 16.9, 16.7, 17.3,
+ 20.2, 20.5, 20.9, 20.8, 20.2, 22.5, 22.7, 22.9)
> hist(x1, breaks=7, main="Simulated Data - x1")
> x2 = x1*rnorm(length(x1), mean=1, sd=1)
> hist(x2, breaks=7, main="Simulated Data - x2")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 45

Let9s compare the z-scores of the two simulated random variables.


> zscore1 = (x1-mean(x1))/sd(x1)
> zscore2 = (x2-mean(x2))/sd(x2)
> par(mfcol=c(2,1))
> hist(zscore1, breaks=7, main="Z-scores1")
> hist(zscore2, breaks=7, main="Z-scores2")
> par(mfcol=c(1,1))

The area under the curve


> problessthan68 <- pnorm(68,pop_mean,pop_sd)
> problessthan68
[1] 0.7780236

Example
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
The pnorm(0) =0.5 is the area under the standard normal curve to the left of zero. Likewise, the
qnorm(0.80) = 0.84 means that 0.84 is the 80th percentile of the standard normal distribution.
rnorm(100) generates 100 random deviates from a standard normal distribution (mean=0, and
standard deviation=1). Each function has parameters specific to that distribution. For example,
rnorm(100, m=50, sd=10) generates 100 random deviates from a normal distribution with mean
50 and standard deviation 10.
The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate
distributions. The format is fitdistr(x, densityfunction) where x is the sample data and
densityfunction is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f",
"gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal",
"Poisson", "t" or "weibull".
Exercise 1: Children's IQ scores are normally distributed with a mean of 100 and a standard
deviation of 15. What proportion of children are expected to have an IQ between 80 and 120?
> mean=100; sd=15
> lb=80; ub=120
> x <- seq(-4,4,length=100)*sd + mean
> hx <- dnorm(x,mean,sd)
> plot(x, hx, type="n", xlab="IQ Values", ylab="",
+ main="Normal Distribution", axes=FALSE)
> i <- x >= lb & x <= ub
> lines(x, hx)
> polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")
> area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
> result <- paste("P(",lb,"< IQ <",ub,") =",signif(area, digits=3))
> mtext(result,3)
> axis(1, at=seq(40, 160, 20), pos=0)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 46

5.2.1. Point Estimate and Interval Estimation


Find a point estimate of mean university student height with the sample data from survey.
> library(MASS) # load the MASS package
> head(survey)
> height.survey = survey$Height
> mean(height.survey, na.rm=TRUE) # skip missing values
[1] 172.3809

Assume the population standard deviation σ of the student height in survey is 9.48. Find the
margin of error and interval estimate at 95% confidence level.
> height.response = na.omit(survey$Height)
> n = length(height.response)
> sigma = 9.48 # population standard deviation
> sem = sigma/sqrt(n); sem # standard error of the mean
[1] 0.6557453
> E = qnorm(.975)∗sem; E # margin of error
[1] 1.285237
> E
[1] 1.285237
> xbar = mean(height.response) # sample mean
> xbar + c(−E, E)
[1] 171.0956 173.6661

Without assuming the population standard deviation of the student height in survey, find the
margin of error and interval estimate at 95% confidence level.
> height.response = na.omit(survey$Height)
> n = length(height.response)
> s = sd(height.response) # sample standard deviation
> SE = s/sqrt(n); SE # standard error estimate
[1] 0.6811677
> E = qt(.975, df=n−1)∗SE; E
[1] 1.342878
> xbar = mean(height.response) # sample mean
> xbar + c(−E, E)
[1] 171.0380 173.7237

Sampling Size of Population Mean. The quality of a sample survey can be improved by
increasing the sample size.
Assume the population standard deviation σ of the student height in survey is 9.48. Find the
sample size needed to achieve a 1.2 centimeters margin of error at 95% confidence level.
> zstar = qnorm(.975)
> sigma = 9.48
> E = 1.2
> zstar^2 ∗ sigma^2/ E^2
[1] 239.7454

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 47

5.2.2. Hypothesis Testing


One Sample t-Test
> set.seed(100)
> x <- rnorm(50, mean = 10, sd = 0.5)
> t.test(x, mu=10)
One Sample t-test

data: x
t = 0.70372, df = 49, p-value = 0.4849
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
9.924374 10.157135
sample estimates:
mean of x
10.04075

Two Sample t-Test


> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)

Welch Two Sample t-test

data: x and y
t = -0.42744, df = 14.721, p-value = 0.6752
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4379188 0.9581946
sample estimates:
mean of x mean of y
-0.04322973 0.19663237

Wilcoxon Rank Sum Test


To test the mean of a sample when normal distribution is not assumed. Wilcoxon signed rank
test can be an alternative to t-Test, especially when the data sample is not assumed to follow a
normal distribution. It is a non-parametric method used to test if an estimate is different from its
true value.
> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "g") # g for greater

Wilcoxon rank sum test

data: x and y
W = 35, p-value = 0.1272
alternative hypothesis: true location shift is greater than 0

Shapiro Test
To test if a sample follows a normal distribution. The null hypothesis here is that the sample
being tested is normally distributed.

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 48

> normaly_disb <- rnorm(100, mean=5, sd=1) # generate a normal distribution


> shapiro.test(normaly_disb) # the shapiro test.

Shapiro-Wilk normality test

data: normaly_disb
W = 0.98836, p-value = 0.535

Kolmogorov-Smirnov Test
Kolmogorov-Smirnov test is used to check whether 2 samples follow the same distribution.
> x <- rnorm(50)
> y <- runif(50)
> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.64, p-value = 6.079e-10
alternative hypothesis: two-sided

> x <- rnorm(50)


> y <- rnorm(50)
> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.12, p-value = 0.8693
alternative hypothesis: two-sided

Fisher9s F-Test
F-Test is used to check if two samples have the same variance.
> var.test(x,y)

F test to compare two variances

data: x and y
F = 0.90857, num df = 49, denom df = 49, p-value = 0.7385
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5155919 1.6010719
sample estimates:
ratio of variances
0.9085702

Chi-Squared Test
This test is used to examine if two categorical variables are dependent. The input data is in the
form of a table that contains the count value of the variables in the observation. We use
chisq.test() function to perform the chi-square test of independence.
> data_frame <- read.csv("https://goo.gl/j6lRXD")
> table(data_frame$treatment, data_frame$improvement)

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 49

improved not-improved
not-treated 26 29
treated 35 15
> chisq.test(data_frame$treatment, data_frame$improvement, correct=FALSE)

Pearson's Chi-squared test


data: data_frame$treatment and data_frame$improvement
X-squared = 5.5569, df = 1, p-value = 0.01841

In the built-in data set survey, the Smoke column records the students smoking habit, while the
Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul"
(regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some"
and "None".
We can tally the students smoking habit against the exercise level with the table function in R.
The result is called the contingency table of the two variables.
Question: Test the hypothesis whether the students smoking habit is independent of their
exercise level at .05 significance level.
> tbl = table(survey$Smoke, survey$Exer)
> tbl # the contingency table
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
> chisq.test(tbl)

Pearson's Chi-squared test

data: tbl
X-squared = 5.4885, df = 6, p-value = 0.4828

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 50

6. LOOPING
It is possible to write your own R functions (see section 3). Also, if you are likely to use a set of
calculations more than once, you would be well advised to present the code in such a way that it
can be reused with minimal typing. Quite often, this brings you into the world of functions and
loops. In fact, looping is a major aspect and attraction of working with the system in the long run.
Recall what we have learnt:

Operators

6.1. Basic Flow Decision Charts

While Loop
> i <- 1
> while (i < 6) {
+ print(i)
+ i = i+1
+ }
[1] 1

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 51

[1] 2
[1] 3
[1] 4
[1] 5

For Loop
> x <- c(2,5,3,9,8,11,6)
> count <- 0
> for (val in x) {
+ if(val %% 2 == 0) count = count+1
+ }
> print(count)
[1] 3

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 52

Break Statement
> x <- 1:5
> for (val in x) {
+ if (val == 3){
+ break
+ }
+ print(val)
+ }
[1] 1
[1] 2

Next Statement

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 53

> x <- 1:5


> for (val in x) {
+ if (val == 3){
+ next
+ }
+ print(val)
+ }
[1] 1
[1] 2
[1] 4
[1] 5

Repeat Loop
> x <- 1
> repeat {
+ print(x)
+ x = x+1
+ if (x == 5){
+ break
+ }
+ }
[1] 1
[1] 2
[1] 3
[1] 4

6.2. Sample mean and variance


In this example, the variable of interest is X with X ∼ N(¼ = 2, σ = 2). We will investigate some
properties of the sample mean and variance when taking a random sample of size n. In other
words, we use R to generate n random drawings of X. This happens in each simulation step.
> mu<-2
> sigma<-2
> n<-10

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 54

> asim<-10
> xbar<-rep(NA,asim)
> xvar<-rep(NA,asim)
> for(i in 1:asim)
+ {set.seed(i)
+ # x contains a random sample of size n of the variable X
+ x<-rnorm(n,mu,sigma)
+ xbar[i]<-mean(x)
+ xvar[i]<-var(x)
+ }

Is the sample mean an unbiased estimator for the population mean? Is the sample variance
also an unbiased estimator for the population variance? Increase the number of simulations to
100 and then finally to 1,000. What do you notice?
> mu<-2
> sigma<-2
> n<-10
> asim<-1000
> xbar<-rep(NA,asim)
> xbar2<-rep(NA,asim)
> xbar3<-rep(NA,asim)
> xvar<-rep(NA,asim)
> xvar2<-rep(NA,asim)
> xvar3<-rep(NA,asim)
> for(i in 1:asim) {
+ set.seed(i)
+ x<-rnorm(n,mu,sigma)
+ x2<-rnorm(n*10,mu,sigma)
+ x3<-rnorm(n*100,mu,sigma)
+ xbar[i]<-mean(x)
+ xbar2[i]<-mean(x2)
+ xbar3[i]<-mean(x3)
+ xvar[i]<-var(x)
+ xvar2[i]<-var(x2)
+ xvar3[i]<-var(x3)
+ }
> mean(xbar)
[1] 2.003151
> mean(xbar2)
[1] 1.999852
> mean(xbar3)
[1] 2.002405
> sd(xbar)
[1] 0.6463367
> sd(xbar2)
[1] 0.1908427
> sd(xbar3)
[1] 0.06513537

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 55

Histogram of xbar Normal Q-Q Plot

2.2
250

2.1
200

Sample Quantiles
Frequency

150

2.0
100

1.9
50

1.8
0

1.8 1.9 2.0 2.1 2.2 -3 -2 -1 0 1 2 3

xbar Theoretical Quantiles

Figure. Histogram and Q-Q plot after 1000 simulations

Now, let9s assume that the variable X is no longer normally distributed but follows a chi-squared
distribution with 4 degrees of freedom.
> n<-10
> asim<-1000
> xbar<-rep(NA,asim)
> xbar2<-rep(NA,asim)
> xbar3<-rep(NA,asim)
> xvar<-rep(NA,asim)
> xvar2<-rep(NA,asim)
> xvar3<-rep(NA,asim)
> for(i in 1:asim) {
+ set.seed(i)
+ x<-rchisq(n,4)
+ x2<-rchisq(n*10,4)
+ x3<-rchisq(n*100,4)
+ xbar[i]<-mean(x)
+ xbar2[i]<-mean(x2)
+ xbar3[i]<-mean(x3)
+ xvar[i]<-var(x)
+ xvar2[i]<-var(x2)
+ xvar3[i]<-var(x3)
+ }

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 56

6.3. Confidence Intervals


> mu<-2
> sigma<-2
> n<-100
> asim<-1000
> l95<-rep(NA,asim)
> r95<-rep(NA,asim)
> l99<-rep(NA,asim)
> r99<-rep(NA,asim)
> for(i in 1:asim){
+ set.seed(i)
+ x<-rnorm(n,mu,sigma)
+ l95[i]<-mean(x)-qnorm(0.975)*sqrt(sigma^2/n)
+ r95[i]<-mean(x)+qnorm(0.975)*sqrt(sigma^2/n)
+ l99[i]<-mean(x)-qnorm(0.995)*sqrt(sigma^2/n)
+ r99[i]<-mean(x)+qnorm(0.995)*sqrt(sigma^2/n)
+ }
> plot(1:1000, l95, xlab='observations',ylab="value",type='l', ylim=c((min(l9
9)-0.5), (max(r99)+0.5)))
> lines(1:1000, r95, typ="l", col="black")
> lines(1:1000, r99, typ="l", col="red")
> lines(1:1000, l99, typ="l", col="red")
> title(main="95% Confidence Interval")

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 57

Figure. Confidence Intervals: back (95%) and red (99%)


6.4. Graphs
Functions often go hand in hand with loops, as they both help to automate commands. Suppose
you have 1000 datasets, and for each dataset you need to make a graph and save it as a jpeg.
It would take a great deal of time to do this manually, and a mechanism that can repeat the
same (or similar) commands any number of times without human intervention would be
invaluable. This is where a loop comes in.

> Owls <- read.table(file = "Owls.txt", header = TRUE)


> names(Owls)
[1] "Nest" "FoodTreatment" "SexParent" "ArrivalTi
me" "SiblingNegotiation"
[6] "BroodSize" "NegPerChick"
> unique(Owls$Nest)
[1] AutavauxTV Bochet Champmartin ChEsard Chevroux
CorcellesFavres Etrabloz
[8] Forel Franex GDLV Gletterens Henniez
Jeuss LesPlanches
[15] Lucens Lully Marnand Moutet Murist
Oleyes Payerne

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)


lOMoARcPSD|35213682

Walter Bazán-Palomino Page | 58

[22] Rueyes Seiry SEvaz StAubin Trey


Yvonnand
27 Levels: AutavauxTV Bochet Champmartin ChEsard Chevroux CorcellesFavres Etr
abloz Forel Franex GDLV ... Yvonnand
> Owls.ATV = Owls[Owls$Nest=="AutavauxTV",]
> plot(x = Owls.ATV$ArrivalTime, y = Owls.ATV$NegPerChick, main = "AutavauxTV
",
+ xlab = "Arrival Time", ylab = "Negotiation behaviour")

> AllNests <- unique(Owls$Nest)


> for (i in 1:27){
+ Nest.i <- AllNests[i]
+ Owls.i <- Owls[Owls$Nest == Nest.i, ]
+ YourFileName <- paste(Nest.i, ".jpg", sep = "")
+ jpeg(file = YourFileName)
+ plot(x = Owls.i$ArrivalTime, y = Owls.i$NegPerChick,
+ xlab = "Arrival Time",
+ ylab = "Negotiation behaviour", main = Nest.i)
+ dev.off()
+ }

Descargado por Tadeo Meza (mezatadeo2003@gmail.com)

You might also like