Professional Documents
Culture Documents
R Notes Chapter 1. Data Type and Data Entry
R Notes Chapter 1. Data Type and Data Entry
C.Y. Ng
Chapter 1. Data Type and Data Entry
L earning Objectives
Before we do any kinds of data analysis in R, we need to import or enter data. R has a wide
variety of data types.
Data Example
Numeric 12.3, 7, 8e2
Integer 2L, 34L, 0L
Complex 3 - 2i
Logical TRUE, FALSE
Character 'a' , "good", "TRUE", '23.4'
We assign a variable with a particular value using the symbol “<-”. For example, to set a
variable with name a to 12.3, we use
> a <- 12.3
You can use whatever variable names you like as long as they fulfill the following conditions.
Reserved words are those reserved for use as names for functions or other special purposes, and
cannot be used as variable names. A valid variable name consists of letters, numbers and the dot
or underline characters and starts with a letter or the dot not followed by a number. Names such
as .2way are not valid.
The function class() returns the class of a variable. You can verify the class of a variable by
using the following:
> class(a)
[1] "numeric"
(The funny [1] indicates that the output is a vector, and “numeric” is the first element.)
In R, the most basic data type is the vector. In this section we start with vectors and simple
operations and functions that works on vectors. Other commonly used data types (matrices,
arrays, data frames and lists) will be introduced in the next section.
Vector
The combine function c() is used to form a vector. The following are numeric, character and
logical vector assignments:
> a <- c(1.0, 2, 6, -2, 7.3, 9)
> b <- c("Anthony", "Becky", "Cheung", "Dan", "Eric") # note the use of "
> c <- c(TRUE, FALSE, TRUE)
(Things after # are comments, and are ignored by the R complier.) There is not a specific type for
scalars in R. Scalars are treated as one-element vectors.
> d <- 6.89
> e <- -5e-2 # That means e = -0.05
> f <- "LEO"
> g <- FALSE
You can index elements of vectors using brackets like the following:
> a[3] # Note the use of brackets []. For functions, we use ()
[1] 6
> a[c(1, 2, 5)] # all except the first, second and fifth element
[1] 6 -2 9
The colon operator generates a sequences of numbers. For example, 2:4 means c(2, 3, 4).
> a[2:4]
[1] 2 6 2
Similarly, 6.6:9.6 means 6.6 7.6 8.6 9.6. You can create something that is hard to read. E.g.
3.8:9.6 means 3.8 4.8 5.8 6.8 7.8 8.8. If you want to generate more complicated
sequence of numbers, you can use the function seq(x, y, i). For example,
> h <- seq(1, 10, 2)
> h
[1] 1 3 5 7 9
You can assign the value of specific elements of a vector by using <-.
> a <- 1:3
> a
[1] 1 2 3
> a[3] <- 2.5
> a
[1] 1.0 2.0 2.5
You can even assign a position of vector that has not existed:
> a[6] <- -4
> a
[1] 1.0 2.0 2.5 NA NA -4.0
a[4] and a[5] have NA, which means their values are missing.
If you want to use a spreadsheet-like environment to make changes to vectors (but not for other
data type), you can do so by typing
> data.entry(h)
A panel will then pop up and you can make changes directly. You can also click “Edit” in the
ribbon in R and enter the variable that you would like to edit to call up the panel.
Arithmetic operations and relational operators are among the simplest mathematical operations.
In R, for two vectors of equal length, the operators act on each element of the vectors:
/
Divide the 1st vector with > v/t
the 2nd vector [1] 0.25 1.50 -3.00
^ (or The 1st vector raised to the > v^t
**) exponent of 2nd vector [1] 2.560000e+02 9.112500e+01 6.491527e-03
Note: You can also do scalar addition and multiplication. For example
v+1 gives 3.0 5.5 8.5 0.5*t gives 4.00 1.50 –1.25
v/3 gives 0.6666667 1.500000 2.5000000 t^2 gives 64.00 9.00 6.25
When vectors are of unequal length, then the operation can still be done. But a warning message
will pop up:
j <- c(2, 3, 4, 5)
jj <- c(1, 2, 3)
> jj+j
[1] 3 5 7 6
> jj*j
[1] 2 6 12 5
> jj^j
[1] 1 8 81 1
Warning message: Longer object length is not a multiple of shorter object
length
So how does the last element appear? The elements of the shorter vector are recycled to complete
the operations! Now you understand why v+1 and t^2 above make sense: The scalar is recycled!
!=
Check if each element of the 1st vector is unequal to > v!=u
the corresponding element of the 2nd vector [1] FALSE TRUE TRUE
Again, you can also mix vector with scalar for relational operators. For example
v>3 gives FALSE TRUE TRUE v!= 4.5 gives TRUE FALSE TRUE
v==2 gives TRUE FALSE FALSE
The following table shows logical operators supported by R. It is applicable only to vectors with
logical, numerical or complex elements. All numbers greater than 1 are considered as TRUE.
The logical operator && and || consider only the first element of each of the two vectors and give
a single element as output. Replacing & and && and | with || in the examples above, we get TRUE
and TRUE.
Recall that the [] brackets are used for indexing. Indexing starts with position 1. Giving a
negative value in the index drops that element from result. In R, TRUE, FALSE, 0, 1 can also be
used for indexing. Now you should be able to understand the following:
> x<-seq(0,10,2) # x is 0 2 4 6 8 10
> x[x>3] # The operation x>3 gives FALSE FALSE TRUE TRUE TRUE TRUE
[1] 4 6 8 10
> x[x>6|x<=1]
[1] 0 8 10
> x[x>1&x<=6]
[1] 2 4 6
Simple Functions
The following is a list of functions that are frequently used to operate on vectors:
Mathematical functions
When the input x is vector, then the function will return a vector of values. For example,
> log10(c(0, 1, 10, 100))
[1] Inf 0 1 2
The following are mathematical functions that play special roles in statistics.
lchoose(n, k)
natural logarithm of the absolute > lchoose(10,3)
value of choose function [1] 5.393628
> gamma(6)
gamma(x) gamma function of x [1] 120
lgamma(x)
natural logarithm of the absolute > lgamma(0.5)
value of gamma function [1] 0.5723649
Constants
month.name
the English names for the > month.name[3]
months of the year [1] "March"
month.abb
the three-letter abbreviations for > month.name[3]
the English month names [1] "Mar"
Probability distributions
For example,
dpois(2, lambda = 0.5) gives 0.07581633 ( 0.52e0.5/2)
dnbinom(2, size = 2, prob = 0.2) gives 0.0768 (= 3C2 × 0.22 × 0.82)
1 ( 1)2 / 2
dnorm(-1) gives 0.2419707 ( e )
2
pbinom(2, size = 4, prob = 0.2) gives 0.9728 (= 0.84 4 × 0.2 × 0.83 6 × 0.22 × 0.82)
pexp(10, rate = 0.2) gives 0.8646647 ( 1 exp(0.2 × 10)),
qnorm(0.95, mean = 10, sd = 2) gives 13.28971 ( 10 1.645 × 2)
qchisq(0.95, df = 1) gives 3.841459
qf(0.9, df1 = 10, df2 = 6) gives 2.936935
runif(10) generates 10 uniform random numbers
Optional: In Monte Carlo experiments in option pricing, it is useful to simulate from the
multivariate normal distribution. The library MASS provides a function to do so by entering the
mean vector and covariance matrix. The following illustrates how we can simulate 50 trinomial
data (and store it as a data frame with column headings). Read this after you finish the next
section.
> Library(MASS)
> mean<-c(0,0.3,-0.3)
> sigma <-matrix(c(0.09, 0.072, 0.045,
0.072, 0.16, -0.02,
0.045, -0.02, 0.25), nrow=3, ncols=3)
> mydata <- mvrnorm(50, mean, sigma)
> mydata <- as.data.frame(mydata)
> names(mydata) <- c("x1", "x2", "x3")
The following function takes in vector (or matrix, see next section) as input argument.
We have t<-c(8.65,4.19,8.25,7.17,7.82,3.38,2.3,5.18,5.32,4.75)
range(x)
the difference between max and > range(c(1,3,8,8))
min [1] 7
> sum((c(1,3,8,8))
sum(x) [1] 20
product(x) sum and product of numbers > product(c(1,3,8,8))
> 192
> cumsum((c(1,3,8,8))
cumsum(x) cumulative sum and product of [1] 1 4 12 20
cumprod(x) numbers > cumprod((c(1,3,8,8))
[1] 1 3 24 192
percentiles where x is the vector,
and probs is a vector with probs > quantile(t, c(0.25,0.75),
quantile(x, type = 6)
probs, type = n) in [0, 1], n specifies the 25% 75%
interpolation method (n = 6 for 3.9875 7.9275
smoothed empirical percentile)
demean center (center = true) or
scale(x, center =
b, scale = c) scaled by sample sd (center, See below
scale = true)
> diff((c(1,3,8,8))
lagged difference, default of [1] 2 5 0
> diff(c(1,3,8,8),
diff(x, lag = m, both lag and differences is 1 differences = 2)
differences = n) (this function is mainly used in [1] 3 5
time series analysis) > diff(c(1,3,8,8), lag = 2)
[1] 7 5
[9,] -0.1755346
[10,] -0.4381455
attr(,"scaled:center")
[1] 5.701
attr(,"scaled:scale")
[1] 2.170512
The third case (which is rarely used in practice) is if center FALSE and scale TRUE. The
1
non-centered data will be divided by the root-mean-square
n 1
x2 .
Other useful functions
Vectors form the basis of more complicated data types. It is very important to know all these data
types because specific statistical function in packages only accept a particular data type as input.
For example, the OLS function lm only takes data frame or vector as data input.
Matrix
A matrix is a two-dimensional rectangular data set. All elements in the matrix have to be in the
same mode (e.g. all numeric, all logical). A matrix can be created using a vector input to the
matrix function. The general format is
vector contains the elements for the matrix, nrow and ncol specify the row and column
dimensions. Other input arguments are optional. byrow is by default by column.
You can identify elements, rows and columns of a matrix by using brackets and comma. X[i,]
refers to the ith row, X[,j] refers to the jth column, and X[i,j] refers to the i,j-th element.
> y[1,4]
[1] 11
> y[2,]
[1] 2 7 12 17
> y[,2]
[1] 6 7 8 9 10
> y[,c(2,4)]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
> y[c(1,5),c(2,4)]
[,1] [,2]
[1,] 6 16
[2,] 10 20
> y[1:3,3:4]
[,1] [,2]
[1,] 11 16
[2,] 12 17
[3,] 13 18
> y[1:3,-c(1,3)]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. An
array can be create with the following:
, , 1
, , 2
, , 3
, , 4
Like matrices, all elements in the array must belong to the same mode.
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can take in
different modes of data. Each column in a data frame should have the same length. A data frame
can be created with the data.frame() function, which stacks columns of vectors:
where col1, col2, col3, and so on are column vectors of any type. The variable names of the
vectors become the names for each column. Consider the following example:
> names(BMI)
[1] "gender" "height" "weight" "age"
> dim(BMI)
[1] 3 4
There are many ways to index the elements of a data frame. You can use the methods used in
matrix before, or you can also use the column name.
> BMI[2:3]
height weight
1 152.0 81
2 171.5 78
3 167.0 63
> BMI[c("gender","age")]
gender age
1 Male 26
2 Male 37
3 Female 29
> BMI$gender
[1] Male Male Female
Levels: Female Male
The row names 1, 2, 3 in the above can be changed by using the rownames(dataframe). For
example, you can use
> rownames(BMI) <- c("Andrew", "Johnson", "Julie")
> print(BMI)
gender height weight age
Andrew Male 152.0 81 26
Johnson Male 171.5 78 37
Julie Female 167.0 63 29
Lists
List is the most complicated data type in R and many of the output from specific functions of R
(e.g. the output of linear regression or the output of a clustering) are lists. A list can be thought of
as an ordered collection of components, each of which has a name (which can be revealed by
using the names(name_of_list) function. A list can be created by
where each of the component can be a vector, a matrix, an array, or a data frame. Consider the
following example:
> mylist
[[1]]
[1] -1 0 1 2 3
$anotherhead
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
[[3]]
gender height weight age
Andrew Male 152.0 81 26
Johnson Male 171.5 78 37
Julie Female 167.0 63 29
> names(mylist)
[1] "" "anotherhead" ""
Notice in the above that in the list, only the second component has a row name! This is because
in the declaration of mylist, only the second component is declared with the structure name =
component.
To specify the components in a list, you need to either use double bracket [[1]], [[2]] etc or
$name if that component has a name. E.g. both mylist[[2]] and mylist$anotherhead
produces
3 Loading Data
Data comes from a variety of sources and a great variety of formats. You can use keyboard to
enter data. However, most of the time the data is imported from text files, Excel, Access, SQL,
and even files generated from other statistical software (such as SPSS, SAS and Stata). In this
section we review the data input procedure from keyboard, txt and csv files and Excel file.
The function read.table is the most important and commonly used function to import simple
data in the form of a data frame.
You can embed data directly into your code. For example,
Such a method is convenient for small datasets. You can also open up a spreadsheet-like
template for easy input. The following code first create an empty data frame with the variable
names and modes you want to have:
The following code then invoke the text editor on the object that allows you to view and even
enter data:
> fix(mydata)
You cannot directly copy cell values from Excel into the text editor, and hence this method is not
too convenient.
For larger datasets, it is easier to load data from data file. For convenience, we assume that the
directory that R is related to contains the data file. (You can change the default working directory
by clicking File and Change dir in R, or by clicking Tools and Global Options in R studio). The
txt file ExKNN.txt contains the data for Example 1.1 of Chapter 1.
If the first row of the data file does not contain variable names (header = FALSE), you can use
col.names to specify a character vector containing the variable names. If here is not header and
you do not specify variable names, then the variables will be named as V1, V2 and so on.
> mydata <- read.table("wwww.txt", header = FALSE,
col.names = c("Age", "Gender", "Weight"))
To enter data from csv files (csv means “comma separated”), use
To load Excel files with multiple sheets, you first need to install a package. For R, click
“Packages” and “Install package(s)” to install XL Connect. For R studio, go to “Tools” in the
ribbon and select “XLConnect”, or use
> install.packages("XLConnect")
(For library packages you only need to install once. Later on we will install more libraries for
performing special statistical calculations.) Now you can load the library to use it:
> library(XLConnect)
The following is an example that reads three sets of data from the same Excel file:
Chapter 2 Regularization
L earning Objectives
In this chapter we illustrate how ridge, Lasso, principal component and partial least square
regressions can be fitted using R. The libraries needed are glmnet (Lasso and Elastic Net
Regularized GLM) package for ridge and Lasso regressions, and pls (Partial Least Squares and
Principal Component Regression) package for principal component and partial least square
regressions. Go to “Packages” and “Install package(s)” in the ribbon to install them. Then load
them when use:
> library(glmnet)
> library(pls)
The dataset we use is the credit data set saved in Chapter 4.txt.
> credit <- read.table("Chapter 4.txt", header = T)
> fix(credit)
> dim(credit)
[1] 400 12
1 L2 and L1 Regularization
The glmnet function does not use the y ~ x syntax as in the lm function. We have to specify the
response and the predictors directly. Also, x and y need to be matrix of predictors and vector of
responses, not data frames. So
In the above, alpha = 0 means ridge regression, and the data are not standardized when in the
fitting process. By default, the data are automatically standardized.
After the codes above are run, associated with each value of is a vector of ridge regression
coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 12 × 9 matrix,
with 12 rows (one for each predictor, plus an intercept) and 9 columns (one for each value of ).
If you wish to supply more than one value for lambda (just as above), make sure that you supply
a decreasing sequence of values.
> print(coef(ridgeunstd.mod))
The L2 norm of the coefficients are decreasing with . For example, we can compute the norm
for 1000 and 10, respectively:
> sqrt(sum(coef(ridgeunstd.mod)[-1,3]^2))
[1] 19.44854
> sqrt(sum(coef(ridgeunstd.mod)[-1,7]^2))
[1] 341.5019
After fitting the model, we can do prediction. Suppose we predict the y based on the x of the first
200 observations:
The following is a snapshot of what you will get, together with the actual value of y:
Now let us consider fitting a model with standardized x and y using the first 200 observations,
and then use the fitted model to prediction the y for the remaining 200 observations. We can then
compute an estimate of the test MSE (under the holdout method).
To determine the best value of lambda we need to perform cross-validation to compute the test
MSE for a large ranges of using the training data set (Note: Here we treat as a tuning
parameter and hence it should be estimated from the training data set). The function cv.glmnet()
performs 10-fold cross validation to estimate the test MSE (you can use nfolds = n to change
the default 10 to other numbers).
It is interesting to see that for this data set, the test MSE seems to be monotonically increasing in
! In fact the seems to be the minimum of the smallest being tested. Let us use a finer grid
search for small values of :
Indeed there is a minimum, but the U in the test MSE curve is not pronounced at all. (Note: The
depends on how the subsets are drawn, and hence you should a get different result if you run
the codes!)
Lasso regression can be done similarly, by using 1. Let us try fitting one model using all data:
PCR can be done in R very easily by using the pcr function in the PLS library. The syntax of
pcr is similar to that of lm, with the additional option of scale = T or F to determine if the data
are standardized prior to the generation of the principal components. Also, setting validation =
"CV" causes R to compute the ten-fold CV for each possible value of M. After fitting the model,
the resulting fit can be extracted using summary(). You can also use validation = "LOO" to
compute the LOOCV estimate of test MSE.
Data: X dimension: 18 3
Y dimension: 18 1
Fit method: svdpc
Number of components considered: 3
VALIDATION: RMSEP
Cross-validated using 10 random segments.
The CV scores are above are the root mean square error. (The adjusted CV is a bias-corrected
CV estimate.) Using more principal components results in better fit and hence a smaller CV. We
can plot the cross-validated MSE:
> validationplot(pcr.fit, val.type = "MSEP") # plot the MSE
From the plot it is evident that using the first component is enough. Actually, while using only 1
PC explains only 66.29% of the variance of the predictors, 98.71% of the variance of y is already
explained!
After determining how many components to use based on the CV result, we can fit the final
model with the number of components specified by ncomp = n:
y
x1 5.4149147
x2 5.4199296
x3 0.2082247
The example used in the lecture notes is mainly used to illustrate multi-collinearity. To illustrate
another aspect of PCR, let us use the data set called gasoline in the pls package.
> data(gasoline)
> names(gasoline)
[1] "octane" "NIR"
This is a data set with NIR (near infrared spectroscopy) spectra and octane numbers of 60
gasoline samples. Each observation has an octane number and a record of the NIR spectra
measurement (log R where R is the diffuse reflectance at a particular wavelength) from 900 nm
to 1700 nm (in 2nm intervals, giving 401 wavelengths). So p 401, n 60, and dimension
reduction is needed. Our goal is to predict octane number using the 401 predictors.
Let us decompose the data set into train and test data sets:
> gas1.fit <- pcr(octane ~ NIR, data = gasTrain, ncomp = 10, validation =
"CV")
> summary(gas1.fit)
Data: X dimension: 45 401
Y dimension: 45 1
Fit method: svdpc
Number of components considered: 10
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 1.568 1.552 1.359 0.2605 0.2707 0.2722 0.2777
adjCV 1.568 1.547 1.280 0.2565 0.2677 0.2697 0.2757
7 comps 8 comps 9 comps 10 comps
CV 0.2459 0.2197 0.2058 0.2077
adjCV 0.2413 0.2168 0.2008 0.2028
Then we can fit a model with M 3 using the training data set and then final estimate the test
MSE for the model using the test data set.
octane
46 88.43762
47 88.17775
48 88.89864
49 88.29018
50 88.52651
51 88.10377
52 87.37123
53 88.32798
54 85.03583
55 85.32728
56 84.59276
57 87.57041
58 86.93111
59 89.25013
60 87.11733
Partial least squares regression can be fitted using plsr, which has the same syntax as pcr.
> plsr.fit <- plsr(y ~ x1 + x2 + x3, data = multo, scale=T, validation = "CV")
> summary(plsr.fit)
Data: X dimension: 18 3
Y dimension: 18 1
Fit method: kernelpls
Number of components considered: 3
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps
CV 11.19 1.913 1.483 1.322
adjCV 11.19 1.790 1.465 1.302
After determining how many components to use based on the CV result, we can fit the final
model with the number of components specified by ncomp = n:
, , 1 comps
y
x1 5.3917515
x2 5.4466237
x3 0.1349532
The PLSR with M 1 is practically the same as the PCR with M 1 in view of the MSE and the
fitted coefficients.
> gas1.fit <- plsr(octane ~ NIR, data = gasTrain, ncomp = 10, validation =
"CV")
> summary(gas1.fit)
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 1.568 1.397 0.3002 0.2364 0.2111 0.1959 0.1924
adjCV 1.568 1.398 0.2875 0.2376 0.2034 0.1923 0.1889
7 comps 8 comps 9 comps 10 comps
CV 0.1902 0.2032 0.2530 0.2796
adjCV 0.1874 0.1999 0.2464 0.2702
, , 2 comps
octane
46 88.35589
47 88.15756
48 88.78035
49 88.24472
50 88.43657
51 87.98744
52 87.32389
53 88.24144
54 84.90116
55 85.28262
56 84.60781
57 87.36257
58 86.81301
59 89.10257
60 86.97763
This “simpler” model is comparable to the PCR model with 3 principal components.
L earning Objectives
In this chapter we deal with non-linear regression and classification models. We are going to
discuss KNN in Section 1, and classification and regression trees in Section 2. In Section 3 we
deal with the random forest and boosting for CARTs.
First you need to install the libraries e1071 and caret. (caret is the short for Classification And
Regression Training). caret is an extremely useful library and it serves as an interface to control
many libraries that conduct regression or classification.
Classification
We are going to use the wine data stored in wine.csv as illustration for KNN classification.
The data set has n 178. The first column named Type is the class of wine (and can take the
value 1, 2, and 3) and is the y of the data set. The remaining 13 variables are the predictors.
Noting that the scales of the predictors differ vastly, standardization is needed before conducting
KNN classification.
To perform a KNN classification, the response y cannot be numerical. So let us create the
corresponding categorical version (“factor” in R) from the numeric version:
KNN is different from other linear regression models in the sense that the model output is the
prediction itself by not beta coefficients. In fact, the only model output is prediction. So one must
specify the training data as well as the predictors for prediction purposes.
In caret, we can randomly split the data into a training data set and a test data set by using the
following code:
> library(caret)
> set.seed(1234)
> train_index <- createDataPartition(y = wine$Type, p = 0.7, list = F)
> training <- wine[train_index,]
> testing <- wine[-train_index,]
I set the seed so that you can replicate my results below. The function createDataPartition
randomly partition the 178 observations into 2 partitions in the ratio 7 : 3 as specified in the p
argument. The last argument tells R to return an output in the form of a list (T) or a matrix (F)
with the record’s index. The argument y is used so that in the training data set and the original
data set, there is about the same ratio of different types of y.
1 2 3
0.376 0.352 0.272
1 2 3
0.3314607 0.3988764 0.2696629
Using the function trainControl, we set the method to compute the average error rate in the
training data set in order to determine the best value of the tuning parameter K. Commonly used
methods are cv, repeatedcv, and LOOCV. Cross validation was discussed in Chapter 1 of the
lecture notes. Since the average error rate (or test MSE for regression) computed from cross-
validation is dependent on how the subsamples to draw, we can repeat that a number of times to
get a series of average error rate and take a final average and this is called repeated cross-
validation. The two additional arguments number 10 gives 10-fold in each of the cross-validation,
and we repeat the cross validation 3 times.
In the function train, Type ~. is similar to the usual linear regression, y is Type and . means all
predictors are used. (You can also replace the part “Type ~., data = training” with
“training[, -1], training[, 1]”. That is, you can directly enter the predictors and
response.) Finally, tuneLength specifies the range of values of K for the KNN to be examined
for the computation of test MSE.
> knn.fit
k-Nearest Neighbors
125 samples
13 predictor
3 classes: '1', '2', '3'
k Accuracy Kappa
5 0.9590326 0.9379247
7 0.9513098 0.9260016
9 0.9451049 0.9165983
11 0.9431208 0.9138134
13 0.9508436 0.9256436
15 0.9561855 0.9337326
17 0.9617799 0.9421534
19 0.9648102 0.9466802
21 0.9673743 0.9505841
23 0.9617799 0.9420737
25 0.9590021 0.9379070
27 0.9594683 0.9386432
29 0.9596820 0.9389060
31 0.9569042 0.9347393
33 0.9569042 0.9347393
35 0.9569042 0.9347393
37 0.9596820 0.9389060
39 0.9596820 0.9389060
41 0.9596820 0.9389060
43 0.9596820 0.9389060
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 21.
The accuracy is 1 error rate as discussed in the lecture notes. The higher it is, the better is the
model’s predictive power. In the final model for prediction, k is selected to be 21.
Finally we can do prediction on our test data set and compute the resulting confusion matrix:
> test_pred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2
[38] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
Reference
Prediction 1 2 3
1 12 2 0
2 0 24 0
3 0 1 14
Overall Statistics
Accuracy : 0.9434
95% CI : (0.8434, 0.9882)
……
The error rate is 3 / 53 (or 1 0.9434) 5.66%. This is a very good fit!
If you want to specify the value of K directly, remove the cross-validation part and also put the
value of K as follow using tuneGrid. The following is another run (with a different partition of
training and validation data set):
Reference
Prediction 1 2 3
1 17 3 0
2 0 18 0
3 0 0 14
Overall Statistics
Accuracy : 0.9423
95% CI : (0.8405, 0.9879)
……
Regression
We are going to use the same example as in the lecture notes as illustration for KNN regression.
k-Nearest Neighbors
38 samples
1 predictor
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
To fit CART we use the library rpart. To plot beautiful trees I recommend using the library
rpart.plot. If you have installed caret, then rpart should have already been installed.
where
formula is in the format: y ~ x1 + x2 … as in lm (interaction term not allowed!)
data specifies the data frame
method "class" for classification, "anova" for regression
control optional parameters for pruning, the default parameters are:
minsplit = 20, minbucket = round(minsplit/3) , cp = 0.01, maxdepth = 30
(here minsplit is the minimum number of observations that must exist in a node for a split to
be attempted, minbucket is the minimum number of observations in any terminal leaf, and cp
is the cost complex parameter in the objective function.)
parms is the method of split for classification tree. Use either parms = list(split =
"gini") or parms = list(split = "information") for cross entropy. The default is Gini
index. For regression trees, there are no such parameters.
After we have fitted a tree, we can examine the results by using the following functions:
Classification
As a simple illustration, consider the golf example in the lecture notes. Load the data set golf.txt.
What is happening? In our toy example, the number of observation is smaller than the default
minsplit! So let us change the parameters:
Call:
rpart(formula = Golf ~ ., data = golf, method = "class", minsplit = 2)
n= 14
Variable importance
Outlook Temperature Humidity Windy
30 26 23 20
n= 14
node), split, n, loss, yval, (yprob)
* denotes terminal node
This is obviously too hard to understand! Let us plot the tree out:
> plot(golf.fit)
You will then see a figure with no labels but only splits! The plot function in rpart frequently
gives funny results because of its mysterious default margin setting. Without deleting the figure,
let us try to fine tune the tree’s labels and fonts:
> text(golf.fit, use.n=TRUE, all=TRUE, cex=.8)
This looks unprofessional. Hence you need the library that was created just to plot trees!
> library(rpart.plot)
> rpart.plot(golf.fit, type = 3, fallen.leaves = T, extra = 104)
There are so many things that you can tune in this plot function that I better give you a link for
the R website for details:
https://www.rdocumentation.org/packages/rpart.plot/versions/3.0.6/topics/rpart.plot
Finally we can use the tree to make predictions for 2 combinations of predictors:
No Yes
1 0.5 0.5
2 0.0 1.0
A probability distribution is reported. In the first case, the response is a 50% No and 50% Yes! If
you want the majority vote but not a probability distribution, you can use the following:
In this case we have built the largest tree possible. Now let us use the wine example again to fit a
classification tree.
Classification tree:
rpart(formula = Type ~ ., data = training, method = "class")
n= 126
The tree with 0.01 seems to have too few splits. Let us relax that a bit:
Classification tree:
rpart(formula = Type ~ ., data = training, method = "class",
cp = 1e-04, minsplit = 10)
n= 126
There is no change. So we use the tree with 3 splits to do prediction. The tree looks like this:
wine_predict 1 2 3
1 14 2 0
2 3 19 0
3 0 0 14
By using only three predictors, we reach an error rate of 5 / (178 – 126) 9.6% error rate.
Regression
To illustrate regression tree and pruning, let us use the default data cu.summary in the caret
library. The data set contains 117 observations, with Mileage as the response, and Price, Country,
Reliability, and Type as predictors. While I use all the data for fitting, many predictors are
missing and they are deleted before fitting.
Regression tree:
rpart(formula = Mileage ~ ., data = cu, method = "anova")
The tree with 0.011604 gives the smallest cross-validated test MSE. (Another rule is to
choose the simplest tree with rel error + xstd < xerror. This is why rpart reports these three
values. Rel error is related to the training MSE and it always go down as the number of splits
increases.)
Random Forests
Here,
mtry is the m in the lecture notes. The default setting in this library is m p for
classification trees and p / 3 for regression trees.
ntree is the number of trees to grow (the B in the lecture notes). Default is 500.
subset is an index vector indicating which rows should be used as the training data set.
importance (default is F): should the importance of predictors be assessed? (If you use T,
then later on you would also call out the variable importance plot (VIP) by using either
importance(output) or varImpPlot(output)).
> library(randomForest)
> wine <- read.csv("wine.csv", header = T)
> wine$Type <- factor(wine$Type)
> train_index <- createDataPartition(y = wine$Type, p = 0.7, list = F)
> training <- wine[train_index,]; testing <- wine[-train_index,]
> winebag <- randomForest(Type~., data = wine, subset = train_index,
mtry = 13, importance = T)
> varImpPlot(winebag)
After running the above, we use the predict function to compute the predictions.
Call:
randomForest(formula = Type ~ ., data = wine, importance = T, subset =
train_index)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
So you can see the error rate for random forest is only half of bagging! The VIP is
Boosting
For boosting, you need to install library gbm. (GBM standards of gradient boosting machine.)
Also, you need to enter the 3 parameters for the number of trees, the depth of the trees, and the
shrinkage parameter.
The above is self-explanatory. 100, 1 and 0.1 are default values and you can skip them if you
wish to use the default. For distribution, you can use either "bernoulli" or "Gaussian" for a
binary classification trees and regression trees.
Let us use the cu.summary data set again as an example. Before we plug in the data set, we need
to clear all observations with missing values because gbm does not automatically do that.
> summary(cubt)
var rel.inf
Type Type 36.434368
Price Price 27.792783
Country Country 26.986571
Reliability Reliability 8.786278
As in the regression tree obtained by cost complexity pruning, Type and Price are the two most
important predictors. Now we compute the training MSE by first obtaining all predicted values:
Then let us repeat with a smaller shrinkage parameter and a large number of trees:
> cubt <- gbm(Mileage ~., data = cu, distribution = "gaussian", n.trees =
10000, interaction.depth = 1, shrinkage = 0.05)
> cubt.pred <- predict(cubt, newdata = cu[, -4], n.trees = 10000)
> mean((cubt.pred-cu[,4])^2)
[1] 2.435577
So, learning slowly leads a better fit because the training MSE is smaller. (In this example, we
are not looking at test MSE, though, because the data set is quite small.)
L earning Objectives
In this chapter we look at PCA and clustering. We will use R to analyze the default dataset
USArrests in the basic R package. You can use
> print(USArrests)
to view the whole dataset. The first “column” are row names and are not treated as data in the
data frame. Here is how we can generate a scatterplot of the four predictors:
> pairs(USArrests, main = "Basic Scatter Plot Matrix")
There are two functions to perform PCA in the basic R package: prcomp() or princomp().
Since the former employs a numerically more stable method to perform the eigenvalue and
eigenvector calculation (single value decomposition), here I am going to introduce only
prcomp().
[Note: If your dataset contains multiple columns and you only wish to conduct PCA on a subset
of columns, you can use the following of prcomp():
prcomp(formula = ~ col1 + col2 + … , data = name_of_dataset, center = T / F,
scale = T / F)
The default for the optional argument center (for mean centering the data) is T, while the
default for the optional argument scale (for scaling the data) is F. See page 8 and page 9 for
what these mean in R. The list returned by prcomp contains five outputs:
sdev, rotation, center, scale, x
After running prcomp, we can use summary(output) to investigate the cumulative proportion of
variance explained, and also use plot(output) to get the scree plot.
Importance of components%s:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
> plot(pc.out)
Since the scree plot is usually viewed as a line chart, we can use the following:
We can call out the z scores easily. If we want to look at the z scores for the first 2 components,
we can use
We can use biplot() to generate the biplot for any two components (e.g. PC1 and PC2):
You may not like the arrows to all point out to the negative directions. Let us flip them:
In this section, we use the function agnes() (not the function hclust() specified in the Exam
SRM textbook) in the R library cluster to do HAC. Agnes is the acronym for Agglomerative
Nesting. I personally perfer agnes() because it can directly accepts the data as the input, while
for hclust() you must change the data into a dissimilarity matrix first using the function
dist(x).
You can apply the following functions to the output of agnes() to get the dendrogram and also
cut it:
pltree(): plot the dendrogram
cutree(): cut the tree at a particular height and give the cluster labels for each observation
> library(cluster)
> agg.out<-agnes(x = USArrests, diss = F, stand = T, method = "complete")
> pltree(agg.out, main = "Dendrogram of USArrests, complete linkage")
[1] 1 2 2 3 2 2 3 4 2 1 4 3 2 4 3 4 3 1 3 2 4 2 3 1 4 3 3 2 3 4 2 2 1 3 4 4 4
[38] 4 4 1 3 1 2 4 3 3 4 3 3 3
This is quite hard to visualize. So let us do the following to show the names of the states (which
is the row label of the data frame object USArrests).
[[1]]
[1] "Alabama" "Georgia" "Louisiana" "Mississippi"
[5] "North Carolina" "South Carolina" "Tennessee"
[[2]]
[1] "Alaska" "Arizona" "California" "Colorado" "Florida"
[6] "Illinois" "Maryland" "Michigan" "Nevada" "New Mexico"
[11] "New York" "Texas"
[[3]]
[1] "Arkansas" "Connecticut" "Idaho" "Iowa"
[5] "Kentucky" "Maine" "Minnesota" "Montana"
[9] "Nebraska" "New Hampshire" "North Dakota" "South Dakota"
[13] "Vermont" "Virginia" "West Virginia" "Wisconsin"
[17] "Wyoming"
[[4]]
[1] "Delaware" "Hawaii" "Indiana" "Kansas"
[5] "Massachusetts" "Missouri" "New Jersey" "Ohio"
[9] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
[13] "Utah" "Washington"
To depict the code above, the function unique(x) returns a vector of elements of x but with
duplicate elements / rows removed. So unique(cut.4) returns 4 elements 1, 2, 3, and 4. Then
the function sapply performs a looping of the function in the second argument
rownames(USArrests)[cut.4 == g] by putting g = each of the values in the first argument,
and returns vectors of outputs. Recall that [cut.4 == g] returns a vector of T and F for each g.
For example, [cut.4 == 1] returns T, F (× 8), T, F (× 7), T, F (× 5), T, F (× 8), T, F (× 6), T, F,
T, F (× 8). Then rownames(USArrests)[cut 4 == 1] reports the row names for those rows
with T.
If you are still not satisfied, you can even do the tree cutting graphically. The codes are much
harder, though…
In the above, a dendrogram with state labels of font size 0.6 is drawn. Then a title heading is
added. The parameter setting hang = -0.1 causes the state labels to hang down from height 0.
Then 4 rectangles that corresponds to the 4 clusters are drawn. The parameter border = 2:5
selects the colours of the borders of the 4 rectangles (colours numbered 2 to 5). The general
format of the function rect.hclust() is as follows:
rect.hclust(tree, k = NULL, h = NULL, border = 2)
The most important parameters are k and h: you can cut the dendrogram such that either exactly
k clusters are produced or by cutting at height h.
3 K Means Clustering
K means clustering can be done using the function kmeans() in the R package stats.
cluster is the main result. It is a series of numbers which identifies the cluster for which
each observation is assigned to.
centers gives the centroids of each center.
totss gives the total sum of squares without any clustering.
withinss gives the individual within-cluster sum of squares (SSEi).
tot.withinss gives the sum of the individual within-cluster sum of squares (SSE).
betweenss is totss tot.withinss.
Cluster means:
Murder Assault UrbanPop Rape
1 8.214286 173.2857 70.64286 22.84286
2 5.590000 112.4000 65.60000 17.27000
3 2.950000 62.7000 53.90000 11.51000
4 11.812500 272.5625 68.31250 28.37500
Clustering vector:
Alabama Alaska Arizona Arkansas California
4 4 4 1 4
Colorado Connecticut Delaware Florida Georgia
1 2 4 4 1
Hawaii Idaho Illinois Indiana Iowa
3 2 4 2 3
Kansas Kentucky Louisiana Maine Maryland
2 2 4 3 4
Massachusetts Michigan Minnesota Mississippi Missouri
1 4 3 4 1
Montana Nebraska Nevada New Hampshire New Jersey
2 2 4 3 1
New Mexico New York North Carolina North Dakota Ohio
4 4 4 3 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
1 1 2 1 4
South Dakota Tennessee Texas Utah Vermont
3 1 1 2 3
Virginia Washington West Virginia Wisconsin Wyoming
1 1 3 3 1
Available components:
Does the HAC with dendrogen cut at 4 clusters and the 4-means clustering give the same result?
Let us check:
cut.4 1 2 3 4
1 2 0 0 5
2 2 0 0 10
3 3 5 9 0
4 7 5 1 1
So the cluster assignments are different! While the majority of elements in HAC’s cluster 1 is in
4-means clustering’s cluster 4, 2 are in 4-means clustering’s cluster 1. The elements in HAC’s
cluster 3 are assigned to clusters 1, 2, and 3 in 4-means clustering.
You may wonder if this is because the data is not standardized. So let us try one more time:
Cluster means:
Murder Assault UrbanPop Rape
1 1.4118898 0.8743346 -0.8145211 0.01927104
2 0.6950701 1.0394414 0.7226370 1.27693964
3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
4 -0.4894375 -0.3826001 0.5758298 -0.26165379
Clustering vector:
Alabama Alaska Arizona Arkansas California
1 2 2 1 2
Colorado Connecticut Delaware Florida Georgia
2 4 4 2 1
Hawaii Idaho Illinois Indiana Iowa
4 3 2 4 3
Kansas Kentucky Louisiana Maine Maryland
4 3 1 3 2
Massachusetts Michigan Minnesota Mississippi Missouri
4 2 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 2 3 4
New Mexico New York North Carolina North Dakota Ohio
2 2 1 3 4
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
4 4 4 4 1
South Dakota Tennessee Texas Utah Vermont
3 1 2 4 3
Virginia Washington West Virginia Wisconsin Wyoming
4 4 3 3 4
cut.4 1 2 3 4
1 7 0 0 0
2 0 12 0 0
3 1 0 13 3
4 0 1 0 13