Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

R Notes

C.Y. Ng
Chapter 1. Data Type and Data Entry

Chapter 1 Data Type and Data Entry

L earning Objectives

Data types, simple functions, importing data

1 Getting Started with Vectors and Functions

Before we do any kinds of data analysis in R, we need to import or enter data. R has a wide
variety of data types.

Data Example
Numeric 12.3, 7, 8e2
Integer 2L, 34L, 0L
Complex 3 - 2i
Logical TRUE, FALSE
Character 'a' , "good", "TRUE", '23.4'

We assign a variable with a particular value using the symbol “<-”. For example, to set a
variable with name a to 12.3, we use
> a <- 12.3

The following is equivalent, though not common:


> 12.3 -> a

You can use whatever variable names you like as long as they fulfill the following conditions.
Reserved words are those reserved for use as names for functions or other special purposes, and
cannot be used as variable names. A valid variable name consists of letters, numbers and the dot
or underline characters and starts with a letter or the dot not followed by a number. Names such
as .2way are not valid.

The function class() returns the class of a variable. You can verify the class of a variable by
using the following:
> class(a)
[1] "numeric"

(The funny [1] indicates that the output is a vector, and “numeric” is the first element.)

FINA3295 | Predictive Analytics 1


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

In R, the most basic data type is the vector. In this section we start with vectors and simple
operations and functions that works on vectors. Other commonly used data types (matrices,
arrays, data frames and lists) will be introduced in the next section.

Vector

The combine function c() is used to form a vector. The following are numeric, character and
logical vector assignments:
> a <- c(1.0, 2, 6, -2, 7.3, 9)
> b <- c("Anthony", "Becky", "Cheung", "Dan", "Eric") # note the use of "
> c <- c(TRUE, FALSE, TRUE)

(Things after # are comments, and are ignored by the R complier.) There is not a specific type for
scalars in R. Scalars are treated as one-element vectors.
> d <- 6.89
> e <- -5e-2 # That means e = -0.05
> f <- "LEO"
> g <- FALSE

You can index elements of vectors using brackets like the following:
> a[3] # Note the use of brackets []. For functions, we use ()
[1] 6

> a[c(1, 2, 5)]


[1] 1.0 2.0 7.3

> a[-2] # all except the second element


[1] 1.0 6 -2 7.3 9

> a[c(1, 2, 5)] # all except the first, second and fifth element
[1] 6 -2 9

The colon operator generates a sequences of numbers. For example, 2:4 means c(2, 3, 4).
> a[2:4]
[1] 2 6 2

Similarly, 6.6:9.6 means 6.6 7.6 8.6 9.6. You can create something that is hard to read. E.g.
3.8:9.6 means 3.8 4.8 5.8 6.8 7.8 8.8. If you want to generate more complicated
sequence of numbers, you can use the function seq(x, y, i). For example,
> h <- seq(1, 10, 2)
> h
[1] 1 3 5 7 9

> h <- seq(10, -1, -1.5)


> h
[1] 10.0 8.5 7.0 5.5 4.0 2.5 1.0 -0.5

FINA3295 | Predictive Analytics 2


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

You can assign the value of specific elements of a vector by using <-.
> a <- 1:3
> a
[1] 1 2 3
> a[3] <- 2.5
> a
[1] 1.0 2.0 2.5

You can even assign a position of vector that has not existed:
> a[6] <- -4
> a
[1] 1.0 2.0 2.5 NA NA -4.0

a[4] and a[5] have NA, which means their values are missing.

Spreadsheet-like Data Entry

If you want to use a spreadsheet-like environment to make changes to vectors (but not for other
data type), you can do so by typing
> data.entry(h)

A panel will then pop up and you can make changes directly. You can also click “Edit” in the
ribbon in R and enter the variable that you would like to edit to call up the panel.

Arithmetic, Relational and Logical Operations

Arithmetic operations and relational operators are among the simplest mathematical operations.
In R, for two vectors of equal length, the operators act on each element of the vectors:

Operator What it does Example


v <- c(2, 4.5, 7.5)
t <- c(8, 3, -2.5)
+ Add two vectors > v+t
[1] 10.0 7.5 5.0
Subtract 2nd vector form the > vt

1st vector [1] –6.0 1.5 10.0
> v*t
* Multiplies both vectors [1] 16.00 13.50 –18.75

/
Divide the 1st vector with > v/t
the 2nd vector [1] 0.25 1.50 -3.00
^ (or The 1st vector raised to the > v^t
**) exponent of 2nd vector [1] 2.560000e+02 9.112500e+01 6.491527e-03

Note: You can also do scalar addition and multiplication. For example
v+1 gives 3.0 5.5 8.5 0.5*t gives 4.00 1.50 –1.25
v/3 gives 0.6666667 1.500000 2.5000000 t^2 gives 64.00 9.00 6.25

FINA3295 | Predictive Analytics 3


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

When vectors are of unequal length, then the operation can still be done. But a warning message
will pop up:
j <- c(2, 3, 4, 5)
jj <- c(1, 2, 3)

> jj+j
[1] 3 5 7 6
> jj*j
[1] 2 6 12 5
> jj^j
[1] 1 8 81 1
Warning message: Longer object length is not a multiple of shorter object
length

So how does the last element appear? The elements of the shorter vector are recycled to complete
the operations! Now you understand why v+1 and t^2 above make sense: The scalar is recycled!

Operator What it does Example


> (<)
Check if each element of the 1st vector is greater > v>t
than the corresponding element of the 2nd vector [1] FALSE TRUE TRUE
Check if each element of the 1st vector is greater
> v>=u # not v=>u !
>= (<=) than or equal to the corresponding element of the [1] TRUE FALSE TRUE
2nd vector
Check if each element of the 1st vector is equal to > u <- c(2, 7.5, 5)
== the corresponding element of the 2nd vector > v==u
[1] TRUE FALSE FALSE

!=
Check if each element of the 1st vector is unequal to > v!=u
the corresponding element of the 2nd vector [1] FALSE TRUE TRUE

Again, you can also mix vector with scalar for relational operators. For example
v>3 gives FALSE TRUE TRUE v!= 4.5 gives TRUE FALSE TRUE
v==2 gives TRUE FALSE FALSE

The following table shows logical operators supported by R. It is applicable only to vectors with
logical, numerical or complex elements. All numbers greater than 1 are considered as TRUE.

Operator What it does Example


> v<-c(3,-1, TRUE)
> t<-c(1, 2, FALSE)
& Element-wise “And” operation. > v&t
[1] TRUE FALSE FALSE
> v<-c(3,-1, FALSE)
> t<-c(1, 2, FALSE)
| Element-wise “Or” operation > v|t
[1] TRUE TRUE FALSE
> v <- c(3, 0, TRUE)
!
Element-wise “Not” operation. Change True to > !v
False, and False to True [1] FALSE TRUE FALSE

FINA3295 | Predictive Analytics 4


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

The logical operator && and || consider only the first element of each of the two vectors and give
a single element as output. Replacing & and && and | with || in the examples above, we get TRUE
and TRUE.

Recall that the [] brackets are used for indexing. Indexing starts with position 1. Giving a
negative value in the index drops that element from result. In R, TRUE, FALSE, 0, 1 can also be
used for indexing. Now you should be able to understand the following:
> x<-seq(0,10,2) # x is 0 2 4 6 8 10
> x[x>3] # The operation x>3 gives FALSE FALSE TRUE TRUE TRUE TRUE
[1] 4 6 8 10
> x[x>6|x<=1]
[1] 0 8 10
> x[x>1&x<=6]
[1] 2 4 6

Simple Functions

The following is a list of functions that are frequently used to operate on vectors:

Mathematical functions

Function What it does Example


> exp(-1.2)
exp(x) exponential [1] 0.3011942
logarithmic of x to base n > log(4, 2)
log(x, base = n)
[1] 2
log(x) natural log > log(exp(-1.2))
log10(x) log base 10 [1] -1.2
> sqrt(12)
sqrt(x) square root [1] 3.464102

sin(x), cos(x), tan(x)


trigonometric function > sin(-1)
(in radian) [1] 0.841471
asin(x), acos(x), > cos(1/sqrt(2))/pi
atan(x) inverse trigonometric function [1] 0.25
> abs(sin(-1))
abs(x) absolute value [1] 0.841471
> round(sin(-1),4)
round(x, digits = n) rounds x to n decimal places [1] -0.8415
> signif(20.458,1)
signif(x, digits = n) rounds x to n significant digits [1] 20
> trunc(-1.2)
trunc(x) truncate x towards 0 [1] -1
> floor(-1.2)
floor(x) largest integer not greater than x [1] -2
> ceiling(4.9)
ceiling(x) smallest integer not less than x [1] 5

When the input x is vector, then the function will return a vector of values. For example,
> log10(c(0, 1, 10, 100))
[1] Inf 0 1 2

FINA3295 | Predictive Analytics 5


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

Note: Inf means positive infinity, -Inf means negative infinity.

Special mathematical functions

The following are mathematical functions that play special roles in statistics.

Function What it does Example


> factorial(4)
factorial(x) factorial ( gamma(x  1)) [1] 24
> lfactorial(13)
lfactorial(x)
natural logarithm of factorial of [1] 22.55216
x
n(n  1) … (n  k), where n is > choose(10,2)
[1] 45
choose(n, k) real and k is a non-negative > choose(-10,2)
integer [1] 55

lchoose(n, k)
natural logarithm of the absolute > lchoose(10,3)
value of choose function [1] 5.393628
> gamma(6)
gamma(x) gamma function of x [1] 120

lgamma(x)
natural logarithm of the absolute > lgamma(0.5)
value of gamma function [1] 0.5723649

Constants

Function What it means Example


> pi
pi constant  [1] 3.141593
> letters[26]
the 26 lower-case and upper- [1] "z"
letters, LETTERS > LETTERS[4]
case letters of the Roman [1] "D"
alphabet > LETTERS[27]
[1] NA

month.name
the English names for the > month.name[3]
months of the year [1] "March"

month.abb
the three-letter abbreviations for > month.name[3]
the English month names [1] "Mar"

Probability distributions

In R, probability distributions take the form [dpqr]distirubtion_abbreviation()


where the first letter refers to the aspect of the distribution returned.

d  density function / probability mass function


p  cumulative distribution function
q  percentile function
r  random variate generation

FINA3295 | Predictive Analytics 6


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

Distribution Abbreviation Distribution Abbreviation


Beta beta Logistic logis
Binomial binom Lognormal lnorm
Cauchy ( t(1)) cauchy Multinomial multinom
Chi-square chisq Negative binomial nbinom
Exponential exp Normal norm
F f Poisson pois
Gamma gamma t t
Geometric geom Uniform unif
Hypergeometric hyper Weibull weibull

For example,
dpois(2, lambda = 0.5) gives 0.07581633 ( 0.52e0.5/2)
dnbinom(2, size = 2, prob = 0.2) gives 0.0768 (= 3C2 × 0.22 × 0.82)
1  ( 1)2 / 2
dnorm(-1) gives 0.2419707 ( e )
2
pbinom(2, size = 4, prob = 0.2) gives 0.9728 (= 0.84  4 × 0.2 × 0.83  6 × 0.22 × 0.82)
pexp(10, rate = 0.2) gives 0.8646647 ( 1  exp(0.2 × 10)),
qnorm(0.95, mean = 10, sd = 2) gives 13.28971 ( 10  1.645 × 2)
qchisq(0.95, df = 1) gives 3.841459
qf(0.9, df1 = 10, df2 = 6) gives 2.936935
runif(10) generates 10 uniform random numbers

You can find the parameterization of the remaining distributions in


https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Distributions.html.

Optional: In Monte Carlo experiments in option pricing, it is useful to simulate from the
multivariate normal distribution. The library MASS provides a function to do so by entering the
mean vector and covariance matrix. The following illustrates how we can simulate 50 trinomial
data (and store it as a data frame with column headings). Read this after you finish the next
section.

> Library(MASS)
> mean<-c(0,0.3,-0.3)
> sigma <-matrix(c(0.09, 0.072, 0.045,
0.072, 0.16, -0.02,
0.045, -0.02, 0.25), nrow=3, ncols=3)
> mydata <- mvrnorm(50, mean, sigma)
> mydata <- as.data.frame(mydata)
> names(mydata) <- c("x1", "x2", "x3")

FINA3295 | Predictive Analytics 7


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

Other Statistical functions

The following function takes in vector (or matrix, see next section) as input argument.

We have t<-c(8.65,4.19,8.25,7.17,7.82,3.38,2.3,5.18,5.32,4.75)

Function What it does Example


min(x) > min(t) > max(t)
max(x) minimum and maximum [1] 2.3 [1] 8.65
mean(x) > mean(t) > median(t)
median(x) mean and median [1] 5.701 [1] 5.25
sd(x) sample standard deviation and > sd(t)
var(x) sample variance [1] 2.170512

range(x)
the difference between max and > range(c(1,3,8,8))
min [1] 7
> sum((c(1,3,8,8))
sum(x) [1] 20
product(x) sum and product of numbers > product(c(1,3,8,8))
> 192
> cumsum((c(1,3,8,8))
cumsum(x) cumulative sum and product of [1] 1 4 12 20
cumprod(x) numbers > cumprod((c(1,3,8,8))
[1] 1 3 24 192
percentiles where x is the vector,
and probs is a vector with probs > quantile(t, c(0.25,0.75),
quantile(x, type = 6)
probs, type = n) in [0, 1], n specifies the 25% 75%
interpolation method (n = 6 for 3.9875 7.9275
smoothed empirical percentile)
demean center (center = true) or
scale(x, center =
b, scale = c) scaled by sample sd (center, See below
scale = true)
> diff((c(1,3,8,8))
lagged difference, default of [1] 2 5 0
> diff(c(1,3,8,8),
diff(x, lag = m, both lag and differences is 1 differences = 2)
differences = n) (this function is mainly used in [1] 3 5
time series analysis) > diff(c(1,3,8,8), lag = 2)
[1] 7 5

Examples for scale: we mostly do mean-centering and standardization.

> scale(t) # default is both first mean-centered and divided by sd


[,1]
[1,] 1.3586658
[2,] -0.6961492
[3,] 1.1743774
[4,] 0.6767989
[5,] 0.9762675
[6,] -1.0693331
[7,] -1.5669116
[8,] -0.2400356

FINA3295 | Predictive Analytics 8


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

[9,] -0.1755346
[10,] -0.4381455
attr(,"scaled:center")
[1] 5.701
attr(,"scaled:scale")
[1] 2.170512

> scale(t, scale = FALSE) # data only mean-center


[,1]
[1,] 2.949
[2,] -1.511
[3,] 2.549
[4,] 1.469
[5,] 2.119
[6,] -2.321
[7,] -3.401
[8,] -0.521
[9,] -0.381
[10,] -0.951
attr(,"scaled:center")
[1] 5.701

The third case (which is rarely used in practice) is if center  FALSE and scale  TRUE. The
1
non-centered data will be divided by the root-mean-square
n 1
 x2 .
Other useful functions

Function What it does Example


> length(t)
length(x) Returns the length of x [1] 10
> y <- rep(1:3,3)
rep(x, n) Repeat x n times [1] 1 2 3 1 2 3 1 2 3
> unique(c(1,2,2,3))
unique(x) Removes duplicated values in x [1] 1 2 3
Creates pretty breakpoints. Divides a
continuous variable x into n intervals by
> pretty(t, 3)
pretty(x, n) selecting n  1 equally spaced rounded [1] 2 4 6 8 10
values. Usually used in formatting the
labels of axes in plotting.
dim(x)
Returns the dimension of a matrix or a See the section on data
data frame x frame below
See the section on data
names(x) Return the column heading of x
frame below
ls() List the objects in the current workspace

FINA3295 | Predictive Analytics 9


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

2 More General Data Types

Vectors form the basis of more complicated data types. It is very important to know all these data
types because specific statistical function in packages only accept a particular data type as input.
For example, the OLS function lm only takes data frame or vector as data input.

Matrix

A matrix is a two-dimensional rectangular data set. All elements in the matrix have to be in the
same mode (e.g. all numeric, all logical). A matrix can be created using a vector input to the
matrix function. The general format is

mymatrix <- matrix(vector, nrow = number_of_rows, ncol = number_of_columns,


byrow = logical_value,
dimnames = list(char_vector_rowname, char_vector_colname) )

vector contains the elements for the matrix, nrow and ncol specify the row and column
dimensions. Other input arguments are optional. byrow is by default by column.

Consider the following two examples:


> y <- matrix(1:20, nrow = 5, ncol = 4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

> cells <- c(1, 3, 5, 9)


> rn <- c("R1", "R2")
> cn <- c("C1", "C2")
> z <- matrix(cells, nrow = 2, ncol = 2, byrow = TRUE, dimnames = list(rn,cn))
> z
C1 C2
R1 1 3
R2 5 9

You can identify elements, rows and columns of a matrix by using brackets and comma. X[i,]
refers to the ith row, X[,j] refers to the jth column, and X[i,j] refers to the i,j-th element.
> y[1,4]
[1] 11
> y[2,]
[1] 2 7 12 17
> y[,2]
[1] 6 7 8 9 10

> y[,c(2,4)]

FINA3295 | Predictive Analytics 10


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20

> y[c(1,5),c(2,4)]
[,1] [,2]
[1,] 6 16
[2,] 10 20

> y[1:3,3:4]
[,1] [,2]
[1,] 11 16
[2,] 12 17
[3,] 13 18

> y[1:3,-c(1,3)]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18

Arrays

While matrices are confined to two dimensions, arrays can be of any number of dimensions. An
array can be create with the following:

myarray <- array(vector, dimensions, dimnames) # dimnames is optional, can


declare later on too

> z <- array(1:24, c(2, 3, 4))


> z

, , 1

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6

, , 2

[,1] [,2] [,3]


[1,] 7 9 11
[2,] 8 10 12

, , 3

[,1] [,2] [,3]


[1,] 13 15 17
[2,] 14 16 18

, , 4

FINA3295 | Predictive Analytics 11


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

[,1] [,2] [,3]


[1,] 19 21 23
[2,] 20 22 24

Like matrices, all elements in the array must belong to the same mode.

Data Frames

Data frames are tabular data objects. Unlike a matrix in data frame each column can take in
different modes of data. Each column in a data frame should have the same length. A data frame
can be created with the data.frame() function, which stacks columns of vectors:

mydata <- data.frame(col1, col2, col3, . . . )

where col1, col2, col3, and so on are column vectors of any type. The variable names of the
vectors become the names for each column. Consider the following example:

> gender <- c("Male", "Male", "Female")


> height <- c(152, 171.5, 167)
> weight <- c(81, 78, 63)
> age <- c(26, 37, 29)
> BMI <- data.frame(gender, height, weight, age)
> print(BMI)
gender height weight age
1 Male 152.0 81 26
2 Male 171.5 78 37
3 Female 167.0 63 29

> names(BMI)
[1] "gender" "height" "weight" "age"
> dim(BMI)
[1] 3 4

There are many ways to index the elements of a data frame. You can use the methods used in
matrix before, or you can also use the column name.

> BMI[2:3]
height weight
1 152.0 81
2 171.5 78
3 167.0 63

> BMI[c("gender","age")]
gender age
1 Male 26
2 Male 37
3 Female 29

FINA3295 | Predictive Analytics 12


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

Alternatively, you can use $ sign to call out a column:

> BMI$gender
[1] Male Male Female
Levels: Female Male

The row names 1, 2, 3 in the above can be changed by using the rownames(dataframe). For
example, you can use
> rownames(BMI) <- c("Andrew", "Johnson", "Julie")
> print(BMI)
gender height weight age
Andrew Male 152.0 81 26
Johnson Male 171.5 78 37
Julie Female 167.0 63 29

You can insert new columns in a data frame by using data.frame:

> IQ <-c(150, 0, -150)


> BMI <- data.frame(BMI, IQ)
> print(BMI)
gender height weight age IQ
Andrew Male 152.0 81 26 150
Johnson Male 171.5 78 37 0
Julie Female 167.0 63 29 -150

> BMI <- BMI[, -5]


> print(BMI)
gender height weight age
Andrew Male 152.0 81 26
Johnson Male 171.5 78 37
Julie Female 167.0 63 29

Lists

List is the most complicated data type in R and many of the output from specific functions of R
(e.g. the output of linear regression or the output of a clustering) are lists. A list can be thought of
as an ordered collection of components, each of which has a name (which can be revealed by
using the names(name_of_list) function. A list can be created by

mylist <- list(component 1, component 2, component 3, …),

where each of the component can be a vector, a matrix, an array, or a data frame. Consider the
following example:

> x <- -1:3


> y <- matrix(1:20, nrow = 5)
> mylist <- list(x, anotherhead = y, BMI)

> mylist
[[1]]
[1] -1 0 1 2 3

FINA3295 | Predictive Analytics 13


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

$anotherhead
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

[[3]]
gender height weight age
Andrew Male 152.0 81 26
Johnson Male 171.5 78 37
Julie Female 167.0 63 29

> names(mylist)
[1] "" "anotherhead" ""

Notice in the above that in the list, only the second component has a row name! This is because
in the declaration of mylist, only the second component is declared with the structure name =
component.

To specify the components in a list, you need to either use double bracket [[1]], [[2]] etc or
$name if that component has a name. E.g. both mylist[[2]] and mylist$anotherhead
produces

[,1] [,2] [,3] [,4]


[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

3 Loading Data

Data comes from a variety of sources and a great variety of formats. You can use keyboard to
enter data. However, most of the time the data is imported from text files, Excel, Access, SQL,
and even files generated from other statistical software (such as SPSS, SAS and Stata). In this
section we review the data input procedure from keyboard, txt and csv files and Excel file.

The function read.table is the most important and commonly used function to import simple
data in the form of a data frame.

Entering data from the keyboard

You can embed data directly into your code. For example,

FINA3295 | Predictive Analytics 14


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

BMI <- "gender height weight age


Male 152.0 81 26
Male 171.5 78 37
Female 167.0 63 29 "
BMI <- read.table(header=TRUE, text=BMI)

Such a method is convenient for small datasets. You can also open up a spreadsheet-like
template for easy input. The following code first create an empty data frame with the variable
names and modes you want to have:

> mydata <- data.frame(gender = character(0), height = numeric(0), weight =


numeric(0), age = numeric(0))

The following code then invoke the text editor on the object that allows you to view and even
enter data:
> fix(mydata)

You cannot directly copy cell values from Excel into the text editor, and hence this method is not
too convenient.

Entering data from txt or csv file

For larger datasets, it is easier to load data from data file. For convenience, we assume that the
directory that R is related to contains the data file. (You can change the default working directory
by clicking File and Change dir in R, or by clicking Tools and Global Options in R studio). The
txt file ExKNN.txt contains the data for Example 1.1 of Chapter 1.

> ex1 <- read.table("ExKNN.txt", header = TRUE)


> dim(ex1)
[1] 50 2
> str(ex1) # str compactly displays the internal structure of an object

'data.frame': 50 obs. of 2 variables:


$ x: num 5.2 5.4 6 6.5 6.6 6.7 6.9 7.1 7.2 7.5 ...
$ y: num 41.9 46.6 59.2 64.5 58.8 66.1 68.9 72.3 74 76.5 ...

If the first row of the data file does not contain variable names (header = FALSE), you can use
col.names to specify a character vector containing the variable names. If here is not header and
you do not specify variable names, then the variables will be named as V1, V2 and so on.
> mydata <- read.table("wwww.txt", header = FALSE,
col.names = c("Age", "Gender", "Weight"))

To enter data from csv files (csv means “comma separated”), use

> wine <- read.csv("wine.csv", header = TRUE)

FINA3295 | Predictive Analytics 15


R Notes
C.Y. Ng
Chapter 1. Data Type and Data Entry

Entering data from Excel file

To load Excel files with multiple sheets, you first need to install a package. For R, click
“Packages” and “Install package(s)” to install XL Connect. For R studio, go to “Tools” in the
ribbon and select “XLConnect”, or use

> install.packages("XLConnect")

(For library packages you only need to install once. Later on we will install more libraries for
performing special statistical calculations.) Now you can load the library to use it:

> library(XLConnect)

Then you can use readWorksheetFromFile to enter data.

> mydata <- readWorksheetFromFile("Chapter 1.xlsx", sheet = 1,


startRow = 4, endCol = 2)

The following is an example that reads three sets of data from the same Excel file:

> mydata <- readWorksheetFromFile("yyy.xlsx",


sheet = c("Sheet 1", "Sheet 1", "Sheet 2"),
header = TRUE,
startRow = c(2,2,3),
startCol = c(2,5,2),
endRow = c(9,15,153),
endCol = c(6,8,6))

FINA3295 | Predictive Analytics 16


R Notes
C.Y. Ng
Chapter 2. Regularization

Chapter 2 Regularization

L earning Objectives

Fit ridge, Lasso, PC and PLS regressions

In this chapter we illustrate how ridge, Lasso, principal component and partial least square
regressions can be fitted using R. The libraries needed are glmnet (Lasso and Elastic Net
Regularized GLM) package for ridge and Lasso regressions, and pls (Partial Least Squares and
Principal Component Regression) package for principal component and partial least square
regressions. Go to “Packages” and “Install package(s)” in the ribbon to install them. Then load
them when use:
> library(glmnet)
> library(pls)

The dataset we use is the credit data set saved in Chapter 4.txt.
> credit <- read.table("Chapter 4.txt", header = T)
> fix(credit)
> dim(credit)
[1] 400 12

The full documentations of the two packages can be obtained in


https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
and
https://cran.r-project.org/web/packages/pls/pls.pdf

1 L2 and L1 Regularization

The glmnet function does not use the y ~ x syntax as in the lm function. We have to specify the
response and the predictors directly. Also, x and y need to be matrix of predictors and vector of
responses, not data frames. So

> x <- as.matrix(credit[-c(12)]) # change data frame to matrix


> y <- credit$Balance # this is a vector
> grid <- c(10000, 5000, 1000, 500, 100, 50, 10, 1, 0)
> ridgeunstd.mod <- glmnet(x, y, alpha = 0, lambda = grid, standardize = F)

FINA3295 | Predictive Analytics 17


R Notes
C.Y. Ng
Chapter 2. Regularization

In the above, alpha = 0 means ridge regression, and the data are not standardized when in the
fitting process. By default, the data are automatically standardized.
After the codes above are run, associated with each value of  is a vector of ridge regression
coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 12 × 9 matrix,
with 12 rows (one for each predictor, plus an intercept) and 9 columns (one for each value of ).
If you wish to supply more than one value for lambda (just as above), make sure that you supply
a decreasing sequence of values.

You can use the following to extract the coefficients:

> print(coef(ridgeunstd.mod))

The L2 norm of the coefficients are decreasing with . For example, we can compute the norm
for   1000 and   10, respectively:

> sqrt(sum(coef(ridgeunstd.mod)[-1,3]^2))
[1] 19.44854
> sqrt(sum(coef(ridgeunstd.mod)[-1,7]^2))
[1] 341.5019

After fitting the model, we can do prediction. Suppose we predict the y based on the x of the first
200 observations:

FINA3295 | Predictive Analytics 18


R Notes
C.Y. Ng
Chapter 2. Regularization

> newy <- predict(ridgeunstd.mod, s = 50, x[1:200,])


> yout <- cbind(newy, y[1:200])
> fix(yout)

The following is a snapshot of what you will get, together with the actual value of y:

Now let us consider fitting a model with standardized x and y using the first 200 observations,
and then use the fitted model to prediction the y for the remaining 200 observations. We can then
compute an estimate of the test MSE (under the holdout method).

> ridge.mod <- glmnet(x[1:200,], y[1:200], alpha = 0,


lambda = 50, standardize = T)
> newy <- predict(ridge.mod, s = 50, x[201:400,])
> mean((newy - y[201:400])^2)
[1] 14556.73

> yout <- cbind(newy, y[201:400])


> fix(yout)

FINA3295 | Predictive Analytics 19


R Notes
C.Y. Ng
Chapter 2. Regularization

To determine the best value of lambda we need to perform cross-validation to compute the test
MSE for a large ranges of  using the training data set (Note: Here we treat  as a tuning
parameter and hence it should be estimated from the training data set). The function cv.glmnet()
performs 10-fold cross validation to estimate the test MSE (you can use nfolds = n to change
the default 10 to other numbers).

> cv.out<-cv.glmnet(x[1:200,], y[1:200], alpha = 0)


> plot(cv.out)
> cv.out$lambda.min

It is interesting to see that for this data set, the test MSE seems to be monotonically increasing in
! In fact the  seems to be the minimum of the smallest  being tested. Let us use a finer grid
search for small values of :

> cv.out<-cv.glmnet(x, y, alpha = 0, lambda = seq(0,40, length = 1000))


> plot(cv.out)
> cv.out$lambda.min
[1] 0.1601602

Indeed there is a minimum, but the U in the test MSE curve is not pronounced at all. (Note: The
 depends on how the subsets are drawn, and hence you should a get different result if you run
the codes!)

FINA3295 | Predictive Analytics 20


R Notes
C.Y. Ng
Chapter 2. Regularization

Lasso regression can be done similarly, by using   1. Let us try fitting one model using all data:

> ridge.mod <- glmnet(x, y, alpha = 1, lambda = 50, standardize = T)


> print(coef(ridge.mod))

12 x 1 sparse Matrix of class "dgCMatrix"


s0
(Intercept) -291.10069343
Income -1.00638791
Limit 0.05108697
Rating 1.66528206
Cards .
Age .
Education .
Female .
Student 236.20776793
Married .
Asian .
Caucasian .

So only 4 predictors are retained.


> cv.out = cv.glmnet(x, y, alpha = 1)
> plot(cv.out)
> cv.out$lambda.min
[1] 0.7784687

2 PCR and PLSR

Principal Component Regression

PCR can be done in R very easily by using the pcr function in the PLS library. The syntax of
pcr is similar to that of lm, with the additional option of scale = T or F to determine if the data
are standardized prior to the generation of the principal components. Also, setting validation =
"CV" causes R to compute the ten-fold CV for each possible value of M. After fitting the model,
the resulting fit can be extracted using summary(). You can also use validation = "LOO" to
compute the LOOCV estimate of test MSE.

> multo <- read.table("PCR.txt", header = T)


> fix(multo)
> pcr.fit = pcr(y ~ x1 + x2 + x3, data = multo, scale = T, validation = "CV")
> summary(pcr.fit)

Data: X dimension: 18 3
Y dimension: 18 1
Fit method: svdpc
Number of components considered: 3

VALIDATION: RMSEP
Cross-validated using 10 random segments.

FINA3295 | Predictive Analytics 21


R Notes
C.Y. Ng
Chapter 2. Regularization

(Intercept) 1 comps 2 comps 3 comps


CV 11.19 1.491 1.432 1.363
adjCV 11.19 1.317 1.419 1.341

TRAINING: % variance explained


1 comps 2 comps 3 comps
X 66.29 99.63 100.00
y 98.71 98.72 99.16

The CV scores are above are the root mean square error. (The adjusted CV is a bias-corrected
CV estimate.) Using more principal components results in better fit and hence a smaller CV. We
can plot the cross-validated MSE:
> validationplot(pcr.fit, val.type = "MSEP") # plot the MSE

From the plot it is evident that using the first component is enough. Actually, while using only 1
PC explains only 66.29% of the variance of the predictors, 98.71% of the variance of y is already
explained!

After determining how many components to use based on the CV result, we can fit the final
model with the number of components specified by ncomp = n:

> pcr.fit <- pcr(y ~ x1 + x2 + x3, data = multo, scale = T, ncomp = 1)


> coef(pcr.fit) # extract the regression coefficients
, , 1 comps

y
x1 5.4149147
x2 5.4199296
x3 0.2082247

> pcr.pred <- predict(pcr.fit, multo[, -4], ncomp = 1)


> mean((pcr.pred - multo$y)^2)
[1] 1.445569

The example used in the lecture notes is mainly used to illustrate multi-collinearity. To illustrate
another aspect of PCR, let us use the data set called gasoline in the pls package.

> data(gasoline)
> names(gasoline)
[1] "octane" "NIR"

This is a data set with NIR (near infrared spectroscopy) spectra and octane numbers of 60
gasoline samples. Each observation has an octane number and a record of the NIR spectra
measurement (log R where R is the diffuse reflectance at a particular wavelength) from 900 nm
to 1700 nm (in 2nm intervals, giving 401 wavelengths). So p  401, n  60, and dimension
reduction is needed. Our goal is to predict octane number using the 401 predictors.

Let us decompose the data set into train and test data sets:

> gasTrain <- gasoline[1:45, ]

FINA3295 | Predictive Analytics 22


R Notes
C.Y. Ng
Chapter 2. Regularization

> gasTest <- gasoline[46:60, ]

Then we fit a PCR model. Let us look at 10 principal components.

> gas1.fit <- pcr(octane ~ NIR, data = gasTrain, ncomp = 10, validation =
"CV")
> summary(gas1.fit)
Data: X dimension: 45 401
Y dimension: 45 1
Fit method: svdpc
Number of components considered: 10

VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 1.568 1.552 1.359 0.2605 0.2707 0.2722 0.2777
adjCV 1.568 1.547 1.280 0.2565 0.2677 0.2697 0.2757
7 comps 8 comps 9 comps 10 comps
CV 0.2459 0.2197 0.2058 0.2077
adjCV 0.2413 0.2168 0.2008 0.2028

TRAINING: % variance explained


1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 82.01 88.46 94.44 96.62 97.96 98.58 98.92
octane 12.84 56.45 97.67 97.68 97.71 97.91 98.60
8 comps 9 comps 10 comps
X 99.17 99.34 99.46
octane 98.96 99.17 99.19

> validationplot(gas1.fit, val.type = "MSEP")

FINA3295 | Predictive Analytics 23


R Notes
C.Y. Ng
Chapter 2. Regularization

Three principal components seem to be enough.

Then we can fit a model with M  3 using the training data set and then final estimate the test
MSE for the model using the test data set.

> gas1.pred <- predict(gas1.fit, gasTest, ncomp = 3)


> gas1.pred
, , 3 comps

octane
46 88.43762
47 88.17775
48 88.89864
49 88.29018
50 88.52651
51 88.10377
52 87.37123
53 88.32798
54 85.03583
55 85.32728
56 84.59276
57 87.57041
58 86.93111
59 89.25013
60 87.11733

> sqrt(mean((gas1.pred-gasTest$octane)^2)) # Root MSE


[1] 0.202396

Actually, an even quicker way is to use the following:


> RMSEP(gas1.fit, newdata = gasTest)
(Intercept) 1 comps 2 comps 3 comps 4 comps
1.4785 1.1910 1.7471 0.2024 0.1989
5 comps 6 comps 7 comps 8 comps 9 comps
0.2076 0.2882 0.2835 0.3324 0.2912
10 comps
0.2679

FINA3295 | Predictive Analytics 24


R Notes
C.Y. Ng
Chapter 2. Regularization

Partial Least Squares Regression

Partial least squares regression can be fitted using plsr, which has the same syntax as pcr.

> plsr.fit <- plsr(y ~ x1 + x2 + x3, data = multo, scale=T, validation = "CV")
> summary(plsr.fit)

Data: X dimension: 18 3
Y dimension: 18 1
Fit method: kernelpls
Number of components considered: 3

VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps
CV 11.19 1.913 1.483 1.322
adjCV 11.19 1.790 1.465 1.302

TRAINING: % variance explained


1 comps 2 comps 3 comps
X 66.28 99.24 100.00
y 98.72 98.74 99.16

> validationplot(plsr.fit, val.type = "MSEP") # plot the MSE

After determining how many components to use based on the CV result, we can fit the final
model with the number of components specified by ncomp = n:

> plsr.fit <- plsr(y ~ x1 + x2 + x3, data = multo, scale = T, ncomp = 1)


> coef(plsr.fit) # extract the regression coefficients

, , 1 comps

y
x1 5.3917515
x2 5.4466237
x3 0.1349532

> plsr.pred = predict(plsr.fit, multo[, -4], ncomp = 1)


> mean((plsr.pred - multo$y)^2)
[1] 1.425637

The PLSR with M  1 is practically the same as the PCR with M  1 in view of the MSE and the
fitted coefficients.

Then let us move on to the gasoline data set.

> gas1.fit <- plsr(octane ~ NIR, data = gasTrain, ncomp = 10, validation =
"CV")
> summary(gas1.fit)

FINA3295 | Predictive Analytics 25


R Notes
C.Y. Ng
Chapter 2. Regularization

Data: X dimension: 45 401


Y dimension: 45 1
Fit method: kernelpls
Number of components considered: 10

VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 1.568 1.397 0.3002 0.2364 0.2111 0.1959 0.1924
adjCV 1.568 1.398 0.2875 0.2376 0.2034 0.1923 0.1889
7 comps 8 comps 9 comps 10 comps
CV 0.1902 0.2032 0.2530 0.2796
adjCV 0.1874 0.1999 0.2464 0.2702

TRAINING: % variance explained


1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 79.23 88.25 94.18 95.03 96.50 97.84 98.77
octane 27.90 97.75 98.04 98.99 99.16 99.23 99.28
8 comps 9 comps 10 comps
X 99.07 99.22 99.36
octane 99.34 99.44 99.54

Two principal components seem to be enough.


Then we can fit a model with M  2 using the training data set and then final estimate the test
MSE for the model using the test data set.

> gas1.pred <- predict(gas1.fit, gasTest, ncomp = 2)


> gas1.pred

FINA3295 | Predictive Analytics 26


R Notes
C.Y. Ng
Chapter 2. Regularization

, , 2 comps

octane
46 88.35589
47 88.15756
48 88.78035
49 88.24472
50 88.43657
51 87.98744
52 87.32389
53 88.24144
54 84.90116
55 85.28262
56 84.60781
57 87.36257
58 86.81301
59 89.10257
60 86.97763

> sqrt(mean((gas1.pred-gasTest$octane)^2)) # Root MSE


[1] 0.2111421

This “simpler” model is comparable to the PCR model with 3 principal components.

FINA3295 | Predictive Analytics 27


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Chapter 3 Non-linear Regression Models

L earning Objectives

KNN, tree-based models, ensemble methods for tree models

In this chapter we deal with non-linear regression and classification models. We are going to
discuss KNN in Section 1, and classification and regression trees in Section 2. In Section 3 we
deal with the random forest and boosting for CARTs.

1 KNN Classification and Regression

First you need to install the libraries e1071 and caret. (caret is the short for Classification And
Regression Training). caret is an extremely useful library and it serves as an interface to control
many libraries that conduct regression or classification.

Classification

We are going to use the wine data stored in wine.csv as illustration for KNN classification.

> wine <- read.csv("wine.csv", header = T)


> str(wine)

'data.frame': 178 obs. of 14 variables:


$ Type : int 1 1 1 1 1 1 1 1 1 1 ...
$ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
$ Malic.acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
$ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
$ Alkalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
$ Mg : int 127 100 101 113 118 112 96 121 97 98 ...
$ Phenol : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
$ Falvanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
$ Nonfavonoids.phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
$ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
$ Color.intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
$ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
$ OD280.OD315 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
$ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data set has n  178. The first column named Type is the class of wine (and can take the
value 1, 2, and 3) and is the y of the data set. The remaining 13 variables are the predictors.

FINA3295 | Predictive Analytics 28


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Noting that the scales of the predictors differ vastly, standardization is needed before conducting
KNN classification.

To perform a KNN classification, the response y cannot be numerical. So let us create the
corresponding categorical version (“factor” in R) from the numeric version:

> wine$Type <- factor(wine$Type)

KNN is different from other linear regression models in the sense that the model output is the
prediction itself by not beta coefficients. In fact, the only model output is prediction. So one must
specify the training data as well as the predictors for prediction purposes.

In caret, we can randomly split the data into a training data set and a test data set by using the
following code:

> library(caret)
> set.seed(1234)
> train_index <- createDataPartition(y = wine$Type, p = 0.7, list = F)
> training <- wine[train_index,]
> testing <- wine[-train_index,]

I set the seed so that you can replicate my results below. The function createDataPartition
randomly partition the 178 observations into 2 partitions in the ratio 7 : 3 as specified in the p
argument. The last argument tells R to return an output in the form of a list (T) or a matrix (F)
with the record’s index. The argument y is used so that in the training data set and the original
data set, there is about the same ratio of different types of y.

> dim(training); dim(testing)


[1] 125 14
[1] 53 14

> prop.table(table(training$Type)); prop.table(table(wine$Type))

1 2 3
0.376 0.352 0.272

1 2 3
0.3314607 0.3988764 0.2696629

Now we are ready to fit a KNN classification model:

> trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)


> knn.fit <- train(Type ~., data = training, method = "knn", trControl =
trctrl, preProcess = c("center", "scale"), tuneLength = 20)

Using the function trainControl, we set the method to compute the average error rate in the
training data set in order to determine the best value of the tuning parameter K. Commonly used
methods are cv, repeatedcv, and LOOCV. Cross validation was discussed in Chapter 1 of the
lecture notes. Since the average error rate (or test MSE for regression) computed from cross-
validation is dependent on how the subsamples to draw, we can repeat that a number of times to

FINA3295 | Predictive Analytics 29


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

get a series of average error rate and take a final average and this is called repeated cross-
validation. The two additional arguments number 10 gives 10-fold in each of the cross-validation,
and we repeat the cross validation 3 times.

In the function train, Type ~. is similar to the usual linear regression, y is Type and . means all
predictors are used. (You can also replace the part “Type ~., data = training” with
“training[, -1], training[, 1]”. That is, you can directly enter the predictors and
response.) Finally, tuneLength specifies the range of values of K for the KNN to be examined
for the computation of test MSE.

> knn.fit
k-Nearest Neighbors

125 samples
13 predictor
3 classes: '1', '2', '3'

Pre-processing: centered (13), scaled (13)


Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 112, 111, 114, 113, 112, 112, ...
Resampling results across tuning parameters:

k Accuracy Kappa
5 0.9590326 0.9379247
7 0.9513098 0.9260016
9 0.9451049 0.9165983
11 0.9431208 0.9138134
13 0.9508436 0.9256436
15 0.9561855 0.9337326
17 0.9617799 0.9421534
19 0.9648102 0.9466802
21 0.9673743 0.9505841
23 0.9617799 0.9420737
25 0.9590021 0.9379070
27 0.9594683 0.9386432
29 0.9596820 0.9389060
31 0.9569042 0.9347393
33 0.9569042 0.9347393
35 0.9569042 0.9347393
37 0.9596820 0.9389060
39 0.9596820 0.9389060
41 0.9596820 0.9389060
43 0.9596820 0.9389060

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 21.

The accuracy is 1  error rate as discussed in the lecture notes. The higher it is, the better is the
model’s predictive power. In the final model for prediction, k is selected to be 21.

Finally we can do prediction on our test data set and compute the resulting confusion matrix:

> test_pred <- predict(knn.fit, newdata = testing)

FINA3295 | Predictive Analytics 30


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

> test_pred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2
[38] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3

> confusionMatrix(test_pred, testing$Type)

Confusion Matrix and Statistics

Reference
Prediction 1 2 3
1 12 2 0
2 0 24 0
3 0 1 14

Overall Statistics

Accuracy : 0.9434
95% CI : (0.8434, 0.9882)
……

The error rate is 3 / 53 (or 1  0.9434)  5.66%. This is a very good fit!

If you want to specify the value of K directly, remove the cross-validation part and also put the
value of K as follow using tuneGrid. The following is another run (with a different partition of
training and validation data set):

> wine$Type <- factor(wine$Type)


> train_index <- createDataPartition(y = wine$Type, p = 0.7, list = F)
> training <- wine[train_index,]; testing <- wine[-train_index,]
> knn2.fit <- train(Type ~., data = training, method = "knn", trControl =
trainControl(method = "none"), preProcess = c("center", "scale"),
tuneGrid = expand.grid(k=2))
> test_pred2 <- predict(knn2.fit, newdata = testing)
> confusionMatrix(test_pred2, testing$Type)

Confusion Matrix and Statistics

Reference
Prediction 1 2 3
1 17 3 0
2 0 18 0
3 0 0 14

Overall Statistics

Accuracy : 0.9423
95% CI : (0.8405, 0.9879)

……

FINA3295 | Predictive Analytics 31


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Regression

We are going to use the same example as in the lecture notes as illustration for KNN regression.

> ex1 <- read.table("ExKNN.txt", header = TRUE)


> set.seed(1234)
> train_index <- createDataPartition(y = ex1$y, p = 0.7, list = F)
> training <- ex1[train_index,]
> testing <- ex1[-train_index,]
> knn.fit <- train(y ~ x, data = training, method = "knn", trControl = trctrl,
preProcess = c("center", "scale"), tuneLength = 10)
> knn.fit

k-Nearest Neighbors

38 samples
1 predictor

Pre-processing: centered (1), scaled (1)


Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 34, 34, 34, 34, 34, 35, ...
Resampling results across tuning parameters:

k RMSE Rsquared MAE


5 6.451475 0.8232691 4.838463
7 6.740841 0.8484769 4.976954
9 7.649581 0.7746582 5.528364
11 8.587700 0.7552668 6.203062
13 9.406253 0.7332156 6.782014
15 10.271306 0.7232533 7.437779
17 10.966694 0.7016144 8.021498
19 11.584828 0.6900777 8.596868
21 12.330446 0.6214820 9.425329
23 12.842971 0.5744659 9.975974

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

> test_pred <- predict(knn.fit, newdata = testing)


> test_pred
[1] 65.06000 84.58000 90.22000 92.16667 92.16667 97.64000 87.30000 86.51667
[9] 86.20000 86.20000 77.74000 77.74000

> RMSE(test_pred, testing$y)


[1] 7.539977
> plot(test_pred ~ testing$y)

FINA3295 | Predictive Analytics 32


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

2 Regression and Classification Trees

To fit CART we use the library rpart. To plot beautiful trees I recommend using the library
rpart.plot. If you have installed caret, then rpart should have already been installed.

To use rpart, use the following code:

rpart(formula, data = …, method = …, control = …, params = )

where
 formula is in the format: y ~ x1 + x2 … as in lm (interaction term not allowed!)
 data specifies the data frame
 method  "class" for classification, "anova" for regression
 control  optional parameters for pruning, the default parameters are:
minsplit = 20, minbucket = round(minsplit/3) , cp = 0.01, maxdepth = 30
(here minsplit is the minimum number of observations that must exist in a node for a split to
be attempted, minbucket is the minimum number of observations in any terminal leaf, and cp
is the cost complex parameter  in the objective function.)
 parms is the method of split for classification tree. Use either parms = list(split =
"gini") or parms = list(split = "information") for cross entropy. The default is Gini
index. For regression trees, there are no such parameters.

After we have fitted a tree, we can examine the results by using the following functions:

FINA3295 | Predictive Analytics 33


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Function What it does


plot(fit) plot decision tree
summary(fit) detailed results of the fit
print(fit) print results of the fit
printcp(fit) display the table for cost complexity parameter

Classification

As a simple illustration, consider the golf example in the lecture notes. Load the data set golf.txt.

> golf <- read.table("golf.txt", header = T)


> golf.fit <- rpart(Golf~., data= golf, method = "class")
> plot(golf.fit)
Error in plot.rpart(golf.fit) : fit is not a tree, just a root

What is happening? In our toy example, the number of observation is smaller than the default
minsplit! So let us change the parameters:

> golf.fit <- rpart(Golf~., data= golf, method = "class", minsplit = 2)


> summary(golf.fit)

Call:
rpart(formula = Golf ~ ., data = golf, method = "class", minsplit = 2)
n= 14

CP nsplit rel error xerror xstd


1 0.30 0 1.0 1.0 0.3585686
2 0.10 2 0.4 1.2 0.3703280
3 0.01 6 0.0 1.4 0.3741657

Variable importance
Outlook Temperature Humidity Windy
30 26 23 20

Node number 1: 14 observations, complexity param=0.3


predicted class=Yes expected loss=0.3571429 P(node) =1
class counts: 5 9
probabilities: 0.357 0.643
left son=2 (10 obs) right son=3 (4 obs)
Primary splits:
Outlook splits as RLL, improve=1.4285710, (0 missing)
Humidity splits as LR, improve=1.2857140, (0 missing)
Windy splits as RL, improve=0.4285714, (0 missing)
Temperature splits as RLR, improve=0.2285714, (0 missing)

Node number 2: 10 observations, complexity param=0.3


predicted class=No expected loss=0.5 P(node) =0.7142857
class counts: 5 5
probabilities: 0.500 0.500
left son=4 (5 obs) right son=5 (5 obs)
Primary splits:
Humidity splits as LR, improve=1.8000000, (0 missing)
Temperature splits as RLR, improve=1.2500000, (0 missing)

FINA3295 | Predictive Analytics 34


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Windy splits as RL, improve=0.8333333, (0 missing)


Outlook splits as -RL, improve=0.2000000, (0 missing)
Surrogate splits:
Temperature splits as RLL, agree=0.8, adj=0.6, (0 split)
Outlook splits as -RL, agree=0.6, adj=0.2, (0 split)

(The remaining results for other splits skipped.)


> print(golf.fit)

n= 14
node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 14 5 Yes (0.3571429 0.6428571)


2) Outlook=Rain,Sunny 10 5 No (0.5000000 0.5000000)
4) Humidity=High 5 1 No (0.8000000 0.2000000)
8) Outlook=Sunny 3 0 No (1.0000000 0.0000000) *
9) Outlook=Rain 2 1 No (0.5000000 0.5000000)
18) Windy=Yes 1 0 No (1.0000000 0.0000000) *
19) Windy=No 1 0 Yes (0.0000000 1.0000000) *
5) Humidity=Normal 5 1 Yes (0.2000000 0.8000000)
10) Windy=Yes 2 1 No (0.5000000 0.5000000)
20) Temperature=Cool 1 0 No (1.0000000 0.0000000) *
21) Temperature=Mild 1 0 Yes (0.0000000 1.0000000) *
11) Windy=No 3 0 Yes (0.0000000 1.0000000) *
3) Outlook=Overcast 4 0 Yes (0.0000000 1.0000000) *

This is obviously too hard to understand! Let us plot the tree out:
> plot(golf.fit)

You will then see a figure with no labels but only splits! The plot function in rpart frequently
gives funny results because of its mysterious default margin setting. Without deleting the figure,
let us try to fine tune the tree’s labels and fonts:
> text(golf.fit, use.n=TRUE, all=TRUE, cex=.8)

FINA3295 | Predictive Analytics 35


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

This looks unprofessional. Hence you need the library that was created just to plot trees!

> library(rpart.plot)
> rpart.plot(golf.fit, type = 3, fallen.leaves = T, extra = 104)

There are so many things that you can tune in this plot function that I better give you a link for
the R website for details:

https://www.rdocumentation.org/packages/rpart.plot/versions/3.0.6/topics/rpart.plot

Finally we can use the tree to make predictions for 2 combinations of predictors:

> newdata <- "Temperature Outlook Humidity Windy


Hot Sunny Normal Yes
Cool Rain High No "
> newdata <- read.table(header = T, text = newdata)
> predict(golf.fit, newdata)

No Yes
1 0.5 0.5
2 0.0 1.0

A probability distribution is reported. In the first case, the response is a 50% No and 50% Yes! If
you want the majority vote but not a probability distribution, you can use the following:

FINA3295 | Predictive Analytics 36


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

> predict(golf.fit, newdata, type = "class")

In this case we have built the largest tree possible. Now let us use the wine example again to fit a
classification tree.

> wine.fit <- rpart(Type~., data = training, method = "class")


> printcp(wine.fit)

Classification tree:
rpart(formula = Type ~ ., data = training, method = "class")

Variables actually used in tree construction:


[1] Alcohol Color.intensity Falvanoids

Root node error: 76/126 = 0.60317

n= 126

CP nsplit rel error xerror xstd


1 0.460526 0 1.000000 1.00000 0.072259
2 0.434211 1 0.539474 0.75000 0.073513
3 0.013158 2 0.105263 0.25000 0.052853
4 0.010000 3 0.092105 0.22368 0.050459

The tree with   0.01 seems to have too few splits. Let us relax that a bit:

> wine.fit <- rpart(Type~., data = training, method = "class", cp = 0.0001,


minsplit = 10)
> printcp(wine.fit)

Classification tree:
rpart(formula = Type ~ ., data = training, method = "class",
cp = 1e-04, minsplit = 10)

Variables actually used in tree construction:


[1] Color.intensity Falvanoids Proline

Root node error: 76/126 = 0.60317

n= 126

CP nsplit rel error xerror xstd


1 0.460526 0 1.000000 1.00000 0.072259
2 0.434211 1 0.539474 0.67105 0.072496
3 0.052632 2 0.105263 0.21053 0.049176
4 0.000100 3 0.052632 0.17105 0.044927

There is no change. So we use the tree with 3 splits to do prediction. The tree looks like this:

> rpart.plot(wine.fit, type = 3, fallen.leaves = T)

FINA3295 | Predictive Analytics 37


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

> wine_tree_predict<-predict(wine.fit, testing, type = "class"


> table(wine_predict, testing$Type)

wine_predict 1 2 3
1 14 2 0
2 3 19 0
3 0 0 14

By using only three predictors, we reach an error rate of 5 / (178 – 126)  9.6% error rate.

Regression

To illustrate regression tree and pruning, let us use the default data cu.summary in the caret
library. The data set contains 117 observations, with Mileage as the response, and Price, Country,
Reliability, and Type as predictors. While I use all the data for fitting, many predictors are
missing and they are deleted before fitting.

> cu <- cu.summary


> fix(cu)
> cu.fit <- rpart(Mileage ~. , data = cu, method = "anova")
> printcp(cu.fit)

Regression tree:
rpart(formula = Mileage ~ ., data = cu, method = "anova")

Variables actually used in tree construction:


[1] Price Type

Root node error: 1354.6/60 = 22.576

n=60 (57 observations deleted due to missingness)

FINA3295 | Predictive Analytics 38


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

CP nsplit rel error xerror xstd


1 0.622885 0 1.00000 1.02732 0.173314
2 0.132061 1 0.37711 0.52773 0.100274
3 0.025441 2 0.24505 0.36893 0.080618
4 0.011604 3 0.21961 0.40335 0.081832
5 0.010000 4 0.20801 0.40649 0.079706

The tree with   0.011604 gives the smallest cross-validated test MSE. (Another rule is to
choose the simplest tree with rel error + xstd < xerror. This is why rpart reports these three
values. Rel error is related to the training MSE and it always go down as the number of splits
increases.)

Now we generate the best tree:

> prunedtree <- prune(cu.fit, cp = 0.011605)


> rpart.plot(prunedtree,type = 3, fallen.leaves = T, digits=5)

FINA3295 | Predictive Analytics 39


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

3 Random Forests and Boosting for CART

Random Forests

The library randomForest in R performs random forest very easily.

randomForest(formula, data  , subset, ntree  , mtry  , importance  F)

Here,
 mtry is the m in the lecture notes. The default setting in this library is m  p for
classification trees and p / 3 for regression trees.
 ntree is the number of trees to grow (the B in the lecture notes). Default is 500.
 subset is an index vector indicating which rows should be used as the training data set.
 importance (default is F): should the importance of predictors be assessed? (If you use T,
then later on you would also call out the variable importance plot (VIP) by using either
importance(output) or varImpPlot(output)).

First let us illustrate bagging (mtry  p) on the wine data.

> library(randomForest)
> wine <- read.csv("wine.csv", header = T)
> wine$Type <- factor(wine$Type)
> train_index <- createDataPartition(y = wine$Type, p = 0.7, list = F)
> training <- wine[train_index,]; testing <- wine[-train_index,]
> winebag <- randomForest(Type~., data = wine, subset = train_index,
mtry = 13, importance = T)

(Note: you can also use


> winebag <- randomForest(Type~., data = training, mtry = 13, importance = T)

if you want to avoid entering the training index.)


> winebag
wine
Call:
randomForest(formula = Type ~ ., data = wine, mtry = 13,
importance = T, subset = train_index)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 13

OOB estimate of error rate: 5.56%


Confusion matrix:
1 2 3 class.error
1 40 2 0 0.04761905
2 1 47 2 0.06000000
3 0 2 32 0.05882353

FINA3295 | Predictive Analytics 40


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

> varImpPlot(winebag)

After running the above, we use the predict function to compute the predictions.

> winebagpred <- predict(winebag, newdata = testing)

Then let us really run random forest:

> winerf<-randomForest(Type~., data = wine, subset = train_index,


importance = T)
> winerf

Call:
randomForest(formula = Type ~ ., data = wine, importance = T, subset =
train_index)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3

OOB estimate of error rate: 2.38%


Confusion matrix:
1 2 3 class.error
1 42 0 0 0.00
2 1 47 2 0.06
3 0 0 34 0.00

So you can see the error rate for random forest is only half of bagging! The VIP is

FINA3295 | Predictive Analytics 41


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

Boosting

For boosting, you need to install library gbm. (GBM standards of gradient boosting machine.)
Also, you need to enter the 3 parameters for the number of trees, the depth of the trees, and the
shrinkage parameter.

gbm(formula, data  , distribution = , n.trees  100, interaction.depth = 1,


shrinkage  0.1)

The above is self-explanatory. 100, 1 and 0.1 are default values and you can skip them if you
wish to use the default. For distribution, you can use either "bernoulli" or "Gaussian" for a
binary classification trees and regression trees.

Let us use the cu.summary data set again as an example. Before we plug in the data set, we need
to clear all observations with missing values because gbm does not automatically do that.

> cu <- cu.summary[complete.cases(cu.summary),]


> cubt <- gbm(Mileage ~., data = cu, distribution = "gaussian ", n.trees =
1000, interaction.depth = 1, shrinkage = 0.2)

The summary function produces a (relative) VIP:

FINA3295 | Predictive Analytics 42


R Notes
C.Y. Ng
Chapter 3. Non-linear Regression Models

> summary(cubt)
var rel.inf
Type Type 36.434368
Price Price 27.792783
Country Country 26.986571
Reliability Reliability 8.786278

As in the regression tree obtained by cost complexity pruning, Type and Price are the two most
important predictors. Now we compute the training MSE by first obtaining all predicted values:

> cubt.pred <- predict(cubt, newdata = cu[, -4], n.trees = 1000)


> mean((cubt.pred-cu[,4])^2)
[1] 3. 291057

Then let us repeat with a smaller shrinkage parameter and a large number of trees:

> cubt <- gbm(Mileage ~., data = cu, distribution = "gaussian", n.trees =
10000, interaction.depth = 1, shrinkage = 0.05)
> cubt.pred <- predict(cubt, newdata = cu[, -4], n.trees = 10000)
> mean((cubt.pred-cu[,4])^2)
[1] 2.435577

So, learning slowly leads a better fit because the training MSE is smaller. (In this example, we
are not looking at test MSE, though, because the data set is quite small.)

FINA3295 | Predictive Analytics 43


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

Chapter 4 Unsupervised Learning

L earning Objectives

Principal component analysis, HAC and K-means clustering

In this chapter we look at PCA and clustering. We will use R to analyze the default dataset
USArrests in the basic R package. You can use

> print(USArrests)

to view the whole dataset. The first “column” are row names and are not treated as data in the
data frame. Here is how we can generate a scatterplot of the four predictors:
> pairs(USArrests, main = "Basic Scatter Plot Matrix")

FINA3295 | Predictive Analytics 44


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

1 Principal Component Analysis

There are two functions to perform PCA in the basic R package: prcomp() or princomp().
Since the former employs a numerically more stable method to perform the eigenvalue and
eigenvector calculation (single value decomposition), here I am going to introduce only
prcomp().

The function is used like this:

prcomp(dataset, center = T / F, scale = T / F)

[Note: If your dataset contains multiple columns and you only wish to conduct PCA on a subset
of columns, you can use the following of prcomp():
prcomp(formula = ~ col1 + col2 + … , data = name_of_dataset, center = T / F,
scale = T / F)

where col1, col2 are the name of the columns. ]

The default for the optional argument center (for mean centering the data) is T, while the
default for the optional argument scale (for scaling the data) is F. See page 8 and page 9 for
what these mean in R. The list returned by prcomp contains five outputs:
sdev, rotation, center, scale, x

 sdev is the row vector of eigenvalues,


 rotation is a matrix of eigenvectors of each principal component,
 center and scale are the vector of means and standard deviations of each column of data,
 x is the resulting z scores computed from the data and the eigenvectors.

After running prcomp, we can use summary(output) to investigate the cumulative proportion of
variance explained, and also use plot(output) to get the scree plot.

Consider the following run:

> pc.out <- prcomp(USArrests, scale = T)


> summary(pc.out)

Importance of components%s:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.00000

> plot(pc.out)

FINA3295 | Predictive Analytics 45


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

Since the scree plot is usually viewed as a line chart, we can use the following:

> plot(pc.out, type = "line")

To print out the eigenvectors, we use


> pc.out$rotation

PC1 PC2 PC3 PC4


Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432

We can call out the z scores easily. If we want to look at the z scores for the first 2 components,
we can use

FINA3295 | Predictive Analytics 46


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

> pc.out$x[, 1:2]


PC1 PC2
Alabama -0.97566045 1.12200121
Alaska -1.93053788 1.06242692
Arizona -1.74544285 -0.73845954
Arkansas 0.13999894 1.10854226
California -2.49861285 -1.52742672
Colorado -1.49934074 -0.97762966
Connecticut 1.34499236 -1.07798362

… (skip the rest)

The scatterplot matrix of the z scores look like this:

> pairs(pc.out$x, main = "Scatter Plot Matrix of z scores")

We can use biplot() to generate the biplot for any two components (e.g. PC1 and PC2):

> biplot(pc.out, choices = c(1,2))

FINA3295 | Predictive Analytics 47


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

You may not like the arrows to all point out to the negative directions. Let us flip them:

> pc.out$rotation[c(1:8)] = -pc.out$rotation[c(1:8)]


> biplot(pc.out, choices = c(1,2))

FINA3295 | Predictive Analytics 48


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

2 Hierarchical Agglomerative Clustering

In this section, we use the function agnes() (not the function hclust() specified in the Exam
SRM textbook) in the R library cluster to do HAC. Agnes is the acronym for Agglomerative
Nesting. I personally perfer agnes() because it can directly accepts the data as the input, while
for hclust() you must change the data into a dissimilarity matrix first using the function
dist(x).

agnes(x, diss = T/F, metric = "euclidean", stand = FALSE, method = "average")

 x is the data set if diss is F, and the dissimilarity matrix if diss is T.


 The metric can be "euclidean" for Euclidean distance (default) or "manhatten" for city
block distance. If x is a dissimilarity matrix, this input is ignored.
 stand  T / F if the data is to be / not to be standardized before performing clustering. Again
if x is a dissimilarity matrix, this input is ignored.
 method can be any of the following 6 types: complete, average, single, ward, weighted
average linkage, flexible (the last 3 are not covered in this course). The default is average.

You can apply the following functions to the output of agnes() to get the dendrogram and also
cut it:
 pltree(): plot the dendrogram
 cutree(): cut the tree at a particular height and give the cluster labels for each observation

> library(cluster)
> agg.out<-agnes(x = USArrests, diss = F, stand = T, method = "complete")
> pltree(agg.out, main = "Dendrogram of USArrests, complete linkage")

FINA3295 | Predictive Analytics 49


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

> cut.4 <- cutree(agg.out, k = 4)

[1] 1 2 2 3 2 2 3 4 2 1 4 3 2 4 3 4 3 1 3 2 4 2 3 1 4 3 3 2 3 4 2 2 1 3 4 4 4
[38] 4 4 1 3 1 2 4 3 3 4 3 3 3

This is quite hard to visualize. So let us do the following to show the names of the states (which
is the row label of the data frame object USArrests).

> sapply(unique(cut.4),function(g)rownames(USArrests)[cut.4 == g])

[[1]]
[1] "Alabama" "Georgia" "Louisiana" "Mississippi"
[5] "North Carolina" "South Carolina" "Tennessee"

[[2]]
[1] "Alaska" "Arizona" "California" "Colorado" "Florida"
[6] "Illinois" "Maryland" "Michigan" "Nevada" "New Mexico"
[11] "New York" "Texas"

[[3]]
[1] "Arkansas" "Connecticut" "Idaho" "Iowa"
[5] "Kentucky" "Maine" "Minnesota" "Montana"
[9] "Nebraska" "New Hampshire" "North Dakota" "South Dakota"
[13] "Vermont" "Virginia" "West Virginia" "Wisconsin"
[17] "Wyoming"

[[4]]
[1] "Delaware" "Hawaii" "Indiana" "Kansas"
[5] "Massachusetts" "Missouri" "New Jersey" "Ohio"
[9] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
[13] "Utah" "Washington"

To depict the code above, the function unique(x) returns a vector of elements of x but with
duplicate elements / rows removed. So unique(cut.4) returns 4 elements 1, 2, 3, and 4. Then
the function sapply performs a looping of the function in the second argument
rownames(USArrests)[cut.4 == g] by putting g = each of the values in the first argument,
and returns vectors of outputs. Recall that [cut.4 == g] returns a vector of T and F for each g.
For example, [cut.4 == 1] returns T, F (× 8), T, F (× 7), T, F (× 5), T, F (× 8), T, F (× 6), T, F,
T, F (× 8). Then rownames(USArrests)[cut 4 == 1] reports the row names for those rows
with T.

If you are still not satisfied, you can even do the tree cutting graphically. The codes are much
harder, though…

> pltree(agg.out, cex = 0.6, main = "Dendrogram for USArrests, complete


linkage", hang = -0.1)
> rect.hclust(agg.out, k = 4, border = 2:5)

In the above, a dendrogram with state labels of font size 0.6 is drawn. Then a title heading is
added. The parameter setting hang = -0.1 causes the state labels to hang down from height 0.
Then 4 rectangles that corresponds to the 4 clusters are drawn. The parameter border = 2:5

FINA3295 | Predictive Analytics 50


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

selects the colours of the borders of the 4 rectangles (colours numbered 2 to 5). The general
format of the function rect.hclust() is as follows:
rect.hclust(tree, k = NULL, h = NULL, border = 2)

The most important parameters are k and h: you can cut the dendrogram such that either exactly
k clusters are produced or by cutting at height h.

3 K Means Clustering

K means clustering can be done using the function kmeans() in the R package stats.

kmeans(x, centers, iter.max = 10, nstart = 1,


algorithm = c("Hartigan-Wong", "Forgy", "MacQueen"), trace=FALSE)

 x is the data set


 centers is the number of clusters
 iter.max is the maximum number of iterations allowed.

FINA3295 | Predictive Analytics 51


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

 nstart is the number of initial random configurations. I personally use 25 or 50.


 algorithm is related to the way to of initialization and how the updating of centroids is
carried out. We have discussed MacQueen and Forgy initialization in the lecture notes. The
default algorithm is Hartigan and Wong, which is an algorithm that employs a time-saving
way in checking for the closest cluster. I did not discussed this in class.

The following objects are the output of the kmeans() function:


cluster, centers, totss, withinss, tot.withinss, betweenss, size, iter, ifault

 cluster is the main result. It is a series of numbers which identifies the cluster for which
each observation is assigned to.
 centers gives the centroids of each center.
 totss gives the total sum of squares without any clustering.
 withinss gives the individual within-cluster sum of squares (SSEi).
 tot.withinss gives the sum of the individual within-cluster sum of squares (SSE).
 betweenss is totss  tot.withinss.

Let us perform a 4-means clustering on the USArrests data set:

> k4.out <- kmeans(USArrests, 4, nstart = 25)


> k4.out
K-means clustering with 4 clusters of sizes 14, 10, 10, 16

Cluster means:
Murder Assault UrbanPop Rape
1 8.214286 173.2857 70.64286 22.84286
2 5.590000 112.4000 65.60000 17.27000
3 2.950000 62.7000 53.90000 11.51000
4 11.812500 272.5625 68.31250 28.37500

Clustering vector:
Alabama Alaska Arizona Arkansas California
4 4 4 1 4
Colorado Connecticut Delaware Florida Georgia
1 2 4 4 1
Hawaii Idaho Illinois Indiana Iowa
3 2 4 2 3
Kansas Kentucky Louisiana Maine Maryland
2 2 4 3 4
Massachusetts Michigan Minnesota Mississippi Missouri
1 4 3 4 1
Montana Nebraska Nevada New Hampshire New Jersey
2 2 4 3 1
New Mexico New York North Carolina North Dakota Ohio
4 4 4 3 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
1 1 2 1 4
South Dakota Tennessee Texas Utah Vermont
3 1 1 2 3
Virginia Washington West Virginia Wisconsin Wyoming
1 1 3 3 1

FINA3295 | Predictive Analytics 52


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

Within cluster sum of squares by cluster:


[1] 9136.643 1480.210 4547.914 19563.863
(between_SS / total_SS = 90.2 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"


[6] "betweenss" "size" "iter" "ifault"

Does the HAC with dendrogen cut at 4 clusters and the 4-means clustering give the same result?
Let us check:

> table(cut.4, k4.out$cluster)

cut.4 1 2 3 4
1 2 0 0 5
2 2 0 0 10
3 3 5 9 0
4 7 5 1 1

So the cluster assignments are different! While the majority of elements in HAC’s cluster 1 is in
4-means clustering’s cluster 4, 2 are in 4-means clustering’s cluster 1. The elements in HAC’s
cluster 3 are assigned to clusters 1, 2, and 3 in 4-means clustering.

You may wonder if this is because the data is not standardized. So let us try one more time:

> k4.out <- kmeans(scale(USArrests), 4, nstart = 25)


> k4.out
K-means clustering with 4 clusters of sizes 8, 13, 13, 16

Cluster means:
Murder Assault UrbanPop Rape
1 1.4118898 0.8743346 -0.8145211 0.01927104
2 0.6950701 1.0394414 0.7226370 1.27693964
3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
4 -0.4894375 -0.3826001 0.5758298 -0.26165379

Clustering vector:
Alabama Alaska Arizona Arkansas California
1 2 2 1 2
Colorado Connecticut Delaware Florida Georgia
2 4 4 2 1
Hawaii Idaho Illinois Indiana Iowa
4 3 2 4 3
Kansas Kentucky Louisiana Maine Maryland
4 3 1 3 2
Massachusetts Michigan Minnesota Mississippi Missouri
4 2 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 2 3 4
New Mexico New York North Carolina North Dakota Ohio
2 2 1 3 4
Oklahoma Oregon Pennsylvania Rhode Island South Carolina

FINA3295 | Predictive Analytics 53


R Notes
C.Y. Ng
Chapter 4. Unsupervised Learning

4 4 4 4 1
South Dakota Tennessee Texas Utah Vermont
3 1 2 4 3
Virginia Washington West Virginia Wisconsin Wyoming
4 4 3 3 4

Within cluster sum of squares by cluster:


[1] 8.316061 19.922437 11.952463 16.212213
(between_SS / total_SS = 71.2 %)

> table(cut.4, k4.out$cluster)

cut.4 1 2 3 4
1 7 0 0 0
2 0 12 0 0
3 1 0 13 3
4 0 1 0 13

So this resulting clustering looks much closer to the HAC result!

FINA3295 | Predictive Analytics 54

You might also like