Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

WST 212 - Applications in Data Science

University of Pretoria
Faculty of Natural and Agricultural Sciences
Department of Statistics

Revised by Priyanka Nagar and Jocelyn Mazarura, 2023

©
Copyright reserved, University of Pretoria
Contents

1 Introduction to R software 3
1.1 Packages, libraries, help and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Variable assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Logical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Conditional statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Databases 10
2.1 Introduction to databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Basic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Overview of SQL procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Specifying columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2.1 Calculated columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2.2 Working with Dates in SQL . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Specifying rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3.1 Selecting a subset of rows . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Ordering rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.5 Summarising data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5.2 Grouping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 SQL Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Inner joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Outer joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 SQL Subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Noncorrelated subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 In-line views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Machine Learning 31
3.1 Introduction to machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.3 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1
3.1.4 Variance-Bias tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.5 The model building process . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Data Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Data Wrangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2
Chapter 1

Introduction to R software

R software is a free open source programming language for statistical computing and data visu-
alisation. In this part we will learn some basic coding in R and get comfortable with this new
language.

1.1 Packages, libraries, help and syntax


Packages and libraries
Since R is a open source software there are certain functions that require additional packages not
included in the base R installation. These packages include functions which help simplify coding
various operations in R. To use the functionality provided by a package we first need to install
the package and once installed we need to remind R that we would like to use the package (when
necessary). To install a package in R you need internet access and the correct name of the package.
Please note that R is case sensitive. The function used to install packages is:
install.packages("package name")

To check whether a package needs to be installed or if you already have installed it use the following
function:
require("package name")

When you want to use some of the functions within these packages you need to remind R that you
are using additional packages by specifying :
library("package name")

If the functions above give an error it means either the name is spelled incorrectly or the package
does not exist.

3
Help
The R help function is extremely valuable if you already know which function you need help with.
If you are uncertain about what function to use to perform a specific operation then you will need
to consult either your notes or Google. For example, R help is extremely useful for when:
help("mean")

However, it is not useful when:


help("how to calculate the expected value?")

Syntax
Please note that R is case sensitive. Coding in R can be tricky as R is case sensitive with function
names as well as when you need to specify certain arguments within a function. R is also very
sensitive to white spaces and tabs. When coding if you press enter and R automatically inserts a
tab or white space do not remove it. The converse also holds, meaning do not include white space
of tabs to make your code look fancy or spacey. Brackets and the type of brackets you use mean
different things in R so make sure you use them correctly. If you are going to use inverted commas
please be consistent. Do not change from a single to double inverted comma in your code.

1.2 Basics
The most basic commands in R are the arithmetic operators, for example:
2 + 5

## [1] 7
8 - 3

## [1] 5
3*4

## [1] 12
16/2

## [1] 8
3ˆ2

## [1] 9
5%%2

## [1] 1
[Remember that the modulo operator returns the reminder after division. So for 5 mod 2 above
the results is 1 as 5 divided by 2 has a quotient of 2 and a remainder of 1.]

4
1.3 Variable assignment
In the example above we used the operators to calculate a value. If we want to use this value again
in another formula then it would be easier to assign the operation to a variable. A variable allows
you to store a value or an object as any data type. [Remember to use descriptive variable names.]
For example: In R we use the assignment operator, <-, to assign a value/object to a variable.
x <- 4
y <- 3
x + y

## [1] 7
Or you can assign the sum of x+y to another variable:
z <- x + y

Now z will store the sum of x+y. If we want to view the answer we need to print the variable:
print(z)

## [1] 7

1.4 Data type


There are various data types in R, the ones we will work with the most include:
• Decimal values termed as numerics.
• Natural numbers termed as integers.
• Boolean values (TRUE or FALSE) termed as logical.
• Text or string termed as characters.
• Datasets or tables termed as data frames.
• Categorical variables with limited categories termed as factors.
To determine the data type in R we make use of the class() function:
x <- 2
class(x)

## [1] "numeric"
y <- "Some words"
class(y)

## [1] "character"
z <- TRUE
class(z)

5
## [1] "logical"
To convert data to a different data type we use the as.new_data_type() function in R.
z <- TRUE
as.numeric(z)

## [1] 1
Note that the logical operators will convert to a 0 or 1 if converted to numeric.
Factors are a little tricky. We use factors when we have categorical data with a limited number
of categories. There are two types of categorical variables: a nominal categorical variable and an
ordinal categorical variable. A nominal variable is a categorical variable without an implied order.
Whereas ordinal variables do have a natural ordering. For example, if you consider a variable
that has three temperature indicators (low, medium and high); these categories can be ordered. If
you have a list of animals in a specific area for example elephants, donkeys and horses then these
categorical variables cannot be ordered in any natural way. If a is a factor and we print(a) we will
see the following result:
print(a)

## [1] M F F F M M
## Levels: F M
class(a)

## [1] "factor"
The levels in the output specify how many categories there are and how the categories are defined.
To determine the categories within a factor variable we can use the levels() function:
levels(a)

## [1] "F" "M"

1.5 Logical operators


Logical comparisons work in R, the operators are:
• < less than.
• > greater than.
• <= less than or equal to.
• >= greater than or equal to.
• == equal to.
• != not equal to.
• & the and .

6
• | the or.
Note that for an equal comparison a double = sign is used. A logical operator will return a Boolean
value stating whether the comparison is TRUE or FALSE, for example:
2 > 3

## [1] FALSE
4 <= 8

## [1] TRUE
7 != 3

## [1] TRUE
6 == (3+2)

## [1] FALSE
3 < 4 | 7 < 6

## [1] TRUE
5 <= 5 & 2 < 3

## [1] TRUE

1.6 Conditional statements


An “if” statement is a conditional statement in programming that performs a function or returns
a value if a given condition is proved true. For example, Jonny can only have a chocolate after
dinner if he eats all his vegetables. The general syntax for an if statement is
if(condition){
...expression...
}

Only if the condition evaluates to TRUE the expression will execute. To include an expression for
when the condition evaluates FALSE an else statement is needed. The syntax is then:
if(condition){
...expression TRUE...
} else{
...expression FALSE...
}

For example:
age <- 24

if(age < 35){

7
print("You are young.")
} else{
print("You are not so young.")
}

## [1] "You are young."


age <- 50

if(age < 35){


print("You are young.")
} else{
print("You are not so young.")
}

## [1] "You are not so young."


Take note that it is important for the else{ } to start in the same line as the closing bracket of the
if statement.

1.7 Functions
In R there are many built in-functions ready to use. Well known functions in R include the mean(),
std(), levels(), names(), print(), abs(), etc. There are also additional functions available with add-on
packages. Any function in R requires either an argument or arguments to compute the function.
Using the help(function name) function we can see which arguments need to be defined. We can
also create our own functions. For example, the code below creates a function called quad_eq which
computes value of the quadratic equation for any set of constants a,b and c and any x value:
quad_eq <- function(a, b, c, x){
f <- a*(xˆ2) + b*x + c
return(f)
}

quad_eq(2, 2, 2, 1)

## [1] 6
The quad_eq() function takes on four arguments namely; a,b,c and x. When we call the function all
these arguments need to be specified in the correct order for the function to execute. For example,
on your own copy the quad_eq function in R and then try calling the function using the statement
quad_eq(2,3) and see what happens.
Functions are useful when you need to calculate an expression multiple times for different set of
parameter (argument) values.
Some built-in functions that are handy when working with data frames are:

8
• head() gives us the first 6 rows of the data frame.
• tail() gives us the last 6 rows of the data frame.
• str() gives an overview of the structure of the data frame.
• names() gives the column names of from the data frame.
• summary() gives a five number summary of the data (min, Q1, Q2, Q3 and max).
• dim() gives the dimension of the data frame.
• nrow() gives the number of rows in a data frame.

9
Chapter 2

Databases

In this part, we will focus on data storage, management and information retrieval. The main aim
of this part is to understand how to use a relational database efficiently.

2.1 Introduction to databases


From the earliest days of computers, storing and manipulating data have been a key application
focus. The first universal Data Base Management System (DBMS) was called the Integrated
Data Store. In the late 1960’s, International Business Machines (IBM) developed the Information
Management System (IMS) DBMS used even today in many major installations. In 1970 a new
data representation framework called the relational data model was proposed. The benefits were
widely recognized, and the use of DBMS’s for managing corporate data became standard practice.
The Structured Query Language (SQL) query language for relational databases, developed as part
of IBM’s System R project, is now the standard query language. Specialized systems have been
developed by numerous organisations for creating data warehouses, consolidating data from several
databases, and for carrying out specialized analysis.
To understand the need for a DMBS, let us consider a scenario: A company has a large collection
(say, 500 GB3 ) of data on employees, departments, products, sales, and so on. Several employees
access this data simultaneously. Questions about the data must be answered quickly, changes made
to the data by different users must be applied consistently, and access to certain parts of the data
(e.g., salaries) must be restricted. We can try to manage the data by storing it in operating system
files. This approach has many drawbacks, including the following:
• We probably do not have 500 GB of main memory to hold all the data. We must therefore
store data in a storage device such as a cloud or hard drive and bring relevant parts into
main memory for processing as needed.
• Even if we have 500 GB of main memory, on computer systems with 64-bit addressing, we
cannot refer directly to more than about 8 GB of data. We have to program some method of
identifying all data items.
• We have to write special programs to answer each question a user may want to ask about

10
the data. These programs are likely to be complex because of the large volume of data to be
searched.
• We must protect the data from inconsistent changes made by different users accessing the
data simultaneously. If applications must address the details of such simultaneous access, this
adds greatly to their complexity.
• We must ensure that data is restored to a consistent state if the system crashes while changes
are being made.
• Operating systems provide only a password mechanism for security. This is not sufficiently
flexible to enforce security policies in which different users have permission to access different
subsets of the data.
By storing data in a DBNIS rather than as a collection of operating system files, we can use the
DBMS’s features to manage the data in a robust and efficient manner. Using a DBMS to manage
data has many advantages:
• Data Independence: Application programs should not be exposed to details of data represen-
tation and storage.
• Efficient Data Access: A DBMS utilizes a variety of sophisticated techniques to store and
retrieve data efficiently. This feature is especially important if the data is stored on external
storage devices.
• Data Integrity and Security: If data is always accessed through the DBMS, the DBMS can
enforce integrity constraints. For example, before inserting salary information for an employee,
the DBMS can check that the department budget is not exceeded. Also, it can enforce access
controls that govern what data is visible to different classes of users.
• Data Administration: When several users share the data, centralizing the administration
of data can offer significant improvements. Experienced professionals who understand the
nature of the data being managed, and how different groups of users use it, can be responsible
for organizing the data representation to minimize redundancy and for fine-tuning the storage
of the data to make retrieval efficient.
• Concurrent Access and Crash Recovery: A DBMS schedules concurrent accesses to the data
in such a manner that users can think of the data as being accessed by only one user at a
time. Further, the DBMS protects users from the effects of system failures.
• Reduced Application Development Time: Clearly, the DBMS supports important functions
that are common to many applications accessing data in the DBMS. This, in conjunction
with the high-level interface to the data, facilitates quick application development. DBMS
applications are also likely to be more robust than similar stand-alone applications because
many important tasks are handled by the DBMS (and do not have to be debugged and tested
in the application).
In today’s day and age with all the buzz around big data, which is largely motivated by the immense
rate at which the world creates and stores data, it becomes more and more important to store the
data in such a manner that allows quick queries of the data, sorting of data and manipulating of

11
data in several ways in order to effectively and efficiently analyse the data at hand and ensure the
data’s security. These systems (DBMS) are important links in the creation and management of
data.
What is a database? A database is defined as “a structured set of data held in a computer, especially
one that is accessible in various ways”. In other words, a database is simply a way of storing and
organizing information in such a way that it can be easily accessed, managed and updated.
What is a relational database? The word “relational” has nothing to do with relationships. It’s
about relations from relational algebra. In a relational database, each relation is a set of rows.
Each row is a list of attributes, which represents a single item in the database. Each row in a
relation (“table”) shares the same attributes (“columns”) where each attribute has a well-defined
data type, defined ahead of time.
What is Structured Query Language (SQL)? SQL is a programming language used to manipulate
and retrieve data in a relational database.
A large part of analytics and statistical analysis consists of first getting the data into a form which
will allow effective analysis. The two main reasons why data will not already be in a usable format
is:
• Large data sets consists of so many records and characteristics, it will not always be efficient
to store all the data in one table.
• Statistical analysis will often explore the use of more than one set of data, with different
characteristics and very often different origins. The analyst will need to accurately combine
these data sets and transform them into the required form to start the statistical investigation.
When working with small data sets, efficiency is not always a concern. However, once you start to
run statistical procedures on one or two million records, a simple process can take hours or even
days to run. A simple step like joining two data sets can take hours. This is where efficient coding
starts to make a big difference.
The R package “sqldf, which utilizes SQL code, uses very effective rules to combine data sets and
manipulate the data to get it into usable forms. If you are comfortable with the SQL procedure
you will find it much easier to get to know the SQL Database language. An aspect of SQL that
makes it great is that the syntax is the same regardless of the software used to implement the query.
There might be slight variations that can be used which are software dependent but the overall
structure remains the same. In SAS, the PROC SQL procedure is used to execute the queries. In
R, the”sqldf” package is required and the same syntax as in the SAS procedure is used within
the sqldf( ) function in R as well as within the Microsoft SQL Database Language system. Other
software’s where the SQL language is (partially) standardised include Oracle, Sybase, DB-2, SQL
Server, MySQL.
An illustrative example of a relational database is shown in Figure 2.1.
From Figure 2.1, we can view a relational database as follows:
• Relational database: collection of tables (also called relations)
• Table:

12
Figure 2.1: Illustration of a database

13
– Collection of rows (also called tuples or records).
– Each row in a table contains a set of columns (also called fields or attributes).
– Each column has a type:
– String: VARCHAR(20)
– Integer: INTEGER
– Floating-point: FLOAT, DOUBLE
– Date or time DATE, TIME, DATETIME
– Several others...
• Primary key: provides a unique identifier for each row (need not be present, but usually is in
practice).
• Schema: the structure of the database, including for each table
– The table name
– The names and types of its columns
– Various optional additional information (constraints, etc.)
– The number of rows in a table is not part of the schema

2.2 Basic Queries


In this chapter, we will focus on the structure, syntax of the SQL procedure and on producing
summarized reports.

2.2.1 Overview of SQL procedure


In this section, we will focus on the general syntax and statements of the SQL procedure.
The SQL procedure in R is initiated by firstly calling the “sqldf” package and secondly by wrapping
the procedure in the sqldf() function.
Within this shell, multiple statements can be included where each statement defines a process to
be executed. Each statement contains smaller building blocks called clauses. A statement can have
multiple clauses to help build your query.
SQL statements consists of the following:
• SELECT - identifies columns to be selected.
• CREATE - build new tables, views, or indexes.
• DESCRIBE - displays table attributes or view definitions.
• INSERT - adds rows of data to tables.

14
• RESET - adds to or changes SQL options without revoking the procedure.
The statement that we will be focusing on is the SELECT statement. The SELECT statement is
used to query from one or more tables. It specifies the desired columns and column order. A required
clause for the SELECT statement is the FROM clause which specifies the data sources/table names.
Other optional clauses that can be used with the SELECT statement are:
• WHERE - specifies data that meets certain conditions.
• GROUP BY - groups data into categories.
• HAVING - specifies groups that meet certain conditions.
• ORDER BY - specifies an order for the data.
A generic program displaying the SELECT statement with all the optional clauses is given below.
Please note the clauses must be specified in the order displayed below.

Example 2.1.1
Based on the Highway data. You have been given the task of obtaining the name of the road being
studied, the year in which the observation was taken and the number of crashes on the road during
that year. Using the sqldf function with the SELECT statement, this information can be easily
obtained without any optional clauses using the following code:
ex1 <- sqldf("Select Road, Year, N_Crashes
From crashes")
head(ex1)

## Road Year N_Crashes


## 1 Interstate 65 1991 25
## 2 Interstate 65 1992 37
## 3 Interstate 65 1993 45
## 4 Interstate 65 1994 46
## 5 Interstate 65 1995 46
## 6 Interstate 65 1996 59
The query above can be altered to provide only the information for crashes in the year 1995 and
can be ordered from lowest to highest based on the number of crashes.
ex2 <- sqldf("Select Road, Year, N_Crashes
From crashes

15
WHERE Year = 1995
Order by N_Crashes")
head(ex2)

## Road Year N_Crashes


## 1 Interstate 275 1995 28
## 2 Interstate 65 1995 46
## 3 US-36 1995 67
## 4 US-40 1995 75
## 5 Interstate 70 1995 81
Note that the output is given according to the same order in which the columns were specified in
the SELECT statement.
In Example 2.1.1, we specify the column names within a specific table that we would like to
display. However, if there is uncertainty about the contents of a table then what do we do? Revise
Introduction to R, we can use the heads(), tails(), names(), class() etc. functions to determine how
our table looks and what information is contained in which table.

2.2.2 Specifying columns


In this section, we will focus on displaying columns directly from tables, columns calculated from
other columns and conditional columns.

Querying all columns in a table


We have learnt how to specify the columns we needed in the SELECT statement, however, if we
would like to display all the columns in a specific table then it will be cumbersome to have to
specify each column name. Instead, you can use an asterisk (*) in the SELECT statement to print
all the column names in a table.
Example 2.2.1 Based on the Highway data. You have been given the task of obtaining all the
information about the crashes.
ex3 <- sqldf("Select *
From crashes")
head(ex3)

## Year Road N_Crashes Volume


## 1 1991 Interstate 65 25 40000
## 2 1992 Interstate 65 37 41000
## 3 1993 Interstate 65 45 45000
## 4 1994 Interstate 65 46 45600
## 5 1995 Interstate 65 46 49000
## 6 1996 Interstate 65 59 51000

16
2.2.2.1 Calculated columns
In the SQL procedure you can modify the query to create new columns in the display (not saved in
the database tables) that represent a function of another column, i.e. a calculated column.
Example 2.2.2 You are asked to modify the display in Example 2.1.1. The output must display the
name of the road being studied, the year in which the observation was taken and the number of
crashes on the road during that year and a new Average_Num_Crashes column which contains the
average number of crashes per month for the year for each road.
ex4 <- sqldf("Select Road, Year, N_Crashes,
N_Crashes/12 AS Average_Num_Crashes
From crashes")
head(ex4)

## Road Year N_Crashes Average_Num_Crashes


## 1 Interstate 65 1991 25 2
## 2 Interstate 65 1992 37 3
## 3 Interstate 65 1993 45 3
## 4 Interstate 65 1994 46 3
## 5 Interstate 65 1995 46 3
## 6 Interstate 65 1996 59 4
The new column name Average_Num_Crashes is called an alias. An alias is assigned with the AS
keyword. It is optional to assign an alias to a new column, if no alias is assigned the column name
will appear blank in the display.
New columns can also be calculated conditionally using the CASE expression. For example, if you
would like to have a Bonus column but the percentage should be conditional on the position of the
employee then the CASE expression can be used when creating the new calculated column Bonus.
A generic code for the CASE expression.

A basic example of the case expression is given in the following example.


Example 2.2.3 Consider the dataset consisting on email addresses. You would like to create a new
column which groups the domains into different express groups. For example, if the email domain
belongs to gmail, yahoo or Hotmail it should be in group 1 and all other domains in group 0.

17
data <- data.frame(E_MAIL = c("x@x.com", "q@yahoo.com"))
ex5 <- sqldf("Select E_MAIL,
case
when E_MAIL like %gmail% then 1
when E_MAIL like %yahoo% then 1
when E_MAIL like %hotmail% then 1
else 0
end express
From data")
head(ex5)

## E_MAIL express
## 1 x@x.com 0
## 2 q@yahoo.com 1
If you are confused about the syntax used in the code above, do not worry. This is covered in
Specifying rows section.

2.2.2.2 Working with Dates in SQL


Working with dates in sqldf requires some additional steps. We first need to make sure the column
that contains the date or time information in is in the correct format and is of right class/data
type. Thereafter, two additional packages are needed namely, “lubridate” and “RH2” packages.
Example 2.2.4 Consider the housing dataset. The table contains a column “date” which has the
“yyyy/mm/dd” in one column. You have been given the task of creating a report which displays the
city, sales, volume, month in a new column labelled month and year in a new column labelled year.
housing$date <- as.Date(housing$date, format="%Y/%m/%d")

ex6 <- sqldf("Select city, sales, volume,


month(date) AS month,
year(date) AS year
From housing")
head(ex6)

## city sales volume month year


## 1 Abilene 72 5380000 1 2000
## 2 Abilene 98 6505000 2 2000
## 3 Abilene 130 9285000 3 2000
## 4 Abilene 98 9730000 4 2000
## 5 Abilene 141 10590000 5 2000
## 6 Abilene 156 13910000 6 2000
Note that calculated columns are not stored in the database and only used to display the results of
the query. If you would like to create the table containing the new calculated columns you need to
use the CREATE statement in other SQL languages. In R, we can assign the query to a variable

18
which will store the new query results. In this manner, we can create new tables.

2.2.3 Specifying rows


In this section, we will focus on selecting only a subset of rows in a query.

2.2.3.1 Selecting a subset of rows


So far, we have specified the columns that we want from a table. However, what do we do when we
want to select only specific rows in a table? The WHERE clause is used to specify a condition
that the data must meet before being selected. Only one WHERE clause is required in a SELECT
statement to specify multiple subsetting criteria. The WHERE clause can combine comparison
operators, logical operators and/or special operators. Comparison operators include:

Logical operators include:

Special operators include:

19
Example 2.2.5 Based on the Highway database. You would like to display a list of roads where the
number of crashes exceed 35.
ex7 <- sqldf("Select Road, N_Crashes
From crashes
WHERE N_Crashes > 35")
head(ex7)

## Road N_Crashes
## 1 Interstate 65 37
## 2 Interstate 65 45
## 3 Interstate 65 46
## 4 Interstate 65 46
## 5 Interstate 65 59
## 6 Interstate 65 76
Example 2.2.6 If you were to modify the report in Example 2.2.4 to only include the details for the
year 2001 then the “WHERE” clause would be required with a calculated column.
ex8 <- sqldf("Select city, sales, volume,
month(date) AS month,
year(date) AS year
From housing
WHERE year(date) = 2001 ")
head(ex8)

## city sales volume month year


## 1 Abilene 75 5730000 1 2001
## 2 Abilene 112 8670000 2 2001
## 3 Abilene 118 9550000 3 2001
## 4 Abilene 105 8705000 4 2001
## 5 Abilene 150 11850000 5 2001
## 6 Abilene 139 11290000 6 2001
In the query ex8, if you used “WHERE year = 2001” an error would have occurred. Important
note: A calculated column which has an alias cannot be called in the “WHERE” clause. Thus, if a

20
restriction is required for a calculated column, the column needs to be specified the same as in the
“SELECT” statement. Only the SELECT and WHERE clauses do not allow the use of the alias,
for the ORDER BY or HAVING clauses the alias can be used.

2.2.4 Ordering rows


In the SELECT statement of a query, the order in which the columns are displayed is specified.
However, if you would like to sort the rows of the specified columns then the ORDER BY clause is
used. The default sort order of the ORDER BY clause is ascending. The keyword DESC must
follow the column name in the clause if a descending order is required. More than one column may
be specified in the clause, however, the first item specified in the clause will determine the major
sort order.
Example 2.2.7 Based on the Highway data. You have been given the task of obtaining the name of
the road being studied, the year in which the observation was taken and the number of crashes on
the road during that year. The report must display the number of crashes from lowest to highest.
ex9 <- sqldf("Select Road, Year, N_Crashes
From crashes
Order by N_Crashes")
head(ex9)

## Road Year N_Crashes


## 1 Interstate 275 1994 21
## 2 Interstate 275 1998 21
## 3 Interstate 275 2008 21
## 4 Interstate 275 1993 22
## 5 Interstate 275 1996 22
## 6 Interstate 275 1999 22
If you would like to produce the same report but with the number of crashes from highest to lowest
then the DESC option is added to the ORDERR BY clause as follows:
ex10 <- sqldf("Select Road, Year, N_Crashes
From crashes
Order by N_Crashes desc")
head(ex10)

## Road Year N_Crashes


## 1 Interstate 65 2012 190
## 2 Interstate 65 2011 183
## 3 Interstate 65 2010 182
## 4 Interstate 65 2009 165
## 5 Interstate 65 2008 157
## 6 Interstate 65 2007 141

21
2.2.5 Summarising data
In this section, we will focus on creating summary queries. Sometimes not all the details contained
in a table are needed instead a summary of the information contained in a table is more useful.

2.2.5.1 Functions
To obtain summaries of the data we can use functions. Commonly used summary functions are:

SQL Description
AVG Returns the mean (average) value
COUNT Returns the number of non-missing values
MAX Returns the largest value
MIN Returns the smallest value
SUM Returns the sum of the values
sum(is.na()) Counts the number of missing values
std Returns the standard deviation
var Returns the variance

Where the first column represents the summary functions in SQL language. The outcome of a
summary function is based on the argument. The argument usually takes on a column variable. To
specify row or column operations, different functions will be required.
Example 2.2.8 Consider the Orange dataset available in R. Write a query that reports the average
circumference of the trees.
ex11 <- sqldf("SELECT AVG(circumference)
FROM Orange")
head(ex11)

## AVG("circumference")
## 1 115.8571

2.2.5.2 Grouping Data


Another method of summarizing data is by grouping data based on certain characteristics within
one or more columns. The GROUP BY clause is used for classifying data into groups based on
values of one or more columns and for calculating statistics based on each unique value of the
grouping columns. Refer to the general syntax for the placing of the GROUP BY clause.
Example 2.2.9 Consider Example 2.2.8, now write a query to create a report that calculates the
average circumference for each tree type.
ex12 <- sqldf("SELECT tree, AVG(circumference) AS meancirc
FROM Orange
GROUP BY tree")
head(ex12)

22
## tree meancirc
## 1 1 99.57143
## 2 2 135.28571
## 3 3 94.00000
## 4 4 139.28571
## 5 5 111.14286
Now only display the results for trees with average circumference larger than 110.
ex13 <- sqldf("SELECT tree, AVG(circumference) AS meancirc
FROM Orange
GROUP BY tree
HAVING meancirc > 110")
head(ex13)

## tree meancirc
## 1 2 135.2857
## 2 4 139.2857
## 3 5 111.1429
In the Example 2.2.9 the HAVING clause was used. This clause subsets groups based in the
expression value and determines which groups are displayed. The HAVING clause is processed
after the GROUP BY clause. The WHERE clause verses the HAVING clause: The WHERE clause
determines which individual rows are available for grouping whereas the HAVING clause determines
the groups displayed and is always preceded by the GROUP BY clause.

2.3 SQL Joins


In this chapter, we will focus on combining data horizontally from multiple tables using different
types of join methods.
In SQL we use joins to combine tables horizontally. A join involves matching data from one row
in one table with a corresponding row in another table. A join can be performed on one or more
columns. There are two main types of joins:
• Inner join - returns only matching rows.
• Outer join - returns all rows from one or both the tables being joined. (Includes matching
and non-matching rows.)
Details and application of these joins will be discussed in the following sections.

2.3.1 Inner joins


When joining two or more tables horizontally and displaying only the matching rows an inner
join is used. In this section, we will explore the syntax of an inner join and determine when it is
appropriate to use such a join.

23
The standard syntax for an inner join is as follows:

Example 2.3.1 Based on the Highway database. Write a report that displays the Year, Road
,Number of Crashes, Volume, District and Length for roads which have all the information.
ex1 <- sqldf("select
a.*,
b.District,
b.Length
from crashes as a
inner join roads as b
on a.Road = b.Road")
head(ex1)

## Year Road N_Crashes Volume District Length


## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
Note: In the above example, the * was used in the SELECT statement. Thus, all columns from
the crashes table appear and the District and Length column from the roads table appears. To
avoid the repetition of columns in the display, column names must be specified in the SELECT

24
statement.

2.3.2 Outer joins


When joining two or more tables horizontally to include all rows an outer join is used. In this
section, we will explore the syntax of an outer join and determine when it is appropriate to use
such a join.
There are three types of outer joins: left, full and right outer join

The standard syntax for an outer join is as follows:

The ON clause specifies the criteria for the outer join. Consider the position of the tables in the
FROM clause.

Full join Right join


Left join
Returns all matching and Returns all matching and Returns all matching and
non-matching rows from the non-matching rows from non-matching rows from the
left table and all matching both tables. right table and all matching
rows from the right table. rows from the left table.

25
Please note that currently the sqldf package does not support RIGHT or FULL outer joins. Hence,
the focus in this course will be on INNER and LEFT joins.
Example 2.3.2 Based on the Highway database. Write a report that displays the Year, Road
,Number of Crashes, Volume, District and Length for all roads which have had crashes.
ex2 <- sqldf("select
a.*,
b.District,
b.Length
from crashes as a
left join roads as b
on a.Road = b.Road")
head(ex2)

## Year Road N_Crashes Volume District Length


## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
Note that both partial outputs look alike for Example 2.3.1 and Example 2.3.2 but if the full report
is viewed the difference is observed.
Example 2.3.3 Consider the Highway database. You would like a report of the crashes on roads that
had or are having roadwork done. You would like the report to include the Year, Road, Number of
Crashes, Volume, District and Length.
ex3 <- sqldf("select
a.*,
b.District,
b.Length
from crashes as a
inner join roads as b
on a.Road = b.Road
inner join works as c
on a.Road=c.Road")
head(ex3)

## Year Road N_Crashes Volume District Length


## 1 1991 Interstate 70 84 76000 Vincennes 156
## 2 1992 Interstate 70 82 79000 Vincennes 156
## 3 1993 Interstate 70 84 82000 Vincennes 156
## 4 1994 Interstate 70 84 91000 Vincennes 156
## 5 1995 Interstate 70 81 93000 Vincennes 156
## 6 1996 Interstate 70 76 93200 Vincennes 156

26
The previous example required a double join. The syntax and logic for a double join follow the
same as a single join.

2.4 SQL Subqueries


2.4.1 Noncorrelated subqueries
A subquery is a query that resides within an outer query. Subqueries are also known as nested
queries, inner queries and sub-selects. A subquery must be resolved before the outer query can be
resolved. A subquery:
• Returns values to be used in the outer query’s WHERE or HAVING clause.
• Returns a single column.
• Can return multiple values or a single value.

Correlated vs Noncorrelated subqueries


There are two types of subqueries; correlated and noncorrelated. A noncorrelated subquery is a
self-contained query. It executes independently of the outer query. A correlated subquery requires
a value or values to be passed to it by the outer query before it can be successfully resolved.

Figure 2.2: Subquery syntax

Example 2.4.1 Consider the staff and which contains salary and job title information on employees
at a specific company. You would like to create a report that shows the job titles with an average
salary greater than the average salary of the company. Logically we can break this query down into
3 steps:
1. Calculate the average salary of the company.

27
2. Calculate the average salary for each job title.
3. Compare the two and keep only the job titles for which the average salary is greater than the
value calculated in step 1.
ex1 <- sqldf("select Employee_Job_Title,
AVG(Salary) as MeanSalary
from staff
group by Employee_Job_Title
having AVG(Salary) >
(select AVG(Salary)
from staff)")

head(ex1)

## Employee_Job_Title MeanSalary
## 1 Administration Manager 46230
## 2 Director 163040
## 3 Sales Manager 98115
Note the subquery is inclosed in brackets and will first execute followed by the outer query. Example
2.4.1 is an example of a noncorrelated subquery as it is self-contained.
Example 2.4.2 Consider the Company database. The CEO sends a birthday card to each employee
having a birthday in February (the current month). Create a report listing the names, cities and
countries of employees with birthdays in February.
info$Employee_Birth_Date <- as.Date(info$Employee_Birth_Date ,
format="%Y/%m/%d")

ex2 <- sqldf("select Employee_ID, Employee_Name, City, Country


from addresses
where Employee_ID in (select Employee_ID
from info
where month(Employee_Birth_Date) =2)
order by Employee_ID")

head(ex2)

## Employee_ID Employee_Name City Country


## 1 120114 Buddery, Jeannette Sydney AU
Example 2.4.2 provides an additional example of a noncorrelated subquery.
Please note in this course we will focus on noncorrelated suqueries. However, an example a correlated
subquery is provided.
Example 2.4.3 The code below illustrates a correlated subquery. This query is not stand alone and
requires information from the main/outer query to execute.

28
ex3 <- sqldf("select Employee_ID, AVG(Salary) as MeanSalary
from employee
where AU = (select Country
from supervisors
where employee$Employee_ID=supervisors$Employee_ID)")

Subqueries that return multiple values


If a subquery returns multiple values and the “equal” operator is used an error is generated. For
subqueries which return multiple values the “IN” operator or a comparison operator with the
keywords “ANY” or “ALL” must be used.
The “ANY” keyword is true for any of the values returned by the subquery whereas the “ALL”
keyword is true for all the values returned by the subquery.

Keyword Signifies
= AN Y (20, 30, 40) = 20 or = 30 or = 40
> AN Y (20, 30, 40) > 20
> ALL(20, 30, 40) > 40
< ALL(20, 30, 40) < 20

2.4.2 In-line views


An in-line view is a query expression (SELECT statement) that resides in a FROM clause. It
acts as a temporary table in a query. In-line views are mostly used to simplify complex queries.
An in-line view can consist of any valid SQL query, however, it may not contain an ORDER BY
clause. An in-line view can be assigned an alias if it were a table. [Hint: always test in-line views
independently while building a complex query to errors].

Figure 2.3: in-line syntax

Example 2.4.4 List all active Sales Department employees who have annual salaries significantly
lower (less than 95%) than the average salary for everyone with the same job title. Logically we
can break this query down into 2 steps:
1. Calculate the average salaries of the active employees in the Sales department, grouped by
their title.
2. Calculate the average salary for each job title.

29
3. Compare the two and keep only the job titles for which the average salary is greater than the
value calculated in step 1.
4. Match each employee to a job title group and compare the employee’s salary to the group’s
average to determine whether it is less than 95
ex4_1 <- sqldf("select Job_Title,
avg(Salary) as Job_Avg
from payroll as p inner join
organisation as o
on p.Employee_ID=o.Employee_ID
where Employee_Term_Date is NULL
and Department= Sales
group by Job_Title ")

head(ex4_1)

## Job_Title Job_Avg
## 1 Sales Rep. II 27347
## 2 Sales Rep. I 26575
## 3 Sales Rep. III 29213
## 4 Sales Rep. IV 31588
ex4_2 <- sqldf("select Employee_Name, emp.Employee_Job_Title,
Employee_Annual_Salary, Job_Avg
from (select Job_Title,
AVG(p.Salary) as Job_Avg
from payroll as p inner join
organisation as o
on p.Employee_ID=o.Employee_ID
where Employee_Term_Date is NULL
and Department= Sales
group by Job_Title) as job inner join
salesstaff as emp
on emp.Employee_Job_Title=job.Job_Title
where Employee_Annual_Salary<(Job_Avg*0.95)
order by Job_Title, Employee_Name ")

head(ex4_2)

## Employee_Name Employee_Job_Title Employee_Annual_Salary Job_Avg


## 1 Ould, Tulsidas Sales Rep. I 22710 26575
## 2 Polky, Asishana Sales Rep. I 25110 26575
## 3 Tilley, Kimiko Sales Rep. I 25185 26575
## 4 Voron, Tachaun Sales Rep. I 25125 26575

30
Chapter 3

Machine Learning

In this chapter, we will focus on machine learning techniques.

3.1 Introduction to machine learning


In this chapter, we will introduce the jargon of machine learning and define some concepts that are
crucial when applying any machine learning technique.
In this era of “big data”, data are amassed by networks of instruments and computers. There
are new opportunities for finding and characterizing patterns using techniques described as data
mining, machine learning, data visualization, data modelling, and so on. This process involves, but
is not limited to, understanding our data, acquiring skills for data wrangling (a process of preparing
data for visualization and other modern techniques of statistical interpretation) and using the data
to answer statistical questions via modelling and visualization. Doing so inevitably involves the
ability to reason statistically and utilize computational and algorithmic capacities. The trick is not
mastering programming but rather learning to think in terms of these operations.
What is machine learning?
Machine learning is a subset of artificial intelligence (AI) that involves the development of algorithms
and statistical models that enable computer systems to learn from and make predictions or decisions
based on data, without being explicitly programmed to do so. The aim of machine learning is
to enable machines to learn from data and perform tasks that would otherwise require human
expertise or time-consuming manual effort.
The specific aims of machine learning can vary depending on the problem domain and the type of
task involved. Some common aims of machine learning include:
• Prediction: The goal of prediction is to develop models that can accurately predict an outcome
or target variable based on input features. For example, predicting whether a customer is
likely to churn or not, based on their past behavior.
• Classification: The goal of classification is to develop models that can assign new data to one
of several predefined categories or classes. For example, classifying images of animals into

31
different species.
• Clustering: The goal of clustering is to group similar data points into clusters or segments
based on their features. For example, clustering customers into different groups based on
their demographics and buying behavior.
• Anomaly detection: The goal of anomaly detection is to identify unusual or rare data points
or events that deviate from the norm. For example, detecting fraudulent transactions in a
credit card dataset.
• Recommendation: The goal of recommendation is to develop models that can suggest
personalized items or content to users based on their past behavior or preferences. For
example, movie and series recommendations that pop up on your Netflix accounts; product
recommendation on Take-a-lot to customers based on their previous purchases or ratings.
The slight difference between machine learning and classical statistics is that in machine learning
the model must predict unseen observations whereas, in classical statistics the model must fit,
explain or describe data. In other words, in classical statistics the focus is more on model fitting,
(building models) and machine learning is all about predictions and decision-making (application of
models).
Machine learning techniques can be divided into several branches, which we refer to as supervised
learning, unsupervised learning, semi-supervised learning and reinforcement learning.
In summary:
• Supervised learning: In this type of machine learning, the algorithm is trained on *labeled*
data (data that is already tagged with the correct output), and the goal is to learn a function
that can map new, unseen data to the correct output.
• Unsupervised learning: In this type of machine learning, the algorithm is trained on *unlabeled*
data (data without any predefined output), and the goal is to discover patterns or structure
in the data.
• Semi-supervised learning: In this type of machine learning, the algorithm is trained on
*partially labeled* data, ie. some of the observations are labelled and others not.
• Reinforcement learning: In this type of machine learning, the algorithm learns to make
decisions or take actions based on rewards or penalties received for certain actions.
In this course, we will focus on supervised and unsupervised learning techniques only.
Supervised learning techniques include linear regression, logistic regression, naive Bayes, linear
discriminant analysis, decision trees, k-nearest neighbours, neural networks and support vector
machines.
Unsupervised learning techniques include hierarchical clustering, k-means clustering, neural networks
and principal component analysis.
The data science process can be summarised as in Figure 3.1.

32
Figure 3.1: An overview of the data science process.

3.1.1 Validation
Since the aim of machine learning is to predict responses, we need to test how well our model will
perform on “new” data. Testing the model on the same data used to build the model, results in
very optimistic performance and overfitting (discussed in next section).
A method used to ensure that models do not overfit is by splitting the data into two parts; a
training set and test set. The model is built on the training set and the test set is used to evaluate
the performance of the model. The test set can be viewed as the “new” data set for prediction.
The norm is usually to split the data into 80% training and 20% test set, however, this could differ
depending on the size of the data set. This concept can be viewed as follows:
The train/test set is sufficient for small datasets. However, in the case of large data sets (big data)
a train/validation/test set is more appropriate. This method is similar to the train/test set but the
extra validation set allows for fine-tuning of parameters. Another method is cross fold validation,
which is similar to the above but instead the data set is divided into a train/test set multiple times
and the evaluated. For this course, we will focus only on a train/test set.
Note that for unsupervised learning techniques, it is not required to split the data into a train and
test set since the data is not labelled and thus the correct classification is unknown.

33
Figure 3.2: Illustration of train/test split

3.1.2 Overfitting
Overfitting is when the model fits the training data too closely, but performs poorly on new, unseen
data. In other words, the model learns the details/behaviour (and the noise!) of the training data
so well that it fits the training set perfectly but fails to fit any additional data or predict responses
for any new observations.
An example of overfitting in regression is shown in the far right image below:

Figure 3.3: Illustration of overfitting and underfitting

3.1.3 Performance measures


There are different ways to determine whether a model is good or not. The accuracy, computation
time and interpretability need to be considered when assessing a model. Different machine learning
techniques require different performance metrics. The metric depends on the type of machine
learning techniques you are using:
For regression techniques, we use the root mean square error (RMSE) to evaluate a model.
This measure calculates the distance mean of the predicted outcome and observed outcome. (This
measure captures the bias-variance trade-off. Discussed in the next section.)
For classification techniques, we use a confusion matrix to obtain the accuracy, precision, recall

34
and specificity measures. A confusion matrix gives the following two-way table when we consider
“True” to be the positive outcome.

Figure 3.4: Confusion matrix

The diagonals provide the number of correctly classified observations (observed=predicted) and the
off-diagonal indicate the number of incorrectly classified observations. From the confusion matrix it
is possible to calculate the accuracy, precision, recall and specificity as follows:
correctly classified observations TP + TN
• Accuracy = total number of observations
= TP + FP + FN + TN
.
• The error of the model = 1≠ Accuracy.
True positive TP
• Precision = True positive + False positive
= TP + FP
.
True positive TP
• Recall (sensitivity) = True positive + False negative
= TP + FN
.
True negative TN
• Specificity = True negative + False postive
= TN + FP
.
For clustering techniques, since the information is not labelled we use similarity measure. The
similarity within clusters, within sum of squares (WSS), and the similarity between cluster, between
cluster sum of squares (BSS).

3.1.4 Variance-Bias tradeoff


It is important to understand prediction errors (bias and variance) when working with predictive
models. Understanding these errors assists us in building accurate models by avoiding under/over-
fitting. There is a tradeoff between a model’s ability to minimize bias and variance.
Bias is the difference between the average prediction of our model and the correct value which we
are trying to predict. It can also be thought of as the inability of a machine learning model to
capture the true relationship between the data variables due to erroneous assumptions that are
inherent to the learning algorithm. Models with high bias pay very little attention to the training
data and oversimplify the model. This leads to high errors on the training and test set.
Variance is the amount that the estimated model will change if we estimated it using different
training data. Since training data is used to fit the statistical learning method, different training
data sets will result in a different estimated models. But ideally the estimated model should not
vary too much between training sets. However, if a method has high variance then small changes in
the training data can result in large changes in the predicted model. Models with high variance
pay a lot of attention to training data and do not generalize on other data. This leads to models
performing very well on training data, but have high error rates on test data.

35
Figure 3.5: Relationship between bias and variance

If the model is too simple and has very few parameters then it may have high bias and low variance.
On the other hand, if the model has a large number of parameters then it might have a high
variance and low bias. Thus, a balance needs to be found. This tradeoff in complexity is why there
is a tradeoff between bias and variance.
Overfitting revisited

Figure 3.6: Illustration of overfitting and underfitting

The problem with overfitting is that it results in high variance, but less bias (the converse is true
for underfitting). It also results in the identification of patterns that only exist in the training set
and do not generalise to other data. The more complex the model is, the more data points it will
capture, and the lower the bias will be. However, complexity will make the model “move” more to
capture the data points, and hence its variance will be larger. Overfitting can be limited by using
methods such as cross validation or train/validation/test sets of the data. An underfit model will
be less flexible and cannot account for the data.

36
The relationship between model complexity and overfitting/overfitting and bias/variance is illus-
trated in figure 3.7.

Figure 3.7: Illustration of the relationship between model complexity and overfitting/overfitting
and bias/variance

In summary:

Overfitting Underfitting
High variance High bias
Model too specific Model to general

3.1.5 The model building process


Before we can jump to model building, there are a few important skills that we need to gain to
fully understand our data and identify the technique that will best suit our problem. These skills
include data visualisation and data wrangling.

37
Figure 3.8: The model building process

3.2 Data Visualisation


In this section, we will focus on well-designed visualisations and how to use them for investigation
and effective communication of the data. Visualization is not only an important communication
skill but it also allows us to uncover structures in data that cannot be detected from numerical
summaries.
” greatest value of a picture is when it forces us to notice what we never expected to see.” -
Exploratory Data Analysis, Tukey (1977).
When creating a visualization it is crucial that the plot represents the key points of an analysis,
hence, the type of plot and variables used should be chosen after careful consideration.
Before plotting the data, we first have to know the data type. Data can be classified into two
groups:
• Quantitative (Numeric)
– Continuous (e.g., amount of rainfall per season).
– Discrete (e.g., number of siblings).
• Qualitative (Categorical)
– Nominal (e.g., type of pet, country).
– Ordinal (e.g., Zomato rating, education level)
Once you have identified the data type, you can decide on the variables you want to plot. If you
would like to explore a single variable then you would use a univariate plot, but if you want to
compare two variables or see the effect of a certain variable on another then you would make use of
a bivariate plot. A basic summary of the univariate and bivariate plot options is given in the tables
below.

38
In R, the “ggplot2” package is very effective for data visualizations. There are base functions in R
that plot variables but the “ggplot2” package provides a unifying framework for describing and
specifying graphics. It allows for the creation of custom data graphics that support visual display
in a purposeful way.
The ggplot2 package is based on the idea that every graph can be built from the same three
components: a dataset, a co-ordinate system and geoms (visual marks that represent the data).
In the syntax of ggplot2, an aesthetic is an explicit mapping between a variable and the visual
cues that represent its values. In ggplot2, a plot is created with the ggplot() command, and any
arguments to that function are applied across any subsequent plotting directives. Graphics in
ggplot2 are built incrementally by elements. Consider the point plot in Example 3.2 below, the
only elements are points, which are plotted using the geom point() function. The arguments to
geom point() specify where and how the points are drawn. Here, the two aesthetics (aes()) map
the vertical (y) coordinate to the weight variable, and the horizontal (x) coordinate to the height
variable. The size argument to geom point() changes the size of all of the symbol/marker. Note
that here, every marker is the same size.

39
The general syntax for plotting a graph is:

In the first line of code the data and the (x,y) co-ordinates are specified. The second line “adds”
on the type of plot. How to build a ggplot:
• Create a plot object with ggplot( ).
• Define the data frame you want to plot.
• Combine components in different ways with the "+" operator to produce a variety of plots.
• Map the aesthetics/features used to represent the data, e.g., x, y locations, color, symbol size.
• Specify what graphics shapes, geoms, to view the data, e.g., point, bars, lines. Add these to
the plot with geom_point, geom_bar or geom_line.
The example below shows the step by step procedure to build a plot using ggplot.
Example 3.2
Step 1: Specify the data
ggplot(
data = cars
)

40
Step 2: A mapping from data onto plot aesthetics
ggplot(
data = cars,
mapping = aes(
x = speed,
y = dist
)
)

41
125

100

75
dist

50

25

0
5 10 15 20 25
speed

Step 3: Add a plot layer


ggplot(
data = cars,
mapping = aes(
x = speed,
y = dist
)
) +
geom_point()

42
125

100

75
dist

50

25

0
5 10 15 20 25
speed

In this step, we can also add layer-specific parameters such as:


• x-axis location.
• y-axis location
• colour of marker
• fill of a marker
• shape of a marker, etc.
For example,
ggplot(
data = cars,
mapping = aes(
x = speed,
y = dist
)
) +
geom_point(color="red", size=3)

43
125

100

75
dist

50

25

0
5 10 15 20 25
speed

As well as choose the type of plot we want to use, for example:


• the points
• the lines
• the histograms etc
In this manner we can add multiple options to a single plot. We can also add multiple plot types,
for example we can add a line plot to the above point plot as follows:
ggplot(
data = cars,
mapping = aes(
x = speed,
y = dist
)
) +
geom_point() +
geom_line()

44
125

100

75
dist

50

25

0
5 10 15 20 25
speed

Step 4: Add Title and label axis


ggplot(
data = cars,
mapping = aes(
x = speed,
y = dist
)
) +
geom_point() +
labs(x = "Speed (m/s)",
y = "Distance (m)",
title = "Point plot of Speed vs Distance for Cars data")

45
Point plot of Speed vs Distance for Cars data
125

100

75
Distance (m)

50

25

0
5 10 15 20 25
Speed (m/s)

Step 5: Interpretation of the plot


In the plot above, a positive linear relationship is observed between speed and distance. As the
speed increases so does the distance. There is one possible maximum outlier observed around the
speed value of 24 which needs to be investigated.
A list of the geom functions that produce common graphs:
• Geom_density() - density plot for univariate continuous variable.
• Geom_density() - density plot for univariate continuous variable.
• Geom_histogram() - histogram for univariate continuous variable.
• Geom_bar() - bar graph for univariate discrete variable.
• Geom_point() - plot of point for bivariate variable.
• Geom_rug() - rug plot for univariate or bivariate variable.
• Geom_boxplot() - plots the five number summary.
• Geom_violin() - violin plots are similar to box plots, except that they also show the kernel
probability density of the data at different values. ( A violin plot is a mirrored density plot
displayed in the same way as a boxplot).
• Geom_line() - plots a line graph.

46
• Geom_hline() , geom_vline(), geom_abline() - adds a horizontal/vertical/slope line to the
graph.
View the ggplot2 package help for a full list of functions. Data visualization does not end at
simply plotting variable(s). The plots need to help the reader to understand the meaning of the
visual representation by providing context. Context can be added in two ways: making the plots
descriptive and adding an interpretation to the plots. The interpretation should include a reason
justifying the use of a specific plot as well as the information provided by the plot.

Adding context to plots


To ensure that a plot is descriptive the following need to be added:
• Label axes, include units.
• Add reference lines and markers for important values, these values should be discussed in the
interpretation.
• Label points of unusual/interesting observations, these values should be discussed in the
interpretation.
• Use color and plotting symbols to convey additional information.
• Add legends and labels, especially when plotting multiple variables on the same grid.
• Describe what you see in the caption, captions should be comprehensive. Captions should
describe what has been graphed.
Every graph should be accompanied with an interpretation. Communication of findings and results
is key in statistics. The interpretation should explain the patterns or lack thereof in the plot in
layman terms.

3.3 Data Wrangling


In this chapter, we will focus on data wrangling. Wrangling skills provides a sound practical
foundation for working with modern data.
What is data wrangling? It is the art of getting your data in a useful form via the process of
cleaning and unifying messy, incomplete and complex data sets for easy access and analysis. It is a
core data analysis technique; working as an iterative process that helps get to the cleanest, most
usable data possible prior to analysis. Each step in the process exposes new potential ways for the
data to be “re-wrangled”, with the ultimate goal of generating the most robust data for analysis.
The process of transforming a table into a data table that contains information explicitly is called
data wrangling. Data wrangling can be assessed by considering the following six processes:
• Discovering – understanding the data and how it can be used for analytic exploration and
analysis.
• Structuring – formatting the data of different shapes and sizes into a data table that can be
used in analysis.

47
• Cleaning – correcting errors resulting from data storage or data entry and standardizing the
data.
• Enriching – improvements that can be added to the representation of information to allow for
efficient analysis and reporting.
• Validating – identifying consistency issues and data quality.
• Publishing – providing correctly formatted and structured data that can be used for analysis.
We will group the above process into two steps;
• exploring the data; and
• preparing the data for analysis.
We will focus on data wrangling in R, mainly using the “dplyr”, “tidyr” and “lubridate” packages.

Step 1: Exploration
In the exploration step, we discover the data, understand the structure of the data and visualize
the data. Given a dataset, to understand the data we need to determine certain aspects such as
the size of the dataset (number of observations and variables) and the type of data it contains
(categorical or quantitative). It would also be great to be able to preview the dataset without
printing out the entire dataset, especially for large datasets. To help understand the data, the
following functions in R can be used:
• class( ) – returns the class of a data object. Refer to Introduction to R software chapter for
data types in R.
• dim( ) – returns the dimension of the data.
• names( ) – returns the column names of the data.
• str( ) – returns a preview of the data with details.
• summary( ) – returns a summary of the data.
Note: The glimpse( ) function can also be used to preview the data with details, however this
function needs the “dplyr” package.

Example
Consider the dataset Iris. To explore this dataset we will apply each of the above functions to gain
a better understanding of how the dataset looks.
class(dat)

## [1] "data.frame"
dim(dat)

## [1] 150 5

48
names(dat)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"


str(dat)

## data.frame : 150 obs. of 5 variables:


## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
glimpse(dat)

## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~
summary(dat)

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
To get a preview of how the data looks; numerically and visually; without printing the entire
dataset, we will use the following functions in R:
• head( ) – gives a view of the top of a dataset.
• tail( ) – gives a view of the bottom of a dataset.
• hist( ) – prints a histogram of a specific variable.

49
• plot( ) – prints a plot of two variables.
Note: The print( ) function is also available to view the entire dataset but it is not recommended
especially for large datasets.

Example
Consider the dataset Iris. For a preview of the data, we apply the functions above as follows:
head(dat)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(dat)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
hist(dat$Sepal.Length)

50
Histogram of dat$Sepal.Length
30
25
20
Frequency

15
10
5
0

4 5 6 7 8

dat$Sepal.Length

Now that we have explored the data and have a better understanding of what the data looks like
and the information it contains, we can move to the next step and prepare the data.

Step 2: Preparation
In this step, we tidy, clean and validate the data to ensure that it is ready to use for analysis.

Structuring the Data


Firstly, we tidy the structure of the data by making sure that the columns and rows in the dataset
are organized in a manner that allows for efficient analysis and avoids redundancy of variables
and/or observations. Given a dataset, it is important that certain principles hold. The rows of the
dataset should contain observations; the columns should contain the variables (or attributes) and
there should be one observational unit per dataset. Depending on the aim of a study, datasets can
be designed to be either long or wide to ensure the principles hold. This can be achieved as in Part
I or by using the equivalent built-in R functions. Data are called ‘tidy’ because they are organized
as follows:
• Each observation is saved in its own row.
• Each variable is saved in its own column.
• Each cell contains a single value.

51
When data are in tidy form, it is relatively straightforward to transform the data into arrangements
that are more useful for answering interesting questions.

Variable Types
The class( ) function is used to determine the type of data object. The type of variable can be
converted using the as.new type. If the variable type is a list then the unlist( ) function in R is
used to convert the list to a numeric vector.

Example
Consider the dataset Iris. The variable type for “Species” can be converted to a factor for grouping
as follows:
new_species <- as.factor(dat$Species)
class(new_species)

## [1] "factor"

Date Formats
Certain datasets also have dates and times recorded in different formats in the same column. This
defies the tidy data principles. To ensure that the date and time format align we use the “lubridate”
package in R. This package allows us to specify the format of the date and time.

Example
The example below illustrates how the lubridate package in R can be used to format dates.
ymd("2020-02-20")

## [1] "2020-02-20"
ymd("2020 February 25")

## [1] "2020-02-25"
mdy("August 24, 2020")

## [1] "2020-08-24"
hms("13:33:09")

## [1] "13H 33M 9S"


ymd_hms("2020/02/20 13:33.09")

## [1] "2020-02-20 13:33:09 UTC"


Now that the rows and columns are in order, we need to clean the entries.

52
Missing and Special values
Missing values and special values need to be identified and corrected. Missing values may be
random or sometimes are associated with an outcome/variable. It is not advised to make prior
assumptions regarding the missing values. Missing values are represented differently in different
software’s, for example, in R it is a “NA”, in Excel it is #N/A and others include an empty string
or a single dot. Special values include infinity, represented as “Inf” in R, and not a number entries,
represented as “NaN” in R. How do we check for these values?
The is.na(data_frame_name) function is used to check for any missing values. This function will
return a logical matrix specifying “FALSE” if the values are not NA and “TRUE” for any NA
values. However, this function requires manual inspection of missing values. Instead, we can use
the following functions to determine whether there are any missing values and if so how many
missing values are present,
• The any(is.na(data frame)) function returns one logical value stating “TRUE” if there are
missing values and “FALSE” otherwise.
• The sum(is.na(data frame)) specifies the number of missing values.
• Another method for identifying missing values as well as in which columns they appear is by
using the summary(data frame) function.
A number of principled approaches have been developed to account for missing values, most notably
multiple imputation (process of replacing missing values with substitute values based on the data).
In this course, we will only focus on identifying missing values.

Example
Investigating missing values. Consider the dataset created below.
dat <- data.frame(A = c(7, NA, 5, 2),
B = c(1, 6, NA, 3),
C = c(NA, 5, 3, 1))

is.na(dat)

## A B C
## [1,] FALSE FALSE TRUE
## [2,] TRUE FALSE FALSE
## [3,] FALSE TRUE FALSE
## [4,] FALSE FALSE FALSE
any(is.na(dat))

## [1] TRUE
sum(is.na(dat))

## [1] 3

53
summary(dat)

## A B C
## Min. :2.000 Min. :1.000 Min. :1
## 1st Qu.:3.500 1st Qu.:2.000 1st Qu.:2
## Median :5.000 Median :3.000 Median :3
## Mean :4.667 Mean :3.333 Mean :3
## 3rd Qu.:6.000 3rd Qu.:4.500 3rd Qu.:4
## Max. :7.000 Max. :6.000 Max. :5
## NA s :1 NA s :1 NA s :1
The other types of values that need to be inspected are outliers and obvious errors. Outliers are
extreme values that differ from other values. Causes include:
• Valid measurements.
• Variability in measurement.
• Experimental error.
• Data entry error.
These values may be discarded or retained depending on cause. Outliers may only be discarded if
there is a strong justification to prove that the value is an error, otherwise, the value is retained in
the data set. Outliers can generally be detected by a boxplot.

Example
Consider the dataset, X, which contains simulated values. The box plot below shows the outliers.
x <- c(rnorm(50, mean = 15, sd = 5), -5, 28, 35, 40)
boxplot(x, horizontal = TRUE)

54
0 10 20 30 40

Obvious errors may appear in many forms such as values so extreme they cannot be plausible (e.g.,
person aged 243) and values that do not make sense (e.g. negative age). There are several causes:
• Measurement error.
• Data entry error.
• Special code for missing data (e.g. -1 means missing).
These values should generally be removed or replaced. Obvious errors can be detected by using a
boxplot, histogram or a five number summary.

Example
Consider the dataset, dat2, which contains simulated values.
dat2 <- data.frame(A = rnorm(100, 50, 10),
B = c(rnorm(99, 50, 10), 750),
C = c(rnorm(99, 50, 10), -20))
summary(dat2)

## A B C
## Min. :17.09 Min. : 28.08 Min. :-20.00
## 1st Qu.:44.64 1st Qu.: 43.58 1st Qu.: 41.78
## Median :49.76 Median : 51.28 Median : 50.90

55
## Mean :50.27 Mean : 57.95 Mean : 49.78
## 3rd Qu.:56.31 3rd Qu.: 56.76 3rd Qu.: 57.35
## Max. :74.67 Max. :750.00 Max. : 77.22
hist(dat2$B, breaks = 20)

Histogram of dat2$B
50
40
Frequency

30
20
10
0

0 200 400 600

dat2$B

boxplot(dat2)

56
600
400
200
0

A B C

3.4 Supervised Learning


In this chapter, we will focus on techniques used for supervised learning, specifically on classification
techniques.

3.4.1 Logistic Regression


In this section, we will focus on building a logistic regression model for predictive analysis.
Classification models are used when the response variable is qualitative and there are predetermined
classes. For a logistic regression model we will consider the case when the response variable is
binary. In general, the logistic model is given by,
p(y = 1)
ln( ) = —0 + —1 x1 + ... + —n xn
1 ≠ p(y = 1)
Here, y is the outcome variable that takes either 0 or 1, where 1 is typically the outcome of interest.
This outcome is linked to a set of independent variables, x1 , ..., xn . 1≠p(y=1)
p(y=1)
refer to the odds that
the outcome variable will take the value of 1. To ensure a linear relationship between the outcome
variable and the independent variables, we take the log of the odds. The probabilities from the
logistic regression can be calculated as,
exp(—0 + —1 x1 + ... + —n xn )
p=
1 + exp(—0 + —1 x1 + ... + —n xn )

57
The glm( ) function in R makes it easy for us to fit a logistic regression model. The general syntax
for this model is,

glm(y ≥ x1 + x2 + x3, data = my_dataset, f amily = ”binomial”)


OR
glm(y ≥ ., data = my_dataset, f amily = ”binomial”)

Where the first argument is the model function, the second argument the data set and the third
argument specifies the binary response. Note that an alternative code is given where the model
is specified in the with “y ~ .”, this is a shorthand for specifying all the variables in the dataset
except the response variable.
The glm( ) function is used to build the model, and the predict( ) function is used to make
predictions based on this model.
prob Ω predict(model, new_dataset, type = ”response”)

The response variable in the dataset is categorical; however, the fitted model will return probabilities.
To obtain a valid outcome we need to set a threshold value to classify each predicted outcome to a
specific category. The norm is taken as 0.5. For example, if a probability is greater than 0.5 then it
is classified into group 1 and if it is less than or equal to 0.5 then it goes to group 2.

This is made easy in R, by using the “if” statement the probabilities are classified into the groups.

pred Ω if else(prob > 0.50, 1, 0)

Where the first argument defines that logical statement, the second argument is the assigned value
if the logical test is true and the third argument is the value for if the logical test is false.

Dummy Variables
We have defined the response variable to be a binary (categorical) variable. This means that the
variable may be composed of characters such as gender, or default status for loan repayment. To
help represent these categories in a numeric form we introduce a dummy variable. A dummy
variable takes on the value 0 or 1 to indicate the absence or presence of a factor. For example, if
we consider the response variable to be the default status of a loan repayment then we can create a
dummy variable that takes on the value of 1 if the loan was not repaid and 0 if the loan was repaid.
When we build our model we will now use the dummy variable as our response variable.

58
Train/Test split
We will split the data set into a training set and test set. The logistic model will be built on the
training set and the performance will be evaluated using the test set.

Real world examples of binary classification problems


In case you were wondering what kind of problems you can use logistic regression for, here are some
real world binary classification problems:
• Spam detection: Predicting whether an email is spam or not.
• Credit card fraud: Predicting if a given transaction is fraud or not.
• Health: Predicting if a given mass of tissue is benign or malignant.
• Marketing: Predicting if a given user will buy a product or not.
• Banking: Predicting if a customer will default on a loan.

Example
Consider the Breast Cancer dataset. You would like to model and predict if a given specimen (row
in dataset) is benign or malignant, based on 9 other cell features. A preview of the dataset:
head(bc)

## X Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei


## 1 1 5 1 1 1 2 1
## 2 2 5 4 4 5 7 10
## 3 3 3 1 1 1 2 2
## 4 4 6 8 8 1 3 4
## 5 5 4 1 1 3 2 1
## 6 6 8 10 10 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant
Step 1: We need to define a dummy variable for the response variable as the data type is a
character. We will indicate a malignant response with a 1 and a benign response with 0. Also it is
important that we define this dummy variable to be a factor variable and not numeric.
# Change class values to 1 s (malignant) and 0 s (benign)
bc$Class <- ifelse(bc$Class == "malignant", 1, 0)
bc$Class <- factor(bc$Class, levels = c(0, 1))

59
Step 2: We randomly split the data into a training and test set (80/20 split).
# set split size
split <- round(nrow(bc)*0.80)

train_ind <- sample(1:nrow(bc),


split,
replace=FALSE)
# Create train set
trainData <- bc[train_ind, ]
# Create test set
testData <- bc[-train_ind, ]

Step 3: Build a logistic regression model with Class being our response variable and the other 9
features the predictors.
# Build Logistic Model
logistic_mod <- glm(Class ~ ., family = "binomial", data = trainData)
summary(logistic_mod)

##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.78677 -0.05162 -0.02000 0.00474 2.48121
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.455260 1.937466 -5.396 6.8e-08 ***
## X -0.006133 0.002594 -2.364 0.018080 *
## Cl.thickness 0.588112 0.205214 2.866 0.004159 **
## Cell.size -0.045847 0.281516 -0.163 0.870631
## Cell.shape 0.410975 0.304455 1.350 0.177058
## Marg.adhesion 0.516888 0.182849 2.827 0.004701 **
## Epith.c.size -0.057226 0.213813 -0.268 0.788974
## Bare.nuclei 0.447618 0.127547 3.509 0.000449 ***
## Bl.cromatin 0.534924 0.245687 2.177 0.029462 *
## Normal.nucleoli 0.391077 0.163532 2.391 0.016783 *
## Mitoses 0.881714 0.388030 2.272 0.023070 *
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## (Dispersion parameter for binomial family taken to be 1)
##

60
## Null deviance: 697.81 on 545 degrees of freedom
## Residual deviance: 54.13 on 535 degrees of freedom
## AIC: 76.13
##
## Number of Fisher Scoring iterations: 9
Step 4: Predict the response variables for the test set.
pred <- predict(logistic_mod, newdata = testData, type = "response")

Step 5: Calculate the performance metrics. The confusion matrix function in R provides the
performance metric values. Note that for the confusion matrix, both input arguments need to be
the same type of variable, i.e. they both need to be factors. Hence, we first convert the numeric
predictors response to factors.
# Recode factors
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels = c(0,1))
y_act <- testData$Class

# Performance metrics
confusionMatrix(data = y_pred, reference = testData$Class)

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 78 2
## 1 4 53
##
## Accuracy : 0.9562
## 95% CI : (0.9071, 0.9838)
## No Information Rate : 0.5985
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9094
##
## Mcnemar s Test P-Value : 0.6831
##
## Sensitivity : 0.9512
## Specificity : 0.9636
## Pos Pred Value : 0.9750
## Neg Pred Value : 0.9298
## Prevalence : 0.5985
## Detection Rate : 0.5693
## Detection Prevalence : 0.5839
## Balanced Accuracy : 0.9574

61
##
## Positive Class : 0
##

3.5 Unsupervised Learning


In this chapter, we will explore unsupervised learning techniques. We will specifically focus on
K-means clustering.

3.5.1 K-means clustering


Clustering is a technique that groups samples such that objects within a group are similar (ho-
mogenous) and the objects between groups are dissimilar (heterogeneous). There are different
types of clustering methods for example, hierarchical clustering, k nearest neighbours and k means
clustering.
For k-means clustering the objective is to divide the data into k clusters and classify each observation
into a cluster. It is an iterative algorithm that calculates the centroids (means of the groups) of k
clusters and assigns a data point to that cluster having least distance between its centroid and the
object. To obtain a classification model for k means the following algorithm is used:
• Firstly, the number of clusters k is decided upon. An approach is to plot a scatter plot of the
data and choose k according to the number of distinct groups visible from the plot. Another
approach is to plot a scree plot to choose the optimal k value.
• Once k is selected, the “kmeans(data, centers=k, nstart=20)” function in R is applied where
the data is specified as the first argument, the number of clusters is specified in the second
argument and the nstart=20 specifies the iterations.
• The results will provide the centroids for each cluster as well as the classification of each
object in the dataset.
Once we obtain the model, the final step is to perform a model evaluation.

Example
Consider the dataset Iris. The dataset consists of five variables, however we will focus on Petal.Width
and Petal.Length. We would like to cluster these observations into categories that are similar
within clusters and different between clusters.
Step 1: Plot the data to try to identify the number of clusters.
ggplot(dat,aes(x = Petal.Length, y = Petal.Width)) + geom_point()

62
2.5

2.0
Petal.Width

1.5

1.0

0.5

0.0
2 4 6
Petal.Length

From the plot, we could say that maybe 2 clusters will be a good fit. We can perform a k-means
clustering with k = 2 and test if it is a good fit using an error measure. Another method of
obtaining an optimal k is by using a scree plot. The scree plot, plots the within group sum of
squares error for different k values. The optimal k is chosen as the point where an elbow occurs
in the plot. The ideal plot will have an elbow where the measure improves more slowly as the
number of clusters increases. This indicates that the quality of the model is no longer improving
substantially as the number of clusters increases. In other words, the elbow indicates the number
of clusters inherent in the data. A scree plot for the data set will produce the following plot:

63
500
Within cluster sum of squares

400
300
200
100
0

2 4 6 8 10

Number of clusters(k)

From the scree plot, we can conclude that the optimal number of cluster is 3 as the elbow occurs at
this point.
Step 2: Now that the number of clusters has been determined, we can use k-means with 3 clusters
to classify the observations.
k_means <- kmeans(dat[,3:4], centers = 3, nstart = 20)
k_means

## K-means clustering with 3 clusters of sizes 50, 48, 52


##
## Cluster means:
## Petal.Length Petal.Width
## 1 1.462000 0.246000
## 2 5.595833 2.037500
## 3 4.269231 1.342308
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2
## [149] 2 2

64
##
## Within cluster sum of squares by cluster:
## [1] 2.02200 16.29167 13.05769
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
• The cluster means return the centroid values for each cluster. The clustering vector classifies
each row of the dataset into a category.
• To evaluate the performance we calculate the within sum of squares and the between sum of
squares.
We can compare the clusters for each of the species:
table(k_means$cluster, dat$Species)

##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 2 46
## 3 0 48 4
A visual representation of the clustered observations where each cluster is represented by a different
colour and the centroids of each cluster is illustrated by the triangles.
plot(dat[,3:4][c("Petal.Length", "Petal.Width")], col=k_means$cluster)
points(k_means$centers[,c("Petal.Length", "Petal.Width")], col=1:3, pch=2, cex=2)

65
Petal.Width

0.5 1.0 1.5 2.0 2.5

1
2
3
4

66
Petal.Length
5
6
7

You might also like