Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 42

Managing and Understanding

Data
R data structures
• There are numerous types of data structures across
programming languages, each with strengths and
weaknesses specific to particular tasks.
• Since R is a programming language used widely for
statistical data analysis, the data structures it utilizes are
designed to make it easy to manipulate data for this type of
work.
• The R data structures used most frequently in machine
learning are vectors, factors, lists, arrays, and data frames.
• Each of these data types is specialized for a specific data
management task, which makes it important to understand
how they will interact in your R project.
Vectors
• The fundamental R data structure is the vector, which stores an ordered
set of values called elements.
• A vector can contain any number of elements. However, all the elements
must be of the same type; for instance, a vector cannot contain both
numbers and text.
• There are several vector types commonly used in machine learning:
integer (numbers without decimals), numeric (numbers with decimals),
character (text data), or logical (TRUE or FALSE values).
• There are also two special values: NULL, which is used to indicate the
absence of any value, and NA, which indicates a missing value
Because R vectors are inherently ordered, the records can be accessed by counting
the item's number in the set, beginning at 1, and surrounding this number with
square brackets (for example, [ and ]) after the name of the vector. For instance, to
obtain the body temperature for patient Jane Doe, or element 2 in the temperature
vector simply type:
> temperature[2]
[1] 98.6
• R offers a variety of convenient methods for extracting data from vectors.
• A range of values can be obtained using the colon operator.
• For instance, to obtain the body temperature of Jane Doe and Steve Graves,
type:
• > temperature[2:3]
• [1] 98.6 101.4
• Items can be excluded by specifying a negative item number. To exclude Jane
Doe's temperature data, type:
• > temperature[-2]
• [1] 98.1 101.4
• Finally, it is also sometimes useful to specify a logical vector indicating
whether each item should be included. For example, to include the first two
temperature readings but exclude the third, type:
• > temperature[c(TRUE, TRUE, FALSE)]
• [1] 98.1 98.6
Factors
• Introducing Machine Learning, features that represent a
characteristic with categories of values are known as nominal.
• Although it is possible to use a character vector to store
nominal data, R provides a data structure known as a factor
specifically for this purpose
• .A factor is a special case of vector that is solely used for
representing nominal variables. In the medical dataset we are
building, we might use a factor to represent gender, because
it uses two categories: MALE and FEMALE.
• Why not use character vectors?
• An advantage of using factors is that they are generally more efficient
than character vectors because the category labels are stored only once.
• Rather than storing MALE, MALE, FEMALE, the computer may store 1, 1,
2. This can save memory.
• Additionally, certain machine learning algorithms use special routines to
handle categorical variables.
• Coding categorical variables as factors ensures that the model will treat
this data appropriately
• To create a factor from a character vector, simply apply the
factor() function.
• For example:
• > gender <- factor(c("MALE", "FEMALE", "MALE"))
• > gender
• [1] MALE FEMALE MALE
• Levels: FEMALE MALE
• Notice that when the gender data was displayed, R printed
additional information indicating the levels of the gender
factor. The levels comprise the set of possible categories the
data could take, in this case MALE or FEMALE.
• When factors are created, we can add additional levels that
may not appear in the data.
• Suppose we added another factor for blood type as shown in
the following example :
• > blood <- factor(c("O", "AB", "A"), levels = c("A", "B", "AB",
"O"))
• > blood
• [1] O AB A
• Levels: A B AB O
Lists
• Another special type of vector, a list, is used for
storing an ordered set of values.
• However, unlike a vector that requires all elements
to be the same type, a list allows different types of
values to be collected.
• Due to this flexibility, lists are often used to store
various types of input and output data and sets of
configuration parameters for machine learning
models
• To illustrate lists, consider the medical patient dataset we have been constructing,
with data for three patients stored in five vectors. If we wanted to display all the
data on John Doe (subject 1), we would need to enter five R commands:
• > subject_name[1]
• [1] "John Doe"
• > temperature[1]
• [1] 98.1
• > flu_status[1]
• [1] FALSE
• > gender[1]
• [1] MALE
• Levels: FEMALE MALE
• > blood[1] [1] O
• Levels: A B AB O
Data Frames
• By far the most important R data structure utilized in machine
learning is the data frame, a structure analogous to a
spreadsheet or database since it has both rows and columns
of data.
• In R terms, a data frame can be understood as a list of vectors
or factors, each having exactly the same number of values.
• Because the data frame is literally a list of vectors, it
combines aspects of both vectors and lists.
• Let's create a data frame for our patient
dataset. Using the patient data vectors we
created previously, the data.frame() function
combines them into a data frame:
• > pt_data <- data.frame(subject_name,
temperature, flu_status, gender, blood,
stringsAsFactors = FALSE)
Matrixes and arrays
• In addition to data frames, R provides other structures that
store values in tabular form.
• A matrix is a data structure that represents a two-
dimensional table, with rows and columns of data.
• R matrixes can contain any single type of data, although they
are most often used for mathematical operations and
therefore typically store only numeric data.
• To create a matrix, simply supply a vector of data to the matrix() function,
along with a parameter specifying the number of rows (nrow) or number
of columns (ncol).
• For example, to create a 2x2 matrix storing the first four letters of the
alphabet, we can use the nrow parameter to request the data to be
divided into two rows:
• > m <- matrix(c('a', 'b', 'c', 'd'), nrow = 2)
• > m [,1] [,2]
• [1,] "a" "c“
• [2,] "b" "d"
Types of Operators

We have the following types of operators in R


programming −
• Arithmetic Operators
• Relational Operators
• Logical Operators
• Assignment Operators
• Miscellaneous Operators
Live Demonstartion
Relational Operators
• >
• <
• >=
• <=
• ==
• !=
Logical Operators
• &
• |
• !
• &&
• ||
Assignment Operator
• Left Assignment Operator: <- or = or <<-
• Right Assignment Operator: -> or ->>
Miscellaneous operators
• :
• %in%
• %*%
Decision Making in R
• Decision making structures require the
programmer to specify one or more
conditions to be evaluated or tested by the
program, along with a statement or
statements to be executed if the condition is
determined to be true, and optionally, other
statements to be executed if the condition is
determined to be false.
Statement and Description
• if statement :An if statement consists of a Boolean
expression followed by one or more statements.
• if...else statement : An if statement can be followed
by an optional else statement, which executes when
the Boolean expression is false.
• switch statement : A switch statement allows a
variable to be tested for equality against a list of
values.
IF statement syntax
• Syntax
The basic syntax for creating an if statement in R
is −
• if(boolean_expression) { // statement(s) will
execute if the boolean expression is true. }
If...Else Statement

• Syntax
• The basic syntax for creating
an if...else statement in R is −
• if(boolean_expression) { // statement(s) will
execute if the boolean expression is true. }
else { // statement(s) will execute if the
boolean expression is false. }
if...else if...else Statement

• Syntax
• The basic syntax for creating an if...else if...else statement in
R is −
• if(boolean_expression 1) { // Executes when the boolean
expression 1 is true. } else if( boolean_expression 2) { //
Executes when the boolean expression 2 is true. } else
if( boolean_expression 3) { // Executes when the boolean
expression 3 is true. } else { // executes when none of the
above condition is true. }
Switch statement
• Syntax
• The basic syntax for creating a switch
statement in R is −
• switch(expression, case1, case2, case3....)
Loops
Repeat Loop
• The basic syntax for creating a repeat loop in
R is −
• repeat { commands if(condition) { break } }
R functions
• A function is a set of statements organized together to
perform a specific task. R has a large number of in-built
functions and the user can create their own functions.
• In R, a function is an object so the R interpreter is able to pass
control to the function, along with arguments that may be
necessary for the function to accomplish the actions.
• The function in turn performs its task and returns control to
the interpreter as well as any result which may be stored in
other objects.
Function Definition
• An R function is created by using the
keyword function. The basic syntax of an R
function definition is as follows −
• function_name <- function(arg_1, arg_2, ...)
{ Function body }
Function Components
The different parts of a function are −
• Function Name − This is the actual name of the function. It is
stored in R environment as an object with this name.
• Arguments − An argument is a placeholder. When a function
is invoked, you pass a value to the argument. Arguments are
optional; that is, a function may contain no arguments. Also
arguments can have default values.
• Function Body − The function body contains a collection of
statements that defines what the function does.
• Return Value − The return value of a function is the last
expression in the function body to be evaluated.
Built-in functions
• Simple examples of in-built functions
are seq(), mean(), max(), sum(x) and paste(...)
 etc. They are directly called by user written
programs.
User Defined functions
• We can create user-defined functions in R.
They are specific to what a user wants and
once created they can be used like the built-in
functions.
• Function with arguments
• Function without arguments
Working with strings
• Live demonstration
Working with csv files
• Datasets for R: tinyurl.com/rdatasets
• Live demonstration
Managing data with R
• Saving and loading R data structures
When you have spent a lot of time getting a particular data frame into the
format that you want, you shouldn't need to recreate your work each
time you restart your R session.
To save a particular data structure to a file that can be reloaded later or
transferred to another system, you can use the save() function.
The save() function writes R data structures to the location specified by
the file parameter. R data files have the file extension .RData.
• If we had three objects named x, y, and z, we could save them
to a file mydata.RData using the following command:
• > save(x, y, z, file = "mydata.RData") Regardless of whether x,
y, and z are vectors, factors, lists, or data frames, they will be
saved to the file.
• The load() command will recreate any data structures already
saved that were to an .RData file. To load the mydata.RData
file we saved in the preceding code, simply type:
• > load("mydata.RData")
• This will recreate the x, y, and z data structures.
Importing and saving data from
CSV files
• A CSV file representing the medical dataset
constructed previously would look as follows:
• subject_name,temperature,flu_status,gender,blood_ty
pe John Doe,98.1,FALSE,MALE,O Jane
Doe,98.6,FALSE,FEMALE,AB Steve
Graves,101.4,TRUE,MALE,A
• To load this CSV file into R, the read.csv() is used as
follows:
• > pt_data <- read.csv("pt_data.csv", stringsAsFactors =
FALSE)

You might also like