Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Introduction to Programming

Module II: Data Structures in R

Vectors, Factors, Lists, Arrays


and Dataframes

Dr. Anuj Sharma


Information Systems Area
Outline
• Introduction
• Data and Data Structures
• Vectors
• Factors
• Matrices
• Arrays
• Data frames
• Subsetting
• Sorting Numeric, Character, and Factor Vectors
• Summary
Data Structures

• R Supports virtually any type of data


• Numbers, characters, logicals (TRUE/ FALSE)
• Arrays of virtually unlimited sizes/ dimensions
• Simplest: Vectors and Matrices
• Lists: Can Contain mixed type variables
• Data Frame: Rectangular Data Set with rows/columns
Data Structure in R

Linear (One Rectangular Higher


Dimensional) (Two dimensional
Dimensional) data
All Same Type VECTORS MATRIX Array
(Homogeneous/
Atomic)
Mixed LIST DATA
(Heterogeneous) FRAME
In an R Session…

• First, read data from other sources


• Use packages, libraries, and functions
• Write functions wherever necessary
• Conduct Data Analysis
• Save outputs to files, write tables
• Save R workspace if necessary (exit prompt)
Types of Data Structures
Types of Data Structures
Vectors, Matrices and Arrays
Vectors
• Vector is the most basic data structure in R that can
store multiple values
• It comes in two parts: Atomic vectors and Lists
• They have three common properties:
– Type function – what actually it is?
– Length function – how many elements does it contain.
– Attribute function – extra arbitrary metadata
• Atomic vectors must share the same type
• On the contrary, elements that are present in a list can
have different data types
Need of a Vector
• Suppose, we are grading an assignment of 12 students
• The grades for the assignment are as follows
100, 70, 90, 92, 94, 94, 94, 55, 20, 90, 98, 68
• How can we create a single variable that can contain
multiple values?
• To create vector, we use the combine function called c( )
• Then pass the values we want to store in the vector to
the c( ) function, separated by commas
grades <- c(100, 70, 90, 92, 94, 94, 94, 55, 20, 90, 98, 68)
grades
[1] 100 70 90 92 94 94 94 55 20 90 98 68
Vectors
• Vector: an ordered collection of data of the same type
• Vectors are usually created with c(), short for combine

> a = c(1,2,3)
> a*2
[1] 2 4 6

• Example: the heights of all 53 students in a class: a vector


of 53 numbers
• In R, a single number is the special case of a vector with 1
element.
• Other vector types: character strings, logical
Atomic Vectors
• There are four common types of R Atomic
Vectors:
– Numeric Data Type
– Integer Data Type
– Character Data Type
– Logical Data Type
– Complex Data Type
Atomic Vectors- Five Classes
Vector.R
Getting Information out of Vectors
• x[1] identifies first element in vector x
• The number in the square brackets [ ] is the index
• The index of a vector begins with the number 1
• Aside: Most programming languages begin indexing at
0, not 1
• To extract a few elements of a vector you can use the c( )
combine function to indicate which elements you’d like
to extract
• Altering elements in a vector-replace only those
elements in the vector that need to be replaced
• We can replace values through reassignment
Accessing Variables
• Subscripts essential tools
– x[1] identifies first element in vector x
– y[1,] identifies first row in matrix y
– y[,1] identifies first column in matrix y
• $ sign used for lists and dataframes
– myframe$age gets age variable of myframe
– attach(dataframe) -> extract by variable name
practice on vectors.R
Vectors Operations
To create a vector: To access vector elements:
# c() command to create vector x # 2nd element of x
x=c(12,32,54,33,21,65) x[2]
# c() to add elements to vector x # first five elements of x
x=c(x,55,32) x[1:5]
# all but the 3rd element of x
# seq() command to create sequence x[-3]
of number # values of x that are < 40
years=seq(1990,2003) x[x<40]
# to contain in steps of .5 # values of y such that x is < 40
a=seq(3,5,.5) y[x<40]
# can use : to step by 1
years=1990:2003;
To perform operations:
# rep() command to create data that # mathematical operations on vectors
follow a regular pattern y=c(3,2,4,3,7,6,1,1)
b=rep(1,5) x+y; 2*y; x*y; x/y; y^2
c=rep(1:2,4)
Vectors Operations.R
Coercion .R
• Changing the mode of an object is often called 'coercion'
• The mode of an object can change without necessarily
changing the class
Factors
• Data falls into a limited number of categories very often
• For example, humans are either male or female
• In R, categorical data is stored in factors
• The term factor refers to a statistical data type used to
store categorical variables
• The difference between a categorical variable and a
continuous variable is that a categorical variable can
belong to a “limited number of categories”
• A continuous variable on the other hand, can
correspond to an infinite number of values
Example
• Factors are the main way to represent a nominal scale
variable
• If you created a variable called eyes, the possible
values of black, blue, green, hazel, etc. have no
ordering, rank, or true zero point
• We can use the as.factor( ) function to convert our
variable
Eyes<-c(“black”, “blue”, “green”, “hazel”,
“gray”)
Eyes_Fact<-as.factor(Eyes)
Creating Factors
• To create factors in R you make use of the function factor()
• First thing to do is create a vector that contains all the
observations belonging to a limited number of categories
• For example, gender vector contains the gender of 5 different
individuals:
gender.vector <- c("Male","Female","Female","Male","Male")
• It is clear, that there are 2 categories, or in R-terms factor
levels, here: "Male" and "Female"
• The function factor() will encode the vector as a factor:
factor.gender.vector <- factor(gender.vector)
factor.gender.vector
Factors()
• Tell R that a variable is nominal by making it a factor
• Nominal variables store values that have no
relationship between the different possibilities
(categories)
• The factor stores the nominal values as a vector of
integers in the range [ 1... k ]
• where k is the number of unique values in the
nominal variable, and
• an internal vector of character strings (the original
values) mapped to these integers
Factors.R
Nominal and Ordinal Categorical Variables
• A nominal variable is a categorical variable without an implied
order
• This means it is impossible to say that one is worth more than the
other
• Think for example of the categorical variable hair.color.vector,
with the categories “Black", "Blonde", “White" and "Grey"
• Here, it is impossible to say one stands above or below the other
• Ordinal variables do have a natural ordering
• Example- the categorical variable temperature.vector with the
categories: "Low", "Medium" and "High"
• Here, it is easy to see that "Medium" stand above "Low", and
"High" stands above "Medium"
Matrices and Arrays
Matrices
• A matrix is a two-dimensional rectangular atomic
data set and thus it can be created using vector input
to the matrix function
• Matrix is a collection of numbers arranged into a
fixed number of rows and columns
• By using a matrix function, we can reproduce a
memory representation of the matrix in R
• Hence, the data elements must be of the same basic
type
Matrices
• Matrix is similar to vector but additionally contains
the dimension attribute
• All attributes of an object can be checked with
the attributes() function
• Dimensions of a matrix can be checked directly with
the dim() function
• We can check if a variable is a matrix or not with
the class() function
• Matrix can be created using the matrix() function.
• Dimension of the matrix can be defined by passing
appropriate value for arguments nrow and ncol
Matrices
• All columns in a matrix must have the same mode (numeric,
character, etc.) and the same length
• The general format is-
mymatrix <- matrix(vector, nrow=r, ncol=c,
byrow=FALSE,
dimnames=list(char_vector_rownames,
char_vector_colnames))

• byrow=TRUE indicates that the matrix should be filled by rows.


• byrow=FALSE indicates that the matrix should be filled by columns
(the default)
• dimnames provides optional labels for the columns and rows
Matrix.R (Read from Code)
Matrices & Matrix Operations
To create a matrix:
# matrix() command to create matrix A with rows and cols
A=matrix(c(54,49,49,41,26,43,49,50,58,71), nrow=5, ncol=2))
B=matrix(1,nrow=4, ncol=4)

To access matrix elements: Statistical operations:


# matrix_name[row_no, col_no] rowSums(A)
A[2,1] # 2nd row, 1st column element colSums(A)
A[3,] # 3rd row rowMeans(A)
A[,2] # 2nd column of the matrix colMeans(A)
A[2:4,c(3,1)] # submatrix of 2nd-4th # max of each columns
elements of the 3rd and 1st columns apply(A,2,max)
A["KC",] # access row by name, "KC" # min of each row
apply(A,1,min)

Element by element ops: Matrix/vector multiplication:


2*A+3; A+B; A*B; A/B; A %*% B;
Matrix Operations.R
Functions for Vectors and Matrices
• Find # of elements or dimensions
• length(v), length(A), dim(A)
• Transpose
• t(v), t(A)
• Matrix inverse
• solve(A)
• Sort vector values
• sort(v)
• Statistics
• min(), max(), mean(), median(), sum(), sd(), quantile()
Arrays
• Arrays are multi-dimensional Data structures
• In an array, data is stored in the form of matrices,
row, and as well as in columns
• We can use the matrix level, row index, and column
index to access the matrix elements
• An array is created using the array() function
Lists
• Vector: an ordered collection of data of the same type.
> a = c(7,5,1)
> a[2]
[1] 5

• List: an ordered collection of data of arbitrary types.


> emp = list(name="john", age=28, married=F)
> emp$name
[1] "john“
> emp$age
[1] 28
• Typically, vector elements are accessed by their index (an integer),
list elements by their name (a character string).
Lists
• List is a data structure having components of mixed
data types
• A vector having all elements of the same type is
called atomic vector but a vector having elements of
different type is called list
• We can check if it’s a list with typeof() function and
find its length using length()
• List can be created using the list() function
• Its structure can be examined with the str() function
Lists
• An ordered collection of heterogeneous objects (components)
• A list allows you to gather a variety of (possibly unrelated)
objects under one name

# example of a list with 4 components -


# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=9696857468, creditcard=y,
age=53)

# example of a list containing two lists


v <- c(list1, list2)

• Identify elements of a list using the [[]] convention


• mylist[[2]] # 2nd component of the list
List.R (Read from Code)
Data frames
Data frames
• Data frame: is supposed to represent the typical data table
that researchers come up with – like a spreadsheet
• It is a rectangular table with rows and columns; data
within each column has the same type (e.g. number, text,
logical), but different columns may have different types
• R’s data frames offer you a great first step by allowing you
to store your data in overviewable, rectangular grids
• Each row of these grids corresponds to observations or an
instance, while each column is a vector containing
measurements data for a specific variable
Data frames
• A data frame is used for storing data tables
• It is a list of vectors of equal length
• For example, the following variable df is a data frame
containing three vectors n, s, b.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b) # df is a data frame
Need of a Data Frame
• We have created two vectors, grades and ids:
Grades<- c(“A”, “B”, “A”, “C”, “D”, “A”, “F”,
“B”)
Ids<-c(“R11”, “R01”, “R05”, “R08”, “R09”, “R22”,
“R34”, “R55”)
• It would be more helpful if the grades were associated
with student ids
• Since we created both vectors we know that grades[1]
corresponds to ids[1], but there’s no real relationship as
this point
Need of a Data Frame
Grades<- c(“A”, “B”, “A”, “C”, “D”, “A”, “F”,
“B”)
Ids<-c(“R11”, “R01”, “R05”, “R08”, “R09”,
“R22”, “R34”, “R55”)
• If we re-ordered the data in ids to be alphabetical, ids[1]
would no longer correspond the correct value in
grades[1]
• This is because any time we manipulate ids, grades is
unaffected.
• This can cause data integrity issues
• To reduce data integrity issues, we combine the grades
and ids together into using the data.frame( ) function
Data frames
Data frames
• The top line of the table, called the header, contains
the column names
• Each horizontal line afterward denotes a data row,
which begins with the name of the row, and then
followed by the actual data
• Each data member of a row is called a cell
• To retrieve data in a cell, we would enter its row and
column coordinates in the single square
bracket "[]" operator
• The two coordinates are separated by a comma
• In other words, the coordinates begins with row
position, then followed by a comma, and ends with the
column position
Characteristics of R Data Frame
• The column names should be non-empty
• The row names should be unique
• The data frame can hold the data which can
be a numeric, character or of factor type
• Each column should contain the same number
of data items
data frames.R
How To Change A Data Frame’s Row And
Column Names
• Data frames can also have a names attribute, by
which you can see the names of the variables that
you have included into your data frame.
• In other words, you can also set the header for your
data frame
n=c(2,3,5)
s=c("aa","bb","cc")
b=c(TRUE,FALSE,TRUE)
df=data.frame(n,s,b) df
#df is a data frame Number name Decision
df 1 2 aa TRUE
2 3 bb FALSE
names(df) 3 5 cc TRUE
names(df)<-c("Number","name","Decision")
df
How To Check A Data Frame’s Dimensions

• The data frame is similar to a matrix, which means that its size
is determined by how many rows and columns you have
combined into it.
• To check how many rows and columns you have in your data
frame, you can use the dim() function:
dim(df)
dim(df)[1] #Number of rows
dim(df)[2] #Number of columns
• You can also use the functions nrow() and ncol(), to retrieve
the number of rows or columns, respectively
nrow(df)
ncol(df)
Subset Data
• Using subset function
– subset() will subset the dataframe
• Subscripting from data frames
– myframe[,1] gives first column of myframe
• Specifying a vector
– myframe[1:5,] gives first 5 rows of data
• Using logical expressions
– myframe[myframe[,1], < 5,] gets all rows of the first
column that contain values less than 5
Difference between Dataframe and List

• A vector is a series of data indexed by position in a


single dimension
• A matrix is similar, but with multidimensional indexes
• A data frame is [can be thought of] an accumulation of
vectors, that don't all have to be of the same data type
• A dataframe has the properties from both matrix and
list, for example, it has names(), nrow(), and ncol(), etc.
also it has length(), which is the same as the length as
underlying list
• Data Frame is a list of vectors of equal length but of
usually different types
Flow Control in R

Dr. Anuj Sharma


Information Systems Area
Outline
• if statement in R
• Branching- if…else statement
• if…else Ladder
• ifelse() Function
• For Loops
• While Loops
• Repeat
• Next and Break
• Nested Loops
Flow Control in R
if statement in R
• Decision making is an important part of
programming
• This can be achieved in R programming using the
conditional if...else statement
• if statement
if (test_expression) {
statement
}

• If the test_expression is TRUE, the statement gets


executed. But if it’s FALSE, nothing happens
Example
x <- 5
if(x > 0){
print("Positive number")
}

Output

[1] "Positive number"


Branching- if…else statement

if (logical expression) {
statements
} else {
alternative statements
}

else branch is optional and is only evaluated if logical


expression is FALSE

It is important to note that else must be in the same line as the


closing braces of the if statement.
Flowchart of if…else statement

x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}

Output
[1] "Negative number"
Example: Check Odd and Even Number
# Program to check if the input number is odd or even
# A number is even if division by 2 give a remainder of 0
# If remainder is 1, it is odd

num = as.integer(readline(prompt="Enter a number: "))

if((num %% 2) == 0) {
print(paste(num,"is Even"))
} else {
print(paste(num,"is Odd"))
}

Output 1 Copy the program in R Studio and Run


Enter a number: 89
[1] "89 is Odd"
if…else Ladder
• if…else ladder (if…else if…else) statement allows you
execute a block of code among more than 2 alternatives
• The syntax of if…else if…else statement is:
if ( test_expression1) {
statement1
} else if ( test_expression2) {
statement2
} else if ( test_expression3) {
statement3
… Only one statement
} else { will get executed
depending upon the
statement4 test_expressions
}
if…else Ladder
Example

x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")

Output

[1] "Zero"
Summary- Conditional Statements
• Perform different commands in different situations
• if (condition) command_if_true
• Can add else command_if_false to end
• Group multiple commands together with braces {}
• if (cond1) {cmd1; cmd2;} else if (cond2) {cmd3; cmd4;}
• Conditions use relational operators
• ==, !=, <, >, <=, >=
• Do not confuse = (assignment) with == (equality)
• = is a command, == is a question

• Combine conditions with and (&&) and or (||)


• Use & and | for vectors of length > 1 (element-wise)
ifelse() Function
• This is a shorthand function to the traditional if…else
statement
ifelse(test_expression, x, y)
• Here, test_expression must be a logical vector (or
an object that can be coerced to logical)
• The return value is a vector with the same length
as test_expression.
a = c(5,7,2,9)
ifelse(a %% 2 == 0,"even","odd")

Output
[1] "odd" "odd" "even" "odd"
If else.R
Loops in R
Loops in R
• Loops are used in
programming to repeat a
specific block of code
Syntax
for (val in sequence)
{
statement
}

Sequence is a vector and val takes on each of its


value during the loop.
In each iteration, statement is evaluated.
Loops
• When the same or similar tasks need to be performed
multiple times; for all elements of a list; for all columns of an
array; etc

for(i in 1:10) {
print(“Hello”)
}

i=1
while(i<=10) {
print(i)
i=i+1
}
For Loop

• For loops are used when


you need to execute a
block of code several
number of times
• It has 3 steps-
– Initialization
– checking the condition
– When condition becomes
false, exits the loop
Loops
• Most common type of loop is the for loop
• for (x in v) { loop_commands; }
• v is a vector, commands repeat for each value in v
• Variable x becomes each value in v, in order
• Example: adding the numbers 1-10
• total = 0; for (x in 1:10) total = total + x;

• Other type of loop is the while loop


• while (condition) { loop_commands; }
• Condition is identical to if statement
• Commands are repeated until condition is false
• while loops are useful when you don’t know number of iterations
while Loop
• “while loops” are used to loop until a
specific condition is met
while (test_expression)
{
Statement
}

• Here, test_expression is evaluated and


the body of the loop is entered if the
result is TRUE
• This is repeated each time
until test_expression evaluates to FALSE
Loops.R
repeat loop
• A repeat loop is used to iterate a
block of code multiple times
• There is no condition check in
repeat loop to exit the loop
• We must put a condition explicitly
inside the loop body and use
break statement to exit the loop

repeat {
statement
}

Read from loops.R


repeat loop
x <- 1
repeat { [1] 1
print(x) [1] 2
x = x+1 [1] 3
if (x == 6){ [1] 4
break [1] 5
}
}
How these two statements-
while and repeat differ?
Difference between the repeat and
while statement
• While loop basically defines when you are going to
enter the loop to execute the statements
• Repeat loop defines when you leave from the loop
after the execution of the statements.
• So these two statements are known as entry control
loop and exit control loop
break Statement
• A break statement is used inside a
loop (repeat, for, while) to stop the
iterations and flow the control
outside of the loop
• In a nested looping situation, where
there is a loop inside another loop,
this statement exits from the
innermost loop that is being
evaluated.
if (test_expression){
break
}

Read from loops.R


break Statement
# break statement

x <- 1:5
for (val in x) { [1] 1
if (val == 3){ [1] 2
break
}
print(val)
}
next Statement
• A next statement is useful when
we want to skip the current
iteration of a loop without
terminating it
• On encountering next, the R
parser skips further evaluation
and starts next iteration of the
loop
if (test_condition){
next
}

Read from loops.R


next Statement
# Next statement

x <- 1:5
for (val in x) { [1] 1
if (val == 3){ [1] 2
next [1] 4
} [1] 5
print(val)
}
Switch branching
• A switch statement is a selection control mechanism
that allows the value of an expression to change the
control flow of program execution via map and
search
• The switch statement is used in place of long if
statements which compare a variable with several
integral values
• It is a multi-way branch statement which provides
an easy way to dispatch execution for different parts
of code.
• This statement allows a variable to be tested for
equality against a list of values
Switch branching
Switch branching
vtr <- c(150,200,250,300,350,400)
option <-"min"
switch(option,
"mean" = print(mean(vtr)),
"min" = print(min(vtr)),
"max" = print(max(vtr)),
"median" = print(median(vtr))
)
Nested Loops
• It is similar to the standard for loop, which makes
it easy to convert for loop to a for each loop
• Unlike many parallel programming packages for
R, “for each loop” doesn’t require the body of for
loop to be turned into a function
• We can call this “a nesting operator” because it is
used to create nested for each loops

You might also like