Professional Documents
Culture Documents
Ds Module2 24
Ds Module2 24
Once you open an R script file, this is how an R Studio with the script file open looks
like.
So, 3 panels console, environment/history and file/plots panels are there. On top left
you have a new window, which is now being opened as a script file. Now you are
ready to write a script file or some program in R Studio.
Writing Scripts in an R File
Writing scripts to an R file is demonstrated here with an example:
In the above example, a variable ‘a’ is assigned with a value 11, in the first line of the
code and there is b which is ‘a’ times 10, that is the second command. Here, the code
is evaluating the value of a times 10 and assign the value to the b and the third
statement, which is print(c(a, b)) means concatenates this a and b and print the result.
So, this is how a script file is written in R. After writing a script file, there is a need to
save this file before execution.
Saving an R File
Let us see, how to save the R file. From the file menu if you click the file tab you can
either save or save as button. When you want to save the file if you click the save
button, it will automatically save the file has untitled x. So, this x can be 1 or 2
depending upon how many R scripts you have already opened.
Or, it is a nice idea to use the Save as button, just below the Save one, so that, you can
rename the script file according to your wish. Let us suppose we have clicked the Save
as button. This will pop out a window like this, where you can rename the script file
as test.R. Once you rename, then by clicking the save button you can save the script
file.
So now, we have seen how to open an R script and how to write some code in the R
script file and save the file.
The next task is to execute the R file.
Execution of an R file
There are several ways in which the execution of the commands that are available in
the R file is done.
Using the run command: This “run” command can be executed using the GUI,
by pressing the run button there, or you can use the Shortcut key control + enter.
What does it do?
It will execute the line in which the cursor is there.
Using the source command:
This “source” command can be executed using the GUI, by pressing the source
button there, or you can use the Shortcut key control + shift + S.
What does it do?
It will execute the whole R file and only print the output which you wanted to
print.
Using the source with echo command:
This “source with echo” command can be executed using the GUI, by pressing the
source with echo button there, or you can use the Shortcut key control + shift +
enter.
What does it do?
It will print the commands also, along with the output you are printing.
R Data Types
A variable can store different types of values such as numbers, characters
etc. These different types of data that we can use in our code are
called data types.
logical
numeric
integer
complex
character
raw
print(bool1)
print(class(bool1))
print(bool2)
print(class(bool2))
Output
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"
is_weekend <- F
print(class(is_weekend)) # "logical"
print(weight)
print(class(weight))
# real numbers
print(height)
print(class(height))
Output
[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"
print(class(integer_variable))
Output
[1] "integer"
Here, 186L is an integer data. So we get "integer" when we print the class
of integer_variable .
complex_value <- 3 + 2i
print(class(complex_value))
Output
[1] "complex"
print(class(fruit))
# create a character variable
print(class(my_char))
Output
[1] "character"
[1] "character"
Here, both the variables - fruit and my_char - are of character data type.
6. Raw Data Type
A raw data type specifies values as raw bytes. You can use the following
methods to convert character data types to a raw data type and vice-versa:
charToRaw() - converts character data to raw data
rawToChar() - converts raw data to character data
For example,
print(raw_variable)
print(class(raw_variable))
print(char_variable)
print(class(char_variable))
Output
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "character"
R Objects
In contrast to other programming languages like C and java in R, the variables
are not declared as some data type. The variables are assigned with R-
Objects and the data type of the R-object becomes the data type of the
variable. There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of
these atomic vectors, also termed as six classes of vectors. The other R-Objects
are built upon the atomic vectors.
In R programming, the very basic data types are the R-objects called vectors which
hold elements of different classes as shown above. Please note in R the number of
classes is not confined to only the above six types. For example, we can use many
atomic vectors and create an array whose class will become array.
Vectors
Vectors are the most basic R data objects and there are six types of atomic vectors.
They are logical, integer, double, complex, character and raw.
To combine the list of items to a vector, use the c() function and separate
the items by a comma.
In the example below, we create a vector variable called fruits, that combine
strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Example
# Print numbers
numbers
Example
numbers
You can also create numerical values with decimals in a sequence, but note
that if the last element does not belong to the sequence, it is not used:
Example
Result:
Example
log_values
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
Example
Example
Example
You can also use negative index numbers to access all items except the ones
specified:
Example
Change an Item
To change the value of a specific item, refer to the index number:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Print fruits
fruits
Repeat Vectors
To repeat vectors, use the rep() function:
Example
repeat_each
Example
repeat_times
Example
repeat_indepent
Example
numbers
Example
numbers
Note: The seq() function has three parameters: from is where the sequence
starts, to is where the sequence stops, and by is the interval of the sequence.
Lists
A list in R can contain many different data types inside it. A list is a collection
of data which is ordered and changeable.
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
You can access the list items by referring to its index number, inside
brackets. The first item has index 1, the second item has index 2, and so on:
Example
thislist[1]
Example
List Length
To find out how many items a list has, use the length() function:
Example
length(thislist)
Example
Example
append(thislist, "orange")
To add an item to the right of a specified index, add " after=index number" in
the append() function:
Example
You can also remove list items. The following example creates a new,
updated list without an "apple" item:
Example
Range of Indexes
You can specify a range of indexes by specifying where to start and where to
end the range, by using the : operator:
Example
Return the second, third, fourth and fifth item:
thislist <-
list("apple", "banana", "cherry", "orange", "kiwi", "melon", "mango")
(thislist)[2:5]
Note: The search will start at index 2 (included) and end at index 5
(included).
Example
for (x in thislist) {
print(x)
}
The most common way is to use the c() function, which combines two
elements together:
Example
list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
list3
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
Matrices
A matrix is a two dimensional data set with columns and rows.
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
Example
thismatrix
Example
thismatrix[1, 2]
The whole row can be accessed if you specify a comma after the number in
the bracket:
Example
thismatrix[2,]
The whole column can be accessed if you specify a comma before the
number in the bracket:
Example
thismatrix[,2]
More than one row can be accessed if you use the c() function:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "p
ear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]
More than one column can be accessed if you use the c() function:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "p
ear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[, c(1,2)]
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "p
ear", "melon", "fig"), nrow = 3, ncol = 3)
Note: The cells in the new column must be of the same length as the
existing matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "p
ear", "melon", "fig"), nrow = 3, ncol = 3)
Note: The cells in the new row must be of the same length as the existing
matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"),
nrow = 3, ncol =2)
thismatrix
Example
Use the dim() function to find the number of rows and columns in a Matrix:
Example
dim(thismatrix)
Matrix Length
Use the length() function to find the dimension of a Matrix:
Example
length(thismatrix)
You can loop through a Matrix using a for loop. The loop will start at the first
row, moving right:
Example
Example
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2,
ncol = 2)
Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow
= 2, ncol = 2)
# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined
# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined
Arrays
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to
specify the dimensions:
Example
Example Explained
You can access the array elements by referring to the index position. You can
use the [] brackets to access the desired elements from an array:
Example
multiarray[2, 3, 2]
You can also access the whole row or column from a matrix in an array, by
using the c() function:
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[c(1),,1]
# Access all the items from the first column from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
A comma (,) before c() means that we want to access the column.
A comma (,) after c() means that we want to access the row.
Example
2 %in% multiarray
Use the dim() function to find the amount of rows and columns in an array:
Example
Array Length
Example
length(multiarray)
Example
for(x in multiarray){
print(x)
}
Factors
Factors are used to categorize data. Examples of factors are:
Demography: Male/Female
Music: Rock, Pop, Classic, Jazz
Training: Strength, Stamina
To create a factor, use the factor() function and add a vector as argument:
Example
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
Result:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
levels(music_genre)
Result:
You can also set the levels, by adding the levels argument inside
the factor() function:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Other"))
levels(music_genre)
Result:
Factor Length
Use the length() function to find out how many items there are in the factor:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
length(music_genre)
Result:
[1] 8
Access Factors
To access the items in a factor, refer to the index number, using [] brackets:
Example
Access the third item:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
music_genre[3]
Result:
[1] Classic
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
music_genre[3]
Result:
[1] Pop
Note that you cannot change the value of a specific item if it is not already
specified in the factor. The following example will produce an error:
Example
Trying to change the value of the third item ("Classic") to an item that does not exist/not predefined
("Opera"):
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"))
music_genre[3]
Result:
Warning message:
However, if you have already specified it inside the levels argument, it will
work:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "
Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Opera"))
music_genre[3]
Result:
[1] Opera
Data Frames
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column
can be character, the second and third can be numeric or logical. However,
each column should have the same type of data.
Example
Example
Data_Frame
summary(Data_Frame)
You will learn more about the summary() function in the statistical part of the R
tutorial.
Access Items
We can use single brackets [ ], double brackets [[ ]] or $ to access columns
from a data frame:
Example
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Add Rows
Example
Add Columns
Example
Use the c() function to remove rows and columns in a Data Frame:
Example
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
dim(Data_Frame)
You can also use the ncol() function to find the number of columns
and nrow() to find the number of rows:
Example
ncol(Data_Frame)
nrow(Data_Frame)
Example
length(Data_Frame)
Example
And use the cbind() function to combine two or more data frames in R
horizontally:
Example
OPERATOR
An operator is a symbol that tells the compiler to perform specific mathematical or
logical manipulations. R language is rich in built-in operators and provides following
types of operators.
Types of Operators
We have the following types of operators in R programming −
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The
operators act on each element of the vector.
Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
== v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first print(v == t)
vector is equal to the corresponding
element of the second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
!= v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first print(v!=t)
vector is unequal to the corresponding
element of the second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable
only to vectors of type logical, numeric or complex. All numbers greater than 1 are
considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.
Operator Description Example
&
It is called Element-wise Logical AND
v <- c(3,1,TRUE,2+3i)
operator. It combines each element of t <- c(4,1,FALSE,2+3i)
the first vector with the corresponding print(v&t)
element of the second vector and gives
a output TRUE if both the elements are it produces the following result −
TRUE. [1] TRUE TRUE FALSE TRUE
! v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes print(!v)
each element of the vector and gives
it produces the following result −
the opposite logical value.
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give
a vector of single element as output.
|| v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first print(v||t)
element of both the vectors and gives the
TRUE if one of them is TRUE. it produces the following result −
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.
%in% v1 <- 8
This v2 <- 12
operator is t <- 1:10
used to print(v1 %in% t)
identify if an print(v2 %in% t)
element it produces the following result −
belongs to a
vector. [1] TRUE
[1] FALSE
Syntax:
var = readline();
var = as.integer(var);
Note that one can use “<-“ instead of “=”
Example:
R
# R program to illustrate
var = readline();
var = as.integer(var);
print(var)
Output:
255
[1] 255
One can also show message in the console window to tell the user, what to
input in the program. To do this one must use a argument
named prompt inside the readline() function. Actually prompt argument
facilitates other functions to constructing of files documenting. But prompt is
not mandatory to use all the time.
Syntax:
var1 = readline(prompt = “Enter any number :”);
or,
var1 = readline(“Enter any number : “);
Example:
R
# R program to illustrate
var = as.integer(var);
# print the value
print(var)
Output:
Syntax:
var1 = readline(“Enter 1st number : “);
var2 = readline(“Enter 2nd number : “);
var3 = readline(“Enter 3rd number : “);
var4 = readline(“Enter 4th number : “);
or,
{
var1 = readline(“Enter 1st number : “);
var2 = readline(“Enter 2nd number : “);
var3 = readline(“Enter 3rd number : “);
var4 = readline(“Enter 4th number : “);
}
Example:
# R program to illustrate
# using braces
var1 = as.integer(var1);
var2 = as.integer(var2);
var3 = as.integer(var3);
var4 = as.integer(var4);
Output:
To take string input is the same as an integer. For “String” one doesn’t need
to convert the inputted data into a string because R takes input as string
always. And for “character”, it needs to be converted to ‘character’.
Sometimes it may not cause any error. One can take character input as
same as string also, but that inputted data is of type string for the entire
program. So the best way to use that inputted data as ‘character’ is to
convert the data to a character.
Syntax:
string:
var1 = readline(prompt = “Enter your name : “);
print(var1)
character:
var1 = readline(prompt = “Enter any character : “);
var1 = as.character(var1)
print(var1)
Example:
R
# R program to illustrate
# string input
# character input
# convert to character
var2 = as.character(var2)
# printing values
print(var1)
print(var2)
Output:
Syntax:
x = scan()
scan() method is taking input continuously, to terminate the input process,
need to press Enter key 2 times on the console.
Example:
This is simple method to take input using scan() method, where some
integer number is taking as input and print those values in the next line on
the console.
R
# R program to illustrate
x = scan()
print(x)
Output:
1: 1 2 3 4 5 6
7: 7 8 9 4 5 6
13:
Read 12 items
[1] 1 2 3 4 5 6 7 8 9 4 5 6
Explanation:
Total 12 integers are taking as input in 2 lines when the control goes to 3rd
line then by pressing Enter key 2 times the input process will be terminated.
Syntax:
x = scan(what = double()) —-for double
x = scan(what = ” “) —-for string
x = scan(what = character()) —-for character
Example:
R
# R program to illustrate
d = scan(what = double())
c = scan(what = character())
print(d) # double
print(s) # string
print(c) # character
Output:
Syntax:
x = scan(“fileDouble.txt”, what = double()) —-for double
x = scan(“fileString.txt”, what = ” “) —-for string
x = scan(“fileChar.txt”, what = character()) —-for character
Example:
# R program to illustrate
print(s) # string
print(d) # double
print(c) # character
Output:
Read 7 items
Read 5 items
Read 13 items
[1] "geek" "for" "geeks" "gfg" "c++" "java" "python"
[1] 123.321 523.458 632.147 741.250 855.360
[1] "g" "e" "e" "k" "s" "f" "o" "r" "g" "e" "e" "k" "s"
Save the data file in the same location where the program is saved for better
access. Otherwise total path of the file need to defined inside
the scan() method.
Control Structures in R
R provides different control structures that can be used on their own and
even in combinations to control the flow of the program. These control
structures are:
1. If – else
2. ifelse() function
3. Switch
4. For loops
5. While loops
6. Break statement
7. Next statement
8. Repeat loops
Let’s take a look at these structures one at a time:
1. if – else
The if-else in R enforce conditional execution of code. They are an important
part of R’s decision-making capability. It allows us to make a decision based
on the result of a condition. The if statement contains a condition that
evaluates to a logical output. It runs the enclosed code block if the condition
evaluates to TRUE. It skips the code block if the condition evaluates
to FALSE.
We can use the if statement on its own like:
Code:
a <- 5
b <- 6
if(a<b){
Output:
You must check R Data Types as it plays a vital role in Control Structures.
We use the else statement with the if statement to enact a choice
between two alternatives. If the condition within the if statement evaluates
to FALSE, it runs the code within the else statement. For example:
Code:
if(a>b){
} else{
Output:
We can use the else if statement to select between multiple options. For
example:
Code:
a <- 5
b <- 5
if(a<b){
} else if(a==b) {
} else {
Output:
2. ifelse() Function
The ifelse() function acts like the if-else structure. The following is the
syntax of the ifelse() function in R:
Output:
3. switch
The switch is an easier way to choose between multiple alternatives than
multiple if-else statements. The R switch takes a single input argument and
executes a particular code based on the value of the input. Each possible
value of the input is called a case. For example:
Code:
a <- 4
switch(a,
Output:
4. for loops
The for loop in R, repeats through sequences to perform repeated tasks.
They work with an iterable variable to go through a sequence. The following
is the syntax of for loops in R:
for(variable in sequence){
Code_to_repeat
for(i in vec){
print(vec[i])
Output:
5. while Loops
The while loop in R evaluates a condition. If the condition evaluates
to TRUE it loops through a code block, whereas if the condition evaluates
to FALSE it exits the loop. The while loop in R keeps looping through the
enclosed code block as long as the condition is TRUE. This can also result in
an infinite loop sometimes which is something to avoid. The while loop’s
syntax is as follows:
while(condition){
code_to _run
Code:
i <- 0
while(i<10){
i <- i+1
}
Output:
6. break Statement
The break statement can break out of a loop. Imagine a loop searching a
specific element in a sequence. The loop needs to keep going until either it
finds the element or until the end of the sequence. If it finds the element
early, further looping is not needed. In such a case, the R break statement
can “break” us out of the loop early. For example:
Code:
for(i in vec){
if(i==7){
print("break!!")
break
}
Output:
7. next Statement
Advertisement
The next statement in R causes the loop to skip the current iteration and
start the next one. For example:
Code:
for(i in vec){
if(i==5 || i==7){
print("next!!")
next
Output:
8. repeat loop
The repeat loop in R initiates an infinite loop from the get-go. The only
way to get out of the loop is to use the break statement. The repeat loop is
useful when you don’t know the required number of iterations. For
example:
Code:
x <- 15
i <- 1
repeat{
if(i == x){
print("found it!!")
break
print("not found!")
i <- i+1
}
Output:
R Functions
A function is a block of code which only runs when it is called.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() { # create a function with the name my_function
print("Hello World!")
Call a Function
To call a function, use the function name followed by parenthesis,
like my_function():
Example
my_function <- function() {
print("Hello World!")
}
Arguments
Information can be passed into functions as arguments.
Arguments are specified after the function name, inside the parentheses. You
can add as many arguments as you want, just separate them with a comma.
The following example has a function with one argument (fname). When the
function is called, we pass along a first name, which is used inside the
function to print the full name:
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Parameters or Arguments?
The terms "parameter" and "argument" can be used for the same thing:
information that are passed into a function.
Number of Arguments
By default, a function must be called with the correct number of arguments.
Meaning that if your function expects 2 arguments, you have to call the
function with 2 arguments, not more, and not less:
Example
This function expects 2 arguments, and gets 2 arguments:
my_function("Peter", "Griffin")
If you try to call the function with 1 or 3 arguments, you will get an error:
Example
This function expects 2 arguments, and gets 1 argument:
my_function("Peter")
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
Nested Functions
There are two ways to create a nested function:
Example
Call a function within another function:
Nested_function(Nested_function(2,2), Nested_function(3,3))
Example Explained
The function tells x to add y.
Example
Write a function within a function:
Example Explained
You cannot directly call the function because the Inner_func has been
defined (nested) inside the Outer_func.
We then print the output with the desired value of "y", which in this case is
5.
Recursion
R also accepts function recursion, which means a defined function can call
itself.
The developer should be very careful with recursion as it can be quite easy to
slip into writing a function which never terminates, or one that uses excess
amounts of memory or processor power. However, when written correctly,
recursion can be a very efficient and mathematically-elegant approach to
programming.
To a new developer it can take some time to work out how exactly this
works, best way to find out is by testing and modifying it.
Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
}
}
tri_recursion(6)
Syntax: date()
Parameters:
Does not accept any parameters
Ex: date()
[1] "Sun Mar 17 22:11:39 2024"
Sys.Date() Function
Sys.Date()function is used to return the system’s date.
Syntax: Sys.Date()
Parameters:
Does not accept any parameters
Ex: Sys.Date()
[1] "2024-03-17"
Sys.time()
Sys.time()function is used to return the system’s date and time.
Syntax: Sys.time()
Parameters:
Does not accept any parameters
Ex:Sys.time()
[1] "2024-03-17 22:09:02 IST"
Sys.timezone()
Sys.timezone() function is used to return the current time zone.
Syntax: Sys.timezone()
Parameters:
Does not accept any parameters
Ex: Sys.timezone()
[1] "Asia/Calcutta"
Times in R are represented by the POSIXct or POSIXlt class and Dates are
represented by the Date class. The as.Date() function handles dates in R
without time. This function takes the date as a String in the format YYYY-MM-
DD or YYY/MM/DD and internally represents it as the number of days since 1970-
01-01. And, Times are stored internally as the number of seconds since 1970-
01-01.
Following are the Date formats that are used to specify the date, I will use these
in the examples below.
The default input format for Date consists of the year, followed by the month
and day, separated by slashes or dashes.
# Date examples
x <- as.Date("1970-01-01")
y <- as.Date("1970-01-02")
print(x)
print(y)
Yields below output. Note that typeof(x) returns a double value.
And, class(x) returns Date.
> x <- as.Date("1970-01-01")
> y <- as.Date("1970-01-02")
> print(x)
[1] "1970-01-01"
> print(y)
[1] "1970-01-02"
To check the internal value of these dates, use the unclass() function.
If your input dates are not in the standard format, you have the specify the
format as shown below.
z <- as.Date("01/01/1970",format='%m/%d/%Y')
print(z)
> z <- as.Date("01/01/1970",format='%m/%d/%Y')
> print(z)
[1] "1970-01-01"
2. Times in R
Times are stored internally as the number of seconds since 1970-01-01. Times in R
are represented either by the POSIXct or the POSIXlt class. Let’s see with
examples what each time class brings to us.
POSIXct stores the time as a large integer value whereas POSIXlt stores the time
as a list of values with information like day, month, week, year e.t.c. If you
wanted to get a specific value of time this comes in handy.
The default input format for POSIX dates consists of the year, followed by the
month and day, separated by slashes or dashes; for time values, the date may be
followed by white space and a time in the form hour:minutes:seconds or
hour:minutes followed by timezone.
2.1 POSIXct Time Class
The as.POSIXct() function takes date and time as input and returns the time of
the type class POSIXct, it internally represents the time as a large integer value.
Use the unclass() to check the integer value.
# Using POSIXct
timect <- as.POSIXct("2022-11-08 22:14:35 PST")
print(timect)
class(timect)
unclass(timect)
Output:
> timect <- as.POSIXct("2022-11-08 22:14:35 PST")
> print(timect)
[1] "2022-11-08 22:14:35 IST"
> class(timect)
[1] "POSIXct" "POSIXt"
> unclass(timect)
[1] 1667925875
attr(,"tzone")
[1] ""
2.2 POSIXlt Time Class
The as.POSIXlt() function also takes the date and time as string format and
returns a value of type class POSIXlt, it internally stores the values of data and
time parts as a list which ideally contains day, week, month, year, hour, minute,
and the second e.t.c. You can check these values by calling unclass().
# Using as.POSIXlt
timelt <- as.POSIXlt("2022-11-08 22:14:35 PST")
print(timelt)
class(timelt)
unclass(timelt)
Yields below output.
> timelt <- as.POSIXlt("2022-11-08 22:14:35 PST")
> print(timelt)
[1] "2022-11-08 22:14:35 IST"
> class(timelt)
[1] "POSIXlt" "POSIXt"
> unclass(timelt)
$sec
[1] 35
$min
[1] 14
$hour
[1] 22
$mday
[1] 8
$mon
[1] 10
$year
[1] 122
$wday
[1] 2
$yday
[1] 311
$isdst
[1] 0
$zone
[1] "IST"
$gmtoff
[1] NA
attr(,"tzone")
[1] "" "IST" "+0630"
attr(,"balanced")
[1] TRUE
You can perform several mathematic operations like + and – on Dates & Times
and you can do comparisons too like ==, >, < e.t.c. Following are some
examples.
In the below example, I am subtracting the date from another date which results
in differences in the number of days.
# Date Diff
dateDiff <- as.Date("2021-01-01") - as.Date("2020-01-01")
print(dateDiff)
# Output
# Time difference of 366 days
3.2 Subtract Times in R
Now let’s use subtract time from another time, the result would be in decimal
value.
#Time Diff
x <- as.POSIXlt("2022-11-08 03:14:35 PST")
y <- as.POSIXlt("2022-11-09 26:14:35 PST")
timeDiff <- y - x
print(timeDiff)
# Output
# Time difference of 20.75694 hours
3.3 Extract Parts of Date & Time
Since POSIXlt stores the time as an array by representing all date fields, you
can use $ operator on the object to get the values. The fields you can use are
“sec“, “min“, “hour“, “mday“, “mon“, “year“, “wday“, “day“, “isdst“,
“zone“, “gmtoff“.
# Output
#[1] 35
#[1] 2
3.4 Add Days to Dates
# Output
#[1] "2021-01-04"
4. Dates & Times Functions in R
4.1 Sys.time()
Sys.time() returns the current system date and time in the format “ 2022-11-
09 20:05:17 PST” which is of type class “ POSIXct” or “ POSIXt“
x <- Sys.time()
print(x)
class(x)
> x <- Sys.time()
> print(x)
[1] "2024-03-17 23:18:06 IST"
> class(x)
[1] "POSIXct" "POSIXt"
4.2 Find Interval Between Dates
If you have a vector of dates or times, you can use the diff() function to get the
difference between dates. The result of the below example would be different
between the first and second dates and the different between the second and
third dates.
# Differences
datesVec <- as.Date(c("2020-04-21", "2021-06-30", "2021-
11-04"))
diff(datesVec)
# Output
#Time differences in days
#[1] 435 127
4.3 Generate Sequence of Dates
By using seq() function you can generate the sequence of dates with the
specified length. The below example generates 5 dates by month difference.
# Output
#[1] "2020-04-21" "2020-05-21" "2020-06-21" "2020-07-21"
"2020-08-21"
4.4 Truncate Date & Time
The trunc() function is used to truncate the date and time values. The below
examples demonstrate the truncation of days, months, and years.
#truncate
x <- as.POSIXlt("2022-11-08 03:14:35 PST")
trunc(x, "mins")
trunc(x, "days")
trunc(x, "year")
# Output
[1] "2022-11-08 03:14:00 PST"
[1] "2022-11-08 PST"
[1] "2022-01-01 PST"
4.5 strptime()
If you have dates and times in R with a different than standard format
use strptime() to convert it to POSIXlt class. This function takes a character
vector that has dates and times and converts into to a POSIXlt object.
Scoping Rules in R
The location where we can find a variable and also access it if required is called the
scope of a variable.
Global variables are those variables that exist throughout the execution of a program. It
can be changed and accessed from any part of the program. It is used both inside and
outside the function.
s=”software”
fun1=function(){
paste(“R is a “, s)
}
fun1()
Output:
“R is software”
Global variables can be created inside the function also by using <<(Global assignment
<< - operator(Global assignment operator)
a=”Python”
fun1=function(x)
print(x)
print(a)
fun1(5)
Output:
“Python”
f2=function(){
a<<- “a studio”
print(a)
f2()
Output:
“a studio”
b=”language”
paste(“R is a “,b)
f2()
Output:
“R is a language”
Output:
Example:
x=”python”
func=function()
x=”cobol”
paste(x,”is a language”)
func()
Output:
“cobol is a language”
Loop Functions
apply() function
apply() takes Data frame or matrix as an input and gives output in vector, list or
array. Apply function in R is primarily used to avoid explicit uses of loop
constructs. It is the most basic of all collections can be used over a matrix.
The syntax for apply() is as follows:
apply(x,MARGIN,FUN,…)
Parameters
Examples
# Get the sum of each column
data <- matrix(1:9, nrow=3, ncol=3)
data
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
apply(data, 2, sum)
[1] 6 15 24
# Get the sum of each row
apply(data, 1, sum)
[1] 12 15 18
You can use user-defined functions as well.
lapply() function
lapply() function is useful for performing operations on list objects and returns a
list object of same length of original set. lapply() returns a list of the similar length
as input list object, each element of which is the result of applying FUN to the
corresponding element of list. Lapply in R takes list, vector or data frame as input
and gives output in list.
Parameters
Condition Description
x Required A list
Example
# Get the sum of each list item
data <- list(item1 = 1:5,
item2 = seq(4,36,8),
item4 = c(1,3,5,7,9))
data
$item1
[1] 1 2 3 4 5
$item2
[1] 4 12 20 28 36
$item4
[1] 1 3 5 7 9
lapply(data, sum)
$item1
[1] 15
$item2
[1] 100
$item4
[1] 25
l in lapply() stands for list. The difference between lapply() and apply() lies
between the output return. The output of lapply() is a list. lapply() can be used for
other objects like data frames and vectors.
A very easy example can be to change the string value of a matrix to lower case
with tolower function. We construct a matrix with the name of the famous
movies. The name is in upper case format.
movies <- c("SPIDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <-lapply(movies, tolower)
str(movies_lower)
Output:
## List of 4
## $:chr"spyderman"
## $:chr"batman"
## $:chr"vertigo"
## $:chr"chinatown"
We can use unlist() to convert the list into a vector.
movies_lower <-unlist(lapply(movies,tolower))
str(movies_lower)
Output:
## chr [1:4] "spyderman" "batman" "vertigo" "chinatown"
sapply() function
sapply() function takes list, vector or data frame as input and gives output in vector
or matrix. It is useful for operations on list objects and returns a list object of same
length of original set. Sapply function in R does the same job as lapply() function
but returns a vector.
Parameters
Parameter Condition Description
x Required A list
Example
# Get the sum of each list item and simplify the result into a vector
data <- list(item1 = 1:5,
item2 = seq(4,36,8),
item4 = c(1,3,5,7,9))
data
$item1
[1] 1 2 3 4 5
$item2
[1] 4 12 20 28 36
$item4
[1] 1 3 5 7 9
sapply(data, sum)
item1 item2 item4
15 100 25
tapply() function
The tapply() function breaks the data set up into groups and applies a function to each
group.
Syntax
The syntax for tapply() is as follows:
tapply(x,INDEX,FUN,…,simplify)
Parameters
Parameter Condition Description
x Required A vector
Example
# Find the age of youngest male and female
data <- data.frame(name=c("Amy","Max","Ray","Kim","Sam","Eve","Bob"),
age=c(24, 22, 21, 23, 20, 24, 21),
gender=factor(c("F","M","M","F","M","F","M")))
data
name age gender
1 Amy 24 F
2 Max 22 M
3 Ray 21 M
4 Kim 23 F
5 Sam 20 M
6 Eve 24 F
7 Bob 21 M
tapply(data$age, data$gender, min)
F M
23 20
To understand how it works, let’s use the iris dataset. This dataset is very famous
in the world of machine learning. The purpose of this dataset is to predict the class
of each of the three flower species: Sepal, Versicolor, Virginica. The dataset
collects information for each species about their length and width.
As a prior work, we can compute the median of the length for each species. Tapply
in R is a quick way to perform this computation.
data(iris)
tapply(iris$Sepal.Width, iris$Species, median)
Output:
## setosa versicolor virginica
## 3.4 2.8 3.0
mapply()
The mapply() function in R can be used to apply a function to multiple list or vector arguments.
where:
USE.NAMES: Whether or not to use names if the first … argument has names
The following examples show how to use this function in different scenarios.
The following code shows how to use mapply() to create a matrix by repeating the values c(1, 2, 3)
each 5 times:
#create matrix
[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
[4,] 1 2 3
[5,] 1 2 3
Notice how this is much more efficient than typing out the following:
[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
[4,] 1 2 3
[5,] 1 2 3
The following code shows how to use mapply() to find the max value for corresponding elements in
two vectors:
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
This passes the sequence 1:4 to the first argument of rep() and the
sequence 4:1 to the second argument.
While there may not be any dedicated debugging process for any
programming language, here is a general process to start you with.
One popular approach is to split the code into parts. You test these
individual parts and try to figure out which one is not working. Then you
split the faulty code further and further until you isolate the part where the
problem is occurring.
3. Making it repeatable
Once you have isolated a part of the code that you believe is the root of the
problem, you will have to repeatedly execute it with different changes to
identify the bug and debug it. You have to make it so that you can execute
that part of the code on its own again and again. After that, make small
changes in the code and execute it. Every try will give you more insight into
the problem and its causes.
Debugging in R
Debugging is the process of finding errors in your code to figure out why it’s behaving in
unexpected ways. This typically involves:
messages give the user a hint that something is wrong, or may be missing. They can
be ignored, or suppressed altogether with suppressMessages().
warnings don’t stop the execution of a function, but rather give a heads up that
something unusual is happening. They display potential problems.
errors are problems that are fatal, and result in the execution stopping altogether.
Errors are used when there is no way for the function to continue with its task.
There are many ways to approach these problems when they arise. For example, condition
handling using tools like try(), tryCatch(), and withCallingHandlers() can increase your
code’s robustness by proactively steering error handling.
R also includes several advanced debugging tools that can be very helpful for quickly and
efficiently locating problems, which will be the focus of this article. To illustrate, we’ll use an
example adapted from an excellent paper by Roger D. Peng, and show how these tools work
along with some updated ways to interact with them via RStudio. In addition to working with
errors, the debugging tools can also be used on warnings by converting them to errors
via options(warn = 2).
traceback()
If we’ve run our code and it has already crashed, we can use traceback() to try to locate
where this happened. traceback() does this by printing a list of the functions that were called
before the error occurred, called the “call stack.” The call stack is read from bottom to top:
Another way we can use traceback(), besides inserting it directly into the code, is by
using traceback() as an error handler (meaning that it will call immediately if any error
occurs). This can be done using options(error = traceback).
We can also access traceback() directly through the button on the right-hand side of the error
message in RStudio:
Debug Mode
While traceback() is certainly useful, it doesn’t show us where, exactly, an error occurred
within a function. For this, we need “debug mode.”
Entering debug mode will pause your function and let you examine and interact with the
environment of the function itself, rather than the usual global environment. In the function’s
runtime environment you’re able to do some useful new things. For example, the
environment pane shows the objects that are saved in the function’s local environment, which
can be inspected by typing their name into the browser prompt.
You can also run code and view the results that normally only the function would see.
Beyond just viewing, you’re able to make changes directly inside debug mode.
You’ll notice that while debugging, the prompt changes to Browse[1]> to let you know that
you’re in debug mode. In this state you’ll still have access to all the usual commands, but also
some extra ones. These can be used via the toolbar that shows up, or by entering the
commands into the console directly:
Debug mode sounds pretty useful, right? Here are some ways we can access it.
browser()
One way to enter debug mode is to insert a browser() statement into your code manually,
allowing you to step into debug mode at a pre-specified point.
If you want to use a manual browser() statement on installed code, you can use
print(functionName) to print the function code (or you can download the source code
locally), and use browser() just like you would on your own code.
While you don’t have to run any special code to quit browser(), do remember to remove
the browser() statement from your code once you’re done.
debug()
Once you’re done with debug(), you’ll need to call undebug(), otherwise it’ll enter debug
mode every time the function is called. An alternative is to use debugonce(). You can check
whether a function is in debug mode using isdebugged().
Options in RStudio
In addition to debug() and browser(), you can also enter debug mode by setting “editor
breakpoints” in RStudio by clicking to the left of the line in RStudio, or by selecting the line
and typing shift+F9. Editor breakpoints are denoted by a red circle on the left-hand side,
indicating that debug mode will be entered at this line once the source is run.
Editor breakpoints avoid having to modify code with a browser() statement, though it is
important to note that there are some instances where editor breakpoints won’t function
properly, and they cannot be used conditionally (unlike browser(), which can be used in
an if() statement).
You can also have RStudio enter the debug mode for you. For example, you can have
RStudio stop the execution when an error is raised via Debug (on the top bar) > On Error, and
changing it from “Error Inspector” to “Break in Code.”
To prevent debug mode from opening every time an error occurs, RStudio won’t invoke the
debugger unless it looks like some of your own code is on the stack. If this is causing
problems for you, navigate to Tools > Global Options > General > Advanced, and unclick
“Use debug error handler only when my code contains errors.”
If you just want to invoke debug mode every single time there’s ever an error,
use options(error = browser()).
recover()
recover() is similar to browser(), but lets you choose which function in the call stack you
want to debug. recover() is not used directly, but rather as an error handler by
calling options(error = recover).
Once put in place, when an error is encountered, recover() will pause R, print the call stack
(though note that this call stack will be upside-down relative to the order in traceback()), and
allow you to select which function’s browser you’d like to enter. This is helpful because
you’ll be able to browse any function on the call stack, even before the error occurred, which
is important if the root cause is a few calls prior to where the error actually takes place.
Once you’ve found the problem, you can switch back to default error handling by removing
the option from your .Rprofile file. Note that previously options(error = NULL) was used to
accomplish this, but this became illegal in R 3.6.0 and as of September 2019 may cause
RStudio to crash the next time you try running certain things, such as .Rmd files.
trace()
The trace() function is slightly more complicated to use, but can be useful when you don’t
have access to the source code (for example, with base functions). trace() allows you to insert
any code at any location in a function, and the functions are only modified indirectly (without
re-sourcing them).
Note that if called with no additional arguments beyond the function name,
trace(yourFunction) just prints the function message:
Let’s try it out:
If we want to see the tracing code to get a better understanding of what’s going on, we can
use body(yourFunction):
At this point, if we call on the function func1(), debug mode will open if r is not a number.
When you’re done, you can remove tracing from a function using untrace()
Simulation in R
Simulation is an important (and big) topic for both statistics and for a variety of other
areas where there is a need to introduce randomness. Sometimes you want to implement a
statistical procedure that requires random number generation or sampling (i.e. Markov
chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to
simulate a system and random number generators can be used to model random inputs.
R comes with a set of pseuodo-random number generators that allow you to simulate from
well-known probability distributions like the Normal, Poisson, and binomial. Some
example functions for probability distributions in R
rnorm: generate random Normal variates with a given mean and standard deviation
dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or
vector of points)
pnorm: evaluate the cumulative distribution function for a Normal distribution
rpois: generate random Poisson variates with a given rate
For each probability distribution there are typically four functions available that start with
a “r”, “d”, “p”, and “q”. The “r” function is the one that actually simulates random
numbers from that distribution. The other functions are prefixed with a
d for density
r for random number generation
p for cumulative distribution
q for quantile function (inverse cumulative distribution)
If you’re only interested in simulating random numbers, then you will likely only need the
“r” functions and not the others. However, if you intend to simulate from arbit rary
probability distributions using something like rejection sampling, then you will need the
other functions too.
Probably the most common probability distribution to work with is the Normal distribution
(also known as the Gaussian). Working with the Normal distributions requires using these
four functions
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
Here we simulate standard Normal random numbers with mean 0 and standard deviation 1.
y=β0+β1x+ε
where ε∼N(0,22) Assume x∼N(0,12), β0=0.5and β1=2. The variable x might represent an
important predictor of the outcome y. Here’s how we could do that in R.
> ## Always set your seed!
> set.seed(20)
>
> ## Simulate predictor variable
> x <- rnorm(100)
>
> ## Simulate the error term
> e <- rnorm(100, 0, 2)
>
> ## Compute the outcome via the model
> y <- 0.5 + 2 * x + e
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.4084 -1.5402 0.6789 0.6893 2.9303 6.5052
We can plot the results of the model simulation.
> plot(x, y)
We can also simulate from generalized linear model where the errors are no longer from a
Normal distribution but come from some other distribution. For examples, suppose we
want to simulate from a Poisson log-linear model where
Y∼Poisson(μ)
logμ=β0+β1xlog
and β0=0.5and β1=0.3. We need to use the rpois() function for this
> set.seed(1)
>
> ## Simulate the predictor variable as before
> x <- rnorm(100)
Now we need to compute the log mean of the model and then exponentiate it to get the
mean to pass to rpois().
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 1.55 2.00 6.00
> plot(x, y)
You can build arbitrarily complex models like this by simulating more predictors or
making transformations of those predictors (e.g. squaring, log transformations, etc.).
Random Sampling
The sample() function draws randomly from a specified set of (scalar) objects allowing
you to sample from arbitrary distributions of numbers.
> set.seed(1)
> sample(1:10, 4)
[1] 9 4 7 1
> sample(1:10, 4)
[1] 2 7 3 6
>
> ## Doesn't have to be numbers
> sample(letters, 5)
[1] "r" "s" "a" "u" "w"
>
> ## Do a random permutation
> sample(1:10)
[1] 10 6 9 2 1 5 8 4 3 7
> sample(1:10)
[1] 5 10 2 8 6 1 4 3 9 7
>
> ## Sample w/replacement
> sample(1:10, replace = TRUE)
[1] 3 6 10 10 6 4 4 10 9 7
To sample more complicated things, such as rows from a data frame or a list, you can
sample the indices into an object rather than the elements of the object itself.
> library(datasets)
> data(airquality)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Now we just need to create the index vector indexing the rows of the data frame and
sample directly from that index vector.
> set.seed(20)
>
> ## Create index vector
> idx <- seq_len(nrow(airquality))
>
> ## Sample from the index vector
> samp <- sample(idx, 6)
> airquality[samp, ]
Ozone Solar.R Wind Temp Month Day
107 NA 64 11.5 79 8 15
120 76 203 9.7 97 8 28
130 20 252 10.9 80 9 7
98 66 NA 4.6 87 8 6
29 45 252 14.9 81 5 29
45 NA 332 13.8 80 6 14
Other more complex objects can be sampled in this way, as long as there’s a way to index
the sub-elements of the object.
Code Profiling
Typically code profilers are used by developers to help identify performance problems
without having to touch their code. Profilers can answer questions like, “How many times is
each method in my code called?” and, “How long do each of these methods take?” Profilers
also track things like memory allocations and garbage collection. Some profilers can even
track key methods in your code, so you can understand how often SQL statements and web
services are called. In addition, some profilers can track web requests and train those
transactions to understand the performance of transactions within your code.
Code profilers can track all the way down to each individual line of code. However, most
developers only use profilers when chasing a CPU or memory problem, and need to go out of
their way to try and find those problems. This is because many profilers make applications
run a hundred times slower than usual. While most consider profilers to be a situational tool
not meant for daily use, code profiling can be a total lifesaver when you need it.
Profilers are great for finding the hot path in your code. Figuring out what is using twenty
percent of the total CPU usage of your code, and then determining how to improve that
would be a great example of when to use a code profiler. In addition, profilers are also great
for finding memory leaks early, as well as understanding the performance of dependency
calls and transactions. Profilers help you look for methods that can lead to improvement over
time. A former mentor once told me, “If you can improve something one percent every day,
then over the course of a month, you’ll improve by thirty percent.” What really makes a
difference is continued improvement over time.
Desktop code profiling is slower and requires a lot of overhead, potentially making your app
much slower than it should be. This kind of profiler usually tracks the performance of every
line of code within each individual method. These types of profilers also track memory
allocations and garbage collection to help with memory leaks. desktop profilers are very good
at finding that hot path, figuring out every single method that’s being called, and identifying
what uses the most CPU.
But there’s also another solution. For the sake of simplicity, we’ll call it a hybrid profiler.
These hybrid code profilers merge key data from server-based profiling with code-level
details on your desktop for use every day. These profilers provide server level insights
combined with the ability to track key methods, every transaction, dependency calls, errors,
and logs.
Some options for desktop code profilers include Visual Studio, NProfiler, and others.
There are very few true hybrid code profiling solutions. Among those is our own hybrid
profiler we call Prefix, which is free to use.
Profiling R Code
R comes with a profiler to help you optimize your code and improve its performance. In
generall, it’s usually a bad idea to focus on optimizing your code at the very beginning of
development. Rather, in the beginning it’s better to focus on translating your ideas into
code and writing code that’s coherent and readable. The problem is that heavily optimized
code tends to be obscure and difficult to read, making it harder to debug and revise. Better
to get all the bugs out first, then focus on optimizing.
Of course, when it comes to optimizing code, the question is what should you optimize?
Well, clearly should optimize the parts of your code that are running slowly, but how do
we know what parts those are?
This is what the profiler is for. Profiling is a systematic way to examine how much time is
spent in different parts of a program.
Sometimes profiling becomes necessary as a project grows and layers of code are placed
on top of each other. Often you might write some code that runs fine once. But then later,
you might put that same code in a big loop that runs 1,000 times. Now the original code
that took 1 second to run is taking 1,000 seconds to run! Getting that little piece of original
code to run faster will help the entire loop.
It’s tempting to think you just know where the bottlenecks in your code are. The reality is
that profiling is better than guessing. Better to collect some data than to go on hunches
alone. Ultimately, getting the biggest impact on speeding up code depends on knowing
where the code spends most of its time. This cannot be done without some sort of rigorous
performance analysis or profiling.
We should forget about small efficiencies, say about 97% of the time: premature
optimization is the root of all evil —Donald Knuth
The basic principles of optimizing your code are:
If you’re going to be scientist, you need to apply the same principles here!
The R Profiler
Using system.time() allows you to test certain functions or code blocks to see if they are
taking excessive amounts of time. However, this approach assumes that you already know
where the problem is and can call system.time() on it that piece of code. What if you don’t
know where to start?
This is where the profiler comes in handy. The Rprof() function starts the profiler in R.
Note that R must be compiled with profiler support (but this is usually the case). In
conjunction with Rprof(), we will use the summaryRprof() function which summarizes the
output from Rprof() (otherwise it’s not really readable). Note that you should NOT
use system.time() and Rprof() together, or you will be sad.
Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates
how much time is spent inside each function. By default, the profiler samples the function
call stack every 0.02 seconds. This means that if your code runs very quickly (say, under
0.02 seconds), the profiler is not useful. But of your code runs that fast, you probably
don’t need the profiler.
The profiler is started by calling the Rprof() function.
> Rprof() ## Turn on the profiler
You don’t need any other arguments. By default it will write its output to a file
called Rprof.out. You can specify the name of the output file if you don’t want to use this
default.
Once you call the Rprof() function, everything that you do from then on will be measured
by the profiler. Therefore, you usually only want to run a single R function or expression
once you turn on the profiler and then immediately turn it off. The reason is that if you
mix too many function calls together when running the profiler, all of the results will be
mixed together and you won’t be able to sort out where the bottlenecks are. In reality, I
usually only run a single function with the profiler on.
The profiler can be turned off by passing NULL to Rprof().
> Rprof(NULL) ## Turn off the profiler
The raw output from the profiler looks something like this. Here I’m calling
the lm() function on some data with the profiler running.
## lm(y ~ x)
sample.interval=10000
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
At each line of the output, the profiler writes out the function call stack. For example, on
the very first line of the output you can see that the code is 8 levels deep in the call stack.
This is where you need the summaryRprof() function to help you interpret this data.
Using summaryRprof()
The summaryRprof() function tabulates the R profiler output and calculates how much
time is spent in which function. There are two methods for normalizing the data.
“by.total” divides the time spend in each function by the total run time
“by.self” does the same as “by.total” but first subtracts out time spent in functions above
the current function in the call stack. I personally find this output to be much more useful.
$by.self
self.time self.pct total.time total.pct
"lm.fit" 2.99 40.35 3.50 47.23
"as.list.data.frame" 0.82 11.07 0.82 11.07
"[.data.frame" 0.79 10.66 1.03 13.90
"structure" 0.73 9.85 0.73 9.85
"na.omit.data.frame" 0.49 6.61 1.30 17.54
"list" 0.46 6.21 0.46 6.21
"lm" 0.30 4.05 7.41 100.00
"model.matrix.default" 0.27 3.64 0.79 10.66
"na.omit" 0.24 3.24 1.54 20.78
"as.character" 0.18 2.43 0.18 2.43
"model.frame.default" 0.12 1.62 2.24 30.23
"anyDuplicated.default" 0.02 0.27 0.02 0.27
Now you can see that only about 4% of the runtime is spent in the actual lm() function,
whereas over 40% of the time is spent in lm.fit(). In this case, this is no surprise since
the lm.fit() function is the function that actually fits the linear model.
You can see that a reasonable amount of time is spent in functions not necessarily
associated with linear modeling (i.e. as.list.data.frame, [.data.frame). This is because
the lm() function does a bit of pre-processing and checking before it actually fits the
model. This is common with modeling functions—the preprocessing and checking is
useful to see if there are any errors. But those two functions take up over 1.5 seconds of
runtime. What if you want to fit this model 10,000 times? You’re going to be spending a
lot of time in preprocessing and checking.
The final bit of output that summaryRprof() provides is the sampling interval and the total
runtime.
$sample.interval
[1] 0.02
$sampling.time
[1] 7.41