Mmsac FDP Tutorial

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

MITCOE, Department of Computer Engineering

Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

1. What is R Programming
R is one of the most popular platforms for data analysis and visualization. R is free and
open source software.
It has versions for Windows, MacOSX, Linux.
R is free software programming language and software environment for statistical
computation and graphics. Ref. WIKIPEDIA
It is generally used by statisticians and data miners for data analytics and mining.
R language is an implementation of S programming language created by John Chembers at
Bell labs.
R was created by Ross Ihaka and Robert Gentleman at University of Auckland, New
Zealand.
R software environment is written in C, Fortran and R.
R provides a wide variety of statistical and graphical techniques including linear , nonlinear
modelling, classical statistical tests , time series analysis, classification, clustering etc.

2. R and R Studio installation on Ubuntu


To have R on your Ubuntu Machine first you have to install r-base by using
sudo apt-get install r-base
After installation of r-base open new terminal and just type R then you will get command line
prompt for R programming
Also you can use IDE software for R . Here I am using R Studio.
Download R studio from http://www.rstudio.com/products/rstudio/download/
Versions for windows, Ubuntu 32 bit and 64 bit are available on this site.
Down load .deb of RStudio and install it by using software center.

3. Begin your R session


To begin with a session in R, double-click on the R/RStudio icon
You can start typing the commands on the prompt (“>”) and execute them
To scroll back to previous commands typed, use the `up' arrow key (and `down' to scroll back
again)
To exit from a session, type q() or quit() at the prompt
Here you would be asked if you wish to save the objects in the workspace
If at any point you want to save the transcript of your session, click on `File' and then `Save
History', which will enable you to save a copy of the commands you have used for later use

4. R Work Space Commands

1. getwd() : It shows you working directory


> getwd()
[1] "/home/sumitra"

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 1
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

2. setwd("mydirectory") : You can set working directory under which you are working

>setwd("/home/sumitra/rexamples")

3. ls() : lists all objects exists in current workspace


> ls()
[1] "good" "missing" "name" "x" "y"

4. rm(objectlist) : This function removes object/objectlist


> x<-1;
> ls()
[1] "x"
> rm(x)
> ls()
character(0)

5. help(option) : View help about specified option.

6. option(): view or set current available options.

7. savehistory("rexample") : saves history of executed commands in specified file. By default


extension is .Rhistory.

8. loadhistory("rexample") : It loads history from specified file. By default extension is


.Rhistory.

9. save.image("rexample") : it saves image of workspace in specified file. Default extension is


.RData.

10. save(objectlist, file="rexample") : It saves specific objects to specified file.

11. load("rexample") : It loads save workspace in to current session by default extension is


.RData.

12. q() : Quits your session with saving workspace

5. Input and Output In R


Input can be processed through script files. A script file is a file that containing R Statements.

Input:
source(filename)
This function submits a script to the current session.
Example : source(myscript.R)
Script file is saved as filename with .R as extension.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 2
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Output:
sink(filename)
This function save output to mentioned file name. If the file name already exists then its
contents are over written.
Two options are provided for sink ()
1. append =TRUE
This option used to append text to already existing file.
2. split=TRUE
This option will send output to both the screen and file

Graphics Output:
sink() has no effect on graphics output. For graphics output, use following functions.
1. pdf(“filename.pdf”) : To save out put in .pdf format.
2 win.metafile(“filename.wmf”) : To save out put in Windows metafile.
3. png(“filename.png”) : To save output in png format.
4. jpeg(“filename.jpeg”) : To save output in jpeg format.
5. bmp(“filename.bmp”) : To save output in bmp format.
6. postscript(“filename.ps”): To save output in PostScript format.

##Example
source(script1.R)
sink(“myoutput”, append=TRUE, split=TRUE)
pdf(“mygraph.pdf”)
source(“script2.R”)
dev.off # is used to return output to terminal

6. Packages in R
Till date there are 6695 packages are available in R. Information of all packages are available on site
http://cran.r-project.org/web/packages/available_packages_by_date.html

What do you mean by Packages?


Packages are collection of R functions, data and compiled code in a well defined format. In
computer all packages are stored in library directory.
.libPaths() : This shows where your library is located
library() : What package is stored in library.

Installing Packages
install.packages(“name of package”) : This command is used to install packages.
update.packages(“name of package”) : This command is used to update packages those are
already installed.
installed.packages() : This command shows already installed packages till date.
library(package name) : This command is used to load package for further use.
help(Package=”package_name”) : This command is used to learn about a particular package.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 3
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

7. Some simple operations


R uses objects to store information in its workspace.
Some types of objects are scalars, vectors, matrices, lists and dataframes.

An important feature of R is that it will do different things on different types of objects. For
example, type:1
> 4+6
The result should be
[1] 10
So, R does scalar arithmetic returning the scalar value 10. (In actual fact, R returns a vector of
length 1 - hence the [1] denoting first element of the vector.
We can assign objects values for subsequent use. For example:
x<-6
y<-4
z<-x+y

Type 1 would do the same calculation as above, storing the result in an object called z. We can look at
the contents of the object by simply typing its name:
>z
[1] 10

At any time we can list the objects which we have created:


> ls()
[1] "x" "y" "z"

> sqrt(16)
[1] 4
calculates the square root of 16.

Objects can be removed from the current workspace with the rm function: For example
> rm(x,y)
There are many standard functions available in R, and it is also possible to create new ones.

To remove all the objects, we do


> rm(list = ls())
> ls()
character(0)

8. Data types and operations in R

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 4
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Objects
R has five basic or “atomic" classes of objects:
character
numeric (real numbers)
integer
complex
logical (True/False)
The most basic object is a vector
A vector can only contain objects of the same class
BUT: The one exception is a list, which is represented as a vector but can contain objects of
different classes (indeed, that's usually why we use them)
Empty vectors can be created with the vector() function.
------------------------------------------------------------------------------------------------------------

Numbers
Numbers in R are generally treated as numeric objects (i.e. double precision real numbers)
If you explicitly want an integer, you need to specify the L suffix
Ex: Entering 1 gives you a numeric object; entering 1L explicitly gives you an integer.
There is also a special number Inf which represents infinity; e.g. 1 / 0; Inf can be used in ordinary
calculations; e.g. 1 / Inf is 0
The value NaN represents an undefined value (\not a number"); e.g. 0 / 0; NaN can also be thought
of as a missing value (more on that later)

Attributes: R objects can have attributes


names, dimnames
dimensions (e.g. matrices, arrays)
class
length
other user-defined attributes/metadata
Attributes of an object can be accessed using the attributes() function.

Entering Inputs
At the R prompt we type expressions. The <- symbol is the assignment operator.
> x<-1
> print (x)
[1] 1

>x

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 5
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[1] 1

> msg<-"hello"
> msg
[1] "hello"

Scanning Inputs from console:


The scan() function can be used to input values into vector variables. Call the scan function for
scanning values into vector variable, say ‘z’
> z<-scan()

At prompt 1: Enter values to be stored in the vector. While entering values, separate them by spaces.
After entering the last value, press ‘enter’ twice

1: 3 4 5 6 7 8
7:
Read 6 items

##Task: Create the following dataset (vector) using ‘c’ function.


3, 1, 5, 7, 5, 8, 9
Store the list in a variable
##Task: Use scan() function for the following list of values and store it in variable list1:
72, 75, 84, 84, 98, 94, 55, 62
Print the list

Expression Evaluation
The grammar of the language determines whether an expression is complete or not.
> x <- ## Incomplete expression

The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.

When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated
expression is returned. The result may be auto-printed.

>x <- 5 ## nothing printed

>x ## auto-printing occurs


[1] 5

>print(x) ## explicit printing


[1] 5

The [1] indicates that x is a vector and 5 is the first element.


------------------------------------------------------------------------------------------------------------

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 6
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Vectors
Can be created in R in a number of ways. The c() function can be used to create vectors of objects.

> z<-c(1, 2, 3, 4, 7, 9)

Note the use of the function c to concatenate or `glue together' individual elements. This function can
be used much more widely, for example
> x<-c(1,2,3)
>x
[1] 1 2 3
> y<-c(4,7,9)
>y
[1] 4 7 9
> z<-c(x,y)
>z
[1] 1 2 3 4 7 9
would lead to the same result by gluing together two vectors to create a single vector.

Some types of vectors


> x <- c(0.5, 0.6) ## numeric

> x <- c(TRUE, FALSE) ## logical

> x <- c(T, F) ## logical

> x <- c("a", "b", "c") ## character

> x <- 9:29 ## integer

> x <- c(1+0i, 2+4i) ## complex

Using the vector() function


> x <- vector("numeric", length = 10)
> x
[1] 0 0 0 0 0 0 0 0 0 0

Implicit coercion/ Mixing objects


> y <- c(1.7, "a") ## implicit coercion to character

> y <- c(TRUE, 2) ## implicit coercion to numeric

> y <- c("a", TRUE) ## implicit coercion to character

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 7
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

When different objects are mixed in a vector, coercion occurs so that every element in the vector is of
the same class.

Explicit coercion
Objects can be explicitly coerced from one class to another using the as.* functions.
##Examples

> x<- 0:6


> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
> as.complex(x)
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i

Non-sensical coercion results in NA

##Examples
> x <- c("a", "b", "c")
> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA
------------------------------------------------------------------------------------------------------------
Lists
Can be created in R in a number of ways. Lists are a special type of vector that can contain elements of
different classes and are a very important data type in R.

##Example
> x<-list(1, "a", TRUE, 1 + 4i)
>x
[[1]]
[1] 1

[[2]]
[1] "a"

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 8
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[[3]]
[1] TRUE

[[4]]
[1] 1+4i
------------------------------------------------------------------------------------------------------------
Factors
Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a
factor as an integer vector where each integer has a label.
Factors are treated specially by modelling functions like lm() and glm()
Using factors with labels is better than using integers because factors are self-describing;
having a variable that has values “Male" and “Female" is better than a variable that has values
1 and 2

##Example

> x <- factor(c("yes", "yes", "no", "yes", "no"))

>x
[1] yes yes no yes no
Levels: no yes

> table(x)
x
no yes
2 3

> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"

The order of the levels can be set using the levels argument to factor(). This can be important in linear
modelling because the first level is used as the baseline level.
##Example

> x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
>x
[1] yes yes no yes no
Levels: yes no
------------------------------------------------------------------------------------------------------------

Missing Values

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 9
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Missing values are denoted by NA or NaN (Not a Number) for undefined mathematical operations.
is.na() is used to test objects if they are NA
is.nan() is used to test for NaN
NA values have a class also, so there are integer NA, character NA, etc.
A NaN value is also NA but the converse is not true

##Example

> x <- c(1, 2, NA, 10, 3)

> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE

> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE

> x <- c(1, 2, NA, NaN, 3)

> is.na(x)
[1] FALSE FALSE TRUE TRUE FALSE

> is.nan(x)
[1] FALSE FALSE FALSE TRUE FALSE
------------------------------------------------------------------------------------------------------------
Data Frames
Data frames are used to store tabular data
They are represented as a special type of list where every element of the list has to have the same
length
Each element of the list can be thought of as a column and the length of each element of the list
is the number of rows
Unlike matrices, data frames can store different classes of objects in each column (just like lists);
matrices must have every element be the same class
Data frames also have a special attribute called row.names
Data frames are usually created by calling read.table() or read.csv()
Can be converted to a matrix by calling data.matrix()

##Example
> dFrame<-data.frame(srNo= 1:6, code=c('M', 'F', 'F', 'F', 'M', 'F'))
> dFrame
srNo code
1 1 M

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 10
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

2 2 F
3 3 F
4 4 F
5 5 M
6 6 F

> nrow(dFrame)
[1] 6

> ncol(dFrame)
[1] 2
------------------------------------------------------------------------------------------------------------
Names
R objects can also have names, which is very useful for writing readable code and self-describing
objects.
##Example
> x<-1:3
> names(x)
NULL
> names(x)<-c("one", "two", "three")
>x
one two three
1 2 3

Lists can also have names

##Example

> x<-list(a = 1, b = 2, c = 3)
>x
$a
[1] 1

$b
[1] 2

$c
[1] 3

Matrices have dimnames (for rows and columns).

##Example

> mat
[,1] [,2]
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 11
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[1,] 1 9
[2,] 3 5
[3,] 0 -1

> dimnames(mat)<- list( c("row1", "row2", "row3"), c("col1", "col2"))

> mat
col1 col2
row1 1 9
row2 3 5
row3 0 -1

Summary on Data Types


Data Types
atomic classes: numeric, logical, character, integer, complex
vectors, lists
factors
missing values
data frames
names

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 12
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

9. Vectorized operations
Many operations in R are vectorized making code more efficient, concise, and easier to read.
> x<- 1:4; y<- 6:9

>x+y
[1] 7 9 11 13

>x>2
[1] FALSE FALSE TRUE TRUE

> y == 8
[1] FALSE FALSE TRUE FALSE

>x*y
[1] 6 14 24 36

>x/y
[1] 0.1666667 0.2857143 0.3750000 0.4444444

As explained above, R will often adapt to the objects it is asked to work on.
##Example:
> x<-c(7, 4, -6)
> y<-c(23, 7, 55)

> x + y
[1] 30 11 49

> x * y
[1] 161 28 -330

Above example shows that R uses component-wise arithmetic on vectors. R will also try to make sense
if objects are mixed. For example,
##Example
> (x + y)-2
[1] 28 9 47

Though care should be taken to make sure that R is doing what you would like it to in these
circumstances.
Two particularly useful functions worth remembering are length() which returns the length of a vector
(i.e. the number of elements it contains) and sum() which calculates the sum of the elements of a
vector.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 13
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

------------------------------------------------------------------------------------------------------------
Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of
length 2 (nrow, ncol)

##Example – Basic method for matrix creation using matrix function


> m<-matrix(nrow=2, ncol=3)
>m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA

Matrices are constructed column-wise. The basic parameters passed are matrix elements, and
dimensions in the form of number of rows and columns.

> m<- matrix (1:6, nrow=2, ncol=3)

>m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

Matrices can be created in R in a variety of ways other than the matrix() function. Perhaps the simplest
is to create the columns and then glue them together with the command cbind().

##Example – Basic method for matrix creation using column-wise binding


> x<-c(5,7,9)
> y<-c(6,3,4)
> z<-cbind(x, y)
>z
xy
[1,] 5 6
[2,] 7 3
[3,] 9 4

##Example – Basic method for matrix creation using row-wise binding

##The matrix can also be created using row-wise binding using the command rbind()
> mat1<-rbind(x,y)
> mat1
[,1] [,2] [,3]
x 5 7 9
y 6 3 4

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 14
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

The functions cbind and rbind can also be applied to matrices themselves (provided the dimensions
match) to form larger matrices.
##Example – Basic method for matrix creation using other matrices
> bigMat<-rbind(mat1, mat1)
> bigMat
[,1] [,2] [,3]
x 5 7 9
y 6 3 4
x 5 7 9
y 6 3 4

The matrix function explained earlier has a few default parameters.


##Example – Basic method for matrix creation, default values as ncol and cbind
> m<-matrix(c(1,2,3,4,5,6), nrow = 3)
>m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

As an alternative we could have specified the number of columns with the argument ncol=2
(obviously, it is unnecessary to give both).
Notice that the matrix is 'filled up' column-wise. If instead you wish to fill up row-wise, add the option
byrow=T.

##Example – Basic method for matrix creation, default setting of column binding over-ridden by
parameter for row-binding called byrow

> m1<-matrix(c(1,2,3,4,5,6), nr= 3)


> m1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> mat<-matrix(c(1,2,3,4,5,6), nr= 3, byrow=T)


> mat
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6

Notice that the argument nrow has been abbreviated to nr. Such abbreviations are always possible for
function arguments provided it induces no ambiguity - if in doubt always use the full argument name.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 15
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Matrices can also be created directly from vectors by adding a dimension attribute. The dimension of a
matrix can be checked with the dim command.
##Example
> m<- 1:10
> dim(m)
NULL
> dim(m)<-c(2, 5)
>m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> dim(m)
[1] 2 5
## That is, two rows and five columns

Sequences
Vectors with sequences of values, generally with uniform intervals. For example, a vector sequence of
numeric from 1 to 10 can be stored in a variable ‘x’, as:
> x<-1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10

More general sequences can be generated using the seq command.


##Example:
> seq(5, 25, by = 3)

[1] 5 8 11 14 17 20 23

and

##Example:
Generates sequence with the range in the given interval of
required length
> seq(4, 79, length = 6)
[1] 4 19 34 49 64 79

If we type the command, seq(2, 25, 3),

It treats 3 as the “by” parameter value by default. So you get

> seq(2, 25, 3) ## ???


[1] 2 5 8 11 14 17 20 23

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 16
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

These examples illustrate that many functions in R have optional arguments, in this case, either the
step length or the total length of the sequence (it doesn't make sense to use both). If you leave out both
of these options, R will make its own default choice, in this case assuming a step length of 1.

So, for example,


##Example:
> x<-seq(1,10)
also generates a vector of integers from 1 to 10.

At this point it's worth mentioning the help facility. If you don't know how to use a function, or don't
know what the options or default values are, type help(function_name) where function-name is the
name of the function you are interested in. This will usually help and will often include examples to
make things even clearer.

Another useful function for building vectors is the rep command for repeating things.
For example, to generate a vector containing twenty-five 2s, we run,
##Example:
> rep(2, 25)
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

##Example: To generate repeated sequences


> rep(2:9, 3)
[1] 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9

##Examples: Variation on the use of rep function with other functions


> rep(1:3, 6)
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> rep(1:3, c(6, 6, 6))
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

Observe the difference in the output in both variations given above.


rep(1:3, c(6, 6, 6)) could also be written as

> rep(1:3, rep(6,3))


[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

Vectorized Matrix Operations

> x<-matrix(1:4, 2, 2) ##constructed column-wise by default


> y<- matrix (rep(10,4), 2,2) ##repeats 10 four times

>x
[,1] [,2]
[1,] 1 3
[2,] 2 4

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 17
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

>y
[,1] [,2]
[1,] 10 10
[2,] 10 10

##Example – Matrix multiplication, component-wise


> x*y ##element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
Notice, multiplication here is component-wise rather than conventional matrix multiplication.

> x %*% y ##true matrix multiplication


[,1] [,2]
[1,] 40 40
[2,] 60 60

>x/y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4

Other useful functions on matrices:


##Example – Matrix transpose, uses function t()
> mat
[,1] [,2]
[1,] 1 9
[2,] 3 5
[3,] 0 -1
> t(mat)
[,1] [,2] [,3]
[1,] 1 3 0
[2,] 9 5 -1

##Example – Matrix inverse, uses function solve()

> sqMat<-matrix(c(1, 5, 4, 7, 0, 9, 2, 8, 5), nrow=3)

> sqMat
[,1] [,2] [,3]
[1,] 1 7 2
[2,] 5 0 8
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 18
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[3,] 4 9 5

> solve(sqMat)
[,1] [,2] [,3]
[1,] -1.0746269 -0.25373134 0.83582090
[2,] 0.1044776 -0.04477612 0.02985075
[3,] 0.6716418 0.28358209 -0.52238806

As with vectors it is useful to be able to extract sub-components of matrices. In this case, we


may wish to pick out individual elements, rows or columns. As before, the [ ] notation is used to
subscript. The following examples should make things clear:

>z
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> z[1,1]
[1] 1

> z[c(2,3), 2]
[1] 5 6

> z[, 2]
[1] 4 5 6

> z[1:2,]
[,1] [,2]
[1,] 1 4
[2,] 2 5
So, in particular, it is necessary to specify which rows and columns are required, whilst omitting the
integer for either dimension implies that every element in that dimension is selected.

##Tutorials##
1. Define
> x<-c(4,2,6)
> y<-c(1,0,-1)
Decide what the result will be of the following:
(a) length(x)
(b) sum(x)
(c) sum(x^2)
(d) x+y
(e) x*y
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 19
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

(f) x-2
(g) x^2
Use R to check your answers.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 20
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

2. Decide what the following sequences are and use R to check your answers:
(a) 7:11
(b) seq(2,9)
(c) seq(4,10,by=2)
(d) seq(3,30,length=10)
(e) seq(6,-4,by=-2)

3. Determine what the result will be of the following R expressions, and then use R to check if you are
right:
(a) rep(2,4)
(b) rep(c(1,2),4)
Suggest an alternative for above function:
(c) rep(c(1,2),c(4,4))
(d) rep(1:4,4)
(e) rep(1:4,rep(3,4))

4. Use the rep function to define simply the following vectors in R.


(a) 6, 6, 6, 6, 6, 6
(b) 5, 8, 5, 8, 5, 8, 5, 8
(c) 5, 5, 5, 5, 8, 8, 8, 8

##Tutorials on matrices##
Exercises
1. Create in R the matrices
3 2 1 4 0
x= and y=
1 1 0 1 1

Calculate the following and check your answers in R:


(a) 2 * x
(b) x * x
(c) x %*% x
(d) x %*% y
(e) t (y)
(f) solve (x)

2. With x and y as above, calculate the effect of the following subscript operations and check
your answers in R.
(a) x [1, ]
(b) x [2, ]
(c) Extract second column elements of y:
(d) y [1, 2]
(e) y [ ,2:3]

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 21
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

10. Subsetting

There are a number of operators that can be used to extract subsets of R objects.
[ always returns an object of the same class as the original; can be used to select more than one
element
[[ is used to extract elements of a list or a data frame; it can only be used to extract a single
element and the class of the returned object will not necessarily be a list or data frame
$ is used to extract elements of a list or data frame by name; semantics are similar to that of [[.

##Examples on general subsetting


> x <- c("a", "b", "c", "c", "d", "a")

> x[1]
[1] "a"

> x[3:5]
[1] "c" "c" "d"

> x[ x > "b"]


[1] "c" "c" "d"

> res<- x > "b"


> res
[1] FALSE FALSE TRUE TRUE TRUE FALSE

> x[res]
[1] "c" "c" "d"

Matrices can be subset in the usual way with (i ; j ) type indices


##Examples on matrix subsetting
> mat<- matrix(1:6, nrow=3)
> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> mat[3, 2]
[1] 6

> mat[, 2]
[1] 4 5 6

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 22
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> mat[2,]
[1] 2 5

By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather
than a 1 1 matrix. This behavior can be turned off by setting drop = FALSE.

> mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> mat[2,]
[1] 2 5
> mat[2, , drop = FALSE]
[,1] [,2]
[1,] 2 5

##Example on list subsetting


> x<-list("foo"= 1:6, "bar" = 0.6)
>x
$foo
[1] 1 2 3 4 5 6

$bar
[1] 0.6

> x[1]
$foo
[1] 1 2 3 4 5 6

> x[1][1]
$foo
[1] 1 2 3 4 5 6

> x$foo
[1] 1 2 3 4 5 6

> x$bar
[1] 0.6

> x[2]
$bar
[1] 0.6
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 23
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> x[2][1]
$bar
[1] 0.6

> x[1][2]
$<NA>
NULL

> x["bar"]
$bar
[1] 0.6

> x[["bar"]][1]
[1] 0.6

> x[[2]][[2]]
Error in x[[2]][[2]] : subscript out of bounds

##Example- Extracting multiple elements of a list

> x<-list("foo"= 1:4, "bar" = 0.6, "third"= "testing")


> x[c(1,3)]
$foo
[1] 1 2 3 4

$third
[1] "testing"

The “[[“ operator can be used with computed indices; $ can only be used with literal names
##Example- Extracting list elements using computed indices (named indices)
> name<-"foo"

> x[[name]]
[1] 1 2 3 4

> x$name ##4 operator doesn’t work for computed indices


NULL

> x$foo
[1] 1 2 3 4

##Example: Subsetting nested elements of a list


> x<-list(a= list(10, 12, 14), b = list(15.5, 16.6))
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 24
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> x[[2]]
[[1]]
[1] 15.5
[[2]]
[1] 16.6

> x[[2]][[1]]
[1] 15.5

> x[[2]][[2]]
Error in x[[2]][[2]] : subscript out of bounds

> x[[1]][[3]]
[1] 14

Partial matching of names is allowed with [[ and $.


##Example: Subsetting using partial matching
> x<- list(abcdefabc = 0.5)
> x<- list(abcdefabc = 0:5)
>x
$abcdefabc
[1] 0 1 2 3 4 5

> x[[a]]
Error: object 'a' not found
> x$a
[1] 0 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 0 1 2 3 4 5

To extract the non- missing values (NAs) from a given list:


##Example:
> x<-c(1, 2, NA, 4, NA, 6)
> missing<-is.na(x)
> x[!missing]
[1] 1 2 4 6

Extracting non-NA elements from multiple lists


##Example 1

> x<-c(1, 2, NA, 4, NA, 6)


> y<-c(NA, "b", NA, "d", "e", NA)
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 25
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> good<- complete.cases(x, y)


> good
[1] FALSE TRUE FALSE TRUE FALSE FALSE

> x[good]
[1] 2 4

> y[good]
[1] "b" "d"

##Example 2
> airquality[1:6,]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> good<- complete.cases(airquality)

> airquality[good,][1:6,]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 26
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

11. Creation of Dataset in R


In data analysis, the first step is to create a dataset that needs to be analyzed, in an appropriate
format that meets your need.

Creation of data set :


This task involves two major steps:

1. Select appropriate data structure to hold your data

The data structure are as follows (we have already studied various data structures in earlier
sessions)
Vectors
Lists
Factors
Matrices
Data Frames

2. Entering or importing data into selected data structure.

In R, data can be imported through various sources as follows

a. Keyboard
b. Statistical Packages (namely SAS, SPSS, Stata)
c. Text Files ( namely ASCII, XML, Webscraping)
d. Other (namely Excel, netCFD, HDF5)
e. Database Management Systems (namely SQL, MySQL, MongoDB, Oracle, Access)

a. Entering data from Key board


The function in R will invoke a text editor that will allow you to enter your data manually. Here are the
steps involved:
1. Create an empty data frame (or matrix) with the variable names and modes you want to have in the
final dataset.
2. Invoke the text editor on this data object, enter your data, and save the results back to the data
object.

## Example
In the following example, you’ll create a data frame named mydata with three variables:
age (numeric) , gender (character) , and weight (numeric) .

You’ll then invoke the text editor, add your data, and save the results.

> mydata <- data.frame(age=numeric(0),


+ gender=character(0), weight=numeric(0))
> mydata <- edit(mydata)
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 27
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> mydata
age gender weight
1 1 F 20

b. Importing data from a delimited text file


You can import data from delimited text files using read.table(), a function that reads a file in table
format and saves it as a data frame.

Syntax:
mydataframe <- read.table(file, header=logical_value, sep="delimiter", row.names="name")

where
file is a delimited ASCII file ,
header is a logical value indicating whether the first row contains variable names ( TRUE or FALSE ),
sep specifies the delimiter separating data values
row.names is an optional parameter specifying one or more variables to represent row identifiers.

## Example

> hw_data<- read.table("hw1_data.csv", header=TRUE, sep=",")


> View(hw_data) ## V in caps
> names(hw_data)
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"

12. Reading Data in R


read.table , read.csv , for reading tabular data
readLines , for reading lines of a text file
source , for reading in R code files ( inverse of dump )
dget , for reading in R code files ( inverse of dput )
load , for reading in saved workspaces
unserialize , for reading single R objects in binary form

Various other read options


gzfile, opens a connection to a file compressed with gzip
bzfile, opens a connection to a file compressed with bzip2
url, opens a connection to a webpage

13. Writing Data in R

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 28
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

There are analogous functions for writing data to files


write.table for writing data in table
writeLines for writing lines in text file
dump for writing in R code file
dput for writing in R code file
save for saving in workspace
serialize for writing single R object in binary form

14. Attaching to objects

R includes a number of datasets that it is convenient to use for examples. You can get a description
of what's available by typing
> data()

To access any of these datasets, you then type data(dataset) where dataset is the name of the dataset
you wish to access.

##Example
> data(trees)

Typing
> trees[1:5,]
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
gives us the first 5 rows of these data, and we can now see that the columns represent measurements of
girth, height and volume of trees (actually cherry trees: see help(trees)) respectively.

Now, if we want to work on the columns of these data, we can use the subscripting technique
explained above: for example, trees [, 2] gives all of the heights. This is a bit tedious however, and it
would be easier if we could refer to the heights more explicitly.
We can achieve this by attaching to the trees dataset:
> x<-attach(trees)
Effectively, this makes the contents of trees a directory, and if we type the name of an object, R will
look inside this directory to find it. Since Height is the name of one of the columns of trees, R now
recognises this object when we type the name. Hence, for example,

> mean(Height)
[1] 76

7. Various functions performed on data


1. head(x,n) : The 'head()' function is an easy way to extract the first few elements of an R object.
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 29
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

##Example
> head(trees, 3)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2

##Example
> head(trees)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7

2. names(x) : You can get the column names of a data frame with the 'names()' function.
> names(trees)
[1] "Girth" "Height" "Volume"

3. tail(x,n): The 'tail()' function is an easy way to extract the last few elements of an R object.
> tail(trees, 3)
Girth Height Volume
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0

4. nrow(x) : You can use the 'nro()' function to compute the number of rows in a data frame.
> nrow(trees)
[1] 31

5. ncol(x) :You can use the 'ncol()' function to compute the number of columns in a data frame.
> ncol(trees)
[1] 3

6. Subsetting operations

Where
x= A matrix/data frame/vector
n= The first n rows (head) /last n rows (tail)

##Tutorials on Dataset creation reading and writing ##


1. Read hw1_data.csv file using R function and store it in hw object
2. Answer following questions
1. In the dataset provided, what are the column names of the dataset?

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 30
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

2. Extract the first 3 rows of the data frame and print them to the console. What does the output
look like?
3. What is the value of Ozone in the 51st row?
4. Identify the class of hw object
5. How many missing values are in the Ozone column of this data frame?

15. Control Structures

Control structures in R allow you to control the flow of execution of the program, depending on
runtime conditions.
Run command ?Control to get information of all control structures in R.
>?Control

if, else: testing a condition


for: execute a loop a fixed number of times
while: execute a loop while a condition is true
repeat: execute an infinite loop
break: break the execution of a loop
next: skip an iteration of a loop
return: exit a function

1. If, else
if (condition) {
# do something
} else {
# do something else
}
Without else
if(condition)
{
}

##Example

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 31
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

open R Studio File -> New File -> R Script then type following code in code region then
> x<-1
> if (x>3) {
+ y<-10
+ } else {
+ y<-5
+}
>y
[1] 5

is same as
> y <- if (x>3 ) {
+ 10
+ } else {
+ 5
+}
>y
[1] 5

2. for and nested for loops


A for loop works on an iteration variable and assigns successive values till the end of a sequence.
##Example 1
> for (i in 1:10) { print(i) }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

##Example 2

> x <- c("apples", "oranges", "bananas", "strawberries")


> for (i in 1:4) {
+ print(x[i])
+}
[1] "apples"
[1] "oranges"
[1] "bananas"
[1] "strawberries"

OR

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 32
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> for (i in seq(x)) {


+ print(x[i])
+}
[1] "apples"
[1] "oranges"
[1] "bananas"
[1] "strawberries"

OR

> for (i in 1:4) print(x[i])


[1] "apples"
[1] "oranges"
[1] "bananas"
[1] "strawberries"

Nested loops
##Example
> m <- matrix(1:10, 2)
> for (i in seq( nrow(m))) {
+ for (j in seq( ncol(m))) {
+ print(m[i, j])
+ }
+}
[1] 1
[1] 3
[1] 5
[1] 7
[1] 9
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10

3. While
##Example
> i <- 1
> while (i < 10) {

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 33
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

+ print(i)
+ i <- i + 1 }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

Be sure there is a way to exit out of a while loop. It can leads to infinite loop if not properly written.
4. Repeat and break
##Exam ple
> sum <- 1
> repeat
+{
+ sum <- sum + 2;
+ print(sum);
+ if(sum>11)
+ break;
+}
[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13

5. Next
##Example
> for (i in 1:20) {
+ if (i%%2 == 1) {
+ next
+ } else {
+ print(i)
+ }
+}
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 34
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[1] 16
[1] 18
[1] 20

## Example
Write script agecount.R for Check that "age" is not less than 10 other wise print message as “Do not
consider age for count”
Read "homicides.txt" data file by using readLines()
Extract ages of victims; ignore records where no age is given
Return integer containing count of homicides for that age

age<-57
homicides <- readLines("homicides.txt")
if(age<10)
{
print("Do not consider it for calculation")

}else
{
pattern = sprintf("\\s+%d\\s+years\\s+old", age)
res = grep(pattern, homicides,ignore.case = TRUE)
length(res)
}

6. Looping funcions in R
· lapply : Loop over a list and evaluate a function on each element
· sapply : Same as lapply but try to simplify the result
· apply : Apply a function over the margins of an array
· tapply : Apply a function over subsets of a vector
· mapply : Multivariate version of lapply
An auxiliary function split is also useful, particularly in conjunction with lapply.

1. lapply
lapply takes three arguments:
1. a list X ;
2. a function (or the name of a function) FUN ;
3. other arguments via its ... argument.
If X is not a list, it will be coerced to a list using as.list .
lapply always returns a list, regardless of the class of the input.
##Example
> x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
>x
$a

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 35
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

[,1] [,2]
[1,] 1 3
[2,] 2 4

$b
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

An anonymous function for extracting the first column of each matrix.


> lapply(x, function(elt) elt[,1])

2. sapply
sapply will try to simplify the result of lapply if possible.
· If the result is a list where every element is length 1, then a vector is returned
· If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
· If it can’t figure things out, a list is returned
3. apply
apply is used to a evaluate a function (often an anonymous one) over the margins of an array.
· It is most often used to apply a function to the rows or columns of a matrix
· It can be used with general arrays, e.g. taking the average of an array of matrices
· It is not really faster than writing a loop, but it works in one line!
Syntax :
> str(apply)
function (X, MARGIN, FUN, ...)
· X is an array
· MARGIN is an integer vector indicating which margins should be “retained”.
· FUN is a function to be applied
· ... is for other arguments to be passed to FUN

## Example
margine 1 begins row where as margin 2 begins colums
>x<-matrix(1:24,nrows=4)
>x
>apply(x,1,sum) sum of all rows
>apply(x,2,sum) sum of all colums

4. mapply
mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments.
> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,USE.NAMES = TRUE)
· FUN is a function to apply
· ... contains arguments to apply over
· MoreArgs is a list of other arguments to FUN .
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 36
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

· SIMPLIFY indicates whether the result should be simplified


##Example
The following is tedious to type
list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
Instead we can do
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4

5. tapply
tapply is used to apply a function over subsets of a vector. t stands for table.
> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
· X is a vector
· INDEX is a factor or a list of factors (or else they are coerced to factors)
· FUN is a function to be applied
· ... contains other arguments to be passed FUN
· simplify , should we simplify the result?

6. split
split takes a vector or other objects and splits it into groups determined by a factor or list of factors.
> str(split)
function (x, f, drop = FALSE, ...)
· x is a vector (or list) or data frame
· f is a factor (or coerced to one) or a list of factors
· drop indicates whether empty factors levels should be dropped
##Tutorials on Control Structure ##
1. For given number identify odd or even number
2. Define list of months and print by using for loop
3. Define list of marks for a student for five subject and calculate sum of marks.

16. Some Functions in R

Str- Compactly display the internal structure of an R object


Used as a diagnostic function
##Example 1

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 37
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

> str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

##Example 2
> str(str)
function (object, ...)

##Example 3
> str(ls)
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
pattern)

##Example 4
> x<-c(1, 2, NA, 4, NA, 6)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.00 1.75 3.00 3.25 4.50 6.00 2
> str(x)
num [1:6] 1 2 NA 4 NA 6

Lexical Scoping and functions


R allows nested function definitions.
##Example
make.power<-function(n){
pow<-function(x){
x^n
}
pow
}
Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 38
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

##Example of Execution
>cube<-make.power(3)
>square<-make.power(2)
>cube(4)
[1] 64
> square(5)
[1] 25

Environments consist of a frame, or collection of named objects, and a pointer to an enclosing


environment. The most common example is the frame of variables local to a function call; its
enclosure is the environment where the function was defined (unless changed subsequently).
To find the environment of a particular function, we can use the function call environment(), with the
function name as the parameter. The return value can be listed using the ls() function to see the names
referred by the function environment/closure.

> ls(environment(cube))
[1] "n" "pow"
> get("n", environment(cube))
[1] 3

> ls(environment(square))
[1] "n" "pow"
> get("n", environment(square))
[1] 2

R performs Lexical scoping while execution


##Example:
y<-10
f<-function(x){
y<-2
y^2 + g(x)
}

g<-function(x){
x*y
}

Now, if we execute f(3), what do we get??


Note that g(x) is externally defined and f(x) is not its parent function.

17. Some Statistical Commands


mean(),median(),max(),min(),sd(),var() - computes the mean, median, max, min, standard deviation,
and variance of an object (typically a data array), respectively. If the object is a multidimensional

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 39
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

array, the function var() will assume that the columns of the object are different variables and return
the variance-covariance matrix.

Summary Statistics of Data


Let's suppose we've collected some data from an experiment and stored them in an object x:
>x<-c(7.5,8.2,3.1,5.6,8.2,9.3,6.5,7.0,9.3,1.2,14.5,6.2)

Measures of Central Tendency


Some simple summary statistics of these data can be produced:
1. Mean: The mean is the summation of all the data values divided by the number of data values.
##Example:
> mean(x)
[1] 7.216667

2. Median: The median of a sequence is the middle score in a set of value that have been ranked in
numeric order. The median can be calculated as :
##Example:
> median(x)
[1] 7.25

3. Mode: The mode is the most frequently occurring score in the data set. There is no direct function in
R to calculate the mode of a dataset.
##Example:
> xt<-table(x); xt
x
1.2 3.1 5.6 6.2 6.5 7 7.5 8.2 9.3 14.5
1 1 1 1 1 1 1 2 2 1
> which(xt==max(xt)) ->m;
> mode<-xt[m]
> mode
X ##bi-modal
8.2 9.3
2 2

4. Other functions:
##Example:
> length(x); min(x); max(x);
[1] 12
[1] 1.2
[1] 14.5

Measures of Variability
These measures are used to determine the degree of variation within a population or sample. These
measures include the range, variation and standard deviation.
1. Range: This value is simply the difference between the highest and lowest values in the data set.
##Example:
> range(x)
[1] 1.2 14.5

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 40
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

2. Variance: A more informative measure of variability is variance. This measure represents the degree
to which the data scores tend to vary from their mean. It is more informative than the range, because it
takes into account every score in the dataset, rather than the min and max values as in range().
Variance is the average of the squared deviations from the mean. The steps to calculate the variance
for a set of data scores are:
a. Find the mean score
b. Find the deviation of each raw score from the mean. For this, subtract the raw score
from the mean.
c. Square the deviation scores. (Reason: negative differences are made positive and
extreme scores are given more weight)
d. Find the sum of the squared deviation scores
e. Divide the sum by the number of scores to give the variance measure.
##Example:
> var(x)
[1] 11.00879

3. Standard Deviation: Simply the square root of the variance. This value is important because its
values is of the same unit as that of the raw data values.
> sd(x)
[1] 3.317949

Finally,
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.75 5.50 8.75 8.25 50.00

which should all be self explanatory.


It may be, however, that we subsequently learn that the first
6 data correspond to measurements made on one machine, and the second six on another machine.
This might suggest summarizing the two sets of data separately, so we would need to extract from
x the two relevant sub-vectors.

This is achieved by subscripting:


> x[1:6]
[1] 7.5 8.2 3.1 5.6 8.2 9.3
> summary(x[1:6])
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.100 6.075 7.850 6.983 8.200 9.300

and
> x[7:12]
[1] 6.5 7.0 9.3 1.2 14.5 6.2
> summary(x[7:12])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.200 6.275 6.750 7.450 8.725 14.500

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 41
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

##Tutorial
1. The data y<-c(33,44,29,16,25,45,33,19,54,22,21,49,11,24,56) contain sales of milk
in litres for 5 days in three different shops (the first 3 values are for shops 1,2 and 3 on
Monday, etc.) Produce a statistical summary of the sales for each day of the week and also for each
shop.

Attaching to objects
##Example:
> data(trees)
> attach(trees)
> head(trees)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7

> length(trees)
[1] 3

> str(trees)
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> mean(trees[, 1])
[1] 13.24839
> mean(trees[, 2])

OR

> mean(Height) ##works only if the dataset is attached


[1] 76

OR
> mean(trees$Height) ##works even if the dataset is not attached
[1] 76

[1] 76
> mean(trees[, 3])
[1] 30.17097

In actual fact, trees is an object called a data-frame, essentially a matrix with named columns (though a
data-frame, unlike a matrix, may also include non-numerical variables, such as character names).

Because of this, there is another equivalent syntax to extract, for example, the vector of heights:
> trees$Height
which can also be used without having first attached to the dataset.

##Tutorial

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 42
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

1. Attach to the dataset quakes and produce a statistical summary of the variables depth and
mag.
2. Attach to the dataset mtcars and find the mean weight and mean fuel consumption for vehicles in the
dataset (type help(mtcars) for a description of the variables available).

The apply function


It is possible to write loops in R, but they are best avoided whenever possible. A common situation is
where we want to apply the same function to every row or column of a matrix. For example, we may
want to find the mean value of each variable in the trees dataset. Obviously, we could operate on each
column separately but this can be tedious, especially if there are many columns.

The function apply simplifies things. It is easiest understood by example:


> apply(trees, 2, mean)
Girth Height Volume
13.24839 76.00000 30.17097

has the effect of calculating the mean of each column (dimension 2) of trees. We'd have used a 1
instead of a 2 if we wanted the mean of every row.

Any function can be applied in this way, though if optional arguments to the function are required
these need to be specified as well - see help(apply) for further details.

##Tutorial
1. Repeat the analyses of the datasets quakes and mtcars using the function apply to simplify the
calculations.
1 4 0
2. If y =
0 1 1
what is the result of apply(y[,2:3],1,mean)? Check your answer in R.

Statistical Computation and Simulation


Many of the tedious statistical computations that would once have had to have been done from
statistical tables can be easily carried out in R. This can be useful for finding confidence intervals etc.
##Example of The Normal Distribution
There are functions in R to evaluate the density function, the distribution function and the quantile
function (the inverse distribution function). These functions are, respectively, dnorm, pnorm and
qnorm.
Description
Density, distribution function, quantile function and random generation for the normal distribution
with mean equal to mean and standard deviation equal to sd.
Usage
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 43
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Arguments
x, q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number required.
mean vector of means.
sd vector of standard deviations.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x].
Details
If mean or sd are not specified they assume the default values of 0 and 1, respectively.
The normal distribution has density
f(x) = 1/(√(2 π) σ) e^-((x - μ)^2/(2 σ^2))
where μ is the mean of the distribution and σ the standard deviation.
Value
dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function,
and rnorm generates random deviates

For example, suppose X ~ N(3, 22), then


> dnorm(x, 3,2)
[1] 1.586983e-02 6.791485e-03 1.992220e-01 8.568430e-02 6.791485e-03 1.397129e-03
[7] 4.313866e-02 2.699548e-02 1.397129e-03 1.330426e-01 1.319622e-08 5.546042e-02
will calculate the density function at points contained in the vector x (note, dnorm will assume mean 0
and standard deviation 1 unless these are specified. Note also that the function assumes you will give
the standard deviation rather than the variance. As an example
> dnorm(x) ##default mean=0, sd=1
[1] 2.434321e-13 9.998379e-16 3.266819e-03 6.182621e-08 9.998379e-16 6.604580e-20
[7] 2.669557e-10 9.134720e-12 6.604580e-20 1.941861e-01 8.824755e-47 1.793784e-09

> dnorm(5, 3, 2)
[1] 0.1209854
evaluates the density of the N(3, 4) distribution at x = 5.

As a further example
> y<-seq(-5, 10, 0.1)
> dnorm(y, 3, 2)
calculates the density function of the same distribution at intervals of 0.1 over the range [-5, 10].

The functions pnorm and qnorm work in an identical way - use help for further information.
Similar functions exist for other distributions. For example, dt, pt and qt for the t-distribution, though
in this case it is necessary to give the degrees of freedom rather than the mean and standard deviation.
Other distributions available include the binomial, exponential, Poisson and gamma, though care is
needed interpreting the functions for discrete variables.

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 44
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

One further important technique for many statistical applications is the simulation of data from
specified probability distributions. R enables simulation from a wide range of distributions, using a
syntax similar to the above.
For example, to simulate 100 observations from the N(3, 4) distribution we write
> rnorm(100,3,2)
Similarly, rt, rpois for simulation from the t and Poisson distributions, etc.
Exercises
1. Suppose X ~ N(2, 0.25). Denote by f and F the density and distribution functions of X respectively.
Use R to calculate
(a) f(0.5)
(b) F(2.5)
(c) F-1(0:95) (recall that F-11 is the quantile function)
(d) Pr(1 ≤ X ≤ 3)
2. Repeat question 1 in the case that X has a t-distribution with 5 degrees of freedom.
3. Use the function rpois to simulate 100 values from a Poisson distribution with a parameter
of your own choice. Produce a statistical summary of the result and check that the mean and
variance are in reasonable agreement with the true population values.
4. Repeat the previous question replacing rpois with rexp.

18. Graphics
R has many facilities for producing high quality graphics. A useful facility before beginning is to
divide a page into smaller pieces so that more than one figure can be displayed.

R has very good graphics capability.There are basically two types of graphics functions in R
1. High level function : It creates new graph
2. Low level function :It adds elements to existing graph.
Various graphics functions are listed out in following table
Sr.No Function Graph Type
1. plot(): vector of x and y values Scatter plot
2. hist() Histogram
3 boxplot() Box-whiskers plot
4 stripchart() Stripchart
5 barplot() Bar Graph
6 stem() Stem and leaf display
Some additional parameter of graphics function
Sr.no Argument Description
1 main Title
2 xab xlabel
3 ylab ylabel

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 45
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

4 xlim Limit of x-axis


5 Ylim Limit of Y-axis
6 type Type of plots
l=line
p=points
b=both
o=line passing through points
7 pch Point character between 1to 20
8 lty Line type
9 cex Controls font size

Low level graphics functions


Sr.No Functions Description
1 lines() lines
2 abline() Line given by intercept and slope
3 points() Points
4 text() Text in the plot
5 Legend() List of Symbols

1. Plot Scatter Plot:


A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian
coordinates to display values for two variables for a set of data.
> data(mtcars)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 46
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> View(mtcars)
> help(mtcars)
> plot(mtcars$mpg,mtcars$cyl)

ggplot2 function is qplot()


>library(ggplot2)
>qplot(mtcars$mpg,mtcars$cyl)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 47
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

X<-seq(0,2,by=0.2)
y<-X
y1<-y^2
y2<-y^3

plot(X,y,"o",lty=1,xlab="Xaxis",ylab="Yaxis",ylim=range(0,max(y2)),cex=0.7,lwd=2)

Some

Try :
plot(mtcars$mpg,mtcars$cyl,type="l")
points(mtcars$mpg,mtcars$cyl)
lines(mtcars$mpg,mtcars$cyl , col="red")
points(mtcars$mpg,mtcars$cyl , col="red")

Plot Barplot
barplot(table(mtcars$cyl))

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 48
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Plot Histogram
hist(mtcars$mpg)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 49
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Plot BoxPlot

boxplot(mtcars$cyl,mtcars$mpg
)

Plot pie chart

pie(1:8,col=1:8)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 50
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

##Some More Examples:


> par(mfrow=c(2,2))
creates a window of graphics with 2 rows and 2 columns. With this choice the windows are filled
up row-wise. Use mfcol instead of mfrow to fill up column-wise. The function par() is a general
function for setting graphical parameters. There are many options: see help(par).

Figure 1: Tree heights and volumes


So, for example
> par(mfrow = c(2,2))
> hist(Height)
> boxplot(Height)
> hist(Volume)
> boxplot(Volume)

We can also plot one variable against another using the function plot:
> plot(Height, Volume)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 51
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

Figure 2: Scatterplot matrix for tree data

R can also produce a scatterplot matrix (a matrix of scatterplots for each pair of variables)
using the function pairs:
> pairs(trees)

Like many other functions plot is object-specific: its behaviour depends on the object to which it is
applied. For example, if the object is a matrix, plot is identical to pairs: try plot(trees).
For some other possibilities try:
> data(nhtemp)
> str(nhtemp)
Time-Series [1:60] from 1912 to 1971: 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9
49.3 51.9 ...
> head(nhtemp)
[1] 49.9 52.3 49.4 51.1 49.4 47.9
> length(nhtemp)
[1] 60
> plot(nhtemp)

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 52
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

> data(faithful)
> head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
> str(faithful)
'data.frame': 272 obs. of 2 variables:
$ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
$ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
> plot(faithful)

> data(HairEyeColor)
> plot(HairEyeColor)

There are also many optional arguments in most plotting functions that can be used to control colours,
plotting characters, axis labels, titles etc. The functions points and lines are useful for adding points
and lines respectively to a current graph. The function abline is useful for adding a line with specified
intercept and slope.
To print a graph, point the cursor over the graphics window and press the right button on the mouse.
This should open up a menu which includes `print' as an option. You also have the option to save the
figure in various formats, for example as a postscript file, for storage and later use.

##Tutorial
1. Use
> x<-rnorm(100)
or something similar, to generate some data. Produce a figure showing a histogram and

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 53
MITCOE, Department of Computer Engineering
Two-week FDP on
“Mathematical Modelling and Statistical Analysis in Computation”
4th June to 13th June 2015
R Tutorial and Hands-on Session

boxplot of the data. Modify the axis names and title of the plot in an appropriate way.
2. Type the following
13
> x<- (-10):10
> n<-length(x)
> y<-rnorm(n,x,4)
> plot(x,y)
> abline(0,1)
Try to understand the effect of each command and the graph that is produced.
3. Type the following:
> data(nhtemp)
> plot(nhtemp)
This produces a time series plot of measurements of annual mean temperatures in New Hamp-
shire, U.S.A.
and lines, use type='b' instead.

References:

[1] “Computing for Data Analysis”- Coursera.org-


[2] “Introduction to Statistical Learning with Applications in R”- James, Witten, Hastie, Tibshirani
[3] “R in Action- Data Analysis and Graphics with R”, Robert Kabacoff

Conducted by: Rekha Sugandhi and Sumitra Pundlik, MIT College of Engineering, Pune Page 54

You might also like