Professional Documents
Culture Documents
Untitled
Untitled
INTRODUCTION TO R:
R IS A PROGRAMMING LANGUA
GE
R IS AN ANALYTICAL TOOL
R IS A SCRIPTING LANGUAGE
32/64bit.
Features of R:
• Easier automation
• Faster computation
• It's free
• State-of-the-art graphics
R Basics
Control Structures
Conditional Executions
Comparison Operators
equal: ==
[Type text]
not equal: !=
greater/less than: > <
greater/less than or equal: >= <=
Logical Operators
and: &
or: |
not: !
If Statements
If statements operate on length-one logical vectors.
Syntax
Example
if(1==0) {
print(1)
} else {
print(2)
}
[1] 2
Avoid inserting newlines between '} else'.
Ifelse Statements
Ifelse statements operate on vectors of variable length.
Syntax
Example
Loops
The most commonly used loop structures in R are for, while and apply loops. Less
common are repeat loops. The break function is used to break out of loops,
and next halts the processing of the current iteration and advances the looping index.
[Type text]
For Loop
For loops are controlled by a looping vector. In every iteration of the loop one value in
the looping vector is assigned to a variable that can be used in the statements of the
body of the loop. Usually, the number of loop iterations is defined by the number of
values stored in the looping vector and they are processed in the same order as they are
stored in the looping vector.
Syntax
for(variable in sequence) {
statements
}
While Loop
Similar to for loop, but the iterations are controlled by a conditional statement.
Syntax
while(condition) statements
Functions
A very useful feature of the R environment is the possibility to expand existing functions
and to easily write custom functions. In fact, most of the R software can be viewed as a
series of R functions.
Table of Contents
The value returned by a function is the value of the function body, which is usually an
unassigned final expression, e.g.: return()
Running R Programs
(1) Executing an R script from the R console
source("my_script.R")
[Type text]
Examples of usage
help (): help(mean)
The variables are assigned with R-Objects and the data type of the R-
object becomes the data type of the variable. There are many types of R-
objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Vectors:
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
[Type text]
# Create a vector.
print(apple)
print(class(apple))
Lists
A list is an R-object which can contain many different types of elements
inside it like vectors, functions and even another list inside it.
# Create a list.
print(list1)
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
[Type text]
Matrices
A matrix is a two-dimensional rectangular data set. It can be created
using a vector input to the matrix function.
# Create a matrix.
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any
number of dimensions. The array function takes a dim attribute which
creates the required number of dimension. In the below example we
create an array with two elements which are 3x3 matrices each.
# Create an array.
print(a)
, , 2
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can
be logical. It is a list of vectors of equal length.
Age = c(42,38,26)
print(BMI)
Finding Variables
To know all the variables currently available in the workspace we use
the ls()function. Also the ls() function can use patterns to match the
variable names.
print(ls())
ls() :
The ls() function can use patterns to match the variable names.
print(ls(pattern = "var"))
Deleting Variables
Variables can be deleted by using the rm() function. Below we delete the
variable var.3. On printing the value of the variable error is thrown.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm() and ls() function
together.
rm(list = ls())
print(ls())
>View(women) in Console.
>nrow(women)
>ncol(women)
>summary(women)
>str(women)
>dim(women)
>women$height
To check the type (or class) of a variable, the class function can be used
>class(women)
PROGRAM 2 :
INTERACT DATA THROUGH .csv Files(Import and Export to .csv Files)
print(getwd())
setwd("/web/com")
print(getwd())
This result depends on your OS and your current directory where you are
working.
[Type text]
You can create this file using windows notepad by copying and pasting
this data. Save the file as input.csv using the save As All files(*.*)
option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
print(data)
write.csv(retval,"output.csv")
print(newdata)
Swirl is a platform for learning (and teaching) statistics and R simultaneously and
interactively. It presents a choice of course lessons and interactively tutors a user
through them. A user may be asked to watch a video, to answer a multiple-choice
or fill-in-the-blanks question, or to enter a command in the R console precisely as
if he or she were using R in practice. Emphasis is on the last, interacting with the R
console. User responses are tested for correctness and hints are given if
appropriate. Progress is automatically saved so that a user may quit at any time and
later resume without losing work.
WHAT IS SWIRL() IN R
• swirl is a software package for
the R programming language that turns
the Rconsole into an interactive learning
environment. Users receive immediate feedback
as they are guided through self-paced lessons in
data science and R programming.
install.packages(“swirl”)
library(swirl)
install_from_swirl("Getting and Cleaning Data")
Packages in Swirl()
dplyr()
Specifically, dplyr supplies five 'verbs' that cover most fundamental data
manipulation tasks:
The easiest way to install and run swirl is by typing the following from the R
console:
install.packages("swirl")
library(swirl)
swirl()
What is dplyr?
dplyr is a powerful R-package to transform and summarize tabular data with rows
and columns
To install dplyr
install.packages("dplyr")
To load dplyr
library(dplyr)
• install.packages("dplyr") ## install
• library("dplyr") ## load
• You only need to install a package once per computer, but you need
to load it every time you open a new R session and want to use that
package.
[Type text]
• To select columns of a data frame, use select(). The first argument to this function is
the data frame (ToothGrowth), and the subsequent arguments are the columns to
keep.
>aa<-select(ToothGrowth,len,supp,dose)
>plot(aa)
• Filter():
To choose rows
• filter(ToothGrowth, len=5)
• Filter():
To choose rows
• filter(ToothGrowth, len=5)
Pipes(>%>)
• Pipes let you take the output of one function and send it directly to the next, which is
useful when you need to many things to the same data set.
>ToothGrowth %>%
+ select(len,supp,dose)
MUTATE():
>ToothGrowth %>%
+ mutate(len = len/ 4)
• If this runs off your screen and you just want to see the first few rows, you can use a
pipe to view the head() of the data
>ToothGrowth %>%
+ mutate(len=len/4) %>%
+head
Groupby():
• group_by() splits the data into groups upon which some operations can be run
>ToothGrowth %>%
+ group_by(len) %>%
+ tally()
Summarize ():
• single group_by() is often used together with summarize() which collapses each
group into a -row summary of that group.
[Type text]
>ToothGrowth %>%
+group_by(len) %>%
The functions we are discussing in this chapter are mean, median and
mode.
Mean
It is calculated by taking the sum of the values and dividing with the
number of values in a data series.
Syntax
The basic syntax for calculating mean in R is −
trim is used to drop some observations from both end of the sorted
vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
[Type text]
# Find Mean.
print(result.mean)
Median
The middle most value in a data series is called the median.
The median()function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
na.rm is used to remove the missing values from the input vector.
Example
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
print(median.result)
[1] 5.6
[Type text]
Mode
The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and
character data.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
uniqv[which.max(tabulate(match(v, uniqv)))]
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
print(result)
print(result)
RANGE
> range(y,finite=TRUE)
[1] -4 43
Range Of Values
range returns a vector containing the minimum and maximum of all the given arguments.
Keywords
arith, univar
Usage
range(…, na.rm = FALSE)
Arguments
…
na.rm
finite
Histogram
A histogram consists of parallel vertical bars that graphically shows the frequency
distribution of a quantitative variable. The area of each bar is equal to the frequency of
items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel
vertical bars showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, # apply the hist function
+ right=FALSE) # intervals closed on the left
Answer
The histogram of the eruption durations is:
[Type text]
Enhanced Solution
To colorize the histogram, we select a color palette and set it in
the col argument of hist. In addition, we update the titles for readability.
> colors = c("red", "yellow", "green", "violet", "orange",
+ "blue", "pink", "cyan")
> hist(duration, # apply the hist function
+ right=FALSE, # intervals closed on the left
+ col=colors, # set the color palette
+ main="Old Faithful Eruptions", # the main title
+ xlab="Duration minutes") # x-axis label
[Type text]
\
[Type text]
> View(emp.data)
> print(emp.data)
> emp.data[1:2,]
> emp.data[c(3,5),c(2,4)]
emp_name start_date
3 kamala 0021-09-19
5 prava 0013-09-20
> A = emp.data$emp_id
> B = emp.data$emp_name
> C = data.frame(A,B)
> print(C)
[Type text]
A B
1 1 ratna
2 2 kumar
3 3 kamala
4 4 prajwal
5 5 prava
> D = emp.data$salary
A = emp.data $emp_id
B = emp.data$emp_name
C = data.frame(A,B)
Print(c)
emp.data[1;2,]
c) Extract 3rd and 5th row with 2nd and 4th columns.
emp.data[c(3,5),c(2,4)]
Above data frame could be normalized using Min-Max normalization technique which
specifies the following formula to be applied to each value of features to be
normalized. This technique is traditionally used with K-Nearest Neighbors
(KNN) Classification problems.
1 (X - min(X))/(max(X) - min(X))
Above could be programmed as the following function in R:
1 # Note df[2]
2 dfNorm <- as.data.frame(lapply(df[2], normalize))
3 # Note df["Salary"]
4 dfNorm <- as.data.frame(lapply(df["Salary"], normalize))
Z-Score Standardization
The disadvantage with min-max normalization technique is that it tends to bring data
towards the mean. If there is a need for outliers to get weighted more than the other
values, z-score standardization technique suits better. In order to achieve z-score
standardization, one could use R’s built-in scale() function. Take a look at following
example where scale function is applied on “df” data frame mentioned above.
1 dfNormZ <- as.data.frame( scale(df[1:2] ))
Following gets printed as dfNormZ
1 Age Salary
1 -0.9271726 -1.03490978
2 2 -0.1324532 0.07392213
3 3 1.0596259 0.96098765
[Type text]
rbind(x1,x2,...)
x1,x2: vector, matrix, data frames
>cbind(x1,x2,...)
• x1,x2:vector, matrix, data frames
• > cbind(x, y)
– x y
• [1,] 1 10
• [2,] 2 11
• [3,] 3 12
• > rbind(x, y)
• x1 2 3
• y 10 11 12
[Type text]
> library(dplyr)
> library(ggplot2)
> library(datasets)
> data(ToothGrowth)
> str(ToothGrowth)
> summary(ToothGrowth)
len supp dose
Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
> scatter.smooth(ToothGrowth)
[Type text]
What is dplyr
dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It
provides some great, easy-to-use functions that are very handy when performing
exploratory data analysis and manipulation. Here, I will provide a basic overview of some of
> library(dplyr)
> library("swirl")
> install.packages("downloader")
> library(downloader)
> head(msleep)
>aa<-select(ToothGrowth,len,supp,dose)
>plot(aa)
• Filter():
To choose rows
• filter(ToothGrowth, len=5)
• Filter():
To choose rows
• filter(ToothGrowth, len=5)
Pipes (>%>)
• Pipes let you take the output of one function and send it directly to the next, which is
useful when you need to many things to the same data set.
>ToothGrowth %>%
+ select(len,supp,dose)
MUTATE ():
>ToothGrowth %>%
+ mutate(len = len/ 4)
• If this runs off your screen and you just want to see the first few rows, you can use a
pipe to view the head() of the data
>ToothGrowth %>%
+ mutate(len=len/4) %>%
+head
[Type text]
Groupby():
• group_by() splits the data into groups upon which some operations can be run
>ToothGrowth %>%
+ group_by(len) %>%
+ tally()
summarize():
• single group_by() is often used together with summarize() which collapses each
group into a -row summary of that group.
>ToothGrowth %>%
+group_by(len) %>%
R's "Mtcars" dataset contains a series of variables relating to motor cars that can
be plotted to explore correlation, with a linear regression model fitted to the
points.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
png(file = "linearregression.png")
[Type text]
R - Multiple Regression
We create the regression model using the lm() function in R. The model
determines the value of the coefficients using the input data. Next we
can predict the value of the response variable for a given set of predictor
variables using these coefficients.
create a subset of these variables from the mtcars data set for this
purpose.
print(head(input))
head(data)
summary(model)
Using the above equation we can predict the value of mpg based on disp and hp.
Step3: Predicting the output.
plot(model)
The purpose of this analysis was to answer two questions regarding fuel
economy. The answer to the first, “Is an automatic or manual transmission
better for MPG?”, is this:
[Type text]
> data("iris")
> summary(iris)
> install.packages("dplyr")
>
head(selected, 3)
select() This function selects data by column name. You can select any number of columns in a
few different ways.
Mutate()
# create a new column that stores logical values for sepal.width greater
than half of sepal.length
tail(newCol)
## greater.half
## 145 FALSE
## 146 FALSE
## 147 FALSE
Visualization
Any powerful analysis will visualize the data to give a better picture (wink wink) of the data.
Below is a general plot of the iris dataset:
> plot(iris)
Kmeans(data, 2, nstart=100)
Generally, the way K-Means algorithms work is via an iterative refinement process:
1. Each data point is randomly assigned to a cluster (number of clusters is given before
hand).
2. Each cluster’s centroid (mean within cluster) is calculated.
3. Each data point is assigned to its nearest centroid (iteratively to minimise the within-
cluster variation) until no major differences are found.
Plot(
> library(datasets)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> library(ggplot2)
> ggplot(iris, aes(Petal.Length, Petal.Width, color = Species))+ geom_point()
> set.seed(20)
> irisCluster <- kmeans(iris[, 3:4],3,nstart = 20)
> irisCluster
[Type text]
>
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
[Type text]
PROGRAM 12:
R - Decision Tree
Decision tree is a graph to represent choices and their results in form of a tree. The nodes
in the graph represent an event or choice and the edges of the graph represent the
decision rules or conditions. It is mostly used in Machine Learning and Data Mining
applications using R.
Examples of use of decision tress is − predicting an email as spam or not spam, predicting
of a tumor is cancerous or predicting a loan as a good or bad credit risk based on the
factors in each of these. Generally, a model is created with observed data also called
training data. Then a set of validation data is used to verify and improve the model. R has
packages which are used to create and visualize decision trees. For new set of predictor
variable, we use this model to arrive at a decision on the category (yes/No, spam/not
spam) of the data.
Install R Package
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The package "party" has the function ctree() which is used to create and analyze decison
tree.
Syntax
The basic syntax for creating a decision tree in R is −
ctree(formula, data)
Input Data
We will use the R in-built data set named readingSkills to create a decision tree. It
describes the score of someone's readingSkills if we know the variables
"age","shoesize","score" and whether the person is a native speaker or not.
# Load the party package. It will automatically load other dependent packages.
library(party)
print(head(readingSkills))
When we execute the above code, it produces the following result and chart −
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
Example
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other dependent packages.
library(party)
png(file = "decision_tree.png")
data = input.dat)
plot(output.tree)
dev.off()
as.Date, as.Date.numeric
Conclusion
From the decision tree shown above we can conclude that anyone whose readingSkills
score is less than 38.3 and age is more than 6 is not a native Speaker.
[Type text]
> install.packages("party")
Installing package into ‘C:/Users/Linda/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/party_1.0-15.zip'
Content type 'application/zip' length 731049 bytes (713 Kb)
opened URL
downloaded 713 Kb
> library("party")
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Call function ctree to build a decision tree. The first parameter is a formula,
which defines a target variable and a list of independent variables.
> print(iris_ctree)
Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150
> plot(iris_ctree)
> install.packages("party")
> library("party")
> str(iris)
> plot(iris_ctree)