Professional Documents
Culture Documents
Data Analysis Using R
Data Analysis Using R
1
Abstract
In this abstract we discuss about data analysis using R. Here we will do a deep study
about the topic. As we know that data analysis is increasingly gaining popularity and the
question of how to perform data analytics using R? It is also important, because R is a
tool that enables data analysts to perform data analysis and visualization. An important
term involved in data analytics using R is exploratory data analysis (EDA), it is an
approach of data analysis for summarizing and visualizing the data set, the focus of the
approach is to analyze data’s basic structures and variables to develop a basic
understanding of the data set, in order to develop a deep understanding to investigate
what methods of statistical analysis would be appropriate for data analysis. In order to
explain this concept in details, we will discuss it step by step and get to know about R,
used for data analysis and then how discuss when and how to R can be used to analyze
data efficiently.
Introduction
2
in the C, C++, .Net, Python or FORTRAN languages for efficiency. It is freely available
under the GNU General Public License, and pre-compiled binary versions provided for
various operating systems like Linux, Windows and Mac. R is free software distributed
under a GNU-style copy left, and an official part of the GNU project called GNU S.
This software is a software driven by command, e.g. if you are a data analyst analyzing
data using R then you will be giving written commands to the software in order to
indicate what you want to do, the advantage of using R is that it lets the analysts collects
large sets of data and add different commands together and then process all the
commands together at a time. The reason why R should be used in data analysis is
because it helps to process the large number of commands together, saves all the data and
progress on work, and enables analysts to easily edit small mistakes so that they don’t
have to go through different commands to search their steps and find the mistakes and
then fix it.
3
Fig 1.1 Why use R?
Data analysis using R is increasing the efficiency in data analysis, because data analytics
using R, enables analysts to process data sets that are traditionally considered large data-
sets, e.g. previously it was not possible to process data sets of 300,000 cases together, but
with R, on a machine with at least 2GB of memory, data sets off 300,000 cases and
around 120 variables can be processed.
Features of R
Packages are part of R programming. Hence, they are useful in collecting sets
of R functions into a single unit.
4
It has effective data handling and storage facilities.
It has facilities to print the reports for the analysis performed in the form of graphs
either on-screen or on hardcopy.
Job Scenario – It is a better option for start-ups and companies looking for cost
efficiency.
5
Applications of R Programming
Many data analysts and research programmers use R because R is the most
prevalent language. Hence, we use R as a fundamental tool for finance.
R is best for data Science because it gives a broad variety of statistics. In addition,
R provides the environment for statistical computing and design. Rather R considers
as an alternate execution of S.
R-Environment Setup
Try it online
You really do not need to set up your own environment to start learning R programming
language because, we already have set up R Programming environment online, so that
you can compile and execute all the available examples online at the same time when you
are doing your theory work. This gives you confidence in what you are reading and to
check the result with different options. Feel free to modify any example and execute it
online.
Try the following example using Try it option at the website available at the top right
corner of the below sample code :
6
Local Environment Setup If you are still willing to set up your environment for R, you
can follow the steps given below.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64
bit) and save it in a local directory.
As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double
click and run the installer accepting the default settings. If your Windows is 32-bit
version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both
the 32-bit and 64-bit versions.
After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon
brings up the R-GUI which is the R console to do R Programming.
Linux Installation
$ yum install R
Above command will install core functionality of R programming along with standard
packages, still you need additional package, then you can launch R prompt as follows:
$R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
7
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Now you can use install command at R prompt to install the required package. For
example, the following command will install plotrix package which is required for 3D
charts.
> install("plotrix")
R Software Platform
Before developing a deep understanding of what exactly data analytics using R contains it
is important to understand the basic interface of R. The R software has four basic features,
R Console, R Script, R environment and Graphical output. these features are summarized
R has the ability to enable analysts to write codes in console, then run commands through
script, analyze variables and sets in R environment and then present the data in the form
of graphical output.
In simple 4 steps, users can analyze data using R, by performing following tasks:
1. R-Console:
R console is the interface where analysts can write codes for running the data, and can
view the output codes later also, the codes can be written using R Script.
2. R-Script:
8
R script is the interface where analysts can write codes, the process is quite simple,
users just have to write the codes and then to run the codes they just need to press
Ctrl+ Enter, or use the “Run” button on top of R Script.
3. R Environment:
R environment is the space to add external factors, this involves adding the actual data
set, then adding variables, vectors and functions to run the data. You can add all your
data here and then also view whether your data has been loaded accurately in the
environment.
4. Graphical Output:
Once all the scripts and codes are added and data sets and variables are added to R,
graphical output feature could be used to create graphs after the exploratory data
analysis performed.
Thus, if based on above features, the functioning of data analytics using R is analyzed,
then data analytics using R entails writing codes and scripts, uploading sets of data and
9
variables, i.e. uploading the information you know, to obtain the information you want to
find out, and then represent the results using visual graphs.
R-Basic syntax
R Command Prompt Once you have R environment setup, then it’s easy to start your R
command prompt by just typing the following command at your command prompt:
$R
This will launch R interpreter and you will get a prompt > where you can start typing
your program as follows:
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.
R Script File Usually, you will do your programming by writing your programs in script
files and then you execute those scripts at your command prompt with the help of R
interpreter called Rscript. So let's start with writing following code in a text file called
test.R as under:
10
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
Comments
Comments are like helping text in your R program and they are ignored by the interpreter
while executing your actual program. Single comment is written using # in the beginning
of the statement as follows:
R does not support multi-line comments but you can perform a trick which is something
as follows:
if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a single of double
quote"
}
myString <- "Hello, World!"
print ( myString)
Though above comments will be executed by R interpreter, they will not interfere with
your actual program. You should put such comments inside, either single or double quote.
11
R-Data Types
Data analytics with R is performed using four features of R, mentioned above, R console,
R script, R environment and Graphical output. The R programming for data science
contains different features and packages that can be installed to analyze different types of
data, R data analytics enable user to analyze different types of data such as:
The simplest of these objects is the vector object and there are six data types of these
atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon
the atomic vectors.
12
Logical TRUE,FALSE v <- TRUE
[1] “logical”
print(class(v))
Numeric 12.3,6,876 v <- 45.5
[1] “numeric”
Print(class(v))
Integer 2L,34L,0L v <- 2L
[1] “integer”
print(class(v))
Complex 3 + 2i v <- 2+5i
[1] “complex”
print(class(v))
Character ‘d’,”bad”, “TRUE”,’23.4’,”YES” v <- “YES”
[1] “character”
print(class(v))
Raw “Hello” is stored as 48 65 6c 6c v <- charToRaw(“Hello”)
[1] “raw”
6f print(class(v))
Vector:
Vector data sets group together objects from same class, e.g. a data set with vectors could
contain numeric, integers etc. However, R data analytics allows mixing of different
objects, i.e. different vectors can be grouped together for analysis. In this article we are
not going in-depth of specific commands that can be performed to group different objects
into one group, but the process of combining different groups into one group causes
coercion, and using the command class function, the data can be grouped into one object
of the same class.
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
13
Matrices:
A matrix data set is created when a vector data set is divided into rows and columns, the
data contains the elements of the same class, but in matrix form the data structure is two
dimensional.
A matrix is a two-dimensional rectangular data set. It can be created using a vector input
to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow=2,ncol=3,byrow = TRUE)
print(M)
Data Frame:
Data frame could be considered an advanced form of matrix, it is a matrix of vectors with
different elements, the difference between a matrix and a data frame is that a matrix must
have elements of the same class, but in data frame lists of different vectors with different
classes can be grouped together in a data frame. The data frame commands could be more
complex than the rest.
Data frames are tabular data objects. Unlike a matrix in data frame each column can
contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal
length.
14
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age =c(42,38,26)
)
print(BMI)
List:
List is a specific term used to describe a vector data set that groups together data from
different classes.
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
15
[[3]]
function (x) .Primitive("sin")
Arrays :
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required number of
dimension. In the below example we create an array with two elements which are 3x3
matrices each.
# Create an array.
a <- array(c('green','yellow'),dim=c(3,3,2))
print(a)
,,2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
Factors:
Factors are the r-objects which are created using a vector. It stores the vector along with
the distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They
are useful in statistical modeling.
16
Factors are created using the factor() function. The nlevels function gives the count of
levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Apart from the R programming for data science that allows analysis of different types of
data, R data sciences allows for different types of variables to be added, such as:
Continuous Variables:
Continuous variables are variables that can be in any form of value, e.g. decimal
values can also be added to the data, such as 1, 2.5, 4.6, 7, etc.
Categorical Variables:
Categorical values can only be added in one form such as 1, 2, 3,4,5 etc. Factors
are used for representing categorical variables in data analytics with R.
Missing Values:
17
Missing values are painful yet a crucial part of data analytics, and R data
analytics. There are different commands such as NA to perform calculations
without the missing values, but when the values are missing, it is important to use
commands to indicate that there are missing values in order to perform data
analytics with R.
R-Decision Making
Decision making structures require the programmer to specify one or more conditions to
be evaluated or tested by the program, along with a statement or statements to be
executed if the condition is determined to be true, and optionally, other statements to be
executed if the condition is determined to be false.
Statement Description
18
if...else statement An if statement can be followed by an optional else statement, which
executes when the Boolean expression is false.
If statement :-
An if statement consists of a Boolean expression followed by one or more
statements.
Syntax :- if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if
statement will be executed. If Boolean expression evaluates to be false, then the
first set of code after the end of the if statement (after the closing curly brace) will
be executed.
Example
x <- 30L
if(is.integer(x)){
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result:
[1] "X is an Integer"
If...Else Statement :-
An if statement can be followed by an optional else statement which executes
when the boolean expression is false.
19
Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
else {
// statement(s) will execute if the boolean expression is false.
}
If the Boolean expression evaluates to be true, then the if block of code will be
executed, otherwise else block of code will be executed.
Example –
x <- c("what","is","truth")
if("Truth" %in% x){
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result:
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if's.
An if can have zero to many else if's and they must come before the else.
Once an else if succeeds, none of the remaining else if's or else's will be
tested.
20
Syntax
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
}else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
}else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
}else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x){
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result:
Switch Statement
A switch statement allows a variable to be tested for equality against a list of
values. Each value is called a case, and the variable being switched on is checked
for each case.
Syntax
switch(expression, case1, case2, case3....)
21
If the value of expression is not a character string it is coerced to integer.
You can have any number of case statements within a switch. Each case is
followed by the value to be compared to and a colon.
If the value of the integer is between 1 and nargs()-1 (The max number of
arguments)then the corresponding element of case condition is evaluated
and the result returned.
If expression evaluates to a character string then that string is matched
(exactly) to the names of the elements.
If there is more than one match, the first matching element is returned.
No Default argument is available.
In the case of no match, if there is a unnamed element of ... its value is
returned. (If there is more than one such argument an error is returned.)
Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
When the above code is compiled and executed, it produces the following result:
[1] "third"
R-Loops
There may be a situation when you need to execute a block of code several number of
times. In general, statements are executed sequentially. The first statement in a function is
executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for more
complicated execution paths.
22
A loop statement allows us to execute a statement or group of statements multiple times
and the following is the general form of a loop statement in most of the programming
languages:
Executes a sequence of statements multiple times and abbreviates the code that
Repeat loop manages the loop variable.
Like a while statement, except that it tests the condition at the end of the loop
For loop body.
Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is
met.
Syntax
repeat {
commands
if(condition){
break
}
}
Example-
v <- c("Hello","loop")
R Programming
44
23
cnt <- 2
repeat{
print(v)
cnt <- cnt+1
if(cnt > 5){
break
}
}
When the above code is compiled and executed, it produces the following result:
While Loop
The While loop executes the same code again and again until a stop condition is
met.
Syntax
while (test_expression) {
statement
}
Here key point of the while loop is that the loop might not ever run. When the
condition is tested and the result is false, the loop body will be skipped and the
first statement after the while loop will be executed.
Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7){
print(v)
cnt = cnt + 1
}
When the above code is compiled and executed, it produces the following result :
24
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
For Loop
A for loop is a repetition control structure that allows you to efficiently write a
loop that needs to execute a specific number of times.
Syntax :
for (value in vector) {
statements
}
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result:
[1] "A"
[1] "B"
[1] "C"
[1] "D"
Loop control statements change execution from its normal sequence. When execution
leaves a scope, all automatic objects that were created in that scope are destroyed.
R supports the following control statements
25
Terminates the loop statement and transfers execution to the statement
Break statement immediately following the loop.
Break Statement
The break statement in R programming language has the following two usages:
When the break statement is encountered inside a loop, the loop is
immediately terminated and program control resumes at the next statement
following the loop.
It can be used to terminate a case in the switch statement (covered in the
next chapter).
Syntax :- break
Example -
v <- c("Hello","loop")
cnt <- 2
repeat{
print(v)
cnt <- cnt+1
if(cnt > 5){
break
}
}
When the above code is compiled and executed, it produces the following result:
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
Next Statement
The next statement in R programming language is useful when we want to skip
the current iteration of a loop without terminating it. On encountering next, the R
parser skips further evaluation and starts next iteration of the loop.
26
Syntax:- next
Example
v <- LETTERS[1:6]
for ( i in v){
if (i == "D"){
next
}
print(i)
}
When the above code is compiled and executed, it produces the following result:
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"
R-Function
Function Definition
An R function is created by using the keyword function. The basic syntax of an R
function definition is as follows:
27
Function Components - The different parts of a function are:
R has many in-built functions which can be directly called in the program without
defining them first. We can also create and use our own functions referred as user defined
functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x)and
paste(...) etc. They are directly called by user written programs. You can refer
most widely used R functions.
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
28
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants
and once created they can be used like the built-in functions. Below is an example
of how a function is created and used.
Calling a Function
new.function(6)
[1] 1
[1] 4
[1] 9
29
[1] 16
[1] 25
[1] 36
[1] 26
[1] 58
[1] 18
[1] 45
31
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)
When we execute the above code, it produces the following result:
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
R-Strings
Any value written within a pair of single quote or double quotes in R is treated as a string.
Internally R stores every string within double quotes, even when you create them with
single quote.
Rules Applied in String Construction :-
The quotes at the beginning and end of a string should be both double quotes or
both single quote. They cannot be mixed.
Double quotes can be inserted into a string starting and ending with single quote.
Single quote can be inserted into a string starting and ending with double quotes.
Double quotes cannot be inserted into a string starting and ending with double
quotes.
Single quote cannot be inserted into a string starting and ending with single quote.
32
d <- 'Double quotes " in between single quote'
print(d)
unexpected symbol
1: g <- "Double quotes " inside
String Manipulation
33
Concatenating Strings - paste() function Many strings in R are combined
using the paste() function. It can take any number of arguments to be combined
together.
Syntax :- paste(..., sep = " ", collapse = NULL)
a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
print(paste(a,b,c, sep = "-"))
print(paste(a,b,c, sep = "", collapse = ""))
35
Counting number of characters in a string - ncahr() function This
function counts the number of characters including spaces in a string.
Syntax:- nchar(x)
x is the vector input.
Example
result <- nchar("Count the number of characters")
print(result)
When we execute the above code, it produces the following result:
[1] 30
36
Example
# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
R-Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators
act on each element of the vector.
37
+ Add two vectors v <- c( 2,5.5,6) [1] 10.0 8.5 10.0
t <- c(8, 3, 4)
print(v+t)
- Subtracts second vector from v <- c( 2,5.5,6) [1] -6.0 2.5 2.0
the first t <- c(8, 3, 4)
print(v-t)
* Multiplies both vectors v <- c( 2,5.5,6) [1] 16.0 16.5 24.0
t <- c(8, 3, 4)
print(v*t)
/ Divide the first vector with the v <- c( 2,5.5,6) [1] 0.250000 1.833333
second t <- c(8, 3, 4) 1.500000
print(v/t)
%% Give the remainder of the first v <- c( 2,5.5,6) [1] 2.0 2.5 2.0
vector with the second t <- c(8, 3, 4)
print(v%%t)
%/% The result of division of first v <- c( 2,5.5,6) [1] 0 1 1
vector with second (quotient) t <- c(8, 3, 4)
print(v%/%t)
^ The first vector raised to the v <- c( 2,5.5,6) [1] 256.000 166.375
exponent of second vector t <- c(8, 3, 4) 1296.000
print(v^t)
Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
38
second vector.
< Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE FALSE TRUE
first vector is less than the t <- c(8,2.5,14,9) FALSE
corresponding element of the print(v < t)
second vector.
== Checks if each element of the v <- c(2,5.5,6,9) [1] FALSE FALSE FALSE
first vector is equal to the t <- c(8,2.5,14,9) TRUE
corresponding element of the print(v==t)
second vector.
<= Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE FALSE TRUE
first vector is less than or equal to t <- c(8,2.5,14,9) TRUE
the corresponding element of the print(v<=t)
second vector.
>= Checks if each element of the v <- c(2,5.5,6,9) [1] FALSE TRUE FALSE
first vector is greater than or t <-c(8,2.5,14,9) TRUE
equal to the corresponding print(v>=t)
element of the second vector.
!= Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE TRUE TRUE
first vector is unequal to the t <- c(8,2.5,14,9) FALSE
corresponding element of the print(v!=t)
second vector.
Logical Operators
Following table shows the logical operators supported by R language. It is applicable
only to vectors of type logical, numeric or complex. All numbers greater than 1 are
considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.
39
& It is called Element-wise Logical v<-c(3,1,TRUE,2+3i) [1] TRUE TRUE FALSE
AND operator. It combines each t<c(4,1,FALSE,2+3i) TRUE
element of the first vector with the print(v&t)
corresponding element of the
second vector and gives a output
TRUE if both the elements are
TRUE.
| It is called Element-wise Logical v<- (3,0,TRUE,2+2i) [1] TRUE FALSE TRUE
OR operator. It combines each t<c(4,0,FALSE,2+3i) TRUE
element of the first vector with the print(v|t)
corresponding element of the
second vector and gives a output
TRUE if one the elements is TRUE.
! It is called Logical NOT operator. v<-c(3,0,TRUE,2+2i) [1] FALSE TRUE FALSE
Takes each element of the vector print(!v) FALSE
and gives the opposite logical
value.
Note:-The logical operator && and || considers only the first element of the vectors and give a
vector of single element as output.
Assignment Operators
These operators are used to assign values to vectors.
40
print(v3)
-> Called Right Assignment c(3,1,TRUE,2+3i)->v1 [1] 3+0i 1+0i 1+0i 2+3i
or c(3,1,TRUE,2+3i)- [1] 3+0i 1+0i 1+0i 2+3i
->> >>v2
print(v1)
print(v2)
&& Called Logical AND operator. Takes v<c(3,0,TRUE,2+2i) [1] TRUE
first element of both the vectors and t<c(1,3,TRUE,2+3i)
gives the TRUE only if both are print(v&&t)
TRUE.
|| Called Logical OR operator. Takes v<c(0,0,TRUE,2+2i) [1] FALSE
first element of both the vectors and t<c(0,3,TRUE,2+3i)
gives the TRUE only if both are print(v||t)
TRUE.
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.
41
print(t)
R- Graphics
Now, we will see how to use R as a function plotter by drawing sine- or cosine functions
within an interval between 0 to 10. First, we create a table of values for x and y and in
order to get a smooth curve, it is reasonable to choose a small step size. As a rule of
thumb I always recommend to use about 100...400 small steps as a good compromise
between smoothness and memory requirements, so let’s set the step size to 0.1:
Instead of plotting points, it is of course also possible to draw continuous lines. This is
indicated by supplying an optional argument type="l". Important: the symbol used here
for type is the small letter “L” for “line” and not the – in printing very similar – numeral
“1” (one)! Note also that in R optional arguments can be given by using a “keyword =
value” pair. This has the advantage that the order of arguments does not matter, because
arguments are referenced by name:
Now we want to add a cosine function with another color. This can be done with one of
the function lines or points, for adding lines or points to an existing figure:
With the help of text it is also possible to add arbitrary text, by specifying first the x- and
y- coordinates and then the text:
42
Many options exist to modify the behavior of most graphics functions so the following
specifies user-defined coordinate limits (xlim, ylim), axis labels and a heading (xlab,
ylab, main).
plot(x, y, xlim = c(-10, 10), ylim = c(-2, 2), xlab = "x-Values", ylab = "y-Values", main =
"Example Graphics")
The above example is a rather long command and may not fit on a single line. In such
cases, R displays a + (plus sign) to indicate that a command must be continued, e.g.
because a closing parenthesis or a closing quote are still missing. Such a + at the
beginning of a line is an automatic “prompt” similar to the ordinary “>” prompt and must
never be typed in manually. If, however, the “+” continuation prompt occurs by accident,
press “ESC” to cancel this mode.
In contrast to the long line continuation prompt, it is also possible to write several
commands on one line, separated by a semi-colon “;”.
Finally, a number (or hash) symbol # means that the following part of the line is a
comment and should therefore be ignored by R.
In order to explore the wealth of graphical functions, you may now have a more extensive
look into the online help, especially regarding ?plot or ?plot.default, and you should
experiment a little bit with different plotting parameters, like lty, pch, lwd, type, log etc.
R contains uncountable possibilities to get full control over the style and content of your
graphics, e.g. with user-specified axes (axis), legends (legend) or user-defined lines and
areas (abline, rect, polygon). The general style of figures like (font size, margins, line
width) can be influenced with the par() function.
In addition, R and its packages contain numerous “high level”-graphics functions for
specific purposes. To demonstrate a few, we first generate a dataset with normally
distributed random numbers(mean0,standard deviation1),then we plot the m and create a
histogram. Here, the function par(mfrow=c(2,2)) divides the plotting area into 2 rows and
2 columns for showing 4 separate figures:
43
x <- rnorm(100)
plot(x)
hist(x)
Now, we add a so-called normal probability plot and a second histogram with relative
frequencies together with the bell-shaped density curve of the standard normal
distribution. The optional argument probability = TRUE makes sure that the histogram
has the same scaling as the density function, so that both can be overlayed:
qqnorm(x)
qqline(x, col="red")
hist(x, probability = TRUE)
xx <- seq(-3, 3, 0.1)
lines(xx, dnorm(xx, 0, 1),col = "red")
Here it may also be a good chance to do a little bit summary statistics like: z.B. mean(x),
var(x), sd(x), range(x), summary(x), min(x), max(x), ...
Or we may consider to test if the generated random numbers x are really normal
distributed by using the Shapiro-Wilks-W-Test:
x <- rnorm(100)
shapiro.test(x)
Shapiro-Wilk normality test
data: x W = 0.99388, p-value = 0.9349
A p-value bigger than 0.05 tells us that the test has no objections against normal
distribution of the data. The concrete results may differ, because x contains random
numbers, so it makes sense to repeat this several times. It can be also useful compare
these normally distributed random numbers generated with rnorm with uniformly
distributed random numbers generated with run if:
par(mfrow=c(2,2))
y <- runif(100)
plot(y)
44
hist(y)
qqnorm(y)
qqline(y, col="red")
mean(y)
var(y)
min(y)
max(y)
hist(y, probability=TRUE)
yy <- seq(min(y), max(y), length = 50)
lines(yy, dnorm(yy, mean(y), sd(y)), col = "red")
shapiro.test(y)
At the end, we compare the pattern of both data sets with box-and-whisker plots:
flavor, number
pistachio,6
mint chocolate chip,7
vanilla,5
chocolate,10
strawberry ,2
neopolitan,4
45
This data represents the number of students in a class that prefer a particular flavor of soy
ice cream. We can read the file into a variable called favs as follows:
If you get an error that there is no such file or directory, give R the full path name to your
data set or, alternatively, run the following command:
The preceding command brings up an open file dialog for letting you navigate to the file
you’ve just created.
The argument sep="," tells R that each data element in a row is separated by a comma.
Other common data formats have values separated by tabs and pipes ("|"). The value of
sep should then be "\t" and "|", respectively.
The argument header=TRUE tells R that the first row of the file should be
interpreted as the names of the columns. Remember, you can enter ?read.table at the
console to learn more about these options.
Reading from files in this comma-separated-values format (usually with the .csv file
extension) is so common that R has a more specific function just for it. The preceding
data import expression can be best written simply as:
Now, we have all the data in the file held in a variable of class data.frame. A data frame
can be thought of as a rectangular array of data that you might see in a spreadsheet
application. In this way, a data frame can also be thought of as a matrix; indeed, we
can use matrix-style indexing to extract elements from it. A data frame differs from a
matrix, though, in that a data frame may have columns of differing types. For
example, whereas a matrix would only allow one of these types, the data set we just
loaded contains character data in its first column, and numeric data in its second column.
Let’s check out what we have by using the head() command, which will show us the first
few lines of a data frame:
46
>head(favs)
Flavor number
1 pistachio 6
2 mint chocolate chip 7
3 vanilla 5
4 chocolate 10
5 strawberry 2
6 neopolitan 4
>class(favs)
[1]"data.frame"
>class(favs$flavor)
[1]"factor"
>class(favs$number)
[1]"numeric"
I lied, ok! So what?! Technically, flavor is a factor data type, not a character type. We
haven’t seen factors yet, but the idea behind them is really simple. Essentially, factors are
codings for categorical variables, which are variables that take on one of a finite
number of categories—think {"high", "medium", and "low"} or {"control",
"experimental"}.
Though factors are extremely useful in statistical modeling in R, the fact that R, by
default, automatically interprets a column from the data read from disk as a type factor if
it contains characters, is something that trips up novices and seasoned useRs alike.
Because of this, we will primarily prevent this behavior manually by adding the
stringsAsFactors optional keyword argument to the read.* commands:
47
Much better, for now! If you’d like to make this behavior the new default, read the ?
options manual page. We can always convert to factors later on if we need to! If you
haven’t noticed already, I’ve snuck a new operator on you—$, the extract operator. This is
the most popular way to extract attributes (or columns) from a data frame. You
can also use double square brackets ([[and]]) to do this.
These are both in addition to the canonical matrix indexing option. The following three
statements are thus, in this context, functionally identical:
> favs$flavor
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"
>favs[["flavor"]]
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"
>favs[,1]
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"
Note
Notice how R has now printed another number in square brackets—besides [1]—along
with our output. This is to show us that chocolate is the fourth element of the vector that
was returned from the extraction.
You can use the names() function to get a list of the columns available in a data frame.
You can even reassign names using the same:
> names(favs)
[1] "flavor" "number"
>names(favs)[1]<-"flav"
>names(favs)
[1] "flav" "number"
Lastly, we can get a compact display of the structure of a data frame by using the str()
function on it:
> str(favs)
'data.frame' : 6 obs. of 2 variables :
48
$ flav : chr "pistachio" "mint chocolate chip" "vanilla" "chocolate"
...
$ number : num 6 7 5 10 2 4
Actually, you can use this function on any R structure—the property of functions that
change their behavior based on the type of input is called polymorphism.
49
Fig 1.5: Import Text Data wizard of R Studio.
Hint: Do not forget to set a Name for the resulting data frame (e.g. dat), otherwise R uses
the file name.
Debugging
For debugging purposes, i.e. if we suspect that something is wrong, it can be necessary to
inspect values of internal variables. For such cases, it would be possible to output internal
variables with print or, as an alternative, to switch the debug mode for this function on
with debug(twosam). This mode can be switched of with undebug(twosam).
50
R contains many additional possibilities, e.g. usage of optional named arguments with
default values, the ellipsis (...-argument, three dots) closures, object orientation or linking
to C- or Fortran routines. Details about this can be found in “An Introduction to R” and
“Writing R Extensions”, that are both part of the official R documentation.
This principle is one of the most important fundamentals, not only in statistics but also in
science in general. However, over-simplification has to be avoided as well, especially in
complex fields like ecology.
Probability
According to the classical definition, probability p is the chance of a specific event,i.e. the
number of events we are interested in, divided by the total number of events. In a die roll,
for example, the probability of any certain number is 1/6 because the die has 6 sides and
each number occurs only one time.
51
Axiom I: 0≤ p≤1
Axiom II: impossible events have p = 0, certain events have p = 1
Axiom III: for events A and B, that exclude each other, i.e. in set theory ATB = 0/ holds:
p(ASB) = p(A)+p(B)
Random sampling means that the individual objects for measurement are selected at
random from the population (examples: random selection of sampling sites by means of
numbered grid cells; random placement of experimental units in a climate chamber,
random order of treatments in time).
By convention, the “true” but unknown parameters of a population are symbolized with
Greek letters(µ, σ, γ, α, β); the calculated parameters(¯ x,s,r2...) are called “estimates”. A
single value xi of a random variable X can also be assumed to consist of an expectation
value E(X) of the random variable and an individual error εi. The expectation value (e.g. a
mean) of this error term is zero:
xi =E(X)+εi (3.1)
52
E(ε) = 0 (3.2)
Statistical Tests
Statistical tests are employed for testing hypotheses about data, e.g. specific properties of
distributions or their parameters. The basic idea is to estimate probabilities for a
hypothesis about the population from a sample.
In contrast to this, significance means that an effect really exists with a certain probability
and that it is unlikely a result of random fluctuations alone.
For testing whether an effect is significant or not, it is important to formulate clear
hypotheses in terms of statistical language, a so-called Null hypothesis (H0) and one or
more alternative Hypotheses (Ha):
H0: Null hypothesis that two populations are not different with respect to a certain
parameter or property. It is assumed, that an observed effect is purely random and that the
true effect is zero.
Ha: The alternative hypothesis (experimental hypothesis) claims that a specific effect
exists. An alternative hypothesis is never completely true or “proven”. The acceptance of
HA means only than H0 is unlikely.
Reality
Decision of the test
H0 true H0 false
53
H0 retained true, 1−α β error
H0 rejected α error Correct decision with power
1−β
Statistical tests can only test for differences between samples, not for equality. This means
that H0 is formulated to be rejected and that it is impossible to test if two populations are
equal–for principal reasons. If a test cannot reject “the null”, it means only that an
observed effect can also be explained as a result of random variability or error and that a
potential effect was too small to be “significantly” detected, given the available sample
size.
Note: Not significant does not necessarily mean “no effect”, it means “no effect or
sample size too small”.
To sum up, there are two possibilities for wrong decisions (cf. Table 1.1):
1. H0 falsely rejected (error of the first kind orα error),
2. H0 falsely retained (error of the second kind orβ error).
It is common convention to use α = 0.05 as the critical value in many sciences, but any
other small value (e.g. 0.01or0.1) would be possible as well. The essential thing, however,
is that this value has to be defined before doing the test and should not be adjusted
afterwards.
In contrast to this, β is often unknown. It depends on the the relative effect size, the
sample size and the power of a certain test. It can be estimated before an experiment by
using power analysis methods, or, even better β is set to a certain value, e.g. 0.2 and then
power analysis is used to determine the required sample size.
We may be tempted to answer the first question by just using a model with the best fit, for
example with the lowest mean square error or the highest R2, but such an approach would
lead to very complex, over complicated models because a complex model with many
parameters has much more flexibility to fit a complicated dataset. Therefore, what we
need is not the model that maximally fits the data. We need the model with the best
compromise between goodness of fit and model complexity, i.e. the smallest model that
fits the data reasonably well.
This approach is called the model selection technique, an essential concept of modern
statistics that sometimes supersedes p-value based testing.
Application in R
1. variance
2. Standard deviation
3. Range
4. Quartiles
5. interquartile range
6. same, calculated from quartiles
7. median absolute deviation
55
Example - x <- rnorm(100, mean=50, sd=10) # 100 random numbers
var(x) # variance
[1] 95.75608
range(x) # range
[1] 25.02415 75.79907
The sheer power of R lies in its incredible packages. In R, most data handling tasks can
be performed in 2 ways: Using R packages and R base functions. In this tutorial, I’ll also
introduce you with the most handy and powerful R packages. To install a package, simply
type:
install.packages("package name")
As a first time user, a pop might appear to select your CRAN mirror (country server),
choose accordingly and press OK.
56
Note: You can type this either in console directly and press ‘Enter’ or in R script and click
‘Run’.
Useful R Packages
Out of ~7800 packages listed on CRAN, I’ve listed some of the most powerful and
commonly used packages in predictive modeling in this article. Since, I have explained
the method of installing packages, you can go ahead and install them now. Sooner or later
you’ll need them.
Importing Data : R offers wide range of packages for importing data available in any
format such as .txt, .csv, .json, .sql etc. To import large files of data quickly, it is advisable
to install and use data.table, readr, RMySQL, sqldf, jsonlite.
Data Visualization: R has in built plotting commands as well. They are good to create
simple graphs. But, becomes complex when it comes to creating advanced graphics.
Hence, you should install ggplot2.
57
Data Exploration is a crucial stage of predictive model. You can’t build great and practical
models unless you learn to explore the data from begin to end. This stage forms a
concrete foundation for data manipulation (the very next stage). Let’s understand it in R.
I have taken the data set from Big Mart Sales Prediction. Before we start, you must get
familiar with these terms:
Train Data: The predictive model is always built on train data set. An intuitive way to
identify the train data is, that it always has the ‘response variable’ included.
58
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data
always contains less number of observations than train data set. Also, it does not include
‘response variable’.
Right now, you should download the data set. Take a good look at train and test data.
Cross check the information shared above and then proceed.
#working directory
path <- ".../Data/BigMartSales"
As a beginner, keep the train and test files in your working directly to avoid unnecessary
directory troubles. Once the directory is set, we can easily import the .csv files using
commands below.
#Load Datasets
train <- read.csv("Train_UWu5bXk.csv")
test <- read.csv("Test_u94Q5KV.csv")
In fact, even prior to loading data in R, it’s a good practice to look at the data in Excel.
This helps in strategizing the complete prediction modeling process. To check if the data
set has been loaded successfully, look at R environment. The data can be seen there. Let’s
explore the data quickly.
> dim(test)
[1] 5681 11
59
We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in
data set. This makes sense. Test data should always have one column less (mentioned
above right?). Let’s get deeper in train data set now.
To begin with, first check if this data has missing values. This can be done by using:
> table(is.na(train))
FALSE TRUE
100813 1463
In train data set, we have 1463 missing values. Let’s check the variables in which these
values are missing. It’s important to find and locate these missing values. Many data
60
scientists have repeatedly advised beginners to pay close attention to missing value in
data exploration stages.
> colSums(is.na(train))
Item_Identifier Item_Weight
0 1463
Item_Fat_Content Item_Visibility
0 0
Item_Type Item_MRP
0 0
Outlet_Identifier Outlet_Establishment_Year
0 0
Outlet_Size Outlet_Location_Type
0 0
Outlet_Type Item_Outlet_Sales
0 0
Hence, we see that column Item_Weight has 1463 missing values. Let’s get more
inferences from this data.
> summary(train)
Here are some quick inferences drawn from variables in train data set:
61
Graphical Representation of Variables
I’m sure you would understand these variables better when explained visually. Using
graphs, we can analyze the data in 2 ways: Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. Bivariate analysis is done with two
variables. Univariate analysis is a lot easy to do. Let’s now experiment doing bivariate
analysis and carve out hidden insights.
For visualization, I’ll use ggplot2 package. These graphs would help us understand the
distribution and frequency of variables in the data set.
We can see that majority of sales has been obtained from products having visibility less
than 0.2. This suggests that item_visibility < 2 must be an important factor in determining
sales. Let’s plot few more interesting graphs and explore such hidden stories.
62
Here, we infer that OUT027 has contributed to majority of sales followed by OUT35.
OUT10 and OUT19 have probably the least footfall, thereby contributing to the least
outlet sales.
From this graph, we can infer that Fruits and Vegetables contribute to the highest amount
of outlet sales followed by snack foods and household products. This information can also
be represented using a box plot chart. The benefit of using a box plot is, you get to see the
outlier and mean deviation of corresponding levels of a variable (shown below).
63
> ggplot(train, aes(Item_Type, Item_MRP)) +geom_boxplot() +ggtitle("Box Plot") +
theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Item
Type") + ylab("Item MRP") + ggtitle("Item Type vs Item MRP")
The black point you see, is an outlier. The mid line you see in the box, is the mean value
of each item type.
Now, we have an idea of the variables and their importance on response variable. Let’s
now move back to where we started. Missing values. Now we’ll impute the missing
values.
Let’s first combine the data sets. This will save our time as we don’t need to write
separate codes for train and test data sets. To combine the two data frames, we must make
sure that they have equal columns, which is not the case.
> dim(train)
[1] 8523 12
> dim(test)
[1] 5681 11
64
Test data set has one less column (response variable). Let’s first add the column. We can
give this column any value. An intuitive approach would be to extract the mean value of
sales from train data set and use it as placeholder for test variable Item _Outlet_ Sales.
Anyways, let’s make it simple for now. I’ve taken a value 1. Now, we’ll combine the data
sets.
Impute missing value by median. I’m using median because it is known to be highly
robust to outliers. Moreover, for this problem, our evaluation metric is RMSE which is
also highly affected by outliers. Hence, median is better in this case.
FALSE
14204
It’s important to learn to deal with continuous and categorical variables separately in a
data set. In other words, they need special attention. In this data set, we have only 3
continuous variables and rest are categorical in nature. If you are still confused, I’ll
suggest you to once again look at the data set using str() and proceed.
Let’s take up Item_Visibility. In the graph above, we saw item visibility has zero value
also, which is practically not feasible. Hence, we’ll consider it as a missing value and
once again make the imputation using median.
65
Let’s proceed to categorical variables now. During exploration, we saw there are mis-
matched levels in variables which needs to be corrected.
Using the commands above, I’ve assigned the name ‘Other’ to unnamed level
in Outlet_Size variable. Rest, I’ve simply renamed the various levels of
Item_Fat_Content.
Data Manipulation in R
Let’s call it as, the advanced level of data exploration. We will learn about feature
engineering and other useful aspects.
Feature Engineering:
This component separates an intelligent data scientist from a technically enabled data
scientist. You might have access to large machines to run heavy computations and
algorithms, but the power delivered by new features, just can’t be matched. We create
new variables to extract and provide as much ‘new’ information to the model, to help it
make accurate predictions.
66
There are 10 unique outlets in this data. This variable will give us information on count of
outlets in the data set. More the number of counts of an outlet, chances are more will be
the sales contributed by it.
> library(dplyr)
> a <- combi%>%
group_by(Outlet_Identifier)%>%
tally()
> head(a)
Source: local data frame [6 x 2] Outlet_Identifier n
(fctr) (int)
1 OUT010 925
2 OUT013 1553
3 OUT017 1543
4 OUT018 1546
5 OUT019 880
6 OUT027 1559
As you can see, dplyr package makes data manipulation quite effortless. You no longer
need to write long function. In the code above, I’ve simply stored the new data frame in a
variable a. Later, the new column Outlet_Count is added in our original ‘combi’ data set.
Similarly, we can compute count of item identifiers too. It’s a good practice to fetch more
information from unique ID variables using their count. This will help us to understand,
which outlet has maximum frequency.
67
> names(b)[2] <- "Item_Count"
> head (b)
Item_Identifier Item_Count
(fctr) (int)
1 DRA12 9
2 DRA24 10
3 DRA59 10
4 DRB01 8
5 DRB13 9
6 DRB24 8
3. Outlet Years –
This variable represent the information of existence of a particular outlet since year 2013.
Why just 2013? You’ll find the answer in problem statement here. My hypothesis is, older
the outlet, more footfall, large base of loyal customers and larger the outlet sales.
This suggests that outlets established in 1999 were 14 years old in 2013 and so on.
68
4. Item Type New –
Now, pay attention to Item_Identifiers. We are about to discover a new trend. There is a
pattern in the identifiers starting with “FD”,”DR”,”NC”. Now, check the
corresponding Item_Types to these identifiers in the data set. You’ll discover, items
corresponding to “DR”, are mostly eatables. Items corresponding to “FD”, are drinks.
And, item corresponding to “NC”, are products which can’t be consumed, let’s call them
non-consumable. Let’s extract these variables into a new variable representing their
counts.
Here I will use substr(), gsub() function to extract and rename the variables respectively.
Let’s now add this information in our data set with a variable name ‘Item_Type_New.
Linear regression
Regression analysis aims to describe the dependency between two (simple regression) or
several variables (multiple regression) by means of a linear or nonlinear function. In
case of not only multiple independent variables (x-variables), but also multiple
dependent variables (y-variables) this is referred to as multivariate regression. In contrast
to correlation analysis, which tests for the existence of a relationship, regression analysis
focuses on the quantitative description of this relationship by means of a statistical model.
Besides that, there are also differences in the underlying statistical assumptions. While
69
correlation analysis assumes that al lvariables are random containing an
error(modelII),regression analysis generally makes a distinction between independent
variables (without error) and dependent variables (with errors), which is called model I.
Only if the following prerequisites are met, the parameters (a,b) can be estimated without
bias and the significance test will be reliable.
In this context it is especially important that the variance of the residuals is homogenous
along the whole range of x, i.e. the variation of the residuals must not increase or decrease
or show any systematic pattern.
Implementation in R
Model formula in R
R has a special formula syntax for describing statistical models. A simple linear model (y
versus x) can be described as:
y~x+1
In contrast to the former, “-1” means a regression model with zero intercept:
70
y~x–1
A simple example :- Consider the following dataset.
plot(cherry$Height,cherry$Volume)
Then, a linear model can be fitted with the linear model function lm():
The R object delivered by lm (called reg in our case) now contains the complete results of
the regression, from which significance tests, residuals and further information can be
extracted by using specific R functions. Thus,
summary(cherry.lm)
Output:
Call:
71
lm(formula = Volume ~ Height, data = cherry)
Residuals:
Min 1Q Median 3Q Max
-21.274 -9.894 -2.894 12.067 29.852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.1236 29.2731 -2.976 0.005835 **
Height 1.5433 0.3839 4.021 0.000378 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gives us the most important results of the regression, e.g. the coefficients, with intercept
being equivalent to parameter a and the dependence on Height being equivalent to the
slope parameter b. The coefficient of determination is found as Multiple R-squared.
Besides that also standard errors of the parameters and significance levels are printed. The
regression line can be added to the plot with the universal line drawing function of R,
abline that directly accepts a linear model object (cherry.lm) as its argument:
abline(cherry.lm)
par(mfrow=c(1,1))
plot(cherry$Height, cherry$Volume)
abline(cherry.lm)
72
Classical tests
Sophisticated methods such as generalized mixed models or labor-intensive multivariate
analysis are not always necessary. In many cases so-called classical tests are sufficient.
Here, the principle of parsimony is applies too: the simplest method should be preferred.
We can test for differences between mean values with the help of the classical t-test. The
one-sample t-test tests if a sample is from a population with a given mean value µ:
73
Output:-
One Sample t-test
data: x t = -1.116, df = 9, p-value = 0.2933
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.404882 5.201915
sample estimates:
mean of x
4.803398
x1 <- rnorm(12, 5, 1)
x2 <- rnorm(12, 5.7, 1)
t.test(x1, x2)
Output:-
Welch Two Sample t-test
data: x1 and x2
t = -2.5904, df = 17.393, p-value = 0.01882
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.8064299 -0.1862296
sample estimates:
mean of x mean of y
4.683433 5.679763
That means that both samples differ significantly (p < 0.05). It has to be mentioned that R
did not perform the “normal” t-test but the Welch test (also termed heteroscedastic t-test),
for which the variances of both samples are not required to be identical.
74
The classical procedure would be as follows:
1. Perform a check for the identity of both variances with var.test beforehand:
2. if p < 0.05 then the variances are significantly different, so the Welch test (see above)
needs to be used.
3. if p > 0.05, then variances are likely to be equal and the “normal” t-test can be applied:
Paired t-Test
75
If the present data are paired (e.g. an allergy test on the left and right arm) the paired t-test
is used. Its advantage is that the influence of individual differences(i.e. acovariate) is
eliminated. The down side is that the number of degrees of freedom is reduced. But if
data are really paired it is definitely advantageous to take that information into account:
x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, var.equal=TRUE) # p=0.20 not significant
Two Sample t-test
data: x1 and x2
t = -1.372, df = 8, p-value = 0.2073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.28924 1.08924
sample estimates:
mean of x mean of y
4.0 5.6
Paired t-test
data: x1 and x2
t = -4, df = 4, p-value = 0.01613
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.710578 -0.489422
sample estimates:
mean of the differences
-1.6
It can be seen that the paired t-test has a greater discriminatory power.
76
Conclusion
R programming for data science is not that complex and the reason for its popularity is its
ease of use and the free download, but in order to learn Data Analytics with R, it is
important to study the software in detail, learn different commands and structures that are
in R and then perform the commands accordingly to analyze data effectively. In order to
help you familiarize you with R, we have already described basics of data analytics with
R, but to learn the software, we have prepared some tips that could help you study R for
data analytics. R is the only programming language that allows statisticians to perform the
most complicated and intricate analyses without getting into too much of details. With so
many benefits for data science, R has gradually mounted heights among professionals of
77
big data. According to a 2014 survey, R is one of the most powerful and popular
programming languages used by data scientists today.
We have studied the introduction to R programming in detail. So, it is clear from the
above information that R is more popular and better option as R supports a different kind
of programming languages. Also, R is an Open source and has far more capabilities and
availability to other languages. Still, if you have any doubts about the R Tutorial, please
Comment.
Hope you like the Introduction to R Programming Tutorial.
References
1. “Thomas Petzoldt” Data Analysis with R Selected Topics and Examples October
21, 2018.
2. “Tony Fishchetti” Data analysis using R Load, wrangle and analysis your data using
world’s most powerful statistical programming language
3. R : Past and Future History A Draft of a Paper for Interface ’98 Ross Ihaka
Statistics Department The University of Auckland Auckland, New Zealand
4. Introduction to Data analysis using R by Department of Statistical Sciences and
Operations Research, Computation Seminar Series, by Edward Boone
5. http://blog.sollers.edu/data-science/importance-of-r-language-for-data-science
6. https://data-flair.training/blogs/r-tutorial/
7. https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-
science-scratch/
78