Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 78

Data Analysis using R

1
Abstract

In this abstract we discuss about data analysis using R. Here we will do a deep study
about the topic. As we know that data analysis is increasingly gaining popularity and the
question of how to perform data analytics using R? It is also important, because R is a
tool that enables data analysts to perform data analysis and visualization. An important
term involved in data analytics using R is exploratory data analysis (EDA), it is an
approach of data analysis for summarizing and visualizing the data set, the focus of the
approach is to analyze data’s basic structures and variables to develop a basic
understanding of the data set, in order to develop a deep understanding to investigate
what methods of statistical analysis would be appropriate for data analysis. In order to
explain this concept in details, we will discuss it step by step and get to know about R,
used for data analysis and then how discuss when and how to R can be used to analyze
data efficiently.

Introduction

R is a programming language and free software environment for statistical analysis,


graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand, and is currently developed by
the R Development Core Team. It is widely used among statisticians and data miners to
develop statistical software and data analysis.

R is an interpreted computer language which allows branching and looping as well as


modular programming using functions. R allows integration with the procedures written

2
in the C, C++, .Net, Python or FORTRAN languages for efficiency. It is freely available
under the GNU General Public License, and pre-compiled binary versions provided for
various operating systems like Linux, Windows and Mac. R is free software distributed
under a GNU-style copy left, and an official part of the GNU project called GNU S.

This software is a software driven by command, e.g. if you are a data analyst analyzing
data using R then you will be giving written commands to the software in order to
indicate what you want to do, the advantage of using R is that it lets the analysts collects
large sets of data and add different commands together and then process all the
commands together at a time. The reason why R should be used in data analysis is
because it helps to process the large number of commands together, saves all the data and
progress on work, and enables analysts to easily edit small mistakes so that they don’t
have to go through different commands to search their steps and find the mistakes and
then fix it.

Why analyze data using R?

 R provides straightforward handling of analyses using the collection of operators


for calculation of arrays, vectors and metrics.

 R is easy to learn for beginners includes conditionals, loops, user defined


recursive functions.

 R provides simple and advanced options of analysis.

 R has ability to easily fix mistakes.

3
Fig 1.1 Why use R?

Data analysis using R is increasing the efficiency in data analysis, because data analytics
using R, enables analysts to process data sets that are traditionally considered large data-
sets, e.g. previously it was not possible to process data sets of 300,000 cases together, but
with R, on a machine with at least 2GB of memory, data sets off 300,000 cases and
around 120 variables can be processed.

Features of R

 It supports procedural programming with functions and object-oriented


programming with generic functions. Procedural programming includes procedure,
records, modules, and procedure calls. While object-oriented programming language
includes class, objects, and functions.

 Packages are part of R programming. Hence, they are useful in collecting sets
of R functions into a single unit.

 Rs programming features include database input, exporting data, viewing data,


variable labels, missing data, etc.

 R is an interpreted language. Hence, we can access it through command line


interpreter.

 R supports matrix arithmetic.

4
 It has effective data handling and storage facilities.

 R supports a large pool of operators for performing operations on arrays and


matrices.

 It has facilities to print the reports for the analysis performed in the form of graphs
either on-screen or on hardcopy.

Comparison Of R With Other Technologies

 Data handling Capabilities – Good data handling capabilities and options


for parallel computation.

 Availability / Cost – R is an open source and we can use it anywhere.

 Advancement in Tool – If you are working on latest technologies, R gets


latest features.

 Ease of Learning – R has a learning curve. R is a low-level programming


language. As a result, simple procedures can take long codes.

 Job Scenario – It is a better option for start-ups and companies looking for cost
efficiency.

 Graphical capabilities – R is having the most advanced graphical


capabilities. Hence, it provides you with advanced graphical capabilities

 Customer Service support and community – R is the biggest online


growing community.

5
Applications of R Programming

 Many data analysts and research programmers use R because R is the most
prevalent language. Hence, we use R as a fundamental tool for finance.

 Many quantitative analysts use R as their programming tool. Hence, R helps in


data importing and cleaning, depending on what manner of strategy you are using on.

 R is best for data Science because it gives a broad variety of statistics. In addition,
R provides the environment for statistical computing and design. Rather R considers
as an alternate execution of S.

Topics Details & Literature Survey

R-Environment Setup

Try it online

You really do not need to set up your own environment to start learning R programming
language because, we already have set up R Programming environment online, so that
you can compile and execute all the available examples online at the same time when you
are doing your theory work. This gives you confidence in what you are reading and to
check the result with different options. Feel free to modify any example and execute it
online.

Try the following example using Try it option at the website available at the top right
corner of the below sample code :

# Print Hello World.


print ("Hello World")

# Add two numbers.


print (23.9 + 11.6)

6
Local Environment Setup If you are still willing to set up your environment for R, you
can follow the steps given below.

Windows Installation

You can download the Windows installer version of R from R-3.2.2 for Windows (32/64
bit) and save it in a local directory.

As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double
click and run the installer accepting the default settings. If your Windows is 32-bit
version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both
the 32-bit and 64-bit versions.

After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon
brings up the R-GUI which is the R console to do R Programming.

Linux Installation

R is available as a binary for many versions of Linux at the location R Binaries.


The instruction to install Linux varies from flavor to flavor. These steps are mentioned
under each type of Linux version in the mentioned link. However, if you are in a hurry,
then you can use yum command to install R as follows:

$ yum install R

Above command will install core functionality of R programming along with standard
packages, still you need additional package, then you can launch R prompt as follows:

$R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.


You are welcome to redistribute it under certain conditions.

7
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or


'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>

Now you can use install command at R prompt to install the required package. For
example, the following command will install plotrix package which is required for 3D
charts.

> install("plotrix")

R Software Platform

Before developing a deep understanding of what exactly data analytics using R contains it
is important to understand the basic interface of R. The R software has four basic features,
R Console, R Script, R environment and Graphical output. these features are summarized
R has the ability to enable analysts to write codes in console, then run commands through
script, analyze variables and sets in R environment and then present the data in the form
of graphical output.

In simple 4 steps, users can analyze data using R, by performing following tasks:

1. R-Console:

R console is the interface where analysts can write codes for running the data, and can
view the output codes later also, the codes can be written using R Script.

2. R-Script:

8
R script is the interface where analysts can write codes, the process is quite simple,
users just have to write the codes and then to run the codes they just need to press
Ctrl+ Enter, or use the “Run” button on top of R Script.

3. R Environment:

R environment is the space to add external factors, this involves adding the actual data
set, then adding variables, vectors and functions to run the data. You can add all your
data here and then also view whether your data has been loaded accurately in the
environment.

4. Graphical Output:

Once all the scripts and codes are added and data sets and variables are added to R,
graphical output feature could be used to create graphs after the exploratory data
analysis performed.

Fig 1.2 R Software Platform

Thus, if based on above features, the functioning of data analytics using R is analyzed,
then data analytics using R entails writing codes and scripts, uploading sets of data and

9
variables, i.e. uploading the information you know, to obtain the information you want to
find out, and then represent the results using visual graphs.

R-Basic syntax

We will start learning R programming by writing a "Hello, World!" program. Depending


on the needs, you can program either at R command prompt or you can use an R script
file to write your program. Let's check both one by one.

R Command Prompt Once you have R environment setup, then it’s easy to start your R
command prompt by just typing the following command at your command prompt:

$R

This will launch R interpreter and you will get a prompt > where you can start typing
your program as follows:

> myString <- "Hello, World!"


> print ( myString)
[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.

R Script File Usually, you will do your programming by writing your programs in script
files and then you execute those scripts at your command prompt with the help of R
interpreter called Rscript. So let's start with writing following code in a text file called
test.R as under:

# My first program in R Programming


myString <- "Hello, World!"

10
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

Comments

Comments are like helping text in your R program and they are ignored by the interpreter
while executing your actual program. Single comment is written using # in the beginning
of the statement as follows:

# My first program in R Programming

R does not support multi-line comments but you can perform a trick which is something
as follows:

if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a single of double
quote"
}
myString <- "Hello, World!"
print ( myString)

Though above comments will be executed by R interpreter, they will not interfere with
your actual program. You should put such comments inside, either single or double quote.

11
R-Data Types

Data analytics with R is performed using four features of R, mentioned above, R console,
R script, R environment and Graphical output. The R programming for data science
contains different features and packages that can be installed to analyze different types of
data, R data analytics enable user to analyze different types of data such as:

Fig 1.3 Data Types of R

The simplest of these objects is the vector object and there are six data types of these
atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon
the atomic vectors.

Data Type Example Verify Result

12
Logical TRUE,FALSE v <- TRUE
[1] “logical”
print(class(v))
Numeric 12.3,6,876 v <- 45.5
[1] “numeric”
Print(class(v))
Integer 2L,34L,0L v <- 2L
[1] “integer”
print(class(v))
Complex 3 + 2i v <- 2+5i
[1] “complex”
print(class(v))
Character ‘d’,”bad”, “TRUE”,’23.4’,”YES” v <- “YES”
[1] “character”
print(class(v))
Raw “Hello” is stored as 48 65 6c 6c v <- charToRaw(“Hello”)
[1] “raw”
6f print(class(v))

 Vector:

Vector data sets group together objects from same class, e.g. a data set with vectors could
contain numeric, integers etc. However, R data analytics allows mixing of different
objects, i.e. different vectors can be grouped together for analysis. In this article we are
not going in-depth of specific commands that can be performed to group different objects
into one group, but the process of combining different groups into one group causes
coercion, and using the command class function, the data can be grouped into one object
of the same class.

When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))

When we execute the above code, it produces the following result:

[1] "red" "green" "yellow"


[1] "character"

13
 Matrices:

A matrix data set is created when a vector data set is divided into rows and columns, the
data contains the elements of the same class, but in matrix form the data structure is two
dimensional.
A matrix is a two-dimensional rectangular data set. It can be created using a vector input
to the matrix function.

# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow=2,ncol=3,byrow = TRUE)
print(M)

When we execute the above code, it produces the following result:


[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"

 Data Frame:

Data frame could be considered an advanced form of matrix, it is a matrix of vectors with
different elements, the difference between a matrix and a data frame is that a matrix must
have elements of the same class, but in data frame lists of different vectors with different
classes can be grouped together in a data frame. The data frame commands could be more
complex than the rest.

Data frames are tabular data objects. Unlike a matrix in data frame each column can
contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal
length.

Data Frames are created using the data.frame() function.

# Create the data frame.

14
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age =c(42,38,26)
)
print(BMI)

When we execute the above code, it produces the following result:

gender height weight Age


1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

 List:

List is a specific term used to describe a vector data set that groups together data from
different classes.

A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)

When we execute the above code, it produces the following result:

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

15
[[3]]
function (x) .Primitive("sin")

 Arrays :

While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required number of
dimension. In the below example we create an array with two elements which are 3x3
matrices each.

# Create an array.
a <- array(c('green','yellow'),dim=c(3,3,2))
print(a)

When we execute the above code, it produces the following result:


,,1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"

 Factors:

Factors are the r-objects which are created using a vector. It stores the vector along with
the distinct values of the elements in the vector as labels. The labels are always character
irrespective of whether it is numeric or character or Boolean etc. in the input vector. They
are useful in statistical modeling.

16
Factors are created using the factor() function. The nlevels function gives the count of
levels.

# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))

When we execute the above code, it produces the following result:

[1] green green yellow red red red yellow green


Levels: green red yellow
# applying the nlevels function we can know the number of distinct values
[1] 3

Apart from the R programming for data science that allows analysis of different types of
data, R data sciences allows for different types of variables to be added, such as:
 Continuous Variables:

Continuous variables are variables that can be in any form of value, e.g. decimal
values can also be added to the data, such as 1, 2.5, 4.6, 7, etc.

 Categorical Variables:

Categorical values can only be added in one form such as 1, 2, 3,4,5 etc. Factors
are used for representing categorical variables in data analytics with R.

 Missing Values:

17
Missing values are painful yet a crucial part of data analytics, and R data
analytics. There are different commands such as NA to perform calculations
without the missing values, but when the values are missing, it is important to use
commands to indicate that there are missing values in order to perform data
analytics with R.

Fig 1.4 Variables of R

R-Decision Making

Decision making structures require the programmer to specify one or more conditions to
be evaluated or tested by the program, along with a statement or statements to be
executed if the condition is determined to be true, and optionally, other statements to be
executed if the condition is determined to be false.

R provides the following types of decision making statements.

Statement Description

An if statement consists of a Boolean expression followed by one or


If statement more statements.

18
if...else statement An if statement can be followed by an optional else statement, which
executes when the Boolean expression is false.

A switch statement allows a variable to be tested for equality against a


switch statement list of values.

 If statement :-
An if statement consists of a Boolean expression followed by one or more
statements.

Syntax :- if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}

If the Boolean expression evaluates to be true, then the block of code inside the if
statement will be executed. If Boolean expression evaluates to be false, then the
first set of code after the end of the if statement (after the closing curly brace) will
be executed.
Example

x <- 30L
if(is.integer(x)){
print("X is an Integer")
}

When the above code is compiled and executed, it produces the following result:
[1] "X is an Integer"

 If...Else Statement :-
An if statement can be followed by an optional else statement which executes
when the boolean expression is false.

19
Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
else {
// statement(s) will execute if the boolean expression is false.
}

If the Boolean expression evaluates to be true, then the if block of code will be
executed, otherwise else block of code will be executed.

Example –

x <- c("what","is","truth")
if("Truth" %in% x){
print("Truth is found")
} else {
print("Truth is not found")
}

When the above code is compiled and executed, it produces the following result:

[1] "Truth is not found"

Here "Truth" and "truth" are two different strings.

 The if...else if...else Statement


An if statement can be followed by an optional else if...else statement, which is
very useful to test various conditions using single if...else if statement.

When using if, else if, else statements there are few points to keep in mind.
 An if can have zero or one else and it must come after any else if's.
 An if can have zero to many else if's and they must come before the else.
 Once an else if succeeds, none of the remaining else if's or else's will be
tested.

20
Syntax

if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
}else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
}else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
}else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x){
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}

When the above code is compiled and executed, it produces the following result:

[1] "truth is found the second time"

 Switch Statement
A switch statement allows a variable to be tested for equality against a list of
values. Each value is called a case, and the variable being switched on is checked
for each case.

Syntax
switch(expression, case1, case2, case3....)

The following rules apply to a switch statement:

21
 If the value of expression is not a character string it is coerced to integer.
 You can have any number of case statements within a switch. Each case is
followed by the value to be compared to and a colon.
 If the value of the integer is between 1 and nargs()-1 (The max number of
arguments)then the corresponding element of case condition is evaluated
and the result returned.
 If expression evaluates to a character string then that string is matched
(exactly) to the names of the elements.
 If there is more than one match, the first matching element is returned.
 No Default argument is available.
 In the case of no match, if there is a unnamed element of ... its value is
returned. (If there is more than one such argument an error is returned.)

Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)

When the above code is compiled and executed, it produces the following result:

[1] "third"

R-Loops

There may be a situation when you need to execute a block of code several number of
times. In general, statements are executed sequentially. The first statement in a function is
executed first, followed by the second, and so on.

Programming languages provide various control structures that allow for more
complicated execution paths.

22
A loop statement allows us to execute a statement or group of statements multiple times
and the following is the general form of a loop statement in most of the programming
languages:

R programming language provides the following kinds of loop to handle looping


requirements.

Loop Type Description

Executes a sequence of statements multiple times and abbreviates the code that
Repeat loop manages the loop variable.

Repeats a statement or group of statements while a given condition is true. It tests


While loop the condition before executing the loop body.

Like a while statement, except that it tests the condition at the end of the loop
For loop body.

 Repeat Loop
The Repeat loop executes the same code again and again until a stop condition is
met.

Syntax
repeat {
commands
if(condition){
break
}
}
Example-

v <- c("Hello","loop")
R Programming
44

23
cnt <- 2
repeat{
print(v)
cnt <- cnt+1
if(cnt > 5){
break
}
}

When the above code is compiled and executed, it produces the following result:

[1] "Hello" "loop"


[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"

 While Loop
The While loop executes the same code again and again until a stop condition is
met.
Syntax
while (test_expression) {
statement
}
Here key point of the while loop is that the loop might not ever run. When the
condition is tested and the result is false, the loop body will be skipped and the
first statement after the while loop will be executed.

Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7){
print(v)
cnt = cnt + 1
}
When the above code is compiled and executed, it produces the following result :

24
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"

 For Loop
A for loop is a repetition control structure that allows you to efficiently write a
loop that needs to execute a specific number of times.
Syntax :
for (value in vector) {
statements
}

Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result:

[1] "A"
[1] "B"
[1] "C"
[1] "D"

Loop Control Statements

Loop control statements change execution from its normal sequence. When execution
leaves a scope, all automatic objects that were created in that scope are destroyed.
R supports the following control statements

Control Statement Description

25
Terminates the loop statement and transfers execution to the statement
Break statement immediately following the loop.

The next statement simulates the behavior of R switch.


Next statement

 Break Statement
The break statement in R programming language has the following two usages:
 When the break statement is encountered inside a loop, the loop is
immediately terminated and program control resumes at the next statement
following the loop.
 It can be used to terminate a case in the switch statement (covered in the
next chapter).
Syntax :- break
Example -
v <- c("Hello","loop")
cnt <- 2
repeat{
print(v)
cnt <- cnt+1
if(cnt > 5){
break
}
}
When the above code is compiled and executed, it produces the following result:
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"

 Next Statement
The next statement in R programming language is useful when we want to skip
the current iteration of a loop without terminating it. On encountering next, the R
parser skips further evaluation and starts next iteration of the loop.

26
Syntax:- next
Example
v <- LETTERS[1:6]
for ( i in v){
if (i == "D"){
next
}
print(i)
}

When the above code is compiled and executed, it produces the following result:

[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"

R-Function

A function is a set of statements organized together to perform a specific task. R has a


large number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function,
along with arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any
result which may be stored in other objects.

 Function Definition
An R function is created by using the keyword function. The basic syntax of an R
function definition is as follows:

function_name <- function(arg_1, arg_2, ...) {


Function body
}

27
Function Components - The different parts of a function are:

 Function Name: This is the actual name of the function. It is stored in R


environment as an object with this name.
 Arguments: An argument is a placeholder. When a function is invoked, you
pass a value to the argument. Arguments are optional; that is, a function may
contain no arguments. Also arguments can have default values.
 Function Body: The function body contains a collection of statements that
defines what the function does.
 Return Value: The return value of a function is the last expression in the
function body to be evaluated.

R has many in-built functions which can be directly called in the program without
defining them first. We can also create and use our own functions referred as user defined
functions.

 Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x)and
paste(...) etc. They are directly called by user written programs. You can refer
most widely used R functions.

# Create a sequence of numbers from 32 to 44.


print(seq(32,44))

# Find mean of numbers from 25 to 82.


print(mean(25:82))

# Find sum of numbers frm 41 to 68.


print(sum(41:68))

When we execute the above code, it produces the following result:

[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526

28
 User-defined Function
We can create user-defined functions in R. They are specific to what a user wants
and once created they can be used like the built-in functions. Below is an example
of how a function is created and used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

 Calling a Function

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.

new.function(6)

When we execute the above code, it produces the following result:

[1] 1
[1] 4
[1] 9

29
[1] 16
[1] 25
[1] 36

 Calling a Function without an Argument

# Create a function without an argument.


new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}
# Call the function without supplying an argument.
new.function()
When we execute the above code, it produces the following result:
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

 Calling a Function with Argument Values (by position


and by name)
The arguments to a function call can be supplied in the same sequence as defined
in the function or they can be supplied in a different sequence but assigned to the
names of the arguments.

# Create a function with arguments.


new.function <- function(a,b,c) {
result <- a*b+c
print(result)
}

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


30
new.function(a=11,b=5,c=3)

When we execute the above code, it produces the following result:

[1] 26
[1] 58

 Calling a Function with Default Argument


We can define the value of the arguments in the function definition and call the
function without supplying any argument to get the default result. But we can also
call such functions by supplying new values of the argument and get non default
result.

# Create a function with arguments.


new.function <- function(a = 3,b =6) {
result <- a*b
print(result)
}

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)

When we execute the above code, it produces the following result:

[1] 18
[1] 45

 Lazy Evaluation of Function


Arguments to functions are evaluated lazily, which means so they are evaluated
only when needed by the function body.

# Create a function with arguments.

31
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}
# Evaluate the function without supplying one of the arguments.
new.function(6)
When we execute the above code, it produces the following result:
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default

R-Strings
Any value written within a pair of single quote or double quotes in R is treated as a string.
Internally R stores every string within double quotes, even when you create them with
single quote.
Rules Applied in String Construction :-

 The quotes at the beginning and end of a string should be both double quotes or
both single quote. They cannot be mixed.
 Double quotes can be inserted into a string starting and ending with single quote.
 Single quote can be inserted into a string starting and ending with double quotes.
 Double quotes cannot be inserted into a string starting and ending with double
quotes.
 Single quote cannot be inserted into a string starting and ending with single quote.

Examples of Valid Strings

a <- 'Start and end with single quote'


print(a)

b <- "Start and end with double quotes"


print(b)

c <- "single quote ' in between double quotes"


print(c)

32
d <- 'Double quotes " in between single quote'
print(d)

When the above code is run we get the following output:

[1] "Start and end with single quote"


[1] "Start and end with double quotes"
[1] "single quote ' in between double quote"
[1] "Double quote \" in between single quote"

Examples of Invalid Strings

e <- 'Mixed quotes"


print(e)
f <- 'Single quote ' inside single quote'
print(f)
g <- "Double quotes " inside double quotes"
print(g)

When we run the script it fails giving below results.


...: unexpected INCOMPLETE_STRING

.... unexpected symbol


1: f <- 'Single quote ' inside

unexpected symbol
1: g <- "Double quotes " inside

 String Manipulation

33
 Concatenating Strings - paste() function Many strings in R are combined
using the paste() function. It can take any number of arguments to be combined
together.
Syntax :- paste(..., sep = " ", collapse = NULL)

 ... represents any number of arguments to be combined.


 sep represents any separator between the arguments. It is optional.
 collapse is used to eliminate the space in between two strings. But not the
space within two words of one string.
Example

a <- "Hello"
b <- 'How'
c <- "are you? "
print(paste(a,b,c))
print(paste(a,b,c, sep = "-"))
print(paste(a,b,c, sep = "", collapse = ""))

When we execute the above code, it produces the following result:

[1] "Hello How are you? "


[1] "Hello-How-are you? "
[1] "HelloHoware you? "

 Formatting numbers & strings - format() function Numbers and strings


can be formatted to a specific style using format()function.
Syntax:- format(x, digits, nsmall,scientific,width,justify = c("left", "right",
"centre", "none"))
 x is the vector input.
 digits is the total number of digits displayed.
 nsmall is the minimum number of digits to the right of the decimal
point.
 scientific is set to TRUE to display scientific notation.
 width indicates the minimum width to be displayed by padding
blanks in the beginning.
 justify is the display of the string to left, right or center.
Example
34
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)

# Display numbers in scientific notation.


result <- format(c(6, 13.14521), scientific = TRUE)
print(result)

# The minimum number of digits to the right of the decimal point.


result <- format(23.47, nsmall = 5)
print(result)

# Format treats everything as a string.


result <- format(6)
print(result)

# Numbers are padded with blank in the beginning for width.


result <- format(13.7, width = 6)
print(result)

# Left justify strings.


result <- format("Hello",width = 8, justify = "l")
print(result)

# Justfy string with center.


result <- format("Hello",width = 8, justify = "c")
print(result)

When we execute the above code, it produces the following result:


[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "

35
 Counting number of characters in a string - ncahr() function This
function counts the number of characters including spaces in a string.
Syntax:- nchar(x)
 x is the vector input.
Example
result <- nchar("Count the number of characters")
print(result)
When we execute the above code, it produces the following result:
[1] 30

 Changing the case - toupper() & tolower() functions -These functions


change the case of characters of a string.
Syntax:- toupper(x)
tolower(x)

 x is the vector input.


Example
# Changing to Upper case.
result <- toupper("Changing To Upper")
print(result)

# Changing to lower case.


result <- tolower("Changing To Lower")
print(result)

When we execute the above code, it produces the following result:


[1] "CHANGING TO UPPER"
[1] "changing to lower"

 Extracting parts of a string - substring() function This function extracts


parts of a String.
Syntax :- substring(x,first,last)
 x is the character vector input.  first is the position of the first character
to be extracted.
 last is the position of the last character to be extracted.

36
Example
# Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)

When we execute the above code, it produces the following result:


[1] "act"

R-Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.

Types of Operators We have the following types of operators in R programming:

 Arithmetic Operators
 Relational Operators
 Logical Operators
 Assignment Operators
 Miscellaneous Operators

 Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators
act on each element of the vector.

Operators Description Example Result

37
+ Add two vectors v <- c( 2,5.5,6) [1] 10.0 8.5 10.0
t <- c(8, 3, 4)
print(v+t)
- Subtracts second vector from v <- c( 2,5.5,6) [1] -6.0 2.5 2.0
the first t <- c(8, 3, 4)
print(v-t)
* Multiplies both vectors v <- c( 2,5.5,6) [1] 16.0 16.5 24.0
t <- c(8, 3, 4)
print(v*t)
/ Divide the first vector with the v <- c( 2,5.5,6) [1] 0.250000 1.833333
second t <- c(8, 3, 4) 1.500000
print(v/t)
%% Give the remainder of the first v <- c( 2,5.5,6) [1] 2.0 2.5 2.0
vector with the second t <- c(8, 3, 4)
print(v%%t)
%/% The result of division of first v <- c( 2,5.5,6) [1] 0 1 1
vector with second (quotient) t <- c(8, 3, 4)
print(v%/%t)
^ The first vector raised to the v <- c( 2,5.5,6) [1] 256.000 166.375
exponent of second vector t <- c(8, 3, 4) 1296.000
print(v^t)

 Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.

Operators Description Example Result


> Checks if each element of the v <- c(2,5.5,6,9) [1] FALSE TRUE FALSE
first vector is greater than the t <- c(8,2.5,14,9) FALSE
corresponding element of the print(v>t)

38
second vector.

< Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE FALSE TRUE
first vector is less than the t <- c(8,2.5,14,9) FALSE
corresponding element of the print(v < t)
second vector.
== Checks if each element of the v <- c(2,5.5,6,9) [1] FALSE FALSE FALSE
first vector is equal to the t <- c(8,2.5,14,9) TRUE
corresponding element of the print(v==t)
second vector.
<= Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE FALSE TRUE
first vector is less than or equal to t <- c(8,2.5,14,9) TRUE
the corresponding element of the print(v<=t)
second vector.
>= Checks if each element of the v <- c(2,5.5,6,9) [1] FALSE TRUE FALSE
first vector is greater than or t <-c(8,2.5,14,9) TRUE
equal to the corresponding print(v>=t)
element of the second vector.
!= Checks if each element of the v <- c(2,5.5,6,9) [1] TRUE TRUE TRUE
first vector is unequal to the t <- c(8,2.5,14,9) FALSE
corresponding element of the print(v!=t)
second vector.

 Logical Operators
Following table shows the logical operators supported by R language. It is applicable
only to vectors of type logical, numeric or complex. All numbers greater than 1 are
considered as logical value TRUE.

Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.

Operators Description Example Result

39
& It is called Element-wise Logical v<-c(3,1,TRUE,2+3i) [1] TRUE TRUE FALSE
AND operator. It combines each t<c(4,1,FALSE,2+3i) TRUE
element of the first vector with the print(v&t)
corresponding element of the
second vector and gives a output
TRUE if both the elements are
TRUE.
| It is called Element-wise Logical v<- (3,0,TRUE,2+2i) [1] TRUE FALSE TRUE
OR operator. It combines each t<c(4,0,FALSE,2+3i) TRUE
element of the first vector with the print(v|t)
corresponding element of the
second vector and gives a output
TRUE if one the elements is TRUE.
! It is called Logical NOT operator. v<-c(3,0,TRUE,2+2i) [1] FALSE TRUE FALSE
Takes each element of the vector print(!v) FALSE
and gives the opposite logical
value.

Note:-The logical operator && and || considers only the first element of the vectors and give a
vector of single element as output.

&& Called Logical AND operator. v<-c(3,0,TRUE,2+2i) [1] TRUE


Takes first element of both the t<-
vectors and gives the TRUE only if c(1,3,TRUE,2+3i)
both are TRUE. print(v&&t)
|| Called Logical OR operator. Takes v<-c(0,0,TRUE,2+2i) [1] FALSE
first element of both the vectors and t<-
gives the TRUE only if both are c(0,3,TRUE,2+3i)
TRUE. print(v||t)

 Assignment Operators
These operators are used to assign values to vectors.

Operators Description Example Result


<- Called Left Assignment v1<c(3,1,TRUE,2+3i) [1] 3+0i 1+0i 1+0i 2+3i
or v2<<c(3,1,TRUE,2+3i) [1] 3+0i 1+0i 1+0i 2+3i
= v3 = c(3,1,TRUE,2+3i) [1] 3+0i 1+0i 1+0i 2+3i
or print(v1)
<<- print(v2)

40
print(v3)

-> Called Right Assignment c(3,1,TRUE,2+3i)->v1 [1] 3+0i 1+0i 1+0i 2+3i
or c(3,1,TRUE,2+3i)- [1] 3+0i 1+0i 1+0i 2+3i
->> >>v2
print(v1)
print(v2)
&& Called Logical AND operator. Takes v<c(3,0,TRUE,2+2i) [1] TRUE
first element of both the vectors and t<c(1,3,TRUE,2+3i)
gives the TRUE only if both are print(v&&t)
TRUE.
|| Called Logical OR operator. Takes v<c(0,0,TRUE,2+2i) [1] FALSE
first element of both the vectors and t<c(0,3,TRUE,2+3i)
gives the TRUE only if both are print(v||t)
TRUE.

 Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.

Operators Description Example Result


: Colon operator. It creates the series v <- 2:8 [1] 2 3 4 5 6 7 8
of numbers in sequence for a print(v)
vector.
%in% This operator is used to identify if v1 <- 8 [1] TRUE
an element belongs to a vector. v2 <- 12 [1] FALSE
t <- 1:10
print(v1 %in% t)
print(v2 %in% t)
%*% This operator is used to multiply a M = matrix( c(2,6,5,1,10,4), [,1] [,2]
matrix with its transpose. nrow=2,ncol=3,byrow= [1,] 65 82
TRUE) [2,] 82 117
t = M %*% t(M)

41
print(t)

R- Graphics
Now, we will see how to use R as a function plotter by drawing sine- or cosine functions
within an interval between 0 to 10. First, we create a table of values for x and y and in
order to get a smooth curve, it is reasonable to choose a small step size. As a rule of
thumb I always recommend to use about 100...400 small steps as a good compromise
between smoothness and memory requirements, so let’s set the step size to 0.1:

x <- seq(0, 10, 0.1) y <- sin(x) plot(x, y)

Instead of plotting points, it is of course also possible to draw continuous lines. This is
indicated by supplying an optional argument type="l". Important: the symbol used here
for type is the small letter “L” for “line” and not the – in printing very similar – numeral
“1” (one)! Note also that in R optional arguments can be given by using a “keyword =
value” pair. This has the advantage that the order of arguments does not matter, because
arguments are referenced by name:

plot(x, y, type = "l")

Now we want to add a cosine function with another color. This can be done with one of
the function lines or points, for adding lines or points to an existing figure:

y1 <- cos(x) lines(x, y1, col = "red")

With the help of text it is also possible to add arbitrary text, by specifying first the x- and
y- coordinates and then the text:

x1 <- 1:10 text(x1, sin(x1), x1, col = "green")

42
Many options exist to modify the behavior of most graphics functions so the following
specifies user-defined coordinate limits (xlim, ylim), axis labels and a heading (xlab,
ylab, main).

plot(x, y, xlim = c(-10, 10), ylim = c(-2, 2), xlab = "x-Values", ylab = "y-Values", main =
"Example Graphics")

The above example is a rather long command and may not fit on a single line. In such
cases, R displays a + (plus sign) to indicate that a command must be continued, e.g.
because a closing parenthesis or a closing quote are still missing. Such a + at the
beginning of a line is an automatic “prompt” similar to the ordinary “>” prompt and must
never be typed in manually. If, however, the “+” continuation prompt occurs by accident,
press “ESC” to cancel this mode.

In contrast to the long line continuation prompt, it is also possible to write several
commands on one line, separated by a semi-colon “;”.
Finally, a number (or hash) symbol # means that the following part of the line is a
comment and should therefore be ignored by R.

In order to explore the wealth of graphical functions, you may now have a more extensive
look into the online help, especially regarding ?plot or ?plot.default, and you should
experiment a little bit with different plotting parameters, like lty, pch, lwd, type, log etc.
R contains uncountable possibilities to get full control over the style and content of your
graphics, e.g. with user-specified axes (axis), legends (legend) or user-defined lines and
areas (abline, rect, polygon). The general style of figures like (font size, margins, line
width) can be influenced with the par() function.

In addition, R and its packages contain numerous “high level”-graphics functions for
specific purposes. To demonstrate a few, we first generate a dataset with normally
distributed random numbers(mean0,standard deviation1),then we plot the m and create a
histogram. Here, the function par(mfrow=c(2,2)) divides the plotting area into 2 rows and
2 columns for showing 4 separate figures:

par(mfrow = c(2, 2))

43
x <- rnorm(100)
plot(x)
hist(x)

Now, we add a so-called normal probability plot and a second histogram with relative
frequencies together with the bell-shaped density curve of the standard normal
distribution. The optional argument probability = TRUE makes sure that the histogram
has the same scaling as the density function, so that both can be overlayed:

qqnorm(x)
qqline(x, col="red")
hist(x, probability = TRUE)
xx <- seq(-3, 3, 0.1)
lines(xx, dnorm(xx, 0, 1),col = "red")

Here it may also be a good chance to do a little bit summary statistics like: z.B. mean(x),
var(x), sd(x), range(x), summary(x), min(x), max(x), ...

Or we may consider to test if the generated random numbers x are really normal
distributed by using the Shapiro-Wilks-W-Test:

x <- rnorm(100)
shapiro.test(x)
Shapiro-Wilk normality test
data: x W = 0.99388, p-value = 0.9349

A p-value bigger than 0.05 tells us that the test has no objections against normal
distribution of the data. The concrete results may differ, because x contains random
numbers, so it makes sense to repeat this several times. It can be also useful compare
these normally distributed random numbers generated with rnorm with uniformly
distributed random numbers generated with run if:

par(mfrow=c(2,2))
y <- runif(100)
plot(y)

44
hist(y)
qqnorm(y)
qqline(y, col="red")
mean(y)
var(y)
min(y)
max(y)
hist(y, probability=TRUE)
yy <- seq(min(y), max(y), length = 50)
lines(yy, dnorm(yy, mean(y), sd(y)), col = "red")
shapiro.test(y)

At the end, we compare the pattern of both data sets with box-and-whisker plots:

par(mfrow=c(1, 1)) boxplot(x, y)

Loading data into R


Thus far, we’ve only been entering data directly into the interactive R console. For any
data set of non-trivial size this is, obviously, an intractable solution. Fortunately for
us, R has a robust suite of functions for reading data directly from external files.
Go ahead, and create a file on your hard disk called favorites.txt that looks like this:

flavor, number
pistachio,6
mint chocolate chip,7
vanilla,5
chocolate,10
strawberry ,2
neopolitan,4

45
This data represents the number of students in a class that prefer a particular flavor of soy
ice cream. We can read the file into a variable called favs as follows:

> favs <- read.table("favorites.txt", sep=",", header=TRUE)

If you get an error that there is no such file or directory, give R the full path name to your
data set or, alternatively, run the following command:

>favs <- read.table(file.choose(), sep=",", header=TRUE)

The preceding command brings up an open file dialog for letting you navigate to the file
you’ve just created.
The argument sep="," tells R that each data element in a row is separated by a comma.
Other common data formats have values separated by tabs and pipes ("|"). The value of
sep should then be "\t" and "|", respectively.
The argument header=TRUE tells R that the first row of the file should be
interpreted as the names of the columns. Remember, you can enter ?read.table at the
console to learn more about these options.
Reading from files in this comma-separated-values format (usually with the .csv file
extension) is so common that R has a more specific function just for it. The preceding
data import expression can be best written simply as:

>favs <- read.csv("favorites.txt")

Now, we have all the data in the file held in a variable of class data.frame. A data frame
can be thought of as a rectangular array of data that you might see in a spreadsheet
application. In this way, a data frame can also be thought of as a matrix; indeed, we
can use matrix-style indexing to extract elements from it. A data frame differs from a
matrix, though, in that a data frame may have columns of differing types. For
example, whereas a matrix would only allow one of these types, the data set we just
loaded contains character data in its first column, and numeric data in its second column.
Let’s check out what we have by using the head() command, which will show us the first
few lines of a data frame:

46
>head(favs)

Flavor number
1 pistachio 6
2 mint chocolate chip 7
3 vanilla 5
4 chocolate 10
5 strawberry 2
6 neopolitan 4

>class(favs)
[1]"data.frame"

>class(favs$flavor)
[1]"factor"

>class(favs$number)
[1]"numeric"

I lied, ok! So what?! Technically, flavor is a factor data type, not a character type. We
haven’t seen factors yet, but the idea behind them is really simple. Essentially, factors are
codings for categorical variables, which are variables that take on one of a finite
number of categories—think {"high", "medium", and "low"} or {"control",
"experimental"}.

Though factors are extremely useful in statistical modeling in R, the fact that R, by
default, automatically interprets a column from the data read from disk as a type factor if
it contains characters, is something that trips up novices and seasoned useRs alike.
Because of this, we will primarily prevent this behavior manually by adding the
stringsAsFactors optional keyword argument to the read.* commands:

>favs <- read.csv("favorites.txt", stringsAsFactors=FALSE)


> class(favs$flavor)
[1]"character"

47
Much better, for now! If you’d like to make this behavior the new default, read the ?
options manual page. We can always convert to factors later on if we need to! If you
haven’t noticed already, I’ve snuck a new operator on you—$, the extract operator. This is
the most popular way to extract attributes (or columns) from a data frame. You
can also use double square brackets ([[and]]) to do this.
These are both in addition to the canonical matrix indexing option. The following three
statements are thus, in this context, functionally identical:
> favs$flavor
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"

>favs[["flavor"]]
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"

>favs[,1]
[1] "pistachio" "mint chocolate chip" "vanilla"
[4] "chocolate" "strawberry" "neopolitan"

Note
Notice how R has now printed another number in square brackets—besides [1]—along
with our output. This is to show us that chocolate is the fourth element of the vector that
was returned from the extraction.
You can use the names() function to get a list of the columns available in a data frame.
You can even reassign names using the same:

> names(favs)
[1] "flavor" "number"
>names(favs)[1]<-"flav"
>names(favs)
[1] "flav" "number"

Lastly, we can get a compact display of the structure of a data frame by using the str()
function on it:
> str(favs)
'data.frame' : 6 obs. of 2 variables :

48
$ flav : chr "pistachio" "mint chocolate chip" "vanilla" "chocolate"
...
$ number : num 6 7 5 10 2 4

Actually, you can use this function on any R structure—the property of functions that
change their behavior based on the type of input is called polymorphism.

The “Import Dataset” Wizard of R Studio


R Studio contains a nice feature that makes importing of data more convenient.
Essentially, this “Import Dataset” wizard helps us to construct the correct read.table or
read.csv function interactively. The upper right window in Figure 1.5 shows the original
input file and the lower window indicates whether the data frame was correctly
recognized. It is possible to try different options until a satisfying result is obtained. For
this purpose:

1. From the menu select: File – Import DataSet – From CSV.


2. Select the requested file and select suitable options like the name of the variable the
data are to be assigned to, the delimiter character (comma or Tab) and whether the first
row of the file contains variable names.
3. Hint: The Code Preview contains the commands that the wizard created. If you copy
these commands to the script pane, you can re-read the data several times without going
back to the menu system.

49
Fig 1.5: Import Text Data wizard of R Studio.

Hint: Do not forget to set a Name for the resulting data frame (e.g. dat), otherwise R uses
the file name.

Debugging
For debugging purposes, i.e. if we suspect that something is wrong, it can be necessary to
inspect values of internal variables. For such cases, it would be possible to output internal
variables with print or, as an alternative, to switch the debug mode for this function on
with debug(twosam). This mode can be switched of with undebug(twosam).

50
R contains many additional possibilities, e.g. usage of optional named arguments with
default values, the ellipsis (...-argument, three dots) closures, object orientation or linking
to C- or Fortran routines. Details about this can be found in “An Introduction to R” and
“Writing R Extensions”, that are both part of the official R documentation.

The principle of parsimony


The principle of parsimony, sometimes also called “Occams razor” is attributed to an
English philosopher from the 14th century who stated that “simpler explanations are,
other things being equal, generally better than more complex ones”. In relation to
statistical analysis and modeling this implies that:
• “models should have as few parameters as possible
• linear models should be preferred to non-linear models
• experiments relying on few assumptions should be preferred to those relying on many
• models should be pared down until they are minimal adequate
• simple explanations should be preferred to complex explanation”

This principle is one of the most important fundamentals, not only in statistics but also in
science in general. However, over-simplification has to be avoided as well, especially in
complex fields like ecology.

 Probability

According to the classical definition, probability p is the chance of a specific event,i.e. the
number of events we are interested in, divided by the total number of events. In a die roll,
for example, the probability of any certain number is 1/6 because the die has 6 sides and
each number occurs only one time.

Unfortunately, this classical definition is not applicable to non-countable populations,


because either the denominator or the numerator (or both) may be undefined or infinite.
What is, for example, the probability that the height of a tree is 12.45 m?

To solve this, mathematicians use an axiomatic definition of probability:

51
Axiom I: 0≤ p≤1
Axiom II: impossible events have p = 0, certain events have p = 1
Axiom III: for events A and B, that exclude each other, i.e. in set theory ATB = 0/ holds:
p(ASB) = p(A)+p(B)

 Sample and Population


The objects, from which we have measurements or observations form a sample. In
contrast to this, a population is the set of all objects that had the same chance to become
part of the sample. This means that the population is defined by the way how samples are
taken, i.e. how representative our sample is for our (intended) observation object. In
principle, there are two different sampling strategies:

Random sampling means that the individual objects for measurement are selected at
random from the population (examples: random selection of sampling sites by means of
numbered grid cells; random placement of experimental units in a climate chamber,
random order of treatments in time).

Stratified sampling requires that the population is divided into sub-populations


(strata). These are separately sampled (at random!) and the population characteristics are
estimated by using weighted mean values. Therefore, it is essential, to have valid
information about the size of the strata to derive the weighting coefficients. Examples are
election forecasts, volumetric mean for a water body from measurements for different
layers, gut content of fish estimated from size classes. The advantage of stratified
sampling is to get better estimates from smaller samples, but this holds only if the weight
coefficients are correct!

By convention, the “true” but unknown parameters of a population are symbolized with
Greek letters(µ, σ, γ, α, β); the calculated parameters(¯ x,s,r2...) are called “estimates”. A
single value xi of a random variable X can also be assumed to consist of an expectation
value E(X) of the random variable and an individual error εi. The expectation value (e.g. a
mean) of this error term is zero:

xi =E(X)+εi (3.1)

52
E(ε) = 0 (3.2)

 Statistical Tests
Statistical tests are employed for testing hypotheses about data, e.g. specific properties of
distributions or their parameters. The basic idea is to estimate probabilities for a
hypothesis about the population from a sample.

Effect size and significance


In statistical testing significance has to be clearly distinguished from effect size. The
effect size ∆ measures the size of an observed effect like the difference between two mean
values(x1−x2)or the size of a correlation coefficient (r2). Even more important is the
relative effect size ,the ratio between an effect and random error or the signal-noise ratio
that is often represented as the ratio between the effect (e.g. difference between mean
values) and the standard deviation:
δ = µ1− µ2/ σ= ∆/σ

In contrast to this, significance means that an effect really exists with a certain probability
and that it is unlikely a result of random fluctuations alone.
For testing whether an effect is significant or not, it is important to formulate clear
hypotheses in terms of statistical language, a so-called Null hypothesis (H0) and one or
more alternative Hypotheses (Ha):

H0: Null hypothesis that two populations are not different with respect to a certain
parameter or property. It is assumed, that an observed effect is purely random and that the
true effect is zero.
Ha: The alternative hypothesis (experimental hypothesis) claims that a specific effect
exists. An alternative hypothesis is never completely true or “proven”. The acceptance of
HA means only than H0 is unlikely.

Table 1.1 : α and β errors

Reality
Decision of the test
H0 true H0 false

53
H0 retained true, 1−α β error
H0 rejected α error Correct decision with power
1−β

Statistical tests can only test for differences between samples, not for equality. This means
that H0 is formulated to be rejected and that it is impossible to test if two populations are
equal–for principal reasons. If a test cannot reject “the null”, it means only that an
observed effect can also be explained as a result of random variability or error and that a
potential effect was too small to be “significantly” detected, given the available sample
size.

Note: Not significant does not necessarily mean “no effect”, it means “no effect or
sample size too small”.

Whether an effect is significant or not is determined by comparing the p-value of a test


with a pre-defined critical value, the significance level α (or probability of error). Here,
the p-value is an estimate for the probability that the null hypothesis is wrong and on the
other hand, α is the amount of false positives, i.e. wrong rejections of H0 that we tolerate.

To sum up, there are two possibilities for wrong decisions (cf. Table 1.1):
1. H0 falsely rejected (error of the first kind orα error),
2. H0 falsely retained (error of the second kind orβ error).

It is common convention to use α = 0.05 as the critical value in many sciences, but any
other small value (e.g. 0.01or0.1) would be possible as well. The essential thing, however,
is that this value has to be defined before doing the test and should not be adjusted
afterwards.

In contrast to this, β is often unknown. It depends on the the relative effect size, the
sample size and the power of a certain test. It can be estimated before an experiment by
using power analysis methods, or, even better β is set to a certain value, e.g. 0.2 and then
power analysis is used to determine the required sample size.

 The model selection approach


54
Significance testing is one of the most important concepts in statistics, but it is not the
only one. In contrast of testing whether a certain effect is significant or not, one can also
test which of several candidate models is more appropriate to explain a given dataset.
This is a direct application of the principle of parsimony, but there are essentially two
questions:

1. What is the best model?


2. What is the most parsimonious model?

We may be tempted to answer the first question by just using a model with the best fit, for
example with the lowest mean square error or the highest R2, but such an approach would
lead to very complex, over complicated models because a complex model with many
parameters has much more flexibility to fit a complicated dataset. Therefore, what we
need is not the model that maximally fits the data. We need the model with the best
compromise between goodness of fit and model complexity, i.e. the smallest model that
fits the data reasonably well.

This approach is called the model selection technique, an essential concept of modern
statistics that sometimes supersedes p-value based testing.

Application in R

All measures of variation can be easily calculated in R:

1. variance
2. Standard deviation
3. Range
4. Quartiles
5. interquartile range
6. same, calculated from quartiles
7. median absolute deviation
55
Example - x <- rnorm(100, mean=50, sd=10) # 100 random numbers

var(x) # variance
[1] 95.75608

sd(x) # standard deviation


[1] 9.785504

range(x) # range
[1] 25.02415 75.79907

quantile(x, c(0.25, 0.75)) # quartiles


[1] 25% 75% 44.75210 56.65558

IQR(x) # interquartile range


[1] 11.90348

diff(quantile(x, c(0.25, 0.75))) # same, calculated from quartiles


[1] 75% 11.90348

mad(x) # median absolute deviation


[1] 8.627367

How to install R Packages?

The sheer power of R lies in its incredible packages. In R, most data handling tasks can
be performed in 2 ways: Using R packages and R base functions. In this tutorial, I’ll also
introduce you with the most handy and powerful R packages. To install a package, simply
type:

install.packages("package name")

As a first time user, a pop might appear to select your CRAN mirror (country server),
choose accordingly and press OK.

56
Note: You can type this either in console directly and press ‘Enter’ or in R script and click
‘Run’.

Useful R Packages

Out of ~7800 packages listed on CRAN, I’ve listed some of the most powerful and
commonly used packages in predictive modeling in this article. Since, I have explained
the method of installing packages, you can go ahead and install them now. Sooner or later
you’ll need them.

Importing Data : R offers wide range of packages for importing data available in any
format such as .txt, .csv, .json, .sql etc. To import large files of data quickly, it is advisable
to install and use data.table, readr, RMySQL, sqldf, jsonlite.

Data Visualization: R has in built plotting commands as well. They are good to create
simple graphs. But, becomes complex when it comes to creating advanced graphics.
Hence, you should install ggplot2.

Data Manipulation: R has a fantastic collection of packages for data manipulation.


These packages allows you to do basic & advanced computations quickly. These
packages are dplyr, plyr, tidyr, lubridate, stringr.

Modeling / Machine Learning: For modeling, caret package in R is powerful


enough to cater to every need for creating machine learning model. However, you can
install packages algorithms wise such as randomForest, rpart, gbm etc

Practice Assignment: As a part of this assignment, install ‘swirl’ package in


package. Then type, library(swirl)to initiate the package. And, complete this interactive
R tutorial. If you have followed this article thoroughly, this assignment should be an easy
task for you!

Exploratory Data Analysis in R

57
Data Exploration is a crucial stage of predictive model. You can’t build great and practical
models unless you learn to explore the data from begin to end. This stage forms a
concrete foundation for data manipulation (the very next stage). Let’s understand it in R.

I have taken the data set from Big Mart Sales Prediction. Before we start, you must get
familiar with these terms:

Response Variable (a.k.a Dependent Variable): In a data set, the response


variable (y) is one on which we make predictions. In this case, we’ll predict
‘Item_Outlet_Sales’. (Refer to image shown below)

Predictor Variable (a.k.a Independent Variable): In a data set, predictor


variables (Xi) are those using which the prediction is made on response variable. (Image
below).

Train Data: The predictive model is always built on train data set. An intuitive way to
identify the train data is, that it always has the ‘response variable’ included.

58
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data
always contains less number of observations than train data set. Also, it does not include
‘response variable’.

Right now, you should download the data set. Take a good look at train and test data.
Cross check the information shared above and then proceed.

Let’s now begin with importing and exploring data.

#working directory
path <- ".../Data/BigMartSales"

#set working directory


setwd(path)

As a beginner, keep the train and test files in your working directly to avoid unnecessary
directory troubles. Once the directory is set, we can easily import the .csv files using
commands below.

#Load Datasets
train <- read.csv("Train_UWu5bXk.csv")
test <- read.csv("Test_u94Q5KV.csv")

In fact, even prior to loading data in R, it’s a good practice to look at the data in Excel.
This helps in strategizing the complete prediction modeling process. To check if the data
set has been loaded successfully, look at R environment. The data can be seen there. Let’s
explore the data quickly.

#check dimensions ( number of row & columns) in data set


> dim(train)
[1] 8523 12

> dim(test)
[1] 5681 11

59
We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in
data set. This makes sense. Test data should always have one column less (mentioned
above right?). Let’s get deeper in train data set now.

#check the variables and their types in train


> str(train)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298
759 697 739 441 991 ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8
3 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002
2007 ...
$ Outlet_Size : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...

Let’s do some quick data exploration.

To begin with, first check if this data has missing values. This can be done by using:

> table(is.na(train))

FALSE TRUE
100813 1463

In train data set, we have 1463 missing values. Let’s check the variables in which these
values are missing. It’s important to find and locate these missing values. Many data

60
scientists have repeatedly advised beginners to pay close attention to missing value in
data exploration stages.

> colSums(is.na(train))
Item_Identifier Item_Weight
0 1463
Item_Fat_Content Item_Visibility
0 0
Item_Type Item_MRP
0 0
Outlet_Identifier Outlet_Establishment_Year
0 0
Outlet_Size Outlet_Location_Type
0 0
Outlet_Type Item_Outlet_Sales
0 0

Hence, we see that column Item_Weight has 1463 missing values. Let’s get more
inferences from this data.

> summary(train)

Here are some quick inferences drawn from variables in train data set:

1. Item_Fat_Content has mis-matched factor levels.

2. Minimum value of item_visibility is 0. Practically, this is not possible. If an item


occupies shelf space in a grocery store, it ought to have some visibility. We’ll treat
all 0’s as missing values.

3. Item_Weight has 1463 missing values (already explained above).

4. Outlet_Size has a unmatched factor levels.

These inference will help us in treating these variable more accurately.

61
Graphical Representation of Variables

I’m sure you would understand these variables better when explained visually. Using
graphs, we can analyze the data in 2 ways: Univariate Analysis and Bivariate Analysis.

Univariate analysis is done with one variable. Bivariate analysis is done with two
variables. Univariate analysis is a lot easy to do. Let’s now experiment doing bivariate
analysis and carve out hidden insights.

For visualization, I’ll use ggplot2 package. These graphs would help us understand the
distribution and frequency of variables in the data set.

> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5,


color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item
Visibility vs Item Outlet Sales")

We can see that majority of sales has been obtained from products having visibility less
than 0.2. This suggests that item_visibility < 2 must be an important factor in determining
sales. Let’s plot few more interesting graphs and explore such hidden stories.

> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_bar(stat =


"identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust =
0.5, color = "black")) + ggtitle("Outlets vs Total Sales") + theme_bw()

62
Here, we infer that OUT027 has contributed to majority of sales followed by OUT35.
OUT10 and OUT19 have probably the least footfall, thereby contributing to the least
outlet sales.

> ggplot(train, aes(Item_Type, Item_Outlet_Sales)) + geom_bar( stat = "identity")


+theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "navy")) +
xlab("Item Type") + ylab("Item Outlet Sales")+ggtitle("Item Type vs Sales")

From this graph, we can infer that Fruits and Vegetables contribute to the highest amount
of outlet sales followed by snack foods and household products. This information can also
be represented using a box plot chart. The benefit of using a box plot is, you get to see the
outlier and mean deviation of corresponding levels of a variable (shown below).

63
> ggplot(train, aes(Item_Type, Item_MRP)) +geom_boxplot() +ggtitle("Box Plot") +
theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Item
Type") + ylab("Item MRP") + ggtitle("Item Type vs Item MRP")

The black point you see, is an outlier. The mid line you see in the box, is the mean value
of each item type.

Now, we have an idea of the variables and their importance on response variable. Let’s
now move back to where we started. Missing values. Now we’ll impute the missing
values.

We saw variable Item_Weight has missing values. Item_Weight is an continuous


variable. Hence, in this case we can impute missing values with mean / median of
item_weight. These are the most commonly used methods of imputing missing value.

Let’s first combine the data sets. This will save our time as we don’t need to write
separate codes for train and test data sets. To combine the two data frames, we must make
sure that they have equal columns, which is not the case.

> dim(train)
[1] 8523 12

> dim(test)
[1] 5681 11

64
Test data set has one less column (response variable). Let’s first add the column. We can
give this column any value. An intuitive approach would be to extract the mean value of
sales from train data set and use it as placeholder for test variable Item _Outlet_ Sales.
Anyways, let’s make it simple for now. I’ve taken a value 1. Now, we’ll combine the data
sets.

> test$Item_Outlet_Sales <- 1


> combi <- rbind(train, test)

Impute missing value by median. I’m using median because it is known to be highly
robust to outliers. Moreover, for this problem, our evaluation metric is RMSE which is
also highly affected by outliers. Hence, median is better in this case.

> combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight,


na.rm = TRUE)
> table(is.na(combi$Item_Weight))

FALSE
14204

Trouble with Continuous Variables & Categorical


Variables

It’s important to learn to deal with continuous and categorical variables separately in a
data set. In other words, they need special attention. In this data set, we have only 3
continuous variables and rest are categorical in nature. If you are still confused, I’ll
suggest you to once again look at the data set using str() and proceed.

Let’s take up Item_Visibility. In the graph above, we saw item visibility has zero value
also, which is practically not feasible. Hence, we’ll consider it as a missing value and
once again make the imputation using median.

> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0,


median(combi$Item_Visibility), combi$Item_Visibility)

65
Let’s proceed to categorical variables now. During exploration, we saw there are mis-
matched levels in variables which needs to be corrected.

> levels(combi$Outlet_Size)[1] <- "Other"


> library(plyr)
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content,
c("LF" = "Low Fat", "reg" = "Regular"))
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content, c("low fat" =
"Low Fat"))
> table(combi$Item_Fat_Content)
Low Fat Regular
9185 5019

Using the commands above, I’ve assigned the name ‘Other’ to unnamed level
in Outlet_Size variable. Rest, I’ve simply renamed the various levels of
Item_Fat_Content.

Data Manipulation in R

Let’s call it as, the advanced level of data exploration. We will learn about feature
engineering and other useful aspects.

Feature Engineering:

This component separates an intelligent data scientist from a technically enabled data
scientist. You might have access to large machines to run heavy computations and
algorithms, but the power delivered by new features, just can’t be matched. We create
new variables to extract and provide as much ‘new’ information to the model, to help it
make accurate predictions.

1. Count of Outlet Identifiers –

66
There are 10 unique outlets in this data. This variable will give us information on count of
outlets in the data set. More the number of counts of an outlet, chances are more will be
the sales contributed by it.

> library(dplyr)
> a <- combi%>%
group_by(Outlet_Identifier)%>%
tally()

> head(a)
Source: local data frame [6 x 2] Outlet_Identifier n
(fctr) (int)
1 OUT010 925
2 OUT013 1553
3 OUT017 1543
4 OUT018 1546
5 OUT019 880
6 OUT027 1559

> names(a)[2] <- "Outlet_Count"


> combi <- full_join(a, combi, by = "Outlet_Identifier")

As you can see, dplyr package makes data manipulation quite effortless. You no longer
need to write long function. In the code above, I’ve simply stored the new data frame in a
variable a. Later, the new column Outlet_Count is added in our original ‘combi’ data set.

2. Count of Item Identifiers –

Similarly, we can compute count of item identifiers too. It’s a good practice to fetch more
information from unique ID variables using their count. This will help us to understand,
which outlet has maximum frequency.

> b <- combi%>%


group_by(Item_Identifier)%>%
tally()

67
> names(b)[2] <- "Item_Count"
> head (b)
Item_Identifier Item_Count
(fctr) (int)
1 DRA12 9
2 DRA24 10
3 DRA59 10
4 DRB01 8
5 DRB13 9
6 DRB24 8

> combi <- merge(b, combi, by = “Item_Identifier”)

3. Outlet Years –

This variable represent the information of existence of a particular outlet since year 2013.
Why just 2013? You’ll find the answer in problem statement here. My hypothesis is, older
the outlet, more footfall, large base of loyal customers and larger the outlet sales.

> c <- combi%>%


select(Outlet_Establishment_Year)%>%
mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)
> head(c)
Outlet_Establishment_Year Outlet_Year
1 1999 14
2 2009 4
3 1999 14
4 1998 15
5 1987 26
6 2009 4

> combi <- full_join(c, combi)

This suggests that outlets established in 1999 were 14 years old in 2013 and so on.

68
4. Item Type New –

Now, pay attention to Item_Identifiers. We are about to discover a new trend. There is a
pattern in the identifiers starting with “FD”,”DR”,”NC”. Now, check the
corresponding Item_Types to these identifiers in the data set. You’ll discover, items
corresponding to “DR”, are mostly eatables. Items corresponding to “FD”, are drinks.
And, item corresponding to “NC”, are products which can’t be consumed, let’s call them
non-consumable. Let’s extract these variables into a new variable representing their
counts.

Here I will use substr(), gsub() function to extract and rename the variables respectively.

> q <- substr(combi$Item_Identifier,1,2)


> q <- gsub("FD","Food",q)
> q <- gsub("DR","Drinks",q)
> q <- gsub("NC","Non-Consumable",q)
> table(q)
Drinks Food Non-Consumable
1317 10201 2686

Let’s now add this information in our data set with a variable name ‘Item_Type_New.

> combi$Item_Type_New <- q

Linear regression

Regression analysis aims to describe the dependency between two (simple regression) or
several variables (multiple regression) by means of a linear or nonlinear function. In
case of not only multiple independent variables (x-variables), but also multiple
dependent variables (y-variables) this is referred to as multivariate regression. In contrast
to correlation analysis, which tests for the existence of a relationship, regression analysis
focuses on the quantitative description of this relationship by means of a statistical model.
Besides that, there are also differences in the underlying statistical assumptions. While

69
correlation analysis assumes that al lvariables are random containing an
error(modelII),regression analysis generally makes a distinction between independent
variables (without error) and dependent variables (with errors), which is called model I.

 Assumptions of the linear regression

Only if the following prerequisites are met, the parameters (a,b) can be estimated without
bias and the significance test will be reliable.

1. Model I is assumed (x is defined, y is a random variable).


2. For every value of x, y is a random variable with mean µy|x and variance σ2 y|x.
3. y values are independent and identically distributed (no autocorrelation and
homogeneous variance σ2)
4. For multiple regression all xj must not be correlated with each other (multicollinearity
condition).
5. The residuals e and the y values need to be normally distributed.

In this context it is especially important that the variance of the residuals is homogenous
along the whole range of x, i.e. the variation of the residuals must not increase or decrease
or show any systematic pattern.

 Implementation in R

Model formula in R
R has a special formula syntax for describing statistical models. A simple linear model (y
versus x) can be described as:

y~x+1

or shorter, because “+ 1” can be omitted:


y~x

In contrast to the former, “-1” means a regression model with zero intercept:

70
y~x–1
A simple example :- Consider the following dataset.

cherry <- read.csv("H:\\cherry.csv",header=TRUE)

First, we plot the data with:

plot(cherry$Height,cherry$Volume)

Then, a linear model can be fitted with the linear model function lm():

Cherry.lm <- lm(Volume ~ Height, data = cherry)

The R object delivered by lm (called reg in our case) now contains the complete results of
the regression, from which significance tests, residuals and further information can be
extracted by using specific R functions. Thus,

summary(cherry.lm)

Output:

Call:

71
lm(formula = Volume ~ Height, data = cherry)

Residuals:
Min 1Q Median 3Q Max
-21.274 -9.894 -2.894 12.067 29.852

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.1236 29.2731 -2.976 0.005835 **
Height 1.5433 0.3839 4.021 0.000378 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.4 on 29 degrees of freedom


Multiple R-Squared: 0.3579, Adjusted R-squared: 0.3358
F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784

gives us the most important results of the regression, e.g. the coefficients, with intercept
being equivalent to parameter a and the dependence on Height being equivalent to the
slope parameter b. The coefficient of determination is found as Multiple R-squared.
Besides that also standard errors of the parameters and significance levels are printed. The
regression line can be added to the plot with the universal line drawing function of R,
abline that directly accepts a linear model object (cherry.lm) as its argument:

abline(cherry.lm)

Plot the lines:-

par(mfrow=c(1,1))
plot(cherry$Height, cherry$Volume)
abline(cherry.lm)

72
Classical tests
Sophisticated methods such as generalized mixed models or labor-intensive multivariate
analysis are not always necessary. In many cases so-called classical tests are sufficient.
Here, the principle of parsimony is applies too: the simplest method should be preferred.

One sample t-Test

We can test for differences between mean values with the help of the classical t-test. The
one-sample t-test tests if a sample is from a population with a given mean value µ:

x <- rnorm(10, 5, 1)# normally distributed sample with mu=5, s=1


t.test(x, mu=5) # does x come from a sample with mu=5?

73
Output:-
One Sample t-test
data: x t = -1.116, df = 9, p-value = 0.2933
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.404882 5.201915
sample estimates:
mean of x
4.803398

Two sample t-Test


In the two-sample t-test two independent samples are compared:

x1 <- rnorm(12, 5, 1)
x2 <- rnorm(12, 5.7, 1)
t.test(x1, x2)

Output:-
Welch Two Sample t-test
data: x1 and x2
t = -2.5904, df = 17.393, p-value = 0.01882
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.8064299 -0.1862296
sample estimates:
mean of x mean of y
4.683433 5.679763
That means that both samples differ significantly (p < 0.05). It has to be mentioned that R
did not perform the “normal” t-test but the Welch test (also termed heteroscedastic t-test),
for which the variances of both samples are not required to be identical.

74
The classical procedure would be as follows:
1. Perform a check for the identity of both variances with var.test beforehand:

var.test(x1, x2) # F-Test

F test to compare two variances


data: x1 and x2
F = 3.1211, num df = 11, denom df = 11, p-value = 0.0719
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8984895 10.8417001
sample estimates:
ratio of variances
3.121082

2. if p < 0.05 then the variances are significantly different, so the Welch test (see above)
needs to be used.

3. if p > 0.05, then variances are likely to be equal and the “normal” t-test can be applied:

t.test(x1, x2, var.equal=TRUE) # classical t-test

Two Sample t-test


data: x1 and x2
t = -2.5904, df = 22, p-value = 0.0167
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.793999 -0.198661
sample estimates:
mean of x mean of y
4.683433 5.679763

Paired t-Test

75
If the present data are paired (e.g. an allergy test on the left and right arm) the paired t-test
is used. Its advantage is that the influence of individual differences(i.e. acovariate) is
eliminated. The down side is that the number of degrees of freedom is reduced. But if
data are really paired it is definitely advantageous to take that information into account:
x1 <- c(2, 3, 4, 5, 6)
x2 <- c(3, 4, 7, 6, 8)
t.test(x1, x2, var.equal=TRUE) # p=0.20 not significant
Two Sample t-test
data: x1 and x2
t = -1.372, df = 8, p-value = 0.2073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.28924 1.08924
sample estimates:
mean of x mean of y
4.0 5.6

t.test(x1, x2, paired=TRUE) # p=0.016 significant

Paired t-test
data: x1 and x2
t = -4, df = 4, p-value = 0.01613
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.710578 -0.489422
sample estimates:
mean of the differences
-1.6

It can be seen that the paired t-test has a greater discriminatory power.

76
Conclusion

R programming for data science is not that complex and the reason for its popularity is its
ease of use and the free download, but in order to learn Data Analytics with R, it is
important to study the software in detail, learn different commands and structures that are
in R and then perform the commands accordingly to analyze data effectively. In order to
help you familiarize you with R, we have already described basics of data analytics with
R, but to learn the software, we have prepared some tips that could help you study R for
data analytics. R is the only programming language that allows statisticians to perform the
most complicated and intricate analyses without getting into too much of details. With so
many benefits for data science, R has gradually mounted heights among professionals of
77
big data. According to a 2014 survey, R is one of the most powerful and popular
programming languages used by data scientists today.

We have studied the introduction to R programming in detail. So, it is clear from the
above information that R is more popular and better option as R supports a different kind
of programming languages. Also, R is an Open source and has far more capabilities and
availability to other languages. Still, if you have any doubts about the R Tutorial, please
Comment.
Hope you like the Introduction to R Programming Tutorial.

References

1. “Thomas Petzoldt” Data Analysis with R Selected Topics and Examples October
21, 2018.
2. “Tony Fishchetti” Data analysis using R Load, wrangle and analysis your data using
world’s most powerful statistical programming language
3. R : Past and Future History A Draft of a Paper for Interface ’98 Ross Ihaka
Statistics Department The University of Auckland Auckland, New Zealand
4. Introduction to Data analysis using R by Department of Statistical Sciences and
Operations Research, Computation Seminar Series, by Edward Boone
5. http://blog.sollers.edu/data-science/importance-of-r-language-for-data-science
6. https://data-flair.training/blogs/r-tutorial/
7. https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-
science-scratch/

78

You might also like