Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Department of Statistics

UNIT 1

Overview of R, R data types and objects, reading and writing data, Essentials of R
language, Running R, Packages in R, Variables names and assignment, Operators,
Integers, Factors, Logical operations, Operations of Scalars, Vectors, Lists, Arrays,
Matrices, Data frames, Control structures and function

Basic Syntax
z<-"Good morning"
To print
print(z)
output:
[1] "Good morning"
#To compare two objects
x="raju"
y="RAJU"
z="RAJU"
x==y
output:
[1] FALSE
x==z
output:
[1] FALSE
y==z
output:
[1] TRUE
#To clear the environment window
rm(list = ls())
#R as calculator
2+2
[1] 4
2*4
[1] 8
585*5
[1] 2925
2/2
[1] 1
Arithmetic operations
x=5
y=6
add = x+y
[1] 11
multiple=x*y
[1] 30
sub=x-y
[1] -1
z=9
x*z
[1] 45
x^y
[1] 15625
#modulos function which gives us the remainder
x%%2
[1] 1
Relational Operators
x<y
[1] TRUE
x>y
[1] FALSE
x<=y
[1] TRUE
x>=y
[1] FALSE
Data Types

1. Numeric data type

# Assign a decimal value to x


x = 5.6

# print the class name of variable


print(class(x))

# print the type of variable


print(typeof(x))

Output:

[1] "numeric"
[1] "double"

2. Complex data type


x = 4 + 3i

# print the class name of x


print(class(x))

# print the type of x


print(typeof(x))

Output:
[1] "complex"
[1] "complex"
3. Character Data Type
fruit="apple"
print(class(fruit))
[1] "character"
4. Integer Data Type
x=350L
print(class(x))
[1] "integer"
ls() function

This built-in function is used to know all the present variables in the workspace.

Syntax:
ls()
rm() function
This is again a built-in function used to delete an unwanted variable within your workspace.

rm() function

This is again a built-in function used to delete an unwanted variable within your workspace.
Find sum of numbers 4 to 6.
print(sum(4:6))

Find max of numbers 4 and 6.


print(max(4:6))

Find min of numbers 4 and 6.


print(min(4:6))

Vector
A Vector is an ordered collection of same data type.

x = c(1, 3, 5, 7, 8)

# Printing those elements in console


print(x)

Output:
[1] 1 3 5 7 8
Sub setting a Vector
x <- c("a", "b", "c", "c", "d", "a")
x[1] # Extract the first element
[1] "a"
x[2] # Extract the second element
[1] "b"
List
empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(empId, empName, numberOfEmp)

print(empList)

Output:
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4

Dataframes

Name = c("Amiya", "Raj", "Asish")


Language = c("R", "Python", "Java")
Age = c(22, 25, 45)

df = data.frame(Name, Language, Age)

print(df)

Output:
Name Language Age
1 Amiya R 22
2 Raj Python 25
3 Asish Java 45
Matrices
A = matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE)
print(A)
Output:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
Maths=c(55,64,86)
Maths
[1] 55 64 86
#No of elements
length(Maths)
[1] 3
#To read the 2nd value from the 'Maths' vector
Maths[2]
[1] 64
Maths[2:3]
[1] 64 86
Arrays
A = array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2))
print(A)
Output:
,,1
[,1] [,2]
[1,] 1 3
[2,] 2 4
,,2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Factors
fac = factor(c("Male", "Female", "Male",
"Male", "Female", "Male", "Female"))
print(fac)
Output:
[1] Male Female Male Male Female Male Female
Levels: Female Male
Sorting elements of a Vector
x <- c(8, 2, 7, 1, 11, 2)
A <- sort(x)
cat('ascending order', A, '\n')
B <- sort(X, decreasing = TRUE)
cat('descending order', B)
Output:
ascending order 1 2 2 7 8 11
descending order 11 8 7 2 2 1
Creating a vector by seq() function
V = seq(1, 40, by= 4)
# Printing the vector
print(V)
# Printing the fifth element of the vector
print(V[5])
Output:
[1] 1 5 9 13 17 21 25 29 33 37
[1] 17 25
To sort the numbers in the vector
x=c(5,-2,3,-7)
sort(x)
[1] -7 -2 3 5
sort(x,decreasing = T)
[1] 5 3 -2 -7
Reverse function
rev(x)
[1] -7 3 -2 5
Concatenate the Strings
u<-paste("abc","de","f")
u
[1] "abc de f"
a<-paste("abc","de","f",u) #Concatenate the Strings
a
[1] "abc de f abc de f"
Separating
w<-paste("abc","de","f",sep = "/") #Concatenate the Strings
w
[1] "abc/de/f"
Splitting
x<-strsplit(w,split = "/")
x
[1] "abc" "de" "f"
Install packages
install.packages(“”)
Example : install.packages(“dplyr”)
Read the data
dt<-read.csv("Loan_data_new.csv")
names(dt)
head(dt)
tail(dt)
view(dt)
To Know the Current working path
getwd()
To change the Current working path
setwd("E:/BBA R code")
Data Frames
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26) )
print(BMI)
gender height weight Age

1 Male 152.0 81 42

2 Male 171.5 93 38

3 Female 165.0 78 26

Strings

# R program to access
# characters in a string

# Accessing characters
# using substr() function
substr("Learn Code Tech", 1, 1)

# Create a string

string <- "Hello, World!"

# Replace "World" with "Universe"

string <- gsub("World", "Universe", string)

# Print the updated string

print(string)
Output
"Hello, Universe!"

# Create two variables with values


x <- 42
y <- 3.14159

# Format a string with the two variable values


result <- sprintf("The answer is %d, and pi is %.2f.", x, y)

# Print the result


print(result)

"The answer is 42, and pi is 3.14.

In this example, we format a string with two decimal places using the%d format specifier
for the integer value x and the%.2f format specifier for the floating-point value y. The
prepared string is saved in the variable result before being written to the console using the
print function. You should see the output when you run this code. The solution is 42, and pi
is 3.14, which is the formatted string with x and y values substituted for the format
specifiers.

#-----------------------------------------------------------------------------------------------------------
----

Loops

R if statement

The syntax of if statement is:

if (test_expression) {
statement
}

If the test_expression is TRUE, the statement gets executed. But if it's FALSE, nothing
happens.
Here, test_expression can be a logical or numeric vector, but only the first element is taken
into consideration.
In the case of numeric vector, zero is taken as FALSE, rest as TRUE

if else statement
The syntax of if else statement is:
if (test_expression) {
statement1
} else {
statement2
}

The else part is optional and is only evaluated if test_expression is FALSE.


It is important to note that else must be in the same line as the closing braces of the if
statement.

if else Ladder
The if else ladder (if..else..if) statement allows you execute a block of code among more than
2 alternatives

The syntax of if else statement is:

if ( test_expression1) {
statement1
} else if ( test_expression2) {
statement2
} else if ( test_expression3) {
statement3
} else {
statement4
}

Only one statement will get executed depending upon the test_expressions.

#-----------------------------------------------------------------------------------------------------------
----Break

With the break statement, we can stop the loop before it has looped through all the items:

Example

Stop the loop at "cherry":

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
if (x == "cherry") {
break
}
print(x)
}

The loop will stop at "cherry" because we have chosen to finish the loop by using
the break statement when x is equal to "cherry" (x == "cherry").
#-----------------------------------------------------------------------------------------------------------
----

Next

With the next statement, we can skip an iteration without terminating the loop:

Example

Skip "banana":

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}

#----------------------------------------------------------------------------------------------------------

x=0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
{print("Zero")
}

#-----------------------------------------------------------------------------------------------------------
----

x <- -10
# check if x is positive
if (x > 0) {
# check if x is even or odd
if (x %% 2 == 0) {
print("x is a positive even number")
} else {
print("x is a positive odd number")
}

# execute if x is not positive


} else {

# check if x is even or odd


if (x %% 2 == 0) {
print("x is a negative even number")
} else {
print("x is a negative odd number")
}
}

#-----------------------------------------------------------------------------------------------------------
----

for (val in 1: 5)
{
# statement
print(val)
}

#--------------------------------------------------------------------------------------------- --------------
----

val = 1
# using while loop
while (val <= 5)
{
# statements
print(val)
val = val + 1
}

#----------------------------------------------------------------------------------------------------------

# R program to illustrate
# application of while loop

# assigning value to the variable


# whose factorial will be calculated
n <- 5

# assigning the factorial variable


# and iteration variable to 1
factorial <- 1
i <- 1

# using while loop


while (i <= n)
{

# multiplying the factorial variable


# with the iteration variable
factorial = factorial * i

# incrementing the iteration variable


i=i+1
}
# displaying the factorial
print(factorial)

#----------------------------------------------------------------------------------------------------------

# R program to illustrate
# the application of repeat loop

# initializing the iteration variable with 0


i <- 0

# using repeat loop


repeat
{
# statement to be executed multiple times
print("Geeks 4 geeks!")

# incrementing the iteration variable


i=i+1

#---------------------------------------------------------------------------------------------------------
# checking the stop condition

if (i == 5)
{
# using break statement
# to terminate the loop
break
}
}

#-----------------------------------------------------------------------------------------------------------

# R program to illustrate
# the use of break statement

# using for loop


# to iterate over a sequence
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using break keyword
break
}

# displaying items in the sequence


print(val)
}

#----------------------------------------------------------------------------------------------------------

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}

#----------------------------------------------------------------------------------------------------------

# Create two variables with values


x <- 42
y <- 3.14159

# Format a string with the two variable values


result <- sprintf("The answer is %d, and pi is %.2f.", x, y)

# Print the result


print(result)

#----------------------------------------------------------------------------------------------------------

# R program to access
# characters in a string

# Accessing characters
# using substr() function
a=substr("Learn Code Tech", 1, 2)
a

#_______________________________________________________________________

R Switch Statement

A switch statement is a selection control mechanism that allows the value of an expression to
change the control flow of program execution via map and search.

The switch statement is used in place of long if statements which compare a variable with
several integral values. It is a multi-way branch statement which provides an easy way to
dispatch execution for different parts of code. This code is based on the value of the expression.

This statement allows a variable to be tested for equality against a list of values. A switch
statement is a little bit complicated. To understand it, we have some key points which are as
follows:

o If expression type is a character string, the string is matched to the listed cases.
o If there is more than one match, the first match element is used.
o No default case is available.
o If no case is matched, an unnamed case is used.

1) Based on Index

If the cases are values like a character vector, and the expression is evaluated to a number than
the expression's result is used as an index to select the case.

2) Based on Matching Value

When the cases have both case value and output value like ["case_1"="value1"], then the
expression value is matched against case values. If there is a match with the case, the
corresponding value is the output.
The basic syntax of If-else statement is as follows:

switch(expression, case1, case2, case3....)


x <- switch(

3,

"Shubham",

"Nishka",

"Gunjan",

"Sumit"

print(x)

#________________________________________________________________________

ax= 1

bx = 2

y = switch(

ax+bx,

"Hello, Shubham",

"Hello Arpita",

"Hello Vaishali",

"Hello Nishka"

print (y)

#________________________________________________________________________

y = "18"

x = switch(

y,

"9"="Hello Arpita",
"12"="Hello Vaishali",

"18"="Hello Nishka",

"21"="Hello Shubham"

print (x)

#_________________________________________________________________________

x= "2"

y="1"

a = switch(

paste(x,y,sep=""),

"9"="Hello Arpita",

"12"="Hello Vaishali",

"18"="Hello Nishka",

"21"="Hello Shubham"

#________________________________________________________________________

R Packages

R packages are the collection of R functions, sample data, and compile codes. In the R
environment, these packages are stored under a directory called "library." During installation,
R installs a set of packages. We can add packages later when they are needed for some specific
purpose. Only the default packages will be available when we start the R console. Other
packages which are already installed will be loaded explicitly to be used by the R program.

There is the following list of commands to be used to check, verify, and use the R packages.
here is the following list of commands to be used to check, verify, and use the R packages.
Check Available R Packages

To check the available R Packages, we have to find the library location in which R packages
are contained. R provides libPaths() function to find the library locations.

libPaths()

When the above code executes, it produces the following project, which may vary depending
on the local settings of our PCs & Laptops.

Getting the list of all the packages installed

R provides library() function, which allows us to get the list of all the installed packages.

library()

When we execute the above function, it produces the following result, which may vary
depending on the local settings of our PCs or laptops.

Packages in library 'C:/Program Files/R/R-3.6.1/library':

#_____________________________________________________________________

Like library() function, R provides search() function to get all packages currently loaded in the
R environment.

search()

#________________________________________________________________________
Install a New Package

In R, there are two techniques to add new R packages. The first technique is installing package
directly from the CRAN directory, and the second one is to install it manually after
downloading the package to our local system.

Install directly from CRAN

The following command is used to get the packages directly from CRAN webpage and install
the package in the R environment. We may be prompted to choose the nearest mirror. Choose
the one appropriate to our location.

install.packages("Package Name")

The syntax of installing XML package is as follows:

install.packages("XML")

#__________________________________________________________________________
Load Package to Library

We cannot use the package in our code until it will not be loaded into the current R
environment. We also need to load a package which is already installed previously but not
available in the current environment.

There is the following command to load a package:

library("package Name", lib.loc = "path to library")

#---------------------------------------------------------------------------------------------------------

Data Reshaping in R Programming


Generally, in R Programming Language, data processing is done by taking data as an
input from a data frame where the data is organized into rows and columns. Data frames are
mostly used since extracting data is much simpler and hence easier. But sometimes we need
to reshape the format of the data frame from the one we receive. Hence, in R, we can split,
merge and reshape the data frame using various functions.

The various forms of reshaping data in a data frame are:

• Transpose of a Matrix
• Joining Rows and Columns
• Merging of Data Frames
• Melting and Casting
Why R – Data Reshaping is Important?
While doing an analysis or using an analytic function, the resultant data obtained because
of the experiment or study is generally different. The obtained data usually has one or more
columns that correspond or identify a row followed by a number of columns that represent
the measured values. We can say that these columns that identify a row can be the
composite key of a column in a database.
Transpose of a Matrix
We can easily calculate the transpose of a matrix in R language with the help of t()
function. The t() function takes a matrix or data frame as an input and gives the transpose of
that matrix or data frame as it’s output.
Syntax:
t(Matrix/ Data frame)
# R program to find the transpose of a matrix

#--------------------------------------------------------------------------------------------------------------

first <- matrix(c(1:12), nrow=4, byrow=TRUE)

print("Original Matrix")

first

first <- t(first)

print("Transpose of the Matrix")

first

#--------------------------------------------------------------------------------------------------------------
Joining Rows and Columns in Data Frame

In R, we can join two vectors or merge two data frames using functions. There are basically
two functions that perform these tasks:
cbind():
We can combine vectors, matrix or data frames by columns using cbind() function.
Syntax: cbind(x1, x2, x3)
where x1, x2 and x3 can be vectors or matrices or data frames.
rbind():
We can combine vectors, matrix or data frames by rows using rbind() function.
Syntax: rbind(x1, x2, x3)
where x1, x2 and x3 can be vectors or matrices or data frames.
#--------------------------------------------------------------------------------------------------------------

# Cbind and Rbind function in R

name <- c("Shaoni", "esha", "soumitra", "soumi")

age <- c(24, 53, 62, 29)

address <- c("puducherry", "kolkata", "delhi", "bangalore")

# Cbind function
info <- cbind(name, age, address)

print("Combining vectors into data frame using cbind ")

print(info)

# creating new data frame

newd <- data.frame(name=c("sounak", "bhabani"),

age=c("28", "87"),

address=c("bangalore", "kolkata"))

#--------------------------------------------------------------------------------------------------------------

# Rbind function

new.info <- rbind(info, newd)

print("Combining data frames using rbind ")

print(new.info)

#--------------------------------------------------------------------------------------------------------------
Merging two Data Frames

In R, we can merge two data frames using the merge() function provided both the data
frames should have the same column names. We may merge the two data frames based on a
key value.
Syntax: merge(dfA, dfB, …)

# Merging two data frames in R

d1 <- data.frame(name=c("shaoni", "soumi", "arjun"),

ID=c("111", "112", "113"))

d2 <- data.frame(name=c("sounak", "esha"),

ID=c("114", "115"))

total <- merge(d1, d2, all=TRUE)

print(total)

#--------------------------------------------------------------------------------------------------------------
Melting and Casting
Data reshaping involves many steps in order to obtain desired or required format. One of
the popular methods is melting the data which converts each row into a unique id-variable
combination and then casting it. The two functions used for this process:
melt():
It is used to convert a data frame into a molten data frame.
Syntax: melt(data, …, na.rm=FALSE, value.name=”value”)
where,
data: data to be melted
… : arguments
na.rm: converts explicit missings into implicit missings
value.name: storing values

dcast():
It is used to aggregate the molten data frame into a new form.
Syntax: melt(data, formula, fun.aggregate)
where,
data: data to be melted
formula: formula that defines how to cast
fun.aggregate: used if there is a data aggregation
UNIT 2

Computation of descriptive Statistics (Measures of Central tendencies and Dispersion,


Moments, Skewness and Kurtosis)

1. Compute descriptive statistics for the following data


65,81,72,59,71,53,85,66,66,70,72,71,79,76,77,68,65,73,64,72,82,73,77,75,80,85,89,7
4,86,83,87,77,67,80,78,69,64,67,79,60,62,78,59,92,74,68,63,69,67,67,84,83,69,72,62,
74,73,68,74, 65
Syntax

# Capturing the data in the object ‘x’

x<-c(65,81,72,59,71,53,85,66,66,70,72,71,79,76,77,68,65,73,64,
72,82,73,77,75,80,85,89,74,86,83,87,77,67,80,78,69,64,67,79,
60,62,78,59,92,74,68,63,69,67,67,84,83,69,72,62,74,73,68,74, 65)

Measures of Central Tendency

Arithmetic Mean

AM=mean(x)

AM

[1] 72.66667

cat("Arithmetic Mean=",AM)

Arithmetic Mean= 72.66667

Median

Md=median(x)

Md

[1] 72

cat("Median=",Md)

Median= 72

Mode

Mo=table(x)

Mode=names(Mo)[Mo==max(Mo)]

Mode

[1] "67" "72" "74"


cat("Mode=",Mode)

Mode= 67 72 74

Measures of Dispersion

Range

R=diff(range(x))

cat("Range=",R)

Range= 39

Quartile Deviation

# To compute First Quartile

Q1=quantile(x,0.25)

cat("First Quartile=",Q1)

First Quartile= 67

# To compute Third Quartile

Q3=quantile(x,0.75)

cat("Third Quartile=",Q3)

Third Quartile= 78.25

# To compute Quartile Deviation

QD=(Q3-Q1)/2

cat("Quartile Deviation=",QD)

Quartile Deviation= 5.625

Mean Deviation

MD=sum(abs(x-AM))/n

cat("Mean Deviation=",MD)

Mean Deviation= 6.688889

Variance and Standard Deviation

V=var(x)

variance=((n-1)/n)*V
cat("Variance=",variance)

Variance= 67.62222

Sd=sqrt(variance)

cat("Standard Deviation=",Sd)

Standard Deviation= 8.223273

Quartiles

quartiles=quantile(x)

quartiles

0% 25% 50% 75% 100%

53.00 67.00 72.00 78.25 92.00

Summary

summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.

53.00 67.00 72.00 72.67 78.25 92.00

Relative measures of dispersion

Coefficient of Range

CR=(max(x)-min(x))/(max(x)+min(x))

CR

[1] 0.2689655

Coefficient of Quartile Deviation

Q1=quantile(x,0.25)

cat("First Quartile=",Q1)

First Quartile= 67

Q3=quantile(x,0.75)

cat("Third Quartile=",Q3)

Third Quartile= 78.25

QD=(Q3-Q1)/2
cat("Quartile Deviation=",QD)

Quartile Deviation= 5.625

CQD=(Q3-Q1)/(Q3+Q1)

cat("Coefficient of Quartile Deviation=",CQD)

Coefficient of Quartile Deviation= 0.07745267

Coefficient of Mean Deviation

n=length(x)

AM=mean(x)

n=length(x)

MD=sum(abs(x- AM))/n

cat("Mean Deviation=",MD)

Mean Deviation= 6.688889

CMD=MD/AM

cat("Coefficient of Mean Deviation=",CMD)

Coefficient of Mean Deviation= 0.09204893

Variance and Standard Deviation

V=var(x)

variance=((n-1)/n)*V

cat("Variance=",variance)

Variance= 67.62222

Sd=sqrt(variance)

cat("Standard Deviation=",Sd)

Standard Deviation= 8.223273

Coefficient of variance (CV) used to compare two groups.

# Marks for Student A and Student B in different subjects

# Marks are stored in Stud_A and Stud_B

Stud_A=c(82,73,95,46,54,61)
Stud_B=c(41,84,66,75,93,82)

For Student A

#To compute mean

Stud_A_AM=mean(Stud_A)

# To compute Variance and Standard Deviation

V=var(Stud_A)

variance=((n-1)/n)*V

cat("Variance=",variance)

Variance= 329.9083

Sd=sqrt(variance)

cat("Standard Deviation=",Sd)

Standard Deviation= 18.16338

#For Students of section B

#To compute mean

Stud_B_AM=mean(Stud_B)

# To compute Variance and Standard Deviation

V1=var(Stud_B)

variance1=((n-1)/n)*V1

cat("Variance=",variance1)

Variance= 329.9083

Std=sqrt(variance1)

cat("Standard Deviation=",Std)

Standard Deviation= 18.16338

CV1=Sd/Stud_A_AM

cat("Coefficient of variance=",CV1)

Coefficient of variance= 0.2651588

CV2=Std/Stud_B_AM
cat("Coefficient of variance=",CV2)

Coefficient of variance= 0.2471208

Therefore, coefficient of variance for student A is more varying than student B.

Moments, Skewness and Kurtosis

#Installing the package “moments”

install.packages("moments")

#Calling the package “moments”

library(moments)

x=c(6,8,17,21,15,11)

#To compute central moments

central_moments<-all.moments((x),order.max=4,central=T)

central_moments

[1] 1 0 27 18 1235

#To compute raw moments

raw_moments<-all.moments((x),order.max=4,central=F)

raw_moments

[1] 1 13 196 3268 58110

Skewness

#Skewness through moments

skewness<-skewness(x)

skewness

[1] 0.1283001

cat("Skewness through moments=",skewness)

Skewness through moments= 0.1283001

Therefore, the distribution is positively skewed

# Skewness through Karl Pearson’s Coefficient

n=length(x)
AM=mean(x)

Md=median(x)

sd=sqrt(var(x)*(n-1)/n)

Skp=3*(AM-Md)/sd

cat("Skewness through Karl Pearson’s Coefficient =",Skp)

Skewness through Karl Pearson’s Coefficient = 0

Therefore, the distribution is symmetric

#Skewness through Bowley's Coefficient

Q1=quantile(x,0.25)

Q2=quantile(x,0.50)

Q3=quantile(x,0.75)

Skb=(Q3+Q1-2*Q2)/(Q3-Q1)

cat("Skewness through Bowley's Coefficient =",Skb)

Skewness through Bowley's Coefficient = -0.09677419

Therefore, the distribution is negatively skewed

Kurtosis

Ku=kurtosis(x)

cat("Kurtosis through moments=",Ku)

Kurtosis through moments= 1.694102

lb<-seq(5,13,2);lb

[1] 5 7 9 11 13

Since kurtosis<3, the distribution has platykurtic curve.


What is Data Visualization?

Data Visualization is the representation

The popular tools for data visualization are of data or information in the form of graphs,
charts, maps, plots. Using these visual effects, it is easy to easily understand the huge
complex data. For employees and for businessman it is an easy way to present data to the
non-technical people. Tableau, Plotty, R, Google Charts, Infogram, etc. R programming is
designed for computing statistics and representation of graphs which is flexible and required
minimum code using different packages.

Advantages

• R is open source which offers various visualization libraries.


• R can easily customize visualization of data by changing axes, fonts, legends and
labels.
• Visualization looks more attractive and easier to understand rather than written data.
• R offers multi panel charts and 3D models.
• Its applications allow us to display a lot of information in a small space which is more
efficient.
Disadvantages

• To generate reports small companies, need to employ professionals to create charts


which increases costs.
• Data visualization using R is slow for large amounts of data.
Libraries used for Data Visualization in R

R offers set of inbuilt functions and libraries for data visualization. Some of them are
mentioned below:

• ggplot2
• Lattice
• highcharter
• Leaflet
• RcolorBrewer
• Plotly
• sunburstR
• RGL
• dygraphs
Data Visualization- Diagrammatic Presentation (Bar and Pie)

Bar Charts

Bar charts are the pictorial representation of data with rectangular bars with heights. The
function barplot() is used in R to create bar charts

Syntax

barplot(H, xlab, ylab, main, names.arg, col)

where

• H: This parameter is a vector or matrix containing numeric values which are


used in bar chart.
• xlab: This parameter is the label for x axis in bar chart.
• ylab: This parameter is the label for y axis in bar chart.
• main: This parameter is the title of the bar chart.
• names.arg: This parameter is a vector of names appearing under each bar in bar
chart.
• col: This parameter is used to give colors to the bars in the graph.

1. The total number of runs scored by a few players in one-day match is given.

Players 1 2 3 4 5 6

No. of runs 30 60 10 50 70 40

Draw suitable bar diagram for the above data.

Simple bar chart

#Frequencies

runs<-c(30,60,10,50,70,40)

#Plotting

barplot(runs, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart", col ="green")

barplot(runs, horiz = TRUE, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart", col
="green")
2. The following table gives the value(in crores) of contracts secured from abroad, in
respect of civil construction, industrial turnkey projects and software consultancy in
three financial years. Construct a suitable bar diagram to denote the share of activity in
total export earnings from the three projects.

Years 1994-95 1995-96 1996-97

Civil Construction 260 312 338

Turnkey Projects 442 712 861

Consultancy Services 1740 1800 2000

Total 2442 2824 3199

Multiple Bar

#Assigning colors

colors = c("green", "orange", "brown")


#Inputing the data

years<-c("1994-95","1995-96","1996-97")

years

[1] "1994-95" "1995-96" "1996-97"

projects<-c("Civil Construction","Turnkey Projects","Consultancy Services")

#Frequencies using matrix

values<-matrix(c(260,312,338,442,712,861,1740,1800,2000),

nrow=3,ncol=3,byrow=TRUE)

[,1] [,2] [,3]


[1,] 260 312 338
[2,] 442 712 861
[3,] 1740 1800 2000

#Plotting
barplot(values, main = "Contracts", names.arg = years,

xlab = "Month", ylab = "projects",

col = colors, beside = TRUE)

legend("topleft", projects, cex = 0.7, fill = colors)


3. The following data related to monthly expenditure (in rupees) of two families A and
B.
Items of Food Clothing Rent Light and Miscellaneous
expenditure Fuel

Family A 1600 800 600 200 800

Family B 1200 600 500 100 600

Represent the above data by suitable bar diagram.


Stacked bar or Component bar graph or Subdivided graph

#Assigning Colours

colors = c("green", "orange", "brown")

#Inputing data
family<-c("family A","family B")
items<-c("Food","Clothing","Rent","Light and fuel","Miscellaneous")
#Frequencies
values<-matrix(c(1600,800,600,200,800,1200,600,500,100,600),
nrow=2,ncol=5,byrow=TRUE)
#Plotting
barplot(values, main = "Monthly Expenditure", names.arg = items,
xlab = "Items of Expenditure", ylab = "Families",
col = colors)
legend("topleft", projects, cex = 0.7, fill = colors)
4. The number of hours spent by a school student on various activities on a working day,
is given below. Construct a pie chart.

Activity Sleep School Play Homework Others

No of hours 8 6 3 3 4

Pie Chart

Pie chart is a representation of values as slices of a circle with different colors. In R, pie()
function is used which takes positive numbers.
Syntax
pie(x, labels, radius, main, col, clockwise)
where
• x is a vector containing the numeric values used in the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie chart.(value between −1 and
+1).
• main indicates the title of the chart.
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Syntax
#Frequencies
hours<-c(8,6,3,3,4)
activity<-c("Sleep","School","Play","Homework","Others")

#Calculating Percentages
piepercent<- round(100 * hours / sum(hours), 1)
# Plot the chart.
pie(hours, labels = piepercent,
main = "Pie chart", col = rainbow(length(hours)))
legend("topright", c("Sleep", " School", " Play", " Homework","Others"),
cex = 0.5, fill = rainbow(length(hours)))
Data visualization - Graphical Presentation (Histogram, frequency polygon, Ogives)
and their interpretations

1. Construct a histogram using the data given below:


155,155,159,160,153,156,155,160,161,150,154,156,153,160,153,162,150,156,160,1
54

Histogram
Histogram is used to plot continuous variable. It helps to break the data into bins and shows
the frequency distribution of these bins.
Syntax
hist(v, main, xlab, xlim, ylim, breaks, col, border)
where
• v: This parameter contains numerical values used in histogram.
• main: This parameter main is the title of the chart.
• col: This parameter is used to set color of the bars.
• xlab: This parameter is the label for horizontal axis.
• border: This parameter is used to set border color of each bar.
• xlim: This parameter is used for plotting values of x-axis.
• ylim: This parameter is used for plotting values of y-axis.
• breaks: This parameter is used as width of each bar.

Example
#Frequencies
heights <-
c(155,155,159,160,153,156,155,160,161,150,154,156,153,160,153,162,150,156,160,154,15
7,157,157,157,157)
#Histogram
hist(heights, xlab = "Heights in cms", col = "green",border = "black", xlim =
c(140,170),ylim = c(0,10), breaks = 5)
2. In a batch of 400 students, the heights of students is given in the following table.
Represent it through histogram.
Heights (in 140-150 150-160 160-170 170-180 180-190
cms.)
No. of 74 163 135 28 25
Students

Syntax

## Listing of Lower limits


ll=c(140,150,160,170,180)
## Listing of Upper limits
ul=c(150,160,170,180,190)
#Mid values
mid=(ll+ul)/2
mid
[1] 145 155 165 175 185
#Frequencies
f=c(74,163,135,28,25)
## pairing mid values and f
y=rep(mid,f)
#Histogram
hist(y,main="Histogram",xlab = "Heights in cms",ylab="No. of Students", col =
"pink",border = "black", breaks = height)
Frequency Polygon

## Listing of Lower limits


ll=c(140,150,160,170,180)
## Listing of Upper limits
ul=c(150,160,170,180,190)
#Mid values
mid=(ll+ul)/2
mid
[1] 145 155 165 175 185
#Frequencies
f=c(74,163,135,28,25)
# For mid values adding 140 – lower limit and 190 – upper limit
x0=c(140,mid,190)
x0
[1] 140 145 155 165 175 185 190
# For frequencies adding zero before and after
f0=c(0,f,0)
f0
[1] 0 74 163 135 28 25 0
#Plotting
plot(x0,f0,main="Frequency Ploygon",xlab="Heights in cms",ylab="No. of
Students",type="o", col="purple")
Histogram and frequency polygon

## Listing of Lower limits


ll=c(140,150,160,170,180)
## Listing of Upper limits
ul=c(150,160,170,180,190)
#Mid Values
mid=(ll+ul)/2
mid
[1] 145 155 165 175 185
#Frequenicies
f=c(74,163,135,28,25)
# For mid values adding 140 – lower limit and 190 – upper limit
x0=c(140,mid,190)
x0
[1] 140 145 155 165 175 185 190
# For frequencies adding zero before and after
f0=c(0,f,0)
f0
[1] 0 74 163 135 28 25 0
## pairing mid values and f
y=rep(mid,f)
#Histogram
hist(y,main="Histogram",xlab = "Heights in cms",ylab="No. of Students", col =
"purple",border = "black", breaks = height)
lines(x0,f0)
Ogive Curve

#Fequencies

f=c(74,163,135,28,25)

#Less than cumulative frequencies


lcf<-cumsum(f)
lcf
[1] 74 237 372 400 425
lcf1<-c(0,lcf)
lcf1
[1] 0 74 237 372 400 425
#Reversing the frequencies
rf<-rev(f)
rf
[1] 0 74 237 372 400 425
cf<-cumsum(rf)
cf
[1] 25 53 188 351 425
#Greater than cumulative frequencies
gcf<-rev(cf)
gcf
[1] 425 351 188 53 25
gcf1<-c(gcf,0)
gcf1
[1] 425 351 188 53 25 0
lbx<-seq(100,200,20)
lbx
[1] 100 120 140 160 180 200
ubx<-seq(100,200,20)

ubx
[1] 100 120 140 160 180 200
#Plotting
plot(ubx,lcf1,type="o",xlim=c(100,200),main="Ogive curve", xlab = "Heights in
cms",ylab="No. of Students",lwd=2)
lines(lbx,gcf1,type="o",,xlim=c(100,200), lwd=2)
Data visualization- Stem & leaf Plot, Box-Whiskers Plot and their interpretation.

Box plot
Box plot also known as box and whisker plot which is a type of chart used in explanatory
data analysis to visually show the distribution of data. Box plot includes minimum score,
lower (first) quartile, median, upper(third) quartile and maximum score.

It is also useful in comparing the distribution of data across data sets by drawing boxplots for
each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
boxplot(x, data, notch, varwidth, names, main)
where
x is a vector or a formula.
data is the data frame.
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.

Example

Draw a box plot to the following data and interpret.

Store1<-c(350,460,20,160,580,250,210,120,200,510,290,380)
Store1

Store2<-c(520,180,260,380,80,500,630,420,210,70,440,140)
Store2
boxplot(Store2, Store1,notch=TRUE, main="boxplot",col="Orange")

1. Stem and leaf plot


x<-c(110, 175, 161, 157, 155, 108, 164, 128, 114, 178, 165, 133, 195, 151, 71, 94, 97, 42, 30,
62, 138, 156, 167, 124, 164, 146, 116, 149, 104, 141, 103, 150, 162, 149, 79, 113, 69, 121,
93, 143, 140, 144, 187, 184, 197, 87, 40, 122, 203, 148)

[1] 110 175 161 157 155 108 164 128 114 178 165 133 195 151 71 94 97 42 30 62

[21] 138 156 167 124 164 146 116 149 104 141 103 150 162 149 79 113 69 121 93 143

[41] 140 144 187 184 197 87 40 122 203 148

stem(x, scale = 1, width = 80, atom = 1e-08)

Output :

The decimal point is 1 digit(s) to the right of the |

3|0
4 | 02
5|
6 | 29
7 | 19
8|7
9 | 347
10 | 348
11 | 0346
12 | 1248
13 | 38
14 | 01346899
15 | 01567
16 | 124457
17 | 58
18 | 47
19 | 57
20 | 3

Scatter Plot
Scatter plot is also known as scatter graph or scatter chart or scatter diagram. Scatter plot is a
type of graph which gives the relationship between the variables in a data set.

Syntax
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

where
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• main is the tile of the graph.
• xlab is the label in the horizontal axis.
• ylab is the label in the vertical axis.
• xlim is the limits of the values of x used for plotting.
• ylim is the limits of the values of y used for plotting.
Example

weight<-
c(2.620,2.875,2.320,3.215,3.440,3.460,3.783,4.067,4.333,4.578,2.567,3.554,2.678,2.569,3.55
67,4.321,3.4567)
mpg<-
c(21.0,21.0,22.8,21.4,18.7,18.1,19.2,17.3,16.4,19.5,15.64,16.57,15.89,17.54,16.877,15.776,2
0167)
plot(x = weight,y = mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
Scatter Plot Matrices
When there are more than two variables, scatter plot matrix is used. The function pair() is
used in R.
Syntax
pairs(formula, data)
where
• formula represents the series of variables used in pairs.
• data represents the data set from which the variables will be taken.
Example
head(iris)

pairs(iris[,1:4], pch = 19)


To show only upper panel

pairs(iris[,1:4], pch = 19, lower.panel = NULL)


Colouring points by groups

pairs(iris[,1:4], pch = 19, lower.panel = NULL)

my_cols <- c("pink", "purple", "green")

pairs(iris[,1:4], pch = 19, cex = 0.5,

col = my_cols[iris$Species],

lower.panel=NULLAdding correlations A
A

Adding correlations in the scatter plots

upper.panel<-function(x, y){

points(x,y, pch=19, col=c("red", "green", "blue")[iris$Species])

r <- round(cor(x, y), digits=2)

txt <- paste0("R = ", r)

usr <- par("usr"); on.exit(par(usr))

par(usr = c(0, 1, 0, 1))

text(0.5, 0.9, txt)

pairs(iris[,1:4], lower.panel = NULL,


upper.panel = upper.panel)

Area Chart

R uses the function geom_area() to create area charts.

Syntax: ggplot(Data, aes(x=x_variable, y=y_variable, fill=group_variable)) + geom_area()


Parameters:
• Data: This parameter contains whole dataset which are used in stacked-area
chart.
• x: This parameter contains numerical value of variable for x axis in stacked-area
chart.
• y: This parameter contains numerical value of variables for y axis in stacked-
area chart.
• fill: This parameter contains group column of Data which is mainly used for
analyse in stacked-area chart.
Example
#Import packages
library(ggplot2)
library(dplyr)
library(tidyverse)
grocery_stores <- rep(c("Store1","Store2","Store3","Store4"),times=4)
year <- as.numeric(rep(seq(2017,2020),each=4))
profits <- runif(16, 50, 100)

data <- data.frame(year, profits, grocery_stores)


ggplot(data, aes(x=year, y=profits, fill=grocery_stores)) + geom_area()
Line Chart

A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts
are usually used in identifying the trends in data.
The plot() function in R is used to create the line graph.
syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)

Following is the description of the parameters used −


• v is a vector containing the numeric values.
• type takes the value "p" to draw only the points, "l" to draw only the lines and
"o" to draw both points and lines.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the Title of the chart.
• col is used to give colors to both the points and lines.

# Create the data for the chart.

v <- c(17, 25, 38, 13, 41)

# Plot the bar chart.

plot(v, type = "o")


#-------------------------------------------------------

Adding Title, Color and Labels in Line Graphs in R

Approach: To create a colored and labeled line chart.

Take all parameters which are required to make line chart by giving a title to the chart and
add labels to the axes.

We can add more features by adding more parameters with more colors to the points and
lines.

#-------------------------------------------------------

# Create the data for the chart.

v <- c(17, 25, 38, 13, 41)

# Plot the bar chart.

plot(v, type = "o", col = "green",

xlab = "Month", ylab = "Article Written",

main = "Article Written chart")


Multiple Lines in a Line Graph in R Programming Language

Approach: To create multiple line graphs.

In above example, we created line graphs by only one line in each graph.

Now creating multiple lines to describe it more clearly.

#Example:

# Create the data for the chart.

v<-c(17,25,38,13,41)

t<-c(22,19,36,19,23)

m<-c(25,14,16,34,29)

# Plot the bar chart.

plot(v, type = "o", col = "red",

xlab = "Month", ylab = "Article Written ",

main = "Article Written chart")


lines(t, type = "o", col = "blue")

lines(m, type = "o", col = "green")


Violin plots

Violin plots help us to visualize numerical variables from one or more categories. They are
similar to box plots in the way they show a numerical distribution using five summary-level
statistics. But violin plots also have the density information of the numerical variables. It
allows visualizing the distribution of several categories by displaying their densities.

Syntax: ggplot( dataframe, aes( x, y, fill, color)) + geom_violin()

Parameters:

dataframe: determines the dataset used in the plot.

fill: determines the color of background of interior of the plot.

color: determines the color of boundary of plot.

Creating basic Violin Plots

# load library ggplot2

library(ggplot2)

# Basic violin plot

# diamonds dataframe has been used here

# diamonds dataframe is provided by R language natively.

ggplot(diamonds, aes(x=cut, y=price)) +

# geom_violin() function is used to plow violin plot

geom_violin()
#----------------------------------------------

# load library ggplot2

library(ggplot2)

# Basic violin plot

# diamonds dataframe has been used here

# diamonds dataframe is provided by R language natively

# color parameter is used to color the boundary of

# plot according to category

ggplot(diamonds, aes(x=cut, y=price, color=cut)) +

# geom_violin() function is used to plow violin plot

geom_violin()
#----------------------------------------------

# load library ggplot2

library(ggplot2)

# Basic violin plot

# diamonds dataframe has been used here

# diamonds dataframe is provided by R language natively

# fill parameter is used to color the background of

#plot according to category

ggplot(diamonds, aes(x=cut, y=price, fill=cut)) +

# geom_violin() function is used to plow violin plot

geom_violin()
#--------------------------------------

# load library ggplot2

library(ggplot2)

# Horizontal violin plot

# diamonds dataframe has been used here

# diamonds dataframe is provided by R language natively.

ggplot(diamonds, aes(x=cut, y=price)) +

# geom_violin() function is used to plow violin plot

geom_violin()+

# coord_flip() function is used to make horizontal

# violin plot

coord_flip()
https://www.tutorialspoint.com/r/r_linear_regression.htm

https://www.geeksforgeeks.org/regression-and-its-types-in-r-programming/

You might also like