Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 81

Chandigarh College of Engineering

Landran, Mohali-140307

Department of Computer Science & Engineering

LAB MANUAL

SUBJECT :-Data Analytics using R Lab


(BTDS602-20)

Coordinator: Dr. Seema Rani

B.Tech III Year –VI Semester


(Branch: AIDS)

Name:
Roll No.
Index
Sr. No. Experiment Name Remarks
R -BASICS
1 Downloading, installing and setting path for R
2 Give an idea of R Data Types.
3 R as a calculator: Perform some arithmetic operations in R

4 Demonstrate the process of creating a user defined


function in R
5 Perform some logical operations in R
6 Write an R script to change the structure of a Data frame
7 Write an R script to demonstrate loops
8 Write an R script to demonstrate conditional statements:
if, if else, switch
9 Write an R script to convert a vector to factors
10 Write an R script to expand a data frame
R- INTERMEDIATE
11 Write an R script to demonstrate R objects.
12 Demonstrate the following aggregate functions in R: sum,
mean, count, min, max
13 Write an R script to read and write different files
14 Write an R script to find subset of a dataset
15 Elucidate the process of data exploration in R using
read(),summary(), nrow(), ncol(),str().
16 Write an R script to handle missing values in a dataset
17 Write an R script to handle outliers
18 Write an R script to handle invalid values
19 Visualize iris dataset using mosaic plot.
20 Visualize correlation between sepal length and petal
length in iris data set using scatter plot.
R- Advance
21 Linear Regression: Consider the following mice data:
Height:140,142,150,147,139,152,154,135,148, 147.
Weight: 59, 61, 66, 62, 57, 68, 69, 58, 63, 62. Derive
relationship coefficients and summary for the above data .

22 Consider the above data and predict the weight of a mouse


for a given height and plot the results using a graph..
23 Logistic Regression: Analyse iris data set using Logistic
Regression. Note: create a subset of iris dataset with
two species..
24 Perform Logistic Regression analysis on the above mice
data(Task.No.21) and plot the results dataset .
25 Decision Tree: Implement ID3 algorithm in R.
26 Implement C4.5 algorithm in R
27 Time Series: Write R script to decompose time series data
into random, trend and seasonal data..
28 Write R script to forecast time series data using single
exponential smoothing method..
29 Clustering: Implement K-means algorithm in R.
30 Implement CURE algorithm in R

Beyond the syllabus

1. Fit a multiple linear regression using iris data by considering the


response variable as Sepal.length

2. Random Forest for Classification

3. Decision Tree for Classification and Regression


Task 1

Downloading, installing and setting path for R

Installation of R
R programming is a very popular language and to work on that we have to install two things, i.e.,
R and RStudio. R and RStudio works together to create a project on R.

Installing R to the local computer is very easy. First, we must know which operating system we are
using so that we can download it accordingly.

The official site https://cloud.r-project.org provides binary files for major operating systems
including Windows, Linux, and Mac OS. In some Linux distributions, R is installed by default,
which we can verify from the console by entering R.

To install R, either we can get it from the site https://cloud.r-project.org or can use commands from
the terminal.

Install R in Windows

There are following steps used to install the R in Windows:

Step 1:

First, we have to download the R setup from https://cloud.r-project.org/bin/windows/base/.


Step 2:

When we click on Download R 3.6.1 for windows, our downloading will be started of R setup.
Once the downloading is finished, we have to run the setup of R in the following way:

1) Select the path where we want to download the R and proceed to Next.

2) Select all components which we want to install, and then we will proceed to Next.
3) In the next step, we have to select either customized startup or accept the default, and then
we proceed to Next.

4) When we proceed to next, our installation of R in our system will get started:

5) In the last, we will click on finish to successfully install R in our system.


Install R in Linux

There are only three steps to install R in Linux

Step 1:

In the first step, we have to update all the required files in our system using sudo apt-get update
command as:

Step 2:

In the second step, we will install R file in our system with the help of sudo apt-get install r-base
as:
Step 3:

In the last step, we type R and press enter to work on R editor.


Task 2
Give an idea of R Data Types.
A variable can store different types of values such as numbers, characters etc. These different types
of data that we can use in our code are called data types. For example,

x <- 123L

Here, 123L is an integer data. So the data type of the variable x is integer.

We can verify this by printing the class of x.

x <- 123L

# print value of x
print(x)

# print type of x
print(class(x))

Output

[1] 123
[1] "integer"

Here, x is a variable of data type integer.

Different Types of Data Types

In R, there are 6 basic data types:

 logical
 numeric
 integer
 complex
 character
 raw

Let's discuss each of these R data types one by one.

1. Logical Data Type

The logical data type in R is also known as boolean data type. It can only have two values: TRUE
and FALSE. For example,
bool1 <- TRUE
print(bool1)
print(class(bool1))

bool2 <- FALSE


print(bool2)
print(class(bool2))

Output:
[1] TRUE
[1] "logical"
[1] FALSE
[1] "logical"

In the above example,

 bool1 has the value TRUE,


 bool2 has the value FALSE.

Here, we get "logical" when we check the type of both variables.

Note: You can also define logical variables with a single letter - T for TRUE or F for FALSE. For
example,

is_weekend <- F
print(class(is_weekend)) # "logical"

2. Numeric Data Type

In R, the numeric data type represents all real numbers with or without decimal values. For
example,

# floating point values


weight <- 63.5

print(weight)
print(class(weight))

# real numbers
height <- 182

print(height)
print(class(height))

Output

[1] 63.5
[1] "numeric"
[1] 182
[1] "numeric"

Here, both weight and height are variables of numeric type

3. Integer Data Type

The integer data type specifies real values without decimal points. We use the suffix L to specify
integer data. For example,

integer_variable <- 186L


print(class(integer_variable))

Output

[1] "integer"

Here, 186L is an integer data. So we get "integer" when we print the class of integer_variable.

4. Complex Data Type

The complex data type is used to specify purely imaginary values in R. We use the suffix i to
specify the imaginary part. For example,

# 2i represents imaginary part


complex_value <- 3 + 2i

# print class of complex_value


print(class(complex_value))

Output

[1] "complex"

Here, 3 + 2i is of complex data type because it has an imaginary part 2i.

5. Character Data Type

The character data type is used to specify character or string values in a variable.

In programming, a string is a set of characters. For example, 'A' is a single character and "Apple"
is a string.
You can use single quotes '' or double quotes "" to represent strings. In general, we use:

 '' for character variables


 "" for string

variables For example,

# create a string variable


fruit <- "Apple"

print(class(fruit))

# create a character variable


my_char <- 'A'

print(class(my_char))

Output

[1] "character"
[1] "character"

Here, both the variables - fruit and my_char - are of character data type.

6. Raw Data Type

A raw data type specifies values as raw bytes. You can use the following methods to convert
character data types to a raw data type and vice-versa:

 charToRaw() - converts character data to raw data


 rawToChar() - converts raw data to character

data For example,

# convert character to raw


raw_variable <- charToRaw("Welcome to Programiz")

print(raw_variable)
print(class(raw_variable))

# convert raw to character


char_variable <- rawToChar(raw_variable)

print(char_variable)
print(class(char_variable))
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw"
[1] "Welcome to Programiz"
[1] "character"
Task 3
R as a calculator: Perform some arithmetic operations in R

R can be used as a powerful calculator by entering equations directly at the prompt in the command
console. Simply type your arithmetic expression and press ENTER. R will evaluate the expressions
and respond with the result. While this is a simple interaction interface, there could be problems if
you are not careful. R will normally execute your arithmetic expression by evaluating each item
from left to right, but some operators have precedence in the order of evaluation. Let's start with
some simple expressions as examples.

Simple arithmetic expressions

The operators R uses for basic arithmetic are:

+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponentiation

Let's experiment with some arithmetic expressions.

4+8 will return the result 12


5 * 14 will return the result 70
7/4 will return the result 1.75
4+5+3 will return the result 12
4^3 will return the result 64
The following is a short list of standard mathematical functions.
Task 4

Demonstrate the process of creating a user defined function in R

We can create user-defined functions in R. They are specific to what a user wants and once created
they can be used like the built-in functions. Below is an example of how a function is created and
used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.


new.function(6)

When we execute the above code, it produces the following result −

Output

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Calling a Function without an Argument

# Create a function without an argument.


new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}

# Call the function without supplying an argument.


new.function()
When we execute the above code, it produces the following result −

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

Calling a Function with Argument Values (by position and by name)

The arguments to a function call can be supplied in the same sequence as defined in the function or
they can be supplied in a different sequence but assigned to the names of the arguments.

# Create a function with arguments.


new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


new.function(a = 11, b = 5, c = 3)

When we execute the above code, it produces the following result −

Output

[1] 26
[1] 58

Calling a Function with Default Argument

We can define the value of the arguments in the function definition and call the function without
supplying any argument to get the default result. But we can also call such functions by supplying
new values of the argument and get non default result.

# Create a function with arguments.


new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)

When we execute the above code, it produces the following result −

Output

[1] 18
[1] 45
Task 5

Perform some logical operations in R

Logical operators are used to carry out Boolean operations like AND, OR etc.

Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR

Operators & and | perform element-wise operation producing result having length of the longer
operand.

But && and || examines only the first element of the operands resulting in a single length logical
vector.

Zero is considered FALSE and non-zero numbers are taken as TRUE. Let's see an example for this:

x <- c(TRUE, FALSE, 0, 6)


y <- c(FALSE, TRUE, FALSE, TRUE)
!x
x&y
x && y
x|y
x || y

Output
[1] FALSE TRUE TRUE FALSE
[1] FALSE FALSE FALSE TRUE
[1] FALSE
[1] TRUE TRUE FALSE TRUE
[1] TRUE
Task 6

Write an R script to change the structure of a Data frame

# Create a sample data frame


data <- data.frame(
Name = c("John", "Jane", "Bob"),
Age = c(25, 30, 22),
Score = c(85, 92, 78)
)

# Display the original data frame


print("Original Data Frame:")
print(data)

# Add a new column


data$Grade <- c("A", "A", "B")

# Display the data frame with the new column


print("Data Frame with New Column 'Grade':")
print(data)

# Remove a column
data <- data[, -3]

# Display the data frame after removing the 'Score' column


print("Data Frame after Removing 'Score' Column:")
print(data)

# Reshape the data frame (convert wide to long format using tidyr)
library(tidyr)
data_long <- gather(data, key = "Variable", value = "Value", -Name)

# Display the long-format data frame


print("Long-format Data Frame:")
print(data_long)
Task 7

Write an R script to demonstrate loops


Syntax of for loop
for (val in sequence)
{
statement
}

Here, sequence is a vector and val takes on each of its value during the loop. In each iteration,
statement is evaluated.

x <- c(2,5,3,9,8,11,6)
count <- 0
for (val in x)
{ if(val %% 2 == 0)count = count+1
}
print(count)

Output

[1] 3

Syntax of while loop


while (test_expression)
{
statement
}

Here, test_expression is evaluated and the body of the loop is entered if the result is TRUE.

The statements inside the loop are executed and the flow returns to evaluate the test_expression
again.

This is repeated each time until test_expression evaluates to FALSE, in which case, the loop
exits.

i <- 1
while (i < 6) {
print(i)
i = i+1
}
Output

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Syntax of repeat loop


repeat {
statement
}

In the statement block, we must use the break statement to exit the loop.

x <- 1
repeat {
print(x)
x = x+1
if (x == 6){
break
}
}

Output

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Task 8

Write an R script to demonstrate conditional statements: if, if else, switch

R if statement
The syntax of if statement is:

if (test_expression) {
statement
}

If test_expression is TRUE, the statement gets executed. But if it's FALSE, nothing happens.

Here, test_expression can be a logical or numeric vector, but only the first element is taken into
consideration.

In the case of a numeric vector, zero is taken as FALSE, rest as TRUE.

Example 1: if statement
x <- 5
if(x > 0)
{ print("Positive
number")
}

Output

[1] "Positive number"

if else statement

The syntax of if else statement is:

if (test_expression) {
statement1
} else {
statement2
}

The else part is optional and is only evaluated if test_expression is FALSE.


Example 2: if else statement
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}

Output

[1] "Negative number"

The above conditional can also be written in a single line as follows:

if(x > 0) print("Non-negative number") else print("Negative number")

This feature of R allows us to write constructs as shown below:

x <- -5
y <- if(x > 0) 5 else 6
y

Output

[1] 6
Task 9

Write an R script to convert a vector to factors

as.factor() function in R Programming Language is used to convert the passed object(usually


Vector) into a Factor.

Syntax: as.factor(object)
Parameters:
Object: Vector to be converte

as.factor() Function in R Example


Example 1: Convert a Factor in R
# Creating a vector
x<-c("female", "male", "male", "female")

# Using as.factor() Function


# to convert vector into factor
as.factor(x)

Output:
[1] female male male female
Levels: female male
Task 10

Write an R script to expand a data frame

# R program to create a dataframe


# with combination of vectors

# Creating vectors
x1 <- c("abc", "cde", "def")
x2 <- c(1, 2, 3)
x3 <- c("M", "F")

# Calling expand.grid() Function


expand.grid(x1, x3)

Output:

Var1 Var2
1 abc M
2 cde M
3 def M
4 abc F
5 cde F
6 def F
Task 11

Write an R script to demonstrate R objects

# Scalar (numeric)
scalar <- 42

# Vector
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "orange")

# Matrix
matrix_example <- matrix(1:9, nrow = 3, ncol = 3)

# Data Frame
data_frame_example <- data.frame(
Name = c("John", "Jane", "Bob"),
Age = c(25, 30, 22),
Score = c(85, 92, 78)
)

# List
list_example <- list(
"Numeric Vector" = numeric_vector,
"Character Vector" = character_vector,
"Matrix" = matrix_example,
"Data Frame" = data_frame_example
)

# Display the objects


print("Scalar:")
print(scalar)

print("\nNumeric Vector:")
print(numeric_vector)

print("\nCharacter Vector:")
print(character_vector)

print("\nMatrix:")
print(matrix_example)

print("\nData Frame:")
print(data_frame_example)

print("\nList:")
print(list_example)
Task 12

Demonstrate the following aggregate functions in R: sum, mean, count, min,


max
# Create a sample numeric vector
numeric_vector <- c(10, 20, 15, 25, 30, 18, 22)

# Sum of the vector elements


total_sum <- sum(numeric_vector)

# Mean (average) of the vector elements


average <- mean(numeric_vector)

# Count of elements in the vector


count <- length(numeric_vector)

# Minimum and maximum values in the vector


minimum_value <- min(numeric_vector)
maximum_value <- max(numeric_vector)

# Display the results


cat("Numeric Vector:", numeric_vector, "\n\n")

cat("Sum of the Vector:", total_sum, "\n")


cat("Mean of the Vector:", average, "\n")
cat("Count of Elements in the Vector:", count, "\n")
cat("Minimum Value in the Vector:", minimum_value, "\n")
cat("Maximum Value in the Vector:", maximum_value, "\n")
Task 13

Write an R script to read and write different files

# Install necessary packages if not already installed


if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl")

# Load the required libraries


library(readr)
library(writexl)

# Create a sample data frame


data <- data.frame(
Name = c("John", "Jane", "Bob"),
Age = c(25, 30, 22),
Score = c(85, 92, 78)
)

# Display the original data frame


cat("Original Data Frame:\n")
print(data)

# Write the data frame to a CSV file


write_csv(data, "example_data.csv")

# Write the data frame to an Excel file


write_xlsx(data, "example_data.xlsx")

# Read data from the CSV file


read_csv_data <- read_csv("example_data.csv")

# Read data from the Excel file


read_xlsx_data <- read_xlsx("example_data.xlsx")

# Display the data read from files


cat("\nData Read from CSV:\n")
print(read_csv_data)

cat("\nData Read from Excel:\n")


print(read_xlsx_data)
Task 14

Write an R script to find subset of a dataset


# R program to create
# subset of a data frame

# Creating a Data Frame


df<-data.frame(row1 = 0:2, row2 = 3:5, row3 = 6:8)
print ("Original Data Frame")
print (df)

# Creating a Subset
df1<-subset(df, select =
row2) print("Modified Data
Frame") print(df1)
Task 15

Elucidate the process of data exploration in R using read(),summary(), nrow(),


ncol(),str().

mydata <- read.csv("D:/Course File/Data Analytics Using R/Book1.csv", header=TRUE))


summary(mydata)
nrow(mydata)
ncol(mydata)
str(mydata)
Task 16

Write an R script to handle missing values in a dataset

In data science, one of the common tasks is dealing with missing data. If we have missing data in
your dataset, there are several ways to handle it in R programming. One way is to simply remove
any rows or columns that contain missing data. Another way to handle missing data is to impute
the missing values using a statistical method. This means replacing the missing values with
estimates based on the other values in the dataset. For example, we can replace missing values with
the mean or median value of the variable in which the missing values are found.

Missing Data

In R, the NA symbol is used to define the missing values, and to represent impossible arithmetic
operations (like dividing by zero) we use the NAN symbol which stands for “not a number”. In
simple words, we can say that both NA or NAN symbols represent missing values in R.

Let us consider a scenario in which a teacher is inserting the marks (or data) of all the students in a
spreadsheet. But by mistake, she forgot to insert data from one student in her class. Thus, missing
data/values are practical in nature.

Finding Missing Data in R

R provides us with inbuilt functions using which we can find the missing values. Such inbuilt
functions are explained in detail below −

Using the is.na() Function

We can use the is.na() inbuilt function in R to check for NA values. This function returns a vector
that contains only logical value (either True or False). For the NA values in the original dataset, the
corresponding vector value should be True otherwise it should be False.

Example
# vector with some data
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
myVector

Output
[1] NA "TP" "4" "6.7" "c" NA "12"

Let’s find the NAs

# finding NAs
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
is.na(myVector)
Output
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE

Let’s identify NAs in Vector

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)


which(is.na(myVector))

Output
[1] 1 6

Let’s identify total number of NAs −

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)


sum(is.na(myVector))

Output
[1] 2

As you can see in the output this function produces a vector having True boolean value at those
positions in which myVector holds a NA value.

Using the is.nan() Function

We can apply the is.nan() function to check for NAN values. This function returns a vector
containing logical values (either True or False). If there are some NAN values present in the
vector, then it returns True corresponding to that position in the vector otherwise it returns False.

Example
myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0)

is.nan(myVector)
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE

Output
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE

As you can see in the output this function produces a vector having True boolean value at those
positions in which myVector holds a NAN value.

Some of the traits of missing values have listed below −

 Multiple NA or NAN values can exist in a vector.


 To deal with NA type of missing values in a vector we can use is.na() function by
passing the vector as an argument.
 To deal with the NAN type of missing values in a vector we can use is.nan() function
by passing the vector as an argument.
 Generally, NAN values can be included in the NA type but the vice-versa is not true.
Removing Missing Data/ Values

Let us consider a scenario in which we want to filter values except for missing values. In R, we
have two ways to remove missing values. These methods are explained below −

Remove Values Using Filter functions

The first way to remove missing values from a dataset is to use R's modeling functions. These
functions accept a na.action parameter that lets the function what to do in case an NA value is
encountered. This makes the modeling function invoke one of its missing value filter functions.

These functions are capable enough to replace the original data set with a new data set in which
the NA values have been changed. It has the default setting as na.omit that completely removes a
row if this row contains any missing value. An alternative to this setting is −

It just terminates whenever it encounters any missing values. The following are the filter functions

 na.omit − It simply rules out any rows that contain any missing value and forgets
those rows forever.
 na.exclude − This agument ignores rows having at least one missing value.
 na.pass − Take no action.
 na.fail − It terminates the execution if any of the missing values are found.

Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.exclude(myVector)

Output
[1] "TP" "4" "6.7" "c" "12"
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "exclude"

Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.omit(myVector)

Output
[1] "TP" "4" "6.7" "c" "12"
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "omit"

Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.fail(myVector)
Output
Error in na.fail.default(myVector) : missing values in object

As you can see in the output, execution halted for rows containing at least one missing value.

Selecting values that are not NA or NAN

In order to select only those values which are not missing, firstly we are required to produce a
logical vector having corresponding values as True for NA or NAN value and False for other
values in the given vector.

Example

Let logicalVector be such a vector (we can easily get this vector by applying is.na() function).

myVector1 <- c(200, 112, NA, NA, NA, 49, NA, 190)
logicalVector1 <- is.na(myVector1)
newVector1 = myVector1[! logicalVector1]
print(newVector1)
Output
[1] 200 112 49 190

Applying the is.nan() function

myVector2 <- c(100, 121, 0 / 0, 123, 0 / 0, 49, 0 / 0, 290)


logicalVector2 <- is.nan(myVector2)
newVector2 = myVector2[! logicalVector2]
print(newVector2)
Output
[1] 100 121 123 49 290

As you can see in the output missing values of type NA and NAN have been successfully removed
from myVector1 and myVector2 respectively.

Filling Missing Values with Mean or Median

In this section, we will see how we can fill or populate missing values in a dataset using mean and
median. We will use the apply method to get the mean and median of missing columns.

Step 1 − The very first step is to get the list of columns that contain at least one missing value (NA)
value.
Example
# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(NA, 84, 93, 87),
Mathematics = c(91, 86, NA, NA) )
#Print dataframe
print(dataframe)
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 NA 91
2 Anil 87 84 86
3 Jai 91 93 NA
4 Naveen 94 87 NA

Let’s print the column names having at least one NA value.

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]


print(listMissingColumns)
Output
[1] "Chemistry" "Mathematics"

In our dataframe, we have two columns with NA values.

Step 2 − Now we are required to compute the mean and median of the corresponding columns.
Since we need to omit NA values in the missing columns, therefore, we can pass "na.rm = True"
argument to the apply() function.

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, mean, na.rm = TRUE)
print(meanMissing)
Output
Chemistry Mathematics
88.0 88.5

The mean of Column Chemistry is 88.0 and that of Mathematics is 88.5.

Now let’s find the median of the columns −

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, median, na.rm = TRUE)

print(medianMissing)
Output
Chemistry Mathematics
87.0 88.5
The median of Column Chemistry is 87.0 and that of Mathematics is 88.5.

Step 3 − Now our mean and median values of corresponding columns are ready. In this step, we
will replace NA values with mean and median using mutate() function which is defined under
“dplyr” package.

Example
# Importing library
library(dplyr)

# Create a data frame


dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(NA, 84, 93, 87),
Mathematics = c(91, 86, NA, NA) )

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, mean, na.rm = TRUE)

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, median, na.rm = TRUE)

newDataFrameMean <- dataframe %>% mutate(


Chemistry = ifelse(is.na(Chemistry), meanMissing[1], Chemistry),
Mathematics = ifelse(is.na(Mathematics), meanMissing[2], Mathematics))

newDataFrameMean
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 88 91.0
2 Anil 87 84 86.0
3 Jai 91 93 88.5
4 Naveen 94 87 88.5

Notice the missing values are filled with the mean of the corresponding column.

Example

Now let’s fill the NA values with the median of the corresponding column.

# Importing library
library(dplyr)

# Create a data frame


dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(NA, 84, 93, 87),
Mathematics = c(91, 86, NA, NA) )

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, mean, na.rm = TRUE)

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns],


2, median, na.rm = TRUE)

newDataFrameMedian <- dataframe %>% mutate(


Chemistry = ifelse(is.na(Chemistry), medianMissing[1], Chemistry),
Mathematics = ifelse(is.na(Mathematics), medianMissing[2],Mathematics))

print(newDataFrameMedian)
Output
Name Physics Chemistry Mathematics
1 Bhuwanesh 98 87 91.0
2 Anil 87 84 86.0
3 Jai 91 93 88.5
4 Naveen 94 87 88.5

The missing values are filled with the median of the corresponding column.
Task 17
Write an R script to handle outliers

Handling outliers in R typically involves identifying and dealing with data points that are
significantly different from the majority of the observations. There are various approaches to
handle outliers, such as removing them, transforming the data, or applying robust statistical
methods. Here's a simple example script that demonstrates how to identify and handle outliers
using the summary(), boxplot(), and winsorize() functions:

# Sample data with outliers


data <- c(25, 30, 22, 28, 35, 200, 21, 26, 27, 24)

# Display summary statistics and boxplot before handling outliers


cat("Summary Statistics before handling outliers:\n")
print(summary(data))

# Create a boxplot to visualize outliers


boxplot(data, main = "Boxplot before Outlier Handling", ylab = "Values")

# Function to winsorize data (replace outliers with upper/lower bounds)


winsorize <- function(x, trim = 0.05) {
q <- quantile(x, c(trim, 1 - trim))
x[x < q[1]] <- q[1]
x[x > q[2]] <- q[2]
return(x)
}

# Apply winsorization to handle outliers


data_winsorized <- winsorize(data)

# Display summary statistics and boxplot after handling outliers cat("\


nSummary Statistics after handling outliers:\n")
print(summary(data_winsorized))

# Create a boxplot to visualize winsorized data


boxplot(data_winsorized, main = "Boxplot after Outlier Handling", ylab = "Values")
This script first displays summary statistics and a boxplot of the original data with outliers. Then, it
defines a winsorize() function to replace outliers with upper and lower bounds. The script
applies winsorization to the data and displays summary statistics and a boxplot of the winsorized
data.

Note: In practice, the approach to handling outliers depends on the specific characteristics of your
data and the goals of your analysis. Winsorization is just one of many possible methods. Other
methods include trimming, transformation, or using robust statistical techniques. Choose the
method that best fits your data and analysis requirements.
Task 18

Write an R script to handle invalid values.

Handling invalid values in R involves identifying and addressing missing or incorrect data.
Common strategies include imputation for missing values, cleaning or transforming data, or
removing observations with invalid values. Here's a basic example script that demonstrates
handling missing values using the complete.cases() function for removal and na.omit()
function, and imputation using the na.mean() function:

# Sample data with invalid values


data <- c(25, 30, NA, 28, 35, 200, 21, 26, "Invalid", 24)

# Display the original data


cat("Original Data:\n")
print(data)

# Identify and remove observations with missing values


clean_data <- data[complete.cases(data)]

# Display the data after removing missing values


cat("\nData after Removing Missing Values:\n")
print(clean_data)

# Identify and remove observations with missing values using na.omit()


clean_data_naomit <- na.omit(data)

# Display the data after removing missing values using na.omit()


cat("\nData after Removing Missing Values using na.omit():\n")
print(clean_data_naomit)

# Impute missing values with the mean


imputed_data <- ifelse(is.na(data), mean(data, na.rm = TRUE), data)

# Display the data after imputing missing values with the mean
cat("\nData after Imputing Missing Values with Mean:\n")
print(imputed_data)

This script first displays the original data, then removes observations with missing values using
both indexing and the na.omit() function. Finally, it imputes missing values with the mean using
the na.mean() function.
Task 19

Visualize iris dataset using mosaic plot

A mosaic plot can be used for plotting categorical data very effectively with the area of the data
showing the relative proportions.

data(HairEyeColor)
mosaicplot(HairEyeColor)
Task 20

Visualize correlation between sepal length and petal length in iris data set using
scatter plot.
# Load the Iris dataset
data(iris)
# Create a scatter plot for Sepal.Length vs. Petal.Length
SL=iris$Sepal.Length
PL=iris$Petal.Length
plot(SL, PL)
Task 21

Linear Regression: Consider the following mice data:


Height:140,142,150,147,139,152,154,135,148, 147. Weight: 59, 61, 66, 62, 57, 68,
69, 58, 63, 62. Derive relationship coefficients and summary for the above data .

# Step 1: Create the data frame


height <- c(140, 142, 150, 147, 139, 152, 154, 135, 148, 147)
weight <- c(59, 61, 66, 62, 57, 68, 69, 58, 63, 62)

mice_data <- data.frame(Height = height, Weight = weight)

# Step 2: Fit a linear regression model


model <- lm(Weight ~ Height, data = mice_data)

# Step 3: Extract coefficients and summary


coefficients <- coef(model)
summary_stats <- summary(model)

# Display the results


cat("Coefficients:\n")
print(coefficients)

cat("\n\nSummary Statistics:\n")
print(summary_stats)
Task 22

Consider the above data and predict the weight of a mouse for a given height
and plot the results using a graph

# Use the existing 'mice_data' and 'model' from the previous code

# Predict the weight for a given height


new_height <- 145 # Replace with the height for which you want to predict the weight
predicted_weight <- predict(model, newdata = data.frame(Height = new_height))

# Plot the original data points and the regression line


plot(mice_data$Height, mice_data$Weight, main = "Linear Regression for Mice Data",
xlab = "Height", ylab = "Weight", pch = 16, col = "blue")

# Add the regression line to the plot


abline(model, col = "red")

# Highlight the predicted point


points(new_height, predicted_weight, pch = 19, col = "green")
text(new_height, predicted_weight, labels = paste("Predicted Weight:", round(predicted_weight, 2)),
pos = 3, col = "green")
Task 23

Logistic Regression: Analyse iris data set using Logistic Regression. Note: create
a subset of iris dataset with two species..

# Load the iris dataset


data(iris)

# Create a subset with two species (setosa and versicolor)


iris_subset <- subset(iris, Species %in% c("setosa", "versicolor"))

# Convert the Species variable to a binary outcome (0 or 1)


iris_subset$Species <- as.factor(ifelse(iris_subset$Species == "setosa", 0, 1))

# Fit logistic regression model


logistic_model <- glm(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_subset,
family = binomial)

# Display the summary of the logistic regression model


summary(logistic_model)
Task 24

Perform Logistic Regression analysis on the above mice data(Task.No.21) and


plot the results dataset .

# Mice data
height <- c(140, 142, 150, 147, 139, 152, 154, 135, 148, 147)
weight <- c(59, 61, 66, 62, 57, 68, 69, 58, 63, 62)

# Linear regression
linear_model <- lm(weight ~ height)

# Summary of linear regression


summary(linear_model)

# Plotting linear regression


plot(height, weight, main="Linear Regression", xlab="Height", ylab="Weight")
abline(linear_model, col="red")

# Assuming a binary outcome variable (e.g., obese or not obese)


outcome <- c(0, 0, 1, 0, 0, 1, 1, 0, 1, 0)

# Logistic regression
logistic_model <- glm(outcome ~ height + weight, family=binomial)

# Summary of logistic regression


summary(logistic_model)

# Plotting logistic regression results


# Assuming height and weight as predictor variables
plot(height, weight, col=outcome+1, pch=19, main="Logistic Regression", xlab="Height",
ylab="Weight")
points(NA, NA, col=1, pch=19, legend="Not Obese")
points(NA, NA, col=2, pch=19, legend="Obese")
legend("topright", legend=c("Not Obese", "Obese"), col=1:2, pch=19)

# Adding logistic regression decision boundary


x_vals <- seq(min(height), max(height), length.out=100)
y_vals <- seq(min(weight), max(weight), length.out=100)
grid <- expand.grid(height=x_vals, weight=y_vals)
probs <- predict(logistic_model, newdata=grid, type="response")
contour(x_vals, y_vals, matrix(probs, length(x_vals), length(y_vals)), levels=0.5, add=TRUE,
col="blue")
Logistic Regression

Not Obese
Obese
68
66
64
Weight

62

0.5
60
58

135 140 145 150

Height
Task 25

Decision Tree: Implement ID3 algorithm in R

library(data.tree)

entropy <- function(q) {


# Calculate the entropy for a value.
-1 * (q * log2(q) + (1 - q) * log2(1 - q))
}

positiveRatio <- function(data, outcomeCol = ncol(data)) {


# Calculate the ratio of T by the total samples.
positiveCount <- length(which(data[, outcomeCol] == T))
sum(positiveCount / nrow(data))
}

gain <- function(data, attributeCol, outcomeCol = ncol(data), precision=3) {


# Calculate the information gain for an attribute.
# First, calculate the total entropy for this attribute by using its positive
ratio.
systemEntropy <- round(entropy(positiveRatio(data, outcomeCol)), precision)

# Get the list of all T and all F outcomes.


positives <- data[which(data[,outcomeCol] == T),]
negatives <- data[which(data[,outcomeCol] == F),]

# Split the attribute into groups by its possible values (sunny, overcast,
rainy).
attributeValues <- split(data, data[,attributeCol])

# Sum the entropy for each positive attribute value.


gains <- sum(sapply(attributeValues, function(attributeValue) {
# Calculate the ratio for this attribute value by all measurements.
itemRatio <- nrow(attributeValue) / nrow(data)

# Calculate the entropy for this attribute value.


outcomeEntropy <- entropy(length(which(attributeValue[,outcomeCol] == T)) /
nrow(attributeValue))

# Cast NaN to 0 and return the result.


result <- itemRatio * outcomeEntropy
round(ifelse(is.nan(result), 0, result), precision)
}))

# The information gain is the remainder from the attribute entropy minus the
attribute value gains.
systemEntropy - gains
}

pure <- function(data, outcomeCol = ncol(data))


{ length(unique(data[, outcomeCol])) == 1
}

ID3 <- function(node, data, outcomeCol = ncol(data))


{ node$obsCount <- nrow(data)
# If the data-set contains all the same outcome values, then make a leaf.
if (pure(data, outcomeCol)) {
# Construct a leaf having the name of the attribute value.
child <- node$AddChild(unique(data[,outcomeCol]))
node$feature <- tail(names(data), 1)
child$obsCount <- nrow(data)
child$feature <- ''
}
else {
# Chose the attribute with the highest information gain.
gains <- sapply(colnames(data)[-outcomeCol], function(colName)
{ gain(data, which(colnames(data) == colName), outcomeCol)
})

feature <- names(gains)[gains == max(gains)][1]

node$feature <- feature

# Take the subset of the data-set having that attribute value.


childObs <- split(data[,!(names(data) %in% feature)], data[,feature], drop
= TRUE)

for(i in 1:length(childObs)) {
# Construct a child having the name of that attribute value.
child <- node$AddChild(names(childObs)[i])

# Call the algorithm recursively on the child and the subset.


ID3(child, childObs[[i]])
}
}
}

# Read dataset.
data <- read.table('weather.tsv', header=T)
# Convert the last column to a boolean.
data[, ncol(data)] <- ifelse(tolower(data[, ncol(data)]) == 'yes', T, F)

# Test calculating information gain for all columns.


sapply(1:ncol(data), function(i) { print(gain(data, i)) })

# Train ID3 to build a decision tree.


tree <- Node$new('Should_Play')
ID3(tree, data)
print(tree, 'feature')
Task 26

Implement C4.5 algorithm in R..


C4.5

The C4.5 algorithm, created by Ross Quinlan, implements decision trees. The algorithm starts with
all instances in the same group, then repeatedly splits the data based on attributes until each item is
classified. To avoid overfitting, sometimes the tree is pruned back. C4.5 attempts this
automatically. C4.5 handles both continuous and discrete attributes.

load packages

First we load the RWeka and caret packages.

J48 is an open source Java implementation of the C4.5 algorith available in the Weka package.

The caret package (Classification And REgression Training) is a set of functions that streamline
learning by providing functions for data splitting, feature selection, model tuning, and more.

library(RWeka)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4

Create the model

First, use caret to create a 10-fold training set. Then train the model. We are using the well-known
iris data set here that is automatically included in R.

set.seed(1958) # set a seed to get replicable results


train <- createFolds(iris$Species, k=10)
C45Fit <- train(Species ~., method="J48", data=iris,
tuneLength = 5,
trControl =
trainControl( method="cv",
indexOut=train))
Look at the model results

The results first describe the data:

 150 samples
 4 predictor attributes
 3 classes: setosa, versicolor, virginica

Next it tells us that it did 10-fold cross validation.


Finally, we get an accuracy of 0.98, Kappa 0.97. Pretty good!

C45Fit
## C4.5-like Trees
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results:
##
## Accuracy Kappa
## 0.96 0.94
##
## Tuning parameter 'C' was held constant at a value of 0.25
##

Look at the model

Looking at the tree we see that it first split on whether or not the petal width was 0.6 or less. Each
indentation is the next split in the tree.

C45Fit$finalModel
## J48 pruned tree
##
##
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## | Petal.Width <= 1.7
## | | Petal.Length <= 4.9: versicolor (48.0/1.0)
## | | Petal.Length > 4.9
## | | | Petal.Width <= 1.5: virginica (3.0)
## | | | Petal.Width > 1.5: versicolor (3.0/1.0)
## | Petal.Width > 1.7: virginica (46.0/1.0)
##
## Number of Leaves : 5
##
## Size of the tree : 9
Task 27

Time Series: Write R script to decompose time series data into random, trend
and seasonal data..
library(astsa, quietly=TRUE, warn.conflicts=FALSE)
library(ggplot2)
library(knitr)
library(printr)
library(plyr)
library(dplyr)
library(lubridate)
library(gridExtra)
library(reshape2)
library(TTR)
kings <- scan('http://robjhyndman.com/tsdldata/misc/kings.dat', skip=3)
head(kings)

## [1] 60 43 67 50 56 42

kings <- ts(kings)


kings

## Time Series:
## Start = 1
## End = 42
## Frequency = 1
## [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69
## [24] 59 48 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

births <- ts(births, frequency = 12, start = c(1946, 1))


births

## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
## Nov Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897

Next loading data on beach town souvenir shop.

gift <- scan("http://robjhyndman.com/tsdldata/data/fancy.dat")


gift<- ts(gift, frequency=12, start=c(1987,1))
gift
## Jan Feb Mar Apr May Jun Jul
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1988 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1989 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1990 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1991 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 1992 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1987 3566.34 5021.82 6423.48 7600.60 19756.21
## 1988 4752.15 5496.43 5835.10 12600.08 28541.72
## 1989 8176.62 8573.17 9690.50 15151.84 34061.01
## 1990 7979.25 8093.06 8476.70 17914.66 30114.41
## 1991 12552.22 11637.39 13606.89 21822.11 45060.69
## 1992 19888.61 23933.38 25391.35 36024.80 80721.71
## 1993 28586.52 30505.41 30821.33 46634.38 104660.67

Plotting Time Series

Plot the kings data.

plot.ts(kings)
At this point we could guess that this time series could be described using an additive model, since
the random fluctuations in the data are roughly constant in size over time.

Plotting the births data.

plot.ts(births)

We can see from this time series that there is certainly some seasonal variation in the number of
births per month; there is a peak every summer, and a trough every winter. Again the it seems like
this could be described using an additive model, as the seasonal fluctuations are roughly constant in
size over time and do not seem to depend on the level of the time series, and the random
fluctuations seem constant over time.

plot.ts(gift)
In this case, an additive model is not appropriate since the size of the seasonal and random
fluctuations change over time and the level of the time series. It is then appropriate to transform the
time series so that we can model the data with a classic additive model.

logGift <- log(gift)


plot.ts(logGift)

Decomposing Time Series

Decomposing a time series means separating it into it’s constituent components, which are often a
trend component and a random component, and if the data is seasonal, a seasonal component.

Decomposing non-Seasonal Data

Recall that non-seasonal time series consist of a trend component and a random
component. Decomposing the time series involves tying to separate the time series into
these individual components.

One way to do this is using some smoothing method, such as a simple moving average. The SMA()
function in the TTR R package can be used to smooth time series data using a moving average. The
SMA function takes a span argument as n order. To calculate the moving average of order 5, we set
n
= 5.

Lets start with n=3 to see a clearer picture of the Kings dataset trend component

kingsSMA3 <- SMA(kings, n=3)


plot.ts(kingsSMA3)
It seems like there is still some random fluctuations in the data, we might want to try a big larger of
a smoother. Lets try n=8.

kingsSMA3 <- SMA(kings, n=8)


plot.ts(kingsSMA3)

This is better, we can see that the death of English kings has declined from ~55 years to ~40 years
for a brief period, followed by a rapid increase in the next 20 years to ages in the 70’s.

Decomposing Seasonal Data

A seasonal time series, in addition to the trend and random components, also has a seasonal
component. Decomposing a seasonal time series means separating the time series into these three
components. In R we can use the decompose() function to estimate the three components of the
time series.
Lets estimate the trend, seasonal, and random components of the New York births dataset.

birthsComp <- decompose(births)

birthsComp
## $x
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
## Nov Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897
##
## $seasonal
## Jan Feb Mar Apr May Jun
## 1946 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1947 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1948 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1949 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1950 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1951 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1952 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1953 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1954 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1955 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1956 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1957 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1958 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## 1959 -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## Jul Aug Sep Oct Nov Dec
## 1946 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1947 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1948 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1949 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1950 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1951 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1952 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1953 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1954 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1955 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1956 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1957 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1958 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
## 1959 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
##
## $trend
## Jan Feb Mar Apr May Jun Jul
## 1946 NA NA NA NA NA NA 23.98433
## 1947 22.35350 22.30871 22.30258 22.29479 22.29354 22.30562 22.33483
## 1948 22.43038 22.43667 22.38721 22.35242 22.32458 22.27458 22.23754
## 1949 22.06375 22.08033 22.13317 22.16604 22.17542 22.21342 22.27625
## 1950 23.21663 23.26967 23.33492 23.42679 23.50638 23.57017 23.63888
## 1951 24.00083 24.12350 24.20917 24.28208 24.35450 24.43242 24.49496
## 1952 24.27204 24.27300 24.28942 24.30129 24.31325 24.35175 24.40558
## 1953 24.78646 24.84992 24.92692 25.02362 25.16308 25.26963 25.30154
## 1954 25.92446 25.92317 25.92967 25.92137 25.89567 25.89458 25.92963
## 1955 25.64612 25.78679 25.93192 26.06388 26.16329 26.25388 26.35471
## 1956 27.21104 27.21900 27.20700 27.26925 27.35050 27.37983 27.39975
## 1957 27.44221 27.40283 27.44300 27.45717 27.44429 27.48975 27.54354
## 1958 27.68642 27.76067 27.75963 27.71037 27.65783 27.58125 27.49075
## 1959 26.96858 27.00512 27.09250 27.17263 27.26208 27.36033 NA
## Aug Sep Oct Nov Dec## $random
## Jan Feb Mar Apr May
## 1946 NA NA NA NA NA
## 1947 -0.237305288 0.863252404 0.543893429 0.175887019 -0.793193109
## 1948 0.183819712 -0.318705929 0.340268429 0.121262019 -0.354234776
## 1949 0.161444712 0.002627404 -0.571689904 -0.749362981 -0.666068109
## 1950 0.064569712 -0.292705929 0.479560096 1.047887019 1.561973558
## 1951 -0.036638622 1.008460737 0.004310096 0.556595353 -0.176151442
## 1952 0.203153045 0.079960737 -0.376939904 -0.853612981 -0.576901442
## 1953 0.254736378 -0.122955929 -0.224439904 -0.159946314 0.016265224
## 1954 -0.590263622 -0.536205929 0.189810096 1.079303686 1.062681891
## 1955 0.021069712 0.535169071 -0.073439904 -1.787196314 -1.647943109
## 1956 -0.316846955 -0.918039263 -0.155523237 0.507428686 0.924848558
## 1957 -0.176013622 -0.471872596 -0.762523237 0.240512019 1.182056891
## 1958 0.122778045 -0.753705929 0.340851763 -0.319696314 0.021515224
## 1959 -0.215388622 0.363835737 -0.295023237 -0.419946314 -1.115734776
## Jun Jul Aug Sep Oct
## 1946 NA -0.963379006 -0.925718750 -0.939949519 -0.709369391
## 1947 -1.391369391 -0.311879006 0.347739583 0.150592147 0.076797276
## 1948 0.001672276 0.256412660 0.119531250 -0.623449519 0.289547276
## 1949 0.813838942 0.371704327 0.225906250 0.081758814 -0.578161058
## 1950 0.166088942 -0.423920673 -0.467718750 -0.433157853 -0.418577724
## 1951 0.387838942 0.499995994 -0.030385417 -0.116407853 -0.033536058
## 1952 0.538505609 0.414370994 0.206656250 0.025133814 -0.161411058
## 1953 -0.481369391 0.251412660 0.100156250 0.148592147 0.110880609
## 1954 0.380672276 -0.679670673 -0.269052083 -0.550157853 -0.282411058
## 1955 0.118380609 0.550245994 1.029447917 0.768592147 0.359422276
## 1956 -0.087577724 0.126204327 -0.437093750 -0.087907853 0.927213942
## 1957 0.053505609 -0.934587340 -0.592927083 0.724717147 0.030713942
## 1958 0.581005609 0.282204327 0.132572917 0.290758814 -0.171994391
## 1959 -1.642077724 NA NA NA NA
## Nov Dec
## 1946 -0.082484776 -0.298388622
## 1947 0.591098558 0.095819712
## 1948 0.154806891 -0.076221955
## 1949 -0.356859776 -0.761638622
## 1950 -0.679651442 -0.513680288

## 1959 NA NA
##
## $figure
## [1] -0.6771947 -2.0829607 0.8625232 -0.8016787 0.2516514 -0.1532556
## [7] 1.4560457 1.1645938 0.6916162 0.7752444 -1.1097652 -0.3768197
##
## $type
## [1] "additive"
##
## attr(,"class")
## [1] "decomposed.ts"

Now lets plot the components .

plot(birthsComp)

Seasonally Adjusting

If you have a seasonal time series, you can seasonally adjust the series by estimating the seasonal
component, and subtracting it from the original time series. We can see below that time time series
simply consists of the trend and random components.
Task 28:

Write R script to forecast time series data using single exponential


smoothing method.

Exponential smoothing is a quantitative method of forecasting a time series data by using an


exponential window function. In this method, the weights decrease exponentially in the exponential
function. The recent values are given greater weights while the older wvalues are given lesser
weights.

This method is quite intuitive and cis applied to huge range of time series data. There are many
types of exponential smoothing but we are only going to discuss simple exponential smoothing in
this recipe.

Simple Exponential smoothing technique is used when the time series data has no trend or seasonal
variation. The weight of each and every parameter is always determined by alpha value described
below:

𝐹𝑡 = 𝐹(𝑡−1) + α(𝐴(𝑡−1) – 𝐹(𝑡−1)

where:

1. 𝐹_𝑡 =New Forecast


2. 𝐹_(𝑡−1) = previous period forecast
3. α = smoothing constant ( 0 ≤ α ≤ 1)
4. 𝐴_(𝑡−1) = previous period’s actual demand

In this recipe, we will carry out simple exponential smoothing on amazon stock data

Step 1: Loading the required packages


# for data manipulation
library(tidyverse)
# to get the stock data install.packages("fpp2")
library(fpp2)

package 'fpp2' successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\Divit\AppData\Local\Temp\Rtmp23rv5H\downloaded_packages
Warning message:
"package 'fpp2' was built under R version 3.6.3"Registered S3 method
overwritten by 'xts':
method from
as.zoo.xts zoo
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Attaching packages
fpp2 2.4
v forecast 8.13 v expsmooth 2.3
v fma 2.4
Warning message:
"package 'forecast' was built under R version 3.6.3"Warning message:
"package 'fma' was built under R version 3.6.3"Warning message:
"package 'expsmooth' was built under R version 3.6.3"-- Conflicts -------------

fpp2_conflicts --
x purrr::flatten() masks jsonlite::flatten()

Step 2: Get the google stock data


google.train = window(goog, end = 900)
google.tst = window(goog, start = 901)

Step 3: Performing Simple Exponential Smoothing

We will use the ses(train_data, alpha = , h = forecast forward steps) function to carry out the task

ses_google = ses(google.train,
alpha = 0.3, # alpha value to be 0.3
h = 100)
autoplot(ses_google)

Forecasts from Simple exponential smoothing

800

700
google.train

600

500

400

0 250 500 750 1000


Time
Task 29

Clustering: Implement K-means algorithm in R.

# Installing Packages
install.packages("ClusterR")
install.packages("cluster")

# Loading package
library(ClusterR)
library(cluster)

# Removing initial label of


# Species from original dataset
iris_1 <- iris[, -5]

# Fitting K-Means clustering Model


# to training dataset
set.seed(240) # Setting seed
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re

# Cluster identification for


# each observation
kmeans.re$cluster

# Confusion Matrix
cm <- table(iris$Species, kmeans.re$cluster)
cm

# Model Evaluation and visualization


plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")

## Plotiing cluster centers


kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

# cex is font size, pch is symbol


points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex = 3)

## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')

Output:

Model kmeans_re:

 The 3 clusters are made which are of 50, 62, and 38 sizes respectively. Within the cluster,
the sum of squares is 88.4%.

 Cluster identification:

 The model achieved an accuracy of 100% with a p-value of less than 1. This indicates the model
is good.

 Confusion Matrix:
 So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48 Versicolor are
correctly classified as Versicolor and 14 are classified as virginica. Out of 36 virginica, 19
virginica are correctly classified as virginica and 2 are classified as Versicolor.

 K-means with 3 clusters plot:

 The model showed 3 cluster plots with three different colors and with Sepal.length and
with Sepal.width.

 Plotting cluster centers:


Task 30

Implement CURE algorithm in R


# CURE algorithm implementation in R

# Function to calculate Euclidean distance between two points


euclidean_distance <- function(point1, point2)
{ sqrt(sum((point1 - point2)^2))
}

# Function to find the index of the closest point to a given point in a set of points
find_closest_point <- function(point, points) {
distances <- apply(points, 1, function(p) euclidean_distance(point, p))
return(which.min(distances))
}

# Function to perform CURE clustering


cure_cluster <- function(data, k, num_representatives)
{ # Randomly select initial representatives
representatives <- data[sample(nrow(data), num_representatives), ]

# Assign each point to the closest representative


assignments <- apply(data, 1, function(point) find_closest_point(point, representatives))

# Iteratively refine the representatives


for (iter in 1:10) { # You can adjust the number of iterations based on your data
# Update representatives based on the mean of the points in each cluster
for (i in 1:num_representatives)
{ cluster_points <- data[assignments == i, ]
representatives[i, ] <- colMeans(cluster_points, na.rm = TRUE)
}

# Reassign points to the updated representatives


assignments <- apply(data, 1, function(point) find_closest_point(point, representatives))
}

return(assignments)
}

# Example usage
set.seed(123) # for reproducibility
data <- matrix(rnorm(200), ncol = 2)
k <- 3
num_representatives <- 5

cluster_assignments <- cure_cluster(data, k, num_representatives)


print(cluster_assignments)
Beyond the syllabus
1. Fit a multiple linear regression using iris data by considering the response
variable as Sepal.length

lm(iris$Sepal.Length~(iris$Sepal.Width+iris$Petal.Length+iris$Petal.Width))

Output:

Call:

Coefficients:

(Intercept) iris$Sepal.Width iris$Petal.Length

1.8560 0.6508 0.7091

iris$Petal.Width

-0.5565
2. Random Forest for Classification

install.packages("randomForest")

library(randomForest)

# Load a sample dataset (e.g., Iris dataset)

data(iris)

# Split the dataset into training and testing sets

set.seed(123)

train_indices <- sample(1:nrow(iris), 0.7 * nrow(iris))

train_data <- iris[train_indices, ]

test_data <- iris[-train_indices, ]

# Train a Random Forest model

rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100)

# Make predictions on the test set

predictions <- predict(rf_model, test_data)

# Evaluate the accuracy

accuracy <- sum(predictions == test_data$Species) / nrow(test_data)

print(paste("Accuracy:", round(accuracy, 2)))


3. Decision Tree for Classification and Regression

install.packages("rpart")
library(rpart)
# Load a sample dataset (e.g., Iris dataset)
data(iris)

# Split the dataset into training and testing sets


set.seed(123)
train_indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Train a decision tree model


dt_model <- rpart(Species ~ ., data = train_data, method = "class")

# Make predictions on the test set


predictions <- predict(dt_model, test_data, type = "class")

# Evaluate the accuracy


accuracy <- sum(predictions == test_data$Species) / nrow(test_data)
print(paste("Accuracy:", round(accuracy, 2)))
# Load a sample dataset (e.g., mtcars dataset)
data(mtcars)

# Split the dataset into training and testing sets


set.seed(123)
train_indices <- sample(1:nrow(mtcars), 0.7 * nrow(mtcars))
train_data <- mtcars[train_indices, ]
test_data <- mtcars[-train_indices, ]

# Train a decision tree model


dt_model <- rpart(mpg ~ ., data = train_data, method = "anova")

# Make predictions on the test set


predictions <- predict(dt_model, test_data)

# Evaluate the performance (e.g., Mean Squared Error)


mse <- mean((predictions - test_data$mpg)^2)
print(paste("Mean Squared Error:", round(mse, 2)))
Viva Questions
1. What is R programming language?
- Answer: R is a programming language and environment specifically designed
for statistical computing and data analysis.

2. Explain the significance of data visualization in R.


- Answer:Data visualization in R helps to represent complex data in a graphical
or pictorial format, making it easier to understand patterns, trends, and insights
within the data.

3. What is the difference between a data frame and a matrix in R?


- Answer: A data frame is a 2-dimensional data structure where columns can be of
different data types, while a matrix is a 2-dimensional data structure where all
elements must be of the same data type.

4. How can you read data from a CSV file into R?


- *Answer:* The `read.csv()` function in R is commonly used to read data from
a CSV file into a data frame.

5. Explain the concept of missing values in R. How can you handle them?
- Answer: Missing values in R are represented by `NA`. Handling methods
include removing missing values using `na.omit()`, filling them with a specific value
using
`na.fill()`, or interpolating values using methods like `na.approx()`.

6. What is ggplot2, and how is it useful in data visualization?


- Answer: ggplot2 is a powerful data visualization package in R. It follows the
Grammar of Graphics principles, allowing users to create complex and
customized plots with ease.

7. How can you perform descriptive statistics in R?


-Answer: Descriptive statistics can be computed using functions like `mean()`,
`median()`, `sd()` for standard deviation, `summary()` for a summary of statistics,
and `table()` for frequency tables.

8. Explain the concept of hypothesis testing in the context of R.


- Answer: Hypothesis testing in R involves using functions like `t.test()` or
`wilcox.test()` to assess whether observed data provides enough evidence to reject a
null hypothesis.

9. What is the purpose of the dplyr package in R?


- Answer: The dplyr package is used for data manipulation tasks. It provides a set
of functions like `filter()`, `select()`, `mutate()`, etc., for efficient and readable data
manipulation.

10. How can you install and load a package in R?


- Answer: To install a package, use `install.packages("package_name")`. To load
a package, use `library(package_name)`.

11. Question: What is R and why is it commonly used for data analytics?

Answer: R is a programming language and environment specifically designed for


statistical computing and graphics. It is widely used in data analytics due to its
extensive statistical and graphical capabilities, open-source nature, and a vast
collection of packages for data manipulation and analysis.

12. Question: Explain the difference between data frames and matrices in R.

Answer: In R, matrices can only hold one data type, while data frames can store
different data types. Data frames are more flexible for handling heterogeneous data
as they can accommodate both numeric and character data, making them suitable
for real-world datasets.

13. Question: What is the purpose of the "tidyverse" in R, and name some of its
core packages.

Answer: The "tidyverse" is a collection of R packages designed for data science.


Core packages include dplyr for data manipulation, ggplot2 for data visualization,
and tidyr for data tidying. It promotes a consistent and efficient workflow in data
analysis.

14. Question: How can you handle missing values in a dataset using R?

Answer: In R, you can handle missing values using functions like `na.omit()` to
remove observations with missing values, `complete.cases()` to identify
complete cases, or `na.fill()` to replace missing values with a specified value.

15. Question: What is the significance of the ggplot2 package in R for


data visualization?

Answer:The ggplot2 package is a powerful tool for creating a wide variety of static
and dynamic visualizations in R. It follows the Grammar of Graphics principles,
providing a consistent and intuitive syntax for creating complex plots, making it
highly effective for exploratory data analysis.

You might also like