Data Cleaning

DATA CLEANING ON AIR QUALITY DATASET
a. Load Air Quality dataset

b. Check all the observations with missing values
c. Check the outliers with boxplot
d. Clean the data by removing outliers and treat missing values.
e. Impute the missing values in the original dataset with "mean" of the respective variables
#airquality is a dataset from the package "datasets"

>airqual=airquality
# summary to check the missing values, mean and median of the data distributions
>summary(airquality)
# we can see there are 37 observations missing in Ozone for the variable and 7 observation missing in
Solar.R
# Checking all the obsrvations with missing values
>airqual[!complete.cases(airqual),]
# Checking the outliers with boxplot

>boxplot(airqual)
# we can see the variables Ozone and Wind has outliers.
# Steps to clean the data by removing outliers and treating missing values.
# a) Mean is badly affected by outliers, hence to impute missing values with mean we have to remove outliers.
# b) The outliers are removed and data is stored in new dataset called Updated_airqual.
# c) The purpose of Updated_airqual dataset is to impute missing values in airqual dataset.
# d) Impute the missing values in airqual dataset with respective mean from Updated_airqual dataset.
# e) Now airqual dataset has no missing values, so we can remove outliers as we did in the step-b
# f) Data is ready for analysis.
# Lets exclude the outliers for this two variables

>boxplot(airqual$Ozone,horizontal = TRUE)
# we see that observation above 120 could be outlier, hence we will consider all below 120
# Similarly for ozone

>boxplot(airqual$Wind,horizontal = TRUE)
# We can see that observations above 17 are outliers, hence we will consider all below 17
# therefore removing the outliers

>Updated_airqual=subset(airqual,Ozone<130 & Wind<17)
# lets check if all the outliers are removed

>boxplot(Updated_airqual)
# Using subset will also remove all the NA's present in the variable selected hence reducing the dataset.
# So after removing the outliers we have to fill missing values in the original dataset("airquality") to restore.
# Let impute the missing values in the original dataset with "mean" of the respective variables.
>airqual$Ozone[is.na(airqual$Ozone)]<-mean(Updated_airqual$Ozone)
# Check summary if their are any mising values in airquality data

>summary(airqual)
# There are no NA's for Ozone.
# Similarly we have to do the same with Wind variable.

>airqual$Solar.R[is.na(airqual$Solar.R)]<-mean(Updated_airqual$Solar.R,na.rm = TRUE)
# Check summary if all the missing values are treated with mean.
>summary(airqual)
# Now we remove the outliers from the airqual dataset.

>data_airquality=subset(airqual,Ozone<70 & Wind<17)
>boxplot(data_airquality)
>nrow(data_airquality)
>data_airquality=subset(airqual,Ozone<70 & Wind<17 & Wind>2)

>boxplot(data_airquality)
PRINCIPAL COMPONENT ANALYSIS (PCA)
Perform Multivariate Analysis using PCA on IRIS data set for developing a predictive model.
#Source Code
# Load the iris dataset
data("iris")
# Compactly displaying the internal structure of a dataset

str(iris)
#The summary function returned descriptive statistics

summary(iris)
# Partition Data
set.seed(111)
# sample() function allows to take a random sample of elements from a dataset

ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.8, 0.2))
training <- iris[ind==1,]
testing <- iris[ind==2,]
# Scatter Plot & Correlations

library(psych)
# pairs.panels [in psych package] is used to create a scatter plot of matrices
pairs.panels(training[,-5],
gap = 0,
bg = c("red", "yellow", "blue")[training$Species],
pch=21)
# Principal Component Analysis

pc <- prcomp(training[,-5],
center = TRUE,
scale. = TRUE)
attributes(pc)
pc$center
pc$scale
print(pc)
summary(pc)
# Orthogonality of PCs
pairs.panels(pc$x,
gap=0,
bg = c("red", "yellow", "blue")[training$Species],
pch=21)
# Bi-Plot
library(devtools)
# install_github("ggbiplot", "vqv")
library(ggbiplot)
g <- ggbiplot(pc,
obs.scale = 1,
var.scale = 1,
groups = training$Species,
ellipse = TRUE,
circle = TRUE,
ellipse.prob = 0.68)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)
# Prediction with Principal Components

trg <- predict(pc, training)
trg <- data.frame(trg, training[5])
tst <- predict(pc, testing)
tst <- data.frame(tst, testing[5])
# Multinomial Logistic regression with First Two PCs

library(nnet)
trg$Species <- relevel(trg$Species, ref = "setosa")
mymodel <- multinom(Species~PC1+PC2, data = trg)
summary(mymodel)
# Confusion Matrix & Misclassification Error - training

p <- predict(mymodel, trg)
tab <- table(p, trg$Species)
tab
1 - sum(diag(tab))/sum(tab)
# Confusion Matrix & Misclassification Error - testing

p1 <- predict(mymodel, tst)
tab1 <- table(p1, tst$Species)
tab1
1 - sum(diag(tab1))/sum(tab1)

Data Cleaning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Cleaning

Uploaded by

Copyright:

Available Formats

DATA CLEANING ON AIR QUALITY DATASET

a. Load Air Quality dataset

#airquality is a dataset from the package "datasets"

# Checking the outliers with boxplot

# Lets exclude the outliers for this two variables

# Similarly for ozone

# therefore removing the outliers

# lets check if all the outliers are removed

# Check summary if their are any mising values in airquality data

# There are no NA's for Ozone.

# Similarly we have to do the same with Wind variable.

# Now we remove the outliers from the airqual dataset.

>data_airquality=subset(airqual,Ozone<70 & Wind<17 & Wind>2)

# Compactly displaying the internal structure of a dataset

#The summary function returned descriptive statistics

# sample() function allows to take a random sample of elements from a dataset

# Scatter Plot & Correlations

# Principal Component Analysis

# Prediction with Principal Components

# Multinomial Logistic regression with First Two PCs

# Confusion Matrix & Misclassification Error - training

# Confusion Matrix & Misclassification Error - testing

You might also like