Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Mini Project – Factor Hair

Analysis
Sravanthi.M

1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................5
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5

1.Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for
outliers and missing values 

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

1.2 EDA - Check for Outliers and missing values and check the summary of the dataset

 2.Is there evidence of multicollinearity? Showcase your analysis

3.Perform simple linear regression for the dependent variable with every independent variable 

4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors

4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)

4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell
whether it is correct in choosing 4 factors. Name the factors with correct explanations. 

5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody

5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction

5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables

5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)

5.4 Output Interpretation <making it meaningful for everybody>

6. Source Code
1 Project Objective
The objective of the report is to explore the Factor Hair data in R and generate insights about the
data set. This exploration report will consist of the following:

 Importing the dataset in R


 Understanding the structure of dataset
 Graphical exploration
 Descriptive statistics
 Insights from the dataset

2 Assumptions
 Is there evidence of multicollinearity?
 Perform factor analysis by extracting four factors.
 Name four factors.
 Perform multiple Liner regression with customer satisfaction as the dependent variable
and the four factors as independent variable.

3 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Check Multicollinearity
3. Factor analysis
4. Four factors Identification
5. Feature Exploration
6. The data set have 12 variables used for marketing segmentation in the context of product
service Management. Variables and the expansion of the variables are mentioned below

We shall follow these steps in exploring the provided dataset.

3|Page
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)

3.1.2 Set up working Directory


Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. For setting up and importing we use
below syntax’s
Syntax → setwd() & getwd()

Please refer 6 for Source Code.


3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the
file.

Please refer 6 for Source Code.

3.2 Variable Identification


We are using
 setwd() :For setting working directory

 getwd() : returns an absolute file path representing the current working directory

 dim: returns the dimension (e.g. the number of columns and rows)

 Str: To look specific data row by row we use str()

 names() : to find the names of the columns

 summary: is a generic function used to produce result summaries of the results of

various model fitting functions. The function invokes particular methods which

depend on the class of the first argument.

 attach() : to attach my data

 hist(): To plot histogram

 boxplot(): To plot boxplot

4 Conclusion
4|Page
From the above given problem, we have found out how Factor Analysis can be used to reduce
the dimensionality of a dataset and then we used multiple linear regression on the
dimensionally reduced columns for further analysis/predictions. Below mentioned points are
covered
1. Checked for Multicollinearity
2. Done Factor Analysis
3. Named the Factors - Cust.Satisf,Sales.Distri,Marketing, After.Sales.Service,Value.For. Money
4. Perform Multiple Linear Regression with customer satisfaction as dependent variable and
Cust.Satisf,Sales.Distri,Marketing,After.Sales.Service,Value.For.Money as independent variables.

5 Detailed Explanation of Findings


1.Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers
and missing values 

1.1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs


1.2 EDA - Check for Outliers and missing values and check the summary of the dataset.
Ans: For basic data summary we need to import the data as mentioned above in 3.2 we will be
using all the functions and analyze the data.

## Seeting up working directory and getting working directory

setwd("D:/College Data/Advance stats/Project")

getwd()

## Reading the file

Factorhair <- read.csv("Hair.csv",header = TRUE)

## Varible names in matrix

variables <- c("Product Quality" , "E-Commerce" , "Technical Support" ,


"Complaint Resolution" ,
"Advertising" , "Product Line" , "Salesforce Image",
"Competitive Pricing" ,
"Warranty & Claims" , "Order & Billing" , "Delivery Speed" ,
"Customer Satisfaction")

## Checking dimentions of the data

dim(Factorhair)

## names of the coloumns

names(Factorhair)

## structure of the data

str(Factorhair)

5|Page
## summary of the data

summary(Factorhair)

Output:

From the summary we have noticed that first column is named as “ID” is just column number and it is
not required further hence we will be removing the column and renaming dataset as hair and remove
the column ID from it.
 We need to find missing values
Syntax: sum(is.na(hair))
Output:

Graphical representation of Factor Hair Data set

 Histogram of dependent variable (Customer satisfaction)


Syntax: hist(`Customer Satisfaction`, breaks = c(0:11),labels = TRUE, include.lowest =
TRUE,right = TRUE,
col = "blue",border = "Green",main = paste("Histogram of Customer Satisfaction"),
xlab = "Customer Satisfaction",ylab = "Count",xlim = c(0,11),ylim = c(0,35))

6|Page
 Box plot of dependent variable (Customer satisfaction)
Syntax: boxplot(`Customer Satisfaction`, horizontal = TRUE, xlab = variables[12],
col = "pink", border="blue",ylim = c(0,11))

 Histogram of the independent variable


Syntax: par(mfrow = c(3,4)) #Convert Plotting space in 12
for (i in (1:11))
{h = round(max(hair[,i]),0)+1

l = round(min(hair[,i]),0)-1

n = variables[i]

hist (hair[,i], breaks = seq(l,h,((h-l)/6)), labels = TRUE,


include.lowest=TRUE, right=TRUE,
col="pink", border="blue",
main = NULL, xlab= n, ylab=NULL,
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,11), ylim = c(0,70))
}

7|Page
 Boxplot of independent variables
par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue", border = "pink", cex.axis = 1)

 Bivariate Analysis - Scatter Plot of independent variables against the dependent variable
Syntax: par(mfrow = c(3,3))

for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab = NULL,col= "red",cex.lab =
1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")

8|Page
 Finding Outliers in variables
Syntax: list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {

Box_Plot <- boxplot(hair[,i],plot = F)$out


OutLiers[,i] <- NA

if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}

OutLiers <- OutLiers[(1:6),]

# Write outliers list in csv

write.csv(OutLiers, "OutLiers.csv")

 2.Is there evidence of multicollinearity? Showcase your analysis


9|Page
Ans: First we need to create correlation matrix and the plot the correlation for Factor hair data set.
Now we need to check multicollinearity of independent variables using VIF
Syntax:
## Create correlation matrix
corlnMtrx <- cor(hair[,-12])
corlnMtrx

## Correlation Plot for Data hair.


corrplot.mixed(corlnMtrx, lower = "number", upper = "pie", tl.col = "black",tl.pos = "lt")

## Check multicollinearity in independent variables using VIF


vifmatrix <- vif(lm(`Customer Satisfaction` ~., data = hair))
10 | P a g
e
vifmatrix
write.csv(vifmatrix, "vifmatrix.csv")

`Product Quality` 1.635796913


`E-Commerce` 2.756694028
`Technical Support` 2.976795746
`Complaint
Resolution` 4.730448292
Advertising 1.508933339
`Product Line` 3.488185222
`Salesforce Image` 3.439420023
`Competitive Pricing` 1.635000159
`Warranty & Claims` 3.198337123
`Order & Billing` 2.9029994
`Delivery Speed` 6.516013572

3.Perform simple linear regression for the dependent variable with every independent variable 
Ans: From the above correlation matrix we will be doing Bartlett Test. If P-value is less than 0.05
then it is ideal case for dimension reduction.
Syntax: cortest.bartlett(corlnMtrx, 100)

4.Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors

4.1 Perform PCA/FA and Interpret the Eigen Values (apply Kaiser Normalization Rule)
4.2 Output Interpretation Tell why only 4 factors are being asked in the questions and tell whether it
is correct in choosing 4 factors. Name the factors with correct explanations
Ans: Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited your data is for Factor Analysis.
Syntax: KMO(corlnMtrx)

 The KMO statistic of 0.65 is also large (greater than 0.50). Hence Factor Analysis is considered as
an appropriate technique for further analysis of the data.
 Calculate the Eigen values for the variables

11 | P a g
e
Syntax:
A <- eigen(corlnMtrx)
EV <- A$values
EV

plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values", pch = 20, col = "blue")
lines(EV, col = "red")
abline(h = 1, col = "green", lty = 2)

 Eigen values should be always more than 1.


 Hence from the above scree plot we will be considering only 4 Factors from 11 variables.
 Factor names are as follows: Sales.Distri, Marketing, After.Sales.Service, Value.For.Money
 Sales.Distri – Delivery speed, Complaint Resolution, Order & Billing is considered as one factor
because all the product is related to purchasing the product from placing the order to billing and
delivery.
 Marketing – Salesforce Image, E-Commerce, Advertising is considered as one factor because the
variables are related to sales and spending on advertising
 After sales service – Technical support, warranty & claims are consider as one factor because post
purchase is included in this factor
 Value for money - Competitive pricing, Product line, Product quality is considered as one factor

5.Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody

12 | P a g
e
5.1 Create a data frame with a minimum of 5 columns, 4 of which are different factors and the
5th column is Customer Satisfaction
5.2 Perform Multiple Linear Regression with Customer Satisfaction as the Dependent Variable
and the four factors as Independent Variables
5.3 MLR summary interpretation and significance (R, R2, Adjusted R2,Degrees of Freedom, f-
statistic, coefficients along with p-values)
5.4 Output Interpretation
Ans: As per the above scree plot extracting 4 factors from 11 variables
 Without rotating
Syntax:
FourFactor = fa(r= hair[,-12], nfactors =4, rotate ="none", fm ="pa")
print(FourFactor)

Loading <- print(FourFactor$loadings,cutoff = 0.3)

13 | P a g
e
write.csv(Loading, "loading.csv")

  PA1 PA2 PA3 PA4


Product Quality 0.201261 -0.40795 -0.05811 0.462588
E-Commerce
fa.diagram(FourFactor) 0.29013 0.659153 0.269989 0.215921
Technical Support 0.27765 -0.38082 0.73814 -0.16628
Complaint
Resolution 0.862348 0.011699 -0.25533 -0.18395
Advertising 0.286088 0.457153 0.082418 0.12877
Product Line 0.689465 -0.45337 -0.14239 0.314815
Salesforce Image 0.394536 0.800679 0.345809 0.250827
Competitive Pricing -0.23159 0.553007 -0.04444 -0.28608
Warranty & Claims 0.379328 -0.32446 0.735494 -0.15303
Order & Billing 0.746973 0.02081 -0.17524 -0.18086
Delivery Speed 0.895111 0.098331 -0.30345 -0.19764

fa.diagram(FourFactor)

 With varimax rotating


Synatx:
FourFactor1 = fa(r= hair[,-12], nfactors =4, rotate ="varimax", fm ="pa")
14 | P a g
e
print(FourFactor1)

Loading1 <- print(FourFactor1$loadings,cutoff = 0.3)

write.csv(Loading1, "Loading1.csv")

  PA1 PA2 PA3 PA4


Product Quality 0.024004 -0.07003 0.01569 0.646969
15 | P a g
e
E-Commerce 0.067574 0.787412 0.0279 -0.11319
Technical Support 0.019767 -0.02524 0.883193 0.116433
Complaint
Resolution 0.897671 0.129545 0.053539 0.13171
Advertising 0.166184 0.529966 -0.04289 -0.06235
Product Line 0.525463 -0.03526 0.127348 0.711841
Salesforce Image 0.115439 0.971489 0.063495 -0.13452
Competitive Pricing -0.07565 0.212939 -0.20892 -0.59039
Warranty & Claims 0.102595 0.056612 0.885113 0.127977
Order & Billing 0.768195 0.126678 0.088175 0.088743
Delivery Speed 0.94873 0.185192 -0.00486 0.087353

fa.diagram(FourFactor1)

 Create a new data frame using scores for four factors and dependent variable

hair1 <- cbind(hair[,12],FourFactor1$scores)

 Check head of the data

head(hair1)

 Name the columns for hair1

colnames(hair1) <-
c("Cust.Satisf","Sales.Distri","Marketing","After.Sales.Service","Value.For.Money")

 Check head of the data


16 | P a g
e
head(hair1)

 Check class of the hair1

class(hair1)

 convert matrix to data.frame

hair1 <- as.data.frame(hair1)

 Corplot for the data hair1

corrplot.mixed(cor(hair1),lower = "number", upper = "pie", tl.col = "black",tl.pos = "lt")

 setting flag for randomness

set.seed(1)

 creating two datasets one to train the model and another to test the model.

spl = sample.split(hair1$Cust.Satisf, SplitRatio = 0.8)

Train = subset(hair1, spl==TRUE)

Test = subset(hair1, spl==FALSE)

 check dimentions of Train and Test Data

cat(" Train Dimention: ", dim(Train) ,"\n", "Test Dimention : ", dim(Test))

17 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)

summary(linearModel)

vif(linearModel)

pred = predict(linearModel, newdata = Test)

 Compute R-sq for the test data

 check SST - Total sum of squres

SST = sum((Test$Cust.Satisf - mean(Train$Cust.Satisf))^2)

 Check SSE - sum of squared deviations of actual values from predicted values

SSE = sum((pred - Test$Cust.Satisf)^2)

 check SSR - sum of squared deviations of predicted values (predicted using regression)

SSR = sum((pred - mean(Train$Cust.Satisf))^2)

R.square.test <- SSR/SST

cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R squared Test :" , R.square.test)

18 | P a g
e
6 Source Code
## Seeting up working directory and getting working directory

setwd("D:/College Data/Advance stats/Project")

getwd()

##Importing packages

library(corrplot)
install.packages("tidyverse")
library(tidyverse)
library(ggplot2)
install.packages("psych")
library(psych)
library(car)
install.packages("caTools")
library(caTools)

## Reading the file

Factorhair <- read.csv("Hair.csv",header = TRUE)

## Varible names in matrix

variables <- c("Product Quality" , "E-Commerce" , "Technical


Support" , "Complaint Resolution" ,
"Advertising" , "Product Line" , "Salesforce Image",
"Competitive Pricing" ,
"Warranty & Claims" , "Order & Billing" , "Delivery
Speed" , "Customer Satisfaction")

## Checking dimentions of the data

dim(Factorhair)

## names of the coloumns

names(Factorhair)

## structure of the data

str(Factorhair)

## summary of the data

19 | P a g
e
summary(Factorhair)

## Creating new data set with hair name and removing column ID

hair <- Factorhair[,-1]

dim(hair)

## chnaging names of the columns

colnames(hair) <-variables

summary(hair)

## attaching the data

attach(hair)

hair

## find any missing values are there

sum(is.na(hair))

##Histogram of dependent variable(Customer satisfaction)

hist(`Customer Satisfaction`, breaks = c(0:11),labels = TRUE,


include.lowest = TRUE,right = TRUE,
col = "blue",border = "Green",main = paste("Histogram of Customer
Satisfaction"),
xlab = "Customer Satisfaction",ylab = "Count",xlim = c(0,11),ylim
= c(0,35))

##box plot of dependent variable(Customer satifaction)

boxplot(`Customer Satisfaction`, horizontal = TRUE, xlab =


variables[12],
col = "pink", border="blue",ylim = c(0,11))

##Histogram of the independent variable

par(mfrow = c(3,4)) #Convert Plotting space in 12


for (i in (1:11))
{h = round(max(hair[,i]),0)+1

l = round(min(hair[,i]),0)-1

n = variables[i]

hist (hair[,i], breaks = seq(l,h,((h-l)/6)), labels = TRUE,


include.lowest=TRUE, right=TRUE,
col="pink", border="blue",
20 | P a g
e
main = NULL, xlab= n, ylab=NULL,
cex.lab=1, cex.axis=1, cex.main=1, cex.sub=1,
xlim = c(0,11), ylim = c(0,70))
}

## Boxplot of indepentdent variables

par(mfrow = c(2,1))
boxplot(hair[,-12], las = 2, names = variables[-12], col = "blue",
border = "pink", cex.axis = 1)

## Bivariate Analysis

##Scatter Plot of independent variables against the dependent variable

par(mfrow = c(3,3))

for (i in c(1:11))
{plot(hair[,i],`Customer Satisfaction`, xlab = variables[i],ylab =
NULL,col= "red",cex.lab = 1,cex.axis = 1,
cex.main = 1,cex.sub = 1,xlim = c(0,10),ylim = c(0,10))
abline(lm(formula = `Customer Satisfaction`~ hair[,i]),col = "blue")

## Finding Outliers in variables

list("OutLiers")
OutLiers <- hair[(1:12),]
for (i in c(1:12)) {

Box_Plot <- boxplot(hair[,i],plot = F)$out


OutLiers[,i] <- NA

if (length(Box_Plot)>0) {
OutLiers[(1:length(Box_Plot)),i] <- Box_Plot
}
}

OutLiers <- OutLiers[(1:6),]

# Write outliers list in csv

write.csv(OutLiers, "OutLiers.csv")

## Create correlation matrix

corlnMtrx <- cor(hair[,-12])

corlnMtrx

## Correlation Plot for Data hair.

corrplot.mixed(corlnMtrx,lower = "number", upper = "pie", tl.col =


21 | P a g
e
"black",tl.pos = "lt")

## Check multicollinearity in independent variables using VIF

vifmatrix <- vif(lm(`Customer Satisfaction` ~., data = hair))


vifmatrix
write.csv(vifmatrix, "vifmatrix.csv")

## Check corlnMtrx with Bartlett Test

cortest.bartlett(corlnMtrx, 100)

# If P-value less than 0.05 then it is ideal case for dimention


reduction.

## Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited your data


is for Factor Analysis.

KMO(corlnMtrx)

## Calculate the Eigen values for the variables

A <- eigen(corlnMtrx)

EV <- A$values

EV

plot(EV, main = "Scree Plot", xlab = "Factors", ylab = "Eigen Values",


pch = 20, col = "blue")

lines(EV, col = "red")

abline(h = 1, col = "green", lty = 2)


## As per the above scree plot extracting 4 factors from 11 variables

## Without rotating

FourFactor = fa(r= hair[,-12], nfactors =4, rotate ="none", fm ="pa")

print(FourFactor)

Loading <- print(FourFactor$loadings,cutoff = 0.3)

write.csv(Loading, "loading.csv")

fa.diagram(FourFactor)

## With varimax rotating

FourFactor1 = fa(r= hair[,-12], nfactors =4, rotate ="varimax", fm


="pa")

print(FourFactor1)

22 | P a g
e
Loading1 <- print(FourFactor1$loadings,cutoff = 0.3)

write.csv(Loading1, "Loading1.csv")

fa.diagram(FourFactor1)

## Create a new data frame using scores for four factors and dependent
varible

hair1 <- cbind(hair[,12],FourFactor1$scores)

##Check head of the data

head(hair1)

## Name the columns for hair1

colnames(hair1) <- c("Cust.Satisf", "Sales.Distri",


"Marketing","After.Sales.Service","Value.For.Money")

##Check head of the data

head(hair1)

##Check class of the hair1

class(hair1)

# convert matrix to data.frame

hair1 <- as.data.frame(hair1)

## Corplot for the data hair1

corrplot.mixed(cor(hair1),lower = "number", upper = "pie", tl.col =


"black",tl.pos = "lt")

##setting flag for randomness

set.seed(1)

##creating two datasets one to train the model and another to test
the model.

spl = sample.split(hair1$Cust.Satisf, SplitRatio = 0.8)

Train = subset(hair1, spl==TRUE)

Test = subset(hair1, spl==FALSE)

##check dimentions of Train and Test Data

cat(" Train Dimention: ", dim(Train) ,"\n", "Test Dimention : ",


dim(Test))

23 | P a g
e
linearModel = lm(Cust.Satisf ~., data = Train)

summary(linearModel)

vif(linearModel)

pred = predict(linearModel, newdata = Test)

## Compute R-sq for the test data

##check SST - Total sum of squres

SST = sum((Test$Cust.Satisf - mean(Train$Cust.Satisf))^2)

##Check SSE - sum of squared deviations of actual values from


predicted values

SSE = sum((pred - Test$Cust.Satisf)^2)

##check SSR - sum of squared deviations of predicted values (predicted


using regression)

SSR = sum((pred - mean(Train$Cust.Satisf))^2)

R.square.test <- SSR/SST

cat(" SST :", SST, "\n", "SSE :", SSE, "\n","SSR :", SSR, "\n","R
squared Test :" , R.square.test)

24 | P a g
e

You might also like