Advanced Statistics-Project

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

ADVANCED STATISTICS - PROJECT

by: Vivek

1
Data Insights:
>Names( names(Data))

[1] "ID" "ProdQual" "Ecom" "TechSup" "CompRes" "Advertising"


[7] "ProdLine" "SalesFImage" "ComPricing" "WartyClaim" "OrdBilling" "DelSpeed"
[13] "Satisfaction"

>Dimension - dim(Data)
[1] 100 13

class(Data)
[1] "data.frame"

str(Data)
'data.frame': 100 obs. of 13 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...

summary(Data)
ID ProdQual Ecom TechSup CompRes
Advertising ProdLine
Min. : 1.00 Min. : 5.000 Min. :2.200 Min. :1.300 Min. :2.600
Min. :1.900 Min. :2.300
1st Qu.: 25.75 1st Qu.: 6.575 1st Qu.:3.275 1st Qu.:4.250 1st Qu.:4.600
1st Qu.:3.175 1st Qu.:4.700
Median : 50.50 Median : 8.000 Median :3.600 Median :5.400 Median :5.450
Median :4.000 Median :5.750
Mean : 50.50 Mean : 7.810 Mean :3.672 Mean :5.365 Mean :5.442
Mean :4.010 Mean :5.805
3rd Qu.: 75.25 3rd Qu.: 9.100 3rd Qu.:3.925 3rd Qu.:6.625 3rd Qu.:6.325
3rd Qu.:4.800 3rd Qu.:6.800
Max. :100.00 Max. :10.000 Max. :5.700 Max. :8.500 Max. :7.800
Max. :6.500 Max. :8.400

2
SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
Satisfaction
Min. :2.900 Min. :3.700 Min. :4.100 Min. :2.000 Min. :1.600 M
in. :4.700
1st Qu.:4.500 1st Qu.:5.875 1st Qu.:5.400 1st Qu.:3.700 1st Qu.:3.400 1
st Qu.:6.000
Median :4.900 Median :7.100 Median :6.100 Median :4.400 Median :3.900 M
edian :7.050
Mean :5.123 Mean :6.974 Mean :6.043 Mean :4.278 Mean :3.886 M
ean :6.918
3rd Qu.:5.800 3rd Qu.:8.400 3rd Qu.:6.600 3rd Qu.:4.800 3rd Qu.:4.425 3
rd Qu.:7.625
Max. :8.200 Max. :9.900 Max. :8.100 Max. :6.700 Max. :5.500 M
ax. :9.900

1.Multicollinearity
Explanation - A high degree of correlation amongst the Independent variables.

How can we detect the problem?


Examine the correlation matrix of regressors and also carry out auxiliary regressions amongst
the regressors. (multicollinearity is a data problem, not a mis-specification problem.)

What are its consequences?


It may be difficult to separate out the effects of the individual regressors. Standard errors may
be overestimated and t‐values depressed.

Correlation amongst the Independent variables

>library(corrplot)
corrplot(corr,method = "number",type = "upper")

3
Inference:
Independent variable like Sales force image-company pricing, warranty claim-order
billing and other few factors are highly correlated.

4
2.Factor Analysis

Before moving to factor analysis, it is mandatory to check if the data is normally distributed.
To check if the data is normally distributed, we need to KMO test.

Kaiser-Meyer-Olkin (KMO) Test is a measure of how suited the data is for Factor Analysis.
The statistic is a measure of the proportion of variance among variables that might be
common variance. KMO returns values between 0 and 1.

> KMO(corr)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = Data1corr)
Overall MSA = 0.65 (0.60 – 0.69  Mediocre)

MSA for each item =

ProdQual Ecom TechSup CompRes Advertising


0.51 0.63 0.52 0.79 0.78

ProdLine SalesFImage ComPricing WartyClaim OrdBilling


0.62 0.62 0.75 0.51 0.76

DelSpeed
0.67

Eigen values:

The eigenvalue is a measure of how much of the variance of the observed variables a factor
explains. Any factor with an eigenvalue ≥1 explains more variance than a single observed
variable.

Eigen=eigen(Data1corr)
AP=parallel(subject = nrow(Data1),var = ncol(Data1),rep = 100,cent = 0.5)
NS=nScree(x=Eigen$values,aparallel = AP$eigen$qevpea)
plotnScree(NS)

5
> prll=fa.parallel(Data1,fm="minres",fa="fa")

Parallel analysis suggests that the number of factors = 3 and the number of components =
NA

Based on above parallel analysis, the number of factors should be 3. It is also better to take
+1 or -1 variable to explain the maximum variance.

As per project request, the number of factors considered is 4.

6
Factor Analysis:

The factor analysis is that multiple observed variables have similar patterns of
responses because they are all associated with a latent (i.e. not directly measured) variable.
their association with an underlying latent variable (i.e.the factor).

Factor analysis without rotation:

Fit=factanal(Data1,nfactors,scores = c("regression"),rotation = "none")


print(Fit,digits = 2,sort=TRUE)

7
Call:
factanal(x = Data1, factors = nfactors, scores = c("regression"), rotation = "
none")

Uniquenesses:
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImag
e ComPricing WartyClaim OrdBilling
0.68 0.36 0.23 0.18 0.68 0.00 0.0
2 0.64 0.16 0.35
DelSpeed
0.08

Loadings:
Factor1 Factor2 Factor3 Factor4
CompRes 0.59 0.32 0.53 0.30
ProdLine 1.00
DelSpeed 0.63 0.36 0.58 0.24
Ecom 0.79
Advertising 0.56 0.11
SalesFImage 0.99
TechSup 0.20 -0.48 0.71
WartyClaim 0.28 0.12 -0.49 0.71
ProdQual 0.47 -0.15 -0.23 -0.16
ComPricing -0.49 0.25 0.24
OrdBilling 0.45 0.27 0.49 0.36

Factor1 Factor2 Factor3 Factor4


SS loadings 2.52 2.32 1.47 1.32
Proportion Var 0.23 0.21 0.13 0.12
Cumulative Var 0.23 0.44 0.57 0.69

Test of the hypothesis that 4 factors are sufficient.


The chi square statistic is 24.26 on 17 degrees of freedom.

The p-value is 0.113

8
Inference:

It is difficult to find which variables are underlying for above factors.

Varimax rotation: a varimax rotation is used to simplify the expression of a particular sub-
space in terms of just a few major items each. The actual coordinate system is unchanged, it
is the orthogonal basis that is being rotated to align with those coordinates.
9
Factor analysis after rotation:

Fit3=factanal(Data1,4,scores = c("regression"),rotation = "varimax")


print(Fit3,digits = 2,sort=TRUE)

Call:
factanal(x = Data1, factors = 4, scores = c("regression"), rotation = "varimax")

Uniquenesses:
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImag
e ComPricing WartyClaim OrdBilling
0.68 0.36 0.23 0.18 0.68 0.00 0.0
2 0.64 0.16 0.35
DelSpeed
0.08

Loadings:
Factor1 Factor2 Factor3 Factor4
CompRes 0.88
OrdBilling 0.79
DelSpeed 0.93
Ecom 0.79
Advertising 0.52
SalesFImage 0.97
TechSup 0.87
WartyClaim 0.89
ProdQual 0.56
ProdLine 0.50 0.86
ComPricing -0.51

Factor1 Factor2 Factor3 Factor4


SS loadings 2.59 1.98 1.64 1.42
Proportion Var 0.24 0.18 0.15 0.13
Cumulative Var 0.24 0.42 0.56 0.69

Test of the hypothesis that 4 factors are sufficient.


The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113

Inference:
The 4 factors explain the 0.69 variance in the data.

load5=Fit3$loadings[,1:4]
plot(load5,type="n")
text(load5,labels=names(Data1),cex=.9)
abline(h=0,v=0)

10
11
3.Naming the Factors
> names(regpca)
[1] "Data$Satisfaction" "Factor1" "Factor2" "Factor3"
"Factor4"

1. Customer_handling
> names(regpca)[names(regpca) == "Factor1"] <- "Customer_handling"
CompRes 0.86
OrdBilling 0.82
DelSpeed 0.97

2. Advertisment

> names(regpca)[names(regpca) == "Factor2"] <- "Advertisment"

Ecom 0.79
Advertising 0.52
SalesFImage 0.98

3. Replacement

> names(regpca)[names(regpca) == "Factor3"] <- "Replacement"

TechSup 0.99
WartyClaim 0.79

4. Quality

> names(regpca)[names(regpca) == "Factor4"] <- "Quality"

ProdQual 0.61
ProdLine 0.80
ComPricing -0.54

5. Depndednt variable = satisfaction


> names(regpca)[names(regpca) == "Data$Satisfaction"] <- "satisfaction"

Depndednt variable = Data$Satisfaction


> names(regpca)

[1] "satisfaction" "Customer_handling" "Advertisment" "Replacement"


"Quality"

12
Univariate analysis- Satisfaction (dependent variable)

13
4.Multiple Regression Analysis
Multiple regression is used when we want to predict the value of a dependent variable
based on the value of two or more other Independent variables.

Null hypothesis/Ho: The satisfaction does not depend any of the independent factors custo
mer handling, Advertisement, Replacement, Quality.

Alternative Hypothesis/Ha :The satisfaction does depend at least one of the independent
factors naming customer handling, Advertisement, Replacement, Quality.

Regression=lm(satisfaction~.,data = Finaldata)
summary(Regression)

Call:
lm(formula = satisfaction ~ ., data = Finaldata)

Residuals:
Min 1Q Median 3Q Max
-2.07220 -0.53652 0.04008 0.51946 1.64675

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.07351 94.113 < 2e-16 ***
Customer_handling 0.55858 0.07410 7.538 2.93e-11 ***
Advertisment 0.59908 0.07413 8.082 2.15e-12 ***
Replacement 0.10616 0.07408 1.433 0.155
Quality 0.48487 0.07595 6.384 6.48e-09 ***
order_Billing 0.06519 0.07459 0.874 0.384
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7351 on 94 degrees of freedom


Multiple R-squared: 0.6388, Adjusted R-squared: 0.6196
F-statistic: 33.25 on 5 and 94 DF, p-value: < 2.2e-16

Inference:
The p value is 2.2e-16, hence we reject Null hypothesis.

The Satisfaction is dependent on one or more independent variable (customer handling, Adv
ertisement, Replacement, Quality).

14
Model Validation:

Feature selection:

stepfwd=stepAIC(Regression,direction = "forward")

Start: AIC=-55.74
satisfaction ~ Customer_handling + Advertisment + Replacement +
Quality + order_Billing

stepfwd=stepAIC(Regression,direction = "backward")

Step: AIC=-56.94
satisfaction ~ Customer_handling + Advertisment + Replacement +
Quality

stepfwd=stepAIC(Regression,direction = "both")

Step: AIC=-56.94
satisfaction ~ Customer_handling + Advertisment + Replacement +
Quality

>Backward feature selection as the least AIC(Aic = -56.94)

Call:
lm(formula = satisfaction ~ ., data = Newreg)

Residuals:
Min 1Q Median 3Q Max
-2.03808 -0.51119 0.07321 0.51371 1.57192

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.07342 94.230 < 2e-16 ***
Customer_handling 0.55887 0.07400 7.552 2.62e-11 ***
Advertisment 0.59900 0.07404 8.091 1.94e-12 ***
Replacement 0.10617 0.07399 1.435 0.155
Quality 0.48453 0.07585 6.388 6.19e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7342 on 95 degrees of freedom


Multiple R-squared: 0.6359, Adjusted R-squared: 0.6206
F-statistic: 41.48 on 4 and 95 DF, p-value: < 2.2e-16

Residual on predicted Model:


pred_full=predict(stepboth,data=Newreg)
> mse_full=mse(Newreg$satisfaction,pred_full)
> mse_full

15
[1] 0.5120377

sqrt(mse_full)
[1] 0.7155681

> mape_full=mape(Newreg$satisfaction,pred_full)
> mape_full
[1] 0.09136993

vif(Newregression)
Customer_handling Advertisment Replacement Quality
1.000117 1.000088 1.000026 1.000213

Comments:
The model Error is only 9% and with the least AIC score in feature selection, all the
four factors are contributing the model. Also, it has no multi-collinearity (VIF score).

Hence the model is valid.

16

You might also like