Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

Advanced Statistics

Group Assignment

Group-7:

 Hari Shankar V K
 Malavika Ravindra Kumar
 Patrick Priyadharshan
 P Rajenthiran

1
Contents

1 . Problem - 1

1.1 Problem statement _______________________________ (3)

1.2 Approach and Initial Findings_______________________ (4)

1.3 Exploratory Data Analysis__________________________ (5)

1.4 Factor Analysis___________________________________ (7)

1.5 Conslusion______________________________________ (11)

2. Problem – 2

2.1 Problem statement__________________________________

2.2 Approach and Initial Findings________________________

2.3 Exploratory Data Analysis___________________________

2.3.1 Outlier & Missing value treatment for Price________

2.3.2 Outlier & Missing Value treatment for ____________

2.3.3 Outlier & Missing Value treatment for____________

2.3.4 Outlier & Missing value treatment for____________

2.3.5 Outlier & Missing Value treatment for____________

2.4 Regression Analysis________________________________

2.5 Conclusion_______________________________________

2
Problem 1: Cereal Data Factor Analysis

1.1 Problem statement


A study of consumer consideration of ready to eat cereals wherein a set of respondents
were asked to evaluate three preferred brands of cereals on twenty-five attributes on a five-
point Likert scale. The data this collected from the study is provided for analysis. Consideration
behavior of the twelve brands needs to be characterized.

Cereal Brand and the Attributes :

Cereal Brand Attributes 1-12 Attributes 13-25

All Bran Filling Family

Cerola Muesli Natural Calories

Just Right Fibre Plain

Kellogg’s corn flakes Sweet Crisp

Komplete Easy Regular

Nutrigrain Salt Sugar

Purina Muesli Satisfying Fruit

Rice Bubbles Energy Process

Special K Fun Quality

Sustain Kids Treat

Vitabrit Soggy Boring

Weetbix Economical Nutritious

Health

3
1.2 Approach and Initial Findings
1. Reading the Data set
2. Exploring the data for outliers and missing values
3. Performing Factor Analysis
4. Profiling the Factors to get an useful insight

Loading the Dataset and getting useful insights:

Removing the first column of the dataset since it contains the brand names which unwanted for
Factor Analysis

4
1.3 Exploratory Data Analysis

As the evaluation of the cereal brands are based on the five-point Likert scale the values that are
more than five needs to be converted to five since it is the highest possible value. thus the new
data set and its summary as below

Library “tidyverse” has been used to replace 6’s with 5’s

5
Summary of new data set

6
Now, we can find that Soggy and Boring are the two variables that hold the negative essence. We
need to reverse those variable in order to get the factor analysis results in a single dimension.
The following operation is performed to reverse those variables.

It can be seen in the new data set that those two columns are reversed and replaced.

1.4 Factor Analysis


Library “nfactors” was used to perfom Factor Analysis.

Eigen values and Correlation matrix

Before jumping into factor analysis, we need find the Eigen values, Eigen matrix and correlation
for the data set.

7
Scree Plot

We need to determine the number of factors so as to reduce the dimensionality and perform the
factor analysis. Scree plot was plotted to determine the optimal number of factors.

As per the Kaizer rule Eigen value >1 is considered to decide on the optimal number of factors for
further analysis. Thus, four factors are considered.
8
The above tabulates the absolute values of the strength of each attribute in the PC ( As values
shown in Column 2 to 5). The Column 6 – h2 shows the communality readings ie., the common
values shared between all the factors

9
The first row tabulates the Eigen values and the second row tabulates the proportion of the
variance explained each factor. The cumulative variance of 0.61741 explains 61.74% of the total
information content is used.

Further from the results of the Factor Analysis – the standardized loadings output there are many
variables of contrasting traits which shows a high degree of correlation with each of the principle
component factors, hence there is a need to rotate the factor loadings to draw a meaningful
inference and identify the patterns.

Thus the rotated factor loading output is a below

10
Though there is a change in the Eigen value and the proportional variance in the new fit the
communality is same as the unrotated factor loading.

Considering the variables which are highly correlated against each of the Rotated components –
four factors are identified as below

Factor 1 – Health Conscious

Filling (74.65%), Natural(77.91%), Fiber(83.76%), Satisfying(67.07%), Energy(70.09%),


Health(83.3%), Regular(67.61%), Quality(66.7%) and Nutritious (83.24%)

Factor 2 –Experience

Fun(68.02%), Plain(70.03%), Treat(84.57%) and Boring(84.57%)

Even though Crisp and Fruit belong to this factor their contribution can be neglected for this
factor.

Factor 3 - Audience and Cereal Preference

Kids (91.12%), Family (77.69%), Soggy (91.12%)

Even though Easy and Economical are part of this factor calculation, their contribution happens
to be minimal for this factor.

Factor 4 –Taste

Sweet(69.8%), Salt(78.09%), Calories(72.16), Sugar(80.74%)

Even though Process belongs to this factor as per R’s calculation, we ignore it since it provides
minimal impact

Table Summary

11
1.5 Conclusion
We understand that there can be four factors in rating cereals listed in the file cereal.csv – if we
were to look at how each cereal performs based on the above found 4 factors (Health,
Experience, Audience & Preference, Taste) – that would give us better room for analysis of each
brand.

12
Problem -2: Leslie Salt Data Regression
Analysis

Problem Statement :
The Leslie property (Mountain View City) contained 246.8 acres and was located right on the San
Francisco Bay. The land had been used for salt evaporation and the city of Mountain View
intended to fill the property and use it for a city park. To ascertain the price Appraisers were
hired but what made the processes difficult was that there were few sales of byland property and
none of them corresponded exactly to the characteristics of the Leslie property. The data on 31
byland properties that were sold during the last 1o years were available.

In addition to the transaction price for each property, the data on a large number of other factors,
including size, time of sale, elevation, location, and access to sewers was collected.
A listing of these data, including only those variables, deemed relevant for this exercise. A
description of the variables is provided below.

Approach:
The approach of regression model technique is used to better understand which variables
influence market evaluation (price) of the Leslie Salt property.

DV denotes dependent variable and IV denotes independent variable

Price – DV

County – IV

Size - IV

Elevation - IV

Sewer - IV

Date - IV

Flood - IV

Distance - IV

13
Data Loading:
We load the data from the file “Dataset_LeslieSalt.xlsx” – We are copying the file to a new data
frame called datasource.
> datasource=read.csv(file.choose(),header = TRUE)
> View(datasource)

Exploratory Data Analysis


> str(datasource)

'data.frame': 31 obs. of 8 variables:


$ Price : num 4.5 10.6 1.7 5 5 3.3 5.7 6.2 19.4 3.2 ...
$ County : int 1 1 0 0 0 1 1 1 1 1 ...
$ Size : num 138.4 52 16.1 1695.2 845 ...
$ Elevation: int 10 4 0 1 1 2 4 4 20 0 ...
$ Sewer : int 3000 0 2640 3500 1000 10000 0 0 1300 6000 ...
$ Date : int -103 -103 -98 -93 -92 -86 -68 -64 -63 -62 ...
$ Flood : int 0 0 1 0 1 0 0 0 0 0 ...
$ Distance : num 0.3 2.5 10.3 14 14 0 0 0 1.2 0 ...

After taking a closer look at the structure, we find that County and Flood variables are of the
integer type while they have to be Factor.

Changing County and Flood to Factors


> datasource$County=factor(datasource$County,levels=c("0","1"),labels=c("San M
ateo","Santa Clara"))

> datasource$Flood=factor(datasource$Flood,levels=c("0","1"),labels=c("No","Ye
s"))

Checking the structure after the transformation

Plotting the data to visually check for pattern resemblance


> plot(datasource)

14
Outlier detection with a boxplot

15
10 values have been detected to be outliers in this dataset. Let’s look at individual variables to
detect and treat outliers.

One outlier in Price (dependent variable)


> boxplot(datasource$Price,main="Boxplot of Price",xlab="Price",ylab="scale of
measure",col="#AA9812")

> out.price=boxplot.stats(datasource$Price)$out
> out.price

Most of the observations are between $5,000 to $15,000 per acre and the median is around
$12,000. The one outlier detected carries the value $37,200.
> summary(datasource$Price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.70 5.35 11.70 11.95 16.05 37.20
> bench.price= 16.05+1.5*IQR(datasource$Price)
> bench.price
[1] 32.1
> datasource$Price[datasource$Price > bench.price]= median(datasource$Price)
> datasource$Price
[1] 4.5 10.6 1.7 5.0 5.0 3.3 5.7 6.2 19.4 3.2 4.7 6.9 8.1 11.6 19.
3 11.7
[17] 13.3 15.1 12.4 15.3 12.2 18.1 16.8 5.9 4.0 11.7 18.2 15.1 22.9 15.2 21.
9

Three outliers in Size (independent variable)

16
Most of the observations for the size variable lies 6.90 acres to 300 acres. The three outliers
carries the value 1695.20, 845, 320.6.
> summary(datasource$Size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.90 20.35 51.40 139.97 104.10 1695.20
> bench.size=104.10+1.5*IQR(datasource$Size)
> bench.size
[1] 229.725
> boxplot.stats(datasource$Size)
$stats
[1] 6.90 20.35 51.40 104.10 145.70

$n
[1] 31

$conf
[1] 27.63373 75.16627

$out
[1] 1695.2 845.0 320.6

> datasource$Size[datasource$Size > bench.size]= median(datasource$Size)


> datasource$Size
[1] 138.4 52.0 16.1 51.4 51.4 6.9 105.9 56.6 51.4 22.1 22.1 27.7
18.6
[14] 69.9 145.7 77.2 26.2 102.3 49.5 12.2 51.4 9.9 15.3 55.2 116.2
15.0
[27] 23.4 132.8 12.0 67.0 30.8

One outlier for Elevation (independent variable)

17
Most of the observations lie from 0 to 11 feet above sea level. The one outlier carries the value
20 feet above sea level.
> summary(datasource$Elevation)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 4.645 7.000 20.000
> bench.elevation=7+1.5*IQR(datasource$Elevation)
> bench.elevation
[1] 14.5
> boxplot.stats(datasource$Elevation)
$stats
[1] 0 2 4 7 11

$n
[1] 31

$conf
[1] 2.581118 5.418882

$out
[1] 20

> datasource$Elevation[datasource$Elevation > bench.elevation]= median(datasou


rce$Elevation)
> datasource$Elevation
[1] 10 4 0 1 1 2 4 4 4 0 0 3 5 8 10 9 8 6 11 8 0 5 2 0
2 5
[27] 5 2 5 2 2

One outlier for Sewer (independent variable)

18
Most of the observations lies between 0 foot to 3500 feet to the nearest sewer connection. The
range ends at 6000 feet. The one outlier carries the value 10,000 feet to the nearest sewer
connection.
> median(datasource$Sewer)
[1] 900
> datasource=datasource%>%mutate(Sewer=replace(Sewer,Sewer=="0",median(Sewer,n
a.rm = TRUE)))
> datasource$Sewer
[1] 3000 900 2640 3500 1000 10000 900 900 1300 6000 6000 4500
5000
[14] 900 900 900 900 900 900 900 4000 900 900 1320 900
900
[27] 4420 2640 3400 900 900
> summary(datasource$Sewer)
Min. 1st Qu. Median Mean 3rd Qu. Max.
900 900 900 2359 3450 10000
> bench.sewer=3450+1.5*IQR(datasource$Sewer)
> bench.sewer
[1] 7275

Three outliers in distance(independent variable)

19
Most of the observations lie between 0 miles to 12 miles. The three outliers in this case carry the
values 14, 14 and 16.5 miles.
> median(datasource$Distance)
[1] 4.9
> datasource=datasource%>%mutate(Distance=replace(Distance,Distance=="0",media
n(Distance,na.rm = TRUE)))
> datasource$Distance
[1] 0.3 2.5 10.3 14.0 14.0 4.9 4.9 4.9 1.2 4.9 4.9 4.9 0.5 4.4 4.
2 4.5
[17] 4.7 4.9 4.6 5.0 16.5 5.2 5.5 11.9 5.5 7.2 5.5 10.2 5.5 5.5 5.
5
> summary(datasource$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.300 4.650 4.900 6.081 5.500 16.500
> bench.distance.upper=5.5+1.5*IQR(datasource$Distance)
> bench.distance.upper
[1] 6.775
> bench.distance.lower=4.65-1.5*IQR(datasource$Distance)
> bench.distance.lower
[1] 3.375
> boxplot.stats(datasource$Distance)
$stats
[1] 4.20 4.65 4.90 5.50 5.50

$n
[1] 31

$conf
[1] 4.65879 5.14121

$out
[1] 0.3 2.5 10.3 14.0 14.0 1.2 0.5 16.5 11.9 7.2 10.2

> datasource$Distance[datasource$Distance > bench.distance.upper]= median(data


source$Distance)
> datasource$Distance[datasource$Distance < bench.distance.lower]= bench.dista
nce.lower
> datasource$Distance
[1] 3.375 3.375 4.900 4.900 4.900 4.900 4.900 4.900 3.375 4.900 4.900 4.900 3
.375
20
[14] 4.400 4.200 4.500 4.700 4.900 4.600 5.000 4.900 5.200 5.500 4.900 5.500 4
.900
[27] 5.500 4.900 5.500 5.500 5.500

Checking the correlation between all variables (except County and Flood)

> cor(datasource[,-c(2,7)])
Price Size Elevation Sewer Date Distance
Price 1.0000000 -0.04199400 0.35213671 -0.26438725 0.63738811 0.15262149
Size -0.0419940 1.00000000 0.29794356 -0.24038160 -0.08107814 -0.25911106
Elevation 0.3521367 0.29794356 1.00000000 -0.35338447 -0.04597188 -0.38577392
Sewer -0.2643872 -0.24038160 -0.35338447 1.00000000 -0.04516885 -0.04862628
Date 0.6373881 -0.08107814 -0.04597188 -0.04516885 1.00000000 0.55491408
Distance 0.1526215 -0.25911106 -0.38577392 -0.04862628 0.55491408 1.00000000

It seems to be that there is a massive correlation with Price and Date.

Regression Analysis
> reg=lm(Price~ .,data = datasource)
> summary(reg)

Call:
lm(formula = Price ~ ., data = datasource)

Residuals:
Min 1Q Median 3Q Max
-7.2857 -2.2803 0.0825 2.7883 6.2263

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.1999788 9.4738190 3.188 0.00410 **
CountySanta Clara -2.6062163 2.0732909 -1.257 0.22135
Size -0.0234471 0.0196519 -1.193 0.24499
Elevation 0.3461871 0.3095115 1.118 0.27490
Sewer -0.0009262 0.0003994 -2.319 0.02963 *
Date 0.1428219 0.0380449 3.754 0.00103 **
FloodYes -6.7512507 2.5744595 -2.622 0.01523 *
Distance -1.2632360 1.5491357 -0.815 0.42318
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.851 on 23 degrees of freedom
Multiple R-squared: 0.6974, Adjusted R-squared: 0.6053
F-statistic: 7.573 on 7 and 23 DF, p-value: 9.102e-05

The R2 value (co-efficient of determination) is 69.74% explains the contribution made by


regression in explaining the variations in the dependent variable. And the adjusted R 2 is 60.53%.

Null hypothesis: there is no linear relationship between DV and IVs. All betas =0

Alternate hypothesis: there is a linear relationship between DV and IVs ( atleast one beta is not
equal to zero)

21
If you look at the p value in the above table (1.351e-05) which is less than 0.05 ( 5 % level of
significance). We reject the null hypothesis and conclude that there is a liner relationship among
the DV and IVs.

Size, Date, Distance, County are not significant to be included in the model we are about to build.
Hence we run a regression analysis without these four variables once again.
> lreg=lm(Price~ .-Distance-Elevation-Size-County, data = datasource)
> summary(lreg)

Call:
lm(formula = Price ~ . - Distance - Elevation - Size - County,
data = datasource)

Residuals:
Min 1Q Median 3Q Max
-6.7549 -2.0249 -0.2863 2.5275 6.7021

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.0725045 1.8996071 12.146 1.88e-12 ***
Sewer -0.0010025 0.0003297 -3.040 0.005207 **
Date 0.1439904 0.0288820 4.985 3.17e-05 ***
FloodYes -7.0345229 1.9007486 -3.701 0.000971 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.809 on 27 degrees of freedom


Multiple R-squared: 0.6524, Adjusted R-squared: 0.6138
F-statistic: 16.9 on 3 and 27 DF, p-value: 2.23e-06

R2 value is 65.24% and adjusted R2 is 61.38%. The p value is 2.23e-06 which is less than 0.05 (5 %
significance value) hence we construct a model based on the above output.

The result of the regression analysis gives birth to the following model

Price=23.07–0.001 Sewer +0.143 Date – 7.034 Flood

Let’s check how our model reacts to a set of values. We’ve assumed the values of few variables
to estimate the price of Leslie Salt property.

County = Santa Clara

Size = 246.8 Acres

Elevation = 0 (On Par with the sea level)

Sewer = 0 miles away from sewer connection

Date = 6 (We’re assuming this property will be sold in the next 6 months)

Flood = 0 ( as it’s diked)

Distance = 0 (Distance is Zero)

22
> leslie_salt <- data.frame(0,"Santa Clara",246.8,0,0,6,"No",0)
> colnames(leslie_salt) <- c("Price", "County", "Size", "Elevation", "Sewer",
"Date", "Flood", "Distance")
> data<-rbind(datasource,leslie_salt)
> leslie_salt_price <- predict(lreg, newdata = data[32,])
> leslie_salt_price
23.93645

The resulting price for Leslie Salt Property would be around $ 23,936 per acre

23

You might also like