Professional Documents
Culture Documents
Advance Stats Group 7 - Final
Advance Stats Group 7 - Final
Group Assignment
Group-7:
Hari Shankar V K
Malavika Ravindra Kumar
Patrick Priyadharshan
P Rajenthiran
1
Contents
1 . Problem - 1
2. Problem – 2
2.5 Conclusion_______________________________________
2
Problem 1: Cereal Data Factor Analysis
Health
3
1.2 Approach and Initial Findings
1. Reading the Data set
2. Exploring the data for outliers and missing values
3. Performing Factor Analysis
4. Profiling the Factors to get an useful insight
Removing the first column of the dataset since it contains the brand names which unwanted for
Factor Analysis
4
1.3 Exploratory Data Analysis
As the evaluation of the cereal brands are based on the five-point Likert scale the values that are
more than five needs to be converted to five since it is the highest possible value. thus the new
data set and its summary as below
5
Summary of new data set
6
Now, we can find that Soggy and Boring are the two variables that hold the negative essence. We
need to reverse those variable in order to get the factor analysis results in a single dimension.
The following operation is performed to reverse those variables.
It can be seen in the new data set that those two columns are reversed and replaced.
Before jumping into factor analysis, we need find the Eigen values, Eigen matrix and correlation
for the data set.
7
Scree Plot
We need to determine the number of factors so as to reduce the dimensionality and perform the
factor analysis. Scree plot was plotted to determine the optimal number of factors.
As per the Kaizer rule Eigen value >1 is considered to decide on the optimal number of factors for
further analysis. Thus, four factors are considered.
8
The above tabulates the absolute values of the strength of each attribute in the PC ( As values
shown in Column 2 to 5). The Column 6 – h2 shows the communality readings ie., the common
values shared between all the factors
9
The first row tabulates the Eigen values and the second row tabulates the proportion of the
variance explained each factor. The cumulative variance of 0.61741 explains 61.74% of the total
information content is used.
Further from the results of the Factor Analysis – the standardized loadings output there are many
variables of contrasting traits which shows a high degree of correlation with each of the principle
component factors, hence there is a need to rotate the factor loadings to draw a meaningful
inference and identify the patterns.
10
Though there is a change in the Eigen value and the proportional variance in the new fit the
communality is same as the unrotated factor loading.
Considering the variables which are highly correlated against each of the Rotated components –
four factors are identified as below
Factor 2 –Experience
Even though Crisp and Fruit belong to this factor their contribution can be neglected for this
factor.
Even though Easy and Economical are part of this factor calculation, their contribution happens
to be minimal for this factor.
Factor 4 –Taste
Even though Process belongs to this factor as per R’s calculation, we ignore it since it provides
minimal impact
Table Summary
11
1.5 Conclusion
We understand that there can be four factors in rating cereals listed in the file cereal.csv – if we
were to look at how each cereal performs based on the above found 4 factors (Health,
Experience, Audience & Preference, Taste) – that would give us better room for analysis of each
brand.
12
Problem -2: Leslie Salt Data Regression
Analysis
Problem Statement :
The Leslie property (Mountain View City) contained 246.8 acres and was located right on the San
Francisco Bay. The land had been used for salt evaporation and the city of Mountain View
intended to fill the property and use it for a city park. To ascertain the price Appraisers were
hired but what made the processes difficult was that there were few sales of byland property and
none of them corresponded exactly to the characteristics of the Leslie property. The data on 31
byland properties that were sold during the last 1o years were available.
In addition to the transaction price for each property, the data on a large number of other factors,
including size, time of sale, elevation, location, and access to sewers was collected.
A listing of these data, including only those variables, deemed relevant for this exercise. A
description of the variables is provided below.
Approach:
The approach of regression model technique is used to better understand which variables
influence market evaluation (price) of the Leslie Salt property.
Price – DV
County – IV
Size - IV
Elevation - IV
Sewer - IV
Date - IV
Flood - IV
Distance - IV
13
Data Loading:
We load the data from the file “Dataset_LeslieSalt.xlsx” – We are copying the file to a new data
frame called datasource.
> datasource=read.csv(file.choose(),header = TRUE)
> View(datasource)
After taking a closer look at the structure, we find that County and Flood variables are of the
integer type while they have to be Factor.
> datasource$Flood=factor(datasource$Flood,levels=c("0","1"),labels=c("No","Ye
s"))
14
Outlier detection with a boxplot
15
10 values have been detected to be outliers in this dataset. Let’s look at individual variables to
detect and treat outliers.
> out.price=boxplot.stats(datasource$Price)$out
> out.price
Most of the observations are between $5,000 to $15,000 per acre and the median is around
$12,000. The one outlier detected carries the value $37,200.
> summary(datasource$Price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.70 5.35 11.70 11.95 16.05 37.20
> bench.price= 16.05+1.5*IQR(datasource$Price)
> bench.price
[1] 32.1
> datasource$Price[datasource$Price > bench.price]= median(datasource$Price)
> datasource$Price
[1] 4.5 10.6 1.7 5.0 5.0 3.3 5.7 6.2 19.4 3.2 4.7 6.9 8.1 11.6 19.
3 11.7
[17] 13.3 15.1 12.4 15.3 12.2 18.1 16.8 5.9 4.0 11.7 18.2 15.1 22.9 15.2 21.
9
16
Most of the observations for the size variable lies 6.90 acres to 300 acres. The three outliers
carries the value 1695.20, 845, 320.6.
> summary(datasource$Size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.90 20.35 51.40 139.97 104.10 1695.20
> bench.size=104.10+1.5*IQR(datasource$Size)
> bench.size
[1] 229.725
> boxplot.stats(datasource$Size)
$stats
[1] 6.90 20.35 51.40 104.10 145.70
$n
[1] 31
$conf
[1] 27.63373 75.16627
$out
[1] 1695.2 845.0 320.6
17
Most of the observations lie from 0 to 11 feet above sea level. The one outlier carries the value
20 feet above sea level.
> summary(datasource$Elevation)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 4.645 7.000 20.000
> bench.elevation=7+1.5*IQR(datasource$Elevation)
> bench.elevation
[1] 14.5
> boxplot.stats(datasource$Elevation)
$stats
[1] 0 2 4 7 11
$n
[1] 31
$conf
[1] 2.581118 5.418882
$out
[1] 20
18
Most of the observations lies between 0 foot to 3500 feet to the nearest sewer connection. The
range ends at 6000 feet. The one outlier carries the value 10,000 feet to the nearest sewer
connection.
> median(datasource$Sewer)
[1] 900
> datasource=datasource%>%mutate(Sewer=replace(Sewer,Sewer=="0",median(Sewer,n
a.rm = TRUE)))
> datasource$Sewer
[1] 3000 900 2640 3500 1000 10000 900 900 1300 6000 6000 4500
5000
[14] 900 900 900 900 900 900 900 4000 900 900 1320 900
900
[27] 4420 2640 3400 900 900
> summary(datasource$Sewer)
Min. 1st Qu. Median Mean 3rd Qu. Max.
900 900 900 2359 3450 10000
> bench.sewer=3450+1.5*IQR(datasource$Sewer)
> bench.sewer
[1] 7275
19
Most of the observations lie between 0 miles to 12 miles. The three outliers in this case carry the
values 14, 14 and 16.5 miles.
> median(datasource$Distance)
[1] 4.9
> datasource=datasource%>%mutate(Distance=replace(Distance,Distance=="0",media
n(Distance,na.rm = TRUE)))
> datasource$Distance
[1] 0.3 2.5 10.3 14.0 14.0 4.9 4.9 4.9 1.2 4.9 4.9 4.9 0.5 4.4 4.
2 4.5
[17] 4.7 4.9 4.6 5.0 16.5 5.2 5.5 11.9 5.5 7.2 5.5 10.2 5.5 5.5 5.
5
> summary(datasource$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.300 4.650 4.900 6.081 5.500 16.500
> bench.distance.upper=5.5+1.5*IQR(datasource$Distance)
> bench.distance.upper
[1] 6.775
> bench.distance.lower=4.65-1.5*IQR(datasource$Distance)
> bench.distance.lower
[1] 3.375
> boxplot.stats(datasource$Distance)
$stats
[1] 4.20 4.65 4.90 5.50 5.50
$n
[1] 31
$conf
[1] 4.65879 5.14121
$out
[1] 0.3 2.5 10.3 14.0 14.0 1.2 0.5 16.5 11.9 7.2 10.2
Checking the correlation between all variables (except County and Flood)
> cor(datasource[,-c(2,7)])
Price Size Elevation Sewer Date Distance
Price 1.0000000 -0.04199400 0.35213671 -0.26438725 0.63738811 0.15262149
Size -0.0419940 1.00000000 0.29794356 -0.24038160 -0.08107814 -0.25911106
Elevation 0.3521367 0.29794356 1.00000000 -0.35338447 -0.04597188 -0.38577392
Sewer -0.2643872 -0.24038160 -0.35338447 1.00000000 -0.04516885 -0.04862628
Date 0.6373881 -0.08107814 -0.04597188 -0.04516885 1.00000000 0.55491408
Distance 0.1526215 -0.25911106 -0.38577392 -0.04862628 0.55491408 1.00000000
Regression Analysis
> reg=lm(Price~ .,data = datasource)
> summary(reg)
Call:
lm(formula = Price ~ ., data = datasource)
Residuals:
Min 1Q Median 3Q Max
-7.2857 -2.2803 0.0825 2.7883 6.2263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.1999788 9.4738190 3.188 0.00410 **
CountySanta Clara -2.6062163 2.0732909 -1.257 0.22135
Size -0.0234471 0.0196519 -1.193 0.24499
Elevation 0.3461871 0.3095115 1.118 0.27490
Sewer -0.0009262 0.0003994 -2.319 0.02963 *
Date 0.1428219 0.0380449 3.754 0.00103 **
FloodYes -6.7512507 2.5744595 -2.622 0.01523 *
Distance -1.2632360 1.5491357 -0.815 0.42318
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.851 on 23 degrees of freedom
Multiple R-squared: 0.6974, Adjusted R-squared: 0.6053
F-statistic: 7.573 on 7 and 23 DF, p-value: 9.102e-05
Null hypothesis: there is no linear relationship between DV and IVs. All betas =0
Alternate hypothesis: there is a linear relationship between DV and IVs ( atleast one beta is not
equal to zero)
21
If you look at the p value in the above table (1.351e-05) which is less than 0.05 ( 5 % level of
significance). We reject the null hypothesis and conclude that there is a liner relationship among
the DV and IVs.
Size, Date, Distance, County are not significant to be included in the model we are about to build.
Hence we run a regression analysis without these four variables once again.
> lreg=lm(Price~ .-Distance-Elevation-Size-County, data = datasource)
> summary(lreg)
Call:
lm(formula = Price ~ . - Distance - Elevation - Size - County,
data = datasource)
Residuals:
Min 1Q Median 3Q Max
-6.7549 -2.0249 -0.2863 2.5275 6.7021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.0725045 1.8996071 12.146 1.88e-12 ***
Sewer -0.0010025 0.0003297 -3.040 0.005207 **
Date 0.1439904 0.0288820 4.985 3.17e-05 ***
FloodYes -7.0345229 1.9007486 -3.701 0.000971 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R2 value is 65.24% and adjusted R2 is 61.38%. The p value is 2.23e-06 which is less than 0.05 (5 %
significance value) hence we construct a model based on the above output.
The result of the regression analysis gives birth to the following model
Let’s check how our model reacts to a set of values. We’ve assumed the values of few variables
to estimate the price of Leslie Salt property.
Date = 6 (We’re assuming this property will be sold in the next 6 months)
22
> leslie_salt <- data.frame(0,"Santa Clara",246.8,0,0,6,"No",0)
> colnames(leslie_salt) <- c("Price", "County", "Size", "Elevation", "Sewer",
"Date", "Flood", "Distance")
> data<-rbind(datasource,leslie_salt)
> leslie_salt_price <- predict(lreg, newdata = data[32,])
> leslie_salt_price
23.93645
The resulting price for Leslie Salt Property would be around $ 23,936 per acre
23