Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

Statical Modeling for Outlook

Group

Summer Internship
BY
Abhishek Yadav
Content
Task 1: Create Five Problem statement and Model the relationships

Task 3: Multiple regression Analysis (MRA) Analytics

Task 4: Forecasting

Task 2: Decision Tree Making


Task 1: Create Five Problem statement
and Model the relationships

Introduction
 The house prices prediction will help us to decide whether the house they
desire to buy is worth of the price or not.
 By making use of the house price prediction system, the vendor would be
ready to decide what all features he/she could add to the house so that the
house can be sold for a higher price.
 OBJECTIVE:
 The objective of this task is to predict house prices based on various
parameters.
Regression Analysis:
 Dependent variable: Price
 Independent variable:

1. year 1978 or 1981   14. ldist log(dist)  


2. age age of house   15. wind perc. time wind
3. agesq age^2   incin. to house
4. nbh neighbourhood #, 1 to 6 16. lprice log(price)  
5. cbd dist. to central bus. dstrct, 17. y81 =1 if year == 1981  
feet 18. larea log(area)  
6. intst dist. to interstate, feet 19. lland log(land)  
7. lintst log(intst)   20. y81ldist y81*ldist  
8. price selling price   21. lintstsq lintst^2  
9. rooms # rooms in house   22. nearinc =1 if dist <=  
10. area square footage of house 15840
11. land square footage lot   23. y81nrinc y81*nearinc  
12. baths # bathrooms   24. rprice price, 1978  
13. dist dist. from house to dollars
incinerator, feet 25. lrprice log(rprice)  
Descriptive Statistic

  rooms area land baths dist ldist wind lprice price


.
count 321 321 321 321 321 321 321 321 321
mean 6.58 2106.72 39629.89 2.339564 20715.57 9.837414 6.97 11.37812 96100.6604
std 0.90 694.95 39514.39 0.770526 8508.18 0.478383 2.66 0.438174 43223.7289
min 4 735 1710 1 5000 8.517193 3 10.16585 26000
25% 6 1560 16935 2 13400 9.50301 5 11.08214 65000
50% 7 2056 43560 2 19900 9.898475 7 11.36094 85900
75% 7 2544 46100 3 27200 10.21097 11 11.69525 120000
max 10 5136 544500 4 40000 10.59663 11 12.61154 300000
Descriptive Statistic

  lland y81ldist larea nearinc y81nrinc rprice lrprice y81


count 321 321 321 321 321 321 321 321

mean 10.30186 4.342579 7.597232 0.299065 0.124611 83721.3555 11.26138 0.442368

std 0.801751 4.892513 0.340723 0.458563 0.330792 33118.7858 0.3879 0.497443

min 7.444249 0 6.599871 0 0 26000 10.16585 0

25% 974% 0% 735% 0% 0% 5900000% 1099% 0%

50% 1068% 0% 763% 0% 0% 8200000% 1131% 0%

75% 1074% 982% 784% 100% 0% 10023040% 1152% 100%

max 13.20762 10.56875 8.54403 1 1 300000 12.61154 1


Descriptive Statistic

  year age agesq nbh cbd intst lintst lintstsq

count 321 321 321 321 321 321 321 321

mean 1979.327 18.009346 1381.56698 2.208723 15822.4299 16442.3676 9.480513 90.48225

std 1.492329 32.565845 4801.78876 2.164353 8967.1063 9033.13065 0.777165 14.06647

min 1978 0 0 0 1000 1000 6.9078 47.7177

25% 197800% 0% 0% 0% 900000% 900000% 911% 8290%

50% 197800% 400% 1600% 200% 1400000% 1600000% 968% 9371%

75% 198100% 2200% 48400% 400% 2300000% 2400000% 1009% 10173%

max 1981 189 35721 6 35000 34000 10.434 108.8684


Scatter Plot
Problem Statement 1: Predicting the price of
house using multivariable linear regression.

 Dependent variable: Price


 Independent variables:
• Age – Age of House
• Lintst – Log(Distance to Interstate in Feet)
• larea – Log(Square Footage of House)
• lland – Log(Square Footage of Lot)
• rooms – No. of Rooms in a House
• baths – No. of Bathrooms in a House
x=logarithmic(land), x=area, y=price
y=logarithmic(price)

Joint Plot
Regression Analysis: Results

Price = 6.46584626601122 -
0.00357375Age - 0.03122348lintst +
0.50968087larea + 0.0787914lland +
0.05111123rooms + 0.10779092baths
PS 2: Relationship between distance from
incinerator and interstate to the price of the house

 Dependent variable: Price


 Independent variable:

 ldist: Log(dist. from house to incinerator)

 lintst: Log(dist. To interstate)


Pair Plot Joint Plot: x=ldist, y=lprice
PS2 Regression Analysis: Results
 Price
= 8.868987307663577 + 0.0584945ldist +
0.20396533linst
PS 3: Relationship between distance to incinerator and
perc. time wind INC to house

 Pair Plot: 'wind', 'dist','ldist' Joint Plot: x='ldist', y='wind'


Lmplot:  lmplot combines regplot and Facet Grid. The Facet Grid class helps in
visualizing the distribution of one variable as well as the relationship between
multiple variables separately within subsets of your dataset using multiple panels.
PS-4: Heat mapping the price of house w.r.t
room and bathroom

 index='baths', columns='rooms', values='lprice'


PS-5 Relationship between square footage of the house
and are of the lot worth the price of the house
 1) x='larea', y='lprice', hue='nbh' x='lland', y='lprice', hue='nbh'
Omitted Variable Bias

1. Front Road Widths


2. Market Reservation Time
3. South-facing
4. Transport facility
Task 3: Multiple regression Analysis (MRA)
Analytics

 Introduction
 Customers buying a replacement car are often assured of the cash
they invest to be worthy
 So, there is a need for a used car price prediction system to
effectively determine the worthiness of the car using a variety of
features.
 OBJECTIVE:
 The objective of this project is to analyze various factors
affecting the price of used Car using Multiple regression
Analysis.
Regression Analysis:

 Dependent variable: Price


 Independent variable:

Variable Description
Model Model Description
Age_08_04 Age in months as in August 2004
Accumulated Kilometers on
KM odometer
HP Horsepower
Cylinder Volume in cubic
CC centimetres
Doors Number of doors
Quarterly_Tax Quarterly road tax in EUROs
Weight Weight in Kilograms
Descriptive Analysis
Quarterly_T
Price Age KM HP cc Doors Weight
ax
count 1436 1436 1436 1436 1436 1436 1436 1436
10730.8 68533.2 1072.459
mean 55.94 101.50 1576.85 4.03 87.12
2 5 6
37506.4
std 3626.96 18.59 14.98 424.38 0.95 41.13 52.64112
5
min 4350 1 1 69 1300 2 19 1000
25% 8450 44 43000 90 1400 3 69 1040
50% 9900 61 63389.5 110 1600 4 85 1070
87020.7
75% 11950 70 110 1600 5 85 1085
5
max 32500 80 243000 192 16000 5 283 1615
Scatter Plot
Joint Plot:

 1) x='KM', y='Price' 2) x='Age', y='Price'


Conclusions
Final regression equation for our analysis is given here below:

Price = -2752.19-1.21747074e+02Age + -2.05478161e-


02KM + 3.37HP-1.25041688e-01cc -
2.25862778e+01Doors+4.12770307e+00Quarterly_Tax+1
.69791623e+01 Weight
It has been concluded for above equation that:

 Price is decreases with increase in the Age and the distance travel (KM) of the
car.
 Price is increase with increase in Horsepower (HP)or increase in quarterly tax.
Task 4:
Forecasting
Quarterly year Sales per quarter (x $
 It is a planning tool that helps Sales 10,000)
management in its attempts to
cope with the uncertainty of the I II III IV
future, relying mainly on data 1991 16 21 9 18
from the past and present and
analysis of trends. 1992 15 20 10 18
1993 17 24 13 22
 Problem formulation and data
collection: 1994 17 25 11 21
 Sales of goods are given for four
years on quarterly bases now fill 1995 18 26 14 25
the tables of Trend, Index value,
seasonal sales etc with the help of
forecasting.
Year Quarter Actual 4-quarter 4-quarter moving 4-quarter Percentage of
sales moving total average centred moving actual to
average moving average

Model building and


values
1991 I 16 NA NA NA NA
             

evaluation:  
 
II
   
21 NA
64
NA
 
NA
 
NA
 
  III 9   16 15.875 56.7
      63      
  IV 18   15.75 15.625 115.2
      62      
1992 I 15   15.5 15.625 96.0
      63      
  II 20   15.75 15.75 127.0
 Actual to Moving Data:       63      
  III 10   15.75 16 62.5
We use Holt’s linear       65      
exponential smoothing.  
 
IV
   
18  
69  
16.25
 
16.75
 
107.5

1993 I 17   17.25 17.625 96.5


      72      
  II 24   18 18.5 129.7
      76      
  III 13   19 19 68.4
      76      
  IV 22   19 19.125 115.0
      77      
1994 I 17   19.25 19 89.5
      75      
  II 25   18.75 18.625 134.2
      74      
  III 11   18.5 18.625 59.1
      75      
  IV 21   18.75 18.875 111.3
      76      
1995 I 18   19 19.375 92.9
      79      
  II 26   19.75 20.25 128.4
      83      
  III 14   20.75 18.5 75.7
      65      
Actual Seasonal Deseasonalized
SN Year Quarter sales index/100 sales

Deseasonalization: 1
2  
1991 I
II
16
21
0.946745562
1.286335404
15.1
15.3
3   III 9 0.657807309 15.5
4   IV 18 1.094520548 15.7
5 1992 I 15 0.934306569 15.9
Intercept 14.91052632 6   II 20 1.274131274 16.0
7   III 10 0.673640167 16.2
8   IV 18 1.07860262 16.4
Slope 0.189473684 9 1993 I 15 0.928909953 16.6
10   II 20 1.272108844 16.8
11   III 10 0.681818182 17.0
12   IV 18 1.048192771 17.2
13 1994 I 17 0.918918919 17.4
14   II 24 1.282442748 17.6
15   III 13 0.728971963 17.8
16   IV 18 1.063829787 17.9
17 1995 I 17 0.894736842 18.1
18   II 24 1.220338983 18.3
19   III 13 0.742857143 18.5
20   IV 22 1 18.7
Cyclical Variation:
Year Quarter Deseasonalized sales Seasonal Seasonalized Sales
(Y) x index/100 Y=a+bx Percent of trend
1991 I 15.1 1 0.922222222 13.92555556 92.2
  II 15.3 2 1.288888889 19.70643275 128.9
  III 15.5 3 0.633333333 9.803333333 63.3
  IV 15.7 4 1.155555556 18.10573099 115.6
1992 I 15.9 5 0.922222222 14.62450292 92.2
  II 16.0 6 1.288888889 20.68327485 128.9
  III 16.2 7 0.633333333 10.28333333 63.3
  IV 16.4 8 1.155555556 18.98152047 115.6
1993 I 16.6 9 0.922222222 15.32345029 92.2
  II 16.8 10 1.288888889 21.66011696 128.9
  III 17.0 11 0.633333333 10.76333333 63.3
  IV 17.2 12 1.155555556 19.85730994 115.6
1994 I 17.4 13 0.922222222 16.02239766 92.2
  II 17.6 14 1.288888889 22.63695906 128.9
  III 17.8 15 0.633333333 11.24333333 63.3
  IV 17.9 16 1.155555556 20.73309942 115.6
1995 I 18.1 17 0.922222222 16.72134503 92.2
  II 18.3 18 1.288888889 23.61380117 128.9
  III 18.5 19 0.633333333 11.72333333 63.3
  IV 18.7 20 1.155555556 21.60888889 115.6
There are 14 attributes in each case of the dataset. They are:
Task 2: Tree  
CRIM     per capita crime rate by town 
Making ZN         proportion of residential land zoned for lots over
25,000 sq. ft. 
INDUS    proportion of non-retail business acres per town. 
 Objective: Analyse the CHAS     Charles River dummy variable (1 if tract bounds river;
dataset and analyse the nature 0 otherwise) 
of the variables and find out NOX       nitric oxides concentration (parts per 10 million) 
the course of action using a RM        average number of rooms per dwelling 
suitable decision tree. AGE       proportion of owner-occupied units built prior to
1940 
DIS       weighted distances to five Boston employment
centres 
RAD      index of accessibility to radial highways 
TAX      full-value property-tax rate per $10,000 
PTRATIO pupil-teacher ratio by town 
B          1000(Bk - 0.63)^2 where Bk is the proportion of blacks
by town 
LSTAT   % lower status of the population 
MEDV    Median value of owner-occupied homes in $1000
Pair Wise Scatter Plot for all variables:
Correlation Matrix:
Decision Tree:
Accuracy: 0.9210526315789473 Precision: 0.68 Recall: 0.8095
Confusion Matrix is: [[123 8] [ 4 17]] Mean Squared Error:
0.07894736842105263 Mean Absolute Error: 0.07894736842105263
Root Mean Squared Error: 0.28097574347450816

You might also like