Professional Documents
Culture Documents
Capstone Project Report
Capstone Project Report
Structure:
1. Introduction
2. Data Loading
3. EDA - Univariate
4. EDA - Bivariate
5. Data Preprocessing
6. Model Building with Dataset-1
7. Hypertuning Dataset-1
8. Summary - Dataset-1
9. Model Building with Dataset-2
10. Hypertuning Dataset-2
11. Summary - Dataset -2
12. Conclusion
13. Pickle file creation
Note:
Dataset - 1 = 22 features
['price', 'room_bed', 'room_bath', 'living_measure', 'lot_measure', 'ceil', 'coast', 'sight', 'condition', 'quality', 'ceil_measure', 'basement', 'yr_built',
'living_measure15', 'lot_measure15', 'furnished', 'total_area', 'month_year', 'City', 'has_basement', 'HouseLandRatio', 'has_renovated']
Dataset - 2 = 31 features (important features after imputing dummy and analyzing different models)
['price', 'room_bed', 'room_bath', 'living_measure', 'lot_measure', 'ceil', 'sight', 'condition', 'ceil_measure', 'basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
'long', 'living_measure15', 'lot_measure15', 'total_area', 'coast_1', 'quality_3', 'quality_4', 'quality_5', 'quality_6', 'quality_7', 'quality_8', 'quality_9',
'quality_10', 'quality_11', 'quality_12', 'quality_13', 'furnished_1'
1. Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
2. Add the folder WA to your current working directory
3. Install below 2 libraries
conda install -c conda-forge/label/cf201901 geopandas
conda install -c conda-forge/label/cf201901 shapely
</b>
This Jupyter Notebook is done as part of PGPML Great Learning Programme for Capstone Project. Let's first, define the problem, objective of
this excercise.
We have the problem statment well defined in the given document which is as follows
INTRODUCTION
Problem Statement
As a house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to
know all aspects that give a house its value. For example, if we want to sell a house and we don't know the price which we can take, as it can't
be too low or too high. To find house price we usually try to find similar properties in our neighbourhood and based on collected data we trying
to assess our house price.
Problem Definition
When any person/business wants to sell or buy a house, they always face this kind of issue as they don't know the price which they should
offer. Due to this they might be offering too low or high for the property. Therefore, we can analyze the available data of the properties in the
area and can predict the price. We need to find how these attributes influence the house prices Right pricing is very imporatnt aspect to sell
house. It is very important to understand what are the factors and how they influence the house price. Objective is to predict the right price of
the house based on the attributes
Objective
Build model which will predict the house price when required features passed to the model. So we will
Find out the significant features from the given features dataset which affects the house price the most.
Build best feasible model to predict the house price with 95% confidence level
Business Reason
As people don't know the features/aspects which commulate property price, we can provide them HouseBuyingSelling guiding services in the
area so they can buy or sell their property with most suitable price tag and they didn't lose their hard earned money by offering low price or
keep waiting for buyers by putting high prices.
DATA LOADING
First, we will load the data from the given csv(comma seperated values) file provided as part of the Capstone Project.
In [2]:
#Supress warnings
import warnings
warnings.filterwarnings('ignore')
In [3]:
# let's check whether data loaded successfully or not, by checking first few records
house_df.head()
Out[3]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... basement yr_built yr_renovated zipcode
0 3034200666 20141107T000000 808100 4 3.25 3020 13457 1.0 0 0 ... 0 1956 0 98133
1 8731981640 20141204T000000 277500 4 2.50 2550 7500 1.0 0 0 ... 800 1976 0 98023
2 5104530220 20150420T000000 404000 3 2.50 2370 4324 2.0 0 0 ... 0 2006 0 98038
3 6145600285 20140529T000000 300000 2 1.00 820 3844 1.0 0 0 ... 0 1916 0 98133
4 8924100111 20150424T000000 699000 2 1.50 1400 4050 1.0 0 0 ... 0 1954 0 98115
5 rows × 23 columns
Data is loaded successfully as we can see first 5 records from the dataset.
Data Understanding
After loading data into our pandas library dataframe, we can now try to understand the kind of data we have with us.
In [4]:
# print the number of records and features/aspects we have in the provided file
house_df.shape
Out[4]:
(21613, 23)
In [5]:
house_df.columns
Out[5]:
From the above we can see the different columns we have in dataset.
1. cid: Notation for a house. Will not of our use. So we will drop this column
2. dayhours: Represents Date, when house was sold.
3. price: It's our TARGET feature, that we have to predict based on other featues
4. room_bed: Represents number of bedrooms in a house
5. room_bath: Represents number of bathrooms
6. living_measure: Represents square footage of house
7. lot_measure: Represents square footage of lot
8. ceil: Represents number of floors in house
9. coast: Represents whether house has waterfront view. It seems to be a categorical variable. We will see in our further data analysis
10. sight: Represents how many times sight has been viewed.
11. condition: Represents the overall condition of the house. It's kind of rating given to the house.
12. quality: Represents grade given to the house based on grading system
13. ceil_measure: Represents square footage of house apart from basement
14. basement: Represents square footage of basement
15. yr_built: Represents the year when house was built
16. yr_renovated: Represents the year when house was last renovated
17. zipcode: Represents zipcode as name implies
18. lat: Represents Lattitude co-ordniates
19. long: Represents Longitude co-ordinates
20. living_measure15: Represents square footage of house, when measured in 2015 year as house area may or may not changed after
renovation if any happened
21. lot_measure15: Represents square footage of lot, when measured in 2015 year as lot area may or may not change after renovation if any
done
22. furnished: Tells whether house is furnished or not. It seems to be categorical variable as description implies
23. total_area: Represents total area i.e. area of both living and lot
In [6]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 23 columns):
cid 21613 non-null int64
dayhours 21613 non-null object
price 21613 non-null int64
room_bed 21613 non-null int64
room_bath 21613 non-null float64
living_measure 21613 non-null int64
lot_measure 21613 non-null int64
ceil 21613 non-null float64
coast 21613 non-null int64
sight 21613 non-null int64
condition 21613 non-null int64
quality 21613 non-null int64
ceil_measure 21613 non-null int64
basement 21613 non-null int64
yr_built 21613 non-null int64
yr_renovated 21613 non-null int64
zipcode 21613 non-null int64
lat 21613 non-null float64
long 21613 non-null float64
living_measure15 21613 non-null int64
lot_measure15 21613 non-null int64
furnished 21613 non-null int64
total_area 21613 non-null int64
dtypes: float64(4), int64(18), object(1)
memory usage: 3.8+ MB
In the dataset, we have more than 21k records and 23 columns, out of which
4 features are of float type
18 features are of integer type
1 feature is of object type (we may need to convert this object type to specific datatype)
In [7]:
Out[7]:
cid 0
dayhours 0
price 0
room_bed 0
room_bath 0
living_measure 0
lot_measure 0
ceil 0
coast 0
sight 0
condition 0
quality 0
ceil_measure 0
basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
living_measure15 0
lot_measure15 0
furnished 0
total_area 0
dtype: int64
We don't have any null or missing values for any of the columns
In [8]:
# let's check whether there's any duplicate record in our dataset or not. If present, we have to remove them
house_df.duplicated().sum()
Out[8]:
0
We don't have any duplicate record in out dataset. So we can say we have more than 21k Unique records
In [9]:
house_df.describe().transpose()
Out[9]:
Most columns distribution is Right-Skewed and only few features are Left-Skewed (like room_bath, yr_built, lat).
We have columns which are Categorical in nature are -> coast, yr_renovated, furnished
In [10]:
#cid - CID is appearing muliple times, it seems data contains house which is sold multiple times
cid_count=house_df.cid.value_counts()
cid_count[cid_count>1].shape
Out[11]:
(176,)
We have 176 properties that were sold more than once in the given data
In [12]:
#we will create new data frame that can be used for modeling
#We will convert the dayhours to 'month_year' as sale month-year is relevant for analysis
house_dfr=house_df.copy()
house_df.dayhours=house_df.dayhours.str.replace('T000000', "")
house_df.dayhours=pd.to_datetime(house_df.dayhours,format='%Y%m%d')
house_df['month_year']=house_df['dayhours'].apply(lambda x: x.strftime('%B-%Y'))
house_df['month_year'].head()
Out[12]:
0 November-2014
1 December-2014
2 April-2015
3 May-2014
4 April-2015
Name: month_year, dtype: object
In [13]:
house_df['month_year'].value_counts()
Out[13]:
April-2015 2231
July-2014 2211
June-2014 2180
August-2014 1940
October-2014 1878
March-2015 1875
September-2014 1774
May-2014 1768
December-2014 1471
November-2014 1411
February-2015 1250
January-2015 978
May-2015 646
Name: month_year, dtype: int64
In [14]:
house_df.groupby(['month_year'])['price'].agg('mean')
Out[14]:
month_year
April-2015 561933.463021
August-2014 536527.039691
December-2014 524602.893270
February-2015 507919.603200
January-2015 525963.251534
July-2014 544892.161013
June-2014 558123.736239
March-2015 544057.683200
May-2014 548166.600113
May-2015 558193.095975
November-2014 522058.861800
October-2014 539127.477636
September-2014 529315.868095
Name: price, dtype: float64
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month have the highest mean price.
In [15]:
house_df.price.describe()
Out[15]:
count 2.161300e+04
mean 5.401822e+05
std 3.673622e+05
min 7.500000e+04
25% 3.219500e+05
50% 4.500000e+05
75% 6.450000e+05
max 7.700000e+06
Name: price, dtype: float64
In [16]:
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df['price'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef84d550>
In [17]:
house_df['room_bed'].value_counts()
Out[17]:
3 9824
4 6882
2 2760
5 1601
6 272
1 199
7 38
8 13
0 13
9 6
10 3
11 1
33 1
Name: room_bed, dtype: int64
The value of 33 seems to be outlier we need to check the data point before imputing the same
In [18]:
house_df[house_df['room_bed']==33]
Out[18]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... yr_built yr_renovated zipcode lat long
2014-06-
750 2402100895 640000 33 1.75 1620 6000 1.0 0 0 ... 1947 0 98103 47.6878 -122.331
25
1 rows × 24 columns
Will delete this data point after bivariate analysis as it looks to be an outlier as it has low price for 33 bed room property
In [19]:
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot(house_df.room_bed,color='green')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef14f780>
In [20]:
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot(house_df.room_bath,color='green')
house_df['room_bath'].value_counts().sort_index()
Out[20]:
0.00 10
0.50 4
0.75 72
1.00 3852
1.25 9
1.50 1446
1.75 3048
2.00 1930
2.25 2047
2.50 5380
2.75 1185
3.00 753
3.25 589
3.50 731
3.75 155
4.00 136
4.25 79
4.50 100
4.75 23
5.00 21
5.25 13
5.50 10
5.75 4
6.00 6
6.25 2
6.50 2
6.75 2
7.50 1
7.75 1
8.00 2
Name: room_bath, dtype: int64
plt.figure(figsize=(plotSizeX, plotSizeY))
print("Skewness is :",house_df.room_bath.skew())
sns.distplot(house_df.room_bath)
Skewness is : 0.511107573347417
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef14f748>
In [22]:
Skewness is : 1.471555426802092
Out[22]:
count 21613.000000
mean 2079.899736
std 918.440897
min 290.000000
25% 1427.000000
50% 1910.000000
75% 2550.000000
max 13540.000000
Name: living_measure, dtype: float64
Data distribution tells us, living_measure is right-skewed.
In [23]:
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef0f5b70>
There are many outliers in living measure. Need to review further to treat the same.
In [24]:
#checking the no. of data points with Living measure greater than 8000
house_df[house_df['living_measure']>8000]
Out[24]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... yr_built yr_renovated zipcode lat
2014-09-
264 9208900037 6890000 6 7.75 9890 31374 2.0 0 4 ... 2001 0 98039 47.6305 -122
19
2014-06-
668 1924059029 4670000 5 6.75 9640 13068 1.0 1 4 ... 1983 2009 98040 47.5570 -122
17
2014-06-
1123 2303900035 2890000 5 6.25 8670 64033 2.0 0 4 ... 1965 2003 98177 47.7295 -122
11
2014-10-
4789 1247600105 5110000 5 5.25 8010 45517 2.0 1 4 ... 1999 0 98033 47.6767 -122
20
2014-10-
16785 6762700020 7700000 6 8.00 12050 27600 2.5 0 3 ... 1910 1987 98102 47.6298 -122
13
2014-07-
18393 6072800246 3300000 5 6.25 8020 21738 2.0 0 0 ... 2001 0 98006 47.5675 -122
02
2014-06-
19888 9808700762 7060000 5 4.50 10040 37325 2.0 1 2 ... 1940 2001 98004 47.6500 -122
11
2014-05-
20740 1225069038 2280000 7 8.00 13540 307752 3.0 0 4 ... 1999 0 98053 47.6675 -121
05
2014-08-
20917 2470100110 5570000 5 5.75 9200 35069 2.0 0 0 ... 2001 0 98039 47.6289 -122
04
9 rows × 24 columns
We have only 9 properties/house which have more than 8k living_measure. So will treat these outliers.
Skewness is : 13.06001895903175
Out[25]:
count 2.161300e+04
mean 1.510697e+04
std 4.142051e+04
min 5.200000e+02
25% 5.040000e+03
50% 7.618000e+03
75% 1.068800e+04
max 1.651359e+06
Name: lot_measure, dtype: float64
In [26]:
#checking the no. of data points with Lot measure greater than 1250000
house_df[house_df['lot_measure']>1250000]
Out[26]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... yr_built yr_renovated zipcode lat
2015-03-
1113 1020069017 700000 4 1.0 1300 1651359 1.0 0 3 ... 1920 0 98022 47.2313 -122.02
27
1 rows × 24 columns
We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this.
In [27]:
Out[27]:
1.0 10680
2.0 8241
1.5 1910
3.0 613
2.5 161
3.5 8
Name: ceil, dtype: int64
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot('ceil',data=house_df)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef19ff60>
Above grapth confirming the same, that most properties have 1 and 2 floors
In [29]:
#coast - most houses donot have waterfront view, very few are waterfront
house_df.coast.value_counts()
Out[29]:
0 21450
1 163
Name: coast, dtype: int64
In [30]:
#sight - most sights have not been viewed
house_df.sight.value_counts()
Out[30]:
0 19489
2 963
3 510
1 332
4 319
Name: sight, dtype: int64
In [31]:
#condition - Overall most houses are rated as 3 and above for its condition overall
house_df.condition.value_counts()
Out[31]:
3 14031
4 5679
5 1701
2 172
1 30
Name: condition, dtype: int64
Analyzing Feature: quality
In [32]:
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x225eedbd358>
In [33]:
Out[33]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... yr_built yr_renovated zipcode lat
2014-09-
264 9208900037 6890000 6 7.75 9890 31374 2.0 0 4 ... 2001 0 98039 47.6305 -122
19
2014-06-
1123 2303900035 2890000 5 6.25 8670 64033 2.0 0 4 ... 1965 2003 98177 47.7295 -122
11
2015-01-
1583 2426039123 2420000 5 4.75 7880 24250 2.0 0 2 ... 1996 0 98177 47.7334 -122
30
2014-09-
7095 2303900100 3800000 3 4.25 5510 35000 2.0 0 4 ... 1997 0 98177 47.7296 -122
11
2015-04-
8509 4139900180 2340000 4 2.50 4500 35200 1.0 0 0 ... 1988 0 98006 47.5477 -122
20
2014-09-
9446 1068000375 3200000 6 5.00 7100 18200 2.5 0 0 ... 1933 2002 98199 47.6427 -122
23
2014-10-
10387 7237501190 1780000 4 3.25 4890 13402 2.0 0 0 ... 2004 0 98059 47.5303 -122
10
2014-11-
12320 1725059316 2390000 4 4.00 6330 13296 2.0 0 2 ... 2000 0 98033 47.6488 -122
20
2014-07-
12686 853200010 3800000 5 5.50 7050 42840 1.0 0 2 ... 1978 0 98004 47.6229 -122
01
2014-10-
16785 6762700020 7700000 6 8.00 12050 27600 2.5 0 3 ... 1910 1987 98102 47.6298 -122
13
2015-03-
17322 9831200500 2480000 5 3.75 6810 7500 2.5 0 0 ... 1922 0 98102 47.6285 -122
04
2014-12-
20892 3303850390 2980000 5 5.50 7400 18898 2.0 0 3 ... 2001 0 98006 47.5431 -122
12
2014-08-
20917 2470100110 5570000 5 5.75 9200 35069 2.0 0 0 ... 2001 0 98039 47.6289 -122
04
13 rows × 24 columns
There are only 13 propeties which have the highest quality rating
Analyzing Feature: ceil_measure
In [34]:
Skewness is : 1.4466644733818372
Out[34]:
count 21613.000000
mean 1788.390691
std 828.090978
min 290.000000
25% 1190.000000
50% 1560.000000
75% 2210.000000
max 9410.000000
Name: ceil_measure, dtype: float64
In [35]:
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x225ef353f28>
In [36]:
#basement_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.basement)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x225f1238080>
We can see 2 gaussians, which tells us there are propeties which don't have basements and some have the basements
In [37]:
house_df[house_df.basement==0].shape
Out[37]:
(13126, 24)
#houses have zero measure of basement i.e. they donot have basements
#let's plot boxplot for properties which have basements only
house_df_base=house_df[house_df['basement']>0]
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.boxplot(house_df_base['basement'])
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x225f0f92a20>
We can clearly see, there are outliers. We need to treat this before our model.
In [39]:
#checking the no. of data points with 'basement' greater than 4000
house_df[house_df['basement']>4000]
Out[39]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... yr_built yr_renovated zipcode lat
2014-06-
668 1924059029 4670000 5 6.75 9640 13068 1.0 1 4 ... 1983 2009 98040 47.5570 -122
17
2014-05-
20740 1225069038 2280000 7 8.00 13540 307752 3.0 0 4 ... 1999 0 98053 47.6675 -121
05
2 rows × 24 columns
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x225f102bd30>
In [41]:
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x225f125f5c0>
The built year of the properties range from 1900 to 2014 and we can see upward trend with time
house_df[house_df['yr_renovated']>0].shape
Out[42]:
(914, 24)
In [43]:
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x225ef896208>
Now will create age column from columns : yr_built & yr_renovated
In [46]:
#For geographic visual
import geopandas as gpd
from shapely.geometry import Point, Polygon
#For current working directory
import os
cwd = os.getcwd()
In [47]:
## Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
USAZip=pd.read_excel("USA ZipCodes_1.xlsx",sheet_name="Sheet8")
USAZip.head()
Out[47]:
In [48]:
house_df=house_df.merge(USAZip,how='left',on='zipcode')
#house_df.drop_duplicates()
In [49]:
Out[49]:
(21613, 27)
In [5]:
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf1142588>
In [51]:
Out[51]:
In [52]:
house_df.Type.value_counts()
Out[52]:
Standard 21613
Name: Type, dtype: int64
As the type is same for all the columns, we will remove this column in further analysis
In [53]:
house_df.City.value_counts()
Out[53]:
Seattle 8977
Renton 1597
Bellevue 1407
Kent 1203
Redmond 979
Kirkland 977
Auburn 912
Sammamish 800
Federal Way 779
Issaquah 733
Maple Valley 590
Woodinville 471
Snoqualmie 310
Kenmore 283
Mercer Island 282
Enumclaw 234
North Bend 221
Bothell 195
Duvall 190
Carnation 124
Vashon 118
Black Diamond 100
Fall City 81
Medina 50
Name: City, dtype: int64
In [54]:
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.countplot('furnished',data=house_df)
house_df.furnished.value_counts()
Out[54]:
0 17362
1 4251
Name: furnished, dtype: int64
Most properties are not furnished. Furnish column need to be converted into categorical column
BIVARIATE ANALYSIS
PairPlot
In [55]:
# let's plot all the variables and confirm our above deduction with more confidence
sns.pairplot(house_df, diag_kind = 'kde')
Out[55]:
<seaborn.axisgrid.PairGrid at 0x225f12a3a90>
From above pair plot, we observed/deduced below
1. price: price distribution is Right-Skewed as we deduced earlier from our 5-factor analysis
2. room_bed: our target variable (price) and room_bed plot is not linear. It's distribution have lot of gaussians
3. room_bath: It's plot with price has somewhat linear relationship. Distribution has number of gaussians.
4. living_measure: Plot against price has strong linear relationship. It also have linear relationship with room_bath variable. So might remove
one of these 2. Distribution is Right-Skewed.
5. lot_measure: No clear relationship with price.
6. ceil: No clear relationship with price. We can see, it's have 6 unique values only. Therefore, we can convert this column into categorical
column for values.
7. coast: No clear relationship with price. Clearly it's categorical variable with 2 unique values.
8. sight: No clear relationship with price. This has 5 unique values. Can be converted to Categorical variable.
9. condition: No clear relationship with price. This has 5 unique values. Can be converted to Categorical variable.
10. quality: Somewhat linear relationship with price. Has discrete values from 1 - 13. Can be converted to Categorical variable.
11. ceil_measure: Strong linear relationship with price. Also with room_bath and living_measure features. Distribution is Right-Skewed.
12. basement: No clear relationship with price.
13. yr_built: No clear relationship with price.
14. yr_renovated: No clear relationship with price. Have 2 unique values. Can be converted to Categorical Variable which tells whether house is
renovated or not.
15. zipcode, lat, long: No clear relationship with price or any other feature.
16. living_measure15: Somewhat linear relationship with target feature. It's same as living_measure. Therefore we can drop this variable.
17. lot_measure15: No clear relationship with price or any other feature.
18. furnished: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
19. total_area: No clear relationship with price. But it has Very Strong linear relationship with lot_measure. So one of it can be dropped.
In [56]:
Out[56]:
cid price room_bed room_bath living_measure lot_measure ceil coast sight condition ... basement yr_built
cid 1.000000 -0.016797 0.001286 0.005160 -0.012258 -0.132109 0.018525 -0.002721 0.011592 -0.023783 ... -0.005151 0.021380
price -0.016797 1.000000 0.308338 0.525134 0.702044 0.089655 0.256786 0.266331 0.397346 0.036392 ... 0.323837 0.053982
room_bed 0.001286 0.308338 1.000000 0.515884 0.576671 0.031703 0.175429 -0.006582 0.079532 0.028472 ... 0.303093 0.154178
room_bath 0.005160 0.525134 0.515884 1.000000 0.754665 0.087740 0.500653 0.063744 0.187737 -0.124982 ... 0.283770 0.506019
living_measure -0.012258 0.702044 0.576671 0.754665 1.000000 0.172826 0.353949 0.103818 0.284611 -0.058753 ... 0.435043 0.318049
lot_measure -0.132109 0.089655 0.031703 0.087740 0.172826 1.000000 -0.005201 0.021604 0.074710 -0.008958 ... 0.015286 0.053080
ceil 0.018525 0.256786 0.175429 0.500653 0.353949 -0.005201 1.000000 0.023698 0.029444 -0.263768 ... -0.245705 0.489319
coast -0.002721 0.266331 -0.006582 0.063744 0.103818 0.021604 0.023698 1.000000 0.401857 0.016653 ... 0.080588 -0.026161
sight 0.011592 0.397346 0.079532 0.187737 0.284611 0.074710 0.029444 0.401857 1.000000 0.045990 ... 0.276947 -0.053440
condition -0.023783 0.036392 0.028472 -0.124982 -0.058753 -0.008958 -0.263768 0.016653 0.045990 1.000000 ... 0.174105 -0.361417
quality 0.008130 0.667463 0.356967 0.664983 0.762704 0.113621 0.458183 0.082775 0.251321 -0.144674 ... 0.168392 0.446963
ceil_measure -0.010842 0.605566 0.477600 0.685342 0.876597 0.183512 0.523885 0.072075 0.167649 -0.158214 ... -0.051943 0.423898
basement -0.005151 0.323837 0.303093 0.283770 0.435043 0.015286 -0.245705 0.080588 0.276947 0.174105 ... 1.000000 -0.133124
yr_built 0.021380 0.053982 0.154178 0.506019 0.318049 0.053080 0.489319 -0.026161 -0.053440 -0.361417 ... -0.133124 1.000000
yr_renovated -0.016907 0.126442 0.018841 0.050739 0.055363 0.007644 0.006338 0.092885 0.103917 -0.060618 ... 0.071323 -0.224874
zipcode -0.008224 -0.053168 -0.152668 -0.203866 -0.199430 -0.129574 -0.059121 0.030285 0.084827 0.003026 ... 0.074845 -0.346869
lat -0.001891 0.306919 -0.008931 0.024573 0.052529 -0.085683 0.049614 -0.014274 0.006157 -0.014941 ... 0.110538 -0.148122
long 0.020799 0.021571 0.129473 0.223042 0.240223 0.229521 0.125419 -0.041910 -0.078400 -0.106500 ... -0.144765 0.409356
living_measure15 -0.002901 0.585374 0.391638 0.568634 0.756420 0.144608 0.279885 0.086463 0.280439 -0.092824 ... 0.200355 0.326229
lot_measure15 -0.138798 0.082456 0.029244 0.087175 0.183286 0.718557 -0.011269 0.030703 0.072575 -0.003406 ... 0.017276 0.070958
furnished -0.010009 0.565991 0.259268 0.484923 0.632947 0.118883 0.347749 0.069882 0.220250 -0.121902 ... 0.092847 0.305225
total_area -0.131844 0.104796 0.044310 0.104050 0.194209 0.999763 0.002637 0.023809 0.080693 -0.010219 ... 0.024832 0.059889
22 rows × 22 columns
We have linear relationships in below featues as we got to know from above matrix
We can plot heatmap and can easily confirm our above findings
In [57]:
# Plotting heatmap
plt.subplots(figsize =(15, 8))
sns.heatmap(house_corr,cmap="YlGnBu",annot=True)
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x2258d4f79b0>
#month,year in which house is sold. Price is not influenced by it, though there are outliers and can be easily se
en.
house_df['month_year'] = pd.to_datetime(house_df['month_year'], format='%B-%Y')
house_df.sort_values(["month_year"], axis=0,
ascending=True, inplace=True)
house_df["month_year"] = house_df["month_year"].dt.strftime('%B-%Y')
Out[58]:
month_year
The mean price of the houses tend to be high during March,April, May as compared to that of September, October, November,December period.
#Room_bed - outliers can be seen easily. Mean and median of price increases with number bedrooms/house uptill a p
oint
#and then drops
sns.factorplot(x='room_bed',y='price',data=house_df, size=4, aspect=2)
#groupby
house_df.groupby('room_bed')['price'].agg(['mean','median','size'])
Out[59]:
room_bed
0 4.102231e+05 288000.0 13
7 9.514478e+05 728580.0 38
8 1.105077e+06 700000.0 13
9 8.939998e+05 817000.0 6
10 8.200000e+05 660000.0 3
11 5.200000e+05 520000.0 1
33 6.400000e+05 640000.0 1
In [60]:
#room_bath - outliers can be seen easily. Overall mean and median price increares with increasing room_bath
sns.factorplot(x='room_bath',y='price',data=house_df,size=4, aspect=2)
plt.xticks(rotation=90)
#groupby
house_df.groupby('room_bath')['price'].agg(['mean','median','size'])
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot
` function has been renamed to `catplot`. The original name will be removed in a future release. Ple
ase update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'`
in `catplot`.
warnings.warn(msg)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3672: UserWarning: The `size` para
mter has been renamed to `height`; please update your code.
warnings.warn(msg, UserWarning)
Out[60]:
room_bath
In [61]:
AxesSubplot(0.125,0.125;0.775x0.755)
Out[61]:
count 21613.000000
mean 2079.899736
std 918.440897
min 290.000000
25% 1427.000000
50% 1910.000000
75% 2550.000000
max 13540.000000
Name: living_measure, dtype: float64
There is clear increment in price of the property with increment in the living measure But there seems to be one outlier to this trend. Need to
evaluate the same
AxesSubplot(0.125,0.125;0.775x0.755)
Out[62]:
count 2.161300e+04
mean 1.510697e+04
std 4.142051e+04
min 5.200000e+02
25% 5.040000e+03
50% 7.618000e+03
75% 1.068800e+04
max 1.651359e+06
Name: lot_measure, dtype: float64
#lot_measure <25000
plt.figure(figsize=(plotSizeX, plotSizeY))
x=house_df[house_df['lot_measure']<25000]
print(sns.scatterplot(x['lot_measure'],x['price']))
x['lot_measure'].describe()
AxesSubplot(0.125,0.125;0.775x0.755)
Out[63]:
count 19713.000000
mean 7762.510577
std 4252.549162
min 520.000000
25% 4997.000000
50% 7253.000000
75% 9620.000000
max 24969.000000
Name: lot_measure, dtype: float64
Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between lot_measure and price
In [64]:
AxesSubplot(0.125,0.125;0.775x0.755)
Out[65]:
ceil
#coast - mean and median of waterfront view is high however such houses are very small in compare to non-waterfro
nt
#Also, living_measure mean and median is greater for waterfront house.
print(sns.factorplot(x='coast',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('coast')['living_measure','price'].agg(['median','mean'])
Out[66]:
living_measure price
coast
The house properties with water_front tend to have higher price compared to that of non-water_front properties
#sight - have outliers. The house sighted more have high price (mean and median) and have large living area as we
ll.
print(sns.factorplot(x='sight',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('sight')['price','living_measure'].agg(['mean','median','size'])
Out[67]:
price living_measure
sight
Properties with higher price have more no.of sights compared to that of houses with lower price
In [68]:
AxesSubplot(0.125,0.125;0.775x0.755)
The above graph also justify that: Properties with higher price have more no.of sights compared to that of houses with lower price
In [69]:
#condition - as the condition rating increases its price and living measure mean and median also increases.
print(sns.factorplot(x='condition',y='price',data=house_df, size = 4, aspect = 2))
#groupby
house_df.groupby('condition')['price','living_measure'].agg(['mean','median','size'])
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3666: UserWarning: The `factorplot
` function has been renamed to `catplot`. The original name will be removed in a future release. Ple
ase update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'`
in `catplot`.
warnings.warn(msg)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:3672: UserWarning: The `size` para
mter has been renamed to `height`; please update your code.
warnings.warn(msg, UserWarning)
Out[69]:
price living_measure
condition
The price of the house increases with condition rating of the house
In [70]:
#Condition - Viewed in relation with price and living_measure. Most houses are rated as 3 or more.
#We can see some outliers as well
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['condition'],palette='Paired',leg
end='full'))
AxesSubplot(0.125,0.125;0.775x0.755)
So we found out that smaller houses are in better condition and better condition houses are having higher prices
Analyzing Bivariate for Feature: quality
In [71]:
#quality - with grade increase price and living_measure increase (mean and median)
Out[71]:
price living_measure
quality
There is clear increase in price of the house with higher rating on quality
In [72]:
#quality - Viewed in relation with price and living_measure. Most houses are graded as 6 or more.
#We can see some outliers as well
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['quality'],palette='coolwarm_r',
legend='full'))
AxesSubplot(0.125,0.125;0.775x0.755)
In [73]:
AxesSubplot(0.125,0.125;0.775x0.755)
Out[73]:
count 21613.000000
mean 1788.390691
std 828.090978
min 290.000000
25% 1190.000000
50% 1560.000000
75% 2210.000000
max 9410.000000
Name: ceil_measure, dtype: float64
In [74]:
AxesSubplot(0.125,0.125;0.775x0.755)
Out[74]:
count 21613.000000
mean 291.509045
std 442.575043
min 0.000000
25% 0.000000
50% 0.000000
75% 560.000000
max 4820.000000
Name: basement, dtype: float64
We will create the categorical variable for basement 'has_basement' for houses with basement and no basement.This categorical variable will
be used for further analysis.
In [75]:
house_df['has_basement'] = house_df['basement'].apply(create_basement_group)
In [76]:
#basement - after binning we data shows with basement houses are costlier and have higher
#living measure (mean & median)
print(sns.factorplot(x='has_basement',y='price',data=house_df, size = 4, aspect = 2))
house_df.groupby('has_basement')['price','living_measure'].agg(['mean','median','size'])
Out[76]:
price living_measure
has_basement
The houses with basement has better price compared to that of houses without basement
In [77]:
AxesSubplot(0.125,0.125;0.775x0.755)
In [78]:
AxesSubplot(0.125,0.125;0.775x0.755)
Out[78]:
yr_built
We will create new variable: Houselandratio - This is proportion of living area in the total area of the house. We will explore the trend of price
against this houselandratio.
In [79]:
Out[79]:
17786 19.0
3782 16.0
10069 16.0
7114 24.0
10080 22.0
Name: HouseLandRatio, dtype: float64
In [80]:
#yr_renovated -
plt.figure(figsize=(plotSizeX, plotSizeY))
x=house_df[house_df['yr_renovated']>0]
print(sns.scatterplot(x['yr_renovated'],x['price']))
#groupby
x.groupby('yr_renovated')['price'].agg(['mean','median','size'])
AxesSubplot(0.125,0.125;0.775x0.755)
Out[80]:
yr_renovated
69 rows × 3 columns
So most houses are renovated after 1980's. We will create new categorical variable 'has_renovated' to categorize the property as renovated and
non-renovated. For further ananlysis we will use this categorical variable.
In [81]:
house_df['has_renovated'] = house_df['yr_renovated'].apply(create_renovated_group)
In [84]:
#has_renovated - renovated have higher mean and median, however it does not confirm if the prices of house renova
ted
#actually increased or not.
#HouseLandRatio - Renovated house utilized more land area for construction of house
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['has_renovated']))
#groupby
house_df.groupby(['has_renovated'])['price','HouseLandRatio'].agg(['mean','median','size'])
AxesSubplot(0.125,0.125;0.775x0.755)
Out[84]:
price HouseLandRatio
has_renovated
Renovated properties have higher price than others with same living measure space.
In [85]:
#pd.crosstab(house_df['yearbuilt_group'],house_df['has_renovated'])
In [86]:
AxesSubplot(0.125,0.125;0.775x0.755)
In [87]:
#furnished - Furnished has higher price value and has greater living_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
print(sns.scatterplot(house_df['living_measure'],house_df['price'],hue=house_df['furnished']))
#groupby
house_df.groupby('furnished')['price','living_measure','HouseLandRatio'].agg(['mean','median','size'])
AxesSubplot(0.125,0.125;0.775x0.755)
Out[87]:
furnished
Furnished houses have higher price than that of the Non-furnished houses
Analyzing Bivariate for Feature: city
In [88]:
Out[88]:
City
In [89]:
indx=city_price.index
overall_price_mean=np.mean(house_df['price'])
overall_price_median=np.median(house_df['price'])
1. Bellevue
2. Fall City
3. Federal Way
4. Kirkland
5. Medina
6. Mercer Island
7. Redmond
8. Sammanmish
9. Woodinville
In [90]:
As we can see from above grapgh, majorly below cities have higher median house prices
1. Bellevue
2. Bothell
3. Issaquah
4. Kirkland
5. Medina
6. Mercer Island
7. Redmond
8. Sammanmish
9. Snoqualmie
10. Woodinville
In [91]:
#let's make the copy of the dataframe, before making any furhter changes
house_df_bdp=house_df.copy()
DATA PROCESSING
Treating Outlilers
We have seen outliers for columns room_bath(33 bed), living_measure, lot_measure, ceil_measure and Basement
In [92]:
def outlier_treatment(datacolumn):
sorted(datacolumn)
Q1,Q3 = np.percentile(datacolumn , [25,75])
IQR = Q3-Q1
lower_range = Q1-(1.5 * IQR)
upper_range = Q3+(1.5 * IQR)
return lower_range,upper_range
Using the above function, lets get the lowerbound and upperbound values
In [93]:
lowerbound,upperbound = outlier_treatment(house_df.ceil_measure)
print(lowerbound,upperbound)
-340.0 3740.0
In [94]:
Out[94]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_y
2014-05-
7142 7397300220 2750000 4 3.25 4430 21000 2.0 0 0 ... 20000 1 25430 May-2
29
2014-05-
10270 3221059044 799950 4 3.50 4220 196817 2.0 0 0 ... 195395 1 201037 May-2
23
2014-05-
9770 2424049029 3100000 6 4.25 6980 15682 3.0 0 4 ... 18367 1 22662 May-2
29
2014-05-
9909 1724069059 2000000 5 4.00 4580 4443 3.0 1 4 ... 4443 1 9023 May-2
24
2014-05-
3712 9359100500 1800000 4 3.25 4060 13000 2.0 0 3 ... 13800 1 17060 May-2
27
2014-05-
3628 4131900042 2000000 5 4.25 6490 10862 2.0 0 3 ... 14080 1 17352 May-2
16
2014-05-
18900 3521059134 900000 3 3.50 4080 217697 1.5 0 3 ... 217790 1 221777 May-2
23
2014-05-
13664 2481620310 1120000 4 2.25 4470 60373 2.0 0 0 ... 40450 1 64843 May-2
14
2014-05-
20740 1225069038 2280000 7 8.00 13540 307752 3.0 0 4 ... 217800 1 321292 May-2
05
2014-05-
10672 3892500150 1550000 3 2.50 4460 26027 2.0 0 0 ... 26027 1 30487 May-2
21
2014-05-
3964 1829300210 762300 4 2.50 3880 14550 2.0 0 0 ... 14045 1 18430 May-2
06
2014-05-
10294 525069127 1200000 4 3.50 4740 172497 2.0 0 0 ... 49658 1 177237 May-2
23
2014-05-
3996 3630200780 1050000 4 3.75 3860 5474 2.5 0 0 ... 5474 1 9334 May-2
22
2014-05-
10540 2524069097 2240000 5 6.50 7270 130017 2.0 0 0 ... 44890 1 137287 May-2
09
2014-05-
13827 824059042 1890000 5 3.50 4180 17935 2.0 0 0 ... 13760 1 22115 May-2
30
2014-05-
10462 7237550130 1300000 4 3.50 4380 74052 1.0 0 0 ... 62291 1 78432 May-2
20
2014-05-
3089 3616600250 1600000 3 3.25 3790 19000 2.0 0 4 ... 18628 1 22790 May-2
27
2014-05-
15646 98000960 1050000 4 3.25 4400 16625 2.0 0 0 ... 15523 1 21025 May-2
13
2014-05-
8153 425069020 1090000 4 2.50 4340 141570 2.5 0 0 ... 97138 1 145910 May-2
05
2014-05-
8484 4039800080 1360000 5 3.50 5960 13703 2.0 0 2 ... 17320 1 19663 May-2
29
2014-05-
15135 3625700010 1870000 5 4.00 4510 15175 2.0 0 0 ... 13500 1 19685 May-2
06
2014-05-
8769 2524049318 2000000 4 3.00 4260 18000 2.0 0 2 ... 17015 1 22260 May-2
28
2014-05-
15404 3276940100 1000000 4 3.00 4260 18687 2.0 0 0 ... 16772 1 22947 May-2
22
2014-05-
8358 4100500070 1710000 5 4.50 4590 14685 2.0 0 0 ... 9486 1 19275 May-2
27
2014-05-
9507 8691310840 833000 4 2.75 3780 10308 2.0 0 0 ... 10740 1 14088 May-2
09
2014-05-
7743 6613000935 2560000 4 2.50 5300 26211 2.0 1 2 ... 19281 1 31511 May-2
13
2014-05-
3330 5710000005 2150000 4 5.50 5060 10320 2.0 0 0 ... 10080 1 15380 May-2
22
2014-05-
19115 6648150040 1680000 5 3.25 4860 23723 2.0 0 2 ... 13860 1 28583 May-2
13
2014-05-
15697 3758900259 1040000 4 3.50 3900 8391 2.0 0 0 ... 12268 1 12291 May-2
07
2014-05-
3187 1853080640 966000 5 4.50 3810 8019 2.0 0 0 ... 7713 1 11829 May-2
14
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-04-
3232 7135520300 1300000 3 2.75 4120 16365 1.0 0 2 ... 14110 1 20485 April-2
07
2015-04-
15540 98000740 945000 5 3.50 4380 14925 2.0 0 0 ... 14633 1 19305 April-2
01
2015-04-
15481 2726059144 1040000 5 3.75 4570 10194 2.0 0 0 ... 7560 1 14764 April-2
10
2015-04-
8270 3401700150 1350000 5 3.00 5530 38816 1.5 0 2 ... 44417 1 44346 April-2
23
2015-04-
19626 3295610080 912000 4 2.75 4030 10888 2.0 0 0 ... 10756 1 14918 April-2
01
2015-04-
14955 3121500150 894000 4 2.50 3800 22029 2.0 0 0 ... 24979 1 25829 April-2
23
2015-04-
15084 2625069070 1390000 4 3.25 4860 181319 2.5 0 0 ... 181319 1 186179 April-2
10
2015-04-
2878 644000040 1780000 4 3.25 3950 10912 2.0 0 0 ... 10998 1 14862 April-2
29
2015-04-
8561 3585900500 1530000 4 4.25 4720 21000 3.0 0 4 ... 20000 1 25720 April-2
02
2015-04-
8682 3860900035 1940000 5 3.50 4230 16526 2.0 0 0 ... 12362 1 20756 April-2
15
2015-04-
15893 98300230 1460000 4 4.00 4620 130208 2.0 0 0 ... 131007 1 134828 April-2
28
2015-04-
7749 6790830090 1060000 4 3.50 4220 8417 3.0 0 0 ... 8435 1 12637 April-2
15
2015-04-
7941 2481630030 965000 4 2.50 3920 41206 2.0 0 0 ... 36562 1 45126 April-2
27
2015-04-
15862 7237550110 1180000 4 3.25 3750 74052 2.0 0 0 ... 74052 1 77802 April-2
24
2015-04-
19172 713500020 1390000 4 4.50 4490 24767 2.0 0 2 ... 32700 1 29257 April-2
21
2015-04-
7847 7853440140 802945 5 3.50 4000 9234 2.0 0 0 ... 6600 1 13234 April-2
09
2015-05-
13999 1126059201 1270000 5 3.25 4410 35192 2.0 0 2 ... 59677 1 39602 May-2
04
2015-05-
9320 1525059261 1900000 5 4.50 5160 44315 2.0 0 0 ... 44315 1 49475 May-2
05
2015-05-
2730 7853440050 771005 5 4.50 4000 6713 2.0 0 0 ... 6600 1 10713 May-2
05
2015-05-
5687 3751600409 510000 4 2.50 4073 17334 2.0 0 0 ... 9625 0 21407 May-2
08
2015-05-
5620 6065300370 4210000 5 6.00 7440 21540 2.0 0 0 ... 19329 1 28980 May-2
06
2015-05-
21004 3303960250 1050000 4 3.25 4020 11588 2.0 0 0 ... 8066 1 15608 May-2
07
2015-05-
15596 1925059254 3000000 5 4.00 6670 16481 2.0 0 0 ... 16607 1 23151 May-2
07
2015-05-
13440 1623089165 920000 4 3.75 4030 503989 2.0 0 0 ... 71874 1 508019 May-2
06
2015-05-
15586 1266200140 1850000 4 3.25 4160 10335 2.0 0 0 ... 10333 1 14495 May-2
06
2015-05-
9588 7237501380 1270000 4 3.50 4640 13404 2.0 0 0 ... 13590 1 18044 May-2
07
2015-05-
17098 2424059174 2000000 4 3.25 5640 35006 2.0 0 2 ... 35033 1 40646 May-2
08
2015-05-
13099 3024059057 1650000 4 4.50 5550 16065 2.0 0 0 ... 16488 1 21615 May-2
01
2015-05-
19121 4389201095 3650000 5 3.75 5020 8694 2.0 0 1 ... 11275 1 13714 May-2
11
2015-05-
13112 7960900060 2900000 4 3.25 5050 20100 1.5 0 2 ... 20060 1 25150 May-2
04
In [95]:
In [96]:
house_df.shape
Out[96]:
(21002, 30)
In [97]:
#ceil_measure
print("Skewness is :", house_df.ceil_measure.skew())
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.ceil_measure)
house_df.ceil_measure.describe()
Skewness is : 0.8198869256569326
Out[97]:
count 21002.000000
mean 1712.238168
std 696.044073
min 290.000000
25% 1180.000000
50% 1540.000000
75% 2140.000000
max 3740.000000
Name: ceil_measure, dtype: float64
After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data points but data is nicely distributed
Treating outliers for column - basement
In [98]:
lowerbound_base,upperbound_base = outlier_treatment(house_df.basement)
print(lowerbound_base,upperbound_base)
-855.0 1425.0
In [99]:
house_df[(house_df.basement < lowerbound_base) | (house_df.basement > upperbound_base)]
Out[99]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_y
2014-05-
16357 3211270170 404000 4 3.00 4060 35621 1.0 0 0 ... 35259 1 39681 May-2
23
2014-05-
7386 5700003640 2100000 5 3.75 5340 10655 2.5 0 3 ... 9418 1 15995 May-2
19
2014-05-
9727 5119010090 549900 5 2.75 3060 7015 1.0 0 0 ... 7600 0 10075 May-2
10
2014-05-
16069 7663700968 565000 7 4.50 4140 9066 1.0 0 0 ... 1865 0 13206 May-2
28
2014-05-
1783 7430200100 1220000 4 3.50 4910 9444 1.5 0 0 ... 11063 1 14354 May-2
14
2014-05-
1145 7856410030 1030000 5 2.75 3190 16920 1.0 0 3 ... 13100 1 20110 May-2
05
2014-05-
13624 7855801610 1220000 4 2.50 3190 8684 1.0 0 3 ... 8684 1 11874 May-2
19
2014-05-
6610 1424059154 1270000 4 3.00 5520 8313 2.0 0 3 ... 8278 1 13833 May-2
16
2014-05-
13951 9322800210 879950 4 2.25 3500 13875 1.0 0 4 ... 15000 1 17375 May-2
20
2014-05-
13757 4219401236 1690000 3 1.75 3400 8965 1.0 0 2 ... 8500 1 12365 May-2
20
2014-05-
10529 7784400130 497300 6 2.75 3200 9200 1.0 0 2 ... 9500 0 12400 May-2
05
2014-05-
6832 486000510 1330000 4 3.00 3370 7920 1.0 0 3 ... 7380 1 11290 May-2
23
2014-05-
2479 1624049293 390000 5 3.75 2890 5000 1.0 0 0 ... 5117 0 7890 May-2
06
2014-05-
15539 7855200120 1370000 4 2.75 3720 9450 1.0 0 4 ... 8605 1 13170 May-2
09
2014-05-
2752 7922900040 1080000 4 3.00 3600 9200 1.0 0 4 ... 9775 1 12800 May-2
22
2014-05-
8532 3623500205 2450000 4 4.50 5030 11023 2.0 0 2 ... 11490 1 16053 May-2
13
2014-05-
14866 5152700060 465000 6 3.25 4250 23326 1.0 0 3 ... 15983 1 27576 May-2
28
2014-05-
3344 4058800215 430000 3 3.75 3890 7140 1.0 0 2 ... 7320 0 11030 May-2
28
2014-05-
3501 4122900190 1350000 5 1.75 3380 20021 1.0 0 0 ... 19809 0 23401 May-2
12
2014-05-
15783 217500140 464000 5 2.50 3400 8970 1.0 0 0 ... 8475 0 12370 May-2
13
2014-05-
9331 5152100060 472000 6 2.50 4410 14034 1.0 0 2 ... 13988 1 18444 May-2
29
2014-05-
9349 3342700405 585000 4 1.75 3000 42200 1.0 0 3 ... 9821 0 45200 May-2
22
2014-05-
9279 4139420590 1210000 4 3.50 4560 16643 1.0 0 3 ... 15177 1 21203 May-2
20
2014-05-
520 1313000220 675000 5 3.00 3410 9600 1.0 0 0 ... 9679 0 13010 May-2
13
2014-05-
769 7856410430 1390000 6 2.75 5700 20000 1.0 0 4 ... 15700 1 25700 May-2
30
2014-05-
18293 5425700205 1800000 4 3.50 4460 16953 1.0 0 0 ... 13370 1 21413 May-2
20
2014-05-
6088 9558050170 475000 4 2.50 3740 8700 1.0 0 0 ... 6333 1 12440 May-2
13
2014-05-
18107 1180008355 380000 5 1.75 3000 6000 1.0 0 0 ... 7125 0 9000 May-2
07
2014-05-
17068 8562710550 950000 5 3.75 5330 6000 2.0 0 2 ... 5797 1 11330 May-2
21
2014-05-
21501 2021201000 980000 4 3.00 3680 5854 1.0 0 3 ... 5000 1 9534 May-2
23
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-04-
16301 9542000275 675000 4 2.50 2420 18470 1.0 0 0 ... 13800 0 20890 April-2
06
2015-04-
7062 3982700250 799900 4 2.50 3030 7800 2.0 0 0 ... 7435 1 10830 April-2
23
2015-04-
18757 8085400376 2320000 4 3.50 5050 9520 2.0 0 0 ... 9248 1 14570 April-2
21
2015-04-
17329 2655500235 1610000 4 3.50 3920 19088 1.0 0 1 ... 13749 1 23008 April-2
10
2015-04-
5618 3524039202 1070000 3 2.25 2950 7232 1.0 0 2 ... 7140 0 10182 April-2
20
2015-04-
18042 5460600110 1050000 6 4.00 5310 12741 2.0 0 2 ... 12632 1 18051 April-2
23
2015-04-
17165 1736800520 662500 3 2.50 3560 9796 1.0 0 0 ... 8925 0 13356 April-2
03
2015-04-
17090 2141300080 707000 5 2.50 3050 13212 1.0 0 0 ... 10826 0 16262 April-2
24
2015-04-
15407 1373800330 1120000 4 2.50 3690 11191 1.0 0 3 ... 8160 1 14881 April-2
20
2015-04-
14911 4147200040 1090000 5 2.25 3650 13068 1.0 0 0 ... 13927 1 16718 April-2
14
2015-04-
2642 2425059074 740000 5 3.00 3655 51836 1.0 0 0 ... 8606 0 55491 April-2
10
2015-04-
2860 9808100150 3350000 5 3.75 5350 15360 1.0 0 1 ... 15940 1 20710 April-2
02
2015-04-
3498 9560500105 957000 4 2.25 2860 11545 1.0 0 0 ... 11396 0 14405 April-2
24
2015-04-
7793 629860010 1350000 4 3.50 4640 9827 2.0 0 2 ... 8207 1 14467 April-2
29
2015-05-
5853 7964410100 700000 4 3.50 5360 25800 1.0 0 0 ... 21781 1 31160 May-2
04
2015-05-
5500 4139420190 2480000 4 5.00 5310 16909 1.0 0 4 ... 15701 1 22219 May-2
12
2015-05-
12295 1742800430 463828 5 1.75 3250 13702 1.0 0 2 ... 11328 0 16952 May-2
04
2015-05-
5347 9541600490 931088 4 2.50 3510 17400 1.0 0 0 ... 12120 1 20910 May-2
05
2015-05-
13793 6065300840 2850000 4 4.00 5040 17208 1.0 0 0 ... 18647 1 22248 May-2
01
2015-05-
19013 1822079046 500000 3 2.00 3040 41072 1.0 0 0 ... 54014 0 44112 May-2
04
2015-05-
5617 1925069082 2200000 5 4.25 4640 22703 2.0 1 4 ... 14200 0 27343 May-2
11
2015-05-
7035 1180007375 625000 5 3.50 4010 6000 2.0 0 3 ... 6000 1 10010 May-2
12
2015-05-
4032 7878400022 390000 4 2.25 3060 7920 1.0 0 0 ... 7800 0 10980 May-2
06
2015-05-
1890 3336000050 435000 6 3.00 3560 4290 1.0 0 0 ... 6000 0 7850 May-2
01
2015-05-
19313 8835401250 1490000 6 2.75 4430 6440 2.0 0 3 ... 7314 1 10870 May-2
06
2015-05-
4404 3523069008 890000 4 3.25 4360 210254 1.0 0 0 ... 87120 1 214614 May-2
05
2015-05-
20299 3286800260 780000 5 2.50 3480 74052 1.0 0 0 ... 65775 0 77532 May-2
06
2015-05-
4712 1924059254 1300000 5 3.75 3490 15246 1.0 0 1 ... 15682 1 18736 May-2
08
2015-05-
8288 2524049108 1380000 5 4.25 4050 18827 1.0 0 2 ... 25120 1 22877 May-2
12
2015-05-
15391 2925059260 800000 5 2.50 3000 10560 1.0 0 0 ... 11616 0 13560 May-2
06
In [101]:
house_df.shape
Out[101]:
(20594, 30)
In [102]:
#basement_measure
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.basement)
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x22593a3e5f8>
After treating outliers of basement, we can see that 400(~2%) data points got imputed. Total about 5% data has been imputed after treating
ceil_measure and basement.
In [103]:
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x22593921d30>
Treating outliers for column - living_measure
In [104]:
lowerbound_lim,upperbound_lim = outlier_treatment(house_df.living_measure)
print(lowerbound_lim,upperbound_lim)
-160.0 4000.0
In [105]:
house_df[(house_df.living_measure < lowerbound_lim) | (house_df.living_measure > upperbound_lim)]
Out[105]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_y
2014-05-
10110 6669100070 900000 4 3.25 4700 38412 2.0 0 0 ... 35571 1 43112 May-2
12
2014-05-
7275 2926069083 900000 5 3.75 4130 226076 2.0 0 0 ... 55321 1 230206 May-2
07
2014-05-
10549 6819100020 1430000 4 4.25 4960 6000 2.5 0 0 ... 4080 1 10960 May-2
29
2014-05-
1438 5093300325 1610000 4 3.50 4390 11600 2.0 0 3 ... 12000 1 15990 May-2
23
2014-05-
13897 7853280350 809000 5 4.50 4630 6324 2.0 0 0 ... 6790 1 10954 May-2
12
2014-05-
10530 6169901185 490000 5 3.50 4460 2975 3.0 0 2 ... 4231 1 7435 May-2
20
2014-05-
6830 425079099 560000 3 3.00 4120 60392 2.0 0 2 ... 64033 1 64512 May-2
07
2014-05-
2969 7853280550 700000 4 3.50 4490 5099 2.0 0 0 ... 5537 1 9589 May-2
28
2014-05-
2864 251620090 2400000 4 3.25 4140 20734 1.0 0 1 ... 20008 1 24874 May-2
30
2014-05-
2680 587550280 625000 4 3.25 4240 25639 2.0 0 3 ... 24967 1 29879 May-2
30
2014-05-
5059 1338600225 1970000 8 3.50 4440 6480 2.0 0 3 ... 8640 1 10920 May-2
28
2014-05-
11490 526069024 950000 5 3.00 4530 258746 1.5 0 0 ... 83199 1 263276 May-2
12
2014-05-
17520 723000114 1400000 5 3.50 4010 8510 2.0 0 1 ... 6128 1 12520 May-2
05
2014-05-
4965 8562710250 890000 4 4.25 4420 5750 2.0 0 0 ... 5750 1 10170 May-2
05
2014-05-
4596 8562710520 890000 5 3.50 4490 6000 2.0 0 0 ... 6000 1 10490 May-2
05
2014-05-
21557 3758900075 1530000 5 4.50 4270 8076 2.0 0 0 ... 10631 1 12346 May-2
07
2014-05-
1071 1924069039 869000 5 3.25 4180 49222 2.0 0 0 ... 8029 0 53402 May-2
19
2014-06-
15797 3127200021 850000 4 3.50 4140 7089 2.0 0 0 ... 8896 1 11229 June-2
16
2014-06-
5235 293760050 1050000 4 4.25 4390 13833 2.0 0 3 ... 11652 1 18223 June-2
27
2014-06-
19273 3629890190 1300000 4 4.00 4270 6002 2.0 0 3 ... 5942 1 10272 June-2
06
2014-06-
17631 1702901180 665000 6 3.00 4250 4400 2.5 0 0 ... 4950 0 8650 June-2
11
2014-06-
4215 8043700300 2700000 4 3.25 4420 7850 2.0 1 4 ... 8525 1 12270 June-2
08
2014-06-
7571 3616600231 960000 4 3.00 4590 9150 2.0 0 0 ... 12348 1 13740 June-2
03
2014-06-
17402 8128600060 600000 4 3.25 4690 14930 2.0 0 2 ... 13320 1 19620 June-2
24
2014-06-
19073 5561300730 530000 4 3.25 4160 35654 2.0 0 0 ... 35675 0 39814 June-2
05
2014-06-
7431 5078400160 1800000 5 4.50 4400 15580 2.0 0 0 ... 14249 1 19980 June-2
05
2014-06-
21125 5700003630 1930000 5 4.25 4830 8050 2.5 0 2 ... 9194 1 12880 June-2
30
2014-06-
1552 1336800010 1340000 5 2.25 4200 5800 2.5 0 0 ... 5800 1 10000 June-2
13
2014-06-
14441 7853280570 765000 4 3.00 4410 5104 2.0 0 0 ... 5537 1 9514 June-2
04
2014-06-
372 7636800041 995000 3 4.50 4380 47044 2.0 1 3 ... 18512 1 51424 June-2
25
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-03-
11314 722059020 550000 6 4.50 4520 40164 2.0 0 0 ... 13068 1 44684 March-2
18
2015-03-
11089 745530180 870000 5 3.50 4495 10079 2.0 0 0 ... 10079 1 14574 March-2
17
2015-03-
11340 9362000080 1600000 5 3.50 4050 20925 2.0 0 3 ... 18321 1 24975 March-2
16
2015-03-
13844 1824079073 985000 5 4.25 4650 108464 2.0 0 0 ... 155509 1 113114 March-2
31
2015-03-
9911 3026059085 1290000 5 3.50 4090 290980 1.0 0 0 ... 9255 1 295070 March-2
17
2015-03-
17302 1924059319 1290000 5 4.00 4050 11358 2.0 0 0 ... 13555 1 15408 March-2
20
2015-03-
14676 1333300145 2230000 3 4.00 4200 30120 2.0 0 2 ... 12200 1 34320 March-2
04
2015-03-
11625 2600010220 1250000 4 2.50 4040 11350 2.0 0 2 ... 12382 1 15390 March-2
26
2015-03-
8631 3616600003 1680000 3 2.50 4090 16972 2.0 0 2 ... 16972 1 21062 March-2
02
2015-04-
14545 2579500101 1390000 4 3.50 4010 10880 2.0 0 3 ... 17310 1 14890 April-2
21
2015-04-
567 9185700485 2540000 4 3.50 4350 6000 2.0 0 0 ... 7200 1 10350 April-2
01
2015-04-
21062 3303980140 1150000 4 3.00 4160 13170 2.0 0 0 ... 13148 1 17330 April-2
02
2015-04-
12545 269000970 1300000 5 3.75 4450 7680 2.0 0 0 ... 6400 1 12130 April-2
02
2015-04-
1512 1118000340 3000000 5 3.75 4590 11265 2.0 0 0 ... 8996 1 15855 April-2
08
2015-04-
4292 1115300270 900000 6 3.75 4210 6105 2.0 0 0 ... 6368 1 10315 April-2
28
2015-04-
16585 6645950070 1450000 4 3.50 5000 38012 2.0 0 0 ... 18054 1 43012 April-2
01
2015-04-
16752 8562720420 1350000 4 3.50 4740 8611 2.0 0 3 ... 8321 1 13351 April-2
30
2015-04-
16743 1223089077 718000 3 1.75 4060 136290 1.0 0 0 ... 51836 0 140350 April-2
01
2015-04-
7550 2260300060 2580000 5 3.00 4780 20440 1.0 0 0 ... 20440 1 25220 April-2
10
2015-04-
5755 1069000070 2800000 5 3.25 4590 12793 2.0 0 2 ... 8609 1 17383 April-2
15
2015-04-
5706 4128500380 1200000 4 2.50 4280 12796 2.0 0 0 ... 9593 1 17076 April-2
27
2015-04-
5670 2254100090 887250 5 3.50 4320 7502 2.0 0 0 ... 7538 1 11822 April-2
07
2015-04-
5572 853200040 2410000 5 2.50 4600 23250 1.5 0 2 ... 20066 1 27850 April-2
28
2015-04-
7209 8562750060 825000 5 3.50 4140 6770 2.0 0 0 ... 5431 1 10910 April-2
20
2015-04-
19608 114101505 630000 5 3.50 4060 8309 2.0 0 0 ... 11711 1 12369 April-2
23
2015-04-
7997 5700004028 2450000 4 4.25 4250 6552 2.0 0 3 ... 8841 1 10802 April-2
17
2015-05-
17159 1118000320 3400000 4 4.00 4260 11765 2.0 0 0 ... 10408 1 16025 May-2
08
2015-05-
17742 5428000070 770000 5 3.50 4750 8234 2.0 0 2 ... 14496 1 12984 May-2
11
2015-05-
16333 2421059090 640000 4 2.50 4090 215186 2.0 0 0 ... 142005 0 219276 May-2
11
2015-05-
1152 1525069088 442500 5 3.25 4240 226097 2.0 0 0 ... 217800 0 230337 May-2
04
In [107]:
Out[107]:
<matplotlib.axes._subplots.AxesSubplot at 0x22593a3e240>
In [108]:
plt.figure(figsize=(plotSizeX, plotSizeY))
sns.distplot(house_df.living_measure)
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x22595886198>
By treating outliers of living_measure, we lost 178 data points more and data distribution looks normal
In [109]:
Out[109]:
(20416, 30)
Treating outliers for column - lot_measure
In [110]:
lowerbound_lom,upperbound_lom = outlier_treatment(house_df.lot_measure)
print(lowerbound_lom,upperbound_lom)
-2774.875 17958.125
In [111]:
Out[111]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_y
2014-05-
10082 1121039059 503000 2 1.75 2860 59612 1.0 1 4 ... 59612 0 62472 May-2
22
2014-05-
14089 6070500055 599000 4 2.25 2260 29930 2.0 0 0 ... 29930 0 32190 May-2
06
2014-05-
1611 5561000190 437500 3 2.25 1970 35100 2.0 0 0 ... 35100 1 37070 May-2
02
2014-05-
14068 5111400086 110000 3 1.00 1250 53143 1.0 0 0 ... 217800 0 54393 May-2
12
2014-05-
14081 3022039071 800000 2 2.25 1730 31491 2.0 1 2 ... 12410 0 33221 May-2
30
2014-05-
20351 9808610190 782000 4 2.50 2830 20345 2.0 0 0 ... 13732 1 23175 May-2
09
2014-05-
9981 2324800350 860000 4 2.00 3740 32417 2.0 0 0 ... 32417 1 36157 May-2
06
2014-05-
16273 1823069279 499950 5 3.50 3200 43560 2.0 0 0 ... 43560 0 46760 May-2
20
2014-05-
16325 7214700160 610000 3 3.00 2480 45302 1.0 0 0 ... 14100 0 47782 May-2
09
2014-05-
10030 2025700730 287200 3 3.00 1850 19966 1.0 0 0 ... 6715 0 21816 May-2
02
2014-05-
3870 1330900250 550000 3 2.25 1980 40887 1.0 0 0 ... 35700 0 42867 May-2
15
2014-05-
16422 4047200380 460000 2 1.50 2730 19877 1.0 0 0 ... 19509 0 22607 May-2
26
2014-05-
3865 2924069132 527500 3 1.75 2310 78844 1.0 0 0 ... 6230 0 81154 May-2
27
2014-05-
1589 4045500510 420850 1 1.00 960 40946 1.0 0 0 ... 20350 0 41906 May-2
21
2014-05-
3824 320069049 305000 4 1.50 1590 131551 1.0 0 3 ... 108028 0 133141 May-2
14
2014-05-
18773 1321720140 370000 4 2.50 3090 18645 2.0 0 0 ... 20114 1 21735 May-2
28
2014-05-
14357 3210950080 486000 4 2.50 2150 39449 1.0 0 0 ... 35717 0 41599 May-2
14
2014-05-
14310 1921069082 560000 3 2.00 2560 216777 1.0 0 0 ... 108463 0 219337 May-2
12
2014-05-
16144 7574910780 766950 3 2.50 3030 30007 1.5 0 0 ... 34983 1 33037 May-2
14
2014-05-
9659 1023059365 520000 3 2.50 2460 54885 2.0 0 0 ... 21407 0 57345 May-2
06
2014-05-
7446 4012800010 360000 4 2.00 2680 18768 1.0 0 0 ... 15750 0 21448 May-2
06
2014-05-
18970 3523089019 480000 4 3.50 3370 435600 2.0 0 3 ... 114868 1 438970 May-2
19
2014-05-
7423 4188000670 749400 4 2.50 3240 20301 2.0 0 0 ... 23650 1 23541 May-2
15
2014-05-
16149 9368700031 195000 2 1.00 720 18000 1.0 0 0 ... 7925 0 18720 May-2
09
2014-05-
9913 8856000545 100000 2 1.00 910 22000 1.0 0 0 ... 9891 0 22910 May-2
07
2014-05-
7214 124069032 600000 3 1.75 1670 39639 1.0 0 0 ... 30492 0 41309 May-2
05
2014-05-
7264 2724089019 527550 1 0.75 820 59677 1.0 0 0 ... 14163 0 60497 May-2
23
2014-05-
7323 3761700251 600000 4 2.00 2510 38141 1.0 0 0 ... 11760 1 40651 May-2
28
2014-05-
1809 226059103 570000 3 1.75 1930 36210 1.0 0 0 ... 35060 0 38140 May-2
27
2014-05-
13684 1721069036 412000 3 1.75 1950 52256 1.0 0 0 ... 51836 0 54206 May-2
29
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-05-
19300 3924500130 460000 2 2.50 1880 40575 1.0 0 0 ... 32935 1 42455 May-2
06
2015-05-
3221 1775700011 390000 3 2.50 1410 26375 1.0 0 0 ... 12474 0 27785 May-2
12
2015-05-
20818 126039394 525000 4 2.75 2300 26650 1.0 0 0 ... 9879 0 28950 May-2
08
2015-05-
3195 9310300215 652500 4 1.75 3130 18253 2.0 0 0 ... 12220 0 21383 May-2
06
2015-05-
13297 322069010 435000 3 2.00 2570 233481 1.5 0 0 ... 157687 0 236051 May-2
08
2015-05-
11080 1823069088 492000 2 1.75 1300 22239 1.0 0 0 ... 14810 0 23539 May-2
04
2015-05-
11081 625069064 625000 3 2.25 2570 47480 1.0 0 0 ... 106722 1 50050 May-2
07
2015-05-
9892 2124069103 374000 3 1.75 1510 18439 1.0 0 0 ... 34326 0 19949 May-2
05
2015-05-
4291 2426049079 330000 3 1.00 1060 20040 1.0 0 0 ... 10800 0 21100 May-2
06
2015-05-
9880 8011100050 350000 2 1.00 1220 28703 1.0 0 0 ... 6720 0 29923 May-2
08
2015-05-
4246 2722059275 536000 3 2.75 2290 34548 2.0 0 3 ... 275299 0 36838 May-2
12
2015-05-
13312 8835800450 950000 3 2.50 2780 275033 1.0 0 0 ... 16340 1 277813 May-2
04
2015-05-
20752 1326069050 750000 2 2.00 2370 155130 1.0 0 0 ... 14475 0 157500 May-2
04
2015-05-
2011 2591720160 674950 3 2.75 3510 92347 2.0 0 0 ... 37070 1 95857 May-2
01
2015-05-
20375 302000375 250000 3 2.00 1050 18304 1.0 0 0 ... 15675 0 19354 May-2
06
2015-05-
15300 722039087 329000 2 1.00 990 57499 1.0 0 0 ... 27442 0 58489 May-2
04
2015-05-
19102 1774220070 550000 4 2.25 2590 36256 2.0 0 0 ... 35657 0 38846 May-2
07
2015-05-
9615 2316400285 495000 4 3.50 2490 18042 2.0 0 0 ... 21107 0 20532 May-2
13
2015-05-
14043 9406510130 448000 5 3.50 3740 24684 2.0 0 0 ... 26023 1 28424 May-2
05
2015-05-
10637 522079068 513000 3 2.50 2150 161607 2.0 0 0 ... 207781 0 163757 May-2
06
2015-05-
6109 251610020 1580000 4 2.75 3480 19991 2.0 0 2 ... 20271 1 23471 May-2
08
2015-05-
12897 4027701265 480000 3 1.75 2920 21375 1.0 0 0 ... 8482 0 24295 May-2
01
2015-05-
1591 4166600610 335000 3 2.00 1410 44866 1.0 0 0 ... 29152 0 46276 May-2
14
2015-05-
6116 122029066 490000 3 1.75 2020 215622 2.0 0 0 ... 215622 0 217642 May-2
08
2015-05-
7911 3585900460 1060000 6 2.75 2980 20000 1.0 0 4 ... 20000 0 22980 May-2
01
2015-05-
11329 2320069111 449999 4 1.75 2290 36900 1.5 0 2 ... 12434 0 39190 May-2
07
2015-05-
9752 2521059060 490000 3 2.25 2840 107157 2.0 0 0 ... 215622 1 109997 May-2
01
2015-05-
4012 6446200050 540000 3 1.75 2590 25992 1.0 0 0 ... 29250 0 28582 May-2
04
2015-05-
16888 3422059208 390000 3 2.50 1930 64904 1.0 0 0 ... 57500 0 66834 May-2
11
2015-05-
4579 1921069101 399000 3 1.75 2170 73616 1.0 0 0 ... 297514 0 75786 May-2
08
We got 2155 records which are outliers. Let's drop these outlier records.
In [112]:
In [113]:
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x22593975eb8>
In [114]:
house_df.shape
Out[114]:
(18288, 30)
Total outliers in the lot_measure are 2128 data points. But still we are going ahead with imputing the data. We will analyze later whether there is
any impact on the data set or not.
In [115]:
#As we know for room_bed = 33 was outlier from our earlier findings, let's see the record and drop it
house_df[house_df['room_bed']==33]
Out[115]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_year
2014-06-
750 2402100895 640000 33 1.75 1620 6000 1.0 0 0 ... 4700 0 7620 June-2014
25
1 rows × 30 columns
In [116]:
In [117]:
house_df.shape
Out[117]:
(18287, 30)
In summary, after treating outliers, we have lost about 15% of the data. We will analyse the impact of this data loss during the model
evaluation.
In [118]:
Out[118]:
As we already have this information in other features. We will drop the unwanted columns from new copied dataframe instance :
cid,dayhours,yr_renovated,zipcode,lat,long,county,type
In [119]:
In [120]:
#let's check the new copy of dataframe by printing first few records
df_model.head()
Out[120]:
cid dayhours price room_bed room_bath living_measure lot_measure ceil coast sight ... lot_measure15 furnished total_area month_ye
2014-05-
17786 7568700740 430000 3 2.75 2550 11160 2.0 0 0 ... 7440 0 13710 May-20
21
2014-05-
3782 2248000080 385500 3 2.00 1540 7947 1.0 0 0 ... 7950 0 9487 May-20
21
2014-05-
10069 7805450110 736000 4 2.50 2290 12047 2.0 0 0 ... 15666 1 14337 May-20
06
2014-05-
7114 2215500080 580000 5 2.00 1940 6000 1.0 0 0 ... 6000 0 7940 May-20
28
2014-05-
10080 1219000043 315000 5 1.75 2320 8100 1.0 0 0 ... 7271 0 10420 May-20
09
5 rows × 30 columns
In [121]:
Out[121]:
In [122]:
In [123]:
df_final.shape
Out[123]:
(18287, 22)
In [124]:
df_final.head()
Out[124]:
price room_bed room_bath living_measure lot_measure ceil coast sight condition quality ... yr_built living_measure15 lot_measure15 furnished
17786 430000 3 2.75 2550 11160 2.0 0 0 3 8 ... 1994 1020 7440
3782 385500 3 2.00 1540 7947 1.0 0 0 3 7 ... 1961 1910 7950
10069 736000 4 2.50 2290 12047 2.0 0 0 4 9 ... 1988 3130 15666
7114 580000 5 2.00 1940 6000 1.0 0 0 5 7 ... 1945 1700 6000
10080 315000 5 1.75 2320 8100 1.0 0 0 4 7 ... 1956 1410 7271
5 rows × 22 columns
In [125]:
df_final.columns
Out[125]:
Creating dummies for categorical variables: 'room_bed', 'room_bath', 'ceil', 'coast', 'sight', 'condition', 'quality', 'furnished','City',
'has_basement', 'has_renovated'
In [126]:
# Getting dummies for columns ceil, coast, sight, condition, quality, yr_renovated, furnished
dff = pd.get_dummies(df_final, columns=['room_bed', 'room_bath', 'ceil', 'coast', 'sight', 'condition', 'quality'
, 'furnished','City',
'has_basement', 'has_renovated'],drop_first=True)
In [127]:
Out[127]:
(18287, 92)
In [128]:
dff.columns
Out[128]:
In [129]:
dff.head()
Out[129]:
City_North
price living_measure lot_measure ceil_measure basement yr_built living_measure15 lot_measure15 total_area month_year ... City_Red
Bend
17786 430000 2550 11160 2550 0 1994 1020 7440 13710 May-2014 ... 0
3782 385500 1540 7947 1120 420 1961 1910 7950 9487 May-2014 ... 0
10069 736000 2290 12047 2290 0 1988 3130 15666 14337 May-2014 ... 0
7114 580000 1940 6000 970 970 1945 1700 6000 7940 May-2014 ... 0
10080 315000 2320 8100 1160 1160 1956 1410 7271 10420 May-2014 ... 0
5 rows × 92 columns
In [130]:
In [131]:
In [132]:
In [133]:
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)
(11703, 90)
(3658, 90)
(2926, 90)
In [134]:
dff.head()
Out[134]:
City_North
price living_measure lot_measure ceil_measure basement yr_built living_measure15 lot_measure15 total_area HouseLandRatio ... City
Bend
17786 430000 2550 11160 2550 0 1994 1020 7440 13710 19.0 ... 0
3782 385500 1540 7947 1120 420 1961 1910 7950 9487 16.0 ... 0
10069 736000 2290 12047 2290 0 1988 3130 15666 14337 16.0 ... 0
7114 580000 1940 6000 970 970 1945 1700 6000 7940 24.0 ... 0
10080 315000 2320 8100 1160 1160 1956 1410 7271 10420 22.0 ... 0
5 rows × 91 columns
Model building
Let's build the model and see their performances
Linear Regression (with Ridge and Lasso)
In [135]:
In [136]:
LR1 = LinearRegression()
LR1.fit(X_train, y_train)
#predicting result over test data
y_LR1_predtr= LR1.predict(X_train)
y_LR1_predvl= LR1.predict(X_val)
LR1.coef_
Out[136]:
In [137]:
LR1_vlscore=r2_score(y_val,y_LR1_predvl)
LR1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_LR1_predvl))
LR1_vlMSE=mean_squared_error(y_val, y_LR1_predvl)
LR1_vlMAE=mean_absolute_error(y_val, y_LR1_predvl)
Compa_df
Out[137]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
The linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively
In [138]:
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_LR1_predvl, kind="reg", color="k")
Lasso model
In [139]:
Lasso1 = Lasso(alpha=1)
Lasso1.fit(X_train, y_train)
Lasso1.coef_
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: Convergen
ceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting
data with very small alpha may cause precision problems.
ConvergenceWarning)
Out[139]:
Lasso1_vlscore=r2_score(y_val,y_Lasso1_predvl)
Lasso1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_Lasso1_predvl))
Lasso1_vlMSE=mean_squared_error(y_val, y_Lasso1_predvl)
Lasso1_vlMAE=mean_absolute_error(y_val, y_Lasso1_predvl)
Compa_df
Out[140]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
The lasso linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively. The coefficeints of
1 variable in lasso model is almost '0', signifying that the variable with '0' coefficient can be dropped.
In [141]:
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_Lasso1_predvl, kind="reg", color="k")
Ridge model
In [142]:
Ridge1 = Ridge(alpha=0.5)
Ridge1.fit(X_train, y_train)
Ridge1.coef_
Out[142]:
In [143]:
Ridge1_vlscore=r2_score(y_val,y_Ridge1_predvl)
Ridge1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_Ridge1_predvl))
Ridge1_vlMSE=mean_squared_error(y_val, y_Ridge1_predvl)
Ridge1_vlMAE=mean_absolute_error(y_val, y_Ridge1_predvl)
Compa_df
Out[143]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
The Ridge linear regression model performed with scores 0.73 & .72 in training data set and validation data set respectively. The coefficeints of
variables in ridge model are all non-zero, indicating that non of the variables can be dropped.
In [144]:
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_Ridge1_predvl, kind="reg", color="k")
In summary, Linear models have performed almost with similar results in both regularized model and non-regularized models
KNN Regressor
In [145]:
In [146]:
knn1 = KNeighborsRegressor(n_neighbors=4,weights='distance')
knn1.fit(X_train, y_train)
In [147]:
knn1_vlscore=r2_score(y_val,y_knn1_predvl)
knn1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_knn1_predvl))
knn1_vlMSE=mean_squared_error(y_val, y_knn1_predvl)
knn1_vlMAE=mean_absolute_error(y_val, y_knn1_predvl)
Compa_df
Out[147]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
In [148]:
In [149]:
y_SVR1_predtr= SVR1.predict(X_train)
y_SVR1_predvl= SVR1.predict(X_val)
In [150]:
SVR1_vlscore=r2_score(y_val,y_SVR1_predvl)
SVR1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_SVR1_predvl))
SVR1_vlMSE=mean_squared_error(y_val, y_SVR1_predvl)
SVR1_vlMAE=mean_absolute_error(y_val, y_SVR1_predvl)
Compa_df
Out[150]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
The above negative scores in SVR model is due to non-learning of the model in the training set which results in non-performance in validation
set
In [151]:
SVR2 = SVR(gamma='auto',C=0.1,kernel='linear')
SVR2.fit(X_train, y_train)
y_SVR2_predtr= SVR2.predict(X_train)
y_SVR2_predvl= SVR2.predict(X_val)
SVR2_vlscore=r2_score(y_val,y_SVR2_predvl)
SVR2_vlRMSE=np.sqrt(mean_squared_error(y_val, y_SVR2_predvl))
SVR2_vlMSE=mean_squared_error(y_val, y_SVR2_predvl)
SVR2_vlMAE=mean_absolute_error(y_val, y_SVR2_predvl)
Compa_df
Out[151]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
The SVR model with modified parameters has not performed well with just ~0.45 in both training and validation data sets
In [152]:
DT1 = DecisionTreeRegressor()
DT1.fit(X_train, y_train)
y_DT1_predtr= DT1.predict(X_train)
y_DT1_predvl= DT1.predict(X_val)
DT1_vlscore=r2_score(y_val,y_DT1_predvl)
DT1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_DT1_predvl))
DT1_vlMSE=mean_squared_error(y_val, y_DT1_predvl)
DT1_vlMAE=mean_absolute_error(y_val, y_DT1_predvl)
Compa_df
Out[153]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
Above performance of initial Decision tree model shows overfit in training set with 0.99 score and low performance in validation set
In [154]:
DT2 = DecisionTreeRegressor(max_depth=10,min_samples_leaf=5)
DT2.fit(X_train, y_train)
y_DT2_predtr= DT2.predict(X_train)
y_DT2_predvl= DT2.predict(X_val)
DT2_vlscore=r2_score(y_val,y_DT2_predvl)
DT2_vlRMSE=np.sqrt(mean_squared_error(y_val, y_DT2_predvl))
DT2_vlMSE=mean_squared_error(y_val, y_DT2_predvl)
DT2_vlMAE=mean_absolute_error(y_val, y_DT2_predvl)
Compa_df
Out[154]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
Above decision tree model with modified parameter has better performed on the training set and validation set compared to initial decision tree
model.But overall decision tree has not performed well than linear regression models.
In [155]:
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
sns.jointplot(x=y_val, y=y_DT2_predvl, kind="reg", color="k")
In summary, KNN regressor model and decision tree models have not performed well in comparison with linear regression models
Ensemble techniques
In [156]:
In [157]:
y_GB1_predtr= GB1.predict(X_train)
y_GB1_predvl= GB1.predict(X_val)
GB1_vlscore=r2_score(y_val,y_GB1_predvl)
GB1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_GB1_predvl))
GB1_vlMSE=mean_squared_error(y_val, y_GB1_predvl)
GB1_vlMAE=mean_absolute_error(y_val, y_GB1_predvl)
Compa_df
Out[157]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
Gradient boosting model has provided good scores in both training and validation sets
In [158]:
y_BGG1_predtr= BGG1.predict(X_train)
y_BGG1_predvl= BGG1.predict(X_val)
BGG1_vlscore=r2_score(y_val,y_BGG1_predvl)
BGG1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_BGG1_predvl))
BGG1_vlMSE=mean_squared_error(y_val, y_BGG1_predvl)
BGG1_vlMAE=mean_absolute_error(y_val, y_BGG1_predvl)
Compa_df
Out[158]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
Bagging model also performed well in training and validation sets.There seems to be overfitting in training set. We need to analyse further by
hypertuning
Random forest
In [159]:
RF1=RandomForestRegressor()
RF1.fit(X_train, y_train)
y_RF1_predtr= RF1.predict(X_train)
y_RF1_predvl= RF1.predict(X_val)
RF1_vlscore=r2_score(y_val,y_RF1_predvl)
RF1_vlRMSE=np.sqrt(mean_squared_error(y_val, y_RF1_predvl))
RF1_vlMSE=mean_squared_error(y_val, y_RF1_predvl)
RF1_vlMAE=mean_absolute_error(y_val, y_RF1_predvl)
Compa_df
Out[160]:
Method Val Score RMSE_vl MSE_vl MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg Model1 0.718749 137733.698415 1.897057e+10 93994.455301 0.730112 132958.367261 1.767793e+10 92391.001786
0 Linear-Reg Lasso1 0.719117 137643.639712 1.894577e+10 93939.441186 0.730092 132963.180396 1.767921e+10 92403.854117
0 Linear-Reg Ridge1 0.718929 137689.597398 1.895843e+10 93992.809617 0.729789 133037.735155 1.769904e+10 92497.255174
Random forest model has performed well in training and validation set. There is scope of further analysis on this model
Enseble models: in summary ensemble models have performed well on training and validation sets. These models will be selected for further
analysis with hypertuning and feature selection
In [161]:
#feature importance
rf_imp_feature_1=pd.DataFrame(RF1.feature_importances_, columns = ["Imp"], index = X_val.columns)
rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1['Imp'] = rf_imp_feature_1['Imp'].map('{0:.5f}'.format)
rf_imp_feature_1=rf_imp_feature_1.sort_values(by="Imp",ascending=False)
rf_imp_feature_1.Imp=rf_imp_feature_1.Imp.astype("float")
rf_imp_feature_1[:30].plot.bar(figsize=(plotSizeX, plotSizeY))
#First 20 features have an importance of 90.5% and first 30 have importance of 95.15
print("First 20 feature importance:\t",(rf_imp_feature_1[:20].sum())*100)
print("First 30 feature importance:\t",(rf_imp_feature_1[:30].sum())*100)
Above are top 30 important features that account for 95% of variation in model. This need to be further analysed during hypertuning of the
models for better scores
Ensemble methods are performing better than linear models. Of all the ensemble models, Gradient boosting regressor is giving better R2
score. we identified top 30 features that are explaining the 95% variation in model(Random Forest). Will further hypertune the model to improve
the model performance. Will further explore and evaluate the features while hyperturning the ensemble models
rf_imp_feature_1[:30]
Out[162]:
Imp
furnished_1 0.28448
yr_built 0.14227
living_measure 0.09463
living_measure15 0.06691
quality_8 0.05062
HouseLandRatio 0.04008
lot_measure15 0.03731
City_Bellevue 0.02532
ceil_measure 0.02459
quality_9 0.02049
total_area 0.01527
lot_measure 0.01319
City_Seattle 0.01268
City_Kirkland 0.01245
City_Kent 0.01089
sight_4 0.00945
quality_7 0.00942
basement 0.00908
City_Redmond 0.00830
coast_1 0.00648
City_Medina 0.00556
quality_10 0.00545
City_Renton 0.00521
room_bed_4 0.00393
City_Sammamish 0.00379
sight_3 0.00351
City_Issaquah 0.00303
In [163]:
In [164]:
trscore=r2_score(y_train_set,y_train_predict)
trRMSE=np.sqrt(mean_squared_error(y_train_set,y_train_predict))
trMSE=mean_squared_error(y_train_set,y_train_predict)
trMAE=mean_absolute_error(y_train_set,y_train_predict)
vlscore=r2_score(y_val,y_val_predict)
vlRMSE=np.sqrt(mean_squared_error(y_val,y_val_predict))
vlMSE=mean_squared_error(y_val,y_val_predict)
vlMAE=mean_absolute_error(y_val,y_val_predict)
result_df=pd.DataFrame({'Method':[model],'val score':vlscore,'RMSE_val':vlRMSE,'MSE_val':vlMSE,'MSE_vl': vlMS
E,
'train Score':trscore,'RMSE_tr': trRMSE,'MSE_tr': trMSE, 'MAE_tr': trMAE})
return result_df
Above function will run the model and return the r2 score,rmse,mse of the model
In [165]:
result_dff
Out[165]:
Method val score RMSE_val MSE_val MSE_vl train Score RMSE_tr MSE_tr MAE_tr
Above sequence of steps with pipeline function will run all the models and compile the scores in result_dff dataframe. We can see that the
above 2 steps are concise instead of running individual models and compiling the scores as earlier.
We can clearly see gradient boosting is giving better result in comparison with other ensemble methods. Also the score of 0.82 on training set
indicates no overfitting of the model
In [166]:
result_ds1=result_dff.copy()
result_ds1
Out[166]:
Method val score RMSE_val MSE_val MSE_vl train Score RMSE_tr MSE_tr MAE_tr
In [167]:
dff.shape
Out[167]:
(18287, 91)
In [168]:
dff.columns
Out[168]:
In [169]:
In [170]:
numerical_cols = df_pca.copy()
numerical_cols.shape
Out[170]:
(18287, 90)
In [171]:
# As PCA for Independent columns of Numerical types, let's pass numerical_cols (16 numerical features)
numerical_cols = numerical_cols.apply(zscore)
cov_matrix = np.cov(numerical_cols.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix
%s [[ 1.00005469 0.20028185 0.84597846 ... 0.01415428 0.20094885
0.05257785]
[ 0.20028185 1.00005469 0.1663024 ... 0.08035946 -0.02988448
-0.00617414]
[ 0.84597846 0.1663024 1.00005469 ... 0.01649371 -0.27730605
0.01739462]
...
[ 0.01415428 0.08035946 0.01649371 ... 1.00005469 -0.0056238
-0.01445085]
[ 0.20094885 -0.02988448 -0.27730605 ... -0.0056238 1.00005469
0.04524435]
[ 0.05257785 -0.00617414 0.01739462 ... -0.01445085 0.04524435
1.00005469]]
Eigen Vectors
%s [[ 3.38140157e-01 -5.91272225e-02 2.10933458e-01 ... -5.27282174e-03
5.54142192e-03 -1.67124034e-04]
[ 7.12659835e-02 -4.34121260e-01 -8.88436080e-02 ... -1.68818774e-02
4.57078107e-03 -8.52995967e-03]
[ 3.49772357e-01 -8.00383876e-03 -4.05156781e-02 ... -4.68496505e-03
-1.39683527e-02 2.43243338e-03]
...
[ 1.29688720e-02 -3.80398560e-02 -3.41813657e-02 ... 1.07174062e-01
8.41737918e-02 -1.95413367e-01]
[-2.50075399e-02 -3.74622282e-02 4.39476539e-01 ... 3.99982110e-03
4.08181763e-02 2.32617269e-02]
[-3.18537004e-03 -6.23000043e-04 1.02661626e-01 ... 2.84579669e-02
-1.47963772e-02 -6.62692406e-02]]
Eigen Values
%s [ 6.40030103e+00 4.23053272e+00 3.02200570e+00 2.36069955e+00
1.72278028e+00 1.70533047e+00 5.17634008e-02 7.84864255e-02
1.23323929e-01 1.58239483e+00 1.94704947e-01 2.10588552e-01
2.45372409e-01 3.37764061e-01 3.52383334e-01 2.24756725e-03
9.93351422e-04 1.28503648e-04 8.54683326e-05 1.51669793e+00
3.97816689e-01 1.48400510e+00 4.25049450e-01 -5.20329656e-16
-1.83229560e-15 3.57406920e-15 1.39212554e+00 1.33812387e+00
5.71411667e-01 6.48215227e-01 6.60453404e-01 1.27455883e+00
6.90208644e-01 7.30900855e-01 1.22358633e+00 1.21781188e+00
7.54916613e-01 7.61951753e-01 7.89272221e-01 1.19439921e+00
1.18354682e+00 8.08765828e-01 8.31761100e-01 1.17521503e+00
1.16073113e+00 8.62975337e-01 1.14847039e+00 8.79158894e-01
1.11948938e+00 1.10960276e+00 8.90644524e-01 8.88567656e-01
9.01761603e-01 1.10493861e+00 9.16012433e-01 9.31041146e-01
1.09143428e+00 1.08460485e+00 1.08273453e+00 1.07118893e+00
9.33793856e-01 1.06368893e+00 9.41694315e-01 9.44273389e-01
9.49801385e-01 9.52927340e-01 1.05455290e+00 1.04955645e+00
1.04815072e+00 1.04163633e+00 9.69808694e-01 1.03813696e+00
1.03345195e+00 1.02768165e+00 1.02381893e+00 9.82887562e-01
9.81254198e-01 9.86986081e-01 1.01521697e+00 9.89390243e-01
9.93680575e-01 9.93261992e-01 1.00195909e+00 1.01363137e+00
1.01189827e+00 1.00051625e+00 1.00419597e+00 1.00622516e+00
1.00552678e+00 1.00928050e+00]
In [173]:
# Let's Sort eigenvalues in descending order
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
In [174]:
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)]
In [175]:
print(len(var_explained))
print((cum_var_exp))
90
[0.07111057 0.11811392 0.15168992 0.17791848 0.19705944 0.21600652
0.23358772 0.250439 0.26692704 0.28239426 0.29726149 0.31142248
0.32501714 0.33854764 0.35181802 0.36496782 0.37802505 0.39092136
0.40368144 0.41611953 0.42844778 0.4407242 0.45285059 0.46490109
0.47693082 0.48883227 0.50065039 0.512367 0.5240281 0.53567358
0.54724669 0.55878091 0.57026308 0.58168114 0.59305629 0.60433586
0.61559781 0.62684051 0.63805413 0.6492338 0.66040571 0.67156283
0.6826951 0.69381134 0.70485163 0.71588727 0.72687989 0.73784581
0.74876618 0.75966841 0.77044347 0.78103098 0.79158375 0.8020751
0.8125378 0.82291272 0.83325705 0.84343441 0.85345344 0.86334895
0.87322138 0.88298928 0.89257737 0.90181865 0.91080445 0.91957366
0.92803933 0.93642683 0.94454751 0.95221608 0.95955405 0.96675604
0.97310471 0.97782723 0.98224717 0.98616233 0.98991506 0.99264127
0.99498101 0.99714428 0.99851447 0.9993865 0.99996161 0.99998659
0.99999762 0.99999905 1. 1. 1. 1. ]
From above table we conclude that 96% variance is contributed by about 72 features
In [176]:
plt.figure(figsize=(plotSizeX, plotSizeY))
plt.bar(range(0,90), np.array(var_explained), alpha = 0.5, align='center', label='individual explained variance')
plt.step(range(0,90), np.array(cum_var_exp), where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
72 dimensions covering 97% variance in the data. So we can reduce to 72 dimension space
Now will recall the ensemble models from our initial run to check the feature selection using featureimp from individual models
In [177]:
#Building fuction to return the feature importances for the model
predictors = [x for x in dff.columns if x not in ['price']]
feat_30list=list(alg_imp_feature_1.index[:30])
if printFeatureImportance:
alg_imp_feature_1[:30].plot.bar(figsize=(plotSizeX, plotSizeY))
#First 20 features have an importance of 90.5% and first 30 have importance of 95.15
print("First 25 feature importance:\t",(alg_imp_feature_1[:25].sum())*100)
print("First 30 feature importance:\t",(alg_imp_feature_1[:30].sum())*100)
return feat_30list
Will run above function with ensemble models: Gradient boosting, Random forest, Bagging
In [178]:
Out[178]:
['furnished_1',
'living_measure',
'yr_built',
'living_measure15',
'quality_8',
'City_Bellevue',
'City_Seattle',
'lot_measure15',
'HouseLandRatio',
'City_Kent',
'quality_9',
'sight_4',
'City_Federal Way',
'coast_1',
'City_Mercer Island',
'City_Kirkland',
'City_Medina',
'City_Redmond',
'quality_11',
'ceil_measure',
'quality_7',
'City_Renton',
'City_Maple Valley',
'quality_6',
'total_area',
'quality_10',
'basement',
'City_Issaquah',
'City_Sammamish',
'condition_5']
The top 30 features are covering about 98% in gradient boosting model. This is very good coverage for just 30% of the variables
In [179]:
Out[179]:
['furnished_1',
'yr_built',
'living_measure',
'living_measure15',
'quality_8',
'HouseLandRatio',
'lot_measure15',
'quality_9',
'ceil_measure',
'City_Bellevue',
'total_area',
'lot_measure',
'City_Seattle',
'City_Kirkland',
'City_Kent',
'City_Federal Way',
'coast_1',
'basement',
'City_Mercer Island',
'quality_7',
'City_Redmond',
'sight_4',
'City_Renton',
'City_Maple Valley',
'City_Medina',
'City_Sammamish',
'quality_10',
'has_renovated_Yes',
'room_bath_2.5',
'room_bed_3']
The top 30 features are covering about 95% in random forest model
Now will extract the top 30 features from the above models
In [180]:
feat_list_GB1=modelfit(GB1,X_train,y_train, printFeatureImportance=False)
print(feat_list_GB1)
feat_list_RF1=modelfit(RF1,X_train,y_train, printFeatureImportance=False)
print(feat_list_RF1)
From the above 2 feature list, we will consolidate all the features
In [181]:
Key_feat=list(set(feat_list_GB1).union(feat_list_RF1))
print(len(Key_feat))
print(Key_feat)
33
['City_Mercer Island', 'condition_5', 'City_Sammamish', 'yr_built', 'sight_4', 'City_Seattle', 'City
_Federal Way', 'City_Maple Valley', 'City_Bellevue', 'furnished_1', 'City_Kent', 'quality_9', 'City_
Redmond', 'City_Issaquah', 'quality_8', 'total_area', 'quality_7', 'ceil_measure', 'City_Medina', 'c
oast_1', 'condition_3', 'lot_measure15', 'HouseLandRatio', 'City_Kirkland', 'City_Renton', 'living_m
easure15', 'basement', 'room_bed_4', 'quality_6', 'lot_measure', 'quality_10', 'quality_11', 'living
_measure']
From two models we have 33 importance features. We will freeze on the above 33 list and make another dataframe (along with 'price')
In [182]:
In [183]:
dff33.shape
Out[183]:
(18287, 34)
In [184]:
dff33.head()
Out[184]:
price basement City_Bellevue coast_1 HouseLandRatio City_Seattle quality_10 quality_9 ceil_measure City_Renton ... quality_8 City_Kent qualit
5 rows × 34 columns
In [185]:
X3 = dff33.drop("price" , axis=1)
y3 = dff33["price"]
print(X3_train.shape)
print(X3_test.shape)
print(X3_val.shape)
(11703, 33)
(3658, 33)
(2926, 33)
Eventhough PCA is helping us to reduce dimensions upto about 60 dimensions, we can see that in our random
forest model top 30 features are explaining the 95% variance in the regression and in gradient boosting model
top 30 features are covering 98% varience.
Hence we conclude that we will use features selection by considering the feature importance fucntion in
individual models. Thus we extracted 33 important features
In [186]:
Since we have better performance in gradient boosting model, we will hypertune the model for improving the score
Following are the parameters we tune for the gradient boosting model.
In [187]:
param_grid = {
'loss':['ls','lad','huber'],
'bootstrap': ['True','False'],
'max_depth': range(5,11,1),
'max_features': ['auto','sqrt'],
'learning_rate': [0.05,0.1,0.2,0.25],
'min_samples_leaf': [4,10,20],
'min_samples_split': [5,10,1000],
'n_estimators': [10,50,100,150,200],
'subsample':[0.8,1]
}
In [188]:
GBR_test=GradientBoostingRegressor(random_state=22)
In [189]:
param_grid1 = {'n_estimators': range(50,401,50)}
In [190]:
grid_search1.fit(X_train,y_train)
grid_search1.best_params_
Out[191]:
{'n_estimators': 400}
In [192]:
grid_search1.best_params_, grid_search1.best_score_
Out[192]:
({'n_estimators': 400}, 0.7757647547223905)
n_estimators of 400 is best in range 50 to 400. Will test same until 1000
In [193]:
Out[193]:
GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=2,
param_grid={'n_estimators': range(400, 1001, 200)},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [194]:
grid_search2.cv_results_,grid_search2.best_params_, grid_search2.best_score_
Out[194]:
Out[195]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=3,
param_grid={'n_estimators': range(1000, 2000, 300)},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [196]:
grid_search2.best_params_, grid_search2.best_score_
Out[196]:
({'n_estimators': 1000}, 0.7885965739886799)
In [197]:
param_grid3 = {
'learning_rate': [0.1,0.2],
'min_samples_leaf': [5,10,20],
'min_samples_split': [5,10,20],
'n_estimators': [500,1000],
}
In [198]:
GBR_test=GradientBoostingRegressor(random_state=22)
Out[198]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=3,
param_grid={'learning_rate': [0.1, 0.2], 'min_samples_leaf': [5, 10, 20], 'min_samples_split'
: [5, 10, 20], 'n_estimators': [500, 1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [199]:
grid_search3.best_params_, grid_search3.best_score_
Out[199]:
({'learning_rate': 0.1,
'min_samples_leaf': 10,
'min_samples_split': 5,
'n_estimators': 1000},
0.7880978276736184)
In combination of 4 parameters above values are giving best result. We can see n_estimators of 1000 is best again. Now, will change the ranges
of other 3 parameters
In [200]:
param_grid4 = {
'learning_rate': [0.1,0.15],
'max_depth': [5,10],
'min_samples_leaf': [5,8],
'min_samples_split': [20,30],
'n_estimators': [1000],
}
In [201]:
GBR_test=GradientBoostingRegressor(random_state=22)
Out[201]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=3,
param_grid={'learning_rate': [0.1, 0.15], 'max_depth': [5, 10], 'min_samples_leaf': [5, 8], '
min_samples_split': [20, 30], 'n_estimators': [1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [202]:
grid_search4.best_params_, grid_search4.best_score_
Out[202]:
({'learning_rate': 0.1,
'max_depth': 5,
'min_samples_leaf': 8,
'min_samples_split': 20,
'n_estimators': 1000},
0.7821899364744039)
param_grid5 = {
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8,10],
'min_samples_split': [30,40],
'n_estimators': [1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[203]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=2,
param_grid={'learning_rate': [0.1], 'max_depth': [5], 'min_samples_leaf': [8, 10], 'min_sampl
es_split': [30, 40], 'n_estimators': [1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [204]:
grid_search5.best_params_, grid_search5.best_score_
Out[204]:
({'learning_rate': 0.1,
'max_depth': 5,
'min_samples_leaf': 10,
'min_samples_split': 40,
'n_estimators': 1000},
0.7844535606632613)
param_grid6 = {
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8],
'min_samples_split': [40,50],
'n_estimators': [1000],
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[205]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=2,
param_grid={'learning_rate': [0.1], 'max_depth': [5], 'min_samples_leaf': [8], 'min_samples_s
plit': [40, 50], 'n_estimators': [1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [206]:
grid_search6.best_params_, grid_search6.best_score_
Out[206]:
({'learning_rate': 0.1,
'max_depth': 5,
'min_samples_leaf': 8,
'min_samples_split': 50,
'n_estimators': 1000},
0.7828068526559553)
There is very marginal improvment in score. We are getting best score at min_samples_split of 40 among 30,40,50.
Will tune the final set of parameters along with above finalized ones
In [207]:
param_grid7 = {
'loss':['ls','lad','huber'],
'max_features': ['auto','sqrt'],
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [8],
'min_samples_split': [40],
'n_estimators': [1000],
'subsample':[0.8,1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[207]:
GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_sampl...te=22, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
fit_params=None, iid='warn', n_jobs=2,
param_grid={'loss': ['ls', 'lad', 'huber'], 'max_features': ['auto', 'sqrt'], 'learning_rate'
: [0.1], 'max_depth': [5], 'min_samples_leaf': [8], 'min_samples_split': [40], 'n_estimators': [1000
], 'subsample': [0.8, 1]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=1)
In [208]:
grid_search7.best_params_, grid_search7.best_score_
Out[208]:
({'learning_rate': 0.1,
'loss': 'huber',
'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 8,
'min_samples_split': 40,
'n_estimators': 1000,
'subsample': 1},
0.7965973506104334)
There is improvement in the score. will try one more iteration with changing other parameters
In [209]:
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1,0.2],
'max_depth': [5,8],
'min_samples_leaf': [5],
'min_samples_split': [40,50],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[209]:
({'learning_rate': 0.1,
'loss': 'huber',
'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 5,
'min_samples_split': 40,
'n_estimators': 1000,
'subsample': 1},
0.7958994895003749)
'learning_rate': 0.1, 'loss': 'huber', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 50, 'n_estimators': 1000,
'subsample': 1 </b>
result_leafs_tr=r2_score(y_GBR_predtr,y_train)
train_results.append(result_leafs_tr)
result_leafs_vl=r2_score(y_GBR_predvl,y_val)
val_results.append(result_leafs_vl)
min_samples_splits = [10,15,30,50,100,500,700,1000]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
From above, min_samples_splits of about 10 is giving best score. Will try expanding the range around 10
In [212]:
min_samples_splits = [10,15,20,30,40,50,60,70,80,90,100]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
min_samples_splits = [7,8,9,10,11,12,13,14,15,20]
train_results_spt = []
val_results_spt = []
for min_samples_split in min_samples_splits:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=min_samples_split,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_spt_tr=r2_score(y_GBR_predtr,y_train)
train_results_spt.append(result_spt_tr)
result_spt_vl=r2_score(y_GBR_predvl,y_val)
val_results_spt.append(result_spt_vl)
max_depths = range(3,11,1)
train_results_dpt = []
val_results_dpt = []
for max_depth in max_depths:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=10,
min_samples_leaf=6,
max_depth=max_depth,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_dpt_tr=r2_score(y_GBR_predtr,y_train)
train_results_dpt.append(result_dpt_tr)
result_dpt_vl=r2_score(y_GBR_predvl,y_val)
val_results_dpt.append(result_dpt_vl)
From above, max_depth of about 6 is giving best score for validation set and not overfitting of training set
In [215]:
estimators = range(100,1500,100)
train_results_est = []
val_results_est = []
for n_estimators in estimators:
GBR_test=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=n_estimators,
subsample=1.0,
min_samples_split=30,
min_samples_leaf=6,
max_depth=9,
random_state=22,
alpha=0.9,
)
GBR_test.fit(X_train,y_train)
y_GBR_predtr= GBR_test.predict(X_train)
y_GBR_predvl= GBR_test.predict(X_val)
result_est_tr=r2_score(y_GBR_predtr,y_train)
train_results_est.append(result_est_tr)
result_est_vl=r2_score(y_GBR_predvl,y_val)
val_results_est.append(result_est_vl)
In [217]:
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1],
'max_depth': [6],
'min_samples_leaf': [6],
'min_samples_split': [12],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[217]:
0.7934419703161365
In [218]:
param_gridF = {
'loss':['huber'],
'max_features': ['sqrt'],
'learning_rate': [0.1],
'max_depth': [5],
'min_samples_leaf': [5],
'min_samples_split': [50],
'n_estimators': [1000],
'subsample':[1]
}
GBR_test=GradientBoostingRegressor(random_state=22)
Out[218]:
(0.7928868850462906,
{'learning_rate': 0.1,
'loss': 'huber',
'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 5,
'min_samples_split': 50,
'n_estimators': 1000,
'subsample': 1})
We can conclude from above that gridsearch CV is giving better results compared to that of tuning done by graphical method of individual
parameters
Final parameters that are giving best result on training set are:
'learning_rate': 0.1, 'loss': 'huber', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 50, 'n_estimators': 1000,
'subsample': 1 </b>
CONFIDENCE INTERVAL
In [219]:
GBR_bestparam=GradientBoostingRegressor(
loss='huber',
learning_rate=0.1,
n_estimators=1000,
subsample=1.0,
min_samples_split=50,
min_samples_leaf=5,
max_depth=5,
random_state=22,
alpha=0.9,
)
GBR_bestparam.fit(X_train,y_train)
y_GBRF_predtr= GBR_bestparam.predict(X_train)
y_GBRF_predvl= GBR_bestparam.predict(X_val)
y_GBRF_predts= GBR_bestparam.predict(X_test)
In [220]:
GBRF_vlscore=r2_score(y_val,y_GBRF_predvl)
GBRF_vlRMSE=np.sqrt(mean_squared_error(y_val, y_GBRF_predvl))
GBRF_vlMSE=mean_squared_error(y_val, y_GBRF_predvl)
GBRF_vlMAE=mean_absolute_error(y_val, y_GBRF_predvl)
GBRF_tsscore=r2_score(y_test,y_GBRF_predts)
GBRF_tsRMSE=np.sqrt(mean_squared_error(y_test, y_GBRF_predts))
GBRF_tsMSE=mean_squared_error(y_test, y_GBRF_predts)
GBRF_tsMAE=mean_absolute_error(y_test, y_GBRF_predts)
GBRF_df
Out[220]:
Method Val Score RMSE_vl MSE_vl train Score RMSE_tr MSE_tr test Score RMSE_ts MSE_ts
0 GBRF 0.80096 115867.988855 1.342539e+10 0.898909 81372.879729 6.621546e+09 0.793584 114695.310542 1.315501e+10
In [221]:
num_folds = 50
seed = 7
For further improvization, the datasets can be made by treating outliers in different ways and hypertuning the ensemble models.
</b>
Dataset-2
In [2]:
In [224]:
## Need to add file USA ZipCodes_1.xlsx to current working directory to access this data
USAZip=pd.read_excel("USA ZipCodes_1.xlsx",sheet_name="Sheet8")
USAZip.head()
Out[224]:
In [239]:
house_df = pd.read_csv('innercity.csv')
In [240]:
house_df1=house_df.merge(USAZip,how='left',on='zipcode')
#house_df.drop_duplicates()
house_df.shape
Out[240]:
(21613, 23)
In [5]:
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf1142588>
In [241]:
For analysis in this iteration categorizing coast, furnished and quality. As in previous version tranformed many features but not got desired
result.
TREATING OUTLIERS
Removing data points which fall into below criteria:
We have lost 20 records which is 0.09% of the data available. These records are extreme values for which we dont have much of data to provide
their better estimate. Hence removing them.
In [242]:
Out[242]:
(21593, 21)
In [243]:
house_df_2.columns
Out[243]:
In [252]:
In [253]:
house_df_final.columns
Out[253]:
In [254]:
house_df_final.shape
Out[254]:
(21593, 31)
In [268]:
Out[268]:
Index(['price', 'room_bed', 'room_bath', 'living_measure', 'lot_measure',
'ceil', 'sight', 'condition', 'ceil_measure', 'basement', 'yr_built',
'yr_renovated', 'zipcode', 'lat', 'long', 'living_measure15',
'lot_measure15', 'total_area', 'coast_1', 'quality_3', 'quality_4',
'quality_5', 'quality_6', 'quality_7', 'quality_8', 'quality_9',
'quality_10', 'quality_11', 'quality_12', 'quality_13', 'furnished_1'],
dtype='object')
Shows the Data Correlation between Attributes with Heatmap
In [256]:
#total_area is highly correlated with lot_measure, ceil_measure is highly correlated with living_measure
house_corr_2 = house_df_final.corr(method ='pearson')
house_corr_2.to_excel("house_corr_2.xls")
plt.figure(figsize=(35,20))
sns.heatmap(house_corr_2,cmap="coolwarm", annot=True,annot_kws={"size":9},fmt='.2')
Out[256]:
<matplotlib.axes._subplots.AxesSubplot at 0x225943454a8>
In [257]:
In [258]:
In [259]:
print(df_train.shape)
print(df_test.shape)
print(df_val.shape)
(13819, 31)
(4319, 31)
(3455, 31)
In [260]:
Out[260]:
1320 330000
16628 245000
2923 369000
15818 532000
4665 506400
Name: price, dtype: int64
In [261]:
Out[261]:
6030 225000
16781 373500
17420 325000
4147 260000
17992 233000
Name: price, dtype: int64
In [262]:
# Split the 'df_test' set into X and y
X_test2 = df_test.drop(['price'],axis=1)
y_test2 = df_test['price']
X_test2.shape
len_test=len(X_test2)
y_test2.head()
Out[262]:
19155 510000
10450 264500
14277 266000
7601 735000
6563 600000
Name: price, dtype: int64
Will use XGboost model apart from models that used earlier on dataset-1
Creating Dataframe for Results and Function to compute the scores for each model on its Train and Validation
datasets
In [24]:
#Function to give results of the models for its train and validation dataset.
#as input it requries model name to display, algorithm, train indepedent variables, train dependent variable,
#validation indepedent variables, validation dependent variable.
def result (model,pipe_model,X_train_set,y_train_set,X_val_set,y_val_set):
pipe_model.fit(X_train_set,y_train_set)
#predicting result over test data
y_train_predict= pipe_model.predict(X_train_set)
y_val_predict= pipe_model.predict(X_val_set)
trscore=r2_score(y_train_set,y_train_predict)
trRMSE=np.sqrt(mean_squared_error(y_train_set,y_train_predict))
trMSE=mean_squared_error(y_train_set,y_train_predict)
trMAE=mean_absolute_error(y_train_set,y_train_predict)
vlscore=r2_score(y_val,y_val_predict)
vlRMSE=np.sqrt(mean_squared_error(y_val,y_val_predict))
vlMSE=mean_squared_error(y_val,y_val_predict)
vlMAE=mean_absolute_error(y_val,y_val_predict)
result_df=pd.DataFrame({'Method':[model],'val score':vlscore,'RMSE_val':vlRMSE,'MSE_val':vlMSE,'MAE_vl': vlMA
E,
'train Score':trscore,'RMSE_tr': trRMSE,'MSE_tr': trMSE, 'MAE_tr': trMAE})
#Plot between actual and predicted values
plt.figure(figsize=(18,10))
sns.lineplot(range(len(y_val_set)),y_val_set,color='blue',linewidth=1.5)
sns.lineplot(range(len(y_val_set)),y_val_predict,color='hotpink',linewidth=.5)
plt.title('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=10) # X-label
plt.ylabel('Values', fontsize=10) # Y-label
return result_df
LINEAR REGRESSION
In [26]:
clf=LinearRegression()
pipe_lr = Pipeline([('LR', clf)])
result_dff=pd.concat([result_dff,result('Linear Reg',pipe_lr,X_train,y_train,X_val,y_val)])
result_dff
Out[27]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
In [28]:
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf527d438>
RIDGE REGRESSION
In [29]:
In [30]:
clf=Ridge()
pipe_ridge = Pipeline([('Ridge', clf)])
result_dff=pd.concat([result_dff,result('Ridge_Reg_1',pipe_ridge,X_train,y_train,X_val,y_val)])
result_dff
Out[30]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf5c6cd30>
In [32]:
#Iteration 2
clf=Ridge(alpha=0.08)
pipe_ridge_1 = Pipeline([('Ridge',clf )])
result_dff=pd.concat([result_dff,result('Ridge_Reg_2',pipe_ridge_1,X_train,y_train,X_val,y_val)])
result_dff
Out[32]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccf5d78358>
LASSO REGRESSION
In [34]:
clf=Lasso(alpha=10, max_iter=1000)
pipe_lasso_1 = Pipeline([('Lasso',clf )])
result_dff=pd.concat([result_dff,result('Lasso_Reg_1',pipe_lasso_1,X_train,y_train,X_val,y_val)])
result_dff
Out[35]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
Out[36]:
quality_13 1282757.93547
quality_12 720634.25526
lat 603385.77684
coast_1 515951.63187
furnished_1 356060.28711
quality_11 254292.72718
quality_8 51062.02158
sight 48526.08682
quality_3 47977.01520
room_bath 44364.15660
condition 35706.89861
ceil 28507.08484
living_measure 126.78842
living_measure15 33.97555
yr_renovated 23.50688
total_area 0.35066
quality_10 -0.00000
lot_measure -0.19029
lot_measure15 -0.29272
basement -8.54716
ceil_measure -15.67109
zipcode -512.28283
yr_built -2269.56051
quality_7 -16734.79835
room_bed -18988.08235
quality_6 -63515.14160
quality_4 -89548.16152
quality_5 -97142.30729
long -172566.03480
quality_9 -177720.13306
dtype: float64
KNN Regressor
In [37]:
Out[37]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [38]:
DECISION TREE
In [39]:
#feature importance
plt.figure(figsize=(10,10))
imp_feature_1[:30].plot.bar(figsize=(15,5))
#Import library
from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor(random_state=1)
pipe_DT_1=Pipeline([('DT1',clf)])
result_dff=pd.concat([result_dff,result('DT1',pipe_DT_1,X_train,y_train,X_val,y_val)])
result_dff
Out[40]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [41]:
#Feature importance
feat_imp(clf,X_train)
Imp
furnished_1 0.33440
living_measure 0.19412
lat 0.17853
long 0.06748
coast_1 0.03510
ceil_measure 0.03389
yr_built 0.03233
living_measure15 0.03192
lot_measure 0.01480
zipcode 0.01341
lot_measure15 0.01192
total_area 0.00832
quality_9 0.00781
room_bath 0.00697
sight 0.00633
quality_8 0.00496
basement 0.00436
condition 0.00266
quality_12 0.00247
quality_10 0.00206
room_bed 0.00199
ceil 0.00180
yr_renovated 0.00080
quality_13 0.00048
quality_11 0.00044
quality_7 0.00030
quality_6 0.00026
quality_5 0.00008
quality_4 0.00000
quality_3 0.00000
In [42]:
from sklearn.ensemble import RandomForestRegressor
In [43]:
clf=RandomForestRegressor(random_state=2)
pipe_RF_1=Pipeline([('RF1',clf)])
result_dff=pd.concat([result_dff,result('RF1',pipe_RF_1,X_train,y_train,X_val,y_val)])
result_dff
Out[43]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [44]:
#Feature importance
feat_imp(clf,X_train)
Imp
furnished_1 0.30826
living_measure 0.23477
lat 0.17234
long 0.06825
living_measure15 0.03089
yr_built 0.02564
coast_1 0.02493
sight 0.01985
ceil_measure 0.01696
zipcode 0.01531
lot_measure15 0.01387
quality_9 0.01243
total_area 0.01047
lot_measure 0.00850
room_bath 0.00705
basement 0.00688
quality_8 0.00417
room_bed 0.00380
condition 0.00321
quality_12 0.00262
yr_renovated 0.00247
ceil 0.00221
quality_11 0.00169
quality_10 0.00148
quality_13 0.00096
quality_7 0.00063
quality_6 0.00030
quality_5 0.00005
quality_4 0.00001
quality_3 0.00000
clf=RandomForestRegressor(n_estimators=50,max_depth=18,min_samples_leaf=10,random_state=3)
pipe_RF_2=Pipeline([('RF2',clf)])
result_dff=pd.concat([result_dff,result('RF2',pipe_RF_2,X_train,y_train,X_val,y_val)])
result_dff
Out[45]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [46]:
#Feature importance
feat_imp(clf,X_train)
Imp
furnished_1 0.34209
living_measure 0.25693
lat 0.18194
long 0.07106
living_measure15 0.02514
yr_built 0.02336
sight 0.01984
ceil_measure 0.01841
zipcode 0.01135
quality_9 0.00908
coast_1 0.00864
lot_measure15 0.00801
total_area 0.00561
quality_8 0.00449
lot_measure 0.00336
room_bath 0.00277
basement 0.00172
quality_12 0.00139
condition 0.00123
quality_11 0.00095
room_bed 0.00073
quality_10 0.00073
quality_7 0.00044
ceil 0.00036
yr_renovated 0.00022
quality_6 0.00017
quality_5 0.00001
quality_4 0.00000
quality_3 0.00000
quality_13 0.00000
clf=GradientBoostingRegressor(random_state=4)
pipe_GB_1=Pipeline([('GB1',clf)])
result_dff=pd.concat([result_dff,result('GB1',pipe_GB_1,X_train,y_train,X_val,y_val)])
result_dff
Out[47]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [48]:
#Feature importance
feat_imp(clf,X_train)
Imp
living_measure 0.32718
furnished_1 0.21738
lat 0.17507
long 0.06494
living_measure15 0.03217
coast_1 0.03081
yr_built 0.03081
sight 0.02848
zipcode 0.01718
quality_9 0.01411
ceil_measure 0.01139
quality_12 0.00933
quality_8 0.00850
room_bath 0.00848
quality_11 0.00673
quality_13 0.00363
lot_measure15 0.00331
condition 0.00300
basement 0.00221
total_area 0.00147
yr_renovated 0.00103
lot_measure 0.00079
quality_7 0.00052
ceil 0.00048
quality_10 0.00046
room_bed 0.00037
quality_6 0.00017
quality_3 0.00000
quality_4 0.00000
quality_5 0.00000
clf=GradientBoostingRegressor(n_estimators=150,max_depth=5,random_state=5)
pipe_GB_2=Pipeline([('GB2',clf)])
result_dff=pd.concat([result_dff,result('GB2',pipe_GB_2,X_train,y_train,X_val,y_val)])
result_dff
Out[49]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [50]:
#Feature importance
feat_imp(clf,X_train)
Imp
living_measure 0.28697
furnished_1 0.22921
lat 0.17826
long 0.07063
living_measure15 0.04054
yr_built 0.03118
coast_1 0.03031
quality_9 0.02114
sight 0.02033
zipcode 0.01644
ceil_measure 0.01361
quality_8 0.00939
quality_10 0.00815
total_area 0.00797
room_bath 0.00609
lot_measure15 0.00533
lot_measure 0.00417
basement 0.00412
quality_12 0.00352
quality_11 0.00347
condition 0.00311
quality_13 0.00196
yr_renovated 0.00142
room_bed 0.00107
ceil 0.00096
quality_7 0.00053
quality_6 0.00009
quality_5 0.00004
quality_3 0.00000
quality_4 0.00000
XGBOOST REGRESSOR
In [51]:
clf=XGBRegressor(objective='reg:squarederror',random_state=6)
pipe_XGB_1=Pipeline([('XGB1',clf)])
result_dff=pd.concat([result_dff,result('XGB1',pipe_XGB_1,X_train,y_train,X_val,y_val)])
result_dff
Out[51]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [52]:
#Feature importance
feat_imp(clf,X_train)
Imp
furnished_1 0.44495
quality_9 0.15441
living_measure 0.08844
coast_1 0.04030
sight 0.03631
quality_8 0.03330
lat 0.03246
long 0.02696
quality_12 0.02049
yr_built 0.01917
living_measure15 0.01869
room_bath 0.01360
zipcode 0.01226
quality_11 0.01098
quality_7 0.00875
ceil_measure 0.00861
quality_13 0.00664
condition 0.00428
lot_measure15 0.00364
yr_renovated 0.00294
basement 0.00252
lot_measure 0.00238
ceil 0.00213
total_area 0.00198
quality_6 0.00197
room_bed 0.00186
quality_3 0.00000
quality_4 0.00000
quality_5 0.00000
quality_10 0.00000
clf=XGBRegressor(n_estimators=150,max_depth=5,random_state=7)
pipe_XGB_2=Pipeline([('XGB2',clf)])
result_dff=pd.concat([result_dff,result('XGB2',pipe_XGB_2,X_train,y_train,X_val,y_val)])
result_dff
Out[53]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [54]:
#Feature importance
feat_imp(clf,X_train)
Imp
furnished_1 0.59499
living_measure 0.06767
quality_9 0.06470
coast_1 0.04365
quality_8 0.03345
lat 0.03122
quality_10 0.03027
sight 0.02396
long 0.01961
quality_12 0.01526
living_measure15 0.01122
yr_built 0.01016
quality_11 0.00679
quality_13 0.00676
zipcode 0.00626
ceil_measure 0.00527
quality_7 0.00439
condition 0.00407
total_area 0.00362
room_bath 0.00278
lot_measure15 0.00274
yr_renovated 0.00219
lot_measure 0.00217
basement 0.00205
quality_6 0.00153
ceil 0.00153
room_bed 0.00113
quality_5 0.00055
quality_4 0.00000
quality_3 0.00000
ADABOOST REGRESSOR
In [55]:
clf= AdaBoostRegressor(DecisionTreeRegressor(random_state=8))
pipe_ADAB_1=Pipeline([('ADAB1',clf)])
result_dff=pd.concat([result_dff,result('ADAB1',pipe_ADAB_1,X_train,y_train,X_val,y_val)])
result_dff
Out[55]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [56]:
#Feature importance
feat_imp(clf,X_train)
Imp
living_measure 0.50994
lat 0.09959
furnished_1 0.06601
long 0.06142
coast_1 0.04096
living_measure15 0.04011
sight 0.03042
ceil_measure 0.02662
yr_built 0.01886
lot_measure15 0.01721
zipcode 0.01391
total_area 0.01116
room_bath 0.01004
lot_measure 0.00888
quality_11 0.00824
basement 0.00793
quality_12 0.00540
quality_13 0.00373
quality_9 0.00355
room_bed 0.00343
ceil 0.00261
yr_renovated 0.00252
condition 0.00235
quality_8 0.00226
quality_10 0.00209
quality_7 0.00055
quality_6 0.00017
quality_5 0.00002
quality_4 0.00000
quality_3 0.00000
clf= AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),n_estimators=250,learning_rate=0.005,random_state=9)
pipe_ADAB_2=Pipeline([('ADAB2',clf)])
result_dff=pd.concat([result_dff,result('ADAB2',pipe_ADAB_2,X_train,y_train,X_val,y_val)])
result_dff
Out[57]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
In [58]:
#Feature importance
feat_imp(clf,X_train)
Imp
living_measure 0.31020
furnished_1 0.22982
lat 0.16848
long 0.07221
living_measure15 0.03353
coast_1 0.02876
yr_built 0.02456
ceil_measure 0.02078
sight 0.01669
zipcode 0.01550
lot_measure15 0.01533
total_area 0.01060
lot_measure 0.00862
quality_9 0.00846
room_bath 0.00701
basement 0.00560
quality_8 0.00364
room_bed 0.00335
condition 0.00317
quality_12 0.00265
quality_11 0.00260
yr_renovated 0.00229
ceil 0.00218
quality_10 0.00175
quality_13 0.00100
quality_7 0.00080
quality_6 0.00032
quality_5 0.00008
quality_4 0.00001
quality_3 0.00000
BAGGING REGRESSION
In [59]:
clf= BaggingRegressor(random_state=10)
pipe_BAG_1=Pipeline([('BAG1',clf)])
result_dff=pd.concat([result_dff,result('BAG1',pipe_BAG_1,X_train,y_train,X_val,y_val)])
result_dff
Out[59]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature Importance
feature_importances = np.mean([ tree.feature_importances_ for tree in clf.estimators_], axis=0)
bg_imp_feature=pd.DataFrame(feature_importances, columns = ["Imp"],index=X_train.columns)
bg_imp_feature.sort_values(by="Imp",ascending=False)
Out[60]:
Imp
furnished_1 0.32952
living_measure 0.21044
lat 0.17412
long 0.06964
living_measure15 0.03440
yr_built 0.03000
coast_1 0.02448
ceil_measure 0.01991
zipcode 0.01548
sight 0.01531
lot_measure15 0.01498
total_area 0.00974
quality_9 0.00967
lot_measure 0.00809
room_bath 0.00737
basement 0.00434
room_bed 0.00403
quality_8 0.00399
condition 0.00313
yr_renovated 0.00237
ceil 0.00228
quality_11 0.00182
quality_12 0.00148
quality_10 0.00137
quality_13 0.00084
quality_7 0.00068
quality_6 0.00042
quality_5 0.00010
quality_4 0.00001
quality_3 0.00000
In [61]:
clf= BaggingRegressor(DecisionTreeRegressor(max_depth=12),n_estimators=250,random_state=11)
pipe_BAG_2=Pipeline([('BAG2',clf)])
result_dff=pd.concat([result_dff,result('BAG2',pipe_BAG_2,X_train,y_train,X_val,y_val)])
result_dff
Out[61]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature Importance
pd.options.display.float_format = '{:.5f}'.format
feature_importances = np.mean([ tree.feature_importances_ for tree in clf.estimators_], axis=0)
bg_imp_feature=pd.DataFrame(feature_importances, columns = ["Imp"],index=X_train.columns)
bg_imp_feature.sort_values(by="Imp",ascending=False)
Out[62]:
Imp
furnished_1 0.31748
living_measure 0.23735
lat 0.17613
long 0.06834
living_measure15 0.02947
coast_1 0.02891
yr_built 0.02585
ceil_measure 0.01984
sight 0.01504
zipcode 0.01456
lot_measure15 0.01222
quality_9 0.00935
total_area 0.00814
lot_measure 0.00665
room_bath 0.00590
basement 0.00445
quality_8 0.00431
quality_12 0.00272
quality_11 0.00231
condition 0.00229
room_bed 0.00227
yr_renovated 0.00182
ceil 0.00158
quality_10 0.00150
quality_13 0.00076
quality_7 0.00048
quality_6 0.00021
quality_5 0.00006
quality_4 0.00000
quality_3 0.00000
In [ ]:
We have used Linear Regression, Ridge and Lasso, KNN, Ensemble Techniques - Decision Trees, Random Forest, Bagging, AdaBoost,
Gradient Boost and XGBoost - its gradient boost with regularization and its faster. R2 score on validation in range 70%-87% with RMSE in range
76000-107000. The model is showing better results.Lets hypertune to see if results could be improved further. Will use Random Forest,
Gradient Boosting, XGBoost and AdaBoost hypertuning. Dropping features which are zero or very close to zero in all above 4 algos -
quality_12, quality_3, quality_4.
In [ ]:
#Dropping features
X_train_ht=X_train.drop(['quality_5', 'quality_3', 'quality_4'],1)
X_test_ht=X_test.drop(['quality_5', 'quality_3', 'quality_4'],1)
X_val_ht=X_val.drop(['quality_5', 'quality_3', 'quality_4'],1)
In [ ]:
In [65]:
Out[65]:
In [66]:
Out[66]:
{'max_depth': 18,
'max_features': 8,
'min_samples_leaf': 5,
'min_samples_split': 18,
'n_estimators': 81}
In [67]:
result_dff=pd.concat([result_dff,result('RF_ht',rf_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
Out[67]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(rf_best,X_train_ht)
Imp
living_measure 0.20762
furnished_1 0.16841
lat 0.15958
living_measure15 0.08067
ceil_measure 0.07752
long 0.05371
room_bath 0.04081
yr_built 0.03216
sight 0.02628
zipcode 0.02266
quality_9 0.02174
coast_1 0.01627
basement 0.01216
quality_8 0.01187
total_area 0.01125
lot_measure15 0.01110
quality_11 0.00995
lot_measure 0.00946
quality_10 0.00673
quality_7 0.00643
condition 0.00437
quality_12 0.00272
quality_6 0.00216
room_bed 0.00135
yr_renovated 0.00130
ceil 0.00118
quality_13 0.00057
GB_ht=GradientBoostingRegressor()
params = {"n_estimators": [138,142,1],"learning_rate":[0.08,0.09],"max_depth": np.arange(8, 11,1),
"max_features":np.arange(5,8,1),'min_samples_leaf': range(16, 21, 1)}
GB_GV_1 = GridSearchCV(estimator = GB_ht, param_grid = params,cv=skf,verbose=1,return_train_score=True,n_jobs=2)
GB_GV_1.fit(X_train_ht,y_train)
Out[69]:
{'learning_rate': 0.09,
'max_depth': 8,
'max_features': 7,
'min_samples_leaf': 17,
'n_estimators': 142}
In [70]:
result_dff=pd.concat([result_dff,result('GB_ht',gb_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
Out[70]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(gb_best,X_train_ht)
Imp
living_measure 0.21739
lat 0.15607
furnished_1 0.13874
living_measure15 0.10908
long 0.06063
ceil_measure 0.05424
room_bath 0.04955
sight 0.03128
yr_built 0.02943
coast_1 0.02644
zipcode 0.02530
lot_measure15 0.01702
quality_9 0.01336
total_area 0.01053
lot_measure 0.01020
basement 0.00833
condition 0.00829
quality_7 0.00649
quality_12 0.00592
quality_8 0.00494
quality_11 0.00487
quality_10 0.00320
quality_6 0.00318
room_bed 0.00243
yr_renovated 0.00185
ceil 0.00125
quality_13 0.00000
ADABOOST HYPERTUNE
In [72]:
ADAB_ht=AdaBoostRegressor(DecisionTreeRegressor(max_depth=28))
params = {"n_estimators": [176,182,1],"learning_rate":[0.4,0.5,0.6],'loss':['linear','square']}
ADAB_GV_1 = GridSearchCV(estimator = ADAB_ht, param_grid = params,cv=skf,verbose=1,return_train_score=True,n_jobs
=2)
ADAB_GV_1.fit(X_train_ht,y_train)
Out[72]:
In [73]:
Out[73]:
In [74]:
adab_best = AdaBoostRegressor(DecisionTreeRegressor(max_depth=28),n_estimators=180,learning_rate=0.5,loss='linear
',
random_state=15)
result_dff=pd.concat([result_dff,result('ADAB_ht',adab_best,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
Out[74]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(adab_best,X_train_ht)
Imp
living_measure 0.48898
furnished_1 0.10561
lat 0.09726
long 0.05701
coast_1 0.03784
living_measure15 0.03747
sight 0.02427
ceil_measure 0.02000
yr_built 0.01993
lot_measure15 0.01562
zipcode 0.01534
room_bath 0.01240
total_area 0.01094
lot_measure 0.00964
basement 0.00810
quality_9 0.00798
quality_11 0.00571
quality_12 0.00445
room_bed 0.00382
yr_renovated 0.00322
condition 0.00291
quality_10 0.00286
ceil 0.00268
quality_8 0.00253
quality_13 0.00252
quality_7 0.00073
quality_6 0.00017
XGBoost Regressor
In [76]:
Out[76]:
In [77]:
Out[77]:
{'colsample_bytree': 0.68,
'learning_rate': 0.2,
'n_estimators': 185,
'subsample': 0.67}
In [78]:
result_dff=pd.concat([result_dff,result('xgb_1_ht',xgb_best_1,X_train_ht,y_train,X_val_ht,y_val)])
result_dff
Out[78]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(xgb_best_1,X_train_ht)
Imp
furnished_1 0.45991
living_measure 0.08495
quality_9 0.07530
lat 0.05495
sight 0.04581
coast_1 0.04306
quality_8 0.02980
long 0.02709
quality_12 0.02143
living_measure15 0.01880
quality_6 0.01387
quality_11 0.01343
quality_13 0.01221
zipcode 0.01197
room_bath 0.01166
yr_built 0.01035
condition 0.01023
quality_10 0.00940
ceil_measure 0.00728
lot_measure15 0.00666
basement 0.00661
total_area 0.00550
ceil 0.00534
yr_renovated 0.00469
lot_measure 0.00348
room_bed 0.00313
quality_7 0.00312
params2 = {
'min_child_weight':[6,7,8,9,10],"max_depth": [3,4,5],
}
xgb_best_2.fit(X_train_ht, y_train)
Out[80]:
{'max_depth': 5, 'min_child_weight': 7}
In [81]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(xgb_best_2,X_train_ht)
Imp
furnished_1 0.46173
quality_9 0.08325
living_measure 0.07545
coast_1 0.06976
lat 0.04473
sight 0.02880
room_bath 0.02802
quality_8 0.02780
long 0.02034
quality_10 0.01916
quality_7 0.01823
yr_built 0.01647
living_measure15 0.01506
quality_12 0.01277
zipcode 0.01129
quality_11 0.01026
quality_13 0.00990
lot_measure15 0.00685
condition 0.00637
ceil_measure 0.00625
lot_measure 0.00535
room_bed 0.00455
total_area 0.00408
ceil 0.00406
basement 0.00383
yr_renovated 0.00322
quality_6 0.00243
params3 = {
'gamma':[i/1.0 for i in range(50,55,1)]
}
xgb_best_3.fit(X_train_ht, y_train)
Out[83]:
{'gamma': 50.0}
In [84]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(xgb_best_3,X_train_ht)
Imp
furnished_1 0.55026
living_measure 0.11252
coast_1 0.04955
sight 0.04906
lat 0.04026
quality_8 0.03380
long 0.01515
quality_6 0.01480
quality_11 0.01418
quality_12 0.01409
living_measure15 0.01227
quality_9 0.00959
zipcode 0.00910
condition 0.00887
quality_10 0.00752
ceil_measure 0.00718
yr_built 0.00665
total_area 0.00638
yr_renovated 0.00607
quality_13 0.00578
room_bath 0.00527
room_bed 0.00497
ceil 0.00429
lot_measure 0.00399
basement 0.00379
lot_measure15 0.00335
quality_7 0.00128
We have executed many models and post comparing results we hyper tuned four models. All models are working well with R2 score greater
than 86% RMSE is below 132600.
But best of of all is Xtreme Gradient boost - which is enhanced version of gradient boost. It includes regularisation and is faster too. Its giving
R2 score of around 89.5% with RMSE of around 109000.
Moving forward this model can be improved further as dont have much data for very high priced houses. So when more data comes in we can
revisit our model and make mecessary changes to accommodate more variation in data to deliver better results, maybe try to decrease RMSE.
Finally lets run our model on test data, which we havent used till now and see how it performs.
result_dff=pd.concat([result_dff,result('xgb_test',xgb_best_3,X_test_ht,y_test,X_val_ht,y_val)])
result_dff
Out[86]:
Method val score RMSE_val MSE_val MAE_vl train Score RMSE_tr MSE_tr MAE_tr
0 Linear Reg 0.71763 180818.45737 32695314526.97058 117107.93415 0.72770 181882.36852 33081195977.82660 116936.92426
0 KNN Reg 0.49520 241764.35295 58450002356.08145 151384.98932 0.99935 8894.71112 79115885.95018 727.31898
#Feature importance
feat_imp(xgb_best_3,X_test_ht)
Imp
furnished_1 0.53507
living_measure 0.13423
sight 0.05095
coast_1 0.03399
lat 0.03394
quality_9 0.02366
quality_8 0.02151
long 0.01854
quality_7 0.01645
ceil_measure 0.01541
room_bath 0.01410
living_measure15 0.01192
condition 0.01068
yr_renovated 0.00998
yr_built 0.00925
quality_11 0.00810
zipcode 0.00625
lot_measure15 0.00602
total_area 0.00579
quality_6 0.00567
basement 0.00538
quality_12 0.00534
quality_10 0.00493
lot_measure 0.00478
ceil 0.00428
room_bed 0.00379
quality_13 0.00000
num_folds = 200
seed = 7
In [89]:
sns.set(style="darkgrid", color_codes=True)
with sns.axes_style("white"):
Some other important features that affect price the most are living measure, latitude, above average quality of house and coastal house. So,
one needs to thoroughly introspect its property on parameters suggested and list its price accordingly, similarly if one wants buy house -
needs to check the features suggested above in house and calculate the predicted price. The same can than be compared to listed price.
For further improvization, the datasets can be made by treating outliers in different ways and hypertuning the ensemble models.
</b>
CONCLUSION:
We have build different models on 2 datasets. The performance (score and 95% confidence interval scores) of the model build on dataset-1 is
better than dataset-2 as the 95% confidence interval of dataset-1 is very narrow compared to that of dataset-2. Even though the score of
dataset-2 model is higher, the model has very vast range of performance scores.
The top key features to consider for pricing a property are:'furnished_1', 'yr_built', 'living_measure','quality_8', 'lot_measure15', 'quality_9',
'ceil_measure', 'total_area'. These are almost similar in both the models
So, one needs to thoroughly introspect its property on parameters suggested and list its price accordingly, similarly if one wants buy house -
needs to check the features suggested above in house and calculate the predicted price. The same can than be compared to listed price.
For further improvization, the datasets can be made by treating outliers in different ways and hypertuning the ensemble models. Making
polynomial features and improvising the model performance can also be explored further.
Pickle file Creation
First we will define the function for data-preprocessing that is required to run through the model. Then we will recall the same for predicting the
price(target) of the property.
The pickle file is created as per the steps followed for dataset-2.
In [9]:
X_test = pd.read_excel(data)
#Removing outliers
X_test_1=X_test[(X_test['living_measure']<=9000) & (X_test['price']<=4000000) &
(X_test['room_bed']<=10) & (X_test['room_bath']<=6)]
cols=['cid','dayhours']
X_test_1=X_test.drop(cols, inplace = False, axis = 1)
for i in range(1,2):
X_test_final['coast_'+str(i)]=0
X_test_final['furnished_'+str(i)]=0
for i in range(1,14):
X_test_final['quality_'+str(i)]=0
for i in range(1,2):
if ((X_test_final['coast']==i).bool()):
X_test_final['coast_'+str(i)]=1
for i in range(1,2):
if ((X_test_final['furnished']==i).bool()):
X_test_final['furnished_'+str(i)]=1
for i in range(1,14):
if ((X_test_final['quality']==i).bool()):
X_test_final['quality_'+str(i)]=1
X_test_final=X_test_final.drop([ 'quality_3', 'quality_4', 'quality_1', 'quality_2', 'quality_5','price'],1)
# Drop categorical variable columns
X_test_final = X_test_final.drop(X_test_final[categ], axis=1)
return X_test_final
In [ ]:
import pickle
with open('model_pickle','wb') as f:
pickle.dump(xgb_best_3,f)
In [11]:
with open('model_pickle','rb') as f:
mp=pickle.load(f)
In [14]:
X_test=model('innercity.xlsx')
mp.predict(X_test)
#X_test.columns
Out[14]:
array([314002.16], dtype=float32)
We can see that with the given parameters, pickle file has run through the model and given predicted price of
the property
In [ ]: