Professional Documents
Culture Documents
Housing Prices Notebook
Housing Prices Notebook
Housing Prices Notebook
The objective of this notebook is to predict the sales price for each house. I explored various questions during my analysis to identify
patterns within the data, leveraging Plotly to create interactive visualizations for deeper insights.
I encourage you to join me on this journey! Feel free to fork this notebook and share your insights and suggestions for further
improvement in the comments. Together, we can unlock the secrets to house price prediction. ^^
# Visulazing librarires
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
%matplotlib inline
sns.set_context("paper", font_scale= 1.25)
# Import models
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor, VotingRegressor
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Create a function to show the count and percentage of missing values in each column, and the column types.
def nulls_count(df):
" Create a function to show the count and percentage of missing values in each column, and the column types."
#Identify and count the missing values in each feature.
missing_num= df.isna().sum()
# Show only features with Nulls.
missing_num= missing_num[missing_num >0].to_frame()
#Calculate the percentage of null values.
percente= (missing_num/ dfs.shape[0])*100
# Create a function to split the concatenated DataFrame into the train and test DataFrames
def split_df(df):
" Create a function to split the concatenated DataFrame into the train and test DataFrames"
train= df.iloc[:train_df.shape[0]]
test= df.iloc[train_df.shape[0]:].drop("SalePrice", axis= 1)
def check_skewness_outliers(df):
" Create a function to show the number of outliers per column."
# numeric column list
numeric_cols= df.select_dtypes(include= 'number').columns
# compute features skewness
skewed= df.skew(numeric_only= True).to_frame().reset_index()
skewed.columns= ['feature', 'skewness']
# Load data
train_df= pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df= pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeatu
693 694 30 RL 60.00 5400 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
927 928 60 RL NaN 9900 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv
1324 1325 20 RL 75.00 9986 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
1019 1020 120 RL 43.00 3013 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
389 390 60 RL 96.00 12474 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
5 rows × 81 columns
# Show numeric features summary statisticsand sorted by standard deviation values desc.
train_df.describe().sort_values('std',axis=1, ascending= False)
SalePrice LotArea GrLivArea MiscVal BsmtFinSF1 BsmtUnfSF TotalBsmtSF 2ndFlrSF Id 1stFlrSF ... YrSold OverallCond
count 1460.00 1460.00 1460.00 1460.00 1460.00 1460.00 1460.00 1460.00 1460.00 1460.00 ... 1460.00 1460.00
mean 180921.20 10516.83 1515.46 43.49 443.64 567.24 1057.43 346.99 730.50 1162.63 ... 2007.82 5.58
std 79442.50 9981.26 525.48 496.12 456.10 441.87 438.71 436.53 421.61 386.59 ... 1.33 1.11
min 34900.00 1300.00 334.00 0.00 0.00 0.00 0.00 0.00 1.00 334.00 ... 2006.00 1.00
25% 129975.00 7553.50 1129.50 0.00 0.00 223.00 795.75 0.00 365.75 882.00 ... 2007.00 5.00
50% 163000.00 9478.50 1464.00 0.00 383.50 477.50 991.50 0.00 730.50 1087.00 ... 2008.00 5.00
75% 214000.00 11601.50 1776.75 0.00 712.25 808.00 1298.25 728.00 1095.25 1391.25 ... 2009.00 6.00
max 755000.00 215245.00 5642.00 15500.00 5644.00 2336.00 6110.00 2065.00 1460.00 4692.00 ... 2010.00 9.00
8 rows × 38 columns
Utilities Street Condition2 RoofMatl Heating LandSlope CentralAir Functional PavedDrive Electrical ... Exterior1st Exterior2nd MasV
count 1460 1460 1460 1460 1460 1460 1460 1460 1460 1459 ... 1460 1460
unique 2 2 8 8 6 3 2 7 3 5 ... 15 16
top AllPub Pave Norm CompShg GasA Gtl Y Typ Y SBrkr ... VinylSd VinylSd
freq 1459 1454 1445 1434 1428 1382 1365 1360 1340 1334 ... 515 504
4 rows × 43 columns
100
80
Box plot
60
40
20
0
0 200k 400k 600k 200k 400k 600k
2. Cleaning data
The data description file says that all the null values in the following columns denote null values. Therefore, we will replace them
with None.
LotFrontage is the linear feet of street connected to the property in any place. It depends on the zoning rules of the neighborhood
and the location of the property. Therefore, we will impute the missing LotFrontage values by using the median LotFrontage value for
each Neighborhood and LotConfig.
For MasVnrArea, we fill the missing values with the median MasVnrArea value for each MasVnrType category.
For MasVnrArea, we fill the missing values with the median MasVnrArea value for each MasVnrType category.
# Create a dictionary of the median MasVnrArea values for each MasVnrType group.
masvnr_dict= dfs.groupby('MasVnrType')['MasVnrArea'].median().to_dict()
# Fill Nulls in MasVnrArea column.
dfs['MasVnrArea']= dfs['MasVnrArea'].fillna(dfs['MasVnrType'].map(masvnr_dict))
For MSZoning, we fill the missing values with the median MSZoning value for each Neighborhood group.
# Fill the missing values in the MSZoning column with the most frequent MSZoning value for each Neighborhood group.
dfs['MSZoning']= dfs.groupby(['Neighborhood'])['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
For TotalBsmtSF, BsmtUnfSF, BsmtFullBath, BsmtHalfBath, BsmtFinSF1,and BsmtFinSF2, there is No basement in the
property. Therefore, they are equal to 0.
For other categorical and numerical columns, we will impute missing values with the mode and median, respectively.
# Check Nulls.
nulls_count(dfs)
2.2 Duplicates
# Check duplicates
print(f"- There is {dfs.duplicated().sum()} duplicates in the dataset.")
0 MiscVal 21.96 37
1 PoolArea 16.91 13
3 LowQualFinSF 12.09 35
4 3SsnPorch 11.38 30
5 KitchenAbvGr 4.30 3
9 BsmtHalfBath 3.93 2
11 OpenPorchSF 2.54 96
12 SalePrice 1.88 58
13 WoodDeckSF 1.84 57
14 1stFlrSF 1.47 41
15 LotFrontage 1.44 45
16 BsmtFinSF1 1.43 15
17 MSSubClass 1.38 4
18 GrLivArea 1.27 66
19 TotalBsmtSF 1.16 44
20 BsmtUnfSF 0.92 45
21 2ndFlrSF 0.86 8
22 TotRmsAbvGrd 0.76 5
23 Fireplaces 0.73 2
25 BsmtFullBath 0.63 1
26 OverallCond 0.57 5
27 BedroomAbvGr 0.33 4
28 GarageArea 0.24 40
29 OverallQual 0.20 1
31 FullBath 0.17 1
33 GarageCars -0.22 2
35 YearBuilt -0.60 5
Our analysis identified a significant number of outliers in most numeric features. To mitigate potential bias and improve model
performance, we explored various outlier handling techniques. We found that standardization, which scales features to a common
range, yielded the best results based on the public Leadership score metric (0.12274).
# replacement
for col, values in dict1.items():
dfs[col].replace(values, inplace= True)
After finishing cleaning data, we will return to the Training and Test sets.
<Axes: >
plt.title("House price distribution by year built and sold.", fontdict= {'fontsize': 16})
plt.title("House price distribution per Proximity to various conditions.", fontdict= {'fontsize': 16})
plt.title("House price distribution per Rating of basement finished area.", fontdict= {'fontsize': 16})
Text(0.5, 1.0, 'House price distribution per Rating of basement finished area.')
4. feature Engineering
# dummies categorical features
dfs= pd.get_dummies(dfs, drop_first= True)
# Removing features with correlation coefficient above 0.95 to reduce multicollinearity and improve model stability.
corr= dfs.drop('SalePrice', axis=1).corr().abs()
tri_df= corr.mask(mask)
to_drop= [c for c in tri_df.columns if any(tri_df[c] >0.95)]
print(f"we dropped {len(to_drop)} features.", to_drop)
dfs= dfs.drop(to_drop, axis= 1)
## Drop low variance features to improve model performance and reduce overfitting.
# Set variance threshold
sel= VarianceThreshold(threshold= 0.005)
train.sample(5)
MSSubClass LotFrontage LotArea Alley LotShape LandSlope OverallQual OverallCond MasVnrArea ExterQual ... SaleType_ConLD Sale
5.Modeling
# Store train and Target variables
X= train.drop(['SalePrice'], axis=1).values
# Normalize The target variable by taking the logarithm
y= np.log1p(train['SalePrice']).values
# shuffle data
kf= KFold(n_splits=8, shuffle= True, random_state= SEED)
As we said before, many features have high variance, so we will normalize them.
# fit model
sgbr.fit(X_train_scaled, y_train)
0.921089316513142
0.9020028424435507
# Over/underfitting check
y_pred_train= voting.predict(X_train_scaled)
y_pred= voting.predict(X_test_scaled)
print(f"""R-squared train score:{voting.score(X_train_scaled, y_train)*100: 0.1f}%, test score: {voting.score(X_test_
and train average error= {mean_squared_error(y_train, y_pred_train, squared= False): 0.3f}, test average error=
Id SalePrice
0 1461 116981.63
1 1462 163090.13
2 1463 182496.25
3 1464 185388.99
4 1465 198106.75