Professional Documents
Culture Documents
Evan Marie Carr - Python and SKlearn
Evan Marie Carr - Python and SKlearn
Evan Marie Carr - Python and SKlearn
In this assignment, you're going to predict the price of a house using information like its location, area, no. of
rooms etc. You'll use the dataset from the House Prices - Advanced Regression Techniques competition on Kaggle.
We'll follow a step-by-step process to train our model:
As you go through this notebook, you will nd a ??? in certain places. Your job is to replace the ??? with
appropriate code or values, to ensure that the notebook runs properly end-to-end and your machine learning
model is trained properly without errors.
Guidelines
1. Make sure to run all the code cells in order. Otherwise, you may get errors like NameError for unde ned
variables.
2. Do not change variable names, delete cells, or disturb other existing code. It may cause problems during
evaluation.
3. In some cases, you may need to add some code cells or new statements before or after the line of code
containing the ???.
4. Since you'll be using a temporary online service for code execution, save your work by running
jovian.commit at regular intervals.
5. Review the "Evaluation Criteria" for the assignment carefully and make sure your submission meets all the
criteria.
6. Questions marked (Optional) will not be considered for evaluation and can be skipped. They are for your
learning.
7. It's okay to ask for help & discuss ideas on the community forum, but please don't post full working code, to
give everyone an opportunity to solve the assignment on their own.
Important Links:
Option 2: Running on your computer locally: To run the code on your computer locally, you'll need to set up
Python, download the notebook and install the required libraries. Click the Run button at the top of this page,
select the Run Locally option, and follow the instructions.
Saving your work: You can save a snapshot of the assignment to your Jovian pro le, so that you can access it
later and continue your work. Keep saving your work by running jovian.commit from time to time.
import jovian
# key: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmcmVzaCI6ZmFsc2UsImlhdCI6MTY2MzAwNjYxNSw
jovian.commit(project='python-sklearn-assignment', privacy='secret')
'https://jovian.ai/evanmarie/python-sklearn-assignment'
!pip install numpy pandas matplotlib seaborn plotly opendatasets jovian --quiet
dataset_url = 'https://github.com/JovianML/opendatasets/raw/master/data/house-prices-ad
We'll use the urlretrieve function from the module urllib.request to dowload the dataset.
urlretrieve(dataset_url, 'house-prices.zip')
with ZipFile('house-prices.zip') as f:
f.extractall(path='house-prices')
The dataset is extracted to the folder house-prices . Let's view the contents of the folder using the os module.
import os
data_dir = 'house-prices'
os.listdir(data_dir)
Use the "File" > "Open" menu option to browse the contents of each le. You can also check out the dataset
description on Kaggle to learn more.
We'll use the data in the le train.csv for training our model. We can load the for processing using the Pandas
library.
import pandas as pd
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()
'house-prices/train.csv'
QUESTION 1: Load the data from the le train.csv into a Pandas data frame.
prices_df = pd.read_csv(train_csv_path)
prices_df
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotCon
... ... ... ... ... ... ... ... ... ... ... .
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub Insid
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub Insid
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub Insid
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub Insid
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub Insid
prices_df.sample(5)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotCon
1411 1412 50 RL 80.0 9600 Pave NaN Reg Lvl AllPub Insid
872 873 20 RL 74.0 8892 Pave NaN Reg Lvl AllPub Corne
286 287 50 RL 77.0 9786 Pave NaN IR1 Bnk AllPub Insid
173 174 20 RL 80.0 10197 Pave NaN IR1 Lvl AllPub Insid
Let's explore the columns and data types within the dataset.
prices_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
QUESTION 2: How many rows and columns does the dataset contain?
n_rows = 1460
n_cols = 81
(OPTIONAL) QUESTION: Before training the model, you may want to explore and visualize data from the
various columns within the dataset, and study their relationship with the price of the house (using
scatter plot and correlations). Create some graphs and summarize your insights using the empty cells
below.
prices_df.columns
style_price_avg = prices_df.groupby('HouseStyle')['SalePrice'].mean().sort_values(ascen
style_price_avg
HouseStyle
2.5Fin 220000.000000
2Story 210051.764045
1Story 175985.477961
SLvl 166703.384615
2.5Unf 157354.545455
1.5Fin 143116.740260
SFoyer 135074.486486
1.5Unf 110150.000000
Name: SalePrice, dtype: float64
house_style_avg_prices = {
'2.5Fin': 220000.0,
'2Story': 210051.0,
'1Story': 175985.0,
'1.5Fin': 166703.0,
'SFoyer': 135074.0,
'SLvl': 166703.0,
'1.5Unf': 110150.0,
'2.5Unf': 128000.0,
'2.5Unf': 157354.0
}
fig_01 = px.bar(x = house_style_avg_prices.keys(), y = house_style_avg_prices.values())
fig_01.update_layout(
font_family="Verdana",
font_color="white",
showlegend = False,
paper_bgcolor='#333333',
font_size=14,
title='House Style and Average Price',
xaxis_title = 'House Style',
yaxis_title = 'Average Cost',
legend_title = 'Country',
plot_bgcolor = 'purple')
fig_01.show()
lot_area_categories = {
'very small': 4999, # 1200 - 4999
'small': 5000, # 5000 - 14999
'medium': 10000, # 15000 - 29999
'large': 20000, # 30000 - 59999
'very large': 210000 # 60000 - 225000
}
prices_df['LotAreaCategory'] = prices_df['LotArea'].apply(lambda x: 'very small' if x <
lot_area_avg_prices = prices_df.groupby('LotAreaCategory')['SalePrice'].mean().sort_val
prices_df[['LotArea', 'LotAreaCategory']].sample(10)
LotArea LotAreaCategory
42 9180 medium
prices_df['LotAreaCategory'].value_counts()
medium 695
large 565
very small 142
very large 53
small 5
Name: LotAreaCategory, dtype: int64
lot_area_avg_prices
LotAreaCategory
very large 253926.283019
large 213890.171681
medium 156540.615827
very small 144731.535211
small 98260.000000
Name: SalePrice, dtype: float64
fig_02.update_layout(
font_family="Verdana",
font_color="white",
showlegend = False,
paper_bgcolor='#333333',
font_size=14,
xaxis_title = 'Lot Area Category',
yaxis_title = 'House Sale Price',
title = 'Lot Size and Average Price',
plot_bgcolor = 'aquamarine')
fig_02.show()
fig_03.update_layout(yaxis_title='Sale Price',
xaxis_title='Year Home was Built',
title='Sale Price, Year Built, and Overall Condition',
legend_title = 'Overall Condition',
font_family="Verdana",
font_color="white",
legend_font_color = 'black',
legend_bgcolor='#72F2EB',
legend_borderwidth=2,
paper_bgcolor='#333333',
plot_bgcolor='#666666',
legend_bordercolor='#0A7373',
font_size=14)
fig_03.update_traces(marker=dict(
size = 8,
opacity = 0.8,
line=dict(width=0.5,
color='black')),
selector=dict(mode='markers'))
fig_03.show()
import jovian
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
1. Identify the input and target column(s) for training the model.
The rst column Id is a unique ID for each house and isn't useful for training the model.
The last column SalePrice contains the value we need to predict i.e. it's the target column.
Data from all the other columns (except the rst and the last column) can be used as inputs to the model.
prices_df
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotCon
... ... ... ... ... ... ... ... ... ... ... .
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub Insid
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub Insid
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub Insid
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub Insid
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub Insid
QUESTION 3: Create a list input_cols of column names containing data that can be used as input to
train the model, and identify the target column as the variable target_col .
# Identify the name of the target column (a single string, not a list)
target_col = "SalePrice"
print(list(input_cols))
len(input_cols)
79
print(target_col)
SalePrice
Make sure that the Id and SalePrice columns are not included in input_cols .
Now that we've identi ed the input and target columns, we can separate input & target data.
inputs_df = prices_df[input_cols].copy()
targets = prices_df[target_col]
inputs_df
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotCon g Lan
... ... ... ... ... ... ... ... ... ... ...
0 208500
1 181500
2 223500
3 140000
4 250000
...
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, Length: 1460, dtype: int64
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
prices_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 82 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
81 LotAreaCategory 1460 non-null object
dtypes: float64(3), int64(35), object(44)
memory usage: 935.4+ KB
QUESTION 4: Crate two lists numeric_cols and categorical_cols containing names of numeric
and categorical input columns within the dataframe respectively. Numeric columns have data types
int64 and float64 , whereas categorical columns have the data type object .
import numpy as np
categorical_cols = inputs_df.select_dtypes('object').columns.tolist()
print(list(numeric_cols))
print(list(categorical_cols))
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]
LotFrontage 259
GarageYrBlt 81
MasVnrArea 8
dtype: int64
Machine learning models can't work with missing data. The process of lling missing values is called imputation.
There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the
average value in the column using the SimpleImputer class from sklearn.impute .
QUESTION 5: Impute ( ll) missing values in the numeric columns of inputs_df using a
SimpleImputer .
SimpleImputer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SimpleImputer
SimpleImputer()
After imputation, none of the numeric columns should contain any missing values.
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0] # should be an empty list
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
max 190.0 313.0 215245.0 10.0 9.0 2010.0 2010.0 1600.0 564
A good practice is to scale numeric features to a small range of values e.g. (0, 1). Scaling numeric features
ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also
work better in practice with smaller numbers.
QUESTION 6: Scale numeric values to the (0, 1) range using MinMaxScaler from
sklearn.preprocessing .
MinMaxScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MinMaxScaler
MinMaxScaler()
After scaling, the ranges of all numeric columns should be (0, 1).
inputs_df[numeric_cols].describe().loc[['min', 'max']]
'https://jovian.ai/evanmarie/python-sklearn-assignment'
inputs_df[categorical_cols].nunique().sort_values(ascending=False)
Neighborhood 25
Exterior2nd 16
Exterior1st 15
SaleType 9
Condition1 9
Condition2 8
HouseStyle 8
RoofMatl 8
Functional 7
BsmtFinType2 6
Heating 6
RoofStyle 6
SaleCondition 6
BsmtFinType1 6
GarageType 6
Foundation 6
Electrical 5
FireplaceQu 5
HeatingQC 5
GarageQual 5
GarageCond 5
MSZoning 5
LotConfig 5
ExterCond 5
BldgType 5
BsmtExposure 4
MiscFeature 4
Fence 4
LotShape 4
LandContour 4
BsmtCond 4
KitchenQual 4
MasVnrType 4
ExterQual 4
BsmtQual 4
LandSlope 3
GarageFinish 3
PavedDrive 3
PoolQC 3
Utilities 2
CentralAir 2
Street 2
Alley 2
dtype: int64
Since machine learning models can only be trained with numeric data, we need to convert categorical data to
numbers. A common technique is to use one-hot encoding for categorical columns.
One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.
QUESTION 7: Encode categorical columns in the dataset as one-hot vectors using OneHotEncoder
from sklearn.preprocessing . Add a new binary (0/1) column for each category
OneHotEncoder(handle_unknown='ignore', sparse=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder
OneHotEncoder(handle_unknown='ignore', sparse=False)
/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py:3678: PerformanceWarning:
inputs_df
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotCon g La
... ... ... ... ... ... ... ... ... ... ...
1455 0.235294 RL 0.140411 0.030929 Pave NaN Reg Lvl AllPub Inside
1456 0.000000 RL 0.219178 0.055505 Pave NaN Reg Lvl AllPub Inside
1457 0.294118 RL 0.154110 0.036187 Pave NaN Reg Lvl AllPub Inside
1458 0.000000 RL 0.160959 0.039342 Pave NaN Reg Lvl AllPub Inside
1459 0.000000 RL 0.184932 0.040370 Pave NaN Reg Lvl AllPub Inside
jovian.commit()
train_inputs
1023 0.588235 0.075342 0.008797 0.666667 0.500 0.963768 0.933333 0.008750 0.00
810 0.000000 0.195205 0.041319 0.555556 0.625 0.739130 0.816667 0.061875 0.11
1384 0.176471 0.133562 0.036271 0.555556 0.500 0.485507 0.000000 0.000000 0.03
626 0.000000 0.167979 0.051611 0.444444 0.500 0.637681 0.466667 0.000000 0.00
813 0.000000 0.184932 0.039496 0.555556 0.625 0.623188 0.133333 0.151875 0.10
1095 0.000000 0.195205 0.037472 0.555556 0.500 0.971014 0.933333 0.000000 0.00
1130 0.176471 0.150685 0.030400 0.333333 0.250 0.405797 0.000000 0.000000 0.11
1294 0.000000 0.133562 0.032120 0.444444 0.750 0.601449 0.666667 0.000000 0.02
860 0.176471 0.116438 0.029643 0.666667 0.875 0.333333 0.800000 0.000000 0.00
1126 0.588235 0.109589 0.011143 0.666667 0.500 0.978261 0.950000 0.081250 0.00
train_targets
1023 191000
810 181000
1384 105000
626 139900
813 157900
...
1095 176432
1130 135000
1294 115000
860 189950
1126 174000
Name: SalePrice, Length: 1095, dtype: int64
val_inputs
892 0.000000 0.167808 0.033252 0.555556 0.875 0.659420 0.883333 0.000000 0.11
1105 0.235294 0.263699 0.051209 0.777778 0.500 0.884058 0.750000 0.226250 0.18
413 0.058824 0.119863 0.035804 0.444444 0.625 0.398551 0.000000 0.000000 0.00
522 0.176471 0.099315 0.017294 0.555556 0.750 0.543478 0.000000 0.000000 0.07
1036 0.000000 0.232877 0.054210 0.888889 0.500 0.978261 0.966667 0.043750 0.18
988 0.235294 0.167979 0.050228 0.555556 0.625 0.753623 0.433333 0.186250 0.02
243 0.823529 0.184932 0.044226 0.555556 0.625 0.782609 0.500000 0.000000 0.00
1342 0.235294 0.167979 0.037743 0.777778 0.500 0.942029 0.866667 0.093125 0.00
1057 0.235294 0.167979 0.133955 0.666667 0.625 0.884058 0.733333 0.000000 0.10
1418 0.000000 0.171233 0.036944 0.444444 0.500 0.659420 0.216667 0.000000 0.00
val_targets
892 154500
1105 325000
413 115000
522 159000
1036 315500
...
988 195000
243 120000
1342 228500
1057 248000
1418 124000
Name: SalePrice, Length: 365, dtype: int64
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
Instead, we'll use Ridge Regression, a variant of linear regression that uses a technique called L2 regularization to
introduce another loss term that forces the model to generalize better. Learn more about ridge regression here:
https://www.youtube.com/watch?v=Q81RR3yKn30
QUESTION 8: Create and train a linear regression model using the Ridge class from
sklearn.linear_model .
Ridge()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge
Ridge()
model.fit uses the following strategy for training the model (source):
3. We compare the model's predictions with the actual targets using the loss function.
4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting
the weights & biases of the model
5. We repeat steps 1 to 4 till the predictions from the model are good enough.
jovian.commit()
'https://jovian.ai/evanmarie/python-sklearn-assignment'
QUESTION 9: Generate predictions and compute the RMSE loss for the training and validation sets.
Hint: Use the mean_squared_error with the argument squared=False to compute RMSE loss.
train_preds = model.predict(train_inputs)
train_preds
val_preds = model.predict(val_inputs)
val_preds
Feature Importance
Let's look at the weights assigned to different columns, to gure out which columns in the dataset are the most
important.
QUESTION 10: Identify the weights (or coe cients) assigned to for different features by the model.
weights = model.coef_.tolist()
weights_df = pd.DataFrame({
'columns': train_inputs.columns,
'weight': weights
}).sort_values('weight', ascending=False)
weights_df
columns weight
15 GrLivArea 74388.503609
13 2ndFlrSF 62433.996447
3 OverallQual 62055.123269
12 1stFlrSF 60726.614752
21 KitchenAbvGr -26474.353422
Can you tell which columns have the greatest impact on the price of the house?
Making Predictions
The model can be used to make predictions on new inputs using the following helper function:
def predict_input(single_input):
predict_input
input_df = pd.DataFrame([single_input])
input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
input_df[encoded_cols] = encoder.transform(input_df[categorical_cols].values)
X_input = input_df[numeric_cols + encoded_cols]
return model.predict(X_input)[0]
predicted_price = predict_input(sample_input)
/opt/conda/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning:
X does not have valid feature names, but OneHotEncoder was fitted with feature names
/opt/conda/lib/python3.9/site-packages/pandas/core/frame.py:3678: PerformanceWarning:
Change the values in sample_input above and observe the effects on the predicted price.
import joblib
house_price_predictor = {
'model': model,
'imputer': imputer,
'scaler': scaler,
'encoder': encoder,
'input_cols': input_cols,
'target_col': target_col,
'numeric_cols': numeric_cols,
'categorical_cols': categorical_cols,
'encoded_cols': encoded_cols
}
joblib.dump(house_price_predictor, 'house_price_predictor.joblib')
['house_price_predictor.joblib']
Congratulations on training and evaluating your rst machine learning model using scikit-learn ! Let's save
our work before continuing. We'll include the saved model as an output.
jovian.commit(outputs=['house_price_predictor.joblib'])
'https://jovian.ai/evanmarie/python-sklearn-assignment'
Make Submission
To make a submission, just execute the following cell:
jovian.submit('zerotogbms-a1')
You can also submit your Jovian notebook link on the assignment page: https://jovian.ai/learn/machine-learning-
with-python-zero-to-gbms/assignment/assignment-1-train-your- rst-ml-model
Make sure to review the evaluation criteria carefully. You can make any number of submissions, and only your nal
submission will be evalauted.