Professional Documents
Culture Documents
Multiple Linear Regression
Multiple Linear Regression
In [3]:
from sklearn.datasets import fetch_california_housing
In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
In [5]:
california=fetch_california_housing()
In [6]:
california
In [7]:
california.keys()
In [8]:
print(california.DESCR)
.. _california_housing_dataset:
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 3/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
.. topic:: References
In [9]:
california.data
In [10]:
california.data.shape
Out[10]: (20640, 8)
In [11]:
california.target_names
Out[11]: ['MedHouseVal']
In [12]:
california.feature_names
Out[12]: ['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude']
In [13]:
california.target
In [14]:
##lets prepare the dataset
dataset=pd.DataFrame(california.data,columns=california.feature_names)
dataset
In [15]:
dataset['Price']=california.target
In [16]:
dataset.head()
Out[16]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Price
In [17]:
dataset info()
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 6/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
[ ]
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 Price 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB
In [18]:
dataset.describe()
In [19]:
dataset.isnull().sum()
Out[19]: MedInc 0
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 7/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
Price 0
dtype: int64
In [20]:
dataset.corr()
In [21]:
import seaborn as sns
sns.heatmap(dataset.corr(),annot=True)
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 8/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
In [22]:
dataset.head()
Out[22]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Price
In [23]:
##independent and dependent features
X=dataset.iloc[:,:-1]#independent features
Y=dataset iloc[: -1]#dependent feature
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 9/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
Y=dataset.iloc[:,-1]#dependent feature
In [24]:
X.head()
In [25]:
Y
Out[25]: 0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
...
20635 0.781
20636 0.771
20637 0.923
20638 0.847
20639 0.894
Name: Price, Length: 20640, dtype: float64
In [26]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.33,random_state=10)
In [27]:
X_train.shape,Y_train.shape,X_test.shape,Y_test.shape
In [28]:
from sklearn.preprocessing import StandardScaler
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 10/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
o s ea .p ep ocess g po t Sta da dSca e
In [30]:
scaler=StandardScaler()
In [31]:
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)
In [32]:
X_train_scaled
In [33]:
X_test_scaled
Model Training
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 11/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
Model Training
In [34]:
from sklearn.linear_model import LinearRegression
In [35]:
regressor=LinearRegression()
In [36]:
regressor
Out[36]: LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
In [37]:
regressor.fit(X_train_scaled,Y_train)
Out[37]: LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
In [38]:
##slopes of 8 features
regressor.coef_
In [39]:
regressor.intercept_
Out[39]: 2.0634768086491184
In [40]:
##prediction
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 12/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
p
Y_pred_test=regressor.predict(X_test_scaled)
In [41]:
Y_pred_test
In [42]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
In [43]:
mse=mean_squared_error(Y_test,Y_pred_test)
mae=mean_absolute_error(Y_test,Y_pred_test)
rmse=np.sqrt(mse)
print(mse)
print(mae)
print(rmse)
0.5522332399363619
0.537105694300796
0.7431239734636219
In [44]:
##R square and Adjusted R _square
from sklearn.metrics import r2_score
score=r2_score(Y_test,Y_pred_test)
print(score)
0.593595852643664
In [45]:
#display adjusted R-squared
1 - (1-score)*(len(Y_test)-1)/(len(Y_test)-X_test.shape[1]-1)
Out[45]: 0.5931179409607519
Pickling
Python Pickle module is used for serialising and de_serialising a python object structure
.Any object in python can be pickled so that it can be saved on disk. What pickle does is
that it "Serialises" the object first before writing it to file.Pickling is a way to convert a
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 13/14
4/7/23, 7:46 PM multiplrlinearegression/MultipleRegression.ipynb at main · tapanpati001/multiplrlinearegression
j g g y
python object(list,tupple,etc..)into a character stream.The idea is that this character stream
contains all the informations necessary to reconstruct the object in another python script.
In [46]:
i t i kl
https://github.com/tapanpati001/multiplrlinearegression/blob/main/MultipleRegression.ipynb 14/14