Linear Regression and SVR

Importing libraries
In [1]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Data ingestion
In [2]: df = pd.read_csv("C:\\Users\\Rajesh Singh\\Downloads\\household_power_consumption\\
In [3]: df
Out[3]: Date Time Global_active_power Global_reactive_power Voltage Global_intens
0 16/12/2006 17:24:00 4.216 0.418 234.840 18.4
1 16/12/2006 17:25:00 5.360 0.436 233.630 23.0
2 16/12/2006 17:26:00 5.374 0.498 233.290 23.0
3 16/12/2006 17:27:00 5.388 0.502 233.740 23.0
4 16/12/2006 17:28:00 3.666 0.528 235.680 15.8
... ... ... ... ... ...
2075254 26/11/2010 20:58:00 0.946 0.000 240.430 4.0
2075255 26/11/2010 20:59:00 0.944 0.000 240.000 4.0
2075256 26/11/2010 21:00:00 0.938 0.000 239.820 3.8
2075257 26/11/2010 21:01:00 0.934 0.000 239.700 3.8
2075258 26/11/2010 21:02:00 0.932 0.000 239.550 3.8
2075259 rows × 9 columns
EDA
In [4]: df.shape
(2075259, 9)
Out[4]:
In [5]: ## selecting 30000 sample data
df1= df.sample(30000)
In [6]: df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1032508 to 1532620
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 30000 non-null object
1 Time 30000 non-null object
2 Global_active_power 30000 non-null object
3 Global_reactive_power 30000 non-null object
4 Voltage 30000 non-null object
5 Global_intensity 30000 non-null object
6 Sub_metering_1 30000 non-null object
8 Sub_metering_3 29619 non-null float64
dtypes: float64(1), object(8)
memory usage: 2.3+ MB
Feature Information
Global_active_power: household global minute-averaged active power (in kilowatt)
global_reactive_power: household global minute-averaged reactive power (in
kilowatt)
voltage: minute-averaged voltage (in volt)
global_intensity: household global minute-averaged current intensity (in ampere)
date: Date in format dd/mm/yyyy
time: time in format hh:mm:ss
sub_metering_1, sub_metering_1, and sub_metering_1 are the meter readings
separating date, month and year
In [7]: df1['Date']= pd.to_datetime(df1['Date'])
In [8]: df1['date']= df1['Date'].dt.day
In [9]: df1['month']= df1['Date'].dt.month
In [10]: df1['year']= df1['Date'].dt.year
separating hours, minutes and seconds
In [12]: df1['hour'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.hour
df1['Minutes'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.minute
replacing special characters from data
In [14]: df1.replace('?',np.nan, inplace =True)
df1.replace(',',np.nan, inplace =True)
df1.replace(' ', np.nan, inplace =True)
In [15]: df1.info()
--- ------ -------------- -----
0 Date 30000 non-null datetime64[ns]
2 Global_active_power 29619 non-null object
3 Global_reactive_power 29619 non-null object
4 Voltage 29619 non-null object
5 Global_intensity 29619 non-null object
9 date 30000 non-null int64
10 month 30000 non-null int64
11 year 30000 non-null int64
12 hour 30000 non-null int64
13 Minutes 30000 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(5), object(7)
Converting the datatypes
In [16]: df1['Global_active_power']= df1['Global_active_power'].astype(float)
df1['Global_reactive_power']= df1['Global_reactive_power'].astype(float)
df1['Voltage']= df1['Voltage'].astype(float)
df1['Global_intensity']= df1['Global_intensity'].astype(float)
df1['Sub_metering_1']= df1['Sub_metering_1'].astype(float)
df1['Sub_metering_2']= df1['Sub_metering_2'].astype(float)
Creating total metering column
In [17]: df1['total_metering']= df1['Sub_metering_1']+df1['Sub_metering_2']+df1['Sub_meterin
In [18]: df1.info()
--- ------ -------------- -----
0 Date 30000 non-null datetime64[ns]
2 Global_active_power 29619 non-null float64
3 Global_reactive_power 29619 non-null float64
4 Voltage 29619 non-null float64
5 Global_intensity 29619 non-null float64
14 total_metering 29619 non-null float64
dtypes: datetime64[ns](1), float64(8), int64(5), object(1)
Dropping columns not necessary

In [19]: dff= df1.drop(columns=['Date','Time','Sub_metering_1','Sub_metering_2','Sub_meterin
In [20]: dff.info()
--- ------ -------------- -----
0 Global_active_power 29619 non-null float64
1 Global_reactive_power 29619 non-null float64
2 Voltage 29619 non-null float64
3 Global_intensity 29619 non-null float64
9 total_metering 29619 non-null float64
dtypes: float64(5), int64(5)
memory usage: 2.5 MB
In [21]: dff.describe()
Out[21]: Global_active_power Global_reactive_power Voltage Global_intensity date
count 29619.000000 29619.000000 29619.000000 29619.00000 30000.000000 3
mean 1.101824 0.124000 240.810522 4.66880 15.722133
std 1.066910 0.113641 3.254252 4.48595 8.818710
min 0.078000 0.000000 225.450000 0.20000 1.000000
25% 0.310000 0.048000 238.950000 1.40000 8.000000
50% 0.620000 0.100000 240.980000 2.80000 16.000000
75% 1.534000 0.196000 242.860000 6.40000 23.000000
max 8.702000 1.100000 253.140000 38.00000 31.000000
Primary Observation:
There are null values in Global_active_power, Global_reactive_power, Voltage,
Global_intensity and total_metering.
And Global_active_power, Global_reactive_power,
Global_intensity and total_metering features seems to have outliers and hence data is
skewed.
Global_active_power : has observed mean power of 1.087975 and standard deviation
1.052817, minimum power 0.078000, range og 25th percentile to 75th percentile is
[0.308000 to 1.530000], but maximum power 10.670000.
This clearly shows data is
skewed and presence of outliers.
Global_reactive_power : has observed mean power of 0.123332 and standard deviation
0.113058, minimum power 0, range og 25th percentile to 75th percentile is [0.048000 to
0.194000], but maximum power 1.186000.
This clearly shows data is skewed and
presence of outliers.
Voltage : has observed mean voltage of 240.853321 and standard deviation 3.241289,
minimum voltage 225.140000, range og 25th percentile to 75th percentile is
[238.990000 to 242.910000], and maximum voltage 253.420000.
Global_intensity : has observed mean current intensity of 4.612513 and standard
deviation 4.426749, minimum is 0.200000, range og 25th percentile to 75th percentile is
[1.400000 to 6.400000], but maximum value is 46.400000.
This clearly shows data is
skewed and presence of outliers.
total_metering : has observed mean total metering is 8.815839 and standard deviation
12.710086, minimum reading 0, range og 25th percentile to 75th percentile is [0 to 18],
but maximum reading is 126.
This is skewed ans have outliers.
In [22]: dff.duplicated() ###checking presence of duplicate
1032508 False
Out[22]:
1794736 False
590021 False
1946737 False
691314 False
...
340664 False
2063841 False
1437739 False
1027465 False
1532620 False
Length: 30000, dtype: bool
In [23]: dff.skew()
Global_active_power 1.805078
Out[23]:
Global_reactive_power 1.272971
Voltage -0.329061
Global_intensity 1.868954
date -0.000066
month -0.000640
year -0.015933
hour 0.002056
Minutes 0.003394
total_metering 2.244892
dtype: float64
Handling null values{replacing with mean when features are without outliers and replacing with median
when features are with outliers}
In [24]: dff['Global_active_power'] = dff['Global_active_power'].fillna(dff['Global_active_p

dff['Global_reactive_power'] = dff['Global_reactive_power'].fillna(dff['Global_reac
dff['Voltage'] = dff['Voltage'].fillna(dff['Voltage'].mean())
dff['Global_intensity'] = dff['Global_intensity'].fillna(dff['Global_intensity'].me
dff['total_metering'] = dff['total_metering'].fillna(dff['total_metering'].median()
In [25]: dff.head()
Out[25]: Global_active_power Global_reactive_power Voltage Global_intensity date month ye
1032508 1.838 0.128 242.58 7.6 12 2 200
1794736 1.756 0.190 241.51 7.2 16 5 20
590021 1.688 0.054 239.15 7.2 30 1 200
1946737 1.600 0.308 239.54 6.8 29 8 20
691314 1.368 0.000 239.24 5.6 4 9 200

In [26]: plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis', fontsize=20, fontweight='bold', alpha=0.8, y=1

for i in range(0, len(dff.columns)):
plt.subplot(5, 3, i+1)
sns.kdeplot(x=dff[dff.columns[i]],shade=True, color='b')
plt.xlabel(dff.columns[i])
plt.tight_layout()
In [27]: plt.figure(figsize=(15, 15))
plt.suptitle('Relation of features with total metering', fontsize=20)
plt.subplot(5, 3, i+1)
sns.scatterplot(data = dff, x=dff[dff.columns[i]], y = dff['total_metering'], c

plt.xlabel(dff.columns[i])
plt.tight_layout()
In [28]: sns.lineplot(x='hour', y='total_metering', data=dff, color = 'blue')
<AxesSubplot:xlabel='hour', ylabel='total_metering'>
Out[28]:
Observation:
total reading incraeses in the morning around 7 am then with a dip aroung 3 or 4 of
afternoon, it again rises in evening time then dip after 8 or 9 of the night.
In [29]: sns.lineplot(x='month', y='total_metering', data=dff, color = 'blue')
<AxesSubplot:xlabel='month', ylabel='total_metering'>
Out[29]:
Observation:
In july there is a dip in power consumption.
In [30]: sns.lineplot(x='year', y='total_metering', data=dff, color = 'blue')
<AxesSubplot:xlabel='year', ylabel='total_metering'>
Out[30]:
Observation:
The power consumption has decreased from 2006 onwards.
Checking correlation
In [31]: dff.corr()
Out[31]: Global_active_power Global_reactive_power Voltage Global_intensity
Global_active_power 1.000000 0.262954 -0.398298 0.998943 -0.0
Global_reactive_power 0.262954 1.000000 -0.112563 0.281364 0.0
Voltage -0.398298 -0.112563 1.000000 -0.409977 0.0
Global_intensity 0.998943 0.281364 -0.409977 1.000000 -0.0
date -0.010590 0.004627 0.006404 -0.010543 1.0
month 0.005835 0.013825 0.042346 0.004858 0.0
year -0.041554 0.045964 0.253848 -0.045567 -0.0
hour 0.284929 0.137597 -0.182374 0.285519 0.0
Minutes 0.000974 -0.019742 0.018049 0.000006 -0.0
total_metering 0.846845 0.190819 -0.342922 0.843842 -0.0
In [32]: plt.figure(figsize=(15,15))
sns.heatmap(data=dff.corr(), annot=True);
Observation:
Global_active_power and global_intensity are highly correlated
Handling multicollinearity
In [33]: from statsmodels.stats.outliers_influence import variance_inflation_factor
In [34]: vif_data = pd.DataFrame()
vif_data['VIF'] = [variance_inflation_factor(dff.values,i) for i in range(len(dff.c

vif_data['features'] = dff.columns
vif_data
Out[34]: VIF features
0 1317.390296 Global_active_power
1 2.937605 Global_reactive_power
2 7550.255484 Voltage
3 1337.011908 Global_intensity
4 4.179946 date
5 4.530054 month
6 7638.125596 year
7 4.222324 hour
8 3.889487 Minutes
9 5.372782 total_metering
In [35]: ## 'Global_active_power', 'Voltage', 'Global_intensity', and 'year' has VIF value g

## Dropping 'Global_active power'as both 'global active power' and 'global intensit
## although voltage's VIF is also high but will not drop it as it does not have col
## dropping year feature as high VIF value and also not having this will not affect
dff.drop(columns=['Global_active_power','year'],axis=1, inplace = True)
In [36]: dff.head()
Out[36]: Global_reactive_power Voltage Global_intensity date month hour Minutes total_me
1032508 0.128 242.58 7.6 12 2 17 52
1794736 0.190 241.51 7.2 16 5 1 40
590021 0.054 239.15 7.2 30 1 11 5
1946737 0.308 239.54 6.8 29 8 15 1
691314 0.000 239.24 5.6 4 9 19 18
Checking outliers
plt.suptitle("Outliers Analysis",fontsize = 20, fontweight = 'bold', alpha= 0.8)
plt.subplot(5,3,i+1)
sns.boxplot(dff[dff.columns[i]])
plt.tight_layout()
Importing winsorizer- to handle outliers
In [39]: from feature_engine.outliers.winsorizer import Winsorizer
In [40]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails
fold = 1.5, # 1.5 times of iqr
variables = ['Global_reactive_power'])
# capping_methods = 'iqr' - 25th quantile & 75th quantile
dff['Global_reactive_power'] = winsorizer.fit_transform(dff[['Global_reactive_power
variables = ['Voltage'])
dff['Voltage'] = winsorizer.fit_transform(dff[['Voltage']])
variables = ['Global_intensity'])
dff['Global_intensity'] = winsorizer.fit_transform(dff[['Global_intensity']])
variables = ['total_metering'])
dff['total_metering'] = winsorizer.fit_transform(dff[['total_metering']])
Checking outliers after outlier treatment
plt.suptitle("Outliers Analysis",fontsize = 20, fontweight = 'bold', alpha= 0.8)
plt.subplot(5,3,i+1)
sns.boxplot(dff[dff.columns[i]])
plt.tight_layout()
Saving the cleaned data
In [45]: dff.to_csv("power_consumption_data.csv")
we will try to upload this dataset in Mongodb
In [46]: import pymongo
In [47]: ## creating connection with mongodb server
client = pymongo.MongoClient("mongodb://raje:mongodb@ac-tl0bnvo-shard-00-00.bkfbasy
db = client.test
In [48]: database = client['power_consumption'] ##creating a database at their in mongodb
In [49]: collection = database["power_consumption_data"]
In [50]: ## first we have to convert this dataframe into dict or josn format as this is the
In [51]: dff_dict = dff.to_dict(orient = 'records')
In [52]: ## now lets insert the data in mongdb
collection.insert_many(dff_dict)
loading the data from mongodb

In [53]: ## here connection is already established so no need to establish connection again,
#client = pymongo.MongoClient("mongodb://raje:mongodb@ac-tl0bnvo-shard-00-00.bkfbas
#db = client.test
#database = client['power_consumption']
#collection = database["power_consumption_data"]
In [54]: ## fetching data from the collection in mongodb{using find will return all the occu
data = pd.DataFrame(list(collection.find()))
In [55]: data
Out[55]: _id Global_reactive_power Voltage Global_intensity date month
0 636e4c605d7ad8332ed99cd5 0.182 241.410 13.4 1 6
1 636e4c605d7ad8332ed99cd7 0.000 238.130 0.6 22 5
2 636e4c605d7ad8332ed99ce3 0.000 239.580 0.8 17 10
3 636e4c605d7ad8332ed99ce6 0.000 233.335 4.6 19 7
4 636e4c605d7ad8332ed99ced 0.176 239.320 6.2 21 4
... ... ... ... ... ... ...
29995 636e4c605d7ad8332eda11c1 0.104 244.810 1.4 1 3
29996 636e4c605d7ad8332eda11cf 0.254 240.480 1.4 22 8
29997 636e4c605d7ad8332eda11d9 0.112 241.310 1.4 26 3
29998 636e4c605d7ad8332eda11e4 0.102 238.750 1.4 28 7
29999 636e4c605d7ad8332eda11f6 0.124 241.800 9.2 29 9
In [56]: #### dropping id column
data.drop(columns=['_id'],axis=1,inplace=True)
In [57]: data.head()
Out[57]: Global_reactive_power Voltage Global_intensity date month hour Minutes total_metering
0 0.182 241.410 13.4 1 6 19 34 18.0
1 0.000 238.130 0.6 22 5 11 17 1.0
2 0.000 239.580 0.8 17 10 16 55 0.0
3 0.000 233.335 4.6 19 7 8 18 17.0
4 0.176 239.320 6.2 21 4 12 30 18.0
splitting the data into dependent and independent features
In [58]: X = data.drop("total_metering", axis=1)
In [59]: X
Out[59]: Global_reactive_power Voltage Global_intensity date month hour Minutes
0 0.182 241.410 13.4 1 6 19 34
1 0.000 238.130 0.6 22 5 11 17
2 0.000 239.580 0.8 17 10 16 55
3 0.000 233.335 4.6 19 7 8 18
4 0.176 239.320 6.2 21 4 12 30
... ... ... ... ... ... ... ...
29995 0.104 244.810 1.4 1 3 13 57
29996 0.254 240.480 1.4 22 8 10 7
29997 0.112 241.310 1.4 26 3 18 5
29998 0.102 238.750 1.4 28 7 1 54
29999 0.124 241.800 9.2 29 9 7 35
In [60]: y = data['total_metering']
In [61]: y.head()
0 18.0
Out[61]:
1 1.0
2 0.0
3 17.0
4 18.0
Name: total_metering, dtype: float64
Splitting the data into train and test data
In [62]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_st
In [63]: X_train
13707 0.098 243.79 1.2 18 8 9 20
10403 0.256 240.06 3.4 4 5 19 23
6673 0.000 245.50 1.0 18 2 4 17
28904 0.274 244.23 4.2 4 9 16 28
2987 0.112 245.65 1.4 18 3 4 14
... ... ... ... ... ... ... ...
28017 0.098 243.68 1.2 2 2 3 9
17728 0.202 239.73 2.4 7 7 23 49
29199 0.180 238.16 6.8 3 9 8 9
7293 0.054 241.76 8.4 31 8 22 22
17673 0.104 239.95 1.4 31 8 3 20
In [64]: X_test
20412 0.000 243.45 1.0 23 12 23 19
1296 0.058 240.72 7.2 11 5 7 57
3906 0.136 242.68 10.4 1 1 4 10
20454 0.078 245.20 1.2 4 9 14 52
5200 0.070 241.62 4.2 10 8 0 2
... ... ... ... ... ... ... ...
7783 0.152 241.96 2.0 25 12 12 53
16480 0.000 243.42 6.2 14 11 22 27
18914 0.170 234.49 13.9 25 3 20 22
3418 0.000 243.03 1.0 10 2 15 45
4855 0.068 240.11 6.0 30 5 20 15
In [65]: y_train
13707 0.0
Out[65]:
10403 1.0
6673 0.0
28904 11.0
2987 0.0
...
28017 0.0
17728 0.0
29199 19.0
7293 19.0
17673 2.0
Name: total_metering, Length: 22500, dtype: float64
Standardising data
In [66]: from sklearn.preprocessing import StandardScaler
In [67]: scaler = StandardScaler()
In [68]: #### Using fit transform to standardise train data
In [69]: X_train = scaler.fit_transform(X_train,)
In [70]: X_train
array([[-0.21494871, 0.93559813, -0.85225693, ..., 0.4274415 ,
Out[70]:
-0.36264487, -0.5475558 ],
[ 1.29044859, -0.26217727, -0.26946362, ..., -0.44053182,
1.08034131, -0.37407043],
[-1.14867616, 1.48471233, -0.90523815, ..., -1.30850514,
-1.08413797, -0.72104116],
...,
[ 0.56633343, -0.87230415, 0.63121697, ..., 0.71676594,
-0.50694349, -1.18366881],
[-0.63417328, 0.28372573, 1.05506665, ..., 0.4274415 ,
1.51323716, -0.43189889],
[-0.15778173, -0.2975004 , -0.79927572, ..., 0.4274415 ,
-1.22843658, -0.5475558 ]])
In [71]: #### using only transform to avoid data leakage
In [72]: X_test = scaler.transform(X_test)
X_test
array([[-1.14867616, 0.82641753, -0.90523815, ..., 1.58473926,
Out[72]:
1.65753578, -0.60538425],
[-0.59606196, -0.05023846, 0.73717939, ..., -0.44053182,
-0.65124211, 1.59209705],
[ 0.14710886, 0.57915559, 1.58487876, ..., -1.59782958,
-1.08413797, -1.12584035],
...,
[ 0.47105512, -2.05081239, 2.51204995, ..., -1.0191807 ,
1.22463993, -0.43189889],
[-1.14867616, 0.69154738, -0.90523815, ..., -1.30850514,
0.50314684, 0.89815559],
[-0.50078365, -0.2461213 , 0.41929213, ..., -0.44053182,
1.22463993, -0.83669807]])
Pickling
In [73]: ## pickling{saving some preprocessing objects/models}

import pickle
# Writing different model files to file
with open('sandardScalar.sav', 'wb') as f:
pickle.dump(scaler,f)
## here we have pickled the standard scaler object
Linear Regression
In [74]: from sklearn.linear_model import LinearRegression
In [75]: ## Creating linear regression model
linear_reg = LinearRegression()
In [76]: ## training the model
linear_reg.fit(X_train, y_train)
LinearRegression()
Out[76]:
In [77]: ## coefficients and intercept of the best fit hyperplane
print('1. the coefficients of independent features: {}'.format(linear_reg.coef_))
print('2. intercept of the best fit hyperplane: {}'.format(linear_reg.intercept_))
1. the coefficients of independent features: [-4.50357114e-01 -1.05664480e-01 9.5

8753698e+00 7.08918901e-03
-3.03128453e-02 -8.95250419e-01 -8.64638801e-02]
2. intercept of the best fit hyperplane: 8.405333333333331
In [78]: ## predicting the test data using the model
y_pred = linear_reg.predict(X_test)
y_pred
array([-1.31742958, 16.20173682, 24.57747909, ..., 31.47360105,
Out[78]:
-0.32246875, 11.67767303])
cost functions
In [79]: from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
In [80]: print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred),2)

print("Mean squared error is {}".format(round(mean_absolute_error(y_test, y_pred),2
print("Root Mean squared error is {}".format(round(np.sqrt(mean_squared_error(y_tes
Mean squared error is 38.2
Root Mean squared error is 6.18
Validating the model using assumptions of Linear regression

Linear relationship
Test truth data and predicted data should follow linear relationship
{this is an indiaction
of a good model}
In [81]: plt.scatter(x=y_test, y = y_pred)
plt.xlabel("Test truth data")
plt.ylabel("Predicted data")
Text(0, 0.5, 'Predicted data')

Out[81]:
Residual distribution
Residuals should follow normal distribution
In [82]: residual_linear= y_test-y_pred
residual_linear.head()
20412 1.317430
Out[82]:
1296 2.798263
3906 -24.577479
20454 1.194867
5200 -3.652491
Name: total_metering, dtype: float64
In [83]: sns.displot(x=residual_linear, kind='kde')
<seaborn.axisgrid.FacetGrid at 0x15bb818faf0>
Out[83]:
Uniform distribution
Residuals vs Predictions should follow a uniform dstribution.
In [84]: plt.scatter(x=y_pred, y=residual_linear)

plt.xlabel('Predictions')
plt.ylabel('Residuals')
Text(0, 0.5, 'Residuals')

Out[84]:
Accuracy of the model with train data and with test data
In [85]: linear_reg.score(X_train,y_train)
0.6849425651586472
Out[85]:
In [86]: linear_reg.score(X_test,y_test)
0.690139882731237
Out[86]:
Performance Matrics
R Square and Adjusted R Square values
In [87]: from sklearn.metrics import r2_score
In [88]: r2_score_lr= r2_score(y_test, y_pred)
print("Linear Regression model has {} % accuracy".format(round(r2_score_lr*100,3)))

adjr2_score_lr=1-((1-r2_score_lr)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print("Adjusted R square accuracy is {} percent".format(round(adjr2_score_lr*100,2)
Linear Regression model has 69.014 % accuracy
Adjusted R square accuracy is 68.99 percent
Ridge Regression
In [89]: from sklearn.linear_model import Ridge
In [90]: ## creating ridge regrssion model
ridge_reg= Ridge()
In [91]: ## training the model
ridge_reg.fit(X_train, y_train)
Ridge()
Out[91]:
In [92]: # Printing co-efficients and intercept of best fit hyperplane
print("coefficient of independent feature is {}".format(ridge_reg.coef_))
print("Intercept of best fit hyperplane is {}".format(ridge_reg.intercept_))
coefficient of independent feature is [-4.50239810e-01 -1.05849300e-01 9.58696058

e+00 7.08439131e-03
-3.02969840e-02 -8.95086518e-01 -8.64597295e-02]
Intercept of best fit hyperplane is 8.405333333333331
In [93]: # predicting test data
y_pred_r = ridge_reg.predict(X_test)
y_pred_r
array([-1.31690495, 16.20114673, 24.57627609, ..., 31.47276509,
Out[93]:
-0.32214097, 11.67760058])
In [94]: print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred_r),

print("Mean absolute error is {}".format(round(mean_absolute_error(y_test, y_pred_r
Mean absolute error is 4.18
In [95]: ridge_reg.score(X_train,y_train)
0.6849425631543848
Out[95]:
In [96]: ridge_reg.score(X_test,y_test)
0.6901402931343383
Out[96]:
Validating model using performace matrics

In [97]: ridge_r2_score=r2_score(y_test, y_pred_r)
print("Our Ridge regression model has {} % accuracy".format(round(ridge_r2_score*10

ridge_adjr2_score=1-((1-ridge_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(ridge_adjr2_score*100
Our Ridge regression model has 69.014 % accuracy
Lasso Regression
In [98]: from sklearn.linear_model import Lasso
In [99]: # creating Lasso regression model
lasso_reg = Lasso()
lasso_reg
Lasso()
Out[99]:
In [100… ## training the model
lasso_reg.fit(X_train, y_train)
Lasso()
Out[100]:
In [101… # Printing co-efficients and intercept of best fit hyperplane
print("coefficient of independent feature is {}".format(lasso_reg.coef_))
print("Intercept of best fit hyperplane is {}".format(lasso_reg.intercept_))
coefficient of independent feature is [-0. -0. 8.25775306 -0.

-0. -0.
-0. ]
Intercept of best fit hyperplane is 8.405333333333331
In [102… ## predicting the test data
y_pred_lasso = lasso_reg.predict(X_test)
y_pred_lasso
array([ 0.93010027, 14.4927787 , 21.49287079, ..., 29.14922151,
Out[102]:
0.93010027, 11.86774416])
In [103… print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred_las

print("Mean absolute error is {}".format(round(mean_absolute_error(y_test, y_pred_l
Mean absolute error is 4.41
In [104… lasso_reg.score(X_train,y_train)
0.6693582767991895
Out[104]:
In [105… lasso_reg.score(X_test,y_test)
0.6743540592319746
Out[105]:
Validating model using performance matrics

In [106… lasso_r2_score=r2_score(y_test, y_pred_lasso)
print("Our Ridge regression model has {} % accuracy".format(round(lasso_r2_score*10

lasso_adjr2_score=1-((1-lasso_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(lasso_adjr2_score*100
Our Ridge regression model has 67.435 % accuracy
Elastic-Net Regression
In [107… from sklearn.linear_model import ElasticNet
In [108… # creating Elastic-Net regression model
elastic_reg=ElasticNet()
elastic_reg
ElasticNet()
Out[108]:
elastic_reg.fit(X_train, y_train)
ElasticNet()
Out[109]:
In [110… # Printing co-efficients and intercept of best fit hyperplane
print("1. Co-efficients of independent features is {}".format(elastic_reg.coef_))
print("2. Intercept of best fit hyper plane is {}".format(elastic_reg.intercept_))
1. Co-efficients of independent features is [ 0. -0.64573766 5.66694515 -

0. 0. 0.
-0. ]
2. Intercept of best fit hyper plane is 8.405333333333331
In [111… elastic_y_pred=elastic_reg.predict(X_test)
elastic_y_pred
array([ 2.7417495 , 12.61532936, 17.01277178, ..., 23.9652694 ,
Out[111]:
2.82884024, 10.9403686 ])
In [112… print("Mean squared error is '{}'".format(round(mean_squared_error(y_test, elastic_

print("Mean absolute error is '{}'".format(round(mean_absolute_error(y_test, elasti
print("Root Mean squared error is '{}'".format(round(np.sqrt(mean_squared_error(y_t
Mean squared error is '50.16'
Mean absolute error is '5.5'
Root Mean squared error is '7.08'
In [113… elastic_reg.score(X_train,y_train)
0.5870637265888694
Out[113]:
In [114… elastic_reg.score(X_test,y_test)
0.5931641182844434
Out[114]:
Validating model using performance matrices

In [115… elastic_reg_r2_score=r2_score(y_test, elastic_y_pred)
print("Our Elastic-Net regression model has {} % accuracy".format(round(elastic_reg

elastic_reg_adj_r2_score=1-((1-elastic_reg_r2_score)*(len(y_test)-1)/(len(y_test)-X
print("Adjusted R square accuracy is {} percent".format(round(elastic_reg_adj_r2_sc
Our Elastic-Net regression model has 59.316 % accuracy
SVR
In [116… from sklearn.svm import SVR
In [117… # creating SVR model
svr = SVR()
svr
SVR()
Out[117]:
svr.fit(X_train, y_train)
SVR()
Out[118]:
In [119… ## predicting the dependent feature value w.r.t. test data
svr_y_pred= svr.predict(X_test)
svr_y_pred
array([-0.21238191, 21.55678922, 20.86542642, ..., 30.91950729,
Out[119]:
0.21280761, 12.93475299])
In [120… print("Mean squared error is '{}'".format(round(mean_squared_error(y_test, svr_y_pr

print("Mean absolute error is '{}'".format(round(mean_absolute_error(y_test, svr_y_
print("Root Mean squared error is '{}'".format(round(np.sqrt(mean_squared_error(y_t
Mean squared error is '33.16'
Mean absolute error is '3.38'
Root Mean squared error is '5.76'
In [121… svr.score(X_train, y_train)
0.7336778695440571
Out[121]:
In [122… svr.score(X_test, y_test)
0.7310080566997736
Out[122]:
Performance Matrix SVR

In [123… svr_r2_score=r2_score(y_test, svr_y_pred)
print("SVR model has {} % accuracy".format(round(svr_r2_score*100,3)))
svr_adj_r2_score=1-((1-svr_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1
print("Adjusted R square accuracy is {} percent".format(round(svr_adj_r2_score*100,
SVR model has 73.101 % accuracy
Apply hyperparameter tuning

In [124… from sklearn.model_selection import GridSearchCV
In [125… model_params = {
'Ridge Regression': {
'model': Ridge(),
'params' : {
'alpha': [1,5,10,20]
},
'Lasso Regression': {
'model': Lasso(),
'params' : {
'alpha': [1,5,10,20]
},
'Elastic-Net Regression' : {
'model': ElasticNet(),
'params': {
'alpha': [1,5,10,20],
'l1_ratio':[0.5,1,1.5,2]
},
'SVR':{
'model': SVR(),
'params':{
'kernel': ['linear', 'poly', 'sigmoid', 'rbf'],
'C':[1,5,10,20]
In [126… model_params.items()
dict_items([('Ridge Regression', {'model': Ridge(), 'params': {'alpha': [1, 5, 10,

Out[126]:
20]}}), ('Lasso Regression', {'model': Lasso(), 'params': {'alpha': [1, 5, 10, 2
0]}}), ('Elastic-Net Regression', {'model': ElasticNet(), 'params': {'alpha': [1,
5, 10, 20], 'l1_ratio': [0.5, 1, 1.5, 2]}}), ('SVR', {'model': SVR(), 'params':
{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'C': [1, 5, 10, 20]}})])
In [127… ##scaling the independent features before fitting it inside grid object{to simplify
X1= scaler.fit_transform(X)
In [128… scores = []
for model_name, mp in model_params.items():
clf = GridSearchCV(mp['model'], mp['params'], cv=5, n_jobs=-1,return_train_sco

clf.fit(X1, y)
scores.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df
Out[128]: model best_score best_params
0 Ridge Regression 0.686089 {'alpha': 1}
1 Lasso Regression 0.670507 {'alpha': 1}
2 Elastic-Net Regression 0.670507 {'alpha': 1, 'l1_ratio': 1}
3 SVR 0.765643 {'C': 20, 'kernel': 'rbf'}
Conclusion:
The SVR model with 'rbf kernel' is the best model for this household power
consumption data
In [ ]:

Linear Regression and SVR

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression and SVR

Uploaded by

Copyright:

Available Formats

Importing libraries

In [1]: import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

In [2]: df = pd.read_csv("C:\\Users\\Rajesh Singh\\Downloads\\household_power_consumption\\

Out[3]: Date Time Global_active_power Global_reactive_power Voltage Global_intens

0 16/12/2006 17:24:00 4.216 0.418 234.840 18.4

1 16/12/2006 17:25:00 5.360 0.436 233.630 23.0

2 16/12/2006 17:26:00 5.374 0.498 233.290 23.0

3 16/12/2006 17:27:00 5.388 0.502 233.740 23.0

4 16/12/2006 17:28:00 3.666 0.528 235.680 15.8

... ... ... ... ... ...

2075254 26/11/2010 20:58:00 0.946 0.000 240.430 4.0

2075255 26/11/2010 20:59:00 0.944 0.000 240.000 4.0

2075256 26/11/2010 21:00:00 0.938 0.000 239.820 3.8

2075257 26/11/2010 21:01:00 0.934 0.000 239.700 3.8

2075258 26/11/2010 21:02:00 0.932 0.000 239.550 3.8

2075259 rows × 9 columns

In [5]: ## selecting 30000 sample data

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 30000 non-null object

1 Time 30000 non-null object

2 Global_active_power 30000 non-null object

3 Global_reactive_power 30000 non-null object

4 Voltage 30000 non-null object

5 Global_intensity 30000 non-null object

6 Sub_metering_1 30000 non-null object

7 Sub_metering_2 30000 non-null object

8 Sub_metering_3 29619 non-null float64

dtypes: float64(1), object(8)

memory usage: 2.3+ MB

separating date, month and year

In [7]: df1['Date']= pd.to_datetime(df1['Date'])

In [8]: df1['date']= df1['Date'].dt.day

In [9]: df1['month']= df1['Date'].dt.month

In [10]: df1['year']= df1['Date'].dt.year

separating hours, minutes and seconds

In [12]: df1['hour'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.hour

df1['Minutes'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.minute

replacing special characters from data

In [14]: df1.replace('?',np.nan, inplace =True)

df1.replace(',',np.nan, inplace =True)

df1.replace(' ', np.nan, inplace =True)

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 14 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 30000 non-null datetime64[ns]

1 Time 30000 non-null object

2 Global_active_power 29619 non-null object

3 Global_reactive_power 29619 non-null object

4 Voltage 29619 non-null object

5 Global_intensity 29619 non-null object

6 Sub_metering_1 29619 non-null object

7 Sub_metering_2 29619 non-null object

8 Sub_metering_3 29619 non-null float64

9 date 30000 non-null int64

10 month 30000 non-null int64

11 year 30000 non-null int64