Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Importing libraries

In [1]: import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

Data ingestion

In [2]: df = pd.read_csv("C:\\Users\\Rajesh Singh\\Downloads\\household_power_consumption\\

In [3]: df

Out[3]: Date Time Global_active_power Global_reactive_power Voltage Global_intens

0 16/12/2006 17:24:00 4.216 0.418 234.840 18.4

1 16/12/2006 17:25:00 5.360 0.436 233.630 23.0

2 16/12/2006 17:26:00 5.374 0.498 233.290 23.0

3 16/12/2006 17:27:00 5.388 0.502 233.740 23.0

4 16/12/2006 17:28:00 3.666 0.528 235.680 15.8

... ... ... ... ... ...

2075254 26/11/2010 20:58:00 0.946 0.000 240.430 4.0

2075255 26/11/2010 20:59:00 0.944 0.000 240.000 4.0

2075256 26/11/2010 21:00:00 0.938 0.000 239.820 3.8

2075257 26/11/2010 21:01:00 0.934 0.000 239.700 3.8

2075258 26/11/2010 21:02:00 0.932 0.000 239.550 3.8

2075259 rows × 9 columns

EDA

In [4]: df.shape

(2075259, 9)
Out[4]:

In [5]: ## selecting 30000 sample data

df1= df.sample(30000)

In [6]: df1.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 9 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 30000 non-null object

1 Time 30000 non-null object

2 Global_active_power 30000 non-null object

3 Global_reactive_power 30000 non-null object

4 Voltage 30000 non-null object

5 Global_intensity 30000 non-null object

6 Sub_metering_1 30000 non-null object

7 Sub_metering_2 30000 non-null object

8 Sub_metering_3 29619 non-null float64

dtypes: float64(1), object(8)

memory usage: 2.3+ MB

Feature Information
Global_active_power: household global minute-averaged active power (in kilowatt)
global_reactive_power: household global minute-averaged reactive power (in
kilowatt)
voltage: minute-averaged voltage (in volt)
global_intensity: household global minute-averaged current intensity (in ampere)
date: Date in format dd/mm/yyyy
time: time in format hh:mm:ss
sub_metering_1, sub_metering_1, and sub_metering_1 are the meter readings

separating date, month and year

In [7]: df1['Date']= pd.to_datetime(df1['Date'])

In [8]: df1['date']= df1['Date'].dt.day

In [9]: df1['month']= df1['Date'].dt.month

In [10]: df1['year']= df1['Date'].dt.year

separating hours, minutes and seconds

In [12]: df1['hour'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.hour

df1['Minutes'] = pd.to_datetime(df1['Time'], format='%H:%M:%S').dt.minute

replacing special characters from data

In [14]: df1.replace('?',np.nan, inplace =True)

df1.replace(',',np.nan, inplace =True)

df1.replace(' ', np.nan, inplace =True)

In [15]: df1.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 14 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 30000 non-null datetime64[ns]

1 Time 30000 non-null object

2 Global_active_power 29619 non-null object

3 Global_reactive_power 29619 non-null object

4 Voltage 29619 non-null object

5 Global_intensity 29619 non-null object

6 Sub_metering_1 29619 non-null object

7 Sub_metering_2 29619 non-null object

8 Sub_metering_3 29619 non-null float64

9 date 30000 non-null int64

10 month 30000 non-null int64

11 year 30000 non-null int64

12 hour 30000 non-null int64

13 Minutes 30000 non-null int64

dtypes: datetime64[ns](1), float64(1), int64(5), object(7)

memory usage: 3.4+ MB

Converting the datatypes

In [16]: df1['Global_active_power']= df1['Global_active_power'].astype(float)

df1['Global_reactive_power']= df1['Global_reactive_power'].astype(float)

df1['Voltage']= df1['Voltage'].astype(float)

df1['Global_intensity']= df1['Global_intensity'].astype(float)

df1['Sub_metering_1']= df1['Sub_metering_1'].astype(float)

df1['Sub_metering_2']= df1['Sub_metering_2'].astype(float)

Creating total metering column

In [17]: df1['total_metering']= df1['Sub_metering_1']+df1['Sub_metering_2']+df1['Sub_meterin

In [18]: df1.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Date 30000 non-null datetime64[ns]

1 Time 30000 non-null object

2 Global_active_power 29619 non-null float64

3 Global_reactive_power 29619 non-null float64

4 Voltage 29619 non-null float64

5 Global_intensity 29619 non-null float64

6 Sub_metering_1 29619 non-null float64

7 Sub_metering_2 29619 non-null float64

8 Sub_metering_3 29619 non-null float64

9 date 30000 non-null int64

10 month 30000 non-null int64

11 year 30000 non-null int64

12 hour 30000 non-null int64

13 Minutes 30000 non-null int64

14 total_metering 29619 non-null float64

dtypes: datetime64[ns](1), float64(8), int64(5), object(1)

memory usage: 3.7+ MB

Dropping columns not necessary


In [19]: dff= df1.drop(columns=['Date','Time','Sub_metering_1','Sub_metering_2','Sub_meterin

In [20]: dff.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 30000 entries, 1032508 to 1532620

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Global_active_power 29619 non-null float64

1 Global_reactive_power 29619 non-null float64

2 Voltage 29619 non-null float64

3 Global_intensity 29619 non-null float64

4 date 30000 non-null int64

5 month 30000 non-null int64

6 year 30000 non-null int64

7 hour 30000 non-null int64

8 Minutes 30000 non-null int64

9 total_metering 29619 non-null float64

dtypes: float64(5), int64(5)

memory usage: 2.5 MB

In [21]: dff.describe()

Out[21]: Global_active_power Global_reactive_power Voltage Global_intensity date

count 29619.000000 29619.000000 29619.000000 29619.00000 30000.000000 3

mean 1.101824 0.124000 240.810522 4.66880 15.722133

std 1.066910 0.113641 3.254252 4.48595 8.818710

min 0.078000 0.000000 225.450000 0.20000 1.000000

25% 0.310000 0.048000 238.950000 1.40000 8.000000

50% 0.620000 0.100000 240.980000 2.80000 16.000000

75% 1.534000 0.196000 242.860000 6.40000 23.000000

max 8.702000 1.100000 253.140000 38.00000 31.000000

Primary Observation:
There are null values in Global_active_power, Global_reactive_power, Voltage,
Global_intensity and total_metering.
And Global_active_power, Global_reactive_power,
Global_intensity and total_metering features seems to have outliers and hence data is
skewed.
Global_active_power : has observed mean power of 1.087975 and standard deviation
1.052817, minimum power 0.078000, range og 25th percentile to 75th percentile is
[0.308000 to 1.530000], but maximum power 10.670000.
This clearly shows data is
skewed and presence of outliers.
Global_reactive_power : has observed mean power of 0.123332 and standard deviation
0.113058, minimum power 0, range og 25th percentile to 75th percentile is [0.048000 to
0.194000], but maximum power 1.186000.
This clearly shows data is skewed and
presence of outliers.
Voltage : has observed mean voltage of 240.853321 and standard deviation 3.241289,
minimum voltage 225.140000, range og 25th percentile to 75th percentile is
[238.990000 to 242.910000], and maximum voltage 253.420000.
Global_intensity : has observed mean current intensity of 4.612513 and standard
deviation 4.426749, minimum is 0.200000, range og 25th percentile to 75th percentile is
[1.400000 to 6.400000], but maximum value is 46.400000.
This clearly shows data is
skewed and presence of outliers.
total_metering : has observed mean total metering is 8.815839 and standard deviation
12.710086, minimum reading 0, range og 25th percentile to 75th percentile is [0 to 18],
but maximum reading is 126.
This is skewed ans have outliers.

In [22]: dff.duplicated() ###checking presence of duplicate

1032508 False

Out[22]:
1794736 False

590021 False

1946737 False

691314 False

...

340664 False

2063841 False

1437739 False

1027465 False

1532620 False

Length: 30000, dtype: bool

In [23]: dff.skew()

Global_active_power 1.805078

Out[23]:
Global_reactive_power 1.272971

Voltage -0.329061

Global_intensity 1.868954

date -0.000066

month -0.000640

year -0.015933

hour 0.002056

Minutes 0.003394

total_metering 2.244892

dtype: float64

Handling null values{replacing with mean when features are without outliers and replacing with median
when features are with outliers}

In [24]: dff['Global_active_power'] = dff['Global_active_power'].fillna(dff['Global_active_p


dff['Global_reactive_power'] = dff['Global_reactive_power'].fillna(dff['Global_reac
dff['Voltage'] = dff['Voltage'].fillna(dff['Voltage'].mean())

dff['Global_intensity'] = dff['Global_intensity'].fillna(dff['Global_intensity'].me
dff['total_metering'] = dff['total_metering'].fillna(dff['total_metering'].median()

In [25]: dff.head()

Out[25]: Global_active_power Global_reactive_power Voltage Global_intensity date month ye

1032508 1.838 0.128 242.58 7.6 12 2 200

1794736 1.756 0.190 241.51 7.2 16 5 20

590021 1.688 0.054 239.15 7.2 30 1 200

1946737 1.600 0.308 239.54 6.8 29 8 20

691314 1.368 0.000 239.24 5.6 4 9 200


In [26]: plt.figure(figsize=(15, 15))

plt.suptitle('Univariate Analysis', fontsize=20, fontweight='bold', alpha=0.8, y=1


for i in range(0, len(dff.columns)):

plt.subplot(5, 3, i+1)

sns.kdeplot(x=dff[dff.columns[i]],shade=True, color='b')

plt.xlabel(dff.columns[i])

plt.tight_layout()

In [27]: plt.figure(figsize=(15, 15))

plt.suptitle('Relation of features with total metering', fontsize=20)

for i in range(0, len(dff.columns)):

plt.subplot(5, 3, i+1)

sns.scatterplot(data = dff, x=dff[dff.columns[i]], y = dff['total_metering'], c


plt.xlabel(dff.columns[i])

plt.tight_layout()

In [28]: sns.lineplot(x='hour', y='total_metering', data=dff, color = 'blue')

<AxesSubplot:xlabel='hour', ylabel='total_metering'>
Out[28]:

Observation:

total reading incraeses in the morning around 7 am then with a dip aroung 3 or 4 of
afternoon, it again rises in evening time then dip after 8 or 9 of the night.

In [29]: sns.lineplot(x='month', y='total_metering', data=dff, color = 'blue')

<AxesSubplot:xlabel='month', ylabel='total_metering'>
Out[29]:
Observation:
In july there is a dip in power consumption.

In [30]: sns.lineplot(x='year', y='total_metering', data=dff, color = 'blue')

<AxesSubplot:xlabel='year', ylabel='total_metering'>
Out[30]:

Observation:

The power consumption has decreased from 2006 onwards.

Checking correlation

In [31]: dff.corr()

Out[31]: Global_active_power Global_reactive_power Voltage Global_intensity

Global_active_power 1.000000 0.262954 -0.398298 0.998943 -0.0

Global_reactive_power 0.262954 1.000000 -0.112563 0.281364 0.0

Voltage -0.398298 -0.112563 1.000000 -0.409977 0.0

Global_intensity 0.998943 0.281364 -0.409977 1.000000 -0.0

date -0.010590 0.004627 0.006404 -0.010543 1.0

month 0.005835 0.013825 0.042346 0.004858 0.0

year -0.041554 0.045964 0.253848 -0.045567 -0.0

hour 0.284929 0.137597 -0.182374 0.285519 0.0

Minutes 0.000974 -0.019742 0.018049 0.000006 -0.0

total_metering 0.846845 0.190819 -0.342922 0.843842 -0.0

In [32]: plt.figure(figsize=(15,15))

sns.heatmap(data=dff.corr(), annot=True);

Observation:

Global_active_power and global_intensity are highly correlated

Handling multicollinearity

In [33]: from statsmodels.stats.outliers_influence import variance_inflation_factor

In [34]: vif_data = pd.DataFrame()

vif_data['VIF'] = [variance_inflation_factor(dff.values,i) for i in range(len(dff.c


vif_data['features'] = dff.columns

vif_data

Out[34]: VIF features

0 1317.390296 Global_active_power

1 2.937605 Global_reactive_power

2 7550.255484 Voltage

3 1337.011908 Global_intensity

4 4.179946 date

5 4.530054 month

6 7638.125596 year

7 4.222324 hour

8 3.889487 Minutes

9 5.372782 total_metering

In [35]: ## 'Global_active_power', 'Voltage', 'Global_intensity', and 'year' has VIF value g


## Dropping 'Global_active power'as both 'global active power' and 'global intensit
## although voltage's VIF is also high but will not drop it as it does not have col
## dropping year feature as high VIF value and also not having this will not affect
dff.drop(columns=['Global_active_power','year'],axis=1, inplace = True)

In [36]: dff.head()

Out[36]: Global_reactive_power Voltage Global_intensity date month hour Minutes total_me

1032508 0.128 242.58 7.6 12 2 17 52

1794736 0.190 241.51 7.2 16 5 1 40

590021 0.054 239.15 7.2 30 1 11 5

1946737 0.308 239.54 6.8 29 8 15 1

691314 0.000 239.24 5.6 4 9 19 18

Checking outliers
In [37]: plt.figure(figsize=(15,15))

plt.suptitle("Outliers Analysis",fontsize = 20, fontweight = 'bold', alpha= 0.8)

for i in range(0, len(dff.columns)):

plt.subplot(5,3,i+1)

sns.boxplot(dff[dff.columns[i]])

plt.tight_layout()

Importing winsorizer- to handle outliers

In [39]: from feature_engine.outliers.winsorizer import Winsorizer

In [40]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails

fold = 1.5, # 1.5 times of iqr

variables = ['Global_reactive_power'])

# capping_methods = 'iqr' - 25th quantile & 75th quantile

dff['Global_reactive_power'] = winsorizer.fit_transform(dff[['Global_reactive_power

In [41]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails

fold = 1.5, # 1.5 times of iqr

variables = ['Voltage'])

# capping_methods = 'iqr' - 25th quantile & 75th quantile

dff['Voltage'] = winsorizer.fit_transform(dff[['Voltage']])

In [42]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails

fold = 1.5, # 1.5 times of iqr

variables = ['Global_intensity'])

# capping_methods = 'iqr' - 25th quantile & 75th quantile

dff['Global_intensity'] = winsorizer.fit_transform(dff[['Global_intensity']])

In [43]: winsorizer = Winsorizer(capping_method = 'iqr', # choose skewed for IQR rule bound
tail = 'both', # cap left, right or both tails

fold = 1.5, # 1.5 times of iqr

variables = ['total_metering'])

# capping_methods = 'iqr' - 25th quantile & 75th quantile

dff['total_metering'] = winsorizer.fit_transform(dff[['total_metering']])

Checking outliers after outlier treatment

In [44]: plt.figure(figsize=(15,15))

plt.suptitle("Outliers Analysis",fontsize = 20, fontweight = 'bold', alpha= 0.8)

for i in range(0, len(dff.columns)):

plt.subplot(5,3,i+1)

sns.boxplot(dff[dff.columns[i]])

plt.tight_layout()

Saving the cleaned data

In [45]: dff.to_csv("power_consumption_data.csv")

we will try to upload this dataset in Mongodb

In [46]: import pymongo

In [47]: ## creating connection with mongodb server

client = pymongo.MongoClient("mongodb://raje:mongodb@ac-tl0bnvo-shard-00-00.bkfbasy
db = client.test

In [48]: database = client['power_consumption'] ##creating a database at their in mongodb

In [49]: collection = database["power_consumption_data"]

In [50]: ## first we have to convert this dataframe into dict or josn format as this is the

In [51]: dff_dict = dff.to_dict(orient = 'records')

In [52]: ## now lets insert the data in mongdb

collection.insert_many(dff_dict)

loading the data from mongodb


In [53]: ## here connection is already established so no need to establish connection again,
#client = pymongo.MongoClient("mongodb://raje:mongodb@ac-tl0bnvo-shard-00-00.bkfbas
#db = client.test

#database = client['power_consumption']

#collection = database["power_consumption_data"]

In [54]: ## fetching data from the collection in mongodb{using find will return all the occu
data = pd.DataFrame(list(collection.find()))

In [55]: data

Out[55]: _id Global_reactive_power Voltage Global_intensity date month

0 636e4c605d7ad8332ed99cd5 0.182 241.410 13.4 1 6

1 636e4c605d7ad8332ed99cd7 0.000 238.130 0.6 22 5

2 636e4c605d7ad8332ed99ce3 0.000 239.580 0.8 17 10

3 636e4c605d7ad8332ed99ce6 0.000 233.335 4.6 19 7

4 636e4c605d7ad8332ed99ced 0.176 239.320 6.2 21 4

... ... ... ... ... ... ...

29995 636e4c605d7ad8332eda11c1 0.104 244.810 1.4 1 3

29996 636e4c605d7ad8332eda11cf 0.254 240.480 1.4 22 8

29997 636e4c605d7ad8332eda11d9 0.112 241.310 1.4 26 3

29998 636e4c605d7ad8332eda11e4 0.102 238.750 1.4 28 7

29999 636e4c605d7ad8332eda11f6 0.124 241.800 9.2 29 9

30000 rows × 9 columns

In [56]: #### dropping id column

data.drop(columns=['_id'],axis=1,inplace=True)

In [57]: data.head()

Out[57]: Global_reactive_power Voltage Global_intensity date month hour Minutes total_metering

0 0.182 241.410 13.4 1 6 19 34 18.0

1 0.000 238.130 0.6 22 5 11 17 1.0

2 0.000 239.580 0.8 17 10 16 55 0.0

3 0.000 233.335 4.6 19 7 8 18 17.0

4 0.176 239.320 6.2 21 4 12 30 18.0

splitting the data into dependent and independent features

In [58]: X = data.drop("total_metering", axis=1)

In [59]: X

Out[59]: Global_reactive_power Voltage Global_intensity date month hour Minutes

0 0.182 241.410 13.4 1 6 19 34

1 0.000 238.130 0.6 22 5 11 17

2 0.000 239.580 0.8 17 10 16 55

3 0.000 233.335 4.6 19 7 8 18

4 0.176 239.320 6.2 21 4 12 30

... ... ... ... ... ... ... ...

29995 0.104 244.810 1.4 1 3 13 57

29996 0.254 240.480 1.4 22 8 10 7

29997 0.112 241.310 1.4 26 3 18 5

29998 0.102 238.750 1.4 28 7 1 54

29999 0.124 241.800 9.2 29 9 7 35

30000 rows × 7 columns

In [60]: y = data['total_metering']

In [61]: y.head()

0 18.0

Out[61]:
1 1.0

2 0.0

3 17.0

4 18.0

Name: total_metering, dtype: float64

Splitting the data into train and test data

In [62]: from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_st

In [63]: X_train

Out[63]: Global_reactive_power Voltage Global_intensity date month hour Minutes

13707 0.098 243.79 1.2 18 8 9 20

10403 0.256 240.06 3.4 4 5 19 23

6673 0.000 245.50 1.0 18 2 4 17

28904 0.274 244.23 4.2 4 9 16 28

2987 0.112 245.65 1.4 18 3 4 14

... ... ... ... ... ... ... ...

28017 0.098 243.68 1.2 2 2 3 9

17728 0.202 239.73 2.4 7 7 23 49

29199 0.180 238.16 6.8 3 9 8 9

7293 0.054 241.76 8.4 31 8 22 22

17673 0.104 239.95 1.4 31 8 3 20

22500 rows × 7 columns

In [64]: X_test

Out[64]: Global_reactive_power Voltage Global_intensity date month hour Minutes

20412 0.000 243.45 1.0 23 12 23 19

1296 0.058 240.72 7.2 11 5 7 57

3906 0.136 242.68 10.4 1 1 4 10

20454 0.078 245.20 1.2 4 9 14 52

5200 0.070 241.62 4.2 10 8 0 2

... ... ... ... ... ... ... ...

7783 0.152 241.96 2.0 25 12 12 53

16480 0.000 243.42 6.2 14 11 22 27

18914 0.170 234.49 13.9 25 3 20 22

3418 0.000 243.03 1.0 10 2 15 45

4855 0.068 240.11 6.0 30 5 20 15

7500 rows × 7 columns

In [65]: y_train

13707 0.0

Out[65]:
10403 1.0

6673 0.0

28904 11.0

2987 0.0

...

28017 0.0

17728 0.0

29199 19.0

7293 19.0

17673 2.0

Name: total_metering, Length: 22500, dtype: float64

Standardising data

In [66]: from sklearn.preprocessing import StandardScaler

In [67]: scaler = StandardScaler()

In [68]: #### Using fit transform to standardise train data

In [69]: X_train = scaler.fit_transform(X_train,)

In [70]: X_train

array([[-0.21494871, 0.93559813, -0.85225693, ..., 0.4274415 ,

Out[70]:
-0.36264487, -0.5475558 ],
[ 1.29044859, -0.26217727, -0.26946362, ..., -0.44053182,

1.08034131, -0.37407043],
[-1.14867616, 1.48471233, -0.90523815, ..., -1.30850514,

-1.08413797, -0.72104116],
...,

[ 0.56633343, -0.87230415, 0.63121697, ..., 0.71676594,

-0.50694349, -1.18366881],
[-0.63417328, 0.28372573, 1.05506665, ..., 0.4274415 ,

1.51323716, -0.43189889],
[-0.15778173, -0.2975004 , -0.79927572, ..., 0.4274415 ,

-1.22843658, -0.5475558 ]])

In [71]: #### using only transform to avoid data leakage

In [72]: X_test = scaler.transform(X_test)

X_test

array([[-1.14867616, 0.82641753, -0.90523815, ..., 1.58473926,

Out[72]:
1.65753578, -0.60538425],
[-0.59606196, -0.05023846, 0.73717939, ..., -0.44053182,

-0.65124211, 1.59209705],
[ 0.14710886, 0.57915559, 1.58487876, ..., -1.59782958,

-1.08413797, -1.12584035],
...,

[ 0.47105512, -2.05081239, 2.51204995, ..., -1.0191807 ,

1.22463993, -0.43189889],
[-1.14867616, 0.69154738, -0.90523815, ..., -1.30850514,

0.50314684, 0.89815559],
[-0.50078365, -0.2461213 , 0.41929213, ..., -0.44053182,

1.22463993, -0.83669807]])

Pickling

In [73]: ## pickling{saving some preprocessing objects/models}


import pickle

# Writing different model files to file

with open('sandardScalar.sav', 'wb') as f:

pickle.dump(scaler,f)

## here we have pickled the standard scaler object

Linear Regression
In [74]: from sklearn.linear_model import LinearRegression

In [75]: ## Creating linear regression model

linear_reg = LinearRegression()

In [76]: ## training the model

linear_reg.fit(X_train, y_train)

LinearRegression()
Out[76]:

In [77]: ## coefficients and intercept of the best fit hyperplane

print('1. the coefficients of independent features: {}'.format(linear_reg.coef_))

print('2. intercept of the best fit hyperplane: {}'.format(linear_reg.intercept_))

1. the coefficients of independent features: [-4.50357114e-01 -1.05664480e-01 9.5


8753698e+00 7.08918901e-03

-3.03128453e-02 -8.95250419e-01 -8.64638801e-02]

2. intercept of the best fit hyperplane: 8.405333333333331

In [78]: ## predicting the test data using the model

y_pred = linear_reg.predict(X_test)

y_pred

array([-1.31742958, 16.20173682, 24.57747909, ..., 31.47360105,

Out[78]:
-0.32246875, 11.67767303])

cost functions
In [79]: from sklearn.metrics import mean_squared_error

from sklearn.metrics import mean_absolute_error

In [80]: print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred),2)


print("Mean squared error is {}".format(round(mean_absolute_error(y_test, y_pred),2
print("Root Mean squared error is {}".format(round(np.sqrt(mean_squared_error(y_tes

Mean squared error is 38.2

Mean squared error is 4.18

Root Mean squared error is 6.18

Validating the model using assumptions of Linear regression


Linear relationship

Test truth data and predicted data should follow linear relationship
{this is an indiaction
of a good model}

In [81]: plt.scatter(x=y_test, y = y_pred)

plt.xlabel("Test truth data")

plt.ylabel("Predicted data")

Text(0, 0.5, 'Predicted data')


Out[81]:

Residual distribution

Residuals should follow normal distribution

In [82]: residual_linear= y_test-y_pred

residual_linear.head()

20412 1.317430

Out[82]:
1296 2.798263

3906 -24.577479

20454 1.194867

5200 -3.652491

Name: total_metering, dtype: float64

In [83]: sns.displot(x=residual_linear, kind='kde')

<seaborn.axisgrid.FacetGrid at 0x15bb818faf0>
Out[83]:

Uniform distribution
Residuals vs Predictions should follow a uniform dstribution.

In [84]: plt.scatter(x=y_pred, y=residual_linear)


plt.xlabel('Predictions')

plt.ylabel('Residuals')

Text(0, 0.5, 'Residuals')


Out[84]:

Accuracy of the model with train data and with test data

In [85]: linear_reg.score(X_train,y_train)

0.6849425651586472
Out[85]:

In [86]: linear_reg.score(X_test,y_test)

0.690139882731237
Out[86]:

Performance Matrics
R Square and Adjusted R Square values

In [87]: from sklearn.metrics import r2_score

In [88]: r2_score_lr= r2_score(y_test, y_pred)

print("Linear Regression model has {} % accuracy".format(round(r2_score_lr*100,3)))


adjr2_score_lr=1-((1-r2_score_lr)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))

print("Adjusted R square accuracy is {} percent".format(round(adjr2_score_lr*100,2)

Linear Regression model has 69.014 % accuracy

Adjusted R square accuracy is 68.99 percent

Ridge Regression
In [89]: from sklearn.linear_model import Ridge

In [90]: ## creating ridge regrssion model

ridge_reg= Ridge()

In [91]: ## training the model

ridge_reg.fit(X_train, y_train)

Ridge()
Out[91]:

In [92]: # Printing co-efficients and intercept of best fit hyperplane

print("coefficient of independent feature is {}".format(ridge_reg.coef_))

print("Intercept of best fit hyperplane is {}".format(ridge_reg.intercept_))

coefficient of independent feature is [-4.50239810e-01 -1.05849300e-01 9.58696058


e+00 7.08439131e-03

-3.02969840e-02 -8.95086518e-01 -8.64597295e-02]

Intercept of best fit hyperplane is 8.405333333333331

In [93]: # predicting test data

y_pred_r = ridge_reg.predict(X_test)

y_pred_r

array([-1.31690495, 16.20114673, 24.57627609, ..., 31.47276509,

Out[93]:
-0.32214097, 11.67760058])

In [94]: print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred_r),


print("Mean absolute error is {}".format(round(mean_absolute_error(y_test, y_pred_r
print("Root Mean squared error is {}".format(round(np.sqrt(mean_squared_error(y_tes

Mean squared error is 38.2

Mean absolute error is 4.18

Root Mean squared error is 6.18

In [95]: ridge_reg.score(X_train,y_train)

0.6849425631543848
Out[95]:

In [96]: ridge_reg.score(X_test,y_test)

0.6901402931343383
Out[96]:

Validating model using performace matrics


R Square and Adjusted R Square values

In [97]: ridge_r2_score=r2_score(y_test, y_pred_r)

print("Our Ridge regression model has {} % accuracy".format(round(ridge_r2_score*10


ridge_adjr2_score=1-((1-ridge_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(ridge_adjr2_score*100

Our Ridge regression model has 69.014 % accuracy

Adjusted R square accuracy is 68.99 percent

Lasso Regression
In [98]: from sklearn.linear_model import Lasso

In [99]: # creating Lasso regression model

lasso_reg = Lasso()

lasso_reg

Lasso()
Out[99]:

In [100… ## training the model

lasso_reg.fit(X_train, y_train)

Lasso()
Out[100]:

In [101… # Printing co-efficients and intercept of best fit hyperplane

print("coefficient of independent feature is {}".format(lasso_reg.coef_))

print("Intercept of best fit hyperplane is {}".format(lasso_reg.intercept_))

coefficient of independent feature is [-0. -0. 8.25775306 -0.


-0. -0.

-0. ]

Intercept of best fit hyperplane is 8.405333333333331

In [102… ## predicting the test data

y_pred_lasso = lasso_reg.predict(X_test)

y_pred_lasso

array([ 0.93010027, 14.4927787 , 21.49287079, ..., 29.14922151,

Out[102]:
0.93010027, 11.86774416])

In [103… print("Mean squared error is {}".format(round(mean_squared_error(y_test, y_pred_las


print("Mean absolute error is {}".format(round(mean_absolute_error(y_test, y_pred_l
print("Root Mean squared error is {}".format(round(np.sqrt(mean_squared_error(y_tes

Mean squared error is 40.15

Mean absolute error is 4.41

Root Mean squared error is 6.34

In [104… lasso_reg.score(X_train,y_train)

0.6693582767991895
Out[104]:

In [105… lasso_reg.score(X_test,y_test)

0.6743540592319746
Out[105]:

Validating model using performance matrics


R Square and Adjusted R Square values

In [106… lasso_r2_score=r2_score(y_test, y_pred_lasso)

print("Our Ridge regression model has {} % accuracy".format(round(lasso_r2_score*10


lasso_adjr2_score=1-((1-lasso_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1
print("Adjusted R square accuracy is {} percent".format(round(lasso_adjr2_score*100

Our Ridge regression model has 67.435 % accuracy

Adjusted R square accuracy is 67.4 percent

Elastic-Net Regression
In [107… from sklearn.linear_model import ElasticNet

In [108… # creating Elastic-Net regression model

elastic_reg=ElasticNet()

elastic_reg

ElasticNet()
Out[108]:

In [109… ## training the model

elastic_reg.fit(X_train, y_train)

ElasticNet()
Out[109]:

In [110… # Printing co-efficients and intercept of best fit hyperplane

print("1. Co-efficients of independent features is {}".format(elastic_reg.coef_))

print("2. Intercept of best fit hyper plane is {}".format(elastic_reg.intercept_))

1. Co-efficients of independent features is [ 0. -0.64573766 5.66694515 -


0. 0. 0.

-0. ]

2. Intercept of best fit hyper plane is 8.405333333333331

In [111… elastic_y_pred=elastic_reg.predict(X_test)

elastic_y_pred

array([ 2.7417495 , 12.61532936, 17.01277178, ..., 23.9652694 ,

Out[111]:
2.82884024, 10.9403686 ])

In [112… print("Mean squared error is '{}'".format(round(mean_squared_error(y_test, elastic_


print("Mean absolute error is '{}'".format(round(mean_absolute_error(y_test, elasti
print("Root Mean squared error is '{}'".format(round(np.sqrt(mean_squared_error(y_t

Mean squared error is '50.16'

Mean absolute error is '5.5'

Root Mean squared error is '7.08'

In [113… elastic_reg.score(X_train,y_train)

0.5870637265888694
Out[113]:

In [114… elastic_reg.score(X_test,y_test)

0.5931641182844434
Out[114]:

Validating model using performance matrices


R Square and Adjusted R Square values

In [115… elastic_reg_r2_score=r2_score(y_test, elastic_y_pred)

print("Our Elastic-Net regression model has {} % accuracy".format(round(elastic_reg


elastic_reg_adj_r2_score=1-((1-elastic_reg_r2_score)*(len(y_test)-1)/(len(y_test)-X
print("Adjusted R square accuracy is {} percent".format(round(elastic_reg_adj_r2_sc

Our Elastic-Net regression model has 59.316 % accuracy

Adjusted R square accuracy is 59.28 percent

SVR
In [116… from sklearn.svm import SVR

In [117… # creating SVR model

svr = SVR()

svr

SVR()
Out[117]:

In [118… ## training the model

svr.fit(X_train, y_train)

SVR()
Out[118]:

In [119… ## predicting the dependent feature value w.r.t. test data

svr_y_pred= svr.predict(X_test)

svr_y_pred

array([-0.21238191, 21.55678922, 20.86542642, ..., 30.91950729,

Out[119]:
0.21280761, 12.93475299])

In [120… print("Mean squared error is '{}'".format(round(mean_squared_error(y_test, svr_y_pr


print("Mean absolute error is '{}'".format(round(mean_absolute_error(y_test, svr_y_
print("Root Mean squared error is '{}'".format(round(np.sqrt(mean_squared_error(y_t

Mean squared error is '33.16'

Mean absolute error is '3.38'

Root Mean squared error is '5.76'

In [121… svr.score(X_train, y_train)

0.7336778695440571
Out[121]:

In [122… svr.score(X_test, y_test)

0.7310080566997736
Out[122]:

Performance Matrix SVR


R Square and Adjusted R Square values

In [123… svr_r2_score=r2_score(y_test, svr_y_pred)

print("SVR model has {} % accuracy".format(round(svr_r2_score*100,3)))

svr_adj_r2_score=1-((1-svr_r2_score)*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1
print("Adjusted R square accuracy is {} percent".format(round(svr_adj_r2_score*100,

SVR model has 73.101 % accuracy

Adjusted R square accuracy is 73.08 percent

Apply hyperparameter tuning


In [124… from sklearn.model_selection import GridSearchCV

In [125… model_params = {

'Ridge Regression': {

'model': Ridge(),

'params' : {

'alpha': [1,5,10,20]

},

'Lasso Regression': {

'model': Lasso(),

'params' : {

'alpha': [1,5,10,20]

},

'Elastic-Net Regression' : {

'model': ElasticNet(),

'params': {

'alpha': [1,5,10,20],

'l1_ratio':[0.5,1,1.5,2]

},

'SVR':{

'model': SVR(),

'params':{

'kernel': ['linear', 'poly', 'sigmoid', 'rbf'],

'C':[1,5,10,20]

In [126… model_params.items()

dict_items([('Ridge Regression', {'model': Ridge(), 'params': {'alpha': [1, 5, 10,


Out[126]:
20]}}), ('Lasso Regression', {'model': Lasso(), 'params': {'alpha': [1, 5, 10, 2
0]}}), ('Elastic-Net Regression', {'model': ElasticNet(), 'params': {'alpha': [1,
5, 10, 20], 'l1_ratio': [0.5, 1, 1.5, 2]}}), ('SVR', {'model': SVR(), 'params':
{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'C': [1, 5, 10, 20]}})])

In [127… ##scaling the independent features before fitting it inside grid object{to simplify
X1= scaler.fit_transform(X)

In [128… scores = []

for model_name, mp in model_params.items():

clf = GridSearchCV(mp['model'], mp['params'], cv=5, n_jobs=-1,return_train_sco


clf.fit(X1, y)

scores.append({

'model': model_name,

'best_score': clf.best_score_,

'best_params': clf.best_params_

})

df = pd.DataFrame(scores,columns=['model','best_score','best_params'])

df

Out[128]: model best_score best_params

0 Ridge Regression 0.686089 {'alpha': 1}

1 Lasso Regression 0.670507 {'alpha': 1}

2 Elastic-Net Regression 0.670507 {'alpha': 1, 'l1_ratio': 1}

3 SVR 0.765643 {'C': 20, 'kernel': 'rbf'}

Conclusion:
The SVR model with 'rbf kernel' is the best model for this household power
consumption data
In [ ]:

You might also like