End To End Implementation of Data Science Pipeline in The Linear Regression Model

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

End To End Implementation

of Data Science Pipeline in

the Linear Regression Model

Here we are going to discuss an end-to-end implementation of

Linear Regression. If you are interested to create a basic

end-to-end Linear Regression Model, then this tutorial is for you.

From this point on, we are going to implement a step-by-step

coding example and the idea is implementing Regression with the

most primitive coding styles in python with the sklearn library.


The steps of this tutorial are demonstrated in the below figure:

End To End Pipeline of Linear Regression [‘Image Created By Dheeraj Kumar K’]

In python Relatively new approach called pipelining is a classical

way of implementing a Machine Learning model by holding all

steps together. Here we are going to discuss the End To End

Implementation of Bike Sharing Demand Regression

DataSet.
1. Loading Dataset: This step is the data connection layer

where we have to fetch the data from the database like

SQL, Excel, JSON, mongo (etc).

Import Basic Required Libraries,

# Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings("ignore")
# Load Dataset

df = pd.read_csv("E:\Downlload\Bike Sharing Dataset\hour.csv")

df

The Above Code will return the output :

Loaded Data Using Pandas as DataFrame

2. Data Visualization: Exploratory Data Analysis is considered

to be the most important step in machine learning modeling

because most problems can be solved with the help of good EDA

and we can get enormous insights from it.

def show_hist(x):
plt.rcParams["figure.figsize"] = 15,18

x.hist()

show_hist(df)

## The Above Code will return this output


Histogram [‘[‘Image Created By Dheeraj Kumar K’]’]

In this Histogram we can see that most of the variables are not

normally distributed, most of the data scientist’s important claim


or Assumption will be Normality. Maybe sometimes the Nature of

the data may be non-normal at that time we can proceed without

Transforming the data.

def Show_PairPlot(x):

sns.pairplot(x)

Show_PairPlot(df)

## The Above Code will return this output


Pair plot [‘[‘Image Created By Dheeraj Kumar K’]’]

Pair Plot will give you insights about both the relationship and

distribution of the variables. Pair Plot is a great technique to


identify the trend to follow up analysis and can easily implement

the following code.

3. Data Preprocessing: Data cleaning is the process of

preparing data for analysis by removing or modifying data that is

incorrect, incomplete, irrelevant, or improperly formatted. The

most important step in Data Science is Data Cleaning, In End to

End Data Science Project, 60% of the work will regarding Data

Cleaning.

Lets Work on Some Cleaning,

# Missing Values

df.isna().sum()
The output of checking Missing Values [‘Image Created By Dheeraj
Kumar K’]

Some More,

df['Year'] = df['dteday'].str.split('-').str[0]

df['Year'] = df['Year'].astype(int)

df['Month'] = df['dteday'].str.split('-').str[1]
df['Month'] = df['Month'].astype(int)

df['Date'] = df['dteday'].str.split('-').str[2]

df['Date'] = df['Date'].astype(int)

df = df.drop(['dteday'],axis=1)

df = df.drop(['yr'],axis=1)

df = df.drop(['mnth'],axis=1)

## The Above Code will return this output


[‘Image Created By Dheeraj Kumar K’]

Outliers

Observations in statistics that are removed from the normalized

distribution observation in any data set in statistics form the gist

of outliers. The most common reasons that outliers occur include

an error in measurement or input of the data, corrupt data, and

the typical true observation that’s outside the normal distribution.

Because of the very nature of datasets in data science, a

mathematical definition of an outlier cannot be defined

specifically.

Outliers Detection

def outlier(x):

high=0

q1 = x.quantile(.25)
q3 = x.quantile(.75)

iqr = q3-q1

low = q1-1.5*iqr

high += q3+1.5*iqr

outlier = (x.loc[(x < low) | (x > high)])

return(outlier)

outlier(df['cnt']).count()

## The Above Code will return this output

The output of Outliers in Response Variable [‘Image Created By Dheeraj Kumar K’]

Removal of Outliers
q1 =df['cnt'].quantile(.25)

q3 = df['cnt'].quantile(.75)

iqr = q3-q1

df_new = df[~((df['cnt'] < (q1 - 1.5 *iqr)) | (df['cnt'] >


(q3+ 1.5 * iqr)))]

Basic Statistical Normality Test

from scipy.stats import anderson

print('Anderson Darling Test :: ',anderson(df['cnt']))

print('=========================================================
=====================')

from scipy.stats import shapiro

print('Shapiro Wilk Test :: ',shapiro(df['cnt']))


print('=========================================================
======================')

from scipy.stats import kstest

print('Kolmogorov–Smirnov Test :: ',kstest(df['cnt'],'norm'))

print('=========================================================
=======================')

Statistical Analysis Result [‘Image Created By Dheeraj Kumar K’]

4. Feature Engineering: What is a feature and why we need

engineering of it? , All machine learning algorithms use some

input data to create outputs. This input data comprise features,

which are usually in the form of structured columns. Algorithms

require features with some specific characteristics to work


properly. Here, the need for feature engineering arises. I think

feature engineering efforts mainly have two goals:

● Preparing the proper input dataset, compatible with the

machine learning algorithm requirements.

● Improving the performance of machine learning models.

Let’s Implement it,

from scipy import stats

import pylab

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson', standardize=True,)

df = pt.fit_transform(df)
df = pd.DataFrame(df)

df=df.rename(columns{0:'Intent',1:'Season',2:'hr',3:"holiday",4:
"Weekday",5:"Workingday",6:"Weathersit",7:"hum",8:"Windspeed",9:
"registered",10:"cnt",11:"year",12:"Month",13:"Date",14:"atemp",
15:"Temp",16:"Cacual"})

df

## The Above Code will return this output

[‘Image Created By Dheeraj Kumar K’]

def show_hist(x):

plt.rcParams["figure.figsize"] = 15,18

x.hist()
show_hist(df)

## The Above Code will return this output


Histogram After Transformation [‘[‘Image Created By Dheeraj Kumar K’]’]
In the above Histogram you can able to see major normality

differences when compared to the Histogram we plotted with raw

data. Hence as I said Sometimes Transformations will work

sometimes may not, it depends on the nature of the data. Because

of using Transformation method Feature Scaling is not required

here, if we want to do there no issues about that, we can proceed.

4. Model Building: When you’re working on a model and want

to train it, you obviously have a dataset. But after training, we

have to test the model on some test dataset. For this, you’ll a

dataset which is different from the training set you used earlier.

But it might not always be possible to have so much data during

the development phase.

In such cases, the obvious solution is to split the dataset you have

into two sets, one for training and the other for testing; and you

do this before you start training your model.


x = df.drop(['cnt'],axis=1)

y = df['cnt']

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3
0,random_state=40)

Building Statsmodel,

import statsmodels.api as sm

model2 =sm.OLS(y_train,x_train).fit()

model2.summary()
StatsModel Output [‘Image Created By Dheeraj Kumar K’]

Linear Regression Model


from sklearn.linear_model import LinearRegression

lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=0,


normalize=False)

LR =lr.fit(x_train,y_train)

LR_Pred = lr.predict(x_test)

from sklearn import metrics

print('ROOT EAN ERROR


SQUARE:',np.sqrt(metrics.mean_squared_error(y_test,LR_Pred)))

from sklearn.metrics import r2_score

print('R SQUARE:',r2_score(y_test,LR_Pred))
import pickle

pickle.dump(lr,open('LinearRegression.pkl','wb'))

Linear Regression Score [‘Image Created By Dheeraj Kumar K’]

FLASK Deployment

What Is Flask?
Flask is a popular Python web framework, meaning it is a

third-party Python library used for developing web applications.

Why Flask?
● Easy to use.

● Built-in development server and debugger.


● Integrated unit testing support.

● RESTful request dispatching.

● Extensively documented.

Project Structure
This project has four parts :

1. model.py — This contains code for the machine learning

model to predict Bike Sharing Demand using Machine

Learning Model.

2. app.py — This contains Flask APIs that receive Prediction

details through GUI or API calls, computes the predicted

value based on our model, and returns it.

3. request.py — This uses the request module to call APIs

defined in app.py and displays the returned value.


4. HTML/CSS — This contains the HTML template and CSS

styling to allow the user to enter Bike-sharing detail and

displays the predicted Demand.

First Let’s code with HTML and CSS

<html >

<!--From https://codepen.io/frytyler/pen/EGdtg-->

<head>

<meta charset="utf-8">

<meta name="viewport" content="width=device-width,


initial-scale=1">

<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/4.3.1/css/bootst
rap.min.css">
<script
src="https://ajax.googleapis.com/ajax/libs/jquery/4.4.1/jquery.m
in.js"></script>

<script
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd
/popper.min.js"></script>

<script
src="https://maxcdn.bootstrapcdn.com/bootstrap/4.3.1/js/bootstra
p.min.js"></script>

<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/
css/font-awesome.min.css">

<title>Home</title>

<!-- <link rel="stylesheet" type="text/css"


href="../static/css/styles.css"> -->

<link rel="stylesheet" type="text/css" href="{{


url_for('static', filename='css/styles.css') }}">

</head>
<style>

body {

width: 100%;

height:10%;

font-family: 'Open Sans', sans-serif;

background: #6b66b8;

color: #fff;

font-size: 28px;

text-align:center;

letter-spacing:1.2px;

}
</style>

<style>

#ip4 {

border-radius: 15px 50px 30px;

border: 2px solid #609;

padding: 20px;

width: 300px;

height: 15px;

</style>

<body>

<div>
<!--navbar portion-->

<nav class="navbar navbar-expand-sm bg-dark fixed-top">

<a class="navbar-brand" href="#"><b>Bike Sharing Demand


Prediction</b></a>

<a class="navbar-brand" href="#"><b><font size="2">BY DHEERAJ


KUMAR</font></b></a>

<ul class="navbar-nav" >

<li class="nav-item">

</li>

<li class="nav-item">
</li>

</ul>

</nav>

<br>

</div>

<body>

<br>

<div class="login">
<h1>Bike Sharing Demand Prediction</h1>

<!-- Main Input For Receiving Query to our ML -->

<form action="{{ url_for('predict')}}"method="post">

<input type="text" name="year" placeholder="year"


required="required" id="ip4"/>

<input type="text" name="month" placeholder="month"


required="required" id="ip4" />

<input type="text" name="holiday" placeholder="holiday"


required="required" id="ip4"/>

<input type="text" name="weekday"


placeholder="weekday" required="required" id="ip4"/>
<input type="text" name="working day"
placeholder="working day" required="required" id="ip4"/>

<input type="text" name="weathersit"


placeholder="weathersit" required="required" id="ip4"/>

<input type="text" name="temp" placeholder="temp"


required="required" id="ip4"/>

<input type="text" name="atemp"


placeholder="atemp" required="required" id="ip4"/>

<input type="text" name="hum" placeholder="hum"


required="required" id="ip4"/>

<input type="text" name="windspread"


placeholder="windspread" required="required" id="ip4"/>

<input type="text" name="registred"


placeholder="registred" required="required" id="ip4"/>
<input type="text" name="hr" placeholder="hr"
required="required" id="ip4"/>

<br>

<br>

<input type="submit" class="btn-primary"value="predict">

</form>

<br>

<br>

{{ prediction_text }}
</div>

</body>

</html>

Let’s code with FLASK

import numpy as np

from flask import Flask, request, jsonify, render_template

import pickle

app = Flask(__name__)

lr= pickle.load(open('LinearRegression.pkl', 'rb'))


@app.route('/')

def home():

return render_template('index.html')

@app.route('/predict',methods=['POST'])

def predict():

'''

For rendering results on HTML GUI

'''
int_features = [int(x) for x in request.form.values()]

final_features = [np.array(int_features)]

prediction = lr.predict(final_features)

output = round(prediction[0], 2)

return render_template('index.html', prediction_text='Bike


Sharing Demand Count should be {}'.format(output))

@app.route('/predict_api',methods=['POST'])

def predict_api():

'''
For direct API calls trought request

'''

data = request.get_json(force=True)

prediction = lr.predict([np.array(list(data.values()))])

output = prediction[0]

return jsonify(output)

if __name__ == "__main__":

app.run(debug=True)
Output

Conclusion
In this tutorial, we have achieved the implementation of linear

regression from scratch.

You might also like