End To End Implementation of Data Science Pipeline in The Linear Regression Model

End To End Implementation
of Data Science Pipeline in
the Linear Regression Model
Here we are going to discuss an end-to-end implementation of
Linear Regression. If you are interested to create a basic
end-to-end Linear Regression Model, then this tutorial is for you.
From this point on, we are going to implement a step-by-step
coding example and the idea is implementing Regression with the
most primitive coding styles in python with the sklearn library.

The steps of this tutorial are demonstrated in the below figure:
End To End Pipeline of Linear Regression [‘Image Created By Dheeraj Kumar K’]
In python Relatively new approach called pipelining is a classical
way of implementing a Machine Learning model by holding all
steps together. Here we are going to discuss the End To End
Implementation of Bike Sharing Demand Regression
DataSet.
1. Loading Dataset: This step is the data connection layer
where we have to fetch the data from the database like
SQL, Excel, JSON, mongo (etc).
Import Basic Required Libraries,
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Load Dataset
df = pd.read_csv("E:\Downlload\Bike Sharing Dataset\hour.csv")
df
The Above Code will return the output :
Loaded Data Using Pandas as DataFrame
2. Data Visualization: Exploratory Data Analysis is considered
to be the most important step in machine learning modeling
because most problems can be solved with the help of good EDA
and we can get enormous insights from it.
def show_hist(x):
plt.rcParams["figure.figsize"] = 15,18
x.hist()
show_hist(df)
## The Above Code will return this output

Histogram [‘[‘Image Created By Dheeraj Kumar K’]’]
In this Histogram we can see that most of the variables are not
normally distributed, most of the data scientist’s important claim

or Assumption will be Normality. Maybe sometimes the Nature of
the data may be non-normal at that time we can proceed without
Transforming the data.
def Show_PairPlot(x):
sns.pairplot(x)
Show_PairPlot(df)

Pair plot [‘[‘Image Created By Dheeraj Kumar K’]’]
Pair Plot will give you insights about both the relationship and
distribution of the variables. Pair Plot is a great technique to

identify the trend to follow up analysis and can easily implement
the following code.
3. Data Preprocessing: Data cleaning is the process of
preparing data for analysis by removing or modifying data that is
incorrect, incomplete, irrelevant, or improperly formatted. The
most important step in Data Science is Data Cleaning, In End to
End Data Science Project, 60% of the work will regarding Data
Cleaning.
Lets Work on Some Cleaning,
# Missing Values
df.isna().sum()
The output of checking Missing Values [‘Image Created By Dheeraj
Kumar K’]
Some More,
df['Year'] = df['dteday'].str.split('-').str[0]
df['Year'] = df['Year'].astype(int)
df['Month'] = df['dteday'].str.split('-').str[1]
df['Month'] = df['Month'].astype(int)
df['Date'] = df['dteday'].str.split('-').str[2]
df['Date'] = df['Date'].astype(int)
df = df.drop(['dteday'],axis=1)
df = df.drop(['yr'],axis=1)
df = df.drop(['mnth'],axis=1)

[‘Image Created By Dheeraj Kumar K’]
Outliers
Observations in statistics that are removed from the normalized
distribution observation in any data set in statistics form the gist
of outliers. The most common reasons that outliers occur include
an error in measurement or input of the data, corrupt data, and
the typical true observation that’s outside the normal distribution.
Because of the very nature of datasets in data science, a
mathematical definition of an outlier cannot be defined
specifically.
Outliers Detection
def outlier(x):
high=0
q1 = x.quantile(.25)
q3 = x.quantile(.75)
iqr = q3-q1
low = q1-1.5*iqr
high += q3+1.5*iqr
outlier = (x.loc[(x < low) | (x > high)])
return(outlier)
outlier(df['cnt']).count()
The output of Outliers in Response Variable [‘Image Created By Dheeraj Kumar K’]
Removal of Outliers
q1 =df['cnt'].quantile(.25)
q3 = df['cnt'].quantile(.75)
iqr = q3-q1
df_new = df[~((df['cnt'] < (q1 - 1.5 *iqr)) | (df['cnt'] >

(q3+ 1.5 * iqr)))]
Basic Statistical Normality Test
from scipy.stats import anderson
print('Anderson Darling Test :: ',anderson(df['cnt']))
print('=========================================================
=====================')
from scipy.stats import shapiro
print('Shapiro Wilk Test :: ',shapiro(df['cnt']))

print('=========================================================
======================')
from scipy.stats import kstest
print('Kolmogorov–Smirnov Test :: ',kstest(df['cnt'],'norm'))
print('=========================================================
=======================')
Statistical Analysis Result [‘Image Created By Dheeraj Kumar K’]
4. Feature Engineering: What is a feature and why we need
engineering of it? , All machine learning algorithms use some
input data to create outputs. This input data comprise features,
which are usually in the form of structured columns. Algorithms
require features with some specific characteristics to work

properly. Here, the need for feature engineering arises. I think
feature engineering efforts mainly have two goals:
● Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
● Improving the performance of machine learning models.
Let’s Implement it,
from scipy import stats
import pylab
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson', standardize=True,)
df = pt.fit_transform(df)
df = pd.DataFrame(df)
df=df.rename(columns{0:'Intent',1:'Season',2:'hr',3:"holiday",4:
"Weekday",5:"Workingday",6:"Weathersit",7:"hum",8:"Windspeed",9:
"registered",10:"cnt",11:"year",12:"Month",13:"Date",14:"atemp",
15:"Temp",16:"Cacual"})
df
[‘Image Created By Dheeraj Kumar K’]
def show_hist(x):
plt.rcParams["figure.figsize"] = 15,18
x.hist()
show_hist(df)

Histogram After Transformation [‘[‘Image Created By Dheeraj Kumar K’]’]
In the above Histogram you can able to see major normality
differences when compared to the Histogram we plotted with raw
data. Hence as I said Sometimes Transformations will work
sometimes may not, it depends on the nature of the data. Because
of using Transformation method Feature Scaling is not required
here, if we want to do there no issues about that, we can proceed.
4. Model Building: When you’re working on a model and want
to train it, you obviously have a dataset. But after training, we
have to test the model on some test dataset. For this, you’ll a
dataset which is different from the training set you used earlier.
But it might not always be possible to have so much data during
the development phase.
In such cases, the obvious solution is to split the dataset you have
into two sets, one for training and the other for testing; and you
do this before you start training your model.

x = df.drop(['cnt'],axis=1)
y = df['cnt']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3
0,random_state=40)
Building Statsmodel,
import statsmodels.api as sm
model2 =sm.OLS(y_train,x_train).fit()
model2.summary()
StatsModel Output [‘Image Created By Dheeraj Kumar K’]
Linear Regression Model

from sklearn.linear_model import LinearRegression
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=0,

normalize=False)
LR =lr.fit(x_train,y_train)
LR_Pred = lr.predict(x_test)
from sklearn import metrics
print('ROOT EAN ERROR

SQUARE:',np.sqrt(metrics.mean_squared_error(y_test,LR_Pred)))
from sklearn.metrics import r2_score
print('R SQUARE:',r2_score(y_test,LR_Pred))
import pickle
pickle.dump(lr,open('LinearRegression.pkl','wb'))
Linear Regression Score [‘Image Created By Dheeraj Kumar K’]
FLASK Deployment
What Is Flask?
Flask is a popular Python web framework, meaning it is a
third-party Python library used for developing web applications.
Why Flask?
● Easy to use.
● Built-in development server and debugger.

● Integrated unit testing support.
● RESTful request dispatching.
● Extensively documented.
Project Structure
This project has four parts :
1. model.py — This contains code for the machine learning
model to predict Bike Sharing Demand using Machine
Learning Model.
2. app.py — This contains Flask APIs that receive Prediction
details through GUI or API calls, computes the predicted
value based on our model, and returns it.
3. request.py — This uses the request module to call APIs
defined in app.py and displays the returned value.

4. HTML/CSS — This contains the HTML template and CSS
styling to allow the user to enter Bike-sharing detail and
displays the predicted Demand.
First Let’s code with HTML and CSS
<html >

<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,

initial-scale=1">
<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/4.3.1/css/bootst
rap.min.css">
<script
src="https://ajax.googleapis.com/ajax/libs/jquery/4.4.1/jquery.m
in.js"></script>
<script
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd
/popper.min.js"></script>
<script
src="https://maxcdn.bootstrapcdn.com/bootstrap/4.3.1/js/bootstra
p.min.js"></script>
<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/
css/font-awesome.min.css">
<title>Home</title>

<link rel="stylesheet" type="text/css" href="{{

url_for('static', filename='css/styles.css') }}">
</head>
<style>
body {
width: 100%;
height:10%;
font-family: 'Open Sans', sans-serif;
background: #6b66b8;
color: #fff;
font-size: 28px;
text-align:center;
letter-spacing:1.2px;
}
</style>
<style>
#ip4 {
border-radius: 15px 50px 30px;
border: 2px solid #609;
padding: 20px;
width: 300px;
height: 15px;
</style>
<body>
<div>

<nav class="navbar navbar-expand-sm bg-dark fixed-top">
<a class="navbar-brand" href="#"><b>Bike Sharing Demand

Prediction</b></a>
<a class="navbar-brand" href="#"><b><font size="2">BY DHEERAJ

KUMAR</font></b></a>
<ul class="navbar-nav" >
<li class="nav-item">
</li>
<li class="nav-item">
</li>
</ul>
</nav>
<br>
</div>
<body>
<br>
<div class="login">
<h1>Bike Sharing Demand Prediction</h1>

<form action="{{ url_for('predict')}}"method="post">
<input type="text" name="year" placeholder="year"

required="required" id="ip4"/>
<input type="text" name="month" placeholder="month"

required="required" id="ip4" />
<input type="text" name="holiday" placeholder="holiday"

<input type="text" name="weekday"

placeholder="weekday" required="required" id="ip4"/>
<input type="text" name="working day"
placeholder="working day" required="required" id="ip4"/>
<input type="text" name="weathersit"

placeholder="weathersit" required="required" id="ip4"/>
<input type="text" name="temp" placeholder="temp"

<input type="text" name="atemp"

placeholder="atemp" required="required" id="ip4"/>
<input type="text" name="hum" placeholder="hum"

<input type="text" name="windspread"

placeholder="windspread" required="required" id="ip4"/>
<input type="text" name="registred"

placeholder="registred" required="required" id="ip4"/>
<input type="text" name="hr" placeholder="hr"
<br>
<br>
<input type="submit" class="btn-primary"value="predict">
</form>
<br>
<br>
{{ prediction_text }}
</div>
</body>
</html>
Let’s code with FLASK
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle
app = Flask(__name__)
lr= pickle.load(open('LinearRegression.pkl', 'rb'))

@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
int_features = [int(x) for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = lr.predict(final_features)
output = round(prediction[0], 2)
return render_template('index.html', prediction_text='Bike

Sharing Demand Count should be {}'.format(output))
@app.route('/predict_api',methods=['POST'])
def predict_api():
'''
For direct API calls trought request
'''
data = request.get_json(force=True)
prediction = lr.predict([np.array(list(data.values()))])
output = prediction[0]
return jsonify(output)
if __name__ == "__main__":
app.run(debug=True)
Output
Conclusion
In this tutorial, we have achieved the implementation of linear
regression from scratch.

End To End Implementation of Data Science Pipeline in The Linear Regression Model

Uploaded by

Copyright:

Available Formats

You might also like

End To End Implementation of Data Science Pipeline in The Linear Regression Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

End To End Implementation of Data Science Pipeline in The Linear Regression Model

Uploaded by

Copyright:

Available Formats

End To End Implementation

of Data Science Pipeline in

the Linear Regression Model

Here we are going to discuss an end-to-end implementation of

Linear Regression. If you are interested to create a basic

end-to-end Linear Regression Model, then this tutorial is for you.

From this point on, we are going to implement a step-by-step

coding example and the idea is implementing Regression with the

most primitive coding styles in python with the sklearn library.

In python Relatively new approach called pipelining is a classical

way of implementing a Machine Learning model by holding all

steps together. Here we are going to discuss the End To End

Implementation of Bike Sharing Demand Regression

where we have to fetch the data from the database like

SQL, Excel, JSON, mongo (etc).

Import Basic Required Libraries,

import matplotlib.pyplot as plt

import seaborn as sns

df = pd.read_csv("E:\Downlload\Bike Sharing Dataset\hour.csv")

The Above Code will return the output :

Loaded Data Using Pandas as DataFrame

2. Data Visualization: Exploratory Data Analysis is considered

to be the most important step in machine learning modeling

and we can get enormous insights from it.

## The Above Code will return this output

normally distributed, most of the data scientist’s important claim

the data may be non-normal at that time we can proceed without

Transforming the data.

## The Above Code will return this output

distribution of the variables. Pair Plot is a great technique to

the following code.

3. Data Preprocessing: Data cleaning is the process of

preparing data for analysis by removing or modifying data that is

incorrect, incomplete, irrelevant, or improperly formatted. The

most important step in Data Science is Data Cleaning, In End to

Lets Work on Some Cleaning,

## The Above Code will return this output

Observations in statistics that are removed from the normalized

distribution observation in any data set in statistics form the gist

of outliers. The most common reasons that outliers occur include

an error in measurement or input of the data, corrupt data, and

the typical true observation that’s outside the normal distribution.

Because of the very nature of datasets in data science, a

mathematical definition of an outlier cannot be defined

outlier = (x.loc[(x < low) | (x > high)])

## The Above Code will return this output

df_new = df[~((df['cnt'] < (q1 - 1.5 *iqr)) | (df['cnt'] >

Basic Statistical Normality Test

from scipy.stats import anderson

print('Anderson Darling Test :: ',anderson(df['cnt']))

from scipy.stats import shapiro

print('Shapiro Wilk Test :: ',shapiro(df['cnt']))

from scipy.stats import kstest

print('Kolmogorov–Smirnov Test :: ',kstest(df['cnt'],'norm'))

Statistical Analysis Result [‘Image Created By Dheeraj Kumar K’]

4. Feature Engineering: What is a feature and why we need

engineering of it? , All machine learning algorithms use some

input data to create outputs. This input data comprise features,