Simple Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1.

Simple Linear Regression


● Linear regression is the simplest machine learning algorithm you'll encounter
○ Especially simple linear regression
● It is a simple algorithm initially developed in the field of statistics and was studied as
a model for understanding the relationship between input and output variables
● It is a linear model - assumes a linear relationship between input variables (X) and
the output variable (y)
● Used to predict continuous values (e.g., weight, price...)

Simple vs. Multiple linear regression

● Simple linear regression solves problems with only one input feature
● Multiple linear regression solves problems with multiple input features

Assumptions

1. Linear Assumption — model assumes the relationship between variables is linear


2. No Noise — model assumes that the input and output variables are not noisy — so
remove outliers if possible
3. No Collinearity — model will overfit when you have highly correlated input
variables
4. Normal Distribution — the model will make more reliable predictions if your input
and output variables are normally distributed. If that’s not the case, try using some
transforms on your variables to make them more normal-looking
5. Rescaled Inputs — use scalers or normalizer to make more reliable predictions

Take-home point

● Training a simple linear regression model is as simple as solving a couple of


equations

Math behind
● In a nutshell, simple linear regression is based on coefficients - and which you need
to find in order to solve a line equation:

Line equation:

● The coefficient has to be calculated first


● It tells you the slope of the line

B1 coefficient:

● The coefficient relies on the slope


● It represents Y-intercept - location at which the line intercepts the Y-axis

B0 coefficient:

● Let's implement simple linear regression with pure Numpy next

Implementation

● You'll need only Numpy to implement the logic


● Matplotlib is used for optional visualizations
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = (14, 7)
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

● The SimpleLinearRegression class is written to follow the familiar Scikit-Learn


syntax
● The coefficients are set to None at the start - __init__() method
● The fit() method calculates the coefficients
● The predict() method essentially implements the line equation
○ Before it does so, it makes sure the coefficients have been
calculated
In [2]:
class SimpleLinearRegression:
'''
A class which implements simple linear regression model.
'''
def __init__(self):
self.b0 = None
self.b1 = None

def fit(self, X, y):


'''
Used to calculate slope and intercept coefficients.

:param X: array, single feature


:param y: array, true values
:return: None
'''
numerator = np.sum((X - np.mean(X)) * (y - np.mean(y)))
denominator = np.sum((X - np.mean(X)) ** 2)
self.b1 = numerator / denominator
self.b0 = np.mean(y) - self.b1 * np.mean(X)

def predict(self, X):


'''
Makes predictions using the simple line equation.

:param X: array, single feature


:return: None
'''
if not self.b0 or not self.b1:
raise Exception('Please call `SimpleLinearRegression.fit(X, y)` before making predictions.')
return self.b0 + self.b1 * X

Testing

● Let's create some dummy data


○ X contains a list of numbers between 1 and 300 (1, 2, 3, ..., 299,
300)
○ y contains normally distributed values centered around X with
standard deviation of 20
● The source data is then visualized:
In [13]:
X = np.arange(start=1, stop=301)
y = np.random.normal(loc=X, scale=20)

plt.scatter(X, y, s=200, c='#087E8B', alpha=0.65)


plt.title('Source dataset', size=20)
plt.xlabel('X', size=14)
plt.ylabel('Y', size=14)
plt.show()

● For validation sake, we'll split the dataset into training and testing parts:
In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

● You can now initialize and train the model, and afterwards make predictions:
In [5]:
model = SimpleLinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)

● Here's how you can get the coefficients:


In [6]:
model.b0, model.b1

● These are the predictions:


In [7]:
preds

● And these are the original values


● Original and predicted differ, but not much
In [8]:
y_test

● You can now evaluate the model by calculating RMSE


○ Root Mean Squared Error
● On average, the model is 21.35 units wrong
● It makes sense, as standard deviation of the dataset is 20
In [9]:
from sklearn.metrics import mean_squared_error

rmse = lambda y, y_pred: np.sqrt(mean_squared_error(y, y_pred))


rmse(y_test, preds)

Visualize the Best-Fit line

● If you re-train the model of the entire dataset and then make predictions for the
entire dataset, you'll get the best fit line
● You can then visualize this line with Matplotlib:
In [14]:
model_all = SimpleLinearRegression()
model_all.fit(X, y)
preds_all = model_all.predict(X)

plt.scatter(X, y, s=200, c='#087E8B', alpha=0.65, label='Source data')


plt.plot(X, preds_all, color='#000000', lw=3, label=f'Best fit line > B0 = {model_all.b0:.2f}, B1 =
{model_all.b1:.2f}')
plt.title('Best fit line', size=20)
plt.xlabel('X', size=14)
plt.ylabel('Y', size=14)
plt.legend()
plt.show()

Comparison with Scikit-Learn

● We want to know if our model is good, so let's compare it with LinearRegression


model from Scikit-Learn
● The input data must be reshaped beforehand:
In [11]:
from sklearn.linear_model import LinearRegression

sk_model = LinearRegression()
sk_model.fit(np.array(X_train).reshape(-1, 1), y_train)
sk_preds = sk_model.predict(np.array(X_test).reshape(-1, 1))
sk_model.intercept_, sk_model.coef_

● Our coefficients were (-1.357484948041531, 1.0026529556316826)


● Not identical, but within a margin of error
● Let's check the RMSE:
In [12]:
rmse(y_test, sk_preds)

21.351850699502783
● Ours was 21.351850699502787, so nearly identical.

You might also like