Lab5 Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Lab 5 - Linear Regression

Python is a popular programming language that is reliable, flexible, easy to learn, free to use on all operating systems, and supported by
both a strong developer community and many free libraries. Python supports all manners of development, including web applications, web
services, desktop apps, scripting, data science, scientific computing, and Jupyter notebooks. Python is a language used by many
universities, scientists, casual developers, and professional developers alike.

You can learn more about the language on python.org and Python for Beginners.

An understanding of different number types is important for working with data in analysis. This lab exercise is based on Chapter 1 of the
book Doing Math with Python by Amit Saha. Because all the code is inside code cells, you can just run each code cell inline rather than
using a separate Python interactive window.

This Jupyter notebook is applicable for Python 3.x versions or above.

Note: This notebook is designed to have you run code cells one by one, and several code cells contain deliberate errors for
demonstration purposes. As a result, if you use the Cell > Run All command, some code cells past the error won't be run. To
resume running the code in each case, use Cell > Run All Below from the cell after the error.

Comments
Many of the examples in this lab exercise include comments. Comments in Python start with the hash character, # , and extend to the end
of the physical line. A comment may appear at the start of a line or following whitespace or code, but not within a string literal. A hash
character within a string literal is just a hash character. Since comments are to clarify code and are not interpreted by Python, they may be
omitted when typing in examples.

Simple Linear Regression


The goal of this project was to build a linear regression model from the ground up using numpy.

In [1]:
%matplotlib inline

#imports
from numpy import *
import matplotlib.pyplot as plt

Import the data


Here, we're using a dataset with two columns containing the amount of hours studied and the test scores students achieved, respectively.

In [2]:
points = genfromtxt('data.csv', delimiter=',')

#Extract columns
x = array(points[:,0])
y = array(points[:,1])

#Plot the dataset


plt.scatter(x,y)
plt.xlabel('Hours of study')
plt.ylabel('Test scores')
plt.title('Dataset')
plt.show()

Defining the hyperparamters


In [3]:
#hyperparamters
learning_rate = 0.0001
initial_b = 0
initial_m = 0
num_iterations = 10

Define cost function


In [4]:
def compute_cost(b, m, points):
total_cost = 0
N = float(len(points))

#Compute sum of squared errors


for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
total_cost += (y - (m * x + b)) ** 2

#Return average of squared error


return total_cost/N

Define Gradient Descent functions


In [5]:
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
cost_graph = []

#For every iteration, optimize b, m and compute its cost


for i in range(num_iterations):
cost_graph.append(compute_cost(b, m, points))
b, m = step_gradient(b, m, array(points), learning_rate)

return [b, m, cost_graph]

def step_gradient(b_current, m_current, points, learning_rate):


m_gradient = 0
b_gradient = 0
N = float(len(points))

#Calculate Gradient
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
m_gradient += - (2/N) * x * (y - (m_current * x + b_current))
b_gradient += - (2/N) * (y - (m_current * x + b_current))

#Update current m and b


m_updated = m_current - learning_rate * m_gradient
b_updated = b_current - learning_rate * b_gradient

#Return updated parameters


return b_updated, m_updated

Run gradient_descent_runner() to get optimized parameters b and m


In [6]:
b, m, cost_graph = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

#Print optimized parameters


print ('Optimized b:', b)
print ('Optimized m:', m)

#Print error with optimized parameters


print ('Minimized cost:', compute_cost(b, m, points))

Optimized b: 0.02963934787473239
Optimized m: 1.4774173755483797
Minimized cost: 112.65585181499746

Plotting the cost per iterations


In [7]:
plt.plot(cost_graph)
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.title('Cost per iteration')
plt.show()

Gradient descent converges to local minimum after 5 iterations

Plot line of best fit


In [8]:
#Plot dataset
plt.scatter(x, y)
#Predict y values
pred = m * x + b
#Plot predictions as line of best fit
plt.plot(x, pred, c='r')
plt.xlabel('Hours of study')
plt.ylabel('Test scores')
plt.title('Line of best fit')
plt.show()

You might also like