Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Visvesvaraya National Institute of

Technology (VNIT), Nagpur

Machine Learning with Python


(ECL443)
Lab Report

Submitted by :
Sakshi Gupta (BT19ECE037) Semester 7

Submitted to :
Dr. Saugata Sinha
(Course Instructor)
Department of ECE,
VNIT Nagpur
1.0 Sakshi Gupta (BT19ECE037)

Linear Regression
Abstract: Machine Learning, as the name suggests, focuses on making the machines
(computers) learn. There are various algorithms designed such that, with the given
data, the machines are able to recognise patterns in the data and then try to predict
values for the future. The accuracy of the algorithm depends on the percentage of
correct predictions it is able to make.

Introduction: Linear regression is a machine learning algorithm where the


dependent variable(y) is written as an expression of a function of the independent
variables (x1, x2, x3,...), with weights assigned to each function, and the model tries to
learn the optimum values for the weights according to the given dataset, so as to
obtain a minimal error value.
Problem Statement:
Given the number of registered vehicles, number of licensed drivers and the number
of miles travelled by the vehicles, predict the number of traffic fatalities caused using
linear regression.

Method/Procedure:

1. The given data file is ”Matlab accidents.mat”. Load the file into a dataframe and
split it into training and testing datasets using the function BT19ECE037 train
test split.
2. Once the data is loaded into a dataframe, we need to recognise the dependent
and independent variables from the data so as to generate X and y.
3. For this problem, the columns Licensed drivers (thousands), Registered vehicles
(thousands), and Vehicles-miles travelled (millions) are chosen as the
independent variables, and the column Traffic fatalities is chosen as the
dependent variable.
4. First, we use the Pseudo Inverse Method. The predicted values for the test
dataset are computed and then compared with the actual values so as to obtain
the accuracy.
5. Then, the Gradient Descent Algorithm is used to do the same and the accuracy
is obtained.
6. Now, for a change in relationship of the input and output variables, we choose
an arbitrary relationship where an extra term is added after squaring the
Registered vehicles (thousands) column, and the accuracy is again calculated.
Results/Discussion:

1
1.0 Sakshi Gupta (BT19ECE037)

1. The dependent variables chosen are Licensed drivers (thousands), Registered


vehicles (thousands), and Vehicles-miles traveled (millions). The independent
variable chosen is Traffic fatalities. The Root Mean Squared Accuracy obtained
using the Pseudo Inverse Method is 85.32%.

Figure 1: Predicted and True Values using Pseudo Inverse

2. Using the Gradient Descent Algorithm, the Room Mean Squared Accuracy
obtained is 91.52%. As we can see, this is a significant improvement over the
Pseudo Inverse method.

2
1.0 Sakshi Gupta (BT19ECE037)

Figure 2: Predicted and True Values using Gradient Descent

3. Now, in order to change the relationship, an extra term is added after squaring
the Registered vehicles (thousands) column arbitrarily. The RMS Accuracy we
now obtain is 85.88%, which is nearly equal to the accuracy obtained with the
Pseudo Inverse Method of Linear Regression.

Figure 3: Predicted and True Values After changing the relationship between input
and output

Conclusion: Linear Regression consists of formulating the independent variables as


functions of dependent variables, each multiplied by a respective weight. The model
then tries to find the optimum value of the weight so that the error value is minimum.
This can be done by using two methods: Pseudo Inverse and Gradient Descent. As
obtained above, the accuracy in the Gradient Descent method is better as compared
to the Pseudo Inverse model.

Appendices: The code for linear regression is given below:

1
# Importing required libraries and Getting the data ready
2

3 ## Importing required libraries


4
5 #!pip install mat4py

3
1.0 Sakshi Gupta (BT19ECE037)

6
import pandas as pd
7
import os
8
from mat4py import loadmat
9
import numpy as np
10
import matplotlib.pyplot as plt
11
import sys
12
from sklearn.metrics import mean squared error, mean absolute error
13

14 """## Getting the data


15

16 **Importing data into a dataframe and Splitting into Train and ... Test Datasets**
17 """
18

19 def BT19ECE120 datasetdivshuffle(filepath, traintestratio=0.2):


20 ext = os.path.splitext(filepath)[1]
21 # import the data 22 if ext==".csv":
23 data = pd.readcsv(filepath) 24 elif
ext==".xlsx":
25 data = pd.readexcel(filepath) 26 elif
ext==".mat":
27 load data = loadmat(filepath) 28 datamat =
load data["accidents"]
29 data = ...

4
1.0 Sakshi Gupta (BT19ECE037)

pd.DataFrame(datamat["hwydata"],columns=datamat["hwyheaders"])

30 states = [x[0] for x in datamat["statelabel"]] 31 data.insert(loc = 1, column =


"State", value=states)
32 else:
33 print("File not found")
34 return None
35 trainfrac = 1 − traintestratio
36 testfrac = traintestratio
37 train = data.sample(frac = trainfrac)
38 test = data.sample(frac = testfrac)
39 return train, test
40

41 train, test = ...


BT19ECE120 datasetdivshuffle("./Matlab accidents.mat")
42 train.head()
43

44 """**Gathering the required Dependent and Independent**


45

46 Here, the *Licensed drivers (thousands)*, *Registered vehicles ...


(thousands)* and *Vehicle−miles traveled (millions)* are ... independent variables and
*Traffic fatalities* is the ...
dependent variable.
47 """
48

49 trainX = train[['Licensed drivers (thousands)','Registered ...


vehicles (thousands)','Vehicle−miles traveled (millions)']]
50 trainy = train['Traffic fatalities']
51 testX = test[['Licensed drivers (thousands)','Registered ...
vehicles (thousands)','Vehicle−miles traveled (millions)']]
52 testy = test['Traffic fatalities']
53

5
1.0 Sakshi Gupta (BT19ECE037)

54 """**Normalizing train and test variables**"""


55

56 for column in trainX:


57 trainX[column] = trainX[column]/np.amax(trainX[column])
58 trainy= trainy/np.amax(trainy) 59 for column in testX:
60 testX[column] = testX[column]/np.amax(testX[column])
61 testy = testy/np.amax(testy)
62

63 trainX = np.array(trainX)
64 testX = np.array(testX)
65 trainy = np.array(trainy)
66 testy = np.array(testy)
67

68 """# Solving using Pseudo−Inverse Method"""


69

70 def linreg pseudoinv(trainX, trainy, testX, testy):


71 # Creating the theta matrices for training and testing
72 onestrain= np.ones([trainX.shape[0],1])
73 traintheta = np.hstack((ones train,trainX))
74

75 onestest = np.ones([testX.shape[0],1])
76 testtheta = np.hstack((ones test,testX))
77

78 # Finding the optimum weights


79 weights = np.matmul(np.linalg.pinv(train theta),trainy)
80

81 # Predicting y using optimum weights


82 y preds = np.matmul(test theta,weights)
83 plt.plot(testy)
84 plt.plot(y preds)
85 print("Mean Squared Error: ",meansquarederror(testy, y preds)) 86 print("Root Mean
Squared Error: ",meansquarederror(testy, ... y preds, squared = False))
87 print("Mean Absolute Error: ",mean absoluteerror(testy, y preds))
88

89 linreg pseudoinv(trainX, trainy, testX, testy)


90

91 """# Solving Using Gradient Descent"""


92

93 def gradient descent(trainX, trainy, testX, testy, ... iterations=1000, learning rate=0.001):
94 # Creating the theta matrices for training and testing
95 onestrain= np.ones([trainX.shape[0],1])
96 traintheta = np.hstack((ones train,trainX))
97

98 onestest = np.ones([testX.shape[0],1])
99 testtheta = np.hstack((ones test,testX))
100

6
1.0 Sakshi Gupta (BT19ECE037)

101 # Considering some random values for weights initially


102 weights = np.random.randn(len(train theta[1]))
103

104 # Applying gradient descent to find the optimum weights 105 for i in
range(iterations):
106 temp = np.matmul(traintheta,weights) − trainy
107 error = np.matmul(traintheta.T, temp)
108 weights = weights − learningrate * error
109

110 # Predict y using the optimum weights obtained


111 y preds = np.matmul(test theta,weights)
112 plt.plot(testy)
113 plt.plot(y preds)
114 print("Mean Squared Error: ",meansquarederror(testy, y preds)) 115 print("Root Mean
Squared Error: ",meansquarederror(testy, ... y preds, squared = False))
116 print("Mean Absolute Error: ",mean absoluteerror(testy, y preds))
117

118 gradient descent(trainX, trainy, testX, testy)


119

120 """# Changing relationship between input and output variables"""


121

122 def changed relationship(trainX, trainy, testX, testy):


123 # Redefining the theta matrix for a new relationship
124 onestrain= np.ones([trainX.shape[0],1]) 125 squaredcoltrain = ...
np.square(trainX.T[1]).reshape(trainX.shape[0],1)
126 traintheta = np.hstack((ones train,squaredcoltrain,trainX))
127

128 onestest = np.ones([testX.shape[0],1])


129 squaredcoltest = np.square(testX.T[1]).reshape(testX.shape[0],1) 130 testtheta =
np.hstack((ones test,squaredcoltest,testX))
131

132 # Finding the optimum weights


133 weights = np.matmul(np.linalg.pinv(train theta),trainy)
134

135 # Predicting y using optimum weights


136 y preds = np.matmul(test theta,weights)
137 plt.plot(testy)
138 plt.plot(y preds)
139 print("Mean Squared Error: ",meansquarederror(testy, y preds)) 140 print("Root Mean
Squared Error: ",meansquarederror(testy, ... y preds, squared = False))
141 print("Mean Absolute Error: ",mean absoluteerror(testy, y preds))
142

143 changed relationship(trainX, trainy, testX, testy)

You might also like