Sample Research Paper

Nova Southeastern University
College of Engineering and Computing

MMIS 643 Data Mining
Winter 2016
Multiclass Classification using Logistic Regression applied to

Handwriting Recognition: The Technical Details
Table of Content
1. Introduction.........................................................................................................................................2
2. Review of pertinent Supervised Learning algorithms..........................................................................3
2.1 Linear Regression.........................................................................................................................3
2.2 Cost Function and the objective of Linear Regression.................................................................3
2.3 The use of matrices in solving models.........................................................................................4
2.4 Logistic Regression and the use of the Sigmoid function.............................................................4
2.5 Logistic Regression Cost Function................................................................................................7
2.6 Solving the Regression Models using Gradient Descent..............................................................9
2.7 Solving the problem of Over-fitting with Regularization............................................................11
2.8 Multiclass Classification: One Vs All...........................................................................................13
3. Solving a real-world Multiclass Classification Problem......................................................................14
4. Conclusion.........................................................................................................................................16
5. Works Cited ......................................................................................................................................17
6. Table of Figures.................................................................................................................................18
7. Annex 1 – Code..................................................................................................................................19
1
1. Introduction
For this project, I have chosen to study and examine Logistic Regression in a Multiclass Classification
problem. The objective of this document is to expose the mathematical details behind a common
supervised machine learning algorithm: Multiclass Classification.
The mathematical concepts covered in this document are simplified and introduced gradually. I start
by explaining the Linear Regression method followed by the notion of cost function and the objective
behind solving Linear Regression problems. I use this as a base to explain Logistic Regression and how it
can be applied to Multiclass Classification problems. Other concepts involved in solving the different
models, like Gradient Descent and Regularization, are also introduced along with their mathematical
formulation. Finally, a simple program written with Octave uses these concepts to demonstrate how
Logistic Regression applied to a Multiclass Classification problem can be used in a real world situation:
The recognition of handwritten characters.
2
2. Review of pertinent Supervised Learning algorithms
2.1 Linear Regression
Linear regression is based on finding the polynomial function hθ ( x ) that best fits a set of given data
points (also called examples). This function, also called hypothesis, can then be used to predict the
outcome of any given set of features:
hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n

(2.1)
Let m be the number of training examples and n be the number of features (or predictors) predicting
the outcome. Let y be the outcome and y (i ) the value of the outcome in our ith example where i ϵ {1. . m} .
(i)
Let’s say we have n variables predicting the values of y, then x j where j ϵ {1. . n} , represents the value of
the jth feature in our ith example.
2.2 Cost Function and the objective of Linear Regression
The objective of the linear regression is to find the value of each θ j so that that hθ ( x (i) ) is as close as
possible to y (i ) thus minimizing hθ ( x (i) ) − y (i ) for each i of our example set. In order to measure the
accuracy of the model, we are going to define a cost function:
m
1
∑
2
J ( θ )= (hθ ( x( i) )− y (i) ) (2.2)
2 m i=1
The cost function associated with our hypothesis is the average of the sum of all the squared
errors divided by 2. (Ng, 2004) explains why the least-squares cost function is a reasonable choice to
estimate the cost. The objective of the linear regression is to find the values of all θ j ,in hθ ( x ), that will
minimize the cost function J ( θ ) .
3
2.3 The use of matrices in solving models
Let
hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n (2.3)
Be our linear regression model to solve. Then, our hypothesis hθ ( x )can also be expressed using a
vector notation:
[] []
x0 θ0 (2.4)
x1 θ1
If x= x2 θ = θ2 then hθ ( x )=θT x with x 0=1
.. ..
xn θn
2.4 Logistic Regression and the use of the Sigmoid function
Logistic regression is based on finding a polynomial function
θ0 +θ 1 x 1+ θ2 x 2 +θ3 x 3+θ 4 x 4 +…+θ n x n (2.5)
that divides a set of data points into 2 categories. This polynomial function, different from the
hypothesis of the linear regression, is called the decision boundary. In Figure 1, it is the first degree
polynomial (or line) that separates the black marks (our ‘1’ category) from the yellow marks (our ‘0’
category).
4
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)
Let the variable x 1 be the Exam 1 score and variable x 2 be the Exam 2 score. According to Figure 1,
we can say that:
y=1 when x 1+ x2 ≥ 92and
y=0 when x 1+ x2 < 92,
more so,
y=1 when x 1+ x2 −92≥ 0 and
y=0 when x 1+ x2 −92<0,
And finally, using the vector notation from (2.4), we can write:
y=1 when θT x ≥ 0 and
y=0 when θT x <0
with
5
[ ]
−92
θ= 1 and x 0=1
1
In order to solve a Classification problem, we need to define a function hθ ( x ) that will predict a value
of 0 or 1, given a set of features. This function has to be function of the boundary function θT x so that
hθ ( θT x ) =1 when θT x ≥ 0and
hθ ( θT x ) =0 when θT x <0
For this, we will use the sigmoid function:
1 (2.6)
hθ ( θ x ) =
T
T
−θ x
1+ e
Figure 2 Sigmoid function using different values of θ (Kohl)
Shown in Figure 2 is the sigmoid function using different values ofθ . The function shows that:
6
0.5 ≤ hθ ( x ) ≤ 1, when θT x ≥ 0 and
0 ≤ hθ ( x )< 0.5, when θT x <0
And therefore (2.6) becomes a good approximation of the probability of the output equals to ‘1’ given
an input x , and this is why we use it as our prediction function. It is called the logistic response function
(Galit Shmueli, 2010, p. 195) and is used as our hypothesis for solving the logistic regression.
Note that boundary functions can be of any degree, Figure 3 shows a boundary using a polynomial
greater than 1.
Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)
2.5 Logistic Regression Cost Function
As we mentioned earlier, the goal of the Linear Regression is to pick θ in hθ ( x ) so that it minimized
the cost function J ( θ ) based on a least-squares cost function. How do we formulate J ( θ ) for logistic
regression?
Given that hθ ( x ) is now non-linear, because of the sigmoid function, the least-square function
becomes a noisy non-convex function (a function with multiple local minimum). It is important to
7
formulate a cost function 1) that gives a good approximation of the cost (for obvious reasons) and, 2) that
mathematically enables us to find its minimum. In the next section, we will explain one particular
method to find a function’s minima. One will understand afterward, why a cost function with multiple
minimum would not be a good fit to solve a regression problem. But for now, let’s just assume that we
have to find another function, more “convex” - that is more of a single “bowl” - shape to model the cost
of our hypothesis.
In order to measure the accuracy of our prediction functionhθ ( x ), we are going to intuitively formulate
an error that best describes the 2-class classification problem:
if y =1, then error=−log ⁡(hθ ( x )) (2.7)
if y =0 ,then error=−log ⁡(1−hθ ( x )) (2.8)
If we predicthθ ( x )=0 , but our example shows y = 1, the value of (2.7) leans toward: ∞ ,on the other
hand, if y = 1 and we predict hθ ( x )=1, still according to (2.7), the error is going to be 0.
If we predicthθ ( x )=1, but our example shows y = 0, the value of (2.8) leans toward: ∞ ,on the other
hand, if y = 0 and we predict hθ ( x )=0 , still according to (2.8), the error is going to be 0. Now that we
have validated the assumption, let’s translate this into a cost function:
m (2.9)
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]
where
1
hθ ( x )= T
1+ e−θ x
and where
8
[] []
x0 θ0
x1 θ1
x= x , θ = θ2 , and x 0=1
2
.. ..
xn θn
The objective of the logistic regression is to find the values of all θ j j ϵ {0. . n } that will minimize the
cost function J ( θ ) (2.9).
2.6 Solving the Regression Models using Gradient Descent
A good way to localize the minima of a function is to visualize the function with a contour plot or a
graph. Figure 4 and Figure 5 show a surface and a contour plot of a simple linear regression cost function
J ( θ0 , θ1 ). Looking at these visual representations of J ( θ0 , θ1 ) gives us a good approximation of where to
findθ0 ∧θ1.
Figure 4 Cost function visualization using a 3D surface plot Figure 5 Cost function visualization using a contour plot
(Stanford University, 2015) (Stanford University, 2015)
But one can immediately notice that if n > 2, withθ j ≠ 0 , if becomes difficult to visualize J ( θ ) .
What if we had a program that selectively picked and tried values of θ j until the perfect combination
that minimizes J ( θ ) is found? There are many numeric methods that do just that. Among them, and
probably the easier to understand, is Gradient Descent. Gradient Descent or Steepest Descent is a method
to solve equations that iteratively computes a coefficient value θ jwith the use of partial derivatives,
9
making sure that each iteration computes a new value of θ j, in the vicinity of θ j , that ensures the
convergence of J ( θ ) toward its nearest local minimum (Curry, 1944). Gradient Descent applies this
calculation simultaneously for every j=0 , … . ,n .
Let’s look at the Gradient Descent algorithm:
Repeat { (2.10)
∂ J (θ)
θ j :=θ j−α
∂θ j
} until θ j converges
∂ J (θ )
In (2.10), α is called the learning rate, and is called the derivative or gradient. The learning
∂ θj
rate controls the size of the step taken while updating the value of θ j from one iteration to the next; the
derivative makes sure that the next value of θ j is taken in the direction of the minimum of J ( θ ) . One can
immediately see that in order for gradient descent to work, the function has to be smooth enough not to
lead to any local minimum.
Without going into the details of finding the derivative, when we replace its value in (2.10), the
gradient descent algorithm becomes:
Repeat { (2.11)
m
θ j :=θ j−α ∑ (h θ ( x (i ) )− y ( i) ) x(i)
j
i=1
with, of course
10
1
hθ ( x ) =
(i)
T (i)
1+ e−θ x
When using gradient descent, it is part of the formulation of the model to pick a value of α that will 1)
be small enough to make the algorithm converge towards a value of θ j without oscillating 2) large enough
that convergence will happen within a reasonable computing cost.
2.7 Solving the problem of Over-fitting with Regularization
If we have too many features the learned hypothesis may fit the training set very well but it may fail to
generalize to new examples. Picking a higher polynomial to define our decision boundary can also lead
to over-fitting. Figure 6, Figure 7, and Figure 8 illustrates how choosing a higher polynomial as our
decision boundary will fit most training data but may fail to be a good predictor.
Figure 6 Underfitting Figure 7 Good fit
Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)
Let’s say we choose to use a higher polynomial to define our decision boundary and then use a
mechanism to reduce the effect of the higher polynomial. The effect would be a smoother, less noisy
11
decision boundary. The solution is to minimize the risk of over-fitting by implementing a mechanism to
shrink all of the coefficients exceptθ0 . Let’s use our cost function (2.9)
m
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]
and add a ‘shrinking’ term that reduces the effect of θ j where 1<j≤n
n (2.12)
λ
∑ θ2
2m i=1 j
(2.12) is called the regularization term, and λ is called the regularization parameter. λ controls a
‘trade-off’ between 1) fitting the training data well by having a lot of features and higher polynomial
terms and, 2) keeping the number of features low and reducing the effect of the higher polynomial terms.
If we introduce the regularization term, (2.12) becomes:
m n (2.13)
−1 λ
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ] + ∑
2m i=1
θ2j
If λ is very-very large, we end up penalizing heavily all theθ j where 1<j≤n and end up with a decision
boundary close to θ0 which is a flat line that under-fits the data.
So, if we add regularization to our gradient descent algorithm (keeping in mind that we do not want to
regularizeθ0 ), (2.11) becomes:
Repeat { (2.14)
m
1
θ0 :=θ0−α ∑ (h ( x (i ) )− y ( i) ) x (0i )
m i=1 θ
12
[ ]
m
θ j :=θ j−α ∑ ( (h θ ( x (i ) )− y ( i) ) x (ji )) + mλ θ j
i=1
2.8 Multiclass Classification: One Vs All
After understanding Logistic Regression, it becomes easy to understand how we can use it to classify a
set of features in one out of multiple categories using the One Vs All method. Let’s say we have k classes,
the idea behind the One Vs All method is to use Logistic Regression to classify one class against the (k-1)
others classes. In other words Multiclass Classification using Logistic Regression consists in training k
Classifiers, one for each class, and on a new input x, make a prediction by picking the class k that
maximizeshθ ( x ). Remember, 0 ≤ hθ ( x ) ≤ 1.
13
3. Solving a real-world Multiclass Classification Problem
We now have some mathematical knowledge to solve a real Multiclass Classification problem. We
are going to use all the concepts presented in the previous chapter and apply them to solve handwriting
recognition.
Let’s look at our dataset. It is comprised of m = 5000 randomly organized 20x20 pixels images
representing handwritten single digit numbers from 0 to 9 and their corresponding value
y∨ y ϵ {0 , 1 , 2, .. , 9 } .
Our features { x 1 , x 2 ,.. x n }are floating point numbers representing the grayscale value of a single pixel
in one input image, and n = 400 is the total number of pixels per image.
We have to define k = 10 classifiers, one for each possible class (0 to 9).
The classification algorithm is implemented using Octave 1. The entry point of the program is located
in the file ex3.m. The function oneVsAll trains a classifier for each digit which objective is to provide
a set of coefficients θ that minimize the cost function associated with the Logistic Regression for that
classifier. The function predictOneVsAll is used to apply all classifiers on our examples and select
the one with the best prediction.
The for loop, located in the function oneVsAll is used to train each classifier. For this particular
application, I took advantage of Octave’s fminunc which implements an alternative to the gradient
descent method, called quasi-Newton method. This method, which is also based on the computation of
∂ J (θ )
the but does not require the coefficient α (3.15), is far more efficient and significantly faster than
∂ θj
1
The code provided in this project was written by me using a template published by Stanford University in the
context of their Machine Learning Course offered on Coursera
14
gradient descent (Shanno, 1970). fminunc takes as parameters the computed value of the cost function
∂ J (θ )
J ( θ ), the value of the gradient , the regularization parameter λ , and the current θ values, and
∂ θj
compute the final set of θ for the current classifier. Here are the steps that resume the implementation
process:
Step 1: Code a visualizing function. It’s always better to be able to visualize our data before working
on a dataset. So, to start, we display a random set of 100 handwritten digits taken among the 5000
examples.
Step 2: Code the sigmoid function that will take a vector of dimension m as an input and return a
vector of dimension m the element-wise sigmoid values.
Step 3: Code the cost function and the gradient. These two operations are done in the
lrCostFunction.m file. The input of this function is a vector of dimension n representing the
coefficientsθ , an n x m matrix X holding the values of { x 1 , x 2 ,.. x n } for our m examples, a vector y of
dimension m, holding the outcome values (0 or 1) corresponding to the actual classifier. For example, if
we are training the digit ‘7’ category Classifier, then y i=1 only if y i=' 7 ' in our training examples. The
output of lrCostFunction is 2 vectors of dimension m: J, containing the value of the cost function
for the current values of the coefficients θ and grad, containing the value of the derivative term of the
gradient descent for the current values of the coefficientsθ .
Step 4: Predict the outcome by applying all the trained classifiers to our 5000 examples. For each
example, pick the best classifier and show the accuracy of the model by comparing it to the example.
15
4. Conclusion
This method can be extended to the whole alphabet by training 26 more classifiers, and providing
more examples to train those classifiers. This can be done by simply changing a few constants in the
code.
Not in the scope of this document but necessary in order to pursue this work, data could have been
partitioned into training and validating sets in order to calculate the effect of over-fitting and under-
fitting.
The set error, which consists in running all the examples through our hypothesis and computing the
error
m set
−1
J set ( θ )= ∑
mset 1=1
y set log hθ ( x (set
(i) i)
) + ( 1− y (set
i)
) log hθ (x (set
i)
)
for each set, indicates whether our model is over-fitting our training examples or is equally performing
on new data.
Training and validation errors reflect the overall performance of our model and give us a comparison
base in the model selection process by allowing us to fine-tune our model by trying different options and
comparing the resulting performance. In this case, we could have investigated the effect on the
performance of the system by either choosing a lower/higher degree of polynomial for our boundary
function, picking a higher/lower value for the regularization parameter λ , or increasing the number of
samples.
16
5. Works Cited
17
6.
Curry, H. D. (1944). The method of steepest descent for non-linear minimization problems. Qu. App.
Math, 2, 258.
Galit Shmueli, N. R. (2010). Data Mining for Business Intelligence. Hoboken, New Jersey: Wiley.
Kohl, N. (n.d.). Role of Bias in Neural Networks. Stack Overflow. Retrieved from
http://natekohl.net/media/sigmoid-scale.png
Ng, A. (2004). Supervised learning - CS229 Lecture notes. Retrieved 4 5, 2016, from Stanford.edu:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
Shanno, D. F. (1970). Conditioning of Quasi-Newton Methods for Function Minimization. Mathematics
of Computation, 24(111).
Stanford University. (2015). Machine Learning. Retrieved 04 03, 2016, from http://www.coursera.org
18
7. Table of Figures
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)..............................5
Figure 2 Sigmoid function using different values of θ (Kohl).......................................................................................6
Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)..........................................7
Figure 4 Cost function visualization using a 3D surface plot (Stanford University, 2015)............................................9
Figure 5 Cost function visualization using a contour plot (Stanford University, 2015).................................................9
Figure 6 Underfitting....................................................................................................................................................11
Figure 7 Good fit...........................................................................................................................................................11
Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)....................................11
19
8. Annex 1 – Code
File Ex3.m
%% One-vs-all
%% Initialization
clear ; close all; clc
%% Setup the parameters you will use for this part of the exercise
input_layer_size = 400; % 20x20 Input Images of Digits
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)
%% =========== Loading and Visualizing Data =============

% We start by first loading and visualizing the dataset.
% Load Training Data

fprintf('Loading and Visualizing Data ...\n')
load('ex3data1.mat'); % training data stored in arrays X, y

m = size(X, 1);
% Randomly select 100 data points to display

rand_indices = randperm(m);
sel = X(rand_indices(1:100), :);
displayData(sel);
fprintf('Program paused. Press enter to continue.\n');

pause;
%% ============ Vectorize Logistic Regression ============
fprintf('\nTraining One-vs-All Logistic Regression...\n')
lambda = 0.1;
[all_theta] = oneVsAll(X, y, num_labels, lambda);
fprintf('Program paused. Press enter to continue.\n');

pause;
%% ================ Part 3: Predict for One-Vs-All ================

% After ...
pred = predictOneVsAll(all_theta, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
20
File DisplayData.m
function [h, display_array] = displayData(X, example_width)
%DISPLAYDATA Display 2D data in a nice grid
% [h, display_array] = DISPLAYDATA(X, example_width) displays 2D data
% stored in X in a nice grid. It returns the figure handle h and the
% displayed array if requested.
% Set example_width automatically if not passed in

if ~exist('example_width', 'var') || isempty(example_width)
example_width = round(sqrt(size(X, 2)));
end
% Gray Image
colormap(gray);
% Compute rows, cols

[m n] = size(X);
example_height = (n / example_width);
% Compute number of items to display

display_rows = floor(sqrt(m));
display_cols = ceil(m / display_rows);
% Between images padding

pad = 1;
% Setup blank display

display_array = - ones(pad + display_rows * (example_height + pad), ...
pad + display_cols * (example_width + pad));
% Copy each example into a patch on the display array

curr_ex = 1;
for j = 1:display_rows
for i = 1:display_cols
if curr_ex > m,
break;
end
% Copy the patch
% Get the max value of the patch

max_val = max(abs(X(curr_ex, :)));
display_array(pad + (j - 1) * (example_height + pad) + (1:example_height), ...
pad + (i - 1) * (example_width + pad) + (1:example_width)) = ...
reshape(X(curr_ex, :), example_height, example_width) /
max_val;
curr_ex = curr_ex + 1;
end
if curr_ex > m,
break;
end
end
% Display Image
h = imagesc(display_array, [-1 1]);
% Do not show axis

axis image off
drawnow;
end
21
File oneVsAll.m
function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta
%corresponds to the classifier for label i
% [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels
% logisitc regression classifiers and returns each of these classifiers
% in a matrix all_theta, where the i-th row of all_theta corresponds
% to the classifier for label i
% Some useful variables

m = size(X, 1);
n = size(X, 2);
% You need to return the following variables correctly

all_theta = zeros(num_labels, n + 1);
% Add ones to the X data matrix

X = [ones(m, 1) X];
% Set Initial theta

initial_theta = zeros(n + 1, 1);
% Set options for fminunc

options = optimset('GradObj', 'on', 'MaxIter', 50);
% Run fminunc to obtain the optimal theta

% This function will return theta and the cost
for c = 1:num_labels
[theta] = ...
fminunc (@(t)(lrCostFunction(t, X, (y == c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end;
end
22
File lrCostFunction.m
function [J, grad] = lrCostFunction(theta, X, y, lambda)
%LRCOSTFUNCTION Compute cost and gradient for logistic regression with

%regularization
% J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values

m = length(y); % number of training examples
J = 0;
grad = zeros(size(theta));
n = length(theta);
h = sigmoid(X * theta);
J1 = sum(-y .* log(h) - (1-y) .* log(1-h)) / m;
J2 = sum(theta(2:n,:).^2) * lambda / (2*m);
J = J1 + J2;
reg_term = (lambda * theta) / m;

reg_term(1) = 0;
grad = X' * (h - y) / m;
grad = grad + reg_term;
% =============================================================
grad = grad(:);
23
File sigmoid.m
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
% J = SIGMOID(z) computes the sigmoid of z.
g = 1.0 ./ (1.0 + exp(-z));

end
24
File predictOneVsAll.m
function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels
%are in the range 1..K, where K = size(all_theta, 1).
% p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
% for each example in the matrix X. Note that X contains the examples in
% rows. all_theta is a matrix where the i-th row is a trained logistic
% regression theta vector for the i-th class. You should set p to a vector
% of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
% for 4 examples)
m = size(X, 1);
num_labels = size(all_theta, 1);
% You need to return the following variables correctly

p = zeros(size(X, 1), 1);
% Add ones to the X data matrix

X = [ones(m, 1) X];
all_predicts = all_theta * X';

[px, p] = max(all_predicts);
end
25

Sample Research Paper

Uploaded by

Copyright:

Available Formats

You might also like

Sample Research Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Research Paper

Uploaded by

Copyright:

Available Formats

Nova Southeastern University

College of Engineering and Computing

Multiclass Classification using Logistic Regression applied to

supervised machine learning algorithm: Multiclass Classification.

The recognition of handwritten characters.

outcome of any given set of features:

hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n

the jth feature in our ith example.

2.2 Cost Function and the objective of Linear Regression

accuracy of the model, we are going to define a cost function:

minimize the cost function J ( θ ) .

hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n (2.3)

2.4 Logistic Regression and the use of the Sigmoid function

Logistic regression is based on finding a polynomial function

θ0 +θ 1 x 1+ θ2 x 2 +θ3 x 3+θ 4 x 4 +…+θ n x n (2.5)

we can say that:

y=1 when x 1+ x2 ≥ 92and

y=0 when x 1+ x2 < 92,

y=1 when x 1+ x2 −92≥ 0 and

y=0 when x 1+ x2 −92<0,

y=1 when θT x ≥ 0 and

y=0 when θT x <0

For this, we will use the sigmoid function:

Figure 2 Sigmoid function using different values of θ (Kohl)

0 ≤ hθ ( x )< 0.5, when θT x <0

Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)

2.5 Logistic Regression Cost Function

an error that best describes the 2-class classification problem:

if y =1, then error=−log ⁡(hθ ( x )) (2.7)

if y =0 ,then error=−log ⁡(1−hθ ( x )) (2.8)

cost function J ( θ ) (2.9).

2.6 Solving the Regression Models using Gradient Descent

J ( θ0 , θ1 ). Looking at these visual representations of J ( θ0 , θ1 ) gives us a good approximation of where to

(Stanford University, 2015) (Stanford University, 2015)

calculation simultaneously for every j=0 , … . ,n .

Let’s look at the Gradient Descent algorithm:

lead to any local minimum.

gradient descent algorithm becomes:

that convergence will happen within a reasonable computing cost.

2.7 Solving the problem of Over-fitting with Regularization

Figure 6 Underfitting Figure 7 Good fit

If we introduce the regularization term, (2.12) becomes:

boundary close to θ0 which is a flat line that under-fits the data.

regularizeθ0 ), (2.11) becomes:

2.8 Multiclass Classification: One Vs All

We have to define k = 10 classifiers, one for each possible class (0 to 9).

the one with the best prediction.

vector of dimension m the element-wise sigmoid values.

gradient descent for the current values of the coefficientsθ .

%% =========== Loading and Visualizing Data =============

% Load Training Data

load('ex3data1.mat'); % training data stored in arrays X, y

% Randomly select 100 data points to display

fprintf('Program paused. Press enter to continue.\n');

%% ============ Vectorize Logistic Regression ============

fprintf('\nTraining One-vs-All Logistic Regression...\n')

fprintf('Program paused. Press enter to continue.\n');

%% ================ Part 3: Predict for One-Vs-All ================

fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);