Sample Research Paper

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

Nova Southeastern University

College of Engineering and Computing


MMIS 643 Data Mining
Winter 2016

Multiclass Classification using Logistic Regression applied to


Handwriting Recognition: The Technical Details
Table of Content

1. Introduction.........................................................................................................................................2
2. Review of pertinent Supervised Learning algorithms..........................................................................3
2.1 Linear Regression.........................................................................................................................3
2.2 Cost Function and the objective of Linear Regression.................................................................3
2.3 The use of matrices in solving models.........................................................................................4
2.4 Logistic Regression and the use of the Sigmoid function.............................................................4
2.5 Logistic Regression Cost Function................................................................................................7
2.6 Solving the Regression Models using Gradient Descent..............................................................9
2.7 Solving the problem of Over-fitting with Regularization............................................................11
2.8 Multiclass Classification: One Vs All...........................................................................................13
3. Solving a real-world Multiclass Classification Problem......................................................................14
4. Conclusion.........................................................................................................................................16
5. Works Cited ......................................................................................................................................17
6. Table of Figures.................................................................................................................................18
7. Annex 1 – Code..................................................................................................................................19

1
1. Introduction
For this project, I have chosen to study and examine Logistic Regression in a Multiclass Classification

problem. The objective of this document is to expose the mathematical details behind a common

supervised machine learning algorithm: Multiclass Classification.

The mathematical concepts covered in this document are simplified and introduced gradually. I start

by explaining the Linear Regression method followed by the notion of cost function and the objective

behind solving Linear Regression problems. I use this as a base to explain Logistic Regression and how it

can be applied to Multiclass Classification problems. Other concepts involved in solving the different

models, like Gradient Descent and Regularization, are also introduced along with their mathematical

formulation. Finally, a simple program written with Octave uses these concepts to demonstrate how

Logistic Regression applied to a Multiclass Classification problem can be used in a real world situation:

The recognition of handwritten characters.

2
2. Review of pertinent Supervised Learning algorithms
2.1 Linear Regression

Linear regression is based on finding the polynomial function hθ ( x ) that best fits a set of given data

points (also called examples). This function, also called hypothesis, can then be used to predict the

outcome of any given set of features:

hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n


(2.1)

Let m be the number of training examples and n be the number of features (or predictors) predicting

the outcome. Let y be the outcome and y (i ) the value of the outcome in our ith example where i ϵ {1. . m} .

(i)
Let’s say we have n variables predicting the values of y, then x j where j ϵ {1. . n} , represents the value of

the jth feature in our ith example.

2.2 Cost Function and the objective of Linear Regression

The objective of the linear regression is to find the value of each θ j so that that hθ ( x (i) ) is as close as

possible to y (i ) thus minimizing hθ ( x (i) ) − y (i ) for each i of our example set. In order to measure the

accuracy of the model, we are going to define a cost function:

m
1

2
J ( θ )= (hθ ( x( i) )− y (i) ) (2.2)
2 m i=1

The cost function associated with our hypothesis is the average of the sum of all the squared

errors divided by 2. (Ng, 2004) explains why the least-squares cost function is a reasonable choice to

estimate the cost. The objective of the linear regression is to find the values of all θ j ,in hθ ( x ), that will

minimize the cost function J ( θ ) .

3
2.3 The use of matrices in solving models

Let

hθ ( x )=θ 0+θ 1 x 1+ θ2 x 2 +θ3 x 3+ θ4 x 4 +…+θ n x n (2.3)

Be our linear regression model to solve. Then, our hypothesis hθ ( x )can also be expressed using a

vector notation:

[] []
x0 θ0 (2.4)
x1 θ1
If x= x2 θ = θ2 then hθ ( x )=θT x with x 0=1
.. ..
xn θn

2.4 Logistic Regression and the use of the Sigmoid function

Logistic regression is based on finding a polynomial function

θ0 +θ 1 x 1+ θ2 x 2 +θ3 x 3+θ 4 x 4 +…+θ n x n (2.5)

that divides a set of data points into 2 categories. This polynomial function, different from the

hypothesis of the linear regression, is called the decision boundary. In Figure 1, it is the first degree

polynomial (or line) that separates the black marks (our ‘1’ category) from the yellow marks (our ‘0’

category).

4
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)

Let the variable x 1 be the Exam 1 score and variable x 2 be the Exam 2 score. According to Figure 1,

we can say that:

y=1 when x 1+ x2 ≥ 92and

y=0 when x 1+ x2 < 92,

more so,

y=1 when x 1+ x2 −92≥ 0 and

y=0 when x 1+ x2 −92<0,

And finally, using the vector notation from (2.4), we can write:

y=1 when θT x ≥ 0 and

y=0 when θT x <0

with

5
[ ]
−92
θ= 1 and x 0=1
1

In order to solve a Classification problem, we need to define a function hθ ( x ) that will predict a value

of 0 or 1, given a set of features. This function has to be function of the boundary function θT x so that

hθ ( θT x ) =1 when θT x ≥ 0and

hθ ( θT x ) =0 when θT x <0

For this, we will use the sigmoid function:

1 (2.6)
hθ ( θ x ) =
T
T
−θ x
1+ e

Figure 2 Sigmoid function using different values of θ (Kohl)

Shown in Figure 2 is the sigmoid function using different values ofθ . The function shows that:

6
0.5 ≤ hθ ( x ) ≤ 1, when θT x ≥ 0 and

0 ≤ hθ ( x )< 0.5, when θT x <0

And therefore (2.6) becomes a good approximation of the probability of the output equals to ‘1’ given

an input x , and this is why we use it as our prediction function. It is called the logistic response function

(Galit Shmueli, 2010, p. 195) and is used as our hypothesis for solving the logistic regression.

Note that boundary functions can be of any degree, Figure 3 shows a boundary using a polynomial

greater than 1.

Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)

2.5 Logistic Regression Cost Function

As we mentioned earlier, the goal of the Linear Regression is to pick θ in hθ ( x ) so that it minimized

the cost function J ( θ ) based on a least-squares cost function. How do we formulate J ( θ ) for logistic

regression?

Given that hθ ( x ) is now non-linear, because of the sigmoid function, the least-square function

becomes a noisy non-convex function (a function with multiple local minimum). It is important to

7
formulate a cost function 1) that gives a good approximation of the cost (for obvious reasons) and, 2) that

mathematically enables us to find its minimum. In the next section, we will explain one particular

method to find a function’s minima. One will understand afterward, why a cost function with multiple

minimum would not be a good fit to solve a regression problem. But for now, let’s just assume that we

have to find another function, more “convex” - that is more of a single “bowl” - shape to model the cost

of our hypothesis.

In order to measure the accuracy of our prediction functionhθ ( x ), we are going to intuitively formulate

an error that best describes the 2-class classification problem:

if y =1, then error=−log ⁡(hθ ( x )) (2.7)

if y =0 ,then error=−log ⁡(1−hθ ( x )) (2.8)

If we predicthθ ( x )=0 , but our example shows y = 1, the value of (2.7) leans toward: ∞ ,on the other

hand, if y = 1 and we predict hθ ( x )=1, still according to (2.7), the error is going to be 0.

If we predicthθ ( x )=1, but our example shows y = 0, the value of (2.8) leans toward: ∞ ,on the other

hand, if y = 0 and we predict hθ ( x )=0 , still according to (2.8), the error is going to be 0. Now that we

have validated the assumption, let’s translate this into a cost function:

m (2.9)
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]

where

1
hθ ( x )= T

1+ e−θ x

and where

8
[] []
x0 θ0
x1 θ1
x= x , θ = θ2 , and x 0=1
2
.. ..
xn θn

The objective of the logistic regression is to find the values of all θ j j ϵ {0. . n } that will minimize the

cost function J ( θ ) (2.9).

2.6 Solving the Regression Models using Gradient Descent

A good way to localize the minima of a function is to visualize the function with a contour plot or a

graph. Figure 4 and Figure 5 show a surface and a contour plot of a simple linear regression cost function

J ( θ0 , θ1 ). Looking at these visual representations of J ( θ0 , θ1 ) gives us a good approximation of where to

findθ0 ∧θ1.

Figure 4 Cost function visualization using a 3D surface plot Figure 5 Cost function visualization using a contour plot

(Stanford University, 2015) (Stanford University, 2015)

But one can immediately notice that if n > 2, withθ j ≠ 0 , if becomes difficult to visualize J ( θ ) .

What if we had a program that selectively picked and tried values of θ j until the perfect combination

that minimizes J ( θ ) is found? There are many numeric methods that do just that. Among them, and

probably the easier to understand, is Gradient Descent. Gradient Descent or Steepest Descent is a method

to solve equations that iteratively computes a coefficient value θ jwith the use of partial derivatives,

9
making sure that each iteration computes a new value of θ j, in the vicinity of θ j , that ensures the

convergence of J ( θ ) toward its nearest local minimum (Curry, 1944). Gradient Descent applies this

calculation simultaneously for every j=0 , … . ,n .

Let’s look at the Gradient Descent algorithm:

Repeat { (2.10)

∂ J (θ)
θ j :=θ j−α
∂θ j

} until θ j converges

∂ J (θ )
In (2.10), α is called the learning rate, and is called the derivative or gradient. The learning
∂ θj

rate controls the size of the step taken while updating the value of θ j from one iteration to the next; the

derivative makes sure that the next value of θ j is taken in the direction of the minimum of J ( θ ) . One can

immediately see that in order for gradient descent to work, the function has to be smooth enough not to

lead to any local minimum.

Without going into the details of finding the derivative, when we replace its value in (2.10), the

gradient descent algorithm becomes:

Repeat { (2.11)

m
θ j :=θ j−α ∑ (h θ ( x (i ) )− y ( i) ) x(i)
j
i=1

} until θ j converges

with, of course

10
1
hθ ( x ) =
(i)
T (i)

1+ e−θ x

When using gradient descent, it is part of the formulation of the model to pick a value of α that will 1)

be small enough to make the algorithm converge towards a value of θ j without oscillating 2) large enough

that convergence will happen within a reasonable computing cost.

2.7 Solving the problem of Over-fitting with Regularization

If we have too many features the learned hypothesis may fit the training set very well but it may fail to

generalize to new examples. Picking a higher polynomial to define our decision boundary can also lead

to over-fitting. Figure 6, Figure 7, and Figure 8 illustrates how choosing a higher polynomial as our

decision boundary will fit most training data but may fail to be a good predictor.

Figure 6 Underfitting Figure 7 Good fit

Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)

Let’s say we choose to use a higher polynomial to define our decision boundary and then use a

mechanism to reduce the effect of the higher polynomial. The effect would be a smoother, less noisy

11
decision boundary. The solution is to minimize the risk of over-fitting by implementing a mechanism to

shrink all of the coefficients exceptθ0 . Let’s use our cost function (2.9)

m
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]

and add a ‘shrinking’ term that reduces the effect of θ j where 1<j≤n

n (2.12)
λ
∑ θ2
2m i=1 j

(2.12) is called the regularization term, and λ is called the regularization parameter. λ controls a

‘trade-off’ between 1) fitting the training data well by having a lot of features and higher polynomial

terms and, 2) keeping the number of features low and reducing the effect of the higher polynomial terms.

If we introduce the regularization term, (2.12) becomes:

m n (2.13)
−1 λ
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ] + ∑
2m i=1
θ2j

If λ is very-very large, we end up penalizing heavily all theθ j where 1<j≤n and end up with a decision

boundary close to θ0 which is a flat line that under-fits the data.

So, if we add regularization to our gradient descent algorithm (keeping in mind that we do not want to

regularizeθ0 ), (2.11) becomes:

Repeat { (2.14)

m
1
θ0 :=θ0−α ∑ (h ( x (i ) )− y ( i) ) x (0i )
m i=1 θ

12
[ ]
m
θ j :=θ j−α ∑ ( (h θ ( x (i ) )− y ( i) ) x (ji )) + mλ θ j
i=1

} until θ j converges

2.8 Multiclass Classification: One Vs All

After understanding Logistic Regression, it becomes easy to understand how we can use it to classify a

set of features in one out of multiple categories using the One Vs All method. Let’s say we have k classes,

the idea behind the One Vs All method is to use Logistic Regression to classify one class against the (k-1)

others classes. In other words Multiclass Classification using Logistic Regression consists in training k

Classifiers, one for each class, and on a new input x, make a prediction by picking the class k that

maximizeshθ ( x ). Remember, 0 ≤ hθ ( x ) ≤ 1.

13
3. Solving a real-world Multiclass Classification Problem

We now have some mathematical knowledge to solve a real Multiclass Classification problem. We

are going to use all the concepts presented in the previous chapter and apply them to solve handwriting

recognition.

Let’s look at our dataset. It is comprised of m = 5000 randomly organized 20x20 pixels images

representing handwritten single digit numbers from 0 to 9 and their corresponding value

y∨ y ϵ {0 , 1 , 2, .. , 9 } .

Our features { x 1 , x 2 ,.. x n }are floating point numbers representing the grayscale value of a single pixel

in one input image, and n = 400 is the total number of pixels per image.

We have to define k = 10 classifiers, one for each possible class (0 to 9).

The classification algorithm is implemented using Octave 1. The entry point of the program is located

in the file ex3.m. The function oneVsAll trains a classifier for each digit which objective is to provide

a set of coefficients θ that minimize the cost function associated with the Logistic Regression for that

classifier. The function predictOneVsAll is used to apply all classifiers on our examples and select

the one with the best prediction.

The for loop, located in the function oneVsAll is used to train each classifier. For this particular

application, I took advantage of Octave’s fminunc which implements an alternative to the gradient

descent method, called quasi-Newton method. This method, which is also based on the computation of

∂ J (θ )
the but does not require the coefficient α (3.15), is far more efficient and significantly faster than
∂ θj

1
The code provided in this project was written by me using a template published by Stanford University in the
context of their Machine Learning Course offered on Coursera

14
gradient descent (Shanno, 1970). fminunc takes as parameters the computed value of the cost function

∂ J (θ )
J ( θ ), the value of the gradient , the regularization parameter λ , and the current θ values, and
∂ θj

compute the final set of θ for the current classifier. Here are the steps that resume the implementation

process:

Step 1: Code a visualizing function. It’s always better to be able to visualize our data before working

on a dataset. So, to start, we display a random set of 100 handwritten digits taken among the 5000

examples.

Step 2: Code the sigmoid function that will take a vector of dimension m as an input and return a

vector of dimension m the element-wise sigmoid values.

Step 3: Code the cost function and the gradient. These two operations are done in the

lrCostFunction.m file. The input of this function is a vector of dimension n representing the

coefficientsθ , an n x m matrix X holding the values of { x 1 , x 2 ,.. x n } for our m examples, a vector y of

dimension m, holding the outcome values (0 or 1) corresponding to the actual classifier. For example, if

we are training the digit ‘7’ category Classifier, then y i=1 only if y i=' 7 ' in our training examples. The

output of lrCostFunction is 2 vectors of dimension m: J, containing the value of the cost function

for the current values of the coefficients θ and grad, containing the value of the derivative term of the

gradient descent for the current values of the coefficientsθ .

Step 4: Predict the outcome by applying all the trained classifiers to our 5000 examples. For each

example, pick the best classifier and show the accuracy of the model by comparing it to the example.

15
4. Conclusion
This method can be extended to the whole alphabet by training 26 more classifiers, and providing

more examples to train those classifiers. This can be done by simply changing a few constants in the

code.

Not in the scope of this document but necessary in order to pursue this work, data could have been

partitioned into training and validating sets in order to calculate the effect of over-fitting and under-

fitting.

The set error, which consists in running all the examples through our hypothesis and computing the

error

m set
−1
J set ( θ )= ∑
mset 1=1
y set log hθ ( x (set
(i) i)
) + ( 1− y (set
i)
) log hθ (x (set
i)
)

for each set, indicates whether our model is over-fitting our training examples or is equally performing

on new data.

Training and validation errors reflect the overall performance of our model and give us a comparison

base in the model selection process by allowing us to fine-tune our model by trying different options and

comparing the resulting performance. In this case, we could have investigated the effect on the

performance of the system by either choosing a lower/higher degree of polynomial for our boundary

function, picking a higher/lower value for the regularization parameter λ , or increasing the number of

samples.

16
5. Works Cited

17
6.
Curry, H. D. (1944). The method of steepest descent for non-linear minimization problems. Qu. App.
Math, 2, 258.
Galit Shmueli, N. R. (2010). Data Mining for Business Intelligence. Hoboken, New Jersey: Wiley.
Kohl, N. (n.d.). Role of Bias in Neural Networks. Stack Overflow. Retrieved from
http://natekohl.net/media/sigmoid-scale.png
Ng, A. (2004). Supervised learning - CS229 Lecture notes. Retrieved 4 5, 2016, from Stanford.edu:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
Shanno, D. F. (1970). Conditioning of Quasi-Newton Methods for Function Minimization. Mathematics
of Computation, 24(111).
Stanford University. (2015). Machine Learning. Retrieved 04 03, 2016, from http://www.coursera.org

18
7. Table of Figures
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)..............................5
Figure 2 Sigmoid function using different values of θ (Kohl).......................................................................................6
Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)..........................................7
Figure 4 Cost function visualization using a 3D surface plot (Stanford University, 2015)............................................9
Figure 5 Cost function visualization using a contour plot (Stanford University, 2015).................................................9
Figure 6 Underfitting....................................................................................................................................................11
Figure 7 Good fit...........................................................................................................................................................11
Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)....................................11

19
8. Annex 1 – Code
File Ex3.m
%% One-vs-all

%% Initialization
clear ; close all; clc

%% Setup the parameters you will use for this part of the exercise
input_layer_size = 400; % 20x20 Input Images of Digits
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)

%% =========== Loading and Visualizing Data =============


% We start by first loading and visualizing the dataset.

% Load Training Data


fprintf('Loading and Visualizing Data ...\n')

load('ex3data1.mat'); % training data stored in arrays X, y


m = size(X, 1);

% Randomly select 100 data points to display


rand_indices = randperm(m);
sel = X(rand_indices(1:100), :);

displayData(sel);

fprintf('Program paused. Press enter to continue.\n');


pause;

%% ============ Vectorize Logistic Regression ============

fprintf('\nTraining One-vs-All Logistic Regression...\n')

lambda = 0.1;
[all_theta] = oneVsAll(X, y, num_labels, lambda);

fprintf('Program paused. Press enter to continue.\n');


pause;

%% ================ Part 3: Predict for One-Vs-All ================


% After ...
pred = predictOneVsAll(all_theta, X);

fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

20
File DisplayData.m
function [h, display_array] = displayData(X, example_width)
%DISPLAYDATA Display 2D data in a nice grid
% [h, display_array] = DISPLAYDATA(X, example_width) displays 2D data
% stored in X in a nice grid. It returns the figure handle h and the
% displayed array if requested.

% Set example_width automatically if not passed in


if ~exist('example_width', 'var') || isempty(example_width)
example_width = round(sqrt(size(X, 2)));
end

% Gray Image
colormap(gray);

% Compute rows, cols


[m n] = size(X);
example_height = (n / example_width);

% Compute number of items to display


display_rows = floor(sqrt(m));
display_cols = ceil(m / display_rows);

% Between images padding


pad = 1;

% Setup blank display


display_array = - ones(pad + display_rows * (example_height + pad), ...
pad + display_cols * (example_width + pad));

% Copy each example into a patch on the display array


curr_ex = 1;
for j = 1:display_rows
for i = 1:display_cols
if curr_ex > m,
break;
end
% Copy the patch

% Get the max value of the patch


max_val = max(abs(X(curr_ex, :)));
display_array(pad + (j - 1) * (example_height + pad) + (1:example_height), ...
pad + (i - 1) * (example_width + pad) + (1:example_width)) = ...
reshape(X(curr_ex, :), example_height, example_width) /
max_val;
curr_ex = curr_ex + 1;
end
if curr_ex > m,
break;
end
end

% Display Image
h = imagesc(display_array, [-1 1]);

% Do not show axis


axis image off

drawnow;

end

21
File oneVsAll.m
function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta
%corresponds to the classifier for label i
% [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels
% logisitc regression classifiers and returns each of these classifiers
% in a matrix all_theta, where the i-th row of all_theta corresponds
% to the classifier for label i

% Some useful variables


m = size(X, 1);
n = size(X, 2);

% You need to return the following variables correctly


all_theta = zeros(num_labels, n + 1);

% Add ones to the X data matrix


X = [ones(m, 1) X];

% Set Initial theta


initial_theta = zeros(n + 1, 1);

% Set options for fminunc


options = optimset('GradObj', 'on', 'MaxIter', 50);

% Run fminunc to obtain the optimal theta


% This function will return theta and the cost
for c = 1:num_labels
[theta] = ...
fminunc (@(t)(lrCostFunction(t, X, (y == c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end;

end

22
File lrCostFunction.m
function [J, grad] = lrCostFunction(theta, X, y, lambda)

%LRCOSTFUNCTION Compute cost and gradient for logistic regression with


%regularization
% J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.

% Initialize some useful values


m = length(y); % number of training examples

J = 0;
grad = zeros(size(theta));

n = length(theta);
h = sigmoid(X * theta);
J1 = sum(-y .* log(h) - (1-y) .* log(1-h)) / m;
J2 = sum(theta(2:n,:).^2) * lambda / (2*m);
J = J1 + J2;

reg_term = (lambda * theta) / m;


reg_term(1) = 0;
grad = X' * (h - y) / m;
grad = grad + reg_term;

% =============================================================

grad = grad(:);

23
File sigmoid.m
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
% J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));


end

24
File predictOneVsAll.m
function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels
%are in the range 1..K, where K = size(all_theta, 1).
% p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
% for each example in the matrix X. Note that X contains the examples in
% rows. all_theta is a matrix where the i-th row is a trained logistic
% regression theta vector for the i-th class. You should set p to a vector
% of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
% for 4 examples)

m = size(X, 1);
num_labels = size(all_theta, 1);

% You need to return the following variables correctly


p = zeros(size(X, 1), 1);

% Add ones to the X data matrix


X = [ones(m, 1) X];

all_predicts = all_theta * X';


[px, p] = max(all_predicts);

end

25

You might also like