Professional Documents
Culture Documents
Sample Research Paper
Sample Research Paper
Sample Research Paper
1. Introduction.........................................................................................................................................2
2. Review of pertinent Supervised Learning algorithms..........................................................................3
2.1 Linear Regression.........................................................................................................................3
2.2 Cost Function and the objective of Linear Regression.................................................................3
2.3 The use of matrices in solving models.........................................................................................4
2.4 Logistic Regression and the use of the Sigmoid function.............................................................4
2.5 Logistic Regression Cost Function................................................................................................7
2.6 Solving the Regression Models using Gradient Descent..............................................................9
2.7 Solving the problem of Over-fitting with Regularization............................................................11
2.8 Multiclass Classification: One Vs All...........................................................................................13
3. Solving a real-world Multiclass Classification Problem......................................................................14
4. Conclusion.........................................................................................................................................16
5. Works Cited ......................................................................................................................................17
6. Table of Figures.................................................................................................................................18
7. Annex 1 – Code..................................................................................................................................19
1
1. Introduction
For this project, I have chosen to study and examine Logistic Regression in a Multiclass Classification
problem. The objective of this document is to expose the mathematical details behind a common
The mathematical concepts covered in this document are simplified and introduced gradually. I start
by explaining the Linear Regression method followed by the notion of cost function and the objective
behind solving Linear Regression problems. I use this as a base to explain Logistic Regression and how it
can be applied to Multiclass Classification problems. Other concepts involved in solving the different
models, like Gradient Descent and Regularization, are also introduced along with their mathematical
formulation. Finally, a simple program written with Octave uses these concepts to demonstrate how
Logistic Regression applied to a Multiclass Classification problem can be used in a real world situation:
2
2. Review of pertinent Supervised Learning algorithms
2.1 Linear Regression
Linear regression is based on finding the polynomial function hθ ( x ) that best fits a set of given data
points (also called examples). This function, also called hypothesis, can then be used to predict the
Let m be the number of training examples and n be the number of features (or predictors) predicting
the outcome. Let y be the outcome and y (i ) the value of the outcome in our ith example where i ϵ {1. . m} .
(i)
Let’s say we have n variables predicting the values of y, then x j where j ϵ {1. . n} , represents the value of
The objective of the linear regression is to find the value of each θ j so that that hθ ( x (i) ) is as close as
possible to y (i ) thus minimizing hθ ( x (i) ) − y (i ) for each i of our example set. In order to measure the
m
1
∑
2
J ( θ )= (hθ ( x( i) )− y (i) ) (2.2)
2 m i=1
The cost function associated with our hypothesis is the average of the sum of all the squared
errors divided by 2. (Ng, 2004) explains why the least-squares cost function is a reasonable choice to
estimate the cost. The objective of the linear regression is to find the values of all θ j ,in hθ ( x ), that will
3
2.3 The use of matrices in solving models
Let
Be our linear regression model to solve. Then, our hypothesis hθ ( x )can also be expressed using a
vector notation:
[] []
x0 θ0 (2.4)
x1 θ1
If x= x2 θ = θ2 then hθ ( x )=θT x with x 0=1
.. ..
xn θn
that divides a set of data points into 2 categories. This polynomial function, different from the
hypothesis of the linear regression, is called the decision boundary. In Figure 1, it is the first degree
polynomial (or line) that separates the black marks (our ‘1’ category) from the yellow marks (our ‘0’
category).
4
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)
Let the variable x 1 be the Exam 1 score and variable x 2 be the Exam 2 score. According to Figure 1,
more so,
And finally, using the vector notation from (2.4), we can write:
with
5
[ ]
−92
θ= 1 and x 0=1
1
In order to solve a Classification problem, we need to define a function hθ ( x ) that will predict a value
of 0 or 1, given a set of features. This function has to be function of the boundary function θT x so that
hθ ( θT x ) =1 when θT x ≥ 0and
hθ ( θT x ) =0 when θT x <0
1 (2.6)
hθ ( θ x ) =
T
T
−θ x
1+ e
Shown in Figure 2 is the sigmoid function using different values ofθ . The function shows that:
6
0.5 ≤ hθ ( x ) ≤ 1, when θT x ≥ 0 and
And therefore (2.6) becomes a good approximation of the probability of the output equals to ‘1’ given
an input x , and this is why we use it as our prediction function. It is called the logistic response function
(Galit Shmueli, 2010, p. 195) and is used as our hypothesis for solving the logistic regression.
Note that boundary functions can be of any degree, Figure 3 shows a boundary using a polynomial
greater than 1.
As we mentioned earlier, the goal of the Linear Regression is to pick θ in hθ ( x ) so that it minimized
the cost function J ( θ ) based on a least-squares cost function. How do we formulate J ( θ ) for logistic
regression?
Given that hθ ( x ) is now non-linear, because of the sigmoid function, the least-square function
becomes a noisy non-convex function (a function with multiple local minimum). It is important to
7
formulate a cost function 1) that gives a good approximation of the cost (for obvious reasons) and, 2) that
mathematically enables us to find its minimum. In the next section, we will explain one particular
method to find a function’s minima. One will understand afterward, why a cost function with multiple
minimum would not be a good fit to solve a regression problem. But for now, let’s just assume that we
have to find another function, more “convex” - that is more of a single “bowl” - shape to model the cost
of our hypothesis.
In order to measure the accuracy of our prediction functionhθ ( x ), we are going to intuitively formulate
If we predicthθ ( x )=0 , but our example shows y = 1, the value of (2.7) leans toward: ∞ ,on the other
hand, if y = 1 and we predict hθ ( x )=1, still according to (2.7), the error is going to be 0.
If we predicthθ ( x )=1, but our example shows y = 0, the value of (2.8) leans toward: ∞ ,on the other
hand, if y = 0 and we predict hθ ( x )=0 , still according to (2.8), the error is going to be 0. Now that we
have validated the assumption, let’s translate this into a cost function:
m (2.9)
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]
where
1
hθ ( x )= T
1+ e−θ x
and where
8
[] []
x0 θ0
x1 θ1
x= x , θ = θ2 , and x 0=1
2
.. ..
xn θn
The objective of the logistic regression is to find the values of all θ j j ϵ {0. . n } that will minimize the
A good way to localize the minima of a function is to visualize the function with a contour plot or a
graph. Figure 4 and Figure 5 show a surface and a contour plot of a simple linear regression cost function
findθ0 ∧θ1.
Figure 4 Cost function visualization using a 3D surface plot Figure 5 Cost function visualization using a contour plot
But one can immediately notice that if n > 2, withθ j ≠ 0 , if becomes difficult to visualize J ( θ ) .
What if we had a program that selectively picked and tried values of θ j until the perfect combination
that minimizes J ( θ ) is found? There are many numeric methods that do just that. Among them, and
probably the easier to understand, is Gradient Descent. Gradient Descent or Steepest Descent is a method
to solve equations that iteratively computes a coefficient value θ jwith the use of partial derivatives,
9
making sure that each iteration computes a new value of θ j, in the vicinity of θ j , that ensures the
convergence of J ( θ ) toward its nearest local minimum (Curry, 1944). Gradient Descent applies this
Repeat { (2.10)
∂ J (θ)
θ j :=θ j−α
∂θ j
} until θ j converges
∂ J (θ )
In (2.10), α is called the learning rate, and is called the derivative or gradient. The learning
∂ θj
rate controls the size of the step taken while updating the value of θ j from one iteration to the next; the
derivative makes sure that the next value of θ j is taken in the direction of the minimum of J ( θ ) . One can
immediately see that in order for gradient descent to work, the function has to be smooth enough not to
Without going into the details of finding the derivative, when we replace its value in (2.10), the
Repeat { (2.11)
m
θ j :=θ j−α ∑ (h θ ( x (i ) )− y ( i) ) x(i)
j
i=1
} until θ j converges
with, of course
10
1
hθ ( x ) =
(i)
T (i)
1+ e−θ x
When using gradient descent, it is part of the formulation of the model to pick a value of α that will 1)
be small enough to make the algorithm converge towards a value of θ j without oscillating 2) large enough
If we have too many features the learned hypothesis may fit the training set very well but it may fail to
generalize to new examples. Picking a higher polynomial to define our decision boundary can also lead
to over-fitting. Figure 6, Figure 7, and Figure 8 illustrates how choosing a higher polynomial as our
decision boundary will fit most training data but may fail to be a good predictor.
Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)
Let’s say we choose to use a higher polynomial to define our decision boundary and then use a
mechanism to reduce the effect of the higher polynomial. The effect would be a smoother, less noisy
11
decision boundary. The solution is to minimize the risk of over-fitting by implementing a mechanism to
shrink all of the coefficients exceptθ0 . Let’s use our cost function (2.9)
m
−1
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ]
and add a ‘shrinking’ term that reduces the effect of θ j where 1<j≤n
n (2.12)
λ
∑ θ2
2m i=1 j
(2.12) is called the regularization term, and λ is called the regularization parameter. λ controls a
‘trade-off’ between 1) fitting the training data well by having a lot of features and higher polynomial
terms and, 2) keeping the number of features low and reducing the effect of the higher polynomial terms.
m n (2.13)
−1 λ
J ( θ )= ∑
m i=1
[ y(i) log hθ ( x )+ ( 1− y ( i) ) log (1−hθ ( x ) ) ] + ∑
2m i=1
θ2j
If λ is very-very large, we end up penalizing heavily all theθ j where 1<j≤n and end up with a decision
So, if we add regularization to our gradient descent algorithm (keeping in mind that we do not want to
Repeat { (2.14)
m
1
θ0 :=θ0−α ∑ (h ( x (i ) )− y ( i) ) x (0i )
m i=1 θ
12
[ ]
m
θ j :=θ j−α ∑ ( (h θ ( x (i ) )− y ( i) ) x (ji )) + mλ θ j
i=1
} until θ j converges
After understanding Logistic Regression, it becomes easy to understand how we can use it to classify a
set of features in one out of multiple categories using the One Vs All method. Let’s say we have k classes,
the idea behind the One Vs All method is to use Logistic Regression to classify one class against the (k-1)
others classes. In other words Multiclass Classification using Logistic Regression consists in training k
Classifiers, one for each class, and on a new input x, make a prediction by picking the class k that
maximizeshθ ( x ). Remember, 0 ≤ hθ ( x ) ≤ 1.
13
3. Solving a real-world Multiclass Classification Problem
We now have some mathematical knowledge to solve a real Multiclass Classification problem. We
are going to use all the concepts presented in the previous chapter and apply them to solve handwriting
recognition.
Let’s look at our dataset. It is comprised of m = 5000 randomly organized 20x20 pixels images
representing handwritten single digit numbers from 0 to 9 and their corresponding value
y∨ y ϵ {0 , 1 , 2, .. , 9 } .
Our features { x 1 , x 2 ,.. x n }are floating point numbers representing the grayscale value of a single pixel
in one input image, and n = 400 is the total number of pixels per image.
The classification algorithm is implemented using Octave 1. The entry point of the program is located
in the file ex3.m. The function oneVsAll trains a classifier for each digit which objective is to provide
a set of coefficients θ that minimize the cost function associated with the Logistic Regression for that
classifier. The function predictOneVsAll is used to apply all classifiers on our examples and select
The for loop, located in the function oneVsAll is used to train each classifier. For this particular
application, I took advantage of Octave’s fminunc which implements an alternative to the gradient
descent method, called quasi-Newton method. This method, which is also based on the computation of
∂ J (θ )
the but does not require the coefficient α (3.15), is far more efficient and significantly faster than
∂ θj
1
The code provided in this project was written by me using a template published by Stanford University in the
context of their Machine Learning Course offered on Coursera
14
gradient descent (Shanno, 1970). fminunc takes as parameters the computed value of the cost function
∂ J (θ )
J ( θ ), the value of the gradient , the regularization parameter λ , and the current θ values, and
∂ θj
compute the final set of θ for the current classifier. Here are the steps that resume the implementation
process:
Step 1: Code a visualizing function. It’s always better to be able to visualize our data before working
on a dataset. So, to start, we display a random set of 100 handwritten digits taken among the 5000
examples.
Step 2: Code the sigmoid function that will take a vector of dimension m as an input and return a
Step 3: Code the cost function and the gradient. These two operations are done in the
lrCostFunction.m file. The input of this function is a vector of dimension n representing the
coefficientsθ , an n x m matrix X holding the values of { x 1 , x 2 ,.. x n } for our m examples, a vector y of
dimension m, holding the outcome values (0 or 1) corresponding to the actual classifier. For example, if
we are training the digit ‘7’ category Classifier, then y i=1 only if y i=' 7 ' in our training examples. The
output of lrCostFunction is 2 vectors of dimension m: J, containing the value of the cost function
for the current values of the coefficients θ and grad, containing the value of the derivative term of the
Step 4: Predict the outcome by applying all the trained classifiers to our 5000 examples. For each
example, pick the best classifier and show the accuracy of the model by comparing it to the example.
15
4. Conclusion
This method can be extended to the whole alphabet by training 26 more classifiers, and providing
more examples to train those classifiers. This can be done by simply changing a few constants in the
code.
Not in the scope of this document but necessary in order to pursue this work, data could have been
partitioned into training and validating sets in order to calculate the effect of over-fitting and under-
fitting.
The set error, which consists in running all the examples through our hypothesis and computing the
error
m set
−1
J set ( θ )= ∑
mset 1=1
y set log hθ ( x (set
(i) i)
) + ( 1− y (set
i)
) log hθ (x (set
i)
)
for each set, indicates whether our model is over-fitting our training examples or is equally performing
on new data.
Training and validation errors reflect the overall performance of our model and give us a comparison
base in the model selection process by allowing us to fine-tune our model by trying different options and
comparing the resulting performance. In this case, we could have investigated the effect on the
performance of the system by either choosing a lower/higher degree of polynomial for our boundary
function, picking a higher/lower value for the regularization parameter λ , or increasing the number of
samples.
16
5. Works Cited
17
6.
Curry, H. D. (1944). The method of steepest descent for non-linear minimization problems. Qu. App.
Math, 2, 258.
Galit Shmueli, N. R. (2010). Data Mining for Business Intelligence. Hoboken, New Jersey: Wiley.
Kohl, N. (n.d.). Role of Bias in Neural Networks. Stack Overflow. Retrieved from
http://natekohl.net/media/sigmoid-scale.png
Ng, A. (2004). Supervised learning - CS229 Lecture notes. Retrieved 4 5, 2016, from Stanford.edu:
http://cs229.stanford.edu/notes/cs229-notes1.pdf
Shanno, D. F. (1970). Conditioning of Quasi-Newton Methods for Function Minimization. Mathematics
of Computation, 24(111).
Stanford University. (2015). Machine Learning. Retrieved 04 03, 2016, from http://www.coursera.org
18
7. Table of Figures
Figure 1 Students are Admitted or not based on the score of 2 exams (Stanford University, 2015)..............................5
Figure 2 Sigmoid function using different values of θ (Kohl).......................................................................................6
Figure 3 Acceptance/Rejection of microchips and test results (Stanford University, 2015)..........................................7
Figure 4 Cost function visualization using a 3D surface plot (Stanford University, 2015)............................................9
Figure 5 Cost function visualization using a contour plot (Stanford University, 2015).................................................9
Figure 6 Underfitting....................................................................................................................................................11
Figure 7 Good fit...........................................................................................................................................................11
Figure 8 Over fitting (this is just an example. The curve doesn’t render the actual function)....................................11
19
8. Annex 1 – Code
File Ex3.m
%% One-vs-all
%% Initialization
clear ; close all; clc
%% Setup the parameters you will use for this part of the exercise
input_layer_size = 400; % 20x20 Input Images of Digits
num_labels = 10; % 10 labels, from 1 to 10
% (note that we have mapped "0" to label 10)
displayData(sel);
lambda = 0.1;
[all_theta] = oneVsAll(X, y, num_labels, lambda);
20
File DisplayData.m
function [h, display_array] = displayData(X, example_width)
%DISPLAYDATA Display 2D data in a nice grid
% [h, display_array] = DISPLAYDATA(X, example_width) displays 2D data
% stored in X in a nice grid. It returns the figure handle h and the
% displayed array if requested.
% Gray Image
colormap(gray);
% Display Image
h = imagesc(display_array, [-1 1]);
drawnow;
end
21
File oneVsAll.m
function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta
%corresponds to the classifier for label i
% [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels
% logisitc regression classifiers and returns each of these classifiers
% in a matrix all_theta, where the i-th row of all_theta corresponds
% to the classifier for label i
end
22
File lrCostFunction.m
function [J, grad] = lrCostFunction(theta, X, y, lambda)
J = 0;
grad = zeros(size(theta));
n = length(theta);
h = sigmoid(X * theta);
J1 = sum(-y .* log(h) - (1-y) .* log(1-h)) / m;
J2 = sum(theta(2:n,:).^2) * lambda / (2*m);
J = J1 + J2;
% =============================================================
grad = grad(:);
23
File sigmoid.m
function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
% J = SIGMOID(z) computes the sigmoid of z.
24
File predictOneVsAll.m
function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels
%are in the range 1..K, where K = size(all_theta, 1).
% p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
% for each example in the matrix X. Note that X contains the examples in
% rows. all_theta is a matrix where the i-th row is a trained logistic
% regression theta vector for the i-th class. You should set p to a vector
% of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
% for 4 examples)
m = size(X, 1);
num_labels = size(all_theta, 1);
end
25