Professional Documents
Culture Documents
Math Honours: Machine Learning & Deep Learning (Still Under Developpement)
Math Honours: Machine Learning & Deep Learning (Still Under Developpement)
under developpement)
Lecturer: M Atemkeng
June 2019
• Data preparation and choice of a loss function to quantify the responses of a model.
These steps will be studied in more details in the rest of this Chapter.
1
Figure 1: 21 Set of two-dimensional observations in a two-class classification problem: X ⊂
R2 , Y = c1 , c2 . Here, c1 is represented by the green color and c2 by the red color).
Let us look at two defined examples. The first shows a classification problem with two classes
(see Figure 1). In this case X ⊂ R2 and Y = {c1 , c2 } where c1 and c2 are the lables or class
tags. The second example shows a regression problem between one-dimensional variables
(see Figure 2). In this case, X ⊂ R, Y ⊂ R.
The objective of the decisional modeling is to find a function (a model):
f ∈F (2)
f : X 7−→ Y (3)
the weakest that is, the lowest generalization error. Here L(.) is noted as the loss (or error)
function which quantifies the extent to which f (x) corresponds to the expected value of y.
EP stands for the expectation with respect to the unknown distribution P .
The choice of a loss function depends on the nature of the modeling problem (classifica-
tion, regression or structured prediction), the choice of the family of functions F in which the
model f is sought, as well as the optimization procedure allowing to find f in F. We examine
in the following the loss functions widely used for classification or regression problems.
2
1.2 Choice of loss function
For classification problems an example of loss function used is the top-hat function defined
as follows:
If the values of f (x) are part of (also like y) the domain of the values of the explained
variable, i.e., f (x) ∈ Y and y ∈ Y where Y must be a finite domain (e.g. Y = {−1; 1}) then
where
(
0, if f (x) 6= y
1f (x)6=y = (6)
1 else
If f (x) ∈ R and Y = {−1, 1} then L01 (x, y, f ) = 1H(f (x))6=y , where H is the step function
defined as:
(
+1, if z ≥ 0
H(z) = (7)
−1 else
The loss (the error) is therefore zero if the prediction is correct (or has the right sign) and
equal to 1 if the prediction is incorrect (or does not have the right sign). This definition has
the following implications:
• Any misclassification brings the same penalty (equal to 1), regardless of the type of
error. For example, the misclassification of a healthy patient in the class of unhealthy
patients brings the same penalty as the misclassification of an unhealthy patient in
the class of healthy patients. We can see that this symmetric loss (error, cost) is not
always a satisfactory choice. Figure 3 illustrate this case.
• For f (x) ∈ R, only the sign of f (x) important. Two different predictions f (x) = 0.001
and f (x) = 1000 are perfectly equivalent to this loss function while the two predictions
may not have the same meaning for the modeled problem.
Another loss function widely used in two-class margin maximization problems (see Chapter3)
is the hingeloss, defined for f (x) ∈ R as follows:
Figure 4 shows a graphical representation of the hingloss for y (red) and y = 1 (blue). We
can notice that Lh is not differentiable with respect to f (x) but has a sub-gradient which
has an impact on the optimization algorithm for the model parameters. New versions of the
hingeloss have been defined for multi-class classifications and structured prediction, these
versions will be studied later in this course.
For regression problems the most frequently encountered loss function is the quadratic
loss (or error), defined as:
3
Figure 3: Classification with a linear boundery. The blue arrows indicate some data poorly
classified by the model
2
Lq (x, y, f ) = f (x) − y , (9)
where f (x) is the model prediction for variable x and y is the supervision information for
the entry x. This function is differentiable with respect to f (x); if f is also differentiable in
respect to the model parameters, then gradient-based optimization can be applied directly,
as we will see later.
4
Figure 5: A linear model for classification with two classes of two-dimensional observations.
• Polynomial models of degree n (n > 1 to go beyond linear models). In this case, the
dependence between the input (the explanatory variables) and the prediction provided
by the model is polynomial. Each model of degree n defines a parametric family. The
capacity of approximating polynomials increases with their degree.
Since the ”simplest” families (e.g., linear models) may be insufficient, one may wonder
whether they should be considered, or whether we should be interested directly in a family
whose capacity of approximation is as large as possible. Let us return to the example seeing
in Chapter1 of a classification problem treated with three families of models: linear models,
MLP with α = 10−5 and MLP with α = 1. Of these three families, the linear models have
the lowest approximation capability and the MLP with α = 10−5 the highest approximation
capability. Figure 7 shows the solutions found in each family, with the corresponding learning
and generalization errors (the generalization error is estimated from the error on the test
data). It is easy to see that, if the learning error obtained is the lowest for the model coming
from the family with the highest approximation capacity (MLP with α = 10−5 ), the error of
generalization is the lowest for a model from another family (MLP with α = 1). A complex
approximation can lead to overfitting: the learning error is very small but the error on the
test data is comparatively high; the model has learned some specific parameters (e.g. noise)
of the training data.
It is therefore not necessarily in the family that has the greatest capacity that we can
find the model that generalizes the best. Examining families of lower capacity is not without
interest. The link between the capacity of the family of models and the generalization
obtained will be explored in the following.
5
Figure 7: Approximation capacity, learning error and generalization error for three models
from the linear family (left), MLP with α = 10−5 (center) and MLP with α = 1 (right).
and the search for the best model in the F family can not only directly take into ac-
count this empirical risk. We can mention three approaches that will be detailed in later
subsections:
• Empirical Risk Minimization (ERM): A search for the model that minimizes the train-
ing error, i.e., find:
• Regulated Empirical Risk Minimization (RERM): A search for the model that mini-
mizes the sum between the training error and a weighted regularization term G(f ):
h i
fD? N = argmin RDN (f ) + αG(f ) (12)
f ∈F
6
The understanding of the relationship between the approximation capacity of a model
family and the generalization performance it can achieves is facilitated by an analysis of the
components of the expected risk. Consider the following notations:
• fD? N a function of F that minimizes the empirical risk, RDN
• f ? a function of F that minimises the expected risk R
There exist a threshold function fmin in F that also minimises the expected risk R and
R(fmin ) ≤ R(f ? ) (13)
Here,R(fmin ) is a lower limit for the expected risk of any model, regardless of the family
from which it came. This risk is strictly non-zero in the presence of noise, because for
any observation x whose components (the values of the explanatory variables) are noisy (or
uncertainty of measurement) can correspond different values of y.
Now, we can represent the expected risk of the function fD? N by a sum of three positive
terms:
7
Figure 8: Links between capacity, approximation error and estimation error.
• For the linear family the average of the training errors is high, the capacity can be
judged insufficient for this problem where the classes are not linearly separable. Lin-
ear models can not sufficiently approach the best model for all families, the error of
approximation is high (there is a strong bias).
• For the family defined by the MLP with α = 10−5 , the average of the training errors
is very small, so the approximation error is potentially small, the capacity can be
considered sufficient. On the other hand, the average of the test errors is much higher
than the average of the training errors and the variance of the test errors is high (higher
than that of the other family of MLP with α = 1), which indicates that the estimation
error is high.
• For the family defined by the MLP with α = 1, the average of the test errors is
comparatively low, which indicates a rather small sum between approximation error
and estimation error. The generalization is better than that obtained with the other
two families. Finally, the average of the test errors is quite low and close to the average
of the training errors.
8
the minimization of the empirical risk results in overfitting the data. In order to avoid this
problem it is necessary to master the capacity (or the complexity) of models in F
An implicit control approach of the capacity is regularization: the optimal model R(fD? N )
within the considered family is obtained by minimizing not just the empirical risk, but rather
the sum between the empirical risk RDN (f ) and a term G(f ) which penalizes (indirectly)
the capacity.
h i
fD? N = argmin RDN (f ) + αG(f ) (15)
f ∈F
Here α is a hyper-parameter that weights the regularization term. The higher the value
of α, the greater the capacity penalty. Here are two examples of regularization methods
commonly used with neural networks:
• Weight decay: G(f ) = ||w||22 , where w the vector of the model parameters (neural
network, e.g. MLP).
• Early stopping: no term G(f ) is present but the optimization algorithm (nonlinear,
for MLP) stops before reaching the minimum of the empirical risk.
1.6 Conclusion
• To build a supervised decision model from data, supervisory information is essential.
• The goal is to find the model that has the best generalization error and not the lowest
training error.
• The training error (the empirical risk) is excessively optimistic as an estimator of the
generalization error (expected risk).
• To minimize the expected risk, one must look for the right trade-off between minimizing
the capacity of the model family and minimizing the training error.