Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Math Honours: Machine learning & Deep learning (still

under developpement)
Lecturer: M Atemkeng
June 2019

1 Modeling from data: a more precise framework


After the first presentation (chapter1) that highlights some of the questions posed by decision
modeling from data, now we need to define a more precise framework in which we will provide
answers. To begin, here are the general steps of decision modeling from data:

• Data preparation and choice of a loss function to quantify the responses of a model.

• Choice of the parametric families in which the models will be searched.

• In each parametric family, estimate the ”best” intra-family model.

• Choice of the best model between parametric families.

• Evaluation of the generalization performance of the chosen model.

These steps will be studied in more details in the rest of this Chapter.

1.1 Notations and definitions


The domain in which the explanatory variables (or the input space) are defined is denoted
X . For example, for observations described by p real numerical variables, X ⊆ Rp . The
domain in which the explained variable (or the output space) is defined is Y. For example,
for a two-class classification problem we can have Y = {−1; 1} and for a regression problem
Y ⊆ R.
Each observation (or data) is defined by the values for the explanatory variables and a
value for the explained variable. For a given problem, the generic observation is described
by the random variables (X, Y ) ∈ X × Y according to the distribution (unknown) P . An
observation or a data is then described by (xi , yi )1≤i≤N with xi ∈ X and yi ∈ Y. The
supervision information is the value of the explained variable, yi ∈ Y. A set of N observations
with supervision information is noted as:
 
DN = (xi , yi ) (1)
1≤i≤N

1
Figure 1: 21 Set of two-dimensional observations in a two-class classification problem: X ⊂
R2 , Y = c1 , c2 . Here, c1 is represented by the green color and c2 by the red color).

Figure 2: 22 set of observations in a regression problem between two one-dimensional vari-


ables, X ⊂ R, Y ⊂ R.

Let us look at two defined examples. The first shows a classification problem with two classes
(see Figure 1). In this case X ⊂ R2 and Y = {c1 , c2 } where c1 and c2 are the lables or class
tags. The second example shows a regression problem between one-dimensional variables
(see Figure 2). In this case, X ⊂ R, Y ⊂ R.
The objective of the decisional modeling is to find a function (a model):

f ∈F (2)
f : X 7−→ Y (3)

which predicts y ∈ Y from x ∈ X and presents the expected risk,

R(f ) = EP [L(X, Y, f )] (4)

the weakest that is, the lowest generalization error. Here L(.) is noted as the loss (or error)
function which quantifies the extent to which f (x) corresponds to the expected value of y.
EP stands for the expectation with respect to the unknown distribution P .
The choice of a loss function depends on the nature of the modeling problem (classifica-
tion, regression or structured prediction), the choice of the family of functions F in which the
model f is sought, as well as the optimization procedure allowing to find f in F. We examine
in the following the loss functions widely used for classification or regression problems.

2
1.2 Choice of loss function
For classification problems an example of loss function used is the top-hat function defined
as follows:
If the values of f (x) are part of (also like y) the domain of the values of the explained
variable, i.e., f (x) ∈ Y and y ∈ Y where Y must be a finite domain (e.g. Y = {−1; 1}) then

L01 (x, y, f ) = 1f (x)6=y , (5)

where
(
0, if f (x) 6= y
1f (x)6=y = (6)
1 else

If f (x) ∈ R and Y = {−1, 1} then L01 (x, y, f ) = 1H(f (x))6=y , where H is the step function
defined as:
(
+1, if z ≥ 0
H(z) = (7)
−1 else

The loss (the error) is therefore zero if the prediction is correct (or has the right sign) and
equal to 1 if the prediction is incorrect (or does not have the right sign). This definition has
the following implications:

• Any misclassification brings the same penalty (equal to 1), regardless of the type of
error. For example, the misclassification of a healthy patient in the class of unhealthy
patients brings the same penalty as the misclassification of an unhealthy patient in
the class of healthy patients. We can see that this symmetric loss (error, cost) is not
always a satisfactory choice. Figure 3 illustrate this case.

• For f (x) ∈ R, only the sign of f (x) important. Two different predictions f (x) = 0.001
and f (x) = 1000 are perfectly equivalent to this loss function while the two predictions
may not have the same meaning for the modeled problem.

Another loss function widely used in two-class margin maximization problems (see Chapter3)
is the hingeloss, defined for f (x) ∈ R as follows:

Lh (x, y, f ) = max{0, 1 − yf (x)} (8)

Figure 4 shows a graphical representation of the hingloss for y (red) and y = 1 (blue). We
can notice that Lh is not differentiable with respect to f (x) but has a sub-gradient which
has an impact on the optimization algorithm for the model parameters. New versions of the
hingeloss have been defined for multi-class classifications and structured prediction, these
versions will be studied later in this course.
For regression problems the most frequently encountered loss function is the quadratic
loss (or error), defined as:

3
Figure 3: Classification with a linear boundery. The blue arrows indicate some data poorly
classified by the model

Figure 4: Hingeloss for y = −1 (red) and y = 1 (blue).

 2
Lq (x, y, f ) = f (x) − y , (9)

where f (x) is the model prediction for variable x and y is the supervision information for
the entry x. This function is differentiable with respect to f (x); if f is also differentiable in
respect to the model parameters, then gradient-based optimization can be applied directly,
as we will see later.

1.3 Choice of parametric families


The second step in decisional modeling from data is the choice of parametric families in
which the models will be searched. These parametric families can be:
• Linear models, for which the prediction is obtained from a linear combination of the
explanatory variables. For example, for a classification problem, the prediction for
input x is the result of applying the step function H() ∈ {−1, 1} to the output of
the model f (x) = wT x + w0 , which is a linear combination of the components of
the vector x with the weights vector w and the free term w0 (see Figure 5). For a
regression problem between one dimensional variables, the prediction for the input x
is f (x) = w1 x + w0 which has a linear dependence of x (see Figure 6).
Linear models may be insufficient to explain the output variable (or ”explained” vari-
able) from a linear combination of explanatory variables. Their ability to approximate
a discrimination boundary in the case of classification or dependency in the case of
regression may be too limited. In the example of classification in Figure 5, the classes
are not linearly separable: the best linear separation has a relatively high empirical
error (here about 12%). Starting with linear models is good practice.

4
Figure 5: A linear model for classification with two classes of two-dimensional observations.

Figure 6: A Linear model for regression between one-dimensional variables.

• Polynomial models of degree n (n > 1 to go beyond linear models). In this case, the
dependence between the input (the explanatory variables) and the prediction provided
by the model is polynomial. Each model of degree n defines a parametric family. The
capacity of approximating polynomials increases with their degree.

• Various families of nonlinear models, for example MLP, etc.

Since the ”simplest” families (e.g., linear models) may be insufficient, one may wonder
whether they should be considered, or whether we should be interested directly in a family
whose capacity of approximation is as large as possible. Let us return to the example seeing
in Chapter1 of a classification problem treated with three families of models: linear models,
MLP with α = 10−5 and MLP with α = 1. Of these three families, the linear models have
the lowest approximation capability and the MLP with α = 10−5 the highest approximation
capability. Figure 7 shows the solutions found in each family, with the corresponding learning
and generalization errors (the generalization error is estimated from the error on the test
data). It is easy to see that, if the learning error obtained is the lowest for the model coming
from the family with the highest approximation capacity (MLP with α = 10−5 ), the error of
generalization is the lowest for a model from another family (MLP with α = 1). A complex
approximation can lead to overfitting: the learning error is very small but the error on the
test data is comparatively high; the model has learned some specific parameters (e.g. noise)
of the training data.
It is therefore not necessarily in the family that has the greatest capacity that we can
find the model that generalizes the best. Examining families of lower capacity is not without
interest. The link between the capacity of the family of models and the generalization
obtained will be explored in the following.

5
Figure 7: Approximation capacity, learning error and generalization error for three models
from the linear family (left), MLP with α = 10−5 (center) and MLP with α = 1 (right).

1.4 Model estimation


It is important now to take a closer look at how a model is estimated within a given para-
metric family. Recall that our goal is to find from in a chosen parametric family F a function
(a model) f ∈ F defined as f : X 7−→ Y which predicts y from x and presents the expected
(or theoretical, in some sense the generalization) risk R(f ) = EP [L(X, Y, f )] the weakest.
Note that, this expected risk R(f ) can not be evaluated because P is unknown. But we can
measure the empirical risk in DN as:
N
1 X
RDN = L(xi , yi , f ), (10)
N i=1

and the search for the best model in the F family can not only directly take into ac-
count this empirical risk. We can mention three approaches that will be detailed in later
subsections:

• Empirical Risk Minimization (ERM): A search for the model that minimizes the train-
ing error, i.e., find:

fD? N = argmin RDN (f ) (11)


f ∈F

• Regulated Empirical Risk Minimization (RERM): A search for the model that mini-
mizes the sum between the training error and a weighted regularization term G(f ):
h i
fD? N = argmin RDN (f ) + αG(f ) (12)
f ∈F

• Structural Risk Minimization (SRM): Consider a sequence of families of increasing


capacity and an ERM estimate is made in each family. The final choice of the model
takes into account both the empirical risk, RDN of the model and the capacity of the
approximation of the family from which the model is derived.

6
The understanding of the relationship between the approximation capacity of a model
family and the generalization performance it can achieves is facilitated by an analysis of the
components of the expected risk. Consider the following notations:
• fD? N a function of F that minimizes the empirical risk, RDN
• f ? a function of F that minimises the expected risk R
There exist a threshold function fmin in F that also minimises the expected risk R and
R(fmin ) ≤ R(f ? ) (13)
Here,R(fmin ) is a lower limit for the expected risk of any model, regardless of the family
from which it came. This risk is strictly non-zero in the presence of noise, because for
any observation x whose components (the values of the explanatory variables) are noisy (or
uncertainty of measurement) can correspond different values of y.
Now, we can represent the expected risk of the function fD? N by a sum of three positive
terms:

R(fD? N ) = R(fmin ) + [R(f ? ) − R(fmin )] + [R(fD? N ) − R(f ? )] (14)


[R(f ? ) − R(fmin )] is the approximation error. It is zero for the family (or families) that
contains the best model for all families. For families that do not contain this model, the
approximation error is strictly positive. If a family F1 has a higher capacity than another
family F2 , then the best model of F1 has a better approximation error than the best model
of F2 .
[R(fD? N ) − R(f ? )] is the estimation error that highlights the fact that the search for the
best model in a given family can not be done by taking into account only the minimization
of the expected risk. It is equal to zero if the function in F that minimizes the empirical risk
is also the one that minimizes the expected risk. The higher the capacity of an F family, the
more likely the model that minimizes the empirical risk (on training data that corresponds
to a sample of size N ) moves away from the model that minimizes the expected risk; thus,
the estimation error increases.
For families of models of increasing capacity, the approximation error generally decreases
(it becomes more likely to find the best model for all families in the family considered) but
the estimation error increases. Since the lower bound remains the same (R(fmin )), the best
family is that for which the sum between the error of aproximation and the estimation error
is the smaller.
Let us examine useful example to better appreciate the connection between the approx-
imation capacity, the approximation error and the estimation error. Consider the same
problem and the same three families of models as in Figure 7.
Figure 8 shows the models obtained within each family for three different training samples
(all of the same number of observations), the averages of their training errors, as well as the
average and the standard deviations of their test errors.
The aproximation capacity of the linear family (left) is lower than that of the MLP family
with α = 1 (right) which is, in turn, less than the capacity of the MLP family with α = 10−5
(In the center). We can make the following observations:

7
Figure 8: Links between capacity, approximation error and estimation error.

• For the linear family the average of the training errors is high, the capacity can be
judged insufficient for this problem where the classes are not linearly separable. Lin-
ear models can not sufficiently approach the best model for all families, the error of
approximation is high (there is a strong bias).

• For the family defined by the MLP with α = 10−5 , the average of the training errors
is very small, so the approximation error is potentially small, the capacity can be
considered sufficient. On the other hand, the average of the test errors is much higher
than the average of the training errors and the variance of the test errors is high (higher
than that of the other family of MLP with α = 1), which indicates that the estimation
error is high.

• For the family defined by the MLP with α = 1, the average of the test errors is
comparatively low, which indicates a rather small sum between approximation error
and estimation error. The generalization is better than that obtained with the other
two families. Finally, the average of the test errors is quite low and close to the average
of the training errors.

1.5 Empirical Risk Minimization (ERM)


We have seeing so far that the minimization of the empirical risk is not sufficient to ensure
a good generalization for the model obtained. If the model family has too high a capacity,

8
the minimization of the empirical risk results in overfitting the data. In order to avoid this
problem it is necessary to master the capacity (or the complexity) of models in F
An implicit control approach of the capacity is regularization: the optimal model R(fD? N )
within the considered family is obtained by minimizing not just the empirical risk, but rather
the sum between the empirical risk RDN (f ) and a term G(f ) which penalizes (indirectly)
the capacity.

h i
fD? N = argmin RDN (f ) + αG(f ) (15)
f ∈F

Here α is a hyper-parameter that weights the regularization term. The higher the value
of α, the greater the capacity penalty. Here are two examples of regularization methods
commonly used with neural networks:

• Weight decay: G(f ) = ||w||22 , where w the vector of the model parameters (neural
network, e.g. MLP).

• Early stopping: no term G(f ) is present but the optimization algorithm (nonlinear,
for MLP) stops before reaching the minimum of the empirical risk.

1.6 Conclusion
• To build a supervised decision model from data, supervisory information is essential.

• The goal is to find the model that has the best generalization error and not the lowest
training error.

• The training error (the empirical risk) is excessively optimistic as an estimator of the
generalization error (expected risk).

• To minimize the expected risk, one must look for the right trade-off between minimizing
the capacity of the model family and minimizing the training error.

You might also like