Math Honours: Machine Learning & Deep Learning (Still Under Developpement)

Math Honours: Machine learning & Deep learning (still
under developpement)
Lecturer: M Atemkeng
June 2019
1 Decision Modeling
In order to construct a decision model from data, it is necessary to have a certain number of
observations concerning all the considered variables. Each problem is described by a set of
”explanatory” variables and a variable to explain. For example, ”What will be the quantity
or volume of water from rain to be collected by the Grahamstown municipality to fill the
Grahamstown basin during the month of December ?”
For a simple random variable, a single observation corresponds to the value of the variable
at a given time. In the case of a modeling problem characterized by several variables, an
observation generally includes the values taken at the same time by all the variables. For
example, to quantify the volume of rainwater of the Grahamstown basin during the month
of December, we have to collect samples for the past years that indicate the soil humidity of
Grahamstown, the intensity of rainfall in Grahamstown, the quantity of water evaporation
during the month of December, the total volume of water in the basin before December, the
volume of water in the basin for past December, etc. The data used for the construction
of models are sets of observations that must include, for at least a part of them, both the
values of the explanatory variables and those corresponding to the explained variable. We can
sometimes find ourselves in the presence of some incomplete observations, or missing data.
In our example, the total volume of water in the basin before December, the intensity of
rainfall, the quantity of water evaporation during the month of December, the soil humidity
of Grahamstown are explanatory variables while the volume of water in the basin for many
December is the explained variable.
A decision model will be validated on observations of the same type, including the values
of the explanatory variables and those associated with the explained variable. Once vali-
dated, however, this model will be used for observations that are limited to the values of the
explanatory variables in order to obtain an estimate for the corresponding (unknown) value
of the explained variable.
1.1 Types of decision problems

The main differences between decision problems arise from the nature of the explanatory
variables. To better understand these differences, it is helpful to consider the following
1
examples:
(a) Which digit represents the image:
(b) A region of an image represents a face or not?
(c) The symptoms correspond to disease A or B or C or none?
(d) What will be the quantity or volume of water from rain to be collected by the Gra-
hamstown municipality to fill the Grahamstown basin during the month of December?
(e) What will be the flow of rain in 48 hours in our current region?
(f) How old is a species of the leaf of plant A that has a circumference of 346 cm?
(g) Who is the person mentioned is the following phrase: Marcel Atemkeng is always sick
during winter because he is not used to the cold in South Africa.
(h) What is the image region corresponding to the pants?
The previous examples allow us to highlight three types of decision problems:
1. Classification (or discrimination): the variable explained is a nominal variable

(variable with modalities), each observation is associated with a modality and only
one (generally called class). Examples (a), (b) and (c) above fall under this category.
For the first example, the explained variable has ten modalities that correspond to the
digits from 0 to 9 (possibly eleven modalities if we consider another class). For the
second, the explained variable has two modalities, ”face” and ”non-face”. In the third
example, the explained variable has four modalities: A, B, C and ”other”.
2. Regression: The explained variable is a quantitative variable that takes values in

a subdomain of the set IR of real numbers. Examples (d), (e) and (f ) belong to
this category, the variables explained being the quantity, the flow rate and the age
respectively.
3. Structured prediction: The explained variable takes values in a structured set of

data. Examples (g) and (h) above fall into this category. In Example (g), the values
that the explained variable can take are the parts (the sub-sequences of words) of the
sentence. In Example (h), the possible values of the explained variable are the subsets
of pixels of the image.
1.2 Decision models

In the context of decision modeling by model, in general, we usually mean a decision rule or
the rules we have to follow before making the decision.
If we take the case of a problem of classification or discrimination between two classes, the
rule of the decision will consist of making a boundary of discrimination. Figure 1 presents a
2
Figure 1: Discrimination border example for 2-class classification problem.
Figure 2: Example of an ambiguity rejection zone (in yellow).
set of two-dimensional observations (the red dots and the green dots), the explanatory vari-
ables being, in this case, the x and the y axis of each point. For each observation, the value of
the dummy variable explained is ”red” or ”green”. In other words, each observation belongs
to either the ”red” class or the ”green” class. The function in dark blue line is a boundary of
discrimination that separates (imperfectly) the red dots from the green dots. A new observa-
tion that is located on one side of this boundary is ”assigned” to the corresponding class (the
corresponding value is predicted by the model using the dummy variable explained). The
model may possibly be supplemented by rejection criteria (or refusal of assignment) which
give the possibility of refusing to decide for certain observations. It should be noted that,
from a practical point of view, the rejection may not be allowed or associated to a specific
value of the explained variable (e.g. class ”other” for a classification problem). On the other
hand, obtaining good rejection criteria for a particular problem raises specific difficulties
with respect to the value of the explained variable (nominal, quantitative or structured).
It is first conceivable to refuse to make a classification decision for observations that are
”too” close to the discrimination boundary, as shown in Figure 2. This is called ” rejection
of ambiguity” because uncertainty can be high as to the exact position of the boundary or
classes overlap partially in the vicinity of this boundary.
Finally, it is conceivable to refuse to classify observations that are ”too far” from known
observations, as in Figure 3. In this case, we speak of rejection of non-representativity, the
model being considered unreliable so far from the majority of the observations.
On the same data, different models can make different predictions. For a two-class classi-
fication problem, Figure 4 presents three decision models (reduced here at the discrimination
boundary, without rejection) applied to the same set of two-dimensional observations. The
first model corresponds to a simple, linear boundary and misclassifies (on the wrong side) a
3
Figure 3: Example of non-representativity rejection zone (in yellow).
Figure 4: Example of three discrimination boundaries for the same data.
relatively large number of observations. The second is a more complex, nonlinear (but rather
smooth) boundary and made fewer misclassifications compared to the first model. The third
model corresponds to a more complex nonlinear boundary (presents parts with a stronger
curvature). We will learn how to choose the ”right” model and what are the links between
the complexity of the models and their errors.
Consider now a regression problem that consists in estimating the values of an explained
variable Y (ordinate) from the values of an explanatory variable X(abscissa). Figure 5 shows
three decision models (in this case, the prediction rules are represented by the black line)
applied to the same set of observations. The first model corresponds to a linear prediction of
type y = ax + b, where a and b are constants. The other two models correspond to nonlinear
prediction rules.
In the case of structured prediction, a model also corresponds to a prediction rule. To
illustrate such a rule for a set of observations is more difficult, we will explain this using two
Figure 5: Example of three prediction rules for the same data.
4
Figure 6: Detection and segmentation of clothing.
different and simple examples.

The first example concerns the detection of named entities in sentences. For the observa-
tion (the sentence) ”Marcel Atemkeng is always sick during winter because he is not used to
the cold in South Africa, the result (the prediction) of the model can be, for example, Marcel
Atemkeng. At first glance, it should be possible to address such a problem as a classification
problem that consists of classifying the individual words, whether or not part of the named
entity. But this approach would ignore the strong dependencies between the different words
composing the same named entity (the inclusion of ”Marcel” in the named entity depends
on the inclusion of ”Atemkeng”). In addition, the same sentence may contain several named
entities (e.g. Marcel Atemkeng and Ulrich Sob are always sick during winter because they
are not used to the cold in South Africa) and their number is not known in advance.
As a second example of structured prediction, we consider the detection and segmentation
of clothing in images. For the observation (image) on the left of Figure 6, the result of the
model can be, for example, the one shown at the right of the same figure (see [?]). The trouser
in the left of Figure 6, for example, corresponds to the light blue region on the right of the
same figure. Here again, we may consider tackling this problem as a problem of classifying
individual pixels into as many classes as there are categories of clothing (plus an ”other” class
to represent misclassification). Such an approach would ignore the dependencies between the
assignments of the different pixels composing the region representing items of clothing, as
well as the dependencies between the decisions concerning different items of clothing.
1.3 Modeling from data

We mentioned earlier that constructing a decision model analytically starts with a perfect
understanding of the observed phenomenon. A simple example is the calculation of a flight
duration of an airplane from the distance to travel and the speed, etc. The decision-making
model is constructed, in such case, from the knowledge of laws in physics. However, this
model neglects the impact of a set of uncontrollable phenomena or variables. For example,
the movement of air masses has consequences on the speed (and on the exact trajectory)
of an aircraft, and therefore on the duration of the flight. The construction of models from
data, by statistical means, on the basis of a set of available observations, makes it possible
to approach situations in which the understanding of the phenomena (physical, chemical,
economic, sociological, etc.) involved is insufficient and an analytical model cannot provide
reasonable accurate predictions. On the other hand, modeling from data can sometimes
5
Figure 7: Supervised learning: use of observations for which only the values of the explained
variable are known.
help to correct predictions of analytical models that cover the problem incompletely, such
as the duration of an airplane from a point A to a point B, knowing only the speed of
the airplane and the total distance from points A to B. To model the problem from data,
a first approach is supervised learning that exploits exclusively the data (observations) for
which the values of the explanatory variables and the corresponding explained variable are
known. The values of the explained variable are used as the ”information of supervision”.
The illustration in Figure 7 shows observations for the two-class classification problem we
previously considered.
The supervision information (the values of the explained variable) is often difficult to
obtain. For example, for the problem of classification of handwritten digits, it is necessary
for a human to make this classification. For the detection and segmentation of clothing, a
human must delimit the corresponding regions of each image and assign them to a category
or class. These operations are time-consuming and sometimes financially expensive. For
supervised learning, it is often necessary to build a model from a relatively small number
of observations (observations that contains the supervised information, we also called this
as observations associated with labels). On the other hand, getting a large number of
observations with values for the explanatory variables alone is much easier. Is it feasible
to use these many observations without supervision to improve the quality of a model with
respect to the exclusive use of supervised observations? This is possible and the approach
is called semi-supervised learning (see e.g. [?]). Semi-supervised learning considers that
the supervised information is available in a partial form, e.g. for a (small) part of the
data (observations). Also, knowledge about the problem we want to deal with allows us to
validate hypotheses between the distribution of the data and the explained variable. One
such assumption is, for example, that the boundary of a two-class classification problem is
in a low-density region, as shown in Figure 8. It is these types of assumptions that make it
possible to use data without supervision to improve decision modeling.
1.4 Learning and generalisation

The construction of decision models from data exploits a set of observations for which the
supervision information is present (the values of the explained variable are known). Once
constructed, these models are used to make predictions, i.e., to estimate the values of the
explained variable, on new observations for which the supervision information is absent. The
6
Figure 8: Semi-supervised learning: takes into account not only observations with supervision
(solid points) but also those without supervision (hollow points).
Figure 9: Learning data (observations).
ability to generalize a model is the ability of the model to make good predictions about new
observations not previously seeing by the models.
In order to build a decision model it is usually necessary to
1. Choose a parametric family in which the model will be searched, and then
2. Optimize settings to find the ”best” model in this family.
To illustrate this, let’s return to the problem of two-class classification considered above.
Figure 9 shows the set of two-dimensional observations (red dots and green dots) for which
the supervision information (”red” or ”green” class) is known. With these observations, we
seek to determine a model that is a boundary of discrimination between the two classes.
We can first examine the parametric family of linear boundaries, represented by straight
lines in the plane of two-dimensional observations. Once the family is chosen, the search for
a model is done by a method that optimizes a particular criteria. If the method used is the
discriminant factor analysis (DFA), we obtain in this example the linear boundary shown in
Figure 10. On the set of observations (learning data) used to obtain the model, the quality
of this model is generally evaluated through its learning error (or empirical risk). In this
example, it is measured by the percentage of misclassification (misclassified observations).
For the linear model obtained by DFA the empirical risk is 15%.
Let us now consider the family defined by a multi-layer perceptron (MLP) to a hidden
layer of 10 neurons with activation function tanh (hyperbolic tangent). The parameters
to be optimized are, in this case, the weights of the connections and the thresholds of the
neurons. The search for the optimal parameters, aiming here to minimize the sum between
7
Figure 10: Linear boundary obtained by discriminant factor analysis (DFA). Learning error
= 15%.
Figure 11: Non-linear boundary obtained by MLP with α = 10−5 . Learning error = 2.3%.
the learning error and a weight decay (learning rate) of α = 10−5 gives, as a result, the
nonlinear boundary indicated in Figure 11. For this nonlinear model, the empirical risk is
only 2.3%, much lower than for the linear model obtained by DFA.
The obtained models must then be used to estimate the values of the explained variable on
new observations for which the supervision information is absent. The error we are interested
in using a model is therefore not the learning error but rather the error of generalization (or
expected risk).
For the same classification problem, consider another set of observations that have the
supervising information, which is from the same distribution as the training data but has an
empty intersection with that of the training data. Figure 12 shows such a set of data points.
The observations are represented by hollow points, the color of each point corresponds to
the class to which it truly belongs too. A misclassification error is counted each time an
observation is on the wrong side of the discrimination boundary.
We can now test each model obtained above on this new set of observations, which we
will call test data. If these observations are sufficiently large and from the same distribution
as any future observations then the error of a model on these test data is a good indication
of its error of generalization (or expected risk). Figure 12 allows to visualize the decisions
of the two models above (linear model obtained by DFA (left) and nonlinear model MLP
(right)). Remember that the color of each point corresponds to the class to which it truly
belongs.
In this example, we see that the test error is much lower for this nonlinear model than
for the linear model. If the test error is a good indication of the expected risk, between these
two models then we will prefer to use the nonlinear model. But how do we know if we have
8
Figure 12: Test data (not used for learning).
Figure 13: Classification with the linear boundary obtained by DFA, test error = 14% and
by MLP with α = 10−5 , test error, test error=5%.
identified the model that presents the best generalization in a category of ”accessible” models
(which we can determine with the data, the possible prior knowledge, and the available
tools)? Within a family of models, how to choose the criteria to optimize? What is the
link between the expected risk (the generalization error) and the empirical risk (the learning
error)? We will examine these questions in the following.
We have seen that to build a decision model, it was necessary to choose one or more
parametric families and, in each family, to determine the parameters that define the best
model by optimizing a criteria. With only the learning data, the only accessible criteria
is the learning error. However, it is the generalization error that interests us, but it can
not be directly measured because the future data are unknown (or, at least, the supervising
information concerning them) when the model is developed. How can we get the model with
the lowest generalization error when we have access only to the learning error? Let us first
look if reducing the learning error reduces the generalization error. For this, let us consider
a test data not used for training or learning but with supervising information. For this, let
us compare three different models, each belonging to a distinct parametric family:
1. A linear model obtained by discriminant factor analysis (DFA).
2. A multilayer perceptron (MLP) with weight decay of α = 10−5 .
3. A multilayer perceptron (MLP) with weight decay of α = 1.
The first model is obtained by optimizing the parameters in order to maximize a specific
discriminating criterion (which, under certain assumptions, corresponds to the minimization
9
Figure 14: Comparison of discrimination boundaries and learning errors.
Figure 15: Comparison of discrimination boundaries and test errors.
of the empirical risk). The last two models are obtained by optimizing the parameters in
order to minimize the learning/training error (the empirical risk).
Figures 14 and 15 show the learning/training error for each of these models and re-
spectively their errors on the test data (seen as estimates of their generalization errors).
By comparing the errors obtained by the different models on the training and on the test
data we notice the following:
1. The model with the lowest training error is not the one with the lowest test error. Re-
ducing the training error does not necessarily result in a decrease in the generalization
error. This remains valid when comparing models from the same parametric family.
2. The training error is an optimistic estimate of the test error. This can be generalized
for all machine learning problems.
3. The difference between the training error and the test error of a model depends on the
parametric family of which it is part of. In our example, the difference is highest for
the most ”complex” model, the MLP with α = 10−5 . We will see in Chapter2 that
this statement can also be generalized.
It is not possible to measure the error of generalization, can we then at least estimate it?
Yes, but each approach has a significant disadvantage:
10
1. The expected risk can be estimated by the error on the test data for which the super-
vising information is available but which has not been used for training. The available
observations with supervising information are therefore split into training data (to ob-
tain the model) and test data (to estimate its generalization capacity). This results in
a significant reduction in the amount of data available for training.
2. Under certain conditions, it is possible to determine a finite upper bound on the dif-
ference between the training error and the generalization error. We will see later
(Chapter2) such a bound which depends on the characteristics of the parametric fam-
ily considered. If the training error is known (e.g. 4%) and the extimated bound is
small (e.g. 5%), then we get a potential interesting upper bound (e.g. 4% + 5% = 9%
) on the generalization error of the model. Unfortunately, there are very few practical
situations to reach boundaries that are narrow enough to be useful.
1.5 Conclucion
We will return to this topic in Chapter2 and learn how to accurately evaluate a model.
11

Math Honours: Machine Learning & Deep Learning (Still Under Developpement)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math Honours: Machine Learning & Deep Learning (Still Under Developpement)

Uploaded by

Copyright:

Available Formats

Math Honours: Machine learning & Deep learning (still

1.1 Types of decision problems

(a) Which digit represents the image:

(b) A region of an image represents a face or not?

(c) The symptoms correspond to disease A or B or C or none?

(h) What is the image region corresponding to the pants?

The previous examples allow us to highlight three types of decision problems:

1. Classification (or discrimination): the variable explained is a nominal variable

2. Regression: The explained variable is a quantitative variable that takes values in

3. Structured prediction: The explained variable takes values in a structured set of

1.2 Decision models

Figure 2: Example of an ambiguity rejection zone (in yellow).

Figure 4: Example of three discrimination boundaries for the same data.

Figure 5: Example of three prediction rules for the same data.

different and simple examples.

1.3 Modeling from data

1.4 Learning and generalisation

Figure 9: Learning data (observations).

2. Optimize settings to find the ”best” model in this family.

1. A linear model obtained by discriminant factor analysis (DFA).

2. A multilayer perceptron (MLP) with weight decay of α = 10−5 .

3. A multilayer perceptron (MLP) with weight decay of α = 1.

Figure 15: Comparison of discrimination boundaries and test errors.

You might also like