Professional Documents
Culture Documents
02 Mlsec Learning Exercise
02 Mlsec Learning Exercise
02 Mlsec Learning Exercise
Exercise Sheet 1
Summer term 2024
2. How is the best possible learning model θ∗ ∈ Θ defined? Why is it generally not
possible to compute θ∗ in practice?
4. Assume you have a large pool of labeled data. You are asked to train two classifiers,
one support-vector machine and a neural network. How do you use the available
data to train and compare the skills of each classifier?
1
Exercise 2 Learning theory
Next, we explore how we can use a linear model for a binary classification task. Therefore,
we consider the following dataset
with X = {0, 1}2 and Y = {0, 1}. The model that we want to use for our classification is
further defined as
fθ (x) = υ(x1 · w1 + x2 · w2 + b)
with parameters θ = (w1 , w2 , b) and step function
0, if z < 0
υ(z) ⇒
1, otherwise .
2. Indicate the decision boundary in your plot for parameters θ = (2, 2, −1).
Hint: Recall the input domain.
4. Find parameters θ such that the resulting model correctly predicts all values of an
AND function
1, if x1 = x2 = 1
AN D(x1 , x2 ) ⇒
0, otherwise .
5. Explain why no values θ can exist such that the model implements a XOR function?
2
Exercise 3 Over- and underfitting
In machine learning, overfitting occurs when a model learns the training data too well,
capturing noise and outliers, which impairs its performance on new data. Underfitting
happens when a model is too simplistic, unable to capture the underlying patterns of the
data, resulting in poor performance on both training and unseen data.
1. Let us consider again the linear model from the previous exercise and assume the
following hypothesis space
If we consider weight decay defined as C(θ) = ∥fθ ∥22 , what is/are the best model(s)
for λ ∈ {0, 0.5, 1}?
2. A large data set is used to evaluate two different learning methods with different
parameterization. One obtains a training error of 0%, the other a training error of
7.5% in the best case.
a) Make a conjecture about the test error.
b) Reason about the complexity of the learned models.
3
Exercise 4 Gradient descent
The simplest strategy for risk minimization is to enumerate the entire hypothesis space.
Unfortunately, in practice this is usually not feasible. Therefore, a common strategy is to
use gradient descent to find a good solution. The basic idea behind gradient descent is
to iteratively minimize the empirical risk by step-wise updating the model parameters
until the loss does not decrease further:
N
1 X
θ̂ = θ − η ∇θ L(fθ (xi ), y) ,
N
i=1
where η is the learning rate. In the following, we want to follow this approach for two
commonly used models, that is, a linear and logistic regression, respectively.
m
X
fθ (x) = wj x j + b .
j=1
To find the optimal model parameters θ∗ , we want to minimize the quadratic loss
function, which is defined as
1
L(y, fθ (x)) = (y − fθ (x))2 .
2
What is the partial derivative of L with respect to parameter w1 , i.e., ∂L/∂w1 ?
You can assume that m = 2.
2. Consider a logistic regression model that is given by
Like in the previous task, assume that m = 2. What is the partial derivative of L
with respect to the parameter w1 , i.e., ∂L/∂w1 ?