02 Mlsec Learning Exercise

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Machine Learning for Computer Security

Exercise Sheet 1
Summer term 2024

Exercise 1 Basic Concepts


We start by reviewing basic concepts for defining and optimizing machine learning models.

1. Formally define and explain the following three functions:


a) learning function
b) prediction function
c) loss function
Use the following symbols in your definitions: input domain X, output domain Y ,
model space Θ, and θ ∈ Θ.

2. How is the best possible learning model θ∗ ∈ Θ defined? Why is it generally not
possible to compute θ∗ in practice?

3. In supervised learning the risk of misclassifying a particular sample is minimized


with the aid of a loss function applied to a finite data set.
a) How is the risk for an infinite set of samples calculated?
b) How is this relevant in practice?

4. Assume you have a large pool of labeled data. You are asked to train two classifiers,
one support-vector machine and a neural network. How do you use the available
data to train and compare the skills of each classifier?

1
Exercise 2 Learning theory
Next, we explore how we can use a linear model for a binary classification task. Therefore,
we consider the following dataset

D := {((0, 0), 0) , ((0, 1), 1) , ((1, 0), 1) , ((1, 1), 0)} ∈ X × Y

with X = {0, 1}2 and Y = {0, 1}. The model that we want to use for our classification is
further defined as
fθ (x) = υ(x1 · w1 + x2 · w2 + b)
with parameters θ = (w1 , w2 , b) and step function

0, if z < 0
υ(z) ⇒
1, otherwise .

1. Plot the data points in D.

2. Indicate the decision boundary in your plot for parameters θ = (2, 2, −1).
Hint: Recall the input domain.

3. Calculate the empirical risk Rn (fθ ) over D with loss

L(fθ (x), y) 7−→ |fθ (x) − y| .

4. Find parameters θ such that the resulting model correctly predicts all values of an
AND function 
1, if x1 = x2 = 1
AN D(x1 , x2 ) ⇒
0, otherwise .

5. Explain why no values θ can exist such that the model implements a XOR function?

2
Exercise 3 Over- and underfitting
In machine learning, overfitting occurs when a model learns the training data too well,
capturing noise and outliers, which impairs its performance on new data. Underfitting
happens when a model is too simplistic, unable to capture the underlying patterns of the
data, resulting in poor performance on both training and unseen data.

1. Let us consider again the linear model from the previous exercise and assume the
following hypothesis space

Θ = {(0, 0, 0), (0, 1, 0), (1, 2, 1)}.

For the individual hypotheses, we obtain empirical risks Rn (fθ1 ) = 4, Rn (fθ2 ) = 3,


and Rn (fθ3 ) = 2.
To constraint the model complexity, we want to introduce a regularization term C
in the optimization problem:

g 7−→ arg min Rn (fθ ) + λC(fθ ) .


θ∈Θ

If we consider weight decay defined as C(θ) = ∥fθ ∥22 , what is/are the best model(s)
for λ ∈ {0, 0.5, 1}?

2. A large data set is used to evaluate two different learning methods with different
parameterization. One obtains a training error of 0%, the other a training error of
7.5% in the best case.
a) Make a conjecture about the test error.
b) Reason about the complexity of the learned models.

3
Exercise 4 Gradient descent
The simplest strategy for risk minimization is to enumerate the entire hypothesis space.
Unfortunately, in practice this is usually not feasible. Therefore, a common strategy is to
use gradient descent to find a good solution. The basic idea behind gradient descent is
to iteratively minimize the empirical risk by step-wise updating the model parameters
until the loss does not decrease further:
N
1 X
θ̂ = θ − η ∇θ L(fθ (xi ), y) ,
N
i=1
where η is the learning rate. In the following, we want to follow this approach for two
commonly used models, that is, a linear and logistic regression, respectively.

1. Assume that we want to optimize a linear regression model that is given by

m
X
fθ (x) = wj x j + b .
j=1

To find the optimal model parameters θ∗ , we want to minimize the quadratic loss
function, which is defined as
1
L(y, fθ (x)) = (y − fθ (x))2 .
2
What is the partial derivative of L with respect to parameter w1 , i.e., ∂L/∂w1 ?
You can assume that m = 2.
2. Consider a logistic regression model that is given by

hθ (x) = σ(fθ (x)) ,

where σ(z) denotes the sigmoid function


1
σ(z) = .
1 + exp−z
To find the optimal model parameters θ∗ in this case, we want to minimize the
logistic loss:

L(y, hθ (x)) = −y log(hθ (x)) − (1 − y) log(1 − hθ (x)) .

Like in the previous task, assume that m = 2. What is the partial derivative of L
with respect to the parameter w1 , i.e., ∂L/∂w1 ?

Hint: The derivative of σ(z) is σ ′ (z) = σ ′ (z)(1 − σ(z)).


3. What requirement needs to be fulfilled by the prediction function and the loss
function in order to apply gradient descent?

You might also like