Professional Documents
Culture Documents
02450ex Spring2019 Sol
02450ex Spring2019 Sol
Please hand in your answers using the electronic file. Only use this page in the case where digital
handin is unavailable. In case you have to hand in the answers using the form on this sheet, please follow
these instructions:
Print name and study number clearly. The exam is multiple choice. All questions have four possible answers
marked by the letters A, B, C, and D as well as the answer “Don’t know” marked by the letter E. Correct answer
gives 3 points, wrong answer gives -1 point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
D C A D C B B B A D
11 12 13 14 15 16 17 18 19 20
A C A A B B B D B A
21 22 23 24 25 26 27
A A B A A A A
Name:
Student number:
1 of 23
No. Attribute description Abbrev.
x1 Average rating of art galleries art galleries
x2 Average rating of dance clubs dance clubs
x3 Average rating of juice bars juice bars
x4 Average rating of restaurants restaurants
x5 Average rating of museums museums
x6 Average rating of parks/picnic spots parks
x7 Average rating of beaches beaches
x8 Average rating of theaters theaters
x9 Average rating of religious institutions religious
y Rating of resort (poor, average, high) Resort’s rating
ments is true?
2
Solution 1.
1
Dataset obtained from https://archive.ics.uci.edu/ml/
datasets/Travel+Reviews
2 of 23
The correct answer is D. To see this, notice the Question 2. A Principal Component Analysis (PCA)
red line in the boxplot agrees with the median of the is carried out on the travel review dataset in Table 1
attribute, and the median of the two attributes in based on the attributes x5 , x6 , x7 , x8 , x9 .
Figure 1 can be derived by projecting onto either of The data is standardized by (i) substracting the
the two axis and (visually estimate) the point such that mean and (ii) dividing each column by its standard
half the mass of the data is above and below. For x2 deviation to obtain the standardized data matrix X̃. A
this is 1.3 and for x9 this is 2.8, which rule out all but singular value decomposition is then carried out on the
option D. standardized data matrix to obtain the decomposition
U SV T = X̃
0.94 −0.12 0.32 −0.0 0.0
0.01
0.0 −0.02 0.0 −1.0
−0.01
V = 0.07 0.07 0.99 −0.0
(1)
0.11 0.99 0.06 −0.08 0.0
−0.33 −0.02 0.94 −0.07 −0.02
14.14 0.0 0.0 0.0 0.0
0.0 11.41 0.0 0.0 0.0
S=
0.0 0.0 9.46 0.0 0.0
0.0 0.0 0.0 4.19 0.0
0.0 0.0 0.0 0.0 0.17
Which one of the following statements is true?
E. Don’t know.
3 of 23
Question 3. Consider again the PCA analysis for the o1 o2 o3 o4 o5 o6 o7 o8 o9 o10
travel review dataset, in particular the SVD decompo- o1 0.0 2.0 5.7 0.9 2.9 1.8 2.7 3.7 5.3 5.1
o2 2.0 0.0 5.6 2.4 2.5 3.0 3.5 4.3 6.0 6.2
sition of X̃ in Equation (1). Which one of the following o3 5.7 5.6 0.0 5.0 5.1 4.0 3.3 5.4 1.2 1.8
statements is true? o4 0.9 2.4 5.0 0.0 2.7 2.1 2.2 3.5 4.6 4.4
o5 2.9 2.5 5.1 2.7 0.0 3.5 3.7 4.0 5.8 5.7
A. An observation with a low value of muse- o6 1.8 3.0 4.0 2.1 3.5 0.0 1.7 5.3 3.8 3.7
o7 2.7 3.5 3.3 2.2 3.7 1.7 0.0 4.2 3.1 3.2
ums, and a high value of religious will typi- o8 3.7 4.3 5.4 3.5 4.0 5.3 4.2 0.0 5.5 6.0
cally have a negative value of the projection o9 5.3 6.0 1.2 4.6 5.8 3.8 3.1 5.5 0.0 2.1
onto principal component number 1. o10 5.1 6.2 1.8 4.4 5.7 3.7 3.2 6.0 2.1 0.0
B. An observation with a low value of museums, Table 2: The pairwise cityblock P distances,
and a low value of religious will typically have d(oi , oi ) = kxi − xj kp=1 = M k=1 |xik − xjk | between
a positive value of the projection onto principal 10 observations from the travel review dataset (recall
component number 3. M = 9). Each observation oi corresponds to a row
of the data matrix X of Table 1. The colors indi-
C. An observation with a low value of museums, cate classes such that the black observations {o1 , o2 }
and a high value of religious will typically have belongs to class C1 (corresponding to a poor rating),
a positive value of the projection onto principal the red observations {o3 , o4 , o5 } belongs to class C2
component number 1. (corresponding to an average rating), and the blue ob-
servations {o6 , o7 , o8 , o9 , o10 } belongs to class C3
D. An observation with a high value of parks will
(corresponding to a high rating).
typically have a positive value of the projection
onto principal component number 5.
Question 4. To examine if observation o7 may be an
E. Don’t know. outlier, we will calculate the average relative density
using the cityblock distance and the observations given
Solution 3. The correct answer is A. Focusing on
in Table 2 only. We recall that the KNN density and
the correct answer, note the projection onto principal
average relative density (ard) for the observation xi are
component v 1 (i.e. column one of V ) is
given by:
0.94 1
0.01 densityX \i (xi , K) = 1 P 0
,
> x0 ∈NX \i (xi ,K) d(xi , x )
K
b1 = x v 1 = x5 x6 x7 x8 x9 −0.01
0.11 densityX \i (xi , K)
−0.33 ardX (x i , K) = 1 P ,
K density
xj ∈NX \i (xi ,K) (x , K)
X \j j
A. 0.41
B. 1.0
C. 0.51
D. 0.83
E. Don’t know.
Solution 4.
4 of 23
To solve the problem, first observe the k = 2 neigh- Question 5. Consider the distances in Table 2 based
borhood of o7 and density is: on 10 observations from the travel review dataset.
The class labels C1 , C2 , C3 (see table caption for
NX \7 (x7 ) = {o6 , o4 }, densityX \7 (x7 ) = 0.513 details) will be predicted using a k-nearest neighbour
classifier based on the distances given in Table 2 (ties
For each element in the above neighborhood we can are broken in the usual manner by considering the
then compute their K = 2-neighborhoods and densities nearest observation from the tied classes). Suppose we
to be: use leave-one-out cross validation (i.e. the observation
that is being predicted is left out) and a 3-nearest
NX \6 (x6 ) = {o7 , o1 }, NX \4 (x4 ) = {o1 , o6 }
neighbour classifier (i.e. k = 3). What is the error
and rate computed for all N = 10 observations?
3
densityX \6 (x6 ) = 0.571, densityX \4 (x4 ) = 0.667. A. error rate = 10
5
From these, the ARD can be computed by plugging in B. error rate = 10
the values in the formula given in the problem. C. error rate = 6
10
7
D. error rate = 10
E. Don’t know.
Solution 5.
The correct answer is C. To see this, recall that
leave-one-out cross-validation means we train a total
of N = 10 models, each model being tested on a single
observation and trained on the remaining such that
each observation is used for testing exactly once.
The model considered is KNN classifier with k = 3.
To figure out the error for a particular observation i
(i.e. the test set for this fold), we train a model on the
other observations and predict on observation i. To do
that, simply find the observation different than i closest
to i according to Table 2 and predict i as belonging to
it’s class. Concretely, we find: N (o1 , k) = {o4 , o6 , o2 },
N (o2 , k) = {o1 , o4 , o5 }, N (o3 , k) = {o9 , o10 , o7 },
N (o4 , k) = {o1 , o6 , o7 }, N (o5 , k) = {o2 , o4 , o1 },
N (o6 , k) = {o7 , o1 , o4 }, N (o7 , k) = {o6 , o4 , o1 },
N (o8 , k) = {o4 , o1 , o5 }, N (o9 , k) = {o3 , o10 , o7 }, and
N (o10 , k) = {o3 , o9 , o7 }.
The error is then found by observing how often
the class label of the observation in the neighborhood
agrees with the true class label. We find this happens
for observations
{o6 , o7 , o9 , o10 }
5 of 23
In dendrogram 3, merge operation number 1
should have been between the sets {f1 } and {f4 }
at a height of 0.94, however in dendrogram 3 merge
number 1 is between the sets {f6 } and {f4 }.
A. Dendrogram 1
B. Dendrogram 2
C. Dendrogram 3
D. Dendrogram 4
E. Don’t know.
6 of 23
From this information we can define the counting ma-
trix n as
0 2 0
n = 0 2 1
1 2 2
S = 4, D = 17
S
J[Z, Q] = 1
2 N (N − 1) − D
A. J[Z, Q] ≈ 0.104
B. J[Z, Q] ≈ 0.143
C. J[Z, Q] ≈ 0.174
D. J[Z, Q] ≈ 0.153
E. Don’t know.
7 of 23
P
x4 ≤ 0.43 x4 ≤ 0.55 We obtain N (r) = ki Rki = 980 as the total number
of observations and the number of observations in each
y=1 143 223
branch is simply:
y=2 137 251 X
y=3 54 197 N (vk ) = Rki .
i
Table 3: Proposed split of the travel review dataset
Next, the impurities I(vk ) is computed from the prob-
based on the attribute x4 . We consider a two-way split
abilities
where for each interval we count how many observa- Rki
tions belonging to that interval has the given class la- pi =
N (vk )
bel.
and the impurity I0 from
P
Question 8. Suppose we wish to build a classification Rki
pi = k .
tree based on Hunt’s algorithm where the goal is to pre- N (r)
dict Resort’s rating which can belong to three classes, In particular we obtain:
y = 1, y = 2, y = 3. The number of observations in
each of the classes are: I0 = 0.634, I(v1 ) = 0.626, I(v2 ) = 0.479.
ny=1 = 263, ny=2 = 359, ny=3 = 358. Combining these we see that ∆ = 0.09 and therefore
option B is correct.
We consider binary splits based on the value of x4 of
the form x4 < z for two different values of z. In Table 3 Question 9. Consider the splits in Table 3. Suppose
we have indicated the number of observations in each we build a classification tree considering only the split
of the three classes for different values of z. Suppose we x4 ≤ 0.55 and evaluate it on the same data it was
use the classification error as impurity measure, which trained upon. What is the accuracy?
one of the following statements is true? A. The accuracy is: 0.42
A. The impurity gain of the split x4 ≤ 0.43 is B. The accuracy is: 0.685
∆ ≈ 0.1045
C. The accuracy is: 0.338
B. The impurity gain of the split x4 ≤ 0.43 is
∆ ≈ 0.0898 D. The accuracy is: 0.097
8 of 23
Solution 10.
1.5
It suffices to compute
the activation of the neural
-2
network at x1 x2 = 3 3 . The activation of each
1 -3 of the two hidden neurons is:
(1)
n1 = h(1) ( 1 3 3 w1 ) = 0.036
-4
0.5 (1)
n2 = h(1) ( 1 3 3 w2 ) = 0.846.
-5
9 of 23
−0.77 0.26
A. w1 = −5.54, w2 = −2.09
0.01 −0.03
0.51 0.1
B. w1 = 1.65, w2 = 3.8
0.01 0.04
−0.9 −0.09
C. w1 = −4.39, w2 = −2.45
−0.0 −0.04
−1.22 −0.28
D. w1 = −9.88, w2 = −2.9
−0.01 −0.01
E. Don’t know.
Question 11. Consider again the travel review The projections onto the four options are, in order,
dataset. We consider a multinomial regression model
ŷ1 ŷ2 ŷ3 = −0.78 0.29 0.0
applied to the dataset projected onto the first two prin-
cipal directions, giving the two coordinates b1 and b2
ŷ1 ŷ2 ŷ3 = 0.5 0.06 0.0
for each observation. Multinomial regression then com-
putes the per-class probability by first computing the
ŷ1 ŷ2 ŷ3 = −0.9 −0.05 −0.0
numbers:
eŷk
. if k ≤ 2 x = 1.0 1.2 1.8 2.3 2.6 3.4 4.0 4.1 4.2 4.6 .
1+ 2k0 =1 eŷk0
P
P (y = k|x) =
P2 1 ŷ 0 if k = 3.
1+ k0 =1 e k Suppose a k-means algorithm is applied to the dataset
with K = 3 and using Euclidian distances. The
algorithm is initialized with K cluster centers located
Suppose the resulting decision boundary is as shown in at
Figure 6, what are the weights? µ1 = 1.8, µ2 = 3.3, µ3 = 3.6
10 of 23
What will the location of the cluster centers be after f1 f2 f3 f4 f5 f6 f7 f8 f9
the k-means algorithm has converged? o1 0 0 0 1 0 0 0 0 0
o2 0 0 0 0 0 0 0 0 1
o3 0 1 1 1 1 1 0 0 0
A. µ1 = 2.05, µ2 = 4, µ3 = 4.3
o4 1 0 0 0 0 0 0 0 0
o5 1 0 0 1 0 0 0 0 0
B. µ1 = 1.58, µ2 = 3.33, µ3 = 4.3 o6 0 0 1 1 0 0 0 1 0
o7 0 0 1 1 1 0 0 0 0
C. µ1 = 1.33, µ2 = 2.77, µ3 = 4.22 o8 0 0 0 0 1 0 0 0 0
o9 0 1 1 0 1 0 0 0 0
D. µ1 = 1.58, µ2 = 3.53, µ3 = 4.4 o10 0 0 1 1 1 0 1 0 0
E. Don’t know.
Table 4: Binarized version of the travel review dataset.
Each of the features fi are obtained by taking a feature
Solution 12. Recall the K-means algorithm iterates
xi and letting fi = 1 correspond to a value xi greater
between assigning the observations to their nearest
than the median (otherwise fi = 0). The colors indi-
centroids, and then updating the centroids to be equal
cate classes such that the black observations {o1 , o2 }
to the average of the observations assigned to them.
belongs to class C1 (corresponding to a poor rating),
Given the initial centroids, the K-means algorithm
the red observations {o3 , o4 , o5 } belongs to class C2
assign observations to the nearest centroid resulting in
(corresponding to an average rating), and the blue ob-
the partition:
servations {o6 , o7 , o8 , o9 , o10 } belongs to class C3
{1, 1.2, 1.8, 2.3}, {2.6, 3.4}, {4, 4.1, 4.2, 4.6}. (corresponding to a high rating).
25
{1, 1.2, 1.8}, {2.3, 2.6, 3.4}, {4, 4.1, 4.2, 4.6}. B. pNB (y = 2|f2 = 0, f4 = 1, f5 = 0) = 79
2000
C. pNB (y = 2|f2 = 0, f4 = 1, f5 = 0) = 6023
At this point, the centroids are no longer changing and
the algorithm terminates. Hence, C is correct. 125
D. pNB (y = 2|f2 = 0, f4 = 1, f5 = 0) = 287
E. Don’t know.
2
Note that in association mining, we would normally also
include features fi such that fi = 1 if the corresponding feature
is less than the median; for brevity we will not consider features
of this kind in this problem
11 of 23
Solution 13. To solve this problem, we simply use Question 14. Consider the binarized version of the
the general form of the naı̈ve-Bayes approximation and travel review dataset shown in Table 4.
plug in the relevant numbers. We get: The matrix can be considered as representing N =
10 transactions o1 , o2 , . . . , o10 and M = 9 items
pNB (y = 2|f2 = 0, f4 = 1, f5 = 0) = f1 , f2 , . . . , f9 . Which of the following options repre-
p(f2 = 0|y = 2)p(f4 = 1|y = 2)p(f5 = 0|y = 2)p(y = 2) sents all (non-empty) itemsets with support greater
P3 than 0.15 (and only itemsets with support greater than
j=1 p(f2 = 0|y = j)p(f4 = 1|y = j)p(f5 = 0|y = j)p(y = j)
222 3 0.15)?
= 1 1 1 1 23 32 32 10
3 4311
1 2 1 5 + 3 3 3 10 + 5 5 5 2 A. {f1 }, {f2 }, {f3 }, {f4 }, {f5 }, {f2 , f3 }, {f2 , f5 },
200 {f3 , f4 }, {f3 , f5 }, {f4 , f5 }, {f2 , f3 , f5 }, {f3 , f4 , f5 }
= .
533
Therefore, answer A is correct. B. {f3 }, {f4 }, {f5 }, {f3 , f4 }, {f3 , f5 }
E. Don’t know.
C. The confidence is 1
1
D. The confidence is 10
E. Don’t know.
12 of 23
Question 16. We will again consider the binarized Then, for each itemset in C20 , check that all subsets
version of the travel review dataset already encoun- of size k − 1 are in L1 . If so, keep them as C2 :
tered in Table 4, however, we will now only consider
the first M = 4 features f1 , f2 , f3 , f4 . We wish to C2 = 0 0 1 1
apply the a-priori algorithm (the specific variant en-
countered in chapter 19 of the lecture notes) to find all Finally, for each itemset in the C2 , check it has
itemsets with support greater than ε = 0.35. Suppose support of at least ε = 0.35 and if so keep them as
at iteration k = 2 we know that: L2 :
L2 = 0 0 1 1
0 0 1 0
L1 = Therefore, answer B is correct.
0 0 0 1
Question 17. Consider the observations in Table 4.
What, in the notation of the lecture notes, is C2 ? We consider these as 9-dimensional binary vectors and
wish to compute the pairwise similarity. Which of the
0 0 1 1 following statements are true?
1 0 1 0
A. C2 = 0 1 1 0 A. Cos(o1 , o3 ) ≈ 0.132
0 1 0 1
1 0 0 1 B. J(o2 , o3 ) ≈ 0.0
B. C2 = 0 0 1 1
C. SMC(o1 , o3 ) ≈ 0.268
D. SMC(o2 , o4 ) ≈ 0.701
C. C2 = 0 1 1 0
E. Don’t know.
1 0 1 0
D. C2 = 1 0 0 1 Solution 17. The problem is solved by simply using
0 0 1 1 the definition of SMC, Jaccard similarity and cosine
similarity as found in the lecture notes. The true values
E. Don’t know. are:
t = 1: Initially, let L1 be all singleton itemsets with a and therefore option B is correct.
support of at least ε = 0.35.
0 0 1 0
L1 =
0 0 0 1
13 of 23
Figure 8: TPR, FPR curves for the classifier.
A. ROC curve 1
B. ROC curve 2
C. ROC curve 3
D. ROC curve 4
E. Don’t know.
14 of 23
Question 19. Consider again the travel review
dataset in Table 1. We would like to predict a resort’s
rating using a linear regression, and since we would like
the model to be as interpretable as possible we will use
variable selection to obtain a parsimonious model. We
limit ourselves to the five features x1 , x6 , x7 , x8 , x9
and in Table 6 we have pre-computed the estimated
training and test error for different variable combina- Feature(s) Training RMSE Test RMSE
tions of the dataset. Which of the following statements none 5.25 5.528
is correct? x1 4.794 5.566
x6 4.563 4.57
A. Forward selection will select attributes x6 x7 5.246 5.52
x8 5.245 5.475
B. Forward selection will select attributes
x9 4.683 5.185
x1 , x6 , x7 , x8
x1 , x6 3.344 4.213
C. Backward selection will select attributes x1 , x6 x1 , x7 4.794 5.565
x6 , x7 4.561 4.591
D. Forward selection will select attributes x1 , x6 x1 , x8 4.742 5.481
x6 , x8 4.559 4.614
E. Don’t know. x7 , x8 5.242 5.473
x1 , x9 3.945 4.967
Solution 19. x6 , x9 4.552 4.643
The correct answer is B. To solve this problem, x7 , x9 4.679 5.223
it suffices to show which variables will be selected x8 , x9 4.674 5.284
by forward/backward selection. First note that in x1 , x6 , x7 3.338 4.165
variable selection, we only need concern ourselves with x1 , x6 , x8 3.325 4.161
the test error, as the training error should as a rule x1 , x7 , x8 4.741 5.494
trivially drop when more variables are introduced and x6 , x7 , x8 4.557 4.648
is furthermore not what we ultimately care about. x1 , x6 , x9 3.314 4.258
Forward selection: The method is initialized with x1 , x7 , x9 3.945 4.958
x6 , x7 , x9 4.55 4.67
the set {} having an error of 5.528.
x1 , x8 , x9 3.942 4.93
Step i = 1 The available variable sets to choose be- x6 , x8 , x9 4.546 4.717
tween is obtained by taking the current variable x7 , x8 , x9 4.667 5.354
set {} and adding each of the left-out variables x1 , x6 , x7 , x8 3.315 4.098
thereby resulting in the sets {x1 }, {x6 }, {x7 }, x1 , x6 , x7 , x9 3.307 4.218
x1 , x6 , x8 , x9 3.282 4.234
{x8 }, {x9 }. Since the lowest error of the available
x1 , x7 , x8 , x9 3.942 4.911
sets is 4.57, which is lower than 5.528, we update
x6 , x7 , x8 , x9 4.542 4.767
the current selected variables to {x6 } x1 , x6 , x7 , x8 , x9 3.266 4.195
Step i = 2 The available variable sets to choose be-
Table 6: Root-mean-square error (RMSE) for the
tween is obtained by taking the current variable
training and test set when using least squares regres-
set {x6 } and adding each of the left-out variables
sion to predict y in the travel review dataset using dif-
thereby resulting in the sets {x1 , x6 }, {x1 , x7 },
ferent combinations of the features x1 , x6 , x7 , x8 , x9 .
{x6 , x7 }, {x1 , x8 }, {x6 , x8 }, {x7 , x8 }, {x1 , x9 },
{x6 , x9 }, {x7 , x9 }, {x8 , x9 }. Since the lowest error
of the available sets is 4.213, which is lower than
4.57, we update the current selected variables to
{x1 , x6 }
15 of 23
set {x1 , x6 } and adding each of the left-out vari- p(x̂2 , x̂3 |y) y=1 y=2 y=3
ables thereby resulting in the sets {x1 , x6 , x7 }, x̂2 = 0, x̂3 = 0 0.41 0.28 0.15
{x1 , x6 , x8 }, {x1 , x7 , x8 }, {x6 , x7 , x8 }, {x1 , x6 , x9 }, x̂2 = 0, x̂3 = 1 0.17 0.28 0.33
{x1 , x7 , x9 }, {x6 , x7 , x9 }, {x1 , x8 , x9 }, {x6 , x8 , x9 }, x̂2 = 1, x̂3 = 0 0.33 0.25 0.15
{x7 , x8 , x9 }. Since the lowest error of the available x̂2 = 1, x̂3 = 1 0.09 0.19 0.37
sets is 4.161, which is lower than 4.213, we update
the current selected variables to {x1 , x6 , x8 } Table 7: Probability of observing particular values of
x̂2 and x̂3 conditional on y.
Step i = 4 The available variable sets to choose be-
tween is obtained by taking the current vari-
able set {x1 , x6 , x8 } and adding each of the on the attributes dance clubs and juice bars using a
left-out variables thereby resulting in the sets Bayes classifier.
{x1 , x6 , x7 , x8 }, {x1 , x6 , x7 , x9 }, {x1 , x6 , x8 , x9 }, Therefore, suppose the attributes have been bina-
{x1 , x7 , x8 , x9 }, {x6 , x7 , x8 , x9 }. Since the lowest rized such that x̂2 = 0 corresponds to x2 ≤ 1.28 (and
error of the available sets is 4.098, which is lower otherwise x̂2 = 1) and x̂3 = 0 corresponds to x3 ≤ 0.82
than 4.161, we update the current selected vari- (and otherwise x̂3 = 1). Suppose the probability for
ables to {x1 , x6 , x7 , x8 } each of the configurations of x̂2 and x̂3 conditional on
the resort’s rating y are as given in Table 7. and the
Step i = 5 The available variable sets to choose be- prior probability of the resort’s ratings are
tween is obtained by taking the current vari-
able set {x1 , x6 , x7 , x8 } and adding each of the p(y = 1) = 0.268, p(y = 2) = 0.366, p(y = 3) = 0.365.
left-out variables thereby resulting in the sets
Using this, what is then the probability an observation
{x1 , x6 , x7 , x8 , x9 }. Since the lowest error of the
had poor rating given that x̂2 = 0 and x̂3 = 1?
newly constructed sets is not lower than the cur-
rent error the algorithm terminates.
A. p(y = 1|x̂2 = 0, x̂3 = 1) = 0.17
Backward selection: The method is initialized with B. p(y = 1|x̂2 = 0, x̂3 = 1) = 0.411
the set {x1 , x6 , x7 , x8 , x9 } having an error of 4.195.
C. p(y = 1|x̂2 = 0, x̂3 = 1) = 0.218
Step i = 1 The available variable sets to choose be-
tween is obtained by taking the current variable D. p(y = 1|x̂2 = 0, x̂3 = 1) = 0.046
set {x1 , x6 , x7 , x8 , x9 } and removing each of the
E. Don’t know.
left-out variables thereby resulting in the sets
{x1 , x6 , x7 , x8 }, {x1 , x6 , x7 , x9 }, {x1 , x6 , x8 , x9 },
Solution 20. The problem is solved by a simple
{x1 , x7 , x8 , x9 }, {x6 , x7 , x8 , x9 }. Since the lowest
application of Bayes’ theorem:
error of the available sets is 4.098, which is lower
than 4.195, we update the current selected vari- p(y = 1|x̃2 = 0, x̃3 = 1)
ables to {x1 , x6 , x7 , x8 }
p(x̃2 = 0, x̃3 = 1|y = 1)p(y = 1)
= P3
Step i = 2 The available variable sets to choose be- k=1 p(x̃2 = 0, x̃3 = 1|y = k)p(y = k)
tween is obtained by taking the current vari-
able set {x1 , x6 , x7 , x8 } and removing each of the The values of p(y) are given in the problem text and
left-out variables thereby resulting in the sets the values of p(x̃2 = 0, x̃3 = 1|y) in Table 7. Inserting
{x1 , x6 , x7 }, {x1 , x6 , x8 }, {x1 , x7 , x8 }, {x6 , x7 , x8 }, the values we see option A is correct.
{x1 , x6 , x9 }, {x1 , x7 , x9 }, {x6 , x7 , x9 }, {x1 , x8 , x9 },
{x6 , x8 , x9 }, {x7 , x8 , x9 }. Since the lowest error of
the newly constructed sets is not lower than the
current error the algorithm terminates.
16 of 23
rule assignment to the nodes in the decision tree?
x1 2 x1 6
A. A: − < 2, B: − < 3,
x2 4 1
x2 0 2
x1 4
C: − <2
x2 2 2
x1 2 x1 4
B. A: − < 2, B: − < 2,
x2 4 1
x2 2 2
x1 6
C: − <3
x2 0 2
x1 4 x1 6
Figure 9: Example classification tree. C. A: − < 2, B: − < 3,
x2 2 2
x2 0 2
x1 2
C: − <2
x2 4 1
x1 4 x1 2
D. A: − < 2, B: − < 2,
x2 2 2
x2 4 1
x1 6
C: − <3
x2 0 2
E. Don’t know.
Solution 21.
This problem is solved by using the definition of
a decision tree and observing what classification rule
each of the assignment of features to node names in
the decision tree will result in. I.e. beginning at the
top of the tree, check if the condition assigned to the
node is met and proceed along the true or false leg of
the tree.
The resulting decision boundaries for each of the
options are shown in Figure 11 and it follows answer A
is correct.
17 of 23
ANN Log.reg.
n∗h E1test λ∗ E2test
Outer fold 1 1 0.561 0.1 0.439
Outer fold 2 1 0.513 0.1 0.487
Outer fold 3 1 0.564 0.1 0.436
Outer fold 4 1 0.671 0.1 0.329
A. 208 models
B. 100 models
Figure 11: Classification trees induced by each of the
options. (Top row: option A and B, bottom row: C C. 200 models
and D)
D. 104 models
E. Don’t know.
Question 22. Suppose we wish to compare a neu-
ral network model and a regularized logistic regression
Solution 22. Going over the 2-level cross-validation
model on the travel review dataset. For the neural
algorithm we see the total number of models to be
network, we wish to find the optimal number of hid-
trained is:
den neurons nh , and for the regression model the opti-
K1 (K2 S + 1) = 104
mal value of λ. We therefore opt for a two-level cross-
validation approach where for each outer fold, we train Since we have to do this for each model, and S = 5
the model on the training split, and use the test split in both cases, we need to train twice this number of
to find the optimal number of hidden units (or regu- models and therefore A is correct.
larization strength) using cross-validation with K2 = 5
folds. The tested values are:
18 of 23
Question 23. We fit a GMM to a single feature x6 Variable y true t=1
from the travel review dataset. Recall the density of a
y1 1 1
1D GMM is
y2 2 1
K
X y3 2 1
p(x) = wk N (x|µk , σk2 ) y4 1 2
k=1 y5 1 1
and suppose that the identified values of the mixture y6 1 2
weights are y7 2 1
Solution 23.
Recall γik is the posterior probability that observa-
tion i is assigned to mixture component 2 which can
easily be obtained using Bayes’ theorem. We see that:
19 of 23
Given this information, how will the AdaBoost up-
date the weights w?
A. 0.25 0.1 0.1 0.1 0.25 0.1 0.1
B. 0.388 0.045 0.045 0.045 0.388 0.045 0.045
C. 0.126 0.15 0.15 0.15 0.126 0.15 0.15
D. 0.066 0.173 0.173 0.173 0.066 0.173 0.173
E. Don’t know.
Solution 24.
We first observe the AdaBoost classifier at t = 1
mis-classify observations:
Figure 13: Estimated negative log-likelihood as ob-
{y2 , y3 , y4 , y6 , y7 } tained using LOO cross-validation on a small, N = 4
one-dimensional dataset as a function of kernel width
1
Since the weights are just wi = N, we therefore get: σ.
X
t=1 = wi (t)(1 − δft (xi ),yi ) = 0.714
Question 25. Consider the following N = 4 observa-
i
tions from a one-dimensional dataset:
From this, we compute αt as
{3.918, −6.35, −2.677, −3.003}.
1 1 − t
αt = log = −0.458
2 t Suppose we apply a Kernel Density Estimator (KDE)
to the dataset with kernel width σ (i.e., σ is the
Scaling the observations corresponding to the mis-
standard deviation of the Gaussian kernels), and we
classified weights as wi eαt and those corresponding to
wish to find σ by using leave-one-out (LOO) cross-
the correctly classified weights as wi e−αt and normal-
validation using the average (per observation) negative
izing the new weights to sum to one then give answer
log-likelihood
A.
4
−1 X
E(σ) = log pσ (xi ).
4
i=1
A. LOO curve 1
B. LOO curve 2
C. LOO curve 3
D. LOO curve 4
E. Don’t know.
20 of 23
observation i, when the KDE is fitted on the other
N − 1 observations, is:
1 X
pσ (xi ) = N (xi |xj , σ = 2)
N −1
j6=i
E. Don’t know.
21 of 23
positive if Σij > 0, negative if Σij < 0 and zero if
Σij ≈ 0. Furthermore, recall positive correlation in a
scatter plot means the points (xi , xj ) tend to lie on a
line sloping upwards, negative correlation means it is
sloping downwards and zero means the data is axis-
aligned.
We can therefore use the scatter plots of variables
xi , xj to read of the sign off Σij (or whether it is zero).
We thereby find that Σ = Σ1 generated the data. We
can now read off the covariance as Cov[x1 , x2 ] = Σ1,2
and the variance of each variable as
Cov[x1 , x2 ]
Corr[x1 , x2 ] = p = 0.647
Var[x1 ]Var[x2 ] Figure 15: 1000 observations drawn from a Gaussian
Mixture Model (GMM) with three clusters.
and therefore answer A is correct.
22 of 23
generate the data?
A.
1 −7.2 2.4 −0.4
p(x) = N x| ,
4 10.0 −0.4 1.7
1 −13.8 1.7 −0.3
+ N x| ,
4 −0.8 −0.3 2.3
1 −6.8 1.6 0.9
+ N x| ,
2 6.4 0.9 1.5
B.
1 −7.2 1.6 0.9
p(x) = N x| ,
2 10.0 0.9 1.5
1 −13.8 1.7 −0.3
+ N x| ,
4 −0.8 −0.3 2.3
1 −6.8 2.4 −0.4
+ N x| ,
4 6.4 −0.4 1.7
C.
1 −7.2 1.6 0.9
p(x) = N x| ,
4 10.0 0.9 1.5
1 −13.8 2.4 −0.4
+ N x| ,
2 −0.8 −0.4 1.7
1 −6.8 1.7 −0.3
+ N x| ,
4 6.4 −0.3 2.3
D.
Figure 16: GMM mixtures corresponding to alternative
1 −7.2 2.4 −0.4
p(x) = N x| , options.
4 10.0 −0.4 1.7
1 −13.8 1.6 0.9
+ N x| ,
4 −0.8 0.9 1.5
1 −6.8 1.7 −0.3
+ N x| ,
2 6.4 −0.3 2.3
E. Don’t know.
Solution 27.
The three components in the candidate GMM densi-
ties can be matched to the colored observations by their
mean values. Then, by considering the basic proper-
ties of the covariance matrices, we can easily rule out all
options except A. Alternatively, in Figure 16 is shown
the densities for densities corresponding to option B
(upper left), C (upper right) and D (bottom center).
23 of 23