Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Faculty of Arts and Sciences

Department of Computer Science


CMPS 287 – Artificial Intelligence
Spring 2019-2020 – Assignment 2

1. Consider a sample of 10 marbles drawn from a bin containing red and green marbles. The proba-
bility that any marble we draw is red is µ = 0.55 (independently, with replacement). We address
the probability of getting no red marbles (ν = 0) in the following cases:
(a) We draw only one such sample. Compute the probability that ν = 0.
(b) We draw 1000 independent samples. Compute the probability that (at least) one of the
samples has ν = 0.
2. Consider the ”2-intervals” learning model, where h : R → {−1, +1} and h(x) = +1 if the point is
within either of two arbitrarily chosen intervals and −1 otherwise.
(a) What is the growth function mH (N ) for this hypothesis set?
(b) What is the VC dimension of this hypothesis set? Recall that the VC dimensions is equal to
the number of free parameters (degrees of freedom) of the model.
(c) The VC dimension is also equal to the maximum number of points the model can shatter
(i.e., realize all possible dichotomies on). Show that the VC dimension you calculated above
is indeed correct by illustrating that the 2-intervals model has a break point of that VC
dimension +1. That is, the 2-interval model would not be able to shatter or realize all
possible dichotomies on VC dimension +1 points.
3. Now, consider the general case: the ”M-intervals” learning model. Again h : R → {−1, +1}, where
h(x) = +1 if the point falls inside any of M arbitrarily chosen intervals, otherwise h(x) = −1.
What is the (smallest) break point of this hypothesis set?
4. Compute the growth function mH (N ) for the learning model made up of two concentric circles in
R2 . Specifically, H contains the functions, which are +1 for a2 ≤ x21 + x22 ≤ b2 and -1 otherwise.
5. For hypothesis sets H1 , H2 , ..., HK with finite, positive VC dimensions dvc (Hk ), some of the follow-
ing bounds on the VC dimension of the intersection of the sets, i.e., dvc (∩kk=1 Hk ), are correct and
some are not. State which ones are correct and which are not, and then show which one among
the correct ones is the tightest bound.
Pk
(a) 0 ≤ dvc (∩kk=1 Hk ) ≤ k=1 dvc (Hk )
(b) 0 ≤ dvc (∩kk=1 Hk ) ≤ minkk=1 dvc (Hk )
(c) 0 ≤ dvc (∩kk=1 Hk ) ≤ maxkk=1 dvc (Hk )
(d) minkk=1 dvc (Hk ) ≤ dvc (∩kk=1 Hk ) ≤ maxkk=1 dvc (Hk )
Pk
(e) minkk=1 dvc (Hk ) ≤ dvc (∩kk=1 Hk ) ≤ k=1 dvc (Hk )
6. For hypothesis sets H1 , H2 , ..., HK with finite, positive VC dimensions dvc (Hk ), some of the fol-
lowing bounds on the VC dimension of the union of the sets, i.e., dvc (∪kk=1 Hk ), are correct and
some are not. State which ones are correct and which are not, and then show which one among
the correct ones is the tightest bound.
Pk
(a) 0 ≤ dvc (∪kk=1 Hk ) ≤ k=1 dvc (Hk )
Pk
(b) 0 ≤ dvc (∪kk=1 Hk ) ≤ K − 1 + k=1 dvc (Hk )
Pk
(c) minkk=1 dvc (Hk ) ≤ dvc (∪kk=1 Hk ) ≤ k=1 dvc (Hk )
Pk
(d) maxkk=1 dvc (Hk ) ≤ dvc (∪kk=1 Hk ) ≤ k=1 dvc (Hk )
Pk
(e) maxkk=1 dvc (Hk ) ≤ dvc (∪kk=1 Hk ) ≤ K − 1 + k=1 dvc (Hk )
7. The Hoeffding inequality provides a way to characterize the generalization error with a probabilistic
2
bound P [|Ein (g) − Eout (g)| > ] ≤ 2M e−2N  for any  > 0. If we set  = 0.05 and want the
2
probability bound 2M e−2N  to be at most 0.03, what is the least number of examples N needed
for when M is equal to
(a) 1
(b) 10
(c) 100
8. In this problem, you will again explore the tradeoff between the number of training examples and
the VC dimensions of a learning r model when it comes to generalization. In class, we defined the
8 4mH (2N )
generalization bound as y = ∗ ln , where N is the number of training examples
N δ
and mH (N ) is the growth function of the model. We also learnt that if a model has a ”finite”
VC dimension d, then mH (N ) = O(N d ). Fix the confidence level by having δ = 0.0005 and plot
the generalization bound versus the number of training examples for various VC dimensions d in
the range
r of (5,10, 15,...,100). That is, for each value of d, plot a curve considering the y-axis as
8 4(2N )d
y= ∗ ln and the x-axis as the number of training examples N , varying N in the range
N δ
(0,10,20,...100,1000, 10000).
What can you observe from such plot?
Hint: you might need to re-write the equation above for the generalization bound to avoid an overflow
error.
9. Consider an unknown boolean target function f over a 3-dimensional boolean input space. We
are given a data set D of five examples represented in the table below, where yn = f (xn ) for
n = 1, 2, 3, 4, 5.
xn yn
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
Note that in this simple boolean case, we can enumerate the entire input space (since there are
only 23 = 8 distinct input vectors), and we can enumerate the set of all possible target functions
(there are only 28 = 256 distinct boolean functions on 3 boolean inputs).
Let us look at the problem of learning f . Since f is unknown except inside D, any function that
agrees with D could conceivably be f . Since there are only 3 points in X outside D, there are only
23 = 8 such functions. The remaining points in X which are not in D are: 101, 110, and 111.
Now, consider the following two possible hypotheses:
(a) g returns 1 for all points.
(b) g is the XOR function applied to x, i.e., if the number of 1s in x is odd, g returns 1; if it is
even, g returns 0.
(a) Compute the in-sample error for the two above hypotheses, which is the percentage of points
in D for which g returns a wrong value (i.e, g(x) 6= yn ).
(b) Compute an estimate for the out-of-sample error for these two hypotheses. Recall that there
are only 3 points outside of D, namely 101, 110 and 111. Since we do not know the actual
labels for these 3 points, you need to consider all 8 possible cases for labels for these 3 points
(i.e., 000, 001, 010, 011, 100, 101, 110 and 111). The out-of-sample error for a hypothesis g
would then be 1/8 times the percentage of points for which g returns a wrong value, summed
over each of the 8 possible cases.
(c) Now, assume you know the actual labels for the three missing points and assume it is 1 for all
three points. What would be the out-of-sample error for each one of the above hypotheses?

You might also like