Tex

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 7

\documentclass{article}

\usepackage{amsmath}
\begin{document}

\begin{center}
\textbf{CS 475 Machine Learning (Fall 2023): Assignment 3}
\end{center}

\begin{center}
\textbf{Student details}
\end{center}

\begin{center}
\textbf{Institutional affiliations}
\end{center}

\begin{center}
\textbf{Problem 1}
\end{center}

Yes, a decision tree can correctly classify a set of \( N \) two-dimensional data


points that are linearly separable.

Here's how such a decision tree could work:

- A decision tree can make "axis-parallel" decisions, meaning that at each node it
can make a decision based on a threshold along one of the axes (either the x-axis
or the y-axis in the two-dimensional case).
- For linearly separable data, there exists a line (or hyperplane in higher
dimensions) that can separate the data points into two classes. This line can be
described by the equation \( w^T x + w_0 = 0 \), where \( w \) is the weight vector
and \( w_0 \) is the bias term.

The decision tree would recursively partition the space such that at each node, it
chooses the best threshold along one of the axes to separate the data points based
on their labels. The depth of the tree would depend on how these partitions are
made.

In the ideal case, the decision tree could have a depth of \( O(\log N) \) if the
data points can be separated in a balanced manner at each step. However, in the
worst case, the tree could become a degenerate tree with depth \( N \), where each
decision node only separates a single point.

Since the data is linearly separable, the decision boundary created by the decision
tree will be a series of axis-parallel lines that, taken together, approximate the
separating hyperplane described by \( w \) and \( w_0 \). This means that while the
decision boundary of the tree may not be a single straight line like the optimal
linear separator, it will still partition the space in such a way that all points
are correctly classified.

The specific shape of the tree and its depth will be determined by the specific
distribution and arrangement of the points in the two-dimensional space. If the
data points are nicely distributed such that at each split, the set of points is
divided roughly in half, the tree will be more balanced and shallower. If the
points are distributed such that many splits only separate a small number of
points, the tree will be deeper and less balanced.
\begin{center}
\textbf{Problem 2}
\end{center}

Yes, a decision tree can still correctly classify \( N \) points that are not
linearly separable with zero mistakes, under the assumption that there are no
duplicate points with different labels (i.e., there are no two points \( x_i \) and
\( x_j \) such that \( x_i = x_j \) but \( y_i \neq y_j \)). This is because
decision trees do not require the data to be linearly separable; they can create
complex, non-linear decision boundaries by splitting the data at each node based on
the feature values.

Here is what such a tree could look like:

- The decision tree would essentially "memorize" the training data by creating a
path for each point to a leaf node that corresponds to its label.
- At each node, the decision tree would make a decision that isolates a subset of
the points from the rest. This process would continue until each point is isolated
in its own leaf node with the correct label.

The depth of the tree would depend on how these splits are made:

- In the best case, the tree might still be somewhat balanced if it can isolate
multiple points with the same label at each split. However, without linear
separability, there is no guarantee that the tree can be balanced, and it's likely
that some splits will only separate a small number of points.
- In the worst case, especially if the data points are "intertwined" in a complex
way, the tree could have a depth of \( N \), where each leaf node corresponds to a
single data point.

This kind of decision tree would likely be highly overfitted to the training data.
It would have a perfect classification accuracy on the training set but would
probably not generalize well to unseen data, as it would have learned the noise and
specific details of the training set rather than any underlying pattern that might
predict the labels of new data points.

\begin{center}
\textbf{Problem 3}
\end{center}

In AdaBoost, the idea is to build a strong classifier \( H(x) \) as a weighted


combination of weak classifiers \( h_t(x) \), with \( t = 1, 2, ..., T \). After \(
T \) rounds of AdaBoost, the ensemble classifier is:

\[
H(x) = \sum_{t=1}^{T} \alpha_t h_t(x)
\]

Each weak classifier \( h_t \) is trained to minimize the weighted error on the
training data, with the weights updated at each iteration to focus more on the
examples that were misclassified by the ensemble up to that point. The weight of
each training example \( i \) at iteration \( T \) is denoted as \( W^{(T)}_i \),
and we normalize the weights so that their sum is 1.
Now, when we add the classifier \( h_{T+1} \), it is chosen to minimize the
weighted error:

\[
\epsilon_{T+1} = \sum_{i=1}^{N} W^{(T)}_i \cdot I(y_i \neq h_{T+1}(x_i))
\]

where \( I \) is the indicator function, which is 1 when the condition is true and
0 otherwise. The weight update rule in AdaBoost is:

\[
W^{(T+1)}_i = \frac{W^{(T)}_i \cdot \exp(-\alpha_{T+1} \cdot y_i \cdot h_{T+1}
(x_i))}{Z_{T+1}}
\]

where \( Z_{T+1} \) is a normalization factor that makes the sum of weights equal
to 1.

The coefficient \( \alpha_{T+1} \) is chosen as:

\[
\alpha_{T+1} = \frac{1}{2} \ln\left(\frac{1 - \epsilon_{T+1}}{\epsilon_{T+1}}\
right)
\]

This means that after adding \( h_{T+1} \), the weight of a misclassified example
increases, and the weight of a correctly classified example decreases, such that:

\[
W^{(T+1)}_i \cdot I(y_i \neq h_{T+1}(x_i)) = W^{(T+1)}_i \cdot \exp(\alpha_{T+1})
\]

and

\[
W^{(T+1)}_i \cdot I(y_i = h_{T+1}(x_i)) = W^{(T+1)}_i \cdot \exp(-\alpha_{T+1})
\]

Given that the sum of weights is normalized to 1, the weighted error of \


( h_{T+1} \) under the updated weights \( W^{(T+1)}_i \) is:

\[
\epsilon' = \sum_{i=1}^{N} W^{(T+1)}_i \cdot I(y_i \neq h_{T+1}(x_i))
\]

Since the new weights \( W^{(T+1)}_i \) have been adjusted precisely to account for
the errors made by \( h_{T+1} \), half of the total weight will be on examples that
\( h_{T+1} \) gets right, and the other half will be on examples it gets wrong.
Thus, the new weighted error \( \epsilon' \) will be exactly \( 1/2 \).

Regarding whether we could select the same classifier again in the following round,
theoretically, it's possible but not useful. Since the weights are updated in such
a way that the new weighted error of \( h_{T+1} \) is \( 1/2 \), choosing \
( h_{T+2} = h_{T+1} \) would not decrease the weighted error further; it would
remain \( 1/2 \). AdaBoost aims to add new classifiers that improve the ensemble,
which means reducing the weighted error. Since \( h_{T+1} \) no longer provides an
advantage after the weights have been updated, it would not be chosen again
immediately in the next round. The algorithm would look for another classifier that
could potentially reduce the weighted error below \( 1/2 \).

\begin{center}
\textbf{Problem 4}
\end{center}

The AdaBoost algorithm aims to minimize the empirical exponential loss on the
training data. The empirical exponential loss for a given classifier \( h_t \) at
iteration \( t \) with respect to the training data \( (x_i, y_i) \) for \( i =
1, \ldots, N \) and weights \( W_i^{(t)} \) is given by:

\[
L_t = \sum_{i=1}^{N} W_i^{(t)} \exp(-y_i h_t(x_i) \alpha_t)
\]

where \( \alpha_t \) is the weight (vote strength) of the classifier \( h_t \), and
\( y_i \in \{-1, +1\} \) are the true labels for the training examples.

To minimize the exponential loss, we take the derivative of \( L_t \) with respect
to \( \alpha_t \), set it to zero, and solve for \( \alpha_t \). This will give us
the optimal \( \alpha_t \) that minimizes the loss.

Let's compute the derivative of \( L_t \) with respect to \( \alpha_t \):

\[
\frac{\partial L_t}{\partial \alpha_t} = \sum_{i=1}^{N} W_i^{(t)} \exp(-y_i
h_t(x_i) \alpha_t) (-y_i h_t(x_i))
\]

Setting this derivative equal to zero gives:

\[
0 = \sum_{i=1}^{N} W_i^{(t)} \exp(-y_i h_t(x_i) \alpha_t) (-y_i h_t(x_i))
\]

Let's denote the subset of indices of correctly classified examples by \( h_t \) as


\( I_{\text{correct}} \) and the subset of misclassified examples as \( I_{\
text{wrong}} \). Then, since \( h_t(x_i) \) will be \( 1 \) for correctly
classified examples and \( -1 \) for incorrectly classified examples, and \( y_i
h_t(x_i) \) will be \( 1 \) if \( i \) is in \( I_{\text{correct}} \) and \( -1 \)
if \( i \) is in \( I_{\text{wrong}} \), we can rewrite the equation above as:

\[
\sum_{i \in I_{\text{correct}}} W_i^{(t)} \exp(-\alpha_t) - \sum_{i \in I_{\
text{wrong}}} W_i^{(t)} \exp(\alpha_t) = 0
\]

Dividing both sides by \( \exp(-\alpha_t) \), we get:

\[
\sum_{i \in I_{\text{correct}}} W_i^{(t)} - \sum_{i \in I_{\text{wrong}}} W_i^{(t)}
\exp(2\alpha_t) = 0
\]
The sums of the weights of the correctly classified and misclassified examples are
related to the weighted error \( \epsilon_t \) of the classifier \( h_t \):

\[
\epsilon_t = \sum_{i \in I_{\text{wrong}}} W_i^{(t)}
\]
\[
1 - \epsilon_t = \sum_{i \in I_{\text{correct}}} W_i^{(t)}
\]

Substituting these into our equation gives us:

\[
(1 - \epsilon_t) - \epsilon_t \exp(2\alpha_t) = 0
\]

Solving for \( \exp(2\alpha_t) \), we have:

\[
\exp(2\alpha_t) = \frac{1 - \epsilon_t}{\epsilon_t}
\]

Taking the natural logarithm of both sides yields:

\[
2\alpha_t = \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)
\]

And solving for \( \alpha_t \) gives us the expression we are looking for:

\[
\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)
\]

This expression for \( \alpha_t \) is the one that minimizes the empirical
exponential loss for the classifier \( h_t \) at iteration \( t \), given the
selection of \( h_t \) and the weights \( W_i^{(t)} \).

\begin{center}
\textbf{Problem 5}
\end{center}

To set up the dual optimization problem for a kernel SVM, we need to express the
primal problem in terms of the dual variables and then formulate it in the standard
form required by the quadratic program solvers. The primal problem of the SVM with
a regularization parameter \( C \) is given by:

\[
\min_{w} \left( \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{N} \max\left(0, 1 - y_i (w^\top
\phi(x_i) - w_0)\right) \right)
\]

where \( w \) is the weight vector in the feature space \( \phi \), \( w_0 \) is
the bias term, \( C \) is the penalty parameter for the error term, \( x_i \) is
the \( i \)-th training example, \( y_i \) is the corresponding label, and \( N \)
is the number of training examples.

The dual formulation is derived using Lagrange multipliers \( \alpha_i \) for the
constraints \( y_i (w^\top \phi(x_i) - w_0) \geq 1 \). The dual problem is:

\[
\max_{\alpha} \left( \sum_{i=1}^{N} \alpha_i - \frac{1}{2} \sum_{i=1}^{N} \
sum_{j=1}^{N} \alpha_i \alpha_j y_i y_j k(x_i, x_j) \right)
\]

subject to:

\[
0 \leq \alpha_i \leq C, \quad \text{for all } i
\]
\[
\sum_{i=1}^{N} \alpha_i y_i = 0
\]

Here, \( k(x_i, x_j) \) is the kernel function, which computes the dot product in
the feature space, i.e., \( k(x_i, x_j) = \phi(x_i)^\top \phi(x_j) \).

Now, let's express this dual problem in the standard form for quadratic programming
solvers:

Objective Function \( H \) and \( f \):

For the dual problem, the matrix \( H \) in the objective function \( \frac{1}{2} \
alpha^\top H \alpha \) is computed from the kernel function applied to the training
examples:

\[
H_{ij} = y_i y_j k(x_i, x_j)
\]

The vector \( f \) is simply \( -1 \) times a vector of ones of size \( N \):

\[
f = - \mathbf{1}
\]

where \( \mathbf{1} \) is a vector with all elements equal to 1.

Inequality Constraints \( A \) and \( a \):

For the inequality constraints \( 0 \leq \alpha_i \leq C \), we can set up matrix \
( A \) and vector \( a \) as follows:

\[
A = \begin{pmatrix} -I_N \\ I_N \end{pmatrix}
\]
\[
a = \begin{pmatrix} \mathbf{0} \\ C \mathbf{1} \end{pmatrix}
\]

where \( I_N \) is the \( N \times N \) identity matrix, \( \mathbf{0} \) is a


vector of zeros, and \( \mathbf{1} \) is a vector of ones (both of size \( N \)).
Equality Constraints \( B \) and \( b \):

The equality constraint \( \sum_{i=1}^{N} \alpha_i y_i = 0 \) can be expressed


using matrix \( B \) and vector \( b \) as:

\[
B = y^\top
\]
\[
b = 0
\]

where \( y \) is the vector of labels \( y_i \).

Putting all of this together, the quadratic programming problem for the dual SVM
is:

Minimize:
\[
\frac{1}{2} \alpha^\top H \alpha + f^\top \alpha
\]

Subject to:
\[
A \cdot \alpha \leq a
\]
\[
B \cdot \alpha = b
\]

where \( H \), \( f \), \( A \), \( a \), \( B \), and \( b \) are defined as


above. This is the standard form that can be fed into a quadratic program solver to
find the optimal Lagrange multipliers \( \alpha \). The solution to this
optimization problem gives us the support vectors and their corresponding weights,
which can then be used to construct the final SVM classifier.

\end{document}

You might also like