Professional Documents
Culture Documents
DSCI 303: Machine Learning For Data Science Fall 2020
DSCI 303: Machine Learning For Data Science Fall 2020
Fall 2020
Problem Set 4
Issued: Wednesday, Nov. 2, 2020 Due: Monday, Dec. 4, 2020
Problem 1 (Gaussian Mixture Model, 15pts). Consider the labeled training points in Figure 1,
where ’+’ and ’o’ denote positive and negative labels, respectively. Some students were asked to
fit to fit Gaussian Mixture Models on this dataset.
1. Two students decided to use one Gaussian distribution for positive data and the other dis-
tribution for negative data. The pink ellipse indicates the positive Gaussian distribution
contour and the light blue ellipse indicates the negative Gaussian distribution contour. (6
points) Please explain Which model fits better this dataset, and why two models show some
differences.
2. The third student decided to use two Gaussian distributions for positive data and two Gaus-
sian distributions for negative data. This student used EM algorithm to iteratively update
parameters and also tried different initializations. Figure 2 left column shows Gaussian mod-
els with 3 different initializations and the right column shows 3 possible models after the first
iteration. Show which graph on the left will be followed by which graph of the right after the
first EM iteration using arrows and explain why. Your answer will be 3 arrows in total, one
for each initialization. (9 points)
Problem 2 (Neural Network, 35pts). In this problem, the goal is to implement a simple fully
connected neural network to classify grayscale images of handwritten digits (0 - 9) from the MNIST
dataset. This dataset contains 60,000 training images and 10,000 testing images. We will remove
10,000 images from the training set and call this our validation set. Each image is 28 × 28 pixels in
size, and is generally represented as a flat vector of 784 numbers. The dataset also includes labels
for each image.
To start, implement a neural network with a single hidden layer and cross entropy loss. Use the
sigmoid function as an activation for the hidden layer, and softmax function for the output layer.
Recall that for a single sample, (x, y), the cross entropy loss is:
K
X
CE(y, ŷ) = − yk log ŷk ,
k=1
where ŷ(x) ∈ RK is the vector of softmax outputs from the model for the training example x,
and y ∈ RK , K = 10 is the one-hot ground truth vector. Hence, if image x is the 1 digit, we will
1
Figure 1: Dataset for Gaussian Mixture Model.
have y = [0, 1, 0, 0, . . . , 0]T . For n training samples, we consider the empirical risk (a function that
results from averaging the loss function over data)
n K
1 X X
J (D, θ) = − yk log ŷk (x), (1)
|D|
(x,y)∈D k=1
where D is the data set and θ are the parameters of the network.
Instead of using gradient descent it is common practice is to use a variant of the stochastic
gradient method. We consider a version called mini-batch gradient method in the following, that
first shuffles the data, and then processes the data in batches of size B. The loss corresponding to
the mini-batch B is
J (B, θ) .
For your implementation, you need to add a regularization term to the cross entropy loss. The
2
regularized loss function corresponding to the mini-batch becomes
J (B, θ) + λ ||W1 ||2 + ||W2 ||2 ,
(2)
where λ is now a hyperparameter. Here, W1 , W2 are the weight matrices and do not include the
bias term! Do not penalize the bias. Set λ = 0.0001
1. Do the following:
• Initialize the weights of the network by sampling from a standard normal distribution.
Initialize the bias terms to zero. Set the hidden units to 300.
• Implement forward-propagation and backward-propagation for the mini-batch loss func-
tion J (B, θ).
• Shuffle the training set and remove 10,000 images randomly to form our validation set.
Set the batch size to B = 1000. This means that we will pass through the entire data
in 50 iterations, which is called on epoch.
• Train the network for 30 epochs. Calculate the value of the loss function over all of
the training set (50,000 samples) after each epoch. Make a plot of total training loss
versus epoch. In this same plot, add the values of the loss function over the validation
set (10,000 samples).
• Similarly, add a plot of the prediction accuracy over the training set versus epoch. Just
like before, add the accuracy of the validation set.
Once you train your network, it might be easier to save your parameters so that you do not
need to retrain it every time that you want to try something new. This is not a requirement
for the problem, but it might make things easier. (24 points)
2. All this while you should have stayed away from the test data completely. Now that you have
convinced yourself that the model is working as expected, it is finally time to measure the
model performance on the test set. Once we measure the test set performance, we report it
whatever the value may be, and do not go back and refine the model any further. (11 points)
3
Figure 2: Three different initializations and models after the first iteration.
Problem 3 (Naive Bayes, 25pts). In this problem you will apply the Naıve Bayes classifier to the
problem of spam detection, using a benchmark database assembled by some researchers. Download
the file spambase.data. The file is in csv format and contains a matrix; the last column contains
the labels and the other columns are the corresponding features.
Here X is n × d, where n = 4601 and d = 57. The different features correspond to different
properties of an email, such as the frequency with which certain characters appear. y is a vector
of labels indicating spam or not spam. For a detailed description of the dataset, visit the UCI
Machine Learning Repository, or Google ‘spambase’.
To evaluate the method, first shuffle the data, then treat the first 2000 examples as training
data, and the rest as test data.
1. Quantize each feature variable to one of two values, say 0 and 1, so that values below the
median map to 0, and those above map to 1. Fit the Naı̈ve Bayes model using the training
data (i.e., estimate the class-conditional marginals), and compute the misclassification rate
(i.e., the test error) on the test data. The medians for quantization should be calculated using
only the training data. Report the test error. (16 points)
4
Note: On the spam detection problem, please note that you get a different test error depending
on how you quantize values that are equal to the median. It makes a difference whether you
quantize values equal to the median to 0 or 1. Quantize all medians the same way; but do
not try all 2d combinations. Make sure you try both options, and report the one that works
better.
2. As a sanity check, what would be the test error if you always predicted the same class, namely,
the majority class from the training data? (9 points)
Problem 4. (Long-short term neural network & Summary 1 : 50 pts + bonus 25 pts)
You will be submitting your jupyter notebook (.iphynb) to deliver your answers.
References
[1] W. Nick Street, W. H. Wolberg, and O. L. Mangasarian ”Nuclear feature extraction for breast
tumor diagnosis”, Proc. SPIE 1905, Biomedical Image Processing and Biomedical Visualization,
(29 July 1993); https://doi.org/10.1117/12.148698