Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

DSCI 303: Machine Learning for Data Science

Fall 2020

Problem Set 4
Issued: Wednesday, Nov. 2, 2020 Due: Monday, Dec. 4, 2020

Please include the following statement at the beginning of your homework:


“I certify that all solutions are entirely in my words and I have credited all sources in this
writeup.”
You will be working on this homework 4 by yourself. Discussions or collaboration
with anyone is NOT allowed. Rice Honor code applies. You are still allowed to ask
questions on Piazza.

Problem 1 (Gaussian Mixture Model, 15pts). Consider the labeled training points in Figure 1,
where ’+’ and ’o’ denote positive and negative labels, respectively. Some students were asked to
fit to fit Gaussian Mixture Models on this dataset.

1. Two students decided to use one Gaussian distribution for positive data and the other dis-
tribution for negative data. The pink ellipse indicates the positive Gaussian distribution
contour and the light blue ellipse indicates the negative Gaussian distribution contour. (6
points) Please explain Which model fits better this dataset, and why two models show some
differences.

2. The third student decided to use two Gaussian distributions for positive data and two Gaus-
sian distributions for negative data. This student used EM algorithm to iteratively update
parameters and also tried different initializations. Figure 2 left column shows Gaussian mod-
els with 3 different initializations and the right column shows 3 possible models after the first
iteration. Show which graph on the left will be followed by which graph of the right after the
first EM iteration using arrows and explain why. Your answer will be 3 arrows in total, one
for each initialization. (9 points)

Problem 2 (Neural Network, 35pts). In this problem, the goal is to implement a simple fully
connected neural network to classify grayscale images of handwritten digits (0 - 9) from the MNIST
dataset. This dataset contains 60,000 training images and 10,000 testing images. We will remove
10,000 images from the training set and call this our validation set. Each image is 28 × 28 pixels in
size, and is generally represented as a flat vector of 784 numbers. The dataset also includes labels
for each image.
To start, implement a neural network with a single hidden layer and cross entropy loss. Use the
sigmoid function as an activation for the hidden layer, and softmax function for the output layer.
Recall that for a single sample, (x, y), the cross entropy loss is:
K
X
CE(y, ŷ) = − yk log ŷk ,
k=1

where ŷ(x) ∈ RK is the vector of softmax outputs from the model for the training example x,
and y ∈ RK , K = 10 is the one-hot ground truth vector. Hence, if image x is the 1 digit, we will

1
Figure 1: Dataset for Gaussian Mixture Model.

have y = [0, 1, 0, 0, . . . , 0]T . For n training samples, we consider the empirical risk (a function that
results from averaging the loss function over data)
n K
1 X X
J (D, θ) = − yk log ŷk (x), (1)
|D|
(x,y)∈D k=1

where D is the data set and θ are the parameters of the network.
Instead of using gradient descent it is common practice is to use a variant of the stochastic
gradient method. We consider a version called mini-batch gradient method in the following, that
first shuffles the data, and then processes the data in batches of size B. The loss corresponding to
the mini-batch B is
J (B, θ) .
For your implementation, you need to add a regularization term to the cross entropy loss. The

2
regularized loss function corresponding to the mini-batch becomes
J (B, θ) + λ ||W1 ||2 + ||W2 ||2 ,

(2)
where λ is now a hyperparameter. Here, W1 , W2 are the weight matrices and do not include the
bias term! Do not penalize the bias. Set λ = 0.0001
1. Do the following:
• Initialize the weights of the network by sampling from a standard normal distribution.
Initialize the bias terms to zero. Set the hidden units to 300.
• Implement forward-propagation and backward-propagation for the mini-batch loss func-
tion J (B, θ).
• Shuffle the training set and remove 10,000 images randomly to form our validation set.
Set the batch size to B = 1000. This means that we will pass through the entire data
in 50 iterations, which is called on epoch.
• Train the network for 30 epochs. Calculate the value of the loss function over all of
the training set (50,000 samples) after each epoch. Make a plot of total training loss
versus epoch. In this same plot, add the values of the loss function over the validation
set (10,000 samples).
• Similarly, add a plot of the prediction accuracy over the training set versus epoch. Just
like before, add the accuracy of the validation set.
Once you train your network, it might be easier to save your parameters so that you do not
need to retrain it every time that you want to try something new. This is not a requirement
for the problem, but it might make things easier. (24 points)
2. All this while you should have stayed away from the test data completely. Now that you have
convinced yourself that the model is working as expected, it is finally time to measure the
model performance on the test set. Once we measure the test set performance, we report it
whatever the value may be, and do not go back and refine the model any further. (11 points)

3
Figure 2: Three different initializations and models after the first iteration.

Problem 3 (Naive Bayes, 25pts). In this problem you will apply the Naıve Bayes classifier to the
problem of spam detection, using a benchmark database assembled by some researchers. Download
the file spambase.data. The file is in csv format and contains a matrix; the last column contains
the labels and the other columns are the corresponding features.
Here X is n × d, where n = 4601 and d = 57. The different features correspond to different
properties of an email, such as the frequency with which certain characters appear. y is a vector
of labels indicating spam or not spam. For a detailed description of the dataset, visit the UCI
Machine Learning Repository, or Google ‘spambase’.
To evaluate the method, first shuffle the data, then treat the first 2000 examples as training
data, and the rest as test data.

1. Quantize each feature variable to one of two values, say 0 and 1, so that values below the
median map to 0, and those above map to 1. Fit the Naı̈ve Bayes model using the training
data (i.e., estimate the class-conditional marginals), and compute the misclassification rate
(i.e., the test error) on the test data. The medians for quantization should be calculated using
only the training data. Report the test error. (16 points)

4
Note: On the spam detection problem, please note that you get a different test error depending
on how you quantize values that are equal to the median. It makes a difference whether you
quantize values equal to the median to 0 or 1. Quantize all medians the same way; but do
not try all 2d combinations. Make sure you try both options, and report the one that works
better.

2. As a sanity check, what would be the test error if you always predicted the same class, namely,
the majority class from the training data? (9 points)

Problem 4. (Long-short term neural network & Summary 1 : 50 pts + bonus 25 pts)
You will be submitting your jupyter notebook (.iphynb) to deliver your answers.

Problem 5. (Summary 2: oral: 50 pts)


This will be an oral problem. You will be scheduled to meet the instructor for
about 5 mins on zoom for answering oral questions about what we learned in the
class.
Please sign up for your oral problem that will be held on Nov 20, Dec 3 or Dec 4 on the sign
up link. The signup will be closed on Nov 13. If you need to schedule this meeting in other times,
please email your available times to akane.sano@rice.edu asap.
The instructor will see you on zoom at your scheduled time. First, you will join a waiting room
and the instructor will let you in at your scheduled time. Rescheduling the oral problem less than
24 hours before the scheduled meeting is not allowed.

References
[1] W. Nick Street, W. H. Wolberg, and O. L. Mangasarian ”Nuclear feature extraction for breast
tumor diagnosis”, Proc. SPIE 1905, Biomedical Image Processing and Biomedical Visualization,
(29 July 1993); https://doi.org/10.1117/12.148698

You might also like