Professional Documents
Culture Documents
Pattern Recognition and Machine Learning - 2022 Winter Semester
Pattern Recognition and Machine Learning - 2022 Winter Semester
Question 1
Minor tasks completed relevant to the problem of classification:
1
binning it into three bins: 0-8, 9-10, 11-27. The reasoning behind choosing this
split for the bins is that the target variable is quite balanced which is ideal for
classification purposes.
MODEL ARCHITECTURE
2
● Optimizer: Stochastic Gradient Descent is one that has been
taught to us so far which is why I used it for this classification
problem.
● Experimented with different layer sizes as shown below, and
concluded that the model performance plateaus at 64 neurons in
the hidden layer.
3
The final model architecture I decided to use was this:
4
Question 2
5
samples, it would be very hard for the model to overfit at any level although I
am not sure whether this is a correct assumption to make.
● One-hot encoded the class labels as required by a MLP for a multi class
classification problem.
1000]
6
○ Proportionate Stratified split works by taking class-wise samples
proportionate to the total class population, for each split. This ensures
that the splits have the same class distributions as the original data.
Wrote the model class and other functions for the MLP
● class mlp_model:
○ def __init__(self, layer_sizes={}, layer_activations={},
learning_rate=None, n_epochs=None): initializes the model with the
architecture passed to it
○ def forward(self, X): Executes a forward pass on the passed input data and
stores the values necessary for gradient calculation in the cache
○ def predict(self, X): Outputs the predicted probabilities for the given data
○ def backward(self, m, Y_, Y, X): Calculates the gradient using the
stored cache and backpropagation
○ def update_params(self, lr): Updates the gradient values using the
gradients stored in the backward pass
○ def CE_loss(self, Y_, Y): Measures the cross entropy loss of a prediction
○ def train(self, X, Y, n_epochs=None, learning_rate = None):
Trains the model using the other class functions and the train data passed to it, with
Stochastic Gradient Descent
○ def train_BGD(self, X, Y, n_epochs=None, learning_rate =
None): Trains the model using the other class functions and the train data passed to
it, with Batch Gradient Descent
○ def save_state(self, path): Stores the current model parameters in a .npy
file
7
○ def load_state(self, path): Overwrites model parameters by loading
parameters from a previously generated saved state (.npy file)
Then, I trained the model using an architecture of two hidden layers each containing
50 neurons, and measured the accuracy on test set:
8
Task c: Activation functions
● Sigmoid performed the best accuracy-wise out of all three by a small margin.
However it is visibly slower to converge than tanh activation, this can be
inferred from the less inclined slope of the sigmoid function graph compared
to that of tanh.
9
● Tanh performed the worst out of all three, which is usually not the case
according to literature however. Also, tanh was the fastest to converge due to
the steeper slope compared to sigmoid.
● Relu was the slowest to converge but performed decently. I do not have any
reasonable interpretation of this.
10
Task d: Weight Initialization
Let us consider a model where we have initialized each weight with a constant value
in our model.
While training, during the first epoch, each neuron would be virtually
indistinguishable from another neuron due to the weights being the same. This is
analogous to saying that the set of weights corresponding to a neuron would be
identical to the set of weights for another neuron, for every layer. This implies that
the gradients accumulated would also be the same for every weight, and hence the
symmetry within every layer would still remain after updating the parameters with
their corresponding gradients.
Hence, the above scenario will again repeat in epoch two, three, four, … and so on.
The implication of this is that initializing weights and biases of the model with same
values (1s initialization, 0s initialization) will lead to a symmetric model which will be
useless for prediction purposes (a symmetric model will give symmetric outputs too).
We can break the symmetry between the model weights by initializing them
randomly. However, it is important not to initialize them with values that are too
large or too small, as the parameters may diverge on training.
I have plotted the accuracy versus epoch curves for models trained with all three
types of initialization:
1) Random
2) Zero
3) Constant
11
Weights initialized with random values: [ XAVIER INITIALIZATION for better
convergence ]
12
Weights initialized with constant:
13
Task e: Architecture selection
The three hidden layer models performed the worst by far, I suspect this is because
as the model complexity increases, the parameters and gradients get smaller and
smaller which leads to errors which can even be seen in the output at train time.
The single hidden layer models are decent but are not able to fit the data as well as
the two hidden layer models which are just slightly better.
Overall I found that increasing the complexity of the model too much invariably
causes errors in the training output which could be a result of errors similar to the
one already stated above for three layers.
Below i have attached screenshots of some of the architectures I tested out (i have
tried more than these)
14
15
16
17
18
Task f: Provision to save and load weights
I have added methods to the model class that allow it to save and load architectures
trained previously. I have demonstrated saving and loading on an empty model.
FINAL NOTES:
I used the validation set and the training plots in order to determine the best value of
hyperparameters for pretty much everything. This was time consuming and hence i
didn't include the relevant code in the submission folder. One interesting
observation I noted was that ReLu gradients tend to explode very easily if you keep
the learning rate too high.
19