Pattern Recognition and Machine Learning - 2022 Winter Semester

CSL2050 - Lab 7
Pattern Recognition and Machine Learning - 2022

Winter Semester
Ishaan Shrivastava [ B20AI013 ]
Question 1
Minor tasks completed relevant to the problem of classification:
1) Loaded the dataset from github repo;

2) EDA: printed and examined the statistics of the distribution of the dataset
3) EDA: plotted the histogram of the target variable and found out that it was
highly imbalanced, which means that we will have to do something about this
later.
4) Preprocessing: Encoded any categorical features and standardized all

numerical features. This is done because it usually improves the model
performance.
5) Preprocessing: One-Hot encoded target variable (done because a standard
classification NN requires one node per class label in the output layer) after
1
binning it into three bins: 0-8, 9-10, 11-27. The reasoning behind choosing this
split for the bins is that the target variable is quite balanced which is ideal for
classification purposes.
6) Preprocessing: Train-Test split in the ratio [3:1]

7) Hyperparameters: For training set I chose a mini-batch size of 16 (powers
of two are good choices as this is able to take advantage of the matrix-matrix
product speedup offered by the processors although I have used CPU instead
of GPU here, and I am not sure whether this speedup applies to CPU as well). I
also chose a learning_rate = 0.01 and n_epochs=200 on testing the effects
of different learning rates upon the training accuracy curve of the model.
8) Dataset Wrappers: I made the dataset wrapper classes for making it easy to
load batches during model training and I instantiated the training data-loader
using the train dataset.
MODEL ARCHITECTURE
● Loss function: Cross entropy loss is typically used in multi-class

classification problems of this sort which is why I used it. It is quite
better than MSE loss at reducing the error rate.
2
● Optimizer: Stochastic Gradient Descent is one that has been
taught to us so far which is why I used it for this classification
problem.
● Experimented with different layer sizes as shown below, and
concluded that the model performance plateaus at 64 neurons in
the hidden layer.
3
The final model architecture I decided to use was this:
Final accuracy on train and test set:
4
Question 2
Task a: Preprocessing/Visualization, Stratified TTVS
● Loaded dataset from github repo;

● Standardized the data for better model performance
● Plotted the histogram for class distribution. Demonstrated the class
imbalance which we take care of later using stratified train-test splitting.
● For visualization purposes, I tried reducing the dimensionality of the data

using LDA and plotting the pairplots of the best 4 features. I observed that the
data distributions for each class are overall quite similar to gaussian
distributions in general and appear to be quite distinct among themselves. I
hypothesized that due to the simplicity of the data and the large number of
5
samples, it would be very hard for the model to overfit at any level although I
am not sure whether this is a correct assumption to make.
● One-hot encoded the class labels as required by a MLP for a multi class
classification problem.
● STRATIFIED TRAIN_VALIDATION_TEST SPLIT in the ratio of [10611 : 2000 :
1000]
6
○ Proportionate Stratified split works by taking class-wise samples
proportionate to the total class population, for each split. This ensures
that the splits have the same class distributions as the original data.
Task b: MLP from scratch
Wrote the model class and other functions for the MLP
● def sigmoid(a): return 1/(1+np.exp(-a))

● def tanh(a): return 2*sigmoid(2*a) - 1
● def relu(a): return np.maximum(a, 0)
● def layer_sigmoid(layer_size_prev, layer_size_this):
● def layer_tanh(layer_size_prev, layer_size_this):
● def layer_relu(layer_size_prev, layer_size_this):
● class mlp_model:
○ def __init__(self, layer_sizes={}, layer_activations={},
learning_rate=None, n_epochs=None): initializes the model with the
architecture passed to it
○ def forward(self, X): Executes a forward pass on the passed input data and
stores the values necessary for gradient calculation in the cache
○ def predict(self, X): Outputs the predicted probabilities for the given data
○ def backward(self, m, Y_, Y, X): Calculates the gradient using the
stored cache and backpropagation
○ def update_params(self, lr): Updates the gradient values using the
gradients stored in the backward pass
○ def CE_loss(self, Y_, Y): Measures the cross entropy loss of a prediction
○ def train(self, X, Y, n_epochs=None, learning_rate = None):
Trains the model using the other class functions and the train data passed to it, with
Stochastic Gradient Descent
○ def train_BGD(self, X, Y, n_epochs=None, learning_rate =
None): Trains the model using the other class functions and the train data passed to
it, with Batch Gradient Descent
○ def save_state(self, path): Stores the current model parameters in a .npy
file
7
○ def load_state(self, path): Overwrites model parameters by loading
parameters from a previously generated saved state (.npy file)
Then, I trained the model using an architecture of two hidden layers each containing
50 neurons, and measured the accuracy on test set:
8
Task c: Activation functions
● On an architecture of two hidden layers each with 10 neurons, I used three

different hidden layer activations and judged the convergence speed and the
accuracy on validation and test set for all three.
● Sigmoid performed the best accuracy-wise out of all three by a small margin.
However it is visibly slower to converge than tanh activation, this can be
inferred from the less inclined slope of the sigmoid function graph compared
to that of tanh.
9
● Tanh performed the worst out of all three, which is usually not the case
according to literature however. Also, tanh was the fastest to converge due to
the steeper slope compared to sigmoid.
● Relu was the slowest to converge but performed decently. I do not have any
reasonable interpretation of this.
10
Task d: Weight Initialization
Let us consider a model where we have initialized each weight with a constant value
in our model.
While training, during the first epoch, each neuron would be virtually
indistinguishable from another neuron due to the weights being the same. This is
analogous to saying that the set of weights corresponding to a neuron would be
identical to the set of weights for another neuron, for every layer. This implies that
the gradients accumulated would also be the same for every weight, and hence the
symmetry within every layer would still remain after updating the parameters with
their corresponding gradients.
Hence, the above scenario will again repeat in epoch two, three, four, … and so on.
The implication of this is that initializing weights and biases of the model with same
values (1s initialization, 0s initialization) will lead to a symmetric model which will be
useless for prediction purposes (a symmetric model will give symmetric outputs too).
We can break the symmetry between the model weights by initializing them
randomly. However, it is important not to initialize them with values that are too
large or too small, as the parameters may diverge on training.
I have plotted the accuracy versus epoch curves for models trained with all three
types of initialization:
1) Random
2) Zero
3) Constant
<<graphs plotted on next page>>
11
Weights initialized with random values: [ XAVIER INITIALIZATION for better
convergence ]
12
Weights initialized with constant:
Weights initialized with zero:
13
Task e: Architecture selection
I tried out a bunch of different architectures, listed below:
1) Single hidden layer, with

2) Two hidden layer, with
3) Three hidden layer, with
The three hidden layer models performed the worst by far, I suspect this is because
as the model complexity increases, the parameters and gradients get smaller and
smaller which leads to errors which can even be seen in the output at train time.
The single hidden layer models are decent but are not able to fit the data as well as
the two hidden layer models which are just slightly better.
Overall I found that increasing the complexity of the model too much invariably
causes errors in the training output which could be a result of errors similar to the
one already stated above for three layers.
Ultimately I decided upon NOT using the best performing

architecture because results with it were for some reason not
reproducible, instead I used the most consistently good
performing architecture which was the double layer, each with
50 tanh neurons.
Below i have attached screenshots of some of the architectures I tested out (i have
tried more than these)
14
15
16
17
18
Task f: Provision to save and load weights
I have added methods to the model class that allow it to save and load architectures
trained previously. I have demonstrated saving and loading on an empty model.
FINAL NOTES:
I used the validation set and the training plots in order to determine the best value of
hyperparameters for pretty much everything. This was time consuming and hence i
didn't include the relevant code in the submission folder. One interesting
observation I noted was that ReLu gradients tend to explode very easily if you keep
the learning rate too high.
19

Pattern Recognition and Machine Learning - 2022 Winter Semester

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Recognition and Machine Learning - 2022 Winter Semester

Uploaded by

Copyright:

Available Formats

CSL2050 - Lab 7

Pattern Recognition and Machine Learning - 2022

1) Loaded the dataset from github repo;

4) Preprocessing: Encoded any categorical features and standardized all

6) Preprocessing: Train-Test split in the ratio [3:1]

● Loss function: Cross entropy loss is typically used in multi-class

Final accuracy on train and test set:

Task a: Preprocessing/Visualization, Stratified TTVS

● Loaded dataset from github repo;

● For visualization purposes, I tried reducing the dimensionality of the data

● STRATIFIED TRAIN_VALIDATION_TEST SPLIT in the ratio of [10611 : 2000 :

Task b: MLP from scratch

● def sigmoid(a): return 1/(1+np.exp(-a))

● On an architecture of two hidden layers each with 10 neurons, I used three

<<graphs plotted on next page>>

Weights initialized with zero:

I tried out a bunch of different architectures, listed below:

1) Single hidden layer, with

Ultimately I decided upon NOT using the best performing

You might also like