Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CSL2050 - Lab 7

Pattern Recognition and Machine Learning - 2022


Winter Semester
Ishaan Shrivastava [ B20AI013 ]

Question 1
Minor tasks completed relevant to the problem of classification:

1) Loaded the dataset from github repo;


2) EDA: printed and examined the statistics of the distribution of the dataset
3) EDA: plotted the histogram of the target variable and found out that it was
highly imbalanced, which means that we will have to do something about this
later.

4) Preprocessing: Encoded any categorical features and standardized all


numerical features. This is done because it usually improves the model
performance.
5) Preprocessing: One-Hot encoded target variable (done because a standard
classification NN requires one node per class label in the output layer) after

1
binning it into three bins: 0-8, 9-10, 11-27. The reasoning behind choosing this
split for the bins is that the target variable is quite balanced which is ideal for
classification purposes.

6) Preprocessing: Train-Test split in the ratio [3:1]


7) Hyperparameters: For training set I chose a mini-batch size of 16 (powers
of two are good choices as this is able to take advantage of the matrix-matrix
product speedup offered by the processors although I have used CPU instead
of GPU here, and I am not sure whether this speedup applies to CPU as well). I
also chose a learning_rate = 0.01 and n_epochs=200 on testing the effects
of different learning rates upon the training accuracy curve of the model.
8) Dataset Wrappers: I made the dataset wrapper classes for making it easy to
load batches during model training and I instantiated the training data-loader
using the train dataset.

MODEL ARCHITECTURE

● Loss function: Cross entropy loss is typically used in multi-class


classification problems of this sort which is why I used it. It is quite
better than MSE loss at reducing the error rate.

2
● Optimizer: Stochastic Gradient Descent is one that has been
taught to us so far which is why I used it for this classification
problem.
● Experimented with different layer sizes as shown below, and
concluded that the model performance plateaus at 64 neurons in
the hidden layer.

3
The final model architecture I decided to use was this:

Final accuracy on train and test set:

4
Question 2

Task a: Preprocessing/Visualization, Stratified TTVS

● Loaded dataset from github repo;


● Standardized the data for better model performance
● Plotted the histogram for class distribution. Demonstrated the class
imbalance which we take care of later using stratified train-test splitting.

● For visualization purposes, I tried reducing the dimensionality of the data


using LDA and plotting the pairplots of the best 4 features. I observed that the
data distributions for each class are overall quite similar to gaussian
distributions in general and appear to be quite distinct among themselves. I
hypothesized that due to the simplicity of the data and the large number of

5
samples, it would be very hard for the model to overfit at any level although I
am not sure whether this is a correct assumption to make.

● One-hot encoded the class labels as required by a MLP for a multi class
classification problem.

● STRATIFIED TRAIN_VALIDATION_TEST SPLIT in the ratio of [10611 : 2000 :

1000]

6
○ Proportionate Stratified split works by taking class-wise samples
proportionate to the total class population, for each split. This ensures
that the splits have the same class distributions as the original data.

Task b: MLP from scratch

Wrote the model class and other functions for the MLP

● def sigmoid(a): return 1/(1+np.exp(-a))


● def tanh(a): return 2*sigmoid(2*a) - 1
● def relu(a): return np.maximum(a, 0)
● def layer_sigmoid(layer_size_prev, layer_size_this):
● def layer_tanh(layer_size_prev, layer_size_this):
● def layer_relu(layer_size_prev, layer_size_this):

● class mlp_model:
○ def __init__(self, layer_sizes={}, layer_activations={},
learning_rate=None, n_epochs=None): initializes the model with the
architecture passed to it
○ def forward(self, X): Executes a forward pass on the passed input data and
stores the values necessary for gradient calculation in the cache
○ def predict(self, X): Outputs the predicted probabilities for the given data
○ def backward(self, m, Y_, Y, X): Calculates the gradient using the
stored cache and backpropagation
○ def update_params(self, lr): Updates the gradient values using the
gradients stored in the backward pass
○ def CE_loss(self, Y_, Y): Measures the cross entropy loss of a prediction
○ def train(self, X, Y, n_epochs=None, learning_rate = None):
Trains the model using the other class functions and the train data passed to it, with
Stochastic Gradient Descent
○ def train_BGD(self, X, Y, n_epochs=None, learning_rate =
None): Trains the model using the other class functions and the train data passed to
it, with Batch Gradient Descent
○ def save_state(self, path): Stores the current model parameters in a .npy
file

7
○ def load_state(self, path): Overwrites model parameters by loading
parameters from a previously generated saved state (.npy file)

Then, I trained the model using an architecture of two hidden layers each containing
50 neurons, and measured the accuracy on test set:

8
Task c: Activation functions

● On an architecture of two hidden layers each with 10 neurons, I used three


different hidden layer activations and judged the convergence speed and the
accuracy on validation and test set for all three.

● Sigmoid performed the best accuracy-wise out of all three by a small margin.
However it is visibly slower to converge than tanh activation, this can be
inferred from the less inclined slope of the sigmoid function graph compared
to that of tanh.

9
● Tanh performed the worst out of all three, which is usually not the case
according to literature however. Also, tanh was the fastest to converge due to
the steeper slope compared to sigmoid.
● Relu was the slowest to converge but performed decently. I do not have any
reasonable interpretation of this.

10
Task d: Weight Initialization

Let us consider a model where we have initialized each weight with a constant value
in our model.

While training, during the first epoch, each neuron would be virtually
indistinguishable from another neuron due to the weights being the same. This is
analogous to saying that the set of weights corresponding to a neuron would be
identical to the set of weights for another neuron, for every layer. This implies that
the gradients accumulated would also be the same for every weight, and hence the
symmetry within every layer would still remain after updating the parameters with
their corresponding gradients.

Hence, the above scenario will again repeat in epoch two, three, four, … and so on.

The implication of this is that initializing weights and biases of the model with same
values (1s initialization, 0s initialization) will lead to a symmetric model which will be
useless for prediction purposes (a symmetric model will give symmetric outputs too).

We can break the symmetry between the model weights by initializing them
randomly. However, it is important not to initialize them with values that are too
large or too small, as the parameters may diverge on training.

I have plotted the accuracy versus epoch curves for models trained with all three
types of initialization:

1) Random
2) Zero
3) Constant

<<graphs plotted on next page>>

11
Weights initialized with random values: [ XAVIER INITIALIZATION for better
convergence ]

12
Weights initialized with constant:

Weights initialized with zero:

13
Task e: Architecture selection

I tried out a bunch of different architectures, listed below:

1) Single hidden layer, with


2) Two hidden layer, with
3) Three hidden layer, with

The three hidden layer models performed the worst by far, I suspect this is because
as the model complexity increases, the parameters and gradients get smaller and
smaller which leads to errors which can even be seen in the output at train time.

The single hidden layer models are decent but are not able to fit the data as well as
the two hidden layer models which are just slightly better.

Overall I found that increasing the complexity of the model too much invariably
causes errors in the training output which could be a result of errors similar to the
one already stated above for three layers.

Ultimately I decided upon NOT using the best performing


architecture because results with it were for some reason not
reproducible, instead I used the most consistently good
performing architecture which was the double layer, each with
50 tanh neurons.

Below i have attached screenshots of some of the architectures I tested out (i have
tried more than these)

14
15
16
17
18
Task f: Provision to save and load weights

I have added methods to the model class that allow it to save and load architectures
trained previously. I have demonstrated saving and loading on an empty model.

FINAL NOTES:

I used the validation set and the training plots in order to determine the best value of
hyperparameters for pretty much everything. This was time consuming and hence i
didn't include the relevant code in the submission folder. One interesting
observation I noted was that ReLu gradients tend to explode very easily if you keep
the learning rate too high.

19

You might also like