Pattern Recognition and Machine Learning - 2022 Winter Semester

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CSL2050 - Lab 4

Pattern Recognition and Machine Learning - 2022

Winter Semester
Ishaan Shrivastava [ B20AI013 ]

Question 1

Task 1: Preprocessing/Visualization

● Loaded dataset from github repo;

● Dropped nominal columns ['Ticket', 'Name', 'PassengerId', 'Cabin'];
● Used mean-imputing to fill in NaN-values in the 'Age' column, and replaced
the two NaN values in 'Embarked' with the most common one, 'S';
● One-hot encoded categorical features ['Pclass', 'Embarked'];
● Visualization:-

Right away, we can tell that 'Sex' is the most important feature, and then perhaps
'Pclass__1' and 'Pclass__3', and 'Fare' are pretty decent too (fares greater than 40
have a decent survival chance and fares less than 40 do not. we can use this fact in
binomial gaussian classifier later)

Also, we can see that 'Age' < 10 tends to have a really good survival rate, which we can
again use in binomial naive bayes later on to transform continuous feature 'Age' to

● Numerical feature analysis:

○ Calculated correlation of features with target variable to get an idea of
the best ones, in order: ['Sex', 'Pclass__1', 'Pclass__3', 'Fare']
○ Plotted Correlation matrix, and observed that most of the features have
very little correlation(indication that the assumption that they are

independent is going to hold for naive bayes?).

● On this basis, I dropped all features except the best ones: ['Age', 'Sex',
'Pclass__3', 'Pclass__1', 'Fare']
● Train-Test split in ratio of 70:30

Task 2: Choosing Naive Bayes variant

● Out of Gaussian, Bernoulli, and multinomial, chose Gaussian Naive Bayes as it

will be able to gain more information from the continuous features in our

Task 3: Gaussian Naive Bayes from Scratch

● Stores the prior, mean and standard deviation used to model the Gaussian
distribution when fitted with train data.
● Model.predict(x_test) returns the predictions on x_test,
Model.predict_prob(x_test) returns the confidence for the predictions x_test,, y_train) is used to learn the model with training data.
● Only works if target variable is binary and holds either 0’s or 1’s

Task 4: 5-fold-Cross-Validation

● Defined function cvfoldNB(data, target, n_splits, modelclass), returns

list of accuracy scores across all n_splits folds.
● Reported average 5foldCV scores for sklearn GausssianNB vs scratch
gaussianNB, and found they were comparable on average, despite some
differences across folds. No idea where the differences arise/how to fix this.

Task 5: Performance across CV folds, computing top



Task 6: Comparison with sklearn GaussianNB

● Evaluated accuracy for 5foldCV and on test set for both models. Both perform
quite similarly.

Task 7: Comparison with sklearn BernoulliNB,


● Evaluated accuracy for 5foldCV and on test set for all three models.
DecisionTreeClassifier outperformed the other two models by quite a
significant margin. This could be because it is able to capture the correlations
between features, and also because most of the features are binary in the first

Question 2

Task a: Plotting distribution of samples

● Loaded dataset from github repo, determined priors for every class
● Plotted class-wise histogram showing distribution of each feature

● Plotted class-wise histogram plots for each feature to determine a good value
for number of bins.

(Full output can be seen on running the related section in colab. Larger size graphs
excluded from the report as there are too many.)

● Analysis of the various bincounts and how good they are at binning:
○ 3 - is unable to capture the variations in density across
○ 5, 7 - capture the variation and difference in density between
classes quite nicely
○ 9 - starts to capture way too many minute variations in the
data (the small bump in the graph for bincount=9, L_groove=6.0
● On this basis I chose bincount=7 for binning.

Task b: Determine prior probability for all classes

● Already done in Task a

Task c: Binning

● Defined function binner(x, feature, bincount) to bin continuous features

on the basis of equal width binning, with n_bins = 7 NOTE: for ease of evaluation,

you can simply change the number of bins to 5 or 3 or anything as needed, in this section of code. The rest
of the code will run accordingly afterwards.

● Pandas has been used only for the purpose of indexing the data. No inbuilt
pandas functions have been used other than indexing and slicing in this
section. I believe that the intention of the question was to not allow the use of
inbuilt binning methods, which I have not done in this section of course.

Task d: Determining likelihood/class conditional probabilities

Task e: Comparing counts with plot of distribution

Task f: Plotting posteriors

● Plotting feature-wise posteriors:

This is done to calculate all the feature-wise posteriors beforehand and also to
check if the formula is working (posteriors across a vertical on a graph should
sum up to 1).

● Plotting posteriors of all data samples:

● From the above two graphs, it is evident that the data samples are
arranged class-wise, in order, and 70 at a time [70 of class 1, 70 of class 2,
70 of class 3, in order]. The posterior values generally agree with this
result, as can be inferred from graph 1, however since no model is
perfect there are a few outliers/wrong predictions here and there.


You might also like