Pattern Recognition and Machine Learning - 2022 Winter Semester

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CSL2050 - Lab 4

Pattern Recognition and Machine Learning - 2022


Winter Semester
Ishaan Shrivastava [ B20AI013 ]

Question 1

Task 1: Preprocessing/Visualization

● Loaded dataset from github repo;


● Dropped nominal columns ['Ticket', 'Name', 'PassengerId', 'Cabin'];
● Used mean-imputing to fill in NaN-values in the 'Age' column, and replaced
the two NaN values in 'Embarked' with the most common one, 'S';
● One-hot encoded categorical features ['Pclass', 'Embarked'];
● Visualization:-

1
2
3
Right away, we can tell that 'Sex' is the most important feature, and then perhaps
'Pclass__1' and 'Pclass__3', and 'Fare' are pretty decent too (fares greater than 40
have a decent survival chance and fares less than 40 do not. we can use this fact in
binomial gaussian classifier later)

Also, we can see that 'Age' < 10 tends to have a really good survival rate, which we can
again use in binomial naive bayes later on to transform continuous feature 'Age' to
binary.

● Numerical feature analysis:


○ Calculated correlation of features with target variable to get an idea of
the best ones, in order: ['Sex', 'Pclass__1', 'Pclass__3', 'Fare']
○ Plotted Correlation matrix, and observed that most of the features have
very little correlation(indication that the assumption that they are

4
independent is going to hold for naive bayes?).

● On this basis, I dropped all features except the best ones: ['Age', 'Sex',
'Pclass__3', 'Pclass__1', 'Fare']
● Train-Test split in ratio of 70:30

Task 2: Choosing Naive Bayes variant

● Out of Gaussian, Bernoulli, and multinomial, chose Gaussian Naive Bayes as it


will be able to gain more information from the continuous features in our
data.

Task 3: Gaussian Naive Bayes from Scratch

● Stores the prior, mean and standard deviation used to model the Gaussian
distribution when fitted with train data.
● Model.predict(x_test) returns the predictions on x_test,
Model.predict_prob(x_test) returns the confidence for the predictions x_test,
Model.fit(x_train, y_train) is used to learn the model with training data.
● Only works if target variable is binary and holds either 0’s or 1’s

5
Task 4: 5-fold-Cross-Validation

● Defined function cvfoldNB(data, target, n_splits, modelclass), returns


list of accuracy scores across all n_splits folds.
● Reported average 5foldCV scores for sklearn GausssianNB vs scratch
gaussianNB, and found they were comparable on average, despite some
differences across folds. No idea where the differences arise/how to fix this.

Task 5: Performance across CV folds, computing top

probabilities

6

Task 6: Comparison with sklearn GaussianNB

● Evaluated accuracy for 5foldCV and on test set for both models. Both perform
quite similarly.

7
Task 7: Comparison with sklearn BernoulliNB,

DecisionTreeClassifier

● Evaluated accuracy for 5foldCV and on test set for all three models.
DecisionTreeClassifier outperformed the other two models by quite a
significant margin. This could be because it is able to capture the correlations
between features, and also because most of the features are binary in the first
place.

8
Question 2

Task a: Plotting distribution of samples

● Loaded dataset from github repo, determined priors for every class
● Plotted class-wise histogram showing distribution of each feature

9
● Plotted class-wise histogram plots for each feature to determine a good value
for number of bins.

10
(Full output can be seen on running the related section in colab. Larger size graphs
excluded from the report as there are too many.)

● Analysis of the various bincounts and how good they are at binning:
○ 3 - is unable to capture the variations in density across
classes
○ 5, 7 - capture the variation and difference in density between
classes quite nicely
○ 9 - starts to capture way too many minute variations in the
data (the small bump in the graph for bincount=9, L_groove=6.0
● On this basis I chose bincount=7 for binning.

Task b: Determine prior probability for all classes

● Already done in Task a

11
Task c: Binning

● Defined function binner(x, feature, bincount) to bin continuous features


on the basis of equal width binning, with n_bins = 7 NOTE: for ease of evaluation,

you can simply change the number of bins to 5 or 3 or anything as needed, in this section of code. The rest
of the code will run accordingly afterwards.

● Pandas has been used only for the purpose of indexing the data. No inbuilt
pandas functions have been used other than indexing and slicing in this
section. I believe that the intention of the question was to not allow the use of
inbuilt binning methods, which I have not done in this section of course.

12
Task d: Determining likelihood/class conditional probabilities

13
14
Task e: Comparing counts with plot of distribution

15
16
Task f: Plotting posteriors

● Plotting feature-wise posteriors:

17
This is done to calculate all the feature-wise posteriors beforehand and also to
check if the formula is working (posteriors across a vertical on a graph should
sum up to 1).

18
● Plotting posteriors of all data samples:

● From the above two graphs, it is evident that the data samples are
arranged class-wise, in order, and 70 at a time [70 of class 1, 70 of class 2,
70 of class 3, in order]. The posterior values generally agree with this
result, as can be inferred from graph 1, however since no model is
perfect there are a few outliers/wrong predictions here and there.

19

You might also like