Professional Documents
Culture Documents
Pattern Recognition and Machine Learning - 2022 Winter Semester
Pattern Recognition and Machine Learning - 2022 Winter Semester
Pattern Recognition and Machine Learning - 2022 Winter Semester
Question 1
Task 1: Preprocessing/Visualization
1
2
3
Right away, we can tell that 'Sex' is the most important feature, and then perhaps
'Pclass__1' and 'Pclass__3', and 'Fare' are pretty decent too (fares greater than 40
have a decent survival chance and fares less than 40 do not. we can use this fact in
binomial gaussian classifier later)
Also, we can see that 'Age' < 10 tends to have a really good survival rate, which we can
again use in binomial naive bayes later on to transform continuous feature 'Age' to
binary.
4
independent is going to hold for naive bayes?).
● On this basis, I dropped all features except the best ones: ['Age', 'Sex',
'Pclass__3', 'Pclass__1', 'Fare']
● Train-Test split in ratio of 70:30
● Stores the prior, mean and standard deviation used to model the Gaussian
distribution when fitted with train data.
● Model.predict(x_test) returns the predictions on x_test,
Model.predict_prob(x_test) returns the confidence for the predictions x_test,
Model.fit(x_train, y_train) is used to learn the model with training data.
● Only works if target variable is binary and holds either 0’s or 1’s
5
Task 4: 5-fold-Cross-Validation
probabilities
6
●
● Evaluated accuracy for 5foldCV and on test set for both models. Both perform
quite similarly.
7
Task 7: Comparison with sklearn BernoulliNB,
DecisionTreeClassifier
● Evaluated accuracy for 5foldCV and on test set for all three models.
DecisionTreeClassifier outperformed the other two models by quite a
significant margin. This could be because it is able to capture the correlations
between features, and also because most of the features are binary in the first
place.
8
Question 2
● Loaded dataset from github repo, determined priors for every class
● Plotted class-wise histogram showing distribution of each feature
9
● Plotted class-wise histogram plots for each feature to determine a good value
for number of bins.
10
(Full output can be seen on running the related section in colab. Larger size graphs
excluded from the report as there are too many.)
● Analysis of the various bincounts and how good they are at binning:
○ 3 - is unable to capture the variations in density across
classes
○ 5, 7 - capture the variation and difference in density between
classes quite nicely
○ 9 - starts to capture way too many minute variations in the
data (the small bump in the graph for bincount=9, L_groove=6.0
● On this basis I chose bincount=7 for binning.
11
Task c: Binning
you can simply change the number of bins to 5 or 3 or anything as needed, in this section of code. The rest
of the code will run accordingly afterwards.
● Pandas has been used only for the purpose of indexing the data. No inbuilt
pandas functions have been used other than indexing and slicing in this
section. I believe that the intention of the question was to not allow the use of
inbuilt binning methods, which I have not done in this section of course.
12
Task d: Determining likelihood/class conditional probabilities
13
14
Task e: Comparing counts with plot of distribution
15
16
Task f: Plotting posteriors
17
This is done to calculate all the feature-wise posteriors beforehand and also to
check if the formula is working (posteriors across a vertical on a graph should
sum up to 1).
18
● Plotting posteriors of all data samples:
● From the above two graphs, it is evident that the data samples are
arranged class-wise, in order, and 70 at a time [70 of class 1, 70 of class 2,
70 of class 3, in order]. The posterior values generally agree with this
result, as can be inferred from graph 1, however since no model is
perfect there are a few outliers/wrong predictions here and there.
19