Professional Documents
Culture Documents
Lab5 Example Fall 23
Lab5 Example Fall 23
[13]: # The 20 newsgroups dataset comprises around 18000 newsgroups posts on # 20 topics split
in two subsets: one for training (or development) and # the other one for testing (or for
performance evaluation). The split # between the train and test set is based upon a
messages posted before # and after a specific date.
data = fetch_20newsgroups()
# data
1
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
[15]: # Choosing some categories instead of all 20 classes.
cat_class = ['talk.politics.guns','comp.graphics','comp.sys.ibm.pc.hardware']
# train_data
2
#Make pipeline allows to create a pipeline flow of data, like we pass data to a␣
‹→function
# the result is passed to next function and so on. This helps to abstract the␣
‹→details to
# a single function
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
[19]: model.fit(train_data.data, train_data.target)
# This fits our models with the data we have passed and the corresponding␣
‹→target.
labels = model.predict(test_data.data)
# This will predict the classes of the testing data we have passed
# by working on the model we have trained in last step.
[20]: mat = confusion_matrix(test_data.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d',
cbar=False,
xticklabels=train_data.target_names, yticklabels=train_data.
‹→target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
# This will help us know, whether our data is actaully predicting correctly.
# This can be understood graphically. We can have two axes on has the number of
# documents with the predicted label. The other has number of documents with
# actual label. So the intersection tells us, the number of documents correctly␣
‹→predicted.
3
As we can see the in case of comp.graphics our data correctly predicted 327 documents. In
case of ibm.pc.hardware 375 documents were correctly predicted. In case of politics.guns 361
were correctly predicted.
Resource link
https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/
05.05-Naive-Bayes.ipynb?pli=1#scrollTo=RGaQV5fiY6WD