Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Take any large text corpora, do the necessary preprocessing and

build a Naïve Bayes Classifier and draw the inferences.

[12]: import matplotlib.pyplot as plt import seaborn as sns; sns.set()


from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix

[13]: # The 20 newsgroups dataset comprises around 18000 newsgroups posts on # 20 topics split
in two subsets: one for training (or development) and # the other one for testing (or for
performance evaluation). The split # between the train and test set is based upon a
messages posted before # and after a specific date.
data = fetch_20newsgroups()
# data

Downloading 20news dataset. This may take a few minutes.


Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
So the data provided from fetch_20newagorups contains many text files or a text string and
then we have some target name or classes in which all of these can be classified. Such as 20
classes given below. We have data of texts and we have classes in which all of these text
documents can be classified. Now our job is to get individual frequency of words from the data
documents and then use naive bayes classification to classify these documents. After this we can test
our fitting with actual class.
[14]: data.target_names
[14]: ['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',

1
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
[15]: # Choosing some categories instead of all 20 classes.
cat_class = ['talk.politics.guns','comp.graphics','comp.sys.ibm.pc.hardware']

[16]: # Splitting this into training and testing data.


train_data = fetch_20newsgroups(subset='train', categories=cat_class)
# This gives us training subset of all the data of categories of the classes we␣
‹→have specified

# train_data

As we can see we have only 3 classes to classify into.


[17]: # Splitting into testing data
test_data = fetch_20newsgroups(subset='test', categories=cat_class)
So we have the data and we also know the actual classes that each document belongs to. Our
job now is to vectorize the documents words, and use naive bayes classification to train a
model. This model will later be used against the testing data to check whether naive bayes
classification is actually able to predict with accuracy or not.
[18]: # TfidVectorizer allows us to use various functions to work on a
corpus. # We can create term document matrix, inverted term,
# decode(doc) Decode the input into a string of unicode symbols.
# fit(raw_documents[, y]) Learn vocabulary and idf from training
set. # fit_transform(raw_documents[, y]) Learn vocabulary and idf,
return␣
‹→document-term matrix.

# The multinomial Naive Bayes classifier is suitable for classification with␣


‹→discrete

# features (e.g., word counts for text classification). The multinomial␣


‹→distribution normall

# requires integer feature counts. However, in practice, fractional counts


such␣

2
#Make pipeline allows to create a pipeline flow of data, like we pass data to a␣
‹→function

# the result is passed to next function and so on. This helps to abstract the␣
‹→details to

# a single function
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
[19]: model.fit(train_data.data, train_data.target)
# This fits our models with the data we have passed and the corresponding␣
‹→target.

# Which in our case is the class of text documents

labels = model.predict(test_data.data)

# This will predict the classes of the testing data we have passed
# by working on the model we have trained in last step.
[20]: mat = confusion_matrix(test_data.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d',
cbar=False,
xticklabels=train_data.target_names, yticklabels=train_data.
‹→target_names)

plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

# This will help us know, whether our data is actaully predicting correctly.
# This can be understood graphically. We can have two axes on has the number of
# documents with the predicted label. The other has number of documents with
# actual label. So the intersection tells us, the number of documents correctly␣
‹→predicted.

# Which is the diagonal of our confusion matrix

3
As we can see the in case of comp.graphics our data correctly predicted 327 documents. In
case of ibm.pc.hardware 375 documents were correctly predicted. In case of politics.guns 361
were correctly predicted.

Resource link

https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/
05.05-Naive-Bayes.ipynb?pli=1#scrollTo=RGaQV5fiY6WD

You might also like