Professional Documents
Culture Documents
Text Processing Assignment Report 210115882
Text Processing Assignment Report 210115882
Text Processing Assignment Report 210115882
A. Introduction:
In this assignment, we have two preprocessing methods Stop Words and Stemming. There are also 3 term weigting
that we have to use: Binary, Term Frequency(TF), and TFIDF.
B. Methodology:
1) Binary:
This was accomplished by calculating the set collision between the set of words in each article and the group of words
in the query and then giving a value of 1 or 0 to each of the items found in the intersection of the two sets indicated.
Binary
S 98 0.15 0.12 0.14
In TABLE 1, our detailed evaluation is given. Here in configuration, P means stemming method, S means removing
stop words method, and N/A means no preprocessing techniques. In all the configurations, you will see that the
performance increased dramatically when we removed stop words or applied stemming. Because stops words such as
“the, is, a, am, I, we, you, etc.” are uninformative, or we can say garbage data, so when we remove this, we get only
the informative data of every sentence. All the words are converted into their base form in the stemming. For example,
“likes, liked, liking” all became one word “like.” By doing this, we are merging our terms which means the same. The
best performance for all of the configurations we get when applying both stemming and stop word preprocessing
methods. So, from this analysis, we can guess how important preprocessing.
Another Interesting thing that is not given in this table is the runtime. The runtime is dramatically reduced when you
work with preprocessed data because when you preprocess, the stop words are being removed, and all the same base
terms get merged, so our total term or word gets reduced. So, we only iterate through the preprocessed term rather
than the whole dataset term.
For the TF method without any configuration, the precision, recall, and F-measure are 0.08, 0.06, and 0.07,
respectively, which is the lowest performance of all the methods. But we added the stemming method the accuracies
increased to 0.11, 0.09, and 0.10. But the by adding the stop words method, we get much higher accuracies which are
0.17, 0.13, and 0.15. But the highest accuracy we got when we used both of these preprocessing methods is 0.19, 0.15,
and 0.17.
The binary method also follows the same accuracy trend as the TF method. When we apply stemming, the accuracy
increases, and it increases more when we remove stop words or apply both of them. The highest performance for the
binary method is when you apply both of the preprocessing methods, and the accuracies are 0.20, 0.16, and 0.18. In
total, 127 relevant documents have been retrieved in this highest binary method performance.
Finally, the TFIDF method gives better performance in all configurations. From the options explored in this
assignment, TFIDF is the best. For all settings, it provides improved accuracy and retrieves more relevant documents.
Here, the performance without any configuration is higher than the best performance with the configuration of TF and
Binary. It retrieved 132 relevant documents without any configuration, which increased to 166 after applying
stemming and 140 after removing stop words. We got the best performance in this assignment for TFIDF with both
stemming and stop words method. In total, 172 relevant documents have been retrieved. The precision, recall, and F-
measure are 0.27, 0.22, and 0.24, respectively, which is even higher than the expected accuracy in the assignment
video instruction.
This higher accuracy is due to the fact that TFIDF considers the relevance of a word in the corpus; neither of the other
methods considers the importance of words in relation to the corpus as a whole. For example, “binary” term weighting
is the simplest method as it assigns 0 or 1 to each term depending on whether or not it exists in the query and in the
document, and TF will assign the value of the frequency of each term in the document meaning that terms which
appear more frequently in a document, it will not consider how important that word is in the dataset.