Text Processing Assignment Report 210115882

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Text Processing Assignment: Document Retrieval

Registration Number: 210115882

A. Introduction:
In this assignment, we have two preprocessing methods Stop Words and Stemming. There are also 3 term weigting
that we have to use: Binary, Term Frequency(TF), and TFIDF.

B. Methodology:
1) Binary:
This was accomplished by calculating the set collision between the set of words in each article and the group of words
in the query and then giving a value of 1 or 0 to each of the items found in the intersection of the two sets indicated.

2) Term Frequency (TF):


This is similar to binary term weighting, except that instead of applying a word weighting of 1 for words that appear
in both the article and the query, the weight allocated to the word is (frequency of the word in the article * frequency
of the word in the question).

3) TF, Inverse Document Frequency (TFIDF):


The TFIDF term weighting is identical to the TF term weighting, but it includes an inverse document frequency
component for every one of the words. The inverse of (Number of articles in which the word appears / Number of
articles) is used to calculate the IDF for each keyword. This will guarantee that uncommon words are more significant
than frequent words is taken into account since the scores of the words will be adjusted accordingly.

C. Result and Analysis:


For the evaluation of the system, we have tested with different configurations over CACM_gold_std collection using
different accuracy matrices such as Precision, Recall, and F-measure.

TABLE 1: Evaluation of the System on different configuration


Term Configuration Relevant Precision Recall F-Measure
Weighting Document
Retrieved
N/A 49 0.08 0.06 0.07

TF P 73 0.11 0.09 0.10

S 107 0.17 0.13 0.15

P and S 122 0.19 0.15 0.17

N/A 132 0.21 0.17 0.18

P 166 0.26 0.21 0.23


TFIDF
S 140 0.22 0.18 0.19

P and S 172 0.27 0.22 0.24


N/A 76 0.12 0.10 0.11

P 91 0.14 0.11 0.13

Binary
S 98 0.15 0.12 0.14

P and S 127 0.20 0.16 0.18

In TABLE 1, our detailed evaluation is given. Here in configuration, P means stemming method, S means removing
stop words method, and N/A means no preprocessing techniques. In all the configurations, you will see that the
performance increased dramatically when we removed stop words or applied stemming. Because stops words such as
“the, is, a, am, I, we, you, etc.” are uninformative, or we can say garbage data, so when we remove this, we get only
the informative data of every sentence. All the words are converted into their base form in the stemming. For example,
“likes, liked, liking” all became one word “like.” By doing this, we are merging our terms which means the same. The
best performance for all of the configurations we get when applying both stemming and stop word preprocessing
methods. So, from this analysis, we can guess how important preprocessing.

Another Interesting thing that is not given in this table is the runtime. The runtime is dramatically reduced when you
work with preprocessed data because when you preprocess, the stop words are being removed, and all the same base
terms get merged, so our total term or word gets reduced. So, we only iterate through the preprocessed term rather
than the whole dataset term.

For the TF method without any configuration, the precision, recall, and F-measure are 0.08, 0.06, and 0.07,
respectively, which is the lowest performance of all the methods. But we added the stemming method the accuracies
increased to 0.11, 0.09, and 0.10. But the by adding the stop words method, we get much higher accuracies which are
0.17, 0.13, and 0.15. But the highest accuracy we got when we used both of these preprocessing methods is 0.19, 0.15,
and 0.17.

The binary method also follows the same accuracy trend as the TF method. When we apply stemming, the accuracy
increases, and it increases more when we remove stop words or apply both of them. The highest performance for the
binary method is when you apply both of the preprocessing methods, and the accuracies are 0.20, 0.16, and 0.18. In
total, 127 relevant documents have been retrieved in this highest binary method performance.

Finally, the TFIDF method gives better performance in all configurations. From the options explored in this
assignment, TFIDF is the best. For all settings, it provides improved accuracy and retrieves more relevant documents.
Here, the performance without any configuration is higher than the best performance with the configuration of TF and
Binary. It retrieved 132 relevant documents without any configuration, which increased to 166 after applying
stemming and 140 after removing stop words. We got the best performance in this assignment for TFIDF with both
stemming and stop words method. In total, 172 relevant documents have been retrieved. The precision, recall, and F-
measure are 0.27, 0.22, and 0.24, respectively, which is even higher than the expected accuracy in the assignment
video instruction.

This higher accuracy is due to the fact that TFIDF considers the relevance of a word in the corpus; neither of the other
methods considers the importance of words in relation to the corpus as a whole. For example, “binary” term weighting
is the simplest method as it assigns 0 or 1 to each term depending on whether or not it exists in the query and in the
document, and TF will assign the value of the frequency of each term in the document meaning that terms which
appear more frequently in a document, it will not consider how important that word is in the dataset.

You might also like