Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Research methods for the first algorithm:

The first problem I focused on was the speed of the provided algorithm. The idea for the algorithm was
to know if the given phrase was relevant to the document or not. After doing some research on similar
existing algorithms, I landed on the TF-IDF algorithm. TF-IDF, short for term frequency–inverse
document frequency, is a numerical statistic that is intended to reflect how important a word is to a
document in a collection or corpus. Because of the way this algorithm is implemented by me, it also
allowed me to handle the problem of the phrase-to-search being in any way different from the sentence
in a file.

While the TF-IDF algorithm, in theory, would help find the correct file that contained the phrase-to-
search, finding the exact paragraph was done using the Approximate string matching (Fuzzy string
search) algorithm. This allowed me to handle the issue of any word in the phrase-to-search having a
different ending, being replaced by a synonym, or the word being removed altogether.

Testing First Algorithm Performance:

The preprocessing step of calculating the TF-IDF score of each word for each document takes
approximately 17 hours (on average 6000 files per hour). This step needs to be done once for the whole
dataset.

Testing was done using a set of random substrings from random files in the dataset.

On a small dataset (up to 1000 files), the algorithm performed well, getting the correct answer about
85% of the time.

Each phrase took approximately 1 sec to find, if the TF-IDF score library was previously saved and was
now being read from a file, and 0.01 sec if the preprocessing and looking up the phrase were being done
during the same execution of the code.

Testing on a larger dataset started yielding problems, files with a similar theme could not be
distinguished by the algorithm and the error rate became bigger. The algorithm got the correct answer
about 40% of the time.

Results:

The results of this algorithm are averagel.

Advantages

The search itself is done very quickly.

 Preprocessing needs to be done only once

Disadvantages

 The preprocessing takes a long time


 The larger the dataset, the bigger the error rate
Research methods for the second algorithm:

After encountering the problems of the first algorithm, I decided to focus on improving the code
provided in section A, using the Fuzzy search algorithm. This would potentially solve the issue of the
phrase-to-search not matching any string in any file exactly, but as expected it heavily affected the
speed.

Testing Second Algorithm Performance:

Due to the lack of resources, this algorithm was not thoroughly tested.

Testing was done using a set of random substrings from random files in the dataset.

On a small dataset (up to 1000 files), the algorithm performed very well, getting the correct answer
about 97% of the time.

Preparing the documents for further searching takes a lot of time because the algorithm iterates over all
the files.

Testing on a large dataset was not done, due to the lack of resources

Results:

The results of this algorithm are average.

Advantages

 The error rate is very low

Disadvantages

 Takes a very long time to search

You might also like