Welcome to Scribd!

Documentation

Uploaded by

0% found this document useful (0 votes)

6 views2 pages

The document summarizes the testing and results of two algorithms for finding relevant text within documents. The first algorithm used TF-IDF scoring and fuzzy string matching. It had fast search times but high error rates on large datasets. The second algorithm used fuzzy matching which improved accuracy but significantly slowed search times. Neither algorithm was fully tested on large datasets due to lack of resources.

Original Description:

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

6 views2 pages

Documentation

Uploaded by

Anahit Arakelyan

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

Research methods for the first algorithm:

The first problem I focused on was the speed of the provided algorithm. The idea for the algorithm was
to know if the given phrase was relevant to the document or not. After doing some research on similar
existing algorithms, I landed on the TF-IDF algorithm. TF-IDF, short for term frequency–inverse
document frequency, is a numerical statistic that is intended to reflect how important a word is to a
document in a collection or corpus. Because of the way this algorithm is implemented by me, it also
allowed me to handle the problem of the phrase-to-search being in any way different from the sentence
in a file.

While the TF-IDF algorithm, in theory, would help find the correct file that contained the phrase-to-
search, finding the exact paragraph was done using the Approximate string matching (Fuzzy string
search) algorithm. This allowed me to handle the issue of any word in the phrase-to-search having a
different ending, being replaced by a synonym, or the word being removed altogether.

Testing First Algorithm Performance:

The preprocessing step of calculating the TF-IDF score of each word for each document takes
approximately 17 hours (on average 6000 files per hour). This step needs to be done once for the whole
dataset.

Testing was done using a set of random substrings from random files in the dataset.

On a small dataset (up to 1000 files), the algorithm performed well, getting the correct answer about
85% of the time.

Each phrase took approximately 1 sec to find, if the TF-IDF score library was previously saved and was
now being read from a file, and 0.01 sec if the preprocessing and looking up the phrase were being done
during the same execution of the code.

Testing on a larger dataset started yielding problems, files with a similar theme could not be
distinguished by the algorithm and the error rate became bigger. The algorithm got the correct answer
about 40% of the time.

Results:

The results of this algorithm are averagel.

Advantages

The search itself is done very quickly.

 Preprocessing needs to be done only once

Disadvantages

 The preprocessing takes a long time

 The larger the dataset, the bigger the error rate
Research methods for the second algorithm:

After encountering the problems of the first algorithm, I decided to focus on improving the code
provided in section A, using the Fuzzy search algorithm. This would potentially solve the issue of the
phrase-to-search not matching any string in any file exactly, but as expected it heavily affected the
speed.

Testing Second Algorithm Performance:

Due to the lack of resources, this algorithm was not thoroughly tested.

Testing was done using a set of random substrings from random files in the dataset.

On a small dataset (up to 1000 files), the algorithm performed very well, getting the correct answer
about 97% of the time.

Preparing the documents for further searching takes a lot of time because the algorithm iterates over all
the files.

Testing on a large dataset was not done, due to the lack of resources

Results:

The results of this algorithm are average.

Advantages

 The error rate is very low

Disadvantages

 Takes a very long time to search

Journey Into Mathematics An Introduction To Proofs: Joseph J. Rotman
Document3 pages
Journey Into Mathematics An Introduction To Proofs: Joseph J. Rotman
sara
No ratings yet
An Efficient Text Pattern Matching Algorithm For Retrieving Information From Desktop
Document11 pages
An Efficient Text Pattern Matching Algorithm For Retrieving Information From Desktop
Shaurya Sethi
No ratings yet
What Is An Algorithm
Document4 pages
What Is An Algorithm
Aa A
No ratings yet
DAA Notes
Document200 pages
DAA Notes
Raghu
No ratings yet
Unit 1 To 5 Qbank Ans
Document7 pages
Unit 1 To 5 Qbank Ans
rangan
No ratings yet
Data Mining Lab Report
Document6 pages
Data Mining Lab Report
Redowan Mahmud Ratul
No ratings yet
Agrep - A Fast Approximate Pattern-Matching Tool: P P ... P T T
Document10 pages
Agrep - A Fast Approximate Pattern-Matching Tool: P P ... P T T
Hovo Apoyan
No ratings yet
A New Two-Phase Sampling Algorithm For Discovering Association Rules
Document24 pages
A New Two-Phase Sampling Algorithm For Discovering Association Rules
anand_gsoft3603
No ratings yet
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
Document14 pages
An O (K Log N) Algorithm For Prefix Based Ranked Autocomplete
Deden Ramadhan
No ratings yet
Diane J. Cook R. Craig Varnell
Document28 pages
Diane J. Cook R. Craig Varnell
dsafdsgdfgsdfg
No ratings yet
Chep v3 Final
Document6 pages
Chep v3 Final
gataycum
No ratings yet
(18-22) Hybrid Association Rule Mining Using AC Tree
Document5 pages
(18-22) Hybrid Association Rule Mining Using AC Tree
iiste
No ratings yet
A Comparison of Document Similarity Algorithms
Document10 pages
A Comparison of Document Similarity Algorithms
Adam Hansen
No ratings yet
Performance Study of Association Rule Mining Algorithms For Dyeing Processing System
Document10 pages
Performance Study of Association Rule Mining Algorithms For Dyeing Processing System
iiste
No ratings yet
DAA Notes
Document199 pages
DAA Notes
John Rahul
No ratings yet
Research Paper Apriori Algorithm
Document7 pages
Research Paper Apriori Algorithm
xhzscbbkf
100% (1)
Improved Automatic Keyword Extraction Given More Linguistic Knowledge
Document8 pages
Improved Automatic Keyword Extraction Given More Linguistic Knowledge
ikhwancules46
No ratings yet
DAA Notes
Document223 pages
DAA Notes
sabahath samreen
No ratings yet
Iterative Deepning Depth 3rd Note
Document3 pages
Iterative Deepning Depth 3rd Note
Augustine Ogar
No ratings yet
DAA Module-1-2
Document111 pages
DAA Module-1-2
Suurya KS
No ratings yet
An Optimised Binary
Document21 pages
An Optimised Binary
ASasS
No ratings yet
DAA NoTESALLUNITSBYNEELIMA
Document220 pages
DAA NoTESALLUNITSBYNEELIMA
Neelima Malchi
No ratings yet
Automated Software Testing Using A Metaheuristic Technique Based On Tabu Search
Document4 pages
Automated Software Testing Using A Metaheuristic Technique Based On Tabu Search
api-3738458
No ratings yet
Thesaurus Implementing Binary Search Algorithm: Polytechnic University of The Philippines
Document3 pages
Thesaurus Implementing Binary Search Algorithm: Polytechnic University of The Philippines
Mori Baleta
No ratings yet
Performance Comparison of Parallel Searching Algorithms On The Network of Workstations
Document6 pages
Performance Comparison of Parallel Searching Algorithms On The Network of Workstations
Phyu Too Thwe
No ratings yet
6368da2533002 PPT
Document8 pages
6368da2533002 PPT
Talla Vivek
No ratings yet
DAA Notes
Document199 pages
DAA Notes
Roushan Kumar
No ratings yet
Unit Vi
Document47 pages
Unit Vi
Rushikesh Jadhav
No ratings yet
PDC Review2
Document23 pages
PDC Review2
corote1026
No ratings yet
Implementation of An Efficient Algorithm: 2. Related Works
Document5 pages
Implementation of An Efficient Algorithm: 2. Related Works
Journal of Computer Applications
No ratings yet
Unit 2
Document38 pages
Unit 2
HOW to BASIC INDIAN
No ratings yet
Unit 2
Document32 pages
Unit 2
Manvika
No ratings yet
Grammar Checker
Document6 pages
Grammar Checker
wichupinuno
No ratings yet
1 Information Retrieval System
Document10 pages
1 Information Retrieval System
rm23082001
No ratings yet
Unit-2: Analysis of Algorithm
Document116 pages
Unit-2: Analysis of Algorithm
18IT076 RAJ PARMAR
No ratings yet
Short Report
Document2 pages
Short Report
abhinav kumar
No ratings yet
Lecture 10
Document6 pages
Lecture 10
Prosun Mukherjee Sajal
No ratings yet
Unit 2
Document19 pages
Unit 2
aman Yadav
No ratings yet
Finding Similar Patents Through Semantic Query Expansion: Sciencedirect
Document6 pages
Finding Similar Patents Through Semantic Query Expansion: Sciencedirect
saman
No ratings yet
Jacob With Berry: Charanjit Kandola, Daniel Fagerlie Due: June 2, 2017
Document5 pages
Jacob With Berry: Charanjit Kandola, Daniel Fagerlie Due: June 2, 2017
api-368024671
No ratings yet
DAA Notes
Document200 pages
DAA Notes
suren scribed
No ratings yet
CCDSALG Reviewer
Document5 pages
CCDSALG Reviewer
Jillian Garcilan
No ratings yet
Selective Sampling For Example-Based Word Sense Disambiguation
Document26 pages
Selective Sampling For Example-Based Word Sense Disambiguation
amaster25
No ratings yet
Optimizing Pattern Matching For Ids
Document11 pages
Optimizing Pattern Matching For Ids
Pomil Bachan Proch
No ratings yet
Unit-2: Analysis of Algorithm: Dr. Gopi Sanghani
Document112 pages
Unit-2: Analysis of Algorithm: Dr. Gopi Sanghani
Raval Bhakti
No ratings yet
What Is An Algorithm
Document8 pages
What Is An Algorithm
jattakcent
No ratings yet
Rift Valley University: Department of Computer Science Algorithm Analysis Assignment
Document10 pages
Rift Valley University: Department of Computer Science Algorithm Analysis Assignment
Mercy Jorge
No ratings yet
Genetic Approach Deduplication
Document5 pages
Genetic Approach Deduplication
letter2lal
No ratings yet
Meta Search Engine
Document3 pages
Meta Search Engine
Vinaya Kumar S
No ratings yet
Incremental Association Rule Mining Using Promising Frequent Itemset Algorithm
Document5 pages
Incremental Association Rule Mining Using Promising Frequent Itemset Algorithm
Amaranatha Reddy P
No ratings yet
Genoogle: An Indexed and Parallelized Search Engine For Similar DNA Sequences
Document18 pages
Genoogle: An Indexed and Parallelized Search Engine For Similar DNA Sequences
Felipe Albrecht
No ratings yet
Concurrent Context Free Framework For Conceptual Similarity Problem Using Reverse Dictionary
Document4 pages
Concurrent Context Free Framework For Conceptual Similarity Problem Using Reverse Dictionary
Editor IJRITCC
No ratings yet
Nformation Etrieval Ystems: P.Veera Swamy
Document73 pages
Nformation Etrieval Ystems: P.Veera Swamy
ganeshjaggineni1927
No ratings yet
Week 3
Document4 pages
Week 3
Anish Devkota
No ratings yet
DBKDA My Article
Document6 pages
DBKDA My Article
ars.deev1999
No ratings yet
Coursera Algorithms On String
Document256 pages
Coursera Algorithms On String
yousef shaban
No ratings yet
Efficient Algorithm For Mining Frequent Patterns Java Project
Document38 pages
Efficient Algorithm For Mining Frequent Patterns Java Project
Kavya Sree
No ratings yet
Unit Iv Data Anlysis
Document34 pages
Unit Iv Data Anlysis
PRIYAM XEROX
No ratings yet
ADA GTU Study Material Presentations Unit-2 14082021030333PM
Document118 pages
ADA GTU Study Material Presentations Unit-2 14082021030333PM
anant_nimkar9243
No ratings yet
TT156 Edited
Document13 pages
TT156 Edited
Fred Miller
No ratings yet
Heuristic: Fundamentals and Applications
From Everand
Heuristic: Fundamentals and Applications
Fouad Sabry
No ratings yet
Notes
Document22 pages
Notes
Reddy Kumar
No ratings yet
Artificial Intelligence Units 3 and 4
Document56 pages
Artificial Intelligence Units 3 and 4
raqmail1475
97% (32)
Formal Methods
Document44 pages
Formal Methods
faizan.ahmed122432
No ratings yet
Chapter 1 Linear Programming 1.1 Transportation of Commodities
Document35 pages
Chapter 1 Linear Programming 1.1 Transportation of Commodities
Rodas getahun
No ratings yet
Syllabus MAD101 Spring 2020
Document14 pages
Syllabus MAD101 Spring 2020
Nguyen Hoang Kien (K16HL)
No ratings yet
Notes 2
Document10 pages
Notes 2
Priyadharshini J
No ratings yet
Limits and Continuity
Document6 pages
Limits and Continuity
tesfayeteferi
No ratings yet
Introduction and Preliminaries - Sets and Functions
Document33 pages
Introduction and Preliminaries - Sets and Functions
Ahmed
No ratings yet
Q3 Basic Calculus 2nd Summatiuve Test
Document4 pages
Q3 Basic Calculus 2nd Summatiuve Test
Deniel Jay Lano
No ratings yet
Array: 1. Array Types: 2. Arrays in C/C++: 3. Check If An Array Is Sorted
Document2 pages
Array: 1. Array Types: 2. Arrays in C/C++: 3. Check If An Array Is Sorted
Sheelaj Babu
No ratings yet
Transportation Problem
Document10 pages
Transportation Problem
puneeth h
No ratings yet
Automata Theory
Document46 pages
Automata Theory
Abhishek
No ratings yet
DM ppt2
Document178 pages
DM ppt2
jajara5519
No ratings yet
Python Advanced - Finite State Machine in Python
Document1 page
Python Advanced - Finite State Machine in Python
Daniel Prieto
No ratings yet
Class 9th Book
Document328 pages
Class 9th Book
Rohit Ranjan
No ratings yet
Query Compiler
Document599 pages
Query Compiler
Lester Fernando
No ratings yet
Algorithm
Document1 page
Algorithm
asasf
No ratings yet
Data Structures and Algorithms - 2018
Document2 pages
Data Structures and Algorithms - 2018
Franklin Tamayo
No ratings yet
Program To Implement Binary Search Tree in C
Document5 pages
Program To Implement Binary Search Tree in C
kutti21
No ratings yet
Lecture Notes: 0N1 (MATH19861) Mathematics For Foundation Year
Document177 pages
Lecture Notes: 0N1 (MATH19861) Mathematics For Foundation Year
THARMARAJ
No ratings yet
Mathematics in The Modern Worldmod5
Document8 pages
Mathematics in The Modern Worldmod5
erickson hernan
No ratings yet
Glossary
Document7 pages
Glossary
vlorbik
No ratings yet
Mathematical Induction
Document25 pages
Mathematical Induction
Melvin P. Carumba
No ratings yet
MAD101 Chap4
Document105 pages
MAD101 Chap4
Anagram Swipe
No ratings yet
8 Week Leetcode List
Document8 pages
8 Week Leetcode List
Nicole
No ratings yet
Math in The Modern World Handout
Document2 pages
Math in The Modern World Handout
KATHERINE TRIXIE
No ratings yet
Upper Bounds On The General Covering Number C (V, K, T, M)
Document19 pages
Upper Bounds On The General Covering Number C (V, K, T, M)
Fernando Immer
No ratings yet
BCA 5 Numerical Methods MCQ
Document4 pages
BCA 5 Numerical Methods MCQ
Mohd Salman
100% (2)
Philosophy Logicism Intuitionism Formalism Reason
Document3 pages
Philosophy Logicism Intuitionism Formalism Reason
Rohani songs
No ratings yet