Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

PRODUCTS

OF COMPUTATIONAL LINGUISTICS
INFORMATION RETRIEWAL SYSTEMS
Definition of Information retrieval systems (IRS)
:
“Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information.” It is the activity of obtaining
information resources relevant to an information need
from a collection of information resources (Calvin Mooers,
1950)

This information can be of various kinds, with the queries


ranging from “Find all the documents containing the
word cougar” to “Find information on the conjugation of
Spanish verbs”
Main objective of IR
• Provide the users with effective access to and
interaction with information resources.
Goal of IR
• The goal is to search large document
collections to retrieve small subsets relevant
to the user’s information need.
Databases and IR Systems: Comparison
There is a potential for confusion with Database Management Systems
(DBMS). The importance of the differences lies in the inability of a DMS to
provide the functions needed to process “information.”
History
• The earliest IRSs were developed to search for
scientific articles on a specific topic.

Usually, the scientists supply their papers with


a set of keywords, i.e., the terms they consider
most important and relevant for the topic of
the paper.
• These sets of keywords are attached to the document in
the bibliographic database of the IRS, being physically
kept together with the corresponding documents or
separately from them. In the simplest case, the query
should explicitly contain one or more of such keywords as
the condition on what the article can be found and
retrieved from the database.
• In a more elaborate system, a query can be a longer
logical expression with the operators and, or, not, e.g.:
“Find the documents on adjectives) and (not  American
English)”.
• Nowadays, a simple but powerful approach to the format
of the query is becoming popular in IRSs for non-
professional users: the query is still a set of words; the
system first tries to find the documents containing all of
these words, then all but one, etc., and finally those
containing only one of the words.
• The documents containing more keywords are presented
to the user first.
• In some systems the user can manually set a threshold for
the number of the keywords present in the documents,
i.e., to search for “at least m of n” keywords.
Main characteristics of IRSs

• Recall is the ratio of the number of relevant


documents found divided by the total number
of relevant documents in the database. 
• Precision is the ratio of the number of relevant
documents divided by the total number of
documents found.
• Recently, systems have been created that can
automatically build sets of keywords given just
the full text of the document. Such systems do
not require the authors of the documents to
specifically provide the keywords. Some of the
modern Internet search engines are
essentially based on this idea.
The limiting factors for the more
sophisticated techniques
• the absence of complete grammatical and
semantic analysis of the text of documents.

• The methods used now even in the most


sophisticated Internet search engines are not
efficient for accurate information retrieval. This
leads to a high level of information noise, i.e.,
delivering of irrelevant documents, as well as to
the frequent missing of relevant ones.
Kinds of information retrieval systems

• In- house: In- house information retrieval


systems are set up by a particular library or
information center to serve mainly the users
within the organization. One particular type of
in-house database is the library catalogue.
• Online: Online IR is nothing but retrieving data
from web sites, web pages and servers that
may include data bases, images, text, tables,
and other types.
Search Capabilities
• Boolean logic

• Proximity is used
to restrict the
distance allowed
within an item
between two
search terms.
• Contiguous Word Phrases
• Fuzzy Searches provide the capability to locate
spellings of words that are similar to the
entered search term (“computer,” “compiter,”
“conputer,” “computter,” “compute.” )
• Term Masking
(does not work for
finding ranges)
Boolean Queries
• AND: both terms must be found
• OR: either term found
• NOT: record containing keyword omitted
• ( ): used for nesting
• +: equivalent to and
• – Boolean operators: equivalent to AND NOT
Document retrieved if query logically true as
exact match in document
Retrieval techniques by Hicholas Belkin and
Bruce Croft
Applications

• Digital libraries
• Media search
- Blog search
- Image retrieval
- Music retrieval
- News search
- Video retrieval
• Search engines
- Site search
- Desktop search
- Enterprise search
- Federated search
- Mobile search
- Web search
• Question answering
Tasks
• Task 1. Find information on features that make information retrieval system effective.

• Task 2. Try open source IRS Sphinx http://sphinxsearch.com/ and Carrot2


https://search.carrot2.org/#/search/web

• Task 3. Choose 3 or more search engines, find some information and compare the output
1.   DuckDuckGo
2.     Shodan
3.     TinEye
4.     Ecosia
5.     The Wayback Machine
6.     FindSounds
7.     Dogpile
8.     Million Short
9.     elgooG
Thank you!

You might also like