Professional Documents
Culture Documents
Infs 422 Combine
Infs 422 Combine
SESSION 1
An Overview of Information Storage and Retrieval
• If the books in the library, for example, are neatly arranged on the
shelves without being organized by cataloguing and classification,
searching for just a single book will be a herculean task.
• “The technique & process of searching, recovering, & interpreting information from
large amounts of stored data” (Science and Technology Dictionary, cited in Singh,
2016)
• It relates to “the organization of, processing of, and access to information of all
forms and formats” (Chowdhury, 2004, p. 1)
• To make the right information accessible to the right user upon request to meet
his or her information need
• Any information storage and retrieval system will have a complex series of
operations before documentary information can be used:
(CHAUHAN, 2023)
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc.
• Columbia Electronic Encyclopedia (2013). Columbia University Press. Licensed from Columbia
University Press.
•Harns, S. (2013). Information retrieval. Retrieved from
https://www.scribd.com/document/142955883/SCOPE-OF-IR
• Hiemstra, D. (2009). Information retrieval models. In A Goker and J. Davies (Eds.), Searching in the
21st century, (pp. 1-19). Chichester, U.K. : Wiley. Retrieved from
http://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModelsTutorial-draft.pdf
•Information storage and retrieval (n.d.). Retrieved from
https://www.tititudorancea.com/z/information_storage_and_retrieval.htm
• Singh, K. (2013). Function of information retrieval. (n.d.). Retrieved from
https://www.scribd.com/document/312452773/Function-of-Information-Retrieval
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing Science,
3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57-62.pd
Session 2
Nature of Data and Documents in Information Retrieval
A data retrieval system and an information retrieval system differ in the nature of
data stored, organized and retrieved. Data in the former is structured whilst the
latter is unstructured.
Again a document is often explained as a written or printed paper or papers serving
as a proof or evidence for something. In information retrieval, a document is simply
any form of data stored in the IR system.
• In the whole operation of information retrieval one can recognize four phases:
1. Word retrieval: identifies the words that will adequately describe the
information.
(CHAUHAN, 2023)
• The idea of information retrieval assumes that, there exist several documents or
records comprising data that have been arranged in a suitable order for easy
retrieval. This means the retrieved information can be represented in different
forms.
• While the primary content being conveyed does not possess a defined
structure, it generally comes packaged in objects, for example in files or
folders or documents, that themselves have some metadata, and are thus
a combination of structured and unstructured data, but normally it is
referred to as "unstructured data"
Definition ➢ A chapter
• In an IR system a document is a much ➢ A section
broader term than a printed or written ➢ A paragraph
paper or papers It is defined as:
• Graphics
➢ “a stored data record in any form”
• Sound/voice recordings
(Korfhage, 1997, p. 17.
• Images
• The word document as a general term
could also include non-textual • Computer programs
information, such as multimedia objects. • Data files
It includes: • Email messages etc
• Books
• Informal writings such a
➢ Letters
➢ Messages
• Parts of a book, such as an, i.e.,
• Check the library and the Internet for all the document surrogates in this lesson,
examine and discuss your discoveries in the chat room on the Sakai course site
Session 3:
Models of Information Retrieval (Part 1)
• The information retrieval model needs to provide the framework for the system to
work and define the many aspects of the retrieval procedure of the retrieval
engines
• The IR model has to provide a system for how the documents in the collection and
user’s queries are transformed.
• The IR model also needs to ingrain the functionality for how the system identifies
the relevancy of the documents based on the query provided by the user.
• The system in the information retrieval model also needs to incorporate the logic
for ranking the retrieved documents based on the relevancy.
Cognitive/User-centered models
They adopt a holistic approach to IR , that is, the information
organization, document representation and query formulation
and their algorithms as well as the information behaviour of
users. Thus they include:
➢User information needs & query formulation methods
➢Human-computer interactions during the search process
➢The socio-cognitive environment of the search process
➢Mode of information use in satisfying a need, plus
➢Matching the query to information stored in the IR system
(Chowdhury, 2010)
System-centered models
➢Includes search & retrieval of relevant documents from the IR
system
➢The hardware & software needed for the representation of
documents and their retrieval and the problems associated with
them
➢Computer programs needed for matching queries with stored
documents to produce output (i.e., the retrieved documents)
➢Algorithms (i.e. procedures, rules or formula for solving problems
often in the field of computer science and mathematics) needed for
improved ranking of documents (Robbins, 2000)
➢(This course will focus on system centered models)
•There are many system centered models but the three most
important ones are the:
➢Boolean model
➢Vector model
➢ Probabilistic model
•These models are important because they are designed to
manage large collections such as web pages of the Internet
(Inkpen, n.d.)
• Callan, J. (2003). Retrieval models: Boolean and Vector Space. Retrieved from
https://www.scribd.com/presentation/351797748/03
• Fuhr, N. (2001). Models in Information Retrieval. In: Agosti M., Crestani F., Pasi G. (eds) Lectures
on Information Retrieval. Heidelberg, Berlin: Springe
• Greengrass, E. (2000). Information retrieval: A survey. Available at
http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.boo k. pdf
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley and Sons Inc.
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Rettrieval.
Retrieved from https://nlp.stanford.edu/IRbook/pdf/irbooko
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing
Science, 3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57- 62.pdf
Session 04:
Models of Information Retrieval (Part 2)
This session will discuss the Statistical models, that is, the Vector
Space and the Probabilistic models. They are referred to as statistical
models because they use statistical information to determine relevance
of documents to queries. They are also described as the best match
models because they are able to predict the degree of relevance of a
document to a query, i.e., retrieved documents are ranked with the
most relevant documents listed first.
At the end of the session, the student will be able to
• Identify the characteristics, advantages and limitations of the
statistical models.
• Explain the underlying principles and structure of the Vector Space
and the Probabilistic models.
Slide 15
Dr. (Mrs.) Florence O. Entsua-Mensah
Assumptions of the Vector Space Model
Slide 17
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing The term frequency-inverse document frequency
of a term
• It is the product of its tf weight and its idf weight, that is:
w t ,d = log(1 + tf t ,d ) log10 ( N / df t )
• tftd is the number of occurrences of a term t in a document
d
• dft is the document frequency, the number of documents in
the collection that contain the term t
Slide 18
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing The term frequency-inverse document frequency
of a term
• N is the total number of documents in the collection
(Hiemstra, Greengrass)
• NB
➢Tf is greater when the term is frequent in a
document.
➢A document with 10 occurrences of a term t
(t=10) is more relevant than a document with one
occurrence of the term (t=1)
➢Idf is greater when the term is rare in the
collection, i.e., doe not occur in only few
documents
Slide 19
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing the similarity between query and document vectors
V
q•d q d qi d i
cos(q , d ) = = • = i =1
q d
i =1 q i =1 i
V V
qd 2
d 2
i
Slide 20
Dr. (Mrs.) Florence O. Entsua-Mensah
Steps in ranking of documents in the VSM
Slide 25
Dr. (Mrs.) Florence O. Entsua-Mensah
Maron and Kuhns’ Probabilistic Retrieval Model
Slide 27
Dr. (Mrs.) Florence O. Entsua-Mensah
Robertson and Spark Jones Probabilistic Retrieval
model
• Propositions of the model
• Given a user’s query, there is a set of documents which contain
exactly the relevant documents (called the ideal set).
• The purpose of the query is to specify the properties of the
answer set
• These properties are not known but the system makes a guess
and presents an initial set of documents
• The user inspects the top retrieved documents to identify the
relevant ones
• The system uses this information to refine the description of
subsequent ideal answer set
• Repetition of this process eventually improves the description of
the ideal answer set
Most IR systems do not include the full text of documents. This is because use of
the raw documents only hinders retrieval since the only way to access required
information is by reading linearly. To ensure efficient retrieval of information ,
document surrogates are generated and used in place of or together with the full-
text documents (Kophage, 1997).
• Indexing is the analysis of a given document content and the representation of the
analysis by appropriate descriptors or key (Chowdhury, 2010).
• An index is the selected terms & their locations in an individual document or group
of documents (Kophage, 1997)
5. Filing of entries
➢ One entry is prepared for each key word selected to represent a given document
➢ User’s query terms are identified and matched against the file of index terms
Pre-coordinate systems
➢ The chain of words represent or define the full or subject content of the
document
Advantages
➢ Use of uncontrolled language giving the indexer flexibility in choosing document
descriptors
Disadvantages
• Lack of consistency – indexers may not assign the same index term to a given
document.
• Varying levels of specificity and exhaustivity are attained based on the different
perspectives of indexers
• Use of controlled vocabulary may hinder accuracy; i.e., indexers may not represent the
document accurately, especially where new words are introduced to the documents.
• Indexer-user mismatch – same concept may be represented differently by indexer and
user e.g., meats and poultry
• Pre-coordination- subsets of terms in manual indexing are often represented by a single
term; e,g., gas, oil, coal, are represented by fuel. This may hinder recall
• When the assignment of the content identifier is carried out with the aid of modern
computing equipment the operation becomes automatic indexing.
1. It is faster & easier to produce, because most of the processes are performed
by a machine. E.g., Microsoft word and Adobe Framemaker can do automatic
indexing.
• An automatic index is free from the bias of a human indexer, but may contain bias
• Professional indexers (who are often librarians) lack the technological skills
• Some librarians cast doubt on the quality of automated indexing. They think the
software cannot match the intellect of the human indexer (Obaseki, 2010)
Indexing exhaustivity
• The extent to which all index terms and concepts in a document are covered
Term specificity
include all kinds of birds is described as more specific but less exhaustive
• The more specific the terms, the better the representation of the subject
Exhaustivity
• However high levels of exhaustivity also decrease the level of precision (i.e., the
proportion of retrieved documents that are relevant) and a large number of non-
relevant materials are retrieved.
➢ The low precision may be ascribed to the less-detailed discussion of some of
the related terms in the retrieved documents
Specificity
➢ For eg., looking for information using a broad term like “sports” may retrieve
a lot of documents most of which might not discus the desired topic, but using
specific term like “soccer” or “tennis” will yield fewer documents some of which
may be irrelevant.
• Thus higher levels of term specificity ensures high precision but low recall
• It is not possible for a given IR system to achieve optimal levels of precision and
recall. Optimal levels may increase cost of the IR system
• Choose two search terms, a broad search term and a specific term . Search the
terms on the net. Which of them gave high precision and high recall? Discuss
your findings in the Chatroom
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley
and Sons Inc.
• Thus an inverted file has the keyword entry, and a reference list
specifying the actual position where the keyword/term is located in
the database
• Flexibility: Inverted indexes can be customized to suit the needs of different types of
information retrieval systems. For example, they can be configured to handle different
types of queries, such as Boolean queries or proximity queries.
• The reversed file may not be the database file's index, but
the database file itself.
• The inverted index is the most popular file structure used in document
retrieval systems such as commercial databases to support full text
search (Moura E.S., & Cristo M.A. 2009)
➢Proximity searches
Manning, C. D., Raghavan, P., & Schutze, H. (2009). A first take at building an index. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-
Dr.1.html#fig:indexstart
(Mrs.) Florence O. Entsua-Mensah Slide 24
Practice assignment
• “The term ‘vocabulary control’ refers to a limited set of terms that must
be used to index documents, and to search for these documents, in a
particular system” (Librarianship Studies and Information Technology,
n.d.)
• The systematic selection of preferred terms to represent the subject
matter of documents (Chowdhury, 2010)
• It is “an organized arrangement of words and phrases used to index
content and /or to retrieve content through browsing or searching” (What
are Controlled Vocabularies, n.d.).
(Most databases have controlled vocabularies, e.g., CINAHL, Academic
Search Premier)
In an IR environment:
Definition
• Thesaurus
• Ontologies
• Form heading – They are the same as topical subject headings but used
organization, arrangement or classification of literary and artistic forms of
material e.g., drama, essay, poetry, fiction etc.
• Cross reference – they are used to direct users from broader and
related topics to the subject headings or terms used to represent a
particular subject e.gs.,
➢ See (or USE) reference – direct users to authorized headings
➢ See also references – direct users to related, broad and narrower
terms (RT, BT, NT) to help in searching specific aspects of a subject
➢ General references – direct users to category or group of headings
instead of individual headings to save space
• Examples of subject heading lists are Library of Congress Subject
Headings (LCSH) and Sears’s List of Subject headings (Library Studies
and Information Technology, n.d.).
➢Business writing
➢Economic indicators
➢Flipped classrooms
➢Systemic risk
➢Target marketing
• Colby, M. (2017). Library of Congress Subject Headings: Online Training: developed by Janis
Young and Daniel Joudrey. Washington, DC: Library of Congress, Program for Cooperative
Cataloging, Cataloger's Learning Workshop, 2017. https://www. loc. gov/catworkshop/lcsh.
Session 8 – Thesaurus
• Preferred terms
➢These are valid index terms that can be used for indexing and
searching
➢They are the terms that best represent the concepts in a
thesaurus
• Non-preferred terms
➢ They are the synonyms
➢They are not used for indexing.
➢Appropriate references are linked from non-preferred terms to
preferred terms to guide searchers and indexers
• Hierarchical relationships
• Equivalence relationships
• Associative relationships
Generic relationship
It identifies a class and its member species. Code or notation
for generic relationship in a thesaurus are BTG (Broader term
generic) and NTG (Narrow term generic) e.g.,
• Rat
• BTG rodents
• Rodents
• NTG rats
context such that all the terms can BTP Nervous system
Polyhierarchical relationships
It describes a situation where a concept belongs to more than
one category. It occurs in a particular instance, which links
proper name with a common noun. e.g.,
COMPUTER PRINTERS
NT Computer peripheral equipment
NT Printing equipment
Several examples:
• A process and its instrument e.g., illumination and lamp
• Concepts and their properties e.g., poison and toxicity
• Concepts and mechanism/units for measurement e.g.,
temperature and thermometer
• Concepts and their origins e.g., India and Indians
• Action and product e.g., weaving and cloth
• Cause and effect e.g., pathogens and infections
• Raw materials and products e.g., cocoa and chocolate
• Discipline or field of study and the object being studied e.g.,
Botany and plants
1. Alphabetic displays
3. Graphic displays
• Early Adolescence
U Early Adolescence
Early Adolescence
UF Early Adolescence; Young Adolescence
• Early Childhood (1966 1980)
U Young Children
Early Childhood Education
Early Detection
U Identification
Early Diagnosis
U Identification
Early Experience
UF Preschool Experience
Source: Thesaurus of ERIC Descriptors
• Graphic displays show the terms and their relationships in the form
of two dimensional figures
• They communicate relationships among concepts more effectively
than the linear forms of display
• They are more effective in an interactive computer environment
where terms an be connected to their details via hyperlinks e.g.,
(clickhttps://www.freethesaurus.com/for+certain
• Some commercial products are available for use to generate
concept maps using terms in a controlled vocabulary (Chowdhury,
2010; Zeng, 2005)
– Concept maps in printed thesaurus are static but real time
graphic displays can be generated in an electronic system
Documents
Un-structured Structured
(semi-structured) Databases (or Knowledge Bases)
Dr (Mrs) Florence Entsua-Mensah 8
Structured vs Semi-structured Text
Entity Event
Recognition Extraction
Relation
Extraction
At the heart of the Giuliani-led critique of the president’s patriotism is the suggestion
that Barack Obama has never expressed love for the United States. Rudolph W.
Giuliani, the former mayor of New York City, has even challenged the media to find
examples of Mr. Obama expressing such affection.
A review of his public remarks provides multiple examples. In 2008, when he was still
a presidential candidate, Mr. Obama uttered the magic words in Berlin, during a
speech to thousands. Mr. Obama used a similar construction, as president, in 2011,
during a town hall meeting in Illinois, when he recalled “why I love this country so
much.”
Mr. Giuliani told Fox News that “I don’t hear from him what I heard from Harry
Truman, what I heard from Bill Clinton, what I heard from Jimmy Carter, which is these
wonderful words about what a great country we are, what an exceptional country we
are.”
Located-in(Person, Place)
He was in Tennessee
Subsidiary(Organization, Organization)
XYZ, the parent company of ABC
Related-to(Person, Person)
John’s wife Yoko
Founder(Person, Organization)
Steve Jobs, co-founder of Apple...
• For example:
• Forexample:
Number of car accidents per vehicle type and number
of casualties in the accidents.
• Supervised Learning
• Features
• Types of two named entities
• Bag of words
• POS of words in between
• Example:
• A rented SUV went out of control on Sunday, causing the death of seven
people in Brooklyn
• Relation: Type = Accident, Vehicle Type = SUV, casualty = 7, weather = ?
• Manually?
• Patterns:
• “[CAR_TYPE] went out of control on [TIMEX], causing the
death of [NUM] people”
• “[PERSON] was born in [GPE]”
• “[PERSON] was graduated from [FAC]”
• “[PERSON] was killed by <X>”
• Matching Techniques
• Exact matching
• Pros and Cons?
• Flexible matching (e.g., [X] was .* killed .* by [Y])
• Pros and Cons?
1. Name(s), aliases:
2. *Date of Birth or Current Age:
3. *Date of Death:
4. *Place of Birth:
5. *Place of Death:
6. Cause of Death:
7. Religion (Affiliations):
8. Known loca(ons and dates:
9. Last known address:
10. Previous domiciles:
11. Ethnic or tribal affiliations:
12. Immediate family members
13. Na(ve Language spoken:
14. Secondary Languages spoken:
15. Physical Characteristics
16. Passport number and country of issue:
17. Professional positions:
18. Education
19. Party or other organization affiliations:
20. Publica(ons (titles and dates):
AMBIGUITY
• Fred’s appointment as professor vs. Fred’s 3 PM
appointment with the dean
• outbreak of typhoid vs. outbreak of violence
COMPLEX STRUCTURES
• For the Federal Election Commission, Bush picked Justice
Department employee and former Fulton County, Ga.,
Republican chairman Hans von Spakovsky for one of three
openings.
(Grishman, 2012)
(Grishman, 2012)
• The term "user" can refer to any person who interacts with
an information system to search for and select resources
he/she needs (University of North Texas, 2017).
Actual users:
• those who are using the information service at a given time.
Potential Users:
• those who are not yet served by the information services.
Expected Users:
• those who not only have the privilege of using the information
service, but also have the intention of doing so.
Beneficiary users:
• Users who have derived some benefits form the information service.
(Chowdhury, 2010)
Dr (Mrs) Florence Entsua-Mensah 12
TOPIC
Information Needs (IN)
3
(Chowdhury, 2010)
• This includes determining the scope, depth, and specificity of the information needed.
• For example, a student conducting research for a term paper may have information needs
related to specific aspects of a topic, such as historical background, current research, or
statistical data.
• This involves searching for, accessing, and evaluating information from various
sources, such as books, articles, databases, websites, experts, or other individuals.
• The information sought should be relevant, reliable, and credible to ensure its usefulness in
meeting the identified needs.
Visceral Need:
is the unconscious need.
Conscious
Need: Conscious by undefined need.
Formalised
Need: Formally expressed need.
Compromised Expressed and influenced by internal and
Needs: external constraints.
• Recall and precision are the most important evaluative criteria for
assessing the performance of an IRS.
• Recall – the extent to which the items retrieved are wanted or
relevant
• An IR system is expected to retrieve relevant documents in
response to a query.
• However, in a large collection, only a proportion of the total
relevant documents are retrieved.
• The system’s performance is measured by the recall ratio.
𝑁𝑜.𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑅𝐸𝐶𝐴𝐿𝐿 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
• Recall assumes that, all relevant items have the same value,
but the value may be relative and varies from user to user.
query but could not be retried] [d=documents that are not relevant to the query]
• Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New
Z39.50Protocol Client for Searching in Libraries and
Research Collaboration. Network Protocols and
Algorithms, 8(3),29.
https://doi.org/10.5296/npa.v8i3.10147
• The Apache Software Foundation: The Free and Open
Productivity Suite. Retrieved from:
http://www.openoffice.org/bibliographic/srw.html.
Accessed on August 7, 2018.
• Ayres, F. H., Cullen, J., Gierl, C., Huggill, J. A. W., Ridley, M. J., &
Torsun, I. S. (1994). QUALCAT: automation of quality control
in cataloguing. BLRD REPORTS, 6068.
• Chen, C.-C. (2001). Global Digital Library Development in the New
Millennium: Fertile Ground for Distributed Cross- Disciplinary
Collaboration. Tsinghua University Press.
• De Silva, S. M. (1997). A review of expert systems in library and
information science. Malaysian Journal of Library &
Information Science, 2(2), 57–92.
• Endres-Niggemeyer, B., & Knorz, G. (1987). AUTOCAT:
knowledge-based descriptive cataloguing of articles
published in scien-tific journals. In Second International
GI Congress 1987. Knowledge Based Sys-tems (pp. 20–21)
(Mansourian, 2004)
(Mansourian, 2004)
• Then these web IRs mainly look for the searched keywords
in the maintained list or the index file to retrieve the original
document present on the web.
(Mansourian, 2004)
Dr. (Mrs) Florence O. Entsua-Mensah 10
Web IR
• Web IRs user's interface is much more user friendly than the
traditional IR systems by keeping in mind the issue of
increase in inexperienced user group of web IR systems.
(Mansourian, 2004)
(Zhang, 2004)