Infs 422 Combine

INFS 422: INFORMATION STORAGE AND RETRIEVAL
SESSION 1
An Overview of Information Storage and Retrieval
Dr. FLORENCE O. ENTSUA-MENSAH

(fentsua-menah@ug.edu.gh)
2022/2023 Academic Year
Dr. (Mrs) F. O. Entsua-Mensah 1

Session Overview
• One of the attributes of the information age is information overload,

that is, the availability of tons of information in different formats.
We often go to the library to search for both print and e-resources.
• If the books in the library, for example, are neatly arranged on the
shelves without being organized by cataloguing and classification,
searching for just a single book will be a herculean task.
• Information storage and retrieval or information retrieval is about

the deliberate organization of information to make them easily
retrievable to meet the information needs of individuals.
• The advent of computers and technological advances have

changed how information is organized, stored and retrieved in
many ways.
Session Outline
The key topics to be covered in the session are:
• Topic One: Definition, purposes and characteristics of Information retrieval

systems
• Topic Two: Components of an information retrieval system, and the information

retrieval process
• Topic Three: Types and everyday uses of an information retrieval system

Learning Outcomes
At the end of the session, the student will be able to:
•Explain information storage and technological

advances have changed retrieval.
•Identify its main purposes, characteristics, and

everyday uses.

Reading List
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).

New York: NealSchuman Publishers, Inc.
• Hiemstra, D. (2000). Information retrieval models. In A Goker and J. Davies

(Eds.), Searching in the 21st century, (pp. 1-19). Chichester, U.K. : Wiley. Retrieved
from http://wwwhome.cs.utwente.nl/~hiemstra/papers/I RModelsTutorial-draft.pdf.

Definitions, Purpose and Characteristics of Information
TOPIC
Retrieval Systems
1
Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 6

What is Information Storage and Retrieval
Information storage and retrieval
• “Information retrieval (IR), also called information storage and retrieval

(ISR or ISAR) or information organization and retrieval, is the art and
science of retrieving from a collection of items a subset that serves the
user’s purpose” (Harns, 2013, para. 1).
• “A branch of computer or library science relating to the storage, locating,

searching, and selecting upon demand relevant data on a given subject”
(NCI Thesaurus).
• The systematic process of collecting and cataloguing data so that they

can be located and displayed on request (The Columbia Electronic
Encyclopedia).

What is Information Storage and Retrieval
• “The technique & process of searching, recovering, & interpreting information from
large amounts of stored data” (Science and Technology Dictionary, cited in Singh,
2016)
• It relates to “the organization of, processing of, and access to information of all
forms and formats” (Chowdhury, 2004, p. 1)
• Information retrieval (IR) is a discipline concerned with the processes by which

queries presented to information systems are matched against a "store" of texts
(the term text may be substituted with still images, sounds, video clips, paintings,
or any other artifact of intellectual activity) (Robbins, 2000, p. 57)

Purposes of an information retrieval system (IRS)
An IRS is designed to fulfil the following purposes:
• To collect and organize documents and information of different formats in different

subject areas and make it available to users upon request
• To make the right information accessible to the right user upon request to meet
his or her information need
• To serve as a bridge between content creators or generators and users of those

contents or information
• To retrieve bibliographic items or exact matches of texts of queries from different

information retrieval systems such as full-text databases or multimedia
information.

Characteristics of an effective information retrieval system (IRS)
To fulfil its purposes, an IRS must be flexible and equipped for:
• Prompt information dissemination

• Information filtering (i.e., unwanted information must be excluded)
• Active switching of information (such as switching from web search to email access)
• Receiving information in a desired format
• Browsing or surfing
• Getting information in an economical way ( expensive academic databases for e.g.,
may not be patronized by academic institutions)
• Current literature (or up-to-date information)
• Accessing other information systems
• Interpersonal communication (e.g., live chats and sharing of information via social
media)
• Personalized help (e.g., provision of anticipated help topics to facilitate ease of use),
and must be
• User friendly, i.e., must consider the convenience of the user (Liston and Schoene, as
cited in Chowdhury, 2010. p. 16).

Information Retrieval Process
• Any information storage and retrieval system will have a complex series of
operations before documentary information can be used:
1. The information must be recorded in documents;

2. Each document must be stored with others in some accessible place and its
location known;
3. Characteristic aspects of each document profile, and this must be recorded with
others in the same file;
4. The potential user must formulate some query or express some interest in
terms of characteristics recorded about documents;
5. This user profile must be compared with document profiles and the locations of
the matching documents identified;
6. The documents must be located and presented to the user.
(CHAUHAN, 2023)

Components of an Information Retrieval System and their
TOPIC Process
2

Components of an information retrieval system
Lancaster (cited in Chowdhury, 2010) indicates that an information retrieval system

is made up of six main components:
1. The document selection system
2. The indexing subsystems
3. The vocabulary sub-system
4. The searching sub-system
5. The user-system interface (a platform for interaction between the user and the
system)
6. The matching sub-system (the system that matches the user’s queries to the
document representations)

• The Document Subsystem: it involves the location, selection, ordering and

receipt of source materials for collection. This process emphasis on two aspects:
Currency of information and Completeness of information, both consist of the
following tasks :
➢Determination of current and probable future requirements of potential
users of IRS.
➢Formulation of a policy acceptance of source material as defined by
subject coverage, publication type, or other criteria.
➢Comparison of available or incoming source materials with policy to
determine which shall be included in the IRS.
• Indexing Subsystem and Vocabulary Subsystem: This system is for naming
subjects in the way we have described called an indexing language and/or any
other language. It consists of two parts (a) Vocabulary and (b) Syntax. If we use
term as they appear in documents without modification, we are using natural
language. In this process, we face problems arising out of:
➢The use of words/terms; or else.
➢Use of the word order (syntax).

➢For example, “child psychology” may express as “psychology for children”.

Therefore, a controlled vocabulary is used. Vocabulary control involves the
establishment of relationships among analytic, often an arbitrary basis, but
most of them based on the prediction of those relationships that may facilitate
identification of all source materials that have been indexed.
➢For example: Instead of Children’s libraries, we use libraries and
children
➢A controlled vocabulary is a part of an artificial indexing language. The
notation of a classification scheme is an example of this artificial language.
• Searching Subsystem: Searching subsystem is one of the major

subsystems of an information retrieval system. In this subsystem, at the
beginning users’ queries are being received and interpreted by the search
system, then appropriate search statements are formulated, and the actual
search (i.e., matching queries with the surrogates of information resources
file) is conducted with a view to retrieving the required information.

• User-System Interface: The receiver of information bearing documents

becomes a source, encoding the message in form of an inquiry when we
discover any information in our store, which appears to match the inquiry,
and we can pass them to the enquirer, who can decide whether they
match his requirements.
• The Matching Subsystem: It matches the document representation

against request representation that is when documents relevant to query
have located, a match has achieved. Search engine acts as a giant
matching device. The matching subsystem has no direct influence on
effectiveness of the complete system. It plays a great role in overall
system efficiency.

The Information retrieval process
An IR system supports 3 basic into an index (Indexing process)

processes: • A user formulates his
1. The representation of contents of the problem/information need in a query
documents (Query formulation)
2. Representation of the users • The IR software compares/matches the
information need query to the index (Matching process)
3. Comparison of the 2 representations • The user is presented with a set of
These processes involves elements retrieved documents which he judges
(Figure 1) for relevance or appropriateness in
➢ Round boxes represent processes meeting his need (Feedback)
➢ Square boxes represent data • The query can be modified if the
retrieved documents are irrelevant
The basis for IR is that documents or (Hiemstra, 2009).
information have been organized and
stored in a way that makes it searchable
by users to meet their information needs:
• Documents in an IRS are processed

Elements of an IRS

TOPIC
Types and Uses of Information Retrieval Systems
3

Types of IR systems
IRS can be categorized as in-house and online

• In-house information retrieval systems
In-house IRS are developed by libraries or information centers to serve their clientele.
For example, a library catalogue for searching information items or checking their
availability.
• Online information retrieval systems
Online information retrieval systems is one which have been designed to provide
access to remote database(s) to a variety of users, via a computer terminal directly
interrogate a machine-readable database. The main features or characteristics of online
IRs are:
➢The terminal can be remote.
➢Time sharing so several users can be online at one time.
➢Information is communicated instantaneously
Online IRS provide users with remote access to Online public access catalogues
(OPACs) provide facilities for library users to carry out online catalogue searches, and
then to check the availability of the required information source, commercial databases
such as CD-ROM and academic databases electronically

Types of IR systems
Categorization of IR systems based on purpose, functions and contents

• Online database
➢ Provide access to peer reviewed scholarly information resources
➢ Are subscription or fee-based services
• Digital libraries and web information service
➢ Information is stored in digital formats
➢ Often free and accessed via the web
• Web search engines
➢ Free tools designed to search vast amounts of information resources on the
world wide web
• OPAC
➢ Searching library catalogues online by bibliographic details of documents such
as, author name, title of document, keywords. Call number etc.

Everyday uses of IR systems
• IR systems was originally designed to search engines, and subject gateways

retrieve text-based and bibliographic (provide links to more academic, reliable
databases. With the advancements in information).
ICT IR systems are used in different • Access to information from social
aspects of our everyday lives including: networking sites.
• Access to information from
• Search for information from library
bibliographic or full text databases e.g.
OPACs (Chowdhury, 2010)
Web of Science, LISA
• Access to e-books & e-journals (World
public library at http://worldlibrary.net/ ,
Emerald at www.emeraldinsight.com)
• Access to information from email
services and mobile phones
• Search for information on company or
institutional intranets
• Access to web information via URLs,

Activity
• Access this link, listen to the video for further understanding

of the IR process https://www.youtube.com/watch?v=Y0CZmsel5Rs.
• Read Chowdhury, p. 6-7 to identify the functions of an IR system and write down
your notes

References
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc.
• Columbia Electronic Encyclopedia (2013). Columbia University Press. Licensed from Columbia
University Press.
•Harns, S. (2013). Information retrieval. Retrieved from
https://www.scribd.com/document/142955883/SCOPE-OF-IR
• Hiemstra, D. (2009). Information retrieval models. In A Goker and J. Davies (Eds.), Searching in the
21st century, (pp. 1-19). Chichester, U.K. : Wiley. Retrieved from
http://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModelsTutorial-draft.pdf
•Information storage and retrieval (n.d.). Retrieved from
https://www.tititudorancea.com/z/information_storage_and_retrieval.htm
• Singh, K. (2013). Function of information retrieval. (n.d.). Retrieved from
https://www.scribd.com/document/312452773/Function-of-Information-Retrieval
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing Science,
3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57-62.pd

Session 2
Nature of Data and Documents in Information Retrieval

Dr (Mrs) Florence Entsua-Mensah 1

Session Overview
A data retrieval system and an information retrieval system differ in the nature of
data stored, organized and retrieved. Data in the former is structured whilst the
latter is unstructured.
Again a document is often explained as a written or printed paper or papers serving
as a proof or evidence for something. In information retrieval, a document is simply
any form of data stored in the IR system.
At the end of the session, the student will be able to

• Explain the nature of data in an IR system and how it differs from a database
management system (DBMS)
• Understand the concept of a document in an IR system

Session Outline
• Topic One: The nature of data in an information retrieval system
• Topic Two: The concept of document in an information retrieval system

Reading List
• Greengrass, E. (2000). Information retrieval: a survey. Retrieved from

https://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.o
ok.pdf
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley
and Sons Inc.

TOPIC The Nature of Data in an Information Retrieval System
1
Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 5

What do Information Retrieval Systems aim to retrieve?
• In the whole operation of information retrieval one can recognize four phases:
1. Word retrieval: identifies the words that will adequately describe the
information.
2. Reference retrieval: Identifies references that are probably pertinent to the

enquiry. Reference retrieval system is typified by the library card catalogue or
other indexes, which yield a complete reference to a document in response to a
general search quest. Many of the mechanized retrieval systems, provide
reference retrieval only.
3. Document retrieval: retrieves a complete copy of the document instead of just

a citation or reference provided.
4. Data retrieval: the sought information is extracted from the documents.
(CHAUHAN, 2023)

The nature of data in an information system
• The idea of information retrieval assumes that, there exist several documents or
records comprising data that have been arranged in a suitable order for easy
retrieval. This means the retrieved information can be represented in different
forms.
• The storehouse contains many bibliographic information, which is quite different

from other kinds of information or data.
• The database of the storehouse includes abstracts of some bibliographic

resources or full texts of documents, such as journal articles, conference
proceedings, newspaper articles, textbooks, encyclopedias, legal documents, and
statistical records, etc along with audios, graphics, images and videos information.

• No matter what the database may contain, be it bibliographic resources,

full-text documents or multimedia information – the system assumes that
there exists a target group of users for whom the system is designed and
fulfill their requirements.
• While the primary content being conveyed does not possess a defined
structure, it generally comes packaged in objects, for example in files or
folders or documents, that themselves have some metadata, and are thus
a combination of structured and unstructured data, but normally it is
referred to as "unstructured data"

• An information retrieval system is designed to deal with unstructured data. The

major objective of an information retrieval system is to retrieve the information;
either the actual information or the documents containing the information that fully or
partially match the user’s query.
• Information retrieval deals with an unstructured data in response to a query. An

unstructured data is explained as:
➢ Natural language text and also multimedia information such as photographic images
audio, video etc.
• Examples of "unstructured data" may include books, journals, documents,

metadata, health records, audio, video, analog data, images, files, and
unstructured text such as the body of an e-mail message, Web page, or word-
processor document.

• Users are considered to have certain queries or information needs, and

when they put forward their requirement to the system, the system should
be able to provide the necessary bibliographic references of those
documents containing the required information; some systems also
retrieve the actual text, image, table or chart relevant to the information
needs of the user.
• However, data in a data retrieval system is structured . A structured data;
➢ “consists of named components, organized according to some well-
defined syntax” or a systematically orderly arrangement (Greengrass,
2000, p. 6)
• Structured data can be handled easily as they can be easily entered,
stored, queried and analyzed.

Structured versus unstructured data
• Structured data (e.g., DBMS of students or employee):

✓It is information that is already structured in fields, such as “name”,
“age”, “gender”, “hobby”, “address”, “profession”, “salary”.
✓This is the typical example of what we find in a record of a relational
database table.
✓When information is organized in a structured form, it is usually
relatively easy to search it, since one can directly query the database
✓ Data elements are related i.e., records have similar syntax and
meaning
✓Designed to retrieve specific facts or records, based on a common
attribute of the data elements (e.g., student ID or gender, employee age
etc.).
✓It is easy for DBMS to retrieve any record and its contents

Structured versus unstructured data
IR deals with unstructured data:

✓No clearly defined data elements. Random collection of documents will
hardly discuss the same topic
✓Retrieves all types of documents; e.g., abstracts or full text documents
such as newspapers, dictionaries, encyclopedias, handbooks, audio,
video, images etc.
✓Retrieval from unstructured data is difficult and requires specialized
skills and advanced technology.
✓This information either does not have a pre-defined data model or is
not organized in a pre-defined order.
✓Unstructured information is typically text-heavy, but may contain
datasets such as dates, numbers, and facts as well.

TOPIC
The Concept of Document in an Information Retrieval System
2

Definition and examples of a document in an IR system
Definition ➢ A chapter
• In an IR system a document is a much ➢ A section
broader term than a printed or written ➢ A paragraph
paper or papers It is defined as:
• Graphics
➢ “a stored data record in any form”
• Sound/voice recordings
(Korfhage, 1997, p. 17.
• Images
• The word document as a general term
could also include non-textual • Computer programs
information, such as multimedia objects. • Data files
It includes: • Email messages etc
• Books
• Informal writings such a
➢ Letters
➢ Messages
• Parts of a book, such as an, i.e.,

Document surrogates
•An IR system stores the full text of documents or surrogates
of the document. Document surrogates are:
• “limited representations of full documents” (Korfhage, 1997, p. 21).
•Document surrogates can be explained as concise displays

representing actual objects using some of their metadata.
•Surrogates can also include elements from the document’s

actual content.

Types of Document surrogates
• Document Identifier – a number/code e.g., accession or a classification number

for the purpose of inventory control or document location. Does not answer a
query or satisfy an information need.
• Bibliographic data/record – all the data elements used to identify, describe, or

retrieve a document/publication of information content OR
➢A collection of data elements organized in a logical way to represent a
bibliographic item or document, publication or any record of human
communication. • Examples - author, title, publication date, publisher, ISBN etc.
➢These are useful to the information seeker. For e.g., date shows the timeliness
and appropriateness of the document.
• Keyword – one or a set of individual words chosen by the author/editor or

sometimes dictated by the database to represent the contents of the document

Types of Document surrogates
• Abstract – a brief one or two paragraph description of the contents of a paper

often written by the author.
➢ Its purpose is to help a reader to determine whether the entire document
should be retrieved.
• Extract – “Artificially constructed surrogates created by someone other than the
author of a paper” (Korfhage, 1997, p. 23).
➢ May comprise the first sentence of each paragraph or significant words and
phrases in the document.
•Review – a critical article on a book, play, recital etc., written by someone other
than the author.
➢Its purpose is to indicate the value of the document with respect to other works
in the same field.
➢It can be retrieved separately to suit the purposes of a reader.

Activity
• Check the library and the Internet for all the document surrogates in this lesson,
examine and discuss your discoveries in the chat room on the Sakai course site

References
• Greengrass, E. (2000). Information retrieval: a survey. Retrieved from

https://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.b
o ok. pdf. 5-7
and Sons Inc. 17-24 Dr. Evelyn Markwei, DIS Slide 1

Session 3:
Models of Information Retrieval (Part 1)

Dr. (Mrs.) Florence O. Entsua-Mensah 1

Session Overview
• Information retrieval technology is constantly evolving and developing

based on increasing volumes of information (Google has indexed over
four billion web pages) . These developments are associated with new IR
challenges handling problems such as email spammers. Designing of
effective IR systems to match the rapid changes and mitigate these new
challenges require rigorous research and application of new theories or
models. These models are very important. For example they are the blue
print for determining query formulation and the ranking of retrieved
documents At the end of the session, the student will be able to
• Explain IR models , their purposes and importance in information retrieval

• Identify the two major classes of IR models and their features and
• Explain the structure, advantages and disadvantages of the Boolean
model

Session Outline

•Topic One: Definition, importance and the two classes of IR
models
•Topic Two: Features of the Boolean Model

Reading List
• Chowdhury, G. G. (2010). Introduction to modern information retrieval

(3rd ed. ). New York: NealSchuman Publishers, Inc.
• Hiemstra, D. (2000). Information retrieval models. In A Goker and J.

Davies (Eds.), Searching in the 21st century, (pp. 1-19). Chichester, U.K. :
Wiley.Retrievedfromhttp://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModels
Tutorialdraft.pdf.3-12.

TOPIC
Definition, importance and the Two Classes of IR Models
1
Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 5

What are IR models?
• The IR models can be differentiated by the way they represent the

documents and query statements, how the system matches the query
with the documents in the corpus to find out the related one and how the
system ranks these documents.
• With Information retrieval aim to help users find information usually in
documents of an unstructured nature that satisfies an information need
from within a large collection usually stored on the web or computers;
• The IR system needs to have components for the model representing the
documents and query statements, and the characteristic matching
function which evaluates the relevancy of the documents with respect to
the provided user query.
• A model is an abstract representation of a process or an object to
facilitate making of predictions and drawing of conclusions.

What are IR models?
• IR models describe the human and computer interaction process involved in

information retrieval (Callan, 2003). They are:
➢The theoretical basis for computing the answer to a query. They guide the expression
of queries & the representation of documents in an IR system (Fuhr, 2001).
➢The basis for prediction & explanation of the relevant documents retrieved by a users’
queries.
• The information retrieval model needs to provide the framework for the system to
work and define the many aspects of the retrieval procedure of the retrieval
engines
• The IR model has to provide a system for how the documents in the collection and
user’s queries are transformed.
• The IR model also needs to ingrain the functionality for how the system identifies
the relevancy of the documents based on the query provided by the user.
• The system in the information retrieval model also needs to incorporate the logic
for ranking the retrieved documents based on the relevancy.

Importance of an IR models
1. A master plan for implementing IR systems.

2. Guides academic research & discussion.
3. Facilitates efficient retrieval of relevant document.
• It guides experimentation necessary for development of IR tools:
➢Aids avoidance of trial and error in the development of IR technology,
➢Aids management of information overload, esp., on the Internet, &
➢Provides solutions to constantly emerging IR challenges & problems,

such as email spammers (Hiemstra, 2000).

The 2 major classes of IR Models
Cognitive/User-centered models
They adopt a holistic approach to IR , that is, the information
organization, document representation and query formulation
and their algorithms as well as the information behaviour of
users. Thus they include:
➢User information needs & query formulation methods
➢Human-computer interactions during the search process
➢The socio-cognitive environment of the search process
➢Mode of information use in satisfying a need, plus
➢Matching the query to information stored in the IR system
(Chowdhury, 2010)

The 2 major classes of IR Models
System-centered models
➢Includes search & retrieval of relevant documents from the IR
system
➢The hardware & software needed for the representation of
documents and their retrieval and the problems associated with
them
➢Computer programs needed for matching queries with stored
documents to produce output (i.e., the retrieved documents)
➢Algorithms (i.e. procedures, rules or formula for solving problems
often in the field of computer science and mathematics) needed for
improved ranking of documents (Robbins, 2000)
➢(This course will focus on system centered models)

Types of System-centered IR models
•There are many system centered models but the three most
important ones are the:
➢Boolean model
➢Vector model
➢ Probabilistic model
•These models are important because they are designed to
manage large collections such as web pages of the Internet
(Inkpen, n.d.)

TOPIC
The Boolean Model
2

The Boolean Model
•Boolean model is a type that allows users to logically relate

multiple concepts together to define what information is needed.
•The Boolean model is the first model of information retrieval and
the simplest retrieval model which retrieves the information on
the basis of the query given in Boolean expression. It is a
system of symbolic logic formulated by George Boole. Has three
operators:
1) OR – logical sum (+)
2) AND – logical product (x)
3) NOT – logical difference (-)
•It is described as the exact match model, i.e., documents are
either retrieved or not but are not ranked.

Explaining the Boolean model
• In the Boolean model, documents are represented by a set of index terms

or keywords.
• Using the Boolean operators, the terms in the query and the concerned
documents can be combined to form a whole new set of documents.
• The Boolean AND of two logical statements x and y means that both x
AND y must be satisfied and will be a set of documents that will be
smaller or equal to the document set
• While the Boolean OR of these same two statements means that at least
one of these statements must be satisfied and will fetch a set of
documents that will be greater or equal to the document set otherwise.
• Any number of logical statements can be combined using the three
Boolean operators.
• User’s queries are Boolean expressions of the keywords connected by
the Boolean operators. For e.g.,

Explaining the Boolean model
• A query term corruption simply defines all documents in the IR system

indexed with the term corruption
• A query corruption AND politics will retrieve a document D1 only if D1
contains both terms. Thus AND means all.
• A query corruption OR politics will retrieve a document D1 only if D1
contains any or both terms. OR means any
• A query corruption and politics will retrieve a document D1 only if D1
contains corruption and not politics (Greengrass, 2000)

Boolean searching using Venn diagrams

Advantages of Boolean Operators
•Easy to implement, therefore extensively used in the design

of IR systems; e.g., OPACs, CD-ROM, Online databases,
web search engines, etc.
•Gives expert users a sense of control and transparency over
documents that are retrieved (i.e., it is easy to understand
why a document was retrieved or not).
•Has great expressive power; i.e., offers users multiple ways
to express their queries for desired results.
•Offers multiple techniques for broadening/ narrowing a
search (Manning, Ragnavan, & Schutze, 2009)

Disadvantages of Boolean Operators
•No ranking for retrieved documents. All retrieved documents

are accorded the same weight/importance without ranking its
relevance.
•Novice users are unable to use the right combinations of the
operators with searches involving multiple queries.
•Output is difficult to control especially when the OR operator
is used(information overload) & users are often not able to
modify the queries to limit the number of retrieved documents
(Manning, Ragnavan, & Schutze., 2009)
•This model requires Boolean query instead of free text.

Disadvantages of Boolean Operators contd.
• Users formulate inefficient Boolean queries because of their knowledge of

the English language. Example, the meaning of AND or OR in natural
language differs from Boolean query usage. Thus the AND operator is
often substituted for the OR operator.
• E.g., someone with a desire for a specific entertainment may state it as:
➢ dinner AND comedy show AND movie instead of
➢ dinner OR comedy show 0R movie or better still
➢ dinner AND (comedy show OR movie)
• Users are unfamiliar with the rules of precedence for logical connectives
➢Two standards both of which rely on parentheses to group terms together.
➢ Combinations within parenthesis is evaluated first before combining with
terms outside the parenthesis

Boolean Rules of Precedence
•In a type 1 system – NOT is applied within the ( ),

then AND, and lastly OR, from left to right. E.g.,
A OR B AND C = A OR (B AND C)
•A type 2 system – applies left to right order of

precedence, irrespective of operators A OR B AND
C = (A OR B) AND C (Korfhage, 1997)

Activity
•Create your own Boolean search queries using all the 3

operators NOT, OR, and AND.
•Search them on the Internet and discuss the differences in

volume of retrieved documents
•Which operators narrow or broaden the search?

References
• Callan, J. (2003). Retrieval models: Boolean and Vector Space. Retrieved from
https://www.scribd.com/presentation/351797748/03
• Fuhr, N. (2001). Models in Information Retrieval. In: Agosti M., Crestani F., Pasi G. (eds) Lectures
on Information Retrieval. Heidelberg, Berlin: Springe
• Greengrass, E. (2000). Information retrieval: A survey. Available at
http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.boo k. pdf
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley and Sons Inc.
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Rettrieval.
Retrieved from https://nlp.stanford.edu/IRbook/pdf/irbooko
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing
Science, 3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57- 62.pdf

Session 04:
Models of Information Retrieval (Part 2)


Session Overview
This session will discuss the Statistical models, that is, the Vector
Space and the Probabilistic models. They are referred to as statistical
models because they use statistical information to determine relevance
of documents to queries. They are also described as the best match
models because they are able to predict the degree of relevance of a
document to a query, i.e., retrieved documents are ranked with the
most relevant documents listed first.
At the end of the session, the student will be able to
• Identify the characteristics, advantages and limitations of the
statistical models.
• Explain the underlying principles and structure of the Vector Space
and the Probabilistic models.
Dr. (Mrs.) Florence O. Entsua-Mensah Slide 2

Session Outline

• Topic One: The characteristics, advantages and limitations
of the statistical models.
• Topic Two: The Vector space model: Principle and structure
• Topic Three: The Probabilistic model: Principle and structure

Reading List
• Chowdhury, G. G. (2010). Introduction to modern information

retrieval (3rd ed. ). New York: Neal-Schuman Publishers, Inc.
• Hiemstra, D. (2000). Information retrieval models. In A Goker
and J. Davies (Eds.), Searching in the 21st century, (pp. 1-19).
Chichester, U.K. : Wiley. Retrieved from
http://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModelsTuto
rial-draft.pdf. 3-12
• Information retrieval models (n.d.). Retrieved from
http://aspoerri.comminfo.rutgers.edu/InfoCrystal/Ch_2.html
• Kuyoro, S. O., & Oludele, A. (2012). Information retrieval: An
Overview. International Journal of advanced Research in
Computer Science, 3(5), 175-178. Retrieved from
https://www.academia.edu/21988230/Information_Retrieval_An
_Overview
The Characteristics, advantages and limitations of the
TOPIC
Statistical Models
1
Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 5

The characteristics of the Statistical models
• The statistical models use statistical information described

as term frequencies to determine relevance of documents to
queries.
• They are associated with free text queries, i.e., queries are
expressed in one or two words in normal human language
rather than a specified query expression or operators such
as the Boolean operators.
• The output or documents retrieved are ranked/ordered by
their degree of relevance with the most relevant to the query
listed first

The characteristics of the Statistical models
•This requires the assigning of scores/weights

to the query terms and the representation of
the documents in a collection.
•The Vector space and Probabilistic models
are two examples of the statistical models
(Information Retrieval (n.d.).

The advantages of the Statistical models
• They solve some of the • Easy query formulation, no
problems of Boolean special skills required,
retrieval for eg., users can use natural
➢They provide relevance language followed by
ranking of retrieved automatic extraction of
documents. keywords from the query
• Users have control over • Uncertainty about what
output and are able to set concepts to use in query
the number of documents formulation can be
to be displayed per search accommodated because of
the use of natural language

The disadvantages of the Statistical models
• Lacks the expressive power of the Boolean approach.

➢For e.g., the NOT operation cannot be
expressed because only positive scores are used
➢ Also, (A and B) or (C and D)) can not be
represented
• Does not support phrase and proximity searches.
• For optimal performance, queries have to contain large
number of words.

The disadvantages of the Statistical models contd.
• Calculation of relevant scores involves a lot of computing

which makes it very expensive.
• The ranked list provides users with limited view of available
information
➢For e.g., the retrieved documents do not suggest how to
refine/modify a query where necessary, to retrieve more or less
information
• Users find it difficult to determine the appropriate words used
in the representation of the relevant documents (Information
Retrieval, n.d.).

TOPIC
The Vector Space Model: Principle and Structure
2

The Vector Space Model
Principle
• It is based on Luhr (1957) similarity criterion which states:
• “The more two representations agreed in given elements and their
distribution, the higher would be the probability of representing
similar information” (as cited in Hiemstra, 2009, p. 7)
• That is, relevance is determined by the degree of similarities
between the properties of the document, (i.e., the index terms
used to represent the documents) and the query terms.
• In simple terms;
➢The more similar a document vector is to a query vector, the more
likely it is that the document is relevant to that query.
➢ The words used to define the dimensions of the space are orthogonal
or independent.

The Vector Space Model
• The vector space model represents the documents and

queries as vectors in a multidimensional space, whose
dimensions are the terms used to build an index to represent
the documents [Salton 1983]
• The vector space model can assign a high ranking score to a
document that contains only a few of the query terms if these
terms occur infrequently in the collection but frequently in the
document.

Features of the Vector Space Model
Document and query

terms are represented as
vectors in a multi
dimensional space and
modeled on algebraic
rules.
(A vector a point in a vector
space. It has two attributes,
direction and length, i.e., it
has a dimension and value)
Features of the Vector Space Model (VSM)
• The dimensions of the vectors are index terms

representing the documents
• The terms can be single words, keywords or phrases
• The attributes or properties of both document and
query terms are computed or weighted using specific
measures and represented as vectors
• Relevance of a document to a query is determined by
the similarities of their vectors (the implication is that
queries are also weighted or have specific values)
Slide 15
Dr. (Mrs.) Florence O. Entsua-Mensah
Assumptions of the Vector Space Model
• The more similar a document vector is to a query vector, the

more likely it is that the document is relevant to that query.
• The words used to define the dimensions of the space are
orthogonal or independent.
• The similarity assumption is an approximation and realistic
whereas the assumption that words are pairwise
independent doesn't hold true in realistic scenarios.

Computing values or dimensions of vectors
• Assigning appropriate values to vectors is known as term
weighting
➢Term weighting is the process of assigning numerical values to
terms based on their statistical distribution, i.e., frequencies of
occurrence of terms in documents, document collections, or
subset of documents such as relevant documents to a query
(Frakes, n.d.)
• There are several term weighting schemes. The most
common and the best weighting scheme in IR was proposed
by Saltan and Yang (1973).
• It is called tf.idf weights, a combination of the term frequency
(tf) and the inverse document frequency (idf)
➢Tf = the number of occurrences of a term in a document
➢Idf = a value inversely related to the document frequency (df)
Slide 17
Computing The term frequency-inverse document frequency
of a term
• It is the product of its tf weight and its idf weight, that is:
w t ,d = log(1 + tf t ,d )  log10 ( N / df t )
• tftd is the number of occurrences of a term t in a document
d
• dft is the document frequency, the number of documents in
the collection that contain the term t
Slide 18
Computing The term frequency-inverse document frequency
of a term
• N is the total number of documents in the collection
(Hiemstra, Greengrass)
• NB
➢Tf is greater when the term is frequent in a
document.
➢A document with 10 occurrences of a term t
(t=10) is more relevant than a document with one
occurrence of the term (t=1)
➢Idf is greater when the term is rare in the
collection, i.e., doe not occur in only few
documents
Slide 19
Computing the similarity between query and document vectors
  


V
  q•d q d qi d i
cos(q , d ) =   =  •  = i =1
q d
i =1 q i =1 i
V V
qd 2
d 2
i
•qi = tf-idf weight of term i in the query

•di = tf-idf weight of term i in the document
•cos(q,d) is the cosine similarity of q and d or, the cosine of the
angle between q and d.
•The cosine similarity of 2 documents ranges from 0 to 1
•If the value is 0 it means q and d are 100% similar
Slide 20
Steps in ranking of documents in the VSM
1. Calculate the term-weighting tf.idf of both query and

document
2. Represent query and each document as a weighted tf.idf
vector
3. Compute the cosine similarity between query vector and
each document vector
4. Rank documents according to cosine (query, document) in
increasing order
5. Return the top ten (10) retrieved documents to the user
(Teufel, n.d.).

Ranking of documents in the VSM
In the VSM documents are

ranked according to their
proximity to the query. In this
diagram, document (d1) will be
more relevant to the query (q)
than document (d2) based on
the proximity. In other words,
the smaller the angle between
the query and the document
vector, the relevant the
document.
TOPIC
The Probabilistic Model: Principle and Structure
3

Setting the stage for the Probabilistic model
• A user approaches an IR system with an information need
• The information need is translated into a query representation
• Similarly there are documents that has been represented with index
terms
• The system has to match the query with documents that are relevant to
the query
• This matching is done without clear understanding of the information
needs of the user introducing an element of uncertainty whether the
documents retrieved will be relevant to the user’s query
• The probability theory provides an antidote to the uncertainty of
relevance of retrieved document to a query by providing the principles for
estimating the relevance of a document to satisfying the information
needs of a user (Manning, Ragnavan, & Schutze, 2009).

The Probabilistic Model
• It is based on the Probability Ranking Principle (PRP)which

states that
• “the function of an information retrieval system is to
rank the documents based on their probability of
relevance to the query, given all the evidence available”
[Belkin and Croft 1992]
• Documents are ranked in order of decreasing probability of
usefulness or relevance to a user.
• It is a manual process which involves the calculation of the
probability that a document will be relevant to a user.
Slide 25
Maron and Kuhns’ Probabilistic Retrieval Model
• They advocate the calculation of probability for each

document in the collection, i.e., will a user submitting a query
judge that document relevant?
• This probability for a particular document (Dm) based on a
query term B =
Users who consider the document relevant to the query
term (B)
Total number of users who submitted the query (B)
• In the absence of statistical evidence, the model assumes

that users submitting a query (B) will judge the document (Dm)
relevant (Chowdhury, 2010).
Slide 26
Robertson and Spark Jones Probabilistic Retrieval model
• Robertson and Spark Jones adopted a different probabilistic

principle proposed by Robertson, 1977) as follows:
• If a reference retrieval system’s response to each request is a ranking of
the documents in the collections in order of decreasing probability of
usefulness to the user who submitted the request, where the
probabilities are estimated as accurately as possible on the basis of
whatever data has been made available to the system for this purpose,
then the overall effectiveness of the system to its users will be the best
that is obtainable on the basis of that data (Robertson, cited in Hiemstra,
p. 11).
Slide 27
Robertson and Spark Jones Probabilistic Retrieval
model
• Propositions of the model
• Given a user’s query, there is a set of documents which contain
exactly the relevant documents (called the ideal set).
• The purpose of the query is to specify the properties of the
answer set
• These properties are not known but the system makes a guess
and presents an initial set of documents
• The user inspects the top retrieved documents to identify the
relevant ones
• The system uses this information to refine the description of
subsequent ideal answer set
• Repetition of this process eventually improves the description of
the ideal answer set

model
•Modelling the ideal answer in

probabilistic terms
•Given a query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the
document dj relevant based solely on the
query and document representation.
•The ideal answer set is referred to as R
and are predicted to be relevant

model
• The probabilistic ranking is computed as:
➢sim(dj ,q) = P(R | dj) / P(¬R | dj) i.e.,
➢ (the ratio of the probability that the documents dj
is relevant and the probability that it is not
relevant)
➢P(R) stands for the probability that a document
randomly selected from the document collection is
relevant (Inkpen, n.d.)
The Probabilistic model is not as popular as the Boolean and Vector

Space models although many experiments have proved that it can
yield good result . Currently it is applied in spam filtering (Chowdhury,
2010)

Activity
•To see how the VSM works, create a series of queries

and search on the Internet for documents with similar
terms in your queries. Note the number of documents
with similar terms to the query. Do you agree with the
ranking of the retrieved documents with regard to your
queries? Discuss your findings in the Chatroom on the
Sakai course site.

References
• Belkin, N. J., & Croft W. B. (1992). Information filtering and information retrieval:Two sides of the coin.
Communications of the ACM, 33(12), 29-38. Retrieved from
https://s3.amazonaws.com/academia.edu.documents/30740540/InformationFilteringAndInformatio
nRetrieval.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1515638296&Signature
=91MO6mpxaaWiScOsHVuMefzAVOg%3D&response-content-
disposition=inline%3B%20filename%3DInformation_filtering_and_information_re.pdf
• Frakes, W. B. (n.d.). Introduction to information systems. Retrieved from
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap01.htm
• Greengrass, E. (2000). Information retrieval: A survey. Available at
http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.book.pdf
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Rettrieval. Retrieved
from https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
• Teufel, S. (n.d.). Term weighting and Vector Space model. Retrieved from
https://www.cl.cam.ac.uk/teaching/1415/InfoRtrv/lecture4.pdf

Session 5 – Subject Analysis and Representation


Session Overview
The major function or purpose of an IR system is to match a user’s query to the

contents of documents for the retrieval of relevant documents. This can be
achieved by analyzing and preparing a surrogate for each document and
organizing it in and orderly manner. The process of preparing document surrogates
and assigning specific descriptors to documents is known as indexing. Indexing
can be done manually by experts or automatically by the use of computer software.
At the end of this session, the student will be able to
• Explain the meaning, purpose, composition and steps of the indexing process.
• Explain the advantages and disadvantages of both manual and automatic
indexing, and
• Understand the parameters for determining the effectiveness of an index

Session Outline
• Topic One: Meaning, purpose, composition and steps in indexing.
• Topic Two: Definition, advantages and disadvantages of manual and automatic

indexing
• Topic Three: Parameters for controlling index effectiveness

Reading List

New York: NealSchuman Publishers, Inc.

TOPIC
Meaning, Purpose, Composition and Steps in Indexing
1

Meaning of an index
Most IR systems do not include the full text of documents. This is because use of
the raw documents only hinders retrieval since the only way to access required
information is by reading linearly. To ensure efficient retrieval of information ,
document surrogates are generated and used in place of or together with the full-
text documents (Kophage, 1997).
• Indexing is the analysis of a given document content and the representation of the
analysis by appropriate descriptors or key (Chowdhury, 2010).
• Thus, the process of constructing document surrogates by assigning identifiers to

text items.
• An index is the selected terms & their locations in an individual document or group
of documents (Kophage, 1997)

Composition of an index
•Index terms may be single words or longer phrases or both
• Indexing is constructed using an indexing language or

vocabulary
• An indexing language may be controlled, i.e., use of

predefined set of index terms or uncontrolled, i.e., use of any
term based on a broad criteria

Purposes of an index
1. Permit location of documents by topic:- Indexes act as selection

guides to material contents
2. Define topic areas :- Indexes serve as a tool for document analysis
3. Predict the relevance of a document to a specified information need

(Kophage, 1997):- Indexes allows users to familiarize themselves with
a document and decide if they need to explore it further.
4. To decide on the optimum number of subject entries, and thus

economize the bulk and cost of cataloguing indexing.

Definition, Process, Advantages and Disadvantages of
TOPIC
Manual and Automatic Indexing
2

Manual Indexing
• Manual indexing – identification & description of terms in documents by

experts/specialists/trained indexers.
• Manual indexing makes use of an uncontrolled indexing language. This

indicates intellectual efforts being taken by the author to identify and
describe the content of a document.
• Traditional commonly used manual systems for compiling indexes of

documents make use of cards, such as library catalogue cards, but
nowadays a good computerised Personal Reference System is to be
preferred.

Steps in manual indexing
1. Analyze the subject content :- It is important to determine exactly what the

given document is about. This is termed as the “aboutness” of the document.
2. Identify keywords:- After collecting information about the ”aboutness “ of a

document, the indexer needs to represent them in the way suitable for matching
users’ queries.
3. Standardize keywords:- Here, the indexer chooses the appropriate keywords

by extracting words directly from the document or through the guide of
vocabulary control devices.
4. Choose an indexing system:- The indexer decides to choose a post-

coordinate system or a pre-coordinate system.
5. Filing of entries

The two types of indexing systems
Post- coordinate systems
➢ One entry is prepared for each key word selected to represent a given document
➢ All entries are organized in a file
➢ User’s query terms are identified and matched against the file of index terms
for the retrieval of relevant documents (Chowdhury, 2010).

The two types of indexing systems
Pre-coordinate systems
➢ A documents is represented by a heading made up of a chain or string of terms
➢ The chain of words represent or define the full or subject content of the
document
➢ The components words are synthesized base on an indexing language
➢ Pre-coordinated indexes arranged alphabetically are called alphabetical

subject indexes or alphabetical subject catalogues
➢Those arranged according to a classification scheme are known as classified

indexes or classified catalogues Librarianship Studies & Information
Technology, 2017)

Advantages and Disadvantages of Manual Indexing
Advantages
➢ Use of uncontrolled language giving the indexer flexibility in choosing document
descriptors
Disadvantages
• Lack of consistency – indexers may not assign the same index term to a given
document.
• Varying levels of specificity and exhaustivity are attained based on the different
perspectives of indexers
• Use of controlled vocabulary may hinder accuracy; i.e., indexers may not represent the
document accurately, especially where new words are introduced to the documents.
• Indexer-user mismatch – same concept may be represented differently by indexer and
user e.g., meats and poultry
• Pre-coordination- subsets of terms in manual indexing are often represented by a single
term; e,g., gas, oil, coal, are represented by fuel. This may hinder recall

Automatic Indexing
• Automatic indexing is defined as “the process of assigning and arranging index

terms for natural language without human intervention” (Tulic 2005, cited in
Obasaki, 2010).
• It is based on algorithms or well defined rules.
• When the assignment of the content identifier is carried out with the aid of modern
computing equipment the operation becomes automatic indexing.
• All automated indexes derive from frequency of occurrences of words within a

document

Advantages of Automatic Indexing
1. It is faster & easier to produce, because most of the processes are performed
by a machine. E.g., Microsoft word and Adobe Framemaker can do automatic
indexing.
2. Easily retrievable & modified in case of an error
3. Easily transferred through ICTs (Obaseki, 2010)
4. Consistency in indexing is assured
5. Cost of producing index entries tends to be lower at the long run
6. There is better retrieval effectiveness (Chowdhury, 2010)

Problems of Automatic Indexing
• An automatic index is free from the bias of a human indexer, but may contain bias
introduced into the algorithms by the system programmer.
• Professional indexers (who are often librarians) lack the technological skills
required for automatic indexing
• Indexing process is intellectually demanding, rigorous, therefore some LIS
professionals would rather not do it
• Some librarians cast doubt on the quality of automated indexing. They think the
software cannot match the intellect of the human indexer (Obaseki, 2010)

TOPIC
Parameters for Controlling Index Effectiveness
3

Parameters for Controlling index effectiveness
Index effectiveness is controlled by two parameters:

➢indexing exhaustivity and term specificity
Indexing exhaustivity
• It is the breadth of coverage of index terms i.e.,
• The extent to which all index terms and concepts in a document are covered
➢ Achieved by selecting many keywords to represent all ideas, concepts and

topics discussed in the document
➢ Non-exhaustive indexing system will use few keywords to give a broad

representation of the subject

Parameters for controlling index effectiveness
Term specificity
• It refers to the depth of coverage, i.e.,
• The extent to which all topics/concepts are indexed in detail.

➢An indexing language on the topic birds, that excludes types or categories of
birds is less specific.
➢An indexing language that includes various kinds of birds but does not
include all kinds of birds is described as more specific but less exhaustive
• The more specific the terms, the better the representation of the subject
(Korphage, 1997, Chowdhury, 2010)

The impact of exhaustivity and specificity on an IR system
Exhaustivity
• If a search term is more exhaustive, recall (i.e., the proportion of relevant

materials retrieved by a system) is higher.
➢ For e.g., searching information about the world wide web using related
terms like the “net”, “Internet”, “superhighway” retrieve more information
most of which are likely to be relevant
• Thus high exhaustivity of indexing ensures high recall
• However high levels of exhaustivity also decrease the level of precision (i.e., the
proportion of retrieved documents that are relevant) and a large number of non-
relevant materials are retrieved.
➢ The low precision may be ascribed to the less-detailed discussion of some of
the related terms in the retrieved documents

The impact of exhaustivity and specificity on an IR system
Specificity
• The more specific a search term, the higher the precision
➢ For eg., looking for information using a broad term like “sports” may retrieve
a lot of documents most of which might not discus the desired topic, but using
specific term like “soccer” or “tennis” will yield fewer documents some of which
may be irrelevant.
• Thus higher levels of term specificity ensures high precision but low recall
• It is not possible for a given IR system to achieve optimal levels of precision and
recall. Optimal levels may increase cost of the IR system
• Moderate or intermediate levels of term specificity and indexing exhaustivity has

been advocated to make the IR system economical (Chowdhury, 2010)

Activity
• Choose two search terms, a broad search term and a specific term . Search the
terms on the net. Which of them gave high precision and high recall? Discuss
your findings in the Chatroom

References

New York: Neal-Schuman Publishers, Inc.
and Sons Inc.
• Librarianship Studies and Information Technology (2017). Retrieved from

https://librarianshipstudies.blogspot.ca/2017/04/pre-coordinate
indexingsystems.html
• Obaseki, T. L. (2010). Automated indexing: The Key to information retrieval in the

21st century. Library Philosophy and Practice. Retrieved from
http://www.webpages.uidaho.edu/~mbolin/obaseki.htm

Session 6 – Index File Organization


Session Overview
An information retrieval system is designed to provide fast and easy

access to the documents stored in it. This is made possible by
organizing the records of each document including index terms or
keywords into fields and subfields, and arranging it alphabetically with
pointers to the actual documents. Such an index file is called an
Inverted index.
At the end of this session, students will be able to:

• Explain an inverted index and its components
• Learn how to construct an inverted index
• Construct a sample inverted index

Session Outline
• Topic One: Meaning, components and features of an

inverted index
• Topic Two: Purposes, importance and weakness of inverted

index
• Topic Three: How to construct an index

Reading List
• Chowdhury, G. G. (2010). Introduction to modern

information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc.
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). A
first take at building an index. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-
take-at-building-an-inverted-index1.html#fig:indexstart

TOPIC
Meaning, Components and Features of an Inverted Index
1

Meaning of an inverted index
• It is an index data structure that maps content to its

location within a database file, in a document or in a set of
documents (Moura E.S., & Cristo M.A. 2009).
• It is normally composed of:

➢ a vocabulary that contains all the distinct words found in a
text and
➢ for each word of the vocabulary, a list that contains
statistics about the occurrences in the text.

Meaning of an inverted index
• An indexing system in which the terms point to documents to

which the terms belong.
• An index structure where every key value (term) is

associated with a list of objects identifiers (representing
documents) (IGI Global disseminator of Knowledge, n.d.)
• It is described as inverted because documents are

associated with words, rather than the vice-versa (words with
documents).

Components of an inverted index
There are 2 main files in an inverted file system of any text

retrieval system
1. The text file containing the document records
➢ Each document record is made up of fields and subfields
with a specific unit of information.
➢ The unit of information include document surrogates or
bibliographic information such as: author’s name, publisher,
title, ISBN, date of publication, etc., and sometimes the
abstract or full text of the document
2. The inverted file containing all the terms and pointers to
the record numbers where the terms occur (Chowdhury,
2010)

Components of an inverted index
• Thus an inverted file has the keyword entry, and a reference list
specifying the actual position where the keyword/term is located in
the database
• Each entry also include:

➢The number of occurrences of a term in a given record
➢Position information e.g., the field in which the term/phrase occurs
➢The position of term/phrase in a given sentence/paragraph

➢There are also field tags used to denote the fields where the
term/phrase are located (Chowdhury, 2010)

Sample document records and sample index
Sample document records Sample Index

Doc 1 360 1 2 ARTIFICIAL INTELLIGENCE
Author: Cunningham, M 140 1 1 CHARTWELL BRATT
Title: File structure and design 120 1 1 CUNNINGHAM, M.
Publisher: Chartwell Bratt 360 1 1 EXPERT SYSTEMS
Year 1995 330 1 1 EXPERT SYSTEMS AND ARTIFICIAL
Keywords: File structure, file organization
INTELLIGENCE
Doc 2 160 1 2 FILE ORGANIZATION
Author: Tharp, A. 260 1 2 FILE ORGANIZATION
Title: File organization and processing 230 1 1 FILE ORGANIZATION AND PROCESSING
Publisher: John Wiley 160 1 1 FILE STRUCTURE
Year 1988 260 1 1 FILE STRUCTURE
Keywords: File structure, file organization 130 1 1 FILE STRUCTURE AND DESIGN
360 1 3 KNOWLEDGE-BASED SYSTEMS
Doc 3
Author: Ford, N 340 1 1 LIBRARY ASSOCIATION
Title: Expert systems and artificial intelligence
Publisher: Library Association NB. The index is arranged in alphabetical order
Year 1991 and each term may occur in multiple
Keywords: Expert systems, artificial intelligence, documentss
Knowledge-based systems (Adapted from Chowdhury, 2010, p. 128)
Features of an Inverted Index
• Efficient search: Inverted indexes allow for efficient searching of large

volumes of text-based data. By indexing every term in every document,
the index can quickly identify all documents that contain a given search
term or phrase, significantly reducing search time.
• Fast updates: Inverted indexes can be updated quickly and efficiently as

new content is added to the system. This allows for near-real-time
indexing and searching for new content.
• Support for multiple languages: Inverted indexes can support multiple

languages, allowing users to search for content in different languages
using the same system.

Features of an Inverted Index
• Flexibility: Inverted indexes can be customized to suit the needs of different types of
information retrieval systems. For example, they can be configured to handle different
types of queries, such as Boolean queries or proximity queries.
• Compression: Inverted indexes can be compressed to reduce storage requirements.

Various techniques such as delta encoding, gamma encoding, variable byte encoding,
etc. can be used to compress the posting list efficiently.
• Support for stemming and synonym expansion: Inverted indexes can be

configured to support stemming and synonym expansion, which can improve the
accuracy and relevance of search results.
➢ Stemming is the process of reducing words to their base or root form.
➢ Synonym expansion involves mapping different words that have similar meanings to a
common term.

TOPIC
Purpose, Importance and Weakness of an Inverted Index
3

Purpose of an Inverted index
• The purpose of an inverted index is to enable fast full-text

searches at the expense of increased processing when
adding documents to the database.
• The reversed file may not be the database file's index, but
the database file itself.
• This is the most common data structure used in document

retrieval systems, and is used extensively by search engines
and the like.

Importance of the Inverted index
• The inverted index is the most popular file structure used in document
retrieval systems such as commercial databases to support full text
search (Moura E.S., & Cristo M.A. 2009)
• It is used by search engines, cell phones, book index, concordance

etc.
• Document records are stored in a computer memory one after the

order. This requires searching through every document one at a time
for any. The index file provides a faster access to the relevant
documents by identifying the terms and jumping to their locations.

Importance of the Inverted index
• It’s structure such as, position information and use of tags
permits use of different search strategies such as:
➢Field specific searches
➢Proximity searches
➢Boolean searches (Chowdhury, 2010)

Weaknesses of Inverted Index
• Large storage overhead and high maintenance costs on

updating, deleting, and inserting.
• Instead of retrieving the data in decreasing order of expected

usefulness, the records are retrieved in the order in which
they occur in the inverted lists.

How to Construct An Index
TOPIC
3

Steps in the construction of an inverted index
• Pre-processing of documents and texts

➢Collect the documents to be indexed. Each document must
have a document ID (doc ID)
➢Tokenize the text – separate into single words
➢E.g., George Weah is the new elected President of Liberia
becomes:
George Weah is the new elected president of Liberia

• Stop words – Omit high frequency or common words, such

as: to, of, the, a etc. (These are words with little value in
helping to select documents that will match a user’s query)
➢Stop list helps to reduce the number of postings that a system
will store
➢Some stop words in songs and some titles are left intact e.g.,
“Let it be” to facilitate search by title and phrase searches
➢However web search engines do not use stop list

• Normalization – Map words & query so that they match. For e.g.,
U.S.A. To USA so that searches for one of the term will retrieve
the other
➢An alternative method is to construct a synonym list, e.g., car and
automobile
➢Also removal of diacritics and accents – e.g., cliché and cliche to
match
➢Capitalization –all letters are reduced to lower case
➢Equate American spelling to British e.g., labour to labor
➢Date – 2/1/18 to Jan. 1, 2018
• Keywords can be used as they are or be transformed into their

base form e.g.,
• Nouns in the singular form – shoes become shoe
• Verbs in the infinitive form – learned become learn

• Stemming – words are reduced to their root. Eg., authorize

and authorization reduce to “authoriz”. Stemming of words;
➢Reduces the size of the index
➢It also broadens the search, i.e., allows retrieval of various
forms of the word
• Linguistic models – modify tokens; e.g., friends to friend, July

to july etc. These are the actual words used by the indexer

Building the index
• The normalized tokens for each document
• Sort and group index terms tagged alphabetically by their

document IDs, to produce a dictionary & postings.
• Postings are sorted by document IDs which is the basis for

efficient query processing
• The dictionary stores other statistics such as;

➢Number of documents that contain each term which also constitutes
the length of each posting (this facilitates ranked retrieval and
improves efficiency of search engines during query processing)
(Manning, Raghavan, & Schutze, 2009, Inkpen, n.d.)

An inverted index
Manning, C. D., Raghavan, P., & Schutze, H. (2009). A first take at building an index. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-
Dr.1.html#fig:indexstart
(Mrs.) Florence O. Entsua-Mensah Slide 24
Practice assignment
• Study carefully the inverted index in the previous

slide
• Read the notes below the index
• Use the documents below to practice how to
construct an inverted index
A document collection
Doc 1 new home sales top forecasts
Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise

Activity
•Access this link for more information on inverted index

construction
•https://www.youtube.com/watch?v=pevQ2T9Gm0w

References
New York: Neal-Schuman Publishers, Inc.
• IGI Globa (n.d.). What is an inverted index. Retrieved from https://www.igi-
global.com/dictionary/inverted-index/15654
• Moura E.S., Cristo M.A. (2009) Inverted Files. In: LIU L., ÖZSU M.T. (eds)
Encyclopedia of Database Systems. Springer, Boston, MA . Retrieved from
https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940
9_1136
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). A first take at building an
index. Retrieved from http://nlp.stanford.edu/IR-book/html/htmledition/a-first-
take-at-building-an-inverted-index-1.html#fig:indexstart

INFS 422:INFORMATION STORAGE AND RETRIEVAL
Session 7 – Vocabulary Control


Session Overview
The basic function of an IR system is to meet the information needs

of users. This is done by retrieving documents that have been
indexed with terms that match the query terms used by the users.
One way of ensuring a “perfect” match between query and index
terms is by use of standardized terms for both indexing and
searching. This is known as vocabulary control .

•Explain vocabulary control and its objectives and advantages
• Identify the differences between natural language indexing and
vocabulary control
• Describe the characteristics and features of vocabulary control
tools
Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 2

Session Outline
• Topic One: Definition, objectives and advantages of

vocabulary control
• Topic Two: Differences between natural language indexing

and vocabulary control
• Topic three: Features and characteristics of vocabulary

control tools

Reading List

retrieval (3rd ed. ). New York: Neal Schuman Publishers,
Inc.

Definition, Objectives and advantages of Vocabulary
TOPIC
Control
1
Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 5

What is Vocabulary Control?
• “The term ‘vocabulary control’ refers to a limited set of terms that must
be used to index documents, and to search for these documents, in a
particular system” (Librarianship Studies and Information Technology,
n.d.)
• The systematic selection of preferred terms to represent the subject
matter of documents (Chowdhury, 2010)
• It is “an organized arrangement of words and phrases used to index
content and /or to retrieve content through browsing or searching” (What
are Controlled Vocabularies, n.d.).
(Most databases have controlled vocabularies, e.g., CINAHL, Academic
Search Premier)

Why Vocabulary Control?
Vocabulary control is necessitated by two basic features of

natural language:
1. Two or more words that can be used to represent one
concept. E.g., salinity/saltiness
2. Two or more words with the same spelling but represents
different concepts
➢ Mercury can mean:
➢ Planet
➢ Metal
➢ Automobile
➢ Mythical being (Zeng, 2005)

Objectives of Vocabulary Control
In an IR environment:
• Consistency - Vocabulary control facilitates the consistent

representation of the subject matter of documents by indexers
to avoid dispersion of related materials
• It facilitate the comprehensive search on topics by

systematically linking all related terms together (Lancaster,
cited in Chowdhury, 2010).
• To match the language of searchers and indexers

Objectives of Vocabulary Control
• Translation – it converts natural language of authors,

indexers, & users into a suitable vocabulary for indexing &
retrieval
• Indication of relationships – It indicates semantic

relationships among terms
• Label and browse – provide consistent word hierarchies that

help users to locate desired information
• Retrieval – it is a searching aid for locating contents of

documents

Advantages of Controlled Vocabulary
• Saves time:- A searcher does not need to come up with all

synonyms/alternative spelling of a term
• Provides comprehensiveness of search results. Eg. Searching

for the term clothing as a subject heading of a database will also
retrieve raiment, apparel, costume, attire, dress, garment etc.,
• It can be used to clarify words with several meanings e.g.,

bark, pool, bolt, season etc., thus aiding precision
• It facilitates the learning of an unfamiliar subject by

providing an authoritative subject list for browsing (Bell, 2012)

Differences between Controlled Vocabulary and Natural Indexing
TOPIC
2

Controlled vocabulary Natural indexing

• Terms used to represent subjects • Any term appearing in the title,
and the process of assigning these document or text of a document
terms to particular documents are may be used as an index term. No
performed by a person persons control the use of terms
• Index terms are identified from an • No authority list is used. Index
authority list such as subject terms are obtained from the
headings or thesaurus document
• A searcher must use the controlled • No controlled list is used. A
list to formulate a search strategy searcher can use any desired terms
to formulate a search strategy

Controlled vocabulary Natural indexing

1. Lacks specificity no matter 1. High specificity and high
how detailed the system precision especially for names
2. Lack of exhaustivity of organizations & persons
3. Not always up to date till new 2. Has exhaustivity which may
terms are added to thesaurus lead to high recall
4. Words of authors may be 3. Up to date. New terms are
misinterpreted always available
5. The user has to learn an 4. Authors words are used. No
artificial language misinterpretation
6. High input costs 5. Both indexer and user use
natural language words
6. Low input costs

Controlled vocabulary Natural language indexing

7. Exchange of materials between 7. No language incompatibility.
databases is difficult because of Easy exchange of material
incompatibility between between databases
standardized terms
8. Exhaustivity may lead to loss of
8. Loss of precision is avoided precision
through over exhaustivity
9. Searching is burdensome. The
9. Searching is not burdensome to user has to make all the
the user because: intellectual effort such as use of
➢ Searches are broadened by availability of preferred terms appropriate words to retrieve
due to the use of synonyms and near and near synonyms
➢ Homographs (words with same spellings but different
relevant documents (Chowdhury,
meanings such as fair) are qualified 2010, pp 156-158)
➢ Broad, narrow & related terms are displayed ➢
Problematic concepts that may not be present in free text
are expressed

TOPIC
Vocabulary Control Tools: Principles and Characteristics
3

Definition and Types of Vocabulary Control Tools
Definition
• Vocabulary control tools are tools used for controlling the

vocabulary of indexing and retrieval (Chowdhury, 2010). E.gs.,
• Subject heading List
• Thesaurus
• Alphanumeric Classification Schemes (e.g., LCC, DDC, UDC)
• Ontologies

Subject Heading List
• Subject heading has been defined as a word or group of words

indicating a subject under which all materials dealing with the
same theme is entered in a catalogue or bibliography, or is
arranged in a file.
• It is a master list of terms and phrases with appropriate cross-

references and notes used as a source of headings to represent
the subject content of an information resource (Chowdhury, 2010)
• It is standardized, officially approved words or groups of words

used to describe the subject content of books or any information
resource.

Subject Heading List
• Subject headings, like access points based on author names

and titles, serve the dual function of location and collocation.
• Subject heading lists are used by library catalogers to aid

them in their choice of appropriate subject headings and to
achieve uniformity.

Features of Subject Headings
• It is alphabetically arranged by • It has subdivisions to describe
term/phrase specific aspects of the subject
• It can be one word, two or more ➢ Subdivisions are separated
words, a phrase, a city, a country, from the main heading by a dash
a geographic region, or a person (--)
• There are a list of semantically ➢ Subdivisions can be topical,
related terms/phrases under each geographical, form (type /form of
main term publication), or chronological
• They use formal language e.g., • It is used to produce a pre-
ART, ASIA, MEDIEVAL coordinated index of a collection
• They are usually given in plural (College of San Mateo Library, n.d.;
form e.g., PASTRIES, SHARKS • Chowdhury, 2010)
Slang, jargon and highly
specialized terminology is avoided

List of Subject Headings -General Principles
The following general principles guide the indexer in the choice of

subject headings
• Specific and direct entry – This principle requires the
assignment of a document under the most specific subject that
precisely represents its subject content.
• Common usage – A subject must be represented by a commonly
used word.
• Uniformity - The principle of uniform heading states that the term
chosen from a group of synonyms must be used to represent all
documents on a given topic.
• Consistency and current terminology – If term based on
common usage becomes obsolete, the common terminology must
be used (All records bearing the obsolete term must be changed
to the current term).

List of Subject Headings -General Principles
• Form heading – They are the same as topical subject headings but used
organization, arrangement or classification of literary and artistic forms of
material e.g., drama, essay, poetry, fiction etc.
• Cross reference – they are used to direct users from broader and
related topics to the subject headings or terms used to represent a
particular subject e.gs.,
➢ See (or USE) reference – direct users to authorized headings
➢ See also references – direct users to related, broad and narrower
terms (RT, BT, NT) to help in searching specific aspects of a subject
➢ General references – direct users to category or group of headings
instead of individual headings to save space
• Examples of subject heading lists are Library of Congress Subject
Headings (LCSH) and Sears’s List of Subject headings (Library Studies
and Information Technology, n.d.).

Library of Congress Subject Headings (LCSH)
• LCSH is a controlled vocabulary tool widely used as a subject

heading list for catalogues and bibliographies
• It was developed in 1898, published in 1914 and called Subject
Headings Used in the Dictionary Catalogs of the Library of
Congress
• It is in its 29th edition. The title changed to LCSH in the 8th edition
• It was originally designed as a vocabulary control tool to represent
the collection of the Library of Congress
• It is now widely used throughout the world and has been
translated to other languages

Library of Congress Subject Headings (LCSH)
• It is used for assigning subject headings to both manual and

machine readable catalogues
• As at April 2017, there were 342,107 subject headings and

references
• About 5000 new headings including subdivisions are added

every year (The Library of Congress, n.d).
• It is now an online only publication. The 35th edition printed

in 2013 was the last print edition

Structure of LCSH
• Computer Software [the approved subject heading is in bold face. Synonyms are not bold face]
(Class no.)
General works on computer programs along with documentation such as manuals, diagrams,
and operating instructions
UF Software, Computer
RT Computer software industry
Computers
SA subdivisions Software and Juvenile Software under subjects for actual software items
NT Application software
System software
Accounting
(Class no.)
Law and legislation
(May subd Geog)
Catalogs
UF Computer programs – Catalogs
Development
(Class no.)
(Adapted from Chowdhury, 2010, p. 160)

Sears List of Subject Headings
• It is smaller in scope than the LCSH and used for assigning

standardized subject headings to all types of documents in
smaller libraries
• It uses the Dewey Decimal Classification system to ensure
that
• It was first designed in 1923 by Minnie Earl Sears to meet
the demands of small libraries for the following reasons:
➢ Small libraries needed a simpler but broader subject
headings for their catalogues
➢ The LCSH list was complicated, too detailed and also
expensive

• The 13th edition (1986) was designed as an online database

with changes suitable for online databases and OPAC
searches. For e.g.,
➢Subject headings were changed to natural forms used by users

such as “Chemistry, Organic” to “Organic chemistry”
➢There is frequent updates with current terminology and new topics
• Its polarity has been credited to a policy of continuous

adaptation to changing trends in user information –seeking
behaviour

• The 15th edition adopted a thesaurus format with use of thesaural

abbreviations such as BT, NT, RT, USE and SA
• Changes in the 19th ed. (2007):
• It has 440 subject headings
• Major additions include Islam, graphic novel, Reality shows,

Suicide bombers, Stem cell research, Body piercing, Biodiversity,
Indigenous peoples etc.
• Some headings have been fine tuned e.g.,
• Fictitious character has become Fictional character

• The 22nd edition (2018) is the current one. New features

include:
➢Business writing
➢Economic indicators
➢Flipped classrooms
➢Systemic risk
➢ Massive open online courses
➢Target marketing

Activity
• Access this link and familiarize yourself with the structure of

https://www.hwwilsoninprint.com/pdf/sears_pgs.pdf
• What are the similarities between the structures of the Sears

List and the LCSH. Discuss your findings on the Sakai Chats

References
• Bell, S. S. (2012). Librarian’s guide to online searching (3rd ed. ). Santa Barbara, California:
Libraries Unlimited.
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York:
Neal-Schuman Publishers, Inc.
• Librarianship Studies & Information Technology. (n.d.). Vocabulary
control.Retrievedfromhttps://librarianshipstudies.blogspot.ca/p/ librarians-reference-
directory.html
• Library of Congress. (n.d.). Retrieved from
https://www.loc.gov/aba/publications/FreeLCSH/freelcsh.html
• Colby, M. (2017). Library of Congress Subject Headings: Online Training: developed by Janis
Young and Daniel Joudrey. Washington, DC: Library of Congress, Program for Cooperative
Cataloging, Cataloger's Learning Workshop, 2017. https://www. loc. gov/catworkshop/lcsh.
• What are Controlled Vocabularies? (n.d.). Retrieved from

http://www.getty.edu/research/publications/electronic_publicati
ons/intro_controlled_vocab/what.pdf
• Zeng, M. (2005). Why vocabulary control. Retrieved from
http://marciazeng.slis.kent.edu/Z3919/1need.htm

Session 8 – Thesaurus


Session Overview
Thesaurus is a vocabulary control tool comprising controlled

set of terms from natural language or various fields of study
linked by hierarchical or associative relationships. It appeared
in the 1950s and useful for selection of indexing terms and
searching.
•Explain a thesaurus, its purposes and features.
•Describe thesaural relationships and their features and

•Describe the various displays used in a thesaurus

Session Outline
• Topic One: Definition, purpose and features of a thesaurus
• Topic Two: Relationships between terms in a thesaurus
• Topic three: Display of terms in a thesaurus

Reading List

retrieval (3rd ed. ). New York: Neal-Schuman Publishers,
Inc. 162-168
• Zeng, M (2005). Guidelines for the construction, format, and

management of monolingual controlled vocabularies.
Retrieved from
http://marciazeng.slis.kent.du/Z3919/53adisplay.htm

TOPIC
Definition, Purpose and Features of a Thesaurus
1

What is a Thesaurus?
• “A book of words or of information about a particular field or set of

concepts; especially : a book of words and their synonyms
• A list of subject headings or descriptors usually with a cross-reference
system for use in the organization of a collection of documents for
reference and retrieval (Merriam-Webster, 2018).
• “A compilation of words and phrases showing synonyms and hierarchical
and other relationships and dependencies” (Chowdhury, 2010, p. 162).
• In indexing and information retrieval, a thesaurus is defined as: “a list of
of every important term (single-word or multi-word) in a given domain of
knowledge; and a set of related terms for each term in the list” (New
World Encyclopedia, 2018).

Purposes of a Thesaurus
The main purpose of a thesaurus is to provide standard

vocabulary for information storage and retrieval systems
• It informs searchers about index terms that have been used

in retrieval systems
• It guides indexers and searchers to choose the same terms

in describing or searching for a concept or a particular word

Features of a Thesaurus
• A thesaurus has 3 main features:

➢Vocabulary control
➢Thesaural relationships
➢Thesaural display
Vocabulary control
• This is a list of concepts of the concepts of thesaurus
comprising words and phrases.
• The words are mostly single-word nouns
• Ambiguous words has scope notes to convey meaning in a
given knowledge area

Features of a Thesaurus
• Preferred terms
➢These are valid index terms that can be used for indexing and
searching
➢They are the terms that best represent the concepts in a
thesaurus
• Non-preferred terms
➢ They are the synonyms
➢They are not used for indexing.
➢Appropriate references are linked from non-preferred terms to
preferred terms to guide searchers and indexers

Relationship between Terms in a
TOPIC
Thesaurus
2

Thesaural relationships
There are three types/classes of thesaural /term relationships:
• Hierarchical relationships
• Equivalence relationships
• Associative relationships

Hierarchical relationships
• They are used to indicate super-ordinate or broader term and

subordinate or narrower term.
• A Broader Term (BT) is a more general term and a narrower

Term (NT) is a more specific term e.g. BT Financial markets
NT Capital markets.
• There are 3 types of hierarchical relationships:
1) Generic
2) Whole-part and
3) Polyhierarchical relationships

Types of hierarchical relationships
Generic relationship
It identifies a class and its member species. Code or notation
for generic relationship in a thesaurus are BTG (Broader term
generic) and NTG (Narrow term generic) e.g.,
• Rat
• BTG rodents
• Rodents
• NTG rats

Whole-part relationship relationship is BTP (Broader term
• It identifies a situation in which the partitive) and NTP (Narrow term
name of the part is an integral part partitive)
of the whole (the BT); or There are many scenarios including:
• When one concept is a constituent 1) Systems and organs of the body

of another irrespective of the CENTRAL NERVOUS SYSTEM
context such that all the terms can BTP Nervous system
be organized into a logical NTP brain
hierarchy (Chowdhury, 2010). NTP spinal cord
• The notation or code for whole part

Whole-part relationship 3) Disciplines or fields of study
scenarios CHEMISTRY
2) Geographic locations BTP Science
GHANA NTP Physical chemistry
BTP Africa NTP Thermodynamics
NTP Accra
NTP Osu 4) Organizational, social or
political structures
ARMIES
BTP Divisions
NTP Battalions
NTP Regiments

Polyhierarchical relationships
It describes a situation where a concept belongs to more than
one category. It occurs in a particular instance, which links
proper name with a common noun. e.g.,
PRINTING EQUIPMENT COMPUTER PERIPHERAL EQUIPMENT

NT Computer printers NT Computer printers
COMPUTER PRINTERS
NT Computer peripheral equipment
NT Printing equipment

Equivalence relationships
• Equivalence relationships describes a situation where the

same concept is expressed by two or more terms
• Thus the relationship between preferred and non-preferred
terms is an equivalence relationship
• Equivalent relationship is denoted by;
• U or USE leading from a non-preferred to a preferred term
• UF or USED FOR from a preferred to a non-preferred term.
Example
➢ Water fowl USE water bird
➢ Water bird UF Water fowl

Types of equivalent relationships
There are five basic types

• Synonyms – a word or phrase with the same meaning as another
word or phrase in the same language (Oxford Dictionary). E.gs.
➢ Synonyms from two linguistic origins e.g., freedom/ liberty
➢ A popular and a scientific name e.g., salt/ sodium chloride
➢ Trade name and generic name synonym e.g., Vaseline / petroleum jelly
➢ New /favoured terms replacing outdated ones e.g., developing countries/
underdeveloped countries
➢ Terms from different cultures e.g., flats/ apartments
➢ Jargon or slang synonyms e.g., whirlybirds/ helicopters
• Lexical variants –different word forms expressing the same
concept e.g., ground water/ ground-water/ groundwater Online/
on-line

Types of equivalent relationships
• Near-synonyms –Words with different meaning but regarded as
equivalents for the purposes of controlled vocabulary e.g., sea
water, salt water
• Generic posting – When the name of a class the names of its
members are treated as quasi-synonyms furniture
➢UF beds beds USE furniture
➢UF chairs chairs USE furniture
• Cross-reference to elements of compound terms
➢Cross reference to compound term elements
➢coal mining USE coal AND mining
➢Cross reference from compound term elements
➢Coal USED FOR coal AND mining
➢Mining USED FOR coal AND mining

Associative relationships
• Associative relationships are terms that are neither hierarchical

nor equivalent conceptually associated to the extent that the link
between them must be made clear in the thesaurus.
• In other words, here two terms are conceptually associated on a
number of different basis while satisfying the requirement that one
of the terms should function as a component in any explanation or
definition of the other.
• They are indicated as RT (related term)
• Examples are
➢Cells
➢RT cytology
➢Cytology
➢RT cells

Types of Associated Relationships
Two broad categories

1. Relationships between terms belonging to the
same hierarchies or categories – are referred to
as siblings with overlapping meaning e.g., ships
and boats
2. Relationships between terms belonging to
different hierarchies or categories- involves two
terms such that when one term is used in indexing
the other is implied.

Types of Associated Relationships
Several examples:
• A process and its instrument e.g., illumination and lamp
• Concepts and their properties e.g., poison and toxicity
• Concepts and mechanism/units for measurement e.g.,
temperature and thermometer
• Concepts and their origins e.g., India and Indians
• Action and product e.g., weaving and cloth
• Cause and effect e.g., pathogens and infections
• Raw materials and products e.g., cocoa and chocolate
• Discipline or field of study and the object being studied e.g.,
Botany and plants

TOPIC
Display of terms in a Thesaurus
2

Display of terms in a thesaurus
• Terms and their relationships in a thesaurus can be

displayed in several different methods e.gs.:
1. Alphabetic displays
2. Flat format displays
3. Graphic displays

Summary of Symbols and abbreviations for use in a thesaurus
• SN: scope note

• DEF: definition
• HN: history note
• USE: shows that the term following use is the preferred term
• UF: shows that the term following UF is the non-preferred
term
• USE+ : the non-preferred terms following USE + should be
used together to represent the concept
• UF+: the non-preferred term after the UF+ should be
represented by a combination of preferred terms including
the one preceding the UF+

Symbols and abbreviations for use in a thesaurus
• TT: top term

• BT: Broader term
• BTG: Broader term (generic)
• BTI: Broader term (instantial showing an instance e.g.,
capital cities and Accra)
• BTP: broader term (partitive)
• NT: narrow term • NTG: narrow term (generic)
• NTI: narrow term (instantial)
• NTP: narrow term (partitive)
• RT: related term (Chowdhury, 2010)

Features of an alphabetic display
• It is the most basic type of thesaurus/vocabulary display and

easy to organize
• All indexing terms both preferred or non-preferred are

organized in alphabetical sequence
• It contains both terms and entry terms with use references
• One disadvantage from a user’s point of view is that, all the

terms in a hierarchy cannot be seen at a single location
(Chowdhury, 2005)

Example of an alphabetical listing display for the term “Early”
• Early Adolescence
U Early Adolescence
Early Adolescence
UF Early Adolescence; Young Adolescence
• Early Childhood (1966 1980)
U Young Children
Early Childhood Education
Early Detection
U Identification
Early Diagnosis
U Identification
Early Experience
UF Preschool Experience
Source: Thesaurus of ERIC Descriptors

Features of Flat Format display
• It is the most commonly used display format
• It comprises terms arranged in alphabetical order together

with their details and one level of BT or NT hierarchy
• In some systems the terms in the hierarchy are also

assigned a line number that a user can reference to expand
a search (Zeng, 2005)

Example of a flat format display
Whale watching Whales
RT Whales SN: Aquatic mammals of the order
Whalers (Persons) Cetacea
RT Whales UF Cetaceans (NPT)
BT Marine mammals
NT Baleen whales
NT Fossil whales
NT Toothed whales
RT Whale oil
RT Whale watching
RT Whalers
RT Whaling
Source: Thomson Gale Master
Thesaurus

Features of graphic display
• Graphic displays show the terms and their relationships in the form
of two dimensional figures
• They communicate relationships among concepts more effectively
than the linear forms of display
• They are more effective in an interactive computer environment
where terms an be connected to their details via hyperlinks e.g.,
(clickhttps://www.freethesaurus.com/for+certain
• Some commercial products are available for use to generate
concept maps using terms in a controlled vocabulary (Chowdhury,
2010; Zeng, 2005)
– Concept maps in printed thesaurus are static but real time
graphic displays can be generated in an electronic system

Features of a graphic display
• Disadvantages of graphic display
➢It does not show equivalent terms or scope notes
➢It does not distinguish between associative and hierarchical

relationships
➢Graphic displays in print versions are bulky and difficult to

navigate (Chowdhury, 2010)

Examples of graphic display

Graphic display of synonyms for certain

Activity
• Check the Internet and familiarize yourself with the structure

of a natural language thesaurus and a thesaurus of any field
of study. Note the differences and similarities

References
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc. 162
• Colby, M. (2017). Library of Congress Subject Headings: Online Training: developed by Janis Young and
Daniel Joudrey. Washington, DC: Library of Congress, Program for Cooperative Cataloging,
Cataloger's Learning Workshop, 2017. https://www. loc. gov/catworkshop/lcsh.
• Synonym (n.d.). Oxford Dictionary. Retrieved from
https://ca.search.yahoo.com/search?fr=mcafee&type=C211CA662D20170127&p =define+syn onym
• Synonym for certain (n.d.). Retrieved from https://www.freethesaurus.com/for+certain
• Thesaurus (2018). Merriam-webster.com. Retrieved from
https://www.merriamwebster.com/dictionary/thesaurus
• Thesaurus. (2018) New World Encyclopedia, . Retrieved April 11, 2018 from
http://www.newworldencyclopedia.org/p/index.php?title=Thesaurus&oldid=986455.
• Thesaurus of ERIC descriptors (n.d.). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/53adisplay.htm
• Thomson Gale Master Thesaurus (n.d). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/53adisplay.htm
• Zeng, M (2005). Static Concept map (n.d.). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/56displaygraph.htm

INFS 422: INFORMATION STORAGE & RETRIEVAL
Session 9 – Information Extraction


Session Overview
• In this session, leaners will be introduced to the concept

of information extraction (IE).
• IE techniques extract structured data from unstructured
text.
• This session presents techniques for extracting limited
kinds of semantic content from text.
• This process of information extraction (IE), turns the
unstructured information embedded in texts into
structured data
Dr (Mrs) Florence Entsua-Mensah Slide 2

Session Outline
• Topic One: Understanding Information Extraction

• Topic Two: Basic Tasks of Information Extraction
• Topic Three: Approaches to Information Extraction
• Topic Four: Challenges to Information Extraction

Recommended Reading
Izaquierdo, R. (2015). Information Extraction.

Available at:
https://www.slideshare.net/rubenizquierdobevia/
information-extraction-45392844
Jurafsky, D., Martin, J.H., 2017. Information Extraction. In:

Speech and Language Processing. pp. 397–430.

TOPIC
What is Information Extraction?
1

Information Extraction defined
• Information extraction (IE) involves the extraction of

structured information from unstructured and/or semi-
structured machine-readable documents.
• Information extraction is the process of extracting specific

(pre-specified) information from textual sources.
• One of the most trivial examples is when your email extracts
only the data from the message for you to add in your Calendar.
• It is largely concerned with the processing human language

texts by means of natural language processing (NLP).

Information Extraction

What is Information Extraction?
Documents
Un-structured Structured
(semi-structured) Databases (or Knowledge Bases)
Structured vs Semi-structured Text
Dr (Mrs) Florence Entsua-Mensah (Source: Izquierdo, 92015)

Why is it useful?
Information extraction enables:
• The automation of tasks such as smart content classification, integrated
search, management and delivery
• Data-driven activities such as mining for patterns and trends, uncovering
hidden relationships, etc.
• Clear factual information which is helpful to answer questions and

analytics.
• Organize and present information into Info boxes in Wikipedia
• Obtain new knowledge via inference.

• Works-for(x, y) AND located-in(y, z) → lives-in(x, z)

Main Goals of IE
• Fill a predefined “template” from raw text.
• Extract who did what to whom and when?

• Event extraction
• Organize information so that is useful to people
• Put information in a form that allows further inferences by

computers.
• Big data analytics and data mining

TOPIC
Basic tasks of Information extraction
2

Information Extraction (IE) - Task
• Generally, the idea is to ‘extract’ or tag particular types of

information from arbitrary text or transcribed speech.
• The broad activities are:
Entity Event
Recognition Extraction
Relation
Extraction

Named Entity Recognition
At the heart of the Giuliani-led critique of the president’s patriotism is the suggestion
that Barack Obama has never expressed love for the United States. Rudolph W.
Giuliani, the former mayor of New York City, has even challenged the media to find
examples of Mr. Obama expressing such affection.
Has the president done so? Yes, he has.
A review of his public remarks provides multiple examples. In 2008, when he was still
a presidential candidate, Mr. Obama uttered the magic words in Berlin, during a
speech to thousands. Mr. Obama used a similar construction, as president, in 2011,
during a town hall meeting in Illinois, when he recalled “why I love this country so
much.”
Mr. Giuliani told Fox News that “I don’t hear from him what I heard from Harry
Truman, what I heard from Bill Clinton, what I heard from Jimmy Carter, which is these
wonderful words about what a great country we are, what an exceptional country we
are.”

Relation Extraction
Located-in(Person, Place)
He was in Tennessee
Subsidiary(Organization, Organization)
XYZ, the parent company of ABC
Related-to(Person, Person)
John’s wife Yoko
Founder(Person, Organization)
Steve Jobs, co-founder of Apple...

Event Extraction

Named Entity Tagger
• Identify types and boundaries of named entity
• For example:
• Alexander Mackenzie , (January 28, 1822 ‐ April 17, 1892), a

building contractor and writer, was the second Prime Minister of
Canada from ….
-> <PERSON>Alexander Mackenzie</PERSON> , (<TIMEX

>January 28, 1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>),
a building contractor and writer, was the second Prime Minister
of <GPE>Canada</GPE> from ….

IE for Template Filling Relation Detection
Given a set of documents and a domain of interest, fill a table

of required fields.
• Forexample:
Number of car accidents per vehicle type and number
of casualties in the accidents.

IE for Question Answering
Q: When was Gandhi born?

A: October 2, 1869
Q: Where was Bill Clinton educated?

A: Georgetown University in Washington, D.C.
Q: What was the education of Yassir Arafat?

A: Civil Engineering
Q: What is the religion of Noam Chomsky?

A: Jewish

TOPIC
Approaches to Information Extraction
3

Approaches
• Statistical sequence labeling
• Supervised Learning
• Semi-supervised Learning and bootstrapping

Approach for NER
• <PERSON>Alexander Mackenzie</PERSON> , (<TIMEX >January 28,
1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>), a building
contractor and writer, was the second Prime Minister of
<GPE>Canada</GPE> from ….
• Statistical sequence labeling techniques can be used –

similar to POS tagging
• Word-by-word sequence labeling
• Example of features
• POS tags
• Syntactic constituents
• Shape features
• Presence in a named entity list

Supervised Approach for relation detection
• Given a corpus of annotated relations between entities, train two
classifiers:
• A binary classifier
• Given a span of text and two entities -> decide if there is a relationship between these
two entities
• Features
• Types of two named entities
• Bag of words
• POS of words in between
• Example:
• A rented SUV went out of control on Sunday, causing the death of seven
people in Brooklyn
• Relation: Type = Accident, Vehicle Type = SUV, casualty = 7, weather = ?
• Pros and Cons?

Pattern Matching
• How can we come up with these patterns?
• Manually?
• Task and domain-specific
• Tedious, time consuming, not scalable

Pattern Matching for Relation Detection
• Patterns:
• “[CAR_TYPE] went out of control on [TIMEX], causing the
death of [NUM] people”
• “[PERSON] was born in [GPE]”
• “[PERSON] was graduated from [FAC]”
• “[PERSON] was killed by <X>”
• Matching Techniques
• Exact matching
• Pros and Cons?
• Flexible matching (e.g., [X] was .* killed .* by [Y])
• Pros and Cons?

Semi-supervised approach-AutoSlog-TS (Riloff 1996)
• MUC-4 task: extract information about terrorist events in

Latin America
• Two corpora:
• Domain-dependent corpus that contains relevant information
• A set of irrelevant documents
• Algorithm:
1. Using heuristics, all patterns are extracted from both corpora. For
example:
• Rule: <Subj> passive-verb
• <Subj> was murdered
• <Subj> was called
2. Pattern Ranking: The output patterns are then ranked by the
frequency of their occurrences in corpus1/corpus2
3. Filter out the patterns by hand

Bootstrapping

Task 12: (DARPA – GALE year2) Produce a biography of [person]
1. Name(s), aliases:
2. *Date of Birth or Current Age:
3. *Date of Death:
4. *Place of Birth:
5. *Place of Death:
6. Cause of Death:
7. Religion (Affiliations):
8. Known loca(ons and dates:
9. Last known address:
10. Previous domiciles:
11. Ethnic or tribal affiliations:
12. Immediate family members
13. Na(ve Language spoken:
14. Secondary Languages spoken:
15. Physical Characteristics
16. Passport number and country of issue:
17. Professional positions:
18. Education
19. Party or other organization affiliations:
20. Publica(ons (titles and dates):

Biography – two approaches
• To obtain high precision, we handle each slot independently

using bootstrapping to learn IE patterns.
• To improve the recall, we utilize a biographical sentence

classifier

TOPIC
Challenges to Information Extraction
4

IE Challenges
AMBIGUITY
• Fred’s appointment as professor vs. Fred’s 3 PM
appointment with the dean
• outbreak of typhoid vs. outbreak of violence
COMPLEX STRUCTURES
• For the Federal Election Commission, Bush picked Justice
Department employee and former Fulton County, Ga.,
Republican chairman Hans von Spakovsky for one of three
openings.
(Grishman, 2012)

IE Challenges Cont’d.
• LOTS OF DIFFERENT PATTERNS
• different words:
named, appointed, selected, chosen, promoted, …
• Different constructions:
IBM named Fred president
IBM announced the appointment of Fred as
president
Fred, who was named president by IBM
• Different names:
George H. W. Bush, former President Bush, 41
IE Challenges Cont’d.
REFERENCE
• George Garrick has served as president of Sony USA for 13
years. The company announced his retirement effective next
May.
• IBM announced several new appointments yesterday. Fred
Smith was named head of research.
(Grishman, 2012)

Summary
• In this session, we learned that information

extraction is defined grossly by establishing
relationship among a set of entities.
• The aim therefore is to identify relations between
entities; which are used primarily to construct
knowledge-bases.

Activity 9.1
1. Distinguish between structured and unstructured

data.
2. Why is it important to transpose unstructured data

into a structured one?
3. In your own words, define information extraction;

and discuss any one of the approaches to
information extraction known to you.

Activity 9.2
• Collect a list of names of works of art from a particular

category from a Web-based source (e.g., gutenberg.org,
amazon.com, imdb.com, etc.).
• Analyze your list and give examples of ways that the names
in it are likely to be problematic for the techniques described
in this session.

References
Izaquierdo, R. (2015). Information Extraction. Available at:

https://www.slideshare.net/rubenizquierdobevia/infor
mation-extraction-45392844
Grishman, R. (2012). Information Extraction: Capabilities

and Challenges. International Winter School in
Language and Speech Technologies. Tarragona,
Spain: International Winter School in Language and
Speech Technologies.
https://doi.org/10.1561/1500000003

INFS 422: INFORMATION STORAGE &
RETRIEVAL


Session Overview
The user of an IRS is at the epicenter of the success of the

IRS. This session discusses the information user. It
provides insight in to the nature of the information user,
their characteristics and types. It also discusses the nature
of their information needs.

Session Outline
• Topic One: Users and their nature
• Topic Two: Characteristics of users
• Topic Three: Information Needs

Recommended Reading Text
Chowdhury, G. G. (2010). Introduction to modern

information retrieval. Facet publishing.
Singh, S., 2015. Users and information use in academic

libraries. Acad. Libr. Syst.

Introduction
• The user is the focal point of all information retrieval
systems; because the sole objective of any ISR system is to
transfer information from the source to the user
(Chowdhury, 2010).
• The characteristics and the specific needs of users

determine the nature of the information to be collected by the
IRS system, and the nature of the user interface to design.
• Hence, an understanding of the nature; and number of users

and their activities in relation to the information requirements,
and so forth is crucial to the design and development of an
appropriate ISR system.

TOPIC
Users and their Nature
1

Who is a User? - 1
• The concept of the ‘user’ is by no means clear (i.e. it is

ambiguous) (Chowdhury, 2010).
• An individual who makes use of information in any way to
complete a task (IGI Global, 2018).
• According to Kenneth Whittaker, a user may be defined as a
person who uses one or more library’s services at least once
a year (Singh, 2015).
• A user can be called a person who needs information which
can be provided by specific library services; or someone who
is known to have the intention of using certain information
services from the library (Singh, 2015).

Who is a User? - 2
• The term "user" can refer to any person who interacts with
an information system to search for and select resources
he/she needs (University of North Texas, 2017).
• Alternative name for the Information User: end users,

patrons, clients, searchers, consumers, readers, etc.
• The term user, often refers to a person visiting a library or

information centre; be it physical or virtual.
• The type of information user is in fact dependent on the

nature of the information.

TOPIC
Characteristics of Users
2

User Characteristics
• Users may be limited by:
The nature of Sex Other forms

their Age
work/profession of social
groups

Characteristics of users (Cont’d.)
• Several criterial may be used to identify and categorise

users.
• For example, within the context of an organisation, user

categories may be identified as:
• Actual Users
• Potential Users
• Eexpected Users
• Beneficiary Users

Characteristics of users (Cont’d.)
Actual users:
• those who are using the information service at a given time.
Potential Users:
• those who are not yet served by the information services.
Expected Users:
• those who not only have the privilege of using the information
service, but also have the intention of doing so.
Beneficiary users:
• Users who have derived some benefits form the information service.
(Chowdhury, 2010)
TOPIC
Information Needs (IN)
3

What is an Information Need?
• Arises when an individual recognises that their current state of knowledge

is insufficient to cope with the task in hand, or in order to resolve conflicts
in a subject area, or to fill a void in some area of knowledge.
• Information needs refer to the specific requirements or desire individuals

or organizations have for obtaining information to fulfill a particular
purpose or achieve a specific goal.
• It is the recognition that there is a gap in knowledge or understanding that

needs to be filled through the acquisition of relevant information.
(Chowdhury, 2010)

What is an Information Need?
• Identifying information needs involves understanding what information is required to

address a particular situation or objective.
• This includes determining the scope, depth, and specificity of the information needed.
• For example, a student conducting research for a term paper may have information needs
related to specific aspects of a topic, such as historical background, current research, or
statistical data.
• To address information needs, individuals or organizations typically engage in

information-seeking activities.
• This involves searching for, accessing, and evaluating information from various
sources, such as books, articles, databases, websites, experts, or other individuals.
• The information sought should be relevant, reliable, and credible to ensure its usefulness in
meeting the identified needs.

Things to note about Information Need (IN):
• IN is a relative concept; it depends on a several of factors. It

varies form person to person, job to job, subject to subject,
and so on.
• IN changes overtime (it does not remain constant).
• IN often changes upon receipt of some information.
• IN often remain unexpressed or poorly expressed.

Types of Information Needs
• In the context of library search, Taylor (as cited in Chowdhury,

2010) identifies four (4) major types of information need that
lead the user from the state of a purely conceptual need to

one that is formally expressed and constrained (by the
environment):
• Visceral need → Conscious need → Formalised need → Compromised

need

Visceral Needs
• Visceral needs refer to individuals’ actual but unexpressed

information needs.
• These needs may arise from personal experiences,

emotions, or intuitions.
• Individuals may have a sense that they lack information or

understanding in a particular area, even if they cannot
precisely articulate or define their needs.
• Visceral needs often serve as the underlying motivation that

triggers the search for information.

Conscious Needs
• Conscious needs represent an ill-defined area of decision or

uncertainty where individuals recognize the need for
information but are unable to clearly articulate it.
• At this level, individuals may feel a sense of unease or a

desire for more knowledge to address a specific problem,
make a decision, or overcome a challenge.
• Conscious needs often arise from gaps in understanding or

awareness and drive individuals to seek clarification or
explore possible solutions.

Formal needs
• Formal needs emerge when individuals are able to express

their information needs in concrete terms.
• At this level, individuals can define the specific information

required to address a particular problem or decision.
• Formal needs are characterized by a clearer articulation of

the gap in knowledge or information, enabling individuals to
seek relevant and targeted information to fill that gap.
• These needs are often expressed through specific questions

or statements

Compromised Needs
• Compromised needs occur when the original information needed

is translated into what the available resources and information
systems can deliver.
• It takes into account the limitations and constraints individuals face
in accessing and obtaining the desired information.
• Compromised needs often arise due to practical considerations,
such as time constraints, resource availability, or technological
limitations.
• Individuals may adjust their information requirements based on
what is realistically achievable within the given constraints.

Types of Information Needs - 2
Visceral Need:
is the unconscious need.
Conscious
Need: Conscious by undefined need.
Formalised
Need: Formally expressed need.
Compromised Expressed and influenced by internal and
Needs: external constraints.

Information-Seeking Behaviour (ISB) of Users
Items that can affect the ISB of Users:

• Their awareness of, and their ability to access sources of
information
• Educational and professional background.
• People relationship / how easily they get on with people
• The amount of competition that exists in their field of activity
• Their past experiences (or the environment in which the user
grew up)
• How the uses formulate their queries

Summary
• We have learnt in this session, the characteristics of users of an

information retrieval system.
• The session also reviews the nature of the information needs of

users.
Dr (Mrs) Florence Entsua-Mensah

24
Activity 10.1
• Aside the user categorization that was discussed in this

session, which other categorization do you know of?
• You are to undertake this exercise with relevant academic
literature.

References
Chowdhury, G.G., 2010. Introduction to modern information

retrieval. Facet publishing.
IGI Global, 2018. What is Information-Users [WWW
Document]. URL https://www.igi-
global.com/dictionary/information-users/14604
(accessed 8.1.18).
Singh, S., 2015. Users and information use in academic
libraries. Acad. Libr. Syst.
University of North Texas, 2017. Users and their
information needs.

422: INFORMATION STORAGE & RETRIEVAL
Session 11 – Evaluation of ISR System


Session Overview
• In recent years, the evaluation of Information Retrieval

Systems and techniques for indexing, sorting, searching and
retrieving information have become increasingly important
(Saracevic, as cited in Kowalski, 2007).
• Consequently, this session discusses the various ways in

which an IRS may be evaluated.

Session Outline
• Topic One: Understanding Evaluation
• Topic Two: Criteria for Evaluation
• Topic Three : Lancaster’s Steps in Evaluation

Reading List

retrieval. Facet publishing
• Kowalski, G. J. (2007). Information retrieval systems: theory

and implementation (Vol. 1). Springer

TOPIC
Understanding Evaluation
1

What is Evaluation?
• Evaluation is a systematic determination of a subject's merit, worth and

significance, using criteria governed by a set of standards.
• Ascertaining the value or worth of something.
• Assigning a rating
• Assessing/judging/appraising the quality, ability or extent of significance of
something
• IR evaluation can be conducted from two main viewpoints
- Managerial view:- When evaluation is conducted from managerial point of
view it is called managerial oriented evaluation
- User view:- When evaluation is conducted from the user point of view it is
called user-oriented evaluation study

Reasons for Evaluating IRS
• To monitor system effectiveness

• To aid in the selection of a system to procure
• To provide input to cost benefit analysis of an information system
• To access the query generation process for improvements
• To determine the effects of changes made to an existing information system.
• To established a foundation of further research on the reason for the
relative success of alternative technique
• To improve the means employed for attaining objectives or to redefine
sub goals or goals in view of research findings.
(Kowalski, 2007)

Metrics for Evaluation of IR Systems
• The purpose of evaluation of an IR system is to measure its

performance based on a given scale. Performance is measured by
2 basic parameters
• 1) Efficiency – How economically the system is achieving its

objectives. The cost factors involved in meeting the stated
objectives can be determined by:
➢Response time – average time taken by a system to respond to a
query.
➢User effort – time taken by a user o obtain the right information
➢Financial expenditure - cost per search

Metrics for Evaluation of IR Systems
• 2) Effectiveness – the extent to which relevant information

is retrieved and non-relevant information withheld
• It means the level up to which the given system attained its

objectives.

TOPIC
Criteria for Evaluation
2

Proponents of Evaluation Methods
• A number of researchers in the field of information storage

and retrieval have suggested ways in which an IRS may be
evaluated.
• Some of the prominent ones are from:

− Cleverdon
− Lancaster

Cleverdon’s Criteria for Evaluating IRS
• Cleverdon (1966) proposed intellectual effort expended
six criteria: by the user to obtain an
1. Recall – ability of the answer to his query.
system to present all 5. Form of presentation of
relevant items search output, which
2. Precision – ability of the affects the ability of the
system to present only user to make use of the
those items that are retrieved items.
relevant 6. Coverage of the collection
3. Time lag – average interval – the extent to which the
between query submission system includes relevant
and response to query matter.
4. Effort – the physical and

Lancaster’s Criteria for Evaluating IRS
1. Coverage of the system
2. Ability of the system to retrieve wanted items (i.e., recall)
3. Ability of the system to avoid retrieval of unwanted items (

i.e., precision)
4. The response time of the system, and
5. The amount of effort required by the user

Recall and Precision
• Recall and precision are the most important evaluative criteria for
assessing the performance of an IRS.
• Recall – the extent to which the items retrieved are wanted or
relevant
• An IR system is expected to retrieve relevant documents in
response to a query.
• However, in a large collection, only a proportion of the total
relevant documents are retrieved.
• The system’s performance is measured by the recall ratio.
𝑁𝑜.𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑅𝐸𝐶𝐴𝐿𝐿 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛

Recall and Precision
• Precision – the extent to which the system retrieves relevant

items and withholds non-relevant items
• It is defined as the proportion of the retrieved items that is
relevant.
𝑁𝑜. 𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐼𝑡𝑒𝑚𝑠 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑

𝑃𝑅𝐸𝐶𝐼𝑆𝐼𝑂𝑁 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
• An ideal system will achieve 100% recall and 100% precision

which is not possible.

Limitations of recall and precision
• Different users may want different levels of recall – a person

preparing a report on a given topic may prefer high recall.
Conversely, the one who need to know just something about
a given topic may prefer low recall.
• Recall assumes that, all relevant items have the same value,
but the value may be relative and varies from user to user.
• Both recall and precision relies on the relevance judgement

of the user and this judgement may be subjective

Limitations contd..
• A subjective view of relevance may also be dependent upon

knowledge of the contents of the user at the time of search.
• Therefore all pertinent items may be relevant but not all

relevant items may be pertinent.
• Some researchers have observed that, the larger the

collection, the larger the number of non-relevant items for a
given query, thus precision and recall has an inverse
relationship.

Limitations contd..
• The increase in the number of recall will cause a decrease in

precision
• The evaluative criteria are document based and therefore

measure only the performance of the system retrieving items
that have been predetermined to be relevant to the
information need.
• They do not consider how the information will be used or

whether the documents fulfil the information need of the user

Other measures of evaluation
• Efficient IR systems must be designed to maximize recall

and precision. The limitations of precision and recall
evaluative criteria calls for new measures of evaluation as
follows:
• Fallout ratio = the proportion of relevant items retrieved in a

given search
• Generality ratio = the proportion of relevant documents in

the collection for a given query.

Retrieval measures
Symbol Evaluation Measure Formula Explanation

R Recall a/(a+c) Proportion of relevant items
items retrieved.
P Precision b/(a+b) Proportion of retrieved items
that are relevant.
F Fallout b/(b+d) Proportion of non-relevant
items retrieved.
G Generality (a+b)/(a+b+c+d) Proportion of relevant items
per query.
[a=docs relevant to query] [b=docs not relevant to query] [c=docs relevant to
query but could not be retried] [d=documents that are not relevant to the query]

Other measures contd..
• Usability – a measure that considers the interface,

expectations, experiences and skills of the user
• Cost – Users may experience costs in terms of any payment

that they need to make for system or document access It
includes time used for searching the system, search
algorithms, options for display of search results etc.

TOPIC
Lancaster’s Steps in Evaluation
3

Lancaster’s Steps in Evaluation
• Lancaster (1971) proposes five (5) main steps for evaluation:
1. Designing the scope of evaluation (detailed plan)
➢ Purpose and Objectives
➢Type, ie., laboratory set-up, real life situation, macro or
micro-evaluation
➢Cost and Staff time
2. Designing the evaluation programme
➢Methodology or action plan
➢Need for designers to control some parameters of he system.
➢ Design must be clear and mark major caution points where more
care is needed.

Lancaster’s Steps in Evaluation (Cont’d.)
3. Execution of the evaluation

- Meticulous implementation of the methodology by the
evaluator to avoid bias or error
- Constant communication between designer & evaluator
about observations & possible re-design of programme
4. Analysis and interpretation of results
- Interpretation of results based objectives
- Suggestions for improvement based on findings
5. Modifying the system based on the results.
➢ Finally, the retrieval system is modified if necessary based on the
feedback obtained from the results and interpretation of
evaluation.

Summary
• In this session the importance of evaluating information

retrieval systems were discussed.
• The session also presented some of the criteria for

evaluation of IRS, with preference to:
• Lancaster steps for Evaluation; and
• Cleverdon’s Evaluation Criteria

Activity 10.1
• Compare and contrast Lancaster’s evaluation steps with that

of Cleverdon’s criteria.

References

retrieval. Facet publishing
• Kowalski, G. J. (2007). Information retrieval systems: theory
and implementation (Vol. 1). Springer.

INFS 422: INFORMATION STORAGE & RETRIEVAL
Session 12 – Evolutions in Information Retrieval


Session Overview
The aim of this session is to:
• Provide an understanding of how the information retrieval

field progressed/evolved to its current state.
• Each evolutionary milestone is illustrated by describing the
standards and protocols and by discussing the global initiatives and
the research tat shaped it.

Session Outline

• Topic 1: Information Retrieval Standards & Protocols
• Topic 2: Global Digital Library
• Topic 3: Intelligent Information Retrieval
• Topic 4: Hypertext and Hypermedia Systems

Reading List
• Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New
Z39.50Protocol Client for Searching in Libraries and
Research Collaboration. Network Protocols and
Algorithms, 8(3),29.
https://doi.org/10.5296/npa.v8i3.10147
• The Apache Software Foundation: The Free and Open
Productivity Suite. Retrieved from:
http://www.openoffice.org/bibliographic/srw.html.
Accessed on August 7, 2018.

Introduction
• The growth of Information and Communication

Technology (ICT) has refashioned information search
and retrieval.
• There are several advancements that have taken

place in this area (i.e. ISR) over the period of time

TOPIC
Information Retrieval Standards & Protocols
1

What is a Standard ?
• A standard means an agreement by what way to perform

a task or carry out some activity to obtain a predictable
result.
• There are various standards and protocols that are in
existence for IR systems.
• Some of the popular search and retrieval standards and
protocols include:
– Z39.50
– SRW
– SRU
– CQL

Z39.50
• Z39.50 is a communication protocol between a client
and a server.
• The increasing number of information available at

libraries and the necessity to find a mechanism to
look for information at several libraries at the same
time promoted to the creation of the Z39.50
protocol.

Z39.50 (Cont’d)
• Sessions inside one connection between both nodes
are known as Z39.50-association or Z-association
(Rego et al., 2016).
• These sessions are initiated by the client. Since Z-

association is open, both server and client can start
any operation defined in Z39.50 protocol. In the
same way, Z-association can be closed by either
client or server, or implicitly terminated by loss of
connection (Rego et al., 2016).

Z39.50 (Cont’d)
• The main goal of Z39.50 is to provide a standard to

search information into an external database whatever
its data organization.
• Thus, Z39.50 is widely used in some of the biggest

libraries. This goal is achieved because the
communication between the client and the server is
standard and independent to the database (Rego et al., 2016).

Z39.50 (Cont’d)
• Z39.50 is used both at the national and international

level as a standard protocol that defines computer-
to-computer information retrieval technique. It is a
non-proprietary and vendor-independent.
• Z39.50 was originally approved by the National

Information Standards Organization (NISO) in 1988.
In 1998, International Organization for
Standardization (ISO) adopted Z39.50 and issued ISO
23950 Information and documentation - Information
retrieval (Z39.50).

Z39.50 (Cont’d)
• Using Z39.50 a user through his/her system can

search and retrieve information from other Z39.50
compliant computer systems without having the
prior idea about the syntax of search that is used by
the other systems.
• The primary goal of Z39.50 is to reduce the

complexity and difficulties involved in searching and
retrieving electronic information .

SRW
• SRW stands for Search/Retrieve Web Service

protocol (The Apache Software Foundation, 2018).
Its aim is to minimize the cross-language problems.
• The goal is to allow access to several networked
resources and support interoperability among
distributed databases, using a common utilization
framework (The Apache Software Foundation, 2018).
• It is developed by collective implementers with more

than 20 years of experience of the Z39.50
Information Retrieval protocol with nascent
developments in the technological arena of the web.

SRU
• SRU stands for Search/Retrieve via URL. It is a

standard XML-based protocol for search by utilizing
CQL (http://www.loc.gov/cql/), a standard syntax for
query representation (The Apache Software
Foundation, 2018).
• The prime difference between SRU and SRW is that
the former uses HTTP as the transport mechanism
and the latter is based on SOAP protocol and uses
XML streams for both the query and the results.
• This depicts that the query is communicated as a URL
and the XML is received as if it were a web page.

CQL
• CQL stands for Contextual Query Language (formerly
known as, Common Query Language).
• It is designed for use with SRW which is a search protocol
successor to Z39.50 (as discussed in the previous section).
• CQL is an abstract and extensible query language for
maximum interoperability amongst the connected
systems. The goal is to reduce the difficulty to learn and
use while retaining the capability to allow complex
searches.
• Primarily CQL is used in the bibliographic domain, however
it is not restricted to this context alone.

TOPIC
Global Digital Library
2

Global Digital Library
• This is more or less a virtual library that consolidates

the collections of individual libraries as one
collection.
– The WWW and the internet laid the foundation for the
virtual/digital libraries.
• Global Digital Library (GDL) is a prototype which aims

to connect several national libraries and some major
libraries, museums, archives, and information
organizations with each other (Chen, 2001).

Challenges/Issues with Digital Information Sharing
• Several legal issues may arise related to intellectual property,
copyright, confidentiality and privacy, security, personal, business
equity, etc.;
• Difference in culture may influence the way of information
communication;
• The presence of generational gaps;
• The sheer complexity of information architecture both at the
global and national level;
• To have an effective and adequate inventory of available
resources comprising the knowledge of information;
• The ability to locate, identify and retrieve relevant and quality
information;
• Due to the huge amount of information, the complexity arises
related to "undesirable" "indecent" information.

TOPIC
Intelligent Information Retrieval
3

Intelligent IR Defined
• Intelligent IR is a computer system having the capability

to infer knowledge with the help of its previous
knowledge for establishing a link between the
requirement of its user and a set of candidate document
(Jones et al., 2000).
• This is a system which can perform intelligent retrieval.

The realization of researchers to use knowledge in the
information retrieval system has led them to think about
the artificial intelligent system which also has the similar
purpose, and one among these classes is an expert
system.

Expert System Defined
• An expert system is “a computer system which emulates

the decision-making ability of human experts” (Jackson,
1998).
• The expert systems are designed to solve complex
problems by reasoning over knowledge stored in a
knowledge base.
• The knowledge in the knowledge base is primarily
represented as IF-THEN rules rather than conventional
procedural code.
• The first expert systems were invented in the 1970s and
then proliferated in the 1980s.

Developments in Expert Systems
• As expert systems evolved, several new techniques

were adopted into various types of inference,
engines. Some of the most important ones include:
– Truth Maintenance
– Hypothetical Reasoning
– Fuzzy Logic
– Ontology Classification

Expert System for LIS Profession
• AUTOCAT was produced in Germany. The system was

designed to generate bibliographic records of physical
sciences periodicals available in machine-readable form
(Endres-Niggemeyer and Knorz, 1987) .
• Qualcat (Quality Control in Cataloguing) was undertaken
at the University of Bradford. The goals of the project
were to develop expert systems to select the best
records, to link the databases and centralized authority
control, to build a fully automated control package for
day to day running, and to investigate interface problems
for cataloguing (Ayres et al., 1994) .

Expert System for LIS Profession
• OCLC developed an expert system, called Cataloguer’s

Assistant. The system was tested in Carnegie-Mellon
University to reclassify the mathematics and computer
science collection (De Silva, 1997) .
• FRUMP: (developed by DeJong) analyses articles from
newspapers using frame-based techniques. The articles
were first scanned and then data were automatically fed
into the different slots within frames.
• SCISOR: (developed by Rau, Jacobs and Zernik, 1989) is a
system that generate reports on corporate acquisitions and
mergers.

TOPIC
Hypertext & Hypermedia Systems
4

Hypertext
• Hypertext refers to the use of hyperlinks (or simply “links”) to

present text and static graphics. Many websites are entirely
or largely hypertexts(Farkas, 2004).

Hypermedia
• Hypermedia refers to the presentation of video, animation,

and audio, which are often referred to as “dynamic” or “time
based” content or as “multimedia” (Farkas, 2004).
• Hypermedia, a logical extension of hypertext, is a non-linear

medium of information space which includes plain text,
audio, video, graphics and hyperlinks link.

Hypertext and Hypermedia (Cont’d)
• Forms of hypertext and hypermedia include CD-ROM and

DVD encyclopaedias (such as Microsoft's Encarta), eBooks,
and the online help systems we find in software products.
• It is common for people to use "hypertext" as a general term

that includes hypermedia (Farkas, 2004). For example, when
researchers talk about “hypertext theory,” they refer to
theoretical concepts that pertain to both static and
multimedia content.
(Farkars, 2004)

Summary
• In this session we have discussed some of the IR techniques

and technologies that evolved in the recent past.
• We have discussed some of the significant IR standards and

protocols.
• We have also reported the state-of-the-art research in IR

field, for instance, the initiative of global digital library,
application of intelligent systems like expert system in library
cataloguing, classification and abstracting, the application
and issues of intelligent hypertext and hypermedia systems

Activity 9.1
• Discuss the role of protocols and standards in the

development of modern IR systems.

References – 1
• Ayres, F. H., Cullen, J., Gierl, C., Huggill, J. A. W., Ridley, M. J., &
Torsun, I. S. (1994). QUALCAT: automation of quality control
in cataloguing. BLRD REPORTS, 6068.
• Chen, C.-C. (2001). Global Digital Library Development in the New
Millennium: Fertile Ground for Distributed Cross- Disciplinary
Collaboration. Tsinghua University Press.
• De Silva, S. M. (1997). A review of expert systems in library and
information science. Malaysian Journal of Library &
Information Science, 2(2), 57–92.
• Endres-Niggemeyer, B., & Knorz, G. (1987). AUTOCAT:
knowledge-based descriptive cataloguing of articles
published in scien-tific journals. In Second International
GI Congress 1987. Knowledge Based Sys-tems (pp. 20–21)

References – 2
• Farkas, D. K. (2004). Hypertext and hypermedia. In Berkshire
Encyclopedia of Human-Computer Interaction (Vol. 16, pp. 332–
336). https://doi.org/10.1016/0360-1315(91)90062-V
• Jackson, P. (1998). Introduction to expert systems. Addison-Wesley
Longman Publishing Co., Inc.
• Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model
of information retrieval: development and comparative experiments:
Part 2. Information Processing & Management, 36(6), 809–840.
• Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New Z39.50
Protocol Client for Searching in Libraries and Research
Collaboration. Network Protocols and Algorithms, 8(3), 29.
https://doi.org/10.5296/npa.v8i3.10147
• The Apache Software Foundation: The Free and Open Productivity Suite.
Retrieved from: http://www.openoffice.org/bibliographic/srw.html.
Accessed on August 7, 2018

INFS 422: INFORMATION STORAGE &
RETRIEVAL


Session Overview
• The session recounts the role of technology in information

storage and retrieval.
• The session specifically examines landmark technological

innovations such as the WWW and cloud storage
services.
Dr. (Mrs) Florence O. Entsua-Mensah Slide 2

Session Outline
• Topic One: Web information Retrieval

• Topic Two: Search Engines
• Topic Three: Cloud Storage Services
Dr. (Mrs) Florence O. Entsua-Mensah 3

Reading List
Mansourian, Y. (2004). Similarities and differences

between Web search procedure and searching in the
pre-web information retrieval systems. . Retrieved
July 9, 2008, from:
http://www.webology.org/2004/v1n1/a3.html.
Zhang, Y. (2004). A Comparison of Search Engines For

Finding Resources. Retrieved July 9, 2008, from
http://www.yuanlei.com/studies/articles/is567-
searchengine/page2.htm
Dr. (Mrs) Florence O. Entsua-Mensah Slide 4

TOPIC
Web Information Retrieval
1
Dr. (Mrs) Florence O. Entsua-Mensah 2022/2023 Academic Year 5

The World Wide Web
• The world wide web (WWW) otherwise known as the Web is

a collection of related and connected web pages.
• A web page is basically an HTML file.
This Photo by Unknown Author is

licensed under CC BY-SA

The WWW Naming Scheme
• The WWW uses a naming scheme to identify documents,

known as Uniform Resource Identifiers or simply URIs
(Berners-Lee et al., 1998).
• URIs consists of a Uniform Resource Locator (URL) which
identifies a document by including information on location of
the document and the Uniform Resource Name (URN) which
acts as true identifier by providing reference to a document.

Classic vs Web Information Retrieval
• In the classic IR systems both the resources and the users

were more or less predictable and homogeneous.
• The digital contents from online as well as offline databases,
Online Public Access Catalogues (OPACs) mainly contain
data stored in a structured manner. Due to this nature of
stored data, the search and retrieval process was much
easier and more predictable.
(Mansourian, 2004)

Classic vs Web IR (Cont’d.)
• Accordingly, the user group were mainly comprised of people

from academics, researchers, subject experts or librarians.
• They were well aware of the search keywords to use for

finding a particular document. But with the emergence of
Web there is a flooding of digital information in a very
unstructured, uneven and heterogeneous manner so to cope
up with this situation Web based IR systems evolved.
(Mansourian, 2004)

Classic vs Web IR (Cont’d.)
• These systems provide accessibility to web based digital

contents.
• They use programs to maintain and update a list of web

documents added to web at a regular interval or according to
the need.
• Then these web IRs mainly look for the searched keywords
in the maintained list or the index file to retrieve the original
document present on the web.
(Mansourian, 2004)
Web IR
• Web IRs user's interface is much more user friendly than the
traditional IR systems by keeping in mind the issue of
increase in inexperienced user group of web IR systems.
• So, now Web IR interface designers have to look more into

the information seeking behaviour of this type of users than
in the classical IR systems where users were mainly experts
of a particular domain under consideration.
(Mansourian, 2004)

TOPIC
Search Engines
2

What are Search Engines?
• Search engine is a tool for locating information from a
collection. Search engines uses information about the
information (such as metadata, catalogue) stored in the
database to locate information.
• Sometimes they perform full text search within the document
from first character to last character.
• Search engines are computer programs that search for
particular ‘keywords’ entered by users and returns a list of
documents with which they are associated with.
• It is simple a service that searches for contents on the web.
• Yet, some do not only search for keywords rather some
search for other things also and these are not “engines“ in
the classical sense.

Types of Search Engines
• Crawler-based search engines are useful if we have

specific search keyword in our mind but if our search topic
is a general one then these type of search engines may
provide several irrelevant documents to a search request,
e.g. AltaVista.
• Human-powered directories are good if our search is a
general topic, then this type of search engines powered
with human crafted directories will guide us and help to
converge our search and fetch refined responses, e.g.
DMOZ.
(Zhang, 2004)

Types of Search Engines (Cont’d.)
• Hybrid search engines use a combination of both crawler-based results

and directory results, e.g. Google.
• Meta-search engines: Meta Search engines are online tools (search
engines) which performs simultaneous search on more than one search
engine at a time.
• These search engines aggregates the results into a single list and
displays them according to their source.
• They are good for saving time by gathering results from different search
engines at a single interface.
• It is excellent if we wish to know whether something is available about a
particular topic or not on the web, e.g. Dogpile.
(Zhang, 2004)

TOPIC
Cloud Storage Services
3

What is Cloud Computing? (1)
• Cloud computing is described as a computing environment

where software applications, computing hardware and other
resources are provisioned to authorized users via a network
instead of these resources residing on the local devices and
location of users.
• With a network connection on a computer or a smart device,

users can access resources on their cloud platforms without
any other connection to the hardware that holds the
resources. (Gruman, 2008).

The underlining rationale for CC is Ubiquity of access
Access beyond Location and Device

Cloud Computing (1)
• Cloud computing has opened a new phase in the
technological sphere where technological products
are now rendered as services.
• This now allows technological products to be

repackaged as services to meet the needs of
customers.
• Hardware products such as storage devices are now
repacked as service (E.g. One drive, Google drive, &
Dropbox)

Cloud Computing (2)
• The advent of cloud computing has generated a lot of

interest among individuals and cooperate organizations.
• Knowingly or unknowingly, most people use cloud

services daily (Rawal, 2011).
• From your understanding so far, can you name some

cloud service that you know of -a part from the ones that
have been mentioned in this class.
Dr. (Mrs) Florence O. Entsua-

20
Mensah
Computing service models
• The types of service models that have emerged under the

cloud computing technology are:
• Software-as-a-Service (SaaS)
• Platform-as-a-Service (PaaS)
• Infrastructure-as- a-Service (IaaS)
• Cloud computing services can be Private, Public

or Hybrid.

Software-as-a-Service (SaaS)
• Under this service model, cloud computing applications
residing on the cloud infrastructure of the service providers
are delivered to the user through web interfaces and
programs (Mell & Grance, 2011).
• E.g. Microsoft Word Online and Google Doc.
• These are word processing applications
which are offered as a service on the internet.
• The software therefore do not reside on the individual
device of the user.

Microsoft Word online

Microsoft Word online

Screenshot of Google Slides

Platform-as-a-Service (PaaS) - 1
• PaaS is a type of cloud service model which is described as

an upgrade of the SaaS model which offers users a platform
to build and run application through a programming interface
provided and supported by the cloud service provider (Birk,
2011).
• PaaS is analogous to SaaS except that, rather than being

software delivered over the web, it is a platform for the
creation of software, delivered over the web.

• Set of tools and services designed to make coding and
deploying those applications quick and efficient.
• Allow users to create their own applications using tools and
language from the platform service provider.

• For instance, a programmer that needs to build an
application that requires a high speed processing Server can
sign up for Platform as a Service model rather than spending
huge amounts in purchasing servers and operating Systems
for the application.
• E.g. of Platform as a Service models are Google’s App-

Engine, Microsoft Window’s Azure

Infrastructure-as- a-Service (IaaS)
• In IaaS model, the cloud Service Providers

supply a range of virtual infrastructure, such
as virtual servers, networks, storage and other
fundamental computing resources to
customers which enable them deploy and run
their own operating system, applications,
upload or download software or files into the
cloud. E.g. Dropbox and One Drive
(Mouratidisa et al., 2013)

Infrastructure-as- a-Service (IaaS)
•Allows users to run any applications they
please on the cloud hardware of their own
choice.
•IaaS is the hardware and software that

powers it all,
•servers
•storage
•networks
•operating systems

ISR & Cloud Computing (CC)
• The introduction of this technology has affected the way

information professionals and even non-information
professionals handle information.
• For the purposes of this class – as introductory – focus

will be on how CC has affected the following areas of IM;
• Information capturing and processing
• Information Storage, Access & Backup
• Information Sharing and retrieval

CC and Information Capturing & Processing
• Information processing is about data manipulation to get

desired results.
• In the Digital era this is often aided by a computer
software, such as word processor, spread sheet,
software for statistical analysis (e.g. SPSS), etc.
• CC creates the opportunity for these data processing
software to be accessed and used on multiple devices
without necessarily installing them on each device (i.e.
SaaS).
• E.g. MS Excel Online & Google Spread Sheet for data
capturing and processing

CC and Information Storage & Access
• A very crucial aspect of IM is information storage and

retrieval.
• Cloud based storage services such as Google DriveTM
and DropboxTM has brought about new phase in
information storage and retrieval. (i.e. IaaS)
• i.e. Information is now stored and accessed from multiple
locations, as oppose to the traditional storage and
retrieval systems which were confined to specific
locations or computer hard drives.

CC & Information Storage and Backup
• Large cloud providers with geographically dispersed sites

worldwide are able to achieve reliability rates that are hard
for private systems to achieve.

CC and Information Sharing
• Collaboration and team work has become inevitable in

todays cooperate world.
• The complexity of today’s working environment require
teamwork.
• Where members on a team or project may be in different
geographical areas.
• CC provides an avenue for data sharing and collaboration
which defies geographical boundaries.
• E.g. e-mail (SaaS), OneDrive (IaaS), Github (PaaS)

Summary
• In this session, we learnt some of the ways in which

technological advancements has affected and continues to
affect information storage and retrieval.
• The session specifically examines:

• The WWW
• Search Engines
• cloud storage services

Activity 13.1
• List any ten (10) search engines known to you; and

determine the type of search engine it is, based on the types
of search engines discussed in this session.
• You may present your answer in the form of a table.

Activity 13.2
• Evaluate the role of cloud storage services in modern

information retrieval systems.

References
Mansourian, Y. (2004). Similarities and differences

between Web search procedure and searching in the
pre-web information retrieval systems. . Retrieved
July 9, 2004, from:
http://www.webology.org/2004/v1n1/a3.html.
Zhang, Y. (2004). A Comparison of Search Engines For

Finding Resources. Retrieved July 9, 2004, from
http://www.yuanlei.com/studies/articles/is567-
searchengine/page2.htm

Infs 422 Combine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Infs 422 Combine

Uploaded by

Copyright:

Available Formats

INFS 422: INFORMATION STORAGE AND RETRIEVAL

Dr. FLORENCE O. ENTSUA-MENSAH

2022/2023 Academic Year

Dr. (Mrs) F. O. Entsua-Mensah 1

• One of the attributes of the information age is information overload,

• Information storage and retrieval or information retrieval is about

• The advent of computers and technological advances have

The key topics to be covered in the session are:

• Topic One: Definition, purposes and characteristics of Information retrieval

• Topic Two: Components of an information retrieval system, and the information

• Topic Three: Types and everyday uses of an information retrieval system

Dr. (Mrs) F. O. Entsua-Mensah 3

At the end of the session, the student will be able to:

•Explain information storage and technological

•Identify its main purposes, characteristics, and

Dr. (Mrs) F. O. Entsua-Mensah 4

• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).

• Hiemstra, D. (2000). Information retrieval models. In A Goker and J. Davies

Dr. (Mrs) F. O. Entsua-Mensah 5

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 6

Information storage and retrieval

• “Information retrieval (IR), also called information storage and retrieval

• “A branch of computer or library science relating to the storage, locating,

• The systematic process of collecting and cataloguing data so that they

Dr. (Mrs) F. O. Entsua-Mensah 7

• Information retrieval (IR) is a discipline concerned with the processes by which

Dr. (Mrs) F. O. Entsua-Mensah 8

An IRS is designed to fulfil the following purposes:

• To collect and organize documents and information of different formats in different

• To serve as a bridge between content creators or generators and users of those

• To retrieve bibliographic items or exact matches of texts of queries from different

Dr. (Mrs) F. O. Entsua-Mensah 9

To fulfil its purposes, an IRS must be flexible and equipped for:

• Prompt information dissemination

Dr. (Mrs) F. O. Entsua-Mensah 10

1. The information must be recorded in documents;

Dr. (Mrs) F. O. Entsua-Mensah 11

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 12

Lancaster (cited in Chowdhury, 2010) indicates that an information retrieval system

Dr. (Mrs) F. O. Entsua-Mensah 13

• The Document Subsystem: it involves the location, selection, ordering and

Dr. (Mrs) F. O. Entsua-Mensah 14

➢For example, “child psychology” may express as “psychology for children”.

• Searching Subsystem: Searching subsystem is one of the major

Dr. (Mrs) F. O. Entsua-Mensah 15

• User-System Interface: The receiver of information bearing documents

• The Matching Subsystem: It matches the document representation

Dr. (Mrs) F. O. Entsua-Mensah 16

An IR system supports 3 basic into an index (Indexing process)

Dr. (Mrs) F. O. Entsua-Mensah 17

Dr. (Mrs) F. O. Entsua-Mensah 18

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 19

IRS can be categorized as in-house and online

Dr. (Mrs) F. O. Entsua-Mensah 20

Categorization of IR systems based on purpose, functions and contents

Dr. (Mrs) F. O. Entsua-Mensah 21

• IR systems was originally designed to search engines, and subject gateways

Dr. (Mrs) F. O. Entsua-Mensah 22

• Access this link, listen to the video for further understanding

Dr. (Mrs) F. O. Entsua-Mensah 23

Dr. (Mrs) F. O. Entsua-Mensah 24

Dr. FLORENCE O. ENTSUA-MENSAH

2022/2023 Academic Year

Dr (Mrs) Florence Entsua-Mensah 1