Download as pdf or txt
Download as pdf or txt
You are on page 1of 375

INFS 422: INFORMATION STORAGE AND RETRIEVAL

SESSION 1
An Overview of Information Storage and Retrieval

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year

Dr. (Mrs) F. O. Entsua-Mensah 1


Session Overview

• One of the attributes of the information age is information overload,


that is, the availability of tons of information in different formats.
We often go to the library to search for both print and e-resources.

• If the books in the library, for example, are neatly arranged on the
shelves without being organized by cataloguing and classification,
searching for just a single book will be a herculean task.

• Information storage and retrieval or information retrieval is about


the deliberate organization of information to make them easily
retrievable to meet the information needs of individuals.

• The advent of computers and technological advances have


changed how information is organized, stored and retrieved in
many ways.
Dr. (Mrs) F. O. Entsua-Mensah 2
Session Outline

The key topics to be covered in the session are:

• Topic One: Definition, purposes and characteristics of Information retrieval


systems

• Topic Two: Components of an information retrieval system, and the information


retrieval process

• Topic Three: Types and everyday uses of an information retrieval system

Dr. (Mrs) F. O. Entsua-Mensah 3


Learning Outcomes

At the end of the session, the student will be able to:

•Explain information storage and technological


advances have changed retrieval.

•Identify its main purposes, characteristics, and


everyday uses.

Dr. (Mrs) F. O. Entsua-Mensah 4


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).


New York: NealSchuman Publishers, Inc.

• Hiemstra, D. (2000). Information retrieval models. In A Goker and J. Davies


(Eds.), Searching in the 21st century, (pp. 1-19). Chichester, U.K. : Wiley. Retrieved
from http://wwwhome.cs.utwente.nl/~hiemstra/papers/I RModelsTutorial-draft.pdf.

Dr. (Mrs) F. O. Entsua-Mensah 5


Definitions, Purpose and Characteristics of Information
TOPIC
Retrieval Systems
1

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 6


What is Information Storage and Retrieval

Information storage and retrieval

• “Information retrieval (IR), also called information storage and retrieval


(ISR or ISAR) or information organization and retrieval, is the art and
science of retrieving from a collection of items a subset that serves the
user’s purpose” (Harns, 2013, para. 1).

• “A branch of computer or library science relating to the storage, locating,


searching, and selecting upon demand relevant data on a given subject”
(NCI Thesaurus).

• The systematic process of collecting and cataloguing data so that they


can be located and displayed on request (The Columbia Electronic
Encyclopedia).

Dr. (Mrs) F. O. Entsua-Mensah 7


What is Information Storage and Retrieval

• “The technique & process of searching, recovering, & interpreting information from
large amounts of stored data” (Science and Technology Dictionary, cited in Singh,
2016)

• It relates to “the organization of, processing of, and access to information of all
forms and formats” (Chowdhury, 2004, p. 1)

• Information retrieval (IR) is a discipline concerned with the processes by which


queries presented to information systems are matched against a "store" of texts
(the term text may be substituted with still images, sounds, video clips, paintings,
or any other artifact of intellectual activity) (Robbins, 2000, p. 57)

Dr. (Mrs) F. O. Entsua-Mensah 8


Purposes of an information retrieval system (IRS)

An IRS is designed to fulfil the following purposes:

• To collect and organize documents and information of different formats in different


subject areas and make it available to users upon request

• To make the right information accessible to the right user upon request to meet
his or her information need

• To serve as a bridge between content creators or generators and users of those


contents or information

• To retrieve bibliographic items or exact matches of texts of queries from different


information retrieval systems such as full-text databases or multimedia
information.

Dr. (Mrs) F. O. Entsua-Mensah 9


Characteristics of an effective information retrieval system (IRS)

To fulfil its purposes, an IRS must be flexible and equipped for:

• Prompt information dissemination


• Information filtering (i.e., unwanted information must be excluded)
• Active switching of information (such as switching from web search to email access)
• Receiving information in a desired format
• Browsing or surfing
• Getting information in an economical way ( expensive academic databases for e.g.,
may not be patronized by academic institutions)
• Current literature (or up-to-date information)
• Accessing other information systems
• Interpersonal communication (e.g., live chats and sharing of information via social
media)
• Personalized help (e.g., provision of anticipated help topics to facilitate ease of use),
and must be
• User friendly, i.e., must consider the convenience of the user (Liston and Schoene, as
cited in Chowdhury, 2010. p. 16).

Dr. (Mrs) F. O. Entsua-Mensah 10


Information Retrieval Process

• Any information storage and retrieval system will have a complex series of
operations before documentary information can be used:

1. The information must be recorded in documents;


2. Each document must be stored with others in some accessible place and its
location known;
3. Characteristic aspects of each document profile, and this must be recorded with
others in the same file;
4. The potential user must formulate some query or express some interest in
terms of characteristics recorded about documents;
5. This user profile must be compared with document profiles and the locations of
the matching documents identified;
6. The documents must be located and presented to the user.

(CHAUHAN, 2023)

Dr. (Mrs) F. O. Entsua-Mensah 11


Components of an Information Retrieval System and their
TOPIC Process
2

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 12


Components of an information retrieval system

Lancaster (cited in Chowdhury, 2010) indicates that an information retrieval system


is made up of six main components:
1. The document selection system
2. The indexing subsystems
3. The vocabulary sub-system
4. The searching sub-system
5. The user-system interface (a platform for interaction between the user and the
system)
6. The matching sub-system (the system that matches the user’s queries to the
document representations)

Dr. (Mrs) F. O. Entsua-Mensah 13


Components of an information retrieval system

• The Document Subsystem: it involves the location, selection, ordering and


receipt of source materials for collection. This process emphasis on two aspects:
Currency of information and Completeness of information, both consist of the
following tasks :
➢Determination of current and probable future requirements of potential
users of IRS.
➢Formulation of a policy acceptance of source material as defined by
subject coverage, publication type, or other criteria.
➢Comparison of available or incoming source materials with policy to
determine which shall be included in the IRS.
• Indexing Subsystem and Vocabulary Subsystem: This system is for naming
subjects in the way we have described called an indexing language and/or any
other language. It consists of two parts (a) Vocabulary and (b) Syntax. If we use
term as they appear in documents without modification, we are using natural
language. In this process, we face problems arising out of:
➢The use of words/terms; or else.
➢Use of the word order (syntax).

Dr. (Mrs) F. O. Entsua-Mensah 14


Components of an information retrieval system

➢For example, “child psychology” may express as “psychology for children”.


Therefore, a controlled vocabulary is used. Vocabulary control involves the
establishment of relationships among analytic, often an arbitrary basis, but
most of them based on the prediction of those relationships that may facilitate
identification of all source materials that have been indexed.
➢For example: Instead of Children’s libraries, we use libraries and
children
➢A controlled vocabulary is a part of an artificial indexing language. The
notation of a classification scheme is an example of this artificial language.

• Searching Subsystem: Searching subsystem is one of the major


subsystems of an information retrieval system. In this subsystem, at the
beginning users’ queries are being received and interpreted by the search
system, then appropriate search statements are formulated, and the actual
search (i.e., matching queries with the surrogates of information resources
file) is conducted with a view to retrieving the required information.

Dr. (Mrs) F. O. Entsua-Mensah 15


Components of an information retrieval system

• User-System Interface: The receiver of information bearing documents


becomes a source, encoding the message in form of an inquiry when we
discover any information in our store, which appears to match the inquiry,
and we can pass them to the enquirer, who can decide whether they
match his requirements.

• The Matching Subsystem: It matches the document representation


against request representation that is when documents relevant to query
have located, a match has achieved. Search engine acts as a giant
matching device. The matching subsystem has no direct influence on
effectiveness of the complete system. It plays a great role in overall
system efficiency.

Dr. (Mrs) F. O. Entsua-Mensah 16


The Information retrieval process

An IR system supports 3 basic into an index (Indexing process)


processes: • A user formulates his
1. The representation of contents of the problem/information need in a query
documents (Query formulation)
2. Representation of the users • The IR software compares/matches the
information need query to the index (Matching process)
3. Comparison of the 2 representations • The user is presented with a set of
These processes involves elements retrieved documents which he judges
(Figure 1) for relevance or appropriateness in
➢ Round boxes represent processes meeting his need (Feedback)
➢ Square boxes represent data • The query can be modified if the
retrieved documents are irrelevant
The basis for IR is that documents or (Hiemstra, 2009).
information have been organized and
stored in a way that makes it searchable
by users to meet their information needs:
• Documents in an IRS are processed

Dr. (Mrs) F. O. Entsua-Mensah 17


Elements of an IRS

Dr. (Mrs) F. O. Entsua-Mensah 18


TOPIC
Types and Uses of Information Retrieval Systems
3

Dr. (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 19


Types of IR systems

IRS can be categorized as in-house and online


• In-house information retrieval systems
In-house IRS are developed by libraries or information centers to serve their clientele.
For example, a library catalogue for searching information items or checking their
availability.
• Online information retrieval systems
Online information retrieval systems is one which have been designed to provide
access to remote database(s) to a variety of users, via a computer terminal directly
interrogate a machine-readable database. The main features or characteristics of online
IRs are:
➢The terminal can be remote.
➢Time sharing so several users can be online at one time.
➢Information is communicated instantaneously
Online IRS provide users with remote access to Online public access catalogues
(OPACs) provide facilities for library users to carry out online catalogue searches, and
then to check the availability of the required information source, commercial databases
such as CD-ROM and academic databases electronically

Dr. (Mrs) F. O. Entsua-Mensah 20


Types of IR systems

Categorization of IR systems based on purpose, functions and contents


• Online database
➢ Provide access to peer reviewed scholarly information resources
➢ Are subscription or fee-based services
• Digital libraries and web information service
➢ Information is stored in digital formats
➢ Often free and accessed via the web
• Web search engines
➢ Free tools designed to search vast amounts of information resources on the
world wide web
• OPAC
➢ Searching library catalogues online by bibliographic details of documents such
as, author name, title of document, keywords. Call number etc.

Dr. (Mrs) F. O. Entsua-Mensah 21


Everyday uses of IR systems

• IR systems was originally designed to search engines, and subject gateways


retrieve text-based and bibliographic (provide links to more academic, reliable
databases. With the advancements in information).
ICT IR systems are used in different • Access to information from social
aspects of our everyday lives including: networking sites.
• Access to information from
• Search for information from library
bibliographic or full text databases e.g.
OPACs (Chowdhury, 2010)
Web of Science, LISA
• Access to e-books & e-journals (World
public library at http://worldlibrary.net/ ,
Emerald at www.emeraldinsight.com)
• Access to information from email
services and mobile phones
• Search for information on company or
institutional intranets
• Access to web information via URLs,

Dr. (Mrs) F. O. Entsua-Mensah 22


Activity

• Access this link, listen to the video for further understanding


of the IR process https://www.youtube.com/watch?v=Y0CZmsel5Rs.
• Read Chowdhury, p. 6-7 to identify the functions of an IR system and write down
your notes

Dr. (Mrs) F. O. Entsua-Mensah 23


References

• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc.
• Columbia Electronic Encyclopedia (2013). Columbia University Press. Licensed from Columbia
University Press.
•Harns, S. (2013). Information retrieval. Retrieved from
https://www.scribd.com/document/142955883/SCOPE-OF-IR
• Hiemstra, D. (2009). Information retrieval models. In A Goker and J. Davies (Eds.), Searching in the
21st century, (pp. 1-19). Chichester, U.K. : Wiley. Retrieved from
http://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModelsTutorial-draft.pdf
•Information storage and retrieval (n.d.). Retrieved from
https://www.tititudorancea.com/z/information_storage_and_retrieval.htm
• Singh, K. (2013). Function of information retrieval. (n.d.). Retrieved from
https://www.scribd.com/document/312452773/Function-of-Information-Retrieval
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing Science,
3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57-62.pd

Dr. (Mrs) F. O. Entsua-Mensah 24


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 2
Nature of Data and Documents in Information Retrieval

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year

Dr (Mrs) Florence Entsua-Mensah 1


Session Overview

A data retrieval system and an information retrieval system differ in the nature of
data stored, organized and retrieved. Data in the former is structured whilst the
latter is unstructured.
Again a document is often explained as a written or printed paper or papers serving
as a proof or evidence for something. In information retrieval, a document is simply
any form of data stored in the IR system.

At the end of the session, the student will be able to


• Explain the nature of data in an IR system and how it differs from a database
management system (DBMS)
• Understand the concept of a document in an IR system

Dr (Mrs) Florence Entsua-Mensah 2


Session Outline

The key topics to be covered in the session are:

• Topic One: The nature of data in an information retrieval system

• Topic Two: The concept of document in an information retrieval system

Dr (Mrs) Florence Entsua-Mensah 3


Reading List

• Greengrass, E. (2000). Information retrieval: a survey. Retrieved from


https://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.o
ok.pdf
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley
and Sons Inc.

Dr (Mrs) Florence Entsua-Mensah 4


TOPIC The Nature of Data in an Information Retrieval System
1

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 5


What do Information Retrieval Systems aim to retrieve?

• In the whole operation of information retrieval one can recognize four phases:

1. Word retrieval: identifies the words that will adequately describe the
information.

2. Reference retrieval: Identifies references that are probably pertinent to the


enquiry. Reference retrieval system is typified by the library card catalogue or
other indexes, which yield a complete reference to a document in response to a
general search quest. Many of the mechanized retrieval systems, provide
reference retrieval only.

3. Document retrieval: retrieves a complete copy of the document instead of just


a citation or reference provided.

4. Data retrieval: the sought information is extracted from the documents.

(CHAUHAN, 2023)

Dr. (Mrs) F. O. Entsua-Mensah 6


The nature of data in an information system

• The idea of information retrieval assumes that, there exist several documents or
records comprising data that have been arranged in a suitable order for easy
retrieval. This means the retrieved information can be represented in different
forms.

• The storehouse contains many bibliographic information, which is quite different


from other kinds of information or data.

• The database of the storehouse includes abstracts of some bibliographic


resources or full texts of documents, such as journal articles, conference
proceedings, newspaper articles, textbooks, encyclopedias, legal documents, and
statistical records, etc along with audios, graphics, images and videos information.

Dr (Mrs) Florence Entsua-Mensah 7


The nature of data in an information system

• No matter what the database may contain, be it bibliographic resources,


full-text documents or multimedia information – the system assumes that
there exists a target group of users for whom the system is designed and
fulfill their requirements.

• While the primary content being conveyed does not possess a defined
structure, it generally comes packaged in objects, for example in files or
folders or documents, that themselves have some metadata, and are thus
a combination of structured and unstructured data, but normally it is
referred to as "unstructured data"

Dr (Mrs) Florence Entsua-Mensah 8


The nature of data in an information system

• An information retrieval system is designed to deal with unstructured data. The


major objective of an information retrieval system is to retrieve the information;
either the actual information or the documents containing the information that fully or
partially match the user’s query.

• Information retrieval deals with an unstructured data in response to a query. An


unstructured data is explained as:
➢ Natural language text and also multimedia information such as photographic images
audio, video etc.

• Examples of "unstructured data" may include books, journals, documents,


metadata, health records, audio, video, analog data, images, files, and
unstructured text such as the body of an e-mail message, Web page, or word-
processor document.

Dr (Mrs) Florence Entsua-Mensah 9


The nature of data in an information system

• Users are considered to have certain queries or information needs, and


when they put forward their requirement to the system, the system should
be able to provide the necessary bibliographic references of those
documents containing the required information; some systems also
retrieve the actual text, image, table or chart relevant to the information
needs of the user.
• However, data in a data retrieval system is structured . A structured data;
➢ “consists of named components, organized according to some well-
defined syntax” or a systematically orderly arrangement (Greengrass,
2000, p. 6)
• Structured data can be handled easily as they can be easily entered,
stored, queried and analyzed.

Dr (Mrs) Florence Entsua-Mensah 10


Structured versus unstructured data

• Structured data (e.g., DBMS of students or employee):


✓It is information that is already structured in fields, such as “name”,
“age”, “gender”, “hobby”, “address”, “profession”, “salary”.
✓This is the typical example of what we find in a record of a relational
database table.
✓When information is organized in a structured form, it is usually
relatively easy to search it, since one can directly query the database
✓ Data elements are related i.e., records have similar syntax and
meaning
✓Designed to retrieve specific facts or records, based on a common
attribute of the data elements (e.g., student ID or gender, employee age
etc.).
✓It is easy for DBMS to retrieve any record and its contents

Dr (Mrs) Florence Entsua-Mensah 11


Structured versus unstructured data

IR deals with unstructured data:


✓No clearly defined data elements. Random collection of documents will
hardly discuss the same topic
✓Retrieves all types of documents; e.g., abstracts or full text documents
such as newspapers, dictionaries, encyclopedias, handbooks, audio,
video, images etc.
✓Retrieval from unstructured data is difficult and requires specialized
skills and advanced technology.
✓This information either does not have a pre-defined data model or is
not organized in a pre-defined order.
✓Unstructured information is typically text-heavy, but may contain
datasets such as dates, numbers, and facts as well.

Dr (Mrs) Florence Entsua-Mensah 12


TOPIC
The Concept of Document in an Information Retrieval System
2

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 13


Definition and examples of a document in an IR system

Definition ➢ A chapter
• In an IR system a document is a much ➢ A section
broader term than a printed or written ➢ A paragraph
paper or papers It is defined as:
• Graphics
➢ “a stored data record in any form”
• Sound/voice recordings
(Korfhage, 1997, p. 17.
• Images
• The word document as a general term
could also include non-textual • Computer programs
information, such as multimedia objects. • Data files
It includes: • Email messages etc
• Books
• Informal writings such a
➢ Letters
➢ Messages
• Parts of a book, such as an, i.e.,

Dr (Mrs) Florence Entsua-Mensah 14


Document surrogates
•An IR system stores the full text of documents or surrogates
of the document. Document surrogates are:
• “limited representations of full documents” (Korfhage, 1997, p. 21).

•Document surrogates can be explained as concise displays


representing actual objects using some of their metadata.

•Surrogates can also include elements from the document’s


actual content.

Dr (Mrs) Florence Entsua-Mensah 15


Types of Document surrogates

• Document Identifier – a number/code e.g., accession or a classification number


for the purpose of inventory control or document location. Does not answer a
query or satisfy an information need.

• Bibliographic data/record – all the data elements used to identify, describe, or


retrieve a document/publication of information content OR
➢A collection of data elements organized in a logical way to represent a
bibliographic item or document, publication or any record of human
communication. • Examples - author, title, publication date, publisher, ISBN etc.
➢These are useful to the information seeker. For e.g., date shows the timeliness
and appropriateness of the document.

• Keyword – one or a set of individual words chosen by the author/editor or


sometimes dictated by the database to represent the contents of the document

Dr (Mrs) Florence Entsua-Mensah 16


Types of Document surrogates

• Abstract – a brief one or two paragraph description of the contents of a paper


often written by the author.
➢ Its purpose is to help a reader to determine whether the entire document
should be retrieved.
• Extract – “Artificially constructed surrogates created by someone other than the
author of a paper” (Korfhage, 1997, p. 23).
➢ May comprise the first sentence of each paragraph or significant words and
phrases in the document.
•Review – a critical article on a book, play, recital etc., written by someone other
than the author.
➢Its purpose is to indicate the value of the document with respect to other works
in the same field.
➢It can be retrieved separately to suit the purposes of a reader.

Dr (Mrs) Florence Entsua-Mensah 17


Activity

• Check the library and the Internet for all the document surrogates in this lesson,
examine and discuss your discoveries in the chat room on the Sakai course site

Dr (Mrs) Florence Entsua-Mensah 18


References

• Greengrass, E. (2000). Information retrieval: a survey. Retrieved from


https://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.b
o ok. pdf. 5-7
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley
and Sons Inc. 17-24 Dr. Evelyn Markwei, DIS Slide 1

Dr (Mrs) Florence Entsua-Mensah 19


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 3:
Models of Information Retrieval (Part 1)

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year

Dr. (Mrs.) Florence O. Entsua-Mensah 1


Session Overview

• Information retrieval technology is constantly evolving and developing


based on increasing volumes of information (Google has indexed over
four billion web pages) . These developments are associated with new IR
challenges handling problems such as email spammers. Designing of
effective IR systems to match the rapid changes and mitigate these new
challenges require rigorous research and application of new theories or
models. These models are very important. For example they are the blue
print for determining query formulation and the ranking of retrieved
documents At the end of the session, the student will be able to

• Explain IR models , their purposes and importance in information retrieval


• Identify the two major classes of IR models and their features and
• Explain the structure, advantages and disadvantages of the Boolean
model

Dr. (Mrs.) Florence O. Entsua-Mensah 2


Session Outline

The key topics to be covered in the session are:


•Topic One: Definition, importance and the two classes of IR
models
•Topic Two: Features of the Boolean Model

Dr. (Mrs.) Florence O. Entsua-Mensah 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information retrieval


(3rd ed. ). New York: NealSchuman Publishers, Inc.

• Hiemstra, D. (2000). Information retrieval models. In A Goker and J.


Davies (Eds.), Searching in the 21st century, (pp. 1-19). Chichester, U.K. :
Wiley.Retrievedfromhttp://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModels
Tutorialdraft.pdf.3-12.

Dr. (Mrs.) Florence O. Entsua-Mensah 4


TOPIC
Definition, importance and the Two Classes of IR Models
1

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 5


What are IR models?

• The IR models can be differentiated by the way they represent the


documents and query statements, how the system matches the query
with the documents in the corpus to find out the related one and how the
system ranks these documents.
• With Information retrieval aim to help users find information usually in
documents of an unstructured nature that satisfies an information need
from within a large collection usually stored on the web or computers;
• The IR system needs to have components for the model representing the
documents and query statements, and the characteristic matching
function which evaluates the relevancy of the documents with respect to
the provided user query.
• A model is an abstract representation of a process or an object to
facilitate making of predictions and drawing of conclusions.

Dr. (Mrs.) Florence O. Entsua-Mensah 6


What are IR models?

• IR models describe the human and computer interaction process involved in


information retrieval (Callan, 2003). They are:
➢The theoretical basis for computing the answer to a query. They guide the expression
of queries & the representation of documents in an IR system (Fuhr, 2001).
➢The basis for prediction & explanation of the relevant documents retrieved by a users’
queries.

• The information retrieval model needs to provide the framework for the system to
work and define the many aspects of the retrieval procedure of the retrieval
engines
• The IR model has to provide a system for how the documents in the collection and
user’s queries are transformed.
• The IR model also needs to ingrain the functionality for how the system identifies
the relevancy of the documents based on the query provided by the user.
• The system in the information retrieval model also needs to incorporate the logic
for ranking the retrieved documents based on the relevancy.

Dr. (Mrs.) Florence O. Entsua-Mensah 7


Importance of an IR models

1. A master plan for implementing IR systems.


2. Guides academic research & discussion.
3. Facilitates efficient retrieval of relevant document.

• It guides experimentation necessary for development of IR tools:

➢Aids avoidance of trial and error in the development of IR technology,

➢Aids management of information overload, esp., on the Internet, &

➢Provides solutions to constantly emerging IR challenges & problems,


such as email spammers (Hiemstra, 2000).

Dr. (Mrs.) Florence O. Entsua-Mensah 8


The 2 major classes of IR Models

Cognitive/User-centered models
They adopt a holistic approach to IR , that is, the information
organization, document representation and query formulation
and their algorithms as well as the information behaviour of
users. Thus they include:
➢User information needs & query formulation methods
➢Human-computer interactions during the search process
➢The socio-cognitive environment of the search process
➢Mode of information use in satisfying a need, plus
➢Matching the query to information stored in the IR system
(Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah 9


The 2 major classes of IR Models

System-centered models
➢Includes search & retrieval of relevant documents from the IR
system
➢The hardware & software needed for the representation of
documents and their retrieval and the problems associated with
them
➢Computer programs needed for matching queries with stored
documents to produce output (i.e., the retrieved documents)
➢Algorithms (i.e. procedures, rules or formula for solving problems
often in the field of computer science and mathematics) needed for
improved ranking of documents (Robbins, 2000)
➢(This course will focus on system centered models)

Dr. (Mrs.) Florence O. Entsua-Mensah 10


Types of System-centered IR models

•There are many system centered models but the three most
important ones are the:
➢Boolean model
➢Vector model
➢ Probabilistic model
•These models are important because they are designed to
manage large collections such as web pages of the Internet
(Inkpen, n.d.)

Dr. (Mrs.) Florence O. Entsua-Mensah 11


TOPIC
The Boolean Model
2

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 12


The Boolean Model

•Boolean model is a type that allows users to logically relate


multiple concepts together to define what information is needed.
•The Boolean model is the first model of information retrieval and
the simplest retrieval model which retrieves the information on
the basis of the query given in Boolean expression. It is a
system of symbolic logic formulated by George Boole. Has three
operators:
1) OR – logical sum (+)
2) AND – logical product (x)
3) NOT – logical difference (-)
•It is described as the exact match model, i.e., documents are
either retrieved or not but are not ranked.

Dr. (Mrs.) Florence O. Entsua-Mensah 13


Explaining the Boolean model

• In the Boolean model, documents are represented by a set of index terms


or keywords.
• Using the Boolean operators, the terms in the query and the concerned
documents can be combined to form a whole new set of documents.
• The Boolean AND of two logical statements x and y means that both x
AND y must be satisfied and will be a set of documents that will be
smaller or equal to the document set
• While the Boolean OR of these same two statements means that at least
one of these statements must be satisfied and will fetch a set of
documents that will be greater or equal to the document set otherwise.
• Any number of logical statements can be combined using the three
Boolean operators.
• User’s queries are Boolean expressions of the keywords connected by
the Boolean operators. For e.g.,

Dr. (Mrs.) Florence O. Entsua-Mensah 14


Explaining the Boolean model

• A query term corruption simply defines all documents in the IR system


indexed with the term corruption
• A query corruption AND politics will retrieve a document D1 only if D1
contains both terms. Thus AND means all.
• A query corruption OR politics will retrieve a document D1 only if D1
contains any or both terms. OR means any
• A query corruption and politics will retrieve a document D1 only if D1
contains corruption and not politics (Greengrass, 2000)

Dr. (Mrs.) Florence O. Entsua-Mensah 15


Boolean searching using Venn diagrams

Dr. (Mrs.) Florence O. Entsua-Mensah 16


Advantages of Boolean Operators

•Easy to implement, therefore extensively used in the design


of IR systems; e.g., OPACs, CD-ROM, Online databases,
web search engines, etc.
•Gives expert users a sense of control and transparency over
documents that are retrieved (i.e., it is easy to understand
why a document was retrieved or not).
•Has great expressive power; i.e., offers users multiple ways
to express their queries for desired results.
•Offers multiple techniques for broadening/ narrowing a
search (Manning, Ragnavan, & Schutze, 2009)

Dr. (Mrs.) Florence O. Entsua-Mensah 17


Disadvantages of Boolean Operators

•No ranking for retrieved documents. All retrieved documents


are accorded the same weight/importance without ranking its
relevance.
•Novice users are unable to use the right combinations of the
operators with searches involving multiple queries.
•Output is difficult to control especially when the OR operator
is used(information overload) & users are often not able to
modify the queries to limit the number of retrieved documents
(Manning, Ragnavan, & Schutze., 2009)
•This model requires Boolean query instead of free text.

Dr. (Mrs.) Florence O. Entsua-Mensah 18


Disadvantages of Boolean Operators contd.

• Users formulate inefficient Boolean queries because of their knowledge of


the English language. Example, the meaning of AND or OR in natural
language differs from Boolean query usage. Thus the AND operator is
often substituted for the OR operator.
• E.g., someone with a desire for a specific entertainment may state it as:
➢ dinner AND comedy show AND movie instead of
➢ dinner OR comedy show 0R movie or better still
➢ dinner AND (comedy show OR movie)
• Users are unfamiliar with the rules of precedence for logical connectives
➢Two standards both of which rely on parentheses to group terms together.
➢ Combinations within parenthesis is evaluated first before combining with
terms outside the parenthesis

Dr. (Mrs.) Florence O. Entsua-Mensah 19


Boolean Rules of Precedence

•In a type 1 system – NOT is applied within the ( ),


then AND, and lastly OR, from left to right. E.g.,
A OR B AND C = A OR (B AND C)

•A type 2 system – applies left to right order of


precedence, irrespective of operators A OR B AND
C = (A OR B) AND C (Korfhage, 1997)

Dr. (Mrs.) Florence O. Entsua-Mensah 20


Activity

•Create your own Boolean search queries using all the 3


operators NOT, OR, and AND.

•Search them on the Internet and discuss the differences in


volume of retrieved documents

•Which operators narrow or broaden the search?

Dr. (Mrs.) Florence O. Entsua-Mensah 21


References

• Callan, J. (2003). Retrieval models: Boolean and Vector Space. Retrieved from
https://www.scribd.com/presentation/351797748/03
• Fuhr, N. (2001). Models in Information Retrieval. In: Agosti M., Crestani F., Pasi G. (eds) Lectures
on Information Retrieval. Heidelberg, Berlin: Springe
• Greengrass, E. (2000). Information retrieval: A survey. Available at
http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.boo k. pdf
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley and Sons Inc.
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Rettrieval.
Retrieved from https://nlp.stanford.edu/IRbook/pdf/irbooko
• Robbins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing
Science, 3(2), 57-61. Retrieved from http://inform.nu/Articles/Vol3/v3n2p57- 62.pdf

Dr. (Mrs.) Florence O. Entsua-Mensah 22


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 04:
Models of Information Retrieval (Part 2)

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

This session will discuss the Statistical models, that is, the Vector
Space and the Probabilistic models. They are referred to as statistical
models because they use statistical information to determine relevance
of documents to queries. They are also described as the best match
models because they are able to predict the degree of relevance of a
document to a query, i.e., retrieved documents are ranked with the
most relevant documents listed first.
At the end of the session, the student will be able to
• Identify the characteristics, advantages and limitations of the
statistical models.
• Explain the underlying principles and structure of the Vector Space
and the Probabilistic models.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 2


Session Outline

The key topics to be covered in the session are:


• Topic One: The characteristics, advantages and limitations
of the statistical models.
• Topic Two: The Vector space model: Principle and structure
• Topic Three: The Probabilistic model: Principle and structure

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information


retrieval (3rd ed. ). New York: Neal-Schuman Publishers, Inc.
• Hiemstra, D. (2000). Information retrieval models. In A Goker
and J. Davies (Eds.), Searching in the 21st century, (pp. 1-19).
Chichester, U.K. : Wiley. Retrieved from
http://wwwhome.cs.utwente.nl/~hiemstra/papers/IRModelsTuto
rial-draft.pdf. 3-12
• Information retrieval models (n.d.). Retrieved from
http://aspoerri.comminfo.rutgers.edu/InfoCrystal/Ch_2.html
• Kuyoro, S. O., & Oludele, A. (2012). Information retrieval: An
Overview. International Journal of advanced Research in
Computer Science, 3(5), 175-178. Retrieved from
https://www.academia.edu/21988230/Information_Retrieval_An
_Overview
Dr. (Mrs.) Florence O. Entsua-Mensah Slide 4
The Characteristics, advantages and limitations of the
TOPIC
Statistical Models
1

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 5


The characteristics of the Statistical models

• The statistical models use statistical information described


as term frequencies to determine relevance of documents to
queries.
• They are associated with free text queries, i.e., queries are
expressed in one or two words in normal human language
rather than a specified query expression or operators such
as the Boolean operators.
• The output or documents retrieved are ranked/ordered by
their degree of relevance with the most relevant to the query
listed first

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 6


The characteristics of the Statistical models

•This requires the assigning of scores/weights


to the query terms and the representation of
the documents in a collection.
•The Vector space and Probabilistic models
are two examples of the statistical models
(Information Retrieval (n.d.).

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 7


The advantages of the Statistical models
• They solve some of the • Easy query formulation, no
problems of Boolean special skills required,
retrieval for eg., users can use natural
➢They provide relevance language followed by
ranking of retrieved automatic extraction of
documents. keywords from the query
• Users have control over • Uncertainty about what
output and are able to set concepts to use in query
the number of documents formulation can be
to be displayed per search accommodated because of
the use of natural language

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 8


The disadvantages of the Statistical models

• Lacks the expressive power of the Boolean approach.


➢For e.g., the NOT operation cannot be
expressed because only positive scores are used
➢ Also, (A and B) or (C and D)) can not be
represented
• Does not support phrase and proximity searches.
• For optimal performance, queries have to contain large
number of words.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 9


The disadvantages of the Statistical models contd.

• Calculation of relevant scores involves a lot of computing


which makes it very expensive.
• The ranked list provides users with limited view of available
information
➢For e.g., the retrieved documents do not suggest how to
refine/modify a query where necessary, to retrieve more or less
information
• Users find it difficult to determine the appropriate words used
in the representation of the relevant documents (Information
Retrieval, n.d.).

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 10


TOPIC
The Vector Space Model: Principle and Structure
2

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 11


The Vector Space Model
Principle
• It is based on Luhr (1957) similarity criterion which states:
• “The more two representations agreed in given elements and their
distribution, the higher would be the probability of representing
similar information” (as cited in Hiemstra, 2009, p. 7)
• That is, relevance is determined by the degree of similarities
between the properties of the document, (i.e., the index terms
used to represent the documents) and the query terms.
• In simple terms;
➢The more similar a document vector is to a query vector, the more
likely it is that the document is relevant to that query.
➢ The words used to define the dimensions of the space are orthogonal
or independent.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 12


The Vector Space Model

• The vector space model represents the documents and


queries as vectors in a multidimensional space, whose
dimensions are the terms used to build an index to represent
the documents [Salton 1983]
• The vector space model can assign a high ranking score to a
document that contains only a few of the query terms if these
terms occur infrequently in the collection but frequently in the
document.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 13


Features of the Vector Space Model

Document and query


terms are represented as
vectors in a multi
dimensional space and
modeled on algebraic
rules.
(A vector a point in a vector
space. It has two attributes,
direction and length, i.e., it
has a dimension and value)
Dr. (Mrs.) Florence O. Entsua-Mensah Slide 14
Features of the Vector Space Model (VSM)

• The dimensions of the vectors are index terms


representing the documents
• The terms can be single words, keywords or phrases
• The attributes or properties of both document and
query terms are computed or weighted using specific
measures and represented as vectors
• Relevance of a document to a query is determined by
the similarities of their vectors (the implication is that
queries are also weighted or have specific values)

Slide 15
Dr. (Mrs.) Florence O. Entsua-Mensah
Assumptions of the Vector Space Model

• The more similar a document vector is to a query vector, the


more likely it is that the document is relevant to that query.
• The words used to define the dimensions of the space are
orthogonal or independent.
• The similarity assumption is an approximation and realistic
whereas the assumption that words are pairwise
independent doesn't hold true in realistic scenarios.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 16


Computing values or dimensions of vectors
• Assigning appropriate values to vectors is known as term
weighting
➢Term weighting is the process of assigning numerical values to
terms based on their statistical distribution, i.e., frequencies of
occurrence of terms in documents, document collections, or
subset of documents such as relevant documents to a query
(Frakes, n.d.)
• There are several term weighting schemes. The most
common and the best weighting scheme in IR was proposed
by Saltan and Yang (1973).
• It is called tf.idf weights, a combination of the term frequency
(tf) and the inverse document frequency (idf)
➢Tf = the number of occurrences of a term in a document
➢Idf = a value inversely related to the document frequency (df)

Slide 17
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing The term frequency-inverse document frequency
of a term
• It is the product of its tf weight and its idf weight, that is:

w t ,d = log(1 + tf t ,d )  log10 ( N / df t )
• tftd is the number of occurrences of a term t in a document
d
• dft is the document frequency, the number of documents in
the collection that contain the term t

Slide 18
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing The term frequency-inverse document frequency
of a term
• N is the total number of documents in the collection
(Hiemstra, Greengrass)
• NB
➢Tf is greater when the term is frequent in a
document.
➢A document with 10 occurrences of a term t
(t=10) is more relevant than a document with one
occurrence of the term (t=1)
➢Idf is greater when the term is rare in the
collection, i.e., doe not occur in only few
documents
Slide 19
Dr. (Mrs.) Florence O. Entsua-Mensah
Computing the similarity between query and document vectors

  


V
  q•d q d qi d i
cos(q , d ) =   =  •  = i =1

q d
i =1 q i =1 i
V V
qd 2
d 2
i

•qi = tf-idf weight of term i in the query


•di = tf-idf weight of term i in the document
•cos(q,d) is the cosine similarity of q and d or, the cosine of the
angle between q and d.
•The cosine similarity of 2 documents ranges from 0 to 1
•If the value is 0 it means q and d are 100% similar

Slide 20
Dr. (Mrs.) Florence O. Entsua-Mensah
Steps in ranking of documents in the VSM

1. Calculate the term-weighting tf.idf of both query and


document
2. Represent query and each document as a weighted tf.idf
vector
3. Compute the cosine similarity between query vector and
each document vector
4. Rank documents according to cosine (query, document) in
increasing order
5. Return the top ten (10) retrieved documents to the user
(Teufel, n.d.).

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 21


Ranking of documents in the VSM

In the VSM documents are


ranked according to their
proximity to the query. In this
diagram, document (d1) will be
more relevant to the query (q)
than document (d2) based on
the proximity. In other words,
the smaller the angle between
the query and the document
vector, the relevant the
document.
Dr. (Mrs.) Florence O. Entsua-Mensah Slide 22
TOPIC
The Probabilistic Model: Principle and Structure
3

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 23


Setting the stage for the Probabilistic model
• A user approaches an IR system with an information need
• The information need is translated into a query representation
• Similarly there are documents that has been represented with index
terms
• The system has to match the query with documents that are relevant to
the query
• This matching is done without clear understanding of the information
needs of the user introducing an element of uncertainty whether the
documents retrieved will be relevant to the user’s query
• The probability theory provides an antidote to the uncertainty of
relevance of retrieved document to a query by providing the principles for
estimating the relevance of a document to satisfying the information
needs of a user (Manning, Ragnavan, & Schutze, 2009).

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 24


The Probabilistic Model

• It is based on the Probability Ranking Principle (PRP)which


states that
• “the function of an information retrieval system is to
rank the documents based on their probability of
relevance to the query, given all the evidence available”
[Belkin and Croft 1992]
• Documents are ranked in order of decreasing probability of
usefulness or relevance to a user.
• It is a manual process which involves the calculation of the
probability that a document will be relevant to a user.

Slide 25
Dr. (Mrs.) Florence O. Entsua-Mensah
Maron and Kuhns’ Probabilistic Retrieval Model

• They advocate the calculation of probability for each


document in the collection, i.e., will a user submitting a query
judge that document relevant?
• This probability for a particular document (Dm) based on a
query term B =
Users who consider the document relevant to the query
term (B)
Total number of users who submitted the query (B)

• In the absence of statistical evidence, the model assumes


that users submitting a query (B) will judge the document (Dm)
relevant (Chowdhury, 2010).
Slide 26
Dr. (Mrs.) Florence O. Entsua-Mensah
Robertson and Spark Jones Probabilistic Retrieval model

• Robertson and Spark Jones adopted a different probabilistic


principle proposed by Robertson, 1977) as follows:
• If a reference retrieval system’s response to each request is a ranking of
the documents in the collections in order of decreasing probability of
usefulness to the user who submitted the request, where the
probabilities are estimated as accurately as possible on the basis of
whatever data has been made available to the system for this purpose,
then the overall effectiveness of the system to its users will be the best
that is obtainable on the basis of that data (Robertson, cited in Hiemstra,
p. 11).

Slide 27
Dr. (Mrs.) Florence O. Entsua-Mensah
Robertson and Spark Jones Probabilistic Retrieval
model
• Propositions of the model
• Given a user’s query, there is a set of documents which contain
exactly the relevant documents (called the ideal set).
• The purpose of the query is to specify the properties of the
answer set
• These properties are not known but the system makes a guess
and presents an initial set of documents
• The user inspects the top retrieved documents to identify the
relevant ones
• The system uses this information to refine the description of
subsequent ideal answer set
• Repetition of this process eventually improves the description of
the ideal answer set

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 28


Robertson and Spark Jones Probabilistic Retrieval
model

•Modelling the ideal answer in


probabilistic terms
•Given a query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the
document dj relevant based solely on the
query and document representation.
•The ideal answer set is referred to as R
and are predicted to be relevant

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 29


Robertson and Spark Jones Probabilistic Retrieval
model
• The probabilistic ranking is computed as:
➢sim(dj ,q) = P(R | dj) / P(¬R | dj) i.e.,
➢ (the ratio of the probability that the documents dj
is relevant and the probability that it is not
relevant)
➢P(R) stands for the probability that a document
randomly selected from the document collection is
relevant (Inkpen, n.d.)

The Probabilistic model is not as popular as the Boolean and Vector


Space models although many experiments have proved that it can
yield good result . Currently it is applied in spam filtering (Chowdhury,
2010)

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 30


Activity

•To see how the VSM works, create a series of queries


and search on the Internet for documents with similar
terms in your queries. Note the number of documents
with similar terms to the query. Do you agree with the
ranking of the retrieved documents with regard to your
queries? Discuss your findings in the Chatroom on the
Sakai course site.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 31


References
• Belkin, N. J., & Croft W. B. (1992). Information filtering and information retrieval:Two sides of the coin.
Communications of the ACM, 33(12), 29-38. Retrieved from
https://s3.amazonaws.com/academia.edu.documents/30740540/InformationFilteringAndInformatio
nRetrieval.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1515638296&Signature
=91MO6mpxaaWiScOsHVuMefzAVOg%3D&response-content-
disposition=inline%3B%20filename%3DInformation_filtering_and_information_re.pdf
• Frakes, W. B. (n.d.). Introduction to information systems. Retrieved from
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap01.htm
• Greengrass, E. (2000). Information retrieval: A survey. Available at
http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.book.pdf
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Rettrieval. Retrieved
from https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
• Teufel, S. (n.d.). Term weighting and Vector Space model. Retrieved from
https://www.cl.cam.ac.uk/teaching/1415/InfoRtrv/lecture4.pdf

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 32


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 5 – Subject Analysis and Representation

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year

Dr. (Mrs.) Florence O. Entsua-Mensah 1


Session Overview

The major function or purpose of an IR system is to match a user’s query to the


contents of documents for the retrieval of relevant documents. This can be
achieved by analyzing and preparing a surrogate for each document and
organizing it in and orderly manner. The process of preparing document surrogates
and assigning specific descriptors to documents is known as indexing. Indexing
can be done manually by experts or automatically by the use of computer software.
At the end of this session, the student will be able to
• Explain the meaning, purpose, composition and steps of the indexing process.
• Explain the advantages and disadvantages of both manual and automatic
indexing, and
• Understand the parameters for determining the effectiveness of an index

Dr. (Mrs.) Florence O. Entsua-Mensah 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Meaning, purpose, composition and steps in indexing.

• Topic Two: Definition, advantages and disadvantages of manual and automatic


indexing

• Topic Three: Parameters for controlling index effectiveness

Dr. (Mrs.) Florence O. Entsua-Mensah 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).


New York: NealSchuman Publishers, Inc.

Dr. (Mrs.) Florence O. Entsua-Mensah 4


TOPIC
Meaning, Purpose, Composition and Steps in Indexing
1

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 5


Meaning of an index

Most IR systems do not include the full text of documents. This is because use of
the raw documents only hinders retrieval since the only way to access required
information is by reading linearly. To ensure efficient retrieval of information ,
document surrogates are generated and used in place of or together with the full-
text documents (Kophage, 1997).

• Indexing is the analysis of a given document content and the representation of the
analysis by appropriate descriptors or key (Chowdhury, 2010).

• Thus, the process of constructing document surrogates by assigning identifiers to


text items.

• An index is the selected terms & their locations in an individual document or group
of documents (Kophage, 1997)

Dr. (Mrs.) Florence O. Entsua-Mensah 6


Composition of an index

•Index terms may be single words or longer phrases or both

• Indexing is constructed using an indexing language or


vocabulary

• An indexing language may be controlled, i.e., use of


predefined set of index terms or uncontrolled, i.e., use of any
term based on a broad criteria

Dr. (Mrs.) Florence O. Entsua-Mensah 7


Purposes of an index

1. Permit location of documents by topic:- Indexes act as selection


guides to material contents

2. Define topic areas :- Indexes serve as a tool for document analysis

3. Predict the relevance of a document to a specified information need


(Kophage, 1997):- Indexes allows users to familiarize themselves with
a document and decide if they need to explore it further.

4. To decide on the optimum number of subject entries, and thus


economize the bulk and cost of cataloguing indexing.

Dr. (Mrs.) Florence O. Entsua-Mensah 8


Definition, Process, Advantages and Disadvantages of
TOPIC
Manual and Automatic Indexing
2

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 9


Manual Indexing

• Manual indexing – identification & description of terms in documents by


experts/specialists/trained indexers.

• Manual indexing makes use of an uncontrolled indexing language. This


indicates intellectual efforts being taken by the author to identify and
describe the content of a document.

• Traditional commonly used manual systems for compiling indexes of


documents make use of cards, such as library catalogue cards, but
nowadays a good computerised Personal Reference System is to be
preferred.

Dr. (Mrs.) Florence O. Entsua-Mensah 10


Steps in manual indexing

1. Analyze the subject content :- It is important to determine exactly what the


given document is about. This is termed as the “aboutness” of the document.

2. Identify keywords:- After collecting information about the ”aboutness “ of a


document, the indexer needs to represent them in the way suitable for matching
users’ queries.

3. Standardize keywords:- Here, the indexer chooses the appropriate keywords


by extracting words directly from the document or through the guide of
vocabulary control devices.

4. Choose an indexing system:- The indexer decides to choose a post-


coordinate system or a pre-coordinate system.

5. Filing of entries

Dr. (Mrs.) Florence O. Entsua-Mensah 11


The two types of indexing systems

Post- coordinate systems

➢ One entry is prepared for each key word selected to represent a given document

➢ All entries are organized in a file

➢ User’s query terms are identified and matched against the file of index terms

for the retrieval of relevant documents (Chowdhury, 2010).

Dr. (Mrs.) Florence O. Entsua-Mensah 12


The two types of indexing systems

Pre-coordinate systems

➢ A documents is represented by a heading made up of a chain or string of terms

➢ The chain of words represent or define the full or subject content of the
document

➢ The components words are synthesized base on an indexing language

➢ Pre-coordinated indexes arranged alphabetically are called alphabetical


subject indexes or alphabetical subject catalogues

➢Those arranged according to a classification scheme are known as classified


indexes or classified catalogues Librarianship Studies & Information
Technology, 2017)

Dr. (Mrs.) Florence O. Entsua-Mensah 13


Advantages and Disadvantages of Manual Indexing

Advantages
➢ Use of uncontrolled language giving the indexer flexibility in choosing document
descriptors

Disadvantages
• Lack of consistency – indexers may not assign the same index term to a given
document.
• Varying levels of specificity and exhaustivity are attained based on the different
perspectives of indexers
• Use of controlled vocabulary may hinder accuracy; i.e., indexers may not represent the
document accurately, especially where new words are introduced to the documents.
• Indexer-user mismatch – same concept may be represented differently by indexer and
user e.g., meats and poultry
• Pre-coordination- subsets of terms in manual indexing are often represented by a single
term; e,g., gas, oil, coal, are represented by fuel. This may hinder recall

Dr. (Mrs.) Florence O. Entsua-Mensah 14


Automatic Indexing

• Automatic indexing is defined as “the process of assigning and arranging index


terms for natural language without human intervention” (Tulic 2005, cited in
Obasaki, 2010).

• It is based on algorithms or well defined rules.

• When the assignment of the content identifier is carried out with the aid of modern
computing equipment the operation becomes automatic indexing.

• All automated indexes derive from frequency of occurrences of words within a


document

Dr. (Mrs.) Florence O. Entsua-Mensah 15


Advantages of Automatic Indexing

1. It is faster & easier to produce, because most of the processes are performed
by a machine. E.g., Microsoft word and Adobe Framemaker can do automatic
indexing.

2. Easily retrievable & modified in case of an error

3. Easily transferred through ICTs (Obaseki, 2010)

4. Consistency in indexing is assured

5. Cost of producing index entries tends to be lower at the long run

6. There is better retrieval effectiveness (Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah 16


Problems of Automatic Indexing

• An automatic index is free from the bias of a human indexer, but may contain bias

introduced into the algorithms by the system programmer.

• Professional indexers (who are often librarians) lack the technological skills

required for automatic indexing

• Indexing process is intellectually demanding, rigorous, therefore some LIS

professionals would rather not do it

• Some librarians cast doubt on the quality of automated indexing. They think the

software cannot match the intellect of the human indexer (Obaseki, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah 17


TOPIC
Parameters for Controlling Index Effectiveness
3

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year 18


Parameters for Controlling index effectiveness

Index effectiveness is controlled by two parameters:


➢indexing exhaustivity and term specificity

Indexing exhaustivity

• It is the breadth of coverage of index terms i.e.,

• The extent to which all index terms and concepts in a document are covered

➢ Achieved by selecting many keywords to represent all ideas, concepts and


topics discussed in the document

➢ Non-exhaustive indexing system will use few keywords to give a broad


representation of the subject

Dr. (Mrs.) Florence O. Entsua-Mensah 19


Parameters for controlling index effectiveness

Term specificity

• It refers to the depth of coverage, i.e.,

• The extent to which all topics/concepts are indexed in detail.


➢An indexing language on the topic birds, that excludes types or categories of
birds is less specific.
➢An indexing language that includes various kinds of birds but does not

include all kinds of birds is described as more specific but less exhaustive

• The more specific the terms, the better the representation of the subject

(Korphage, 1997, Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah 20


The impact of exhaustivity and specificity on an IR system

Exhaustivity

• If a search term is more exhaustive, recall (i.e., the proportion of relevant


materials retrieved by a system) is higher.
➢ For e.g., searching information about the world wide web using related
terms like the “net”, “Internet”, “superhighway” retrieve more information
most of which are likely to be relevant

• Thus high exhaustivity of indexing ensures high recall

• However high levels of exhaustivity also decrease the level of precision (i.e., the
proportion of retrieved documents that are relevant) and a large number of non-
relevant materials are retrieved.
➢ The low precision may be ascribed to the less-detailed discussion of some of
the related terms in the retrieved documents

Dr. (Mrs.) Florence O. Entsua-Mensah 21


The impact of exhaustivity and specificity on an IR system

Specificity

• The more specific a search term, the higher the precision

➢ For eg., looking for information using a broad term like “sports” may retrieve
a lot of documents most of which might not discus the desired topic, but using
specific term like “soccer” or “tennis” will yield fewer documents some of which
may be irrelevant.

• Thus higher levels of term specificity ensures high precision but low recall

• It is not possible for a given IR system to achieve optimal levels of precision and
recall. Optimal levels may increase cost of the IR system

• Moderate or intermediate levels of term specificity and indexing exhaustivity has


been advocated to make the IR system economical (Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah 22


Activity

• Choose two search terms, a broad search term and a specific term . Search the
terms on the net. Which of them gave high precision and high recall? Discuss
your findings in the Chatroom

Dr. (Mrs.) Florence O. Entsua-Mensah 23


References

• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).


New York: Neal-Schuman Publishers, Inc.

• Korfhage, R. R. (1997). Information storage and retrieval. New York; John Wiley
and Sons Inc.

• Librarianship Studies and Information Technology (2017). Retrieved from


https://librarianshipstudies.blogspot.ca/2017/04/pre-coordinate
indexingsystems.html

• Obaseki, T. L. (2010). Automated indexing: The Key to information retrieval in the


21st century. Library Philosophy and Practice. Retrieved from
http://www.webpages.uidaho.edu/~mbolin/obaseki.htm

Dr. (Mrs.) Florence O. Entsua-Mensah 24


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 6 – Index File Organization

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

An information retrieval system is designed to provide fast and easy


access to the documents stored in it. This is made possible by
organizing the records of each document including index terms or
keywords into fields and subfields, and arranging it alphabetically with
pointers to the actual documents. Such an index file is called an
Inverted index.

At the end of this session, students will be able to:


• Explain an inverted index and its components
• Learn how to construct an inverted index
• Construct a sample inverted index

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Meaning, components and features of an


inverted index

• Topic Two: Purposes, importance and weakness of inverted


index

• Topic Three: How to construct an index

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern


information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc.
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). A
first take at building an index. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-
take-at-building-an-inverted-index1.html#fig:indexstart

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 4


TOPIC
Meaning, Components and Features of an Inverted Index
1

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 5


Meaning of an inverted index

• It is an index data structure that maps content to its


location within a database file, in a document or in a set of
documents (Moura E.S., & Cristo M.A. 2009).

• It is normally composed of:


➢ a vocabulary that contains all the distinct words found in a
text and
➢ for each word of the vocabulary, a list that contains
statistics about the occurrences in the text.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 6


Meaning of an inverted index

• An indexing system in which the terms point to documents to


which the terms belong.

• An index structure where every key value (term) is


associated with a list of objects identifiers (representing
documents) (IGI Global disseminator of Knowledge, n.d.)

• It is described as inverted because documents are


associated with words, rather than the vice-versa (words with
documents).

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 7


Components of an inverted index

There are 2 main files in an inverted file system of any text


retrieval system
1. The text file containing the document records
➢ Each document record is made up of fields and subfields
with a specific unit of information.
➢ The unit of information include document surrogates or
bibliographic information such as: author’s name, publisher,
title, ISBN, date of publication, etc., and sometimes the
abstract or full text of the document
2. The inverted file containing all the terms and pointers to
the record numbers where the terms occur (Chowdhury,
2010)

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 8


Components of an inverted index

• Thus an inverted file has the keyword entry, and a reference list
specifying the actual position where the keyword/term is located in
the database

• Each entry also include:


➢The number of occurrences of a term in a given record
➢Position information e.g., the field in which the term/phrase occurs

➢The position of term/phrase in a given sentence/paragraph


➢There are also field tags used to denote the fields where the
term/phrase are located (Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 9


Sample document records and sample index

Sample document records Sample Index


Doc 1 360 1 2 ARTIFICIAL INTELLIGENCE
Author: Cunningham, M 140 1 1 CHARTWELL BRATT
Title: File structure and design 120 1 1 CUNNINGHAM, M.
Publisher: Chartwell Bratt 360 1 1 EXPERT SYSTEMS
Year 1995 330 1 1 EXPERT SYSTEMS AND ARTIFICIAL
Keywords: File structure, file organization
INTELLIGENCE
Doc 2 160 1 2 FILE ORGANIZATION
Author: Tharp, A. 260 1 2 FILE ORGANIZATION
Title: File organization and processing 230 1 1 FILE ORGANIZATION AND PROCESSING
Publisher: John Wiley 160 1 1 FILE STRUCTURE
Year 1988 260 1 1 FILE STRUCTURE
Keywords: File structure, file organization 130 1 1 FILE STRUCTURE AND DESIGN
360 1 3 KNOWLEDGE-BASED SYSTEMS
Doc 3
Author: Ford, N 340 1 1 LIBRARY ASSOCIATION
Title: Expert systems and artificial intelligence
Publisher: Library Association NB. The index is arranged in alphabetical order
Year 1991 and each term may occur in multiple
Keywords: Expert systems, artificial intelligence, documentss
Knowledge-based systems (Adapted from Chowdhury, 2010, p. 128)
Dr. (Mrs.) Florence O. Entsua-Mensah Slide 10
Features of an Inverted Index

• Efficient search: Inverted indexes allow for efficient searching of large


volumes of text-based data. By indexing every term in every document,
the index can quickly identify all documents that contain a given search
term or phrase, significantly reducing search time.

• Fast updates: Inverted indexes can be updated quickly and efficiently as


new content is added to the system. This allows for near-real-time
indexing and searching for new content.

• Support for multiple languages: Inverted indexes can support multiple


languages, allowing users to search for content in different languages
using the same system.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 11


Features of an Inverted Index

• Flexibility: Inverted indexes can be customized to suit the needs of different types of
information retrieval systems. For example, they can be configured to handle different
types of queries, such as Boolean queries or proximity queries.

• Compression: Inverted indexes can be compressed to reduce storage requirements.


Various techniques such as delta encoding, gamma encoding, variable byte encoding,
etc. can be used to compress the posting list efficiently.

• Support for stemming and synonym expansion: Inverted indexes can be


configured to support stemming and synonym expansion, which can improve the
accuracy and relevance of search results.
➢ Stemming is the process of reducing words to their base or root form.
➢ Synonym expansion involves mapping different words that have similar meanings to a
common term.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 12


TOPIC
Purpose, Importance and Weakness of an Inverted Index
3

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 13


Purpose of an Inverted index

• The purpose of an inverted index is to enable fast full-text


searches at the expense of increased processing when
adding documents to the database.

• The reversed file may not be the database file's index, but
the database file itself.

• This is the most common data structure used in document


retrieval systems, and is used extensively by search engines
and the like.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 14


Importance of the Inverted index

• The inverted index is the most popular file structure used in document
retrieval systems such as commercial databases to support full text
search (Moura E.S., & Cristo M.A. 2009)

• It is used by search engines, cell phones, book index, concordance


etc.

• Document records are stored in a computer memory one after the


order. This requires searching through every document one at a time
for any. The index file provides a faster access to the relevant
documents by identifying the terms and jumping to their locations.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 15


Importance of the Inverted index

• It’s structure such as, position information and use of tags

permits use of different search strategies such as:

➢Field specific searches

➢Proximity searches

➢Boolean searches (Chowdhury, 2010)

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 16


Weaknesses of Inverted Index

• Large storage overhead and high maintenance costs on


updating, deleting, and inserting.

• Instead of retrieving the data in decreasing order of expected


usefulness, the records are retrieved in the order in which
they occur in the inverted lists.

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 17


How to Construct An Index
TOPIC
3

Dr. (Mrs.) Florence O. Entsua-Mensah 2022/2023 Academic Year Slide 18


Steps in the construction of an inverted index

• Pre-processing of documents and texts


➢Collect the documents to be indexed. Each document must
have a document ID (doc ID)
➢Tokenize the text – separate into single words
➢E.g., George Weah is the new elected President of Liberia
becomes:

George Weah is the new elected president of Liberia

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 19


Steps in the construction of an inverted index

• Stop words – Omit high frequency or common words, such


as: to, of, the, a etc. (These are words with little value in
helping to select documents that will match a user’s query)
➢Stop list helps to reduce the number of postings that a system
will store
➢Some stop words in songs and some titles are left intact e.g.,
“Let it be” to facilitate search by title and phrase searches
➢However web search engines do not use stop list

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 20


Steps in the construction of an inverted index
• Normalization – Map words & query so that they match. For e.g.,
U.S.A. To USA so that searches for one of the term will retrieve
the other
➢An alternative method is to construct a synonym list, e.g., car and
automobile
➢Also removal of diacritics and accents – e.g., cliché and cliche to
match
➢Capitalization –all letters are reduced to lower case
➢Equate American spelling to British e.g., labour to labor
➢Date – 2/1/18 to Jan. 1, 2018

• Keywords can be used as they are or be transformed into their


base form e.g.,
• Nouns in the singular form – shoes become shoe
• Verbs in the infinitive form – learned become learn

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 21


Steps in the construction of an inverted index

• Stemming – words are reduced to their root. Eg., authorize


and authorization reduce to “authoriz”. Stemming of words;
➢Reduces the size of the index
➢It also broadens the search, i.e., allows retrieval of various
forms of the word

• Linguistic models – modify tokens; e.g., friends to friend, July


to july etc. These are the actual words used by the indexer

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 22


Building the index

• The normalized tokens for each document

• Sort and group index terms tagged alphabetically by their


document IDs, to produce a dictionary & postings.

• Postings are sorted by document IDs which is the basis for


efficient query processing

• The dictionary stores other statistics such as;


➢Number of documents that contain each term which also constitutes
the length of each posting (this facilitates ranked retrieval and
improves efficiency of search engines during query processing)
(Manning, Raghavan, & Schutze, 2009, Inkpen, n.d.)

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 23


An inverted index

Manning, C. D., Raghavan, P., & Schutze, H. (2009). A first take at building an index. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-
Dr.1.html#fig:indexstart
(Mrs.) Florence O. Entsua-Mensah Slide 24
Practice assignment

• Study carefully the inverted index in the previous


slide
• Read the notes below the index
• Use the documents below to practice how to
construct an inverted index
A document collection
Doc 1 new home sales top forecasts
Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 25


Activity

•Access this link for more information on inverted index


construction
•https://www.youtube.com/watch?v=pevQ2T9Gm0w

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 26


References
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ).
New York: Neal-Schuman Publishers, Inc.
• IGI Globa (n.d.). What is an inverted index. Retrieved from https://www.igi-
global.com/dictionary/inverted-index/15654
• Inkpen, D. (n.d.). Information retrieval on the Internet. Retrieved from
http://site.uottawa.ca/~diana/csi4107/IR_draft.pdf
• Moura E.S., Cristo M.A. (2009) Inverted Files. In: LIU L., ÖZSU M.T. (eds)
Encyclopedia of Database Systems. Springer, Boston, MA . Retrieved from
https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940
9_1136
• Manning, C. D., Raghavan, P., & Schutze, H. (2009). A first take at building an
index. Retrieved from http://nlp.stanford.edu/IR-book/html/htmledition/a-first-
take-at-building-an-inverted-index-1.html#fig:indexstart

Dr. (Mrs.) Florence O. Entsua-Mensah Slide 27


INFS 422:INFORMATION STORAGE AND RETRIEVAL

Session 7 – Vocabulary Control

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

The basic function of an IR system is to meet the information needs


of users. This is done by retrieving documents that have been
indexed with terms that match the query terms used by the users.
One way of ensuring a “perfect” match between query and index
terms is by use of standardized terms for both indexing and
searching. This is known as vocabulary control .

At the end of this session, students will be able to:


•Explain vocabulary control and its objectives and advantages
• Identify the differences between natural language indexing and
vocabulary control
• Describe the characteristics and features of vocabulary control
tools

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Definition, objectives and advantages of


vocabulary control

• Topic Two: Differences between natural language indexing


and vocabulary control

• Topic three: Features and characteristics of vocabulary


control tools

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information


retrieval (3rd ed. ). New York: Neal Schuman Publishers,
Inc.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 4


Definition, Objectives and advantages of Vocabulary
TOPIC
Control
1

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 5


What is Vocabulary Control?

• “The term ‘vocabulary control’ refers to a limited set of terms that must
be used to index documents, and to search for these documents, in a
particular system” (Librarianship Studies and Information Technology,
n.d.)
• The systematic selection of preferred terms to represent the subject
matter of documents (Chowdhury, 2010)
• It is “an organized arrangement of words and phrases used to index
content and /or to retrieve content through browsing or searching” (What
are Controlled Vocabularies, n.d.).
(Most databases have controlled vocabularies, e.g., CINAHL, Academic
Search Premier)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 6


Why Vocabulary Control?

Vocabulary control is necessitated by two basic features of


natural language:
1. Two or more words that can be used to represent one
concept. E.g., salinity/saltiness
2. Two or more words with the same spelling but represents
different concepts
➢ Mercury can mean:
➢ Planet
➢ Metal
➢ Automobile
➢ Mythical being (Zeng, 2005)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 7


Objectives of Vocabulary Control

In an IR environment:

• Consistency - Vocabulary control facilitates the consistent


representation of the subject matter of documents by indexers
to avoid dispersion of related materials

• It facilitate the comprehensive search on topics by


systematically linking all related terms together (Lancaster,
cited in Chowdhury, 2010).

• To match the language of searchers and indexers

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 8


Objectives of Vocabulary Control

• Translation – it converts natural language of authors,


indexers, & users into a suitable vocabulary for indexing &
retrieval

• Indication of relationships – It indicates semantic


relationships among terms

• Label and browse – provide consistent word hierarchies that


help users to locate desired information

• Retrieval – it is a searching aid for locating contents of


documents

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 9


Advantages of Controlled Vocabulary

• Saves time:- A searcher does not need to come up with all


synonyms/alternative spelling of a term

• Provides comprehensiveness of search results. Eg. Searching


for the term clothing as a subject heading of a database will also
retrieve raiment, apparel, costume, attire, dress, garment etc.,

• It can be used to clarify words with several meanings e.g.,


bark, pool, bolt, season etc., thus aiding precision

• It facilitates the learning of an unfamiliar subject by


providing an authoritative subject list for browsing (Bell, 2012)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 10


Differences between Controlled Vocabulary and Natural Indexing
TOPIC
2

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 11


Differences between Controlled Vocabulary and Natural Indexing

Controlled vocabulary Natural indexing


• Terms used to represent subjects • Any term appearing in the title,
and the process of assigning these document or text of a document
terms to particular documents are may be used as an index term. No
performed by a person persons control the use of terms
• Index terms are identified from an • No authority list is used. Index
authority list such as subject terms are obtained from the
headings or thesaurus document
• A searcher must use the controlled • No controlled list is used. A
list to formulate a search strategy searcher can use any desired terms
to formulate a search strategy

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 12


Differences between Controlled Vocabulary and Natural Indexing

Controlled vocabulary Natural indexing


1. Lacks specificity no matter 1. High specificity and high
how detailed the system precision especially for names
2. Lack of exhaustivity of organizations & persons
3. Not always up to date till new 2. Has exhaustivity which may
terms are added to thesaurus lead to high recall
4. Words of authors may be 3. Up to date. New terms are
misinterpreted always available
5. The user has to learn an 4. Authors words are used. No
artificial language misinterpretation
6. High input costs 5. Both indexer and user use
natural language words
6. Low input costs

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 13


Differences between Controlled Vocabulary and Natural Indexing

Controlled vocabulary Natural language indexing


7. Exchange of materials between 7. No language incompatibility.
databases is difficult because of Easy exchange of material
incompatibility between between databases
standardized terms
8. Exhaustivity may lead to loss of
8. Loss of precision is avoided precision
through over exhaustivity
9. Searching is burdensome. The
9. Searching is not burdensome to user has to make all the
the user because: intellectual effort such as use of
➢ Searches are broadened by availability of preferred terms appropriate words to retrieve
due to the use of synonyms and near and near synonyms
➢ Homographs (words with same spellings but different
relevant documents (Chowdhury,
meanings such as fair) are qualified 2010, pp 156-158)
➢ Broad, narrow & related terms are displayed ➢
Problematic concepts that may not be present in free text
are expressed

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 14


TOPIC
Vocabulary Control Tools: Principles and Characteristics
3

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 15


Definition and Types of Vocabulary Control Tools

Definition

• Vocabulary control tools are tools used for controlling the


vocabulary of indexing and retrieval (Chowdhury, 2010). E.gs.,

• Subject heading List

• Thesaurus

• Alphanumeric Classification Schemes (e.g., LCC, DDC, UDC)

• Ontologies

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 16


Subject Heading List

• Subject heading has been defined as a word or group of words


indicating a subject under which all materials dealing with the
same theme is entered in a catalogue or bibliography, or is
arranged in a file.

• It is a master list of terms and phrases with appropriate cross-


references and notes used as a source of headings to represent
the subject content of an information resource (Chowdhury, 2010)

• It is standardized, officially approved words or groups of words


used to describe the subject content of books or any information
resource.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 17


Subject Heading List

• Subject headings, like access points based on author names


and titles, serve the dual function of location and collocation.

• Subject heading lists are used by library catalogers to aid


them in their choice of appropriate subject headings and to
achieve uniformity.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 18


Features of Subject Headings
• It is alphabetically arranged by • It has subdivisions to describe
term/phrase specific aspects of the subject
• It can be one word, two or more ➢ Subdivisions are separated
words, a phrase, a city, a country, from the main heading by a dash
a geographic region, or a person (--)
• There are a list of semantically ➢ Subdivisions can be topical,
related terms/phrases under each geographical, form (type /form of
main term publication), or chronological
• They use formal language e.g., • It is used to produce a pre-
ART, ASIA, MEDIEVAL coordinated index of a collection
• They are usually given in plural (College of San Mateo Library, n.d.;
form e.g., PASTRIES, SHARKS • Chowdhury, 2010)
Slang, jargon and highly
specialized terminology is avoided

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 19


List of Subject Headings -General Principles

The following general principles guide the indexer in the choice of


subject headings
• Specific and direct entry – This principle requires the
assignment of a document under the most specific subject that
precisely represents its subject content.
• Common usage – A subject must be represented by a commonly
used word.
• Uniformity - The principle of uniform heading states that the term
chosen from a group of synonyms must be used to represent all
documents on a given topic.
• Consistency and current terminology – If term based on
common usage becomes obsolete, the common terminology must
be used (All records bearing the obsolete term must be changed
to the current term).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 20


List of Subject Headings -General Principles

• Form heading – They are the same as topical subject headings but used
organization, arrangement or classification of literary and artistic forms of
material e.g., drama, essay, poetry, fiction etc.
• Cross reference – they are used to direct users from broader and
related topics to the subject headings or terms used to represent a
particular subject e.gs.,
➢ See (or USE) reference – direct users to authorized headings
➢ See also references – direct users to related, broad and narrower
terms (RT, BT, NT) to help in searching specific aspects of a subject
➢ General references – direct users to category or group of headings
instead of individual headings to save space
• Examples of subject heading lists are Library of Congress Subject
Headings (LCSH) and Sears’s List of Subject headings (Library Studies
and Information Technology, n.d.).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 21


Library of Congress Subject Headings (LCSH)

• LCSH is a controlled vocabulary tool widely used as a subject


heading list for catalogues and bibliographies
• It was developed in 1898, published in 1914 and called Subject
Headings Used in the Dictionary Catalogs of the Library of
Congress
• It is in its 29th edition. The title changed to LCSH in the 8th edition
• It was originally designed as a vocabulary control tool to represent
the collection of the Library of Congress
• It is now widely used throughout the world and has been
translated to other languages

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 22


Library of Congress Subject Headings (LCSH)

• It is used for assigning subject headings to both manual and


machine readable catalogues

• As at April 2017, there were 342,107 subject headings and


references

• About 5000 new headings including subdivisions are added


every year (The Library of Congress, n.d).

• It is now an online only publication. The 35th edition printed


in 2013 was the last print edition

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 23


Structure of LCSH
• Computer Software [the approved subject heading is in bold face. Synonyms are not bold face]
(Class no.)
General works on computer programs along with documentation such as manuals, diagrams,
and operating instructions
UF Software, Computer
RT Computer software industry
Computers
SA subdivisions Software and Juvenile Software under subjects for actual software items
NT Application software
System software
Accounting
(Class no.)
Law and legislation
(May subd Geog)
Catalogs
UF Computer programs – Catalogs
Development
(Class no.)
(Adapted from Chowdhury, 2010, p. 160)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 24


Sears List of Subject Headings

• It is smaller in scope than the LCSH and used for assigning


standardized subject headings to all types of documents in
smaller libraries
• It uses the Dewey Decimal Classification system to ensure
that
• It was first designed in 1923 by Minnie Earl Sears to meet
the demands of small libraries for the following reasons:
➢ Small libraries needed a simpler but broader subject
headings for their catalogues
➢ The LCSH list was complicated, too detailed and also
expensive

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 25


Sears List of Subject Headings

• The 13th edition (1986) was designed as an online database


with changes suitable for online databases and OPAC
searches. For e.g.,

➢Subject headings were changed to natural forms used by users


such as “Chemistry, Organic” to “Organic chemistry”

➢There is frequent updates with current terminology and new topics

• Its polarity has been credited to a policy of continuous


adaptation to changing trends in user information –seeking
behaviour

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 26


Sears List of Subject Headings

• The 15th edition adopted a thesaurus format with use of thesaural


abbreviations such as BT, NT, RT, USE and SA

• Changes in the 19th ed. (2007):

• It has 440 subject headings

• Major additions include Islam, graphic novel, Reality shows,


Suicide bombers, Stem cell research, Body piercing, Biodiversity,
Indigenous peoples etc.

• Some headings have been fine tuned e.g.,

• Fictitious character has become Fictional character

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 27


Sears List of Subject Headings

• The 22nd edition (2018) is the current one. New features


include:

➢Business writing

➢Economic indicators

➢Flipped classrooms

➢Systemic risk

➢ Massive open online courses

➢Target marketing

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 28


Activity

• Access this link and familiarize yourself with the structure of


Sears List of Subject Headings
https://www.hwwilsoninprint.com/pdf/sears_pgs.pdf

• What are the similarities between the structures of the Sears


List and the LCSH. Discuss your findings on the Sakai Chats

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 29


References
• Bell, S. S. (2012). Librarian’s guide to online searching (3rd ed. ). Santa Barbara, California:
Libraries Unlimited.
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York:
Neal-Schuman Publishers, Inc.
• Librarianship Studies & Information Technology. (n.d.). Vocabulary
control.Retrievedfromhttps://librarianshipstudies.blogspot.ca/p/ librarians-reference-
directory.html
• Library of Congress. (n.d.). Retrieved from
https://www.loc.gov/aba/publications/FreeLCSH/freelcsh.html

• Colby, M. (2017). Library of Congress Subject Headings: Online Training: developed by Janis
Young and Daniel Joudrey. Washington, DC: Library of Congress, Program for Cooperative
Cataloging, Cataloger's Learning Workshop, 2017. https://www. loc. gov/catworkshop/lcsh.

• What are Controlled Vocabularies? (n.d.). Retrieved from


http://www.getty.edu/research/publications/electronic_publicati
ons/intro_controlled_vocab/what.pdf
• Zeng, M. (2005). Why vocabulary control. Retrieved from
http://marciazeng.slis.kent.edu/Z3919/1need.htm

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 30


INFS 422: INFORMATION STORAGE AND RETRIEVAL

Session 8 – Thesaurus

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

Thesaurus is a vocabulary control tool comprising controlled


set of terms from natural language or various fields of study
linked by hierarchical or associative relationships. It appeared
in the 1950s and useful for selection of indexing terms and
searching.

At the end of this session, students will be able to:

•Explain a thesaurus, its purposes and features.

•Describe thesaural relationships and their features and


•Describe the various displays used in a thesaurus

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Definition, purpose and features of a thesaurus

• Topic Two: Relationships between terms in a thesaurus

• Topic three: Display of terms in a thesaurus

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information


retrieval (3rd ed. ). New York: Neal-Schuman Publishers,
Inc. 162-168

• Zeng, M (2005). Guidelines for the construction, format, and


management of monolingual controlled vocabularies.
Retrieved from
http://marciazeng.slis.kent.du/Z3919/53adisplay.htm

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 4


TOPIC
Definition, Purpose and Features of a Thesaurus
1

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 5


What is a Thesaurus?

• “A book of words or of information about a particular field or set of


concepts; especially : a book of words and their synonyms
• A list of subject headings or descriptors usually with a cross-reference
system for use in the organization of a collection of documents for
reference and retrieval (Merriam-Webster, 2018).
• “A compilation of words and phrases showing synonyms and hierarchical
and other relationships and dependencies” (Chowdhury, 2010, p. 162).
• In indexing and information retrieval, a thesaurus is defined as: “a list of
of every important term (single-word or multi-word) in a given domain of
knowledge; and a set of related terms for each term in the list” (New
World Encyclopedia, 2018).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 6


Purposes of a Thesaurus

The main purpose of a thesaurus is to provide standard


vocabulary for information storage and retrieval systems

• It informs searchers about index terms that have been used


in retrieval systems

• It guides indexers and searchers to choose the same terms


in describing or searching for a concept or a particular word

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 7


Features of a Thesaurus

• A thesaurus has 3 main features:


➢Vocabulary control
➢Thesaural relationships
➢Thesaural display
Vocabulary control
• This is a list of concepts of the concepts of thesaurus
comprising words and phrases.
• The words are mostly single-word nouns
• Ambiguous words has scope notes to convey meaning in a
given knowledge area

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 8


Features of a Thesaurus

• Preferred terms
➢These are valid index terms that can be used for indexing and
searching
➢They are the terms that best represent the concepts in a
thesaurus
• Non-preferred terms
➢ They are the synonyms
➢They are not used for indexing.
➢Appropriate references are linked from non-preferred terms to
preferred terms to guide searchers and indexers

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 9


Relationship between Terms in a
TOPIC
Thesaurus
2

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 10


Thesaural relationships

There are three types/classes of thesaural /term relationships:

• Hierarchical relationships

• Equivalence relationships

• Associative relationships

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 11


Hierarchical relationships

• They are used to indicate super-ordinate or broader term and


subordinate or narrower term.

• A Broader Term (BT) is a more general term and a narrower


Term (NT) is a more specific term e.g. BT Financial markets
NT Capital markets.
• There are 3 types of hierarchical relationships:
1) Generic
2) Whole-part and
3) Polyhierarchical relationships

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 12


Types of hierarchical relationships

Generic relationship
It identifies a class and its member species. Code or notation
for generic relationship in a thesaurus are BTG (Broader term
generic) and NTG (Narrow term generic) e.g.,
• Rat
• BTG rodents
• Rodents
• NTG rats

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 13


Types of hierarchical relationships

Whole-part relationship relationship is BTP (Broader term

• It identifies a situation in which the partitive) and NTP (Narrow term

name of the part is an integral part partitive)

of the whole (the BT); or There are many scenarios including:

• When one concept is a constituent 1) Systems and organs of the body


of another irrespective of the CENTRAL NERVOUS SYSTEM

context such that all the terms can BTP Nervous system

be organized into a logical NTP brain

hierarchy (Chowdhury, 2010). NTP spinal cord

• The notation or code for whole part

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 14


Types of hierarchical relationships
Whole-part relationship 3) Disciplines or fields of study
scenarios CHEMISTRY
2) Geographic locations BTP Science
GHANA NTP Physical chemistry
BTP Africa NTP Thermodynamics
NTP Accra
NTP Osu 4) Organizational, social or
political structures
ARMIES
BTP Divisions
NTP Battalions
NTP Regiments

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 15


Types of hierarchical relationships

Polyhierarchical relationships
It describes a situation where a concept belongs to more than
one category. It occurs in a particular instance, which links
proper name with a common noun. e.g.,

PRINTING EQUIPMENT COMPUTER PERIPHERAL EQUIPMENT


NT Computer printers NT Computer printers

COMPUTER PRINTERS
NT Computer peripheral equipment
NT Printing equipment

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 16


Equivalence relationships

• Equivalence relationships describes a situation where the


same concept is expressed by two or more terms
• Thus the relationship between preferred and non-preferred
terms is an equivalence relationship
• Equivalent relationship is denoted by;
• U or USE leading from a non-preferred to a preferred term
• UF or USED FOR from a preferred to a non-preferred term.
Example
➢ Water fowl USE water bird
➢ Water bird UF Water fowl

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 17


Types of equivalent relationships

There are five basic types


• Synonyms – a word or phrase with the same meaning as another
word or phrase in the same language (Oxford Dictionary). E.gs.
➢ Synonyms from two linguistic origins e.g., freedom/ liberty
➢ A popular and a scientific name e.g., salt/ sodium chloride
➢ Trade name and generic name synonym e.g., Vaseline / petroleum jelly
➢ New /favoured terms replacing outdated ones e.g., developing countries/
underdeveloped countries
➢ Terms from different cultures e.g., flats/ apartments
➢ Jargon or slang synonyms e.g., whirlybirds/ helicopters
• Lexical variants –different word forms expressing the same
concept e.g., ground water/ ground-water/ groundwater Online/
on-line

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 18


Types of equivalent relationships
• Near-synonyms –Words with different meaning but regarded as
equivalents for the purposes of controlled vocabulary e.g., sea
water, salt water
• Generic posting – When the name of a class the names of its
members are treated as quasi-synonyms furniture
➢UF beds beds USE furniture
➢UF chairs chairs USE furniture
• Cross-reference to elements of compound terms
➢Cross reference to compound term elements
➢coal mining USE coal AND mining
➢Cross reference from compound term elements
➢Coal USED FOR coal AND mining
➢Mining USED FOR coal AND mining

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 19


Associative relationships

• Associative relationships are terms that are neither hierarchical


nor equivalent conceptually associated to the extent that the link
between them must be made clear in the thesaurus.
• In other words, here two terms are conceptually associated on a
number of different basis while satisfying the requirement that one
of the terms should function as a component in any explanation or
definition of the other.
• They are indicated as RT (related term)
• Examples are
➢Cells
➢RT cytology
➢Cytology
➢RT cells

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 20


Types of Associated Relationships

Two broad categories


1. Relationships between terms belonging to the
same hierarchies or categories – are referred to
as siblings with overlapping meaning e.g., ships
and boats
2. Relationships between terms belonging to
different hierarchies or categories- involves two
terms such that when one term is used in indexing
the other is implied.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 21


Types of Associated Relationships

Several examples:
• A process and its instrument e.g., illumination and lamp
• Concepts and their properties e.g., poison and toxicity
• Concepts and mechanism/units for measurement e.g.,
temperature and thermometer
• Concepts and their origins e.g., India and Indians
• Action and product e.g., weaving and cloth
• Cause and effect e.g., pathogens and infections
• Raw materials and products e.g., cocoa and chocolate
• Discipline or field of study and the object being studied e.g.,
Botany and plants

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 22


TOPIC
Display of terms in a Thesaurus
2

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 23


Display of terms in a thesaurus

• Terms and their relationships in a thesaurus can be


displayed in several different methods e.gs.:

1. Alphabetic displays

2. Flat format displays

3. Graphic displays

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 24


Summary of Symbols and abbreviations for use in a thesaurus

• SN: scope note


• DEF: definition
• HN: history note
• USE: shows that the term following use is the preferred term
• UF: shows that the term following UF is the non-preferred
term
• USE+ : the non-preferred terms following USE + should be
used together to represent the concept
• UF+: the non-preferred term after the UF+ should be
represented by a combination of preferred terms including
the one preceding the UF+

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 25


Symbols and abbreviations for use in a thesaurus

• TT: top term


• BT: Broader term
• BTG: Broader term (generic)
• BTI: Broader term (instantial showing an instance e.g.,
capital cities and Accra)
• BTP: broader term (partitive)
• NT: narrow term • NTG: narrow term (generic)
• NTI: narrow term (instantial)
• NTP: narrow term (partitive)
• RT: related term (Chowdhury, 2010)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 26


Features of an alphabetic display

• It is the most basic type of thesaurus/vocabulary display and


easy to organize

• All indexing terms both preferred or non-preferred are


organized in alphabetical sequence

• It contains both terms and entry terms with use references

• One disadvantage from a user’s point of view is that, all the


terms in a hierarchy cannot be seen at a single location
(Chowdhury, 2005)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 27


Example of an alphabetical listing display for the term “Early”

• Early Adolescence
U Early Adolescence
Early Adolescence
UF Early Adolescence; Young Adolescence
• Early Childhood (1966 1980)
U Young Children
Early Childhood Education
Early Detection
U Identification
Early Diagnosis
U Identification
Early Experience
UF Preschool Experience
Source: Thesaurus of ERIC Descriptors

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 28


Features of Flat Format display

• It is the most commonly used display format

• It comprises terms arranged in alphabetical order together


with their details and one level of BT or NT hierarchy

• In some systems the terms in the hierarchy are also


assigned a line number that a user can reference to expand
a search (Zeng, 2005)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 29


Example of a flat format display
Whale watching Whales
RT Whales SN: Aquatic mammals of the order
Whalers (Persons) Cetacea
RT Whales UF Cetaceans (NPT)
BT Marine mammals
NT Baleen whales
NT Fossil whales
NT Toothed whales
RT Whale oil
RT Whale watching
RT Whalers
RT Whaling
Source: Thomson Gale Master
Thesaurus

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 30


Features of graphic display

• Graphic displays show the terms and their relationships in the form
of two dimensional figures
• They communicate relationships among concepts more effectively
than the linear forms of display
• They are more effective in an interactive computer environment
where terms an be connected to their details via hyperlinks e.g.,
(clickhttps://www.freethesaurus.com/for+certain
• Some commercial products are available for use to generate
concept maps using terms in a controlled vocabulary (Chowdhury,
2010; Zeng, 2005)
– Concept maps in printed thesaurus are static but real time
graphic displays can be generated in an electronic system

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 31


Features of a graphic display

• Disadvantages of graphic display

➢It does not show equivalent terms or scope notes

➢It does not distinguish between associative and hierarchical


relationships

➢Graphic displays in print versions are bulky and difficult to


navigate (Chowdhury, 2010)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 32


Examples of graphic display

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 33


Graphic display of synonyms for certain

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 34


Activity

• Check the Internet and familiarize yourself with the structure


of a natural language thesaurus and a thesaurus of any field
of study. Note the differences and similarities

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 35


References
• Chowdhury, G. G. (2010). Introduction to modern information retrieval (3rd ed. ). New York: Neal-
Schuman Publishers, Inc. 162
• Colby, M. (2017). Library of Congress Subject Headings: Online Training: developed by Janis Young and
Daniel Joudrey. Washington, DC: Library of Congress, Program for Cooperative Cataloging,
Cataloger's Learning Workshop, 2017. https://www. loc. gov/catworkshop/lcsh.
• Synonym (n.d.). Oxford Dictionary. Retrieved from
https://ca.search.yahoo.com/search?fr=mcafee&type=C211CA662D20170127&p =define+syn onym
• Synonym for certain (n.d.). Retrieved from https://www.freethesaurus.com/for+certain
• Thesaurus (2018). Merriam-webster.com. Retrieved from
https://www.merriamwebster.com/dictionary/thesaurus
• Thesaurus. (2018) New World Encyclopedia, . Retrieved April 11, 2018 from
http://www.newworldencyclopedia.org/p/index.php?title=Thesaurus&oldid=986455.
• Thesaurus of ERIC descriptors (n.d.). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/53adisplay.htm
• Thomson Gale Master Thesaurus (n.d). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/53adisplay.htm
• Zeng, M (2005). Static Concept map (n.d.). Retrieved from
http://marciazeng.slis.kent.edu/Z3919/56displaygraph.htm

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 36


INFS 422: INFORMATION STORAGE & RETRIEVAL

Session 9 – Information Extraction

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

• In this session, leaners will be introduced to the concept


of information extraction (IE).
• IE techniques extract structured data from unstructured
text.
• This session presents techniques for extracting limited
kinds of semantic content from text.
• This process of information extraction (IE), turns the
unstructured information embedded in texts into
structured data

Dr (Mrs) Florence Entsua-Mensah Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Understanding Information Extraction


• Topic Two: Basic Tasks of Information Extraction
• Topic Three: Approaches to Information Extraction
• Topic Four: Challenges to Information Extraction

Dr (Mrs) Florence Entsua-Mensah 3


Recommended Reading

Izaquierdo, R. (2015). Information Extraction.


Available at:
https://www.slideshare.net/rubenizquierdobevia/
information-extraction-45392844

Jurafsky, D., Martin, J.H., 2017. Information Extraction. In:


Speech and Language Processing. pp. 397–430.

Dr (Mrs) Florence Entsua-Mensah Slide 4


TOPIC
What is Information Extraction?
1

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 5


Information Extraction defined

• Information extraction (IE) involves the extraction of


structured information from unstructured and/or semi-
structured machine-readable documents.

• Information extraction is the process of extracting specific


(pre-specified) information from textual sources.
• One of the most trivial examples is when your email extracts
only the data from the message for you to add in your Calendar.

• It is largely concerned with the processing human language


texts by means of natural language processing (NLP).

Dr (Mrs) Florence Entsua-Mensah 6


Information Extraction

Dr (Mrs) Florence Entsua-Mensah 7


What is Information Extraction?

Documents

Un-structured Structured
(semi-structured) Databases (or Knowledge Bases)
Dr (Mrs) Florence Entsua-Mensah 8
Structured vs Semi-structured Text

Dr (Mrs) Florence Entsua-Mensah (Source: Izquierdo, 92015)


Why is it useful?
Information extraction enables:
• The automation of tasks such as smart content classification, integrated
search, management and delivery
• Data-driven activities such as mining for patterns and trends, uncovering
hidden relationships, etc.

• Clear factual information which is helpful to answer questions and


analytics.

• Organize and present information into Info boxes in Wikipedia

• Obtain new knowledge via inference.


• Works-for(x, y) AND located-in(y, z) → lives-in(x, z)

Dr (Mrs) Florence Entsua-Mensah 10


Main Goals of IE

• Fill a predefined “template” from raw text.

• Extract who did what to whom and when?


• Event extraction

• Organize information so that is useful to people

• Put information in a form that allows further inferences by


computers.
• Big data analytics and data mining

Dr (Mrs) Florence Entsua-Mensah 11


TOPIC
Basic tasks of Information extraction
2

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 12


Information Extraction (IE) - Task

• Generally, the idea is to ‘extract’ or tag particular types of


information from arbitrary text or transcribed speech.
• The broad activities are:

Entity Event
Recognition Extraction

Relation
Extraction

Dr (Mrs) Florence Entsua-Mensah 13


Named Entity Recognition

At the heart of the Giuliani-led critique of the president’s patriotism is the suggestion
that Barack Obama has never expressed love for the United States. Rudolph W.
Giuliani, the former mayor of New York City, has even challenged the media to find
examples of Mr. Obama expressing such affection.

Has the president done so? Yes, he has.

A review of his public remarks provides multiple examples. In 2008, when he was still
a presidential candidate, Mr. Obama uttered the magic words in Berlin, during a
speech to thousands. Mr. Obama used a similar construction, as president, in 2011,
during a town hall meeting in Illinois, when he recalled “why I love this country so
much.”

Mr. Giuliani told Fox News that “I don’t hear from him what I heard from Harry
Truman, what I heard from Bill Clinton, what I heard from Jimmy Carter, which is these
wonderful words about what a great country we are, what an exceptional country we
are.”

Dr (Mrs) Florence Entsua-Mensah 14


Relation Extraction

Located-in(Person, Place)
He was in Tennessee

Subsidiary(Organization, Organization)
XYZ, the parent company of ABC

Related-to(Person, Person)
John’s wife Yoko

Founder(Person, Organization)
Steve Jobs, co-founder of Apple...

Dr (Mrs) Florence Entsua-Mensah 15


Event Extraction

Dr (Mrs) Florence Entsua-Mensah 16


Named Entity Tagger

• Identify types and boundaries of named entity

• For example:

• Alexander Mackenzie , (January 28, 1822 ‐ April 17, 1892), a


building contractor and writer, was the second Prime Minister of
Canada from ….

-> <PERSON>Alexander Mackenzie</PERSON> , (<TIMEX


>January 28, 1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>),
a building contractor and writer, was the second Prime Minister
of <GPE>Canada</GPE> from ….

Dr (Mrs) Florence Entsua-Mensah 17


IE for Template Filling Relation Detection

Given a set of documents and a domain of interest, fill a table


of required fields.

• Forexample:
Number of car accidents per vehicle type and number
of casualties in the accidents.

Dr (Mrs) Florence Entsua-Mensah 18


IE for Question Answering

Q: When was Gandhi born?


A: October 2, 1869

Q: Where was Bill Clinton educated?


A: Georgetown University in Washington, D.C.

Q: What was the education of Yassir Arafat?


A: Civil Engineering

Q: What is the religion of Noam Chomsky?


A: Jewish

Dr (Mrs) Florence Entsua-Mensah 19


TOPIC
Approaches to Information Extraction
3

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 20


Approaches

• Statistical sequence labeling

• Supervised Learning

• Semi-supervised Learning and bootstrapping

Dr (Mrs) Florence Entsua-Mensah 21


Approach for NER
• <PERSON>Alexander Mackenzie</PERSON> , (<TIMEX >January 28,
1822 <TIMEX> ‐ <TIMEX>April 17, 1892</TIMEX>), a building
contractor and writer, was the second Prime Minister of
<GPE>Canada</GPE> from ….

• Statistical sequence labeling techniques can be used –


similar to POS tagging
• Word-by-word sequence labeling
• Example of features
• POS tags
• Syntactic constituents
• Shape features
• Presence in a named entity list

Dr (Mrs) Florence Entsua-Mensah 22


Supervised Approach for relation detection
• Given a corpus of annotated relations between entities, train two
classifiers:
• A binary classifier
• Given a span of text and two entities -> decide if there is a relationship between these
two entities

• Features
• Types of two named entities
• Bag of words
• POS of words in between

• Example:
• A rented SUV went out of control on Sunday, causing the death of seven
people in Brooklyn
• Relation: Type = Accident, Vehicle Type = SUV, casualty = 7, weather = ?

• Pros and Cons?

Dr (Mrs) Florence Entsua-Mensah 23


Pattern Matching

• How can we come up with these patterns?

• Manually?

• Task and domain-specific

• Tedious, time consuming, not scalable

Dr (Mrs) Florence Entsua-Mensah 24


Pattern Matching for Relation Detection

• Patterns:
• “[CAR_TYPE] went out of control on [TIMEX], causing the
death of [NUM] people”
• “[PERSON] was born in [GPE]”
• “[PERSON] was graduated from [FAC]”
• “[PERSON] was killed by <X>”
• Matching Techniques
• Exact matching
• Pros and Cons?
• Flexible matching (e.g., [X] was .* killed .* by [Y])
• Pros and Cons?

Dr (Mrs) Florence Entsua-Mensah 25


Semi-supervised approach-AutoSlog-TS (Riloff 1996)

• MUC-4 task: extract information about terrorist events in


Latin America
• Two corpora:
• Domain-dependent corpus that contains relevant information
• A set of irrelevant documents
• Algorithm:
1. Using heuristics, all patterns are extracted from both corpora. For
example:
• Rule: <Subj> passive-verb
• <Subj> was murdered
• <Subj> was called
2. Pattern Ranking: The output patterns are then ranked by the
frequency of their occurrences in corpus1/corpus2
3. Filter out the patterns by hand

Dr (Mrs) Florence Entsua-Mensah 26


Bootstrapping

Dr (Mrs) Florence Entsua-Mensah 27


Task 12: (DARPA – GALE year2) Produce a biography of [person]

1. Name(s), aliases:
2. *Date of Birth or Current Age:
3. *Date of Death:
4. *Place of Birth:
5. *Place of Death:
6. Cause of Death:
7. Religion (Affiliations):
8. Known loca(ons and dates:
9. Last known address:
10. Previous domiciles:
11. Ethnic or tribal affiliations:
12. Immediate family members
13. Na(ve Language spoken:
14. Secondary Languages spoken:
15. Physical Characteristics
16. Passport number and country of issue:
17. Professional positions:
18. Education
19. Party or other organization affiliations:
20. Publica(ons (titles and dates):

Dr (Mrs) Florence Entsua-Mensah 28


Biography – two approaches

• To obtain high precision, we handle each slot independently


using bootstrapping to learn IE patterns.

• To improve the recall, we utilize a biographical sentence


classifier

Dr (Mrs) Florence Entsua-Mensah 29


TOPIC
Challenges to Information Extraction
4

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 30


IE Challenges

AMBIGUITY
• Fred’s appointment as professor vs. Fred’s 3 PM
appointment with the dean
• outbreak of typhoid vs. outbreak of violence

COMPLEX STRUCTURES
• For the Federal Election Commission, Bush picked Justice
Department employee and former Fulton County, Ga.,
Republican chairman Hans von Spakovsky for one of three
openings.
(Grishman, 2012)

Dr (Mrs) Florence Entsua-Mensah 31


IE Challenges Cont’d.
• LOTS OF DIFFERENT PATTERNS
• different words:
named, appointed, selected, chosen, promoted, …
• Different constructions:
IBM named Fred president
IBM announced the appointment of Fred as
president
Fred, who was named president by IBM
• Different names:
George H. W. Bush, former President Bush, 41
Dr (Mrs) Florence Entsua-Mensah 32
IE Challenges Cont’d.
REFERENCE
• George Garrick has served as president of Sony USA for 13
years. The company announced his retirement effective next
May.
• IBM announced several new appointments yesterday. Fred
Smith was named head of research.

(Grishman, 2012)

Dr (Mrs) Florence Entsua-Mensah 33


Summary

• In this session, we learned that information


extraction is defined grossly by establishing
relationship among a set of entities.
• The aim therefore is to identify relations between
entities; which are used primarily to construct
knowledge-bases.

Dr (Mrs) Florence Entsua-Mensah 34


Activity 9.1

1. Distinguish between structured and unstructured


data.

2. Why is it important to transpose unstructured data


into a structured one?

3. In your own words, define information extraction;


and discuss any one of the approaches to
information extraction known to you.

Dr (Mrs) Florence Entsua-Mensah 35


Activity 9.2

• Collect a list of names of works of art from a particular


category from a Web-based source (e.g., gutenberg.org,
amazon.com, imdb.com, etc.).
• Analyze your list and give examples of ways that the names
in it are likely to be problematic for the techniques described
in this session.

Dr (Mrs) Florence Entsua-Mensah 36


References

Izaquierdo, R. (2015). Information Extraction. Available at:


https://www.slideshare.net/rubenizquierdobevia/infor
mation-extraction-45392844

Grishman, R. (2012). Information Extraction: Capabilities


and Challenges. International Winter School in
Language and Speech Technologies. Tarragona,
Spain: International Winter School in Language and
Speech Technologies.
https://doi.org/10.1561/1500000003

Dr (Mrs) Florence Entsua-Mensah 37


INFS 422: INFORMATION STORAGE &
RETRIEVAL

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

The user of an IRS is at the epicenter of the success of the


IRS. This session discusses the information user. It
provides insight in to the nature of the information user,
their characteristics and types. It also discusses the nature
of their information needs.

Dr (Mrs) Florence Entsua-Mensah Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Users and their nature

• Topic Two: Characteristics of users

• Topic Three: Information Needs

Dr (Mrs) Florence Entsua-Mensah 3


Recommended Reading Text

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. Facet publishing.

Singh, S., 2015. Users and information use in academic


libraries. Acad. Libr. Syst.

Dr (Mrs) Florence Entsua-Mensah Slide 4


Introduction
• The user is the focal point of all information retrieval
systems; because the sole objective of any ISR system is to
transfer information from the source to the user
(Chowdhury, 2010).

• The characteristics and the specific needs of users


determine the nature of the information to be collected by the
IRS system, and the nature of the user interface to design.

• Hence, an understanding of the nature; and number of users


and their activities in relation to the information requirements,
and so forth is crucial to the design and development of an
appropriate ISR system.

Dr (Mrs) Florence Entsua-Mensah 5


TOPIC
Users and their Nature
1

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 6


Who is a User? - 1

• The concept of the ‘user’ is by no means clear (i.e. it is


ambiguous) (Chowdhury, 2010).
• An individual who makes use of information in any way to
complete a task (IGI Global, 2018).
• According to Kenneth Whittaker, a user may be defined as a
person who uses one or more library’s services at least once
a year (Singh, 2015).
• A user can be called a person who needs information which
can be provided by specific library services; or someone who
is known to have the intention of using certain information
services from the library (Singh, 2015).

Dr (Mrs) Florence Entsua-Mensah 7


Who is a User? - 2

• The term "user" can refer to any person who interacts with
an information system to search for and select resources
he/she needs (University of North Texas, 2017).

• Alternative name for the Information User: end users,


patrons, clients, searchers, consumers, readers, etc.

• The term user, often refers to a person visiting a library or


information centre; be it physical or virtual.

• The type of information user is in fact dependent on the


nature of the information.

Dr (Mrs) Florence Entsua-Mensah 8


TOPIC
Characteristics of Users
2

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 9


User Characteristics

• Users may be limited by:

The nature of Sex Other forms


their Age
work/profession of social
groups

Dr (Mrs) Florence Entsua-Mensah 10


Characteristics of users (Cont’d.)

• Several criterial may be used to identify and categorise


users.

• For example, within the context of an organisation, user


categories may be identified as:
• Actual Users
• Potential Users
• Eexpected Users
• Beneficiary Users

Dr (Mrs) Florence Entsua-Mensah 11


Characteristics of users (Cont’d.)

Actual users:
• those who are using the information service at a given time.

Potential Users:
• those who are not yet served by the information services.

Expected Users:
• those who not only have the privilege of using the information
service, but also have the intention of doing so.
Beneficiary users:
• Users who have derived some benefits form the information service.

(Chowdhury, 2010)
Dr (Mrs) Florence Entsua-Mensah 12
TOPIC
Information Needs (IN)
3

Dr (Mrs) Florence Entsua-Mensah 2022/2023 Academic Year 13


What is an Information Need?

• Arises when an individual recognises that their current state of knowledge


is insufficient to cope with the task in hand, or in order to resolve conflicts
in a subject area, or to fill a void in some area of knowledge.

• Information needs refer to the specific requirements or desire individuals


or organizations have for obtaining information to fulfill a particular
purpose or achieve a specific goal.

• It is the recognition that there is a gap in knowledge or understanding that


needs to be filled through the acquisition of relevant information.

(Chowdhury, 2010)

Dr (Mrs) Florence Entsua-Mensah 14


What is an Information Need?

• Identifying information needs involves understanding what information is required to


address a particular situation or objective.

• This includes determining the scope, depth, and specificity of the information needed.
• For example, a student conducting research for a term paper may have information needs
related to specific aspects of a topic, such as historical background, current research, or
statistical data.

• To address information needs, individuals or organizations typically engage in


information-seeking activities.

• This involves searching for, accessing, and evaluating information from various
sources, such as books, articles, databases, websites, experts, or other individuals.
• The information sought should be relevant, reliable, and credible to ensure its usefulness in
meeting the identified needs.

Dr (Mrs) Florence Entsua-Mensah 15


Things to note about Information Need (IN):

• IN is a relative concept; it depends on a several of factors. It


varies form person to person, job to job, subject to subject,
and so on.

• IN changes overtime (it does not remain constant).

• IN often changes upon receipt of some information.

• IN often remain unexpressed or poorly expressed.

Dr (Mrs) Florence Entsua-Mensah 16


Types of Information Needs

• In the context of library search, Taylor (as cited in Chowdhury,


2010) identifies four (4) major types of information need that

lead the user from the state of a purely conceptual need to


one that is formally expressed and constrained (by the
environment):

• Visceral need → Conscious need → Formalised need → Compromised


need

Dr (Mrs) Florence Entsua-Mensah 17


Visceral Needs

• Visceral needs refer to individuals’ actual but unexpressed


information needs.

• These needs may arise from personal experiences,


emotions, or intuitions.

• Individuals may have a sense that they lack information or


understanding in a particular area, even if they cannot
precisely articulate or define their needs.

• Visceral needs often serve as the underlying motivation that


triggers the search for information.

Dr (Mrs) Florence Entsua-Mensah 18


Conscious Needs

• Conscious needs represent an ill-defined area of decision or


uncertainty where individuals recognize the need for
information but are unable to clearly articulate it.

• At this level, individuals may feel a sense of unease or a


desire for more knowledge to address a specific problem,
make a decision, or overcome a challenge.

• Conscious needs often arise from gaps in understanding or


awareness and drive individuals to seek clarification or
explore possible solutions.

Dr (Mrs) Florence Entsua-Mensah 19


Formal needs

• Formal needs emerge when individuals are able to express


their information needs in concrete terms.

• At this level, individuals can define the specific information


required to address a particular problem or decision.

• Formal needs are characterized by a clearer articulation of


the gap in knowledge or information, enabling individuals to
seek relevant and targeted information to fill that gap.

• These needs are often expressed through specific questions


or statements

Dr (Mrs) Florence Entsua-Mensah 20


Compromised Needs

• Compromised needs occur when the original information needed


is translated into what the available resources and information
systems can deliver.
• It takes into account the limitations and constraints individuals face
in accessing and obtaining the desired information.
• Compromised needs often arise due to practical considerations,
such as time constraints, resource availability, or technological
limitations.
• Individuals may adjust their information requirements based on
what is realistically achievable within the given constraints.

Dr (Mrs) Florence Entsua-Mensah 21


Types of Information Needs - 2

Visceral Need:
is the unconscious need.
Conscious
Need: Conscious by undefined need.
Formalised
Need: Formally expressed need.
Compromised Expressed and influenced by internal and
Needs: external constraints.

Dr (Mrs) Florence Entsua-Mensah 22


Information-Seeking Behaviour (ISB) of Users

Items that can affect the ISB of Users:


• Their awareness of, and their ability to access sources of
information
• Educational and professional background.
• People relationship / how easily they get on with people
• The amount of competition that exists in their field of activity
• Their past experiences (or the environment in which the user
grew up)
• How the uses formulate their queries

Dr (Mrs) Florence Entsua-Mensah 23


Summary

• We have learnt in this session, the characteristics of users of an


information retrieval system.

• The session also reviews the nature of the information needs of


users.

Dr (Mrs) Florence Entsua-Mensah


24
Activity 10.1

• Aside the user categorization that was discussed in this


session, which other categorization do you know of?
• You are to undertake this exercise with relevant academic
literature.

Dr (Mrs) Florence Entsua-Mensah Slide 25


References

Chowdhury, G.G., 2010. Introduction to modern information


retrieval. Facet publishing.
IGI Global, 2018. What is Information-Users [WWW
Document]. URL https://www.igi-
global.com/dictionary/information-users/14604
(accessed 8.1.18).
Singh, S., 2015. Users and information use in academic
libraries. Acad. Libr. Syst.
University of North Texas, 2017. Users and their
information needs.

Dr (Mrs) Florence Entsua-Mensah 26


422: INFORMATION STORAGE & RETRIEVAL

Session 11 – Evaluation of ISR System

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

• In recent years, the evaluation of Information Retrieval


Systems and techniques for indexing, sorting, searching and
retrieving information have become increasingly important
(Saracevic, as cited in Kowalski, 2007).

• Consequently, this session discusses the various ways in


which an IRS may be evaluated.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Understanding Evaluation

• Topic Two: Criteria for Evaluation

• Topic Three : Lancaster’s Steps in Evaluation

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 3


Reading List

• Chowdhury, G. G. (2010). Introduction to modern information


retrieval. Facet publishing

• Kowalski, G. J. (2007). Information retrieval systems: theory


and implementation (Vol. 1). Springer

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 4


TOPIC
Understanding Evaluation
1

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 5


What is Evaluation?

• Evaluation is a systematic determination of a subject's merit, worth and


significance, using criteria governed by a set of standards.
• Ascertaining the value or worth of something.
• Assigning a rating
• Assessing/judging/appraising the quality, ability or extent of significance of
something
• IR evaluation can be conducted from two main viewpoints
- Managerial view:- When evaluation is conducted from managerial point of
view it is called managerial oriented evaluation
- User view:- When evaluation is conducted from the user point of view it is
called user-oriented evaluation study

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 6


Reasons for Evaluating IRS

• To monitor system effectiveness


• To aid in the selection of a system to procure
• To provide input to cost benefit analysis of an information system
• To access the query generation process for improvements
• To determine the effects of changes made to an existing information system.
• To established a foundation of further research on the reason for the
relative success of alternative technique
• To improve the means employed for attaining objectives or to redefine
sub goals or goals in view of research findings.
(Kowalski, 2007)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 7


Metrics for Evaluation of IR Systems

• The purpose of evaluation of an IR system is to measure its


performance based on a given scale. Performance is measured by
2 basic parameters

• 1) Efficiency – How economically the system is achieving its


objectives. The cost factors involved in meeting the stated
objectives can be determined by:
➢Response time – average time taken by a system to respond to a
query.
➢User effort – time taken by a user o obtain the right information
➢Financial expenditure - cost per search

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 8


Metrics for Evaluation of IR Systems

• 2) Effectiveness – the extent to which relevant information


is retrieved and non-relevant information withheld

• It means the level up to which the given system attained its


objectives.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 9


TOPIC
Criteria for Evaluation
2

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 10


Proponents of Evaluation Methods

• A number of researchers in the field of information storage


and retrieval have suggested ways in which an IRS may be
evaluated.

• Some of the prominent ones are from:


− Cleverdon
− Lancaster

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 11


Cleverdon’s Criteria for Evaluating IRS
• Cleverdon (1966) proposed intellectual effort expended
six criteria: by the user to obtain an
1. Recall – ability of the answer to his query.
system to present all 5. Form of presentation of
relevant items search output, which
2. Precision – ability of the affects the ability of the
system to present only user to make use of the
those items that are retrieved items.
relevant 6. Coverage of the collection
3. Time lag – average interval – the extent to which the
between query submission system includes relevant
and response to query matter.
4. Effort – the physical and

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 12


Lancaster’s Criteria for Evaluating IRS

1. Coverage of the system

2. Ability of the system to retrieve wanted items (i.e., recall)

3. Ability of the system to avoid retrieval of unwanted items (


i.e., precision)

4. The response time of the system, and

5. The amount of effort required by the user

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 13


Recall and Precision

• Recall and precision are the most important evaluative criteria for
assessing the performance of an IRS.
• Recall – the extent to which the items retrieved are wanted or
relevant
• An IR system is expected to retrieve relevant documents in
response to a query.
• However, in a large collection, only a proportion of the total
relevant documents are retrieved.
• The system’s performance is measured by the recall ratio.
𝑁𝑜.𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑅𝐸𝐶𝐴𝐿𝐿 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 14


Recall and Precision

• Precision – the extent to which the system retrieves relevant


items and withholds non-relevant items
• It is defined as the proportion of the retrieved items that is
relevant.

𝑁𝑜. 𝑜𝑓 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝐼𝑡𝑒𝑚𝑠 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑


𝑃𝑅𝐸𝐶𝐼𝑆𝐼𝑂𝑁 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑡𝑒𝑚𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛

• An ideal system will achieve 100% recall and 100% precision


which is not possible.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 15


Limitations of recall and precision

• Different users may want different levels of recall – a person


preparing a report on a given topic may prefer high recall.
Conversely, the one who need to know just something about
a given topic may prefer low recall.

• Recall assumes that, all relevant items have the same value,
but the value may be relative and varies from user to user.

• Both recall and precision relies on the relevance judgement


of the user and this judgement may be subjective

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 16


Limitations contd..

• A subjective view of relevance may also be dependent upon


knowledge of the contents of the user at the time of search.

• Therefore all pertinent items may be relevant but not all


relevant items may be pertinent.

• Some researchers have observed that, the larger the


collection, the larger the number of non-relevant items for a
given query, thus precision and recall has an inverse
relationship.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 17


Limitations contd..

• The increase in the number of recall will cause a decrease in


precision

• The evaluative criteria are document based and therefore


measure only the performance of the system retrieving items
that have been predetermined to be relevant to the
information need.

• They do not consider how the information will be used or


whether the documents fulfil the information need of the user

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 18


Other measures of evaluation

• Efficient IR systems must be designed to maximize recall


and precision. The limitations of precision and recall
evaluative criteria calls for new measures of evaluation as
follows:

• Fallout ratio = the proportion of relevant items retrieved in a


given search

• Generality ratio = the proportion of relevant documents in


the collection for a given query.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 19


Retrieval measures

Symbol Evaluation Measure Formula Explanation


R Recall a/(a+c) Proportion of relevant items
items retrieved.
P Precision b/(a+b) Proportion of retrieved items
that are relevant.
F Fallout b/(b+d) Proportion of non-relevant
items retrieved.
G Generality (a+b)/(a+b+c+d) Proportion of relevant items
per query.

[a=docs relevant to query] [b=docs not relevant to query] [c=docs relevant to

query but could not be retried] [d=documents that are not relevant to the query]

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 20


Other measures contd..

• Usability – a measure that considers the interface,


expectations, experiences and skills of the user

• Cost – Users may experience costs in terms of any payment


that they need to make for system or document access It
includes time used for searching the system, search
algorithms, options for display of search results etc.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 21


TOPIC
Lancaster’s Steps in Evaluation
3

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 22


Lancaster’s Steps in Evaluation
• Lancaster (1971) proposes five (5) main steps for evaluation:
1. Designing the scope of evaluation (detailed plan)
➢ Purpose and Objectives
➢Type, ie., laboratory set-up, real life situation, macro or
micro-evaluation
➢Cost and Staff time
2. Designing the evaluation programme
➢Methodology or action plan
➢Need for designers to control some parameters of he system.
➢ Design must be clear and mark major caution points where more
care is needed.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 23


Lancaster’s Steps in Evaluation (Cont’d.)

3. Execution of the evaluation


- Meticulous implementation of the methodology by the
evaluator to avoid bias or error
- Constant communication between designer & evaluator
about observations & possible re-design of programme
4. Analysis and interpretation of results
- Interpretation of results based objectives
- Suggestions for improvement based on findings
5. Modifying the system based on the results.
➢ Finally, the retrieval system is modified if necessary based on the
feedback obtained from the results and interpretation of
evaluation.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 24


Summary

• In this session the importance of evaluating information


retrieval systems were discussed.

• The session also presented some of the criteria for


evaluation of IRS, with preference to:
• Lancaster steps for Evaluation; and
• Cleverdon’s Evaluation Criteria

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 25


Activity 10.1

• Compare and contrast Lancaster’s evaluation steps with that


of Cleverdon’s criteria.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 26


References

• Chowdhury, G. G. (2010). Introduction to modern information


retrieval. Facet publishing
• Kowalski, G. J. (2007). Information retrieval systems: theory
and implementation (Vol. 1). Springer.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 27


INFS 422: INFORMATION STORAGE & RETRIEVAL

Session 12 – Evolutions in Information Retrieval

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

The aim of this session is to:

• Provide an understanding of how the information retrieval


field progressed/evolved to its current state.
• Each evolutionary milestone is illustrated by describing the
standards and protocols and by discussing the global initiatives and
the research tat shaped it.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 2


Session Outline

The key topics to be covered in the session are:


• Topic 1: Information Retrieval Standards & Protocols
• Topic 2: Global Digital Library
• Topic 3: Intelligent Information Retrieval
• Topic 4: Hypertext and Hypermedia Systems

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 3


Reading List

• Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New
Z39.50Protocol Client for Searching in Libraries and
Research Collaboration. Network Protocols and
Algorithms, 8(3),29.
https://doi.org/10.5296/npa.v8i3.10147
• The Apache Software Foundation: The Free and Open
Productivity Suite. Retrieved from:
http://www.openoffice.org/bibliographic/srw.html.
Accessed on August 7, 2018.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 4


Introduction

• The growth of Information and Communication


Technology (ICT) has refashioned information search
and retrieval.

• There are several advancements that have taken


place in this area (i.e. ISR) over the period of time

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 5


TOPIC
Information Retrieval Standards & Protocols
1

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 6


What is a Standard ?

• A standard means an agreement by what way to perform


a task or carry out some activity to obtain a predictable
result.
• There are various standards and protocols that are in
existence for IR systems.
• Some of the popular search and retrieval standards and
protocols include:
– Z39.50
– SRW
– SRU
– CQL

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 7


Z39.50
• Z39.50 is a communication protocol between a client
and a server.

• The increasing number of information available at


libraries and the necessity to find a mechanism to
look for information at several libraries at the same
time promoted to the creation of the Z39.50
protocol.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 8


Z39.50 (Cont’d)
• Sessions inside one connection between both nodes
are known as Z39.50-association or Z-association
(Rego et al., 2016).

• These sessions are initiated by the client. Since Z-


association is open, both server and client can start
any operation defined in Z39.50 protocol. In the
same way, Z-association can be closed by either
client or server, or implicitly terminated by loss of
connection (Rego et al., 2016).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 9


Z39.50 (Cont’d)

• The main goal of Z39.50 is to provide a standard to


search information into an external database whatever
its data organization.

• Thus, Z39.50 is widely used in some of the biggest


libraries. This goal is achieved because the
communication between the client and the server is
standard and independent to the database (Rego et al., 2016).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 10


Z39.50 (Cont’d)

• Z39.50 is used both at the national and international


level as a standard protocol that defines computer-
to-computer information retrieval technique. It is a
non-proprietary and vendor-independent.

• Z39.50 was originally approved by the National


Information Standards Organization (NISO) in 1988.
In 1998, International Organization for
Standardization (ISO) adopted Z39.50 and issued ISO
23950 Information and documentation - Information
retrieval (Z39.50).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 11


Z39.50 (Cont’d)

• Using Z39.50 a user through his/her system can


search and retrieve information from other Z39.50
compliant computer systems without having the
prior idea about the syntax of search that is used by
the other systems.

• The primary goal of Z39.50 is to reduce the


complexity and difficulties involved in searching and
retrieving electronic information .

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 12


SRW

• SRW stands for Search/Retrieve Web Service


protocol (The Apache Software Foundation, 2018).
Its aim is to minimize the cross-language problems.
• The goal is to allow access to several networked
resources and support interoperability among
distributed databases, using a common utilization
framework (The Apache Software Foundation, 2018).

• It is developed by collective implementers with more


than 20 years of experience of the Z39.50
Information Retrieval protocol with nascent
developments in the technological arena of the web.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 13


SRU

• SRU stands for Search/Retrieve via URL. It is a


standard XML-based protocol for search by utilizing
CQL (http://www.loc.gov/cql/), a standard syntax for
query representation (The Apache Software
Foundation, 2018).
• The prime difference between SRU and SRW is that
the former uses HTTP as the transport mechanism
and the latter is based on SOAP protocol and uses
XML streams for both the query and the results.
• This depicts that the query is communicated as a URL
and the XML is received as if it were a web page.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 14


CQL
• CQL stands for Contextual Query Language (formerly
known as, Common Query Language).
• It is designed for use with SRW which is a search protocol
successor to Z39.50 (as discussed in the previous section).
• CQL is an abstract and extensible query language for
maximum interoperability amongst the connected
systems. The goal is to reduce the difficulty to learn and
use while retaining the capability to allow complex
searches.
• Primarily CQL is used in the bibliographic domain, however
it is not restricted to this context alone.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 15


TOPIC
Global Digital Library
2

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 16


Global Digital Library

• This is more or less a virtual library that consolidates


the collections of individual libraries as one
collection.
– The WWW and the internet laid the foundation for the
virtual/digital libraries.

• Global Digital Library (GDL) is a prototype which aims


to connect several national libraries and some major
libraries, museums, archives, and information
organizations with each other (Chen, 2001).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 17


Challenges/Issues with Digital Information Sharing
• Several legal issues may arise related to intellectual property,
copyright, confidentiality and privacy, security, personal, business
equity, etc.;
• Difference in culture may influence the way of information
communication;
• The presence of generational gaps;
• The sheer complexity of information architecture both at the
global and national level;
• To have an effective and adequate inventory of available
resources comprising the knowledge of information;
• The ability to locate, identify and retrieve relevant and quality
information;
• Due to the huge amount of information, the complexity arises
related to "undesirable" "indecent" information.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 18


TOPIC
Intelligent Information Retrieval
3

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 19


Intelligent IR Defined

• Intelligent IR is a computer system having the capability


to infer knowledge with the help of its previous
knowledge for establishing a link between the
requirement of its user and a set of candidate document
(Jones et al., 2000).

• This is a system which can perform intelligent retrieval.


The realization of researchers to use knowledge in the
information retrieval system has led them to think about
the artificial intelligent system which also has the similar
purpose, and one among these classes is an expert
system.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 20


Expert System Defined

• An expert system is “a computer system which emulates


the decision-making ability of human experts” (Jackson,
1998).
• The expert systems are designed to solve complex
problems by reasoning over knowledge stored in a
knowledge base.
• The knowledge in the knowledge base is primarily
represented as IF-THEN rules rather than conventional
procedural code.
• The first expert systems were invented in the 1970s and
then proliferated in the 1980s.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 21


Developments in Expert Systems

• As expert systems evolved, several new techniques


were adopted into various types of inference,
engines. Some of the most important ones include:
– Truth Maintenance
– Hypothetical Reasoning
– Fuzzy Logic
– Ontology Classification

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 22


Expert System for LIS Profession

• AUTOCAT was produced in Germany. The system was


designed to generate bibliographic records of physical
sciences periodicals available in machine-readable form
(Endres-Niggemeyer and Knorz, 1987) .
• Qualcat (Quality Control in Cataloguing) was undertaken
at the University of Bradford. The goals of the project
were to develop expert systems to select the best
records, to link the databases and centralized authority
control, to build a fully automated control package for
day to day running, and to investigate interface problems
for cataloguing (Ayres et al., 1994) .

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 23


Expert System for LIS Profession

• OCLC developed an expert system, called Cataloguer’s


Assistant. The system was tested in Carnegie-Mellon
University to reclassify the mathematics and computer
science collection (De Silva, 1997) .
• FRUMP: (developed by DeJong) analyses articles from
newspapers using frame-based techniques. The articles
were first scanned and then data were automatically fed
into the different slots within frames.
• SCISOR: (developed by Rau, Jacobs and Zernik, 1989) is a
system that generate reports on corporate acquisitions and
mergers.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 24


TOPIC
Hypertext & Hypermedia Systems
4

Lecturer: Dr (Mrs) F. O. Entsua-Mensah 2022/2023 Academic Year 25


Hypertext

• Hypertext refers to the use of hyperlinks (or simply “links”) to


present text and static graphics. Many websites are entirely
or largely hypertexts(Farkas, 2004).

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 26


Hypermedia

• Hypermedia refers to the presentation of video, animation,


and audio, which are often referred to as “dynamic” or “time
based” content or as “multimedia” (Farkas, 2004).

• Hypermedia, a logical extension of hypertext, is a non-linear


medium of information space which includes plain text,
audio, video, graphics and hyperlinks link.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 27


Hypertext and Hypermedia (Cont’d)

• Forms of hypertext and hypermedia include CD-ROM and


DVD encyclopaedias (such as Microsoft's Encarta), eBooks,
and the online help systems we find in software products.

• It is common for people to use "hypertext" as a general term


that includes hypermedia (Farkas, 2004). For example, when
researchers talk about “hypertext theory,” they refer to
theoretical concepts that pertain to both static and
multimedia content.
(Farkars, 2004)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 28


Summary

• In this session we have discussed some of the IR techniques


and technologies that evolved in the recent past.

• We have discussed some of the significant IR standards and


protocols.

• We have also reported the state-of-the-art research in IR


field, for instance, the initiative of global digital library,
application of intelligent systems like expert system in library
cataloguing, classification and abstracting, the application
and issues of intelligent hypertext and hypermedia systems

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 29


Activity 9.1

• Discuss the role of protocols and standards in the


development of modern IR systems.

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 30


References – 1

• Ayres, F. H., Cullen, J., Gierl, C., Huggill, J. A. W., Ridley, M. J., &
Torsun, I. S. (1994). QUALCAT: automation of quality control
in cataloguing. BLRD REPORTS, 6068.
• Chen, C.-C. (2001). Global Digital Library Development in the New
Millennium: Fertile Ground for Distributed Cross- Disciplinary
Collaboration. Tsinghua University Press.
• De Silva, S. M. (1997). A review of expert systems in library and
information science. Malaysian Journal of Library &
Information Science, 2(2), 57–92.
• Endres-Niggemeyer, B., & Knorz, G. (1987). AUTOCAT:
knowledge-based descriptive cataloguing of articles
published in scien-tific journals. In Second International
GI Congress 1987. Knowledge Based Sys-tems (pp. 20–21)

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 31


References – 2
• Farkas, D. K. (2004). Hypertext and hypermedia. In Berkshire
Encyclopedia of Human-Computer Interaction (Vol. 16, pp. 332–
336). https://doi.org/10.1016/0360-1315(91)90062-V
• Jackson, P. (1998). Introduction to expert systems. Addison-Wesley
Longman Publishing Co., Inc.
• Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model
of information retrieval: development and comparative experiments:
Part 2. Information Processing & Management, 36(6), 809–840.
• Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New Z39.50
Protocol Client for Searching in Libraries and Research
Collaboration. Network Protocols and Algorithms, 8(3), 29.
https://doi.org/10.5296/npa.v8i3.10147
• The Apache Software Foundation: The Free and Open Productivity Suite.
Retrieved from: http://www.openoffice.org/bibliographic/srw.html.
Accessed on August 7, 2018

Lecturer: Dr (Mrs) F. O. Entsua-Mensah SLIDE 32


INFS 422: INFORMATION STORAGE &
RETRIEVAL

Dr. FLORENCE O. ENTSUA-MENSAH


(fentsua-menah@ug.edu.gh)

2022/2023 Academic Year


Session Overview

• The session recounts the role of technology in information


storage and retrieval.

• The session specifically examines landmark technological


innovations such as the WWW and cloud storage
services.

Dr. (Mrs) Florence O. Entsua-Mensah Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic One: Web information Retrieval


• Topic Two: Search Engines
• Topic Three: Cloud Storage Services

Dr. (Mrs) Florence O. Entsua-Mensah 3


Reading List

Mansourian, Y. (2004). Similarities and differences


between Web search procedure and searching in the
pre-web information retrieval systems. . Retrieved
July 9, 2008, from:
http://www.webology.org/2004/v1n1/a3.html.

Zhang, Y. (2004). A Comparison of Search Engines For


Finding Resources. Retrieved July 9, 2008, from
http://www.yuanlei.com/studies/articles/is567-
searchengine/page2.htm

Dr. (Mrs) Florence O. Entsua-Mensah Slide 4


TOPIC
Web Information Retrieval
1

Dr. (Mrs) Florence O. Entsua-Mensah 2022/2023 Academic Year 5


The World Wide Web

• The world wide web (WWW) otherwise known as the Web is


a collection of related and connected web pages.

• A web page is basically an HTML file.

This Photo by Unknown Author is


licensed under CC BY-SA

Dr. (Mrs) Florence O. Entsua-Mensah 6


The WWW Naming Scheme

• The WWW uses a naming scheme to identify documents,


known as Uniform Resource Identifiers or simply URIs
(Berners-Lee et al., 1998).
• URIs consists of a Uniform Resource Locator (URL) which
identifies a document by including information on location of
the document and the Uniform Resource Name (URN) which
acts as true identifier by providing reference to a document.

Dr. (Mrs) Florence O. Entsua-Mensah 7


Classic vs Web Information Retrieval

• In the classic IR systems both the resources and the users


were more or less predictable and homogeneous.
• The digital contents from online as well as offline databases,
Online Public Access Catalogues (OPACs) mainly contain
data stored in a structured manner. Due to this nature of
stored data, the search and retrieval process was much
easier and more predictable.

(Mansourian, 2004)

Dr. (Mrs) Florence O. Entsua-Mensah 8


Classic vs Web IR (Cont’d.)

• Accordingly, the user group were mainly comprised of people


from academics, researchers, subject experts or librarians.

• They were well aware of the search keywords to use for


finding a particular document. But with the emergence of
Web there is a flooding of digital information in a very
unstructured, uneven and heterogeneous manner so to cope
up with this situation Web based IR systems evolved.

(Mansourian, 2004)

Dr. (Mrs) Florence O. Entsua-Mensah 9


Classic vs Web IR (Cont’d.)

• These systems provide accessibility to web based digital


contents.

• They use programs to maintain and update a list of web


documents added to web at a regular interval or according to
the need.

• Then these web IRs mainly look for the searched keywords
in the maintained list or the index file to retrieve the original
document present on the web.

(Mansourian, 2004)
Dr. (Mrs) Florence O. Entsua-Mensah 10
Web IR

• Web IRs user's interface is much more user friendly than the
traditional IR systems by keeping in mind the issue of
increase in inexperienced user group of web IR systems.

• So, now Web IR interface designers have to look more into


the information seeking behaviour of this type of users than
in the classical IR systems where users were mainly experts
of a particular domain under consideration.

(Mansourian, 2004)

Dr. (Mrs) Florence O. Entsua-Mensah 11


TOPIC
Search Engines
2

Dr. (Mrs) Florence O. Entsua-Mensah 2022/2023 Academic Year 12


What are Search Engines?
• Search engine is a tool for locating information from a
collection. Search engines uses information about the
information (such as metadata, catalogue) stored in the
database to locate information.
• Sometimes they perform full text search within the document
from first character to last character.
• Search engines are computer programs that search for
particular ‘keywords’ entered by users and returns a list of
documents with which they are associated with.
• It is simple a service that searches for contents on the web.
• Yet, some do not only search for keywords rather some
search for other things also and these are not “engines“ in
the classical sense.

Dr. (Mrs) Florence O. Entsua-Mensah 13


Types of Search Engines

• Crawler-based search engines are useful if we have


specific search keyword in our mind but if our search topic
is a general one then these type of search engines may
provide several irrelevant documents to a search request,
e.g. AltaVista.
• Human-powered directories are good if our search is a
general topic, then this type of search engines powered
with human crafted directories will guide us and help to
converge our search and fetch refined responses, e.g.
DMOZ.

(Zhang, 2004)

Dr. (Mrs) Florence O. Entsua-Mensah 14


Types of Search Engines (Cont’d.)

• Hybrid search engines use a combination of both crawler-based results


and directory results, e.g. Google.
• Meta-search engines: Meta Search engines are online tools (search
engines) which performs simultaneous search on more than one search
engine at a time.
• These search engines aggregates the results into a single list and
displays them according to their source.
• They are good for saving time by gathering results from different search
engines at a single interface.
• It is excellent if we wish to know whether something is available about a
particular topic or not on the web, e.g. Dogpile.
(Zhang, 2004)

Dr. (Mrs) Florence O. Entsua-Mensah 15


TOPIC
Cloud Storage Services
3

Dr. (Mrs) Florence O. Entsua-Mensah 2022/2023 Academic Year 16


What is Cloud Computing? (1)

• Cloud computing is described as a computing environment


where software applications, computing hardware and other
resources are provisioned to authorized users via a network
instead of these resources residing on the local devices and
location of users.

• With a network connection on a computer or a smart device,


users can access resources on their cloud platforms without
any other connection to the hardware that holds the
resources. (Gruman, 2008).

Dr. (Mrs) Florence O. Entsua-Mensah 17


The underlining rationale for CC is Ubiquity of access
Access beyond Location and Device

Dr. (Mrs) Florence O. Entsua-Mensah 18


Cloud Computing (1)
• Cloud computing has opened a new phase in the
technological sphere where technological products
are now rendered as services.

• This now allows technological products to be


repackaged as services to meet the needs of
customers.
• Hardware products such as storage devices are now
repacked as service (E.g. One drive, Google drive, &
Dropbox)

Dr. (Mrs) Florence O. Entsua-Mensah 19


Cloud Computing (2)

• The advent of cloud computing has generated a lot of


interest among individuals and cooperate organizations.

• Knowingly or unknowingly, most people use cloud


services daily (Rawal, 2011).

• From your understanding so far, can you name some


cloud service that you know of -a part from the ones that
have been mentioned in this class.

Dr. (Mrs) Florence O. Entsua-


20
Mensah
Computing service models

• The types of service models that have emerged under the


cloud computing technology are:
• Software-as-a-Service (SaaS)
• Platform-as-a-Service (PaaS)
• Infrastructure-as- a-Service (IaaS)

• Cloud computing services can be Private, Public


or Hybrid.

Dr. (Mrs) Florence O. Entsua-Mensah 21


Software-as-a-Service (SaaS)
• Under this service model, cloud computing applications
residing on the cloud infrastructure of the service providers
are delivered to the user through web interfaces and
programs (Mell & Grance, 2011).
• E.g. Microsoft Word Online and Google Doc.
• These are word processing applications
which are offered as a service on the internet.
• The software therefore do not reside on the individual
device of the user.

Dr. (Mrs) Florence O. Entsua-Mensah 22


Microsoft Word online

Dr. (Mrs) Florence O. Entsua-Mensah 23


Microsoft Word online

Dr. (Mrs) Florence O. Entsua-Mensah 24


Screenshot of Google Slides

Dr. (Mrs) Florence O. Entsua-Mensah 25


Platform-as-a-Service (PaaS) - 1

• PaaS is a type of cloud service model which is described as


an upgrade of the SaaS model which offers users a platform
to build and run application through a programming interface
provided and supported by the cloud service provider (Birk,
2011).

• PaaS is analogous to SaaS except that, rather than being


software delivered over the web, it is a platform for the
creation of software, delivered over the web.

Dr. (Mrs) Florence O. Entsua-Mensah 26


Platform-as-a-Service (PaaS) - 2
• Set of tools and services designed to make coding and
deploying those applications quick and efficient.
• Allow users to create their own applications using tools and
language from the platform service provider.

Dr. (Mrs) Florence O. Entsua-Mensah 27


Platform-as-a-Service (PaaS) - 3
• For instance, a programmer that needs to build an
application that requires a high speed processing Server can
sign up for Platform as a Service model rather than spending
huge amounts in purchasing servers and operating Systems
for the application.

• E.g. of Platform as a Service models are Google’s App-


Engine, Microsoft Window’s Azure

Dr. (Mrs) Florence O. Entsua-Mensah 28


Infrastructure-as- a-Service (IaaS)

• In IaaS model, the cloud Service Providers


supply a range of virtual infrastructure, such
as virtual servers, networks, storage and other
fundamental computing resources to
customers which enable them deploy and run
their own operating system, applications,
upload or download software or files into the
cloud. E.g. Dropbox and One Drive

(Mouratidisa et al., 2013)

Dr. (Mrs) Florence O. Entsua-Mensah 29


Infrastructure-as- a-Service (IaaS)
•Allows users to run any applications they
please on the cloud hardware of their own
choice.

•IaaS is the hardware and software that


powers it all,
•servers
•storage
•networks
•operating systems

Dr. (Mrs) Florence O. Entsua-Mensah 30


ISR & Cloud Computing (CC)

• The introduction of this technology has affected the way


information professionals and even non-information
professionals handle information.

• For the purposes of this class – as introductory – focus


will be on how CC has affected the following areas of IM;
• Information capturing and processing
• Information Storage, Access & Backup
• Information Sharing and retrieval

Dr. (Mrs) Florence O. Entsua-Mensah 31


CC and Information Capturing & Processing

• Information processing is about data manipulation to get


desired results.
• In the Digital era this is often aided by a computer
software, such as word processor, spread sheet,
software for statistical analysis (e.g. SPSS), etc.
• CC creates the opportunity for these data processing
software to be accessed and used on multiple devices
without necessarily installing them on each device (i.e.
SaaS).
• E.g. MS Excel Online & Google Spread Sheet for data
capturing and processing

Dr. (Mrs) Florence O. Entsua-Mensah 32


CC and Information Storage & Access

• A very crucial aspect of IM is information storage and


retrieval.
• Cloud based storage services such as Google DriveTM
and DropboxTM has brought about new phase in
information storage and retrieval. (i.e. IaaS)
• i.e. Information is now stored and accessed from multiple
locations, as oppose to the traditional storage and
retrieval systems which were confined to specific
locations or computer hard drives.

Dr. (Mrs) Florence O. Entsua-Mensah 33


CC & Information Storage and Backup

• Large cloud providers with geographically dispersed sites


worldwide are able to achieve reliability rates that are hard
for private systems to achieve.

Dr. (Mrs) Florence O. Entsua-Mensah 34


CC and Information Sharing

• Collaboration and team work has become inevitable in


todays cooperate world.
• The complexity of today’s working environment require
teamwork.
• Where members on a team or project may be in different
geographical areas.
• CC provides an avenue for data sharing and collaboration
which defies geographical boundaries.
• E.g. e-mail (SaaS), OneDrive (IaaS), Github (PaaS)

Dr. (Mrs) Florence O. Entsua-Mensah 35


Summary

• In this session, we learnt some of the ways in which


technological advancements has affected and continues to
affect information storage and retrieval.

• The session specifically examines:


• The WWW
• Search Engines
• cloud storage services

Dr. (Mrs) Florence O. Entsua-Mensah 36


Activity 13.1

• List any ten (10) search engines known to you; and


determine the type of search engine it is, based on the types
of search engines discussed in this session.
• You may present your answer in the form of a table.

Dr. (Mrs) Florence O. Entsua-Mensah 37


Activity 13.2

• Evaluate the role of cloud storage services in modern


information retrieval systems.

Dr. (Mrs) Florence O. Entsua-Mensah 38


References

Mansourian, Y. (2004). Similarities and differences


between Web search procedure and searching in the
pre-web information retrieval systems. . Retrieved
July 9, 2004, from:
http://www.webology.org/2004/v1n1/a3.html.

Zhang, Y. (2004). A Comparison of Search Engines For


Finding Resources. Retrieved July 9, 2004, from
http://www.yuanlei.com/studies/articles/is567-
searchengine/page2.htm

Dr. (Mrs) Florence O. Entsua-Mensah 39

You might also like