Internet Technology

Dr. Noaman Muhammad Ali

Ph.D. in Informatics and Information Processes
Information Systems & Technology Department
Spring 2023-2024
Chapter 2

Dr. Noaman M. Ali

Spring 2023-2024
 Introduction
 Information Retrieval
o Definition
o Goal
o Classification of Information Retrieval Systems
 Information Search Methods on the Internet
o Direct Search Using Hypertext Links
o Use of Search Engines
o Search Using Special Tools
Internet Technology
Slide 2- 3
 Search Engine Basic Components
o Indexing Module
o Database
o Search Server
 Web Browser

Internet Technology
Slide 2- 4
 Nowadays, users rely on the web for information,
but the amount of data on the web is growing in an
uncontrolled way.
 Finding relevant and required information is a
hard task; this problem is referred to as
information overload.
 In this chapter, we will discuss the information
retrieval issues related to the search for
information through the Internet.

Internet Technology
Slide 2- 5
Information Retrieval
 Definition
o Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
o An information retrieval system is an
applied computer environment for processing,
storing, sorting, filtering, and searching for
large arrays of structured information.
Internet Technology
Slide 2- 6
Information Retrieval (Cont.)
 Goal
o Retrieve documents with information that is
relevant to the user’s information need and help
the user complete a task.
o Should easily retrieve the interested

Internet Technology
Slide 2- 7
Information Retrieval (Cont.)
 Examples
o Web Search (Search Engine)
o E-mail Search
o Searching your Laptop
o Searching authors, titles, and subjects in library
card catalogs or computers
o Document classification and categorization,
user interfaces, data visualization, filtering

Internet Technology
Slide 2- 8
Information Retrieval (Cont.)
 Notes
o IR can be inaccurate as long as the error is
o Data is usually natural language text, which is
not always well structured and could be
semantically ambiguous

Internet Technology
Slide 2- 9
Classification of IR Systems
 Directories

Internet Technology
Slide 2- 10
Classification of IR Systems (Cont.)
 Directories “Local”

Internet Technology
Slide 2- 11
Classification of IR Systems (Cont.)
 Directories “Web”

Internet Technology
Slide 2- 12
Classification of IR Systems (Cont.)
 Directories “Web” (Cont.)
 Also called: catalogs, yellow pages, subject directories
 Hierarchical taxonomies that classify human knowledge
 The first level of taxonomies ranges from 12 to 26
 Popularities: Yahoo!, eBLAST, LookSmart, Magellan, and
 Most allow keyword searches

Internet Technology
Slide 2- 13
Classification of IR Systems (Cont.)
 Databases

Internet Technology
Slide 2- 14
Classification of IR Systems (Cont.)
 Search Engines

Internet Technology
Slide 2- 15
Information Search Methods on the
► Direct Search Using Hypertext Links

Internet Technology
Slide 2- 16
Information Search Methods on the
Internet (Cont.)
► Use of Search Engines

Internet Technology
Slide 2- 17
Information Search Methods on the
Internet (Cont.)
► Search Using Special Tools

Internet Technology
Slide 2- 18
Search Engine
 Search engines are the means by which most
people search the Web
 Common examples are Google, Altavista, and Bing
 Yet a search engine does not actually search the
Web during your search
 A search engine searches itself.

Internet Technology
Slide 2- 19
Difficulties of Building a Search Engine
 Build by companies and hide the technical detail
 Distributed data
 High percentage of volatile data
 Large volume
 Unstructured and redundant data
 Quality of data
 Heterogeneous data

Internet Technology
Slide 2- 20
Difficulties of Building a Search Engine
 Dynamic data
 How to specify a query from the user
 How to interpret the answer provided by the

Internet Technology
Slide 2- 21
User Problems
 Do not exactly understand how to provide a
sequence of words for the search
 Not aware of the input requirement of the search
 Problems understanding Boolean logic, so the
users cannot use advanced search
 Novice users do not know how to start using a
search engine

Internet Technology
Slide 2- 22
User Problems (Cont.)
 Do not care about advertisements? No funding
 Around 85% of users only look at the first page of
the result, so relevant answers might be skipped

Internet Technology
Slide 2- 23
Searching Guidelines
 Specify the words clearly (+, -)
 Use Advanced Search when necessary
 Provide as many particular terms as possible
 If looking for a company, institution, or
organization, try: [.com | .edu | .org | .gov | country code]
 Some search engines are specialized in some areas

Internet Technology
Slide 2- 24
Searching Guidelines (Cont.)
 If the user uses broad queries, try to use Web
directories as starting points
 The user should notice that anyone can publish
data on the Web, so information that they get from
search engines might not be accurate.

Internet Technology
Slide 2- 25
Types of Search Engines
 Search by Keywords
► Yandex, Google, and Bing
 Search by categories
► Yahoo!
 Specialize in other languages
► Chinese Yahoo and Yahoo Japan
 Interview simulation
► Ask Jeeves!
Internet Technology
Slide 2- 26
Search Engine Basic Components
 Indexing Module
o Spider
o Crawler ("traveling" spider)
o Indexer
 Database
 Search Server
 An Interface
o Enables users to submit queries
o Displays results
Internet Technology
Slide 2- 27
Search Engine Basic Components (Cont.)
 Crawling
o Search engines continually send
out hundreds of “robots” or
“bots” (or “spiders” or
“crawlers” )
o A robot that follows links
o Bots visit websites, read word by
word, and then index those
words, Metadata, and ALT
attributes in IMG tags
o Robot Exclusion Protocol (REP)
Internet Technology
Slide 2- 28
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o Starting point?
o Popular pages

Internet Technology
Slide 2- 29
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o At its peak:
o Use multiple spiders
o Each spider can keep ~300 connections to
pages at a time
o Generates 600K/s
o Starting points:
o Dedicated server that feeds URLs to spiders
o Instead of relying on ISPs for domain names
they have their own DNS server
Internet Technology
Slide 2- 30
Search Engine Basic Components (Cont.)
 Crawling (Cont.)
o Google spider looks at two things:
o Significant words within the page
o Location of the words -- Why is location

Internet Technology
Slide 2- 31
Search Engine Basic Components (Cont.)
 Indexing
o Spiders get the data
o Now what?
o Content analysis
o Method by which information is sorted and
o One way: Storing the word and associated URL
o No way to tell if the word is important or
o How many times was the word used?
Internet Technology
Slide 2- 32
Search Engine Basic Components (Cont.)
 Database
o Where the user's query is matched
o A huge database of Websites that is gathered and
indexed by word
o These databases can be huge, with millions of
o Contains only essential parts of
o Only includes pages that were
o Search engines are always out of
Internet Technology
date. Slide 2- 33
Search Engine Basic Components (Cont.)
 Interface
o Using the
keywords you
give it, a search
engine then
searches its
own current

Internet Technology
Slide 2- 34
Search Engine Basic Components (Cont.)
 Interface (Cont.)
o Query Interface
✓ A box is entered a sequence of words (AltaVista
uses union, HotBot uses intersection)
✓ Complex query interfaces (e.g., Boolean logic,
phrase search, title search, URL search, date
range search, data type search)

Internet Technology
Slide 2- 35
Search Engine Basic Components (Cont.)
 Interface (Cont.)
o Answer Interface
✓ Relevant pages appear at the top of the list
✓ Each entry in the list includes a title of the
page, a URL, a brief summary, a size, a date,
and a written language

Internet Technology
Slide 2- 36
Search Engine Basic Components (Cont.)
 Search Server
o Interfaces are based on rankings
o Search engines return results based on a ranking
o Ranking is the order in which files are listed when
they are retrieved.
o Google is different:

✓ PageRankTM method based on popularity

Internet Technology
Slide 2- 37
Search Engine Basic Components (Cont.)
 Search Server (Cont.)
o Ranking is a relationship between items about
their ordering
o For more useful information:
✓ Number of times word appears on page
✓ Assign a weight to each word
o Each search engine has a different formula for
assigning weight to words in its index
o Popular way of indexing: Hashing
✓ Numerical value assigned to each word that
Internet Technology
By Dr. Noaman M. Ali
can be retrieved using a formula Slide 2- 38
Web Browser

Internet Technology
Slide 2- 39
 Finding relevant and required information is a hard
task; this problem is referred to as information overload.
 The primary goal of Information Retrieval systems is to
retrieve documents with information that is relevant to
the user’s information need and helps the user
complete a task.
 There are three main types of online IR systems: the
directory, the database and the search engine.
 There are three main methods of internet search: Direct
Search Using Hypertext Links, Use of Search Engines,
and Search Using Special Tools.

Internet Technology
Slide 2- 40
 Almost all major search engines have their own
structure, different from others. However, it is possible
to single out the main components common to all
search engines.
 Differences in the structure can only be in the form of
implementation of the mechanisms of interaction of
these components.

Internet Technology
Slide 2- 41

