بحث واسترجاع المعلومات 6

4th Class
Web Information Retrieval

University of Mosul
College Computer Science & Mathematics
Department of Computer Science Lecture 6
Web Search Overview & Web Crawling
 A search engine is the practical application of information retrieval

techniques to large scale text collections
 Web search engines are best-known examples, but many others
 Open source search engines are important for research and
development e.g., Lucene, Lemur/Indri, Galago
IR and Search Engines
Information Retrieval Search Engines
Relevance Performance
-Effective ranking -Efficient search and indexing
Evaluation
-Testing and measuring Incorporating new data
Information needs -Coverage and freshness
-User interaction Scalability
-Growing with data and users
Adaptability
-Tuning for applications
Specific problems
-e.g. Spam
Search Engine Issues

 Performance
1. Measuring and improving the efficiency of search ,e.g., reducing
response time, increasing query throughput, increasing indexing
speed
1
By Dr. Ielaf Osamah
4th Class

University of Mosul
2. Indexes are data structures designed to improve search efficiency.

Designing and implementing them are major issues for search
engine.
 Dynamic data (Incorporating new data)
1. The “collection” for most real applications is constantly
changing in terms of updates, additions, deletions, e.g., web
pages
2. Acquiring or “crawling” the documents is a major task: Typical
measures are coverage (how much has been indexed) and
freshness (how recently was it indexed)
3. Updating the indexes while processing queries is also a design
issue
 Scalability
1. Making everything work with millions of users every day, and
many terabytes of documents
2. Distributed processing is essential
 Adaptability
1. Changing and tuning search engine components such as
ranking algorithm, indexing strategy, interface for different
applications
 Spam
2. For Web search, spam in all its forms is one of the major issues
3. Affects the efficiency of search engines and, more seriously,
the effectiveness of the results
4. Many types of spam
5. e.g. spamdexing or term spam, link spam, “optimization”
6. New subfield called adversarial IR, since spammers are
“adversaries” with different goals
2
By Dr. Ielaf Osamah
4th Class

University of Mosul
Architecture of SE
SE Architecture
Indexing Process
1. Text acquisition: identifies and stores documents for indexing

 Crawler
3
By Dr. Ielaf Osamah
4th Class

University of Mosul
 Identifies and acquires documents for search engine

 Many types – web, enterprise, desktop
 Web crawlers follow links to find documents
 Must efficiently find huge numbers of web pages
(coverage) and keep them up-to-date (freshness)
 Single site crawlers for site search
 Topical or focused crawlers for vertical search
 Document crawlers for enterprise and desktop search
 Follow links and scan directories
 Web Crawler
a. Starts with a set of seeds, which are a set of URLs (Uniform
Resource Locator) given to it as parameters
b. Seeds are added to a URL request queue
c. Crawler starts fetching pages from the request queue
d. Downloaded pages are parsed to find link tags that might contain
other useful URLs to fetch
e. New URLs added to the crawler’s request queue, or frontier
f. Continue until no more new URLs or disk full
 Document data store
 Stores text, metadata, and other related content for
documents .Metadata is information about document such as
type and creation date
 Provides fast access to document contents for search engine
components e.g. result list generation
 Could use relational database system: More typically, a
simpler, more efficient storage system is used due to huge
numbers of documents
2. Text transformation: transforms documents into index terms or features

 Parser
 Processing the sequence of text tokens in the document to
recognize structural elements e.g., titles, links, headings, etc.
 Tokenizer recognizes “words” in the text must consider issues like
capitalization, hyphens, apostrophes, non-alpha characters,
separators.
4
By Dr. Ielaf Osamah
4th Class

University of Mosul
 Markup languages such as HTML, XML often used to specify

structure
 Stopping :Remove common words e.g., “and”, “or”, “the”, “in”, Some
impact on efficiency and effectiveness, and Can be a problem for
some queries
 Stemming: Group words derived from a common stem e.g.,
”computer”, “computers”, “computing”, “compute”, Usually
effective, but not for all queries , and Benefits vary for different
languages
 Information Extraction: Identify classes of index terms that are
important for some applications e.g., named entity recognizers
identify classes such as people, locations, companies, dates, etc.
 Classifier: Identifies class-related metadata for documents. e.g.,
topics, reading levels, sentiment
3. Index creation: takes index terms and creates data structures (indexes) to
support fast searching
 Document Statistics: Gathers counts and positions of words and
other features, Used in ranking algorithm.
 Weighting: Computes weights for index terms Used in ranking
algorithm.
 Inversion
 Core of indexing process
 Converts document-term information to term-document for
indexing, Difficult for very large numbers of documents.
 Format of inverted file is designed for fast query processing, Must
also handle updates, and Compression used for efficiency
 Index Distribution
 Distributes indexes across multiple computers and/or multiple
sites
 Essential for fast query processing with large numbers of
documents
 Many variations
 Document distribution, term distribution, replication
5
By Dr. Ielaf Osamah
4th Class

University of Mosul
Query Process
1. User interaction: supports creation and refinement of query, display

of results
 Query input
 Provides interface and parser for query language
 Most web queries are very simple, other applications may
use forms
 Query language used to describe more complex queries
and results of query transformation e.g., Boolean queries
 Query transformation: Improves initial query, both before and
after initial search.
2. Ranking: uses query and indexes to generate ranked list of

documents
3. Evaluation: monitors and measures effectiveness and efficiency

(primarily offline).
6
By Dr. Ielaf Osamah

بحث واسترجاع المعلومات 6

Uploaded by

Copyright:

Available Formats

You might also like

بحث واسترجاع المعلومات 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

بحث واسترجاع المعلومات 6

Uploaded by

Copyright:

Available Formats

4th Class

Web Information Retrieval

Web Search Overview & Web Crawling

 A search engine is the practical application of information retrieval

IR and Search Engines

Information Retrieval Search Engines

Search Engine Issues

Web Information Retrieval

2. Indexes are data structures designed to improve search efficiency.

Web Information Retrieval

1. Text acquisition: identifies and stores documents for indexing

Web Information Retrieval

 Identifies and acquires documents for search engine

2. Text transformation: transforms documents into index terms or features

Web Information Retrieval

 Markup languages such as HTML, XML often used to specify

Web Information Retrieval

1. User interaction: supports creation and refinement of query, display

2. Ranking: uses query and indexes to generate ranked list of

3. Evaluation: monitors and measures effectiveness and efficiency

You might also like