Unit 5 6 Pages Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY


UNIT V INFORMATION RETRIEVAL AND WEB SEARCH
Types of retrieval model:
Information Retrieval (IR) can be defined as a software program that deals with the
 Classical IR Model. It is the simplest and easy to implement IR model. ...
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material  Non-Classical IR Model. It is completely opposite to classical IR model. ...
that can usually be documented on an unstructured nature i.e. usually text which satisfies  Alternative IR Model. ...
an information need from within large collections which is stored on computers. For  Inverted Index. ...
example, Information Retrieval can be when a user enters a query into the system.  Stop Word Elimination. ...
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the  Stemming. ...
user or the user has asked for in the form of a query. The documents and the queries are  Term Weighting. ...
represented in a similar manner, so that document selection and ranking can be formalized  Term Frequency (tfij)
by a matching function that returns a retrieval status value (RSV) for each document in
the collection. Many of the Information Retrieval systems represent document contents by a
set of descriptors, called terms, belonging to a vocabulary V. An IR model determines the
query-document matching function according to four main approaches: TYPES OF QUERIES IN IR SYSTEMS:

During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR systems
also allow the use of Boolean and other operators to build a complex query. The query
language with these operators enriches the expressiveness of a user’s information need.
1. Keyword Queries:
 Simplest and most common queries.
 The user enters just keyword combinations to retrieve documents.
 These keywords are connected by logical AND operator.
 All retrieval models provide support for keyword queries.
Retrieval Models
2. Boolean Queries:
It is the simplest and easy to implement IR model. This model is based on  Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination
mathematical knowledge that was easily recognized and understood as well. Boolean, of keyword formulations.
Vector and Probabilistic are the three classical IR models. These are the three main statistical  No ranking is involved because a document either satisfies such a query or does not
models—Boolean, vector space, and probabilistic—and the semantic model. satisfy it.
 A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
 When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
 To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
 This query consists of a sequence of words that make up a phase.
 It is generally enclosed within double quotes.
4. Proximity Queries:
 Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
 Most commonly used proximity search option is a phase search that requires terms to
be in exact order.

1|Page 2|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms. Stemming and Lemmatization are Text Normalization (or sometimes called Word
 Search engines use various operators’ names such as NEAR, ADJ (adjacent), or Normalization) techniques in the field of Natural Language Processing that are used to
AFTER. prepare text, words, and documents for further processing.
 However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries:
 It supports regular expressions and pattern matching-based searching in text.
 Retrieval models do not directly support for this query type.
 In IR systems, certain kinds of wildcard search support may be implemented.
Example: usually words ending with trailing characters.
6. Natural Language Queries:
 There are only a few natural language search engines that aim to understand the Stop words removal: Stop word removal is one of the most commonly used
structure and meaning of queries written in natural language text, generally as question preprocessing steps across different NLP applications. The idea is simply removing the
or narrative. words that occur commonly across all the documents in the corpus. Typically, articles and
 The system tries to formulate answers for these queries from retrieved results. pronouns are generally classified as stop words.
 Semantic models can provide support for this query type.

TEXT PREPROCESSING: Text preprocessing is an initial phase in text mining. There are
various preprocessing techniques to categorize text documents. These are filtering, splitting
of sentences, stemming, stop words removal and token frequency count. Filtering has
a set of rules for removing duplicate strings and irrelevant text
The various text preprocessing steps are:
1. Tokenization.
2. Lower casing. The preprocessing of the text data is an essential step as there we prepare the text data
3. Stop words removal. ready for the mining. If we do not apply then data would be very inconsistent and could not
generate good analytics results.
4. Stemming.
5. Lemmatization. Text Pre-processing is used to clean up text data: Convert words to their roots (in other
The purpose of tokenization is to protect sensitive data while preserving its business words, lemmatize). Filter out unwanted digits, punctuation, and stop words.
utility. This differs from encryption, where sensitive data is modified and stored with methods
that do not allow its continued use for business purposes. If tokenization is like a poker chip, Some of the common text preprocessing / cleaning steps are:
encryption is like a lockbox.  Lower casing.
 Removal of Punctuations.
 Removal of Stop words.
 Removal of Frequent words.
 Removal of Rare words.
 Stemming.
 Lemmatization.
 Removal of emojis.

Evaluation measure

3|Page 4|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Evaluation measures for an information retrieval system are used to assess how well the Web search engines are large data mining applications. There are several data mining
search results satisfied the user's query intent. The field of information retrieval has used techniques are used in all elements of search engines, ranging from crawling (e.g.,
various types of quantitative metrics for this purpose, based on either observed user behavior deciding which pages must be crawled and the crawling frequencies), indexing (e.g.,
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of selecting pages to be indexed and determining to which extent the index must be
measure, an evaluation for an information retrieval system should also include a validation of constructed), and searching (e.g., determining how pages must be ranked, which
the measures used, i.e. an assessment of how well the measures what they are intended to advertisements must be added, and how the search results can be customized or create
measure and how well the system fits its intended use case. [1] Metrics are often split into two “context aware”).
types: online metrics look at users' interactions with the search system, while offline metrics ANALYTICS
measure theoretical relevance, in other words how likely each result, or search engine results
Analytics is the systematic computational analysis of data or statistics. [1] It is used for the
page (SERP) page as a whole, is to meet the information needs of the user.
discovery, interpretation, and communication of meaningful patterns in data. It also entails
Online metrics applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
Online metrics are generally created from search logs. The metrics are often used to determine programming, and operations research to quantify performance.
the success of an A/B test.
Organizations may apply analytics to business data to describe, predict, and improve business
Session abandonment rate performance. Specifically, areas within analytics include descriptive analytics, diagnostic
Session abandonment rate is a ratio of search sessions which do not result in a click. analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2] Analytics may
Click-through rate apply to a variety of fields such as marketing, management, finance, online systems,
information security, and software services. Since analytics can require extensive computation
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total (see big data), the algorithms and software used for analytics harness the most current
users who view a page, email, or advertisement. It is commonly used to measure the success of methods in computer science, statistics, and mathematics
an online advertising campaign for a particular website as well as the effectiveness of email
campaigns.[2]
CURRENT TRENDS IN WEB SEARCH
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured 1. Voice search will become even more relevant
using dwell time as a primary factor along with secondary user interaction, for instance, the Voice search is already an integral part of our daily lives: we ask Siri where the closest gas
user copying the result URL is considered a successful result, as is copy/pasting from the station is or say “Hey Google, which Thai restaurant is the highest rated in my town?“ At the
snippet. moment, optimizing for these kinds of voice searches is recommended especially for
ecommerce or websites whose users are likely to have their hands full. For example, if you
Zero result rate
run a recipe blog, you want your users to find the answer on how long to let the dough rest
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned with without having to type with their potentially dirty hands on the phone.
zero results. The metric either indicates a recall issue, or that the information being searched 2. Your site search can no longer offer zero results pages
for is not in the index. A zero result page for your user means a lost client for you. But what seems like a problem
can be a great opportunity to increase your revenue. Let’s go back to our example. In this case,
Offline metrics you cannot offer your user Ralph Lauren winter shoes. But you can show them results for
other relevant products such as summer shoes by Ralph Lauren or winter shoes by other
Offline metrics are generally created from relevance judgment sessions where the judges score
brands.
the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
3. Search will become more personalized than ever
relevance from 0 to 5) scales can be used to score each document returned in response to a
With personalization, you can offer relevant results for each user based on their preferences
query. In practice, queries may be ill-posed, and there may be different shades of relevance.
and prior search behavior. Going back to our example, an HR person might have already
WEB SEARCH downloaded a pdf targeted towards HR managers on the website. Based on their behavior,
A web search engine is a specialized computer server that searches for data on the they would get assessed as a B2B user and can get more B2B oriented results in their search.
Web. The search results of a user query are restored as a list (known as hits). The hits 4. Site search will feel less like search and more intuitive
can include web pages, images, and different types of files. A good site search is the one you do not even think about as a user. You use it so intuitively
There are various search engines also search and return data available in public that you don’t need to assess what you are doing – you just do it. In 2022, site search will
databases or open directories. Search engines differ from web directories in that web look even less like classical search.
directories are supported by human editors whereas search engines works
algorithmically or by a combination of algorithmic and human input.

5|Page 6|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)

You might also like