Sridevi Paper

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Information Retrieval through Query Clustering and Query

Classification – A Review
Mrs. Sridevi K N [1RN17PEA02] Dr. Prakasha S
Research Scholar, VTU-RC, Research Guide, VTU-RC,
RNSIT, Bengaluru RNSIT, Bengaluru
Sridevi.kn23@gmail.com prakashasphd@gmail.com

Abstract: Information retrieval is


extracting important pattern, features, Keywords: Information Retrieval [IR],
knowledge from data. As requested by classification, clustering, algorithm
users, information retrieval system
facilitates the search of data and 1. Introduction
documents. The process includes
identifying specific pieces of information Clustering and classification is widely
in semi-structured and unstructured used in the field of information retrieval
text documents and converting them to field. Clustering is the unsupervised
a structured database. The IR learning and classification is the
techniques can be applied to newspaper supervised learning. The performance of
articles, web pages, scientific articles, the IR is linked to accessibility of data,
medical notes etc. Now a days, the IR selection and managing techniques of huge
systems are used daily by various types amounts of information on web that is
of users. There is an enormous growth usually expressed as textual data.
in information. To access information in
an effective manner, IR systems are A basic model of IR systems is as shown
required. In the field of information in figure 1. Huge amount of information is
access, Information retrieval is the queried on specific criterion and
emerging concept and is overtaking information is fetched that matches the
other traditional methods of searching. selection criterion.

Various techniques are in use for


building IR systems. Clustering and
Classification are some among them.
These two techniques are the main
divisions of data mining processes. To
Improvement in search relevance can be
manage algorithms, these are essential
achieved by classifying web queries into
in the world of data analysis. These two
predefined categories. The process is also
techniques divide data into sets. This
known as web query classification.
task is relevant in the current
Classifying the user’s general queries into
information age as there is a need of
topical categories can improve
data coupling with development. In this
effectiveness and efficiency of web search.
work, an overview of developments in
An unsupervised technique, Clustering
the Information Retrieval field is
deals with the partition of data structure in
presented with a special focus on
the unknown area. For further learning,
classification and clustering techniques.
this is the basis. The process of Clustering
consist of carious steps such as feature Incremental clustering is proposed by
extraction, selection for retrieval and most Charikar et al [4] which is based on a
representative feature selection from the careful analysis of the requirements of the
original data set. The algorithm design step IR application. The main objective is to
includes series of steps as per the problem maintain clusters of small diameter
characteristics. The validation of algorithm efficiently with the addition of new points.
is done at clustering result evaluation step. In this work, the researchers have done the
This step also provides practical analysis of natural greedy algorithms and
explanation for the result of clustering. proved that they poorly perform. The
This paper reviews some of the proposed algorithms which are
classification and clustering techniques for deterministic and randomized are
information retrieval and identifies the incremental and shown good performance.
research gap in the area.
Anton Leuski et al [5] experimentally
2. Information Retrieval Techniques tested a set of clustering algorithms and
prove that compared to ranked list
Minjuan Zhong et al. [1] proposed new
approach, a clustering of the documents
information retrieval algorithm that
retrieved can be significantly more
reduces complexity of time and improves
effective. The researchers also proved that
accuracy. The algorithm uses classification
the clustering approach has similar
and key phrase extraction. The traditional
effectiveness as the interactive relevance
methods of evaluation of performance
feedback based on query expansion.
cannot evaluate ranking results of the
documents retrieved effectively.
Furnas et al [6] worked on automatic
indexing and retrieval methods. The
In the work done by Anastasios Tombros
modeling is done to improve the term-
[2] state that if hierarchical clustering is
document association improvement
applied to query-specific clustering, the
leading to detection of relevant documents
effectiveness of the retrieval is
on the basis of terms used in the queries.
comparatively better than the static
Formed from weighted combination of
clustering and of conventional IFS.
terms, queries are characterized as pseudo-
documents vectors.
S. K. Bhatia et al. [3] developed clusters
and cluster characterizations by employing
Michael Steinbach et al [7] conducted an
viewpoint of the user. Based on a
experimental study and presented the
knowledge attainment technique, the user
results of agglomerative hierarchical
viewpoint is produced through a structured
clustering and clustering techniques K-
interview which is known as personal
means. In this research work, the
construct theory. It is proved that the
researchers have used both a ‘standard’ K-
outcome of the personal construct theory is
means algorithm and ‘bisecting’ K-means
a cluster representation which can be used
algorithm. The results proved that the
for query and also for assignment of new
bisecting K-means method performed
documents to the suitable clusters.
better than the other hierarchical
approaches.
3. Architecture used in Information
Yih-Jen Horng et al [8] extended the work Retrieval
of previous researchers and presented a Information Retrieval Agents: Craig A
new approach for fuzzy information Knoblock [9] presented an architecture
retrieval on the basis of fuzzy inference that has information retrieval agents. Such
techniques and fuzzy hierarchical agents address the issue of representation,
clustering. In the first step, a fuzzy communication, problem solving and
agglomerative hierarchical clustering is learning. Each information agent is
used to form documents cluster and thus specialized to a particular area of
get cluster centers. In the next step, the expertise. This enables organization of
authors presented a technique for enormous number of sources of
construction of fuzzy logic rules based on information and provides clarity in the
the clusters of the document and cluster queries that each agent handles.
centers. Later, the fuzzy logic rules
constructed are applied to modify the Peer-to-Peer Information Retrieval: Karl
user’s query for documents retrieval as per Aberer et al [10] proposed architecture for
the request of user. The proposed method peer-to-peer information retrieval. These
proved more flexible and more intelligent systems are decentralized and client server
than the existing techniques. The proposed architecture is followed. These are large-
method can expend user’s queries for scale computer networks where peers can
fuzzy information retrieval in an effective join and leave the system at any time.
way.

Table 1. Comparison of works

Figure 2. Architecture of peer-to-peer IR


systems. Source [24]

TIRA: Yunlu Ai et al [11] proposed a


model called TIRA (Text based
Information Retrieval Architecture). This
model contains four major components
viz. client, server, data modules and
processor modules. UML-activity diagram
represents customized IR process and this
is achieved by the client. The defined IR
processes are executed by the server. The P(“american”)P(“presidential | american”)
server also provides available processor P(election | american presidential). For
and data modules information. Text-based example in a document, the words “of”
Information retrieval framework that is and “times” would be counted as occuring
flexible is offered by TIRA. It also twice; yet, their functionality as a bi-gram
provides technologies to visually define gives a different meaning. N-grams such
Information retrieval components. TIRA as “stock exchange”,“ labour party” and
achieves this by connecting different IR “french open championship” reveal a
components as well as to execute them. different meaning than from each word on
its own.
Models used in Information Retrieval
Hidden Markov Model: A model is a
Vector Space Model: This is an algebraic stochastic and finite state automaton.
model that has two steps. The first HMM have strong statistical foundations
represents the text documents into vector and well suited to natural language
of words and the in the second step these domains. These models can handle new
are transformed to numerical format data robustly and are computationally
enabling the user to apply any text mining efficient to develop.
techniques like extraction, filtering and
information retrieval. 4. Challenges in Information Retrieval

Boolean: This is the most widely used Requisites needed for IR system:
exact-match model. In this model, queries Information retrieval system’s basic
are represented ass logic expressions and requirements include potential users group
documents features are operands. Boolean and a vast collection of documents. These
model uses Boolean queries. Here, the documents should satisfy some or all
retrieved documents are not ranked. information requirements of the user.

Inference Network (Probabilitic): The Essential Operations to be carried out:


probability of user finding a relevant The essential operations to be carried out
document is estimated by probabilistic include document’s concept indexing,
models. Ranking for the retrieved converting the concept indexing into
documents is by their likelihoods of descriptor language, Concept analysis of
relevance which is probability ratio that the query and extracting the information.
the document is relevant to the not relevant
probability. Performance of the system: The
performance of the Information retrieval
N-Gram Model: The N-gram model uses systems can be measured from various
conditional probability to model a aspects. In regard to recall and relevance,
document. A term’s probability of there is an efficiency measure. Its
occurrence in a document is conditioned performance in these parameters is
on the preceding N-1 words. For example, reduced by indexing, search criteria
the probability of seeing the tri-gram formulation and the simplicity of the
“american presidential election” is: request.
Other issues: Unlike traditional search
engines, IR systems have to function
without an explicit query. Users’ needs are
to be understood depending on
personalization and context. Compared to
traditional recommender systems, the IR
systems must be open domain and should
be capable of making suggestion and
collect information from various sources
that involve multiple people, actions and
objects. With respect to indexing system
and lack of needs of users and their
behavior, there exist some major problems
with retrieval methods. The main problem
is uncertainty. The users consent is a major
problem. Going through the consent of the
user leads to the search behavior analysis
and this this in turn leads to uncertainty.

The classification of query throws a 6. Benefits of clustering and


challenging problem because web queries classification
are short and provide fewer features. This
feature lack, combined with dynamic Classification and Clustering appear to be
nature of the query stream and also similar but differ in the context of data
changing terms of the average user hinders mining. Classification is used in
the traditional text classification methods. supervised learning technique in which
In understanding the current sense of the labels that are pre-defined are assigned to
user’s queries, there exists a problem. In instances by properties. Clustering is an
order to bring efficiency to web search, it unsupervised learning technique that
is necessary to map the queries of the users groups similar instances. This grouping is
to topical categories for which the search done based on their features or properties.
engine has domain-specific knowledge.
Today’s search engines, both for enterprise The benefits of the classification include
and web, often incorporate the use of topic browsing, specifically for inexperienced
wise databases and this provides an users or the users who are not familiar
opportunity to bring improvements. with certain content. The classification
5. Open Issues in IR Systems also broadens and narrows searches by
using hierarchical structure of
Many IR problems provide scope for classification, searching terms,
improvement as listed in Table 2. provisioning multilingual access to
resources. Clustering helps in information
Table 2. Open Issues in Information retrieval to organize documents to topic-
Retrieval
specific groups. Clustering is also in use as [3] S. K. Bhatia and J. S. Deogun,
an organization of retrieved documents "Conceptual clustering in information
retrieval," in IEEE Transactions on
thus helping the users better understand
Systems, Man, and Cybernetics, Part B
the documents retrieved and therefore be (Cybernetics), vol. 28, no. 3, pp. 427-436,
better able to focus their search. June 1998.
doi: 10.1109/3477.678640
7. Conclusion [4] Charikar, Moses & Chekuri, Chandra &
Feder, Tomás & Motwani, Rajeev. (2004).
From the literature, it can be observed that Incremental Clustering and Dynamic
Information Retrieval. SIAM J. Comput..
IR related works suffer from the difficulty
33. 1417-1440.
of comparing retrieval results. Huge 10.1137/S0097539702418498.
documents collection has been used and in [5] Anton Leuski, Evaluating Document
most of the research works, these Clustering for Interactive Information
documents are different. Hence, the Retrieval (2001), In Proceedings of the
tenth International Conference on
comparison of works has become difficult.
Information and Knowledge Managment
The existing research in this area is (CIKM}, pp 33—40.
diverse; some of the works use
[6] Furnas, George & Deerwester, Scott & T.
classification and key phrase extraction Durnais, Susan & Landauer, Thomas & A.
that reduces complexity and improves Harshman, Richard & A. Streeter, Lynn &
accuracy. With the explosion of E. Lochbaum, Karen. (2017). Information
information in recent years, there is a Retrieval using a Singular Value
necessity of better performing information Decomposition Model of Latent Semantic
Structure. ACM SIGIR Forum. 51. 90-
retrieval techniques in terms of speed of 105. 10.1145/3130348.3130358.
the extraction and relevance of the
document extracted. This is a promising [7] Michael Steinbach , George Karypis ,
research area that provides plenty of Vipin Kumar, A comparison of document
clustering techniques (2000), In KDD
opportunities for researchers to contribute. Workshop on Text Mining,

References [8] Yih-Jen Horng, Shyi-Ming Chen, A New


Method for Fuzzy Information Retrieval
Based on Fuzzy Hierarchical Clustering
[1] Minjuan Zhong, Zhiping Chen, Yaping
and Fuzzy Inference Techniques, IEEE
Lin and Jintao Yao, "Using classification TRANSACTIONS ON FUZZY
and key phrase extraction for information SYSTEMS, VOL. 13, NO. 2, APRIL 2005
retrieval," Fifth World Congress on [9] Craig A. Knoblock and Yigal Arens, An
Intelligent Control and Automation (IEEE Architecture for Information Retrieval
Cat. No.04EX788), Hangzhou, China, Agents, AAAI Technical Report SS-94-03
2004, pp. 3037-3041 Vol.4. [10] Karl Aberer, Fabius Klemm, Martin
Rajman, Jie Wu, An Architecture for Peer-
[2] Anastasios Tombros *, Robert Villa, C.J.
to-Peer Information Retrieval, July 2,
Van Rijsbergen, The effectiveness of 2004
query-specific hierarchic clustering in [11] Yunlu Ai, Robert Gerling, Marco
information retrieval, Information Neumann, Christian Nitschke, Patrick
Processing and Management 38 (2002) Riehmann, TIRA: Text based Information
559–582 Retrieval Architecture.

You might also like