Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Concept search

en.wikipedia.org/wiki/Concept_search

A concept search (or conceptual search) is an automated information retrieval method that is used to search
electronically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for
information that is conceptually similar to the information provided in a search query. In other words, the ideas
expressed in the information retrieved in response to a concept search query are relevant to the ideas contained
in the text of the query.

Contents
[hide]

Why Concept Search?[edit]


Concept search techniques were developed because of limitations imposed by classical Boolean keyword
search technologies when dealing with large, unstructured digital collections of text. Keyword searches often
return results that include many non-relevant items (false positives) or that exclude too many relevant items
(false negatives) because of the effects of synonymy and polysemy. Synonymy means that one of two or more
words in the same language have the same meaning, and polysemy means that many individual words have
more than one meaning.

Polysemy is a major obstacle for all computer systems that attempt to deal with human language. In English,
most frequently used terms have several common meanings. For example, the word fire can mean: a
combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous
terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this
set has more than eight common senses. For the 2000 most-polysemous terms in English, the typical verb has
more than eight common senses and the typical noun has more than five.[1]

In addition to the problems of polysemous and synonymy, keyword searches can exclude inadvertently
misspelled words as well as the variations on the stems (or roots) of words (for example, strike vs. striking).
Keyword searches are also susceptible to errors introduced by optical character recognition (OCR) scanning
processes, which can introduce random errors into the text of documents (often referred to as noisy text) during
the scanning process.

A concept search can overcome these challenges by employing word sense disambiguation (WSD),[2] and other
techniques, to help it derive the actual meanings of the words, and their underlying concepts, rather than by
simply matching character strings like keyword search technologies.

Approaches to Concept Search[edit]


In general, information retrieval research and technology can be divided into two broad categories: semantic and
statistical. Information retrieval systems that fall into the semantic category will attempt to implement some
degree of syntactic and semantic analysis of the natural language text that a human user would provide (also
see computational linguistics). Systems that fall into the statistical category will find results based on statistical
measures of how closely they match the query. However, systems in the semantic category also often rely on
statistical methods to help them find and retrieve information.[3]

Efforts to provide information retrieval systems with semantic processing capabilities have basically used three
different approaches:

Auxiliary structures
Local co-occurrence statistics
1/7
Transform techniques (particularly matrix decompositions)

Auxiliary Structures[edit]

A variety of techniques based on Artificial Intelligence (AI) and Natural Language Processing (NLP) have been
applied to semantic processing, and most of them have relied on the use of auxiliary structures such as
controlled vocabularies and ontologies. Controlled vocabularies (dictionaries and thesauri), and ontologies allow
broader terms, narrower terms, and related terms to be incorporated into queries.[4] Controlled vocabularies are
one way to overcome some of the most severe constraints of Boolean keyword queries. Over the years,
additional auxiliary structures of general interest, such as the large synonym sets of WordNet, have been
constructed.[5] It was shown that concept search that is based on auxiliary structures, such as WordNet, can be
efficiently implemented by reusing retrieval models and data structures of classical Information Retrieval.[6] Later
approaches have implemented grammars to expand the range of semantic constructs. The creation of data
models that represent sets of concepts within a specific domain (domain ontologies), and which can incorporate
the relationships among terms, has also been implemented in recent years.

Handcrafted controlled vocabularies contribute to the efficiency and comprehensiveness of information retrieval
and related text analysis operations, but they work best when topics are narrowly defined and the terminology is
standardized. Controlled vocabularies require extensive human input and oversight to keep up with the rapid
evolution of language. They also are not well suited to the growing volumes of unstructured text covering an
unlimited number of topics and containing thousands of unique terms because new terms and topics need to be
constantly introduced. Controlled vocabularies are also prone to capturing a particular world view at a specific
point in time, which makes them difficult to modify if concepts in a certain topic area change.[7]

Local Co-occurrence Statistics[edit]

Information retrieval systems incorporating this approach count the number of times that groups of terms appear
together (co-occur) within a sliding window of terms or sentences (for example, ± 5 sentences or ± 50 words)
within a document. It is based on the idea that words that occur together in similar contexts have similar
meanings. It is local in the sense that the sliding window of terms and sentences used to determine the co-
occurrence of terms is relatively small.

This approach is simple, but it captures only a small portion of the semantic information contained in a collection
of text. At the most basic level, numerous experiments have shown that approximately only ¼ of the information
contained in text is local in nature.[8] In addition, to be most effective, this method requires prior knowledge
about the content of the text, which can be difficult with large, unstructured document collections.[7]

Transform Techniques[edit]

Some of the most powerful approaches to semantic processing are based on the use of mathematical transform
techniques. Matrix decomposition techniques have been the most successful. Some widely used matrix
decomposition techniques include the following:[9]

Matrix decomposition techniques are data-driven, which avoids many of the drawbacks associated with auxiliary
structures. They are also global in nature, which means they are capable of much more robust information
extraction and representation of semantic information than techniques based on local co-occurrence statistics.[7]

Independent component analysis is a technique that creates sparse representations in an automated fashion, [10]
and the semi-discrete and non-negative matrix approaches sacrifice accuracy of representation in order to
reduce computational complexity.[7]

Singular value decomposition (SVD) was first applied to text at Bell Labs in the late 1980s. It was used as the
foundation for a technique called Latent Semantic Indexing (LSI) because of its ability to find the semantic
meaning that is latent in a collection of text. At first, the SVD was slow to be adopted because of the resource
2/7
requirements needed to work with large datasets. However, the use of LSI has significantly expanded in recent
years as earlier challenges in scalability and performance have been overcome. LSI is being used in a variety of
information retrieval and text processing applications, although its primary application has been for concept
searching and automated document categorization.[11]

Uses of Concept Search[edit]


eDiscovery - Concept-based search technologies are increasingly being used for Electronic Document
Discovery (EDD or eDiscovery) to help enterprises prepare for litigation. In eDiscovery, the ability to
cluster, categorize, and search large collections of unstructured text on a conceptual basis is much more
efficient than traditional linear review techniques. Concept-based searching is becoming accepted as a
reliable and efficient search method that is more likely to produce relevant results than keyword or
Boolean searches.[12]

Enterprise Search and Enterprise Content Management (ECM) - Concept search technologies are
being widely used in enterprise search. As the volume of information within the enterprise grows, the
ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis has
become essential. In 2004 the Gartner Group estimated that professionals spend 30 percent of their time
searching, retrieving, and managing information.[13] The research company IDC found that a 2,000-
employee corporation can save up to $30 million per year by reducing the time employees spend trying to
find information and duplicating existing documents.[13]
Content-Based Image Retrieval (CBIR) - Content-based approaches are being used for the semantic
retrieval of digitized images and video from large visual corpora. One of the earliest content-based image
retrieval systems to address the semantic problem was the ImageScape search engine. In this system,
the user could make direct queries for multiple visual objects such as sky, trees, water, etc. using spatially
positioned icons in a WWW index containing more than ten million images and videos using keyframes.
The system used information theory to determine the best features for minimizing uncertainty in the
classification.[14] The semantic gap is often mentioned in regard to CBIR. The semantic gap refers to the
gap between the information that can be extracted from visual data and the interpretation that the same
data have for a user in a given situation.[15] The ACM SIGMM Workshop on Multimedia Information
Retrieval is dedicated to studies of CBIR.
Multimedia and Publishing - Concept search is used by the multimedia and publishing industries to
provide users with access to news, technical information, and subject matter expertise coming from a
variety of unstructured sources. Content-based methods for multimedia information retrieval (MIR) have
become especially important when text annotations are missing or incomplete.[14]
Digital Libraries and Archives - Images, videos, music, and text items in digital libraries and digital
archives are being made accessible to large groups of users (especially on the Web) through the use of
concept search techniques. For example, the Executive Daily Brief (EDB), a business information
monitoring and alerting product developed by EBSCO Publishing, uses concept search technology to
provide corporate end users with access to a digital library containing a wide array of business content. In
a similar manner, the Music Genome Project spawned Pandora, which employs concept searching to
spontaneously create individual music libraries or virtual radio stations.
Genomic Information Retrieval (GIR) - Genomic Information Retrieval (GIR) uses concept search
techniques applied to genomic literature databases to overcome the ambiguities of scientific literature.
Human Resources Staffing and Recruiting - Many human resources staffing and recruiting
organizations have adopted concept search technologies to produce highly relevant resume search
results that provide more accurate and relevant candidate resumes than loosely related keyword results.

Effective Concept Searching[edit]


The effectiveness of a concept search can depend on a variety of elements including the dataset being searched
3/7
and the search engine that is used to process queries and display results. However, most concept search
engines work best for certain kinds of queries:

Effective queries are composed of enough text to adequately convey the intended concepts. Effective
queries may include full sentences, paragraphs, or even entire documents. Queries composed of just a
few words are not as likely to return the most relevant results.
Effective queries do not include concepts in a query that are not the object of the search. Including too
many unrelated concepts in a query can negatively affect the relevancy of the result items. For example,
searching for information about boating on the Mississippi River would be more likely to return relevant
results than a search for boating on the Mississippi River on a rainy day in the middle of the summer in
1967.
Effective queries are expressed in a full-text, natural language style similar in style to the documents
being searched. For example, using queries composed of excerpts from an introductory science textbook
would not be as effective for concept searching if the dataset being searched is made up of advanced,
college-level science texts. Substantial queries that better represent the overall concepts, styles, and
language of the items for which the query is being conducted are generally more effective.

As with all search strategies, experienced searchers generally refine their queries through multiple searches,
starting with an initial seed query to obtain conceptually relevant results that can then be used to compose
and/or refine additional queries for increasingly more relevant results. Depending on the search engine, using
query concepts found in result documents can be as easy as selecting a document and performing a find similar
function. Changing a query by adding terms and concepts to improve result relevance is called query
expansion.[16] The use of ontologies such as WordNet has been studied to expand queries with conceptually-
related words.[17]

Relevance Feedback[edit]
Relevance feedback is a feature that helps users determine if the results returned for their queries meet their
information needs. In other words, relevance is assessed relative to an information need, not a query. A
document is relevant if it addresses the stated information need, not because it just happens to contain all the
words in the query.[18] It is a way to involve users in the retrieval process in order to improve the final result
set. [18] Users can refine their queries based on their initial results to improve the quality of their final results.

In general, concept search relevance refers to the degree of similarity between the concepts expressed in the
query and the concepts contained in the results returned for the query. The more similar the concepts in the
results are to the concepts contained in the query, the more relevant the results are considered to be. Results
are usually ranked and sorted by relevance so that the most relevant results are at the top of the list of results
and the least relevant results are at the bottom of the list.

Relevance feedback has been shown to be very effective at improving the relevance of results. [18] A concept
search decreases the risk of missing important result items because all of the items that are related to the
concepts in the query will be returned whether or not they contain the same words used in the query.[13]

Ranking will continue to be a part of any modern information retrieval system. However, the problems of
heterogeneous data, scale, and non-traditional discourse types reflected in the text, along with the fact that
search engines will increasingly be integrated components of complex information management processes, not
just stand-alone systems, will require new kinds of system responses to a query. For example, one of the
problems with ranked lists is that they might not reveal relations that exist among some of the result items.[19]

Guidelines for Evaluating a Concept Search Engine[edit]

4/7
1. Result items should be relevant to the information need expressed by the concepts contained in the query
statements, even if the terminology used by the result items is different from the terminology used in the
query.
2. Result items should be sorted and ranked by relevance.
3. Relevant result items should be quickly located and displayed. Even complex queries should return
relevant results fairly quickly.
4. Query length should be non-fixed, i.e., a query can be as long as deemed necessary. A sentence, a
paragraph, or even an entire document can be submitted as a query.
5. A concept query should not require any special or complex syntax. The concepts contained in the query
can be clearly and prominently expressed without using any special rules.
6. Combined queries using concepts, keywords, and metadata should be allowed.
7. Relevant portions of result items should be usable as query text simply by selecting the item and telling
the search engine to find similar items.
8. Query-ready indexes should be created relatively quickly.
9. The search engine should be capable of performing Federated searches. Federated searching enables
concept queries to be used for simultaneously searching multiple datasources for information, which are
then merged, sorted, and displayed in the results.
10. A concept search should not be affected by misspelled words, typographical errors, or OCR scanning
errors in either the query text or in the text of the dataset being searched.

Search Engine Conferences and Forums[edit]


Formalized search engine evaluation has been ongoing for many years. For example, the Text REtrieval
Conference (TREC) was started in 1992 to support research within the information retrieval community by
providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Most of today's
commercial search engines include technology first developed in TREC. [20]

In 1997, a Japanese counterpart of TREC was launched, called National Institute of Informatics Test Collection
for IR Systems (NTCIR). NTCIR conducts a series of evaluation workshops for research in information retrieval,
question answering, text summarization, etc. A European series of workshops called the Cross Language
Evaluation Forum (CLEF) was started in 2001 to aid research in multilingual information access. In 2002, the
Initiative for the Evaluation of XML Retrieval (INEX) was established for the evaluation of content-oriented XML
retrieval systems.

Precision and recall have been two of the traditional performance measures for evaluating information retrieval
systems. Precision is the fraction of the retrieved result documents that are relevant to the user's information
need. Recall is defined as the fraction of relevant documents in the entire collection that are returned as result
documents.[18]

Although the workshops and publicly available test collections used for search engine testing and evaluation
have provided substantial insights into how information is managed and retrieved, the field has only scratched
the surface of the challenges people and organizations face in finding, managing, and, using information now
that so much information is available.[19] Scientific data about how people use the information tools available to
them today is still incomplete because experimental research methodologies haven’t been able to keep up with
the rapid pace of change. Many challenges, such as contextualized search, personal information management,
information integration, and task support, still need to be addressed. [19]

See also[edit]

References[edit]
5/7
1. Jump up ^ Bradford, R. B., Word Sense Disambiguation, Content Analyst Company , LLC, U.S. Patent
7415462, 2008.
2. Jump up ^ R. Navigli, Word Sense Disambiguation: A Survey , ACM Computing Surveys, 41(2), 2009.
3. Jump up ^ Greengrass, E., Information Retrieval: A Survey, 2000.
4. Jump up ^ Dubois, C., The Use of Thesauri in Online Retrieval, Journal of Information Science, 8(2),
1984 March, pp. 63-66.
5. Jump up ^ Miller, G., Special Issue, WordNet: An On-line Lexical Database , Intl. Journal of Lexicography,
3(4), 1990.
6. Jump up ^ Fausto Giunchiglia, Uladzimir Kharkevich, and Ilya Zaihrayeu. Concept Search, In
Proceedings of European Semantic Web Conference, 2009.

7. ^ Jump up to: a b c d Bradford, R. B., Why LSI? Latent Semantic Indexing and Information Retrieval,
White Paper, Content Analyst Company , LLC, 2008.
8. Jump up ^ Landauer, T., and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis
Theory of Acquisition, Induction, and Representation of Knowledge, Psychological Review, 1997, 104(2),
pp. 211-240.
9. Jump up ^ Skillicorn, D., Understanding Complex Datasets: Data Mining with Matrix Decompositions,
CRC Publishing, 2007.
10. Jump up ^ Honkela, T., Hyvarinen, A. and Vayrynen, J. WordICA - Emergence of linguistic
representations for words by independent component analysis. Natural Language Engineering, 16(3):277-
308, 2010
11. Jump up ^ Dumais, S., Latent Semantic Analysis, ARIST Review of Information Science and Technology,
vol. 38, Chapter 4, 2004.
12. Jump up ^ Magistrate Judge John M. Facciola of the U.S. District Court for the District of Washington,
D.C. Disability Rights Council v. Washington Metropolitan Transit Authority, 242 FRD 139 (D. D.C. 2007),
citing George L. Paul & Jason R. Baron, "Information Inflation: Can the Legal System Adapt?" 13 Rich.
J.L. & Tech. 10 (2007).

13. ^ Jump up to: a b c Laplanche, R., Delgado, J., Turck, M., Concept Search Technology Goes Beyond
Keywords, Information Outlook, July 2004.

14. ^ Jump up to: a b Lew, M. S., Sebe, N., Djeraba, C., Jain, R., Content-based Multimedia Information
Retrieval: State of the Art and Challenges, ACM Transactions on Multimedia Computing, Communications,
and Applications, February 2006.
15. Jump up ^ Datta R., Joshi, D., Li J., Wang, J. Z., Image Retrieval: Ideas, Influences, and Trends of the
New Age, ACM Computing Surveys, Vol. 40, No. 2, April 2008.
16. Jump up ^ Robertson, S. E., Spärck Jones, K., Simple, Proven Approaches to Text Retrieval, Technical
Report, University of Cambridge Computer Laboratory, December 1994.
17. Jump up ^ Navigli, R., Velardi, P. An Analysis of Ontology-based Query Expansion Strategies . Proc. of
Workshop on Adaptive Text Extraction and Mining (ATEM 2003), in the 14th European Conference on
Machine Learning (ECML 2003), Cavtat-Dubrovnik, Croatia, September 22-26th, 2003, pp. 42–49

18. ^ Jump up to: a b c d Manning, C. D., Raghavan P., Schütze H., Introduction to Information Retrieval,
Cambridge University Press, 2008.

19. ^ Jump up to: a b c Callan, J., Allan, J., Clarke, C. L. A., Dumais, S., Evans, D., A., Sanderson, M., Zhai,
C., Meeting of the MINDS: An Information Retrieval Research Agenda, ACM, SIGIR Forum, Vol. 41 No. 2,
December 2007.
20. Jump up ^ Croft, B., Metzler, D., Strohman, T., Search Engines, Information Retrieval in Practice,
Addison Wesley, 2009.

6/7
External links[edit]

7/7

You might also like