A Model Driven Approach To Building Domain Specific Search Engines

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)

2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C) | 978-1-6654-2484-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/MODELS-C53483.2021.00029

A Model Driven Approach to Building Domain


Specific Search Engines
Vishnudas Raveendran, Sapan Shah, Sreedhar Reddy
TCS Research, Tata Consultancy Services Ltd.,
Pune, India
Email: vishnudas.raveendran@tcs.com, sapan.hs@tcs.com, sreedhar.reddy@tcs.com

Abstract—In many application domains, general purpose To build an effective search engine, it is not only the text
search engines are not very effective as they are not designed for processing technologies such as natural language processing
deep processing of semantic relations present in these domains. (NLP), machine learning etc., that one has to master, one
They need a search engine that understands domain concepts
and relations. However, building a domain specific search engine also has to understand the nuances of domain entities and
is a non-trivial task. To build an effective search engine, it is their relationships. Domain knowledge such as generalization
not only the text processing technologies that one has to master, hierarchies, relation types, cardinalities, property value types,
one also has to understand the nuances of domain entities and value-ranges, units, etc. play a key role in understanding the
their relationships. Domain knowledge such as generalization text. To achieve right accuracy levels this knowledge has to be
hierarchies, relation types, cardinalities, property value types,
units, ranges, etc., play a key role in understanding the text. acquired and then embedded in the text processing algorithms.
To achieve right accuracy levels, these domain specific nuances Implementing a domain specific search engine is therefore a
have to be encoded into text processing algorithms. Developing a highly effort and knowledge intensive activity.
search engine that takes all this into account requires substantial In this work, we propose a model driven approach that can
time and effort. To address this, we present a model driven significantly reduce the search engine development efforts. The
approach to realize domain specific search engines. We propose a
metamodel to specify a domain in terms of concepts, relations and approach is based on a metamodel for specifying relevant as-
their extraction models. An information extraction component pects of an application domain, and a search engine framework
interprets this metamodel to access the domain model and the built around this metamodel. The framework has three parts: 1)
information contained in it. It uses this information to determine domain agnostic core, 2) domain specification, and 3) query
what entities, relations and properties need to be extracted, module, as shown in Fig. 1. Domain specification captures
what extraction models to use for the same, and how to resolve
the ambiguities that arise from the text processing stage. We relevant aspects of the domain such as domain concepts,
also show how a domain specific query language interface can relations and their extraction models as an instantiation of
be generated from the domain model. We discuss the results the metamodel. The domain agnostic core simply interprets
of applying the proposed approach on two domains: materials the domain model in terms of the metamodel. Since the
science and urban mining. The model driven approach results in interpretation is in terms of the metamodel, it can work with
substantial savings in development efforts.
any domain that is specified as an instance of the metamodel.
The query module provides a search interface in terms of
I. I NTRODUCTION
a domain specific query language. The query language is
General purpose search engines are not very effective when essentially a template specified in terms of the metamodel.
searching for information in knowledge intensive domains This template is instantiated with domain specific concepts
such as materials science, medicine and genetics. This is and relations to obtain a domain specific query language.
because, to be effective, a search engine has to understand The development life cycle of a domain specific search
the concepts and relations that underlie the domain. Suppose engine consists of: 1) acquisition of domain knowledge i.e.
a user queries for “composition of steel where hardness is entities, properties and relations that make up the domain and
greater than 50 Rc ”. A keyword based search engine will find 2) implementation of search engine modules i.e. extraction,
documents where the terms composition, steel, hardness, 50 indexing, search, etc. taking the relevant domain knowledge
and Rc occur, but these occurrences need not have the intended into account. These steps are often performed in an iterative
relations. We are looking for a composition of steel, not any manner until acceptable search accuracy is achieved. Though
composition, and we are looking for hardness of the steel with domain knowledge acquisition cost is fixed and common for
the said composition, and it is this hardness that should be >50 both traditional development approach and our model driven
Rc . Keyword based general purpose search engines are weak approach, the implementation cost for the traditional approach
on this kind of relation processing. To address this, efforts is much higher since the complete search engine is generally
have been made in building domain specific search engines built from scratch. The iterative nature of the problem adds to
that cater to specific domains such as biomedical and genomic the complexity since small and frequent updates generally lead
[1], materials science [2], patent search [3], etc. However, to introduction of bugs. The model driven approach proposed
building a domain specific search engine is a non-trivial task. in this work significantly reduces this cost by first explicitly

978-1-6654-2484-4/21/$31.00 ©2021 IEEE 165


DOI 10.1109/MODELS-C53483.2021.00029

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
knowledge. Suppose we have the following sentence in a
paper:
“The specimen is heated to 1800 C and held for 5 mins prior
to water quenching at 100 C per min to room temperature.”
When reading this sentence, a person with domain knowledge
Fig. 1. Schematic of the search engine framework knows that 100 C per min refers to cooling rate of a quenching
process. However, an information extraction model cannot
figure this out as there is no explicit mention of ‘cooling rate’
capturing domain specifications and then automatically creat- in the sentence, unless we enable it with domain knowledge.
ing a search system by leveraging the domain agnostic core To extend the above example further, suppose we have the
thereby completely avoiding the manual coding effort. We following sentence:
evaluated our approach on two domains: materials science and “Tempering followed by quenching were performed at the
urban mining. We observed that the model driven approach rates of 120 C per min and 200 C per min respectively.”
substantially reduces development efforts compared to the While a human reader can easily figure out that 120 C per
traditional development approach. The rest of this paper is min refers to heating rate of tempering and 200 C per min
organized as follows. Section II discusses domain specific refers to cooling rate of quenching, an NLP algorithm might
search and the issues pertaining to it. Section III discusses the get confused. It might think that both 120 C per min and
architecture of our model driven search engine. Section IV 200 C per min refer to cooling rate of quenching. A human
discusses the metamodel and how it is used for domain speci- reader knows this cannot be, since it is a cardinality violation
fication. Section V discusses how information is extracted and – a quenching process can only have one cooling rate. An
indexed by interpreting the metamodel. Section VI discusses automated extraction model cannot figure this out unless we
the query module. Section VII discusses evaluation and results. enable it with the necessary domain knowledge.
Section VIII discusses the relevant related work. Finally, we Dealing with abstraction is another issue. A domain can
summarize and provide future work directions in section IX. have a large number of concepts (many thousands in some
II. D OMAIN S PECIFIC S EARCH - I SSUES cases). It is not necessary (or practical) to specify a separate
extraction algorithm for each concept. It should be possible
A domain specific search engine is expected to provide
to reuse algorithms across concepts that are semantically
significantly more accurate results than a general purpose
similar. For example, in manufacturing domain one may have
search engine when querying information pertaining to the
entities ‘Part’, ‘MechanicalPart’ and ‘ElectricalPart’, the latter
domain. To attain this, there are several challenges that need to
two being specializations of the former. An extraction model
be addressed while building a domain specific search engine.
specified for ‘Part’ might work for the other two. Similarly,
A. Relation Aware Search we may have a relation called ‘interaction’ between parts; this
Domain specific search involves queries such as “get aircraft relation may have specializations such as ‘energy exchange’
whose wingspan is greater than 30 m” or “composition of and ‘material exchange’. A model that recognizes ‘interac-
steel where hardness is greater than 50 Rc ”. A keyword based tion’ might work for the specializations as well. To be able
search engine does not do a good job on such queries since to do this, a search engine has to be aware of the concept
it would only check for the occurrence of the words in the hierarchies in a domain.
query. It would give us search results containing occurrences
of the words ‘wingspan’, ‘30’ and ‘hardness’ but these need C. Domain Specific Query Interface
not have any semantic relationships with each other. This
significantly lowers the precision. Value matching is another Search engines usually provide natural language as the
problem. Suppose a document contains “a hardness value of query interface. However, natural language suffers from inher-
60 Rc ”. Even though this satisfies the constraint (greater than ent ambiguities. While this is acceptable for general-purpose
50 Rc ), it will not be considered a valid match by keyword search, it may not be acceptable for domain specific search
search engines as they do not understand the semantics of where one is looking for a precise interpretation of the
greater than. To give another example, suppose we have query. One might need a structured query language that is
a query, “heating time greater than 20 min”. This will not customized for the needs of the domain, with an unambiguous
match a document containing heating time of 2 hours. Such semantics. Domain concepts and relations should be part of
issues adversely affect the recall of the search system. To the vocabulary in such a query language. For example, in
address these, we need an extraction system that is domain the materials science domain, we might have queries such as
aware. It should recognize and extract not only domain entity “steel subjected to tempering at a temperature greater than
occurrences but also property values and relations. 1500 C and cooled to 500 C” or “composition of carbon in
DS100”. For effective retrieval of documents, the system must
B. Domain Knowledge Guided Extraction understand that DS100 is a material name; tempering is a
In NLP, no extraction model is perfect. There are always process; and 1500 C and 500 C refer to tempering temperature
ambiguities. Some of these can be resolved by using domain and cooling temperature respectively. This brings out the need

166

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
for a typed query language where the terms in the query are IV. D OMAIN S PECIFICATION
typed with domain entities and relations. We propose a model driven approach where we externalize
the specification of domain knowledge in the form of a domain
III. S EARCH S YSTEM A RCHITECTURE model. The domain model specifies the entities and relations
We propose a model driven approach for building domain of the domain, their generalization hierarchies, relation cardi-
specific search systems to address the issues described above. nalities, property value types, units, value-ranges and so on.
Fig. 2 shows the components of the proposed search system The domain model also specifies the extraction models to be
and their interactions at a high level. used for different entities and relations. As mentioned earlier,
the information extraction module itself is completely domain
1) Domain Specification Interface: It is a graphical user agnostic. It imparts domain specificity to the search function
interface to allow domain experts to capture domain by interpreting the domain models and the information cap-
concepts, properties and relations using the metamodel. tured in them. To be able to do this, we propose a metamodel
The domain expert may also specify extraction models (discussed below) which provides the structures necessary to
(rules or machine-learning models) for these concepts describe domain models. An application domain model is
and relations. The metamodel is described in section IV. specified as an instance of this metamodel. The information
2) Metamodel Instances: It is a repository that stores do- extraction module accesses all domain specific information
main specifications as instances of the metamodel. via this metamodel. The metamodel consists of two parts:
3) Extraction Module: The extraction module uses the 1) M2-layer metamodel: This contains domain metamodel
instances of metamodel to know which extraction model to specify the domain in terms of core concepts, relations
to use for which domain element. It processes the doc- and properties, and extraction metamodel to specify extraction
ument repository to extract mentions of domain entities models (algorithms) for domain concepts and relations; 2) M1-
and relationships along with their location details in layer metamodel: This contains domain instance metamodel
documents. It uses the knowledge captured in domain to specify the instances of the core domain concepts, relations
models to resolve ambiguities as explained in section V. and properties that we are interested in extracting information
4) Index: The extracted entity, property value and relation- about, and mention model to specify the structure of entity
ship mentions are populated into an index. This is stored and relation mentions extracted from text.
in a graph database for efficient processing of queries.
5) Domain Search Interface: This provides the search in- A. M2-layer Metamodel
terface to the user in terms of a domain specific query The M2-layer metamodel for domain model specification is
language. The query language is discussed in section VI. shown in Fig. 3-a. It has three meta-level classes, namely,
Query results list a set of matching documents with OntMetaClass, OntMetaRelation and OntMetaProperty for
relevant text fragments highlighted at a sentence level. specifying the domain entities, their relations and properties
6) Query Processor Module: The query processor module respectively. Let us understand the metamodel by taking mate-
parses the domain specific search query and translates it rials science domain as an example (fig. 4-a). Materials science
into equivalent sub-queries on the index database. has concepts such as Material and Process. A material has a
7) Search Module: This module fires the translated sub- set of material properties and a process has a set of process
queries and collates the search results. parameters. A material has a set of associated processes that
are performed on it (E.g. steel is subjected to processes
such as tempering and quenching). Fig. 4-c shows how this
domain is specified in terms of the M2-layer metamodel.
The entities Material and Process are created as instances
of OntMetaClass; the relation material process is created as
an instance of OntMetaRelation; and MaterialProperty and
ProcessParameter are created as instances of OntMetaProp-
erty and associated with the entities Material and Process
respectively. The figure shows an entity with the notation
“<entity-name>:<metaentity-name>”. The figure also shows
the cardinality of the relation material process as 1..* : 1..*,
signifying that one material may be related with one or
more processes, and vice versa. In Fig. 3-a, note that we
can also specify generalization and specialization hierarchies
among entities and relations using ‘superClass/subClass’ and
‘superRelation/subRelation’ associations respectively.
Fig. 3-a also shows the metamodel for specifying extraction
Fig. 2. Architecture of our model driven approach for building domain specific models (algorithms) for entities and relations. Different kinds
search engines of extraction models can be specified such as dictionary based,

167

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Metamodel for domain specification: (a) M2-layer metamodel to specify domain concepts, properties, relations and their extraction algorithms; (b)
M1-layer metamodel to specify instances of domain elements and their mentions

pattern based and machine learning based. For example, the instances are then linked with the domain entity Process using
entity Material may have a pattern based extraction model, the instance association. We can also specify value ranges
MaterialProperty may have a dictionary based model, and the the properties can take or the units to be used. For example,
relation material process may have a machine learning model. CoolingRate has the unit ‘C per min‘ and a value range of 1
to 1000.
B. M1-layer Metamodel As mentioned earlier, the extraction module uses the domain
Fig. 3-b shows the M1-layer metamodel used for specifying model to find occurrences of domain elements in the text.
the instance level of the domain model. For the materials These occurrences are called mentions and are stored in the
domain considered above, Steel and Aluminium are instances index as an instance of the mention model. Consider an
of the entity Material; similarly Tempering and Quenching example where ‘tempering process with heating rate of 120
are instances of the domain entity Process. Fig. 4-b shows C per min’ is present in document X and ‘tempering with
a fragment of instances in this domain. These instance level heating rate of 200 C per min’ is present in document Y .
entities, relations and properties are specified using the classes These are two separate mentions of the process Tempering
OntClass, OntRelation and OntProperty respectively. Fig. 4-d with separate process parameter values. We need to extract all
shows how these instances are specified in terms of the M1- such mentions and their locations (i.e. file, section, page where
layer metamodel. It shows two process instances Tempering they occur in text) and link them with the process Tempering.
and Quenching and their parameters, and how they are cre- Fig. 3-b shows the classes of the mention model (e.g. Ont-
ated as instances of OntClass and OntProperty. The created ClassMention, OntPropertyMention, etc.) and how they relate

168

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. An example fragment from materials science showing how domain model and instances can be specified in terms of M2-layer and M1-layer metamodels
((c) and (d) are shown using UML object diagram notations)

to the domain instance model. The mention model stores the “If v is a value unattached to any
occurrences of entities, property values and relationships along property and the sentence has an
with the location details of the document they occur in. entity mention whose property unit
matches v’s unit, then attach v to
V. M ODEL D RIVEN I NFORMATION E XTRACTION AND that property”
I NDEXING
Such rules are generated not only for units, but also for value
The information extraction module interprets the above types and value-range constraints. The value centric rules help
metamodels to identify the domain entities, properties and in associating values with the right properties.
relations about which information needs to be extracted from
the text files and the extraction models (i.e. algorithms) to 2) Relation Centric Rules: The relation centric rules are
be used for the same. It also exploits the domain knowledge generated for the relations (instances of OntRelation) present
captured in these models to enhance extraction performance between domain entities. Referring to the same example in
in the following ways. section II:
“Tempering followed by quenching were performed at the
A. Ambiguity Resolution rates of 120 C per min and 200 C per min respectively.”
Here, we want to help the extraction model figure out that
In NLP, no extraction model is perfect. There are always
120 C per min refers to heating rate of tempering and 200
ambiguities. As discussed in section II, some of these ambigui-
C per min refers to cooling rate of quenching. We can do
ties can be resolved by using domain knowledge. This section
this by generating rules from the cardinality information in
describes a set of rules our system generates for ambiguity
the domain model. For instance if the domain model says the
resolution.
cardinality between ‘quenching’ and ‘cooling rate’ is 1 : 1, we
1) Value Centric Rules: The value centric rules are gener- generate the following rule:
ated for the properties (instances of OntProperty) of domain
entities. Referring to the following example: “If an entity mention has two
“The specimen is heated to 1800 C and held for 5 mins prior associated parameter values and the
to water quenching at 100 C per min to room temperature.” cardinality only permits one parameter
Here, we want to help the extraction model figure out that 100 value then attach only the nearest
C per min refers to cooling rate (which is missing in text) parameter value and look for the
of a quenching process. We do this by generating a set of next nearest entity mention for the
rules using the knowledge captured in the domain model. For unattached value”
example if the domain model says that the entity quenching
has a property cooling rate and the units of the property value The relation centric rules use the cardinality constraints of
are C per min, we generate the following rule: domain relations to avoid inconsistent entity associations.

169

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
B. Extraction Model Inheritance Algorithm 1 Algorithm for extracting mentions from text
This allows us to specify an extraction model at a generic 1: for entity E of OntM etaclass do
level and reuse it for multiple entities. In other words, an 2: Check if E has an extraction Model;
extraction model specified at a super class/relation level is 3: If not, check if one of its superclass entities has an
inherited down the sub class/relation hierarchy, until explicitly extraction model;
overridden at a specific entity/relation level. For instance, an 4: if not, continue to next entity;
extraction model specified for entity ‘Person’ may be used 5: Use the extraction model(algorithm) to extract men-
for extracting ‘Employee’, ‘Manager’ or ‘President’, unless tions of E from the text;
explicitly overridden with a more specific algorithm. This is 6: for property P of
useful for scaling up information extraction for a large domain. 7: (E union E.superclass*).ontMetaProperty do
To take an example, medical domain contains many thousands  E.superclass* refers to super-class
of entities and relations. One should not have to specify closure (all ancestors of E), i.e. we are getting list of all
extraction models for each of these entities and relations own and inherited properties of E
separately. It should be possible to specify them at a generic 8: check if P has an extraction model;
level and inherit. Putting it all together, Algorithm 1 describes 9: if not, continue to the next property;
the high level approach for extraction of domain elements. 10: use the extraction model (algorithm) to extract
The algorithm uses the type hierarchies specified for domain mentions of P from the text;
elements to select the right extraction models. These models 11: end for
are then applied to each document in the repository to find 12: end for
instances of various domain elements. The algorithm also uses 13: for each relation R of OntMetaRelation do
the disambiguation rules to improve the accuracy of extraction. 14: check if R has an extraction model;
15: if not, check if one of its super-relation has an extrac-
C. Indexing tion model;
The indexing module takes the object graph (of entity and 16: if not, continue to next relation;
relation mentions) generated by the extraction module for 17: use the extraction model(algorithm) to extract men-
each document in the repository and indexes it to a graph tions of R from the text;
database. In addition to storing the entity, property and relation 18: end for
mentions, we also index the document text content. This 19: Apply the disambiguation rules generated (as described in
enables our system to combine object based queries over section V-A) to resolve ambiguous mentions;
mentions with the standard keyword based queries.

VI. Q UERY P ROCESSING


We have developed a declarative query language template
for users to input search queries. Fig. 5 shows a subset of
the grammar of this language, highlighting its key features.
The basic unit of a query is a value constraint/condition
on the domain instances and their properties. The language
provides boolean operators to build complex queries over such
constraints. A constraint may be specified either as a point- Fig. 5. Query language template
value or a range-value constraint. The range constraint may
be specified either with lower and upper bounds (e.g. [1, 10])
or with relational operators such as >, <, >= or <=. The section. Table I shows example queries conforming to the
language provides a syntax similar to path expressions [4] to query languages generated for the two domains.
traverse relations between domain entities and their properties.
The language also provides a way to combine value constraint VII. E VALUATION AND R ESULTS
based queries with keyword based queries.
We evaluated our model driven approach for building
In the language, the non-terminals that start with $ (e.g.
domain specific search systems on two domains: materials
$OntClass and $OntMetaClass) provide a means to refer to
science and urban mining. The materials domain focuses
domain model entities, properties and relations. These are
on searching steel related publications for their behaviour
specified in terms of metamodel elements which when in-
of mechanical properties in the context of heat treatment
stantiated stamp out corresponding domain specific elements.
processes. The urban mining domain focuses on recovery
Thus, the language provides a generic template from which
of metals from e-wastes through mechanical and chemical
one can generate a domain specific query language. We used
processing. Table II reports the total number of concepts,
it to generate domain specific query languages for the materials
properties and their instances created by domain experts
science and urban mining domains discussed in the evaluation
for both domains. For extraction models, we used only the

170

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
TABLE I
E XAMPLE QUERIES FOR MATERIALS SCIENCE AND URBAN MINING
DOMAINS ALONG WITH THEIR SEARCH ACCURACY IN TERMS OF
PRECISION , RECALL AND F1- SCORE

Query Prec% Rec% F1%


Domain: Materials Science
carbon[composition=(0.05, 0.5) wt%] & sili- 91.67 57.89 70.97
con[ composition=(0.02, 0.1)wt%]
cooling[rate >20 C/min] AND silicon[ com- 72.72 61.53 66.67
position =(0.05, 0.5) wt%]
annealing [temperature >200 C] AND 93.33 66.67 77.78
steel[tensile strength >200 MPa]
Domain: Urban Mining
fe[particle size] 63.64 100 77.78
HNO3[pH <5] 50 100 66.67
Fe[particle size <500 micrometre] & leach- 60 60 60
ing[rotation speed >30 rpm ]

TABLE II
D OMAIN DATA S TATISTICS Fig. 6. Screenshot of our system:(a) matches 2 publications for the given
input query; (b) snippets of the first matched publication ([5]) highlighting
the text content that matches constraints specified in the query

model, it is clearly indicative of the kind of effort savings


possible with the model driven approach. Moreover, we found
search accuracy to be the same in both cases since the extrac-
tion models were the same. We observed a similar trend for
dictionary based and pattern based models. The pattern based
the urban mining domain as well. To report the development
extraction models use linguistic features such as tokens, part-
efforts in a standardized manner and avoid reporting bias,
of-speech tags and dependency relations. We did not include
we applied COCOMO II model [6] for calculating person-
any machine learning model since we do not have tagged
month estimates for both the traditional as well as our model
datasets available for the domains. From the domain models
driven approach1 . We computed the total number of source
and extraction specifications, we created the domain specific
lines of code (SLOC) using a code-management plugin of
search engines. The domain agnostic part of the search engine
Eclipse which considers only statements, avoiding blank-lines
first pre-processes the documents in the repository using a
and brackets. The SLOC for the traditional approach is 21,320.
standard NLP pipeline consisting of tokenization, sentence
Whereas, the SLOC for capturing domain specifications in the
splitting, parts-of-speech tagging, dependency parsing and co-
model driven approach is 440. The SLOC values along with
reference resolution. It then applies the dictionary based and
the other COCOMO II parameters are then used to measure
pattern based extraction models as described in Algorithm 1
effort estimates using COCOMO II calculator2 . For both the
to extract and populate the mentions into the mention model.
approaches, we considered a fixed cost for domain knowledge
Table I shows sample search queries along with their search
acquisition at 2 person-months. We observed similar trend
accuracy for both domains. Fig. 6 shows the screenshot of
reported earlier wherein the effort estimates for the traditional
our system for an example query in materials science domain.
approach is 25.6 person-months as opposed to just 3.6 person-
In the example, note that the system has identified 2h as the
months for the model driven approach.
tempering time even though time is not explicitly mentioned in
The approach presented in this work not only reduces the
the sentence. It has also carried out unit conversion to identify
initial development efforts but also speeds up the integration of
that 2h > 20min.
frequent model updates. The development of a domain specific
To assess the benefits of the model driven approach, we search engine often includes iterative updates in the domain
compared the materials search engine with an older version model, extraction, indexing, etc. in order to achieve acceptable
we had developed using the traditional approach [2]. The older search accuracy. These small and frequent updates generally
version had taken approximately 20 person-months to develop lead to introduction of bugs in the traditional approach. Since
end-to-end. Of this, roughly 2 person-months were spent our approach integrates such changes primarily through ex-
on understanding domain concepts and identifying extraction
1 The parameter’s chosen for the COCOMO II estimates for traditional
models. With the model driven approach, it took us just 1
person-month to develop the new engine since we already approach are kept ‘High’ for all ‘Software scale drivers’, ‘Very high’ for
all ‘Personnel’ (emulating a smooth development life-cycle) and in ‘Software
had all the required domain related information. Or 3 person- cost drivers’ section all parameters are at a ‘Nominal’ value except ‘Database-
months if we add the 2 person-months of initial domain analy- size’ and ‘Complexity’ attributes which are kept ‘High’. For model-driven
sis effort. This translates to about 7x reduction in effort. While approach since there is no active code-development required, all parameters
are ‘Nominal’ except ‘Database-size’ and ‘Complexity’ which is kept ‘High’.
this figure will vary with the size and complexity of the domain 2 available at http://softwarecost.org/tools/COCOMO/

171

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.
plicit model level specifications, it does not require code-level IX. C ONCLUSION AND F UTURE W ORK
modifications, resulting in robust development life cycle. In this work, we presented a model driven approach for
building domain specific search engines. It is built around a
VIII. R ELATED W ORK metamodel that helps in explicitly capturing domain specifica-
tions such as domain concepts, properties, relations and their
General purpose search engines primarily rely on keyword extraction models. A search engine framework built around
based techniques. As such, they are not very effective in the metamodel then interprets these specifications to generate
many application domains since they are not designed for domain specific search engines. We evaluated our approach
deep processing of semantic relations that are generally present on materials science and urban mining domains and observed
in these domains. Tang et al. [7], in a set of experiments in substantial savings in development efforts.
mental health domain, reported this drawback and proposed an Our future work includes identifying a suitable mechanism
approach based on query expansion to enrich user query terms to specify domain specific ranking criteria to rank search
with domain specific elements. In addition to query expansion, results. We are also working on algorithms to automatically
White et al. [8] first applied a technique for user profiling to learn extraction patterns from tagged examples.
classify search users into categories such as domain expert and
non-expert user using features based on query terms, search R EFERENCES
result clicks, and so on. The authors then incorporated user [1] (2021) National center for biotechnology information. [Online].
category specific query expansion and re-ranking criteria to Available: https://www.ncbi.nlm.nih.gov/
better serve user query intents. These approaches, however, [2] S. Shah, D. Vora, B. P. Gautham, and S. Reddy, “A relation aware search
engine for materials science,” Integrating Materials and Manufacturing
serve just as an extension to general purpose search engines Innovation, vol. 7, pp. 1–11, 2018.
and do not provide the kind of domain processing warranted [3] (2021) Google patents. [Online]. Available: https://patents.google.com/
by domain specific search engines. [4] M. Kifer, W. Kim, and Y. Sagiv, “Querying object-oriented databases,”
SIGMOD Rec., vol. 21, no. 2, p. 393402, Jun. 1992. [Online].
One of the earlier efforts in leveraging domain information Available: https://doi.org/10.1145/141484.130342
for search is by McCallumzy et al. [9]. Their system applied [5] W.-S. Lee and T.-T. Su, “Mechanical properties and microstructural fea-
tures of aisi 4340 high-strength alloy steel under quenched and tempered
naive-Bayes classifier in a reinforcement learning setting for conditions,” Journal of materials processing technology, vol. 87, no. 1-3,
spidering. It then used hidden markov model based approach pp. 198–206, 1999.
for domain specific information extraction. Similar efforts to [6] B. W. Boehm, C. Abts, A. W. Brown, S. Chulani, B. K. Clark,
E. Horowitz, R. Madachy, D. J. Reifer, and B. Steece, Software cost
organize information for agriculture domain has also been re- estimation with COCOMO II. Prentice Hall Upper Saddle River, NJ,
ported in [10], where semantic web technologies and ontology 2000, vol. 1.
are used to represent domain information. There has been a [7] T. T. Tang, N. Craswell, D. Hawking, K. Griffiths, and H. Christensen,
“Quality and relevance of domain-specific search: A case study in mental
plethora of work on development of search engines for specific health,” Information Retrieval, vol. 9, no. 2, pp. 207–225, 2006.
domains such as biomedical, gene research and medicine. A [8] R. W. White, S. T. Dumais, and J. Teevan, “Characterizing the influence
few notable examples are DrugBank [11], GoPubMed [12] of domain expertise on web search behavior,” in Proceedings of the
second ACM international conference on web search and data mining,
and Gene. Since building a domain specific search engine is 2009, pp. 132–141.
knowledge, labour and time intensive, there have been attempts [9] A. McCallumzy, K. Nigamy, J. Renniey, and K. Seymorey, “Building
to bootstrap domain specific search engines. For instance, domain-specific search engines with machine learning techniques,” in
Proceedings of the AAAI Spring Symposium on Intelligent Agents in
Jacome et al. [13] have developed a search engine framework Cyberspace. Citeseer. Citeseer, 1999, pp. 28–39.
for biomedical domain wherein named entity taggers for [10] D. Mukhopadhyay, A. Banik, S. Mukherjee, J. Bhattacharya, and Y.-C.
various entities such as drug, genes, protein, organisms, etc. Kim, “A domain specific ontology based semantic web search engine,”
arXiv preprint arXiv:1102.0695, 2011.
are pre-trained and integrated into the framework. These pre- [11] D. S. Wishart, C. Knox, A. C. Guo, D. Cheng, S. Shrivastava, D. Tzur,
trained taggers provide domain specific vocabulary to improve B. Gautam, and M. Hassanali, “Drugbank: a knowledgebase for drugs,
extraction accuracy resulting in better search. In addition to drug actions and drug targets,” Nucleic acids research, vol. 36, no.
suppl 1, pp. D901–D906, 2008.
building search engines that understand domain entities and [12] A. Doms and M. Schroeder, “Gopubmed: exploring pubmed with the
relations, there have also been efforts in providing features gene ontology,” Nucleic acids research, vol. 33, no. suppl 2, pp. W783–
like domain concept based document categorization, search W786, 2005.
[13] A. G. Jácome, F. Fdez-Riverola, and A. Lourenço, “Biomedical search
filtering and query refinement. In biomedical domain, Joseph engine framework: lightweight and customized implementation of
et al. [14] have developed a system that provides concept domain-specific biomedical search engines,” computer methods and
based article categorization and search filtering. In animal programs in biomedicine, vol. 131, pp. 63–77, 2016.
[14] T. Joseph, V. G. Saipradeep, G. S. V. Raghavan, R. Srinivasan, A. Rao,
experiments domain, Sauer et al. [15] have proposed a search S. Kotte, and N. Sivadasan, “Tpx: Biomedical literature search made
system, Go3R, with ontology based context filtering. easy,” Bioinformation, vol. 8, no. 12, pp. 578–580, 2012. [Online].
While a number of different approaches and enhancements Available: https://app.dimensions.ai/details/publication/pub.1033247469
and http://www.bioinformation.net/008/97320630008578.pdf
have been proposed in the literature, we have not come across [15] U. G. Sauer, T. Wächter, B. Grune, A. Doms, M. R. Alvers, H. Spiel-
a model driven approach such as the one proposed in this paper mann, and M. Schroeder, “Go3r–semantic internet search engine for
that separates the domain agnostic core of a search engine alternative methods to animal testing,” ALTEX-Alternatives to animal
experimentation, vol. 26, no. 1, pp. 17–31, 2009.
from domain specifications and where the former interprets
the latter to render domain specific search functionality.

172

Authorized licensed use limited to: TH Brandenburg. Downloaded on October 22,2022 at 17:23:32 UTC from IEEE Xplore. Restrictions apply.

You might also like