Professional Documents
Culture Documents
Paper 43
Paper 43
Abstract
Clinical decisions are considered crucial and lifesaving. At times, healthcare workers are
overworked and there could be lapses in judgements or decisions that could lead to tragic
consequences. The clinical decision support systems are very important to assist heath workers.
But in spite of a lot of effort in building a perfect system for clinical decision support, such a
system is yet to see the light of day. Any clinical decision support system is as good as its
knowledgebase. So, the knowledgebase should be consistently maintained with updated
knowledge available in medical literature. The challenge in doing it lies in the fact that there is
huge amount of data in the web in varied format. A method of knowledgebase curation is
proposed in the article using RDF Knowledge Graph and SPARQL queries.
Keywords
Clinical Decision Support System, RDF Knowledge Graph, Knowledgebase Curation.
CDSS has evolved gradually with the ample availability of such knowledgebases, it could
technological developments that has happened also be cost effective in terms of price [12]. But
in the past few decades. The system is such knowledge acquisition might lead to the
essentially centered on high adaption and knowledge-acquisition bottleneck as certain
effective use of constantly updated knowledge. standards and formalization will have to be
With the evolution of CDSS over the years, maintained for the knowledge portability. Such
there has also been a consistent evolution in its a knowledge-acquisition bottleneck could harm
definition from a mere use of information the effectiveness of the CDSS and freeing the
technology for data entry to a hi-tech complex CDSS of the bottleneck makes the knowledge
system that provides individual specific, acquisition process complex and difficult [12],
intelligently filtered and efficiently presented [13]. Another bottleneck in the curation of
knowledge for clinicians, staff, patients, or knowledgebase lies in the maintenance of the
other individuals [1, 2] knowledgebase [14]. Together with creation of
CDSS can be broadly classified as knowledgebase, its verification and constant
knowledge-based CDSS and non-knowledge- updating is equally important. The verification
based CDSS. The knowledge-based CDSS are and validation of the knowledgebase involves
designed to mimic the knowledge processed by transparency, updatability, adaptability, and
a human expert. Such systems were earlier learnability [15].
termed as the expert systems. Non-knowledge- A knowledgebase is judged by its accuracy,
based systems, on the other hand, rely on completeness and the quality of its data. So, the
statistical data that is available to help in construction of the knowledgebase is done in
decision making. These systems make full use such a manner that these three factors are
of the machine learning and neural network enhanced to the maximum. The methods of
algorithms to predict possible outcomes [3, 4, constructing knowledgebases can be classified
5, 6]. into four main groups. There are closed
methods in which the knowledgebase is are
2.1. Knowledge-based CDSS manually fixed by experts, open methods in
which knowledgebases are curated by
volunteers, automated semi-structured methods
Knowledge-based systems evolved from
in which the knowledgebases are procured from
expert systems. While the expert systems were
semi-structured texts automatically by using
built on the knowledge of human experts, the rules that are programmed into the system and
knowledge-based systems have the capability the automated unstructured methods that use
to acquire knowledge from different sources artificial intelligence algorithms to extract
and build upon it. So, while the expert system knowledge from unstructured texts [16].
could be ranked according to the knowledge of
the expert, the knowledge-based systems had
the capacity of greater knowledge [7, 8]. 2.2. Knowledge Graphs for
The knowledgebase of the knowledge-based knowledge base
CDSS ultimately determines the effectiveness
of the CDSS [9]. The acquisition, Knowledge graphs are knowledgebases in
representation and the integration of knowledge which knowledge is expressed in a graph
base in the workflow is vital for the success of structure having nodes to represent the concepts
CDSS [10]. The process of selecting, or entities and edges between the nodes to
organizing, and looking after the knowledge in represent the relationship between those entities
the knowledge base makes the knowledgebase or concepts [16, 17, 18, 19, 20, 21, 22, 23].
efficient and the CDSS successful. The two There are diverse definitions for knowledge
important facets of the curation process is the graphs varying according to the purpose for
method of knowledge acquisition and which the knowledge graph is created or by the
knowledge representation [11]. knowledge graph model [24]. Although Google
One way to build a cost effective is credited for the popularity of knowledge
knowledge-based CDSS is to use commercial graphs from 2012 [24, 25, 26], the term
knowledgebases that are available. It could knowledge graph was used in a report in 1973
reduce cost of the development time of the with a very similar meaning [27] and later in
CDSS and also because of the common 1982 the term was used to represent textual
349
concepts using graphs. There has been decades 2.3. RDF Knowledge Graphs
of study in representing knowledge using
graphs [28].
Resource Description Framework(RDF) is a
The maintenance of knowledge graphs has
World Wide Web Consortium (W3C)
the processes of creation, hosting, curation and
specification to represent knowledge in the
deployment. The process of creation can be
form of triples (subject, predicate, object)
manual as in the case of expert systems or semi-
containing references, literals or blank [30],
automatic or automatic. Apart from these, there
[31]. RDF can be modelled as directed label
is also a method of annotation by mapping the
graphs in which the subject and object are
knowledgebase entities to the source without
represented by the vertices or nodes and the
actually keeping the entities in the
corresponding predicate are represented by the
knowledgebase. Hosting or storage processes
labelled edges [32, 33, 34, 35]. RDF graphs are
use various methods of keeping knowledge in
widely used to represent knowledge graphs like
the knowledgebase. The curation processes
in the cases of Freebase, Yago, and Linked Data
involve three steps, namely, the assessment of
since the billions of triples scattered across the
new knowledge, its cleaning and its enrichment
web can be captured and integrated with the
by detecting the source of the knowledge,
existing knowledge using powerful abstraction
integrating it with existing knowledgebase,
for representing heterogeneous, partial, scant,
detecting duplication and correcting entity
and potentially noisy knowledge graphs [36],
relations. Once the knowledgebase is ready, it
[37]. Unlike property graphs that is also quite
is deployed in appropriate application [29].
popular representation of knowledge graphs
There are various sources of knowledge that
due to its property and value representation for
can be utilized for the creation of knowledge
its edges, the use of metadata in RDF
base. Textual knowledge that can be in the form
knowledge graphs allows the convenient
of newspapers, books, scientific articles, social
distributed storage of knowledge. That also
media, emails, web crawls, and so on is a very
makes RDF graphs more flexible than property
rich source of knowledge for building a
graphs [37, 38].
knowledge graph for CDSS. However, the
RDF knowledge graphs are stored as triples
process of extracting knowledge from text is
in a Triple store or RDF stores. The flexibility
complex and involves the application of
of RDF stores is its greatest advantage. Since
Natural Language Processing(NLP) and
the RDF knowledge graph has the ability to link
Information Extraction(IE) techniques.
any number of entities with their relations, the
Curation of the knowledgebase using these
RDF stores are also flexible enough to store
techniques may follow a combination of five
them without restriction on size. Moreover any
stages. In the pre-processing stage, the text is
kind of knowledge can be expressed and stored
analyzed for atomic terms and symbols. Some
using RDF knowledge graphs that allows its
of the techniques used in the pre-processing
extraction and reuse by different applications
stage are Tokenization, Part-of-
[39].
Speech(POS)tagging, Dependency Parsing and
In the case of textual knowledge, RDF
Word Sense Disambiguation(WSD). After the
knowledge graphs are helpful in finding the
pre-processing stage is the Named Entity
Thematic Scope or the Topic Model of a text.
Recognition (NER) stage in which the various
The topic category or the semantic entities in a
concepts or entities that forms the nodes of the
set of documents is abstracted using the
graph are identified. The NER is followed by
Thematic Scope or the Topic Model [38, 40].
the Entity Linking (EL) stage in which an
Since the RDF knowledge graphs of a
association is made between the entities that are
document is represented as a set of triples in
identified in the text and the entities in the
which each triple is considered a word or an
existing knowledge graphs so that the similar
entity, there is the possibility of detecting word
entities could be placed side-by-side. During
and phrase patterns automatically by clustering
the Relation Extraction (RE) stage, the relation
word groups that best characterize a document.
between the various entities taken from the text
Some of the methods of the Topic Modelling of
are considered using a various RE techniques.
RDF knowledge graphs are Latent Semantic
Finally the extracted relation is joint to the
Analysis (LSA), Probabilistic Latent Semantic
entities in the last stage of the text processing
Indexing (pLSI) and Latent Dirichlet
[22].
350
Allocation (LDA). The challenges faced in using a part or the whole expression of a
Topic Modelling include sparseness of the question in natural language. The question in
entities, a lack of context especially when the natural language can be interrogative in
words used have multiple meanings especially which case it will be a factoid type of question
when the entities are sparce and the use of and its answer will be a fact from the
unnatural language like the use of special knowledge source or the question could be
characters or unusual casing of letters which statement in which case the answer will be in
normally are removed while pre-processing the the form of either a list or a definition or
text. Normally, these challenges are overcome hypothetical statement or a causal remark or a
by supplementing the text or modifying the relationship description or procedural
method of Topic Modelling [40, 41, 42]. Entity explanation or just a confirmation. The sources
summarization which is the best way of of knowledge are normally unstructured and is
summarize an entity by identifying a limited a set of documents, video clips, audio clips, or
number of ordered RDF triples is one of the text files that are given as input to the systems
problems that is solved using Topic Models of [44]. QA systems make use of Information
RDF knowledge graphs [41, 42]. Entity Retrieval, Information Extraction and Natural
summarization has many applications like Language Processing(NLP) techniques [45].
search engines and is useful for research QA systems have a long history of
activities. The existing entity summarization development starting with the earliest popular
techniques can be classified into the generic system BASEBALL in 1961 and LUNAR
techniques that apply to a wide range of made in 1972. The Text REtrieval Conference
domains, applications and users and the specific (TREC) that began in 1992 for large scale
techniques that make use of external resources information retrieval accelerated interest and
or factors that are effective only in specific growth in QA systems. Other forums and
domains or applications. While the generic campaigns such as the Cross Language
techniques make use of universal features like Evaluation Forum (CLEF) and NIITest
frequency and centrality, informativeness, and Collections for IR systems (NTCIR) campaigns
diversity and coverage, the specific techniques also enhanced the QA systems. Noteworthy QA
make use of domain knowledge, context systems include IBM Watson, and the other
awareness and personalization [43]. commercial personal assistant software like
Apple’s Siri, Amazon’s Alexa, Google’s Assist
3. Related works and Microsoft’s Cortana [44, 45, 46, 47, 48,
49].
SPARQL is a query language in which a
Extracting knowledge from unstructured pattern in a query is matched with that in a
textual sources has been a challenge. Several
graph from different sources. The matching is
studies have been done in order to solve the done in three stages. It begins with the pattern
problem of retrieving meaningful and relevant matching involving features like optional parts,
knowledge from literature since it is crucial for union of patterns, nesting, filtering or
decision support in systems like the CDSS.
constraining values of matches, and selecting
Some relevant techniques have been delt with the data source to be matched by a pattern. Once
in the earlier sections on knowledge graphs and these features are applied, the output is
RDF knowledge graphs. As part of the computed using standard operators like
proposed method, certain other related projection, distinct, order, limit, and offset to
techniques have to be explained in order to have modify the values that is got from pattern
a complete picture of the complexity of the matching. Finally, the result of the SPARQL
problem of curating the knowledgebase for query is given in one of the many forms like
CDSS. Boolean answers, the pattern matching values,
new triples from the values, and resources
3.1. Question Answering using description. Working with RDF knowledge
SPARQL graphs require the use of SPARQL [50, 51]. In
order to apply NLP techniques used in the QA
systems over SPARQL queries, query builders
Question-answering (QA) is a process of
such as QUaTRO2, OptiqueVQS,
retrieving knowledge from different sources
NITELIGHT, QueryVOWL, Smeagol,
351
application programming interface (API) for For the evaluation of the proposed algorithm, the
knowledge retrieval and query processing [60]. precision and recall method are used as it is the
typical form of evaluation metrics used in
information extraction. During the process of
4.2. Proposed Approach information extraction or retrieval, there could be
two types of knowledge that is obtained from the
The proposed approach of knowledgebase knowledge source. There is knowledge that can be
curation for CDSS has three stages. In the first considered important to the application and there is
stage, new knowledge is extracted from a knowledge relevant to the query. In the proposed
medical literature using an OIE application. approach, since the query is based on keywords from
The OIE application is for knowledge the QA system, only knowledge relevant to the
extraction since it will result in triples that can query is selected rather than all knowledge that is
be integrated to the existing knowledge graph deemed important from the knowledge source.
of the CDSS. The triples got from the OIE Therefore, the results of the OIE is restricted to just
application on the medical literature is queried the knowledge relevant to the query. The application
using keywords from the CDSS interface for of the evaluation metrics is also bound by the
consideration that only the sum total of the relevant
relevant knowledge using a QA system. If the
knowledge found by the system proportionate to all
query results in new knowledge being found,
the relevant knowledge that can be manually
then those triples in RDF from are added to the
counted on the same medical literature is calculated
existing knowledge graph of the CDSS. A table rather that taking in consideration all the important
is maintained with the list of medical literature knowledge present in the literature that is used in the
already checked for knowledge so that they test. The precision evaluation metric is given by the
need not be looked for new knowledge again. formula in equation (1)
The algorithm for the proposed system is as
shown in Algorithm 1. relevant RDF triples retrived RDF triples
Precision = (1)
retrived RDF triples
Algorithm 1.
Curating the CDSS knowledge base using RDF A contingency matrix can be formed using
Knowledge Graph the relevance of the RDF triple as shown in
URI (Uniform Resource Identifier)
table 1. If the triple retrieved by the system is
relevant to the query than it is true positive
table U = {u1, u2, …, un}
RDF Knowledge Graph G ={V,E} where V Є otherwise it is false positive. So also, if a
{v1,v2,….,vn} and E Є {e1,e2,…,en} relevant triple in the knowledge source is not
RDF triple S = {s, p, o} (subject(s), retrieved by the system then it is false negative
predicate(p), object(o)) and if a triple that is not relevant and is ignored
Keyword K = {k1, k2,…kn} taken from the by the system it is true negative.
QA system of CDSS
1 Read ui //URI of a new document Table 1
2 If ui Є U GOTO Step 12 Contingency matrix according to the relevance
3 Else use OpenIE to create G of RDF triples
4 For every K Relevant Not
5 use QA system with SPARQL query
Relevant
to find ki in G
6 For every S found RDF triple True positive False
7 If S not in CDSS knowledge base retrieved positive
8 Append S to CDSS knowledge base RDF triple False True
9 END For loop not retrieved negative negative
10 END For loop
11 END Else
The formula to calculate the precision of the
12 If another document exists GOTO Step 1
system using the contingency matrix can be
13 Else STOP
given as in equation (2)
H. Sack, A Survey on Knowledge Graph One, IJCAI’11, AAAI Press, 2011, pp. 3–
Embeddings with Literals: Which model 10.
links better Literal-ly?, ArXiv abs/1910.1 [67] Mausam, Open Information Extraction
(2019). Systems and Downstream Applications,
[60] P. Bonatti, S. Decker, A. Polleres, V. in: Proceedings of the Twenty-Fifth In-
Presutti, Knowledge Graphs: New ternational Joint Conference on Artifi- cial
Directions for Knowledge Representa- Intelligence, IJCAI’16, AAAI Press, 2016,
tion on the Semantic Web (Dagstuhl pp. 4074–4077.
Seminar 18371), Dagstuhl Reports 8 [68] A. Zhila, E. Yagunova, O. Makarova,
Bringing The Output of Open Informa-
(2018) 106–110. URL: https://drops. tion Extraction to The RDF / XML For-
dagstuhl.de/opus/volltexte/2019/10328/. mat : A Case Study, 2015.
doi:10.4230/DagRep.8.9.29. [69] G. Angeli, M. J. J. Premkumar, C. D.
[61] K. M. Endris, M.-E. Vidal, D. Graux, Manning, Leveraging Linguistic Struc-
Federated Query Processing, Springer ture For Open Domain Information Ex-
International Publishing, Cham, 2020, pp. traction, in: Proceedings of the 53rd
73–86. URL: https://doi.org/10.1007/ 978- Annual Meeting of the Association for
3-030-53199-7{_}5. doi:10.1007/978-3- Computational Linguistics and the 7th
030-53199-7_5. International Joint Conference on Nat- ural
[62] M. Saleem, Y. Khan, A. Hasnain, I. Er- Language Processing of the Asian
milov, A.-C. N. Ngomo, A Fine-grained Federation of Natural Language Process-
Evaluation of SPARQL Endpoint Feder- ing, {ACL} 2015, July 26-31, 2015,
ation Systems, Semantic Web 7 (2016) Beijing, The Association for Computer
493–518. Linguis- tics, 2015, pp. 344–354. URL:
[63] J. Tang, M. Hong, D. L. Zhang, J. Li, https://doi. org/10.3115/v1/p15-1034.
Information Extraction: Methodologies doi:10.3115/ v1/p15-1034.
and Applications, in: H. A. do Prad, E. [70] I. Sim, P. Gorman, R. A. Greenes, R.
Ferneda (Eds.), Emerging Technolo- gies B. Haynes, B. Kaplan, H. Lehmann, P. C.
of Text Mining, IGI Global, 2008, pp. 1– Tang, Clinical decision sup- port systems
33. URL: http://services.igi-global. for the practice of evidence-based
com/resolvedoi/resolve.aspx?doi=10.4018 medicine, Journal of the American
/978-1-59904-373-9.ch001. doi:10. Medical Informatics Association: JAMIA
4018/978-1-59904-373-9.ch001. 8(2001)527–534.
[64] O. Etzioni, M. Banko, S. Soderland, D. URL:https://pubmed.ncbi.nlm.
S. Weld, Open Information Extrac- tion nih.gov/11687560https://www.ncbi.
from the Web, Commun. ACM 51 (2008) nlm.nih.gov/pmc/articles/PMC130063/.
68–74. URL: https://doi.org/ doi:10.1136/jamia.2001.0080527.
10.1145/1409360.1409378. doi:10.1145/
1409360.1409378.
[65] S. Ali, H. Mousa, M. Hussien, A Review
of Open Information Extrac- tion
Techniques, IJCI. International Journal of
Computers and Infor- mation 6 (2019)
20–28.
URL:https://ijci.journals.ekb.eg/article{_}
35099. htmlhttps://ijci.journals.ekb.eg/
article{_}35099{_}2751a97dec8ca23f3e6
ca98f27cee4b6.pdf.
doi:10.21608/ijci.2019.35099.
[66] O. Etzioni, A. Fader, J. Christensen, S.
Soderland, M. Mausam, Open Infor-
mation Extraction: The Second Gener-
ation, in: Proceedings of the Twenty-
Second International Joint Conference on
Artificial Intelligence - Volume Vol- ume