BmQGen Biomedical Query Generator For Knowledge Discovery

2015 IEEE International Conference on Bioinformatics and Biomedicine (BTBM)
BmQGen: Biomedical Query Generator for

Knowledge Discovery
Feichen Shen, MS \ Hongfang Liu, PhD2, Sunghwan Sohn, PhD2, David W Larson, MD3, Yugyung Lee, PhD!
'CSEE Department, University of Missouri - Kansas City, USA
2Department of Health Sciences Research, 3Department of Surgery,
Mayo Clinic College of Medicine, USA
fsm89@mail.umkc.edu, leeyu@umkc.edu
{Liu.Hongfang, Sohn.Sunghwan, Larson.david2} @mayo.edu
Abstract- A large number of structured and unstructured data (e.g., The Semantic Web provides an important platform for this
EHRs, ontologies, reports) have been introduced by the biomedical activity of biomedical information exchange to take place.
community. Cross-domain data integration is identified as an Increasingly, we are also seeing the emergence of biomedical
important research problem for translational research. From an and scientific collaborations. The first notable effort toward
application perspective, identifying related concepts among medical connecting scattered medical data is to materialize a data
ontologies is an important goal of life science research. It is essential movement by the biomedical community (i.e., the Linked Data,
to analyze how relations are specified to connect concepts in a single bi02RDF, OBO, LinkedCT) [1]. In addition, the Semantic Web
ontology or across multiple ontologies. With the explosion of cross
Health Care and Life Sciences Interest Group (HCLSIG) [2]
domain datasets, it is extremely hard for researchers to discover
aimed for utilizing Semantic Web technologies for innovative
knowledge from current infrastructures of ontologies. It is mainly a
research and collaboration in the health care and life science
lack of the connectivity between the ontologies' cross domains and
domains. In this drive, large amounts of medical data have been
ontologies to unstructured data; even if they have specific biomedical
specified and shared via machine-readable formats, such as the
knowledge in a more general and comprehensive level. Therefore,
Resource Description Framework (RDF) and Ontology Web
there is a need for a mechanism to do semantic partition and query
generation for cross domain biomedical knowledge discovery. In this
Language (OWL). The ontologies were developed to easily
paper, we present such a model that clusters integrated data based on
extend the work of others and share across different domains.
semantic closeness of predicates into different groups and produces These Semantic Web technologies make it easier and more
meaningful queries to fully discover knowledge over a set of practical to integrate, query, and analyze the full scale of
interlinked data sources. We have implemented a prototype of the relevant biomedical and healthcare data, as well as EHRs for
BmQGen system and evaluated the proposed query model based on the cost effective health care systems [3].
predicate oriented clustering with colorectal surgical cohort from the
However, there are significant difficulties to be resolved
Mayo Clinic.
before seamless interoperability and interchange can occur.
Keywords- biomedical cross domain, knowledge discovery, query Existing semantic approaches for linking are promising;
generation, semantic however, they are computationally expensive and impractical
for large scale ontologies since these works may still require
I. INTRODUCTION
human intervention. Furthermore, as the size of data increases
Researchers and health care practitioners prefer to conduct drastically, it is difficult to discover information from
research in an evidence-based practice by using available structured/unstructured data in a single domain or cross
research results when making decisions in health care. The domains, especially for those researchers with expertise in a
main challenge we are facing to support evidence-based specific domain. Thus, we need to reduce human intervention
research is the big data problem associated with large, complex, with the help of process automation in extraction and
and dynamic medical data (e.g., clinical research data, EHRs, integration of semantics from structured or unstructured data.
ontologies). A lot of medical ontologies and tools have been
For extraction of a cohesive structure and semantics from
developed for biomedical research and applications. However,
structured or unstructured data, identification of meaningful
these are not sufficient for integrating or mapping unstructured
linking either together within or across a large number of
data to structured data such as ontologies that will be significant
biomedical ontologies is necessary. vSparQL was introduced to
for evidence-based research. It is mainly due to the lack of the
enable application ontologies to be derived from these large,
ability to integrate data from such a variety of sources and
fragmented sources such as the FMA[4]. A series of queries
extract both a cohesive structure and semantics from structured
might be generated using large ontologies like the NCI
or unstructured data to support evidence-based research.
thesaurus by extracting relevant information that is desired for
In order to extract a cohesive structure and semantics, it is applications [5]. The GLEEN project aims to develop a useful
essential to know what information exists and what meaningful service for simplified, materialized views of complex
relationships among the related domains (e.g., identification of ontologies [6]. However, these works are limited due to the lack
genes responsible for a disease, development of drugs for their of the comprehensive semantic analysis of large sources and the
treatment or prediction of a pathogen's susceptibility to a drug). usage of the knowledge for query processing. We need to
978-1-4673-6799-8/15/$31.00 ©2015 IEEE 1092
Authorized licensed use limited to: Sardar Patel Institute of Technology. Downloaded on April 18,2024 at 08:36:05 UTC from IEEE Xplore. Restrictions apply.
connect related information through a reference ontology that RDF/OWL data model [9] specifies resources (information on
becomes a platform to link together multiple ontologies that the entities and their relationship in the given document) in the
cover a broad range of related information. Advanced form of triples <subject (S)-predicate (P)-object (0», where S
techniques are needed to analyze these larger reference denotes the resource, and P denotes aspects of the resource and
ontologies, rather than simply getting a slice of a reference expresses a relationship between S and O.
ontology and applying it for a query process or decision support
We first explain how to convert unstructured data to a linked
[5]. There is also some related work on using k means and fuzzy
data structure (RDF/OWL). We then present a fuzzy clustering
cmeans for clustering microarray data [7, 8], but neither of them
algorithm to cluster the RDF/OWL graph. Finally, we evaluate
are concerned about the semantics of data nor hierarchical
the proposed model through the queries automatically generated
clustering.
from the clusters.
In this paper, we combined the advantages of Machine
The basic steps of the BmQGen framework are described in
Learning with the added rigor of machine-readable semantics
the remainder of this section.
in extracting information and generating queries applicable for
decision support in clinics. The approach to converting free text Step 1: Feature Selection and Concept Annotation.
to ontologies is based on text classification methods by
identifying related information connected through relationships We extracted various free text reports and then performed
and classifying them according to their relevance. In addition, preprocessing to maintain terms consistently and exclude
queries are evaluated by measuring the content of information, irrelevant terms, using filtering. We then extracted co-occurring
identifying possible extensions or compositions of queries and terms from free text clinical notes, and calculated the inequality
making a comparison with an existing query benchmark. We score to cluster them into a different category (domain). In this
have implemented a prototype of the BmQGen system and study, we used MedTagger [10], which is an open source
evaluated the proposed query model based on the predicate concept detection and normalization tool through open health
oriented clustering with colorectal surgical cohort from the natural language processing. Specifically, this tool identifies
Mayo Clinic. phrases present in MedLex, a general semantic lexicon created
for the clinical domain [11].
II. METHODS
The point-wise mutual information (MI) was used to assess
Graph Analytics
the inequality of the likelihood for given terms [12].
RDF Graph Generator

Inequality(c.o) = logz(N(c.o) * logz N(c:�:)O.Q1_logz N�)
Unstructured Data N is the number of observations, N (c) is the number of cases
Text Documents
having concept c, N(c,o) is the number of cases with a specific
I=l
LJ observation 0 and concept c, and N(o) is the number of cases
with a specific observation o.
As an example, a given input text, "the patient was UCI'd

with plans to have the catherter indwelling," MedTagger
recognized concepts uci and catheter. Then we used the point
wise mutual information measurement to calculate inequality
likelihood between these concepts (uci and catheter) and
ILEUS. We ranked concepts by their inequality score and only
chose the top 60 of them. Then we used OpenIE [13] to get the
triple <S-P-O> from the free text annotated with the concepts
Semi-Structured Data Semantic Clustering
(Ontologies/RDF) recognized by MedTagger. For this purpose, we first got a triple
{the patient, UCI'd, catherter indwelling}. To make the triple
Fig 1: The BmQGen Framework
normalized, we looked up the MedTagger concept in the
The proposed framework BmQGen is based on the dictionary again and converted the triple to {patient, uci,
following steps: (i) convert unstructured (free text) data to catherter}. For each complication case, we did the same work
semantically structured (RDF/OWL) data, (ii) arrange them above in order to generate six ontologies respectively.
into groups in a semantically meaningful manner, and (iii)
generate queries for evidence-based practice. Step 2: RDF Graph Construction.
The main contribution of this paper is knowledge discovery First, the top K concepts will be selected and each sentence
from multiple domains by defining a predicate-oriented model with these concepts in the datasets will be annotated with the
for pattern-based fuzzy clustering, and processing cross-domain selected terms. We then extracted assertions (RDF/OWL triples)
queries automatically generated from clustered patterns. Figure from a given free test dataset considering the top K concepts of
1 summarizes the proposed system architecture. As seen in this each domain and generated RDF/OWL triples respectively. We
figure, N different ontologies can be clustered into M semantic used OpenIE to extract the triples from the free text. The
groups. Semantically related queries will be generated for each OpenIE that is based on TextRunner [14], ReVerb [15] using
of the M clusters crawling RDF/OWL (predicate) graphs. The
1093
(PoS) patterns, extracts a significant relation without any or Vo between predicates Vpi and Vpj' r(Vpi• Vpj) determines the
relation-specific input. OpenIE uses a conditional random field reachability between Vpi and Vpj . I(Vpi' Vpj) indicates the
(CRF) classifier to automatically extract triples representing
closest level of neighbors Vpi and Vpj.
binary relations (Argl, Relation, Arg2) from sentences. The
triples generated from OpenIE will be connected to generate I(Vpi,vpj)
RDF graphs.
1.
Step 3: Assertion Clustering and Query Generation
Our assumption for the predicate neighboring patterns
= lf L, + L" L1
=
d(Vpi' Vpk). L2 d(VPk' Vpj)
=
r(Vpi• Vpk) true. r(Vpk• Vpj) true.

= =
(PNP) is that a predicate plays an important role in sharing

information and connecting entities among heterogeneous data. r(Vpi' Vpj) true
=
For any given domain, the number of unique relations is much

less than the number of concepts. Thus, this is a scalable
approach for mapping domains than the concept-centric Definition 2: Given Vpi and Vpj in a directed RDF schema
approach. A group of terms (subjects) can be connected to a Graph G (V, E). Let E(Vp;) and E(Vpj) denote the number of
group of terms (objects) through a single predicate unlike the
entities (subjects or objects) that are directed connected to Vpi
{
concept-driven approach. From the unit of <subjects-predicate
objects>, a specific context can be discovered from the and Vpj regardless of the direction respectively. ps_l ( Vpi' Vpj)
associated concepts (subjects, objects). From the neighbors of indicates the probability-based similarity between Vpi and Vp{
the predicates, a specific context can be discovered from the
association of predicates and their subjects and objects. We can
o. d(Vp;. Vpj) 0 (Vp; Vpj)
= =
infer/predict missing predicates or missing concepts based on 1. d(Vp;. Vpj) � 00

existing contexts. Therefore, we gave a hypothesis that when a ps- 1 (Vpl. ' V· ) =
(IE(Vp;)1 n IE(V pj)I)2

Pi
graph can be clustered based on PNP patterns, data in the same
---'------'-'- .
.
0 therWlse
Vp ; Vp ;
-
clustered group have a closer relationship than when in different IE( )I * IE( )1
ones.
III. QUERY GENERATION FOR KNOWLEDGE DISCOVERY Definition 3: For any relationship being higher than a level-l
A. Predicate Neighboring Measurements relationship, the probability based similarity pS_h (Vpi' VpJ
defines the probability-based similarity between Vpi and Vp{
First, we need to define the measurement for different
ps_h(Vp;. Vpj)
{
predicate neighboring patterns in tenus of sets of concepts and
relations over the datasets. For this purpose, we propose a
ps_1(Vp;. Vpk) * ps_1(Vpb Vpj). I(Vp;. Vpj) =
2
predicate-oriented algorithm [16] to detenuine the closeness of =
two different predicates. A level of different relationships Max (ps_1(Vpi,vPk)' ps_1(Vpk,vpJ). I(Vpi,vpJ > 2
between predicates piand through a concept W is defined
pj
as follows: Level 1 has four different combinations that are
based on a predicate sharing pattern (subjects (objects) are same The similarity matrix for clustering is generated after the
for both triples) as well as a connection pattern (subject (object) similarity score for all pairs of predicates have been generated.
of one triple is the object (subject) of the other triple). Levels 2 B. Hierarchical Fuzzy C-Means Clustering
and 3 have two various paths, respectively, that are based only
on a predicate connection pattern. The formal definition is We posited that predicate clustering is a required step for
shown in Definition 1. efficient query processing involving the alignment and
integration of ontologies. Here we clarify our approach to
The clustering approach we propose here is based on a efficient query processing and query generation within the
similarity measurement for discovering the predicate above theoretical framework. A query processing consists of a
neighboring patterns inherent in the ontologies. The similarity collection of several relationships between multiple properties.
measurement for the clustering algorithm varies based on Given that properties are more closely related to some
different neighboring levels for each pair of predicates. properties than others, property clustering and partition can be
Basically, we gave lower weights to closer predicates and utilized for efficient query processing - the task of classifying a
higher weights to farther predicates. That is, for any two collection of properties into clusters. The guiding principle is to
predicates in a level-l relationship, a probability based minimize inter-cluster similarity and maximize intra-cluster
similarity score is defined in Definition 2 and
ps_l(Vpi' Vpj) similarity, based on the notion of semantic distance.
any two predicates in a higher-level relationship in Definition
3, respectively. An example of predicate PI's neighboring level To discover the neighboring relation between predicates, we
with others and weighted similarity is also given in Figure 2. used an innovative Hierarchical Fuzzy C-Means (HFCM)
clustering algorithm. We extended a Fuzzy C-Means clustering
Definition 1: Given a directed graph G (V, E), vertices algorithm [17] with a hierarchical approach applying a heuristic
Vs. Vp• Va denote subject, predicate and object in an RDF schema function. In general, we defined a fuzzy hierarchy for clustering
graph, respectively. Let d(Vpi• Vpj) represent the number of V, setting a machine capacity threshold a to denote a certain
1094
number of triplets that each cluster for each level of the datasets and retrieve triple information. We used R computing
hierarchy can hold. The hierarchy can be constructed by environment [20] for our experimental validation. We
applying the Fuzzy C-Means algorithm on each cluster until the implemented a software plugin for query and schema graph
number of triplets for each cluster is less than or equal to the visualization using CytoScape 3.0.2 [21]. In addition, we set the
threshold a or no further change of numbers of elements for machine capacity threshold a as 113 of the total size of the
each cluster can be made. To get the optimal number of clusters, triples.
we used Silhouette Width to evaluate results and chose the one
B. Case Study
with the biggest score.
As a case study, the Mayo Clinic's colorectal surgical
C. Query Generation
reports were used to generate queries by categorizing
From the HFCM clustering step, predicates are grouped into relationships among 6 colorectal postsurgical complications
a number of clusters according to the similarity measurement of (deep vein thrombosis/pulmonary embolism, bleeding, wound
the predicate neighboring patterns (PNP). Each cluster will infections, myocardial infraction, ileus and abscess/leak).
have a graph, called the cluster graph that is composed of the Postsurgical complications are related to general or certain type
predicates either from a single domain or different domains. of surgeries. Clinical data of six complications after colorectal
The Query Generation model will take over and start crawling surgery was attempted, was analyzed to find interesting
the cluster graph to generate the query graph with one domain patterns/associations in a single or multiple complications and
or across domains. Figure 3 shows how the query generation generated comprehensive cross-domain queries that might be
model generates queries. The m clusters generated after running useful in conducting evidence-based practice by using available
HFCM and one of them has three predicates, namely uci from research results. We assume 6 postsurgical complications
the ILEUS domain, transferred from the ABSCESS domain and represent 6 domains. Predicate profiles and association patterns
held from the BLEED domain. As seen in Figure 3, we started are important to link data together with a variety of domains.
to generate a query, first visiting the predicate that had the Our previous paper [12] listed a brief description of each
highest in-degree and out-degree and expanded it with its complication type.
neighbors. In this example, we started with uci and then visited
C. Convert colorectal surgical cohort to ontology
all its neighbors in a descending order of their similarity scores.
As the similarity score between uci and held is 0.6, uci visits SElECT ?P, ?C, ?F Where{
held first. And then uci visits transferred with the similarity 1P, ucl,?C Sub Graph
score 0.5. After all uci's neighbors were visited, we started to

?P, transferred,?F
?F, held, ?C}
- - - - - � .... : :, : - - - - - - - - - - - -
'. .; --:-
select its neighbor that had the highest similarity score. In this
example, held takes turns to reach its neighbor and find
transferred. The algorithm runs in an iterative way until there
is no more neighbors can be visited within the cluster. For
example, if a query is generated with a predicate's similarity
score t �0.6, then the generated graph will include only uci and
held. Finally, a SPARQL query will be generated by replacing
the names of subjects and objects with variables.
".J. HELD
, .
'
1- __ �_____ -. ", _____________
Clustering Predicate Graph

m Clusters
Fig3: Clustering Predicate Graph for Query Generation
levell, (Pl, P2), (Pl, P3) Our case study has 1,980 colorectal surgical cases for 1,416
PS(pl,p2) = i*i= � patients between 2005 and 2013 enrolled at the Mayo Clinic in
PS(pl,p3) = i*i= �
Level2: (P1, P4)
Rochester, MN. We used our previous work, MedTagger to
PS(pl, p4) = PS(pl,p3) PS(p3,p4) • extract concepts from clinical notes written about any
= .!.*!.=..!..
4 4 16 complications within 30 days after surgery in cohort. The top
level3: (Pl, PSI
PS(pl.pS) = Max(PS(pl,p4)' 60 concepts by their inequality score were used for information
PS(p4,pS),PS(pl,p3) •
extraction. The summary of 6 domains is shown in Table I.
3
:_�� :!_S!�_=: � _________________
D. Validation of Clustering Results with Golden Standard
Fig 2: Predicate Neighboring Level and Weighted Similarity We applied a hierarchical Fuzzy C-means (HFCM)
algorithm to the dataset to hierarchically partition each cluster
IV. RESULTS AND DISCUSSION into smaller and more compact groups, which results in 8
clusters. As Table II shows, cluster 1 and 2 contain 24 and 16
A.Specijication
predicates respectively across ABSCESS, BLEED and
The BmQGen prototype system was implemented using INFECTION. Cluster 3 contains 13 predicates respectively
Java in an Eclipse Juno Integrated Development Environment across ABSCESS, BLEED, DVTPE and MI. Cluster 4 has 30
[18]. Apache Jena API [19] was used to parse OWLIRDF predicates related to BLEED, DVTPE and ILEUS. Cluster 5 has
1095
23 predicates in 5 complications except BLEED. Cluster 6 valid. ILEUS has a strong relationship with ABSCESS and
includes 29 predicates with ABSCESS, BLEED, DVTPE and BLEED, verifying that cluster 6 is valid. LEAK and ILEUS are
ILEUS. Cluster 7 has predicates in INFECTION and ILEUS. also strongly associated, verifying that cluster 7 is valid. MI is
Cluster 8 has predicates in ABSCESS, MI and BLEED. Due to strongly related to ABSCESS and BLEED, and we can also
our assumption of the predicate fuzziness, a predicate may verify that cluster 8 is valid. DVTPE does not have a very strong
appear in more than one cluster. relationship with other complications, but this weak correlation
with ILEUS, BLEED and ABSCESS is captured by cluster 3,
TABLE I: colorectal surgical cohort
4, 5 and 6. Therefore, the clusters we generated by the HFCM
follow the same correlation provided by the golden standard
Ontology # of Unique # of Unique
benchmark.
Triple Schema Predicates
ABSCESS BLEED DVTPE ILEUS INFECTI LEAK MI
ABSCESS 286 13
ON
BLEED 177 13 ABSCESS "" :. . . � .. ':. II
BLEED :. 1.0000 0.1039 0.1375 0.0966 0.1456 0.1460

DVTPE 33 10 DVTPE . " .. 0.1039 1.0000 0.0998 0.0439 0.0170 0.0069
ILEUS 811 21 ILEUS 0.1375 0.0998 1.0000 0.1032 0.1881 0.0144
INFECTI 0.0966 0.0439 0.1032 1.0000 0.1246 0.0355

MI 156 12 ON
LEAK . :. 0.1456 0.0170 0.1881 0.1246 1.0000 0.1379
INFECTION 66 14
MI I I : 0.1460 0.0069 0.0144 0.0355 0.1379 1.0000
Total 1529 83
Fig4: Correlation Matrices for Golden Standard
E. Query Generation and Visualization

TABLE II: Clustering Results
The SPARQL queries we generated for each cluster are
Cluster #Predicates Complications Coverage shown in Figure 5. For the predicate graphs across six domains
and their query boundaries correspondingly, we used different
1 24 ABSCESS, BLEED, INFECTION colors like green, black, blue, pink, orange and red to present
2 16 ABSCESS, BLEED, INFECTION ABSCESS, BLEED, DVTPE, ILEUS, INFECTION and MI,
respectively. In addition, we used a rectangle to identify the
3 13 ABSCESS, BLEED, DVTPE, MI query boundary out of the whole predicate graph. Queries 1 to
6 are cross domain queries that are automatically generated
4 30 BLEED, DVTPE, ILEUS
from each cluster. These queries identify the relationships
5 23 ABSCESS, DVTPE, INFECTION, among different post-surgical complications. For example,
ILEUS, MI INFECTION, ABSCESS and BLEED are closely related to
each other through the predicates of wound, bleeding or fever.
6 29 ABSCESS, BLEED, DVTPE, DVTPE and BLEED are usually related through the predicate
ILEUS clot. ABSCESS and ILEUS are related to each other through
7 11 ABSCESS, ILEUS the predicates, abdominal collections and distention. MI and
BLEED are related to each other through anemia and coronary.
8 16 ABSCESS, MI, BLEED
F. Discussion
The dataset we used in this paper is quite noisy and there

We used a golden standard file provided by a medical expert exists some false positive and true negative cases in the
Dr. David W. Larson in the Colon and Rectal Surgery mapping between words/phrases and complications. Therefore,
department at the Mayo Clinics to validate our clustering the inequality metrics generated by the previous work may not
output. This file lists indications of 7 types of complications for represent the true inequality of the likelihood. Although data
1505 patients after colorectal surgery from 2005 to 2013. In our normalization is not the main concern of this research,
study, we considered 6 types of complications by treating clustering results and query outputs will be refined if a more
ABSCESS and LEAK as the same complication (unlike the accurately normalized input is given. To provide a better
golden standard) for the sake of simplicity. A patient may have solution, words/phrases identified with a high inequality of
had no complication or up to 7 complications as the golden likelihood needs to be selected to get better normalized data.
standard specified. We built correlation metrics based on this We also need to try different ontologies and dictionary tools
benchmark to find out which complications showed a strong (e.g., MetaMap, WordNet) to acquire a better annotated dataset
positive correlation. Figure 4 represents the matrices with for query generation.
visualization. The number in red represents the top 3
correlations for each complication. It is obvious to see that the V. CONCLUSION
complications ABCESS, BLEED and INFECTION have a
In this study, we investigated the use of ontology and an
strong correlation, which verifies that cluster 1 and cluster 2 are
unsupervised machine learning approach to integrate
1096
heterogeneous unstructured resource to a semi-structured annotation quality of the datasets. In addition, to improve
knowledge base and achieved a cross domain query generation system performance and scalability, we will utilize a distributed
for knowledge discovery. We proposed the BmQGen and parallel platform to fulfill efficient clustering and query
framework to handle any RDF/OWL datasets from multiple generation in order to handle big data in biomedical research.
domains. To evaluate our framework, we presented a case study
with colorectal postsurgical complications and demonstrated REFERENCES
that the BmQGen framework is capable to extract a cohesive [I] Linking Open Data cloud diagram 2014. by Max Schmachtenberg.
structure and semantics, as well as interesting patterns from Christian Bizer. Anja Jentzsch and Richard Cyganiak. http://lod
cloud.net!
structured/unstructured complication datasets. By using the
colorectal surgical reports from the Mayo Clinic and golden [2] Semantic Web Health Care and Life Sciences Interest Group
[http://www.w3.org/200 llsw/hc\s/]
standard, we successfully validated our clustering results
[3] Luciano, Joanne S., et al. "The Translational Medicine Ontology and
thereby providing solid evidence for automatic query
Knowledge Base: driving personalized medicine by bridging the gap
generation. between bench and bedside." J lJiomed Semanlics 2.Suppl 2 (2011): SI.
[4] Shaw, Marianne, et al. "Generating application ontologies from reference
Infection, Abscess
Ql Infection, Abscess, Bleed Q2 ontologies." AMIAAnnual Symposium Proceedings. Vol. 200S. American
Cluster: 2
Cluster: 1 Medical Informatics Association,200S.
Patient got a Patient
abdominal ". developed fluid [5] Nekrutenko, A., and Taylor, J., 2012. "Next-generation sequencing data
wound results throughout interpretation: enhancing reproducibility and accessibility." Nature Rev.
in bleeding hislber abdomen Genetics 13.9:667-672.
within 30 days within 30 days
after surgery. after surgery. [6] Detwiler. Landon T., Dan Suciu. and James F. Brinkley. "Regular paths
Find the type What other in sparq\: Querying the nci thesaurus." AMIA Annual Symposium
Proceedings. American Medical Informatics Association,200S.
Select?wo�d
of this wound symptom could
\Vbere{ Select ?fevers
this patient get
Patient infection ?wounci
by purulent
\Vbere{ [7] Sturn. Alexander. John Quackenbush, and Zlatko Trajanoski. "Genesis:
Patient drainage ?fevers.
Patient developed ?wound.
drainage ? cluster analysis of microarray data." Bioinformatics IS.I (2002): 207-
Physician explained bleeding. Patient developed fluid.
Patient time ?time.
20S.
Bleeding from ?wound.
Patient time ?time. }Filter (?time<=30) [S] Dembele. Doulaye. and Philippe Kastner. "Fuzzy C-means method for
} Filter (?time<=30) clustering microarray data." Bioinformatics 19.5 (2003): 973-9S0.
[9] Klyne, Graham, and Jeremy J. Carroll. "Resource description framework
Q3 Infection, Abscess Abscess, Ileus (RDF): Concepts and abstract syntax." (2006).
Patient got Cluster: 5 Cluster: 7 [to] Found at: http://ohnlp.org/index.php/MedTagger
infected by a Within 30
[II] Found at: http://www.medlex.com/
wound within 30 days after
days after surgery. surgery, [12] Liu. Hongfang, et al. "Facilitating post-surgical complication detection
Such wound is patient's Ct through sublanguage analysis." AMIA Summits on Translational Science
packed by damp. scan showed
Proceedings 2014 (2014): 77.
CT-scan also abdominal
this
revealed conections. [13] Etzioni, Oren, et al. "Open Information Extraction: The Second
WOWld. ,,,bat Select?wound \Vhat possible Select ?distention Generation."IJCAI. Vol. II. 2011.
wound is it
? \Vbere( symptom \Vbere(
?woWld packed damp this Patient has ct
[14] Yates. Alexander. et al. "Textrunner: open information extraction on the
Patient infection ?wound patient ct revealed collectio ns web." Proceedings of Human Language Technologies: The Annual
has? Patient becomes?distentio n Conference of the North American Chapter of the Association for
Patient time ?time
Computational Linguistics: Demonstrations. Association for
? <=
Computational Linguistics. 2007.
Qs Dvtpe, Bleed Q6 Bleed, Mi [IS] Fader, Anthony, Stephen Soderland, and Oren Etzioni. "Identifying
Cluster: 4 Cluster: 8 relations for open information extraction." Proceedings of the Conference
Within 30 days on Empirical Methods in Natural Language Processing. Association for
after surgery,
... Within 30days Computational Linguistics. 20II.
patient clotting
by some device.
after surgery, [16] Shen, Feichen. "A pervasive framework for real-time activity patterns of
patient held
Such device mobile users." Pervasive Computing and Communication Workshops
anemia. What
placed since
other possible (PerCom Workshops), 2015 IEEE International Conference on. IEEE,
anticoagu lation.
diseas e he/she 2015.
Anticoagu lation Select ?IVC
[17] Bezdek,James c., Robert Ehrlich,and William Full. "FCM: The fuzzy c
could have?
disconti nued to \Vbere( Select ?coronary
GJ bleed. Find Patient clot?rvc Where( means clustering algorithm." Computers & Geosciences to, no. 2: 191-
this device ?IVe placed anticg Patient held anemia
203,19S4.
ang;.cg discontinue bleed Patient bas?coronary
Patient time ?time } Filter Patient time ?time [IS] Eclipse Juno Integrated Development Environment
(?time<=30) } Filter(?t im e<=30) https://www.eclipse.org/juno/
[19] McBride, B. "Jena: Implementing the RDF Model and Syntax
FigS: Cross Complications Queries Specification." SemWeb. 2001.
[20] Found at: The R Project for Statistic http://www.r-projecLorg/
In our future work, we will improve the accuracy of [21] Shannon, Paul, et al. "Cytoscape: a software environment for integrated
information extraction by refining the data normalization models of biomolecular interaction networks." Genome research 13.11
workflow with different ontologies and (2003): 249S-2504.
extraction/normalization tools (e.g., MetaMap [22]). We will [22] Found at: http://metamap.nlm.nih.gov
also get more support from domain experts to improve the
1097

BmQGen Biomedical Query Generator For Knowledge Discovery

Uploaded by

Copyright:

Available Formats

You might also like

BmQGen Biomedical Query Generator For Knowledge Discovery

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BmQGen Biomedical Query Generator For Knowledge Discovery

Uploaded by

Copyright:

Available Formats

2015 IEEE International Conference on Bioinformatics and Biomedicine (BTBM)

BmQGen: Biomedical Query Generator for

978-1-4673-6799-8/15/$31.00 ©2015 IEEE 1092

RDF Graph Generator

As an example, a given input text, "the patient was UCI'd

r(Vpi• Vpk) true. r(Vpk• Vpj) true.

(PNP) is that a predicate plays an important role in sharing

For any given domain, the number of unique relations is much

infer/predict missing predicates or missing concepts based on 1. d(Vp;. Vpj) � 00

(IE(Vp;)1 n IE(V pj)I)2

score 0.5. After all uci's neighbors were visited, we started to

Clustering Predicate Graph

D. Validation of Clustering Results with Golden Standard

BLEED 177 13 ABSCESS "" :. . . � .. ':. II

BLEED :. 1.0000 0.1039 0.1375 0.0966 0.1456 0.1460

ILEUS 811 21 ILEUS 0.1375 0.0998 1.0000 0.1032 0.1881 0.0144

INFECTI 0.0966 0.0439 0.1032 1.0000 0.1246 0.0355

E. Query Generation and Visualization

The dataset we used in this paper is quite noisy and there

also get more support from domain experts to improve the

You might also like