Overlaplda: A Generative Approach For Literature-Based Discovery

OverlapLDA: A Generative Approach for
Literature-Based Discovery
Juncheng Ding Wei Jin
Department of Computer Science and Engineering Department of Computer Science and Engineering
University of North Texas University of North Texas
Denton, TX USA Denton, TX USA
junchengding@my.unt.edu wei.jin@unt.edu
Abstract—Literature-based discovery, the process of choosing Literature-Based Discovery (LBD) [3], which has become one
bridge terms to build plausible connections between ”unrelated” of the foremost tasks in biomedical text mining [4].
terms, helps the biomedical research by generating potentially LBD aims to find the intermediate terms explaining the po-
useful hypotheses. The basic idea of previous work is conducting
two or higher order association rules searching and ranking, tential relations between two terms according to the literature.
either according to the semantics or the statistical measurements. Researchers define LBD in an Information Retrieval (IR) style,
However, the high order searching nature makes current ap- where the input is generally one or two query terms, the corpus
proaches computationally complex. We address this problem by is the set of literature, and the output is the possible connecting
proposing a generative model assuming all the connection words terms indicating particular relationships between two input
as a rank of words. We further present an inference method
for the generative model to learn the rank. Our model avoids query terms. LBD varies in the number of input terms: two-
the process of high order searching. Experiments show that our term input LBD and one-term input LBD [5]. Two-term input
approach addresses the problem both effectively and efficiently. LBD is closed discovery which aims to find relations between
Index Terms—text mining, literature-base discovery, generative two input terms, while one-term input LBD is open discovery
model, topic model targeting at finding all possible second terms (in the context
of closed discovery) as well as their respective intermediate
I. I NTRODUCTION
terms. We focus our work on closed discovery in this paper
The enormous amount of biomedical literature, as well because open discovery is an extension of closed discovery.
as the explosive growth in its number, provides much more Generally, current work addresses LBD in a two or higher-
biomedical knowledge than before. PubMed1 , the most popu- order association rules mining way. In the first step, they find
lar biomedical literature search tool, has more than 29 million all the terms associated with the first input term, namely first-
records now and more than 500,000 records are added to it ev- order association terms. Secondly, for each first order associ-
ery year. The explosive quantity of knowledge greatly benefits ation terms, they further discover all the second-order associ-
biomedical researchers nowadays. However, this vast amount ation terms related to it. Finally, they pick out all the second-
also makes it hard for one researcher or even a group of order associations ending with the second input and rank the
researchers to read all related literature before conducting the associations based on particular criteria. Even though some
experiments. This conflict gives rise to biomedical text mining solutions solve the problem without second-order or higher-
[1], which helps extract useful information automatically from order searching (i.e., some approaches assess the similarities
biomedical literature. between intermediate terms and the second input terms only
An essential step of any research is to generate possible instead of conduction a complete second-order search), it is
hypotheses. For example, in biomedical research, one needs to essential to conduct the second-order search if we need to rank
know a particular medicine may cure a specific disease before the intermediate terms properly. The reason is that one needs
experimental validation. Limited numbers and higher accuracy to know all second-order associations of an intermediate-term
are preferred because the cost of this kind of biomedical before deciding how strong the intermediate-term to second
experimentation is pretty high. Therefore, researchers always input term association is among its counterparts.
spend much time on rigorous planning and analysis to generate The high-order searching process involves a large quantity
better hypotheses, which involves intensive human labor. The of computational cost, and it also exhibits exponential growth
large quantity of biomedical literature provides potentially with the increasing number of intermediate terms. Researchers
better options for the automatic hypotheses generation while proposed several techniques to mitigate the computational
aggravating the labor-intensive situation. Advancements in burden by eliminating some first-order association terms, such
text mining and information retrieval alleviate the conflicts as learning a classifier for whether one term can relate two
[2]. Researchers in this domain formulate the problem as input terms [6], or filtering the first-order association terms
1 https://www.ncbi.nlm.nih.gov/pubmed/ via external semantics database [7]. However, they still suffer
from the substantial size of the high-order searching space.
We propose a generative approach, Overlap Latent Dirich- as well. We summarize the contribution of our work as follows:
let Allocation (OverlapLDA), to find the intermediate terms • We figure out the drawback of current LBD solutions that
while avoiding high-order searching, thus reducing the com- they are of high computational cost.
putational cost in the LBD process. The assumption behind • We propose a generative approach as well as its inference
OverlapLDA is that particular relationship between two query to bypass the current drawback.
terms can be represented by a rank of the intermediate terms, • We analyze the computational cost and experiment to
which is similar to a topic representation in topic models show the utility of our generative approach.
[8]. Therefore, the relation, represented as a topic, is an The remainder of the paper has the following structure.
overlapping topic between two query terms (or two corpora Section 2 presents related work. Section 3 defines several
built according to respective query terms). terminologies in our work. In section 4, we introduce our
We depict the basic idea of OverlapLDA in Figure 1, in model and inference method in detail. Section 5 is our analysis
which one circle refers to a corpus, and one rectangle is one and experiments. Finally, we conclude our work and give our
topic in the respective corpus. Each corpus (either in orange or future direction in Section 6.
in green), containing all the literature related to one input term,
has several corpus-specific topics as well as one overlapping II. R ELATED W ORK
topic (in blue) shared by the other corpus. In this section we introduce the current progress of LBD
research and the motivation of our work. We also present our
introduction of topic models as our work’s basis.
A. Literature-Based Discovery
Literature-Based Discovery (LBD), also known as hypothe-
sis generation, aims to find the possible relationships between
two terms as well as the evidence (also in the form of terms).
Swanson firstly introduced LBD in 1986 [3] in an ABC
manner where A and C are the two query terms, and B is
the desiring connecting term. In 1986, Swanson manually
discovered the relationship between ”Fish Oil” and ”Raynaud
Disease” through bridging terms as ”blood Viscosity, platelet
Fig. 1. The Basic Idea of OverlapLDA function, and vascular reactivity”. The discovery was verified
by clinical trials later [2]. The subsequent work focused
Before learning the overlapping topic, we create two corpora on automatic LBD, albeit Swanson’s process needed human
for each of the query terms according to the assumption of reading and expert background.
OverlapLDA, instead of just one corpus in traditional topic As in our introduction, current solutions to LBD discover
models. Each corpus consists of all the literature containing novel relationships via high-order association rule mining.
the respective query term. We also propose the inference The association mining process needs deciding whether an
method for OverlapLDA based on Gibbs Sampling. The model association rule (or its respective ending term) is a valid one
inference can learn the shared topic across the two corpora as and how strong the association is (to rank all the associations).
well as their corpus-specific topics. The shared topic is thus the According to the evidence of the above decisions, we can
rank of terms bridging two query terms. Moreover, we ignore categorize current LBD solutions as semantic-based ones,
the learning of corpus-specific topics, which are useless in statistical-based ones, and their mix. The semantic-based ap-
LBD, to save computational cost. It is worthwhile to note that proaches employ semantic information, which usually comes
we only need to learn one overlapping topic other than all the from an independent database, to determine the validity and
topics, which is pretty efficient. the strength of an association rule. For example, ”Fish Oil”
We report our analysis on time complexity and our exper- bears the semantics of medicine which has the ”may affect”
imentation in this paper. The complexity analysis shows that semantic relation to a physiological index. Therefore, there
our model can save computational cost, which becomes more may exist an association between the term ”Blood Viscosity”
significant when the corpus size grows. We experiment on a bearing the semantics of a physiological index. A greater num-
gold standard query pair ”Fish Oil” and ”Raynaud Disease” ber of semantic relationships indicate a stronger association.
because it is easier to verify our outcome. The experimental The other line of research is statistical-based, which adopts the
results demonstrate that our model can achieve even better term frequency statistics to make the judgment. The frequent
performance. We also test our model given different hyperpa- occurrence of ”Fish Oil” and ”Blood Viscosity” indicates their
rameters, showing that our model is stable. The analysis and strong mutual relationship. More recent work employs both
experiment reveal the utility of our generative approach, which two types of information in the association mining process.
reduces the computational cost and performs well in LBD. There are also differences in representing the literature
Overall, our research addresses the problem of computa- between distinct LBD solutions. One line of research considers
tional time complexity in LBD and finds better bridging terms literature as a number of documents and finds the association
terms via Information Retrieval (IR) [7], [9]. The output of LBD data source. The latter part is the clarification of our
LBD is several terms instead of documents in classical IR. The model’s terminologies.
other line regards one piece of literature as a graph in which
A. Literature and Medical Subject Headings
each node is a term, and the edge can be either a particular
statistics or a specific semantic relationship [10]–[12]. The Literature: In the context LBD, the literature refers to
whole collection of literature is a large graph. Therefore, the a group of papers related to the input terms. Most of the
LBD is a subgraph generation process. papers in the biomedical domain are indexed by MEDLINE,
More recent work took advantage of machine learning to which is the largest bibliographic database in the field. The
learn the representation of literature or terms either according database now contains more than 29 million records (papers
to the semantics or statistics, and either in IR or graph ap- or conferences abstracts) from different publishing venues and
proaches . However, due to the same reason in the introduction is freely accessible via PubMed. Each record in the database
part, these techniques cannot avoid high-order searching if we consists of information such as the title, the publishing data,
want to rank all the possible relationships properly. the publishing venue, the abstract, the Medical Subject Head-
ings (MeSH), and more. We select records from MEDLINE
B. Latent Dirichlet Allocation as our literature, and we will introduce how we create the
Topic models are a family of models built on text docu- corpora in the methodology section.
ments. In topic models, a corpus contains several documents, Medical Subject Headings (MeSH): MeSH, created and
and each document is a bag of unordered words. A topic updated by the United States National Library of Medicine
model assumes a topic layer between documents and words. (NLM) for indexing journal articles and books in life sci-
Therefore, a document is a mixture of topics, and each topic ence, is a human-controlled vocabulary with approximately
represents a mixture of words. After the model inference, one 26 thousand terms and is updated annually. PubMed indexes
can get the topic representations in terms of words mixture each journal article with 10-15 MeSH headings indicating the
and documents representations in terms of topics mixture. article’s contents. Therefore, we can safely use only the MeSH
Latent Dirichlet Allocation (LDA) is the most representative terms other than everything in an article to represent it [18],
topic model, which is the first generative one and breeds many [19]. In the following part, we refer to all the MeSH terms
advanced topic models that fit for many different occasions. appearing in an article when mentioning one article.
The generative characteristic makes LDA, and models based
B. Corpus, Document and Topic
on LDA, suitable for the smaller dataset. The generative
models assume particular structures in the data; thus, they Corpus: A corpus is a collection of documents. In LBD, a
converge faster and need less training before convergence than corpus equals to a collection of literature. Researchers build
discriminative models [13]. Therefore, we can achieve the corpora in different ways to fit for different purposes. For
same or even better performance with the same amount of example, one may build a corpus by aggregating all the journal
training data theoretically. Moreover, we can exploit sampling- articles discussing the ”lung cancer” to analyze this disease in
based methods for generative models inference, such as Gibbs detail. He or she may further split the ”lung cancer” corpus
sampling for LDA, which are time efficient (just going through into several different corpora according to their publishing date
the data several times). The above features make LDA an to analyze the temporal change in ”lung cancer” research.
efficient way to extract useful information on smaller datasets. Therefore, it is imperative to build corpus properly before
LDA can be extended to fit different purposes, of which the conducting research. We will introduce our way of building
Author Topic Model (ATM) [14] and the Link-LDA [15] are corpora in the next section.
most typical. ATM adds an author layer between documents Document: A document is one article in the literature or
and topics in LDA. Each document has several authors, and one component in a corpus in topic models. Although there
each author has his or her mixture of topics. ATM thus cap- are many different fields in an article in MEDLINE, we use all
tures more abundant information than LDA by adding a layer. the MeSH terms associated with a journal article to represent
Link-LDA assumes each document has two parts: words and that article. Therefore, one document in our setting is a set of
references. A topic determines a specific distribution of words unordered MeSH terms. Due to the reasons presented in the
as well as a distribution of references. Link-LDA extracts more introduction of MeSH, our document representation can safely
accurate topic information by considering references. There conclude the content of the respective document.
are many further deviations of the above two models [16], Topic: Topic is a summarization of the content about a
[17]. However, these models provide just a corpus-specific subject. We use the term ”topic” in the context of topic models.
topics description. In LBD, we need a shared topic describing In our paper, topic refers to a distribution of terms (MeSH
the relationship between two certain terms. They are thus not terms) summarizing a piece of content. Distribution of terms
able to address the LBD problem. is the probabilities of different terms appearing in a topic. Note
that a topic can inherently describe the relationship between
III. T ERMINOLOGIES two terms because they are both in the format of a rank of
In this section, we explain the basic terminologies in our words. Our proposed model aims to extract a topic describing
work. The first part introduces the terminologies related to the relationships between two terms.
IV. M ETHODOLOGY a distribution over the vocabulary. In OverlapLDA, each doc-
In this section, we will introduce our proposed model as ument is a mixture of latent topics while each topic (either the
well as its inference. Before that are how we build the corpora overlapping topic or a corpus-specific topic) is a distribution
and the symbolic notations in our model. of words. The graphic representation of OverlapLDA is in
Figure 2, and the generative process for all documents d
A. Corpora Construction in the corpus D1 and corpus D2 is in Algorithm 1. In the
As in our corpus introduction part, how to build the corpus above descriptions, α1 and α2 are the K1 -dimensional and
depends on the purpose of the analysis. We propose a corpora K2 -dimensional Dirichlet document priors respectively; β is
building method that facilitates the topic discovery in LBD. V -dimensional Dirichlet prior for topics; γ is the Beta prior
We take the complete MEDLINE(2018) as our data source. of overlap label. α1 , α2 , β, and γ are all hyperparameters.
The built corpora are subsets of the dataset according to the We assume K1 and K2 topics for the first and second
input query terms. As in Figure 1, our model can learn the corpus respectively, where K1 and K2 are hyperparameters.
overlap topic between two corpora. Therefore, we build two The overlap topic number is 1. Each topic is a V -dimension
separate corpora for each input query term instead of only multinomial distribution φ over V unique words, and the
one in tradition topic models. Each corpus consists of all the distribution is sampled from a given Dirichlet prior β. For
documents containing the respective query term in MEDLINE. each word in a document, one can sample from the overlap
Therefore, the corpus building is a classic IR process. In topic or corpus-specific topic according to the overlap label
the view of IR, the documents set is all the documents in o sampled from overlap binomial prior σi for the ith corpus.
MEDLINE, the input is one query term in LBD, and the output Each set of corpus-specific topics is the same as the set of
is all the documents containing the input term. We build an topics in the original LDA.
inverted index to expedite the process as well.
B. Notations
The notations of OverlapLDA are in Table I. We consider
two corpora D1 , D2 where dim is the mth document in the
ith corpus. There are N unique tokens in the vocabulary V
containing all the unique terms in D1 , D2 , where wn is the
nth token in V. Assuming there are Z1 topics and Z2 topics
for the corpus D1 and corpus D2 respectively, and z1k1 and
z2k2 are the k1th topic in D1 and k2th topic in D2 respectively. Fig. 2. Graphic Representation of OverlapLDA
c is the corpus label. o is a binary value indicating whether
the current topic is an overlap topic 1 or not 0.
Algorithm 1 The generative process of OverlapLDA
1: procedure T OPICS G ENERATION(β, K1 , K2 )
TABLE I
OVERLAP LDA N OTATIONS 2: for all topic z1 ∈ [1, K1 ] do
3: sample a word mixture over this topic φz1 ∼ Dir(β)
Notation Meaning 4: for all topic z2 ∈ [1, K2 ] do
D1 the first corpus, all documents 5: sample a word mixture over this topic φz2 ∼ Dir(β)
D1 number of documents in the first corpus 6: sample a word mixture over overlap topic φo ∼ Dir(β)
D2 the second corpus, all documents
D2 number of documents in the second corpus
7: procedure OVERLAP G ENERATION(γ)
dim mth document in ith corpus 8: for all corpus i do
c corpus label 9: sample a overlap distribution mixture σi ∼ Beta(γ)
V set of unique tokens in D
V number of tokens in V
10: procedure D OCUMENTS G ENERATION(α1 , α2 , ξ, Φ, M )
wm,n nth token in dim
11: for all document in corpus c do
T1 1st corpus topic set 12: sample document length N ∼ P oisson(ξ)
K1 number of topics in T1 13: sample the topic mixture θc ∼ Dir(αc )
z1k1 k1th topic in T1 14: for all word wn do
T2 2nd corpus topic set 15: sample overlap label o ∼ Binomial(σc )
K2 number of topics in T2 16: if o == 1 then
z2k2 k2th topic in T2 17: sample a word w ∼ M ultinomial(φo )
o overlap label 18: else
19: sample a topic number z ∼ M ultinomial(θc )
20: sample a word w ∼ M ultinomial(φzc )
C. OverlapLDA
OverlapLDA is a model of two corpora with a hidden topic Generating Topics: To generate a topic is to generate
structure, which has a shared topic as well as two sets of a word distribution of that topic. OverlapLDA models two
corpus-specific topics. OverlapLDA expresses every topic via corpora in one generative process. Each corpus has its corpus-
specific topics, and there is one overlap topic shared by the Before sampling, we randomly and independently assign
two corpora. We assume that all the topics share the same the topic number and overlap label to every word. In a
Dirichlet prior β as in Equation 1. sampling process, for every word in every document, we
p(w|z) = p(φzc |β) = Dir(φzc |β) (1) sample whether it belongs to the overlapping topic or a
corpus-specific topic according to Equation 3. If the word
Generating Overlap Distribution: Before generating each belongs to a corpus-specific topic, we can sample the corpus-
word in a document, we assume a selection mechanism on specific topic number using Equation 4. We go through every
whether one will sample a word from the overlap topic or a word in the two corpora corpus during each iteration. The
corpus-specific topic. The selection mechanism is the overlap assignments will converge after several iterations. We can then
distribution, which is a binomial one. For different corpora, estimate the overlap topic representation (its words or concepts
one samples the overlap distributions independently from a distribution) using Equation 5:
Beta distribution as in Equation 2.
no=1
wzo + β
p(o|c) = p(σc |γ) = Beta(σc |γ) (2) φo = o=1
(5)
n. + V β
Generating Documents: The first step to generate a docu- where φo is the words’ probabilities appearing in the overlap
ment is sampling a length, which is the number of tokens that topic. We can also infer the other two sets of parameters which
appear in the document. Meanwhile, one will sample a topic are not useful in LBD application, so we skip the inference
distribution θc = p(w|z) for the specific document according part. For hyperparameters, we use smoothing ones following
to Equation 1. Before generating a word in a document, one previous literature [14], which are βi = 0.01, i ∈ [0, V − 1],
samples an overlapping label o indicating whether the topic and γ = [10, 1] in default. The iteration time is 30 in the fol-
is the overlap topic according to p(0|c) in Equation 2. The lowing sampling processes which is enough for convergence.
overlap label o = 1 equals to sampling the word from the In LBD, what we need is the overlapping topic representa-
overlap φo = p(w|zo ), while o = 0 equals to sampling a tion. Therefore, it is worthwhile to note that there is no need to
topic. Finally, one samples a word according to the topic. sample the corpus-specific topics, which makes OverlapLDA
Different from LDA, OverlapLDA addresses two corpora more efficient. In this regard, the hyperparameters K1 , K2 ,
and can extract their commonality. As in Figure 2, Over- α1 , and α2 make no difference in the inference process.
lapLDA assumes two kinds of topics in a document. In the
generation of a word, one first decides to choose what kind V. A NALYSIS AND E XPERIMENTS
of topic and samples the word according to the selected topic. In this section, we present our analysis and experiments to
OverlapLDA infers the overlap topic of two corpora as well demonstrate the effectiveness and efficiency of OverlapLDA in
as the corpus-specific topics. LBD. The query in our following experiments is the ”Fish Oil”
The most important contribution of OverlapLDA is that it and ”Raynaud Disease” pair proposed in Swanson’s pioneering
assumes two corpora with an overlap, and provides an explicit work [3]. We firstly present our way of preprocessing the
expression of the overlap topic. In the next part, we will literature (all documents in MEDLINE2018) as well as the
describe the inference of OverlapLDA based Gibbs sampling. corpora construction. We then compare the computational cost
D. Parameter Inference of our method with others. In the following part, we present
the top intermediate terms discovered by OverlapLDA and
We propose to use Gibbs sampling for inference, which is
explore whether these terms are meaningful in expressing the
efficient (just going through the corpus several times) [20].
connection between ”Fish Oil” and ”Raynaud Disease”. We
There are three sets of parameters to be estimated the
also quantitatively compare our results with others’ work. In
D1 +D2 documents’ topic distributions θ, the Z1 +Z2 +1 top-
the last part of this section, we show our model’s sensitivity
ics’ respective word distributions, and the overlap distributions
to its hyperparameters since there are several hyperparameters
σc for each corpus. When doing Gibbs sampling, we build a
to be determined before model inference.
Markov chain that converges to posterior distribution on z1 and
z2 and then use the samples to estimate the parameters. The A. Preprocessing
transitional probability between two successive states are the We consider some words that appear too frequent or express
probabilities of z1 and z2 conditioned on all other variables. too broad information as useless in LBD. Before building any
After integrating out θ1 , θ2 and φ, the transitional probability task-specific corpora, we maintain a filtering list containing the
for a document is in Equation 3 and Equation 4. words that either appear too frequent or are too broad in our
ndon + γon nown + β preprocessing step. The final connection terms list removes
p(on |o n , w n ) ∝ × o n (3)
d
n. + γ0 + γ1 n. n + V β all the words on the list. However, we keep these words in
nzwin +β OverlapLDA because they may also carry useful information.
p(zin |c = i, o = 0, zi, n , wi, n ) ∝ (nd,c
zin + αi ) ×
in
(4) We define the too frequent or too broad terms following the
nz. in + V β
previous practice [7]. Since the occurrences of the terms do
where zin is the topic assignment of the nth word in document
not follow the normal distribution [21], we compute an upper
from corpus i, nkm is the count of m in the range of k, and
bound (19161 in our data) according to the median. Any terms
x i is all x except i.
appearing more frequent than the upper bound are considered intermediate term. The next step is to assign a strength score
too frequent. We also take advantage of the MeSH tree, which to each possible second-order association. The third phase
reflects the hierarchical information between MeSH terms. We picks out the paths ending with the second query term as well
consider the terms in the first three layers as too broad terms. as their respective scores and rank them. These time-costing
approaches search the solution in an exponential space.
B. Corpora Building
We define the search between two directly related term as
We choose ”Fish Oil” and ”Raynaud Disease” as the input one operation, and the number of operations is in Equation 6.
query terms in our experiments. Swanson discovered in 1986 X
that there might exist hidden associations between these two Operations = Nf s = Nf ∗ N̄s (6)
Nf
terms via manually reading, even though there was no relation-
ship between the two according to the current literature. The where Nf is the number of ”first order” terms related to one
reasons behind was that ”Fish Oil” could affect the ”Blood query term and Nf s is the number of ”second-order” terms
Viscosity”, the ”Platelet Function”, and the ”Vascular Reac- related to respective ”first order” terms. We use the number
tivity” while ”Raynaud Disease” could also produce effects of terms related to another query terms to approximate the
on these blood parameters [3]. averaged number of ”second-order” terms N̄s .
In the context of LBD, the input query terms are ”Fish In our proposed method, we assume a generative model
Oil” and ”Raynaud Disease”, and the output connection terms between two corpora and make the model inference. The infer-
can be ”Blood Viscosity”, ”Platelet Function” and ”Vascular ence outputs the ranked intermediate terms as an overlapping
Reactivity”. It is worthwhile to note that LBD uses the ”topic” between two corpora. The inference process samples
literature without connection between two input terms because a label for each term in the corpora and repeats several times
LBD aims to find ”hidden” or ”implicit” knowledge. In this until convergence. We can compute the number of sampling
regard, we only use the literature published before 1986. operations as in Equation 7.
The corpus construction is thus straightforward. As in the Operations = (Nc1 + Nc2 ) ∗ Niter (7)
last section, the literature is in the format of unordered MeSH
terms. We collect all the literature in MEDLINE published where Nc1 is the number of terms in the first corpus, Nc2 is
before 1986 containing either ”Fish Oil”(FO) or ”Raynaud the number of the terms in the second corpus, and Niter is
Disease”(RD). All the collected documents containing FO the number of iterations in OverlapLDA inference (the default
makes up the first corpus while the second corpus consists of iteration times is 30).
all the collected documents containing RD. Table II presents Based on the above definitions, we estimate the number of
the detailed information of the two corpora. operations to find the intermediate terms in FO and RD pair
as in Table III.
TABLE II
”F ISH O IL” AND ”R AYNAUD D ISEASE ” C ORPORA TABLE III
N UMBER OF O PERATIONS IN D IFFERENT A PPROACHES
Item Number
Approaches Number of Operations
FO Corpus Doc. 531
Current 3015818
FO Corpus Unique Terms. 1202
OverlapLDA 245427
FO Corpus Doc. Avg. Length 10.63
RD Corpus Doc. 2625
RD Corpus Unique Terms 2509 We can observe from Table III that OverlapLDA saves
RD Corpus Doc. Avg. Length 10.37 91.8% computation than current approaches. We should note
Unique Terms 3204 that each operation in current approaches, either statistics-
Unique Terms After Filtering 1314 based ones or semantics-based ones, consists of at least one
database query, which is much more time-consuming than one
C. Computational Cost sampling operation in OverlapLDA.
More generally, we can observe from the MEDLINE that
LBD aims to discover one intermediate-term associated
the typical number for Nf and N̄s are both 5000. Nc1 and the
with both two input terms, or more complex, more than one
Nc2 are usually around 50000. Therefore, OverlapLDA can
intermediate terms serially associated in a chain.
still decrease the number of operations by around 88 percent.
Although the possible number of paths is limited (to the
The above analysis shows the efficiency of our proposed
number of terms in the vocabulary) between two query terms,
OverlapLDA approach. We can roughly save 90 percent of
current approaches need to evaluate all the possible paths
operations for LBD.
to rank them appropriately. In this regard, current LBD ap-
proaches, either for open discovery or closed discovery, solve D. Qualitative Analysis
the problem in an open discovery manner. In this part, we list the top intermediate terms for ”Fish Oil”
They have to first find all the possible ”first order” interme- and ”Raynaud Disease”, and check whether they are valid.
diate terms in respect to one query term, and then discover all After the inference, we get the overlap topic as a rank of
the ”second-order” intermediate terms for each ”first order” terms. Table IV lists the top 20 terms after removing the too
TABLE IV The semantic-based approaches take advantage of external
T OP R ANKING I NTERMEDIATE T ERMS B Y OVERLAP LDA semantic database2 to see whether there is a connection
Terms Evidence(PMID) between two terms and accumulate all possible connections
Sympathectomy 8095453 for each two terms pair. However, it is hard for semantic-
Thromboangiitis Obliterans 3797213 based approaches to rank all the plausible associations, the
Arteriosclerosis 2285650 output of which is usually a list of associations with evidence
Arteriosclerosis Obliterans 2285650 [10]–[12]. We take the mixed approaches, involving semantic
Plethysmography 2536517
information in statistic-based approaches, in our comparison.
Cod Liver Oil 3797213
Antibodies, Antinuclear 17352512 The Semantic Association Rule based algorithm (SAR) [9]
Cryoglobulins 21117349 improves AR by considering semantics while Null-invariant
Thrombosis 3797213 Correlation Measures with Semantic Support (NCMwSP) [7]
Arteritis 10096117 adds semantic support to upgrade NCM.
Platelet Aggregation 3797213 Table V lists the five groups of ranks. ”N/A” means that
Sjogren’s Syndrome 3797213 the respective papers provide no ranks for the specific terms.
Gangrene 21117349
Blood Viscosity 3797213
The bold numbers are the best ranks of particular terms. By
Vasomotor System 10700429 replacing ”N/A” with averaged scores in the same row, we get
Vasoconstriction 3797213 the averaged ranks in the last row.
Purpura 7842531 We can observe from Table V that OverlapLDA achieves
Subclavian Artery 2536517 best ranks most of the time. This observation demonstrates the
Pulse 11416699 effectiveness of OverlapLDA in LBD. For the term ”Platelet
Nifedipine 3797213
Aggregation”, there is even a difference of more than 200 in
its ranks between OverlapLDA and NCM, NCMwSP as well.
frequent or too broad terms. The second column is the PMID OverlapLDA achieves better averaged ranks than the others.
codes for articles in PubMed, which are pieces of evidence These indicate that OverlapLDA retrieves better intermedi-
that the intermediate terms can relate the two input query ate terms than other approaches. Moreover, the exponential
terms. Most of the top retrieved terms have direct evidence searching nature of other approaches makes them computa-
for specific connections. For instance, the FO helps lower the tional complex compared to OverlapLDA.
”Blood Viscosity”, and high ”Blood Viscosity” exacerbates the Besides, it is worthwhile to pay attention to that involving
RD as in the paper indexed by ”PMID=3797213” [3]. semantics always improve the ranking, either for AR or for
However, we find that there are two terms in the top 20, NCM. Even though OverlapLDA shows better performance
”Sympathectomy” and ”Sjogrenś Syndrome”, without direct than approaches with semantic information, OverlapLDA has
evidence, even though they appear together with FO and RD the potential to improve if added semantic information.
in the same articles. They indicate a particular connection as F. Hyperparameters Sensitivity
well after careful literature review. ”Sympathectomy” and FO
Since there are some pre-defined hyperparameters in Over-
both relieve hypertension, and ”Sympathectomy” is one cure
lapLDA, we experiment to see to OverlapLDA’s sensitivity
for RD. In this regard, ”Sympathectomy” indicates that FO
over them. The topic prior β has little impact on LDA
may also cure or relieve RD. As for ”Sjogrenś Syndrome”,
[24]. Therefore, we change γ and measure OverlapLDA’s
FO is one cure for this syndrome [22] while this syndrome
performance in LBD.
has tight relationship with RD [23]. ”Sjogrenś Syndrome” can
γ, a 2 × 1 vector, is the Beta prior indicating the pro-
also show that FO may cure RD.
portion of overlapping topic in all the topics in one corpus.
E. Quantitative Comparison We fix the second entry in γ as 1, and the first entry are
Many approaches rank the retrieved intermediate terms, [0.05, 0.1, 0.5, 1, 5, 10, 15, 20]. We run OverlapLDA 20 time
which can be one measurement of the LBD performance. We given each γ to eliminate the variation caused by random
choose the more representative ones to compare with ours, initialization. The measurement of OverlapLDA performance
including semantic-based approach, statistic-based approach, is the averaged ranks of the terms in Table V.
and the mixed approach [7], [9]. We quantitatively compare Figure 3 is the averaged ranks with one standard deviation
the ranks of several widely-accepted intermediate terms. of terms in Table V. We can observe from it that the ranks
The statistic-based approaches devise some statistical mea- rarely change given different γ. This experiment shows the
surements based on the terms frequencies and terms co- stability of OverlapLDA over its hyperparameters.
occurrences statistics in the corpus, ranging from ”support” VI. C ONCLUSION AND F UTURE W ORK
and ”confidence” in association mining to more advanced and
sophisticated statistics [1]. Association Rule based algorithm This paper figures out one drawback of current LBD solu-
(AR) [9] and Null-invariant Correlation Measures (NCM) [7], tions, which searches an exponential space to find and rank the
the two comparative approaches in this paper, fall into the 2 https://semanticnetwork.nlm.nih.gov/
category of statistic-based approaches.
TABLE V
T HE R ANKS OF S EVERAL T ERMS C ONNECTING ”F ISH O IL” AND ”R AYNAUD D ISEASE ” IN D IFFERENT A PPROACHES
Terms AR SAR NCM NCMwSP OverlapLDA

Blood Viscosity 27 19 19 15.7 13
Vasoconstriction N/A N/A 21 18.7 15
Epoprostenol N/A N/A 35.7 22.3 28
Thrombosis N/A N/A 26.3 33 8
Platelet Aggregation 33 24 223.3 205 10
Arteriosclerosis N/A N/A 26.7 12.3 2
Blood Platelets 32 23 N/A N/A 23
Prostaglandins E 44 37 N/A N/A 34
Average 27.0 22.8 52.0 46.4 16.6
25.0
22.5
predications,” Journal of biomedical informatics, vol. 46, no. 2, pp. 238–
251, 2013.
Averaged Ranks
20.0
17.5
[11] D. Cameron, R. Kavuluru, T. C. Rindflesch, A. P. Sheth,
15.0 K. Thirunarayan, and O. Bodenreider, “Context-driven automatic sub-
12.5 graph creation for literature-based discovery,” Journal of biomedical
10.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 informatics, vol. 54, pp. 141–157, 2015.
First entry of γ
[12] V. Gopalakrishnan, K. Jha, A. Zhang, and W. Jin, “Generating hypothe-
sis: Using global and local features in graph to discover new knowledge
Fig. 3. Averaged Ranks Given Different γ in OverlapLDA from medical literature,” in Proceedings of the 8th International Confer-
ence on Bioinformatics and Computational Biology, BICOB, 2016, pp.
23–30.
[13] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers:
plausible connections between two terms in a large number A comparison of logistic regression and naive bayes,” in Advances in
of biomedical articles. We propose OverlapLDA to address neural information processing systems, 2002, pp. 841–848.
this problem in a generative way by assuming an overlapping [14] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-
topic model for authors and documents,” in Proceedings of the 20th
topic between two corpora representing each query term. conference on Uncertainty in artificial intelligence. AUAI Press, 2004,
OverlapLDA inference learns the topic, in the form of a pp. 487–494.
rank of connecting terms. Experiments and analysis show the [15] E. Erosheva, S. Fienberg, and J. Lafferty, “Mixed-membership models
of scientific publications,” Proceedings of the National Academy of
efficiency and effectiveness of OverlapLDA. Future directions Sciences, vol. 101, no. suppl 1, pp. 5220–5227, 2004.
include introducing semantics into OverlapLDA, which has [16] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su,
great potential to improve the model performance. “Arnetminer: Extraction and mining of academic social networks,”
in Proceedings of the 14th ACM SIGKDD International Conference
R EFERENCES on Knowledge Discovery and Data Mining, ser. KDD ’08. New
York, NY, USA: ACM, 2008, pp. 990–998. [Online]. Available:
[1] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical http://doi.acm.org/10.1145/1401890.1402008
machine learning tools and techniques. Morgan Kaufmann, 2016. [17] H. Deng, B. Zhao, and J. Han, “Collective topic modeling for
[2] A. M. Cohen and W. R. Hersh, “A survey of current work in biomedical heterogeneous networks,” in Proceedings of the 34th International
text mining,” Briefings in bioinformatics, vol. 6, no. 1, pp. 57–71, 2005. ACM SIGIR Conference on Research and Development in Information
[3] D. R. Swanson, “Fish oil, raynaud’s syndrome, and undiscovered public Retrieval, ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 1109–
knowledge,” Perspectives in biology and medicine, vol. 30, no. 1, pp. 1110. [Online]. Available: http://doi.acm.org/10.1145/2009916.2010073
7–18, 1986. [18] D. R. Swanson, N. R. Smalheiser, and V. I. Torvik, “Ranking indirect
[4] V. Gopalakrishnan, K. Jha, W. Jin, and A. Zhang, “A survey on connections in literature-based discovery: The role of medical subject
literature based discovery approaches in biomedical domain,” Journal headings,” Journal of the American Society for Information Science and
of biomedical informatics, p. 103141, 2019. Technology, vol. 57, no. 11, pp. 1427–1439, 2006.
[5] M. Weeber, H. Klein, A. R. Aronson, J. G. Mork, L. De Jong-van [19] S. Bhattacharya, V. Ha-Thuc, and P. Srinivasan, “Mesh: a window into
Den Berg, and R. Vos, “Text-based discovery in biomedicine: the full text for document summarization,” Bioinformatics, vol. 27, no. 13,
architecture of the dad-system.” in Proceedings of the AMIA Symposium. pp. i120–i128, 2011.
American Medical Informatics Association, 2000, p. 903. [20] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings
[6] V. Gopalakrishnan, K. Jha, G. Xun, H. Q. Ngo, and A. Zhang, “Towards of the National academy of Sciences, vol. 101, no. suppl 1, pp. 5228–
self-learning based hypotheses generation in biomedical text domain,” 5235, 2004.
Bioinformatics, vol. 34, no. 12, pp. 2103–2115, 2017. [21] A. Kastrin, T. C. Rindflesch, and D. Hristovski, “Large-scale structure
[7] K. Jha and W. Jin, “Mining novel knowledge from biomedical literature of a network of co-occurring mesh terms: statistical analysis of macro-
using statistical measures and domain knowledge,” in Proceedings of the scopic properties,” PloS one, vol. 9, no. 7, p. e102188, 2014.
7th ACM International Conference on Bioinformatics, Computational [22] D. F. Horrobin, “Essential fatty acid metabolism in diseases of con-
Biology, and Health Informatics. ACM, 2016, pp. 317–326. nective tissue with special reference to scleroderma and to sjogren’s
[8] D. M. Blei, “Probabilistic topic models,” Communications of the ACM, syndrome,” Medical hypotheses, vol. 14, no. 3, pp. 233–247, 1984.
vol. 55, no. 4, pp. 77–84, 2012. [23] M. Garcı́a-Carrasco, A. Sisó, M. Ramos-Casals, J. Rosas, G. De la Red,
[9] X. Hu, X. Zhang, I. Yoo, X. Wang, and J. Feng, “Mining hidden con- V. Gil, S. Lasterra, R. Cervera, J. Font, and M. Ingelmo, “Raynaud’s
nections among biomedical concepts from disjoint biomedical literature phenomenon in primary sjögren’s syndrome. prevalence and clinical
sets through semantic-based association rule,” International Journal of characteristics in a series of 320 patients.” The Journal of rheumatology,
Intelligent Systems, vol. 25, no. 2, pp. 207–223, 2010. vol. 29, no. 4, pp. 726–730, 2002.
[10] D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, [24] H. M. Wallach, D. M. Mimno, and A. McCallum, “Rethinking lda: Why
K. Thirunarayan, A. P. Sheth, and T. C. Rindflesch, “A graph-based priors matter,” in Advances in neural information processing systems,
recovery and decomposition of swansons hypothesis using semantic 2009, pp. 1973–1981.

Overlaplda: A Generative Approach For Literature-Based Discovery

Uploaded by

Copyright:

Available Formats

You might also like

Overlaplda: A Generative Approach For Literature-Based Discovery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overlaplda: A Generative Approach For Literature-Based Discovery

Uploaded by

Copyright:

Available Formats

OverlapLDA: A Generative Approach for

Terms AR SAR NCM NCMwSP OverlapLDA

You might also like