Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Grid Infrastructure for Mixed Bioinformatics Data and Text Mining

Moustafa Ghanem, Alexandros Chortaras, Yike Guo Anthony Rowe, Jon Ratcliffe
Imperial College London InforSense Ltd
{mmg, yg, ac901}@doc.ic.ac.uk {asrowe, j.ratclifJe}@inforsense.com

Abstract features such as keywords, patterns, gene and disease


names, disease names etc. This vector form is then

We present a distributed infrastructure for mixed data amenable to the application of traditional data mining

and text mining. Our approach is based on extending the techniques such as classification, clustering, PCA,

Discovery Net infrastructure, a grid-computing association analysis, etc.

environment for knowledge discovery, to allow end users The second form of interaction is whereby text mining

to construct complex distributed text and data mining is used to validate and interpret the results of

workjlows. We describe our architecture, data model and bioinformatics data mining procedures. For example

visual programming approach and present a number of consider a scientist engaged in the analysis of numerical

mixed data text mining examples over biological data. experimental data using traditional data mmmg
techniques, e.g. the analysis of microarray gene
expression data using data clustering. The result of this
1. Introduction & Motivation
clustering analysis may be a group of co-regulated genes
(i.e. genes that exhibit similar experimental behaviour).
The automated analysis of scientific text documents
At this stage, the user may wish to validate and interpret
can play an important role in streamlining scientific
the significance of his findings by using text mining
research by allowing users to analyse quickly the content
techniques to analyse automatically the contents of
of large document collections and to extract automatically
scientific documents about the genes in the identified
from them important, and possibly unknown, facts and
grouping. The aim is to extract automatically from the
information. A large number of successful text mining
text facts, whether previously known or not, that may
studies have been conducted in the past few years in the
help explain the experimental similarity between such
area of Bioinformatics to investigate the utility of such
genes.
approach. For example, various studies have been
In the remainder of this paper, we describe how both
conducted on the design and use of systems that identify
types of mixed data and text mining are supported within
and extract biological entities (genes, proteins, chemical
the Discovery Net infrastructure to provide a flexible
compounds, diseases, etc.) mentioned in the literature,
framework whereby end users can rapidly develop
e.g. [3, 12, 15]. Other studies have focused on extracting
complex text/data mining applications.
the relationships between such biological entities (e.g.
protein-protein interactions or gene-disease correlations),
mentioned in the literature e.g. [1, 9, 15]. Studies such as 2. The Discovery Net System
those reported in [11] aimed to investigate how text
mining approaches can be used validate the results of The Discovery Net system [2, 7, 13] is grid-based
analytical gene expression methods in identifying knowledge discovery environment. Discovery Net is
significant gene groupings. designed primarily to support the distributed analysis of

The aim of our work presented here is to design a scientific data based on a workflow or pipeline

flexible infrastructure that allows end users (e.g. methodology.

biologists) to construct easily their own text mining Within Discovery Net, analysis components (whether

applications. To motivate our approach, we highlight two traditional data mining components or text mining

complimentary forms of interaction between data and text components) are treated as remote services or black boxes

mining that are commonly used in bioinformatics: with known input and output interfaces described using

The first form is where the automated analysis of text web service protocols. The secure execution of the

documents makes use of traditional data mining individual components themselves over high performance
techniques. In this form, text mining proceeds by using a computing resources is achieved using Grid computing

generic pipeline that takes in text documents, performs protocols [5]. The coordination of the execution of such

any of a number of text pre-processing operations services is described as workflows expressed in DPML

(cleaning, NLP parsing, regular expression operations, [14], an xml-based workflow language. The execution
etc), followed by coding the features of the documents in management of the distributed workflows is handled by

vector form where counts are recorded for user-defined Discovery Net's workflow execution engine [13].

0-7803-873S-X/OS/$20.00200S IEEE
annotation is uniquely defined by its span, i.e. by its
starting and ending position in the document, and has
associated with it a set of attributes. Table 1 shows a
simple example of an annotated text.
The role of the attributes is to hold additional
information e.g. about the function, the semantics, or
Diseases other types of user-defined information related to the
Ngure 1: bxample WorKtlow m lJIscovery Net corresponding text segment. Each annotation has also a
type which, unlike in the original Tipster Architecture, in
Figure shows a visual representation of an
our system is a low level notion limited to defining the
executable Discovery Net workflow which computes a
role of the annotation as a constituent part of the
co-occurrence matrix between genes and diseases that
document, i.e. as a single word, as a sequence of words,
appear in a set of PubMed bibliographic records. The
etc. Depending on the particular user application different
icons represent processing components, and the arrows
attributes will be assigned to the annotations. Typical
represent data flow between them. The first component in
examples include attributes that represent the results of a
the shown workflow is a data service that stores the
natural language processing operation such as part-of
imported PubMed documents. These documents are
speech tagging, stemming and morphological analysis,
passed to a processing component that extracts the texts
results of dictionary lookups or database queries for
of the abstracts in the records. The next two components
certain annotations or results of a named entity or
use a gene and disease dictionary to identify and mark
terminology extraction process.
biological entities appearing in the abstracts. The outputs
are passed to a component that generates a text index over
Text Annot. Type Attributes
the results which are then passed to a computational
lnsulin token pos:noun, stem:insulin
component that calculates the required co-occurrence
Resistance token pos:noun, stem:resist
matrix.
Insulin compound disease:insulin resistance
The Discovery Net system allows users to create such resistance token
workflows visually. Each of the remote web/grid services plays token pos:verb, stem:plai

is first registered within the system, and then the user uses a token pos:det, stopword

a workflow editor to connect the icons representing the


Table 1: Annotated Text Example
data flow between the processing components used.
The Discovery Net system provides a large number of

3. Text Mining within Discovery Net components for the creation of basic annotation set as
well as enriching the set by adding new attributes (e.g.

Discovery Net's text mining implementation is based on part-of-speech, stem), as well as components for

introducing three re-usable data types that simplify the identifying and annotating frequently occurring phrases

co-ordination of the execution of remote services and that using statistical methods.

allow higher-level information to be exchanged easily Interfaces also exist for translating the outputs of third

between such co-operating services. These types are pure party tools and remote text mining services. e.g. the

text documents, annotated text documents and feature remote bioinformatics terminology server [4] which

vectors. provides an XML-RPC interface allowing its services to


be invoked remotely through a programmatic interface.
The server provides two services are provided, the first
3.1 The Text Annotation Model
returns abbreviations found in a document and the second
annotates the document with the gene names and protein
At the core of the Discovery Net text mmmg is an
names found. Each service receives its input as a string
extensible document representation model. The model is
containing the document text, and returns an array
based on the Tipster Document Architecture [8] that uses
containing the found entities, their location within the
the notion of a document annotation. Following this
document and a score for the confidence of the
model, a single document is represented by two entities:
prediction. When executed from Discovery Net, these
the document text, which corresponds to the plain
outputs are translated to the annotation model format
document text and the annotation set structure. The
annotation set structure provides a flexible mechanism for
allowing them to be passed in a standardised format t
other components within the same workflow.
associating extra-textual information with certain text
segments. Each such text segment is called an annotation,
and an annotation set consists of the full set of
3.2 Data Mining over Feature Vectors
annotations that a make up a document. A single
In many cases text mmmg analysis results in the word removal. The regular expression patterns are then
reduction of each document of a document collection to a generated automatically ("Extract RegEx" component) by
single feature vector, whose dimensions reflect the considering combinations of the unique gene name tag
significant informational components that the analysis and up to 4 other words.
identified. These features can then be passed to other
typical data mining operations such as clustering and
categorization for further processing.
Discovery Net supports the creation, storage and Clean Tel ] ature Vector In100ain Filter SVM

management of such feature vectors for user specified


I
annotations.
computation
Also
of
provided
standard
are
statistical
components
and data
for the
mining

Extract RegEx
Gene Diet

operations over the feature vectors. These include the Figure 2: Document Classification Training Workflow
computation of co-occurrence matrices between the
annotations of the documents, and the similarity (e.g. Based on an Information Gain Filter that keeps only
cosine similarity) between the extracted feature vectors,
the most discriminating patterns, only a few regular
computing statistical information for the features of a
expressions are retained. Finally, the simplified feature
single document collection (e.g. inverse, document vectors are then used to train a traditional classification
frequency, entropy) as well as inter-collection feature
algorithm based on support vector machine (SYM). Our
statistics (e.g. chi square, mutual information, information
document classifier, based on this method, proved to
gain). Furthermore, the system also provides access to a produce more accurate results than approaches relying
large number of statistical and data mining components
solely on using keyword-based features and secured our
provided by the InforSense KDE data mining system that team an honourable mention in the competition [7].
can be used for classifying and clustering tabular data.

4.2 Interpreting Gene Expression Data Sets


4. Discovery Net Text Mining Workflows
The second example presented demonstrates how text
In this section, we describe two examples of Discovery mining can be used to help end users interpret and
Net mixed data mining and text mining.
validate the results of a traditional data mining task over
numerical gene expression data. The workflow presented
4.1 Scientific Document Categorization Workflow in Figure 3 is divided into three logical phases.
In the first phase of the workflow ("Gene Expression
The first example presented demonstrates how to Analysis"), a biologist conducts analysis over gene
construct a text categorization system based on the KDD expression data using traditional data mining methods
CUP 2002 competition for document categorization [6]. (mainly statistical analysis and data clustering in this
This task deals with building automatic methods for example). The output of this stage is a set of "interesting
detecting which scientific papers, in a set of full-text genes" that the data mining methods isolate as being
genetics papers from FlyBase, contain experimental candidates for further analysis. In the example, these are
evidence about gene products (transcripts and proteins). differentially expressed genes for which the user is
A manually pre-labelled document collection of 800 interested in finding a biological significance of why they
documents is provided by the task organizers. Figure 2 are differentially expressed, i.e. to identify a) an
shows one possible workflow to develop and train an interpretation why such genes are differentially
automatic document classification system for the FlyBase expressed, and b) identify which diseases are associated
task. This workflow is a variation on our earlier work with these genes.
presented in [6].
As with traditional document classification methods
each document is represented as a feature vector that
records how many times a particular feature occurs in a
document. Our approach here is based on using frequent
regular expression patterns as features. To generate these
features, we first replace all gene names and their aliases
with unique tags identifying a gene name ('genexx').
Each document is passed through a series of text pre
processing components ("Clean Text") that perform
traditional text cleaning operations (stemming and stop-
[2] V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A Rowe, P.
Wendel. Discovery Net: Towards a Grid of Knowledge
Discovery. ACM KDD-2002. July 2002 Edmonton, Canada.
Figure 3: Gene Expression Analysis Workflow.
[3] J.T. Chang, H. Schutze, R. B. Altman. Creating an online

In the second phase of the workflow ("Find Relevant dictionary of abbreviations from Medline. J Am Med Inform
Assoc. 2002 ;9(6):612-20.
Genes from Online Databases") the user uses the
Discovery Net's InfoGrid [7] integration framework to [4] 1.T. Chang, H. Schutze, R. B. Altman Biomedical

obtain further information about the genes. This part of Abbreviation Server. http://bionlp.stanford.edu/abbreviation

the workflow starts by obtaining the nucleotide sequence [5] I. Foster and e. Kesselman. Globus: A metacomputing

for each gene by issuing a query to the NCB I database infrastructure toolkit. Int. Journal of Supercomputer
Applications, 11(2):115--128,
based on the gene accession number. The retrieved
sequence is then used to execute a BLAST query to [6] M. M. Ghanem, Y. Guo, H. Lodhi, Y. Zhang, Automatic

retrieve a set of homologous sequences; these sequences Scientific Text Classification Using Local Patterns: KDD CUP
2002 (Task 1) ,SIGKDD Explorations, 2002. 4(2).
in tum are used to issue a query to SwissProt to retrieve
the PubMed Ids identifying articles relating to the [7] N. Giannadakis, A. Rowe, M. Ghanem M, and Y. Guo Y.
InfoGrid: Providing Infonnation Integration for Knowledge
homologous sequences. Finally the PubMed Ids are used
Discovery. Information Science,2003: 3:199-226.
to issue a query against PubMed to retrieve the abstracts
associated with these articles, and the abstracts are passed [8] R. Grishman. TIPSTER Text Architecture Design.
http://www.itl.nist.gov/iaui/894.02/related"'projects/tipster/docs/
through a frequent phrase identification algorithm to
arch3].doc.
extract summaries for the retrieved documents.
In the third phase of the workflow ("Find Association [9] E. M. Marcotte, 1. Xenarios I, D. Eisenberg. Mining
literature for protein-protein interactions, Bioinformatics.
between Frequent Terms") the user uses a dictionary of
2001; 17(4):359-63
disease terms obtained from the MESH (Medical Subject
[10] T. Ohta, Y. Tateisi, H Mirna J. Tsujii. GENIA Corpus: an
Headings) dictionary to isolate the key disease terms
Annotated Research Abstract Corpus in Molecular Biology
appearing in the articles. The identified disease words are
Domain. HL T 2002.
then analysed using a standard association analysis using
[II] S. Raychaudhuri, H. Schutze, R. B. Altman. Using Text
an a priori style algorithm to find frequently co-occurring
Analysis to Identify Functionally Coherent Gene Groups.
disease terms in the retrieved article sets that are
Genome Research. 2002:1582-1590.
associated with both the identified genes as well as their
[12] T.e. Rindflesch, L. Tanabe, J. N. Weinstein, and L. Hunter;
homologues.
EDGAR: Extraction of Drugs, Genes and Relations from the
Biomedical Literature. Pacific Symposium on Biocomputing
5. Summary 2000.:514-525

[13] A. Rowe, D. Kalaitzopoulos, M. Osmond, M Ghanem, Y.


In this paper we have presented the Discovery Net Guo The Discovery Net System for High Throughput
system for mixed data and text mining. The system is Bioinformatics. Bioinformatics 2003: 225-231
based on co-ordinating the execution of distributed [14] J. Syed J, Y. Guo Y, M. Ghanem Discovery Processes:
processing components using on a workflow model. We Representation And Re-Use, UK e-Science All Hands Meeting,
have presented the main data structures used to support Sheffield UK, September, 2002.
distributed text mining in the system and discussed their [15] L. Tanabe, W. 1. Wilbur. Tagging Gene and Protein Names
advantages. We have also described two types text/data in Biomedical Texts. Bioinformatics. 2002 18(8).
mining interaction paradigms supported by the system,
and presented an example workflow for each paradigm.
Acknowledgments

Parts of this work have been supported under the EPSRC funded
UK e-Science Project Discovery Net. We thank and
acknowledge support from InforSense Ltd. for using their
software and visual tools in our development activities.

References

[I] L. A. Adamic, D. Wilkinson, B. A. Hubennan, E. Adar. A


Literature Based Method for Identifying Gene-Disease
Connections IEEE Computer Society Bioinformatics
Conference,August 1 2002

You might also like