Professional Documents
Culture Documents
Imperial College London (MMG, Yg, Ac901) @doc - Ic.Ac - Uk Inforsense LTD
Imperial College London (MMG, Yg, Ac901) @doc - Ic.Ac - Uk Inforsense LTD
Moustafa Ghanem, Alexandros Chortaras, Yike Guo Anthony Rowe, Jon Ratcliffe
Imperial College London InforSense Ltd
{mmg, yg, ac901}@doc.ic.ac.uk {asrowe, j.ratclifJe}@inforsense.com
We present a distributed infrastructure for mixed data amenable to the application of traditional data mining
and text mining. Our approach is based on extending the techniques such as classification, clustering, PCA,
environment for knowledge discovery, to allow end users The second form of interaction is whereby text mining
to construct complex distributed text and data mining is used to validate and interpret the results of
workjlows. We describe our architecture, data model and bioinformatics data mining procedures. For example
visual programming approach and present a number of consider a scientist engaged in the analysis of numerical
mixed data text mining examples over biological data. experimental data using traditional data mmmg
techniques, e.g. the analysis of microarray gene
expression data using data clustering. The result of this
1. Introduction & Motivation
clustering analysis may be a group of co-regulated genes
(i.e. genes that exhibit similar experimental behaviour).
The automated analysis of scientific text documents
At this stage, the user may wish to validate and interpret
can play an important role in streamlining scientific
the significance of his findings by using text mining
research by allowing users to analyse quickly the content
techniques to analyse automatically the contents of
of large document collections and to extract automatically
scientific documents about the genes in the identified
from them important, and possibly unknown, facts and
grouping. The aim is to extract automatically from the
information. A large number of successful text mining
text facts, whether previously known or not, that may
studies have been conducted in the past few years in the
help explain the experimental similarity between such
area of Bioinformatics to investigate the utility of such
genes.
approach. For example, various studies have been
In the remainder of this paper, we describe how both
conducted on the design and use of systems that identify
types of mixed data and text mining are supported within
and extract biological entities (genes, proteins, chemical
the Discovery Net infrastructure to provide a flexible
compounds, diseases, etc.) mentioned in the literature,
framework whereby end users can rapidly develop
e.g. [3, 12, 15]. Other studies have focused on extracting
complex text/data mining applications.
the relationships between such biological entities (e.g.
protein-protein interactions or gene-disease correlations),
mentioned in the literature e.g. [1, 9, 15]. Studies such as 2. The Discovery Net System
those reported in [11] aimed to investigate how text
mining approaches can be used validate the results of The Discovery Net system [2, 7, 13] is grid-based
analytical gene expression methods in identifying knowledge discovery environment. Discovery Net is
significant gene groupings. designed primarily to support the distributed analysis of
The aim of our work presented here is to design a scientific data based on a workflow or pipeline
biologists) to construct easily their own text mining Within Discovery Net, analysis components (whether
applications. To motivate our approach, we highlight two traditional data mining components or text mining
complimentary forms of interaction between data and text components) are treated as remote services or black boxes
mining that are commonly used in bioinformatics: with known input and output interfaces described using
The first form is where the automated analysis of text web service protocols. The secure execution of the
documents makes use of traditional data mining individual components themselves over high performance
techniques. In this form, text mining proceeds by using a computing resources is achieved using Grid computing
generic pipeline that takes in text documents, performs protocols [5]. The coordination of the execution of such
any of a number of text pre-processing operations services is described as workflows expressed in DPML
(cleaning, NLP parsing, regular expression operations, [14], an xml-based workflow language. The execution
etc), followed by coding the features of the documents in management of the distributed workflows is handled by
vector form where counts are recorded for user-defined Discovery Net's workflow execution engine [13].
0-7803-873S-X/OS/$20.00200S IEEE
annotation is uniquely defined by its span, i.e. by its
starting and ending position in the document, and has
associated with it a set of attributes. Table 1 shows a
simple example of an annotated text.
The role of the attributes is to hold additional
information e.g. about the function, the semantics, or
Diseases other types of user-defined information related to the
Ngure 1: bxample WorKtlow m lJIscovery Net corresponding text segment. Each annotation has also a
type which, unlike in the original Tipster Architecture, in
Figure shows a visual representation of an
our system is a low level notion limited to defining the
executable Discovery Net workflow which computes a
role of the annotation as a constituent part of the
co-occurrence matrix between genes and diseases that
document, i.e. as a single word, as a sequence of words,
appear in a set of PubMed bibliographic records. The
etc. Depending on the particular user application different
icons represent processing components, and the arrows
attributes will be assigned to the annotations. Typical
represent data flow between them. The first component in
examples include attributes that represent the results of a
the shown workflow is a data service that stores the
natural language processing operation such as part-of
imported PubMed documents. These documents are
speech tagging, stemming and morphological analysis,
passed to a processing component that extracts the texts
results of dictionary lookups or database queries for
of the abstracts in the records. The next two components
certain annotations or results of a named entity or
use a gene and disease dictionary to identify and mark
terminology extraction process.
biological entities appearing in the abstracts. The outputs
are passed to a component that generates a text index over
Text Annot. Type Attributes
the results which are then passed to a computational
lnsulin token pos:noun, stem:insulin
component that calculates the required co-occurrence
Resistance token pos:noun, stem:resist
matrix.
Insulin compound disease:insulin resistance
The Discovery Net system allows users to create such resistance token
workflows visually. Each of the remote web/grid services plays token pos:verb, stem:plai
is first registered within the system, and then the user uses a token pos:det, stopword
3. Text Mining within Discovery Net components for the creation of basic annotation set as
well as enriching the set by adding new attributes (e.g.
Discovery Net's text mining implementation is based on part-of-speech, stem), as well as components for
introducing three re-usable data types that simplify the identifying and annotating frequently occurring phrases
co-ordination of the execution of remote services and that using statistical methods.
allow higher-level information to be exchanged easily Interfaces also exist for translating the outputs of third
between such co-operating services. These types are pure party tools and remote text mining services. e.g. the
text documents, annotated text documents and feature remote bioinformatics terminology server [4] which
operations over the feature vectors. These include the Figure 2: Document Classification Training Workflow
computation of co-occurrence matrices between the
annotations of the documents, and the similarity (e.g. Based on an Information Gain Filter that keeps only
cosine similarity) between the extracted feature vectors,
the most discriminating patterns, only a few regular
computing statistical information for the features of a
expressions are retained. Finally, the simplified feature
single document collection (e.g. inverse, document vectors are then used to train a traditional classification
frequency, entropy) as well as inter-collection feature
algorithm based on support vector machine (SYM). Our
statistics (e.g. chi square, mutual information, information
document classifier, based on this method, proved to
gain). Furthermore, the system also provides access to a produce more accurate results than approaches relying
large number of statistical and data mining components
solely on using keyword-based features and secured our
provided by the InforSense KDE data mining system that team an honourable mention in the competition [7].
can be used for classifying and clustering tabular data.
In the second phase of the workflow ("Find Relevant dictionary of abbreviations from Medline. J Am Med Inform
Assoc. 2002 ;9(6):612-20.
Genes from Online Databases") the user uses the
Discovery Net's InfoGrid [7] integration framework to [4] 1.T. Chang, H. Schutze, R. B. Altman Biomedical
obtain further information about the genes. This part of Abbreviation Server. http://bionlp.stanford.edu/abbreviation
the workflow starts by obtaining the nucleotide sequence [5] I. Foster and e. Kesselman. Globus: A metacomputing
for each gene by issuing a query to the NCB I database infrastructure toolkit. Int. Journal of Supercomputer
Applications, 11(2):115--128,
based on the gene accession number. The retrieved
sequence is then used to execute a BLAST query to [6] M. M. Ghanem, Y. Guo, H. Lodhi, Y. Zhang, Automatic
retrieve a set of homologous sequences; these sequences Scientific Text Classification Using Local Patterns: KDD CUP
2002 (Task 1) ,SIGKDD Explorations, 2002. 4(2).
in tum are used to issue a query to SwissProt to retrieve
the PubMed Ids identifying articles relating to the [7] N. Giannadakis, A. Rowe, M. Ghanem M, and Y. Guo Y.
InfoGrid: Providing Infonnation Integration for Knowledge
homologous sequences. Finally the PubMed Ids are used
Discovery. Information Science,2003: 3:199-226.
to issue a query against PubMed to retrieve the abstracts
associated with these articles, and the abstracts are passed [8] R. Grishman. TIPSTER Text Architecture Design.
http://www.itl.nist.gov/iaui/894.02/related"'projects/tipster/docs/
through a frequent phrase identification algorithm to
arch3].doc.
extract summaries for the retrieved documents.
In the third phase of the workflow ("Find Association [9] E. M. Marcotte, 1. Xenarios I, D. Eisenberg. Mining
literature for protein-protein interactions, Bioinformatics.
between Frequent Terms") the user uses a dictionary of
2001; 17(4):359-63
disease terms obtained from the MESH (Medical Subject
[10] T. Ohta, Y. Tateisi, H Mirna J. Tsujii. GENIA Corpus: an
Headings) dictionary to isolate the key disease terms
Annotated Research Abstract Corpus in Molecular Biology
appearing in the articles. The identified disease words are
Domain. HL T 2002.
then analysed using a standard association analysis using
[II] S. Raychaudhuri, H. Schutze, R. B. Altman. Using Text
an a priori style algorithm to find frequently co-occurring
Analysis to Identify Functionally Coherent Gene Groups.
disease terms in the retrieved article sets that are
Genome Research. 2002:1582-1590.
associated with both the identified genes as well as their
[12] T.e. Rindflesch, L. Tanabe, J. N. Weinstein, and L. Hunter;
homologues.
EDGAR: Extraction of Drugs, Genes and Relations from the
Biomedical Literature. Pacific Symposium on Biocomputing
5. Summary 2000.:514-525
Parts of this work have been supported under the EPSRC funded
UK e-Science Project Discovery Net. We thank and
acknowledge support from InforSense Ltd. for using their
software and visual tools in our development activities.
References