Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Expert Systems with Applications 30 (2006) 210 www.elsevier.

com/locate/eswa

Applications of articial intelligence in bioinformatics: A review


Zoheir Ezziane
College of Information Technology, P.O. Box 14143, Dubai, United Arab Emirates

Abstract Articial intelligence (AI) has increasingly gained attention in bioinformatics research and computational molecular biology. With the availability of different types of AI algorithms, it has become common for the researchers to apply the off-shelf systems to classify and mine their databases. At present, with various intelligent methods available in the literature, researchers are facing difculties in choosing the best method that could be applied to a specic data set. Researchers need tools, which present the data in a comprehensible fashion, annotated with context, estimates of accuracy and explanation. This article aims to review the use of AI in the areas of bioinformatics and computational molecular biology (DNA sequencing). These areas have risen from the needs of biologists to utilize and help interpret the vast amounts of data that are constantly being gathered in genomic research. The underlying motivation for many of the bioinformatics and DNA sequencing approaches is the evolution of organisms and the complexity of working with erroneous data. This article also describes the kind of software programs which were developed by the research community in order to (1) search, classify and mine different available biological databases; (2) simulate biological experiments with and without errors. q 2005 Elsevier Ltd. All rights reserved.
Keywords: Intelligent bioinformatics systems; Intelligent DNA sequencing; AI tools in bioinformatics

1. Introduction The post-genomic era has been characterized by two different scenarios: on the one hand, the huge amount of available biological data sets all over the world requires suitable tools and methods both for modeling biological processes and analyzing biological sequences; on the other, many new computational models and paradigms inspired and developed as metaphors of biological systems are ready to be applied in the context of computer science. Hence, the need to either develop new models or exploit and analyze the available genomes is considered a priority task for the bioinformatics research community. There are at least 26 billion base pairs (bp) representing the various genomes available in the server of the National Center for Biotechnology Information (NCBI). Besides the human genome with about 3 billion bp, many other species have their complete genome available there. Cohen (Cohen, 2004) explained the needs of biologists to utilize and help interpret the vast amounts of data that are constantly being gathered in genomic research. He also pointed out the basic concepts in molecular cell biology, and outlined the nature of the existing
E-mail address: zezziane@duc.ac.ae 0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.09.042

data, and illustrated the algorithms needed to understand cell behavior. This review article focuses on the use of AI techniques and algorithms in bioinformatics. The largest known gene in the NCBI server has about 20 million base pairs and the largest protein consists of about 34, 000 amino acids. In contrast, the Protein Database (PDB) has a catalogue of only 45,000 proteins specied by their 3D structure. Bioinformatics and computational biology are concerned with the use of computation to understand biological phenomena and to acquire and exploit biological data, increasingly large-scale data (Guseld, 2004). Methods from bioinformatics and computational biology are increasingly used to augment or leverage traditional laboratory and observation-based biology. These methods have become critical in biology due to recent changes in our ability and determination to acquire massive biological data sets, and due to the ubiquitous, successful biological insights that have come from the exploitation of those data. This transformation from a data-poor to a data-rich eld began with DNA sequence data, but is now occurring in many other areas of biology. DNA sequence analysis is attractive to computer scientists because of the availability of digital information. However, there are many challenges related to this area such as: (1) Parsing a genome in order to nd the segments of DNA sequence with various biological roles. For example, encoding proteins and RNA, and controlling when and where those

Z. Ezziane / Expert Systems with Applications 30 (2006) 210

molecules are expressed. (2) Aligning the sequences in DNA sequences in order to check for similarity or differences. The alignment procedure could be performed locally (DNA fragment level) or globally (genome level). DNA arrays or DNA Chips were proposed in the late 1980s by several researchers independently for the purpose of DNA sequencing (Drmanac, Hood, & Crkvenjakov, 1993; Drmanac, Labat, Brunker, & Crkvenjakov, 1989; Pevzner & Lipshutz, 1994), and the technology was named DNA Sequencing by Hybridization (SBH). This method may also be referred to as sequencing by k-tuple composition. The idea is to build a 2D grid (or matrix) of all possible k-tuples (or k-mers) for a given k. At each (i,j) entry a distinct k-tuple or probe is attached. The matrix of probes will be referred to as the k-chip, C(k), or the sequencing chip. The DNA probes are referred to as oligonucleotides, or oligos for short. Then a sample of the single stranded DNA to be sequenced is presented to the matrix. This DNA is labeled with a radioactive or uorescent material. Each k-tuple present in the sample hybrids with its reverse complement in the matrix. After washing unhybridized DNA from the chip, the hybridized k-tuples can be determined by device, which detects the labeled DNA. Current implementations of SBH use classical probing schemes, i.e., chips accommodating all 4k k-mer oligos, the symbols being the well-known DNA bases {A,C,G,T}, where A, C, G, and T represent four nucleotides adenine, cytosine, guanine, and thymine; and k represents a technology dependent integer parameter usually between 8 and 10. Combinatorial constraints add another hurdle to the SBH problem. The classical sequencing chip C(k) contains all 4k single-stranded oligos of length k. For example, in C(8) all 48Z65,536 octamers are used. The classical chip C(8) sufces to reconstruct 200 nucleotide long sequences in only 94 of 100 cases (Pevzner, Lysov, Khrapko, Belyavsky, Florentiev and Mirzabekov, 1991), even in error-free experiments. Unfortunately, the length of unambiguously reconstructible sequence grows slower than the size of the chip. Thus exponential growth inherently limits the length of the longest reconstructible sequence by traditional SBH. Extensive simulations were performed by (Dyer, Frieze, & Suen, 1994; Pevzner & Lipshutz, 1994; Waterman, 1995) and observed that the expected length of unambiguously reconstructible sequences with solid length-k probes is O(2k). Recent array designs based on gapped probes proved much more efcient and approaching the information theory limit for array designs (Preparata & Upfal, 2000). If the hybridization experiments were executed without errors, then the spectrum would be ideal. In this case it would contain only all subsequences of length k of the original sequence of the known length n. Thus, the spectrum consists of n-kC1 elements and to reconstruct the original sequence one must nd an order of spectrum elements such that neighboring elements always overlap on kK1 nucleotides. For example, suppose the original sequence to be found is ACTGGTC, nZ7. In this hybridization experiment one can use the complete library of oligos of length kZ3, composed of the following 43Z64 oligos: {AAA,., TTT}. As a result of the experiment

performed without errors one obtains the ideal spectrum for this sequence, containing all three-symbol substring of the original sequence: {ACT,CTG,TGG,GGT,GTC}. The reconstruction of the sequence consists of nding such an order of the spectrum elements, where each pair of neighboring elements overlaps on kK1Z2 symbols. There are several exact methods for solving DNA sequencing problem with ideal spectrum (Bains & Smith, 1988; Drmanac et al., 1989; Pevzner, 1989). However, the hybridization experiment usually produced two possible types of error in the spectrum. A probe p is a false positive if an experiment reports the presence of p even though p is not actually a substring of the target DNA sequence SEQ. A probe p is false negative if an experiment reports the absence of p even though p is a substring of SEQ. Currently, the ratio of errors to proper oligos in the problem data is rather large. In experimental data from real applications one could expect a smaller number of errors then the algorithm would provide even better results in these practical settings (Blazewicz & Kasprzak, 2003). The existence of errors in the DNA sequence results in strongly NP-hard combinatorial problems (Blazewicz & Kasprzak, 2003). There exist exact and heuristic methods assuming errors in the spectrum, but almost all of them consider a reduced model of errors (Halperin, Halperin, Hartman, & Shamir, 2002; Pevzner, 1989). 2. DNA sequence reconstruction 2.1. Sequencing scheme A SBH chip consists of a xed numbers of features. Each feature can accommodate one probe. A probe is a string of symbols from the alphabet SZ{A,C,G,T,-}, wheredenotes the blank symbol. SBH provides information about k-mers present in the DNA string, but does not provide information about the positions of the k-mers. Moreover, SP is said to be the spectrum of sequence SEQ if SP is a multi-set of all k-long substrings of SEQ, assuming that the number of occurrences of each k-mer is also known. For example, SEQZATGCAGGTCC and SPZ{ATG,AGC,CAG, GCA,CGT,GTC,TCC,TGC}. A sequencing algorithm is an algorithm that, given a multiset of k-mers SPZ{SP1,.,SPnKkC1}, decides if the spectrum denes a unique DNA sequence SEQ, and, if yes, reconstruct the sequence SEQ from its spectrum. 2.2. Traditional solutions for the SBH problem The fundamental computational problem in SBH is the reconstruction of a sequence from its spectrum, the list of all k-mers that are included in the sequence along with their multiplicities. The traditional solutions for the SBH problem, which are briey discussed in this paper, are the Hamiltonian path, also known as the Traveling Salesman Problem (TSP), the Eulerian path problem (EPP), and the positional SBH (PSBH).

Z. Ezziane / Expert Systems with Applications 30 (2006) 210

2.2.1. Hamiltonian path/TSP The SBH problem can be approached as a TSP (Blasewicz, Formanowicz, Kasprzak, Markiewicz, & Welglarz, 1999) by dening a directed graph G1 such as every occurrence of a k-mer in the spectrum is represented by a vertex in the graph, i.e. a k-mer that appears more than once is represented by multiple vertices. And every pair of vertices x,y2V are connected by a directed edge e from x to y if and only if the kK 1 sufx of x is identical to the kK1 prex of y. For example: {GTC,TCC} are 3-mers that are connected by a directed edge, since the 2-mer sufx of GTC equals the 2-mer prex of TCC. Joining the two k-mers into a sequence is connected with a cost. The cost of joining two k-mers is equal to k minus a number of nucleotides that overlap in these k-mers. For example, two k-mers CCATC and TCTAG may overlap on two nucleotides and create a longer sequence CCATCTAG. Consequently, a cost of joining them is equal to 3. The goal is to visit every vertex in G1 exactly only once and return to the starting point in such a way that a sum of costs of traversed edges included in the G1 cycle is at its minimum. Hence, this problem is the same as nding a DNA sequence SEQ with the spectrum SP. This problem is known to be NPhard, thus unlikely to admit a polynomialtime algorithm. 2.2.2. EPP Pevzner (Pevzner, 1989) proposed a different approach, which reduces the SBH problem to the EPP, leading to a simple linear-time algorithm for sequence reconstruction. The idea is to construct a graph G2 (Pevzners graph), whose edges correspond to k-mers and to nd a path in the graph that visits every edge only once. Here the vertices are the full set of (kK 1)-mer appearing in the spectrum. Based on the dened graph G2, the problem is translated in nding a path that visits all edges on G2. The solution is not necessarily unique because it is possible to detect a Eulerian cycle, which creates multiple ambiguous solutions. This ambiguity of a SBH solution occurs if it is impossible to reconstruct the original sequence SEQ from Pevzners graph. Multiple (alternative) solutions which are manifested as branches in the graph, and unless the number of branches is very small, there is no good way to determine the correct sequence (Ben-dor, Peer, Shamir, & Sharan, 2001). 2.2.3. Positional SBH Recently, several authors have suggested enhancements of SBH based on adding location information to the spectrum (Adleman, 1998; Broude, Sano, Smith, & Cantor, 1994; Guseld, Karp, Wang, & Stelling, 1998; Hannenhalli, Pevzner, Lewis, & Skiena, 1996; Shamir & Tsur, 2001). The measurement of approximate positions of every k-mer in the target sequence is registered. This additional information makes the reconstruction less ambiguous. Therefore, it is possible to know the allowed positions in the target sequence for each k-mer in the spectrum. This method is called positional SBH (PSBH). However, it is possible to reduce the PSBH to Eulerian path problem by restricting the edge position in the Eulerian graph. This transformation leads to the positional Eulerian path

problem (PEP): Given a Pevzners graph with a list of allowed positions on each edge, decide if there exists an Eulerian path, in which edge appears in one of its allowed positions. PEP is NP-complete (Hannenhalli et al., 1996), even if all the lists of allowed positions are intervals of equal length. Due to the centrality of the sequencing problem in biotechnology and in the Human Genome Project, and due to its mathematical elegance, SBH continues to draw a lot of attention. Many authors have suggested ways to improve the basic method. Alternative chip design (Bains & Smith, 1988; Khrapko, Lysov, Khorlyn, Shick, Florentieve and Mirzabekov, 1989; Pevzner et al., 1991; Preparata, Frieze, & Upfal, 1999), usage of prior sequence information (Peer & Shamir, 2000), as well as interactive protocols (Frieze & Halldorsson, 2001; Skiena & Sundaram, 1995) were suggested. Nonetheless, an effective and competitive sequencing solution using SBH has yet to be demonstrated. 3. DNA sequencing with articial intelligence The eld of molecular biology is described as tailor-made for AI and machine learning approaches (Shavlik, Hunter, & Searls, 1995). This is due to the nature of AI approaches that performs well in domains where there is an immense amount of data but little theory. Since the introduction of AI to this eld, numerous algorithms have been designed and applied to study different data sets. Most of these researches compare a new method with the traditional ones, afrm the effectiveness and efciencies of their methods in particular data sets. Sequencing of DNA is among the most important tasks in molecular biology. DNA chips are considered to be a more rapid alternative to more common gel-based methods of sequencing. DNA chips commonly are made with the set of all possible probes eight nucleotides in length (octamers) generating 65,536 unique probes spaced on a 1.6 cm2 array (Fodor, Read, Pirrung, Stryer, Lu and Solas, 1991). For example, consider the DNA target sequence ATTGATTCG, with length NZ9 and a DNA chip with all possible probes of length nZ4. A DNA chip with probe length n will have 4n positions in the grid on the DNA chip. Thus, for a probe length 4 there exist 256 grid positions, each associated with a unique probe sequence. All possible 4-nucleotide probes would exist in the set: {AAAA, AAAT, ., and TTTT}. In SBH, an appropriate length probe must be used to unambiguously determine a target of length N. When is large (O40 nucleotides), a probe of length 4 cannot be used to reconstruct the target with a high probability of success (Fogel, Chellapilla, & Fogel, 1998). As N increases, the probability of redundancy in the target increases making unambiguously reconstruction difcult (Noble, 1995). Hence the AI methods are well suited to solve the DNA sequencing problem unambiguously and obtain a near optimal solution. A hidden Markov model (HMM) is a statistical model, which is very well suited for many tasks in molecular biology (Krogh, 1998). The most popular use of the HMM in molecular biology is as a probabilistic prole of a protein family, which is called a prole HMM. From a family of proteins (or DNA) a

Z. Ezziane / Expert Systems with Applications 30 (2006) 210

prole HMM can be made for searching a database for other members of the family. Boufounos, El-Difrawy, & Ehrlich (2004) used HMMs in DNA sequencing, where they developed an approach to the DNA basecalling problem. In addition, they also modeled the state emission densities using articial neural networks and provided a modied Baum-Welch re-estimation procedure to perform training. Fuzzy logic is a mathematical framework, which is compatible with poorly quantitative yet qualitatively signicant data. Fuzzy logic is a natural language for linguistic modeling, thus it is consistent with the qualitative linguistic graphical methods conventionally used to describe biological systems (Woolf & Wang, 2000). Fuzzy if-then rules were also developed to describe the basic molecular properties and behaviors of DNA inside the living cell (Ji, 2004). Neural network model has also emerged as a promising AI technique in DNA sequence analysis because this approach might well embody important aspects of intelligence not captured by symbolic and statistical methods (Hatzigeorgiou, Mache, & Reczko, 1996; Hatzigeorgiou, Papanikolaou, & Reczko, 1999). An important direction in this work involved integrating multiple, fundamentally different AI approaches into single hybrid intelligent systems, which let each component perform the tasks for which it is best suited. Integrating symbolic knowledge into a neural network to create a knowledge-based neural network has quickly become an important hybrid-intelligence research area. Empirical observations indicate that such systems can outperform both neuralnetwork and symbolic approaches (Fu, 1999). 3.1. Applications of articial intelligence in sequencing by hybridization Multi-agent systems, in which several agents coordinate their knowledge and activities, offer a natural way to view and characterize intelligent systems. Intelligence and interaction are deeply inevitably coupled, and multi-agent systems reect this insight. In this type of environments, an application is usually distributed across multiple agents capable of intelligent coordination. Recently a distributed multi-casts ant system, called DIMANTS, was designed by Bertelle, Dutot, Guinand, and Olivier (2002) for the SBH problem. It is a heuristic approach based on social insects organizations. The proposed a model consists of reactive agents which try to solve the problem by moving over the vertices and edges of the new graph model called SBH-graph. The associated computational problem consisted in rebuilding the original sequence from the SBHgraph. The agents, in DIMANTS, interact by chemical messages, which are pieces of paths deposited in the vertices. 3.1.1. Sequencing by hybridization using evolutionary programming Because the DNA sequencing problem is highly complex, this would not normally be possible using exact, exponentialtime algorithms. Thus a method that runs in polynomial time, and that often returns optimal solutions, is very valuable.

Consequently, evolutionary programming (EP) has been applied for DNA sequencing (Fogel et al., 1998), and very goods results were obtained through numerous simulations. A simulation of the sequencing by hybridization process was developed by Fogel et al. (1998) and used EP to determine the most suitable target lengths for the set all tetramer probes. The analysis suggested that probe lengths of 4 nucleotides were best suited to target lengths of 2535 nucleotides in agreement with data used by Bains (Bains, 1991). Fogel et al. (1998) conducted experiments with a limitation to a probe length of 4 and target lengths varied from 10 to 40. Moreover, the simulations using EP for sequence reconstruction utilized a population size of 500 in trials of 1000 generations, and a population size of 100 for trials of 2000 generations. The parameters of the algorithm were the population size, probe length, and target length. The complexity, in those simulations, increased linearly with the population size and target length, while it increased exponentially with increasing probe length. An advanced analysis was carried out later by Fogel and Chellapilla (1999), where the lengths of the probe n and the DNA target sequence N reached 8 and 100, respectively. They reported that probes of length 5 can be used to successfully reconstruct sequences of up to length 4550, with a 90% probability of success. Above length 5, error scores increase and the required number of trails may become unreasonable. The experiments also showed that probes of length 6 can successfully reconstruct target sequences of length 5055 with a 90% probability of success. Their results suggested that probes of length 8 could be used to unambiguously determine target of length 80. 3.1.2. Sequencing by hybridization using genetic algorithms Since sequencing the DNA molecule involves massive computation and reconstructing the original DNA sequence is an optimization search hence application of genetic algorithms (GAs) is quite justied in the DNA sequencing problem (Blazewicz & Kasprzak, 2002; Douzono, Hara, & Noguchi, 1998; Majee & Sahoo, 2004). Majee and Sahoo (2004) examined the performance of a GA in DNA sequencing through oligonucleotide hybridization method. They have constructed special kind of crossover and mutation operators. They also introduced special scoring scheme for detecting the repetitive units. The experiments showed construction of sub-sequences of length NC1 from hybridized probes of length N actually available from the biochemical experiment. For short sequences, the simulation produced near exact results. For example, when the length of the original DNA sequence is 12 and the number of probes (length NZ4) is 8. Then the number of subsequences of length NC1 (i.e. 5) constructed is 7. Here the GA population size was equal to 20 and the number of generations needed was 8. The nal sequence after the eighth generation was almost identical with the original sequence except that its length was 11. The GA solution suffered from early convergence of the population especially when the length of original DNA sequence reaches 22. A hybrid genetic algorithm (HGA) (Blazewicz & Kasprzak, 2002) solving the DNA sequencing problem with negative

Z. Ezziane / Expert Systems with Applications 30 (2006) 210

and positive errors was proposed. Their method supplements a standard genetic approach by using greedy improvement. The size of oligos was set to 10. The lengths of original sequences were between 109 and 509, and lengths of oligos were 10. The experiments showed that HGA generated nearoptimal solutions, and similarities with original sequences were very high. For instances of cardinality 100, the HGA returned only original sequences. Similarity less than 100% was caused because of missing information about the last nucleotides in sequences (negative errors). Even for large spectra of lengths reaching 500 with many errors of both types, the HGA composed sometimes-optimal sequences. The obtained solutions have the qualities from 98.3 to 100% of optimal values on the average. The applications of AI in DNA sequencing have been very fruitful in generating effective and competitive solutions. However, there is still a constant need for more advanced and sophisticated tools in computational molecular biology. 4. Intelligent systems in bioinformatics In the post-genome era, research in bioinformatics has been overwhelmed by the experimental data (Tan & Gilbert, 2003). The complexity of biological data ranges from simple strings (nucleotides and amino acids sequences) to complex graphs (biochemical networks; from 1D (sequence data) to 3D (protein and RNA structures). Considering the amount and complexity of the data, it is becoming impossible for an expert to compute and compare the entries with the current databases. Thus AI and machine learning techniques have been used to analyze biological data sets in order to discover and mine the patterns and similarities existing in various databases. Tan and Gilbert (2003) performed an empirical comparison of rule-based learning systems (decision trees, one rule, ve Bayes, decision rules), statistical learning systems (na instance based, SVM and articial neural networks) and ensemble methods (stacking, bagging and boosting) on some available data of E. coli, Yeast, Promoters, and HIV. They have reported a comparison of different supervised machine learning techniques in classifying biological data. They also conrmed that none of the single methods could consistently perform well over all the data sets. Their work also showed that combined methods perform better than the individual ones. Kasturi and Acharya (2004) proposed an unsupervised machine-learning algorithm that identies clusters of genes using combined data (promoter sequences of genes/DNA binding motifs, gene ontologies, and location data). The outcome of their experiments showed that the combined learning approach identied correlated genes effectively. In data mining one hopes to detect certain patterns in huge amounts of data through unsupervised learning algorithms. Articial neural network were used as a data-mining tool to predict and diagnose the occurrence of breast cancer (Chou, Lee, Shao, & Chen, 2004; Pendharkar, Rodger, Yaverbaum, Herman, & Benner, 1999). Page and Craven (2003) have developed several applications of multi-relational data mining to biological data, taking care to cover a broad range of

multi-relational data mining techniques. Rocco and Critchlow (2003) presented an approach for nding classes of bioinformatics data sources and integrating them behind a unied interface. The outcome of their work is supposed to eliminate the human effort required to maintain a current repository of sources. Tang and Zhang (2003) proposed a model of simultaneously mining both empirical and hidden phenotype structures from gene expression data. 4.1. Data management in bioinformatics Since the advent of modern molecular biology, scientists have been building databases analyzing and documenting every conceivable aspect of the data (Conte et al., 2000; Laskowski, 2001). Most of these databases are at least partially maintained; human intelligence still plays a very important role in analysis. Recent developments in molecular biology have resulted in automated methods, which are capable of generating vast volumes of raw experimental data. Perhaps the best-known example of this phenomenon is DNA sequencing; many bioinformatics projects are dedicated to annotation and analysis of sequence data (Kulikova, Aldebert, Althorpe, Baker, Bates and Browne, 2004). Kumar, Palakal, Mukhopashyay, Stephens, and Li (2004) developed a knowledge base called BioMap using MEDLINE collection, which contains over 12 million citations and author abstracts from over 4600 biomedical journals. They have also presented an organization of a distributed database system to maintain the knowledge base of BioMap. Bresciani, Fontana, and Busetta (2002) introduced a novel paradigm for providing knowledge based exible query interfaces to object-oriented biological databases. The prototype they developed has the advantage in adopting a semantic approach with a capability of reasoning on the semantics of the queries. These features are aimed at improving the interaction between non-expert users and complex databases. Wang, Kuo, Chen, Hsiao, and Tsai (2005) described the framework of KSPF (Knowledge Sharing for Protein Families), which is applicable to all types of protein families. Palakal, Mukhopashyay, and Mostafa (2002) designed and implemented an information management system prototype, called BioSifter, applied in the bioinformatics. Their tool was able to automatically retrieve relevant text documents from biological literature based on their interest prole. BLAST (Basic Alignment Search Tool) family of applications allows biologists to nd homologous of an input sequence in DNA and protein sequence libraries (Altschul, Gish, Miller, Meyers, & Lipman, 1990). BLAST is an example of an application that has been enhanced as a web source, which provides dynamic access to large data sets. 4.2. Data mining tools and challenges Molecular biology is fortunate to have a plethora of distributed databases. The task of mining these databases is particularly challenging because it can require the integration of generic data mining approaches with relatively deep

Z. Ezziane / Expert Systems with Applications 30 (2006) 210

biological models (Altman, 2001). The Internet has allowed easy publication of data mining tools, most of which are licensed free of charge for academic use (Bergeron, 2003). Typically, researchers use a number of tools such as these on a day-to-day basis. Very often, pipelines are constructed by hand or cobbled together using hastily written Perl and Python scripts (Barker and Thornton, 2004). Various academic bioinformatics-specic data mining tools are available including MEME (Bailey & Elkan, 1994), Pratt (Jonassen, 1997; Jonassen, Collins, & Higgins, 1995), PIMA (Smith & Smith, 1992), and SPEXS (Vilo, 1998). MEME (Multiple EM for Motif Elicitation) is a motif discovery tool. Prat, a stand-alone pattern discovery tool, is designed to uncover patterns conserved in sets of unaligned protein sequences. PIMA (Pattern-Induced Multi-sequence Alignment program) can be used to perform a multi-sequence alignment of a set of sequences. SPEXS (Sequence Pattern Exhaustive Search) is a sequence pattern discovery tool. Biological and genomic sources integration represents an important challenge to biological systems. Hernandez and Kambhampati (2004) surveyed the challenges that an integration system for biological sources must face due to several factors such as the variety and amount of data available, the representational heterogeneity of the data in the different sources, and the autonomy and differing capabilities of the sources. Bajic, Brusic, Li, Ng, and Wong (2003) described advances made in data integration and data mining technologies that are relevant to molecular biology and biomedical sciences. Web portals, such as Entrez and, to a lesser extent, ExPASy represent the rst level of integration of bioinformatics data, methodologies, and tools (Bergeron, 2003). Very large databases replete with multiple potential relationships present scalability issues that may require signicant computational time on powerful computer systems. Moreover, many of the traditional data mining methods were developed for homologous numerical data. However, bioinformatics databases increasingly hold text sequences, protein structure, and other data sets that are anything by homologous. Hence the need for more sophisticated data mining techniques for bioinformatics is fundamental for the success of intelligent bioinformatics systems. 4.3. Multi-agent systems in bioinformatics Because of the challenges posed by the massive amount of raw biological data that are currently available on the Internet, some researchers proposed the use of a multi-agent system (MAS) to cope with those challenges. Decker, Zheng, and Schmidt (2001) designed and implemented a MAS as an information-gathering tool, which was used to analyze data and provide important knowledge to researchers. They used DECAF (Graham & Decker, 1999), a multi-agent system toolkit, to construct a prototype multi-agent system for automated annotation and database storage of sequencing data for herpesviruses. Bryson, Luck, Joy, and Jones (2000)

proposed and developed GeneWeaver, a multi-agent system that is being applied to the very real and demanding problems of genome analysis and protein structure prediction. Karasawas, Baldock, and Burger (2002) designed a multi-agent bioinformatics integration system that aims at helping users make the most of the available integration facilities while also increasing their trust in the system. Their system provided an explanation facility that helps the users to better understand how the system has arrived at a rez-particular answer. Delgado, Fajardo, Gibaja, and Pe rez (2005) presented the development of BioMen Pe (Biological Management Executed over Network) managed by means of a Multi-Agent System MAS has also been applied to the task of biological network simulation. Khan, Makkena, McGeary, Decker, Gillis and Schmidt (2003) described the simulation of signal transduction (ST) networks using the DECAF (Graham & Decker, 1999) MAS architecture. In their system, a molecular species is modeled as an individual agent with hierarchical task network structures to represent self- and externally initiated reactions. An agents identity was determined by a rule le (one for every participating molecular species) that species the reactions it may participate in, as well as its initial concentration. Reactions within the system were actuated by inter-agent communication. Srinivasan, Mitchell, Bodenreider, Pant, and Menczer (2002) presented preliminary research on different retrieval agents and tested their ability to retrieve biomedical information, whose relevance is assessed using both genetic and ontological expertise. Moreau et al. (Moreau, Miles, Goble, Greenwood, Dialani and Addis, 2003) developed the myGrid architecture. This grid aims to provide a personalized environment for bioscientists, which helps them to automate, repeat and therefore better achieve their biological experiments. 5. Concluding remarks Bioinformatics grew out of molecular biology as it became clear that specialist skills were needed to organize and analyze the data being generated. Now molecular biology has reached a stage of development where continued progress depends on (1) combining the intelligent systems with the sheer volume of biomedical research and (2) combining the knowledge of the tiny molecular mechanisms with knowledge of the biological systems. Biological literature databases continue to grow rapidly with vital information that is important for conducting sound biomedical research. As data and information space continue to grow exponentially, the need for rapidly surveying the published literature, synthesizing, and discovering the embedded knowledge is becoming critical to allow the researchers to conduct informed work, avoid repetition, and generate new hypotheses. Knowledge, in this case, is dened as one-to-many and many-to-many relationships among biological entities such as gene, protein, drug, disease, etc. The knowledge discovery process

Z. Ezziane / Expert Systems with Applications 30 (2006) 210 Barker, J., & Thornton, J. (2004). Software engineering challenges in bioinformatics. Proceedings of the 26th international conference on software engineering (ICSE04), Edinburgh, Scotland, UK (pp. 1215). Bendor, A., Peer, I., Shamir, R., & Sharan, R. (2001). On the complexity of positional sequencing by hybridization. Journal of Computational Biology, 8(4), 361371. Bergeron, B. (2003). Bioinformatics computing. Upper Saddle River, NJ: Prentice-Hall. Bertelle, C., Dutot, A., Guinand, F., & Olivier, D. (2002). DNA sequencing hybridization based on multi-castes ant system. Proceedings of agents in bioinformatics (NETTAB02), Bologna, Italy (pp. 17). Blazewicz, J., Formanowicz, P., Kasprzak, M., Markiewicz, J., & Welglarz, J. (1999). DNA sequencing with positive and negative errors. Journal of Computational Biology, 6, 113123. Blazewicz, J., & Kasprzak, M. (2002). Hybrid genetic algorithm for DNA sequencing with errors. Journal of Heuristics, 8, 495502. Blazewicz, J., & Kasprzak, M. (2003). Complexity of DNA sequencing by hybridization. Theoretical Computer Science, 290, 14591473. Boufounos, P., El-Difrawy, S., & Ehrlich, D. (2004). Basecalling using hidden Markov models. Journal of the Franklin Institute, 341(12), 23 36. Bresciani, P., Fontana, P., & Busetta, P. (2002). A knowledge based interface for distributed biological databases. Proceedings of the international workshop on agents in bioinformatics (NETTAB-02), Bologna, Italy ,http// sra.itc.it/people/bresciani/publications/nettab-2002.pdf. Broude, S. D., Sano, T., Smith, C. S., & Cantor, C. R. (1994). Enhanced DNA sequencing by hybridization. Proceedings of the National Academy of Sciences USA, 91, 30723076. Bryson, K., Luck, M., Joy, M., & Jones, D. T. (2000). Applying agents to bioinformatics in GeneWeaver Lecture notes in articial intelligence, (pp. 6071) Vol. 1860. Chou, S. M., Lee, T. S., Shao, Y. E., & Chen, I. F. (2004). Mining the breast cancer pattern using articial neural networks and multivariate adaptive regression splines. Expert Systems with Applications, 27(1), 133142. Cohen, J. (2004). BioinformaticsAn introduction for computer scientist. ACM Computing Surveys, 36(2), 122158. Decker, K., Zheng, S., & Schmidt, C. (2001). A multi-agent system for automated genomic annotation. Proceedings of the fth international conference on autonomous agents, Montreal, Canada (pp. 433440). rez-Pe rez, R. (2005). BioMen: An Delgado, M., Fajardo, W., Gibaja, E., & Pe information system to herbarium. Expert Systems with Applications, 28(3), 507518. Douzono, H., Hara, S., & Noguchi, Y. (1998). An application of genetic algorithm to DNA sequencing by oligonucleotide hybridization. Proceedings of the IEEE international joint symposia on intelligence and systems, Rockville, Maryland, USA (pp. 9298). Drmanac, R., Hood, L., & Crkvenjakov, R. (1993). DNA sequence determination by hybridization: A strategy for efcient large scale sequencing. Science, 260, 16491652. Drmanac, R., Labat, I., Brunker, I., & Crkvenjakov, R. (1989). Sequencing of megabase plus DNA by hybridization: Theory of the method. Genomics, 4, 114128. Dyer, M. E., Frieze, A. M., & Suen, S. (1994). The probability of unique solutions of sequencing by hybridization. Journal of Computational Biology, 1, 105110. Fodor, S. P. A., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., & Solas, D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science, 251, 767773. Fogel, G. B., & Chellapilla, K. (1999). Simulated sequencing by hybridization using evolutionary programming. Proceedings of the congress on evolutionary computation, Washington DC, USA (pp. 463469). Fogel, G. B., Chellapilla, K., & Fogel, D. B. (1998). Reconstruction of DNA sequence information from simulated DNA chip using evolutionary programming. Proceedings of the seventh annual conference on evolutionary programming, San Diego, CA, USA (pp. 429436).

basically involves identication of biological object names, reference resolution, ontology and synonym discovery, and nally extracting objectobject relationships. The process of biological knowledge discovery is also evolving in terms of data and information being created continuously. Knowledge discovery process is difcult and requires profound understanding of both knowledge discovery and computational biology to identify problems and optimization criteria which, when maximized by knowledge discovery algorithms, actually contribute to a better understanding of biological systems. Identication of appropriate knowledge discovery problems and development of evaluation methods for knowledge discovery results are ongoing efforts. In order to build knowledge discovery systems that contribute to our understanding of biological systems, solutions to the above problems have to be assembled into efcient and scalable systems. The eld of molecular biology is described as tailor-made for AI methods. This is due to the nature of AI approaches that performs well in domains where there is an immense amount of data but little theory. Since the introduction of AI to this eld, numerous algorithms have been designed and applied to study different data sets. The intellectual challenges to knowledge processing in bioinformatics and computational molecular biology are exciting and promise to provide problems that will continue to drive the development of improved tools for intelligent systems. The topics covered in this article are applications of AI in bioinformatics and DNA sequencing. The outcome of this research demonstrates the need for improving the existing tools, which are already being applied to extract important knowledge and to nd out useful patterns from the massive amount of raw biological data. A strong but not easy interaction between computer scientists and biologists will be very useful in order to infer which tools are best suited to help biologists tackle unsolved problems.

References
Adleman, L. M. (1998). Location sensitive sequencing of DNA. Technical report, University of Southern California. Altman, R. B. (2001). Challenges for intelligent systems in biology. IEEE Intelligent Systems, 16(6), 1418. Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403410. Bailey, T. L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the second international conference on intelligent systems for molecular biology (pp. 2836). Bains, W. (1991). Hybridization methods for DNA sequencing. Genomics, 11, 294301. Bains, W., & Smith, G. C. (1988). A novel method for nucleic acid sequence determination. Journal of Theoretical Biology, 135, 303307. Bajic, V. B., Brusic, V., Li, J., Ng, S. K., & Wong, L. (2003). From informatics to bioinformatics. Proceedings of the rst AsiaPacic bioinformatics conference on bioinformatics, Adelaide, Australia (pp. 312).

Z. Ezziane / Expert Systems with Applications 30 (2006) 210 Frieze, A., & Hallorsson, B. V. (2001). Optimal sequencing by hybridization. Proceedings of the fth annual international conference on computational molecular biology (RECOMB01), Montreal, Canada (pp. 141148). Fu, L. (1999). An expert network for DNA sequence analysis. IEEE Intelligent Systems, 14(1), 6571. Graham, J. R., & Decker, K. (1999). Towards a distributed, environmentcentered agent framework. Lecture notes in computer science, Vol. 1757 (pp. 290304). Guseld, D. (2004). Introduction to the IEEE/ACM transactions on computational biology and bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 23. Guseld, D., Karp, R., Wang, L., & Stelling, P. (1998). Graph traversals, genes and matroids: An efcient case of the traveling salesman problem. Discrete Applied Mathematics, 88, 167180. Halperin, E., Halperin, S., Hartman, T., & Shamir, R. (2002). Handling long targets and errors in sequencing by hybridization. Proceedings of the sixth annual international conference on computational molecular biology (RECOMB02), Washington, DC, USA (pp. 176185). Hannenhalli, S., Pevzner, P., Lewis, H., & Skiena, S. (1996). Positional sequencing by hybridization. Computer Applications in the Biosciences, 12, 1924. Hatzigeorgiou, A. G., Mache, N., & Reczko, M. (1996). Functional site prediction on the DNA sequence by articial neural networks. Proceedings of the IEEE international joint symposia in intelligence and systems (IJSIS96), Rockville, MD, USA (pp. 1217). Hatzigeorgiou, A. G., Papanikolaou, H., & Reczko, M. (1999). Finding the reading frame in protein coding regions on DNA sequences: A combination of statistical and neural network methods. In M. Mohammadian (Ed.), Computational intelligence for modeling, control and automation (pp. 148 153). Vienna, OH: IOS Press. Hernandez, T., & Kambhampati, S. (2004). Integration of biological sources: Current systems and challenges ahead. ACM SIGMOD Record, 33(3), 51 60. Ji, S. (2004). Molecular information theory: Solving the mysteries of DNA. In G. Ciobanu, & G. Rozenberg (Eds.), Modeling in molecular biology. Natural computing series (pp. 141150). Berlin: Springer. Jonassen, I. (1997). Efcient discovery of conserved patterns using a pattern graph. Computer Applications in the Biosciences, 13, 509522. Jonassen, I., Collins, J. F., & Higgins, D. G. (1995). Finding exible patterns in unaligned protein sequences. Protein Science, 4, 15871595. Karasawas, K., Baldock, R., & Burger, A. (2002). A multi-agent bioinformatics integration system with adjustable autonomy: An overview. Proceedings of the rst international joint conference on autonomous agents and multiagent systems: Part 1, Bologna, Italy (pp. 302303). Kasturi, J., & Acharya, R. (2004). Clustering of diverse genomic data using information fusion. Proceedings of the 2004 ACM symposium on applied computing, Nicosia, Cyprus (pp. 116120). Khan, S., Makkena, R., McGeary, F., Decker, K., Gillis, W., & Schmidt, C. (2003). A multi-agent system for the quantitative simulation of biological networks. Proceedings of the second international joint conference on autonomous agents and multiagent systems, Melbourne, Australia (pp. 385392). Khrapko, K. R., Lysov, Y. P., Khorlyn, A. A., Shick, V. V., Florentieve, V. L., & Mirzabekov, A. D. (1989). An oligonucleotide hybridization approach to DNA sequencing. FEBS Letters, 256, 118122. Krogh, A. (1998). An introduction to hidden Markov models for biological sequences. In S. L. Salzberg, D. B. Searls, & S. Kasif (Eds.), Computional methods in molecular biology (pp. 4563). Amsterdam: Elsevier. Kulikova, T., Aldebert, P., Althorpe, N., Baker, W., Bates, K., Browne, P., et al. (2004). The EMBL nucleotide sequence database. Nucleic Acids Research, 32, 2730. Kumar, K., Palakal, M., Mukhopashyay, S., Stephens, M., & Li, H. (2004). BioMap: Toward the development of a knowledge base of biomedical literature. Proceedings of the 2004 ACM symposium on applied computing, Nicosia, Cyprus (pp. 121127).

Laskowski, R. (2001). PDBsum: Summaries and analyses of PDB structures. Nucleic Acids Research, 29, 221222. Lo Conte, L., Ailey, B., Hubbard, T., Brenner, S., Murzin, A., & Chothia, C. (2000). SCOP: A structural classication of proteins database. Nucleic Acids Research, 28(1), 257259. Majje, C., & Sahoo, G. (2004). DNA sequencing by oligonucleotide hybridization: A genetic algorithm approach. Proceedings of the international conference on genetics and evolutionary computation (GECCO), Seattle, Washington, USA ,http//www.cs.bham.ac.uk/~wbl/ biblio/gecco2004/WSOE001.pdf. Moreau, L., Miles, S., Goble, C., Greenwood, M., Dialani, V., Addis, M., et al. (2003). On the use of agents in a bioinformatics grid. Proceedings of the third international symposium on cluster computing and the grid, Tokyo, Japan (pp. 653660). Noble, D. (1995). DNA sequencing on a chip. Analytical Chemistry, 67(5), 201A204A. Page, D., & Craven, M. (2003). Biological applications of multi-relational data mining. ACM SIGKDD Explorations Newsletter, 5(1), 6979. Palakal, M., Mukhopashyay, S., & Mostafa, J. (2002). An intelligent biological information management system. Proceedings of the 2002 ACM symposium on applied computing, Madrid, Spain (pp. 159163). Peer, I., & Shamir, R. (2000). Spectrum alignment: Efcient resequencing by hybridization. Proceedings of the eighth international conference on intelligent systems in molecular biology (ISMB00), La Jolla, California, USA (pp. 260268). Pendharkar, P. C., Rodger, J. A., Yaverbaum, G. J., Herman, N., & Benner, M. (1999). Association, statistical, mathematical and neural approaches for mining breast cancer patterns. Expert Systems with Applications, 17(3), 223232. Pevzner, P. A. (1989). l-Tuple DNA sequencing. Journal of Biomolecular Structure and Dynamics, 7, 6373. Pevzner, P. A., & Lipshutz, R. (1994). Towards DNA sequencing chips. Proceedings of the international conference on mathematical foundations of computer science Lecture notes in computer science, Vol. 841 (pp. 143 158). Pevzner, P. A., Lysov, Y. P., Khrapko, K. R., Belyavsky, A. V., Florentiev, V. L., & Mirzabekov, A. D. (1991). Improved chips for sequencing by hybridization. Journal of Biomolecular Structure and Dynamics, 9, 399 410. Preparata, F., Frieze, A., & Upfal, E. (1999). On the power of universal bases in sequencing by hybridization. Proceedings of the third annual international conference on computational molecular biology (RECOMB99), Lyon, France (pp. 295301). Preparata, F., & Upfal, E. (2000). Sequencing-by-hybridization at the information- theory bound: An optima algorithm. Proceedings of the fourth annual international conference on computational molecular biology (RECOMB00), Tokyo, Japan (pp. 245253). Rocco, D., & Critchlow, T. (2003). Automatic discovery and classication of bioinformatics web sources. Bioinformatics, 19(15), 19271933. Shamir, R., & Tsur, D. (2001). Large scale sequencing by hybridization. Proceedings of the fth annual international conference on computational molecular biology (RECOMB01), Montreal, Canada (pp. 269277). Shavlik, J., Hunter, L., & Searls, D. (1995). Introduction. Machine Learning, 21, 510. Skiena, S. S., & Sunaram, G. (1995). Reconstructing strings from substrings. Journal of Computational Biology, 2, 333353. Smith, R. D., & Smith, T. F. (1992). Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative modelling. Protein Engineering, 5(1), 3541. Srinivasan, P., Mitchell, J., Bodenreider, O., Pant, G., & Menczer, F. (2002). Web crawling agents for retrieving biomedical information. Proceedings of the international workshop on agents in bioinformatics (NETTAB-02). Bologna, Italy. Tan, A. C., & Gilbert, D. (2003). An empirical comparison of supervised machine learning techniques in bioinformatics. Proceedings rst Asia Pacic bioinformatics conference (APBC2003), Adelaide, Australia (pp. 219222).

10

Z. Ezziane / Expert Systems with Applications 30 (2006) 210 Wang, H. C., Kuo, H. C., Chen, H. H., Hsiao, Y. Y., & Tsai, W. C. (2005). KSPF: Using gene sequence patterns and data mining for biological knowledge management. Expert Systems with Applications, 28(3), 537545. Waterman, M. S. (1995). Introduction to computational biology. London: Chapman & Hall. Woolf, P. J., & Wang, Y. (2000). A fuzzy logic approach to analyzing gene expression data. Physiological Genomics, 3, 915.

Tang, C., & Zhang, A. (2003). Mining multiple phenotype structures underlying gene expression proles. Proceedings of the 12th international conference on information and knowledge management, New Orleans, LA, USA pp. 418425. Vilo, J. (1998). Discovering frequent patterns from strings. Technical report C-1998-9. Department of Computer Science, University of Helsinki.

You might also like