VIDYAA VIKAS COLLEGE OF ENGINEERING & TECHNOLOGY, Tiruchengode, Namakkal
Prepared by, Srimathi.K Pavithra.B
B.Tech(IT), Pre-final yr.


Biolo"y and comp,ter


hare a nat,ral affinity. Phy ici t

Schr/din"er envi ioned life a an aperiodic cry tal, ob ervin" that the or"ani0in" tr,ct,re of life i neither completely re",lar, li1e a p,re cry tal, nor completely chaotic and .itho,t tr,ct,re, li1e d, t in the .ind. Thi i .hy biolo"ical information ha never ati factorily yielded to cla ical mathematical analy i . 2achine comp,tation combine ele"ant al"orithm .ith br,te-force calc,lation 3 .hich eem a rea onable approach to thi aperiodic tr,ct,re. The ol,tion to vario, problem lie in the domain of or"anic matter. Th, , e4aminin" ho. or"ani m olve problem can lead to ne. comp,tation-and

al"orithm-development approache that devo,r the problem that are diffic,lt to tac1le in the laboratory, b,t o ea y to approach , in" a comp,ter. $ioin%or a&ic' i the field of cience in .hich biolo"y, comp,ter cience, and information technolo"y mer"e to form a in"le di cipline. The ,ltimate "oal of the field i to enable the di covery of ne. biolo"ical in i"ht a .ell a to create a "lobal per pective from .hich ,nifyin" principle in biolo"y can be di cerned. In thi fir t part of thi paper, .e "ive a brief introd,ction on bioinformatic and data minin" and their

relation hip. In the later part, .e deal .ith data minin" approache in bioinformatic and it application partic,larly in biomedical and 56* data analy i .

THE ITINERARY (# $ioin%or a&ic' ) *here $io"og+ -# Da&a ee&' Co ,u&er 'cience

ining ) an in&roduc&ion

.# *ha& i' a /io"ogica" da&a/a'e0 1# *h+ need da&a ining in /ioin%or a&ic'0

2# Cha""enge' in /io3indu'&r+ 4# A,,roache' o% da&a ining in $ioin%or a&ic' ining ining ining ining ining ining

3 In%"uence /a'ed 3 A%%ini&+3/a'ed 3 Ti e de"a+ da&a 3 Trend3/a'ed 3 6redic&i5e da&a 7# Da&a 3 Co ,ara&i5e da&a

ining %or $io edica" and DNA da&a ana"+'i'

8# Conc"u'ion

De%ining $ioin%or a&ic' Bioinformatic i the comp,ter-a i ted data mana"ement di cipline that help , "ather, analy0e, and repre ent biolo"ical information in order to ,nder tand life8 proce e . Bioinformatic i concept,ali0in" /io"og+ in term of molec,le (in the en e of phy ical-chemi try) and then applyin" 9in%or a&ic': techni9,e di cipline (derived from

,ch a applied math, CS, and tati tic ) to ,nder tand and or"ani0e the

information a ociated .ith the e molec,le , on a lar"e- cale.

$ioin%or a&ic'; where biology meets computer science Biolo"y i the yo,n"e t of the nat,ral cience . :hen it collected information reache a critical den ity, a nat,ral cience pro"re e from information "atherin" to information proce in". Combinin" cold ilicon and hot protopla m may con tit,te a marria"e of oppo ite , b,t thi ,nion co,ld prod,ce "enetic re earch prodi"ie .

The e day biolo"i t , e comp,ter ro,tinely to a i t .ith many activitie , incl,din" Biomolec,lar e9,ence ali"nment, * embly of 56* piece , 2,ltivariate analy i of lar"e- cale "ene e4pre ion , and 2etabolic path.ay analy i .

C,rrently, the mo t ,cce f,l , e of comp,ter in biolo"y are comparative e9,ence analy i and in silico cloning - the proce e4i tin" databa e to clone a "ene. of , in" a comp,ter earch of


ining ) a '+non+

%or KDD of e4tractin" hidden predictive

5ata minin" can be defined a the proce information from lar"e databa e . 5ata minin", by it

imple t definition, a,tomate the detection of relevant

pattern in a databa e. ;or e4ample, a pattern mi"ht indicate that married male .ith children are a li1ely to drive a partic,lar port car than married male .ith no children.


5ata minin" , e .ell-e tabli hed tati tical and machine learnin" techni9,e to b,ild model that predict c, tomer behavior. Today, technolo"y a,tomate the minin" proce , inte"rate it .ith commercial data .areho, e , and pre ent it in a relevant .ay for b, ine , er .

*ha& i' a $io"ogica" Da&a/a'e0 * /io"ogica" da&a/a'e i a lar"e, or"ani0ed body of per i tent data, , ,ally a ociated .ith comp,teri0ed oft.are de i"ned to ,pdate, 9,ery, and retrieve

component of the data tored .ithin the y tem. * imple databa e mi"ht be a in"le file containin" many record , each of .hich incl,de the ame et of information. ;or e4ample, a record a ociated .ith a n,cleotide e9,ence databa e typically contain information ,ch a contact name< the inp,t e9,ence .ith a de cription of the type of molec,le< the cientific name of the o,rce or"ani m from .hich it .a i olated< and, often, literat,re citation a ociated .ith the e9,ence. ;or re earcher to benefit from the data tored in a databa e, t.o additional re9,irement m, t be met: -a y acce to the information< and

* method for e4tractin" only that information needed to an .er a pecific biolo"ical 9,e tion. Need %or da&a ining in /ioin%or a&ic'

The " of c,rve of biolo"ical information databa e follo. an e4ponential c,rve that clo ely mimic 2oore> la. - do,blin" every )' month or o. By helpin" re earcher proce thi va t collection of data, Computer science can a i t in

di per in" thi information torm. 2ore than )7,(((,((( biolo"ical ab tract are lyin" for information e4traction, and the amo,nt i till ,pdatin".

The biopharmace,tical ind, try i "eneratin" more chemical and biolo"ical creenin" data than it 1no. .hat to do .ith or ho. be t to handle. * a re ,lt, decidin" .hich tar"et and lead compo,nd to develop f,rther i often a lon" and ard,o, ta 1. 2edical data ha increa ed dramatically 2an,al analy i i not ade9,ate The traditional data analy i method are not ade9,ate to deal .ith enormo, data flo.. 5ata minin" i nece ary. Comprehen ive pre-proce in" facilitie are incl,ded The "enerated r,le .ere imple to ,nder tand In the medical domain primary ob?ective .a e4planation rather than prediction 2edical databa e typically have a hi"h proportion of mi in" val,e . The data minin" oft.are can efficiently handle the mi in" val,e . Cha""enge' in $io ) indu'&r+

-4plainin" the cale of data that need to be handled in Biotechnolo"y, Orac"e Aeneral 2ana"er S#Gro5er ay , BThere are %7,((( "enome .ith ).& million protein in them. -ach "enome re9,ire appro4imately %(( terabyte of trace file . So %7,((( time %((TB i ma ive. 2edical ima"in" "enerate $(( million AB of data ann,ally. -ach ma )(((> of ma pectrometer "enerate 7(( AB of data daily. 2,ltiply thi by

pectrometer in , e in the .orld today and yo, "et the pict,re.C Thi

heer vol,me of data call for intelli"ent databa e . Biolo"i t ometime can>t a"ree on the very definition and concept the

databa e are ,ppo e to mana"e. In "enomic , the data entered i not acc,rate and preci e. -ven if it i problem i tandardi0ed, earchin" a colo al databa e i no mean ta 1. *nother created by different or"ani0ation , tore information

that databa e

idio yncratically, creatin" different file format that cannot tal1 to each other. To be"in .ith it elf, biolo"ical data i comple4 and interlin1ed. * pot on a 56* array, for in tance, i connected not only to immediate information abo,t it inten ity, b,t to layer of information abo,t "enomic location, 56* e9,ence, tr,ct,re, f,nction, and m,ch more. Creatin" information y tem that allo. biolo"i t to eamle ly follo. the e lin1 .itho,t "ettin" lo t in a ea of information i a challen"e for Comp,ter cienti t . 5ata minin" .ith ele"ant al"orithm eem to be a better ol,tion.


A,,roache' o% Da&a In%"uence3/a'ed

ining in $ioin%or a&ic' ining;

Comple4 and "ran,lar (a oppo ed to linear) data in lar"e databa e are canned for infl,ence bet.een pecific data et , and thi i done alon" many

dimen ion and in m,lti-table format . The e y tem find application .herever there are i"nificant ca, e-andeffect relation hip bet.een data et D a occ,r , for e4ample, in lar"e and m,ltivariant "ene e4pre ion t,die , .hich are behind area ,ch a pharmaco"enomic .


ining: m,ltiple dimen ion , and

Ear"e and comple4 data et are analy0ed acro

the data-minin" y tem identifie data point or et that tend to be "ro,ped to"ether. The e y tem differentiate them elve by providin" hierarchie of a ociation and" any ,nderlyin" lo"ical condition or r,le that acco,nt for the pecific "ro,pin" of data. Thi approach i partic,larly , ef,l in biolo"ical motif analy i , .hereby it i important to di tin",i h FaccidentalF or incidental motif from one .ith biolo"ical i"nificance. Ti e de"a+ da&a ining:

The data et i not available immediately and in complete form, b,t i collected over time. The y tem de i"ned to handle ,ch data loo1 for pattern that are

confirmed or re?ected a the data et increa e and become more rob, t. Thi approach i "eared to.ard lon"-term clinical trial analy i and m,lticomponent mode of action t,die . Trend3/a'ed ining:

The oft.are analy0e lar"e and comple4 data et in term of any chan"e that occ,r in pecific data et over time. The data et can be , er-defined, or the y tem can ,ncover them it elf. - entially, the y tem report on anythin" that i chan"in" over time.

Co ,ara&i5e da&a


It foc, e on overlayin" lar"e and comple4 data et that are imilar to each other and comparin" them. Thi i partic,larly , ef,l in all form of clinical trial meta analy e , .here data collected at different ite over different time period , and perhap ,nder imilar b,t not al.ay identical condition , need to be compared. Gere, the empha i i on findin" di imilaritie , not imilaritie . 6redic&i5e da&a ining:

5ata minin" alone i lac1in" ome.hat if it i ,nable to al o offer a frame.or1 for ma1in" im,lation , prediction , and foreca t , ba ed on the data et it ha analy0ed. It combine pattern matchin", infl,ence relation hip , time et correlation , and di imilarity analy i to offer im,lation of f,t,re data et .



ining %or /io edica" and DNA da&a ana"+'i'; The pa t decade ha een an e4plo ive " in biomedical re earch,

ran"in" from the development of ne. pharmace,tical and advance in cancer therapie to the identification and e9,encin" t,dy of the h,man "enome by di coverin" lar"e- cale pattern

and "ene f,nction . Since a "reat deal of biomedical re earch ha foc, ed on 56* data analy i , .e t,dy thi application here. +ecent re earch in 56* analy i ha lead to the di covery of "enetic ca, e for many di ea e and di abilitie , a .ell a the di covery of ne. medicine and approache for di ea e dia"no i , prevention, and treatment. *n important foc, in "enome re earch i the t,dy of 56* e9,ence ,ch e9,ence form the fo,ndation of the "enetic code of all livin" or"ani m . *ll 56* e9,ence are compri ed of fo,r b,ildin" bloc1 (called n,cleotide ): adenine (*), cytosine(C), guanine (A), and thymine (T). The e fo,r n,cleotide are combined to form lon" e9,ence or chain that re emble a t.i ted ladder. ince


G,man bein" have aro,nd ), ((,((( "ene . * "ene i , ,ally compri ed of h,ndred of individ,al n,cleotide arran"ed in a partic,lar order. There are almo t an ,nlimited n,mber of .ay that the n,cleotide can be ordered and e9,enced to form di tinct "ene . It i challen"in" to identify partic,lar "ene e9,ence pattern that play role in vario, di ea e . Since many intere tin" e9,ential pattern analy i and imilarity earch techni9,e have been developed in data minin", data minin" ha become a po.erf,l tool and contrib,te ,b tantially to 56* analy i in the" .ay ,

Se an&ic in&egra&ion o% he&erogeneou', di'&ri/u&ed geno e da&a/a'e'; 5,e to the hi"hly di trib,ted, ,ncontrolled "eneration and , e of a .ide variety of 56* data, the emantic inte"ration of ,ch hetero"eneo, and .idely di trib,ted "enome databa e become an important ta 1 for y tematic coordinated analy i of 56* databa e . Thi ha promoted the development of inte"rated data .areho, e and di trib,ted federated

databa e to tore and mana"e the primary and derived "enetic data. 5ata cleanin" and data inte"ration method developed in data minin" .ill help the inte"ration of "enetic data and the con tr,ction of data .areho, e for "enetic data analy i .

Si i"ari&+ 'earch and co ,ari'on a ong DNA 'e<uence; Hne of the mo t important earch problem in "enetic analy i i imilarity

earch and compari on amon" 56* e9,ence . Aene e9,ence i olated from di ea ed


and healthy ti ,e can be compared to identify critical difference bet.een the t.o cla e of "ene . Tho can be done by fir t retrievin" the "ene e9,ence from the t.o ti ,e cla e , and then findin" and comparin" the fre9,ently occ,rrin" pattern of each cla . I ,ally, e9,ence occ,rrin" more fre9,ently in the di ea ed ample than in the healthy ample mi"ht indicate the "enetic factor of the di ea e< on the other hand, tho e occ,rrin" only more fre9,ently in the healthy ample mi"ht indicate mechani m that protect the body from the di ea e. *ltho,"h "enetic analy i re9,ire the techni9,e needed here i 9,ite different. ;or e4ample, imilarity earch, ome of the data

tran formation method , .hich are pop,larly , ed in the analy i of time- erie data, are ineffective for "enetic data ince ,ch data are nonn,meric data and the preci e

interconnection bet.een different 1ind of n,cleotide play an important role in their f,nction. Hn the other hand, the analy i of fre9,ent e9,ential pattern i important in the analy i of imilarity and di imilarity in "enetic e9,ence . A''ocia&ion ana"+'i'; identification of co-occurring gene sequences; C,rrently, many t,die have foc, ed on the compari on of one "ene to other. Go.ever, mo t di ea e are not tri""ered by a in"le "ene b,t by a combination of "ene actin" to"ether. * ociation analy i method can be , ed to help determine the 1ind of "ene that are li1ely to co-occ,r in tar"et ample . S,ch analy i .o,ld facilitate the di covery of "ro,p of "ene and the t,dy of interaction and relation hip bet.een them.


6a&h ana"+'i'; linking genes to different stages of disease development; :hile a "ro,p of "ene may contrib,te to a di ea e proce , different "ene may become active at different ta"e of the di ea e. If the e9,ence of "enetic activitie acro the different ta"e of di ea e development can be identified, it may be po ible to eparately, therefore

develop pharmace,tical intervention that tar"et the different ta"e

achievin" more effective treatment of the di ea e. S,ch path analy i i e4pected to play an important role in "enetic t,die .

Vi'ua"i=a&ion &oo"' and gene&ic da&a ana"+'i'; Comple4 tr,ct,re and e9,encin" pattern of "ene are mo t effectively pre ented in "raph , tree , c,boid , and chain by vario, 1ind of vi ,ali0ation tool . S,ch vi ,ally appealin" tr,ct,re and pattern facilitate pattern ,nder tandin",

1no.led"e di covery, and interactive data e4ploration. Ji ,ali0ation therefore play an important role in biomedical data minin".

An Indu'&ria" "oo! *fter the Kdotcom> do.nfall many leadin" companie li1e TCS, *i,ro, and I$> are no. loo1in" at comp,teri0in" the medical field. IT profe ional feel databa e mana"ement and data minin" ol,tion and ervice play an important role in thi . Dr#>ano?!u ar, director, I$> re'earch "a/', ay , that competence in area li1e data and tora"e mana"ement, data minin" .o,ld aid in p,r ,it of bioinformatic .




y tem benefit from the , e of data minin" trate"ie to

locate intere tin" and pertinent relation hip .ithin ma ive information. ;or e4ample, data minin" method can a certain and ,mmari0e the et of "ene re pondin" to a certain level of tre in an or"ani m. +e earcher can , e "raphical model and relational

al"orithm to mine ,ch "ene et and model a "ene e4pre ion net.or1. Thi paper on it part reveal the per i tent role of data minin" in e4perimental biolo"y. Th, , /io"og+ combined .ith co ,u&er 'cience i an emer"in" field that ha come to tay and erve the h,manity for it better ca, e.

