Professional Documents
Culture Documents
12_ICIEV_Dhaka
12_ICIEV_Dhaka
S.Barman(Mandal) A.Mandal
Institute of Radiophysics and Electronics Institute of Radiophysics and Electronics
University of Calcutta University of Calcutta
92, A.P.C Road, Kolkata-700009, India 92, A.P.C Road, Kolkata-700009, India
E-mail: barman_s@email.com
S.Saha M.Roy
Institute of Radiophysics and Electronics The Calcutta Technical School
University of Calcutta Govt. of West Bengal,
92, A.P.C Road, Kolkata-700009, India Kolkata, India
Abstract- Accurate exon prediction in a genome is extremely sequence is always written from 5’ end to 3’ end which is
important for understanding of life processes. Researchers use called polarity of DNA chain. Figure 1 illustrates the
various techniques for detecting accurate location of exons. A organization of DNA in eukaryotic genes. Two types of
digital filter model has been proposed in this paper for the nitrogenous bases purine and pyrimidine are present in DNA.
prediction of exons in DNA sequence. The technique involves
‘A’ and ‘G ’are purine bases where as ‘C’ and ‘T’ are
conversion of DNA character string into a numerical sequence
using weak-strong bonding of nucleotides and filtering the pyrimidine bases. The DNA strands are held together mainly
transformed sequence using Narrow Band Pass FIR filter whose by hydrogen bonds between bases. There are two hydrogen
passband is centered at 2π/3. The filtered signal power of Narrow bonds between ‘A’ and ‘T’ while three hydrogen bonds exist
band pass FIR filter is then used as a measure parameter for between ‘C’ and ‘G’. Hence ‘C’ and ‘G’ bonds are stronger
both exons and introns. A plot of signal power versus nucleotide than ‘A’ and ‘T’ bonds as depicted in Fig.2 and in Fig. 3.
location is used to distinguish exons from introns of a DNA DNA is usually divided into coding (exon) and non coding
sequence. The simulation plots show very distinct peaks in exon (intron) regions. Only exons are responsible for protein
regions indicating its presence where as such peaks are absent in synthesis and introns are spliced off before manufacturing of
intron regions. The design model is tested for several databases of
protein.
Homo sapiens Beta-globin chromosome which have been
downloaded from National Center for Biotechnology Information Gene prediction refers to detecting locations of protein –
(NCBI) homepage. coding regions (exon) of genes in a long DNA sequence which
is one of the most important step for understanding life
Index Terms- Genomics, Amino acids, Deoxyribonucleic Acid process. A challenging problem to the researchers is to
(DNA), Power Spectral Density, Protein Coding Region, Digital identify the exon location of a DNA sequence which help in
Filter. determining protein functions.
I. INTRODUCTION
Digital Signal Processing (DSP) techniques offer a great filtering technique, Section 4 presents the results obtained and
promise in analyzing genomic data because of its digital Section 5 gives the conclusion.
nature. The following techniques have been tested for several
databases of Homo sapiens chromosomes.
Adenine (A) Thymine (T)
H H
II. DNA REPRESENTATION USING MAPPING
H H
H N
N H O C
C O O C C
Signal processing analysis of Genomic sequences is hindered
H
by their representation as string of four letter characters A, T,
N O N N C H
C, G . Hence a mapping technique is required to convert the
N O C N sequence into numerals before applying DSP techniques.
Different researchers have adopted different mapping methods
Purine O Pyrimidine
2 Rings 1 Ring
for this purpose. Here the authors have attempted a mapping
rule based on weak-strong hydrogen bonding for digitization.
As nucleotides ‘A’ and ‘T’ have two hydrogen bonds in their
molecular structure they have been treated as weak bond and
assigned value ‘-1’. Nucleotides ‘C’ and ‘G’ have three
Fig 2. Hydrogen bonds between neucleotides
hydrogen bonds so they are treated as strong and have been
A number of authors have devised algorithms for detection assigned value ‘1’ of nucleotide data base .
of protein coding regions in genomic sequence by finding For example a DNA sequence of length N:
regions exhibiting period-3 characteristic [1]. Vaidyanathan x[n] = [A T G C C T T A G G A T] (1)
and Yoon [2] applied an anti notch Infinite Impulse Response
(IIR) digital filter to the indicator sequences to detect the After mapping:
period - 3 component. The Discrete Fourier Transform (DFT) xsw[n] = [-1 -1 1 1 1 -1 -1 -1 1 1 -1 -1] (2)
approach to find the 3 – periodicity regions in genomic The mapping technique based on weak and strong base is
sequences has been used by various authors [3, 4, 5]. Tiwari better choice for numerically representing DNA sequence
etal [6] presented an early study of application of DFT for compared to binary indicator method proposed by Voss [11]
gene prediction. Epps etal [7] developed an integer period because it not only generates single sequence but is also
DFT for biological sequence processing that had certain biologically more meaningful as it represent physical property
advantages in detecting DNA periodicity. Roy etal [8] of nucleotides. After this conversion DSP methods can be
introduced an algorithm called Positional Frequency applied effectively to extract the hidden features of DNA
Distribution of Nucleotides (PFDN) which predicted exon sequences.
regions based on distribution of nucleotides in exons and
intron regions. Ramachandan etal [9, 10] used Electron-ion III. DIGITAL FILTERING TECHNIQUE
interaction potential (EIIP) values for locating hot spots in
protein. In this technique, a Narrow Band Pass (NBP) digital From DSP perspective, the prediction of exon regions using
filter is used to select the characteristic frequency of interest. digital filtering is achieved through spectral analysis of DNA
sequence by comparing the strength of the signal in coding and
Guanine (G) Cytosine (C) non coding region along the length of a DNA sequence.
H
H N
In our technique the mapped discrete signal is transformed
O N
H H into another signal in frequency domain using Discrete Fourier
C
C O C C
transform. Then it is passed through a narrowband Finite
N C N H N C H Impulse Response (FIR) bandpass filter with its passband
centered at the period – 3 frequency of 2π/3 which is used to
N O C N
filter the transformed DNA sequence and is then squared. The
Purine
N
H O Pyrimidine
strength of the signal is used as the measure parameter of the
2 Rings
H
1 Ring period-3 property of coding region after passing the squared
signal again through an IIR lowpass filter. A very sharp peak is
observed in exon region which is absent in intron regions.
Fig 3. Triple Hydrogen Bond C Ξ G Signifies strong bond
The DFT of the indicator sequence, obtained by the
mapping technique employed is given by
The authors in this paper a cascade of Bandpass FIR and
Low pass IIR filter is used to measure the signal power both in Xs[k] = ∑xs[n]e-j2πnk/n
exon and intron regions of a DNA sequence. The filtering Where
technique gives prominent and satisfactory results for
prediction of exon region. xs[n]=[-1 1 1 -1 1-1 -1 1 …………..],
The paper is organized as follows: Section 2 presents DFT of xs[n] = Xs[k],
binary mapping technique, Section 3 elaborates details of
13 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
14 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
INTRON (Relative Position:119-248 Length: 130) INTRON (Relative Position: 435-564 Length: 130)
15 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision
IEEE International Conference on Acoustics, Speech, and Signal [8] M.Roy, S. Biswas , and S. Barman(Mandal), “Identification and
Processing ,pp.653-56,2005. analysis of coding and non-coding regions of a DNA sequence by
[4] S.Datta, and A.Asif,“DFT based DNA splicing algorithms for Positional Frequency Distribution of Nucleotides (PFDN) algorithm”,
prediction of protein coding regions”, Proceeding of the IEEE Thirty- IEEE 978-1-4244-5073-2, 2009 .
Eight Asilomar Conference on Signals, Systems and [9] P.Ramachandran, Lu.Wu-Sheng and A.Antoniou, “Improved Hot-
Computers,pp.45-49,2004 . Spot location Technique for Proteins using a Bandpass Notch Digital
[5] S.Datta, A.Asif, and H.Wang, “Prediction of protein coding regions Filter”, IEEE, 978-1-4244-1684-4, 2008.
in DNA sequences using Fourier spectral characteristics.”, [10] P. Ramachandran, Lu.Wu-Sheng, and A.Antoniou, “Location of
Proceeding of the IEEE Sixth Internationl Symposium on Multimedia exons in DNA sequences using Digital Filters”, IEEE, 978-1-4244-
Software ,pp.160-63,2004 . 3828-0/09, 2008.
[6] S.Tiwari, S.Ramachandran, A. Bhattachary,S.Bhattacharya. and [11] R.F. Voss, “Evolution of long-range fractal correlations and 1/f noise
R.Ramaswamy. “Prediction of probable genes by fourier analysis of in DNA base sequences”, Phy.Rev.Lett.,Vol.68 , No.25 , pp.3805-
genomic sequences”, CABIOS, vol 3,no.3.pp.263-270,1997 . 3808,1992 .
[7] J.Epps, E.Ambikairajah and M .Akhtar “An integer period DFT for [12] National Centre for Biotechnology Information (NCBI). [Online].
biological sequence processing”, Proceedings of the IEEE Available: http://www.ncbi.nlm.nih.gov/
International Workshop on Genomic Signal Processing and Statistics
GENSIPS,pp. 1-4, 2008.
16 ICIEV 2012