Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

Prediction of Protein Coding Regions of a DNA


Sequence through Spectral Analysis

S.Barman(Mandal) A.Mandal
Institute of Radiophysics and Electronics Institute of Radiophysics and Electronics
University of Calcutta University of Calcutta
92, A.P.C Road, Kolkata-700009, India 92, A.P.C Road, Kolkata-700009, India
E-mail: barman_s@email.com

S.Saha M.Roy
Institute of Radiophysics and Electronics The Calcutta Technical School
University of Calcutta Govt. of West Bengal,
92, A.P.C Road, Kolkata-700009, India Kolkata, India

Abstract- Accurate exon prediction in a genome is extremely sequence is always written from 5’ end to 3’ end which is
important for understanding of life processes. Researchers use called polarity of DNA chain. Figure 1 illustrates the
various techniques for detecting accurate location of exons. A organization of DNA in eukaryotic genes. Two types of
digital filter model has been proposed in this paper for the nitrogenous bases purine and pyrimidine are present in DNA.
prediction of exons in DNA sequence. The technique involves
‘A’ and ‘G ’are purine bases where as ‘C’ and ‘T’ are
conversion of DNA character string into a numerical sequence
using weak-strong bonding of nucleotides and filtering the pyrimidine bases. The DNA strands are held together mainly
transformed sequence using Narrow Band Pass FIR filter whose by hydrogen bonds between bases. There are two hydrogen
passband is centered at 2π/3. The filtered signal power of Narrow bonds between ‘A’ and ‘T’ while three hydrogen bonds exist
band pass FIR filter is then used as a measure parameter for between ‘C’ and ‘G’. Hence ‘C’ and ‘G’ bonds are stronger
both exons and introns. A plot of signal power versus nucleotide than ‘A’ and ‘T’ bonds as depicted in Fig.2 and in Fig. 3.
location is used to distinguish exons from introns of a DNA DNA is usually divided into coding (exon) and non coding
sequence. The simulation plots show very distinct peaks in exon (intron) regions. Only exons are responsible for protein
regions indicating its presence where as such peaks are absent in synthesis and introns are spliced off before manufacturing of
intron regions. The design model is tested for several databases of
protein.
Homo sapiens Beta-globin chromosome which have been
downloaded from National Center for Biotechnology Information Gene prediction refers to detecting locations of protein –
(NCBI) homepage. coding regions (exon) of genes in a long DNA sequence which
is one of the most important step for understanding life
Index Terms- Genomics, Amino acids, Deoxyribonucleic Acid process. A challenging problem to the researchers is to
(DNA), Power Spectral Density, Protein Coding Region, Digital identify the exon location of a DNA sequence which help in
Filter. determining protein functions.

I. INTRODUCTION

The hereditary information needed to build and maintain a


living organism resides in its genome. The genome contains
genes which are made of DNA. DNA is a very large
biomolecule, made up of smaller units called nucleotides.
Genetic information is believed to be stored in the particular
order of four kinds of nucleotide bases, Adenine(A),
Thymine(T), Cytosine(C) and Guanine(G). Thus a DNA can
be represented as a string of character. There are two
complementary DNA chains twisted around one another in a
right handed double helix structure. The two strands of DNA
are always complementary to one another. Adenine (A) of one
strand always pairs with Thymine (T) of opposite strand while Fig 1. Organization of Genes in eukaryotes
Guanine (G) always pairs with Cytosine (C). DNA base

978-1-4673-1154-0/12/$31.00 ©2012 IEEE ICIEV 2012


IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

Digital Signal Processing (DSP) techniques offer a great filtering technique, Section 4 presents the results obtained and
promise in analyzing genomic data because of its digital Section 5 gives the conclusion.
nature. The following techniques have been tested for several
databases of Homo sapiens chromosomes.
Adenine (A) Thymine (T)
H H
II. DNA REPRESENTATION USING MAPPING
H H
H N
N H O C
C O O C C
Signal processing analysis of Genomic sequences is hindered
H
by their representation as string of four letter characters A, T,
N O N N C H
C, G . Hence a mapping technique is required to convert the
N O C N sequence into numerals before applying DSP techniques.
Different researchers have adopted different mapping methods
Purine O Pyrimidine
2 Rings 1 Ring
for this purpose. Here the authors have attempted a mapping
rule based on weak-strong hydrogen bonding for digitization.
As nucleotides ‘A’ and ‘T’ have two hydrogen bonds in their
molecular structure they have been treated as weak bond and
assigned value ‘-1’. Nucleotides ‘C’ and ‘G’ have three
Fig 2. Hydrogen bonds between neucleotides
hydrogen bonds so they are treated as strong and have been
A number of authors have devised algorithms for detection assigned value ‘1’ of nucleotide data base .
of protein coding regions in genomic sequence by finding For example a DNA sequence of length N:
regions exhibiting period-3 characteristic [1]. Vaidyanathan x[n] = [A T G C C T T A G G A T] (1)
and Yoon [2] applied an anti notch Infinite Impulse Response
(IIR) digital filter to the indicator sequences to detect the After mapping:
period - 3 component. The Discrete Fourier Transform (DFT) xsw[n] = [-1 -1 1 1 1 -1 -1 -1 1 1 -1 -1] (2)
approach to find the 3 – periodicity regions in genomic The mapping technique based on weak and strong base is
sequences has been used by various authors [3, 4, 5]. Tiwari better choice for numerically representing DNA sequence
etal [6] presented an early study of application of DFT for compared to binary indicator method proposed by Voss [11]
gene prediction. Epps etal [7] developed an integer period because it not only generates single sequence but is also
DFT for biological sequence processing that had certain biologically more meaningful as it represent physical property
advantages in detecting DNA periodicity. Roy etal [8] of nucleotides. After this conversion DSP methods can be
introduced an algorithm called Positional Frequency applied effectively to extract the hidden features of DNA
Distribution of Nucleotides (PFDN) which predicted exon sequences.
regions based on distribution of nucleotides in exons and
intron regions. Ramachandan etal [9, 10] used Electron-ion III. DIGITAL FILTERING TECHNIQUE
interaction potential (EIIP) values for locating hot spots in
protein. In this technique, a Narrow Band Pass (NBP) digital From DSP perspective, the prediction of exon regions using
filter is used to select the characteristic frequency of interest. digital filtering is achieved through spectral analysis of DNA
sequence by comparing the strength of the signal in coding and
Guanine (G) Cytosine (C) non coding region along the length of a DNA sequence.
H

H N
In our technique the mapped discrete signal is transformed
O N
H H into another signal in frequency domain using Discrete Fourier
C
C O C C
transform. Then it is passed through a narrowband Finite
N C N H N C H Impulse Response (FIR) bandpass filter with its passband
centered at the period – 3 frequency of 2π/3 which is used to
N O C N
filter the transformed DNA sequence and is then squared. The
Purine
N
H O Pyrimidine
strength of the signal is used as the measure parameter of the
2 Rings
H
1 Ring period-3 property of coding region after passing the squared
signal again through an IIR lowpass filter. A very sharp peak is
observed in exon region which is absent in intron regions.
Fig 3. Triple Hydrogen Bond C Ξ G Signifies strong bond
The DFT of the indicator sequence, obtained by the
mapping technique employed is given by
The authors in this paper a cascade of Bandpass FIR and
Low pass IIR filter is used to measure the signal power both in Xs[k] = ∑xs[n]e-j2πnk/n
exon and intron regions of a DNA sequence. The filtering Where
technique gives prominent and satisfactory results for
prediction of exon region. xs[n]=[-1 1 1 -1 1-1 -1 1 …………..],
The paper is organized as follows: Section 2 presents DFT of xs[n] = Xs[k],
binary mapping technique, Section 3 elaborates details of

13 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

k= 0,1,2,…..N-1 For the IIR LP filter design the normalized cutoff


frequency and order of filter have been chosen to be 0.24 and
n= 0,1,2…...N-1
10 respectively and the attenuation at cutoff frequencies is
s= A, C, G, T fixed at 3 dB.
A MATLAB simulink environment has been created to Figure 7 shows simulated output of our proposed model
filter the DNA signal. The proposed exon prediction model, both for exon and intron regions of DNA sequence. The
amplitude response of FIR bandpass and IIR lowpass filter simulation results show very sharp peaks in exon regions
have been shown in Fig. 4, Fig. 5 and Fig. 6 respectively. whereas such peaks are absent in intron regions.
The signal strength in coding region is found to be 10 times
more than the signal strength of non-coding regions. The sharp
peaks in exon regions show period-3 property present in
protein coding regions.

Accession No.: AF007546

EXON (Relative Position: 402-624 Length: 223)


Fig 4. Proposed model for exon prediction

Fig 5. FIR Band pass Filter response

INTRON (Relative Position: 625-1474 Length: 850)

Fig 6. IIR Low pass Filter response

IV. RESULTS AND DISCUSSION

The proposed model is tested for a set of Homo sapiens beta-


globin (HBB) genes downloaded from NCBI homepage. The
Accession numbers AF007546, AF083883, AF059180,
EF450778, AF527577, DQ074763, AY163866, DQ126271,
EU761578 and DQ157442 have been used to evaluate the
performance of the proposed model.
In this model, a FIR bandpass (BP) filter with Blackman
Window and IIR lowpass (LP) filter with Butterworth
approximation have been used by trial and error method.
For the FIR BP filter design approach the following
normalized passband edges have been chosen
wc1 = 0.6665 and wc2 = 0.6667
The minimum filter order = 100.
The attenuation at cutoff frequencies is fixed at 6 dB.

14 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

Accession No. AF059180 Accession No. EF450778

EXON (Relative Position: 565-787 Length: 223)


EXON (Relative Position:27-118 Length:92)

INTRON (Relative Position:119-248 Length: 130) INTRON (Relative Position: 435-564 Length: 130)

Accession No. AF083883 Fig 7. Simulation results of proposed model

EXON (Relative Position: 27-118 Length: 92) V. CONCLUSION

In this paper, a digital filter based model has been


proposed to predict exon regions of a DNA sequence. The
performance of the model is successfully tested for the several
Homo sapiens databases of HBB gene taken from public
genome data bases. The exon regions are distinguished from
intron regions of a DNA sequence through spectral analysis
based on well-defined peaks in the waveform. Simulation
results show the proposed model is efficient for the prediction
of exon region. For efficient digital processing of DNA signal
weak-strong bond based mapping is used in the present paper.
Application of EIIP based mapping may further enhance the
INTRON (Relative Position: 475-1321 Length: 850) accuracy of the proposed model to predict exon regions of a
DNA sequence. Alternate types of filters such as notch filter,
multiband filter may also be investigated to improve the
accuracy of the prediction. Phase distortion and phase
response of the filters are eliminated for simplicity of analysis.
REFERENCES

[1] J. Tuqan, and A. Rushdi, “A DSP perspective to the period – 3 detection


problem.”, Proceedings of the IEEE International Workshop on
Genomic Signal Processing and Statistics, GENSIPS ,pp.53-54, 2006 .
[2] P.P.Vaidyanathan, and B.J.Yoon,“The role of signal-processing
concepts in genomics and proteomics”, Journal of the Franklin Institute
, special issue on Genomics ,2004.
[3] S.Datta, and A.Asif, “A fast DFT based gene prediction algoritham for
identification of protein coding regions”, Proceeding of

15 ICIEV 2012
IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision

IEEE International Conference on Acoustics, Speech, and Signal [8] M.Roy, S. Biswas , and S. Barman(Mandal), “Identification and
Processing ,pp.653-56,2005. analysis of coding and non-coding regions of a DNA sequence by
[4] S.Datta, and A.Asif,“DFT based DNA splicing algorithms for Positional Frequency Distribution of Nucleotides (PFDN) algorithm”,
prediction of protein coding regions”, Proceeding of the IEEE Thirty- IEEE 978-1-4244-5073-2, 2009 .
Eight Asilomar Conference on Signals, Systems and [9] P.Ramachandran, Lu.Wu-Sheng and A.Antoniou, “Improved Hot-
Computers,pp.45-49,2004 . Spot location Technique for Proteins using a Bandpass Notch Digital
[5] S.Datta, A.Asif, and H.Wang, “Prediction of protein coding regions Filter”, IEEE, 978-1-4244-1684-4, 2008.
in DNA sequences using Fourier spectral characteristics.”, [10] P. Ramachandran, Lu.Wu-Sheng, and A.Antoniou, “Location of
Proceeding of the IEEE Sixth Internationl Symposium on Multimedia exons in DNA sequences using Digital Filters”, IEEE, 978-1-4244-
Software ,pp.160-63,2004 . 3828-0/09, 2008.
[6] S.Tiwari, S.Ramachandran, A. Bhattachary,S.Bhattacharya. and [11] R.F. Voss, “Evolution of long-range fractal correlations and 1/f noise
R.Ramaswamy. “Prediction of probable genes by fourier analysis of in DNA base sequences”, Phy.Rev.Lett.,Vol.68 , No.25 , pp.3805-
genomic sequences”, CABIOS, vol 3,no.3.pp.263-270,1997 . 3808,1992 .
[7] J.Epps, E.Ambikairajah and M .Akhtar “An integer period DFT for [12] National Centre for Biotechnology Information (NCBI). [Online].
biological sequence processing”, Proceedings of the IEEE Available: http://www.ncbi.nlm.nih.gov/
International Workshop on Genomic Signal Processing and Statistics
GENSIPS,pp. 1-4, 2008.

16 ICIEV 2012

You might also like