Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Fractals, Vol. 11, No.

1 (2003) 19–25
c World Scientific Publishing Company

SELF-SIMILARITY LIMITS OF
GENOMIC SIGNATURES

ZUO-BING WU
State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics
Academia Sinica, Beijing 100080, China
wuzb@lnm.imech.ac.cn

Received October 26, 2001; Accepted June 25, 2002

Abstract
It is shown that metric representation of DNA sequences is one-to-one. By using the metric
representation method, suppression of nucleotide strings in the DNA sequences is determined.
For a DNA sequence, an optimal string length to display genomic signature in chaos game
representation is obtained by eliminating effects of the finite sequence. The optimal string
length is further shown as a self-similarity limit in computing information dimension. By
using the method, self-similarity limits of bacteria complete genomic signatures are further
determined.

Keywords: Genomic Signature; Chaos Game Representation; Metric Representation.

1. INTRODUCTION duplicated genes by a distance matrix, evolution-


ary relationship of three primary kingdoms of life is
Along with an increasing amount of DNA se- inferred.2 Due to investigating relative abundances
quences extracted from experiments, it is important of short oligonucleotides in subsequences, genomic
to develop methods for extracting meaningful signature phenomenon and derivation of partial-
information from the one-dimensional symbolic ordering relationships among bacterial genomes are
sequences composed of the four letters “A”, “C”, proposed.3 The genomic signature describes that
“G” and “T” (or “U”). To detect similarity in the difference of dinucleotide relative abundance
DNA sequences, scatter plots1 are introduced to values within a single genome is larger than that
make classification of cytochromes and illustrate between distinct genomes. Chaos game represen-
a dendrogram. From a comparison of a pair of tation (CGR),4 which generates a two-dimensional

19
20 Z.-B. Wu

square from a one-dimensional sequence, provides Similarly, the number β is defined as


a technique to visualize the composition of DNA
X
m
sequences. By composing the CGR and short- β=2 νm−j+1 3−j + 3−m
sequence representation methods, the evolution of j=1
species-type specificity in mitochondral genomes is
analyzed.5 In terms of the CGR method, it is shown X
m
=2 νi 3−(m−i+1) + 3−m (2)
that the main characteristics of the whole genome
i=1
can be exhibited by its subsequences. 6 The genomic
signature is extended to describe characteristics of where νi is 0 if si ∈ {A, T } or 1 if si ∈ {C, G}.
CGR images. By making a Euclidean metric be- According to (1) and (2), the one-dimensional sym-
tween two CGR images, classification of species in bolic sequence s1 s2 · · · sN is partitioned into four
three primary kingdoms is discussed. kinds of subsequences, which correspond to points
Recently, metric representation (MR), 7 which in four fundamental zones A, C, G and T of Fig. 1.
is borrowed from the symbolic dynamics, makes Under left or right shift operators, each zone can be
an ordering of subsequences in a plane. The MR further shrunk to less zones with a factor of 1/3 2 .
method is an extension of CGR. Suppression of For an infinite sequence, this procedure can be
certain nucleotide strings in the DNA sequences defined as a fractal,8 which has a self-similarity. The
leads to a self-similarity of pattern seen in the MR subsequences with the same ending k-nucleotide
of DNA sequences. In this paper, first, we show string are labeled by Σ(k) . All subsequences Σ(k)
that the MR is one-to-one. Due to the MR method, correspond to points in the zone encoded by the
we determine suppression of nucleotide strings in k-nucleotide string.
DNA sequences. Then, eliminating effects of finite
sequences on suppression of nucleotide strings, we Lemma 1. (α, β){S(Σm )} = 2(µm+1 , νm+1 )/3 +
give an optimal string length to display genomic (α, β){Σm }/3. S is a left shift operator.
signature. Moreover, we plot information function
Proof. Note that for the left shift operator,
versus string lengths to determine self-similarity
S(Σm ) = Σm sm+1 . From Definitions (1) and (2),
limits in MR images. Using the method, we present
we can immediately obtain the result.
self-similarity limits of bacteria complete genomic
signatures. Lemma 2. (α, β){Σm } = (α, β){G∞ Σm }.

Proof. When m = 1, Σ1 = s1 and


2. SUPPRESSION OF G∞ Σ 1 = S(G ). By Lemma 1, we can obtain

NUCLEOTIDE STRINGS (α, β){G Σ1 } = 2(µ1 , ν1 )/3 + (α, β){G∞ }/3 =

2(µ1 , ν1 )/3 + (1, 1)/3 = (α, β){Σ1 }. Suppose when


For a given DNA sequence, we have a one-
m = i, we have (α, β){Σi } = (α, β){G∞ Σi }. For
dimensional symbolic sequence s1 s2 · · · si · · · sN
m = i + 1, we have Σi+1 = Σi si+1 = S(Σi ) and
(si ∈ {A, C, G, T }). In a two-dimensional MR, we
G∞ Σi+1 = S(G∞ Σi ). By Lemma 1, we obtain
take the correspondence of symbol s i to number µi
(α, β){Σi+1 } = 2(µm+1 , νm+1 )/3 + (α, β){Σi }/3
or νi ∈ {0, 1} and calculate the values (α, β) of all
and (α, β){G∞ Σi+1 } = 2(µm+1 , νm+1 )/3 +
subsequences Σm = s1 s2 · · · sm (1 ≤ m ≤ N ). The
(α, β){G∞ Σi }/3. So, using the supposition
number α represented in base 3, between 0 and 1,
(α, β){Σi } = (α, β){G∞ Σi }, we can lead to
is defined as
(α, β){Σi+1 } = (α, β){G∞ Σi+1 }. 

X
m By Lemma 2, each finite subsequence Σ m has a
α=2 µm−j+1 3−j + 3−m correspondent infinite sequence G∞ Σm . Here, we
j=1 define a set of the infinite sequences as Σ.
X
m
Theorem 1. (α, β) : Σ → Λ is one-to-one. Λ is a
=2 µi 3−(m−i+1) + 3−m (1)
i=1
set of points in the (α, β) plane.

This means that given Σ1 , Σ2 ∈ Σ, if Σ1 6= Σ2 ,


where µi is 0 if si ∈ {A, C} or 1 if si ∈ {G, T }. then (α, β){Σ1 } 6= (α, β){Σ2 }. We give a proof by
Self-Similarity Limits of Genomic Signatures 21

1.0 ........... ....... ....... ...... ..... ..... ..... .......... ....... .... ... ....... ...... ...... ....... ..........
..... ........ ....... ....... ....... ...... ........ ...... C.....G....... ......
. . ........GCG ...... ....... ........ ........
2 2
....... ..C .... ........ ....... ........ ........ GC
....... ........
. ......... .......CG
...... ...... ........ ........G ........ ........
0.9 ........ ........ ........ ........ ........ ........ ........ ........ ..... ... . ........ .CG
ACG ...... ........ ........ ........ ........

C G
0.8 ........ ....... ...... ....... ........ ..... ...... ........ .......... ....... ........ ......... ........ ........ ........ ........
......... ........ ........ ......... ........ ......... ........ ........ ...... ........ ........ ....... ........ ........ ........ ........
AC TC AG TG
0.7 ........ ........ ........ ....... ........ ....... ........ ....... ......... ........ ........ ........ ........ ....... ........ ........
........ ........ ........ ........ ........ ........ ........ ........ ....... ........ ........ ........ ........ ........ ........ ........

0.6

0.5

0.4
........ ........ ...... ........ ...... . .. ..... ....... ........ ........ ...... ........ ....... ..... ....... ........
... .. . .... ....
..... ........ ........ ...... ......... ......... ........ ....... ........ ........ ....... ...... ....... ........ ........
0.3 ........ ...CA GA CT GT
.... ... .... ....
.... .... ..... .... ........ ........ ......... ........ ........ ....... ........ ...... ........ ....... ........ ........
........ ........ ....... ........ ........ ........ ......... ........ ........ ......... ........ .......... ........ ........ ........ ........
0.2
A T
........ ........ ....... ........ ........ ....... .... ...... ........ ....... ..... ....... ........ ........ ....... ........
0.1 ....... ........ 2 ........ ........ ........ ........ ........ ....... ........ ......... ........ ........ ....... ........ 2........ ........
A TA AT T
........ ....... ........ ....... ........ ........ ....... ........ ........ ....... ........ ........ ........ ........ ........ .......
......... ........ ......... ......... ........ ........ ........ ......... ........ ......... ........ ........ ......... ........ ........ ...
.........
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

F g 1 Metr c representat on o HUMHBB Its boundary and part t on nes are abe ed by so d nes and dash nes
respect ve y

contrad ct on Suppose (α β){Σ1 } = (α β){Σ2 } t on between po nts and subsequences can preserve
and s marked as P n the (α β) p an For the n zones w th sma enough engths For examp e
zone nc ud ng the po nt P we encode t as two subsequences G∞ AC n the zone AC T ∞ C 2 n the
subsequences Σ11 and Σ21 w th the same mononu- zone C 2 A∞ GC n the zone GC and C ∞ T C n the
c eot de Then en arge the zone by an area fac- zone T C have the same po nts n CGR (1/4 3/4)
tor of 32 we can obta n two encod ng subsequences In MR of DNA sequences each zone n CGR s
Σ12 and Σ22 w th the same d nuc eot de Each en- shrunk and c ear y d v ded by four bands There ex-
arg ng process prov des a r ght sh ft to two sub- sts a one-to-one correspondence between zones and
sequences At the same t me the po nt P s on y end ng k-nuc eot de str ngs of subsequences Fre-
nc uded n one of four en arged zones So two sh ft- quency of po nts n the zone can be determ ned by
ng subsequences are the same Fo ow ng the en- us ng the MR method as fo ows In order to com-
arg ng process n an nfin te step we can obta n pute frequenc es n zones encoded by k-nuc eot de
Σ1 = Σ2 contrad ct ng our or g na assumpt on str ngs we need to determ ne part t on nes of MR
Th s contract on s due to the fact that we have n F g 1 For mononuc eot des there ex st 2 × 2
assumed (α β){Σ1 } = (α β){Σ2 } thus Σ1 = Σ2 zones n the MR We have n1 (= 3) part t on nes
then (α β){Σ1 } = (α β){Σ2 } b10 = 0 b11 = 1/2 and b12 = 1 a ong the α ax s
For the DNA sequence some zones n CGR are For denuc eot des there ex st 4 × 4 zones n the
rep en shed by po nts so that a pattern appears In MR We have n2 (= 5) part t on nes b20 = b10 = 0
CGR there ex sts a correspondence of more subse- b21 = b11 /3 = 1/6 b22 = b11 = 1/2 b23 = 1 − b21 = 5/6
quences w th d fferent end ng k-nuc eot de str ngs and b24 = 1 − b20 = 1 a ong the α ax s In genera for
to the same po nts n bounds of zones For ex- k−1-nuc eot de str ngs f know ng n k−1 (= 2k−1 +1)
amp e subsequences G∞ A n the zone A T ∞ C n part t on nes bk−1 (i = 0 1 nk−1 − 1) a ong the
the zone C A∞ G n the zone G and C ∞ T n the α ax s we can obta n nk (= 2k + 1 = 2nk−1 − 1) par-
zone T have the same po nts n CGR (1/2 1/2) t t on nes bk (i = 0 1 nk − 1) for k-nuc eot de
Under eft sh ft operators the correspond ng re a- str ngs as fo ows For the k-nuc eot de str ngs
22 Z.-B. Wu

Table 1 Suppression of k-nucleotide strings in HUMHBB, YEAST1 and random sequences. The total
numbers of nucleotide strings for a length k and suppressed k-nucleotide strings, are labeled by Π k and Λk ,
respectively.

k 5 6 7 8 9 10

Πk 1024 4096 16384 65536 262144 1048576

ΛHUMHBB
k /ΛRandom
k (73308) 4/0 244/0 3667/208 32909/21402 209280/198219 985222/977852

ΛYEAST1
k /ΛRandom
k (230209) 0/0 0/0 110/0 8897/2021 134302/109290 863555/842246

there exist 2k × 2k zones in the MR. The left half frequent zones are redivided to smaller and de-
(0 ≤ i ≤ nk−1 − 1) of partition lines along the α scribed by a grey scale. Some empty zones may
axis are described as follows appear in the patterns of CGR, i.e. some nucleotide
k−1 strings are suppressed in the sequences. In the
bki = bi/2 , if i %2 = 0
(3) procedure of decreasing zone sizes, more and more
bki = bik−1 /3, if i %2 = 1 . empty zones emerge in the patterns of CGR. For
example, evolution of a self-similarity pattern in
From (3), the right half (nk−1 ≤ i ≤ nk − 1) of CGR of the archaebacteria Archeoglobus fulqidus is
partition lines along the α axis can be determined shown in Fig. 1 of Deschavanne et al. 6 If DNA se-
immediately quences are infinite, the compositional structure can
be displayed in small enough zones. Empty zones
bki = 1 − bknk −1−i . (4) are a part of the global feature in CGR. However,
For example, for trinucleotides, nine partition lines the DNA sequences are finite. A finite sequence,
along the α axis are 0, 18 1 1 5 1 13 5 17
, 6 , 18 , 2 , 18 , 6 , 18 and even a random sequence, may also lead to suppres-
1. We can obtain 17 partition lines 0, 54 1 1
, 18 5 1
, 54 , 6, sion of strings. Along with increase of string length,
13 5 17 1 37 13 41 5 49 17 53 more and more strings are suppressed in the finite
54 , 18 , 54 , 2 , 54 , 18 , 54 , 6 , 54 , 18 , 54 and 1 along sequences.
the α axis for tetranucleotides. Partition lines along
In Table 1, we compare the suppression of
the β axis are the same to those along the α axis.
nucleotide strings between DNA and random se-
Each zone in the MR can thus be surrounded by the
combined partition lines along the α and β axes. quences with the same length. Suppression of nu-
Using the MR method, we determine suppres- cleotide strings for HUMHBB starts at k = 5. For
sion of k-nucleotide strings in HUMHBB (human a random sequence with the same length, which is
β-region, chromosome 11) with 73,308 bases and given by using a random number generator, 9 sup-
YEAST1 (yeast chromosome 1) with 230,209 bases pression of nucleotide strings is delayed to start
in Table 1, respectively. In order to check efficiency at k = 7. The number of suppressed nucleotide
of the method, we also determine the number of dis- strings for the random number is 5.67% of that for
appearing strings in all strings for a giving string HUMHBB. The finite sequence of HUMHBB effects
length in HUMHBB and YEAST1, respectively. on the suppression of 7-nucleotide strings. Along
The results are identical with those in Table 1, with increase of k, numbers of suppressed nucleotide
respectively. So, the MR method is effective in strings for the random number more increase fur-
determining suppression of nucleotide strings in ther and approach those for HUMHBB. At k = 10,
DNA sequences. the number of suppressed nucleotide strings for the
In CGR of DNA sequences, self-similarity pat- random number is 99.3% of that for HUMHBB.
terns change more obscurely as lengths of sequences In this case, suppression of nucleotide strings in
increase. A grey plot describes frequency values in HUMHBB is mainly caused by the finite length
small zones, which sizes (2−k × 2−k ) can be given of sequence. Moreover, suppression of nucleotide
by lengths of strings encoding the zones (k). Along strings for YEAST1 starts at k = 7. For a ran-
with increase of string lengths, the self-similarity dom sequence with the same length, which is given
patterns in CGR are more clear. A high and low by using a random number generator, 9 suppression
Self-Similarity Limits of Genomic Signatures 23

of nucleotide strings is delayed to start at k = 8. 14


The number of suppressed nucleotide strings for
the random number is 22.7% of that for YEAST1.
The finite sequence of YEAST1 effects on the sup-
12

. ..
pression of 8-nucleotide strings. At k = 10, the
number of suppressed nucleotide strings for the ran-
dom number is 97.5% of that for YEAST1. Due to
10

. .
.
8
the comparison of suppression of nucleotide strings,

I( )
we can thus obtain that HUMHBB and YEAST1
have shorter suppressed nucleotide strings than ran- 6
.
dom sequences with the same lengths, respectively.
Along with increase of string lengths, the finite .
sequences take stronger effects on suppression of
4
.
nucleotide strings.
In order to display genomic signature, we must
eliminate effects of finite sequences on suppression
2
.
of nucleotide strings. For a DNA sequence, we take 0
0 2 4 6 8 10 12
the longest string length before suppression of nu- log(1/ )
cleotide strings in a random sequence with the same
Fig. 2 A plot of information function I() versus log(1/)
lengths as an optimal option of string lengths. Ac-
labeled by dots and its fitting line for HUMHBB.
cording to the definition, string lengths 6 and 7 can
be chosen as optimal options for genomic signatures
of HUMHBB and YEAST1, respectively. 14

. .
.
3. LIMITS OF SELF-SIMILARITY 12
SCALES
Suppression of certain nucleotide strings in the
10
.
DNA sequences leads to a fractal pattern seen in .
the MR of DNA sequences. To quantify the fractal 8
.
I( )

feature in the MR of DNA sequences, we introduce


information dimension. For a given length k of nu- 6
.
cleotide strings, we have M (= N − k + 1) subse-
quences Σi (i = k, k + 1, . . . , N ), which end with M .
k-nucleotide strings. The subsequences are corre-
4
.
sponding to M points in an MR. In the MR, the
length of a zone and the total number of zones are
 = 3−k and Z = 4k , respectively. The numbers
2
.
of points falling in the ith zone and of non-empty 0
0 2 4 6 8 10 12
zones are labeled by mi () and Z(), respectively. log(1/ )
Dividing the number mi () by the total point num-
ber M yields a probability pi () for the ith zone. Fig. 3 A plot of information function I() versus log(1/)
labeled by dots and its fitting line for YEAST1.
Information function and dimension for the points
in MR are respectively defined10 as

Z() The information function I() during a range of


X
I() = − pi log pi (5) log(1/) has a scaling region. The scaling region
i=1 reflects the self-similarity pattern in the MR. The
information dimension D1 can be found from the
and slope in scaling region I() versus log(1/). When
I() the length  of a zone in MR increases from 3 −k
D1 = lim . (6)
→0 log(1/) to 2−k , MR of DNA sequences changes to CGR.
24 Z.-B. Wu

Information dimension in CGR can thus be de- strings. Using the least-squares fit method 9 for
termined as (log 2 3)D1 . We compute information the linear part, we determine its slope, i.e. infor-
function I() with different sizes  for HUMHBB mation dimension D1 , to 1.20. It is less than
(as drawn in Fig. 2). A linear part of the curve the information dimension 1.26 for a random se-
I() versus log(1/) emerges between log(1/) = quence of the same length. Moreover, in Fig. 3, we
log3 = 1.10 and log(1/) = 6 log3 = 6.59. A fit- draw information function I() versus log(1/) for
ting line is also drawn in Fig. 2. The point for YEAST1. A linear part of the curve I() versus
log(1/) = 7 log3 = 7.69 leaves from the line. log(1/) exists between log(1/) = log3 = 1.10 and
Along with the decrease of log(1/), farther and log(1/) = 7 log3 = 7.69. We can obtain that the
farther the points leave from the line. Since points suppression of many nucleotide strings in YEAST1
in the zones correspond to k-nucleotide strings, we emerges from 8-nucleotide strings. Using the least-
can derive that the self-similarity pattern in the squares fit method9 for the linear part, we also plot
MR preserves approximately from mononucleotides a fitting line in Fig. 3 and determine its slope, i.e. in-
to 6-nucleotide strings, and that the suppression formation dimension D1 , to 1.22. It is less than the
of many nucleotide strings emerges at 7-nucleotide information dimension 1.26 for a random sequence

Table 2 Suppression of k-nucleotide strings and self-similarity limits of bacteria complete


genomes labeled by Λk and kl , respectively.

k 6 7 8 9 10 kl

Λmgen
k (580074) 14 851 14189 126690 776767 7
Λmjan
k (1664970) 3 318 7656 84937 612138 8
Λhpyl
k (1667867) 2 192 4290 58661 538051 8
Λhpyl99
k (1643831) 1 130 3977 58033 538512 8
Λbbur
k (910724) 0 232 8139 101444 712552 8
Λrpxx
k (1111523) 0 71 4778 79792 643520 8
Λhinf
k (1830138) 0 12 1077 33859 442423 8
ΛpN
k
GR234
(536165) 0 10 2881 76649 699974 7
Λmpneu
k (816394) 0 7 2329 66513 638786 8
Λmthe
k (1751377) 0 5 665 26669 408030 8
Λaquae
k (1551335) 0 4 840 33972 468735 8
Λpyro
k (1738505) 0 4 708 26863 403468 8
Λaf
k
ul
(2178400) 0 4 365 16382 330488 8
Λmtub
k (4411529) 0 3 595 20793 306071 9
Λpabyssi
k (1765118) 0 3 291 18803 367742 8
Λtmar
k (1860725) 0 2 594 24329 399932 8
Λcpneu
k (1230230) 0 2 452 28569 468992 8
Λecoli
k (4639221) 0 1 173 5595 150409 9
Λsynecho
k (3573470) 0 1 149 8058 214433 9
Λctra
k (1042519) 0 0 562 34004 510293 8
Λaero
k (1669695) 0 0 137 20084 401256 8
Λtpal
k (1138011) 0 0 118 20912 453066 8
Λbsub
k (4214814) 0 0 4 2919 156165 9
Self-Similarity Limits of Genomic Signatures 25

of the same length. The limits of self-similarity in REFERENCES


MR of HUMHBB and YEAST1 are equivalent to
1. A. J. Gibbs and G. A. Mcintyre, “The Diagram, A
the optimal string lengths for genomic signatures,
Method for Comparing Sequences,” Eur. I. Biochem.
respectively. Thus, for presenting the genomic sig- 16, 1-11 (1970).
nature, a self-similarity limit as an optimal string 2. N. Iwabe, K. Kuma, M. Hasegawa, S. Osawa and
length can be determined in computing information T. Miyata, “Evolutionary Relationship of Archae-
dimension. bacterial, Eubacteria, and Eukaryotes Inferred from
Using the MR method, we determine suppression Phylogenetic Trees of Duplicated Genes,” Proc. Natl.
of k-nucleotide strings of bacteria complete genomes Acad. Sci. USA 86, 9355–9359 (1989).
in Table 2, where we put suppression of k-nucleotide 3. S. Karlin, J. Mrazek and A. M. Campbell,
strings in the order of decrease. For each of the bac- “Compositional Biases of Bacterial Genomes and
teria complete genomes, a linear part exists in the Evoluationary Implications,” J. Bacteriol. 179,
plot of information function I() versus log(1/). 3899–3913 (1997).
4. H. J. Jeffrey, “Chaos Game Representation of
From the linear parts, we determine self-similarity
Gene Structure,” Nucleic Acids Res. 18, 2163–2170
limits of genomic signatures in Table 2. Keeping in
(1990).
the order, we find the suppression of bacteria com- 5. K. A. Hill and S. M. Singh, “The Evolution of
plete genomes does not necessarily depend on the Species-Type Specificity in the Global DNA Se-
lengths of sequences. The common optimal string quence Organization of Mitochondrial Genomes,”
length for the bacteria complete genomic signatures Genome 40, 342-356 (1997).
can be chosen as 7. 6. P. J. Deschavanne, A. Giron, J. Vilain, G. Fagot and
B. Fertil, “Genomic Signature: Characterization and
Classification of Species Assessed by Chaos Game
4. CONCLUSION Representation of Sequences,” Mol. Biol. Evol. 16,
1391–1399 (1999).
In summary, we have shown MR of DNA sequences 7. Z.-B. Wu, “Metric Representation of DNA
is one-to-one. Due to the MR method, suppres- Sequences,” Electrophoresis 21, 2321–2326 (2000).
sion of nucleotide strings in the DNA sequences 8. B. B. Mandelbrot, The Fractal Geometry of Nature
is determined. For a DNA sequence, an optimal (Freeman and Company, New York, 1983).
string length to display genomic signature is ob- 9. W. H. Press, S. A. Teukolsky, W. T. Vetterling and
tained by eliminating effects of the finite sequence. B. P. Flannery, Numerical Recipes in C, 2nd ed.
The optimal string length is further shown as a self- (Cambridge University Press, 1992).
similarity limit in computing information dimen- 10. J. D. Farmer, “Chaotic Attractors of an Infinite-
Dimensional Dynamical System,” Physica D4,
sion. By using this method, self-similarity limits
366–393 (1982).
of bacteria complete genomic signatures are further
determined.

ACKNOWLEDGMENTS
This work was supported in part by the Na-
tional Key Program for Developing Basic Science
G1999032801-11.

You might also like