Professional Documents
Culture Documents
Genomuc Signatures
Genomuc Signatures
1 (2003) 19–25
c World Scientific Publishing Company
SELF-SIMILARITY LIMITS OF
GENOMIC SIGNATURES
ZUO-BING WU
State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics
Academia Sinica, Beijing 100080, China
wuzb@lnm.imech.ac.cn
Abstract
It is shown that metric representation of DNA sequences is one-to-one. By using the metric
representation method, suppression of nucleotide strings in the DNA sequences is determined.
For a DNA sequence, an optimal string length to display genomic signature in chaos game
representation is obtained by eliminating effects of the finite sequence. The optimal string
length is further shown as a self-similarity limit in computing information dimension. By
using the method, self-similarity limits of bacteria complete genomic signatures are further
determined.
19
20 Z.-B. Wu
X
m By Lemma 2, each finite subsequence Σ m has a
α=2 µm−j+1 3−j + 3−m correspondent infinite sequence G∞ Σm . Here, we
j=1 define a set of the infinite sequences as Σ.
X
m
Theorem 1. (α, β) : Σ → Λ is one-to-one. Λ is a
=2 µi 3−(m−i+1) + 3−m (1)
i=1
set of points in the (α, β) plane.
1.0 ........... ....... ....... ...... ..... ..... ..... .......... ....... .... ... ....... ...... ...... ....... ..........
..... ........ ....... ....... ....... ...... ........ ...... C.....G....... ......
. . ........GCG ...... ....... ........ ........
2 2
....... ..C .... ........ ....... ........ ........ GC
....... ........
. ......... .......CG
...... ...... ........ ........G ........ ........
0.9 ........ ........ ........ ........ ........ ........ ........ ........ ..... ... . ........ .CG
ACG ...... ........ ........ ........ ........
C G
0.8 ........ ....... ...... ....... ........ ..... ...... ........ .......... ....... ........ ......... ........ ........ ........ ........
......... ........ ........ ......... ........ ......... ........ ........ ...... ........ ........ ....... ........ ........ ........ ........
AC TC AG TG
0.7 ........ ........ ........ ....... ........ ....... ........ ....... ......... ........ ........ ........ ........ ....... ........ ........
........ ........ ........ ........ ........ ........ ........ ........ ....... ........ ........ ........ ........ ........ ........ ........
0.6
0.5
0.4
........ ........ ...... ........ ...... . .. ..... ....... ........ ........ ...... ........ ....... ..... ....... ........
... .. . .... ....
..... ........ ........ ...... ......... ......... ........ ....... ........ ........ ....... ...... ....... ........ ........
0.3 ........ ...CA GA CT GT
.... ... .... ....
.... .... ..... .... ........ ........ ......... ........ ........ ....... ........ ...... ........ ....... ........ ........
........ ........ ....... ........ ........ ........ ......... ........ ........ ......... ........ .......... ........ ........ ........ ........
0.2
A T
........ ........ ....... ........ ........ ....... .... ...... ........ ....... ..... ....... ........ ........ ....... ........
0.1 ....... ........ 2 ........ ........ ........ ........ ........ ....... ........ ......... ........ ........ ....... ........ 2........ ........
A TA AT T
........ ....... ........ ....... ........ ........ ....... ........ ........ ....... ........ ........ ........ ........ ........ .......
......... ........ ......... ......... ........ ........ ........ ......... ........ ......... ........ ........ ......... ........ ........ ...
.........
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
F g 1 Metr c representat on o HUMHBB Its boundary and part t on nes are abe ed by so d nes and dash nes
respect ve y
contrad ct on Suppose (α β){Σ1 } = (α β){Σ2 } t on between po nts and subsequences can preserve
and s marked as P n the (α β) p an For the n zones w th sma enough engths For examp e
zone nc ud ng the po nt P we encode t as two subsequences G∞ AC n the zone AC T ∞ C 2 n the
subsequences Σ11 and Σ21 w th the same mononu- zone C 2 A∞ GC n the zone GC and C ∞ T C n the
c eot de Then en arge the zone by an area fac- zone T C have the same po nts n CGR (1/4 3/4)
tor of 32 we can obta n two encod ng subsequences In MR of DNA sequences each zone n CGR s
Σ12 and Σ22 w th the same d nuc eot de Each en- shrunk and c ear y d v ded by four bands There ex-
arg ng process prov des a r ght sh ft to two sub- sts a one-to-one correspondence between zones and
sequences At the same t me the po nt P s on y end ng k-nuc eot de str ngs of subsequences Fre-
nc uded n one of four en arged zones So two sh ft- quency of po nts n the zone can be determ ned by
ng subsequences are the same Fo ow ng the en- us ng the MR method as fo ows In order to com-
arg ng process n an nfin te step we can obta n pute frequenc es n zones encoded by k-nuc eot de
Σ1 = Σ2 contrad ct ng our or g na assumpt on str ngs we need to determ ne part t on nes of MR
Th s contract on s due to the fact that we have n F g 1 For mononuc eot des there ex st 2 × 2
assumed (α β){Σ1 } = (α β){Σ2 } thus Σ1 = Σ2 zones n the MR We have n1 (= 3) part t on nes
then (α β){Σ1 } = (α β){Σ2 } b10 = 0 b11 = 1/2 and b12 = 1 a ong the α ax s
For the DNA sequence some zones n CGR are For denuc eot des there ex st 4 × 4 zones n the
rep en shed by po nts so that a pattern appears In MR We have n2 (= 5) part t on nes b20 = b10 = 0
CGR there ex sts a correspondence of more subse- b21 = b11 /3 = 1/6 b22 = b11 = 1/2 b23 = 1 − b21 = 5/6
quences w th d fferent end ng k-nuc eot de str ngs and b24 = 1 − b20 = 1 a ong the α ax s In genera for
to the same po nts n bounds of zones For ex- k−1-nuc eot de str ngs f know ng n k−1 (= 2k−1 +1)
amp e subsequences G∞ A n the zone A T ∞ C n part t on nes bk−1 (i = 0 1 nk−1 − 1) a ong the
the zone C A∞ G n the zone G and C ∞ T n the α ax s we can obta n nk (= 2k + 1 = 2nk−1 − 1) par-
zone T have the same po nts n CGR (1/2 1/2) t t on nes bk (i = 0 1 nk − 1) for k-nuc eot de
Under eft sh ft operators the correspond ng re a- str ngs as fo ows For the k-nuc eot de str ngs
22 Z.-B. Wu
Table 1 Suppression of k-nucleotide strings in HUMHBB, YEAST1 and random sequences. The total
numbers of nucleotide strings for a length k and suppressed k-nucleotide strings, are labeled by Π k and Λk ,
respectively.
k 5 6 7 8 9 10
ΛHUMHBB
k /ΛRandom
k (73308) 4/0 244/0 3667/208 32909/21402 209280/198219 985222/977852
ΛYEAST1
k /ΛRandom
k (230209) 0/0 0/0 110/0 8897/2021 134302/109290 863555/842246
there exist 2k × 2k zones in the MR. The left half frequent zones are redivided to smaller and de-
(0 ≤ i ≤ nk−1 − 1) of partition lines along the α scribed by a grey scale. Some empty zones may
axis are described as follows appear in the patterns of CGR, i.e. some nucleotide
k−1 strings are suppressed in the sequences. In the
bki = bi/2 , if i %2 = 0
(3) procedure of decreasing zone sizes, more and more
bki = bik−1 /3, if i %2 = 1 . empty zones emerge in the patterns of CGR. For
example, evolution of a self-similarity pattern in
From (3), the right half (nk−1 ≤ i ≤ nk − 1) of CGR of the archaebacteria Archeoglobus fulqidus is
partition lines along the α axis can be determined shown in Fig. 1 of Deschavanne et al. 6 If DNA se-
immediately quences are infinite, the compositional structure can
be displayed in small enough zones. Empty zones
bki = 1 − bknk −1−i . (4) are a part of the global feature in CGR. However,
For example, for trinucleotides, nine partition lines the DNA sequences are finite. A finite sequence,
along the α axis are 0, 18 1 1 5 1 13 5 17
, 6 , 18 , 2 , 18 , 6 , 18 and even a random sequence, may also lead to suppres-
1. We can obtain 17 partition lines 0, 54 1 1
, 18 5 1
, 54 , 6, sion of strings. Along with increase of string length,
13 5 17 1 37 13 41 5 49 17 53 more and more strings are suppressed in the finite
54 , 18 , 54 , 2 , 54 , 18 , 54 , 6 , 54 , 18 , 54 and 1 along sequences.
the α axis for tetranucleotides. Partition lines along
In Table 1, we compare the suppression of
the β axis are the same to those along the α axis.
nucleotide strings between DNA and random se-
Each zone in the MR can thus be surrounded by the
combined partition lines along the α and β axes. quences with the same length. Suppression of nu-
Using the MR method, we determine suppres- cleotide strings for HUMHBB starts at k = 5. For
sion of k-nucleotide strings in HUMHBB (human a random sequence with the same length, which is
β-region, chromosome 11) with 73,308 bases and given by using a random number generator, 9 sup-
YEAST1 (yeast chromosome 1) with 230,209 bases pression of nucleotide strings is delayed to start
in Table 1, respectively. In order to check efficiency at k = 7. The number of suppressed nucleotide
of the method, we also determine the number of dis- strings for the random number is 5.67% of that for
appearing strings in all strings for a giving string HUMHBB. The finite sequence of HUMHBB effects
length in HUMHBB and YEAST1, respectively. on the suppression of 7-nucleotide strings. Along
The results are identical with those in Table 1, with increase of k, numbers of suppressed nucleotide
respectively. So, the MR method is effective in strings for the random number more increase fur-
determining suppression of nucleotide strings in ther and approach those for HUMHBB. At k = 10,
DNA sequences. the number of suppressed nucleotide strings for the
In CGR of DNA sequences, self-similarity pat- random number is 99.3% of that for HUMHBB.
terns change more obscurely as lengths of sequences In this case, suppression of nucleotide strings in
increase. A grey plot describes frequency values in HUMHBB is mainly caused by the finite length
small zones, which sizes (2−k × 2−k ) can be given of sequence. Moreover, suppression of nucleotide
by lengths of strings encoding the zones (k). Along strings for YEAST1 starts at k = 7. For a ran-
with increase of string lengths, the self-similarity dom sequence with the same length, which is given
patterns in CGR are more clear. A high and low by using a random number generator, 9 suppression
Self-Similarity Limits of Genomic Signatures 23
. ..
pression of 8-nucleotide strings. At k = 10, the
number of suppressed nucleotide strings for the ran-
dom number is 97.5% of that for YEAST1. Due to
10
. .
.
8
the comparison of suppression of nucleotide strings,
I( )
we can thus obtain that HUMHBB and YEAST1
have shorter suppressed nucleotide strings than ran- 6
.
dom sequences with the same lengths, respectively.
Along with increase of string lengths, the finite .
sequences take stronger effects on suppression of
4
.
nucleotide strings.
In order to display genomic signature, we must
eliminate effects of finite sequences on suppression
2
.
of nucleotide strings. For a DNA sequence, we take 0
0 2 4 6 8 10 12
the longest string length before suppression of nu- log(1/ )
cleotide strings in a random sequence with the same
Fig. 2 A plot of information function I() versus log(1/)
lengths as an optimal option of string lengths. Ac-
labeled by dots and its fitting line for HUMHBB.
cording to the definition, string lengths 6 and 7 can
be chosen as optimal options for genomic signatures
of HUMHBB and YEAST1, respectively. 14
. .
.
3. LIMITS OF SELF-SIMILARITY 12
SCALES
Suppression of certain nucleotide strings in the
10
.
DNA sequences leads to a fractal pattern seen in .
the MR of DNA sequences. To quantify the fractal 8
.
I( )
Information dimension in CGR can thus be de- strings. Using the least-squares fit method 9 for
termined as (log 2 3)D1 . We compute information the linear part, we determine its slope, i.e. infor-
function I() with different sizes for HUMHBB mation dimension D1 , to 1.20. It is less than
(as drawn in Fig. 2). A linear part of the curve the information dimension 1.26 for a random se-
I() versus log(1/) emerges between log(1/) = quence of the same length. Moreover, in Fig. 3, we
log3 = 1.10 and log(1/) = 6 log3 = 6.59. A fit- draw information function I() versus log(1/) for
ting line is also drawn in Fig. 2. The point for YEAST1. A linear part of the curve I() versus
log(1/) = 7 log3 = 7.69 leaves from the line. log(1/) exists between log(1/) = log3 = 1.10 and
Along with the decrease of log(1/), farther and log(1/) = 7 log3 = 7.69. We can obtain that the
farther the points leave from the line. Since points suppression of many nucleotide strings in YEAST1
in the zones correspond to k-nucleotide strings, we emerges from 8-nucleotide strings. Using the least-
can derive that the self-similarity pattern in the squares fit method9 for the linear part, we also plot
MR preserves approximately from mononucleotides a fitting line in Fig. 3 and determine its slope, i.e. in-
to 6-nucleotide strings, and that the suppression formation dimension D1 , to 1.22. It is less than the
of many nucleotide strings emerges at 7-nucleotide information dimension 1.26 for a random sequence
k 6 7 8 9 10 kl
Λmgen
k (580074) 14 851 14189 126690 776767 7
Λmjan
k (1664970) 3 318 7656 84937 612138 8
Λhpyl
k (1667867) 2 192 4290 58661 538051 8
Λhpyl99
k (1643831) 1 130 3977 58033 538512 8
Λbbur
k (910724) 0 232 8139 101444 712552 8
Λrpxx
k (1111523) 0 71 4778 79792 643520 8
Λhinf
k (1830138) 0 12 1077 33859 442423 8
ΛpN
k
GR234
(536165) 0 10 2881 76649 699974 7
Λmpneu
k (816394) 0 7 2329 66513 638786 8
Λmthe
k (1751377) 0 5 665 26669 408030 8
Λaquae
k (1551335) 0 4 840 33972 468735 8
Λpyro
k (1738505) 0 4 708 26863 403468 8
Λaf
k
ul
(2178400) 0 4 365 16382 330488 8
Λmtub
k (4411529) 0 3 595 20793 306071 9
Λpabyssi
k (1765118) 0 3 291 18803 367742 8
Λtmar
k (1860725) 0 2 594 24329 399932 8
Λcpneu
k (1230230) 0 2 452 28569 468992 8
Λecoli
k (4639221) 0 1 173 5595 150409 9
Λsynecho
k (3573470) 0 1 149 8058 214433 9
Λctra
k (1042519) 0 0 562 34004 510293 8
Λaero
k (1669695) 0 0 137 20084 401256 8
Λtpal
k (1138011) 0 0 118 20912 453066 8
Λbsub
k (4214814) 0 0 4 2919 156165 9
Self-Similarity Limits of Genomic Signatures 25
ACKNOWLEDGMENTS
This work was supported in part by the Na-
tional Key Program for Developing Basic Science
G1999032801-11.