Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Theoretical Biology 267 (2010) 95–105

Contents lists available at ScienceDirect

Journal of Theoretical Biology


journal homepage: www.elsevier.com/locate/yjtbi

A study of entropy/clarity of genetic sequences using


metric spaces and fuzzy sets
D.N. Georgiou a, T.E. Karakasidis b,, Juan J. Nieto c, A. Torres d
a
University of Patras, Department of Mathematics, 265 00 Patras, Greece
b
University of Thessaly, Department of Civil Engineering, 383 34 Volos, Greece
c
Departamento de Análisis Matemático, Facultad de Matemáticas, Universidad de Santiago de Compostela, 15782, Spain
d
Departamento de Psiquiatrı́a Radiologı́a y Salud Pública, Facultad de Medicina, Universidad de Santiago de Compostela, 15782, Spain

a r t i c l e in fo abstract

Article history: The study of genetic sequences is of great importance in biology and medicine. Sequence analysis and
Received 21 January 2010 taxonomy are two major fields of application of bioinformatics. In the present paper we extend the
Received in revised form notion of entropy and clarity to the use of different metrics and apply them in the case of the Fuzzy
22 July 2010
Polynuclotide Space (FPS). Applications of these notions on selected polynucleotides and complete
Accepted 6 August 2010
Available online 11 August 2010
genomes both in the I12  k space, but also using their representation in FPS are presented. Our results
show that the values of fuzzy entropy/clarity are indicative of the degree of complexity necessary for
Keywords: the description of the polynucleotides in the FPS, although in the latter case the interpretation is
Polynucleotides slightly different than in the case of the I12  k hypercube. Fuzzy entropy/clarity along with the use of
Fuzzy sets
appropriate metrics can contribute to sequence analysis and taxonomy.
Metric spaces
& 2010 Elsevier Ltd. All rights reserved.
Entropy
Clarity

1. Introduction Fuzzy Polynucleotide Space (FPS) based on the principle of the


fuzzy hypercube (Kosko, 1992). In this notation a polynucleotide
Bioinformatics is a relatively new discipline (see Jamshidi consisting of a sequence of k triplets XYZ is a point in a I12  k
et al., 2001; Morgenstern, 2002; Paun et al., 1998; Percus, 2002; space. However, Torres and Nieto (see Torres and Nieto, 2003)
Tang, 2000) where Mathematics play an important role in the mapped a polynucleotide on a I12 space by considering the
analysis of genetic sequences. The genetic material of living frequencies of the nucleotides at the three base sites of a codon in
organisms consist of nucleic acids DNA and RNA. The analysis of the coding sequence. In that work using a metric motivated by
the genetic material is of great importance for diagnosis and publications of Lin (1997) and Sadegh-Zadeh (see Sadegh-Zadeh,
taxonomy reasons. In this course there are two basic strategies 2000), they calculated distances between nucleotides. They also
that are commonly used: (a) sequence analysis, i.e. determination applied their algorithm for the comparison of complete genomes
of the building blocks of a nucleic acid (nucleotides) and their (for example M. tuberculosis and E. coli). Further work has been
order in the molecular chain, and (b) sequence comparison used recently performed using the idea of Nieto et al. (see Nieto et al.,
to identify the degree of difference/similarity between polynuclo- 2006) in which the influence of several metrics have been
tides, e.g in order to identify similarity with known viruses. examined. The advantages of this methodology are:
DNA and RNA are made of triplets XYZ of codons each of them
having the possibility to be one of four nucleotides {U, C, A, G} in (a) one can compare polynucleotides of very big length in a very
the case of DNA and {T, C, A, G} in the case of RNA (A ¼Adenine, efficient computationally way and
C¼Cytosine, G ¼Guanine, T¼Thymine, U¼Uracil) (Freland and (b) one can apply the algorithm in order to compare polynucleo-
Hurst, 1998). Sadegh-Zadeh (see Sadegh-Zadeh, 2000) showed tides of different length as it is the case for genomes of
that the genetic code can be represented in a 12-dimensional different organisms.
space because a triplet codon XYZ has a 3  4 ¼ 12 dimensional
fuzzy code (a1,y,a12) and it is a point in the 12-dimensional fuzzy We point that metrics play an important role on computational
polynucleotide space ½0,112 as a subspace of the real space biology. Different metrics have been used to study secondary
½0,112 . Sadegh-Zadeh (see Sadegh-Zadeh, 2000) introduced the structures (see Moulton et al., 2000) or biopolyment contact
structures (see Liabres and Rossello, 2004).
It is very important to be in a position to determine how close
 Corresponding author. two genetic sequences are since there are many important biological
E-mail address: thkarak@uth.gr (T.E. Karakasidis). and medical implications (see Stephen and Freeland, 2008;

0022-5193/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.jtbi.2010.08.010
96 D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105

Kedarisetti et al., 2006; DasGupta et al., 1998; Chechetkin, 2003; we focus on the use of different metrics in the calculation of the
Chou, 1995, 2000b; Foster et al., 1999; Gusev et al., 1999; Jiang et al., entropy and clarity of a polynucleotide in conjunction with the
2002; Liben-Nowell, 2001; Li et al., 2001; Mocz, 1995). The use of FPS which can be used in order to reduce the information
biological distance among the 20 amino acids can be calculated necessary for the representation of large polynucloetides.
according to their classification results. Since the concept of pseudo The structure of the paper is as follows. In Section 2 we present
amino acid composition was proposed by Chou (2001), many efforts the notion of the Fuzzy Polynucleotide Space (FPS) and the
have been made trying to use various quantities to represent the 20 entropy concept and give some applications on polynucleotides
native amino acids in order to better reflect the sequence-order and selected genomes. We compare some of the results using our
effects through the vehicle of pseudo amino acid composition entropy definitions with results obtained in Menconi (2005)
(PseAA), along with work in order to choose effective properties for where the notion of computable complexity of several
such procedures (Kawashima et al., 1999; Kawashima and Kanehisa, complete genomes is analyzed and compared with the classical
2000; Trinquier and Sanejouand, 1998). In an earlier paper (Chou, entropy results. In Section 3, clarity of a polynucleotide is
2000a), the physicochemical distance among the 20 amino acids considered and results on several polynucleotides are presented.
(Schneider and Wrede, 1994) was adopted to define PseAA. Finally in Section 4 the conclusions of the present work are
Subsequently, some investigators used complexity measure factor summarized.
(Xiao et al., 2005a), some used the values derived from the cellular
automata (Xiao et al., 2005b,c, 2006a,b), some used hydrophobic
and/or hydrophilic values (Wolfenden, 2007; Zhang et al., 2001; 2. Entropy and fuzzy polynucleotide space
Chou, 2005b; Feng, 2002; Wang et al., 2006, 2004; Gao et al., 2005;
Chen et al., 2006a; Kurgan et al., 2007; Kurgan and Chen, 2007) and 2.1. Fuzzy sets and fuzzy hypercube
some were through Fourier transform (Liu et al., 2005; Perez-
Montoto et al., 2009; Guo et al., 2006), as well as trough cellular Let X be a set. A is a fuzzy subset of X if there is a function mA
automaton approach (Xiao et al., 2009b) The pseudo amino acid such that (for details see Bardossy and Duckstein, 1995; Bezdek,
composition was originally introduced to improve the prediction 1981; Klir and Yuan, 1995; Hashimoto, 1983; Terrano et al., 1992;
quality for protein subcellular localization and membrane protein Zimmermann, 1991)
type (Chou and Cai, 2005, 2006; Chou, 2001; Chou and Elrod, 1999a,
2002, 2003; Shen and Chou, 2009a, b), as well as for enzyme (1) mA : X-½0,1.
functional class (Chou, 2005b; Chou and Elrod, 1999b). Work using (2) A ¼ fðx, mA ðxÞÞ : x A Xg, that is A is the set of all pairs ðx, mA ðxÞÞ
pseudo amino acid composition has also been performed (Xiao et al., such that x A X and mA ðxÞ is the degree of its membership in A.
2008a,b, 2009a; Zhang et al., 2008; Lin and Pan, 2001). The pseudo
amino acid composition can be used to represent a protein sequence In what follows if X¼{x1,x2, y,xn} and
with a discrete model yet without completely losing its sequence- A ¼ fðx1 , mA ðx1 ÞÞ, . . . ,ðxn , mA ðxn ÞÞg,
order information (Chou and Shen, 2007a,b,d), and hence is
particularly useful for analyzing a large amount of complicated then we write
protein sequences by means of the taxonomic approach. Actually, it A ¼ ðmA ðx1 Þ, . . . , mA ðxn ÞÞ:
has been widely used to study various protein attributes, such as
protein structural class (Chen et al., 2006a,b; Lin and Li, 2007a; Ding Let A and B two fuzzy sets of a set X.
et al., 2007; Gu and Chen, 2009; Homaeian et al., 2007), protein Then by A4B we denote the fuzzy set for which the member-
subcellular localization (Chou and Shen, 2008, 2007a,b,Shen and ship function mA4B : X-½0,1 is defined as following
Chou, 2007a), protein subnuclear localization (Shen and Chou,
2005a,b; Mundra et al., 2007) protein submitochondria localization mA4B ðxÞ ¼ minfmA ðxÞ, mB ðxÞg,
(Du and Li, 2006), protein oligomer type (Chou and Cai, 2003), for every x A X.
conotoxin superfamily classification (Mondal et al., 2006; Lin and Li, Also by A3B we denote the fuzzy set for which the member-
2007b) membrane protein type (Liu et al., 2005; Shen and Chou, ship function mA3B : X-½0,1 is defined as following
2005a; Wang et al., 2006; Shen et al., 2006; Chou and Shen, 2007c)
mA3B ðxÞ ¼ maxfmA ðxÞ, mB ðxÞg,
apoptosis protein subcellular localization (Chen and Li, 2007a,b)
enzyme functional classification (Chou, 2005a; Chou and Cai, 2004; for every x A X.
Zhou et al., 2007; Shen and Chou, 2007b) protein fold pattern (Shen For A a fuzzy set, the fuzzy complement Ac is defined by
and Chou, 2006), and signal peptide (Chou et al., 2006; Chou and Ac(x)¼ 1 A(x), x A X.
Shen, 2007e; Shen and Chou, 2007c). Recent research works on the Kosko (1992) introduced a geometrical interpretation of fuzzy
extension of these kind of parameters in the form of Markov Chain sets as points in a hypercube. Indeed, for a given set X ¼{x1,x2,
invariants of 2D graph or networks representation of aminoacid, y,xn}, the set of all fuzzy subsets (of X) is precisely the unit
DNA, and RNA sequences to codify pseudo-aminoacid and pseudo- hypercube
nucleotide bases composition (Aguero-Chapin et al., 2008; Gonza- In ¼ ½0,1n ,
lez-Diaz et al., 2007c, 2007b, 2007a; Aguero-Chapin et al., 2006;
Vilar et al., 2009; Georgiou et al., 2009) as well as more complex since any fuzzy subset A determines a point P A In given by
work such as Xiao and Lin, 2009, Xiao et al., 2010. The reader can P ¼ ðmA ðx1 Þ, . . . , mA ðxn ÞÞ:
also consult some recent reviews which made a discussion of many Reciprocally, any point P ¼ ða1 , . . . ,an Þ A In generates a
of these previous results (Gonzalez-Diaz et al., 2008; Chou, 2009; Lin fuzzy subset A of X defined by the map mA : X-½0,1 such that
et al., 2009). mA ðxi Þ ¼ ai , i ¼1,2,y,n.
In the present paper we present some new results concerning Nonfuzzy or crisp subsets of X¼{x1,y,xn} are given by
the notions of entropy and clarity of a nucleotide that can be used mappings
in order to estimate the fuzziness of a polynucleotide. We
compare the results with that obtained in Sadegh-Zadeh (2000).
m : X-f0,1g
We note that it is possible to compare sequences using a from the set X into the set {0,1} and they are located at the 2n
minimum entropy principle (Sadovsky, 2003). More precisely corners of the n-dimensional unit hypercube In. So, the ground set
D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105 97

X¼{x1,y,xn} is itself the fuzzy set ð1,1, . . . ,1Þ A In . Also, the empty Table 1
fuzzy set is the fuzzy set ð0,0, . . . ,0Þ A In , denoted by |. (a) Number of nucleotides at the three base sites of a codon in the s1 sequence. and
(b) Fractions of nucleotides at the three base sites of a codon in the coding
Hypercubical calculus is developed in Zaus (1999), and some
sequence of s1
applications of the fuzzy unit hypercube are given in Nieto and
Torres (2003), Sadegh-Zadeh (1999) and Hegalson and Jobe U C A G Total
(1998). In this context a codon corresponds to a corner of the
12-dimensional unit hypercube I12. Any element of I12 may be (a)
First base 2 0 0 0 2
viewed as a fuzzy codon. Second base 0 0 1 1 2
DNA and RNA can be treated as a language written using an Third base 1 1 0 0 2
alphabet of strings. The role of strings is played by several
U C A G
chemical compounds. In fact the alphabet for DNA is {T,C,A,G}
while for RNA {U,C,A,G} where A,C,G,T and U stand for Adenine, (b)
Cytosine, Guanine, Thymine and Uracil, respectively. In this First base 1 0 0 0
context in the case of RNA alphabet if U is the first letter of this Second base 0 0 0.5 0.5
Third base 0.5 0.5 0 0
alphabet one codes it as ð1,0,0,0Þ : 1 because the first letter U is
present, 0 since the second letter does not appear, 0 since the example, Sadegh-Zadeh, 2000). Since the notion of entropy is
third letter is not present and 0 since the fourth letter G does not related also to the calculation of distances between points in
appear. In a similar way C is represented as (0,1,0,0), A as (0,0,1,0) the FPS, we describe briefly in the next section the notion of
and G as (0,0,0,1). So if we have a nucleotide described by the distance and then pass to the concepts of entropy and clarity.
codon UCG (serine) this would be written in the I12 hypercube as
ð1,0,0,0,0,1,0,0,0,0,0,1Þ:

2.2. Metrics—distances
There are cases where the exact chemical structure of the
sequence is not known for the complete sequence. In this case
Consider the n-dimensional unit hypercube In.
some components of its fuzzy code being neither 0 or 1 but a
If p ¼ ðp1 , . . . ,pn Þ, q ¼ ðq1 , . . . ,qn Þ A In are two different fuzzy
value in the interval (0,1) and are sequences not necessarily at a
polynucleotides, then we consider the following distances
corner of the hypercube. If for example we have a codon
between the elements p and q:
ð0:3,0:4,0:2,0:1,0,0,1,0,1,0,0,0Þ
This stands for XAU. The first letter X is unknown and corresponds
to: U to extent 0.3, C to extent 0.4, A to extent 0.2 and G to (1) Euclidean distance
extend 0.1 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
When we have a polynuclotide which is a sequence of k u n
uX
triplets, one would need a k  12 hyperspace. For example if we l ðp,qÞ ¼ t
2
ðpi qi Þ2 :
i¼1
have the polynucleotide described by the sequence UACUGU
(tyrosin/cysteine), it is a point in I212 ¼ I24 and represented by
(2) Hamming distance
s1 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0Þ
X
n
l1 ðp,qÞ ¼ jpi qi j:
However, if one considers the frequencies of the nucleotides of i¼1
the alphabet at the three base sites of a codon in the coding
sequence it may be viewed as a point in the hypercube I12. (3) Nieto–Torres Vazquez–Trasande (NTV) metric
Taking into account the number of nucleotides at the three Pn
base sites of a codon in the s1 sequence (see Table 1a) as well as jp q j
lðp,qÞ ¼ Pn i ¼ 1 i i :
the fractions of nucleotides at the three base sites of a codon in i¼1 maxfp i ,qi g
the coding sequence of s1 (See Table 1b) sequence s1 can be
written in the I12 space as
f ðs1 Þ ¼ ð1,0,0,0,0,0,0:5,0:5,0:5,0:5,0,0Þ: Also, if p ¼ q ¼ | ¼ ð0, . . . ,0Þ, then lð|,|Þ ¼ 0 (see Nieto et al.,
2003).
The distance l is motivated by publications Lin (1997) and
In the case of complete genome the frequencies of the Sadegh-Zadeh (2000). We know that l is a metric Nieto et al.
nucleotides at the three base sites of a codon in the codon (2003) and has already been employed in Torres and Nieto (2003),
sequence are considered. This can be viewed as point in the Torres and Nieto (2006) and Nieto et al. (2006). In Dress and Lokot
hypercube I12. (2003) (see, also Dress et al., 2004) it is proposed to call this
This idea has been applied to the genomes of M. tuberculosis, metric as the NTV metric. Applications of metrics are also
E. coli, and A. aeolicus to obtain their fuzzy set of frequencies presented in (Karakasidis and Georgiou, 2004; Samaras et al.,
and calculate their corresponding distance in the Fuzzy 2001).
Polynucleotide Space (FPS) (see Torres and Nieto, 2003; Nieto
et al., 2006).
When dealing with genetic sequences it is of interest: 2.3. The entropy of a polynucleotide

(i) to be able to describe how different two sequences are. For Let X¼{x1,y,xn} be a set and
this reason the notion of distance is used (see Nieto et al.,
A ¼ fðx1 , mA ðx1 Þ ¼ a1 Þ, . . . ,ðxn , mA ðxn Þ ¼ an Þg  ða1 , . . . ,an Þ,
2003, 2006; Nieto and Torres, 2003), and
(ii) to know how much ordered the sequence is. In this direction where ai A ½0,1, a fuzzy set of X. Then by c(A) (see, for example,
the notion of entropy and clarity is employed (see, for Sadegh-Zadeh, 2000) we denote the number
98 D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105

X
n X
n entropyl ðAÞ ¼ 1lðA,CÞ,
mA ðxi Þ ¼ ai :
i¼1 i¼1 where C is the fuzzy set (0.5,y,0.5) of In.

Proof. It is known that (see Sadegh-Zadeh, 2000) cðCÞ ¼ l1 ðC,|Þ.


In the crisp case, c(A) is the cardinality of A. Thus
In the following we propose a new definition of entropy
l1 ðC,AÞ
Definition 1. Let (In,d) be a metric space (see, for example, entropyl1 ðAÞ ¼ 1
l1 ðC,|Þ
Engelking, 1977), X ¼{x1,y,xn} a set, C ¼ ð0:5, . . . ,0:5Þ A In , l1 ðC,AÞ
| ¼ ð0, . . . ,0Þ A In , and F(2X) the fuzzy power set of X. The map ¼ 1
cðCÞ
entropyd : Fð2X Þ-½0,1, cðCÞl1 ðA,CÞ
¼ :
cðCÞ
dðA,CÞ
entropyd ðAÞ ¼ 1 ,
dðC,|Þ
Obviously,
for every A ¼ ða1 , . . . ,an Þ A In , is called fuzzy entropy map with
n
respect to the metric d. cðCÞ ¼ 0:5 þ 0:5 þ    þ0:5 ¼ n  0:5 ¼ ,
2
Remark. De Luca and Termini (1972) first axiomatized nonprob- and we have
abilistic entropy in the setting of fuzzy sets theory (see also Fan
and Ma, 2002). We adopt them here. Let X be a set and let E be a cðCÞl1 ðA,CÞ
entropyl1 ðAÞ ¼
set-to-point map cðCÞ
n 1
E : Fð2X Þ-½0,1, l ðA,CÞ
¼2 n
where F(2X) is the fuzzy power set of X. Hence E is a fuzzy set
2
defined on fuzzy sets of X. E is an entropy measure if it satisfies n2l1 ðA,CÞ
the four De Luca–Termini axioms: ¼ :
n

(DT1) E(A) ¼0 if A A 2X (A nonfuzzy), where 2X is the power


set of X. Also
(DT2) E(A) ¼1 if A(x)¼0.5, for every x A X. 1 pffiffiffi
l2 ðC,|Þ ¼  n,
(DT3) EðAÞ rEðBÞ if AðxÞ rBðxÞ when BðxÞ r0:5 and BðxÞ rAðxÞ 2
when BðxÞ Z 0:5.
and
(DT4) E(A) ¼e(Ac).
l2 ðC,AÞ
entropyl2 ðAÞ ¼ 1
l2 ðC,|Þ
It is possible that for some metric d the fuzzy entropy map
l2 ðC,AÞ
entropyd with respect to the metric d to satisfy the De Luca and ¼ 1
1 pffiffiffi
Termini axioms. So, for example for the metrics l1 and l2 it is clear  n
that the maps pffiffiffi 2
n2lðA,CÞ
¼ pffiffiffi :
l1 ðA,CÞ n
E1 ðAÞ  entropyl1 ðAÞ ¼ 1
l1 ðC,|Þ
and Now, for the NTV metric we have
2
l ðA,CÞ n  0:5
E2 ðAÞ  entropyl2 ðAÞ ¼ 1 lðC,|Þ ¼ ¼ 1:
l2 ðC,|Þ n  0:5
satisfy the above four De Luca–Termini axioms.
Also, it is possible that for some metric d the map entropyd does
not satisfy the De Luca and Termini axioms. So, for example for Thus
the NTV metric l the map
lðA,CÞ
entropyl ðAÞ ¼ 1 ¼ 1lðA,CÞ:
lðA,CÞ lðC,|Þ
E0 ðAÞ  entropyl ðAÞ ¼ 1
lðC,|Þ
does not satisfy the above four De Luca–Termini axioms (see This complete the proof. &
Example 2 below).
In the following we present some theorems concerning the
specification of the formulas used to calculate entropies in the FPS Example 1. Let X¼{x1,x2} be a set and A ¼(0.4,0.8) a fuzzy set of X.
space for specific metrics. We consider the metric space (I2, l1).

Theorem 1. Let X ¼{x1,y,xn} and A ¼(a1,y,an) a fuzzy set of X. Using the definition of entropy given in Sadegh-Zadeh (2000)
Then, the following statements are true: we have (see page 23 of Sadegh-Zadeh, 2000):
3
cðCÞl1 ðA,CÞ n2l1 ðA,CÞ entðAÞ ¼ ¼ 0:4286:
entropyl1 ðAÞ ¼ ¼ , 7
cðCÞ n
pffiffiffi
n2l2 ðA,CÞ Using the above definition we have
entropyl2 ðAÞ ¼ pffiffiffi ,
n 22l1 ðA,CÞ
entropyl1 ðAÞ ¼
and 2
D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105 99

22ðj0:40:5j þj0:80:5jÞ Now cðCÞ ¼ n=2. Thus,


¼
2
cðA4Ac Þ 2  cðA4Ac Þ
20:8 1:2 entropyl1 ðAÞ ¼ ¼ :
¼ ¼ ¼ 0:6: cðCÞ n
2 2
We thus observe that the entropy of Sadegh-Zadeh does not
coincide with the Hamming entropy Finally, by the fact that
entðAÞ ¼ 0:4286a entropyl1 ðAÞ ¼ 0:6: cðA4Ac Þ ¼ l1 ðA4Ac ,|Þ:
it follows that
Remark. According to the Definition 1 and Theorem 1 we have a 2  l1 ðA4Ac ,|Þ
entropyl1 ðAÞ ¼ :
geometrical intepretation of entropy is illustrated in Fig. 1. The n
entropyl1 of a fuzzy set A is 1 minus the Euclidean distance
a ¼ l2 ðA,CÞ divided by the Euclidean distance b ¼ l2 ðC,|Þ, that is When some of the ai are less than or equal 0.5 and others
a greater than 0.5, the proof is analogous.
entropyl2 ðAÞ ¼ 1 :
b In the following we present some examples of applications of
the use of the entropy definitions in various cases of polynucleo-
tides from relatively small ones up to large ones.
Theorem 2. Let X ¼{x1,y,xn} be a set and A¼ (a1,y,an), where
ai A ½0,1, a fuzzy set of X. Let Ac ¼ ð1a1 , . . . ,1an Þ A In . Then Example 2. Let (I2,d) be a metric space, X¼{x1,x2} a set, and
A¼C ¼(0.5,0.5) a fuzzy set of X. Then, we have
cðA4Ac Þ 2  cðA4Ac Þ 2  l1 ðA4Ac ,|Þ
entropyl1 ðAÞ ¼ ¼ ¼ , dðC,CÞ
cðCÞ n n entropyd ðAÞ ¼ 1 ¼ 10 ¼ 1:
dðC,|Þ
where C is the fuzzy set C ¼ ð0:5, . . . ,0:5Þ A In .

Proof. Suppose that ai r 0:5 for every i¼ 1,2,y,n  1, and an 4 0:5.


Then, we have Example 3. Let (I2,d) be a metric space, X ¼{x1,x2} a set and
A¼(0,1), B ¼(1,0) and D ¼(1,1) three fuzzy sets of X. Then, we have
A4Ac ¼ ðminfa1 ,1a1 g, . . . ,minfan1 1,1an1 g,minfan ,1an gÞ
¼ ða1 , . . . ,an1 ,1an Þ: & dðC,AÞ
entropyd ðAÞ ¼ 1 ¼ 11 ¼ 0,
dðC,|Þ
where d¼ l1 or d ¼l2.

By the above we have Similarly


1
l ðA,CÞ entropyd ðBÞ ¼ entropyd ðDÞ ¼ 0,
entropyl1 ðAÞ ¼ 1
l1 ðC,|Þ where d¼ l1 or d ¼l2.
Pn
j0:5ai j For the NTV metric l we have
¼ 1 i ¼ 1
n  0:5 lðC,AÞ 2 1
Pn1
j0:5ai j þ j0:5an j entropyl ðAÞ ¼ 1 ¼ 1 ¼ ,
¼ 1 i ¼ 1 lðC,|Þ 3 3
n  0:5
Pn1
i ¼ 1 0:5ai þ an 0:5 1
¼ 1 entropyl ðBÞ ¼ ,
n  0:5 3
n  0:5ða1 þa2 þ    þ an1 Þ þ an 1
¼ 1 and
n  0:5
n  0:5n  0:5 þ ða1 þ a2 þ    þ an1 Þan þ 1 lðC,DÞ 1 1
¼ entropyl ðDÞ ¼ 1 ¼ 1 ¼ :
n  0:5 lðC,|Þ 2 2
a1 þa2 þ    þ an1 þ1an
¼
n  0:5 We see that for points at the corners of I2, NTV metric does not
cðA4Ac Þ result zero values as is the case for metrics l1 or l2.
¼ :
cðCÞ
Example 4. Consider the sequences employed also by Sadegh-
Zadeh (2000):
(0,0,1) (0,1,1)
s1 ¼ UACUGU tyrosine=cysteine

This point belongs to the 24-dimensional unit cube and it


(1,0,1) corresponds to a corner in I24. Following the methodology of
(1,1,1)
Torres and Nieto (2003) we calculate the frequencies (fractions) of
a
the nucleotide at the three base sites in order to obtain their fuzzy
A
representation in the I12 hyperspace. The corresponding results
C appear in Tables 1a and 1b. Note that the entropy in I24 is
b
entropyl1 ðs1 Þ ¼ entropyl2 ðs1 Þ ¼ 0
O (0,1,0)
and
entropyl ðs1 Þ ¼ 0:2:

(1,1,0)
(1,0,0) In the I12 space the frequencies give a point in the I12 space :
Fig. 1 f ðs1 Þ ¼ ð1,0,0,0,0,0,0:5,0:5,0:5,0:5,0,0Þ:
100 D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105

Note that now we identify s1 in I24 and f(s1) in I12. Table 3


If C ¼ ð0:5, . . . ,0:5Þ A I12 , then, we have the following entropies Calculated entropy values for sequences s7, s8 and s9 using the metrics l, l1 and l2.
for the Euclidean metric
pffiffiffi Metric s7 s8 s9
l2 ðs1 ,CÞ 2
entropyl2 ðs1 Þ ¼ 1 2 ¼ 1 pffiffiffi  0:183503, l2 0 0.087129 0.183503
l ðC,|Þ 3
l1 0 0.166667 0.333333
Hamming metric, l 0.2 0.285714 0.384615

l1 ðs1 ,CÞ 1
entropyl1 ðs1 Þ ¼ 1 ¼  0:333333,
l1 ðC,|Þ 3 Example 6. Now consider the following sequences:
and NTV metric
s7 ¼UACUAC
lðs1 ,CÞ 5
entropyl ðs1 Þ ¼ 1 ¼  0:384615: s8 ¼UAGUAU
lðC,|Þ 13
s9 ¼UACUCG

We can see that there is a difference in the results when which correspond in the FPS, respectively, to
dealing with FPS. This subtlety will be analyzed with further
s7 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0Þ
results.

Example 5. Consider the sequences: s8 ¼ ð1,0,0,0,0,1,0,0,0:5,0,0,0:5Þ,

s2 ¼CACUGU histidine/cysteine s9 ¼ ð1,0,0,0,0,0:5,0:5,0,0,0,0:5,0:5Þ:


s3 ¼CUCUGU leucine/cysteine The corresponding entropy values appear in Table 3. Note in the
s4 ¼CAUUGU histidine/cysteine first one, s7, the same triplet UAC is repeated all along the
s5 ¼CAGUGU glutamine/cysteine sequence. In the second one, s8, the dinucleotide UA is repeated at
s6 ¼CAAUGU glutamine/cysteine the same base positions.

The program to compute the entropy using these three metrics


These are points in a 24-dimensional unit cube since they are is available on request from the authors.
made of 2 triplets. Following the methodology of Torres and Nieto As we know from physics, entropy is a measure of the order/
(2003) we calculated the frequencies (fractions) of the nucleotides disorder of the system, which represents the degree of complexity
at the three base sites in order to obtain their fuzzy representation in order to describe the system. In the case where we have
in the I12 hyperspace. The entropy in the I24 is again zero as there repetition of the same triplet (as it is the case for UAC in
is no uncertainty concerning the chemical composition. However, sequence s7) we have maximum order resulting in zero entropy.
when dealing with FPS results will be different. In the case of FPS In the case where we have repetition of the same dyad like UA in
zero entropy means maximum order we have the same triplet all sequence s8 we have a slightly higher entropy and in other cases
along the genetic sequence. like sequence s9 entropy increases further.
The above sequences are represented in the I12 space as (see We remind here that there is a probabilistic definition of
Nieto et al., 2006): entropy for signals related to thermodynamics see Shannon,
s2 ¼ ð0:5,0:5,0,0,0,0,0:5,0:5,0:5,0:5,0,0Þ, 1948). This classical measure of entropy is defined as
X
H¼ pj log2 ðpj Þ
s3 ¼ ð0:5,0:5,0,0,0:5,0,0,0:5,0:5,0:5,0,0Þ,
where pj are the non-zero probabilities of a signal to have a given
s4 ¼ ð0:5,0:5,0,0,0,0,0:5,0:5,1,0,0,0Þ,
value.
s5 ¼ ð0:5,0:5,0,0,0,0,0:5,0:5,0:5,0,0,0:5Þ Applying this definition in the case of selected polynucleotides
as represented in the FPS the role of pj is played by the non-zero
and coordinates of the polynucleotide. In this case for the entropy of
s6 ¼ ð0:5,0:5,0,0,0,0,0:5,0:5,0:5,0,0:5,0Þ: s7, s8 and s9 we have
Hðs7 Þ ¼ 0,
The results of entropy using the various metrics are summar-
ized in Table 2. Hðs8 Þ ¼ 1
s2, s3, s4, s5 and s6 present the same entropy results although and
the exact value changes depending on the metric used and only s4
presents different entropy which is lower. In fact s2, s3, s5 and s6, Hðs9 Þ ¼ 2:
have the same number of coordinates being equal to 0.5 and all
the others 0, while s4 has only four coordinates equal to 0.5, one What is of interest is that sequence s7 which corresponds to
equal to 1 and all the others equal to 0. the most ordered sequence results in zero entropy. In the other
two sequences we have increasing entropy as in the case of
Table 2 definitions based on metrics given above. It is also remarkable
Entropy values for sequences s2, s3, s4, s5, s6 using the metrics l, l1 and l2 calculated that the ratio of entropies of sequences s8 and s9 equals 2 in both
in FPS.
cases: probabilistic entropy and entropy based on metrics.
Metric s2 s3 s4 s5 s6
Example 7. Consider the following sequences with three triplets
2
l 0.292893 0.292893 0.183503 0.292893 0.292893 r1 ¼ UACUACUAC
l1 0.5 0.5 0.333333 0.5 0.5
l 0.5 0.5 0.384615 0.5 0.5
r2 ¼ UACUACUAG
D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105 101

r3 ¼ UACCAAUAG In the probabilistic definition of entropy, for the point E we have


312 36
represented in the I ¼I space as HðEÞ ¼ 6
r1 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0Þ
which is the maximum possible value in I12.
(2) If A ¼ ða1 , . . . ,a12 Þ A I12 , C ¼ ð0:5, . . . ,0:5Þ A I12 , and | ¼
r2 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1Þ
12
ð0, . . . ,0Þ A I , then
r3 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1Þ:
l2 ðC,AÞ
entropyl2 ðAÞ ¼ 1
l2 ðC,|Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The entropy in the I36 space is ða1 0:5Þ2 þ    þða12 0:5Þ2
¼ 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
entropyd ðr1 Þ ¼ entropyd ðr2 Þ ¼ entropyd ðr3 Þ ¼ 0, ð0:50Þ2 þ    þð0:50Þ2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where d ¼l1 or d ¼l2, and
ða1 0:5Þ2 þ    þða12 0:5Þ2
entropyl ðr1 Þ ¼ entropyl ðr2 Þ ¼ entropyl ðr3 Þ ¼ 0:2: ¼ 1 pffiffiffi :
3

However, in the FPS representation the entropy of the


sequences would not result zero values. In fact zero values would In FPS we have that
be reproduced only for the sequences where the same triplet is a1 þ    þ a12 ¼ 3:
repeated all along the sequence, for the second where we have
repetition of the dyad UA at the same base positions entropy This comes out from the fact that each ai corresponds to the
increases and is higher in the third case. Following the frequency of appearance of a letter of the DNA (or RNA) alphabet
methodology of Torres and Nieto (2003) we calculated the at each triplet base (first, second, third). For each base these
frequencies (fractions) of the nucleotides at the three base sites probabilities which correspond to a quadruplet of ai sums to 1, so
in order to obtain their fuzzy representation in the I12 hyperspace: for the three bases this results in 3.
Using a simple program of Mathematica (see Appendix
r1 ¼ ð1,0,0,0,0,1,0,0,0,0,1,0Þ
Appendix A) we see that the map
r2 ¼ ð1,0,0,0,0,1,0,0,0,0,2=3,1=3Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ða1 0:5Þ2 þ    þ ða12 0:5Þ2
entropyl2 ðAÞ ¼ 1 pffiffiffi
r3 ¼ ð2=3,1=3,0,0,0,1,0,0,0,1=3,1=3,1=3Þ 3
with the restriction
The corresponding entropy values are summarized in Table 4.
a1 þ    þ a12 ¼ 3:
Remarks. (1) In the case of large sequences the maximum
we have a maximum of the entropy at the point E.
entropy would correspond to the characteristic case in which
As above, we can see that
we have equiprobable distribution of all alphabet letters at all
three bases. This would correspond to the point. l1 ðC,AÞ ja1 0:5jþ    þ ja12 0:5j
entropyl1 ðAÞ ¼ 1 ¼ 1
E ¼ ð0:25,0:25,0:25,0:25,0:25,0:25,0:25,0:25,0:25,0:25,0:25,0:25Þ l1 ðC,|Þ 6

in the FPS representation. Its entropy in the metric l2 is with the restriction
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
l2 ðE,CÞ 12  0:252 a1 þ    þ a12 ¼ 3:
entropyl2 ðEÞ ¼ 1 2 ¼ 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:5:
l ðC,|Þ 12  0:52 have a maximum (of entropy) at the point E.
The entropies obtained by the use of metrics l1 and l are
Example 8. Following the methodology of Torres and Nieto
l1 ðE,CÞ
entropyl1 ðEÞ ¼ 1 ¼ 0:5 (2003), we consider the point
l1 ðC,|Þ
ð0:1632,0:3089,0:1724,0:3556,0:2036,0:3145,0:1763,0:3056,
and
0:1645,0:3461,0:1593,0:3302Þ A I12 :
lðE,CÞ
entropyl ðEÞ ¼ 1 ¼ 0:5: which corresponds to the fuzzy set of frequencies of the genome
lðC,|Þ
of M. tuberculosis (see Torres and Nieto, 2003), the point
ð0:1605,0:2420,0:2600,0:3374,0:3116,0:2286,0:2846,0:1752,
A point in the FPS corresponds to a corner of the hypercube
0:2619,0:2568,0:1831,0:2981Þ A I12 :
when we have the same triplet all along the sequence. If we have
maximum order, the point occupies a corner of the hypercube. which corresponds to the fuzzy set of frequencies of the genome
The bigger the distance from the corners, the bigger the entropy, of E. coli, and the point
and thus the bigger the complexity to describe the sequence. ð0:1706,0:1605,0:3241,0:3446,0:3282,0:1735,0:3478,0:1504,
0:2139,0:2455,0:3052,0:2352Þ A I12
Table 4
Calculated entropy values for sequences r1, r2 and r3 using the metrics l, l1 and l2.
which corresponds to the fuzzy set of frequencies of the genome
of A. aeolicus (see Nieto et al., 2006).
Metric r1 r2 r3
We also compare the entropies of the Mycoplasma Pneumoniae
2
l 0 0.077042 0.206508 using our definitions of entropy. In Tables 5a and 5b we
l1 0 0.111111 0.277778
l 0.2 0.255814 0.35
present the results concerning the representation of Mycoplasma
Pneumoniae in FPS.
102 D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105

Thus, the genome of M. pneumoniae is represented in the I12 Table 6


hypercube by the point Entropy values for sequences M. tuberculosis, E. coli, A. aeolicus and M. pneumoniae
using the metrics l, l1 and l2 calculated in FPS.
ð0:2038,0:1769,0:327,0:2923,0:3054,0:1936,0:3601,0:1408,
0:3212,0:2285,0:26,0:1902Þ A I12 : Metric M. tuberculosis E. coli A. aeolicus M. pneumoniae

l2 0.475876 0.488814 0.478769 0.482055


The corresponding entropies of all long polynucleotides appear l1 0.500033 0.499967 0.499917 0.499967
l 0.500033 0.499967 0.499917 0.499967
in Table 6. We observe that metrics give the same numerical
results while the NTV metric gives different results and can
differentiate in a more clear way the complexity of polynucleo-
tides in their FPS representation. Table 7
Comparison of results obtained using entropyl , entropyl1 , entropyl2 with the results
Remark. In Table 7 we compare the results obtained using the of complexity K and probabilistic entropy H1 of Menconi (2005) in the case of A.
three above entropy definitions for the A. aeolicus and E. coli with aeolicus and E. coli.
the results of Menconi (2005) where they computed the complex-
Genome K H1 entropyl2 entropyl1 entropyl
ity K of a genome and the probabilistic entropy H1 (for more
details on the notions of K and H1 see Menconi, 2005). What is of A. aeolicus 1.883 1.976 0.478 0.499 0.499
interest is that in the case of entropyl1 and entropyl results are E. Coli 1.893 1.987 0.489 0.499 0.499
practically identical, while entropyl2 results in a small but
identifiable difference between the two genomes in the same
sense like K and H1.
l1 ðA,CÞ
(3) clarityl1 ðAÞ ¼ :
3. Clarity and fuzzy polynucleotide space cðCÞ

2l1 ðA,CÞ
(4) clarityl1 ðAÞ ¼ :
n
Definition 2. Let d be a metric in I , X ¼{x1,y, xn} a set and n
A¼(a1,y,an), where ai A ½0,1, a fuzzy set of X. The clarity of A, with cðCÞcðA4Ac Þ
(5) clarityl1 ðAÞ ¼ :
respect to the metric d, denoted by clarityd(A) is defined to be the cðCÞ
number
2lðA,CÞ
1entropyd ðAÞ, (6) clarityl ðAÞ ¼ pffiffiffi :
n
that is (7) clarityl ðAÞ ¼ lðA,CÞ:

clarityd ðAÞ ¼ 1entropyd ðAÞ:


Proof. Follows by Theorems 1 and 2 and by fact that clarityd(A)¼
Example 9. Let X¼{x1,x2} be a set, A¼(0.4,0.8) a fuzzy set of X and
1 entropyd(A). &
consider the metric space (I2, l1). Then, we have

clarityl1 ðAÞ ¼ 1entropyl1 ðAÞ ¼ 10:6 ¼ 0:4: Remark. According the Definition 2 and Theorem 3 we have a
geometrical interpretation of clarity as illustrated in Fig. 1. The
Also, using the definition of clarity given in Sadegh-Zadeh
clarity of a fuzzy set A is the Euclidean distance a ¼l2(A,C) divided
(2000) we have
by the distance b ¼ l2 ðC,|Þ, that is
3 4 a
clarðAÞ ¼ 1entðAÞ ¼ 1 ¼ ¼ 0:5714: clarityl1 ðA,BÞ ¼ :
7 7 b
We observe that
clarityl1 ðAÞ ¼ 0:4 aclarðAÞ ¼ 0:5714: Example 10.

Theorem 3. Let X ¼{x1,y,xn} and A ¼(a1,y,an) a fuzzy set of X.


Then, the following statements are true: (1) Let X¼{x1,x2} be a set and A¼(0,1), B ¼(1,0) two fuzzy sets of
X. Then
(1) entropyd ðAÞ,clarityd ðAÞ A ½0,1.
entropyd ðAÞ ¼ entropyd ðBÞ ¼ 0
(2) entropyd(A)¼ 1  clarityd(A). and

Table 5 clarityd ðAÞ ¼ clarityd ðBÞ ¼ 1,


(a) The number of nucleotides at the three base sites of a codon in the codon where d ¼l1 or d ¼l2. Also,
sequence of Mycoplasma Pneumoniae and (b) Fractions of nucleotides at the three
base sites of a codon in the coding sequences of Mycoplasma Pneumoniae. entropyl ðAÞ ¼ entropyl ðBÞ ¼ 13
T C A G
and

(a) clarityl ðAÞ ¼ clarityl ðBÞ ¼ 113 ¼ 23:


First base 48995 42525 78622 70293
Second base 73438 46554 86585 33858 (2) Let X¼{x1,y,xn} be a set and A a fuzzy set of X such that A¼C.
Third base 77233 54942 62523 45737
Then entropyd(A)¼ 1 and clarityd(A)¼0.
(b) (3) We consider the following polynucleotide sequences:
First base 0.2038 0.1769 0.327 0.2923 s1 ¼UACUGU (tyrosine/cysteine)
Second base 0.3054 0.1936 0.3601 0.1408
Third base 0.3212 0.2285 0.26 0.1902
s2 ¼CACUGU (histidine/cysteine)
s3 ¼CUCUGU (leucine/cysteine)
D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105 103

Table 8 sequence of the polynucleotide. However, results in both cases


Calculated clarity values for sequences s1, s2 and s3 using the metrics l, l1 and l2. show, as expected, that entropy is related to the complexity of
description of the sequence. The value of entropy/clarity is
Metric s1 s2 s3
representative of the complexity of description of the polynucleo-
l2 1 1 1 tide. Similar entropy means similar degree of complexity. We also
l1 1 1 1 apply the definition of probabilistic entropy in the case of selected
l 0.8 0.8 0.8 polynucleotides and we observe that the entropy based on
metrics presents similar behavior with that of probabilistic
nature.
Table 9 Further studies are in progress in order to investigate in more
Calculated clarity values for sequences s1, s2 and s3 using the metrics l, l1 and l2. detail the properties of these notions and their biological implica-
tions since it seems that the use of FPS space can lead to a
Metric s1 s2 s3
reduction of the necessary information and the use of appropriate
2
l 0.816497 0.707107 0.707107 metrics can be used to differentiate the degree of their complexity.
l1 0.66667 0.5 0.5
l 0.67385 0.5 0.5

Acknowledgements

The research of J.J. Nieto has been partially supported by


Table 10 Ministerio de Educacion y Ciencia and FEDER, project MTM2007-
Computed clarity for sequences M. tuberculosis, E. coli, A. aeolicus and 61724, and by Xunta de Galicia and FEDER, project PGIDIT05P-
M. pneumoniae.
XIC20702PN.
Metric M. tuberculosis E. coli A. aeolicus M. pneumoniae

l2 0.524124 0.511186 0.521231 0.57945 Appendix A


l1 0.499967 0.500033 0.500083 0.500033
l 0.499967 0.500033 0.500083 0.50
"( P12 ) #
2 X
12
i ¼ 1 ða½i0:5Þ
ln½1 :¼ NMaximize 1 pffiffiffi , a½i ¼ ¼ 3 ,Array½a,f12g
3 i¼1

We have the following representations in the I24 space: Out½1 ¼ f0:5,fa½1-0:25,a½2-0:25,a½3-0:25,a½4-0:25,a½5


-0:25,a½6-0:25,a½7-0:25,a½8-0:25,a½9
s1 ¼(1, 0, 0, 0, 0, 1, 0, 0, 0, 0,1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0) -0:25,a½10-0:25,a½11-0:25,a½12-0:25gg
s2 ¼(0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0)
s3 ¼(0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0)

References
Using the distances l2, l1 and l for the clarity of s1 ,s2 ,s3 A I24 we
obtain the results for entropy values that are summarized in Table 8.
Aguero-Chapin, G., Gonzalez-Diaz, H., Molina, R., Varona-Santos, J., Uriarte, E.,
The results are consistent with the fact that in the case of the Gonzalez-Diaz, Y., 2006. Novel 2D maps and coupling numbers for protein
I12  k space (in this case of I24) all above sequences are known sequences. The first, QSAR study of polygalacturonases; isolation and
with precision. However, if we use their representation in the I12 prediction of a novel sequence from, Psidium guajava. FEBS, Letter 580 (3),
723–730.
dimensional FPS results the results present some differences. Aguero-Chapin, G., Gonzalez-Diaz, H., Riva, G.D., Rodriguez, E., Sanchez-Rodriguez, A.,
Using the distances l2, l1 and l for the clarity of s1 ,s2 ,s3 A I12 we Podda, G., Vazquez-Padron, R.I., 2008. MMM-QSAR, recognition of ribonucleases
have the results that appear in Table 9: without alignment: comparison with an HMM model and isolation
from Schizosaccharomyces pombe, prediction, and experimental assay of
Again, results indicate the degree of complexity necessary for
a new sequence. Journal of Chemical Information and Modelling 48 (2),
the description of the polynucleotides. 434–448.
Bardossy, A., Duckstein, L., 1995. Fuzzy Rule-Based Modeling with Applications to
Geophysical, Biological and Engineering Systems. CRC Press, Boca Raton.
(4) We consider the following fuzzy set of frequencies of the Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.
genomes of M. tuberculosis, E. coli, A. aeolicus. and M. Plenum Press, New York.
pneumoniae. Using the distances l2, l1 and l for the clarity Chen, C., Zhou, X., Tian, Y., Zou, X., Cai, P., 2006a. Predicting protein structural class
with pseudo-amino acid composition and support vector machine fusion
the corresponding results are summarized in Table 10.
network. Annals Biochemistry 357, 116–121.
Chen, C., Tian, Y.X., Zou, X.Y., Cai, P.X., Mo, J.Y., 2006b. Using pseudo-amino acid
composition and support vector machine to predict protein structural class.
4. Conclusions Journal of Theoretical Biology 243, 444–448.
Chen, Y.L., Li, Q.Z., 2007a. Prediction of apoptosis protein subcellular location using
improved hybrid approach and pseudo amino acid composition. Journal of
We present results concerning the notions of fuzzy entropy Theoretical Biology 248, 377–381.
and clarity of a polynucleotide. We propose a new definition of Chen, Y.L., Li, Q.Z., 2007b. Prediction of the subcellular location of apoptosis
proteins. Journal of Theoretical Biology 245, 775–783.
entropy and we consider several applications using different Chou, K.C., 1995. A novel approach to predicting protein structural classes in a
metrics in the case of I12  k space, where k is the number of (20-1)-D amino acid composition space. Proteins: Structure, Function &
codons of the polynucleotide. We also examine the behavior of Genetics 21, 319–344.
Chou, K.C., 2000b. Review: prediction of protein structural classes and subcellular
these notions when those polynucleotides are projected in the I12
locations. Current Protein and Peptide Science 1, 171–208.
Fuzzy Polynucleotide Space. We observe that in both cases we Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid
have a different interpretation of the obtained results for entropy. composition. Proteins: Structure, Function, and Genetics (Erratum: ibid., 2001,
While in the former case low entropy means that we are close to a Vol.44, 60) 43, 246–255.
Chou, K.C., 2000a. Prediction of protein subcellular locations by incorporating
corner of the 12  k space in the latter case it means that we have quasi-sequence-order effect. Biochemical & Biophysical Research Communica-
repetition of the same triplet or part of a triplet all along the tions 278, 477–483.
104 D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105

Chou, K.C., 2005a. Prediction of G-protein-coupled receptor classes. Journal of Guo, Y.Z., Li, M., Lu, M., Wen, Z., Wang, K., Li, G., Wu, J., 2006. Classifying G protein-
Proteome Research 4, 1413–1418. coupled receptors and nuclear receptors based on protein power spectrum
Chou, K.-C., 2009. Pseudo amino acid composition and its applications in from fast Fourier transform. Amino Acids 30, 397–402.
bioinformatics, proteomics and system biology. Current Proteomics 6, 262–274. Gao, Y., Shao, S.H., Xiao, X., Ding, Y.S., Huang, Y.S., Huang, Z.D., Chou, K.C., 2005.
Chou, K.C., Cai, Y.D., 2003. Predicting protein quaternary structure by pseudo Using pseudo amino acid composition to predict protein subcellular location:
amino acid composition. Proteins: Structure, Function, and Genetics 53, approached with Lyapunov index, Bessel function, and Chebyshev filter. Amino
282–289. Acids 28, 373–376.
Chou, K.C., Cai, Y.D., 2004. Predicting enzyme family class in a hybridization space. Gu, F., Chen, H., 2009. Evaluating long-term relationship of protein sequence by
Protein Science 13, 2857–2863. use of D-interval conditional probability and its impact on protein structural
Chou, K.C., Cai, Y.D., 2005. Prediction of membrane protein types by incorporating class prediction. Protein and Peptide Letters 16 (10), 1267–1276.
amphipathic effects. Journal of Chemical Information and Modeling 45, Hegalson, C.M., Jobe, T.H., 1998. The fuzzy cube and causal efficacy: representation
407–413. of concomitant mechanisms in stroke. Neural Networks 11, 549–555.
Chou, K.C., Cai, Y.D., 2006. Predicting protein–protein interactions from sequences Hashimoto, H., 1983. Szpilrajn’s theorem on fuzzy orderings. Fuzzy Sets and
in a hybridization space. Journal of Proteome Research 5, 316–322. Systems 10, 101–108.
Chou, K.C., Cai, Y.D., Zhong, W.Z., 2006. Predicting networking couples for Homaeian, L., Kurgan, L.A., Cios, K.J., Ruan, J., Chen, K., 2007. Prediction of protein
metabolic pathways of Arabidopsis. EXCLI Journal 5, 55–65. secondary structure content for the twilight zone sequences. Proteins 69 (3),
Chou, K.C., Elrod, D.W., 2002. Bioinformatical analysis of G-protein-coupled 486–498.
receptors. Journal of Proteome Research 1, 429–433. Jiang, T., Lin, G., Ma, B., Zhang, K., 2002. A general edit distance between RNA
Chou, K.C., Elrod, D.W., 1999a. Protein subcellular location prediction. Protein structures. Journal of Computational Biology 9, 371–388.
Engineering 12, 107–118. Jamshidi, N., Edwards, J.S., Fahland, T., Church, G.M., Palsson, B.O., 2001. Dynamic
Chou, K.C., Elrod, D.W., 1999b. Prediction of membrane protein types and simulation of the human red blood cell matabolic network. Bioinformatics 17,
subcellular locations. Proteins: Structure, Function, and Genetics 34, 286–287.
137–153. Karakasidis, T.E., Georgiou, D.N., 2004. Partitioning elements of the periodic table
Chou, K.C., Elrod, D.W., 2003. Prediction of enzyme family classes. Journal of via fuzzy clustering technique. Soft Computing (Springer-Verlag) 8, 231–236.
Proteome Research 2, 183–190. Kawashima, S., Kanehisa, M., 2000. AAindex: amino acid index database. Nucleic
Chou, K.C., Shen, H.B., 2007a. Euk-mPLoc: a fusion classifier for large-scale Acids Research 28, 374.
eukaryotic protein subcellular location prediction by incorporating multiple Kawashima, S., Ogata, H., Kanehisa, M., 1999. AAindex: amino acid index database.
sites. Journal of Proteome Research 6, 1728–1734. Nucleic Acids Research 27, 368–369.
Chou, K.C., Shen, H.B., 2007b. Review: recent progresses in protein subcellular Kedarisetti, K., Kurgan, L., Dick, S., 2006. Classifier ensembles for protein structural
location prediction. Analytical Biochemistry 370, 1–16. class prediction with varying homology. Biochemical and Biophysical Research
Chou, K.C., Shen, H.B., 2007c. Large-scale plant protein subcellular location Communications 348 (3), 981–988.
prediction. Journal of Cellular Biochemistry 100, 665–678. Klir, G.J., Yuan, B., 1995. Fuzzy Sets and Fuzzy Logic (Theory and Applications).
Chou, K.C., Shen, H.B., 2007d. MemType-2L: a web server for predicting membrane Prentice Hall PRT, New Jersey.
proteins and their types by incorporating evolution information through Pse- Kosko, B., 1992. Neural Networks and Fuzzy Systems. Prentice-Hall, Englewood
PSSM. Biochemistry and Biophysics Research Communication 360, 339–345. Cliffs, NJ.
Chou, K.C., Shen, H.B., 2007e. Signal-CF: a subsite-coupled and window-fusing Kurgan, L.A., Stach, W., Ruan, J., 2007. Novel scales based on hydrophobicity
approach for predicting signal peptides. Biochemistry and Biophysics Research indices for secondary protein structure. Journal of Theoretical Biology 248,
Communication 357, 633–640. 354–366.
Chou, K.C., Shen, H.B., 2008. Cell-PLoc: a package of web-servers for predicting Kurgan, L., Chen, K., 2007. Prediction of protein structural class for the twilight
subcellular localization of proteins in various organisms. Nature Protocols 3, zone sequences. Biochemical and Biophysical Research Communications
153–162. 357 (2), 453–460.
Chou, K.C., 2005b. Using amphiphilic pseudo amino acid composition to predict Lin, C.T., 1997. Adaptive subsethood for radial basis fuzzy systems. In: Kosko, B.
enzyme subfamily classes. Bioinformatics 21, 10–19. (Ed.), Fuzzy Engineering. Prentice-Hall, Upper Saddle River, NJ, pp. 429–464.
Chechetkin, V.R., 2003. Block structure and stability of the genetic code. Journal of Liabres, M., Rossello, F., 2004. A new family of metrics for biopolymer contact
Theoretical Biology 222, 177–188. structures. Computational Biology and Chemistry 28, 21–37.
DasGupta, B., Jiang, T., Kannan, S., Sweedyk, E., 1998. On the complexity and Liben-Nowell, D., 2001. On the structure of syntenic distance. Journal of
approximation of syntenic distance. Discrete Applied Mathematics 88, 59–82. Computational Biology 8, 53–67.
Ding, Y.S., Zhang, T.L., Chou, K.C., 2007. Prediction of protein structure classes with Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H., 2001. An
pseudo amino acid composition and fuzzy support vector machine network. information-based sequence distance and its application to whole mitochon-
Protein & Peptide Letters 14, 811–815. drian phylogeny. Bioinformatics 17, 149–154.
Du, P., Li, Y., 2006. Prediction of protein submitochondria locations by hybridizing Lin, Z., Pan, X., 2001. Accurate prediction of protein secondary structural content.
pseudo-amino acid composition with various physicochemical features of Journal of Protein Chemistry 20, 217–220.
segmented sequence. BMC Bioinformatics 7, 518. Lin, H., Li, Q.Z., 2007a. Using pseudo amino acid composition to predict protein
Dress, A., Lokot, T., 2003. A simple proof of the triangle inequality for the NTV structural class: approached by incorporating 400 dipeptide components.
metric. Applied Mathematics Letters 16, 809–813. Journal of Computational Chemistry 28, 1463–1466.
Dress, A., Lokot, T., Pustyl’nikov, L.D., 2004. A new scale-invariant geometry of L1 Lin, H., Li, Q.Z., 2007b. Predicting conotoxin superfamily and family by using
space. Applied Mathematics Letters 17, 815–820. pseudo amino acid composition and modified Mahalanobis discriminant.
De Luca, A., Termini, S., 1972. A definition of a nonprobabilistic entropy in the Biochemistry and Biophysics Research Communication 354, 548–551.
setting of fuzzy sets theory. Information and Control 20, 301–312. Lin, W.-Z., Xiao, X., Chou, K.-C., 2009. GPCR-GIA: a web-server for identifying
Engelking, R., 1977. General Topology, Warszawa. G-protein coupled receptors and their families with grey incidence analysis.
Fan, J.-L., Ma, Y.-L., 2002. Some new fuzzy entropy formulas. Fuzzy Sets and Protein Engineering, Design and Selection 22, 699–705.
Systems 128, 277–284. Liu, H., Wang, M., Chou, K.C., 2005. Low-frequency Fourier spectrum for predicting
Foster, M., Heath, A., Afzal, M., 1999. Application of distance geometry to 3D membrane protein types. Biochemistry and Biophysics Research Communica-
visualization of sequence relation-ships. Bionformatics 15, 89–90. tion 336, 737–739.
Feng, Z.P., 2002. An overview on predicting the subcellular location of a protein. In Menconi, G., 2005. Sublinear growth of information in DNA sequences. Bulletin of
Silico Biology 2, 291–303. Mathematical Biology 67, 737–759.
Freeland, S.J., Hurst, L.D., 1998. The genetic code is one in a million. Journal of Morgenstern, B., 2002. A simple and space-efficient fragment-chaining algorithm
Molecular Evolution 47, 238–248. for alignment of DNA and protein sequences. Applied Mathematics Letter 15
Gusev, V.D., Nemytikova, L.A., Chuzhanova, N.A., 1999. On the complexity (1), 11–16.
measures of genetic sequences. Bioinformatics 15, 994–999. Moulton, V., Zuker, M., Steel, M., Pointon, R., Penny, D., 2000. Metrics on RNA
Georgiou, D.N., Karakasidis, T.E., Nieto, J.J., Torres, A., 2009. Use of fuzzy clustering secontary structures. Journal of Computational Biology 7 (1/2), 277–292.
technique and matrices to classify amino acids and its impact to Chou’s pseudo Mondal, S., Bhavna, R., Mohan Babu, R., Ramakumar, S., 2006. Pseudo amino acid
amino acid composition. Journal of Theoretical Biology 257, 17–26. composition and multi-class support vector machines approach for conotoxin
Gonzalez-Diaz, H., Aguero-Chapin, G., Varona, J., Molina, R., Delogu, G., Santana, L., superfamily classification. Journal of Theoretical Biology 243, 252–260.
Uriarte, E., Podda, G., 2007b. 2D-RNA-coupling numbers: a new computational Mocz, G., 1995. Fuzzy cluster analysis of simple physicochemical properties of
chemistry approach to link secondary structure topology with biological amino acids for recognizing secondary structure in proteins. Protein Science 4,
function. Journal of Computational Chemistry 28 (6), 1049–1056. 1178–1187.
Gonzalez-Diaz, H., Gonzalez-Diaz, Y., Santana, L., Ubeira, F.M., Uriarte, E., 2008. Mundra, P., Kumar, M., Kumar, K.K., Jayaraman, V.K., Kulkarni, B.D., 2007. Using
Proteomics, networks and connectivity indices. Proteomics 8 (4), 750–778. pseudo amino acid composition to predict protein subnuclear localization:
Gonzalez-Diaz, H., Perez-Castillo, Y., Podda, G., 2007a. Uriarte E. Computational approached with PSSM. Pattern Recognition Letters 28, 1610–1615.
chemistry comparison of stable/nonstable protein mutants classification Nieto, J.J., Torres, A., Vazquez-Trasande, M.M., 2003. A metric space to study
models based on 3D and topological indices. Journal of Computational differences between polynucleotides. Applied Mathematics Letter 16, 1289–1294.
Chemistry 28 (12), 1990–1995. Nieto, J.J., Torres, A., 2003. Midpoints for fuzzy sets and their application in
Gonzalez-Diaz, H., Vilar, S., Santana, L., Uriarte, E., 2007c. Medicinal chemistry and medicine. Artificial Intelligence in Medicine 17, 81–101.
bioinformatics-current trends in drugs discovery with networks topological Nieto, J.J., Torres, A., Georgiou, D.N., Karakasidis, T.E., 2006. Fuzzy polynucleotide
indices. Current Topics in Medicinal Chemistry 7 (10), 1015–1029. spaces and metrics. Bulletin of Mathematical Biology 68, 703–725.
D.N. Georgiou et al. / Journal of Theoretical Biology 267 (2010) 95–105 105

Paun, Gh., Rozenberg, G., Saloma, A., 1998. DNA Computing: New Computing Torres A., Nieto J.J., 2006, Fuzzy logic in medicine and bioinformatics. Journal of
Paradigms. Springer, Berlin. Biomedicine and Biotechnology, Article ID 91908.
Percus, J., 2002. Mathematics of Genome Analysis. Cambridge University Press, Trinquier, G., Sanejouand, Y.-H.I., 1998. Which effective property of amino acids is
Cambridge. best preserved by the genetic code? Protein Engineering 11 153–169.
Perez-Montoto, L.G., Santana, L., Gonzalez-Diaz, H., 2009. Scoring function for Vilar, S., Gonzalez-Diaz, H., Santana, L., Uriarte, E., 2009. A network-QSAR model
DNA-drug docking of anticancer and antiparasitic compounds based on for prediction of genetic-component biomarkers in human colorectal cancer.
spectral moments of 2D lattice graphs for molecular dynamics trajectories. Journal of Theoretical Biology 261 (3), 449–458.
European Journal of Medicinal Chemistry 44 (11), 4461–4469. Wolfenden, R., 2007. Experimental measures of amino acid hydrophobicity and the
Schneider, G., Wrede, P., 1994. The rational design of amino acid sequences by topology of transmembrane and globular proteins. Journal of Cell Biology 177
artificial neural networks and simulated molecular evolution: de novo design i10–i10.
of an idealized leader peptidase cleavage site. Biophysical Journal 66, 335–344. Wang, M., Yang, J., Liu, G.P., Xu, Z.J., Chou, K.C., 2004. Weighted-support vector
Shen, H.B., Chou, K.C., 2005a. Using optimized evidence-theoretic K-nearest neighbor machines for predicting membrane protein types based on pseudo amino acid
classifier and pseudo amino acid composition to predict membrane protein types. composition. Protein Engineering, Design, and Selection 17, 509–516.
Biochemical and Biophysical Research Communications 334, 288–292. Wang, S.Q., Yang, J., Chou, K.C., 2006. Using stacked generalization to predict
Shen, H.B., Chou, K.C., 2006. Ensemble classifier for protein fold pattern membrane protein types based on pseudo amino acid composition. Journal of
recognition. Bioinformatics 22, 1717–1722. Theoretical Biology 242, 941–946.
Shen, H.B., Chou, K.C., 2007a. Hum-mPLoc: an ensemble classifier for large-scale Xiao, X., Shao, S.H., Huang, Z.D., Chou, K.C., 2006a. Using pseudo amino acid
human protein subcellular location prediction by incorporating samples with composition to predict protein structural classes: approached with complexity
multiple sites. Biochemical and Biophysical Research Communications 355, measure factor. Journal of Computational Chemistry 27, 478–482.
1006–1011. Xiao, X., Shao, S., Ding, Y., Huang, Z., Chen, X., Chou, K.C., 2005b. Using cellular
Shen, H.B., Chou, K.C., 2007b. EzyPred: a top–down approach for predicting automata to generate image representation for biological sequences. Amino
enzyme functional classes and subclasses. Biochemical and Biophysical Acids 28, 29–35.
Research Communications 364, 53–59. Xiao, X., Shao, S., Ding, Y., Huang, Z., Chen, X., Chou, K.C., 2005c. An application of
Shen, H.B., Chou, K.C., 2007c. Signal-3L: a 3-layer approach for predicting signal gene comparative image for predicting the effect on replication ratio by HBV
peptide. Biochemical and Biophysical Research Communications 363, virus gene missense mutation. Journal of Theoretical Biology 235, 555–565.
297–303. Xiao, X., Shao, S., Ding, Y., Huang, Z., Huang, Y., Chou, K.C., 2005a. Using complexity
Shen, H.B., Yang, J., Chou, K.C., 2006. Fuzzy KNN for predicting membrane protein measure factor to predict protein subcellular location. Amino Acids 28, 57–61.
types from pseudo amino acid composition. Journal of Theoretical Biology 240, Xiao, X., Shao, S.H., Ding, Y.S., Huang, Z.D., Chou, K.C., 2006b. Using cellular
9–13. automata images and pseudo amino acid composition to predict protein sub-
Shen, H.-B., Chou, K.-C., 2009a. Gpos-mploc: a top–down approach to improve the cellular location. Amino Acids 30, 49–54.
quality of predicting subcellular localization of gram-positive bacterial Xiao, X., Lin, W.Z., Chou, K.C., 2008a. Using grey dynamic modeling and pseudo
proteins. Protein and Peptide Letters 16, 1478–1484. amino acid composition to predict protein structural classes. Journal of
Shen, H.-B., Chou, K.-C., 2009b. A top–down approach to enhance the power of Computational Chemistry 29, 2018–2024.
predicting human protein subcellular localization: Hum-mPLoc 2.0. Analytical Xiao, X., Wang, P., Chou, K.C., 2008b. Predicting protein structural classes with
Biochemistry 394, 269–274. pseudo amino acid composition: an approach using geometric moments of
Shen, H.B., Chou, K.C., 2005b. Predicting protein subnuclear location with cellular automaton image. Journal of Theoretical Biology 254, 691–696.
optimized evidence-theoretic K-nearest classifier and pseudo amino acid Xiao, X., Wang, P., Chou, K.C., 2009a. Predicting protein quaternary structural
composition. Biochemical and Biophysical Research Communications 337, attribute by hybridizing functional domain composition and pseudo amino
752–756. acid composition. Journal of Applied Crystallography 42, 169–173.
Sadegh-Zadeh, K., 2000. Fuzzy genomes. Artificial Intelligence in Medicine 18, 1–28. Xiao, X., Wang, P., Chou, K.C., 2009b. GPCR-CA: a cellular automaton image
Sadovsky, M.G., 2003. The method to compare nucleotide sequences based on the approach for predicting G-protein-coupled receptor functional classes. Journal
minimum entropy principle. Bulletin of Mathematical Biology 65, 309–322. of Computational Chemistry 30, 1414–1423.
Shannon, C.E., 1948. A mathematical theory of communication. The Bell Systems Xiao, X., Lin, W.Z., 2009. Application of protein grey incidence degree measure to
Technical Journal 27, 379–423. predict protein quaternary structural types. Amino Acids 37, 741–749.
Sadegh-Zadeh, K., 1999. Fundamentals of clinical methodology: 3. Nosology. Xiao, X., Wang, P., Chou, K.C., 2010. Quat-2L: a web-server for predicting protein
Artificial Intelligence in Medicine 17, 87–108. quaternary structural attributes. Molecular Diversity, DOI:10.1007/
Samaras, P., Kungolos, A., Karakasidis, T., Georgiou, D., Perakis, K., 2001. Statistical s11030-010-9227-8.
evaluation of PCDD/F emission data during solid waste combustion by fuzzy Zhang, T.L., Ding, Y.S., Chou, K.C., 2008. Prediction protein structural classes with
clustering techniques. Journal of Environmental Science and Health, Marcel pseudo-amino acid composition: approximate entropy and hydrophobicity
Dekker, Inc. (part A) 36, 153–161. pattern. Journal of Theoretical Biology 250, 186–193.
Stephen, Y.L., Freeland, J., 2008. A quantitative investigation of the chemical space Zhang, Z.D., Sun, Z.R., Zhang, C.T., 2001. A new approach to predict the helix/strand
surrounding amino acid alphabet formation. Journal of Theoretical Biology content of globular proteins. Journal of Theoretical Biology 208, 65–78.
250, 349–361. Zhou, X.B., Chen, C., Li, Z.C., Zou, X.Y., 2007. Using Chou’s amphiphilic pseudo-
Tang, B., 2000. Evaluation of some DNA cloning strategies. Computers and amino acid composition and support vector machine for prediction of enzyme
Mathematics with Application 39 (11), 43–48. subfamily classes. Journal of Theoretical Biology 248, 546–551.
Torres, A., Nieto, J.J., 2003. The fuzzy polynucleotide space: basic properties. Zimmermann, H.J., 1991. Fuzzy Theory and its Applications. Kluwer Academic
Bioinformatics 19 (5), 587–592. Publishers, New York.
Terano, T., Asai, K., Sugeno, M., 1992. Fuzzy Systems Theory and its Applications. Zaus, M., 1999. Crisp and Soft Computing with Hypercubical Calculus.
Academic Press, Harcount Brace Jovanovich Publishers, San Diego, California. Physica-Verlag, Heideberg.

You might also like