Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

J Mol Evol (1995) 41:98-103

jou.. o, MOLECULAR

© Springer-VerlagNew York Inc. 1995

Distributions of the Use Frequencies of Amino Acids in the Hypervariable


Regions of Immunoglobulins

F. Lara-Ochoa, 1 E. Vargas-Madrazo, 2 J.C. Almagro I

1Instituto de Quiraica, UNAM, Apdo. Postal 70-213, Coyoacan04510, Mexico D.F.


2 Instituto de InvestigacionesBiologicas, UniversidadVeracruzana, Xalapa, Ver. Mexico

Received: 6 February 1994 / Accepted: 10 August 1994

Abstract. Two types of distributions for the frequen- three in the variable domain of the heavy chain, V H
cies of occurrence of amino acids in each position of (Padlan 1994; Wu and Kabat 1970; Kabat and Wu 1971).
hypervariable regions CDR-1 and CDR-2 were obtained It has been proposed that the specificity of the Igs is
for 2,000 immunoglobulins. The results show that some determined by the sequence and size of the CDRs (Kabat
positions fit an inverse power-law distribution, while 1978).
others fit an exponential-type distribution. As a result of CDR- 1 and CDR-2 of both V L and V H are codified by
comparison with structural data in the literature it is pro- a continuous gene fragment, while the CDR-3 is codified
posed that sites in which the frequency distribution fits by recombination events (Tonegawa 1983; Alt et al.
the inverse power law are critical to maintaining canon- 1987). It has been found that except for the third region
ical shapes of the recognition regions or are involved in of V H, binding site loops have one of a small number of
modulating these canonical conformations, while those main-chain conformations termed canonical structures
sites where the distribution fits the exponential law are (Chothia and Lesk 1987). These specific backbone con-
those which should be exclusively involved in the rec- formations are believed to reflect the existence of a few
ognition mechanism. key conserved residues in the loop and framework of the
antibody molecule (Chothia and Lesk 1987). This sug-
Key words: Immunoglobulins - - Canonical confor- gests that some of the amino acids in the CDRs may have
mations - - CDR - - Zipf law - - Inverse power law a structural role, while the rest are either exclusively
involved in the recognition function or are irrelevant.
Likewise, Kabat et al. (1977) postulated the existence of
Introduction evolutionarily conserved residues within CDRs, and that
the antigen specificity is determined by a few hypervari-
Immunoglobulins (Igs) are key proteins that mediate the able positions in the immediate neighborhood of such
immune response by binding foreign epitopes, hindering conserved positions. Along similar lines, Ohno et al.
their ability to either bind to receptors on target cells, or (1985) proposed that if some sites on the CDR-1 and
by marking invading microorganisms for destruction. An CDR-2 are conserved for the shaping of the primordial
important challenge in immunology deals with the unre- antigen binding cavity, then the remaining sites must be
solved question of how the immune system recognizes properly involved in the recognition process.
foreign epitopes. in a previous study (Vargas-Madrazo et al. 1994), it
The antibody binding site is formed by six hypervari- was found that the frequency of use of amino acids in the
able or complementarity-determining regions (CDRs): CDRs was skewed. Lara-Ochoa et al. (1994) performed
three in the variable domain of the light chain, V L, and a statistical analysis of the first two regions of the anti-
body-combining sites in order to determine whether or
not the frequency of use of individual amino acids is
Correspondence to: F. L a r a - O c h o a constrained by the role played by each position. The
99

Table 1. Relative frequencies of amino acids in percent with the highest constancy on the CDR-1 and CDR-2 of the heavy and light domainsa

Predominant Relative Secondary Total Structural Conserved


Site amino acid frequency amino acids frequency (Chothia) (Kabat)

Heavy chain
26 a (98%)
27 F (48%) Y (42%) 90%
28 T (62%) S (22%) 84%
29 F (76%) L (11%) 87%
30 T (50%) S (41%) 91%
31 S (48%) T (5%) 53%
32 Y (65%) F (11%) 76%
34 M (61%) I (14), V (8%) 83% + +
51 I (88%) +
52a P (56%) S (10%) 66% +
52b K (87%)
54 N (27%) S (28%), G (27%) 82% + +
55 G (61%) S (14%) 75% +
57 T (76%)
59 Y (94%) F (2%) 96%
60 N (48%) A (21%), S (12%) 81%*
63 F (46%) V (28%), L (19%) 93%
64 K (77%)
65 G (61%) S (23%) 83%
Light chain
24 R (48%) K (16%) 64%
25 A (55%) S (30%), G (10%) 95%
26 S (87%) T (2%) 89%
27 Q (51%) N (1%) 52%
27a S (81%) G (11%) 92%
27b L (47%) V (19%), I (12%), A (11%) 89%
27c V (42%) L (32%) 74%
32 Y (72%) F (5%) 77%
33 L (58%) M (17%), V (10%) 85%
52 S (79%) T (7%) 86%
54 L (48%) R (48%) 96%*
56 S (72%) W(11%) 83%

The frequencies of amino acids for all positions listed fit an inverse power distribution
b In these special cases the secondary amino acids do not share the same properties of hydrophobicity and volume

distribution o f the f r e q u e n c y o f use o f t h e s e sites w a s fit sitions, the use frequencies may be as low as 0.48 (for instance, Phe-27
to t w o d i f f e r e n t t y p e s o f d i s t r i b u t i o n s ( L a r a - O c h o a et al. on Vu); in these cases, alternative amino acids which are second or
third in abundance, sharing similar properties of hydrophobicity and
1994). T h e thrust o f the p r e s e n t article is c o n c e r n e d w i t h
volume, can be found in the same site (Table 1). Hence, all these cases
t w o aspects: first, to p e r f o r m statistical analysis o f the should correspond to a nonrandom frequency distribution. It is there-
f r e q u e n c y o f use o f a m i n o acids for e a c h p o s i t i o n o f the fore of importance to determine for each site the type of distribution for
C D R - 1 a n d C D R - 2 o f the V H and V L c h a i n s o f 2,000 Igs; the amino acid use frequency and to attempt to infer from these dis-
a n d s e c o n d , the c o m p a r i s o n o f this analysis w i t h the tributions the role of each site relative to the CDR-1 and CDR-2.
The inverse power law gives a model of a skewed distribution that
o b s e r v e d use f r e q u e n c i e s o f a m i n o acids in the r e p o r t e d
seems appropriate for modeling the use frequency data. This distribu-
s e q u e n c e s for the I g ' s g e r m - l i n e g e n e s ( G e n e - b a n k , re- tion has been used in different contexts, including biological systems
l e a s e 79). (West 1985; Nicolis 1987), linguistics (Nicolis, 1986), and social sci-
ences (Montroll and Badger 1974) and to model diverse physical phe-
nomena (Montroll and Shlesinger 1983; Meijer et al. 1981). A version
Methodology and Results of this distribution was proposed by Zipf (1949) for the first time. Here
we will use a modified version of Zipf's relation that was proposed by
The amino acid distribution table reported by Kabat et al. (1991) is used Mandelbrot (1977), namely
to establish the percent frequencies of the most-used amino acids for
each site. From these data, we observe in Table 1 that amino acid uses P(r) = K (r + p)-13 (1)
of some sites have frequencies that are of the same order or higher than
those of sites buried in the core of a protein family (Lira and Sauer
1989). As an illustration, and following the site numeration suggested where K, p, and [3 are parameters ([3 > 0) intrinsic to each system, while
by Kabat et al. (1991), we observe in Table 1 that Gly-26 and Ile-5t on P(r) denotes the probability of use of each amino acid with rank order
VH have a probability of use of 0.98 and 0.88, respectively, which r. In linguistics, the physical meaning of the constants K, p, and 13has
should skew their distribution frequencies significantly. For some po- been widely studied (Schroeder 1991, and references therein cited), K
100
1 . . . . . . . . . . . . . . . . . . 0,5

0,~ . __ ...................................................................
0 .........................................................................

0 ............................................
-0.51 . + +

o,

_2_.~
-2 ~~_ -"I .~. . . . . .~. . . - ~I .......
-25

-,3,5 [--- ~ I----q--~ , ----T-----T-----I


-.02 0 0,2 0.4 0,6 0.8 I 1,2 1.¢
log (r + Ro) 4 8 12 '~6
2 6 10 14. 18

1 - -

0,5
0.5
0
-0,5
-I . . . . L4

-I .5

-2
-2.5 -2

-.3 L - 2.5 " ~


--5,5 -- I I ......... I --1--- I T-"-'-I-I
-3
0 0.2 0,4 0,6 0.8 1 1.2 14 /

Io0 (r + Ro) -5.5 - - i ~. . . . . . . . r--............


5 10 15 2Q
Fig. 1. Plots of the logarithm of the rank vs the logarithm of their
relative frequencies for those sites identified as structural. The corre- I

lation coefficients for all the positions were higher than 0.95. The Fig. 2. Plots of the rank vs the logarithm of their relative frequencies
illustrated positions are for VH: site 29: +, and site 55: *, and for Vrf for those sites identified as exclusively involved in the recognition
site 25: +, and site 52: * mechanism. The correlation coefficients for all the positions were
higher than 0.95. The illustrated positions are for VH: site 53: +, and
site 58: *, and for VL: site 29: +, and site 34: *
is interpreted to be a measure of the extension of the text--that is, the
total number of words in the text; 13gives a measure of the diversity or
richness of words employed in the text; and p gives the probability of
first-rank words (small or large). The meaning of these parameters in Discussion
our application needs to be modified to take into account different types
of constraints. For instance, for some sites on the CDRs the use fre-
quency of amino acids is nonrandom (Vargas-Madrazo et al. 1994),
A set of sites proposed to be conserved on the CDR-1
which reduces the available number of amino acids for the site, limiting and CDR-2 have been reported in the literature (Kabat et
the total number of elernents (and richness ?) for the site. This and other al. 1977). These sites are listed in the seventh column of
constraints should be considered before attempting to extrapolate di- Table 1. Chothia and Lesk (1987) reported a subset of
rectly the meanings used in linguistics. sites on the CDRs which are key for the maintenance of
The linearized form of relation (1) was plotted for each of 27 sites
on the heavy chain and for each of 21 sites on the light chain. The
the canonical conformations (structural set). These sites
results of plotting the logarithm of the probability of use of each amino are listed in the sixth column of Table 1. In this table we
acid vs In (r + p) are illustrated in Fig. 1 for some of the analyzed observe that 14 of the 17 cases reported as structural sites
positions. The results were statistically tested and for 19 sites of V H and by Kabat et al. (1977), and all the sites considered by
12 sites of V L (see Table 1) the fit with a straight line was reasonably Chothia and Lesk (1987) as critical for maintaining the
good, with correlation coefficients higher than 0.95.
The rest of the positions, that is, 8 on VH and 8 on Vr_, could not be
canonical conformations, within the analyzed regions,
fit to the inverse power law. Instead, they fitted well an exponential correspond to those sites whose frequency distributions
distribution of the form fit the inverse power law. This correspondence with sites
previously proposed as conserved, and the fact that the
P(r) = kexp(-Xr), X a constant parameter (2) more abundant amino acids in each site share similar
properties of hydrophobicity and volume, physicochem-
with correlation coefficients greater than 0.95. Plots of data fitting the ical properties typical of structural amino acids (Miyata
exponential distribution are illustrated in Fig. 2. et al. 1979; Lim and Sauer 1989; Sneath 1966; Grantham
101

1974), strongly suggests that all sites whose frequency Table 2. Relative frequencies of amino acids in percent with the
distributions fit the inverse power law have a structural highest constancy as obtained from the germ-line gene data base
role in the CDRs.
Predominant Relative Secondary Total
What is interesting then is to have found in this work Site amino a c i d frequency aminoacids frequency
that the amino acids responsible for keeping the canon-
ical conformations fit well a skewed distribution such as Heavy
chain
the inverse power law. Distributions of this type natu-
26 G (81%)
rally arise in probabilistic processes whose primary 27 F (34%) Y (37%) 70%
achievements require the unfailing completion of numer- 28 T (52%) S (17%),
ous independent secondary subtasks (Montroll and N(ll%) 70%
Shlesinger 1982). In fact, the failure of any of these tasks 29 F (60%) I (18%),
V (3%),
is sufficient to cause the failure of the main task. To
L (2%) 83%
illustrate this point let Pi denote the probability of achiev- 30 T (51%) S (18%),
ing the ith step, in a sequence of events oriented toward L (11%) 80%
achieving a general goal. If P denotes the probability of 31 S (41%) D (31%) 72%
achieving this general goal at a given time (Montroll and 32 Y (61%) W(9%)
F (3%) 73%
Shlesinger 1982), then we have
34 M (55%) I (13) 68%
51 I (70%)
P = P a P 2 . . . P~ (3) 54 N (37%) s (16%),
D (13%) 66%
where n represents the total number of successful steps 55 G (55%) S (8%) 64%
57 T (65%)
required to complete the general task.
59 Y (71%)
The relation of the above process to some of the Dar- 6o N (35%) S (14%),
winian ideas of evolution is suggestive. In our context D (10%) 59%
these ideas would suggest that the actual shapes of the 63 F (40%) V (17%),
canonical conformations of the CDRs are achieved as the L (10%) 67%
64 K (56%) Q (10%) 66%
result of an evolutionary process, in which the adopted 65 G (47%) s (19%) 66%
conformations are privileged (optimal?) shapes that in- Light
teract more efficiently with the antigens. The adoption of chain
optimal shapes, that is, shapes that interact more effi- 24 R (69%) K (12%) 81%
ciently with antigens, may be understood considering 25 A (76%) S (7%) 83%
26 S (74%) G (7%) 81%
that protein interactions are increased in surfaces that
27 Q (47%) K (20%),
facilitate the formation of clefts or holes (Lewis 1991). E (9%) 76%
Chothia et al. (1992) indicate that indeed canonical con- 32 Y (46%) N (11%),
formations generate a corresponding number of surfaces F (4%) 61%
involved in the antigen-antibody interactions. In fact, it 33 L (59%) M (8%),
Y (6%),
has been reported that specific antibodies for small hap-
V (3%) 76%
tens often have a concave combining site, frequently 52 s (40%) T (21%) 61%
found as a deep pocket or groove on the surface. On the 54 L (35%) R (30%) 65%
other hand, antibodies that bind large molecules, such as 56 s (40%) T (21%) 61%
proteins, tend to have flat combining sites so that the
surface of antibody-antigen is larger (Bolger and Sher-
man 1991). More general prototypes of surfaces have surface that the canonical structures present to the anti-
been reported by Lehn (1990), who has also categorized gens (Chothia et al. 1992). If this is the case, then it does
receptor-substrate interactions as being either endo or increase substantially the number of available surfaces
exo in character. Endo surfaces correspond to cyclic or (prototypes) to contend with a larger number of epitopes.
severely bent molecules that contain holes, clefts, or cav- Thus, canonical conformations are determined by a few
ities having interior interactive sites that converge upon conserved residues which, as supported by our results,
a central locus while exo surfaces contain exterior inter- may be the result of an evolutionary process.
active sites (Gutsche 1992). That the aforementioned sites mainly play a structural
In Table 1 we may observe that the number of con- role is striking, since it has usually been assumed that
served residues is greater than the known critical sites sites 26-30 and 53-55 of V H and 26-32 and 50-56 of
responsible for maintaining the canonical conformations V L, by being located in the loop region, were recognition
(Chothia and Lesk 1987). Nonetheless, these other con- positions. Thus, our results suggest that an enhanced
served sites, particularly those placed in the center of the concept of a recognition site must be based not only on
loop, must be important in modulating the shapes of the whether or not the site is on a loop but also on the type
102

30-
25-
20-
15-
10-

0-

a
c
d
e

g
h

k
I

arnin0 acids rl
P
B q
r s-3
l " / S-1
V
% 25- tOlal
W
20- Y

15-

5-
0-

c
d

f
g

k l'-<..
r
amino acids ra
P
q

v (etal
w
Y

Relative use frequencies of each amino acid in the total sample (total), and in the two subsamples with one (s-l), and five (s-3) randomly
F i g . 3.
chosen sequences from each one of the specificities present in the Kabat et al. (1991) compilation. A Position 52 of V H. B P o s i t i o n 50 o f V L.

of distribution of use followed by the amino acids on termine the type of distribution. These use frequencies
each site. showed a behavior similar to that of the protein sample.
To eliminate the possibility that the described behav- The results are shown in Table 2. The fact that two types
ior may be due to some artifact generated as a conse- of sites persist at the gene level gives additional support
quence of somatic hypermutations, or may be due to a to the suggestion that the observed behavior of the pro-
bias in the analyzed sample (2,000 Igs) that is a conse- tein sample should indeed be a consequence of an evo-
quence of a predominance of certain specificities, e.g., a lutionary process, and not be generated during the rec-
greater number of Ig antihaptens, two control tests were ognition process.
carried out. To eliminate the possibility of an existing bias in the
The reported sequences for the Ig's germ-line genes sample, the total sample was spliced into two subsam-
were aligned (Gene-bank, release 79), and the amino ples, each one with a different specificity, and their
acid use frequencies for each site were analyzed to de- amino acid use frequencies were compared with that of
103

the total sample. The two subsamples w e r e built as fol- mining residues in the variable portions of light and heavy chains.
lows: (1) one s e q u e n c e was r a n d o m l y c h o s e n f r o m each Ann NY Acad Sci 190:382-393
Kabat EA, Wu TT, Bilofsky H (1977) Unusual distributions of amino
specificity in the sample; (2) a r a n d o m sample o f five
acids in complementarity-determining (hypervariable) segments of
sequences was selected f r o m specificities that h a v e five heavy and light chains of immunoglobulins and their possible roles
or m o r e sequences in the sample. The results of this in specificity of antibody-combining sites. J Biol Chem 252:6609-
analysis for one position in VL and one in VrI are s h o w n 6616
in Fig, 3, w h i c h shows that the pattern o f a m i n o acid Kabat EA, Wu T, Perry HM, Gottesman KS, Foeller C (1991) In:
usage found in the two subsamples is the same as that in Sequences of proteins of immunological interest, 5th ed. National
Institutes of Health, Bethesda, MD
the total sample. T h e s e m e a s u r e s rule out the possibility
Lara-Ochoa F, Vargas E, Jimenez-Montano MA, Almagro JC (1994)
o f any bias in the a n a l y z e d s a m p l e Patterns in the complementarity determining regions of immuno-
If the positions discussed are critical in maintaining globulins (CDRs). Biosystems 32:1-9
the structures o f the family, then our reported findings Lehn JM (t990) Perspectives in supramolecular chemistry, from mo-
i m p l y that the rest of the positions o f the a n a l y z e d C D R s lecular recognition towards molecular information processing and
self-organization. Angew Chem Int Ed Engl 29:1304-1319
must be properly i n v o l v e d in the recognition p r o c e s s - -
Lewis AR (1991 ) Clefts and binding sites in protein receptors. Methods
that is, those positions w h o s e f r e q u e n c y distributions are Enzymol 202:126-156
e x p o n e n t i a l are those w h i c h m u s t be e x c l u s i v e l y in- Lim WA, Saner RT (1989) Alternative packing arrangement in the
v o l v e d in the recognition process. This e x p e r i m e n t a l fre- hydrophobic core of proteins. Nature 339:31-36
q u e n c y distribution is w e l l known, and is typical o f pro- Mandelbrot BB (1977) Fractals: form, chance, and dimension. Free-
cesses that contain no memory (Gilman 1991), man, San Francisco
Meijer PHE, Mountain RD, Souler Jr RJ, eds (1981) Sixth International
c o n f i r m i n g the traditional v i e w (Ohno et al. 1985) o f a
Conference on Noise in Physical Systems. National Bureau Stan-
r a n d o m substitution for these sites. (See h o w e v e r , M i a n dards, Washington, DC, Special Publication No. 614
et al. 1991, and V a r g a s - M a d r a z o et al. 1994.) Mian IS, Bradwell AR, Olson AJ (1991) Structure, function and prop-
In summary, our results suggest that the existence o f erties of antibody binding sites. J Mol Biol 217:133-151
two different types o f sites is the key to success in m o - Miyata T, Miyasawa S, Yasunaga T (1979) Two types of amino acid
substitution in protein evolution. J Mol Evol 12:219-223
lecular recognition, in w h i c h the principle i n v o l v e d is
Montroll E, Badger WW (1974) Introduction to quantitative aspects of
that an e f f e c t i v e identification requires surfaces o f c o m -
social phenomena. Gordon and Breach Science Pub
plementarity size, shape and functionality. Montroll EW, Schlesinger MF (1982) On 1/f noise and other distribu-
tions with long tails. Proc Natl Acad Sci USA 79:3380-3383
Acknowledgments. We gratefully acknowledge the valuable advice Montroll E, Shlesinger MF (1983) Maximum entropy formalism, frac-
received from M. Jimenez-Montano. Furthermore, we wish to thank tals, scaling phenomena, and 1/f noise: a tale of tails. J Statist Phys
Prof. E. Zuckerkandl, E. Lara, C. Castillo-Chavez, and the referee for 32:209-230
their critical review of the manuscript. We also are indebted to Prof. Nicolis JS (1986) Chaotic dynamics as applied to information process-
Zuckerkandl for his generous help with our previous article published ing. Rep Prog Phys 49:1109-1187
in J. Mol. Evol. J.C.A. was supported by DGAPA grant IN-206093; Nicolis JS (1987) Chaotic dynamics of logical paradoxes. In: Bothe,
E.V.M. was supported by CONACyT. Ebeling, Kurzhanski, Peschel, (eds) Dynamical systems and envi-
ronmental models. Academic-Vertag, pp 105-113
Ohno S, Mori N, Matsunaga T (1985) Antigen-binding specificities of
References antibodies are primarily determined by seven residues of Vh. Proc
Nail Acad Sci USA 82:2945-2949
Padlan EA (1994) Anatomy of the antibody molecule. Mol Immunol
Alt FW, Blackwell TK, Yancopoulos GD (1987) Development of the (in press)
primary antibody repertoire. Science 238:1079-1087
Schroeder M (1991) Fractals, chaos, power laws. WH Freeman and
Bolger, MB, Sherman MA (1991) Computer modeling of combining Company, New York, p 35
site structure of anti-hapten monoclonal antibodies. Methods En-
Sneath PHA (1966) Relation between chemical structure and biological
zymol 203:21-45
activity in peptides. J Theor Biol 12:157-195
Chothia C, Lesk AM (1987) Canonical structures for the hypervariable
regions of immunoglobulins. J Mol Biol 186:651-663 Tonegawa S (1983) Somatic generation of antibody diversity. Nature
Chothia C, Lesk AM, Gherardi E, Tomlinson IM, Walter G, Marks JD, 302:575-581
Llewelyn MB, Winter G (1992) Structural repertoire of the human Vargas-Madrazo E, Lara-Ochoa F, Jimenez-Montano M (1994) A
VH segments. J Mol Biol 227:799-817 skewed distribution of amino acids at recognition sites of the hy-
Gilman JJ (1991) Research management today. Phys Today March: pervariable region of immunoglobulins. J Mol Evol 38:100-104
42-49 West BM (1985) An essay on the importance of being non-linear.
Gutsche CD (1992) Supramolecular chemistry. In: Parker SP (eds) Lectures notes in biomathematics, 62, Springer-Verlag, Berlin
Encyclopedia of chemistry, 2nd ed. McGraw Hill, Inc., NY Wu TT, Kabat EA (1970) An analysis of the sequences of the variable
Grantham R (1974) Amino acid difference formula to help explain regions of Bence Jones proteins and myeloma light chains and their
protein evolution. Science 185:862-864 implications for antibody complementarity. J Exp Med 132:211-
Kabat EA (1978) The structural basis of antibody complementarity. 250
Adv Protein Chem 32:1-75 Zipf GK (1949) Human behavior and the principle of least effort.
Kabat EA, Wu TT (1971) Attempts to locate complementarity deter- Addison-Wesley, Cambridge

You might also like