Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Vol. 21 no.

17 2005, pages 3535–3540


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti569

Genetics and population analysis

A statistical model for HIV-1 sequence classification


using the subtype analyser (STAR)
R. E. Myers1 , C. V. Gale2 , A. Harrison3 , Y. Takeuchi1 and P. Kellam2,∗
1 Department of Immunology and Molecular Pathology, 2 Department of Infection and
3 Department of Biochemistry, University College, London

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
Received on March 23, 2005; revised on June 23, 2005; accepted on June 30, 2005
Advance Access publication July 26, 2005

ABSTRACT HIV-1 phylogenies were inferred from viral envelope sequences


Motivation: HIV-1 antiretroviral drug resistance testing produces large (env) [subtypes A–F (Myers et al., 1992)], which were consistent
amounts of HIV-1 protease and reverse transcriptase sequences. when extended to group antigen sequences (gag) [except subtype E
These provide an excellent resource to study the incidence, spread (Louwagie et al., 1993)]. Improvements in molecular biology, com-
and clinical significance of HIV-1 subtypes. We have produced a pro- putational power and phylogenetic algorithms have subsequently
gram, Subtype Analyser (STAR) that rapidly and accurately subtypes allowed subtyping of the full length HIV-1 genome (Salminen et al.,
HIV-1. Here we have determined a robust and statistically validated 1995, 1996). Currently, there are 11 single subtypes (A–H, J, N, O)
model for subtype assignment. and a variety of circulating recombinant forms (CRF) of HIV-1 that
Results: We have significantly extended our HIV-1 subtyping tool represent recombination events between two or more single subtypes
(STAR), such that each query sequence when evaluated against sub- (Robertson et al., 1995, 2000). The Los Alamos HIV-1 Sequence
type profile alignments, returns a discriminating score based on the Database stores updated reference subtype sequences and provides
ratio of subtype positive to negative amino acid positions. These scores phylogenetic tools, such as SUDI, that allow novel subtypes to be
were transformed into a Z -score distribution and evaluated. Of the 141 evaluated (http://hiv-web.lanl.gov/content/hiv-db/SUDI/sudi.html).
sequences used to define the subtype alignments, 98% were correctly Although HIV-1 sequences and consequently subtypes evolve,
reclassified. InclusionofadditionalrecombinationdetectionwithinSTAR HIV-1 subtyping allows clinical aspects of HIV-1 infection and man-
increased the detection of known recombinant sequences to 95%. agement to be studied in the context of similar sequences. Web-based
Availability: STAR is available as compiled (Linux Fedora 3) or source HIV-1 subtyping tools are available at both the NCBI (Rozanov et al.,
code from http://pgv19.virol.ucl.ac.uk/download/star_linux.tar 2004) and at the Stanford HIV Drug Resistance Database http://
Contact: p.kellam@ucl.ac.uk hivdb2.stanford.edu/asi/deployed/hiv_central.pl?program=hivseq),
Supplementary information: Available at http://pgv19.virol.ucl.ac.uk/ both of which use reference datasets from Los Alamos. The NCBI
download/star_supplement subtyping tool uses the BLAST algorithm along a sliding win-
dow to compare HIV nucleotide sequences with reference subtype
1 INTRODUCTION sequences, whereas the Stanford tool calculates a genetic distance
Human immunodeficiency virus type-1 (HIV-1) is the retroviral between the test sequence and reference sequences. Neither method
agent that causes acquired immune deficiency syndrome (AIDS). attempts to assign any confidence to the subtype prediction; both
As of December 2004 there were an estimated 40 million people require nucleic acid as input and neither allow batch submission of
living with HIV/AIDS, with 5 million new infections and 3 million sequences.
deaths in 2004. One defining component of the retroviral life cycle We have described previously the program STAR (Gale et al.,
is the reverse transcription of the viral RNA genome into DNA. The 2004), which uses a region of HIV-1 polymerase gene (Pol) [99 amino
enzyme responsible for this, reverse transcriptase (RT), has a low acids of protease (PR), 340 amino acids of RT] to subtype sequences
level of fidelity resulting in an error rate of between 1/1000 and relative to a large set of accurately defined reference sequences. This
1/10000 nucleotide misincorporations (Roberts et al., 1988; Takeuchi region of HIV-1 is routinely used for drug resistance profiling thereby
et al., 1988). Given a 9000–10 000 base pair genome, this indicates allowing subtyping of clinical sequences evaluated for drug resist-
that progeny viruses contain between 1 and 10 polymorphisms per ance. STAR was designed to use amino acid sequences rather than
genome per round of replication. HIV-1 replicates to high levels nucleic acid, to allow retrospective subtyping of sequences in data-
in individual untreated patients (Ho et al., 1989), which coupled bases where information is stored as amino acid changes relative
with the size of the HIV-1 pandemic results in viral quasi-species of to a known reference strain. STAR utilizes position-specific scoring
HIV-1 within an infected individual and a global distribution of matrices (PSSMs) of each HIV subtype to perform profile subtype
viruses that are genetically divergent (Wain-Hobson, 1993). alignments. The frequency of amino acids at each position in the Pol
HIV-1 genetic diversity has been stratified and a subtype nomen- fragment for a given subtype were compared against a sequence of
clature based on sequence variation has been described. Initial unknown subtype with the maximum sum of the frequencies identi-
fying the subtype of the unknown sequence. Any sequence with a
∗ To whom correspondence should be addressed. score within 1.5 standard deviations from the mean of the reference

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 3535
R.E.Myers et al.

10
Odds-ratio

PD
1

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
ND

0.1
0 50 100 150 200 250 300 350 400 450
Amino acid

Fig. 1. Odds ratio scores of a HIV-1 Pol query sequence against subtype B (black) and D (grey) PSSMs, plotted as log odds ratio for visualization. Positive
discriminant thresholds (PD) and negative discriminant (ND) thresholds were used to prevent incorporation of noise (scores around the neutral odds ratio of 1)
into the final score. The values of PD and ND were varied within the range 1.1–1.9 (PD) and 0.9–0.1 (ND). The ratio of the number of amino acid positions
with an odds-ratio score greater than the PD and less than the ND was calculated for each of the 11 subtype specific PSSMs. Subtypes B and D were the first
and second highest scoring PSSMs for this query sequence (DOR 12 and 0.71, respectively).

sequence subtype score was used to highlight sequences that were dif- frequency profiles and compared against a similarly normalized frequency
ficult to classify or were putative recombinant subtypes. This scoring profile of the complete dataset. Alignment of a query sequence of unknown
system was derived empirically and as such STAR required a more subtype was performed using dynamic programming against a template
accurate means of assessing how well a query sequence fitted the data sequence derived from a complete subtype consensus sequence. This, there-
model. Here we describe an improved scoring mechanism for STAR fore, represented the alignment structure of the reference datasets. In this
way the start and stop of the Pol fragment were determined, deletions in
allowing the statistical assignment of single subtype, recombinant
the query sequence (uncommon) could be gapped and insertions (usually
subtype and unclassified subtypes. owing to resistance mutations and therefore not subtype specific) could be
excised.
2 SYSTEMS AND METHODS Each amino acid in the test sequence was compared with the odds ratio
(subtype/complete dataset) of that amino acid at the corresponding position
A dataset of 174 sequences was obtained from GenBank and The Los Alamos in each of the 11 subtype profile alignments. Odds ratio scores >1 indicated
Database. A dataset of 174 sequences comprising full-length HIV-1 genomes a positive discriminant at that position for the given subtype alignment, <1
with annotated Gag, Pol and Env proteins was obtained from GenBank. indicated a negative discriminant. The sums of the number of positive and
Although the number of fully annotated HIV-1 genomes continues to increase, negative discriminants along the length of the sequence, expressed as a ratio
this dataset encompassed all single subtypes and the majority of CRFs. Of (positive discriminant/negative discriminant) yielded an overall discriminant
these 174 sequences, 65 were identified as Los Alamos HIV-1 subtype refer- odds ratio (DOR) score for the test sequence against each subtype alignment.
ence sequences (Supplementary Table 1). A dataset of 141 sequences were The subtype alignment with the largest DOR score indicated the subtype that
partitioned into 11 subtype alignments by assessing phylogenies derived from the query sequence most closely resembled. The odds ratio scores at each
sequence identity/linkage, neighbour joining and maximum parsimony meth- amino acid position were simply counted as positive and negative discrimin-
ods. The 33 sequences obtained with the original dataset were defined as ants (rather than summed or multiplied) to prevent overprediction of subtype
recombinant or unclassified. This dataset was used to seed STAR, providing by a few high scoring positions. To remove noise around a neutral odds ratio
sufficient subtype coverage and allowing complex repetitive permutations of score of 1, positive and negative discriminant assigning thresholds were estab-
STAR program parameters. An additional 813 clinically derived sequences lished (equations are detailed in Supplementary information). These meant
were obtained from the Health Protection Agency (HPA) Laboratories in the that only positions with an odds ratio score above a positive or below a negat-
United Kingdom. ive discriminant threshold were counted and included the overall discriminant
ratio (Fig. 1).
2.1 Scoring function To investigate the relationship between the level of positive and negative
A multiple alignment of 141 HIV-1 Pol fragment sequences was performed discriminant thresholds and their effect on the DOR score, the threshold levels
using ClustalW (version 1.8) (Higgins and Sharp, 1988; Thompson et al., were varied within ranges of 1.1–1.9/0.9–0.1 (positive/negative discriminant
1994). Owing to the high level of sequence conservation within HIV-1 Pol, threshold) (Fig. 1). The 141 sequences used to define the 11-subtype align-
it was possible to generate an alignment with only two gapped positions. In ments were reclassified using these varying discriminant thresholds, with the
each case gaps were introduced as a result of single sequences. The posi- query sequence being removed from the subtype alignments prior to calcu-
tions that introduced gaps were excised from the overall alignment as they lating the subtype odds ratio scores. Assessment of the DOR score for each
provided no phylogenetic or subtype determining information. The 11 sub- of the 141 sequences showed the average highest score from the 141 member
type defined alignments were extracted, such that each subtype alignment sequence set varied predictably; high negative discriminant (ND) and low
corresponded with the structure of the complete 141 sequence alignment. positive discriminant (PD) thresholds generated higher summed scores and
Subtype alignments were converted into normalized positional amino acid the converse producing lower scores, respectively (Fig. 2a).

3536
HIV-1 subtype analyser (STAR)

(a) (b)
0.1 0 .1 Ave. 0.1
0.1
Score
0.2 0 .2 % Mis-
0.2
24 - 27 Classified
0.3 0 .3
21 - 24 0.3
Negative Discriminant

4 -5
0.4 0 .4 18 - 21 0.4 3 - 4
0.4
15 - 18
0.5 0 .5 0.5 2 - 3
12 - 15 0.5

0.6 0 .6 1 -2
9- 1 2 0.6 0.6
6- 9 0 -1
0 .7

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
0.7 0.7 0.7
3- 6
0.8 0 .8 0.8
0- 3 0.8

0.9 0 .9 0.9
0.9
1 .1 1.2 1 .3 1.4 1 .5 1.6 1 .7 1 .8 1.9 1 .1 1.2 1 .3 1.4 1 .5 1.6 1 .7 1 .8 1.9
Positive Discriminant
(c)
Z-score 2.75 Z-score 2.50 Z-score 2.25 % Above
0.1 0 .1 0.1 0 .1 0.1 Z-Score
0.2 0 .2 0.2 0 .2 0.2 0.2 9 8- 1 00
Negative Discriminant

9 6- 9 8
0.3 0 .3 0.3 0 .3 0.3 0.3
9 4- 9 6
0.4 0 .4 0.4 0 .4 0.4 0.4 9 2- 9 4

0 .5 0 .5 9 0- 9 2
0.5 0.5 0.5 0.5
8 8- 9 0
0.6 0 .6 0.6 0 .6 0.6 0.6
8 6- 8 8
0.7 0 .7 0.7 0 .7 0.7 0.7 8 4- 8 6
8 2- 8 4
0.8 0 .8 0.8 0 .8 0.8 0.8
8 0- 8 2
0.9 0 .9 0.9 0 .9 0.9 0.9
1 .1 1.2 1 .3 1.4 1 .5 1.6 1 .7 1 .8 1.9 1 .1 1.2 1 .3 1.4 1 .5 1.6 1 .7 1 .8 1.9 1 .1 1.2 1 .3 1.4 1 .5 1.6 1 .7 1 .8 1.9
Positive Discriminant

Fig. 2. (a) Mean DOR scores obtained from reclassifying the 141 sequences over a range of discriminant thresholds (1.1–1.9/0.9–0.1). The score magnitude
is colour coded from a mean high score of 24–27 (brown) to a mean low score of 0–3 (purple). Each sequence of the 141-sequence dataset was reclassified
for subtype at the given positive and negative discriminant thresholds and the average of the 141 DOR scores calculated. (b) The percentage of misclassified
sequences based on the known subtype of each of the 141-sequence dataset. 0–1% misclassification represents two misclassified sequences out of 141. (c) The
effect of positive and negative discriminant thresholds on Z-score. The percentage of sequences (n = 141), for which the highest value of an 11 point Z-score
distribution, is greater than the specified Z-score cut-offs of 2.75, 2.50 and 2.25 at varying positive and negative determinant thresholds. A standardized scale
(right) is used to indicate percentage ranging from 98–100% (brown) of maximum Z-scores being above the specified threshold to 80–82% (blue) scoring
above the threshold.

As the subtype of the 141 sequences was known, the effect of varying transformation resulting in distribution with a mean of 0 and a standard devi-
discriminant thresholds could also be evaluated in terms of the number ation of 1 for the 11-point distribution of subtype scores (per query sequence).
of sequences that were erroneously classified (Fig. 2b). Classification was The same process of reclassification of the 141 sequences following prior
adjudged incorrect if the highest scoring subtype alignment did not match the removal from the subtype alignment was used to investigate Z-score trans-
known subtype of the sequence. The analysis showed little overall misclassi- formation. The maximum score per sequence from the 11 possible scores
fication with the highest level being 4–5% misclassification (6–8 sequences). was always >2. Analysis of Z-score frequency in terms of the percentage
More importantly, this analysis identified an island of discriminant threshold of sequences above a given Z-score at varying discriminant thresholds (ana-
values between 1.5 and 1.6 (PD) and 0.4 and 0.5 (ND) that were optimal logous to Fig. 2a) was used to confirm the optimum discriminant threshold
for classification. Using these discriminant thresholds resulted in correct levels. This established a Z-score value that maximized correct classification
classification of all except two sequences (98.6%). of query sequence subtype (Fig. 2c).
As expected when the maximum Z-score threshold was reduced (Z-score
2.25, Fig. 2c), the overall percentage of sequences above the threshold was
2.2 Assessment of score increased. At the lowest Z-score (2.25) a broad range of discriminant
Despite the identification of the optimal discriminant thresholds there was thresholds (PD range 1.4–1.9/ND range 0.5–0.3) resulted in >98% of
no clear relationship between the magnitude of the DOR score from the sequences being above the Z-score threshold. At a Z-score of 2.5, only
maximal subtype alignment and likelihood of a correct subtype reclassific- one pair of discriminant thresholds (PD 1.6/ND 0.5) resulted in >98% of
ation (data not shown). Rather than considering the maximal subtype ratio sequences with Z-scores higher than the threshold. This threshold com-
alone, the highest scoring subtype alignment (per query sequence) can be bination also correlated with the group of discriminant thresholds identified
considered as the extremity of the distribution of the scores generated for previously as resulting in only two erroneous subtype predictions (Fig. 2b).
the set of 11-subtype alignments. This was characterized using a Z-score Again, as expected, a high Z-score threshold of 2.75 reduced the maximum

3537
R.E.Myers et al.

70 3.5

60 3

50 2.5
% Frequency

40 2

Z-Score
30
1.5
20
1
10
0.5

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 0
Z-Score
-0.5

Fig. 3. The frequency distribution of Z-score from the maximal subtype -1


alignments, (solid line) n = 141 (141 × 1) and non-maximal subtype A B C
alignments (dashed line) n = 1410 (141 × 10). Data were generated using
STAR-s to reclassify the 141 sequences used to define the 11 subtype PSSMs.
Fig. 4. Distributions of subtype B Z-scores. Three groups of sequences are
displayed; sequences used to define the subtype B alignment and PSSM
(A, closed circle), known recombinant sequences containing subtype B (B,
percentage of sequences scoring above the threshold to 90–92%. In effect,
open square) and clinical subtype B predicted sequences, i.e. those sequences
a high Z-score threshold should have low coverage but high accuracy of
where the maximal Z-score corresponded with the subtype B PSSM (C, ×).
subtype assignment with the converse being true of low Z-score thresholds.
Two recombinant sequences (open square) circulating recombinant form 12
Through this analysis we have identified discriminant thresholds of 1.6
(CRF 12, subtypes B and F) have Z-scores >2.5. Only 8.4% of clinical
(positive) and 0.5 (negative) that resulted in 139/141 sequences, matching
sequences predicted to be subtype B have Z-scores <2.5.
the subtype alignment to which they had been assigned, with a maximal
Z-score >2.5.
We investigated the distribution of the entire set of Z-scores for all query
sequences (n = 1551) partitioning them into two sets, the top Z-score set Nevertheless, only 8.4% (n = 56) of clinical sequences predicted to be
(n = 141) and the remaining Z-score set (n = 10) (Fig. 3). subtype B fall below the 2.5 Z-score threshold. This 8.4% of clinical
This analysis showed that there was excellent separation between the sequences may contain polymorphisms that are not present within our initial
maximal Z-score (subtype predictive alignment) and the 10 other Z-scores subtype specific alignments or may suggest the presence of recombinant
(non-subtype predictive alignments). The two sequences that were not sub- genomes.
typed correctly produced maximal Z-scores <2.5. In these cases, the second To identify the recombinant genomes, all query sequences were reanalysed
highest scoring subtype alignment Z-scores were also relatively large (1.11, along the length of the sequence. Each query sequence was compared with the
1.51). This contrasts with the remaining 139 correctly assigned sequences 11 HIV-1 subtype specific PSSMs; however the PSSMs were recalculated to
with Z-scores >2.5, where the second highest scoring subtype PSSM scored give normalized percentage occurrence of each amino acid per position within
<1. The Z-score based implementation of STAR (STAR-s) performed better the individual subtype alignments. The percentage occurrence of the amino
than the original scoring threshold function of STAR (Gale et al., 2004). The acids present in the query sequence relative to a subtype PSSM were averaged
number of sequences scoring above the respective thresholds increased from within a sliding window of 50 and a step value of one amino acid. The subtype
93% with STAR (Gale et al., 2004) to 98% with STAR-s described here. initially predicted for the query sequence provided a baseline score of 0 and
the scores derived from the other 10 subtype PSSMs were compared with
2.3 Assessment of recombinant or unclassified that baseline. The output of this analysis showed positions within the query
sequence that were more similar to a different subtype rather than the predicted
sequences subtype. Summing the regions of each subtype trace from the first instance
In cases where a query sequence returned a maximal Z-score below the of a positive score to the first instance of a negative score (i.e. where the
threshold of 2.5, the subtype of that sequence could not be rigorously assigned. similarity to the alternative PSSM finishes), highlighted regions of the query
Unassigned query sequences were likely to be either recombinant sequences sequence that were more similar to a subtype other than the predicted subtype
that contained regions of sequence similarity to more than one subtype (Fig. 5).
alignment or sequences containing polymorphisms not defined within the The recombinant regions within a query sequence were formally assessed
subtype alignments. In cases where sequences previously defined within the by analysing their predicted length (if >50 amino acids) and the cumulative
Los Alamos database as recombinant [http://hiv-web.lanl.gov/content/index, total percentage sequence identity difference relative to the initial predicted
(Robertson et al., 2000)] were analysed relative to single subtype sequences subtype. A ratio of summed percentage identity to sequence length was
it was clear that, recombinant sequences generally produced low Z-scores required to be >1 for the region to be considered recombinant. Assessment
(Fig. 4). of the validity of mosaic recombinant architectures proved to be effective for
Two (of three) circulating recombinant forms with subtype B and subtype F simple dual subtype mosaic architectures (Fig. 5).
recombinant regions (CRF12 BF) produced Z-scores >2.5; however, these The main aim of recombination characterization was to separate reliable
sequences predominantly contain subtype B sequence with minimal subtype F single subtype predictions from sequences which were less reliably subtyped.
recombinant regions (Carr et al., 2001). Consistent with the score distribution, Therefore, query sequences were initially subtyped using the Z-score sub-
when 814 clinical sequences were subtyped using the Z-score function (data typing program and then reanalysed to detect the presence of recombination
not shown), the distribution of subtype B predicted sequences (maximum events. In the event of this second analysis phase detecting recombination,
Z-score, subtype B PSSM) showed greater dispersion than the reference the initial Z-score was assigned a null value and the query sequence was
sequences used to define the subtype B alignment (Fig. 4). reported as a recombinant with no statistical score.

3538
HIV-1 subtype analyser (STAR)

120 (a) 100


Subtype FK
Accumulative % Identity

100 Subtype A 90
80

% Sequence Classification
80
70
60
60
40
50
20 40
0 30
0 50 100 150 200 250 300 350 400 450 20
Amino Acid Position

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
10
0
Fig. 5. Recombination analysis of CRF 12 (subtypes B and F). HIV–1 poly- STAR STAR-s STAR-rec
merase amino acids 1-99 PR, 1-340 RT (labelled 100–439) are indicated. The
(b) 1.00
initial subtype prediction for CRF12 was subtype B; therefore, amino acid
positions with a y-axis score of 0 are indicative of subtype B. Regions scoring
>0, amino acids 1–100, 150–195, show more similarity to the subtype FK 0.80
than B PSSMs. This analysis illustrates that the query sequence should be

Sensitivity
classified as a recombinant sequence rather than single subtype B sequence 0.60
as initially predicted. ST AR
STAR
0.40 ST AR-s
STAR-s
A comparison of three implementations of STAR, the original STAR ST AR-rec
STAR-rec
(STAR), the Z-score STAR (STAR-s), and the Z-score and recombination 0.20
analysis STAR (STAR-rec) (Fig. 6a) showed that STAR-rec performed bet-
ter than the other methods. STAR and STAR-rec correctly reclassify similar 0.00
numbers of single subtype sequences when the method dependent confidence 0.00 0.20 0.40 0.60 0.80 1.00
thresholds were applied (93 and 92% of 141 sequences, respectively). As 1-Specificity
described, STAR-s performs better during this reclassification (98%); how-
ever, both STAR-s and STAR potentially classify known recombinants as a
single subtype. From 20 known recombinant sequences STAR and STAR- Fig. 6. (a) Comparison between STAR HIV-1 Pol (PR 1–99, RT 1–340)
s assigned 13 and 8 sequences, respectively to a single subtype, scoring subtyping methodologies. Methods were evaluated by comparing the num-
above the relevant threshold. However, STAR-rec identifies the majority ber of sequences used to define subtype alignments correctly reclassified and
of the known recombinant sequences (19 out of 20), although not signific- scoring above a threshold by the leave-one-out analysis (% Reclassification
antly affecting the percentage reclassified known single subtype sequences above threshold—white bars) (n = 141). STAR methods were also evaluated
(Fig. 6a). The receiver operated characteristic (ROC) (Fig. 6b) formally on the basis of correct identification of recombinant sequences as not belong-
showed that STAR-rec had higher sensitivity and specificity that either ing to a single subtype (% Recombinant detection—grey bars) (n = 20). (b)
STAR-s or the original STAR methods. ROC analysis of three implementations of STAR. True positives (TP) were
In summary, STAR-rec, combining Z-scoring and recombination analysis, calculated as correct subtype predictions above a Z-score threshold, False
provides a validated statistical framework for the accurate assignment of negatives (FN), as below Z-score threshold for 141 single subtype sequences.
single HIV-1 subtypes while identifying potential recombinant sequences. False positives (FP) and True negatives (TN) were assigned on the basis
of single subtype prediction for the set of known recombinant sequences
(n = 20) scoring above and below a Z-score threshold, respectively.
3 DISCUSSION Sensitivity was calculated as TP/(TP+FN), Specificity was calculated as
We have described here a validated scoring system for assigning TN/(TN+FP).
HIV-1 sequence subtype to PR and RT sequences using the STAR
program. This scoring system works effectively on HIV-1 sequences
generated as a part of routine antiretroviral drug therapy monitoring. aim that clinical databases used to manage HIV-1 infected patients
The scoring algorithm has been parameterized, such that each posi- could incorporate subtype information produced from the sequences
tion in the test sequence was evaluated as being positively, negatively already used to monitor HIV-1 antiretroviral therapy resistance. In
or neutrally associated with a given subtype alignment expressed as this context, STAR has advantages over currently existing subtyp-
a PSSM. Optimal DOR score used to establish the association were ing methodologies. STAR does not require each query sequence to
determined, which coupled with an additional recombination ana- be submitted to a web-based subtyping tool and does not require
lysis gave optimal performance in terms of predictive classification the building or interpretation of phylogenetic trees. The use of
power, while minimizing false predictions. Hidden Markov models amino acid based subtyping will allow retrospective subtyping of
(HMM) were considered as an alternative to PSSMs; however, the sequences already stored in databases as mutations from a subtype
hidden layers, use of a matched versus random score model and diffi- B consensus, which in some cases is the only sequence information
culty of detecting recombinant sequences within our implementation, available. Importantly, STAR also gives a statistical assessment of
rendered them less attractive than our PSSM model (addressed within the accuracy of its subtype prediction.
Supplementary data, HMM). The design of STAR as a clinical tool results in subtyping a relat-
STAR has been designed to allow rapid, accurate and user-friendly ively small fragment of the HIV-1 genome (1317 of the 9000 base
subtyping of HIV-1 sequences generated in a clinical context. It is our pairs). Recombination events outside the Pol protein will not be

3539
R.E.Myers et al.

detected. Recombination events at the 5 or 3 ends of the HIV-1 ACKNOWLEDGEMENTS


PR and RT fragment, resulting in a short region of one subtype and a This work was funded by the Medical Research Council (MRC)
longer region of a different subtype, also remain problematic. One of (R.E.M.) and the Special Hospital Trustees of the Middlesex Hospital
the three previously identified recombinant sequences CRF12 (B/F) London (C.V.G.).
was classified as subtype B with a Z-score >2.5, owing to the short
nature of the F sequence in the region of HIV-1 examined. The Conflict of Interest: none declared.
remaining subtype recombinant sequences included in this analysis
all generated a highest scoring subtype alignment below Z-score
REFERENCES
2.5. It is envisaged that relative overprediction of sequences that
appear recombinant is preferable to inaccurate prediction of genu- Bartosch,B. et al. (2004) Evidence and consequence of porcine endogenous retrovirus
recombination. J. Virol., 78, 13880–13890.
inely recombinant sequences as confidently belonging to a single

Downloaded from http://bioinformatics.oxfordjournals.org/ at West Virginia University, Health Science Libraryy on June 26, 2015
Carr,J.K. et al. (2001) Diverse BF recombinants have spread widely since the introduc-
HIV-1 subtype. This will provide clean datasets of accurately sub- tion of HIV-1 into South America. AIDS, 15, F41–F47.
typed sequences and a set of putative recombinant query sequences Gale,C.V. et al. (2004) Development of a novel human immunodeficiency virus type 1
that will require subsequent downstream analysis by more sophist- subtyping tool, Subtype Analyzer (STAR): analysis of subtype distribution in
London. AIDS Res. Hum. Retroviruses, 20, 457–464.
icated recombination analysis tools, such as LOHA (Bartosch et al.,
Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple
2004) or RDP (Martin et al., 2005). sequence alignment on a microcomputer. Gene, 73, 237–44.
STAR was designed to partition HIV-1 polymerase amino acid Ho,D.D. et al. (1989) Quantitation of human immunodeficiency virus type 1 in the blood
sequences into subtypes defined by sequence variation, which com- of infected persons. N. Engl. J. Med., 321, 1621–1625.
plements, but does not replace HIV-1 phylogenetic analyses. This Louwagie,J. et al. (1993) Phylogenetic analysis of gag genes from 70 international
HIV-1 isolates provides evidence for multiple genotypes. AIDS, 7, 769–780.
methodology is applicable to other datasets, which have been sub- Martin,D.P. et al. (2005) RDP2: recombination detection and analysis from sequence
grouped by sequence variation. Full-length Hepatitis B virus (HBV) alignments. Bioinformatics, 21, 260–262.
and HIV-1 nucleotide genomes have both been subtyped using Myers,G. et al. (1992) The emergence of simian/human immunodeficiency viruses. AIDS
STAR (unpublished data). Res. Hum. Retroviruses, 8, 373–386.
Parry,J.V. et al. (2001) National surveillance of HIV-1 subtypes for England and Wales:
STAR is a tool that allows sutyping of a specific HIV-1 sequence
design, methods, and initial findings. J. Acquir. Immune. Defic. Syndr., 26, 381–388.
resource (drug resistance monitoring) and as an aid to investigating Roberts,J.D. et al. (1988) The accuracy of reverse transcriptase from HIV-1. Science
the clinical implications of HIV-1 subtype. The majority of antiret- 242, 1171–1173.
roviral drug design and evaluation is undertaken using the HIV-1 Robertson,D.L. et al. (2000) HIV-1 nomenclature proposal. Science, 288, 55–56.
subtype B virus, which predominates in both the US and western Robertson,D.L. et al. (1995) Recombination in AIDS viruses. J. Mol. Evol., 40, 249–259.
Rozanov,M. et al. (2004) A web-based genotyping resource for viral sequences. Nucleic
European epidemics. The influx of non-subtype B HIV-1 infected Acids Res., 32(Web Server issue), W654–W659.
individuals into western Europe (Parry et al., 2001; Gale et al., Salminen,M.O. et al. (1996) Full-length sequence of an ethiopian human immuno-
2004) and the increasing availability of therapy in Africa and Asia deficiency virus type 1 (HIV-1) isolate of genetic subtype C. AIDS Res. Hum.
will generate large amounts of non-subtype B sequence data as a Retroviruses, 12, 1329–1339.
Salminen,M.O. et al. (1995) Recovery of virtually full-length HIV-1 provirus of diverse
part of routine drug resistance screening. Accurate subtype predic-
subtypes from primary virus cultures using the polymerase chain reaction. Virology,
tion of this resource should, therefore, provide insights into the role 213, 80–86.
of HIV-1 subtype and sequence variation, with respect to the clin- Takeuchi,Y. et al. (1988) Low fidelity of cell-free DNA synthesis by reverse transcriptase
ical management and disease progression of non-subtype B infected of human immunodeficiency virus. J. Virol., 62, 3900–3902.
individuals. Incorporating subtype information into HIV-1 patient Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap
management strategies will allow analysis of any potential subtype penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
related differences in response to drug therapy, transmission and Wain-Hobson,S. (1993) The fastest genome evolution ever described: HIV variation in
disease progression. situ. Curr. Opin. Genet. Dev., 3, 878–883.

3540

You might also like