Multi-Stage Redundancy Reduction

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Multi-stage Redundancy Reduction: Effective Utilisation of

Small Protein Datasets


John Hawkins Mikael Bodén

Abstract
In many poignant bioinformatics problems the data sets contain considerable redundancy
due to the evolutionary processes which generate the data and the presence of biases in the
data collection procedures. The redundancy can affect the development of data driven models
in two ways. Firstly it encourages the model to replicate the bias, and hence impede its ability
to make reliable predictions about the general domain. Secondly it can influence the metrics
used to evaluate the model, lulling the community into a false sense of security regarding
the reliablity of the model. The standard practice in bioinformatics involves removing the
redundancy such that there is no more than at most forty percent similarity between sequences
in a data set. For small data sets this can dilute the already impoverished data beyond the
boundary of practicality. One can choose to include all available data in the process by just
ensuring that only the training and test samples have the required redundancy gap. However,
this encourages overfitting of the model by exposure to a highly redundant training sets. We
outline a process of multi-stage redundancy reduction, whereby the paucity of data can be
effectively utilised without compromising the integrity of the model or the testing procedure.

1 Introduction
The effective application of machine learning techniques to problems of classification is not simply
a matter of training a model on all available data. Although a model produced in this fashion
will perform very well on data with similarity to the training data it will tend to fail on genuinely
novel data. Hence, the performance statistics produced by cross-validation on data sets containing
redundancy will not be a reliable guide to their ability to generalize.
Considerable effort has gone into developing algorithms to minimise the redundancy within the
data yet maximise the amount left with which to build models. The founding work of Hobohm et
al has become a standard part of the bioinformatics practice [5]. Nevertheless, some of the issue
therein still remain.
Genuine generalization by a model requires that it has extracted the essence of the real-world
process that is generating the distinctions it wishes to predict. In the case of bioinformatics
problems this would involve some focus onto the components of the molecules that serve the
functional role, rather than merely conincidental correlations brought about by sequence homology
through a shared evolutionary trajectory. A model that relies on sequence homology is utilizing a
different strategy, it is exploiting the rich evolutionary history of sequences and gambling that no
unknown protein sequences will be far in sequence space from known ones. When a model that is
trained and tested on redundant data ”the apparent predictive performance may be overestimated,
reflecting the method’s ability to reproduce its own particular input rather than its generalization
power.” [2] (page 6) .
A de facto standard has emerged in bioinformatics to perform a redundancy reduction of data
sets using sequence homology scores, such that no two sequences have greater than 25%-40%
identical residues across a specified length. This threshold is no doubt due to the fact that it is
the so called ’twilight’ region in which sequence alignments become a poor indicator of homology
[8].

1
In previous years when the datasets where much smaller, the data curation process was not as
rigorous as it can be today, simply due to the infeasibility of reducing an already minimal dataset.
However, there are a number of biological problems where the datasets remain comparitively small
simply due to the fact that they relate to subtle aspects of cellular life. In these instances we are
faced with a problem: ’how to best train and test our models so that we utilise the available data
and do not bias our tests?’.
We present a solution to this problem in the form of a method of creating and the dataset and
then a regime for training and testing. The technique involves performing redundancy reduction in
two stages and allowing small amounts of redundancy within training sets, but rigorously excluding
it between training and test sets. Furthermore, we provide a clear indictation of generalisation
improvement offered by redundancy reduction by showing that performance is optimal when not
using all available data, but by allowing only small amounts of redundancy within the training
sets.

2 The Method
The essence of the technique is as follows: we perform and initial redundancy reduction of the
data in order to remove the sequences that are most similar. This is done with a much weaker
threshold, between 40% and 90% similarity. The remaining sequences comprise the dataset for
the purposes of training the model.
In order to ensure that the testing procedure isn’t compromised by the existence of some
redundancy in the data, we perform a second clustering at the threshold determined to be re-
quired for rigorous testing. These clusters are then used to generate the subsets of data for the
cross-validations, such that all redundancy exists within the subsets. In other words there is no
redundancy between the subsets, such that the cross-validation is assured to be an adequate test
of generalisation.
The cross-validation is then performed such that the partially redundant data is used in the
training of the models. However, in each iteration, the set that has been allocated for testing
undergoes a further reduction to remove the remaining redundancy. In this way we allow the
models to utilise some redundancy in the training data, but remove all redundancy from the
testing procedure.
The procedure is demonstrated schematically in Figure 1.

3 Datasets
For the purposes of illustrating the utility of the technique we apply it to two different data sets
of proteins. Both data sets revolve around the importation of proteins into the peroxisome, which
is a small but important organelle in eukayotic cells. The two mechanisms are named after the
sequence motif which the proteins rely on for recognition and import. Peroxisomal Targeting Signal
One (PTS1) and Peroxisomal Targeting Signals Two (PTS2), are both subtle distinct sequences
within the protein that allow import into the organelle through distinct protein pathways [1, 7].
The crucial aspect of these data sets for the current paper is that the peroxisomal has a small
protein complement, hence the data sets are typically small.
The PTS1 pathway relies on an C-terminal tripeptide and sequence of nine preceding residues
that support its recognition. We have outlined the curation of this dataset in previous studies
[9, 4], for this study we use the dataset prior to the redundancy reduction stage, i.e. after the
manual curation step only.
The PTS2 pathway is relies on a 9-mer with an unspecified position, although most instances
occur in the c-terminal region. There are far fewer known instances of PTS2 proteins, and to make
matters worse the general motif is less specific, meaning that the training sets for differentiating
between real and falicious PTS2 instances are highly unbalanced. We have outlined the curation
of this data set in a previous study [3]. Due to the size of the negative set we maintain the heavily

2
Complete Data Set

1) Reduce Using
Initial Reduction Rate
90%

2) Create Sets with Between Set


Redundancy less than the
Final Reduction Rate
30%

Training Set
Cross-Validation
3) Train Using all Data in the
Training Set 30%

4) Test Using a Reduced


Version of the Test Set at
the Final Reduction Rate

Figure 1: Schematic Representation of the Multi-Stage Redundancy Reduction Training and Test-
ing Procedure.

3
Redundancy Reduction Threshold
Dataset 100 95 90 85 80 70 60 50 40 30
PTS1 Positives 139 131 114 97 83 66 62 56 50 47
Negatives 291 256 230 208 193 180 171 165 164 163
PTS2 Positives 97 91 81 69 62 56 51 41 37 35

Table 1: Numbers of Proteins in each data set as the initial redundancy reduction threshold is
varied. At the upper limit, a threshold of 100 means no reduction is performed. At the lower
end of 30% the threshold is identical to that used to distinuish the test sets, hence the process is
equivalent to the standard process of a single redundancy reduction prior to training and testing.

reduced version (at a 10% threshold) and instead focus the multi-stage redundancy reduction on
the positive set only.
The numbers of proteins that result from the varying levels of redundancy reduction are shown
in Table 1.

4 Simulations
In order to demonstrate the effectiveness of the technique and identify a threshold for the initial
redundancy reduction we perform a range of simulations over each of the datasets. For each dataset
we produce a set of versions each of which have been through an inital redundancy reduction. For
these inital reductions we use a set of thresholds varying from 100 (for no reduction) down to 30,
the upper limit of the defacto standard. The number of proteins in the datasets for each of these
threshold is shown in Table 1.
As a model to train on these problems we have chosen a Support Vector Machine with a
spectrum kernel. The spectrum kernel is a general purpose sequence kernel that is efficient to run
and has proven effective on a wide range of bioinformatics problems. The spectrum kernel takes
one parameter, k defines the length of the sequence segments considered in the spectrum.
For a given sequence, the spectrum of the sequence is the set of all k-mers it contains. The
Spectrum kernel compares any two sequences by considering the number of these k-mers that two
sequences share [6]. More specifically, the kernel calculates the dot product between the vectors
holding all k-mer counts for any pair of sequences. If two sequences share a large number of k-mers
they result in a large spectrum kernel value.
For both datasets we explore k values ranging between 1 and 5. We run these on each of
the datasets, such that the initial reduction threshold ranges from 100% (No Reduction) to 30%
(Single Stage Reduction). We run each configuaration as a ten-fold cross-validation, and run each
ten times using different seeds to calculate means and standard deviations.
We then train a support vector machine using a spectrum kernel to preform the classification
using these datasets. The spectrum kernel is a general purpose kernel, that has proven

5 Results
The results for each of the spectrum kernels on the PTS1 problem are shown in Table 2. In each
case we shown the mean Matthews’ Correlation Coefficient over then ten independant runs, with
the standard deviation in brackets.
For three of the four k values tried in the spectrum kernel the multi-stage redundancy reduction
technique produced the best model. The best overall model produced used an initial redundancy
reduction threshold of 90% and a k value of 2. However, as we can see in Figure 2, the distributions
across the simulations were quite large making the distinction somewhat weak in the case of k
values 1 and 2. Nevertheless, for three of the four kernels, making some use of the redundant data
allowed the development of a much better model than simply discarding it.

4
PTS1 Simulation Results
k-value 100 95 90 85 80 70 60 50 40 30
1 0.31 0.29 0.32 0.30 0.32 0.28 0.31 0.30 0.29 0.27
2 0.39 0.42 0.39 0.43 0.37 0.34 0.35 0.34 0.28 0.32
3 0.29 0.27 0.30 0.34 0.34 0.34 0.36 0.34 0.34 0.29
4 0.06 0.14 0.13 0.13 0.12 0.14 0.16 0.17 0.20 0.20

Table 2: Average MCC values for the Spectrum Kernel run with different k-values and different
thresholds for the initial redundancy reduction. The first column with an initial threshold of 100%
involves using all available data to train the models. The final column, at 30%, inolves a single
redundancy reduction prior to training and testing. Intermediate values are the result of the two
stage redundancy reduction procedure.

0.5

0.45

0.4

0.35

0.3
Average MCC

0.25

0.2

0.15

Spectrum 1
0.1 + 1 STD
Spectrum 2
+ 1 STD
Spectrum 3
0.05
+ 1 STD
Spectrum 4
+ 1 STD
0
20 30 40 50 60 70 80 90 100 −
Initial Redundancy Reduction Threshold

Figure 2: The average MCC for the Spectrum Kernel on the PTS1 problem, plotted against the
initial rate of redundancy reduction. The error bars depict one standard deviation on either side
of the mean. Data generated from ten runs of ten-fold cross-validation.

5
PTS2 Simulation Results
k-value 100 95 90 85 80 70 60 50 40 30
1 0.31 0.29 0.32 0.30 0.32 0.28 0.31 0.30 0.29 0.27
2 0.39 0.42 0.39 0.43 0.37 0.34 0.35 0.34 0.28 0.32
3 0.29 0.27 0.30 0.34 0.34 0.34 0.36 0.34 0.34 0.29
4 0.06 0.14 0.13 0.13 0.12 0.14 0.16 0.17 0.20 0.20

Table 3: Average MCC values for the Spectrum Kernel run with different k-values and different
thresholds for the initial redundancy reduction. The first column with an initial threshold of 100%
involves using all available data to train the models. The final column, at 30%, inolves a single
redundancy reduction prior to training and testing. Intermediate values are the result of the two
stage redundancy reduction procedure.

Similarly the results for each of the spectrum kernels on the PTS2 problem are shown in Table
3.
We have taken the best kernel configuration for each problem and then plotted the mean MCC
as a function of the initial reduction threshold in Figure ??. The error bars represent one standard
deviation on either side of the mean. What we see clearly in this figure is that in both cases the
best model was created using the mutli-stage redundancy reduction procedure. For the PTS1
problem and initial reduction of 90% and the PTS2 an initial reduction of 80%.

6 Conclusion
We have discussed the problem of redundant data in instances where there is an overall paucity
of data. We have described a technique for utilising as much of the available data as possible
without compromising the integrity of the generalisation testing procedure. We demonstrated
the utility of the technique on two datasets related to the subcellular localisation of peroxisomal
proteins. In each case the best model generated using our chosen architeture resulted from using
the multi-stage redundancy reduction approach.

References
[1] A. Baker and I. A. Sparkes. Peroxisome protein import: some answers, more questions. Current
Opinion in Plant Biology, 8(6):640–647, 2005.
[2] P. Baldi and S. Brunak. Bioinformatics : the machine learning approach, Second Edition. MIT
Press, Cambridge, Mass, 2001.
[3] M. Bodén and J. Hawkins. Evolving discriminative motifs for recognizing proteins imported
to the peroxisome via the pts2 pathway. In Proceedings of the IEEE Congress on Evolutionary
Computation, pages 9300–9305, Canada, July 2006. IEEE.
[4] J. Hawkins and M. Bodén. Predicting peroxisomal proteins. In Proceedings of the IEEE
Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages
469–474, Piscataway, November 2005. IEEE.
[5] U. Hobohm, M. Scharf, R. Schneider, and C. Sander. Selection of representative protein data
sets. Protein Science, 1(3):409–417, 1992.
[6] C. Leslie, E. Eskin, and W. S. Grundy. The spectrum kernel: A string kernel for svm protein
classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors,
Proceedings of the Pacific Symposium on Biocomputing, pages 564–575. World Scientific, 2002.

6
[7] P. Michels, J. Moyersoen, H. Krazy, N. Galland, M. Herman, and V. Hannaert. Peroxisomes,
glyoxysomes and glycosomes. Molecular Membrane Biology, 22(1 - 2):133–145, 2005.
[8] B. Rost. Twilight zone of protein sequence alignments. Protein Eng., 12(2):85–94, 1999.
[9] M. Wakabayashi, J. Hawkins, S. Maetschke, and M. Bodén. Exploiting targetting signal
dependencies in the prediction of pts1 peroxisomal proteins. In M. Gallagher, J. Hogan, and
F. Maire, editors, Intelligent Data Engineering and Automated Learning - IDEAL 2005: 6th
International Conference, volume 3578 of Lecture Notes in Computer Science, pages 454–461.
Springer, 2005.

You might also like