Professional Documents
Culture Documents
An Improved Topology Prediction of Alpha-Helical Transmembrane Protein Based On Deep Multi-Scale Convolutional Neural Network
An Improved Topology Prediction of Alpha-Helical Transmembrane Protein Based On Deep Multi-Scale Convolutional Neural Network
An Improved Topology Prediction of Alpha-Helical Transmembrane Protein Based On Deep Multi-Scale Convolutional Neural Network
Abstract—Alpha-helical proteins (aTMPs) are essential in various biological processes. Despite their tertiary structures are crucial
for revealing complex functions, experimental structure determination remains challenging and costly. In the past decades, various
sequence-based topology prediction methods have been developed to bridge the gap between the sequences and structures by
characterizing the structural features, but significant improvements are still required. Deep learning brings a great opportunity for its
powerful representation learning capability from limited original data. In this work, we improved our aTMP topology prediction method
DMCTOP using deep learning, which composed of two deep convolutional blocks to simultaneously extract local and global contextual
features. Consequently, the inputs were simplified to reflect the original features of the sequence, including a protein sequence feature
and an evolutionary conservation feature. DMCTOP can efficiently and accurately identify all topological types and the N-terminal
orientation for an aTMP sequence. To validate the effectiveness of our method, we benchmarked DMCTOP against 13 peer methods
according to the whole sequence, the transmembrane segment and the traditional criterion in testing experiments. All the results reveal
that our method achieved the highest prediction accuracy and outperformed all the previous methods. The method is available at
https://icdtools.nenu.edu.cn/dmctop.
Index Terms—Deep learning, deep multi-scale convolutional neural network, topology prediction, transmembrane proteins
Fig. 2. Our method, DMCTOP, uses a deep multi-scale convolutional neural network (DMCNN) for aTMP topology prediction. The input consists of
protein sequence and evolutionary conservation features. After the treatment of feature-integration, the preprocessed feature vectors are transported
to deep convolutional blocks composed of several modules formed by multi-scale CNN layers to extract both local and global contextual features. On
top of the second-deep convolutional block, there are two fully-connected layers with softmax for multi-label classification. The fine-tuning operation
ensures that the prediction results are biologically meaningful.
Evolutionary Conservation: The evolutionary conserva- adjacent amino acids, we used CNNs to extract local contex-
tive information as PSSM (Position Specific Scoring Matrix) tual characteristics. We applied Rectified Linear Unit
profile is a matrix used for protein sequence pattern repre- (ReLU) [38] as the activation function. Since a turn of alpha
sentation [35], [36]. The PSSM profiles are calculated by helices consists of an average of 3.5 amino acids, we use
using PSI-BLAST [37] against UniRef90 with an e-value convolution kernels of size 3, 7, 11 to enrich the feature
threshold 0.001 and 3 iterations. The obtained PPSSM matrix information. The longer kernel size is chosen because amino
in this paper consists of 700 21 matrix entries, where 700 acids are sometimes affected by other residues at a rela-
also represents the length of an input protein sequence, 21 tively long distance and the different kernel sizes are also
represents 20 types of natural amino acids and a type of considered to be a biologically relevant [39]. After that, we
‘None’. The original value of PPSSM is then converted by sig- constructed different sizes of convolutional layers into a
moid function so that the value is in the range (0, 1). uniform network module as shown in Fig. 3.
Before the vector matrix convolution calculation, it is
2.3 Deep Network Architecture processed with a one-dimensional convolution kernel of
The Deep Multi-Scale Convolutional Neural Network size 1. The purpose of this is to ensure that the original fea-
(DMCNN) as shown in Fig. 2 consists of three parts, one ture map scale is unchanged, while the non-linear effect of
input feature-integration layer for feature preprocessing, two the input feature is significantly increased, and the depth of
deep convolutional blocks composed of multiple modules the training network is also improved. In addition, this
formed by multi-scale CNN layers, and two fully-connected operation can densify the feature matrix to avoid the prob-
layers, followed by fine-tuning. The input to the DMCTOP lem of unevenly distributed data. This calculation proce-
method consists of protein sequence features and evoluti- dure is illustrated by the following formula:
onary conservation features. After the treatment of feature-
integration, the preprocessed feature vectors are transported yi ¼ F x~i:iþf1 ¼ ReLU w x~i:iþf1 þ b ; (2)
to deep convolutional blocks composed of several modules
formed by multi-scale CNN layers to extract both local and where F 2 Rf42 is a convolutional kernel; f is the extent
global contextual extraction. Multiple hierarchical convolu- of the kernel along the protein sequence; 42 is the feature
tional operations with different kernel sizes cover a wide dimensionality at individual amino acids and b is the bias
range of the protein sequences in various granularities. On term. The kernel goes through the complete input sequence
top of the second-deep convolutional block, there are two
fully-connected layers with softmax for multi-label classifica-
tion. Finally, the operation of fine-tuning ensures that the pre-
diction results are biologically meaningful.
Multi-Scale CNN Layers: Given the amino-acid sequence
with concatenated features as the input to the multi-scale
CNN layers, the feature vectors are shown as follows:
like a sliding window and generates a corresponding out- each class is higher than any other method. Third, we used
put feature map Y~ ¼ ½y1 ; y2 ; . . . ; y700 , where each yi has 64 specificity to measure the ability of different prediction
channels. methods to identify negative samples. The results show that
In this paper, we use the kernels of different sizes at the our method is superior to others, and the three types of mem-
same time with (f = 3, 7, 11) for extracting multiple local brane regions have reached 93.53, 92.78, and 95.66 percent,
contextual feature maps, i.e., Y~1 ; Y~2 ; Y~3 . These multi-scale respectively. In addition, for the two indicators of recall and
features
concatenated together as local relevancy Y ¼
are precision, our method reaches the highest values in the most
Y~1 ; Y~2 ; Y~3 . concerned transmembrane regions, 90.97 and 86.58 percent,
Deep Convolutional Block: The function of the module respectively. For the intra-membrane region, DMCTOP has
consisting of multi-scale CNN layers is to explore more the highest precision value of 87.09 percent, but its recall
abstract and specific local correlation features at different value is 86.24 percent, which is lower than 87.82 percent of
extents. Then we built the modules into a deep convolu- SCAMPI-msa. While for the extracellular region, the recall
tional block, which increases the depth and complexity of value of our method reaches the highest 83.16 percent, and
the neural network training. By stacking convolution opera- its precision value is 86.17 percent, less than 88.68 percent of
tions together, the network has a stronger ability to extract SCAMPI-msa.
non-local residue interactions, that is, to extract global con- Since precision and recall often have mutual restraints,
textual features among amino acids. we synthesize their results with F1 -measure to evaluate the
Implementation Details: To develop a high-quality classification performance of the model. The F1 -score of the
model, we used 10-fold cross-validation and the average of intracellular, transmembrane, and extracellular regions are
the independent test set results as our final prediction per- 86.74, 88.49, and 84.46 percent, respectively, which are
formance. All the architectures described in this paper were higher than those of the SCAMPI-msa method, 85.84, 84.76,
implemented using the open-source software TensorFlow, and 82.16 percent, and are also superior to other methods.
based on the Keras library. We included the batch normali- Fig. 4 plots the F1 -measure to visualize the comparison of
zation and dropout to improve the generalization ability of different methods at the whole sequence level. Meanwhile,
the model. The early stop rule and learning rate scheduler the average AUC values (area under the ROC curves) and
were adopted to control the overfitting and learning effi- the mean precision (area under the precision-recall) of ‘I’,
ciency. The entire deep network is trained on an NVIDIA ‘M’ and ‘O’ in the training process are shown in Fig. 5.
GeForce GTX 1080Ti with 11 GB of memory.
3.2 Prediction Performance Analysis
3 RESULTS AND DISCUSSION at Transmembrane Segment Level
and Traditional Criterion Level
When comparing DMCTOP with other prediction tools, we In order to assess the prediction performance of alpha trans-
used the uniform high-resolution membrane protein test membrane regions (i.e., alpha transmembrane helixes) with-
set of 113 protein sequences to ensure the reliability of the out considering the orientation of different methods, we
comparison results. Meanwhile, the prediction performance used a unified evaluation criterion developed by Jayasinghe
of DMCTOP and other tools were verified in three perspec- [41] to verify the prediction effect of all methods. They
tives, namely, prediction performance analysis at the whole defined a successful prediction as an overlap of at least 9
sequence level, transmembrane segment level, and tradi- AAs between a predicted aTMH segment and a known one.
tional criterion level. Similarly, Moller [42] also adopted a nine residue segment
length in his method. The total numbers of predicted and
3.1 Prediction Performance Analysis at Whole real known alpha transmembrane regions in the testing set
Sequence Level are indicated by Npred and Nknown , respectively. The total num-
Evaluate all types of topological regions at the whole ber of overlapping predicted and real known alpha trans-
sequence level, namely, ‘I’ (Intracellular), ‘M’ (Transmem- membrane regions (i.e., number of correctly predicted
brane), and ‘O’ (Extracellular). The six measurements include aTMHs in the testing set) is indicated by Ncorrect . The effi-
accuracy, recall, precision, specificity, Matthews correlation ciency of the alpha transmembrane regions prediction is
coefficient (MCC) and F1 -measure were used to calculate measured by M ¼ Ncorrect =Nknown and C ¼ Ncorrect =Npred . The
the performance prediction of the DMCTOP and other overall aTMHs prediction accuracy Q value is calculated
methods [40]. using the following equation:
The performance of DMCTOP method at the whole pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sequence along with the 13 methods are listed in Table 1. For Q ¼ M C 100%: (3)
the three-category problem, if one of the categories is defined
as a positive sample, the other two categories can be consid- The predicted results of DMCTOP at the transmembrane
ered as a negative sample. First, in terms of prediction accu- segment level along with other methods that are detailed in
racy, our method is 86.65 percent, and the highest accuracy Table 2.
among the 13 prediction tools is the SCAMPI-msa method, According to the overall aTMHs prediction power
which has an accuracy of 84.32 percent. Second, MCC value defined in [41], [42], the DMCTOP method has achieved the
reflects the classification performance and prediction reli- highest Q value of 98.90 percent. Measurement C reflects
ability of model. The MCC of DMCTOP in the intra-mem- the percentage of the true transmembrane region in the
brane, transmembrane, and extra-membrane regions are transmembrane region predicted by the model. Our method
80.03, 82.33, and 77.44 percent, respectively, and the value of is superior to the above mentioned methods with a C value
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 299
TABLE 1
Prediction Performance at Whole Sequence Level
of 99.12 percent. Although the performance of DMCTOP In general, the accuracy of the traditional topology predic-
and PRODIV differs by 0.22 percent in the measurement tion is measured in the following three perspectives: 1) the
index M, it can be seen from Table 2 that the parameter number of predicted aTMHs, 2) the locations of those pre-
Ncorrect has the most important influence on the M value, and dicted aTMHs, and 3) the orientation of alpha transmembrane
the results of the two prediction methods differ only by
one prediction sample. Meanwhile, the PRODIV method
predicts 473 aTMH regions, 17 more than the 456 results
predicted by the DMCTOP method. However, with 458
known aTMH segments, it is clear that our method has
better prediction stability and reliability.
Fig. 5. The AUC value and mean precision of three classes based on
Fig. 4. Compare the F1 -measures of different methods. 10 runs in the training process.
300 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
TABLE 2 TABLE 4
Prediction Performance at Transmembrane Segment Physicochemical Properties of Amino Acids
Level and Traditional Criterion Level
AAs Hydrophobicity Hydrophobicity Charge Polarity
Algorithm Nknown Npred Ncorrect M(%) C(%) Q(%) Top(%) (Kyte-Doolittle) (Eisenberg)
HMMTOP2.0 458 473 440 96.07 93.02 94.53 67.3 Ala 1.8 0.25 neutral nonpolar
MEMSAT3.0 458 467 444 96.94 95.07 96.00 66.4 Arg -4.5 -1.8 positive polar
OCTOPUS 458 460 451 98.47 98.04 98.25 89.4 Asn -3.5 -0.64 neutral polar
Philius 458 463 439 95.85 94.82 95.33 70.8 Asp -3.5 -0.72 negative polar
Phobius 458 461 443 96.72 96.10 96.41 65.5 Cys 2.5 0.04 neutral polar
PRO 458 450 438 95.63 97.33 96.48 79.6 Glu -3.5 -0.62 negative polar
PRODIV 458 473 453 98.91 95.77 97.33 85.8 Gln -3.5 -0.69 neutral polar
SCAMPI-seq 458 458 444 96.94 96.94 96.94 79.6 Gly -0.4 0.16 neutral nonpolar
SCAMPI-msa 458 462 450 98.25 97.40 97.83 88.5 His -3.2 -0.4 neutral polar
SPOCTOPUS 458 458 443 96.72 96.72 96.72 84.1 IIe 4.5 0.73 neutral nonpolar
TMHMM2.0 458 449 431 94.10 95.99 95.04 58.4 Leu 3.8 0.53 neutral nonpolar
TOPCONS 458 446 436 95.20 97.76 96.47 77.9 Lys -3.9 -1.1 positive polar
TopPred2.0 458 458 431 94.10 94.10 94.01 63.7 Met 1.9 0.26 neutral nonpolar
DMCTOP* 458 456 452 98.69 99.12 98.90 91.7 Phe 2.8 0.61 neutral nonpolar
Pro -1.6 -0.07 neutral nonpolar
* Experiment results we calculated. Ser -0.8 -0.26 neutral polar
Bold fonts represent the best experimental results. Thr -0.7 -0.81 neutral polar
Trp -0.9 0.37 neutral polar
Tyr -1.3 0.02 neutral polar
protein sequences (i.e., N-terminal direction). Use this evalua- Val 4.2 0.54 neutral polar
tion criterion for measuring the existing tools, if all aTMHs
and the N-terminal direction of a transmembrane protein
sequence have been predicted correctly, then the topology AAs. 2) the overlapped part between a known one and a pre-
can be judged to be correctly predicted. Therefore, the com- dicted aTMH segment should at least half of the longer
parison results of various methods are shown in Table 2. The one. Therefore, the values of M and C are converted to
experimental results show that compared with other predic- M 0 ¼ Ncorrectnew =Nknown and C 0 ¼ Ncorrectnew =Npred , respectively.
tion methods, DMCTOP also has the best prediction perfor- On the other hand, based on the new evaluation criteria of
mance at the traditional criterion level with topology aTMH, the topology prediction of traditional criterion is also
prediction accuracy of 91.7 percent. refined as following conditions: 1) the numbers of predicted
aTMHs and the known aTMHs are equal. 2) the locations of
3.3 Prediction Performance Analysis With predicted aTMHs and the known aTMHs are correspond-
Strict Evaluation Measures ing. 3) all the non-TMHs of inside or outside positions are
Although our method achieved better results than previous correctly predicted. We reassess all methods with the new
methods, in recent years, the stricter evaluation criteria have standards and the comparison results are listed in Table 3.
been proposed to verify the effectiveness of the method [43]. Under more stringent standards, the DMCTOP is superior
On the one hand, for the definition of aTMH, the new stan- to other methods in both transmembrane segments and tradi-
dard indicates that a aTMH must comply with the following tional criterion levels. This benefits from what we mentioned
two conditions in the meantime before it is considered to be earlier, our DMCNN model has the same excellent identifica-
correct. 1) the error between the predicted aTMH terminal tion ability for the non-transmembrane region on the basis of
position and the real aTMH terminal position is less than 5 accurate prediction of the transmembrane region. Therefore,
we conclude that our proposed method, DMCTOP, has
achieved significant results at all three levels, thus demon-
TABLE 3
Prediction Performance With Strict Evaluation Measures strating the effectiveness of the method.
Algorithm Ncorrect’ M’(%) C’(%) Q’(%) Topology 3.4 Prediction Performance Analysis
HMMTOP2.0 342 74.67 72.30 73.48 37
of Physicochemical Properties
MEMSAT3.0 360 78.60 77.09 77.84 38 As mentioned in the introduction section, the various physi-
OCTOPUS 401 87.55 87.17 87.36 68 cal and chemical properties of aTMP play an important role
Philius 374 81.66 80.78 81.22 45 in the prediction of topological structure at the beginning of
Phobius 346 75.55 75.05 75.30 39 this field [7], [8]. Each specific property of amino acid is
PRO 356 77.73 79.11 78.42 51 largely based on propensity of its side chain [44]. Therefore,
PRODIV 368 80.35 77.80 79.06 51
SCAMPI-seq 369 80.57 80.57 80.57 47 we select three important physicochemical properties of
SCAMPI-msa 388 84.72 83.98 84.33 58 amino acids from [45] and shown in Table 4, including
SPOCTOPUS 399 87.12 87.12 87.12 68 hydrophobic, polarity and charge.
TMHMM2.0 350 76.42 77.95 77.18 38 We structure the physicochemical matrix with the dimen-
TOPCONS 387 84.50 86.77 85.63 62 sion of 700 5, where the first 4 columns represent the differ-
TopPred2.0 351 76.64 76.64 76.64 41 ent properties of each amino acid and the last column
DMCTOP* 413 90.17 90.57 90.37 73
represents the ‘None’ label. In addition, we normalize the fea-
Experiment results we calculated. ture matrix values to the range (-1, 1). The encoded matrix is
Bold fonts represent the best experimental results. transmitted to the DMCNN based on 10-fold cross-validation,
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 301
TABLE 5 TABLE 6
Prediction Performance of Physicochemical Features Comparison of Results Before and After Fine-Tuning
Result Class F1(%) S(%) MCC(%) ACC(% Result Class F1(%) S(%) MCC(%) ACC(%)
I 70.23 88.95 57.69 I 86.93 93.24 80.28
Physicochemical M 82.47 84.71 73.11 75.1 Raw Result M 87.68 92.64 82.31 86.73
O 68.27 85.98 54.06 O 84.74 94.20 77.78
I 86.74 93.53 80.03
Fine-tuning M 88.49 92.78 82.33 86.65
and the experimental results of the testing set are listed in O 84.46 95.66 77.44
Table 5.
The measurements include F1 -measure, specificity, MCC
and accuracy are used to reflect the prediction performance. The changes of prediction results before and after fine-tuning
As show in Table 5, the classification ability of physicochemi- are listed in Table 6.
cal feature attributes for three classes is lower than the com- The predicted results before and after fine-tuning are com-
bination of sequence features and evolutionary features. pared with four important indicators include F1 -measure,
Moreover. we find a common problem that the ability of the specificity, MCC and accuracy which reflect the experimental
methods, which use physical and chemical properties to pre- performance. The comparison results show that the predic-
dict topological structure is generally lacking [7], [9], [10]. tion results of the three topological types before and after fine
These properties are extracted artificially based on prior tuning have only a small change range, and will not affect the
knowledge, so there is inevitably noise in the data. The good prediction performance of DMCTOP. Meanwhile, it
unnecessary features would decrease training speed, model also reflects that the initial prediction results predicted by the
interpretability, and, most importantly, reduce generaliza- deep neural network model constructed in this paper have
tion performance on testing set. achieved good performance.
3.5 Comparison of Results Before 3.6 The Impact of Training Sample Scale
and After Fine-Tuning on Prediction Results
There are two types of transmembrane sequencing that are Although the number of aTMP with known topological
biologically meaningful: 1) Extracellular - Transmembrane - structures has increased in recent years, the data scale of the
Intracellular; 2) Intracellular - Transmembrane - Extracellu- training set is still relatively small especially after the nonre-
lar. In addition, at the amino acid level, the same type of label dundancy treatment.
should appear continuously within each region. However, This may result in inadequate learning of local and global
the prediction results of a few sequences do not conform contextual information for the aTMP when training the
these rules. Fig. 6a represents examples of incorrect predic- model. To illustrate this point clearly, we reduced the amount
tion of single amino acid, i.e., the specific type of residues are of data of training samples and took 10 percent as the interval
classified into other classes. Fig. 6b is the situation that the to ensure that the redundancy between training set and test
length of some transmembrane regions is shorter than 3 set is 30, 20, 10 and 0 percent, respectively. And the compari-
AAs, which is hard to identify, thus leading to the wrong son results are shown in Fig. 7.
assignment of non-transmembrane segments. Therefore, It can be seen from Fig. 7, as the training data size
before outputting the final prediction result, we used a fine- decreased, the prediction results at whole sequence level
tuning process to check whether any cases do not satisfy have a tendency to decline from 86.65 to 83.63 percent. At
these biological rules, and correct the problems accordingly. the traditional criterion, the downward trend is moderate
Fig. 6. The fine-tuning is applied to manage the results that do not conform to the transmembrane rules. 6a. represents the incorrect predictions in
residues where the specific types are predicted to others. 6b. is the infeasible structure caused by the transmembrane deletion. The regions shorter
than 3 AAs is hard to identify, leading to the wrong predictions of non-transmembrane regions.
302 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
ACKNOWLEDGMENTS
This work was supported by the National Natural Science
Funds of China (No. 81671328, 61802057), the Jilin Sci-
entific and Technological Development Program (No.
Fig. 7. Comparison results of different scale training sets. 20180414006GH, 20180520028JH, 20170520058JH), The
Science and Technology Research Project of the Educa-
compared to the whole sequence, with fluctuations ranging tion Department of Jilin Province (No. JJKH20190290KJ,
from 91.65 to 89.38 percent. However, at transmembrane seg- JJKH20191309KJ), and Fundamental Research Funds for the
ments level, the results fluctuated in the range of 98.46 to Central Universities (No. 2412019FZ052, 2412019FZ048).
99.12 percent, almost unaffected by the scale of training. In
other words, the reduction in the size of training data has a REFERENCES
significant impact at whole sequence level, and has a slight
[1] M. S. Almen, K. J. Nordstr€ om, R. Fredriksson, and H. B. Schi€ oth,
influence on the prediction effect of the traditional criterion. “Mapping the human membrane proteome: A majority of the
From these two aspects, we can see that the main reason for human membrane proteins can be classified according to function
the decline is the misprediction of the topology types ‘O’ and and evolutionary origin,” BMC Biol., vol. 7, no. 1, 2009, Art. no. 50.
[2] R. Shamima and S. Suresh, “Prediction of membrane protein
‘I’, which indirectly leads to prediction errors in the trans- structures using a projection based meta-cognitive radial basis
membrane direction, and does not cause impact on the label function network,” in Proc. Int. Joint Conf. Neural Netw., 2016,
‘M’. Because the biological characteristics of aTMH regions pp. 1229–1235.
are more remarkable than those of non-transmembrane [3] J. P. Overington, A. L. Bissan, and A. L. Hopkins, “How many
drug targets are there?” Nat. Rev. Drug Discov., vol. 5, no. 12,
regions. On the other hand, the decrease of performance in pp. 993–6, 2006.
training size and nonredundant sequence identity threshold [4] H. M. Berman et al., “The protein data bank,” Nucleic Acids Res.,
is not very large, indicating that our model has a very good vol. 28, no. 1, pp. 235–242, 2000.
generalization. [5] J. G. Almeida, A. J. Preto, P. I. Koukos, B. Amjj, and I. S. Moreira,
“Membrane proteins structures: A review on computational
Although our deep learning architecture has significantly modeling tools,” Biochimica Et Biophysica Acta, vol. 1859, no. 10,
enhanced the performance of aTMP topological prediction, 2017, Art. no. 2021.
there is still room for improvement. In the future, we would [6] P. Du, S. Gu, and Y. Jiao, “Pseaac-general: Fast building various
modes of general form of chou’s pseudo-amino acid composition
like to construct a more efficient network structure to fur- for large-scale protein datasets,” Int. J. Mol. Sci., vol. 15, no. 3,
ther improve the predictive and generalization capabilities pp. 3495–3506, 2014.
of the model under limited data conditions. We also need to [7] G. von Heijne, “Membrane protein structure prediction. hydropho-
add more visualizations [46], [47] to reflect the interpretabil- bicity analysis and the positive-inside rule,” J. Mol. Biol., vol. 225,
no. 2, pp. 487–94, 1992.
ity of the deep learning algorithm in the application process, [8] D. T. Jones, W. R. Taylor, and J. M. Thornton, “A model recognition
rather than the so-called black-box operation. approach to the prediction of all-helical membrane protein structure
and topology,” Biochemistry, vol. 33, no. 10, pp. 3038–3049, 1994.
[9] A. Krogh, B. H. G. Larsson, and S. Ell, “Predicting transmembrane
4 CONCLUSION protein topology with a hidden markov model:application to
complete genomes,” J. Mol. Biol., vol. 305, no. 3, pp. 567–580, 2001.
In this paper, we propose a novel method, DMCTOP, using [10] G. Tusnady and I. Simon, “The hmmtop transmembrane topology
a deep multi-scale convolutional neural network (DMCNN) prediction server,” Bioinformatics, vol. 17, no. 9, pp. 849–850, 2001.
[11] L. K€all, A. Krogh, and E. L. L. Sonnhammer, “A combined trans-
for aTMP topology prediction. The distribution of different membrane topology and signal peptide prediction method,” J.
regions of the transmembrane protein topology is due to the Mol. Biol., vol. 338, no. 5, pp. 1027–1036, 2004.
interaction of various residues through a group of amino [12] H. Viklund and A. Elofsson, “Best alpha-helical transmembrane
acids or domains. Our network architecture has powerful protein topology predictions are achieved using hidden markov
models and evolutionary information,” Protein Sci., vol. 13, no. 7,
generalization ability to learn the hidden rules within the pp. 1908–1917, 2004.
sequence information effectively and discover abstract local [13] C. Peters, K. D. Tsirigos, N. Shu, and A. Elofsson, “Improved
or global contextual features at different levels automati- topology prediction using the terminal hydrophobic helices rule,”
Bioinformatics, vol. 32, no. 8, 2016, Art. no. 1158.
cally. By integrating local and global contextual features, we [14] D. Jones, “Improving the accuracy of transmembrane protein topol-
improved the state-of-art in protein topology prediction. ogy prediction using evolutionary information,” Bioinformatics,
The comparison results at three levels demonstrate the vol. 23, no. 5, pp. 538–544, 2007.
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 303
[15] S. M. Reynolds, L. K€ all, M. E. Riffle, J. A. Bilmes, and W. S. Noble, [38] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
“Transmembrane topology and signal peptide prediction using boltzmann machines,” in Proc. Int. Conf. Int. Conf. Mach. Learn., 2010,
dynamic bayesian networks,” PLOS Comput. Biol., vol. 4, no. 11, pp. 807–814.
2008, Art. no. e1000213. [39] E. Asgari and M. R. K. Mofrad, “Protvec: A continuous distributed
[16] T. Nugent and D. T. Jones, “Transmembrane protein topology representation of biological sequences,” Comput. Sci., vol. 10, no. 11,
prediction using support vector machines,” BMC Bioinf., vol. 10, 2015, Art. no. e0141287.
no. 1, pp. 159–159, 2009. [40] J. Yasen and P. Du, “Performance measures in evaluating machine
[17] V. HaKan and E. Arne, “Octopus: Improving topology prediction learning based bioinformatics predictors for classifications,”
by two-track ann-based preference scores and an extended topo- Quantitative Biol., vol. 4, pp. 320–330, 2016.
logical grammar,” Bioinformatics, vol. 24, no. 15, pp. 1662–1668, [41] S. Jayasinghe, K. Hristova, and S. H. White, “Energetics, stability,
2008. and prediction of transmembrane helices 1,” J. Mol. Biol., vol. 312,
[18] V. HaKan, B. Andreas, S. Marcin, and E. Arne, “SPOCTOPUS: A no. 5, pp. 927–934, 2001.
combined predictor of signal peptides and membrane protein top- [42] S. Moller, E. V. Kriventseva, and R. Apweiler, “A collection of well
ology,” Bioinformatics, vol. 24, no. 24, pp. 2928–9, 2008. characterised integral membrane proteins,” Bioinformatics, vol. 16,
[19] A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson, no. 12, pp. 1159–1160, 2000.
“TOPCONS: Consensus prediction of membrane protein top- [43] J. Reeb, E. Kloppmann, M. Bernhofer, and B. Rost, “Evaluation of
ology,” Nucleic Acids Res., vol. 37, pp. 465–8, 2009. transmembrane helix predictions in 2014,” Proteins: Struct. Func-
[20] W. Lu, B. Fu, H. Wu, L. Qiang, K. Wang, and J. Min, “CRF-TM: A tion Bioinf., vol. 83, no. 3, pp. 473–484, 2015.
conditional random field method for predicting transmembrane [44] M. Hayat and A. Khan, “WRF-TMH: Predicting transmembrane
topology,” in Proc. Int. Conf. Int. Sci. Big Data Eng., 2015, pp. 529–537. helix by fusing composition index and physicochemical proper-
[21] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolutional ties of amino acids,” Amino Acids, vol. 44, no. 5, pp. 1317–1328,
neural network architectures for predicting dna–protein binding,” 2013.
Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016. [45] M. Bernhofer, E. Kloppmann, J. Reeb, and B. Rost, “TMSEG:
[22] S. Zhang et al., “A deep learning framework for modeling struc- Novel prediction of transmembrane helices,” Proteins: Struct.
tural features of rna-binding protein targets,” Nucleic Acids Res., Function Bioinf., vol. 84, no. 11, pp. 1706–1716, 2016.
vol. 44, no. 4, 2015, Art. no. e32. [46] P. Du, W. Zhao, Y. Miao, L. Wei, and L. Wang, “UltraPse: A
[23] C. Fang, Y. Shang, and D. Xu, “MUFOLD-SS: New deep inception- universal and extensible software platform for representing
inside-inception networks for protein secondary structure pre- biological sequences,” Int. J. Mol. Sci., vol. 18, no. 11, 2017,
diction,” Proteins-structure Function Bioinf., vol. 86, no. 5, pp. 592–598, Art. no. 2400.
2018. [47] J. Wang et al., “VisFeature: A stand-alone program for visualizing
[24] Q. Zou, P. Xing, L. Wei, and B. Liu, “Gene2vec: gene subsequence and analyzing statistical features of biological sequences,” Bioin-
embedding for prediction of mammalian N6-methyladenosine formatics, vol. 36, no. 4, pp. 1277–1278, 2019.
sites from mRNA,” RNA, vol. 25, no. 2, pp. 205–218, 2019.
[25] L. Wei, Y. Ding, R. Su, J. Tang, and Q. Zou, “Prediction of human
protein subcellular localization using deep learning,” J. Parallel Yuning Yang is currently working toward the
Distrib. Comput., vol. 117, pp. 212–217, 2018.
PhD degree in the School of Information Sci-
[26] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
ence and Technology, Northeast Normal Univer-
vol. 521, no. 7553, 2015, Art. no. 436. sity. His research interests include computational
[27] G. E. Tusn ady, K. Lajos, and S. Istvan,“TOPDB: Topology data biology and bioinformatics.
bank of transmembrane proteins,” Nucleic Acids Res., vol. 36,
pp. D234–D239, 2008.
[28] S. Jayasinghe, K. Hristova, and S. H. White, “Mptopo: A database
of membrane protein topology,” Protein Sci., vol. 10, no. 2,
pp. 455–458, 2001.
[29] I. Masami, A. Masafumi, D. M. Lao, and S. Toshio, “Transmembrane
topology prediction methods: a re-assessment and improve-
ment by a consensus method using a dataset of experimentally-
characterized transmembrane topologies,” Silico Biol., vol. 2, no. 1, Jiawen Yu is currently working toward the
2002, Art. no. 19.
master’s degree in the School of Information
[30] Q. Zou, G. Lin, X. Jiang, X. Liu, and X. Zeng, “Sequence clustering in
Science and Technology, Northeast Normal Uni-
bioinformatics: an empirical study,” Brief Bioinf., vol. 21, pp. 1–10, versity. Her research interests include computa-
2018. tional biology and bioinformatics.
[31] J. Tan et al., “Identification of hormone binding proteins based on
machine learning methods,” Math. Biosciences Eng., vol. 16, no. 4,
pp. 2466–2480, 2019.
[32] W. Yang, X. Zhu, J. Huang, H. Ding, and H. Lin, “A brief survey
of machine learning methods in protein sub-golgi localization,”
Current Bioinf., vol. 13, no. 3, pp. 234–240, 2019.
[33] H. Ying, N. Beifang, G. Ying, F. Limin, and L. Weizhong, “CD-HIT
suite: A web server for clustering and comparing biological
sequences,” Bioinformatics, vol. 26, no. 5, pp. 680–682, 2010. Zhe Liu is currently working toward the graduate
[34] B. Andreas, V. H?Kan, F. Jenny, L. Erik, V. H. Gunnar, and E. Arne,
degree in the School of Information Science and
“Prediction of membrane-protein topology from first principles,”
Technology, Northeast Normal University. She
in Proc. Nat. Acad. Sci. USA, vol. 105, no. 20, pp. 7177–7181, joined the Institution of Computational Biology,
2008. Northeast Normal University, China, in 2019,
[35] X. Zhu, C. Feng, H. Lai, W. Chen, and L. Hao, “Predicting protein where she participated in some research on trans-
structural classes for low-similarity sequences by evaluating dif- membrane protein structure prediction using deep
ferent features,” Knowl. Based Syst., vol. 163, pp. 787–793, 2019.
learning methods.
[36] P. Du, X. Wang, C. Xu, and Y. Gao, “Pseaac-builder: A cross-platform
stand-alone program for generating various special chou’s pseudo-
amino acid compositions,” Anal. Biochem., vol. 425, no. 2,
pp. 117–119, 2012.
[37] A. A. Schaffer, “Improving the accuracy of psi-blast protein database
searches with composition-based statistics and other refinements,”
Nucleic Acids Res., vol. 29, no. 14, pp. 2994–3005, 2001.
304 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
Xi Wang is currently working toward the mas- Zhiqiang Ma received the PhD degree from the
ter’s degree in the School of Information Science School of Computer Science, Jilin University, in
and Technology, Northeast Normal University. Her 2009. Currently, he is a professor with the School
research interests include computational biology of Information Science and Technology, Northeast
and bioinformatics. Normal University. He is the vice president of the
research association of Computer Education in the
Normal Universities of China and the executive
director of the Jilin Computer Federation. His inter-
ests include bioinformatics, software engineering,
molecular biology, and data mining.