An Improved Topology Prediction of Alpha-Helical Transmembrane Protein Based On Deep Multi-Scale Convolutional Neural Network

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO.

1, JANUARY/FEBRUARY 2022 295

An Improved Topology Prediction of


Alpha-Helical Transmembrane Protein Based on
Deep Multi-Scale Convolutional Neural Network
Yuning Yang , Jiawen Yu, Zhe Liu , Xi Wang, Han Wang , Zhiqiang Ma, and Dong Xu

Abstract—Alpha-helical proteins (aTMPs) are essential in various biological processes. Despite their tertiary structures are crucial
for revealing complex functions, experimental structure determination remains challenging and costly. In the past decades, various
sequence-based topology prediction methods have been developed to bridge the gap between the sequences and structures by
characterizing the structural features, but significant improvements are still required. Deep learning brings a great opportunity for its
powerful representation learning capability from limited original data. In this work, we improved our aTMP topology prediction method
DMCTOP using deep learning, which composed of two deep convolutional blocks to simultaneously extract local and global contextual
features. Consequently, the inputs were simplified to reflect the original features of the sequence, including a protein sequence feature
and an evolutionary conservation feature. DMCTOP can efficiently and accurately identify all topological types and the N-terminal
orientation for an aTMP sequence. To validate the effectiveness of our method, we benchmarked DMCTOP against 13 peer methods
according to the whole sequence, the transmembrane segment and the traditional criterion in testing experiments. All the results reveal
that our method achieved the highest prediction accuracy and outperformed all the previous methods. The method is available at
https://icdtools.nenu.edu.cn/dmctop.

Index Terms—Deep learning, deep multi-scale convolutional neural network, topology prediction, transmembrane proteins

1 INTRODUCTION Therefore, computational methods for predicting their


structures or structural properties are in high demand. A
LPHA-HELICAL transmembrane proteins (aTMP) are the
A major class of transmembrane proteins and have great
biological and medical importance. About 27 percent of all
sequence-based topology prediction for aTMP is an impor-
tant component in this effort.
With the rapid development of high-throughput seq-
human proteins are estimated to be aTMP [1], which mostly
uencing technologies [6], a series of efforts have been made
found in the plasma membrane. They cross the phospholipid
over the decades to predict the topology of aTMP. At first,
bilayer of the cytomembrane with either single-pass or multi-
algorithms were based solely on hydrophobicity scales
pass, carrying on a variety of important functions for cells,
such as TopPred [7] or statistical analyses [8], as residues
such as cell-to-cell signaling, ion conductivity, cell cohesion
at different positions (i.e., embedded in the membrane
and the regulation of network signal transmission [2]. Thus,
or exposed to solvent) exhibit different hydrophobicities.
aTMP are also important drug targets, representing about
Lately, the development of transmembrane topology pre-
60 percent of the known drug targets in the current market
diction methods applied multiple data-driven statistic mod-
[3]. Despite their immense importance, until January 2020, the
els, including hidden Markov models (HMMs) [9], [10],
solved three-dimensional structure of aTMP remains only
[11], [12], [13], artificial neural networks (ANNs) [14],
about 1.8 percent of all structures in the Protein Data Bank
dynamic Bayesian networks (DBNs) [15], support vector
(PDB) [4]. Since aTMP are difficult to solubilize, purify, and
machines (SVMs) [16] and hybrid methods such as OCTO-
crystallize, methods commonly used to determine globular
PUS [17] and SPOCTOPUS [18]. Meanwhile, the ensemble
protein structures, such as Nuclear Magnetic Resonance
framework of TOPCONS [19] first combined the results of
(NMR) and X-rays, are disadvantageous to implement [5].
various predictors, and produced a consensus prediction
using a Viterbi-like algorithm and dynamic programming
 Yuning Yang, Jiawen Yu, Zhe Liu, Xi Wang, Han Wang, and Zhiqiang algorithm. Although HMMs are widely applied in topology
Ma are with the Department of Information Science and Technology, prediction [12], the performance is limited due to the lack of
Northeast Normal University, Changchun, Jilin 130117, China. ability to extract long-range or global correlations [20]. On
E-mail: {yangyn533, yujw077, liu0940, wangq845, wangh101, mazq}@nenu.
edu.cn. the one hand, since the amount of known aTMP structures
 Dong Xu is with the Department of Electrical Engineering and Computer in the early membrane protein database was relatively
Science and Bond Life Sciences Center, University of Missouri, Columbia, small, the previous methods have the poor generalization
MO 65211 USA. E-mail: xudong@missouri.edu. ability for new protein sequences [7], [9], [14]. On the other
Manuscript received 2 Apr. 2020; revised 10 June 2020; accepted 23 June 2020. hand, the previous evaluation criteria for topology predic-
Date of publication 29 June 2020; date of current version 3 Feb. 2022.
(Corresponding author: Han Wang.)
tion only focused on the transmembrane region, ignoring
Digital Object Identifier no. 10.1109/TCBB.2020.3005813 the comparison of prediction results in the intracellular and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
296 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

the effectiveness of DMCTOP, which is also superior to other


methods. The experimental results show that DMCTOP can
correctly identify membrane regions as well as to correctly
predict the N-terminal direction of aTMP simultaneously,
no matter the types and number of the alpha transmembrane
helices (aTMHs).

2 MATERIAL AND METHODS


2.1 Datasets
The Topology Data Bank of Transmembrane Proteins
(TOPDB) [27] database was selected to construct our training
set, which is the most complete and comprehensive collec-
tion of transmembrane protein datasets containing experi-
mentally validated topology for 4,067 sequences currently,
including 1,712 bitopic entries and 2,355 polytopic entries.
The testing set is a high-resolution membrane protein dataset
selected from the Membrane Protein Topology (MPtopo)
Fig. 1. Topology prediction of alpha-helical transmembrane protein. database [28], where 116 aTMP sequences of known 3D-
structure were obtained. The signal peptide sequences were
extracellular regions [13], [20]. More importantly, the above removed if any from the entries in the dataset, to avoid the
methods were not evaluated in a strict and uniform way, possibility that they are predicted as the first aTMH [29].
which might have led to overestimation of prediction Since the homologous sequences would affect the predic-
accuracy. tion performance [30], [31], [32], CD-HIT [33] was used to fil-
Recent advances in many fields of bioinformatics have ter the redundancy of datasets with a sequence identity
benefited from deep-learning algorithms [21], [22], [23], [24], threshold of 40 percent [17], [34]. CD-HIT-2D was also uti-
[25]. Compared to traditional machine learning methods, lized to restrict the training set with at most 40 percent simi-
deep-learning has powerful representation learning ability larity to the testing set. After the screening, the training and
through parameterized nonlinear differentiable layer-by- testing sets contain 2,272 and 113 non-redundant aTMP
layer architectures by using forward and backpropagation, sequences, respectively. Furthermore, in order to preserve
which can automatically learn more useful and more abstract the original information distributed over the chains as
features from large-scale complex data [26]. With the completely as possible, we used the entire sequence of a pro-
increase of quantity of aTMP structures, deep-learning tein as the input to the model, although some residues may
methods are promising to significantly improve the topology not have 3D coordinates. As the majority of protein chains
prediction of transmembrane proteins, which will not only are shorter than 700 amino acids (AAs), to facilitating subse-
provide the theoretical guidance for biological experiments quent processing and implementation, we chose a cutoff
to speed up the research process but also lay a foundation for length of 700 in both training and testing sets, and the chains
the characterization and prediction of tertiary structures. shorter than 700 were padded with zeros and those longer
In this paper, we propose a novel method, DMCTOP, than 700 were truncated into smaller segments.
using the Deep Multi-Scale Convolutional Neural Network
(DMCNN) for aTMP topology prediction. For a given 2.2 Sequence Features and Encoding
sequence of aTMP, as shown in Fig. 1, our goal is to deter- To maximize the underlying information of protein sequence,
mine which classification of each residue is among three clas- all the aTMP sequences were encoded into two types of quan-
ses of ’I’ (Intracellular), ’M’ (Transmembrane), and ’O’ tized biological features.
(Extracellular). The convolution modules consisting of vari-  Protein Sequence: The feature is represented by a one-hot
ous kernel sizes are adopted to extract multi-scale local con- coding vector, which is a sparse coding method used to rep-
textual features from a transmembrane protein sequence. resent the specific position of each amino-acid on a protein
Meanwhile, the modules are stacked to form the deep convo- sequence. The original chains for residues encoded by m 
lutional blocks, which can also integrate the high dimen- n binary matrices as input features where m is the protein
sional nonlinear feature information between different sequence length and n is the number of amino-acid types.
network modules and make up an advanced global contex- There are totally 20 types of natural amino acids. In this
tual feature. We built our model architecture and parameters work, because we control the length of the input protein
with a 10-fold cross-validation on the dataset, and then com- sequence to 700AAs, the value of m equal to 700. If a
pared the previous methods with the uniform evaluation cri- sequence is shorter than 700AAs, the rest positions are pad-
teria on the testing set. The DMCTOP method achieves the ded with ‘None’, which means there is no amino-acid there.
86.65 percent prediction accuracy at the whole sequence We label this as an artificial amino acid type; so the corre-
level, 98.90 percent at the transmembrane segments level sponding value of n is 21. In summary, an input protein
and 91.7 percent at the traditional criterion level, which sur- sequence is expressed as a 700  21 matrix in the system,
passes all the comparison methods. each amino acid in the sequence has a value of 1 for the col-
Finally, we employed stricter criteria at the transmem- umn corresponding to itself, and the remaining elements
brane and traditional criterion levels to further demonstrate are zero.
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 297

Fig. 2. Our method, DMCTOP, uses a deep multi-scale convolutional neural network (DMCNN) for aTMP topology prediction. The input consists of
protein sequence and evolutionary conservation features. After the treatment of feature-integration, the preprocessed feature vectors are transported
to deep convolutional blocks composed of several modules formed by multi-scale CNN layers to extract both local and global contextual features. On
top of the second-deep convolutional block, there are two fully-connected layers with softmax for multi-label classification. The fine-tuning operation
ensures that the prediction results are biologically meaningful.

 Evolutionary Conservation: The evolutionary conserva- adjacent amino acids, we used CNNs to extract local contex-
tive information as PSSM (Position Specific Scoring Matrix) tual characteristics. We applied Rectified Linear Unit
profile is a matrix used for protein sequence pattern repre- (ReLU) [38] as the activation function. Since a turn of alpha
sentation [35], [36]. The PSSM profiles are calculated by helices consists of an average of 3.5 amino acids, we use
using PSI-BLAST [37] against UniRef90 with an e-value convolution kernels of size 3, 7, 11 to enrich the feature
threshold 0.001 and 3 iterations. The obtained PPSSM matrix information. The longer kernel size is chosen because amino
in this paper consists of 700  21 matrix entries, where 700 acids are sometimes affected by other residues at a rela-
also represents the length of an input protein sequence, 21 tively long distance and the different kernel sizes are also
represents 20 types of natural amino acids and a type of considered to be a biologically relevant [39]. After that, we
‘None’. The original value of PPSSM is then converted by sig- constructed different sizes of convolutional layers into a
moid function so that the value is in the range (0, 1). uniform network module as shown in Fig. 3.
Before the vector matrix convolution calculation, it is
2.3 Deep Network Architecture processed with a one-dimensional convolution kernel of
The Deep Multi-Scale Convolutional Neural Network size 1. The purpose of this is to ensure that the original fea-
(DMCNN) as shown in Fig. 2 consists of three parts, one ture map scale is unchanged, while the non-linear effect of
input feature-integration layer for feature preprocessing, two the input feature is significantly increased, and the depth of
deep convolutional blocks composed of multiple modules the training network is also improved. In addition, this
formed by multi-scale CNN layers, and two fully-connected operation can densify the feature matrix to avoid the prob-
layers, followed by fine-tuning. The input to the DMCTOP lem of unevenly distributed data. This calculation proce-
method consists of protein sequence features and evoluti- dure is illustrated by the following formula:
onary conservation features. After the treatment of feature-  
integration, the preprocessed feature vectors are transported yi ¼ F  x~i:iþf1 ¼ ReLU w  x~i:iþf1 þ b ; (2)
to deep convolutional blocks composed of several modules  
formed by multi-scale CNN layers to extract both local and where F 2 Rf42 is a convolutional kernel; f is the extent
global contextual extraction. Multiple hierarchical convolu- of the kernel along the protein sequence; 42 is the feature
tional operations with different kernel sizes cover a wide dimensionality at individual amino acids and b is the bias
range of the protein sequences in various granularities. On term. The kernel goes through the complete input sequence
top of the second-deep convolutional block, there are two
fully-connected layers with softmax for multi-label classifica-
tion. Finally, the operation of fine-tuning ensures that the pre-
diction results are biologically meaningful.
 Multi-Scale CNN Layers: Given the amino-acid sequence
with concatenated features as the input to the multi-scale
CNN layers, the feature vectors are shown as follows:

~ ¼ ½x~1 ; x~2 ; . . . ; x~700 ;


X (1)
 
where x~i 2 R42 is the preprocessed 42-dimensional feature
vector with two features of the i-th amino acid with the Fig. 3. The feature vectors are fed into multi-scale CNN layers with differ-
sequence length of 700. To simulate the local dependence of ent kernel sizes to extract multiple local contextual features.
298 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

like a sliding window and generates a corresponding out- each class is higher than any other method. Third, we used
put feature map Y~ ¼ ½y1 ; y2 ; . . . ; y700 , where each yi has 64 specificity to measure the ability of different prediction
channels. methods to identify negative samples. The results show that
In this paper, we use the kernels of different sizes at the our method is superior to others, and the three types of mem-
same time with (f = 3, 7, 11) for extracting multiple local brane regions have reached 93.53, 92.78, and 95.66 percent,
contextual feature maps, i.e., Y~1 ; Y~2 ; Y~3 . These multi-scale respectively. In addition, for the two indicators of recall and
features
  concatenated together as local relevancy Y ¼
are precision, our method reaches the highest values in the most
Y~1 ; Y~2 ; Y~3 . concerned transmembrane regions, 90.97 and 86.58 percent,
 Deep Convolutional Block: The function of the module respectively. For the intra-membrane region, DMCTOP has
consisting of multi-scale CNN layers is to explore more the highest precision value of 87.09 percent, but its recall
abstract and specific local correlation features at different value is 86.24 percent, which is lower than 87.82 percent of
extents. Then we built the modules into a deep convolu- SCAMPI-msa. While for the extracellular region, the recall
tional block, which increases the depth and complexity of value of our method reaches the highest 83.16 percent, and
the neural network training. By stacking convolution opera- its precision value is 86.17 percent, less than 88.68 percent of
tions together, the network has a stronger ability to extract SCAMPI-msa.
non-local residue interactions, that is, to extract global con- Since precision and recall often have mutual restraints,
textual features among amino acids. we synthesize their results with F1 -measure to evaluate the
 Implementation Details: To develop a high-quality classification performance of the model. The F1 -score of the
model, we used 10-fold cross-validation and the average of intracellular, transmembrane, and extracellular regions are
the independent test set results as our final prediction per- 86.74, 88.49, and 84.46 percent, respectively, which are
formance. All the architectures described in this paper were higher than those of the SCAMPI-msa method, 85.84, 84.76,
implemented using the open-source software TensorFlow, and 82.16 percent, and are also superior to other methods.
based on the Keras library. We included the batch normali- Fig. 4 plots the F1 -measure to visualize the comparison of
zation and dropout to improve the generalization ability of different methods at the whole sequence level. Meanwhile,
the model. The early stop rule and learning rate scheduler the average AUC values (area under the ROC curves) and
were adopted to control the overfitting and learning effi- the mean precision (area under the precision-recall) of ‘I’,
ciency. The entire deep network is trained on an NVIDIA ‘M’ and ‘O’ in the training process are shown in Fig. 5.
GeForce GTX 1080Ti with 11 GB of memory.
3.2 Prediction Performance Analysis
3 RESULTS AND DISCUSSION at Transmembrane Segment Level
and Traditional Criterion Level
When comparing DMCTOP with other prediction tools, we In order to assess the prediction performance of alpha trans-
used the uniform high-resolution membrane protein test membrane regions (i.e., alpha transmembrane helixes) with-
set of 113 protein sequences to ensure the reliability of the out considering the orientation of different methods, we
comparison results. Meanwhile, the prediction performance used a unified evaluation criterion developed by Jayasinghe
of DMCTOP and other tools were verified in three perspec- [41] to verify the prediction effect of all methods. They
tives, namely, prediction performance analysis at the whole defined a successful prediction as an overlap of at least 9
sequence level, transmembrane segment level, and tradi- AAs between a predicted aTMH segment and a known one.
tional criterion level. Similarly, Moller [42] also adopted a nine residue segment
length in his method. The total numbers of predicted and
3.1 Prediction Performance Analysis at Whole real known alpha transmembrane regions in the testing set
Sequence Level are indicated by Npred and Nknown , respectively. The total num-
Evaluate all types of topological regions at the whole ber of overlapping predicted and real known alpha trans-
sequence level, namely, ‘I’ (Intracellular), ‘M’ (Transmem- membrane regions (i.e., number of correctly predicted
brane), and ‘O’ (Extracellular). The six measurements include aTMHs in the testing set) is indicated by Ncorrect . The effi-
accuracy, recall, precision, specificity, Matthews correlation ciency of the alpha transmembrane regions prediction is
coefficient (MCC) and F1 -measure were used to calculate measured by M ¼ Ncorrect =Nknown and C ¼ Ncorrect =Npred . The
the performance prediction of the DMCTOP and other overall aTMHs prediction accuracy Q value is calculated
methods [40]. using the following equation:
The performance of DMCTOP method at the whole pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sequence along with the 13 methods are listed in Table 1. For Q ¼ M  C  100%: (3)
the three-category problem, if one of the categories is defined
as a positive sample, the other two categories can be consid- The predicted results of DMCTOP at the transmembrane
ered as a negative sample. First, in terms of prediction accu- segment level along with other methods that are detailed in
racy, our method is 86.65 percent, and the highest accuracy Table 2.
among the 13 prediction tools is the SCAMPI-msa method, According to the overall aTMHs prediction power
which has an accuracy of 84.32 percent. Second, MCC value defined in [41], [42], the DMCTOP method has achieved the
reflects the classification performance and prediction reli- highest Q value of 98.90 percent. Measurement C reflects
ability of model. The MCC of DMCTOP in the intra-mem- the percentage of the true transmembrane region in the
brane, transmembrane, and extra-membrane regions are transmembrane region predicted by the model. Our method
80.03, 82.33, and 77.44 percent, respectively, and the value of is superior to the above mentioned methods with a C value
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 299

TABLE 1
Prediction Performance at Whole Sequence Level

Algorithm Class R(%) P(%) S(%) M(%) F1(%) Acc(%)


I 63.36 74.49 89.02 54.75 68.48
HMMTOP2.0 M 85.14 76.98 86.90 70.44 80.85 72.89
O 69.93 67.00 83.48 52.84 68.43
I 67.59 81.33 92.14 62.89 73.83
MEMSAT3.0 M 86.02 79.54 88.61 73.32 82.65 77.03
O 77.40 71.03 84.86 60.97 74.08
I 81.81 85.83 93.16 75.90 83.77
OCTOPUS M 90.44 83.62 90.88 79.89 86.90 84.12
O 81.81 85.83 93.16 75.90 83.77
I 75.03 73.38 86.22 60.92 74.20
Philius M 88.75 78.00 87.12 73.79 83.03 75.42
O 61.86 74.33 89.75 54.43 67.52
I 68.97 69.12 84.40 53.40 69.04
Phobius M 88.49 78.92 87.83 74.44 83.93 72.48
O 59.33 67.81 86.49 47.58 63.29
I 83.98 75.71 86.36 68.72 79.63
PRO M 84.20 79.69 88.95 72.23 81.88 78.77
O 67.66 81.78 92.77 63.85 74.05
I 83.59 80.36 89.66 72.56 81.94
PRODIV M 88.27 78.80 87.78 74.18 83.27 80.82
O 70.13 84.27 93.72 67.34 76.55
I 81.22 76.87 87.63 67.95 78.99
SCAMPI-seq M 86.23 80.25 89.07 74.08 83.13 79.08
O 69.36 80.36 91.87 63.85 74.46
I 87.82 83.78 91.39 78.34 85.84
SCAMPI-msa M 88.27 81.51 89.70 76.57 84.76 84.32
O 76.54 88.68 95.32 74.92 82.16
I 82.11 85.47 92.93 75.81 83.76
SPOCTOPUS M 90.12 83.94 91.13 79.94 86.92 84.14
O 79.97 83.00 92.14 72.83 81.46
I 65.44 68.19 84.55 50.51 66.79
TMHMM2.0 M 87.53 78.26 87.49 73.20 82.64 71.12
O 59.83 65.19 94.68 45.56 62.40
I 75.90 82.59 91.90 69.32 79.10
TOPCONS M 87.11 83.24 90.97 77.25 85.13 80.31
O 77.76 75.10 87.64 64.82 76.41
I 65.04 67.97 84.48 50.08 66.47
TopPred2.0 M 84.19 78.34 88.02 71.04 81.16 70.71
O 62.46 64.63 83.61 46.49 63.53
I 86.24 87.09 93.53 80.03 86.74
DMCTOP* M 90.97 86.58 92.78 82.33 88.49 86.65
O 83.16 86.17 95.66 77.44 84.46

* Experiment results we calculated.


Bold fonts represent the best experimental results.

of 99.12 percent. Although the performance of DMCTOP In general, the accuracy of the traditional topology predic-
and PRODIV differs by 0.22 percent in the measurement tion is measured in the following three perspectives: 1) the
index M, it can be seen from Table 2 that the parameter number of predicted aTMHs, 2) the locations of those pre-
Ncorrect has the most important influence on the M value, and dicted aTMHs, and 3) the orientation of alpha transmembrane
the results of the two prediction methods differ only by
one prediction sample. Meanwhile, the PRODIV method
predicts 473 aTMH regions, 17 more than the 456 results
predicted by the DMCTOP method. However, with 458
known aTMH segments, it is clear that our method has
better prediction stability and reliability.

Fig. 5. The AUC value and mean precision of three classes based on
Fig. 4. Compare the F1 -measures of different methods. 10 runs in the training process.
300 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

TABLE 2 TABLE 4
Prediction Performance at Transmembrane Segment Physicochemical Properties of Amino Acids
Level and Traditional Criterion Level
AAs Hydrophobicity Hydrophobicity Charge Polarity
Algorithm Nknown Npred Ncorrect M(%) C(%) Q(%) Top(%) (Kyte-Doolittle) (Eisenberg)
HMMTOP2.0 458 473 440 96.07 93.02 94.53 67.3 Ala 1.8 0.25 neutral nonpolar
MEMSAT3.0 458 467 444 96.94 95.07 96.00 66.4 Arg -4.5 -1.8 positive polar
OCTOPUS 458 460 451 98.47 98.04 98.25 89.4 Asn -3.5 -0.64 neutral polar
Philius 458 463 439 95.85 94.82 95.33 70.8 Asp -3.5 -0.72 negative polar
Phobius 458 461 443 96.72 96.10 96.41 65.5 Cys 2.5 0.04 neutral polar
PRO 458 450 438 95.63 97.33 96.48 79.6 Glu -3.5 -0.62 negative polar
PRODIV 458 473 453 98.91 95.77 97.33 85.8 Gln -3.5 -0.69 neutral polar
SCAMPI-seq 458 458 444 96.94 96.94 96.94 79.6 Gly -0.4 0.16 neutral nonpolar
SCAMPI-msa 458 462 450 98.25 97.40 97.83 88.5 His -3.2 -0.4 neutral polar
SPOCTOPUS 458 458 443 96.72 96.72 96.72 84.1 IIe 4.5 0.73 neutral nonpolar
TMHMM2.0 458 449 431 94.10 95.99 95.04 58.4 Leu 3.8 0.53 neutral nonpolar
TOPCONS 458 446 436 95.20 97.76 96.47 77.9 Lys -3.9 -1.1 positive polar
TopPred2.0 458 458 431 94.10 94.10 94.01 63.7 Met 1.9 0.26 neutral nonpolar
DMCTOP* 458 456 452 98.69 99.12 98.90 91.7 Phe 2.8 0.61 neutral nonpolar
Pro -1.6 -0.07 neutral nonpolar
* Experiment results we calculated. Ser -0.8 -0.26 neutral polar
Bold fonts represent the best experimental results. Thr -0.7 -0.81 neutral polar
Trp -0.9 0.37 neutral polar
Tyr -1.3 0.02 neutral polar
protein sequences (i.e., N-terminal direction). Use this evalua- Val 4.2 0.54 neutral polar
tion criterion for measuring the existing tools, if all aTMHs
and the N-terminal direction of a transmembrane protein
sequence have been predicted correctly, then the topology AAs. 2) the overlapped part between a known one and a pre-
can be judged to be correctly predicted. Therefore, the com- dicted aTMH segment should at least half of the longer
parison results of various methods are shown in Table 2. The one. Therefore, the values of M and C are converted to
experimental results show that compared with other predic- M 0 ¼ Ncorrectnew =Nknown and C 0 ¼ Ncorrectnew =Npred , respectively.
tion methods, DMCTOP also has the best prediction perfor- On the other hand, based on the new evaluation criteria of
mance at the traditional criterion level with topology aTMH, the topology prediction of traditional criterion is also
prediction accuracy of 91.7 percent. refined as following conditions: 1) the numbers of predicted
aTMHs and the known aTMHs are equal. 2) the locations of
3.3 Prediction Performance Analysis With predicted aTMHs and the known aTMHs are correspond-
Strict Evaluation Measures ing. 3) all the non-TMHs of inside or outside positions are
Although our method achieved better results than previous correctly predicted. We reassess all methods with the new
methods, in recent years, the stricter evaluation criteria have standards and the comparison results are listed in Table 3.
been proposed to verify the effectiveness of the method [43]. Under more stringent standards, the DMCTOP is superior
On the one hand, for the definition of aTMH, the new stan- to other methods in both transmembrane segments and tradi-
dard indicates that a aTMH must comply with the following tional criterion levels. This benefits from what we mentioned
two conditions in the meantime before it is considered to be earlier, our DMCNN model has the same excellent identifica-
correct. 1) the error between the predicted aTMH terminal tion ability for the non-transmembrane region on the basis of
position and the real aTMH terminal position is less than 5 accurate prediction of the transmembrane region. Therefore,
we conclude that our proposed method, DMCTOP, has
achieved significant results at all three levels, thus demon-
TABLE 3
Prediction Performance With Strict Evaluation Measures strating the effectiveness of the method.

Algorithm Ncorrect’ M’(%) C’(%) Q’(%) Topology 3.4 Prediction Performance Analysis
HMMTOP2.0 342 74.67 72.30 73.48 37
of Physicochemical Properties
MEMSAT3.0 360 78.60 77.09 77.84 38 As mentioned in the introduction section, the various physi-
OCTOPUS 401 87.55 87.17 87.36 68 cal and chemical properties of aTMP play an important role
Philius 374 81.66 80.78 81.22 45 in the prediction of topological structure at the beginning of
Phobius 346 75.55 75.05 75.30 39 this field [7], [8]. Each specific property of amino acid is
PRO 356 77.73 79.11 78.42 51 largely based on propensity of its side chain [44]. Therefore,
PRODIV 368 80.35 77.80 79.06 51
SCAMPI-seq 369 80.57 80.57 80.57 47 we select three important physicochemical properties of
SCAMPI-msa 388 84.72 83.98 84.33 58 amino acids from [45] and shown in Table 4, including
SPOCTOPUS 399 87.12 87.12 87.12 68 hydrophobic, polarity and charge.
TMHMM2.0 350 76.42 77.95 77.18 38 We structure the physicochemical matrix with the dimen-
TOPCONS 387 84.50 86.77 85.63 62 sion of 700  5, where the first 4 columns represent the differ-
TopPred2.0 351 76.64 76.64 76.64 41 ent properties of each amino acid and the last column
DMCTOP* 413 90.17 90.57 90.37 73
represents the ‘None’ label. In addition, we normalize the fea-
 Experiment results we calculated. ture matrix values to the range (-1, 1). The encoded matrix is
Bold fonts represent the best experimental results. transmitted to the DMCNN based on 10-fold cross-validation,
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 301

TABLE 5 TABLE 6
Prediction Performance of Physicochemical Features Comparison of Results Before and After Fine-Tuning

Result Class F1(%) S(%) MCC(%) ACC(% Result Class F1(%) S(%) MCC(%) ACC(%)
I 70.23 88.95 57.69 I 86.93 93.24 80.28
Physicochemical M 82.47 84.71 73.11 75.1 Raw Result M 87.68 92.64 82.31 86.73
O 68.27 85.98 54.06 O 84.74 94.20 77.78
I 86.74 93.53 80.03
Fine-tuning M 88.49 92.78 82.33 86.65
and the experimental results of the testing set are listed in O 84.46 95.66 77.44
Table 5.
The measurements include F1 -measure, specificity, MCC
and accuracy are used to reflect the prediction performance. The changes of prediction results before and after fine-tuning
As show in Table 5, the classification ability of physicochemi- are listed in Table 6.
cal feature attributes for three classes is lower than the com- The predicted results before and after fine-tuning are com-
bination of sequence features and evolutionary features. pared with four important indicators include F1 -measure,
Moreover. we find a common problem that the ability of the specificity, MCC and accuracy which reflect the experimental
methods, which use physical and chemical properties to pre- performance. The comparison results show that the predic-
dict topological structure is generally lacking [7], [9], [10]. tion results of the three topological types before and after fine
These properties are extracted artificially based on prior tuning have only a small change range, and will not affect the
knowledge, so there is inevitably noise in the data. The good prediction performance of DMCTOP. Meanwhile, it
unnecessary features would decrease training speed, model also reflects that the initial prediction results predicted by the
interpretability, and, most importantly, reduce generaliza- deep neural network model constructed in this paper have
tion performance on testing set. achieved good performance.

3.5 Comparison of Results Before 3.6 The Impact of Training Sample Scale
and After Fine-Tuning on Prediction Results
There are two types of transmembrane sequencing that are Although the number of aTMP with known topological
biologically meaningful: 1) Extracellular - Transmembrane - structures has increased in recent years, the data scale of the
Intracellular; 2) Intracellular - Transmembrane - Extracellu- training set is still relatively small especially after the nonre-
lar. In addition, at the amino acid level, the same type of label dundancy treatment.
should appear continuously within each region. However, This may result in inadequate learning of local and global
the prediction results of a few sequences do not conform contextual information for the aTMP when training the
these rules. Fig. 6a represents examples of incorrect predic- model. To illustrate this point clearly, we reduced the amount
tion of single amino acid, i.e., the specific type of residues are of data of training samples and took 10 percent as the interval
classified into other classes. Fig. 6b is the situation that the to ensure that the redundancy between training set and test
length of some transmembrane regions is shorter than 3 set is 30, 20, 10 and 0 percent, respectively. And the compari-
AAs, which is hard to identify, thus leading to the wrong son results are shown in Fig. 7.
assignment of non-transmembrane segments. Therefore, It can be seen from Fig. 7, as the training data size
before outputting the final prediction result, we used a fine- decreased, the prediction results at whole sequence level
tuning process to check whether any cases do not satisfy have a tendency to decline from 86.65 to 83.63 percent. At
these biological rules, and correct the problems accordingly. the traditional criterion, the downward trend is moderate

Fig. 6. The fine-tuning is applied to manage the results that do not conform to the transmembrane rules. 6a. represents the incorrect predictions in
residues where the specific types are predicted to others. 6b. is the infeasible structure caused by the transmembrane deletion. The regions shorter
than 3 AAs is hard to identify, leading to the wrong predictions of non-transmembrane regions.
302 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

effectiveness of our approach and reflect the potential of deep


learning algorithms in the field of transmembrane protein
topology prediction. Furthermore, our method is also supe-
rior to the previous methods using the stricter evaluation cri-
teria. Finally, we hope that the research work done in this
paper will be helpful to the researchers in related fields.
In addition, we implemented our method as a web server
DMCTOP for the research community. The web server is
specially trained for predicting aTMP topology structures
by simply entering a protein sequence as the input. The
users can submit the sequences and evaluate our method at
https://icdtools.nenu.edu.cn/dmctop.

ACKNOWLEDGMENTS
This work was supported by the National Natural Science
Funds of China (No. 81671328, 61802057), the Jilin Sci-
entific and Technological Development Program (No.
Fig. 7. Comparison results of different scale training sets. 20180414006GH, 20180520028JH, 20170520058JH), The
Science and Technology Research Project of the Educa-
compared to the whole sequence, with fluctuations ranging tion Department of Jilin Province (No. JJKH20190290KJ,
from 91.65 to 89.38 percent. However, at transmembrane seg- JJKH20191309KJ), and Fundamental Research Funds for the
ments level, the results fluctuated in the range of 98.46 to Central Universities (No. 2412019FZ052, 2412019FZ048).
99.12 percent, almost unaffected by the scale of training. In
other words, the reduction in the size of training data has a REFERENCES
significant impact at whole sequence level, and has a slight
[1] M. S. Almen, K. J. Nordstr€ om, R. Fredriksson, and H. B. Schi€ oth,
influence on the prediction effect of the traditional criterion. “Mapping the human membrane proteome: A majority of the
From these two aspects, we can see that the main reason for human membrane proteins can be classified according to function
the decline is the misprediction of the topology types ‘O’ and and evolutionary origin,” BMC Biol., vol. 7, no. 1, 2009, Art. no. 50.
[2] R. Shamima and S. Suresh, “Prediction of membrane protein
‘I’, which indirectly leads to prediction errors in the trans- structures using a projection based meta-cognitive radial basis
membrane direction, and does not cause impact on the label function network,” in Proc. Int. Joint Conf. Neural Netw., 2016,
‘M’. Because the biological characteristics of aTMH regions pp. 1229–1235.
are more remarkable than those of non-transmembrane [3] J. P. Overington, A. L. Bissan, and A. L. Hopkins, “How many
drug targets are there?” Nat. Rev. Drug Discov., vol. 5, no. 12,
regions. On the other hand, the decrease of performance in pp. 993–6, 2006.
training size and nonredundant sequence identity threshold [4] H. M. Berman et al., “The protein data bank,” Nucleic Acids Res.,
is not very large, indicating that our model has a very good vol. 28, no. 1, pp. 235–242, 2000.
generalization. [5] J. G. Almeida, A. J. Preto, P. I. Koukos, B. Amjj, and I. S. Moreira,
“Membrane proteins structures: A review on computational
Although our deep learning architecture has significantly modeling tools,” Biochimica Et Biophysica Acta, vol. 1859, no. 10,
enhanced the performance of aTMP topological prediction, 2017, Art. no. 2021.
there is still room for improvement. In the future, we would [6] P. Du, S. Gu, and Y. Jiao, “Pseaac-general: Fast building various
modes of general form of chou’s pseudo-amino acid composition
like to construct a more efficient network structure to fur- for large-scale protein datasets,” Int. J. Mol. Sci., vol. 15, no. 3,
ther improve the predictive and generalization capabilities pp. 3495–3506, 2014.
of the model under limited data conditions. We also need to [7] G. von Heijne, “Membrane protein structure prediction. hydropho-
add more visualizations [46], [47] to reflect the interpretabil- bicity analysis and the positive-inside rule,” J. Mol. Biol., vol. 225,
no. 2, pp. 487–94, 1992.
ity of the deep learning algorithm in the application process, [8] D. T. Jones, W. R. Taylor, and J. M. Thornton, “A model recognition
rather than the so-called black-box operation. approach to the prediction of all-helical membrane protein structure
and topology,” Biochemistry, vol. 33, no. 10, pp. 3038–3049, 1994.
[9] A. Krogh, B. H. G. Larsson, and S. Ell, “Predicting transmembrane
4 CONCLUSION protein topology with a hidden markov model:application to
complete genomes,” J. Mol. Biol., vol. 305, no. 3, pp. 567–580, 2001.
In this paper, we propose a novel method, DMCTOP, using [10] G. Tusnady and I. Simon, “The hmmtop transmembrane topology
a deep multi-scale convolutional neural network (DMCNN) prediction server,” Bioinformatics, vol. 17, no. 9, pp. 849–850, 2001.
[11] L. K€all, A. Krogh, and E. L. L. Sonnhammer, “A combined trans-
for aTMP topology prediction. The distribution of different membrane topology and signal peptide prediction method,” J.
regions of the transmembrane protein topology is due to the Mol. Biol., vol. 338, no. 5, pp. 1027–1036, 2004.
interaction of various residues through a group of amino [12] H. Viklund and A. Elofsson, “Best alpha-helical transmembrane
acids or domains. Our network architecture has powerful protein topology predictions are achieved using hidden markov
models and evolutionary information,” Protein Sci., vol. 13, no. 7,
generalization ability to learn the hidden rules within the pp. 1908–1917, 2004.
sequence information effectively and discover abstract local [13] C. Peters, K. D. Tsirigos, N. Shu, and A. Elofsson, “Improved
or global contextual features at different levels automati- topology prediction using the terminal hydrophobic helices rule,”
Bioinformatics, vol. 32, no. 8, 2016, Art. no. 1158.
cally. By integrating local and global contextual features, we [14] D. Jones, “Improving the accuracy of transmembrane protein topol-
improved the state-of-art in protein topology prediction. ogy prediction using evolutionary information,” Bioinformatics,
The comparison results at three levels demonstrate the vol. 23, no. 5, pp. 538–544, 2007.
YANG ET AL.: IMPROVED TOPOLOGY PREDICTION OF ALPHA-HELICAL TRANSMEMBRANE PROTEIN BASED ON DEEP MULTI-SCALE... 303

[15] S. M. Reynolds, L. K€ all, M. E. Riffle, J. A. Bilmes, and W. S. Noble, [38] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
“Transmembrane topology and signal peptide prediction using boltzmann machines,” in Proc. Int. Conf. Int. Conf. Mach. Learn., 2010,
dynamic bayesian networks,” PLOS Comput. Biol., vol. 4, no. 11, pp. 807–814.
2008, Art. no. e1000213. [39] E. Asgari and M. R. K. Mofrad, “Protvec: A continuous distributed
[16] T. Nugent and D. T. Jones, “Transmembrane protein topology representation of biological sequences,” Comput. Sci., vol. 10, no. 11,
prediction using support vector machines,” BMC Bioinf., vol. 10, 2015, Art. no. e0141287.
no. 1, pp. 159–159, 2009. [40] J. Yasen and P. Du, “Performance measures in evaluating machine
[17] V. HaKan and E. Arne, “Octopus: Improving topology prediction learning based bioinformatics predictors for classifications,”
by two-track ann-based preference scores and an extended topo- Quantitative Biol., vol. 4, pp. 320–330, 2016.
logical grammar,” Bioinformatics, vol. 24, no. 15, pp. 1662–1668, [41] S. Jayasinghe, K. Hristova, and S. H. White, “Energetics, stability,
2008. and prediction of transmembrane helices 1,” J. Mol. Biol., vol. 312,
[18] V. HaKan, B. Andreas, S. Marcin, and E. Arne, “SPOCTOPUS: A no. 5, pp. 927–934, 2001.
combined predictor of signal peptides and membrane protein top- [42] S. Moller, E. V. Kriventseva, and R. Apweiler, “A collection of well
ology,” Bioinformatics, vol. 24, no. 24, pp. 2928–9, 2008. characterised integral membrane proteins,” Bioinformatics, vol. 16,
[19] A. Bernsel, H. Viklund, A. Hennerdal, and A. Elofsson, no. 12, pp. 1159–1160, 2000.
“TOPCONS: Consensus prediction of membrane protein top- [43] J. Reeb, E. Kloppmann, M. Bernhofer, and B. Rost, “Evaluation of
ology,” Nucleic Acids Res., vol. 37, pp. 465–8, 2009. transmembrane helix predictions in 2014,” Proteins: Struct. Func-
[20] W. Lu, B. Fu, H. Wu, L. Qiang, K. Wang, and J. Min, “CRF-TM: A tion Bioinf., vol. 83, no. 3, pp. 473–484, 2015.
conditional random field method for predicting transmembrane [44] M. Hayat and A. Khan, “WRF-TMH: Predicting transmembrane
topology,” in Proc. Int. Conf. Int. Sci. Big Data Eng., 2015, pp. 529–537. helix by fusing composition index and physicochemical proper-
[21] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolutional ties of amino acids,” Amino Acids, vol. 44, no. 5, pp. 1317–1328,
neural network architectures for predicting dna–protein binding,” 2013.
Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016. [45] M. Bernhofer, E. Kloppmann, J. Reeb, and B. Rost, “TMSEG:
[22] S. Zhang et al., “A deep learning framework for modeling struc- Novel prediction of transmembrane helices,” Proteins: Struct.
tural features of rna-binding protein targets,” Nucleic Acids Res., Function Bioinf., vol. 84, no. 11, pp. 1706–1716, 2016.
vol. 44, no. 4, 2015, Art. no. e32. [46] P. Du, W. Zhao, Y. Miao, L. Wei, and L. Wang, “UltraPse: A
[23] C. Fang, Y. Shang, and D. Xu, “MUFOLD-SS: New deep inception- universal and extensible software platform for representing
inside-inception networks for protein secondary structure pre- biological sequences,” Int. J. Mol. Sci., vol. 18, no. 11, 2017,
diction,” Proteins-structure Function Bioinf., vol. 86, no. 5, pp. 592–598, Art. no. 2400.
2018. [47] J. Wang et al., “VisFeature: A stand-alone program for visualizing
[24] Q. Zou, P. Xing, L. Wei, and B. Liu, “Gene2vec: gene subsequence and analyzing statistical features of biological sequences,” Bioin-
embedding for prediction of mammalian N6-methyladenosine formatics, vol. 36, no. 4, pp. 1277–1278, 2019.
sites from mRNA,” RNA, vol. 25, no. 2, pp. 205–218, 2019.
[25] L. Wei, Y. Ding, R. Su, J. Tang, and Q. Zou, “Prediction of human
protein subcellular localization using deep learning,” J. Parallel Yuning Yang is currently working toward the
Distrib. Comput., vol. 117, pp. 212–217, 2018.
PhD degree in the School of Information Sci-
[26] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
ence and Technology, Northeast Normal Univer-
vol. 521, no. 7553, 2015, Art. no. 436. sity. His research interests include computational
[27] G. E. Tusn ady, K. Lajos, and S. Istvan,“TOPDB: Topology data biology and bioinformatics.
bank of transmembrane proteins,” Nucleic Acids Res., vol. 36,
pp. D234–D239, 2008.
[28] S. Jayasinghe, K. Hristova, and S. H. White, “Mptopo: A database
of membrane protein topology,” Protein Sci., vol. 10, no. 2,
pp. 455–458, 2001.
[29] I. Masami, A. Masafumi, D. M. Lao, and S. Toshio, “Transmembrane
topology prediction methods: a re-assessment and improve-
ment by a consensus method using a dataset of experimentally-
characterized transmembrane topologies,” Silico Biol., vol. 2, no. 1, Jiawen Yu is currently working toward the
2002, Art. no. 19.
master’s degree in the School of Information
[30] Q. Zou, G. Lin, X. Jiang, X. Liu, and X. Zeng, “Sequence clustering in
Science and Technology, Northeast Normal Uni-
bioinformatics: an empirical study,” Brief Bioinf., vol. 21, pp. 1–10, versity. Her research interests include computa-
2018. tional biology and bioinformatics.
[31] J. Tan et al., “Identification of hormone binding proteins based on
machine learning methods,” Math. Biosciences Eng., vol. 16, no. 4,
pp. 2466–2480, 2019.
[32] W. Yang, X. Zhu, J. Huang, H. Ding, and H. Lin, “A brief survey
of machine learning methods in protein sub-golgi localization,”
Current Bioinf., vol. 13, no. 3, pp. 234–240, 2019.
[33] H. Ying, N. Beifang, G. Ying, F. Limin, and L. Weizhong, “CD-HIT
suite: A web server for clustering and comparing biological
sequences,” Bioinformatics, vol. 26, no. 5, pp. 680–682, 2010. Zhe Liu is currently working toward the graduate
[34] B. Andreas, V. H?Kan, F. Jenny, L. Erik, V. H. Gunnar, and E. Arne,
degree in the School of Information Science and
“Prediction of membrane-protein topology from first principles,”
Technology, Northeast Normal University. She
in Proc. Nat. Acad. Sci. USA, vol. 105, no. 20, pp. 7177–7181, joined the Institution of Computational Biology,
2008. Northeast Normal University, China, in 2019,
[35] X. Zhu, C. Feng, H. Lai, W. Chen, and L. Hao, “Predicting protein where she participated in some research on trans-
structural classes for low-similarity sequences by evaluating dif- membrane protein structure prediction using deep
ferent features,” Knowl. Based Syst., vol. 163, pp. 787–793, 2019.
learning methods.
[36] P. Du, X. Wang, C. Xu, and Y. Gao, “Pseaac-builder: A cross-platform
stand-alone program for generating various special chou’s pseudo-
amino acid compositions,” Anal. Biochem., vol. 425, no. 2,
pp. 117–119, 2012.
[37] A. A. Schaffer, “Improving the accuracy of psi-blast protein database
searches with composition-based statistics and other refinements,”
Nucleic Acids Res., vol. 29, no. 14, pp. 2994–3005, 2001.
304 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

Xi Wang is currently working toward the mas- Zhiqiang Ma received the PhD degree from the
ter’s degree in the School of Information Science School of Computer Science, Jilin University, in
and Technology, Northeast Normal University. Her 2009. Currently, he is a professor with the School
research interests include computational biology of Information Science and Technology, Northeast
and bioinformatics. Normal University. He is the vice president of the
research association of Computer Education in the
Normal Universities of China and the executive
director of the Jilin Computer Federation. His inter-
ests include bioinformatics, software engineering,
molecular biology, and data mining.

Han Wang is director of the Institution of Computa-


tional Biology, Northeast Normal University, China. Dong Xu received the doctorate degree, in 1995.
His research interests include transmembrane pro- He is Shumaker Endowed professor with the Elec-
tein structure and function predictions using artifi- trical Engineering and Computer Science Depart-
cial intelligence methods, which extend to many ment, director of Information Technology Program,
research fields, including big biological data, pro- with appointments with the Christopher S. Bond
tein-protein interaction, new drug target discovery Life Sciences Center and the Informatics Institute
and drug design. He has published many peer- with the University of Missouri. He completed two
reviewed papers in the field, funded by the National years of postdoctoral work at the U.S. National
Natural Science Foundation of China, Jilin Scien- Cancer Institute and then was a staff scientist at
tific, and Technological Development Program of Oak Ridge National Laboratory until joining the uni-
China. versity in 2003. His research includes protein struc-
ture prediction, high-throughput biological data analyses and in silico
studies of plants, microbes and cancers. He has published more than
300 papers. He is a fellow of American Association for the Advancement of
Science (AAAS).

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/csdl.

You might also like