Professional Documents
Culture Documents
Prediction of Protein-Protein Interactions With LSTM Deep Learning Model
Prediction of Protein-Protein Interactions With LSTM Deep Learning Model
Abstract— Protein-protein interactions (PPI) has a vital Machine learning algorithms (SVM (support vector
role in molecular biology and bioinformatics since they are the machines), k-NN (k nearest neighbour), RF (random forest),
key organisms which give information about cellular, its DT (decision trees) performed an efficient task in data
structure and its functions. In recent years many methods and mining studies yet they cannot find hidden features and they
techniques are proposed in order to perform PPI’s yet they are
have a lack of ability to extract the key features when
suffered from operational time, and large costs as well as low
prediction accuracy. In this study, we performed a deep protein data is big. Furthermore, it requires an amount of
learning approach to resolve these problems. To do that we time and quality equipment to process. In recent years’
introduced a LSTM architecture to predict protein-protein development of artificial intelligence systems make deep
interactions by applying both ProtVec and protein signatures learning approaches more popular. The main reason behind
methods. VCP (valosin-containing protein) which is associated that is the data is big nowadays and technology allows
with H. Pylori is considered in this work. The performance of researchers to perform studies more influential and rapid
the method determined by log-loss, ROC, and classification [8]. Deep learning methods are efficient to extract high-
accuracy. The proposed method showed a good predictive dimensional and non-linear features from not only protein
ability yet there is still more works need to be performed to
sequences but also other fields including real-world
improve the results of PPI prediction studies with respect to
deep learning and machine learning approaches. applications, image detection, pattern recognition [9,10]. In
order to perform deep learning and obtain a great
Keywords—protein-protein interaction, deep learning, performance in PPI’s analysis, it is needed to combine
bioinformatics, prediction. different modalities (structural, 1st and 2nd order similarity)
from the protein network [3]. Also, researches should
I. INTRODUCTION determine the protein families to specify the protein
interactions.
Protein composed of different types of amino acids
which merge together and create a 3-dimensional In this work, we performed a deep learning approach to
structure. Protein can be thought of as a functional part predict protein-protein interactions. Firstly, we obtained
of the cell. They are responsible for the metabolic VCP data from BioGRID dataset and determine the protein
activities of an organism. Yet all of these processes require groups. After, in order to convert the protein sequences into
interaction between proteins [1]. It’s a well-known fact that numerical values both protein signatures and ProtVec
PPI’s are responsible for the function and form of all methods were applied. In the last section of the work, LSTM
organisms [2]. Protein-protein interaction studies are model was used to classify and predict the protein
becoming popular since analyzing proteins give valuable interactions. The results of both protein signature-based and
information about bio-medical functions, complexes and ProtVec based features were compared in the expectation
metabolic cycles. Determining the protein-protein that to determine the best approach. The performance of the
interactions can be helpful to specify the cancer-related LSTM was determined the log-loss score, ROC curve
cells and their networks and other diseases [3]. values, and classification accuracy. It is essential to analyze
protein-protein interactions to comprehend the activities and
In order to perform PPI’s, there are two types of
biology of the organisms. Predicting the interactions
methods existed: Computational and experimental
between protein networks is valuable for understanding the
(biomedical) approaches. Biomedical approaches include
functions of the proteins, their environment, and their
yeast two-hybrid screening (Y2Hs) [4], tandem affinity
molecular compositions.
purification (TAP) [5], nuclear magnetic resonance (NMR)
The remainder of the paper is organized as follows. In
[6], and mass spectrometry protein complex identification
Section II, we mentioned the PPIs studies in the literature.
(MSPCI) [7]. All of these methods produce a great number
The methods, prediction scores and the data were shared in
of data during experiment thus they are all require a solid
those sections. In Section III, material and methods of the
lab-work and costly laboratory and its equipment. Besides,
proposed method have given. We specified VCP data and
the results of protein-protein interactions are highly based
background information. Also, in that section, protein
on experimental approaches so it depends on the chemicals
signature and ProtVec methods, and LSTM deep learning
and their metabolic activities. This makes the accuracy
model were clarified. Section IV presents the prediction
changeable with respect to genome-based methods. On the
results of the proposed method.
other hand, computational methods include machine
learning approaches and network theory metrics.
( )=∑ (1)
According to Table 1, ProtVec shows a slightly better TABLE II. SPECIFICITY AND SENSITIVITY VALUES FOR ENCODING
performance than the protein signature method in all METHODS.
criteria. Classification accuracy for ProtVec calculated as
Encoding
92% while protein signature classifies the protein Sensitivity Specificity
Method
interactions with 86% accuracy. The main reason behind
Protein
that is ProtVec is a better tool for encoding protein 84% 91%
Signature
sequences with given circumstances. Also, the log-loss
score for ProtVec is better than the protein signature. Log- ProtVec 95% 93%
loss is a beneficial way to determine the testing and training
score in deep learning and machine learning applications. It As can be seen in Table 2, true positive rates for ProtVec is
gives the 0 scores when the perfect classification (100%) greater than the protein signature technique. Likewise, false
obtained. Fig. 5 and Fig. 6 shows the approximate graphical positive rate also gives better performance with 93%
representation of each method of log-loss scores. accuracy.
In the conclusion, the proposed method is compared [4] E.A. Creasey, R.M. Delahay, S.J. Daniell, and G. Frankel, “Yeast
two- hybrid system survey of interactions between LEE-encoded
with some of the existing works in the literature to show the proteins of enteropathogenic Escherichia coli,” Microbiology, vol.
performance of deep learning model. Table 3 remarks the 149, no. 8, 2003, pp. 2093 – 2106.
comparison results. [5] G. Rigaut, A. Shebchenko, B. Rutz, M. Wilm, M. Mann, and B.
Seraphin, “A generic protein purification method for protein complex
characterization and proteome exploration,” Nature Biotechnology,
TABLE III. COMPARISON RESULTS OF THE STUDIES. vol. 17, no. 10, 1999, pp. 1030 – 1032.
Reference Classification Accuracy Method [6] M. Bhasin, and G.P. Raghava, “Classification of nuclear receptors
11 80% NN based on amino acid composition and dipeptide composition,” Journal
of Biological Chemistry, vol. 279, 2004, pp. 23262-23266.
12 80,7% SVM
13 93,75% SVM [7] H. Yuen, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. Adams,
A, Millar et al. “Systematic identification of protein complexes in
14 95,29% DNN Saccharomyces cerevisiae by mass spectrometry,” Nature, vol. 415,
15 98,78% LSTM no. 6868, 2002, pp. 180.
This work 92% LSTM [8] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
Press, 2016.
[9] D. Xiuquan, S. Sun, C. Hu, Y. Yao, Y. Yan, and Y. Zhang,
As can be seen in Table 3, in [11-13] authors applied “DeepPPI: Boosting prediction of protein protein interactions with
machine learning algorithms to determine protein deep neural networks,” Journal of Chemical Information and
interactions. The average classification performance is Modeling, vol. 57, no. 6, 2017, pp. 1499 – 1510.
calculated as 84,8% for all three references. On the other, [10] S. Tanlin, B. Zhou, L. Lai, and J. Pei, “Sequence-based prediction of
protein protein interaction using a deep-learning algorithm,” BMC
including this work, both authors in [14,15] applied deep Bioinformatics, vol. 18, no. 1, 2017, pp. 277.
learning methods. Using a deep learning technique [11] Y. Chen, J. Xu, B. Yang, Y. Zhao, and W. He, “A novel method for
determines the protein interactions with 95,3% average prediction of protein interaction sites based on integrated RBF neural
accuracy. Deep learning algorithms generate more networks,” Computers in Biology and Medicine, vol. 42, 2012, pp.
successful results than machine learning techniques. 402 – 407.
[12] S. Martin, D. Roe, and J. Faulon, “Predicting protein-protein
interactions using signature products,” Bioinformatics, vol. 21, no. 2,
V. CONCLUSION 2005, pp. 218 – 226.
In this work, a deep learning model for prediction of [13] Y. Park, and E. M. Marcotte, “A flaw in the typical evaluation
scheme for pairinput computational predictions,” Nat Methods, vol. 9,
protein- protein interaction was proposed. During the study, no. 12, 2012. [14] L. Zhang, G. Yu, D. Xia, and J. Wang, “Protein-
VCP data was used to determine the interaction network protein interactions prediction based on ensemble deep neural
with other proteins. In the first stage, protein sequences networks,” Neurcomputing, vol. 324, 2019, pp. 10 – 19.
were transformed into numerical representations with two [14] L. Zhang, G. Yu, D. Xia, and J. Wang, “Protein-protein interactions
prediction based on ensemble deep neural networks,” Neurcomputing,
different methods: protein signature and ProtVec. After, vol. 324, 2019, pp. 10 – 19.
numerical values were normalized to [0,1] range in the [15] H. Li, X. Gong, H. Yu, and C. Zhou, “Deep neural network based
preprocessing phase. In order to predict the interactions, predictions of protein interactions using primary sequences,”
LSTM model was evaluated with different parameters. Both Molecules, vol. 23, 2018. [16] BioGRID dataset, online link:
protein signature and ProtVec based protein sequences were https://thebiogrid.org/.
used as an input for the LSTM model and the model’s [16] BioGRID dataset, online link: https://thebiogrid.org/.
performance was determined with classification accuracy, [17] K. D. Pruitt, T. Tatusova, and D. R. Maglott, “NCBI reference
sequences (RefSeq): A curated non-redundant sequence database of
log-loss error rate and ROC values. Both methods gave genomes, transcripts and proteins,” Nucleic Acids Research, vol. 35,
promising results and ProtVec gave a slightly better 2007, pp. 61 – 65.
prediction with 92% accuracy. It is indicated that deep [18] Valosin containing protein image, online link:
learning is a powerful tool for PPIs studies yet performance https://en.wikipedia.org/wiki/Valosin
highly depends on the encoding methods and deep learning containing_protein#/media/File:5ifw.jpg
model’s parameters. [19] D.P. Visco, R. S. Pophale, M.D. Rintoul, and J.L. Faulon, “Develoing
a methodology for an inverse quantitative structure-activity
relationship using the signature molecular descriptor,” Journal of
REFERENCES Molecular Graphical Model, vol. 2, pp. 429 – 439.
[20] J.L. Faulon, C. Churchwell, and D.P. Visco, “The signature molecular
[1] M.S. Ahmed, “SIGNET: A neural network architecture for predicting
descriptor. 2. Enumerating molecules from their extended valence
protein-protein interactions”, Electronic Thesis and Dissertion
sequences,” Journal of Chemical Information Computational Science,
Repository, Western University, 2017.
vol. 43, pp. 721 – 734.
[2] E.D. Levy, and Jose B. Pereira-Leal, “Evolution and dynamics of
[21] E. Asgari, and M.R.K. Mofrad, “Continious distributed representation
protein interactions and networks,” Current Opinion in Structural
of biological sequences for deep proteomics and genomic, “ PloS
Biology, vol. 18, no.3, 2008, pp. 349 – 357.
One, vol. 10, no. 11, 2015.
[3] D. Zhang, and M.R. Kabuka, “Multimodal deep representation
[22] Y. Li, and L. Ilie, “SPRINT: Ultrafast protein-protein interaction
learning for protein-protein interaction networks,” 2018 IEEE
prediction of the entire human interactome,” 2017.
International Conference on Bioinformatics and Biomedicine
(BIBM), pp.595 – 602, 2018.