Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Prediction of Protein-Protein Interactions with

LSTM Deep Learning Model


Talha Burak Alakus Ibrahim Turkoglu
Department of Software Engineering Department of Software Engineering
Kirklareli University Firat University
Kirklareli, Turkey Elazig, Turkey
burak.alakuss@gmail.com, talhaburakalakus@klu.edu.tr iturkoglu@firat.edu

Abstract— Protein-protein interactions (PPI) has a vital Machine learning algorithms (SVM (support vector
role in molecular biology and bioinformatics since they are the machines), k-NN (k nearest neighbour), RF (random forest),
key organisms which give information about cellular, its DT (decision trees) performed an efficient task in data
structure and its functions. In recent years many methods and mining studies yet they cannot find hidden features and they
techniques are proposed in order to perform PPI’s yet they are
have a lack of ability to extract the key features when
suffered from operational time, and large costs as well as low
prediction accuracy. In this study, we performed a deep protein data is big. Furthermore, it requires an amount of
learning approach to resolve these problems. To do that we time and quality equipment to process. In recent years’
introduced a LSTM architecture to predict protein-protein development of artificial intelligence systems make deep
interactions by applying both ProtVec and protein signatures learning approaches more popular. The main reason behind
methods. VCP (valosin-containing protein) which is associated that is the data is big nowadays and technology allows
with H. Pylori is considered in this work. The performance of researchers to perform studies more influential and rapid
the method determined by log-loss, ROC, and classification [8]. Deep learning methods are efficient to extract high-
accuracy. The proposed method showed a good predictive dimensional and non-linear features from not only protein
ability yet there is still more works need to be performed to
sequences but also other fields including real-world
improve the results of PPI prediction studies with respect to
deep learning and machine learning approaches. applications, image detection, pattern recognition [9,10]. In
order to perform deep learning and obtain a great
Keywords—protein-protein interaction, deep learning, performance in PPI’s analysis, it is needed to combine
bioinformatics, prediction. different modalities (structural, 1st and 2nd order similarity)
from the protein network [3]. Also, researches should
I. INTRODUCTION determine the protein families to specify the protein
interactions.
Protein composed of different types of amino acids
which merge together and create a 3-dimensional In this work, we performed a deep learning approach to
structure. Protein can be thought of as a functional part predict protein-protein interactions. Firstly, we obtained
of the cell. They are responsible for the metabolic VCP data from BioGRID dataset and determine the protein
activities of an organism. Yet all of these processes require groups. After, in order to convert the protein sequences into
interaction between proteins [1]. It’s a well-known fact that numerical values both protein signatures and ProtVec
PPI’s are responsible for the function and form of all methods were applied. In the last section of the work, LSTM
organisms [2]. Protein-protein interaction studies are model was used to classify and predict the protein
becoming popular since analyzing proteins give valuable interactions. The results of both protein signature-based and
information about bio-medical functions, complexes and ProtVec based features were compared in the expectation
metabolic cycles. Determining the protein-protein that to determine the best approach. The performance of the
interactions can be helpful to specify the cancer-related LSTM was determined the log-loss score, ROC curve
cells and their networks and other diseases [3]. values, and classification accuracy. It is essential to analyze
protein-protein interactions to comprehend the activities and
In order to perform PPI’s, there are two types of
biology of the organisms. Predicting the interactions
methods existed: Computational and experimental
between protein networks is valuable for understanding the
(biomedical) approaches. Biomedical approaches include
functions of the proteins, their environment, and their
yeast two-hybrid screening (Y2Hs) [4], tandem affinity
molecular compositions.
purification (TAP) [5], nuclear magnetic resonance (NMR)
The remainder of the paper is organized as follows. In
[6], and mass spectrometry protein complex identification
Section II, we mentioned the PPIs studies in the literature.
(MSPCI) [7]. All of these methods produce a great number
The methods, prediction scores and the data were shared in
of data during experiment thus they are all require a solid
those sections. In Section III, material and methods of the
lab-work and costly laboratory and its equipment. Besides,
proposed method have given. We specified VCP data and
the results of protein-protein interactions are highly based
background information. Also, in that section, protein
on experimental approaches so it depends on the chemicals
signature and ProtVec methods, and LSTM deep learning
and their metabolic activities. This makes the accuracy
model were clarified. Section IV presents the prediction
changeable with respect to genome-based methods. On the
results of the proposed method.
other hand, computational methods include machine
learning approaches and network theory metrics.

978-1-7281-3789-6/19/$31.00 ©2019 IEEE


Besides, the performance of the LSTM model is determined LSTM (Long-Short Term Memory) to specify the
in here including log-loss scores and ROC curve values. interactions. Three various features were obtained from
Lastly, the study is concluded in Section V. protein sequences including semantic relations, motifs, and
long-short term relations between proteins. Proposed
II. RELATED WORKS method’s performance was determined with 5-fold cross-
validation and average classification accuracy was found
In this section, computational based PPI studies
98,78%. At the end of the study, researchers indicated that
including machine learning and deep learning methods in
deep learning is a powerful tool to inspect PPI’s.
the literature have been examined and explained.
In the study of [11], researchers tried to predict protein- III. MATERIAL AND METHODS
protein interaction zones based on machine learning
algorithms. In order to classify the zones, they aimed to use A. VCP and BioGRID
Radial Basis Functional Neural Network classifier. In order In this study, we obtained VCP protein data from
to convert protein sequences into the numerical BioGRID (Biological General Repository for Interaction
representations, frequencies of each amino acid sequences Datasets) [16]. BioGRID is an interaction repository for
were calculated. After, in the feature extraction phase, protein and genetic interactions. It consists of nearly two
relative entropy was calculated and totally 1000 features million protein interactions data and can be accessed via its
were obtained from frequencies. Last part of the study, web page.VCP also is known as p97 in mammals and
radial basis classifier was used and weight parameters were CDC48 in Saccharomyces cerevisiae is an ATPase enzyme
tested to determine the best classification performance. In and encoded by the VCP gene in humans. VCP is a member
conclusion, the F1 score and classification accuracy were of the AAA family of proteins. Fig. 1 shows the typical
calculated 99%, and 80% respectively. Protein-protein VCP. It has a role on protein degradation, DNA repair,
interactions were determined based on the protein signatures DNA replication, and intracellular membrane fusion. Also,
method in the study of [12]. Researches obtained it affects the regulation of the cell cycle. It has a
Helicobacter pylori protein data from human and mouse. homohexameric complex which allows interacting of
After the conversion of sequences, SVM (Support Vector different kinds of cofactors. Mutations in this gene cause
Machine) was applied to predict the interactions. The many illnesses and diseases including ALS (amyotrophic
proposed method was tested on both Helicobacter pylori, lateral sclerosis), CMT (Charcot Marie Tooth), etc. [17].
Escherichia coli, and Saccharomyces cerevisiae to
determine the performance. During the testing process, 10-
fold cross-validation was used and the performance of the
classifier was stated with specificity, accuracy, and
sensitivity. In conclusion best classification accuracy
obtained from Saccharomyces cerevisiae dataset with 80,7%
classification accuracy. The interactions between proteins
were specified with Siamese Neural Network [1]. In the
study, four different datasets were considered and three
different testing methods (C1, C2, and C3) and two various
neural networks were applied. In the first neural network
model protein sequences were converted with protein
signature while in the second neural network, proteins were
composed with ProtVec method. Test procedures were
designated based on the study of [13]. C1 testing procedure
includes all training and testing data. C2 testing procedure
consists of all testing data yet only consider one training Fig.1. Valosin containing protein in human [18].
data. On the other hand, C3 only marks all testing data. The
detailed information on the testing procedures can be seen in According to the current information of BioGRID, there are
[13]. Protein signature method was given the best result for 1290 physical interactions exists between VCP and other
both C1 and C2 testing procedure with 93,75% and 85,75% proteins. In order to use VCP to determine the PPI with
AUC scores respectively. Yet C3 showed the best result on computational methods, firstly protein sequences were
C3 testing procedure with 73,25% AUC score. Deep converted into numerical values. To do that, we applied two
learning is applied with respect to determining protein different methods: Protein signature and ProtVec. The
interactions [14]. Firstly, protein sequences were detailed information of all them can be seen in the next
transformed into numerical representations. After 9 various subsections.
deep learning models were used to predict the interactions.
The performance of the proposed method was calculated by B. Protein Signature
finding accuracy, F1 score, and specificity. During the Protein signature was developed to encode the protein
testing phase, 5-fold cross-validation was applied and the sequences into the numerical representations. With this way,
results were obtained as 95,29%, 95,29%, and 95,48% sequences can be transformed into the vector and can give
respectively. . In the study of [15], protein interactions information to distinguish the binding and non-binding
predicted based on protein primer sequences. Researchers protein pairs.
applied both CNN (Convolutional Neural Network) and
Protein signature was advanced based on signature
molecular descriptor [19,20]. The calculation of the
signature has given below in Equation 1 [12];

( )=∑ (1)

In Equation 1, represents the amino acid sequence and


( ) shows the length of that amino acid. is the number
of occurrences in given amino acid. specifies the vector in
the signature space. Protein signature includes amino acid Fig. 2. Protein splitting process for ProtVec [21].
and its possible neighbours thus, a signature space consists
For converting success of ProtVec, we also applied this
of all possible signatures. Consider the six-letter amino acid
method to transform VPC data to numbers. With that way,
sequence as an example of MTLVVL. Signatures will be
we aimed to compare the performances of two different
determined based on three monomer units, and there are
methods with respect to the selected deep learning model.
four three monomer units as signatures in that sequence:
MTL, TLV, LVV, VVL. Each unit (signature) has a root
and two neighbours which will be ordered alphabetically. D. LSTM Deep Learning Model
Thus, the signatures are, T(LM), L(TV), V(LV), and LSTM is a deep learning model developed based on
V(LV) and finally the signature of ( ) = recurrent neural networks in 1997 [22]. LSTM model is
( ) + ( ) + 2V( ). generally used in many applications including speech
processing, music composition, abstract creations, word
In order to use VCP data for deep learning process, one completion, etc. It is a gradient-based learning algorithm
of the methods that we applied is the signature method and developed to aim the solve back-flow problems. LSTM
which is helpful to convert the letter sequences to numeric model includes information from the outside of normal flow
values. It provided a vector for the classification and of the recurrent neural network in a gated cell. With this
processing for deep learning approach. cell, information can be stored, can be read and can be
written. Gates in the model determine which information
C. ProtVec will be stored or will be used. These gates were called as
Using letters, words, and in general nominal values as an forget gates and it solves the vanishing gradient problem of
input for feature, vectors can cause problems in machine recurrent neural networks. Detailed information about
learning and deep learning since there is no order in nominal LSTM can be found at [22]. Our proposed method includes
values. Typically, in order to convert nominal values to LSTM for prediction of protein- protein interactions.
numbers, a word embedding method applied. The aim of the
embedding’s is to reduce the dimension of vector space and IV. RESULTS AND DISCUSSION
converts the words or letters into a meaningful feature. In
We developed a LSTM model for protein interaction in
accordance with this information, researches in [21]
that work. We only considered the first 400 lengths of
developed a word embedding method for protein sequences
sequence of each protein as an input for LSTM classifier.
and called ProtVec. Proteins were decomposed into to
Before the prediction phase, each protein sequences have
trimers like three- gram method, and ProtVec is learned for
already converted to numerical values. Our model includes
each word within its surrounding neighbourhood. Fig. 2,
4 1D convolutional layers, 4 pooling layers with average
shows the protein-splitting process for obtaining the training
pooling, 1 LSTM layer and 1 fully connected layer with
data. With that method, proteins which have similar
Softmax with 1024 neurons. The detailed LSTM model can
physiochemical properties affiliate. Detailed information of
be seen in Fig. 3 below.
ProtVec can be found at [21].

Fig. 3. LSTM model used in this work.


According to Figure 3, the first two convolution layers
have 16 filters with a size of 2x1. In the first two pooling
layers’ average value of each input has been calculated with
the number of 32 filters with the size of 2x1 like in
convolutional layers. Size of each filter is the same in the
network only the number of kernels has been changed to
double after the first four layers. After, LSTM was applied
with the size of 512 LSTM units. Fully connected layer has
1024 neurons like Softmax layer which is a classification
layer. Protein sequences were encoded into the numbers
with two different methods: protein signature and ProtVec.
We considered each method to determine the best encoding
technique with respect to the proposed method. The general
structure of the proposed method has given in Fig.4.

Fig. 5. Log-loss score of ProtVec method.

Fig. 4. General structure of the proposed method.

After developing of LSTM model, 400 lengths of protein


sequences were given in the network in order to predict the
protein interactions. Both protein signature and ProtVec
based sequences performed different results and their scores
were given in Table 1. Table 1 shows the classification
accuracies, log-loss scores, and ROC values of each
computational methods based on LSTM network.
Fig. 5. Log-loss score of protein signature method.
TABLE I. PERFORMANCE OF BOTH PROTEIN ENCODING METHODS
BASED ON VARIOUS CRITERIA.
Besides, based on the ROC values both sensitivity and
Encoding Method
Classification Log-Loss ROC specificity values can be calculated. In the ROC curve, the
Accuracy Score Values y- axis represents the sensitivity while the x-axis shows
Protein Signature 86% 0,1508 84% the1- specificity. Table 2 shows both specificity and
sensitivity values for all methods with respect to ROC
ProtVec 92% 0,0834 95%
values.

According to Table 1, ProtVec shows a slightly better TABLE II. SPECIFICITY AND SENSITIVITY VALUES FOR ENCODING
performance than the protein signature method in all METHODS.
criteria. Classification accuracy for ProtVec calculated as
Encoding
92% while protein signature classifies the protein Sensitivity Specificity
Method
interactions with 86% accuracy. The main reason behind
Protein
that is ProtVec is a better tool for encoding protein 84% 91%
Signature
sequences with given circumstances. Also, the log-loss
score for ProtVec is better than the protein signature. Log- ProtVec 95% 93%
loss is a beneficial way to determine the testing and training
score in deep learning and machine learning applications. It As can be seen in Table 2, true positive rates for ProtVec is
gives the 0 scores when the perfect classification (100%) greater than the protein signature technique. Likewise, false
obtained. Fig. 5 and Fig. 6 shows the approximate graphical positive rate also gives better performance with 93%
representation of each method of log-loss scores. accuracy.
In the conclusion, the proposed method is compared [4] E.A. Creasey, R.M. Delahay, S.J. Daniell, and G. Frankel, “Yeast
two- hybrid system survey of interactions between LEE-encoded
with some of the existing works in the literature to show the proteins of enteropathogenic Escherichia coli,” Microbiology, vol.
performance of deep learning model. Table 3 remarks the 149, no. 8, 2003, pp. 2093 – 2106.
comparison results. [5] G. Rigaut, A. Shebchenko, B. Rutz, M. Wilm, M. Mann, and B.
Seraphin, “A generic protein purification method for protein complex
characterization and proteome exploration,” Nature Biotechnology,
TABLE III. COMPARISON RESULTS OF THE STUDIES. vol. 17, no. 10, 1999, pp. 1030 – 1032.
Reference Classification Accuracy Method [6] M. Bhasin, and G.P. Raghava, “Classification of nuclear receptors
11 80% NN based on amino acid composition and dipeptide composition,” Journal
of Biological Chemistry, vol. 279, 2004, pp. 23262-23266.
12 80,7% SVM
13 93,75% SVM [7] H. Yuen, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. Adams,
A, Millar et al. “Systematic identification of protein complexes in
14 95,29% DNN Saccharomyces cerevisiae by mass spectrometry,” Nature, vol. 415,
15 98,78% LSTM no. 6868, 2002, pp. 180.
This work 92% LSTM [8] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
Press, 2016.
[9] D. Xiuquan, S. Sun, C. Hu, Y. Yao, Y. Yan, and Y. Zhang,
As can be seen in Table 3, in [11-13] authors applied “DeepPPI: Boosting prediction of protein protein interactions with
machine learning algorithms to determine protein deep neural networks,” Journal of Chemical Information and
interactions. The average classification performance is Modeling, vol. 57, no. 6, 2017, pp. 1499 – 1510.
calculated as 84,8% for all three references. On the other, [10] S. Tanlin, B. Zhou, L. Lai, and J. Pei, “Sequence-based prediction of
protein protein interaction using a deep-learning algorithm,” BMC
including this work, both authors in [14,15] applied deep Bioinformatics, vol. 18, no. 1, 2017, pp. 277.
learning methods. Using a deep learning technique [11] Y. Chen, J. Xu, B. Yang, Y. Zhao, and W. He, “A novel method for
determines the protein interactions with 95,3% average prediction of protein interaction sites based on integrated RBF neural
accuracy. Deep learning algorithms generate more networks,” Computers in Biology and Medicine, vol. 42, 2012, pp.
successful results than machine learning techniques. 402 – 407.
[12] S. Martin, D. Roe, and J. Faulon, “Predicting protein-protein
interactions using signature products,” Bioinformatics, vol. 21, no. 2,
V. CONCLUSION 2005, pp. 218 – 226.
In this work, a deep learning model for prediction of [13] Y. Park, and E. M. Marcotte, “A flaw in the typical evaluation
scheme for pairinput computational predictions,” Nat Methods, vol. 9,
protein- protein interaction was proposed. During the study, no. 12, 2012. [14] L. Zhang, G. Yu, D. Xia, and J. Wang, “Protein-
VCP data was used to determine the interaction network protein interactions prediction based on ensemble deep neural
with other proteins. In the first stage, protein sequences networks,” Neurcomputing, vol. 324, 2019, pp. 10 – 19.
were transformed into numerical representations with two [14] L. Zhang, G. Yu, D. Xia, and J. Wang, “Protein-protein interactions
prediction based on ensemble deep neural networks,” Neurcomputing,
different methods: protein signature and ProtVec. After, vol. 324, 2019, pp. 10 – 19.
numerical values were normalized to [0,1] range in the [15] H. Li, X. Gong, H. Yu, and C. Zhou, “Deep neural network based
preprocessing phase. In order to predict the interactions, predictions of protein interactions using primary sequences,”
LSTM model was evaluated with different parameters. Both Molecules, vol. 23, 2018. [16] BioGRID dataset, online link:
protein signature and ProtVec based protein sequences were https://thebiogrid.org/.
used as an input for the LSTM model and the model’s [16] BioGRID dataset, online link: https://thebiogrid.org/.
performance was determined with classification accuracy, [17] K. D. Pruitt, T. Tatusova, and D. R. Maglott, “NCBI reference
sequences (RefSeq): A curated non-redundant sequence database of
log-loss error rate and ROC values. Both methods gave genomes, transcripts and proteins,” Nucleic Acids Research, vol. 35,
promising results and ProtVec gave a slightly better 2007, pp. 61 – 65.
prediction with 92% accuracy. It is indicated that deep [18] Valosin containing protein image, online link:
learning is a powerful tool for PPIs studies yet performance https://en.wikipedia.org/wiki/Valosin
highly depends on the encoding methods and deep learning containing_protein#/media/File:5ifw.jpg
model’s parameters. [19] D.P. Visco, R. S. Pophale, M.D. Rintoul, and J.L. Faulon, “Develoing
a methodology for an inverse quantitative structure-activity
relationship using the signature molecular descriptor,” Journal of
REFERENCES Molecular Graphical Model, vol. 2, pp. 429 – 439.
[20] J.L. Faulon, C. Churchwell, and D.P. Visco, “The signature molecular
[1] M.S. Ahmed, “SIGNET: A neural network architecture for predicting
descriptor. 2. Enumerating molecules from their extended valence
protein-protein interactions”, Electronic Thesis and Dissertion
sequences,” Journal of Chemical Information Computational Science,
Repository, Western University, 2017.
vol. 43, pp. 721 – 734.
[2] E.D. Levy, and Jose B. Pereira-Leal, “Evolution and dynamics of
[21] E. Asgari, and M.R.K. Mofrad, “Continious distributed representation
protein interactions and networks,” Current Opinion in Structural
of biological sequences for deep proteomics and genomic, “ PloS
Biology, vol. 18, no.3, 2008, pp. 349 – 357.
One, vol. 10, no. 11, 2015.
[3] D. Zhang, and M.R. Kabuka, “Multimodal deep representation
[22] Y. Li, and L. Ilie, “SPRINT: Ultrafast protein-protein interaction
learning for protein-protein interaction networks,” 2018 IEEE
prediction of the entire human interactome,” 2017.
International Conference on Bioinformatics and Biomedicine
(BIBM), pp.595 – 602, 2018.

You might also like