1 s2.0 S0003269723003913 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Analytical Biochemistry 687 (2024) 115426

Contents lists available at ScienceDirect

Analytical Biochemistry
journal homepage: www.elsevier.com/locate/yabio

MVNN-HNHC:A multi-view neural network for identification of human


non-histone crotonylation sites
Jun Gao a, Yaomiao Zhao a, Chen Chen b, **, Qiao Ning a, *
a
Department of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China
b
Naval Architecture and Ocean Engineering College, Dalian Maritime University, Dalian, 116026, China

A R T I C L E I N F O A B S T R A C T

Index Terms: Crotonylation on lysine sites in human non-histone proteins plays a crucial role in biology activities. However,
Multi-view neural network because traditional experimental methods for crotonylation site identification are time-consuming and labor-
Non-histone intensive, computational prediction methods have become increasingly popular in recent years. Despite its
Crotonacylation sites
significance, crotonylation site prediction has received less attention in non-histone proteins than in histones. In
Adaptive encoding features
this study, we proposed a Multi-View Neural Network for identification of Human Non-Histone Crotonylation
sites, named MVNN-HNHC. MVNN-HNHC integrated multi-view encoding features and adaptive encoding fea­
tures through multi-channel neural network to deeply learn about attribute differences between crotonylation
sites and non-crotonylation sites from various aspects. In MVNN-HNHC, convolutional neural networks can
obtain local information from these features, and bidirectional long short term memory networks were utilized to
extract sequence information. Then, we employ the attention mechanism to fuse the outputs of various feature
extraction modules. Finally, the fully connection network acted as the classifier to predict whether a lysine site
was crotonylation site or non-crotonylation site. Performance metrics on independent test set, including sensi­
tivity, specificity, accuracy, Matthews correlation coefficient, and area under the curve (AUC) values reach
80.06 %, 75.77 %, 77.06 %, 0.5203, and 0.7792, respectively. To verify the effectiveness of this method, we
carry out a series of experiments and the results show that MVNN-HNHC is an effective tool for predicting
crotonylation sites in non-histone proteins. The data and code are available on https://github.com/xbbxhbc/ju
njun0612.git.

1. Introduction eight types of new PTM identification, including lysine acylation, pro­
pionylation, crotonylation, succinylation, butylylation, malmalylation,
Crotonylation on lysine is a new and special post-translational pro­ glutarylation, dihydroxyisobutylylation and trihydroxybutyylation
tein modification (PTM) by covalently binding the modified small [14]. Although acylation had been intensively studied, few studies had
molecules to the specific lysine sites of the substrate protein [1,2]. been done on crotonylation sites, especially those in non-histone pro­
Crotonylation affects the structure and function relationship of proteins teins. Lysine crotonylation plays a key role in many physiological pro­
by changing the various physiological and pathological processes in the cesses, such as development, metabolism and disease [15]. Therefore,
organism [3–7]. According to previous studies, there were more than we focus on the accurate identification lysine crotonylation sites on
600 types of protein post-translational modification in eukaryotes. non-histone proteins in human organisms.
Lysine modifications that occur on non-histone proteins are reported to In order to further understand the function of lysine crotonylation
be closely related to cell signaling, protein activity regulation, and site and its related role, many scientists had analyzed relevant experi­
protein transport [8–12]. With the progress of research technology for ments on lysine in recent years, but accurately predicting the location of
protein post-translational modification, more and more lysine sites have lysine crotoylation site was the first step and the key step in the
been identified, and more abundant types of histone lysine modifica­ following work. In recent years, the research for this problem has been
tions have been discovered [13]. Zhao et al. first proposed the model for carried out mainly through the experimental and calculation methods

* Corresponding author.
** Corresponding author.
E-mail addresses: Chen_Chen@dlmu.edu.cn (C. Chen), ningq669@dlmu.edu.cn (Q. Ning).

https://doi.org/10.1016/j.ab.2023.115426
Received 28 August 2023; Received in revised form 21 November 2023; Accepted 6 December 2023
Available online 22 December 2023
0003-2697/© 2023 Elsevier Inc. All rights reserved.
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

[9]. With the progress of proteomics technology, various related tech­ protein was still lacking. It is necessary to design a good performance
nologies had been adopted, such as high performance liquid chroma­ model for identification of Kcr sites in non-histone proteins. Secondly, in
tography fractionation (HPLC), isotopic labeling, affinity enrichment, the feature coding part, previous researchers either based on traditional
high performance liquid chromatography tandem mass spectrometry manual coding, or chose adaptive coding, but neglected to integrate the
(MS) [16]. However, considering that the experimental methods require two, from different angles to get better information. To dig deeper than
a lot of manpower, and the design of experimental methods is complex, we know, computers can help us get useful information. Therefore,
long cycle, high cost, it is difficult to widely promote in large-scale drawing on the experience of previous experimental workers, we pro­
species [8]. By contrast, the computational methods for site identifica­ posed a reasonable deep learning framework for identifying non-histone
tion have the advantages of short time consuming and high accuracy. Kcr sites, named MVNN-HNHC. For various types of feature coding
Therefore, we shifted our focus to designing computational approaches methods, convolutional neural networks and bidirectional long term
to identify lysine crotonylation sites. and short term memory networks are applied to extract features and
To date, a number of computational methods have been developed to reduce dimensions, so as to obtain more valuable information. Finally,
predict protein lysine crotonylation (Kcr) sites. Huang and Zeng [17] the output is integrated, and the attention network is used to identify the
proposed the first predictor of Kcr sites, named CrotPred, based on the key information again. Compared with other existing methods, the
hypothesis that the peptides producing bartonylation were generated by proposed model has a better prediction result. The framework for this
different hidden Markov models. Qiu [18] et al. proposed a new method work is shown in Fig. 1.
to use position weighted amino acid composition for feature coding and
support vector machine as classifier to predict crotonylation sites. 2. Materials and methods
Malebary et al. [19] developed a new computational predictor called
iCrotoK-PseAAC, a model that incorporates relative characteristics of 2.1. Construction of the benchmark data set
various locations and compositions as well as statistical matrices into
pseudo-amino acid composition to identify Kcr sites. None of the pro­ In this experimental study, we collected the same dataset as Chen
posed methods provide an online server, which was inconvenient for et al., which contained a large number of experimentally verified human
biologists, so there is still a lot of room for improvement. Subsequently, non-histone Kcr sites. Firstly, 19287 Kcr sites were obtained from 4230
Ju et al. [20] proposed CKSAAP_CrotSite model, and selected K-spaced non-histone proteins in the Uniprot database. To remove redundant
amino acid pair as feature coding scheme from amino acid frequency, protein sequences, CD-HIT [28] was utilized with 30 % sequence iden­
amino acid factor, double contour Bayes, binary encoding and K-spaced tity. In order to determine the size of its sliding window,
amino acid pair. Qiu et al. [21] report a new predictor, iKcr-PseEns, Two-Samples-Logo software [29] was used to further analyze the loca­
established by coupling five layers of amino acids pairs to a general tion specificity of positive and negative samples and the distribution
pseudo-amino acid composition. In these reports, the researchers used state of sequences around positive samples. As shown in Fig. 2, residues
different techniques such as position Weighted Matrix, support vector around center lysine mainly concentrated between − 10 and 10, and
machine, K-Nearest Neighbor and many others. However, the maximum there were obvious sequence differences between crotonylation and
predictive accuracy achieved by these techniques was not very high. In non-crotonylation samples. In order to avoid information omission, and
order to maximize the effect. Liu et al. [22] took into account the refer to the previous studies [24], the sliding window for protein
sequence-based features, physicochemical properties and evolutionary sequence interception was set as 29 (− 14~K~14). If the central position
derivative features of protein sequences, and adopted five feature was crotonylation, it would be regarded as a positive sample. It is worth
extraction methods to extract features, and employed ElasticNet to noting that if the length of the amino acid fragment is not enough, the
reduce the dimension of the original feature space. Then, the synthetic virtual amino acid “o" is selected to occupy this position.
minority over-sampling technique method was used to address the
impact of the data imbalance problem. Finally, the LightGBM classifier
2.2. Multi-view feature encoding scheme
was used to predict Kcr sites. Lv [23] et al. developed a method based on
deep learning, called Deep-Kcr, which used multiple types of features for
The selection of suitable feature coding scheme will also have a great
fusion and convolutional neural network for feature extraction. Chen
impact on the prediction results, which is a relatively important step in
[24] et al. first carried out a comprehensive review of six methods for
the whole model framework. Below, we will select five coding schemes
predicting crotonacylation sites and proposed a new method named
from manual coding and adaptive coding schemes suitable for this
nh-Kcr. By designing and using a new deep learning based framework
model, and divide them into two categories: (1) Traditional manual
called CNNrgb, it uses amino acid index, binary encoding and BLO­
coding: CKSAAP, CTDD, AAINDEX, BLOSUM62. (2) Adaptive embed­
SUM62 encoding schemes as the matrix of red, green and blue color
ding coding mechanism: ADAPTIVE-EMBEDDING. Here is a detailed
channels of convolutional neural network respectively for benchmark
description of the different encoding schemes:
testing. Qiao et al. [25] proposed a new predictor, Bert-Kcr, developed
using a transfer learning approach and a pretrained bidirectional
2.2.1. Traditional manual coding
encoder representation from a transformer model for protein Kcr site
CKSAAP: By calculating the frequency information of k interval in
prediction. Dou [26] constructed a convolutional neural network
the protein fragment sequence, K-spaced amino acid pair encoding
framework called iKcr_CNN in the deep learning framework, and used
method (CKSAAP) extracts the feature vector reflecting the interaction
focus loss function instead of standard cross entropy to optimize the
of amino acid pair in a certain interval, which is widely used in the field
model for identifying human non-histone Kcr modifications. Li [27]
of protein bioinformatics [30–32]. Value of K represents the spacing
et al. established a new predictor, Adapt-Kcr, which was a relatively
between any two amino acids in the protein sequence. If K is set to 0,
advanced end-to-end deep learning model in recent years. It used
there will be 400 pairs of amino acids with zero spacing (i. e., AA, AC,
adaptive embedding, and captures important information based on
AD, …, YY). The calculation formula of the feature vector is as follows:
convolutional neural network, bidirectional long and short term mem­
( )
ory network and attention structure. It had good performance and was a NAA NAC NAD NYY
challenging prediction model so far. , , , ⋯⋯, (1)
NTotal NTotal NTotal NTotal 400
Although the models proposed in the past all performed well in
predicting the crotonylation site of lysine, there were still some areas where NTotal = l-k-1, l is the length of the window size, NAA, NAC, NAD, …,
that can be improved. First of all, most of the past models are studied and NYY represent the frequency of amino acid pairs in the fragment.
learned from histones while the research work on the non-histone AAINDEX: Amino acid index (AAINDEX) summarizes a total of 500

2
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

Fig. 1. Overall framework of MVNN-HNHC.

Fig. 2. Motif conservation analysis of sequence identification of crotonylation and non-crotonylation on human non-histone proteins.

multi-dimensional physical and chemical properties characteristic


N(r,s) + N(s,r)
values, which represent different physical and chemical properties, T(r,s) = ,
N− 1
enabling intuitive understanding of biochemical reaction information. { } (3)
The physical and chemical properties are from AAINDEX database (htt (polar, neutral), (neutral, hydrophobic),
r, s ∈
ps://www.genome.jp/aaindex/) [33]. In this paper, twelve kinds of (hydrophobic, polar)
physical and chemical properties are selected according to previous
research, including net charge, spiral normalized frequency, helical where N(r, s) and N(s, r) are the number of dipeptides coded as “rs” and
tendency at 44 in T4 lysozyme, the composition of amino acids (AA) in “sr” in the sequence, and N is the length of the sequence.
proteins in cells, AA composition of multi-transprotein membrane, the Distribution (D) encodes each amino acid as 1,2, and 3 according to
volume of crystal water, accessibility information value, transfer energy, its category (polarity, neutral, and hydrophobic). Measure the first,
organic solvent/water, AA composition of membrane protein, formation 25,50,75 and 100 of 20 amino acids, the descriptor Ei is defined as:
entropy, conformational preference and relative distributive energy of
P1 P25 P50
optimization for all beta chains. Ei 1Dx = ; Ei 25Dx = ; Ei 50Dx = ;
L L L
CTDD: Composition, Transition and Distribution (CTD) protocol was (4)
P75 P100
originally proposed by Dubchak et al. [34]. Features represent amino Ei 75Dx = ; Ei 100Dx = ; i = 1, 2, ⋯, 7; x = 1, 2, 3
L L
acid distribution patterns for specific structural or physicochemical
properties in protein or peptide sequences. Here are the details of the where P1, P25, P50, P75 and P100 measure the position of the first residue,
three options: 25, 50, 75 and 100 % occurrence of x, respectively. CTDD as the coding
The component (C) descriptor consists of three values: polarity of the scheme for data. The 13 physical and chemical properties utilized in this
protein, neutral and overall composition (percentage) of hydrophobic work are listed in Table S1.
residues. The combined descriptors are calculated as follows: BLOSUM62: The similarity of any two amino acid fragments is
N(r) described by comparing the relative frequency and probability values of
C(r) = , r ∈ {polar, neutral, hydrophobic} (2) amino acid substitution. The matrix represents the primary sequence
N
information of proteins as a basic feature set. So far, it has been used by
where N(r) is the number of amino acid type r in the encoded sequence many predictors [35,36], and has achieved relatively ideal performance.
and N is the length of the sequence. The matrix composed of m × n elements represents each residue in the
Transition (T) represents the frequency percentage of amino acids of training data set, where n represents the length of the intercepted amino
one natural and another natural amino acid, and the transition acid sequence of 29, m represents 20 kinds of amino acids.
descriptor is calculated as follows:
2.2.2. Adaptive embedding coding mechanism
By learning the marker vector information and positional informa­
tion for each amino acid type in a protein fragment sequence, we inte­
grated the two by the position of the up table in the whole sequence and

3
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

letters [27]. Finally, the integrated marker information and position { }


information of twenty amino acids are mapped to a specific random 0, if X < 0
RELU(X) = (6)
X, else
initialization vector, whose fusion vector participates in the training
process of the model, and adjusts the gradient update adaptation ac­ BLSTM: Since the convolutional neural network does not consider
cording to the back propagation mechanism. the connection between the sequences. The recurrent neural network
(RNN) can help us to solve the problem of association before and after
2.3. Depth feature extraction the sequence data. Thus, the CNN and RNN for efficient binding, and are
widely used in the bioinformatics field. The experimental effects were
The classification algorithm based on deep learning has been widely robustly confirmed. The RNN is susceptible to sequence length con­
used in biological information, and its effect has been confirmed by straints, insufficient ability to learn and preserve information over the
relevant research results. In the proposed method experiment, we long term. LSTM is a particular type of RNN, which can increase the
choose convolutional neural network (CNN), bidirectional long-short- storage capacity of the network on the original basis and avoid the
term memory network (BLSTM) and attention network for multi- gradient vanishing and gradient explosion problem. But LSTM can only
directional and deep feature extraction. First, high-dimensional fea­ achieve one-directional information transfer, to capture the contextual
tures were extracted with the short sequence feature abstract ability of information of the sequence, improving the computational efficiency of
CNN to extract deep features from the encoded input data [37]. Then the model. We chose the bidirectional network BLSTM to extract the
BLSTM can synthesize the high-dimensional features of short sequence dependence information of sequence around lysine sites, whose formula
and process the temporal data with local correlation. The attention is as follows.
module can reduce the influence of non-significant features in the final
C̃t = tanh(Wc xt + Uc ht− 1 + bc ) (7)
model [38]. The general flow chart of the paper is shown in Fig. 3.
CNN: In this experiment, we used different convolutional neural
it = log(Wi xt + Ui ht− 1 + bi ) (8)
network structures for five different feature coding schemes. The general
structure includes 5 input layers, 51-dimensional convolutional layers, (
f t = log Wf xt + Uf ht− 1 + bf
)
(9)
and output layers. Depending on the differences in feature coding, we
chose different filters to better fit our model to avoid the occurrence of Ot = log(Wo xt + Uo ht− 1 + bo ) (10)
overfitting, preventing the model from performing well in the training
set but not satisfactorily in the test set. After the convolution operation, Ct = f t ct− 1 + it c̃t (11)
we set the dropout rate to 0.5 to avoid overfitting. The formula of the
convolutional neural network, the formula of the RELU activation ht = ot tanhct (12)
function is expressed by formula (5), (6).
( ) where, ht and ht-1 represent the external state of the hidden layer at time t
∑D
Yp = f W p,d
⊗ Xd
+ bp
(5) and t-1, respectively, ct represents the introduced memory unit, ft and Ot
d=1
represent the input forgetting gate and output gate, respectively, w*, U*
and b* represent the learnable parameters of the network, where * can
where Xd represents the input feature map, 1 ≤ d ≤ D. D represents the take i, o, f, c.
depth of the input feature mapping group. YP represents the output Attention: After the deep feature selection of CNN and BLSTM, a
feature map, 1 ≤ p ≤ P. P represents the depth of the output feature large amount of output information is collected. By further integrating
mapping group. WP,d represents parameters of convolutional kernels. bP this information and putting it into the attention layer, the weight pa­
represents a scalar bias. rameters within the attention network help us to emphasize and choose
Nonlinear activation function is an essential part of the neural the information to pay attention to, abandoning some unimportant in­
network. After the convolution layer, we choose the ReLU activation formation, and solving the problem of information overload in the
function to process the output after the convolution to increase the model.
nonlinear fitting ability of the neural network and improve the training [ ]
efficiency of the model. Its calculation formula is as follows: X = Xck , Xaa , Xct , Xbl , Xad (13)

Fig. 3. Flow chart of the MVNN-HNHC predictor. Model development based on CNN, BLSTM, and Attention.

4
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

Q = Wq X (14) 3. Results and analysis

K = Wk X (15) 3.1. Analysis of multi-view feature coding schemes

V = Wv X (16) Five feature-encoding schemes were used in this paper. To further


analyze the effect of each coding scheme on the overall effect, one
( T)
QK coding scheme was removed one by one, and the results of the other four
Attention(Q, K, V) = softmax √̅̅̅̅̅ V (17) coding schemes were fused. As shown in Fig. 4, for the independent test
dk
set, it was obvious that the adaptive coding mechanism contributes the
where Xck, Xaa, Xct, Xbl, Xad are the output of CNN-BLSTM module for most among five feature-encoding schemes. Without adaptive encoding
CKSAAP, AAINDEX, CTDD, BLOSUM62, ADAPTIVE. X ∈ RDk ×N , repre­ scheme, the sensitivity, accuracy and Matthew correlation coefficients
senting the input sequence, Q, K, and V represent the query vector, key decreased from 80.06 %, 77.06 % and 0.5203–63.46 %, 73.34 % and
vector, and value vector, respectively. 0.3949, respectively. Therefore, The adaptive encoding method can
Full connection layer: After the attention network, the model also mine the important feature information around crotonylation sites. In
used fully connected layer to predict the label of samples. We use sig­ this experiment, feature coding AAINDEX and CTDD based on physi­
moid activation function for the output of the fully connected layer to cochemical properties were also significant for feature extraction.
calculate the final probability score. The expression of the sigmoid Without these two feature coding methods, the sensitivity and speci­
activation function is as follows: ficity value would be unbalanced. Without AAINDEX, the Sn value was
almost unchanged, the Sp value would decrease from 75.77 % to 74.3 %,
1
sigmoid(x) = x
(18) the ACC value would decrease from 77.06 % to 76.2 %, and the MCC
1 + e−
would decrease by about 1 %. Without CTDD, the Sp value would
The MVNN-HNHC uses the binary cross-entropy as the loss function decrease from 75.77 % to 71 %, and MCC would decrease by 2 %. The
of the model to measure the difference between the label and the output. ACC value would also decrease from 77.06 % to 74.8 %. Therefore, the
physicochemical properties plays a critical role in crotonylation site
2.4. Model performance evaluation identification. To demonstrate the effect of sequence differences on
feature extraction, we removed the feature encoding of CKSAAP. From
In this experiment, we used the method of independent test for the results, we could see that Sp, Matthew correlation coefficient and
evaluation, and used different indicators for performance measurement, accuracy decreased by about 5.52 %, 2 % and 3 %, respectively. It was
such as: accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthew clear that feature coding based on sequence information has some in­
correlation coefficient (MCC). Relevant definitions are as follows: fluence on the performance of the experimental results. To further
TP observe the sequence difference between crotonoylation and non-
Sn = (19) crotonoylation, we used the two-sample-logo software to further
TP + FN
analyze the position specificity of the positive and negative samples and
TN the distribution status of the sequences around the positive samples. As
Sp = (20)
TN + FP shown in Fig. 2, residues K were mainly concentrated between − 9 and 9,
with clear sequence differences between the positive and negative
ACC =
TP + TN
(21) samples which confirmed that CKSAAP feature coding had a positive
TP + TN + FP + FN effect on the outcome based on this sequence feature. Moreover, in this
experiment, BLOSUM62 had little effect on the overall results, but the
TP × TN − FP × FN
MCC = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (22) presence of this encoding also contributed to the overall performance
(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN) improvement, making a corresponding contribution to the final results
of the model.
where TP, FP, FN and TN represent the data volume of true positive,
false positive, false negative and true negative respectively. MCC is a
more balanced index used to measure classification performance in bi­
nary classification problem. It comprehensively considers true positive,
true negative, false positive and false negative. The value of MCC ranges
from − 1 to 1, and the higher the value is, the better the prediction
performance.

2.5. Model parameter setting

Throughout the procedure, we interspersed multiple strategies to


train the model. First, the ReLU [39] activation function is used to speed
up the convergence rate after each convolution operation, and the
normalization operation is utilized to solve gradient vanishing or
gradient explosion. Multiple dropout layers were used to reduce over­
fitting and improve the generalization ability of the model [40]. The
dropout rate was set to 0.5. Furthermore, we also used the Adam opti­
mizer [41] for model optimization. The model was subjected to different
learning rates, finally set to 0.001, to control the step size or speed of
parameter updates during neural network training.

Fig. 4. Performance comparison of WITHOUT_CKSAAP, WITHOUT_AAINDEX,


WITHOUT_CTDD, WITHOUT_BLOSUM62, WITHOUT_ADAPTIVE and
MVNN-HNHC.

5
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

In conclusion, the five feature coding schemes based on traditional


manual coding and adaptive coding all had some positive effects on the
encoding results, none of which could be omitted. By integrating mul­
tiple encoding methods, it could more effectively characterize effective
protein sequence information.

3.2. Ablation study of different model structures

According to the previous researches, most of the frameworks were


relatively simple and the performances were limited, and there was still
a lot of room for improvement. In this paper, the framework include the
convolutional neural network, bidirectional long and short term mem­ Fig. 6. Visualization results before and after depth feature extraction. (a) is
ory network and attention mechanism architecture. Bidirectional long before feature extraction, and (b) is after feature extraction.
and short term memory network could make up for the shortcomings of
convolutional neural networks that do not consider the dependence model, and then used the same independent tests set to evaluate the
between sequences, improving the ability of sequence analysis. Atten­ performance of the other state-of-the-art models, including Deep-Kcr,
tion mechanism can screen favorable information for us and improve the Bert-Kcr, Adapt-Kcr and nh-Kcr.
efficiency of the model. In order to verify the effectiveness of the com­ As can be seen from Table 1, Sp, MCC and ACC are all higher than the
bination of the three, we conducted relevant ablation studies based on other models. Compared with other models, nh-Kcr has the highest Sn
the same independent test set. Under the condition of unified feature value and AUC value, but there is a serious imbalance in the prediction
coding modules, model architectures were added one by one. It was of positive and negative samples of this model. There is a 29.35 % dif­
proved that each module of convolutional neural network, bidirectional ference between Sn and Sp. The model proposed in this paper has no
long and short term memory network and attention mechanism has obvious deviation in the prediction of positive and negative samples,
certain influence on the result. The results were shown in Fig. 5. and the prediction results of positive and negative samples are roughly
Compared with convolutional neural network, the specificity did not balanced. Therefore, it is an efficient and reliable prediction tool for
change after the addition of bidirectional long and short term memory solving site identification problems, with high robustness and general­
network, but the sensitivity, accuracy and Matthew correlation coeffi­ ization ability. To more clearly demonstrate the differences between
cient increased by 3 %, 0.3 % and 2 %, respectively. On the basis of the different approaches, we selected a mouse protein for crotonylation site
bidirectional long and short term memory network, we further added prediction and visualized the results, shown in Fig. 7.
the attention mechanism, and the overall experimental effect was As shown in Fig. 7, green was the correctly classified crotonylation
further improved. Therefore, attention mechanism could help us to lock lysine site, and red was the misclassified crotonylation lysine site. We
in important information efficiently and quickly in sequence analysis could clearly see that our model predicts a slightly higher proportion of
problems. In order to verify the effectiveness of the architecture in correct sites than the other models. Therefore, our model not only had a
identifying Kcr sites, T-SNE diagram was used to visualize the encoded good site classification effect for human proteins, but also for other
output and the output after feature extraction, as shown in Fig. 6. It organisms.
could be found from the figure that the positive and negative samples
without feature extraction by deep learning modules almost completely 3.4. Analysis of model on histone proteins and non-histone proteins
overlapped together and are densely distributed in the whole area.
However, after feature extraction by deep learning modules, we would In previous studies of site identification problems, most of the models
find that the positive and negative samples had obvious separation proposed by researchers focused on histone proteins, but had less un­
phenomenon, and we could clearly distinguish different types of sites. derstanding about non-histone proteins. Therefore, MVNN-HNHC model
Therefore, the combination of convolutional neural network, bidirec­ was designed for non-histone proteins, and the results were also
tional long and short term memory network, and attention mechanism considerable. To further validate the strong robustness of our model, we
can effectively extract deep feature information for crotonylation site used the same number of positive and negative samples of histones as
prediction. the test set for validation, although its performance was not as good as
that of non-histones, the difference was minimal. Furthermore, we
3.3. Comparison with state-of-the-art models mixed the test datasets of histone and non-histone proteins into the
model, the results are shown in Fig. 8. Its performance was not as good
To demonstrate the superiority of our proposed approach, we as that of a single type of samples. With Sn, Sp and ACC values decreased
compared our model with the state-of-the-art approaches. From the by 12 %, 2.3 % and 2 % compared with non-histone proteins, respec­
point of view of fairness, we chose the same training set to train the tively, the MCC value also decreased from 0.52 to 0.44. The reason for
this result might be that the different types of proteins had their own
unique properties due to the differences in their generating environ­
ment. Our model needed to be further strengthened.

Table 1
Performance comparison of different models.
Sn Sp ACC MCC AUC

MVNN-HNHC 80.06 % 75.77 % 77.06 % 0.5203 0.7792


Adapt-Kcr 81.73 % 71.20 % 74.36 % 0.4879 0.7709
Deep-Kcr 60.13 % 74.02 % 69.86 % 0.3257 0.6841
nh-Kcr 92.86 % 63.51 % 72.32 % 0.5179 0.7819
Bert-Kcr 82.00 % 70.05 % 73.64 % 0.4790 0.7695
Fig. 5. The effect comparison of different model structures.

6
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

Fig. 7. By comparing the performance of different models, (a) the prediction of MVNN-HNHC, (b) Deep-Kcr, (c) Bert-Kcr, (d) Adapt-Kcr, and (e) the prediction of
nh-Kcr.

results showed that our proposed model was a good site identification
tool. While the predicted results were ideal, there was still a lot of work
that needs further thought. For example, the model was not very
interpretable, and the integration of multiple types of data sets as a test
needed to be improved. In future work, we would seriously consider the
above shortcomings, and we would provide a more perfect model in our
next work.

CRediT authorship contribution statement

Jun Gao: Data curation, Methodology, Writing – original draft.


Yaomiao Zhao: Validation, Visualization. Chen Chen: Conceptualiza­
tion, Investigation. Qiao Ning: Supervision, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial


interests or personal relationships that could have appeared to influence
Fig. 8. Comparison of the results of histone, non-histone, and a mixture
the work reported in this paper.
of both.

Data availability
4. Conclusion
I have shared the link to my data and source code on website, whose
This paper proposed a site recognition model based on multi-view
link is attached in the paper.
encoding features and adaptive encoding features for crotonylation
prediction. The model integrated sequence-based features, physico­
Acknoweldgement
chemical features, protein site-specific scoring matrix, adaptive coding
and other feature representation methods. Through different types of
This work has been supported by the National Natural Science
experiments, it was further confirmed that the combination of tradi­
Foundation of China (62302075, 62002039), the Fundamental Research
tional manual coding and adaptive coding, and the use of computer to
Funds for the Central Universities (3132023265, 3132023255,
assist human recognition, could effectively and deeply mine the
3132023257).
discriminant features and identify the useful information unknown to
human. Compared with the results of manual coding and adaptive
Appendix A. Supplementary data
coding alone, the mixed results were more ideal. The convolutional
neural network was used to characterize the local information of the
Supplementary data to this article can be found online at https://doi.
sequence, and the long and short term memory network was used to
org/10.1016/j.ab.2023.115426.
obtain the connection of the context information. Finally, the attention
network was used to deeply screen the obtained information. The inte­
gration of each module increased the complexity of the model, but it
showed higher performance than the model proposed before. These

7
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

References [30] F. Li, C. Li, M. Wang, G.I. Webb, Y. Zhang, J.C. Whisstock, J. Song, GlycoMine: a
machine learning-based approach for predicting N-, C- and O-linked glycosylation
in the human proteome, Bioinformatics 31 (2015) 1411–1419.
[1] R.L. Soffer, Post-translational modification of proteins catalyzed by aminoacyl-
[31] Z. Chen, Y.Z. Chen, X.F. Wang, C. Wang, R.X. Yan, Z. Zhang, Prediction of
tRNA-protein transferases, Mol. Cell. Biochem. 2 (1) (1973) 3–14.
ubiquitination sites by using the composition of k-spaced amino acid pairs, PLoS
[2] F. Wold, In vivo chemical modification of proteins (post-translational
One 6 (2011), e22930.
modification), Annu. Rev. Biochem. 50 (1) (1981) 783–814.
[32] M.P. Mosharaf, M.M. Hassan, F.F. Ahmed, M.S. Khatun, M.A. Moni, M.N.H. Mollah,
[3] R. Fellows, J. Denizot, C. Stellato, A. Cuomo, P. Jain, E. Stoyanova, P. Varga-Weisz,
Computational prediction of protein ubiquitination sites mapping on Arabidopsis
Microbiota derived short chain fatty acids promote histone crotonylation in the
thaliana, Comput. Biol. Chem. 85 (2020), 107238.
colon through histone deacetylases, Nat. Commun. 9 (1) (2018) 105.
[33] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama,
[4] H. Huang, D. Zhang, Y. Wang, M. Perez-Neut, Z. Han, Y.G. Zheng, Q. Hao, Y. Zhao,
M. Kanehisa, AAindex: amino acid index database, progress report 2008, Nucleic
Lysine benzoylation is a histone mark regulated by SIRT2, Nat. Commun. 9 (1)
Acids Res. 36 (2008) D202–D205.
(2018) 3374.
[34] I. Dubchak, I. Muchnik, S.R. Holbrook, S.H. Kim, Prediction of protein folding class
[5] G. Jiang, D. Nguyen, N.M. Archin, S.A. Yukl, G. Méndez-Lagares, Y. Tang, HIV
using global description of amino acid sequence, Proc. Natl. Acad. Sci. U.S.A. 92
latency is reversed by ACSS2-driven histone crotonylation, J. Clin. Invest. 128 (3)
(1995) 8700–8704.
(2018) 1190–1198.
[35] J.H. Yang, H.P. Choi, A. Yang, R. Azad, F. Chen, Z. Liu, K.M. Azadzoi, Post-
[6] S. Liu, H. Yu, Y. Liu, X. Liu, Y. Zhang, C. Bu, S. Yuan, Z. Chen, G. Xie, W. Li, B. Xu,
translational modification networks of contractile and cellular stress response
J. Yang, L. He, Chromodomain protein CDYL acts as a crotonyl-CoA hydratase to
proteins in bladder ischemia, Cells (2021) 10.
regulate histone crotonylation and spermatogenesis, Mol. Cell 67 (5) (2017)
[36] L. Wei, C. Zhou, H. Chen, J. Song, R. Su, ACPred-FL: a sequence-based predictor
853–866, e855.
using effective feature representation to improve the prediction of anti-cancer
[7] O. Ruiz-Andres, M.D. Sanchez-Niño, P. Cannata-Ortiz, M. Ruiz-Ortega, J. Egido,
peptides, Bioinformatics 34 (2018) 4007–4016.
A. Ortiz, Histone lysine crotonylation during acute kidney injury in mice, Dis.
[37] D. Wang, Y. Liang, D. Xu, Capsule network for protein post-translational
Models Mech. 9 (6) (2016) 633–645.
modification site prediction, Bioinformatics 35 (2019) 2386–2394.
[8] H. Huang, D.L. Wang, Y. Zhao, Quantitative crotonylome analysis expands the
[38] Z. Lin, M. Feng, C.N.D. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A Structured
roles of p300 in the regulation of lysine crotonylation pathway, Proteomics 18
Self-Attentive Sentence Embedding, 2017.
(2018), e1700230.
[39] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines,
[9] W. Wei, A. Mao, B. Tang, Q. Zeng, S. Gao, Large-scale identification of protein
in: Proceedings of the 27th International Conference on Machine Learning (ICML-
crotonylation reveals its role in multiple cellular functions, J. Proteome Res. 16
10), 2010, pp. 807–814.
(2017) 1743–1752.
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a
[10] Q. Wu, W. Li, C. Wang, P. Fan, L. Cao, Z. Wu, Ultradeep lysine crotonylome reveals
simple way to prevent neural net works from overfitting, J. Mach. Learn. Res. 15
the crotonylation enhancement on both histones and nonhistone proteins by SAHA
(2014) 1929–1958.
treatment, J. Proteome Res. 16 (2017) 3664–3671.
[41] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014.
[11] W. Xu, J. Wan, J. Zhan, X. Li, H. He, Z. Shi, H. Zhang, Global profiling of
crotonylation on non-histone proteins, Cell Res. 27 (2017) 946–949.
[12] H. Yu, C. Bu, Y. Liu, T. Gong, X. Liu, S. Liu, X. Peng, W. Zhang, Y. Peng, J. Yang,
L. He, Y. Zhang, Global crotonylome reveals CDYL-regulated RPA1 crotonylation in Jun Gao is a postgraduate student in information science and
homologous recombination-mediated DNA repair, Sci. Adv. 6 (2020) e4697. technology, Dalian Maritime University. Her research interests
[13] M. Tan, H. Luo, S. Lee, F. Jin, J.S. Yang, E. Montellier, T. Buchou, Z. Cheng, include disease and noncoding RNAs, protein sites prediction
S. Rousseaux, N. Rajagopal, Z. Lu, Z. Ye, Q. Zhu, J. Wysocka, Y. Ye, S. Khochbin, and semi-supervised learning.
B. Ren, Y. Zhao, Identification of 67 histone marks and histone lysine crotonylation
as a new type of histone modification, Cell 146 (2011) 1016–1028.
[14] R.G. Krishna, F. Wold, Post-translational modification of proteins, Adv. Enzymol.
Relat. Area Mol. Biol. 67 (1993) 265–298.
[15] B.R. Sabari, D. Zhang, C.D. Allis, Y. Zhao, Metabolic regulation of gene expression
through histone acylations, Nat. Rev. Mol. Cell Biol. 18 (2017) 90–101.
[16] H. Yu, C. Bu, Y. Liu, T. Gong, X. Liu, S. Liu, X. Peng, W. Zhang, Y. Peng, J. Yang,
L. He, Y. Zhang, X. Yi, X. Yang, L. Sun, Y. Shang, Z. Cheng, J. Liang, Global
crotonylome reveals CDYL-regulated RPA1 crotonylation in homologous
recombination-mediated DNA repair, Sci. Adv. 6 (2020), eaay4697.
[17] G.H. Huang, W.F. Zeng, A discrete hidden Markov model for detecting histone
crotonyllysine sites, Match-Commun Math Co 75 (2016) 717–730.
[18] W.R. Qiu, B.Q. Sun, H. Tang, J. Huang, H. Lin, Identify and analysis crotonylation Yaomiao Zhao is a postgraduate student in information science
sites in histone by using support vector machines, Artif. Intell. Med. 83 (2017) and technology, Dalian Maritime University. Her research in­
75–81. terests include miRNA-disease association prediction, protein
[19] S.J. Malebary, M.S.U. Rehman, Y.D. Khan, iCrotoK-PseAAC, Identify lysine sites prediction and machine learning.
crotonylation sites by blending position relative statistical features according to the
Chou’s 5-step rule, PLoS One 14 (2019), e0223993.
[20] Z. Ju, J.J. He, Prediction of lysine crotonylation sites by incorporating the
composition of k-spaced amino acid pairs into Chou’s general PseAAC, J. Mol.
Graph. Model. 77 (2017) 200–204.
[21] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, J.H. Jia, K.C. Chou, iKcr-PseEns: identify
lysine crotonylation sites in histone proteins with pseudo components and
ensemble classifier, Genomics 110 (2018) 239–246.
[22] Y. Liu, Z. Yu, C. Chen, Y. Han, B. Yu, Prediction of protein crotonylation sites
through LightGBM classifier based on SMOTE and elastic net, Anal. Biochem. 609
(2020), 113903.
[23] H. Lv, F.Y. Dao, Z.X. Guan, H. Yang, Y.W. Li, H. Lin, Deep-Kcr: accurate detection
of lysine crotonylation sites using deep learning method, Briefings Bioinf. 22
Chen Chen received the BS and Ph.D degree from the college of
(2021).
mechanical engineering, Dalian University of Technology,
[24] Y.Z. Chen, Z.Z. Wang, Y. Wang, G. Ying, Z. Chen, J. Song, nhKcr: a new
China, in 2020. He is currently a lecturer at Dalian Maritime
bioinformatics tool for predicting crotonylation sites on human nonhistone
University, Dalian. He focus on the intelligent manufacturing
proteins based on deep learning, Briefings Bioinf. 22 (2021).
and machine learning.
[25] Y. Qiao, X. Zhu, H. Gong, BERT-Kcr: prediction of lysine crotonylation sites by a
transfer learning method with pre-trained BERT models, Bioinformatics 38 (2022)
648–654.
[26] L. Dou, Z. Zhang, L. Xu, Q. Zou, iKcr_CNN: a novel computational tool for
imbalance classification of human nonhistone crotonylation sites based on
convolutional neural networks with focal loss, Comput. Struct. Biotechnol. J. 20
(2022) 3268–3279.
[27] Z. Li, J. Fang, S. Wang, L. Zhang, Y. Chen, C. Pian, Adapt-Kcr: a novel deep learning
framework for accurate prediction of lysine crotonylation sites based on learning
embedding features and attention architecture, Briefings Bioinf. 23 (2022).
[28] Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, C.D.-H.I.T. Suite, A web server for clustering
and comparing biological sequences, Bioinformatics 26 (2010) 680–682.
[29] V. Vacic, L.M. Iakoucheva, P. Radivojac, Two Sample Logo: a graphical
representation of the differences between two sets of sequence alignments,
Bioinformatics 22 (2006) 1536–1537.

8
J. Gao et al. Analytical Biochemistry 687 (2024) 115426

Qiao Ning received the BS and the PhD degree from the School
of information science and technology, Northeast Normal
University, China, in 2019. She is currently a lecturer with the
Department of Information Science and Technology, Dalian
Maritime University, Dalian. Her research interests include
machine learning and Bioinformatics

You might also like