Professional Documents
Culture Documents
1 s2.0 S0003269723003913 Main
1 s2.0 S0003269723003913 Main
1 s2.0 S0003269723003913 Main
Analytical Biochemistry
journal homepage: www.elsevier.com/locate/yabio
A R T I C L E I N F O A B S T R A C T
Index Terms: Crotonylation on lysine sites in human non-histone proteins plays a crucial role in biology activities. However,
Multi-view neural network because traditional experimental methods for crotonylation site identification are time-consuming and labor-
Non-histone intensive, computational prediction methods have become increasingly popular in recent years. Despite its
Crotonacylation sites
significance, crotonylation site prediction has received less attention in non-histone proteins than in histones. In
Adaptive encoding features
this study, we proposed a Multi-View Neural Network for identification of Human Non-Histone Crotonylation
sites, named MVNN-HNHC. MVNN-HNHC integrated multi-view encoding features and adaptive encoding fea
tures through multi-channel neural network to deeply learn about attribute differences between crotonylation
sites and non-crotonylation sites from various aspects. In MVNN-HNHC, convolutional neural networks can
obtain local information from these features, and bidirectional long short term memory networks were utilized to
extract sequence information. Then, we employ the attention mechanism to fuse the outputs of various feature
extraction modules. Finally, the fully connection network acted as the classifier to predict whether a lysine site
was crotonylation site or non-crotonylation site. Performance metrics on independent test set, including sensi
tivity, specificity, accuracy, Matthews correlation coefficient, and area under the curve (AUC) values reach
80.06 %, 75.77 %, 77.06 %, 0.5203, and 0.7792, respectively. To verify the effectiveness of this method, we
carry out a series of experiments and the results show that MVNN-HNHC is an effective tool for predicting
crotonylation sites in non-histone proteins. The data and code are available on https://github.com/xbbxhbc/ju
njun0612.git.
1. Introduction eight types of new PTM identification, including lysine acylation, pro
pionylation, crotonylation, succinylation, butylylation, malmalylation,
Crotonylation on lysine is a new and special post-translational pro glutarylation, dihydroxyisobutylylation and trihydroxybutyylation
tein modification (PTM) by covalently binding the modified small [14]. Although acylation had been intensively studied, few studies had
molecules to the specific lysine sites of the substrate protein [1,2]. been done on crotonylation sites, especially those in non-histone pro
Crotonylation affects the structure and function relationship of proteins teins. Lysine crotonylation plays a key role in many physiological pro
by changing the various physiological and pathological processes in the cesses, such as development, metabolism and disease [15]. Therefore,
organism [3–7]. According to previous studies, there were more than we focus on the accurate identification lysine crotonylation sites on
600 types of protein post-translational modification in eukaryotes. non-histone proteins in human organisms.
Lysine modifications that occur on non-histone proteins are reported to In order to further understand the function of lysine crotonylation
be closely related to cell signaling, protein activity regulation, and site and its related role, many scientists had analyzed relevant experi
protein transport [8–12]. With the progress of research technology for ments on lysine in recent years, but accurately predicting the location of
protein post-translational modification, more and more lysine sites have lysine crotoylation site was the first step and the key step in the
been identified, and more abundant types of histone lysine modifica following work. In recent years, the research for this problem has been
tions have been discovered [13]. Zhao et al. first proposed the model for carried out mainly through the experimental and calculation methods
* Corresponding author.
** Corresponding author.
E-mail addresses: Chen_Chen@dlmu.edu.cn (C. Chen), ningq669@dlmu.edu.cn (Q. Ning).
https://doi.org/10.1016/j.ab.2023.115426
Received 28 August 2023; Received in revised form 21 November 2023; Accepted 6 December 2023
Available online 22 December 2023
0003-2697/© 2023 Elsevier Inc. All rights reserved.
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
[9]. With the progress of proteomics technology, various related tech protein was still lacking. It is necessary to design a good performance
nologies had been adopted, such as high performance liquid chroma model for identification of Kcr sites in non-histone proteins. Secondly, in
tography fractionation (HPLC), isotopic labeling, affinity enrichment, the feature coding part, previous researchers either based on traditional
high performance liquid chromatography tandem mass spectrometry manual coding, or chose adaptive coding, but neglected to integrate the
(MS) [16]. However, considering that the experimental methods require two, from different angles to get better information. To dig deeper than
a lot of manpower, and the design of experimental methods is complex, we know, computers can help us get useful information. Therefore,
long cycle, high cost, it is difficult to widely promote in large-scale drawing on the experience of previous experimental workers, we pro
species [8]. By contrast, the computational methods for site identifica posed a reasonable deep learning framework for identifying non-histone
tion have the advantages of short time consuming and high accuracy. Kcr sites, named MVNN-HNHC. For various types of feature coding
Therefore, we shifted our focus to designing computational approaches methods, convolutional neural networks and bidirectional long term
to identify lysine crotonylation sites. and short term memory networks are applied to extract features and
To date, a number of computational methods have been developed to reduce dimensions, so as to obtain more valuable information. Finally,
predict protein lysine crotonylation (Kcr) sites. Huang and Zeng [17] the output is integrated, and the attention network is used to identify the
proposed the first predictor of Kcr sites, named CrotPred, based on the key information again. Compared with other existing methods, the
hypothesis that the peptides producing bartonylation were generated by proposed model has a better prediction result. The framework for this
different hidden Markov models. Qiu [18] et al. proposed a new method work is shown in Fig. 1.
to use position weighted amino acid composition for feature coding and
support vector machine as classifier to predict crotonylation sites. 2. Materials and methods
Malebary et al. [19] developed a new computational predictor called
iCrotoK-PseAAC, a model that incorporates relative characteristics of 2.1. Construction of the benchmark data set
various locations and compositions as well as statistical matrices into
pseudo-amino acid composition to identify Kcr sites. None of the pro In this experimental study, we collected the same dataset as Chen
posed methods provide an online server, which was inconvenient for et al., which contained a large number of experimentally verified human
biologists, so there is still a lot of room for improvement. Subsequently, non-histone Kcr sites. Firstly, 19287 Kcr sites were obtained from 4230
Ju et al. [20] proposed CKSAAP_CrotSite model, and selected K-spaced non-histone proteins in the Uniprot database. To remove redundant
amino acid pair as feature coding scheme from amino acid frequency, protein sequences, CD-HIT [28] was utilized with 30 % sequence iden
amino acid factor, double contour Bayes, binary encoding and K-spaced tity. In order to determine the size of its sliding window,
amino acid pair. Qiu et al. [21] report a new predictor, iKcr-PseEns, Two-Samples-Logo software [29] was used to further analyze the loca
established by coupling five layers of amino acids pairs to a general tion specificity of positive and negative samples and the distribution
pseudo-amino acid composition. In these reports, the researchers used state of sequences around positive samples. As shown in Fig. 2, residues
different techniques such as position Weighted Matrix, support vector around center lysine mainly concentrated between − 10 and 10, and
machine, K-Nearest Neighbor and many others. However, the maximum there were obvious sequence differences between crotonylation and
predictive accuracy achieved by these techniques was not very high. In non-crotonylation samples. In order to avoid information omission, and
order to maximize the effect. Liu et al. [22] took into account the refer to the previous studies [24], the sliding window for protein
sequence-based features, physicochemical properties and evolutionary sequence interception was set as 29 (− 14~K~14). If the central position
derivative features of protein sequences, and adopted five feature was crotonylation, it would be regarded as a positive sample. It is worth
extraction methods to extract features, and employed ElasticNet to noting that if the length of the amino acid fragment is not enough, the
reduce the dimension of the original feature space. Then, the synthetic virtual amino acid “o" is selected to occupy this position.
minority over-sampling technique method was used to address the
impact of the data imbalance problem. Finally, the LightGBM classifier
2.2. Multi-view feature encoding scheme
was used to predict Kcr sites. Lv [23] et al. developed a method based on
deep learning, called Deep-Kcr, which used multiple types of features for
The selection of suitable feature coding scheme will also have a great
fusion and convolutional neural network for feature extraction. Chen
impact on the prediction results, which is a relatively important step in
[24] et al. first carried out a comprehensive review of six methods for
the whole model framework. Below, we will select five coding schemes
predicting crotonacylation sites and proposed a new method named
from manual coding and adaptive coding schemes suitable for this
nh-Kcr. By designing and using a new deep learning based framework
model, and divide them into two categories: (1) Traditional manual
called CNNrgb, it uses amino acid index, binary encoding and BLO
coding: CKSAAP, CTDD, AAINDEX, BLOSUM62. (2) Adaptive embed
SUM62 encoding schemes as the matrix of red, green and blue color
ding coding mechanism: ADAPTIVE-EMBEDDING. Here is a detailed
channels of convolutional neural network respectively for benchmark
description of the different encoding schemes:
testing. Qiao et al. [25] proposed a new predictor, Bert-Kcr, developed
using a transfer learning approach and a pretrained bidirectional
2.2.1. Traditional manual coding
encoder representation from a transformer model for protein Kcr site
CKSAAP: By calculating the frequency information of k interval in
prediction. Dou [26] constructed a convolutional neural network
the protein fragment sequence, K-spaced amino acid pair encoding
framework called iKcr_CNN in the deep learning framework, and used
method (CKSAAP) extracts the feature vector reflecting the interaction
focus loss function instead of standard cross entropy to optimize the
of amino acid pair in a certain interval, which is widely used in the field
model for identifying human non-histone Kcr modifications. Li [27]
of protein bioinformatics [30–32]. Value of K represents the spacing
et al. established a new predictor, Adapt-Kcr, which was a relatively
between any two amino acids in the protein sequence. If K is set to 0,
advanced end-to-end deep learning model in recent years. It used
there will be 400 pairs of amino acids with zero spacing (i. e., AA, AC,
adaptive embedding, and captures important information based on
AD, …, YY). The calculation formula of the feature vector is as follows:
convolutional neural network, bidirectional long and short term mem
( )
ory network and attention structure. It had good performance and was a NAA NAC NAD NYY
challenging prediction model so far. , , , ⋯⋯, (1)
NTotal NTotal NTotal NTotal 400
Although the models proposed in the past all performed well in
predicting the crotonylation site of lysine, there were still some areas where NTotal = l-k-1, l is the length of the window size, NAA, NAC, NAD, …,
that can be improved. First of all, most of the past models are studied and NYY represent the frequency of amino acid pairs in the fragment.
learned from histones while the research work on the non-histone AAINDEX: Amino acid index (AAINDEX) summarizes a total of 500
2
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
Fig. 2. Motif conservation analysis of sequence identification of crotonylation and non-crotonylation on human non-histone proteins.
3
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
Fig. 3. Flow chart of the MVNN-HNHC predictor. Model development based on CNN, BLSTM, and Attention.
4
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
5
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
Table 1
Performance comparison of different models.
Sn Sp ACC MCC AUC
6
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
Fig. 7. By comparing the performance of different models, (a) the prediction of MVNN-HNHC, (b) Deep-Kcr, (c) Bert-Kcr, (d) Adapt-Kcr, and (e) the prediction of
nh-Kcr.
results showed that our proposed model was a good site identification
tool. While the predicted results were ideal, there was still a lot of work
that needs further thought. For example, the model was not very
interpretable, and the integration of multiple types of data sets as a test
needed to be improved. In future work, we would seriously consider the
above shortcomings, and we would provide a more perfect model in our
next work.
Data availability
4. Conclusion
I have shared the link to my data and source code on website, whose
This paper proposed a site recognition model based on multi-view
link is attached in the paper.
encoding features and adaptive encoding features for crotonylation
prediction. The model integrated sequence-based features, physico
Acknoweldgement
chemical features, protein site-specific scoring matrix, adaptive coding
and other feature representation methods. Through different types of
This work has been supported by the National Natural Science
experiments, it was further confirmed that the combination of tradi
Foundation of China (62302075, 62002039), the Fundamental Research
tional manual coding and adaptive coding, and the use of computer to
Funds for the Central Universities (3132023265, 3132023255,
assist human recognition, could effectively and deeply mine the
3132023257).
discriminant features and identify the useful information unknown to
human. Compared with the results of manual coding and adaptive
Appendix A. Supplementary data
coding alone, the mixed results were more ideal. The convolutional
neural network was used to characterize the local information of the
Supplementary data to this article can be found online at https://doi.
sequence, and the long and short term memory network was used to
org/10.1016/j.ab.2023.115426.
obtain the connection of the context information. Finally, the attention
network was used to deeply screen the obtained information. The inte
gration of each module increased the complexity of the model, but it
showed higher performance than the model proposed before. These
7
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
References [30] F. Li, C. Li, M. Wang, G.I. Webb, Y. Zhang, J.C. Whisstock, J. Song, GlycoMine: a
machine learning-based approach for predicting N-, C- and O-linked glycosylation
in the human proteome, Bioinformatics 31 (2015) 1411–1419.
[1] R.L. Soffer, Post-translational modification of proteins catalyzed by aminoacyl-
[31] Z. Chen, Y.Z. Chen, X.F. Wang, C. Wang, R.X. Yan, Z. Zhang, Prediction of
tRNA-protein transferases, Mol. Cell. Biochem. 2 (1) (1973) 3–14.
ubiquitination sites by using the composition of k-spaced amino acid pairs, PLoS
[2] F. Wold, In vivo chemical modification of proteins (post-translational
One 6 (2011), e22930.
modification), Annu. Rev. Biochem. 50 (1) (1981) 783–814.
[32] M.P. Mosharaf, M.M. Hassan, F.F. Ahmed, M.S. Khatun, M.A. Moni, M.N.H. Mollah,
[3] R. Fellows, J. Denizot, C. Stellato, A. Cuomo, P. Jain, E. Stoyanova, P. Varga-Weisz,
Computational prediction of protein ubiquitination sites mapping on Arabidopsis
Microbiota derived short chain fatty acids promote histone crotonylation in the
thaliana, Comput. Biol. Chem. 85 (2020), 107238.
colon through histone deacetylases, Nat. Commun. 9 (1) (2018) 105.
[33] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama,
[4] H. Huang, D. Zhang, Y. Wang, M. Perez-Neut, Z. Han, Y.G. Zheng, Q. Hao, Y. Zhao,
M. Kanehisa, AAindex: amino acid index database, progress report 2008, Nucleic
Lysine benzoylation is a histone mark regulated by SIRT2, Nat. Commun. 9 (1)
Acids Res. 36 (2008) D202–D205.
(2018) 3374.
[34] I. Dubchak, I. Muchnik, S.R. Holbrook, S.H. Kim, Prediction of protein folding class
[5] G. Jiang, D. Nguyen, N.M. Archin, S.A. Yukl, G. Méndez-Lagares, Y. Tang, HIV
using global description of amino acid sequence, Proc. Natl. Acad. Sci. U.S.A. 92
latency is reversed by ACSS2-driven histone crotonylation, J. Clin. Invest. 128 (3)
(1995) 8700–8704.
(2018) 1190–1198.
[35] J.H. Yang, H.P. Choi, A. Yang, R. Azad, F. Chen, Z. Liu, K.M. Azadzoi, Post-
[6] S. Liu, H. Yu, Y. Liu, X. Liu, Y. Zhang, C. Bu, S. Yuan, Z. Chen, G. Xie, W. Li, B. Xu,
translational modification networks of contractile and cellular stress response
J. Yang, L. He, Chromodomain protein CDYL acts as a crotonyl-CoA hydratase to
proteins in bladder ischemia, Cells (2021) 10.
regulate histone crotonylation and spermatogenesis, Mol. Cell 67 (5) (2017)
[36] L. Wei, C. Zhou, H. Chen, J. Song, R. Su, ACPred-FL: a sequence-based predictor
853–866, e855.
using effective feature representation to improve the prediction of anti-cancer
[7] O. Ruiz-Andres, M.D. Sanchez-Niño, P. Cannata-Ortiz, M. Ruiz-Ortega, J. Egido,
peptides, Bioinformatics 34 (2018) 4007–4016.
A. Ortiz, Histone lysine crotonylation during acute kidney injury in mice, Dis.
[37] D. Wang, Y. Liang, D. Xu, Capsule network for protein post-translational
Models Mech. 9 (6) (2016) 633–645.
modification site prediction, Bioinformatics 35 (2019) 2386–2394.
[8] H. Huang, D.L. Wang, Y. Zhao, Quantitative crotonylome analysis expands the
[38] Z. Lin, M. Feng, C.N.D. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A Structured
roles of p300 in the regulation of lysine crotonylation pathway, Proteomics 18
Self-Attentive Sentence Embedding, 2017.
(2018), e1700230.
[39] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines,
[9] W. Wei, A. Mao, B. Tang, Q. Zeng, S. Gao, Large-scale identification of protein
in: Proceedings of the 27th International Conference on Machine Learning (ICML-
crotonylation reveals its role in multiple cellular functions, J. Proteome Res. 16
10), 2010, pp. 807–814.
(2017) 1743–1752.
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a
[10] Q. Wu, W. Li, C. Wang, P. Fan, L. Cao, Z. Wu, Ultradeep lysine crotonylome reveals
simple way to prevent neural net works from overfitting, J. Mach. Learn. Res. 15
the crotonylation enhancement on both histones and nonhistone proteins by SAHA
(2014) 1929–1958.
treatment, J. Proteome Res. 16 (2017) 3664–3671.
[41] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014.
[11] W. Xu, J. Wan, J. Zhan, X. Li, H. He, Z. Shi, H. Zhang, Global profiling of
crotonylation on non-histone proteins, Cell Res. 27 (2017) 946–949.
[12] H. Yu, C. Bu, Y. Liu, T. Gong, X. Liu, S. Liu, X. Peng, W. Zhang, Y. Peng, J. Yang,
L. He, Y. Zhang, Global crotonylome reveals CDYL-regulated RPA1 crotonylation in Jun Gao is a postgraduate student in information science and
homologous recombination-mediated DNA repair, Sci. Adv. 6 (2020) e4697. technology, Dalian Maritime University. Her research interests
[13] M. Tan, H. Luo, S. Lee, F. Jin, J.S. Yang, E. Montellier, T. Buchou, Z. Cheng, include disease and noncoding RNAs, protein sites prediction
S. Rousseaux, N. Rajagopal, Z. Lu, Z. Ye, Q. Zhu, J. Wysocka, Y. Ye, S. Khochbin, and semi-supervised learning.
B. Ren, Y. Zhao, Identification of 67 histone marks and histone lysine crotonylation
as a new type of histone modification, Cell 146 (2011) 1016–1028.
[14] R.G. Krishna, F. Wold, Post-translational modification of proteins, Adv. Enzymol.
Relat. Area Mol. Biol. 67 (1993) 265–298.
[15] B.R. Sabari, D. Zhang, C.D. Allis, Y. Zhao, Metabolic regulation of gene expression
through histone acylations, Nat. Rev. Mol. Cell Biol. 18 (2017) 90–101.
[16] H. Yu, C. Bu, Y. Liu, T. Gong, X. Liu, S. Liu, X. Peng, W. Zhang, Y. Peng, J. Yang,
L. He, Y. Zhang, X. Yi, X. Yang, L. Sun, Y. Shang, Z. Cheng, J. Liang, Global
crotonylome reveals CDYL-regulated RPA1 crotonylation in homologous
recombination-mediated DNA repair, Sci. Adv. 6 (2020), eaay4697.
[17] G.H. Huang, W.F. Zeng, A discrete hidden Markov model for detecting histone
crotonyllysine sites, Match-Commun Math Co 75 (2016) 717–730.
[18] W.R. Qiu, B.Q. Sun, H. Tang, J. Huang, H. Lin, Identify and analysis crotonylation Yaomiao Zhao is a postgraduate student in information science
sites in histone by using support vector machines, Artif. Intell. Med. 83 (2017) and technology, Dalian Maritime University. Her research in
75–81. terests include miRNA-disease association prediction, protein
[19] S.J. Malebary, M.S.U. Rehman, Y.D. Khan, iCrotoK-PseAAC, Identify lysine sites prediction and machine learning.
crotonylation sites by blending position relative statistical features according to the
Chou’s 5-step rule, PLoS One 14 (2019), e0223993.
[20] Z. Ju, J.J. He, Prediction of lysine crotonylation sites by incorporating the
composition of k-spaced amino acid pairs into Chou’s general PseAAC, J. Mol.
Graph. Model. 77 (2017) 200–204.
[21] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, J.H. Jia, K.C. Chou, iKcr-PseEns: identify
lysine crotonylation sites in histone proteins with pseudo components and
ensemble classifier, Genomics 110 (2018) 239–246.
[22] Y. Liu, Z. Yu, C. Chen, Y. Han, B. Yu, Prediction of protein crotonylation sites
through LightGBM classifier based on SMOTE and elastic net, Anal. Biochem. 609
(2020), 113903.
[23] H. Lv, F.Y. Dao, Z.X. Guan, H. Yang, Y.W. Li, H. Lin, Deep-Kcr: accurate detection
of lysine crotonylation sites using deep learning method, Briefings Bioinf. 22
Chen Chen received the BS and Ph.D degree from the college of
(2021).
mechanical engineering, Dalian University of Technology,
[24] Y.Z. Chen, Z.Z. Wang, Y. Wang, G. Ying, Z. Chen, J. Song, nhKcr: a new
China, in 2020. He is currently a lecturer at Dalian Maritime
bioinformatics tool for predicting crotonylation sites on human nonhistone
University, Dalian. He focus on the intelligent manufacturing
proteins based on deep learning, Briefings Bioinf. 22 (2021).
and machine learning.
[25] Y. Qiao, X. Zhu, H. Gong, BERT-Kcr: prediction of lysine crotonylation sites by a
transfer learning method with pre-trained BERT models, Bioinformatics 38 (2022)
648–654.
[26] L. Dou, Z. Zhang, L. Xu, Q. Zou, iKcr_CNN: a novel computational tool for
imbalance classification of human nonhistone crotonylation sites based on
convolutional neural networks with focal loss, Comput. Struct. Biotechnol. J. 20
(2022) 3268–3279.
[27] Z. Li, J. Fang, S. Wang, L. Zhang, Y. Chen, C. Pian, Adapt-Kcr: a novel deep learning
framework for accurate prediction of lysine crotonylation sites based on learning
embedding features and attention architecture, Briefings Bioinf. 23 (2022).
[28] Y. Huang, B. Niu, Y. Gao, L. Fu, W. Li, C.D.-H.I.T. Suite, A web server for clustering
and comparing biological sequences, Bioinformatics 26 (2010) 680–682.
[29] V. Vacic, L.M. Iakoucheva, P. Radivojac, Two Sample Logo: a graphical
representation of the differences between two sets of sequence alignments,
Bioinformatics 22 (2006) 1536–1537.
8
J. Gao et al. Analytical Biochemistry 687 (2024) 115426
Qiao Ning received the BS and the PhD degree from the School
of information science and technology, Northeast Normal
University, China, in 2019. She is currently a lecturer with the
Department of Information Science and Technology, Dalian
Maritime University, Dalian. Her research interests include
machine learning and Bioinformatics