Professional Documents
Culture Documents
A Novel Methodology On Distributed Representations of
A Novel Methodology On Distributed Representations of
* arzucan.ozgur@boun.edu.tr; elif.ozkirimli@boun.edu.tr
Abstract
The effective representation of proteins is a crucial task that directly affects the
performance of many bioinformatics problems. Related proteins usually bind to
similar ligands. Chemical characteristics of ligands are known to capture the
functional and mechanistic properties of proteins suggesting that a ligand based
approach can be utilized in protein representation. In this study, we propose
SMILESVec, a SMILES-based method to represent ligands and a novel method
to compute similarity of proteins by describing them based on their ligands.
The proteins are defined utilizing the word-embeddings of the SMILES strings
of their ligands. The performance of the proposed protein description method is
evaluated in protein clustering task using TransClust and MCL algorithms. Two
other protein representation methods that utilize protein sequence, BLAST and
ProtVec, and two compound fingerprint based protein representation methods
are compared. We showed that ligand-based protein representation, which uses
only SMILES strings of the ligands that proteins bind to, performs as well as
protein-sequence based representation methods in protein clustering. The re-
sults suggest that ligand-based protein description can be an alternative to the
traditional sequence or structure based representation of proteins and this novel
approach can be applied to different bioinformatics problems such as prediction
of new protein-ligand interactions and protein function annotation.
Introduction
The aging population is putting drug design studies under pressure as we see
an increase in the incidence of complex diseases. Multiple proteins from dif-
ferent protein families or protein networks are usually implicated in these com-
plex diseases such as cancer, cardiovascular, immune and neurodegenerative dis-
eases [18,33,34]. Reliable representation of proteins plays a crucial role in the per-
formance of many bioinformatics tasks such as protein family classification and
2
BLAST
Basic Local Alignment Tool (BLAST) reports the similarity between protein
sequences using local alignment [1]. For the ASTRAL data sets, we used both
BLAST sequence identity values and BLAST e-values that were previously ob-
tained [4] with all-versus-all BLAST with e-value threshold of 100.
representations of ligands where each bit refers to chemical features such as spe-
cific substructures and rings in which MACCS and Extended Fingerprint encode
166 and 1024 bits, respectively. The proteins were represented as described in
Equation 12 in which fingerprints were used to represent each interacting ligand.
Pp
vector(F ingerprintM ethodk )
vector(protein) = k=1 (12)
p
Fingerprints were used in order to compare a knowledge-based ligand description
with a data-driven approach (SMILESVec).
Clustering Algorithms
We evaluated the effectiveness of the different protein representation approaches
for the task of protein clustering. Transitivity Clustering (TransClust), which
has been shown to produce the best F-measure score amongst several other
algorithms in protein clustering [4] and the commonly used Markov Clustering
Algorithm (MCL) were used as the protein clustering algorithms.
Evaluation
In order to evaluate the performance of the proposed methods, we utilized the
F-measure, precision and recall metrics. These metrics are widely used in the
9
maxg indicates that for each family f , we compute precision and recall values
for each corresponding g cluster, and choose the maximum score.
Results
We evaluated the performance of five different protein similarity computation
approaches in clustering of the A-50 dataset. The similarity approaches were
BLAST, ProtVec, SMILESVec, MACCS, and Extended Fingerprint, the first
two of which are protein sequence based similarity methods, whereas the latter
three utilize the ligands to which proteins bind. We took word-frequency based
protein similarity methods that use protein sequences and compound SMILES
strings, respectively, as the baseline. Average (avg) and minimum/maximum
(min/max) of the vectors were taken to build combined vectors for ProtVec and
SMILESVec from their subsequence vectors.
We performed our experiments on the A-50 dataset using two different cluster-
ing algorithms, TransClust and MCL. The ligand-based (SMILESVec, MACCS
and Extended Fingerprint) protein representation approaches require a protein
to bind to at least one ligand in order to define a ligand-based vector for that
protein. Therefore, we removed the proteins with no ligand binding information
from both data sets. Table 1 provides a summary of A-50 data set before and
after filtering.
Table 1. Distribution of families and super-families in A-50 data set before and
after filtering
Data set Num. Sequences Super-families Families
Before filtering
A50 10816 1080 2109
After filtering
A50 1639 425 652
Table 2 summarizes the top-10 most frequent families and super-families be-
fore and after filtering. We can observe that less than half of the frequent
families and super-families remained in top-ten list such as Immunoglobulin
10
(b.1.1) and Fibronectin type III (b.1.2) super-families and their descendants,
Immunoglobulin I set (b.1.1.4) and Fibronectin type III (b.1.2.1) families, re-
spectively. Super-families and families that weren’t initially in the top-10 list
such as Protein-kinase like (d.144.1) super-family and nuclear-receptor binding
domain (a.123.1) and their respective descendant families also made it among
the frequent set of proteins when ligand interactions were taken into account.
In the filtered data set in which all proteins have an interacting ligand, there
were 1057 proteins with fewer than 200 ligands (64% of the whole proteins) 101 of
which were proteins with single ligands (0.6% of the whole proteins). There were
67 proteins with more than 10000 interacting ligands (0.4%), thus increasing the
average of the interacting ligands to 1791. The protein with the highest number
of interacting ligands was d2dpia2 (DNA polymerase iota), a protein involved
in DNA repair [20] and implicated in esophageal squamous cell cancer [43] and
breast cancer [42], with 115018 ligands.
cerned. In our case, since SMILES of the interacting ligands of A-50 data set
was collected from ChEMBL database, the performance of the SMILESVec in
which embeddings were learned from training with ChEMBL SMILES rather
than Pubchem SMILES was notably better.
We also investigated whether using the combination of the SMILES corpus
of ChEMBL and Pubchem can improve the performance of SMILESVec embed-
dings. We indeed reported an improvement on character-based embedding in
family clustering (0.739) whereas word-based embedding produced F-measure
values higher than the Pubchem-based learning and lower than the ChEMBL-
based learning. We can suggest that the increase in the performance of the
character-based learning with the combination of two different SMILES corpora
might be positively correlated with the increase in SMILES samples, while the
number of unique letters that appear in the SMILES did not significantly change
between databases (e.g. absence/presence of the few characters that represent
isometry information). However, with the word-based learning, we observed that
there was significant increase in the variety of the chemical words, thus the com-
bined SMILES corpus model did not work as well as it did in character-based
learning. This result suggests that the size of the learning corpus may affect
the representation of the embeddings, thus we might suggest a larger SMILES
corpus could lead to better character-based embeddings for SMILESVec.
Considering only ChEMBL trained SMILESVec, we observe that even though
producing comparable scores, word-based approach was better than character-
based SMILESVec in terms of F-measure in family clustering. In super-family
clustering however, character-based approach performs as well as word-based
SMILESVec. Similarly, ProtVec is also better represented in word-level rather
than character-level.
The ligand-based protein representation methods, SMILESVec and MACCS-
based approach performed almost as well as ProtVec in family and super-family
clustering with TransClust algorithm, even though no protein sequence infor-
mation was used. With MCL, a lower clustering performance was obtained
compared to TransClust, and both SMILESVec and MACCS-based method pro-
duced slightly better F-measure than ProtVec Avg in both super-family and
family clustering. We can suggest that, since ligand-based protein represen-
tation methods capture indirect function information through ligand binding,
they were recognizably better at detecting super-families than families com-
pared to sequence-based ProtVec on a relatively distant data set. Furthermore,
SMILESVec, a text-based unsupervised learning model, produced comparable F-
measure scores to MACCS and Extended fingerprints, which are binary vectors
based on human-engineered feature descriptions.
Table 4 reports the Pearson correlations [31] among the protein similarity
computation methods. Comparison with BLAST e-value resulted in negative
correlation, as expected, since e-values closer to zero indicate high match (sim-
ilarity). Ligand based protein representation methods had higher correlation
values with BLAST e-value than protein-sequence based methods. We also ob-
served strong correlation among the ligand-based protein representation meth-
ods, suggesting that, regardless of the ligand representation approach, the use
13
Discussion
In this study, we first propose a ligand-representation method, SMILESVec,
which uses a word embeddings model. Then, we represent proteins using their
interacting ligands. In this approach, the interacting ligands of each protein in
the data set are collected. Then, the SMILES string of each ligand is divided
into fixed-length overlapping substrings. These created substrings are then used
to build real-valued vectors with the Word2vec model and then the vectors are
combined into a single vector to represent the whole SMILES string. Finally,
protein vectors are constructed by taking the average of the vectors of their
ligands. The effectiveness of the proposed method in describing the proteins
was measured by performing clustering on ASTRAL 50 (A-50) dataset from
the SCOPe database using two different clustering algorithms, TransClust and
MCL. Both of these clustering algorithms use protein similarity scores to iden-
tify cliques. SMILESVec based protein representation was compared with other
protein representation methods, namely BLAST and ProtVec, both of which
depend on protein sequence to measure protein similarity, and the MACCS and
14
Acknowledgments
TUBITAK-BIDEB 2211-E Scholarship Program (to HO) and BAGEP Award
of the Science Academy (to AO) are gratefully acknowledged. We thank Prof.
Kutlu O. Ulgen and Mehmet Aziz Yirik for helpful discussions.
Funding
This work is funded by Bogazici University Research Fund (BAP) Grant Number
12304.
References
1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic
local alignment search tool. Journal of molecular biology, 215(3):403–410,
1990.
6. D.-S. Cao, J.-C. Zhao, Y.-N. Yang, C.-X. Zhao, J. Yan, S. Liu, Q.-N. Hu,
Q.-S. Xu, and Y.-Z. Liang. In silico toxicity prediction by support vector
16
9. Y.-Y. Chiu, J.-H. Tseng, K.-H. Liu, C.-T. Lin, K.-C. Hsu, and J.-M. Yang.
Homopharma: A new concept for exploring the molecular binding mech-
anisms and drug repurposing. BMC genomics, 15(9):S8, 2014.
26. T. OEChem. Openeye scientific software. Inc., Santa Fe, NM, USA, 2012.
31. K. Pearson. Note on regression and inheritance in the case of two parents.
Proceedings of the Royal Society of London, 58:240–242, 1895.
37. J.-Y. Shi, S.-M. Yiu, Y. Li, H. C. Leung, and F. Y. Chin. Predicting
drug–target interaction for new drugs using enhanced similarity measures
and super-target clustering. Methods, 83:98–104, 2015.
43. S. Zou, Z.-F. Shang, B. Liu, S. Zhang, J. Wu, M. Huang, W.-Q. Ding, and
J. Zhou. Dna polymerase iota (pol ι) promotes invasion and metastasis
of esophageal squamous cell carcinoma. Oncotarget, 7(22):32274, 2016.