Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

1

A novel methodology on distributed representations of


proteins using their interacting ligands
Hakime Öztürk1 , Elif Ozkirimli2,* , and Arzucan Özgür1,*
arXiv:1801.10199v1 [stat.ML] 30 Jan 2018

1 Department of Computer Engineering, Bogazici University,


Istanbul, 34342, Turkey
2 Department of Chemical Engineering, Bogazici University,
Istanbul, 34342, Turkey

* arzucan.ozgur@boun.edu.tr; elif.ozkirimli@boun.edu.tr

Abstract
The effective representation of proteins is a crucial task that directly affects the
performance of many bioinformatics problems. Related proteins usually bind to
similar ligands. Chemical characteristics of ligands are known to capture the
functional and mechanistic properties of proteins suggesting that a ligand based
approach can be utilized in protein representation. In this study, we propose
SMILESVec, a SMILES-based method to represent ligands and a novel method
to compute similarity of proteins by describing them based on their ligands.
The proteins are defined utilizing the word-embeddings of the SMILES strings
of their ligands. The performance of the proposed protein description method is
evaluated in protein clustering task using TransClust and MCL algorithms. Two
other protein representation methods that utilize protein sequence, BLAST and
ProtVec, and two compound fingerprint based protein representation methods
are compared. We showed that ligand-based protein representation, which uses
only SMILES strings of the ligands that proteins bind to, performs as well as
protein-sequence based representation methods in protein clustering. The re-
sults suggest that ligand-based protein description can be an alternative to the
traditional sequence or structure based representation of proteins and this novel
approach can be applied to different bioinformatics problems such as prediction
of new protein-ligand interactions and protein function annotation.

Introduction
The aging population is putting drug design studies under pressure as we see
an increase in the incidence of complex diseases. Multiple proteins from dif-
ferent protein families or protein networks are usually implicated in these com-
plex diseases such as cancer, cardiovascular, immune and neurodegenerative dis-
eases [18,33,34]. Reliable representation of proteins plays a crucial role in the per-
formance of many bioinformatics tasks such as protein family classification and
2

clustering, prediction of protein functions and prediction of the interactions be-


tween protein-protein and protein-ligand pairs. Proteins are usually represented
based on their sequences [5, 10, 19]. A recent study adapted Word2Vec [24],
which is a widely-used word-embeddings model in Natural Language Process-
ing (NLP) tasks, into the genomic space to describe proteins as real-valued
continuous vectors using their sequences, and utilized these vectors to classify
proteins [2]. However, even though the structure of a protein is determined by
its sequence, sequence alone is usually not adequate to completely understand
its mechanism. Furthermore, the relationship between fold or architecture and
function was shown to be weak, while a strong correlation was reported for archi-
tecture and bound ligand [23]. Semantic features such as functional categories
and annotations, and Gene Ontology (GO) classes [7, 16, 25, 37] have been sug-
gested to support the functional understanding of proteins, nevertheless these
features are usually described in the form of binary vectors preventing the di-
rect use of the provided information. Therefore, a novel approach that defines
proteins by integrating functional characterizations can provide important in-
formation toward understanding and predicting protein structure, function and
mechanism. Ligand-centric approaches are based on the chemical similarity of
compounds that interact with similar proteins [32] and have been successfully
adopted for tasks such as target fishing, off-target effect prediction and protein-
clustering [9, 35]. The use of chemical similarity of the interacting ligands of
proteins to group them resulted in both biologically and functionally related
protein clusters [22, 28]. Motivated by these results, we propose to describe
proteins using their interacting ligands.
In order to define the protein with a ligand centric approach, the description
of the ligand is critical. Ligands can be represented in many different forms
including knowledge-based fingerprints, graphs, or strings. Simplified Molecular
Input Line Entry System (SMILES), which is a character-based representation
of ligands, has been used for QSAR studies [6,36] and protein-ligand interaction
prediction [21, 29]. Even though it is a string based representation form, use
of SMILES performed as well as powerful graph-based representation methods
in protein-ligand interaction prediction and has proven to be computationally
less expensive [29]. A recent study that employed a Recurrent Neural Networks
(RNN) based model to describe compound properties also used SMILES to pre-
dict chemical properties [17]. However, such deep-learning based approaches
require more computational power. An advantage of SMILES is that it provides
a promising environment for the adoption of NLP approaches because it is char-
acter based. Distributed word representation models have been widely used in
recent studies of NLP tasks, especially with the introduction of Word2Vec [24].
The model requires a large amount of text data to learn the representations of
words to describe them in low-dimensional space as real valued vectors. These
vectors comprise the syntactic and semantic features of the words, e.g., the vec-
tors of words with similar meanings are also similar.
In this study, we introduce SMILESVec, in which we adopted the word-
embeddings approach to define ligands by utilizing their SMILES strings. Lig-
ands are represented by learning features from a large SMILES corpus via
3

Word2Vec [24], instead of using manually constructed ligand features as it is


done in fingerprint models. We then describe each protein using the average of
its interacting ligand vectors that are built by SMILESVec. We followed a similar
pipeline for evaluation that is presented in [4] in which the authors compared the
performances of different clustering algorithms on the task of detecting remote
homologous protein families. We measured how well SMILESVec-based protein
representation describes proteins within a protein clustering task by using two
state-of-the-art clustering algorithms; Transitive Clustering (TransClust) [41]
and Markov Clustering Algorithm (MCL) [14].
The performance of clustering using SMILESVec-based protein representa-
tion was compared with that using the traditional BLAST, MACCS-based [40]
and Extended Fingerprint-based protein representations as well as the recently
proposed distributed protein vector representation, which is called ProtVec [2].
ASTRAL data set (A-50) of SCOPe database was used as benchmark [8, 15].
The results showed that the representation of proteins with their ligands is
a promising method with competitive F-scores in the protein clustering task,
even though no sequence or structure information is used. SMILESVec can be
an alternative approach to binary-vector based fingerprint models for ligand-
representation. The ligand-based protein representation might be useful in dif-
ferent bioinformatics tasks such as identifying new protein-ligand interactions
and protein function annotations.

Materials and Methods


Data set
The ASTRAL data sets are the part of Structural Classification of Proteins
(SCOP) collection and classified under folds, families and super-families [15]. A
family denotes a group of proteins with typically distinct functionalities but also
with high sequence similarities, whereas a super-family is a group of protein fam-
ilies with structural and functional similarities amongst families. The ASTRAL
data sets are named based on the minimum sequence similarity of the proteins
that they comprise. For instance, ASTRAL 50 (A-50) data set includes proteins
with at most 50% sequence similarity (http://scop.berkeley.edu/astral/subsets/ver=1.75&seqOption=1).
In this study, we used A-50 data set from SCOP 1.75 version to demonstrate
the performance of the protein representation methods and considered clustering
into families and super-families for evaluation. Families and super-families with
single protein were removed while preparing the data [4]. We used the same
protein pairs that [4] used for A-50 to compute similarity scores
(http://www.lcqb.upmc.fr/julianab/software/cluster/).

Collection of Protein-Ligand Interactions


First, the corresponding UniProt identifiers were extracted for each protein in A-
50 dataset using Bioservices Python package [11]. Then, the interacting ligands
with their corresponding canonical SMILES were retrieved from ChEMBL using
ChEMBL web services [12] (Data collected on Dec 30, 2017). The workflow of
4

protein-ligand interaction extraction is illustrated in Figure 1. The collected


interactions were used to build the proposed SMILESVec-based protein repre-
sentations.

Figure 1. Extraction of protein-ligand interactions. As an example protein,


Myosin Binding ProteinC is provided as input with its corresponding SCOPe
ID: d2yxma

Distributed Representation of Proteins and Ligands


The Word2Vec model, which is based on feed-forward neural networks, has
been previously adopted to represent proteins using their sequences [2]. The
approach, that we will refer to as ProtVec throughout the article, improved the
performance for the protein classification problem. In this study, we used the
Word2Vec model with the Skip-gram approach to consider the order of the sur-
rounding words. In the biological context, we can use the string representations
of proteins/ligands (e.g., FASTA sequence for proteins and SMILES for ligands)
in textual format and define words as sub-sequences of these representations.
Figure 2 illustrates a sample protein sequence and its sequence list (biological
words) as well as a sample ligand SMILES and its corresponding sub-sequences
(chemical words). The biological words which are referred to as sequence-lists are
created with a set of three characters of non-overlapping sub-sequences for each
list that starts from the character indices 1,2, and 3, respectively, therefore lead-
ing to three sequence lists [2]. The chemical words were created as 8-character
long overlapping substrings of SMILES with sliding window approach. As shown
in Figure 2, the SMILES string “C(C1CCCCC1)N2CCCC2” is divided into the
following chemical words: “C(C1CCCC”, “(C1CCCCC”, “C1CCCCC1”, “1CC-
CCC1), “CCCCC1)N”, “CCCC1)N2”, ... , “)N2CCCC2”. We performed sev-
eral experiments in which word size varied in the range of 4-12 characters and
8-charactered chemical words obtained the best results.
With the use of the Word2Vec model we were able to describe complex struc-
tures using their simplified representations. For each subsequence (word) that
was extracted from protein sequence/ligand SMILES, Word2Vec produced a real-
valued vector that is learned from a large training set. The vector learning is
based on the context of each subsequence (e.g. its surrounding subsequences)
and can detect some important subsequences that usually occur in the same con-
texts. Therefore, with the help of the neural-network based nature of Word2Vec,
5

Figure 2. Representation of biological and chemical words .

every subsequence of a protein sequence/ligand SMILES was described in a se-


mantically meaningful way. The Word2Vec model defined a vector representa-
tion for each of the 3-residue subsequences of the proteins. Protein vectors were
constructed as the average of the summation of these subsequence vectors as de-
scribed in Equation 1 where vector(subsequencek ) refers to the 100-dimensional
real-valued vector for the kth subsequence and m is equal to the total number of
sub-sequences that can be extracted from a protein sequence. For proteins, 550K
protein sequences from UniProt were used to train Word2Vec with skip-gram
approach.
Pm
vector(subsequencek )
P rotV ec = vector(protein) = k=1 (1)
m
Similarly, the Word2Vec model produced a real-valued vector for each SMILES
word and the corresponding ligand vector is constructed as the average of the
summation of these SMILES word vectors as described in Equation 2. V ector(subsequencek )
represents the Word2Vec output for the 8-character long kth subsequence of
the SMILES string and n indicates the total number of these SMILES subse-
quences (words). We will refer to ligand vectors as SMILESVec throughout
the article. For learning, 1.7M canonical SMILES from CHEMBL database
(ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl23 ) were retrieved
[39]. The skip-gram approach with vector size set to 100 was used.
Pn
vector(subsequencek )
SM ILESV ec = vector(ligand) = k=1 (2)
n
We also used the Word2Vec model to learn embeddings for the characters in
the SMILES alphabet. Therefore instead of word-level, we created char-level
embeddings for the unique characters that appear in SMILES in ChEM BL23
data set (58 chars). Equation 3 describes SM ILESV ecchar where n in this case
represents the total number of the characters in a SMILES.
Pn
vector(chark )
SM ILESV ecchar = vector(ligand) = k=1 (3)
n
6

We further investigated an important aspect when working with SMILES rep-


resentation, since there are several valid SMILES for a single molecule. Canoni-
calization algorithms were coined for the purpose of generating a unique SMILES
for a molecule, however couldn’t prevent the diversity that came with differ-
ent canonicalization algorithms. Thus, it is not that surprising that canonical
SMILES definition can differ from database to database. ChEMBL uses Accel-
rys’s Pipeline Pilot that uses an algorithm derived from Daylight’s [30], whereas
Pubchem uses OpenEye software [26] for canonical SMILES generation [3]. The
most evident difference between the canonical SMILES of two databases is that
ChEMBL includes isomeric information, whereas Pubchem does not. There-
fore, even though we collected the SMILES of the interacting ligands from the
ChEMBL database, we both experimented learning chemical words and charac-
ters from ChEMBL and Pubchem canonical SMILES corpora both separately
and together (combined).
We can represent a protein/ligand vector as the output of the maximum
or minimum functions, where m is the total number of the subsequences that
are created from the protein/ligand sequence and d is the dimensionality of
the vector (i.e. the number of features). M INi represents the minimum value
of the ith feature among m subsequences (Equation 4). To obtain a protein
vector of minimum, M INi is selected for each feature as defined in Equation
5. Similarly, M AXi represents the maximum value of the ith feature among
m subsequences (Equation 6) and protein vector of maximum is created as in
Equation 7 for d number of features. The concatenation of these minimum and
maximum protein vectors results in a vector with twice the dimensionality of
the original vectors [13]. The min/max representation is described in Equation
8.

M INi = min([subsequence0i , subsequencemi ]) (4)

vectormin (protein) = [M IN0 M IN1 ...M INi ...M INd ] (5)

M AXi = max([subsequence0i , subsequencemi ]) (6)

vectormax (protein) = [M AX0 M AX1 ...M AXi ...M AXd ] (7)

vectorminmax (protein) = [vectormin (protein)][vectormax (protein)] (8)

Protein Similarity Computation


We used BLAST and ProtVec-based methods as baseline to compare to the
ligand-centric protein representation that we proposed.
7

BLAST
Basic Local Alignment Tool (BLAST) reports the similarity between protein
sequences using local alignment [1]. For the ASTRAL data sets, we used both
BLAST sequence identity values and BLAST e-values that were previously ob-
tained [4] with all-versus-all BLAST with e-value threshold of 100.

Word Frequency-based Protein Similarity


Word frequency-based protein similarity method uses three-charactered protein
words that are created as it was explained in Section . However, instead of
learning process, we simply count the occurrence of protein words that appear
in a protein sequence. In order to compute similarity between two proteins, we
used the formula depicted in Equation 9 [38]:
Pm |NP1 ,i −NP2 ,i |
i=1 1 − |NP ,i +NP ,i |
1 2
W ordF requencysim(L1 , L2 ) = (9)
m
where m is the total number of unique words created from protein sequences
P1 and P2 , NP1 ,i is the frequency of words of type i in protein P1 and NP2 ,i is
the frequency of words of type i in protein P2 .

ProtVec-based Protein Similarity


In ProtVec based clustering, protein vectors were constructed as defined in Sec-
tion 2.3, either with the average or minmax method. Cosine similarity function
was used to compute the similarity between two protein vectors P 1 and P 2 as
in Equation 10 where d denotes the size (dimensionality) of the vectors.
Pd
P 1i P 2i
CosSim(P 1, P 2) = i=1 (10)
kP 1kkP 2k

SMILESVec-based Protein Similarity


First, the ligand vectors were constructed by SMILESVec approach described
in Section 2.3. Then each protein was represented as the average of the vectors
of the ligands they interact with. Equation 11 describes the construction of a
protein vector from its binding ligands where SMILESVec represents the ligand
vector and p represents the total number of ligands that the protein interacts
with. Pp
vector(SM ILESV eck )
vector(protein) = k=1 (11)
p
Similarly, protein similarity is computed using the cosine similarity function.

Fingerprint-based Protein Similarity


We used two popular fingerprint-based compound representation methods as an
alternative to SMILESVec, namely MACCS and Extended Fingerprint. Chemi-
cal Development Kit descriptors were used to build MACCS and Extended fin-
gerprints of the ligands [40]. Fingerprints are binary (absence/presence) vector
8

representations of ligands where each bit refers to chemical features such as spe-
cific substructures and rings in which MACCS and Extended Fingerprint encode
166 and 1024 bits, respectively. The proteins were represented as described in
Equation 12 in which fingerprints were used to represent each interacting ligand.
Pp
vector(F ingerprintM ethodk )
vector(protein) = k=1 (12)
p
Fingerprints were used in order to compare a knowledge-based ligand description
with a data-driven approach (SMILESVec).

SMILES word frequency-based Protein Similarity


For each interacting ligand of a protein, 8-character-long SMILES words were
created as explained in Section . Then similarity between two proteins were com-
puted as in Equation 9 using the collection of chemical words of their respective
interacting ligands.

Clustering Algorithms
We evaluated the effectiveness of the different protein representation approaches
for the task of protein clustering. Transitivity Clustering (TransClust), which
has been shown to produce the best F-measure score amongst several other
algorithms in protein clustering [4] and the commonly used Markov Clustering
Algorithm (MCL) were used as the protein clustering algorithms.

Transitivity Clustering (TransClust)


TransClust is a clustering method that is based on the weighted transitive graph
projection problem [41]. The main idea behind TransClust is to construct tran-
sitive graphs by adding or removing edges from an intransitive graph using a
weighted cost function. Weighted cost function is calculated as the distance be-
tween a user-defined threshold and a pairwise similarity function. TransClust
connects two proteins on the network if their similarity is greater than the user-
defined threshold. The graph is expanded by adding or removing edges until it
becomes a disjoint union of cliques [4].

Markov Clustering Algorithm (MCL)


MCL is a network clustering algorithm that considers the weights of the edges
(flows) in the network [14] and utilized to build a flow matrix of the network. The
algorithm is implemented for a given number of iterations. The iteration number
is called granularity inflation defining the homogeneity and the heterogeneity of
the clusters. We used the default value (2.0) of the inflation parameter in MCL.

Evaluation
In order to evaluate the performance of the proposed methods, we utilized the
F-measure, precision and recall metrics. These metrics are widely used in the
9

evaluation of classification methods. To adapt these metrics into the assessment


of clustering task, we followed the formulation explained by Bernardes and co-
workers [4].
For a data set of n proteins, let us assume nf represents the number of
proteins that belong to the f th family or class, ng is the number of proteins
that are placed in the g th cluster and nf g represents the number of proteins that
belong to the f th family and are placed in the g th cluster. Precision of cluster
g with respect to the f th family is computed as precisionf g = nf g /ng , whereas
recall is defined as recallf g = nf g /nf . Finally we can define F-measure as in
Equation 13:
1X 2precisionf g recallf g
F − measure = nf maxg (13)
n precisionf g + recallf g
f

maxg indicates that for each family f , we compute precision and recall values
for each corresponding g cluster, and choose the maximum score.

Results
We evaluated the performance of five different protein similarity computation
approaches in clustering of the A-50 dataset. The similarity approaches were
BLAST, ProtVec, SMILESVec, MACCS, and Extended Fingerprint, the first
two of which are protein sequence based similarity methods, whereas the latter
three utilize the ligands to which proteins bind. We took word-frequency based
protein similarity methods that use protein sequences and compound SMILES
strings, respectively, as the baseline. Average (avg) and minimum/maximum
(min/max) of the vectors were taken to build combined vectors for ProtVec and
SMILESVec from their subsequence vectors.
We performed our experiments on the A-50 dataset using two different cluster-
ing algorithms, TransClust and MCL. The ligand-based (SMILESVec, MACCS
and Extended Fingerprint) protein representation approaches require a protein
to bind to at least one ligand in order to define a ligand-based vector for that
protein. Therefore, we removed the proteins with no ligand binding information
from both data sets. Table 1 provides a summary of A-50 data set before and
after filtering.

Table 1. Distribution of families and super-families in A-50 data set before and
after filtering
Data set Num. Sequences Super-families Families
Before filtering
A50 10816 1080 2109
After filtering
A50 1639 425 652

Table 2 summarizes the top-10 most frequent families and super-families be-
fore and after filtering. We can observe that less than half of the frequent
families and super-families remained in top-ten list such as Immunoglobulin
10

(b.1.1) and Fibronectin type III (b.1.2) super-families and their descendants,
Immunoglobulin I set (b.1.1.4) and Fibronectin type III (b.1.2.1) families, re-
spectively. Super-families and families that weren’t initially in the top-10 list
such as Protein-kinase like (d.144.1) super-family and nuclear-receptor binding
domain (a.123.1) and their respective descendant families also made it among
the frequent set of proteins when ligand interactions were taken into account.
In the filtered data set in which all proteins have an interacting ligand, there
were 1057 proteins with fewer than 200 ligands (64% of the whole proteins) 101 of
which were proteins with single ligands (0.6% of the whole proteins). There were
67 proteins with more than 10000 interacting ligands (0.4%), thus increasing the
average of the interacting ligands to 1791. The protein with the highest number
of interacting ligands was d2dpia2 (DNA polymerase iota), a protein involved
in DNA repair [20] and implicated in esophageal squamous cell cancer [43] and
breast cancer [42], with 115018 ligands.

Table 2. Summary of the top-10 frequent families and super-families in A-50


data set of SCOPe given with the family name and the number of proteins that
belong to them.
Before filtering After filtering
Super-family # prots. Family # prots. Super-family # prots. Family # prots.
P-loop containing nucleoside
Fibronectin type III Protein kinase-like Protein kinases, catalytic subunit
1 triphosphate hydrolases 312 100 47 39
(b.1.2.1) (d.144.1) (d.144.1.7)
(c.37.1)
NAD(P)-binding P-loop containing nucleoside
Tyrosine-dependent oxidoreductases Fibronectin type III
2 Rossmann-fold domain 269 86 triphosphate hydrolases 43 28
(c.2.1.2) (b.1.2.1)
(c.2.1) (c.37.1)
Immunoglobulin Canonical RNA-binding domain Immunoglobulin Eukaryotic proteases
3 174 82 41 25
(b.1.1) (d.58.7.1) (b.1.1) (b.47.1.2)
NAD(P)-binding
”Winged helix” DNA-binding domain Immunoglobulin I set EGF-type module
4 157 67 Rossmann-fold domain 32 24
(a.4.5) (b.1.1.4) (g.3.11.1)
(c.2.1)
Thioredoxin-like G proteins Trypsin-like serine proteases Immunoglobulin I set
5 149 64 31 23
(c.47.1) (c.37.1.8) (b.47.1) (b.1.1.4)
(Trans)glycosidases Immunoglobulin V set Fibronectin type III SH2 domain
6 126 63 28 22
(c.1.8) (b.1.1.1) (b.1.2) (d.93.1.1)
Nuclear receptor
Nucleic acid-binding proteins Classic zinc finger, C2H2 EGF/Laminin
7 110 60 27 ligand-binding domain 18
(b.40.4) (g.37.1.1) (g.3.11)
(a.123.1.1)
Homeodomain-like Phosphate binding protein-like SH2 domain Cyclin
8 102 56 22 15
(a.4.1) (c.94.1.1) (d.93.1) (a.74.1.1)
Fibronectin type III PDZ domain Cysteine proteinases Pleckstrin-homology domain
9 100 54 20 15
(b.1.2) (b.36.1.1) (d.3.1) (b.55.1.1)
S-adenosyl-L-methionine-dependent Nuclear receptor
N-acetyl transferase, NAT Tyrosine-dependent oxidoreductases
10 methyltransferases 99 53 ligand-binding domain 19 15
(d.108.1.1) (c.2.1.2)
(c.66.1) (a.123.1)

Finally, we assessed the performance of the clustering algorithms with F-


measure values for two different clustering scenarios, family and super-family
clustering. TransClust requires a user-defined threshold to identify clusters,
therefore in order to choose the best threshold value, we computed the F-measure
values for similarity threshold range of [0, 1] with 0.001 step-size for similarity
computation methods that outputs in the range of 0-1. For BLAST, range of [0,
100] with step-size value of 0.05 was tested for similarity threshold. We chose the
similarity thresholds that gave the best F-measure for super-family and family
to decide the final clusters.
Table 3 reports the F-measure values for family and super-family clustering
and the number of clusters that are detected with TransClust and MCL algo-
rithms, respectively.
Between TransClust and MCL, TransClust produced better F-measure val-
ues in all representation methods on A-50 data set. The results obtained by
11

both clustering algorithms were better in family clustering than in super-family


clustering, which was an expected outcome since detection of distantly related
proteins is a much harder task.
Both clustering algorithms relied on similarity scores in order to group pro-
teins. Among the protein sequence-based similarity methods, the poorest cluster-
ing performance in super-family/family (0.350/0.500) belonged to BLAST with
e-value, the baseline. Protein word frequency (0.686/0.744) obtained the best
performance on the A-50 dataset in super-family and family clustering, respec-
tively. The performances of the ProtVec Avg (0.681/0.739) and the ligand-based
protein representation methods followed the best result closely. Though, bring-
ing in a semantic aspect with learning through Word2Vec model, ProtVec-based
similarity (avg and minmax), was outperformed by the straightforward word-
frequency based approach.
The results also showed than average-based combination method (ProtVec
avg) was better than min/max-based combination method (ProtVec minmax)
to build a single protein vector from subsequence vectors in the protein clus-
tering task. Since min/max-based combination method did not perform well
in sequence-based protein similarity, we did not test the technique for SMILES-
based protein similarity approaches.

Table 3. Performances of TransClust and MCL algorithms in super-family and


family clustering for all protein similarity computation methods with F-measure
values.
Transclust MCL
Super-family Family Super-family Family
No.Clusters F-measure No.Clusters F-measure No. Clusters F-measure No. Clusters F-measure
Protein sequence based
Blast (e-val) A-50 1596 0.350 1636 0.500 728 0.290 728 0.379
Blast (identity) A-50 606 0.595 660 0.631 783 0.540 783 0.592
Protein Word frequency A-50 708 0.686 688 0.744 411 0.590 411 0.606
ProtVec Avg (word) A-50 655 0.681 704 0.739 1001 0.596 1001 0.665
ProtVec Avg (char) A-50 707 0.674 707 0.729 1017 0.590 1017 0.662
ProtVec MinMax (word) A-50 586 0.667 704 0.718 1014 0.590 1014 0.662
Ligand based
SMILES word frequency A-50 801 0.624 957 0.704 312 0.470 312 0.475
SMILESVec (word, chembl) A-50 621 0.677 730 0.735 867 0.608 867 0.667
SMILESVec (word, pubchem) A-50 573 0.668 692 0.730 857 0.604 857 0.664
SMILESVec (word, combined) A-50 617 0.675 764 0.735 894 0.607 894 0.668
SMILESVec(char, chembl) A-50 636 0.678 710 0.729 999 0.596 999 0.668
SMILESVec(char, pubchem) A-50 714 0.671 715 0.729 977 0.595 977 0.667
SMILESVec(char, combined) A-50 712 0.675 712 0.739 1006 0.595 1006 0.669
MACCS A-50 589 0.679 683 0.736 874 0.606 874 0.667
Extended Fingerprint A-50 607 0.680 756 0.732 744 0.609 744 0.655

Among the ligand based representation methods, we examined the perfor-


mance of the word-based embeddings and character-based embeddings as well
as the effect of the source of the training data set on embeddings. We col-
lected canonical SMILES from both ChEMBL ( 1.7M) and Pubchem ( 2.3M)
databases. The SMILES strings of the interacting ligands were only collected
from ChEMBL as explained in Section . The main difference between these
two databases is that ChEMBL allows the isometric information of the molecule
to be encoded within SMILES. The results clearly indicated that the choice of
the training set for embedding learning is important where SMILES was con-
12

cerned. In our case, since SMILES of the interacting ligands of A-50 data set
was collected from ChEMBL database, the performance of the SMILESVec in
which embeddings were learned from training with ChEMBL SMILES rather
than Pubchem SMILES was notably better.
We also investigated whether using the combination of the SMILES corpus
of ChEMBL and Pubchem can improve the performance of SMILESVec embed-
dings. We indeed reported an improvement on character-based embedding in
family clustering (0.739) whereas word-based embedding produced F-measure
values higher than the Pubchem-based learning and lower than the ChEMBL-
based learning. We can suggest that the increase in the performance of the
character-based learning with the combination of two different SMILES corpora
might be positively correlated with the increase in SMILES samples, while the
number of unique letters that appear in the SMILES did not significantly change
between databases (e.g. absence/presence of the few characters that represent
isometry information). However, with the word-based learning, we observed that
there was significant increase in the variety of the chemical words, thus the com-
bined SMILES corpus model did not work as well as it did in character-based
learning. This result suggests that the size of the learning corpus may affect
the representation of the embeddings, thus we might suggest a larger SMILES
corpus could lead to better character-based embeddings for SMILESVec.
Considering only ChEMBL trained SMILESVec, we observe that even though
producing comparable scores, word-based approach was better than character-
based SMILESVec in terms of F-measure in family clustering. In super-family
clustering however, character-based approach performs as well as word-based
SMILESVec. Similarly, ProtVec is also better represented in word-level rather
than character-level.
The ligand-based protein representation methods, SMILESVec and MACCS-
based approach performed almost as well as ProtVec in family and super-family
clustering with TransClust algorithm, even though no protein sequence infor-
mation was used. With MCL, a lower clustering performance was obtained
compared to TransClust, and both SMILESVec and MACCS-based method pro-
duced slightly better F-measure than ProtVec Avg in both super-family and
family clustering. We can suggest that, since ligand-based protein represen-
tation methods capture indirect function information through ligand binding,
they were recognizably better at detecting super-families than families com-
pared to sequence-based ProtVec on a relatively distant data set. Furthermore,
SMILESVec, a text-based unsupervised learning model, produced comparable F-
measure scores to MACCS and Extended fingerprints, which are binary vectors
based on human-engineered feature descriptions.
Table 4 reports the Pearson correlations [31] among the protein similarity
computation methods. Comparison with BLAST e-value resulted in negative
correlation, as expected, since e-values closer to zero indicate high match (sim-
ilarity). Ligand based protein representation methods had higher correlation
values with BLAST e-value than protein-sequence based methods. We also ob-
served strong correlation among the ligand-based protein representation meth-
ods, suggesting that, regardless of the ligand representation approach, the use
13

of interacting ligands to represent proteins provides similar information.

Table 4. Pearson correlation between protein similarity methods


Method Method Pearson correlation
BLAST (e-value) BLAST (identity) -0.109
BLAST (e-value) Protein word frequency -0.250
BLAST (e-value) ProtVec (avg) -0.291
BLAST (e-value) SMILESVec (word, chembl) -0.335
BLAST (e-value) SMILESVec (char, chembl) -0.207
BLAST (e-value) MACCS -0.336
SMILESVec (word, chembl) MACCS 0.895
SMILESVec (char, pubchem) MACCS 0.590
SMILESVec (word, chembl) SMILESVec (char, pubchem) 0.682
SMILESVec (word, chembl) Extended Fp. 0.938
Extended Fp. MACCS 0.937

We further investigated a case in which similar super-family clusters were pro-


duced with SMILESVec-based protein similarity and ProtVec protein similarity
using TransClust algorithm. We observed that Fibronectin Type III proteins
(7 proteins) were clustered together when SMILESVec was used, whereas us-
ing ProtVec placed them into four different clusters; one cluster contained four
of those proteins, another cluster contained a single protein and the other two
proteins were part of other clusters. The protein that was clustered by itself
(SCOPe ID:d1n26a3, Human Interleukin-6 Receptor alpha chain) had two inter-
acting ligands (CHEMBL81;Raloxifene and CHEMBL46740;Bazedoxifene) that
were also shared by a protein (SCOPe ID:d1bqua2,Cytokine-binding region of
GP130) clustered separately with ProtVec. Thus, we can suggest that using
information on common interacting ligands, SMILESVec achieved to combine
these seven proteins into a single cluster, while ProtVec failed to do so with a
sequence-based approach.

Discussion
In this study, we first propose a ligand-representation method, SMILESVec,
which uses a word embeddings model. Then, we represent proteins using their
interacting ligands. In this approach, the interacting ligands of each protein in
the data set are collected. Then, the SMILES string of each ligand is divided
into fixed-length overlapping substrings. These created substrings are then used
to build real-valued vectors with the Word2vec model and then the vectors are
combined into a single vector to represent the whole SMILES string. Finally,
protein vectors are constructed by taking the average of the vectors of their
ligands. The effectiveness of the proposed method in describing the proteins
was measured by performing clustering on ASTRAL 50 (A-50) dataset from
the SCOPe database using two different clustering algorithms, TransClust and
MCL. Both of these clustering algorithms use protein similarity scores to iden-
tify cliques. SMILESVec based protein representation was compared with other
protein representation methods, namely BLAST and ProtVec, both of which
depend on protein sequence to measure protein similarity, and the MACCS and
14

Extended Fingerprint binary fingerprint based ligand-centric protein represen-


tation approaches. The performance of the clustering algorithms, as reported
by F-measure, showed that protein word-frequency based similarity model was
a better alternative to BLAST e-value or sequence identity to measure protein
similarity. Furthermore, ligand-based protein representation methods also pro-
duced comparable F-measure scores to ProtVec.
Using SMILESVec, we were able to define proteins based on their interacting
ligands even in the absence of sequence or structure information. SMILESVec-
based protein representation had better clustering performance than BLAST
and comparable clustering performance to protein word-frequency based method,
both of which use protein sequences. We should emphasize that SCOPe data
sets were constructed based on protein similarity, thus high performance with the
protein sequence-based models in family/super-family clustering is no surprise.
However, having ligand-based protein representation methods, either learning
from SMILES or represented with binary compound features, performing as
well as protein sequence-based models is quite intriguing and promising.
SMILESVec and MACCS representation performed similarly in the task of
protein clustering and better than Extended Fingerprint representation, sug-
gesting that the word-embeddings approach that learns representations from a
large SMILES corpus in an unsupervised manner is as accurate as a knowledge-
based fingerprint model. We propose that the ligand-based representation of
proteins might reveal important clues especially in protein-ligand interaction re-
lated tasks like drug specificity or identification of proteins for drug targeting.
The similarity between a candidate ligand and the SMILESVec for a protein can
be used as an indicator for a possible interaction.
We would like to mention that ASTRAL data sets contain domains rather
than full length proteins while CHEMBL collects protein - ligand interaction
information based on the whole protein sequence from UniProt. A multidomain
protein may have multiple and diverse chemotypes of ligands binding to each
domain and retrieving ligand information based on the full length protein may
lump this disparate information together, leading to loss of information on do-
main specific ligand interactions. The performance of domain sequence based
methods is therefore at an advantage because family/superfamily assignment
in SCOPe is also based on domain sequence while the ligand based approach
we use in SMILESVec uses more noisy data. Despite this disadvantage, ligand
based approach performs as well as the sequence based approaches. We hypoth-
esize that if domain - ligand interactions are taken into account, ligand based
approaches would have higher performance.
The study we conducted here also showed that SMILES description is sensi-
tive to the database definition conventions, therefore SMILES strings requires
careful consideration. Since we collected the protein-ligand interaction and lig-
and SMILES information from ChEMBL database to represent proteins, build-
ing SMILESVec vectors from the chemical words trained in ChEMBL SMILES
corpus yielded better F-measure than the model in which the Pubchem SMILES
corpus was used for training of the chemical words.
We showed that ligand-centric protein representation performed at least as
15

well as protein sequence based representations in the clustering task even in


the absence of sequence information. Ligand-centric protein representation is
only available for proteins with at least one known ligand interaction, while a
sequence based approach can miss key functional/mechanistic properties of the
protein. The orthogonal information that can be obtained from the two ap-
proaches has been previously observed [27]. As future work, we will investigate
combining both sequence and ligand information in protein representation. We
believe that this approach will provide a deeper understanding of protein func-
tion and mechanism toward the use of these representations in clustering and
other bioinformatics tasks such as function annotation and prediction of novel
protein - drug interactions.

Acknowledgments
TUBITAK-BIDEB 2211-E Scholarship Program (to HO) and BAGEP Award
of the Science Academy (to AO) are gratefully acknowledged. We thank Prof.
Kutlu O. Ulgen and Mehmet Aziz Yirik for helpful discussions.

Funding
This work is funded by Bogazici University Research Fund (BAP) Grant Number
12304.

References
1. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic
local alignment search tool. Journal of molecular biology, 215(3):403–410,
1990.

2. E. Asgari and M. R. Mofrad. Continuous distributed representation


of biological sequences for deep proteomics and genomics. PloS one,
10(11):e0141287, 2015.

3. K. V. Balakin. Pharmaceutical data mining: approaches and applications


for drug discovery, volume 6. John Wiley & Sons, 2009.

4. J. S. Bernardes, F. R. Vieira, L. M. Costa, and G. Zaverucha. Evaluation


and improvements of clustering algorithms for detecting remote homolo-
gous protein families. BMC bioinformatics, 16(1):34, 2015.

5. C. Cai, L. Han, Z. L. Ji, X. Chen, and Y. Z. Chen. Svm-prot: web-


based support vector machine software for functional classification of a
protein from its primary sequence. Nucleic acids research, 31(13):3692–
3697, 2003.

6. D.-S. Cao, J.-C. Zhao, Y.-N. Yang, C.-X. Zhao, J. Yan, S. Liu, Q.-N. Hu,
Q.-S. Xu, and Y.-Z. Liang. In silico toxicity prediction by support vector
16

machine and smiles representation-based string kernel. SAR and QSAR


in Environmental Research, 23(1-2):141–153, 2012.

7. R. Cao and J. Cheng. Integrated protein function prediction by mining


function associations, sequences, and protein–protein and gene–gene in-
teraction networks. Methods, 93:84–91, 2016.

8. J.-M. Chandonia, N. K. Fox, and S. E. Brenner. Scope: Manual curation


and artifact removal in the structural classification of proteins–extended
database. Journal of molecular biology, 429(3):348–355, 2017.

9. Y.-Y. Chiu, J.-H. Tseng, K.-H. Liu, C.-T. Lin, K.-C. Hsu, and J.-M. Yang.
Homopharma: A new concept for exploring the molecular binding mech-
anisms and drug repurposing. BMC genomics, 15(9):S8, 2014.

10. K.-C. Chou. Prediction of protein cellular attributes using pseudo-amino


acid composition. Proteins: Structure, Function, and Bioinformatics,
43(3):246–255, 2001.

11. T. Cokelaer, D. Pultz, L. M. Harder, J. Serra-Musach, and J. Saez-


Rodriguez. Bioservices: a common python package to access biological
web services programmatically. Bioinformatics, 29(24):3241–3242, 2013.

12. M. Davies, M. Nowotka, G. Papadatos, N. Dedman, A. Gaulton, F. Atkin-


son, L. Bellis, and J. P. Overington. Chembl web services: streamlin-
ing access to drug discovery data and utilities. Nucleic acids research,
43(W1):W612–W620, 2015.

13. C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt. Repre-


sentation learning for very short texts using weighted word embedding
aggregation. Pattern Recognition Letters, 80:150–156, 2016.

14. A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algo-


rithm for large-scale detection of protein families. Nucleic acids research,
30(7):1575–1584, 2002.

15. N. K. Fox, S. E. Brenner, and J.-M. Chandonia. Scope: Structural classi-


fication of proteins—extended, integrating scop and astral data and clas-
sification of new structures. Nucleic acids research, 42(D1):D304–D309,
2013.

16. M. Frasca and N. Cesa-Bianchi. Multitask protein function prediction


through task dissimilarity. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 2017.

17. G. B. Goh, N. O. Hodas, C. Siegel, and A. Vishnu. Smiles2vec: An


interpretable general-purpose deep neural network for predicting chemical
properties. arXiv preprint arXiv:1712.02034, 2017.

18. J. X. Hu, C. E. Thomas, and S. Brunak. Network biology concepts in


complex disease comorbidities. Nature Reviews Genetics, 2016.
17

19. M. J. Iqbal, I. Faye, A. M. Said, and B. B. Samir. A distance-based feature-


encoding technique for protein sequence classification in bioinformatics.
In Computational Intelligence and Cybernetics (CYBERNETICSCOM),
2013 IEEE International Conference on, pages 1–5. IEEE, 2013.

20. R. Jain, J. R. Choudhury, A. Buku, R. E. Johnson, L. Prakash, S. Prakash,


and A. K. Aggarwal. Mechanism of error-free dna synthesis across n1-
methyl-deoxyadenosine by human dna polymerase-ι. Scientific reports,
7:43904, 2017.

21. Jastrzke. Learning to smile (s).

22. M. J. Keiser, B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin,


and B. K. Shoichet. Relating protein pharmacology by ligand chemistry.
Nature biotechnology, 25(2):197, 2007.

23. A. C. Martin, C. A. Orengo, E. G. Hutchinson, S. Jones, M. Karmirantzou,


R. A. Laskowski, J. B. Mitchell, C. Taroni, and J. M. Thornton. Protein
folds and functions. Structure, 6(7):875–884, 1998.

24. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis-


tributed representations of words and phrases and their compositionality.
In Advances in neural information processing systems, pages 3111–3119,
2013.

25. A. C. Nascimento, R. B. Prudêncio, and I. G. Costa. A multiple kernel


learning algorithm for drug-target interaction prediction. BMC bioinfor-
matics, 17(1):46, 2016.

26. T. OEChem. Openeye scientific software. Inc., Santa Fe, NM, USA, 2012.

27. M. J. O’Meara, S. Ballouz, B. K. Shoichet, and J. Gillis. Ligand similarity


complements sequence, physical interaction, and co-expression for gene
function prediction. PloS one, 11(7):e0160098, 2016.

28. H. Öztürk, E. Ozkirimli, and A. Özgür. Classification of beta-lactamases


and penicillin binding proteins using ligand-centric network models. PloS
one, 10(2):e0117874, 2015.

29. H. Öztürk, E. Ozkirimli, and A. Özgür. A comparative study of smiles-


based compound similarity functions for drug-target interaction predic-
tion. BMC bioinformatics, 17(1):128, 2016.

30. G. Papadatos and J. P. Overington. The chembl database: a taster for


medicinal chemists. Future, 6(4):361–364, 2014.

31. K. Pearson. Note on regression and inheritance in the case of two parents.
Proceedings of the Royal Society of London, 58:240–242, 1895.

32. A. Peón, C. C. Dang, and P. J. Ballester. How reliable are ligand-centric


methods for target fishing? Frontiers in chemistry, 4, 2016.
18

33. P. Poornima, J. D. Kumar, Q. Zhao, M. Blunder, and T. Efferth. Network


pharmacology of cancer: From understanding of complex interactomes to
the design of multi-target specific therapeutics from nature. Pharmaco-
logical research, 111:290–302, 2016.

34. J. A. Santiago and J. A. Potashkin. A network approach to clinical in-


tervention in neurodegenerative diseases. Trends in molecular medicine,
20(12):694–703, 2014.

35. M. Schenone, V. Danvcik, B. K. Wagner, and P. A. Clemons. Target iden-


tification and mechanism of action in chemical biology and drug discovery.
Nature chemical biology, 9(4):232–240, 2013.

36. J. Schwartz, M. Awale, and J.-L. Reymond. Smifp (smiles fingerprint)


chemical space for virtual screening and visualization of large databases
of organic molecules. Journal of chemical information and modeling,
53(8):1979–1989, 2013.

37. J.-Y. Shi, S.-M. Yiu, Y. Li, H. C. Leung, and F. Y. Chin. Predicting
drug–target interaction for new drugs using enhanced similarity measures
and super-target clustering. Methods, 83:98–104, 2015.

38. D. Vidal, M. Thormann, and M. Pons. Lingo, an efficient holographic


text based method to calculate biophysical properties and intermolecular
similarities. Journal of chemical information and modeling, 45(2):386–393,
2005.

39. Y. Wang, S. H. Bryant, T. Cheng, J. Wang, A. Gindulyte, B. A. Shoe-


maker, P. A. Thiessen, S. He, and J. Zhang. Pubchem bioassay: 2017
update. Nucleic acids research, 45(D1):D955–D963, 2016.

40. E. L. Willighagen, J. W. Mayfield, J. Alvarsson, A. Berg, L. Carlsson,


N. Jeliazkova, S. Kuhn, T. Pluskal, M. Rojas-Chertó, O. Spjuth, et al.
The chemistry development kit (cdk) v2. 0: atom typing, depiction, molec-
ular formulas, and substructure searching. Journal of Cheminformatics,
9(1):33, 2017.

41. T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht, J. H. Morris,


S. Böcker, J. Stoye, and J. Baumbach. Partitioning biological data with
transitivity clustering. Nature methods, 7(6):419–420, 2010.

42. J. Yang, Z. Chen, Y. Liu, R. J. Hickey, and L. H. Malkas. Altered dna


polymerase ι expression in breast cancer cells leads to a reduction in dna
replication fidelity and a higher rate of mutagenesis. Cancer research,
64(16):5597–5607, 2004.

43. S. Zou, Z.-F. Shang, B. Liu, S. Zhang, J. Wu, M. Huang, W.-Q. Ding, and
J. Zhou. Dna polymerase iota (pol ι) promotes invasion and metastasis
of esophageal squamous cell carcinoma. Oncotarget, 7(22):32274, 2016.

You might also like