Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Computational Biology and Chemistry 64 (2016) 263–270

Contents lists available at ScienceDirect

Computational Biology and Chemistry


journal homepage: www.elsevier.com/locate/compbiolchem

Research Article

Perceptron ensemble of graph-based positive-unlabeled learning for


disease gene identification
Gholam-Hossein Jowkar* , Eghbal G. Mansoori
School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran

A R T I C L E I N F O A B S T R A C T

Article history:
Received 27 March 2016 Identification of disease genes, using computational methods, is an important issue in biomedical and
Received in revised form 25 June 2016 bioinformatics research. According to observations that diseases with the same or similar phenotype
Accepted 8 July 2016 have the same biological characteristics, researchers have tried to identify genes by using machine
Available online 12 July 2016 learning tools. In recent attempts, some semi-supervised learning methods, called positive-unlabeled
learning, is used for disease gene identification. In this paper, we present a Perceptron ensemble of graph-
Keywords: based positive-unlabeled learning (PEGPUL) on three types of biological attributes: gene ontologies,
Disease gene identification protein domains and protein-protein interaction networks. In our method, a reliable set of positive and
Biological networks
negative genes are extracted using co-training schema. Then, the similarity graph of genes is built using
Positive-unlabeled learning
metric learning by concentrating on multi-rank-walk method to perform inference from labeled genes. At
Ensemble of classifiers
Perceptron last, a Perceptron ensemble is learned from three weighted classifiers: multilevel support vector
machine, k-nearest neighbor and decision tree. The main contributions of this paper are: (i) incorporating
the statistical properties of gene data through choosing proper metrics, (ii) statistical evaluation of
biological features, and (iii) noise robustness characteristic of PEGPUL via using multilevel schema. In
order to assess PEGPUL, we have applied it on 12950 disease genes with 949 positive genes from six class
of diseases and 12001 unlabeled genes. Compared with some popular disease gene identification
methods, the experimental results show that PEGPUL has reasonable performance.
ã 2016 Elsevier Ltd. All rights reserved.

1. Introduction properties of disease genes showed that the genes with the same or
similar diseases stay in the same neighborhood in molecular
In biomedical research, identification of genes underlying networks (Piro and Cunto, 2012). Moreover, the traditional tools
human hereditary is essential for prenatal and postnatal diagnosis were expensive and time-consuming. These observations lead to
and treatment (Piro and Cunto, 2012). Huntington as the first the development of computational approaches for prediction or
genetic disease, on the 4th chromosome of human DNA, was priorization of candidate disease genes (Wang et al., 2011). These
discovered by using polymorphism information (Bromberg, 2013). approaches rely on the observations that diseases with the same or
After that, the biologists focused on gene associated diseases and similar phenotype have the same biological characteristic. In this
mutation on genes to identify genetic disorders and gene regard, computational analysis is used to combine different data
associated diseases. By screening them, the vulnerabilities of a sources, functional information of genes is used to extract disease
child for inherited diseases before his/her birth can be determined. gene knowledge, and machine learning methods are used to
Also, the prognosis and counselling of affected families are predict the disease genes.
discussed, and in some cases, this can lead to the development Disease genes identification, in terms of learning type, can be
of therapeutic strategies (Piro and Cunto, 2012). categorized into three groups of unsupervised, supervised and
Since the abnormal function of genes, in the body, causes some semi-supervised learning. Traditionally, researchers face the
diseases, it is necessary to identify the molecular pathway of these classification of disease genes as a supervised learning method
disorders (Bromberg, 2013). In this regard, the study on the (Kohler et al., 2008; Smalter et al., 2007; Radivojac et al., 2008),
though it is regarded as a semi-supervised problem in some
researches (Yang et al., 2012; Yang et al., 2014). Since in semi-
supervised methods, the learning starts with a small set of labeled
* Corresponding author.
E-mail addresses: hjowkar@shirazu.ac.ir (G.-H. Jowkar), mansoori@shirazu.ac.ir (positive and negative) samples, the obtained model faces a small
(E.G. Mansoori). subset of positive samples and a huge subset of unlabeled samples

http://dx.doi.org/10.1016/j.compbiolchem.2016.07.004
1476-9271/ã 2016 Elsevier Ltd. All rights reserved.
264 G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270

(possibly negative and/or positive) (Cerulo et al., 2010). This model genes in hereditary disease has not been proved. No matter which
is known as positive-unlabeled (PU) learning since some disease supervised method is used, training with wrongly potential
genes have not been identified yet, but have been treated as positive genes affects the performance of the classifier. With
negative in the unlabeled set. In PU learning, these unidentified regard to these limitations, PU method was proposed as a suitable
genes are used to identify positive genes. The main question to be procedure, by considering unconfirmed disease genes as an
answered is: Does unlabeled data help? The answer depends on unlabeled (beside negative) set (Yang et al., 2012; Yang et al.,
the problem status which is tested. 2014; Cerulo et al., 2010; Mordelet and Vert, 2011). There are two
In this paper, we present a Perceptron ensemble of graph-based main approaches in PU learning (Cerulo et al., 2010): probability
positive-unlabeled learning (PEGPUL) on three biological net- estimate correction which learns without negative samples, and
works: gene ontology (GO), protein domain (PD) and protein- selection of reliable negatives which extracts negative samples
protein interaction (PPI) networks. In this regard, after a brief (Yang et al., 2012; Yang et al., 2014; Cerulo et al., 2010; Mordelet
explanation of building biological networks, a reliable set of and Vert, 2011).
negative genes with the helps of co-training schema are extracted As an example of the first approach, Cerulo et al. (2010)
in order to form a two-class problem. Next, the similarity graph is presented a method called PosOnly. It trains SVM classifier on
built using metric learning by concentrating on modified random positive and unlabeled gene regulatory networks to predict the
walk method to perform inference from labeled gene based on the probabilities that differ by only a constant factor from the true
guilt-by-association rule. At last, a Perceptron ensemble is learned conditional probabilities of being positive. In the other approach,
from three classifiers: weighted multilevel support vector machine the aim is to extract negative samples from unlabeled genes. One
(SVM), weighted k-nearest neighbor (KNN) and weighted classifi- way of choosing reliable negatives is selecting random subset of
cation and regression tree (CART). unlabeled genes. Mordelet et al. proposed a bagging method called
The rest of this paper is organized as follows. In Section 2, some ProDiGe which repeatedly selects random subset from unlabeled
recent related works are reviewed and discussed. Materials and genes and runs different SVM classifiers on each bootstrap
our proposed PEGPUL method are explained in Section 3. In (Mordelet and Vert, 2011). They used nine sources of information
Section 4, the experimental setting and results are presented. And about the genes which can be categorized to three types of features
Section 5 concludes the paper. including PPI network, protein sequences and protein functional
information. The final result was gained by aggregating all of the
2. Related works presented results with the confidence score.
In early attendee of PU learning in disease gene identification,
In early attempts, unsupervised clustering methods were used Yang et al. proposed a method called PUDI that used three
on GO (Freudenberg and Propping, 2002). Diseases of known biological networks: GO, PPI network and PDs (Yang et al., 2012).
genetic origin were clustered based on their phenotype similari- Based on PU selection of reliable negatives and also the similarity
ties. Each candidate gene was scored according to its similarity to of being positive/negative genes, each unlabeled gene was divided
clusters. And this score showed the association degree of that gene into multiple subsets called reliable negative, likely positive, likely
when searching for a mutation in monogenic diseases. As a negative and weak negative. According to their similarity to
supervised method, Adie et al. proposed an algorithm (based on positive or negative classes, they got different weights in order to
decision tree) called PROSPECTR which uses a variety of genome be presented to multi-level weighted SVMs. In a new work (Yang
sequence-based features (Adie et al., 2005). Smalter et al. (2007) et al., 2014), Yang et al. have extended their previous work and have
used SVM classifier with topological and sequence-based features proposed Ensemble based PU learning (EPU). They have added
of PPI networks to classify disease genes. Radivojac et al. (2008) gene expression data and phenotypic similarity networks to their
proposed a SVM-based method called PhenoPred which uses three previous data sources. For ensembling classifiers, they have used
types of features including PPI network, protein sequences and KNN, SVM and naïve Bayes.
protein functional information. PhenoPred combines three indi-
vidual SVM classifiers, designed on three feature sets, to form the 3. Materials and methods
final classifier.
Selecting a subset (priorization) from candidate of disease Disease gene identification methods typically involve two
genes has been recently studied. Kohler et al. (2008) used PPI data stage, extracting a list of candidate genes and the criteria for
to build the similarity network and then to prioritize the candidate learning, such as the involvement in a particular disease and
genes by random walk. They showed that PPI data is a valuable learning procedure (Moreau and Tranchevent, 2012). In this
resource for this problem and is far better than direct interaction or section, the specification and statistics of disease datasets are
shortest path measures for capturing relation between global described. Also, the related biological networks, the extracted
similarity networks of genes. Vanunu et al. proposed PRINCE as a features and their meanings are presented. Then, our proposed
network-based method for prioritizing disease genes and inferring approaches for disease gene identification are explained.
protein complex associations using PPI and disease-disease
similarity measure (Vanun et al., 2010). By categorizing genes 3.1. Data and biological networks
based on type of evidence, some disease identification methods are
used through functional annotations (Yang et al., 2012; Yang et al., In this work, by using online Mendelian inheritance in man
2014), gene expression data (Adie et al., 2005) and ontologies (Yang (OMIM) database, positive genes are marked and other genes are
et al., 2012; Yang et al., 2014; Freudenberg and Propping, 2002; treated as unlabeled. Six groups of confirmed diseases are selected
Wang et al., 2007). New technologies make use of PPI networks as positive genes, based on (Goh et al., 2007) namely cardiovascu-
(Kohler et al., 2008; Yang et al., 2012; Yang et al., 2014) as a precious lar disease, endocrine disease, cancer disease, metabolic disease,
source for candidate gene priorization. neurological disease and ophthalmological disease. With respect
In most of the researches mentioned, the confirmed disease to the quality and the performance of disease gene identification
genes have been considered as positive genes and the unconfirmed methods, the data are derived from multiple biological sources
genes considered as negative genes. Although known genes can (Gill et al., 2014). Attempted to follow the methods outlined in
safely be assumed to be positive, obtaining negative genes is not (Yang et al., 2012), PD, PPI and GO data have been used as feature
straightforward. This is because the non-involvement of these vector of each gene.
G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270 265

PDs are evolutionary features since they are natural functional Table 2
Statistics of examined disease genes.
building blocks of different proteins. Indeed, these domains are
essential units that participate in transcription and interaction Label of genes Disease class No. of genes Gene space %
between other molecules (Yang et al., 2012). Through using Positive Cardiovascular 104 0.80%
database of protein families, Pfam is trained by a hidden Markov Endocrine 81 0.62%
model on a curated alignment of representative sequences (Punta Metabolic 264 2.03%
Neurological 219 1.68%
et al., 2011). Pfam includes two families: Pfam-A with high quality
Ophthalmological 107 0.82%
since is manually curated and computationally derived Pfam-B Cancer 174 1.34%
with lower quality. In the current research, however, only Pfam-A Unlabeled – 12001 92.31%
is used.
GO is a set of controlled vocabulary to annotate genes and their
products (Yang et al., 2014). It describes the biological roles of
molecular entities and their relationships (Freudenberg and of these classifiers for gene prediction. The algorithm of PEGPUL is
Propping, 2002). GO includes three sub-ontologies: molecular presented here while its steps are explained in the following
function (MF) as the elemental activities of a gene product at the subsections.
molecular level, biological process (BP) as a set of molecular Algorithm: PEGPUL method
functions, and cellular components (CC) which represent some Using co-training algorithm, extract reliable negative genes
parts of a cell or its extracellular environment. In this regard, the Using metric learning, construct similarity graph
feature vector of each gene contains three components Using multi-rank-walk, propagate labels
fMF; BP; CC g. By using the graph-based approach in (Wang et al., Using Perceptron ensemble of weighted classifiers, predict the
2007), the GO attributes are converted into numeric features. disease genes
Accordingly, these attributes are ranked and the top-scored Return
features of each three sub-ontologies are selected provided
jMFj ¼ jBPj ¼ jCCj. 3.2.1. Extracting negative set using co-training
PPI network represents the physical interactions between The input of binary gene classifier consists of two class of
proteins. It is annotated by undirected graph with nodes as the positive and negative genes. As mentioned before, the major
genes and edges as the mapped interactions of proteins encoded by difference of semi-supervised learning versus PU learning is the
the genes (Kohler et al., 2008). According to PPI network, four existence of negative set. In order to extract a reliable set of
topological features are used: D as degree of each protein (i.e., the negative genes, we have extended the naïve but efficient approach
number of genes in its radius 1 neighborhood); 1PN as 1-positive- in (Yang et al., 2012; Yang et al., 2014). Its basis is on the concept
neighbor (i.e., number of genes in radius 1 neighborhood of a that an unlabeled gene with the most dissimilarity to positive sets
positive gene divided by its degree); 2PN as 2-positive-neighbor is a reliable negative gene. In other words, those genes from
(i.e., number of genes in radius 2 neighborhood of a positive gene unlabeled set U which have maximum distances from positive
divided by its degree); and CLC as clustering coefficient that genes in P are good candidates for negative genes. In this regard, a
measures the degree to which the genes in direct neighbors tend to representative of positive genes (denoted by gp ) is introduced and
being clustered together. its vector is computed via mean of genes in P:
For a given gene dataset with P positive and U unlabeled set of
1X
genes, it is represented as G ¼ P[U, where each gene g is V gp ¼ Vg ð1Þ
jPj g2P
represented by vector V g of its components as V g ¼ ðMF; BP;
CC; PD; D; 1PN ; 2PN ; CLCÞ. The statistics of each gene's components is
Then, the genes with larger distances from gp are considered as
summarized in Table 1.
more probable candidates for negative set:
The description of disease gene dataset is given in Table 2. From
n  E o
the number of genes, it is clear that this dataset is highly N ¼ g 2 UjDist E g; gp Dist ð2Þ
imbalanced such that over 92% of data belongs to the unlabeled
genes. where Dist is a distance threshold which is defined as the average
distance of unlabeled genes from Pr :
3.2. Proposed PEGPUL method
1X  
Dist ¼ Dist E g; gp ð3Þ
Using the feature vectors described earlier, PEGPUL method is jUj g2U
proposed to identify the disease genes. In the first step, it tries to
extract a reliable set of negative genes (Yang et al., 2012; Yang et al., In Eq. (3), Dist Eð:; :Þ is Euclidian distance. Using N as seed of
2014) as seed initialization with helps of co-training scheme. Next, reliable negative genes, a semi-supervised learning algorithm is
using these positive and enriched reliable negative genes, the applied to enrich this set. With respect to the semi-supervised
similarity graph is built via metric learning. After that, a modified literature, the co-training method (Zhu, 2005; Blum and Mitchell,
version of random walk method, called multi-rank-walk is used in 1998) is used here. In order to use this method, three assumptions
order to propagate labels to remaining unlabeled genes. According should be satisfied: (i) features can be split into two subsets, (ii)
to its results, the weights of three classifiers are determined. At each subset is sufficient for training a good classifier, and (iii) the
last, a Perceptron neural network is used to establish an ensemble two sets are conditionally independent according to labels.
Using co-training, features are split into two subsets of GOs and
proteins (PDs and PPIs). This approach relies on this assumption
Table 1 that these two subsets are conditionally independent. Then, two
Gene vector description. classifiers are independently run on these two views to predict the
Component name GO PD PPI unlabeled genes confidently. If these two views agree, the
unlabeled genes take positive/negative labels. The negative ones
MF BP CC D 1PN 2PN CLC
are added to train set as high confidence genes in order to be
Component length 1000 1000 1000 1000 1 1 1 1 considered in the next iterations of algorithm.
266 G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270

Algorithm: Co-training algorithm for negative gene extraction or relations between genes. This graph can be represented by a
 
Inputs: P: set of positive genes, U: set of unlabeled genes, C1,C2: row-normalized adjacency matrix W ¼ wgg0 where:
two distinct classifiers
DistMðg; g0 Þ  minðg00 2GÞ DistMðg; g00 Þ
Outputs: N: set of reliable negative genes wgg0 ¼ 1  ð5Þ
Using Eq. (1), obtain vector of gp in V gp maxg00 2G DistMðg; g00 Þ  minðg00 2GÞ DistMðg; g00 Þ
Using Eq. (3), compute Dist
where Dist Mð:; :Þ is Mahalanobis distance in Eq. (4). To take some
 
N ¼ fg 2 U : Dist E g; gp > Distg biological interpretations, Fig. 1 shows a small portion of this
graph. According to similarity degrees on edges, genes ABL1 and AR
U ¼UN are much similar with respect to their role in cancer. Also, the self-
Repeat similarity of each gene is high. However, the similarity between
Train classier C1 on P[N with only GO features ABL1 and PCCA metabolic disease is minimum.
Train classier C2 on P[N with only PD and PPI features Using this similarity graph, the labels are propagated via modified
multi-rank-walk to handle the remaining unlabeled genes.
Nr ¼ fg 2 U : g is agreed by C1 and C2 as negativeg
3.2.3. Propagating labels using multi-rank-walk
There are a variety of label propagation methods in the
N ¼ N[Nr literature including some kernelized methods (Valentini et al.,
2014). In this section, a method based on random walk with restart
on graph is proposed. In this method, the probability of belonging
U ¼ U  Nr unlabeled genes to negative/positive classes is estimated. We aim
Until Nr is empty to find the similar genes based on label propagation of current
Return N labeled genes.
The multi-rank-walk algorithm is based on random walk with
3.2.2. Similarity graph construction by metric learning restart, which is defined as a walker's transition from its current
Taking for granted that there are a positive set and a negative set node (gene) to randomly chosen neighbors (Kohler et al., 2008; Lin
of genes and the remaining unlabeled genes. We need to construct and Cohen, 2010; Le and Kwon, 2013). Each time, the random
a similarity graph as in graph-based semi-supervised learning. In walker follows an edge with probability 1  r or may jump back to
this regard, the k-nearest neighbor graph method of semi- source gene (restart) with probability r. Let ptþ and pt be the
supervised literature is used (Zhu, 2005). In this method, an edge probability vector of all genes in G regarding positive and negative
between a gene and all its neighbors are added to show similarity classes, respectively, at time step t. These probabilities at time step
between two genes. t þ 1 are computed as:
In most graph-based methods, the main focus is on label
ptþ1
c ¼ ð1  rÞW 0 ptc þ rp0c ; forc 2 fþ; g ð6Þ
propagation of genes for a built graph while emphasizing on graph
construction technique. In this regard, there are many learning where W 0 is a random walk normalized Laplacian matrix. It is
methods which do not use the label information to measure the computed as:
similarity or dissimilarity of two genes (so called unsupervised).
For constructing the similarity graph, the label information is used W 0 ¼ D1 W ð7Þ
in a supervised manner to learn the distance metric (Dhillon et al.,
where W is the similarity matrix in (5) and D is a diagonal matrix
2010; Huang et al., 2013). The Mahalanobis distance is defined as:
with entries Dgg ¼ S wgg .
 T   g0 2G
Dist Mðg; g0 Þ ¼ V g  V g0 A V g  V g0 ð4Þ
Variation of random walk algorithm is developed regarding
where A is inverse covariance matrix and V g is the vector of gene g, different setting and interpretation of p0þ and p0 as initial seed
as described before. Notice that Euclidean distance can be obtained probability vector of genes (Lin and Cohen, 2010). In the proposed
from Eq. (4) if A ¼ I. However in this paper, instead of computing algorithm, these probabilities are initialized by:
the inverse covariance matrix A, it is learned in a supervised
1 
manner. In this regard, the approach of large margin nearest p0þ ¼ 1jPj ; 0jNj ; 0jUj ð8Þ
jPj
neighbor is used (Dhillon et al., 2010). In this method, A is learned
as a linear transformation of genes to their k-nearest neighbors and
with same labels.
1 
Using the learned metric in Eq. (4), the similarity between all p0 ¼ 0jPj ; 1jNj ; 0jUj ð9Þ
jNj
genes is computed and the similarity graph is built. In this graph,
each node indicates a gene while the edges represent the similarity where 1n means a vector of n ones. Clearly, k p0þ k1 ¼ k p0 k1 ¼ 1.

Fig. 1. A portion of similarity graph.


G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270 267

After converging algorithm, the probability of each gene is subclass (e.g.,jP for all genes in P). In Eq. (10), r is a penalty factor
n o
obtained via comparing its related probabilities in ptþ and pt (Yang with exponentially growing sequences of 28 ; 27 ; . . . ; 27 ; 28 .
et al., 2012; Yang et al., 2014). In this regard, the unlabeled genes in  
Also, the weight vector of v ¼ vP ; vN ; vQP ; vQN ; vMN is initial-
U are divided into three subclasses. According to 1  r, the
ized by v0 ¼ ð1; 1; 1; 1; 1Þ, and then is obtained empirically by
unlabeled genes with higher ptþ are categorized as quasi positive
varying one element while keeping stable the others (Yang et al.,
(QP); those having larger pt as quasi negative (QN), and all
2012). After learning on training genes, the posterior probability
remained genes are called moderate negative (MN). Using these
pðgjclass : þ=; model : WMSVMÞ is obtained to be presented to
categories, the weights of classifiers are determined. The details of
the Perceptron.
this algorithm is described here.
Algorithm: Multi-rank-walk algorithm for label propagation
3.2.4.2. Weighted k-nearest neighbor.
Inputs: P; N; U: set of positive, negative, and unlabeled genes,
KNN is a learning method which can predict the class of a gene
r :restart probability
based on the majority vote of its k-nearest neighbor classes. The
Outputs: QP: quasi positive set,QN: quasi negative set, MN:
performance of KNN relies on the definition of distance metrics,
moderate negative set
the choice of instance selection and the classification scheme
Using Eq. (5), compute W
(Ferrandiz and Boullé, 2010). Concentrating on distance metric and
Using Eq. (7), compute W 0
voting procedure in this paper, Chebyshev (instead of Euclidean)
Initialize p0þ using Eq. (8) distance is used to find the nearest genes. It is, indeed, an infinity
Initialize p0 using Eq. (9) norm distance in vector space of genes:
t¼0 Dist C ðg; g0 Þ ¼ k V g  V g0 k1 ¼ maxjV g ðiÞ  V g0 ðiÞj ð11Þ
i
Repeat
where V g ðiÞ is the ith component of vector V g.
tþ1
pþ ¼ ð1  rÞW 0 ptþ þ rp0þ Additionally, an extension of voting procedure is used. That is,
the votes of nearest neighbor genes are weighted according to their
tþ1
p ¼ ð1  rÞW 0 pt þ rp0 inverse squared Chebyshev distance in Eq. (11). This approach is
called weighted k-nearest neighbor (WKNN) method. The posteri-
t ¼tþ1 or probability pðgjclass : þ=; model : WKNNÞ from trained model
is used for the ensembling purpose.
Until k ptþ1
þ  pþ k 1 þ k p  p k
t tþ1 t
1 <e

QP ¼ fg 2 U : ptþ ðgÞ > pt ðgÞandptþ ðgÞ > 1  rg 3.2.4.3. Weighted decision tree. CART (as a decision tree) is defined
by recursively partitioning the gene space based on impurity
QN ¼ fg 2 U : pt ðgÞ > ptþ ðgÞandpt ðgÞ > 1  rg measures, such as Gini index. In this paper, Gini diversity index
(GDI) is supposed as impurity. That is, for a pure gene with just one
MN ¼ U  ðQP[QNÞ class, GDI is zero; otherwise, it is positive. This index for gene g is
defined as:
Return QP, QN, MN
GDIðgÞ ¼ 1  p2 ðþjgÞ  p2 ðjgÞ ð12Þ
3.2.4. Ensembling classifiers
where pðþjgÞ and pðjgÞ are the observed fraction of positive and
After categorizing the unlabeled genes using multi-rank-walk
negative class in training data, respectively. The objective of CART
algorithm, all genes in G are divided into five subclasses, that is, is minimizing the misclassification cost of genes.
G ¼ P[N[ ðQP[QN[MNÞ. In this section, we try to assign different As in the two previous classifiers, the weights of genes in
weights to these subsets, based on labeling confidence, via three weighted CART (WCART) are computed based on multi-level
distinct weighted classifiers. Using these independent classifiers learning with the same parameter setting schema. According to
for predicting a disease gene, their posterior probabilities are this model, the posterior probability pðgjclass : þ=; model :
presented to a Perceptron ensembler to make final prediction. WCARTÞ are calculated.

3.2.4.1. Weighted multilevel support vector machine.


SVM is a classifier which tries to solve an optimization problem 3.2.4.4. Perceptron ensemble. Perceptron is a single layer neural
via finding support vectors. During the learning if the outlier genes network which avoids blindness learning in greedy manner (Hastie
are chosen as support vectors, the decision boundary may be far et al., 2005). The main schema is during learning: the weights of
from true boundaries. To overcome this problem, different weights Perceptron are adjusted only when an error occurs. The Perceptron
are assigned to distinct genes according to their subclasses (called algorithm tries to minimize this objective function for
multi-level learning (Yang et al., 2012)). In weighted multi-level misclassified genes, gi :
support vector machine (WMSVM), the decision hyper-planes are X  
Q ¼ ygi wTgi V gi þ w0 ð13Þ
learned according to the relative importance of genes in the
training set. Using the five subclasses of G, WMSVM is formulated
where wgi is set of weights we are looking for, ygi is genes label
as:
(þ= for positive/negative genes). The algorithm uses stochastic
1  gradient descent to minimize this objective function Q as learning
argmin k V k2 þ r vP jP jPj þ vN jN jNj þ vQP jQP jQPj
V;b;j 2 procedure. In this regard, the learning continues until no error
  occurs.
T
þvQN jQN jQNj þ vMN jMN jMNjÞs:t:C g V V g þ b Using Perceptron, the posterior probabilities pðgjclass :
 1  jg ; g 2 G; C g 2 fP; N; QP; QN; MNg ð10Þ þ=; modelÞ for three models WMSVM, WKNN, and WCART are
ensembled as Perceptron predictor. After adjusting the weights of
where jg is the slack variable which measures the misclassification Perceptron via training, the final decision on new gene is made by
degree of gene g. A similar j is supposed for all genes in each weighted combination of three posterior probabilities.
268 G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270

4. Experimental results Table 3


Effect of K on WKNN classifier.

In this section, the preprocessing on disease gene dataset is No. of Precision Recall F-measure MCC
explained first. Then, this data is observed statistically with some neighbors, K
biological interpretation. Next, the action of PEGPUL steps on gene 1 72.23  1.58 85.19  0.90 78.73  1.18 0.55  0.03
data is presented where its performance is also compared against 3 74.66  0.82 85.85  1.01 79.85  0.65 0.58  0.01
9 75.77  0.94 86.05  0.81 80.56  0.60 0.59  0.01
some other works. At last, PEGPUL is investigated from statistical
13 75.97  0.86 85.65  0.95 80.50  0.48 0.59  0.01
and machine learning perspectives. 15 76.17  0.69 85.73  0.82 80.65  0.38 0.59  0.01
19 76.45  0.76 85.36  0.87 80.65  0.38 0.59  0.01
4.1. Disease gene data preprocessing 21 76.53  0.89 85.32  0.94 80.66  0.43 0.59  0.01
23 76.69  0.81 85.35  0.82 80.76  0.39 0.59  0.01

Among 12,950 genes in the dataset, 949 confirmed positive


genes of six disease classes based on (Goh et al., 2007) are
and 7 as representatives, their Pearson correlation reached as high
considered against 12,001 unlabeled genes. In this regard, each
as 0.99.
geneg is presented by vector V g, as described in Section 3.1. Since
some components of V g might be missed, we have used zero values 4.3. Performance of PEGPUL on gene data
to declare their absence as an imputation strategy. Moreover,
before presenting genes to classifiers, their values in gene vectors After extracting the feature vectors from gene data, the
are normalized. similarity graph is built. The edges in this graph represent the
As shown in Table 2, the dataset is highly imbalanced. To reduce similarity between two genes, computed using (4). Using this
its side-effect on performance of predictors, the 12,001 unlabeled similarity graph by multi-rank-walk algorithm to propagate
genes are under-sampled to 949. In this regard, ten subsets (with labeled genes, it divides the unlabeled genes into quasi positive,
size 949) of unlabeled genes are selected randomly. Then, each one quasi negative, and moderate negative subsets. These subsets of
is joined by positive genes to form a balanced dataset of size 1898 genes with different weights are used in multilevel learning of
(altogether ten datasets). Each dataset is used in the experiments Perceptron ensemble to make the final predictions.
in 3-fold cross-validation manner. In this approach, 2 folds of data In current learning schema, there are several hyper-parameters
are used for model construction and the rest fold for model testing. such as the desired number of nearest neighbors in WKNN and the
split criterion in WCART which should be chosen. In this regard,
4.2. Disease gene data analysis some important ones are discussed here. In WKNN, the suitable
values for K, as the number of nearest neighbors, and their
In order to study the extracted features from disease gene data, corresponding performances are shown in Table 3. According to F-
their discrimination ability and correlation are investigated here. measure and Matthews correlation coefficient (MCC) (The Micro-
First, the probability density function (PDF) of a sample feature Array Quality Control (MAQC) Consortium, 2010) criteria, setting
(e.g., feature 880) with respect to positive/unlabeled genes is K ¼ 9would be suitable for WKNN classifier.
plotted to statistically show the capability of this feature in class In the weighted version of each classifier, as described in
discrimination. In this regard, it is visualized through the kernel WMSV, the weight vector is obtained empirically by varying one
density estimation in Fig. 2. Apparently, this feature has element of weights while keeping the others fixed. Before
multimodal PDF and is not discriminative at all. Therefore, the constructing ensemble, the impact of these weights on each
PU supervised assumption that unlabeled genes may contain individual classifier is examined. For this purpose, the performance
positive genes is probably true. of each base classifier, before and after applying weights, is
As another analysis, the Pearson correlation (Murphy, 2012) reported in Table 4. According to the precision, recall, F-measure
between some pairs of features is examined. This study reveals the and MCC in this table, the weighted version of classifiers
high correlation between each pair of features. Using features 5 outperform their unweighted case. This also confirms the goodness
and effectiveness of obtained weights.
In order to motivate the use of Perceptron ensemble, three well-
known ensembling procedures are presented in Table 5. In this

Table 4
The effects of weights on base classifiers .

Classifier Precision Recall F-measure MCC


KNN 69.87  1.11 67.96  1.20 68.83  0.90 0.38  0.01
WKNN 75.77  0.95 86.05  0.81 80.56  0.60 0.59  0.01
CART 64.71  1.23 71.95  2.24 68.12  1.48 0.33 0.06
WCART 75.56  1.62 70.02  1.46 72.61  1.05 0.47  0.02
SVM 66.94  1.09 75.19  0.80 70.78  0.77 0.38  0.02
WMSVM 80.03  1.17 77.64  0.86 78.79  0.67 0.58  0.01

Table 5
Performance comparison of ensembles.

Classifier Precision Recall F-measure MCC


Perceptron ensemble 76.67  0.58 89.31  0.46 82.49  0.38 0.63  0.01
Stacking ensemble 67.83  0.83 96.24  0.46 79.56  0.63 0.56  0.15
MLP ensemble 74.32  2.65 88.04  2.15 80.45  1.58 0.58  0.03
Fig. 2. PDF of feature 880.
G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270 269

Table 6 Table 8
Performance comparison of individual classifiers versus Perceptron ensemble. Performance comparison of disease gene identification methods.

Classifier Precision Recall F-measure MCC Prediction Precision Recall F-measure MCC
method
WKNN 75.77  0.95 86.05  0.81 80.56  0.60 0.59  0.01
WCART 75.56  1.62 70.02  1.46 72.61  1.05 0.47  0.02 PUDI 79.99  1.13 78.00  1.01 78.95  0.86 0.58  0.02
WMSVM 80.03  1.17 77.64  0.86 78.79  0.67 0.58  0.01 ProDiGe 61.04  0.91 82.10  0.82 70.20  0.75 0.38  0.01
Perceptron 76.67  0.58 89.31  0.46 82.49  0.38 0.63  0.01 EPU 78.12  0.65 85.21  0.35 81.34  0.25 0.61  0.00
ensemble PEGPUL 76.67  0.58 89.31  0.46 82.49  0.38 0.63  0.01
Bagging 76.74 1.22 77.59  1.27 77.13  1.01 0.54  0.02
LogitBoost 80.00  1.14 77.19  0.94 78.54  0.92 0.57  0.01
Table 7 k-means 61.17  1.05 65.91  0.94 63.41  0.64 0.24  0.02
Effect of removing noisy gene data on performance of PEGPUL. k-medoids 60.64  1.56 66.57  3.64 62.98  1.11 0.23  0.02
Noisy genes Precision Recall F-measure MCC
0% 76.54  3.46 89.98  1.41 82.68  2.04 0.63  0.03
Table 9
10% 76.87  2.20 90.24  2.40 83.00  1.67 0.64  0.02
UCI datasets used in the experiments.
20% 78.97  4.20 88.38  4.59 83.28  1.54 0.66  0.02
30% 77.63  2.93 90.94  1.55 83.75  2.24 0.67  0.03 Dataset No. of features No. of classes No. of instances
Sonar 60 2 208
Pima (Diabetes) 8 2 768
regard, beside the result of Perceptron, stacking ensemble (Hastie Bupa 6 2 345
Cancer (Breast) 9 2 699
et al., 2005) and multilayer perceptron (MLP) (Murphy, 2012) are
also examined. Aggregating (voting) the results via stacking
ensemble is the simplest way to cumulate the individual classifier
results. MLP as an ensembling approach uses the same structure as where biological networks GO, PPI and PDs are used as in PEGPUL.
Perceptron with one hidden layer and nonlinear activation Additionally, ProDiGe as a bagging method (Mordelet and Vert,
function. The performance comparison of these ensembles are 2011) is used in our comparison. This method repeatedly selects
presented in Table 5. Though stacking ensemble recalls most of the random subset from unlabeled genes and runs different SVM
disease genes, the precision and F-measure of neural network classifiers on each bootstrap. Also, EPU as an ensemble extension of
ensembles are high. However, the Perceptron achieves the best PUDI is compared (Yang et al., 2014). Moreover, the results of
results (as shown in bold). This means that a neural network with LogitBoost ensemble (Murphy, 2012) as a boosting method is
output layer is sufficient for this ensembling purpose. Apparently, presented. Bagging via random forest, as another supervised
using more layers, as in MLP, results in worse performance because method, is also used. To compare PEGPUL with unsupervised
of wrong number of hidden neurons, inappropriate learning methods, k-means (Murphy, 2012) and k-medoids (Murphy, 2012)
algorithm selection, or overtraining. clustering algorithms are presented where the number of clusters
Additionally, the performance of each weighted classifier in is set to classes in supervised manner (i.e., k = 2). In these cases, the
individual and ensemble scenarios is presented in Table 6. As accuracy of clusters is computed in terms of pureness. Table 8
expected, the ensemble of three weighted classifiers works better compares the precision, recall, F-measure and MCC of these
than each individual classifier. According to these results, WCART methods where the best results are in boldface. According to these
is the weakest classifier in terms of both precision and recall while results, PEGPUL has the best performance. However, in comparing
the precision of WMSVM is the highest even better than ensemble. with EPU, as its nearest competitor, PEGPUL obtains slightly better
However, Perceptron ensemble is the best in terms of F-measure results, though this little improvement is noticeable in bioinfor-
(as shown in bold). matics.
To illustrate the robustness of PEGPUL to noise, some of the In order to evaluate the performance of PEGPUL in another
outlying genes in the dataset are pretended as noisy and removed application, four datasets available from the UCI repository
from data. For this purpose, the centroid (mean) of positive and (Asuncion and Newman, 2007) are used in another experiment.
unlabeled genes are computed and some of the farthest genes to Table 9 summarizes the specification of these datasets.
their centroids are set aside. Then, PEGPUL is applied on these In order to enter these datasets in PU learning concept, about
noise-purified gene data and its performance is computed as 20% of positive instances are added to negative ones to build an
before. Table 7 illustrates the precision, recall and F-measure of unlabeled set. Then, these new datasets are used in experimenting
PEGPUL for four distinct percent of noise removal. According to PUDI, EPU and PEGPUL. Table 10 reports their performance in
these results, our algorithm is able to tolerate the noisy gene data terms of F-measure and MCC while the best results are shown in
and outliers. boldface. Obviously, the performance of PEGPUL is comparable to
As the last experiment on gene data, the effectiveness of EPU and PUDI. In Cancer dataset, however, F-measure of PUDI is a
PEGPUL, as a semi-supervised approach, against some supervised little better. Also, in Bupa and Cancer datasets, PUDI and EPU have
and unsupervised learning methods and/or ensemble of classifiers higher MCC.
is compared. In this regard, PUDI (Yang et al., 2012) as the first At last, to statistically analyze the performance of PEGPUL
attempt of PU learning in disease gene identification is considered against PUDI and EPU, a statistical t-test (Moreau and Tranchevent,

Table 10
Comparison of PEGPUL against EPU and PUDI on UCI datasets.

Dataset F-measure MCC

PUDI EPU PEGPUL PUDI EPU PEGPUL


Sonar 71.88  0.89 71.81  1.18 76.06  0.40 0.41  0.16 0.39  0.23 0.47  0.06
Pima 71.83  0.34 72.45  0.43 73.36  0.35 0.41  0.02 0.41  0.08 0.41  0.01
Bupa 60.63  0.25 65.63  0.75 66.25  0.59 0.18  0.03 0.10  0.08 0.02  0.01
Cancer 90.41  0.19 90.35  0.13 89.38  0.16 0.81  0.03 0.81  0.02 0.78  0.01
270 G.-H. Jowkar, E.G. Mansoori / Computational Biology and Chemistry 64 (2016) 263–270

Table 11 References
The paired t-test results of disease gene identification methods.
Piro, R., Cunto, F.D., 2012. Computational approaches to disease-gene prediction:
Ensemble method p-value
rationale, classification and successes. FEBS J. 279, 678–696.
PEGPUL versus PUDI 0.1782 Bromberg, Y., 2013. Disease gene prioritization. FEBS J. 9 (4), 1–16.
EPU versus PUDI 0.3424 Wang, X., Gulbahce, N., Yu, H., 2011. Network-based methods for human disease
PEGPUL versus EPU 0.3530 gene prediction. Brief. Funct. Genomics 10 (5), 280–293.
Kohler, S., et al., 2008. Walking the interactome for prioritization of candidate
disease genes. Am. J. Hum. Genet. 82, 949–958.
Smalter, A., Lei, S.F., Chen, X.-W., 2007. Human disease-gene classification with
integrative sequence-based and topological features of protein-protein
2012) is examined on the null hypothesis that the F-measure of interaction networks. IEEE Int. Conf. on Bioinformatics and Biomedicine, CA.
Radivojac, P., et al., 2008. An integrated approach to inferring gene-disease
PEGPUL in gene data and four UCI datasets is not less than the associations in humans. Proteins 72 (3), 1030–1037.
others. The p-value of paired comparisons, where a ¼ 0:05, are Yang, P., et al., 2012. Positive-Unlabeled Learning for Disease Gene Identification.
reported in Table 11, where its small values cast doubt on the Bioinformatics, Oxford University Press, pp. 1–7.
Yang, P., et al., 2014. Ensemble positive unlabeled learning for disease gene
validity of the null hypothesis. According to these tests, the F-
identification. PLoS One 5 (9) .
measure of PEGPUL is close to PUDI and comparable to EPU in these Cerulo, L., Elkan, C., Ceccarelli, M., 2010. Learning gene regulatory networks from
data sets. only positive and unlabeled data. BMC Bioinform. 11 (1), 228.
Freudenberg, J., Propping, P., 2002. A similarity-based method for genome-wide
prediction of disease-relevant human genes. Bioinformatics 18, 110–115.
5. Conclusion Adie, E., et al., 2005. Speeding disease gene discovery by sequence based candidate
prioritization. BMC Bioinform. 22 (6) .
In this paper, we proposed PEGPUL, as a new PU learning Vanun, O., et al., 2010. Associating genes and protein complexes with disease via
network propagation. PLoS Comput. Biol. 6 (1), 1–9.
method for disease gene identification. Our graph-based ensemble Wang, J.Z., et al., 2007. A new method to measure the semantic similarity of GO
method could improve the-state-of-the-art PU-based methods, by terms. Bioinformatics 23 (10), 1274–1281.
exploiting the strengths of EPU. Compared to unsupervised Mordelet, F., Vert, J., 2011. ProDiGe: Prioritization of Disease Genes with multitask
machine learning from positive and unlabeled examples. BMC Bioinform. 12.
methods for disease genes identification, supervised methods Moreau, Y., Tranchevent, L.-C., 2012. Computational tools for prioritizing candidate
are potentially more accurate, but they need a complete set of genes: boosting disease gene discovery. Nat. Rev. Genet. 13 (8), 523–536.
known genes for training. According to lack of labeled genes, semi- Goh, K., et al., 2007. The human disease network. PNAS 104 (21), 8685–8690.
Gill, N., Singh, S., Aseri, T.C., 2014. Computational disease gene prioritization: an
supervised methods are better options for this purpose. In this appraisal. J. Comput. Biol. 21 (6), 456–465.
regard, we put this hypothesis to test, if its answer were positive, Punta, Marco, Coggill, Penny C., Eberhardt, Ruth Y., Mistry, Jaina, Tate, John,
the unlabeled data can be used to identify the positive genes in this Boursnell, Chris, Pang, Ningze, Forslund, Kristoffer, Ceric, Goran, Clements, Jody,
Heger, Andreas, Holm, Liisa, Sonnhammer, Erik L.L., Eddy, Sean R., Bateman,
content.
Alex, Finn, Robert D., 2011. The Pfam protein families database. Nucleic Acids
In viewing PEGPUL from machine learning perspective, we Res. 40 (D1), D290–D301.
investigate its robustness to noise and outliers through building Zhu, X., 2005. Semi-supervised learning literature survey. Technical Report 1530,
Computer Sciences. University of Wisconsin-Madison, Wisconsin-Madison.
accurate decision boundaries, though lack of discriminative
Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training.
features still remains a serious challenge. To consider these issues, Proc. of the 11th Annu. Conf. on Computational Learning Theory .
the base classifiers and ensemble method were chosen with high Dhillon, P.S., Talukdar, P.P., Crammer, K., 2010. Inference Driven Metric Learning
resistance to noise. On one hand, by choosing appropriate support (IDML) for Graph Construction, Technical Reports. CIS.
Huang, Y., Li, C., Georgiopoulos, M., 2013. Reduced-rank local distance metric
vectors for WMSVM with help of multilevel learning, it could learning. Machine Learning and Knowledge Discovery in Databases. Springer,
determine accurate decision boundaries. On the other hand, since Berlin Heidelberg, pp. 224–239.
the performance of KNN relies on distance metrics, WKNN could Valentini, G., et al., 2014. An extensive analysis of disease-gene associations using
network integration and fast kernel-based gene prioritization methods. Artif.
consider the topological properties of gene data by choosing Intell. Med. 61, 63–78.
Chebyshev distance. Moreover, since WCART use weighted Lin, F., Cohen, W.W., 2010. Semi-supervised classification of network data using very
decision trees, it is robust to noise and outliers. At last, by using few labels. Int. Conf. on Advances in Social Networks Analysis and Mining
(ASONAM) .
Perceptron neural network as ensemble method, its noise and fault Le, D.-H., Kwon, Y.-K., 2013. Neighbor-favoring weight reinforcement to improve
tolerance are also employed. random walk-based disease gene prioritization. Comput. Biol. Chem. 44, 1–8.
Future work should focus on extracting more discriminative Ferrandiz, S., Boullé, M., 2010. Bayesian instance selection for the nearest neighbor
rule. Mach. Learn. 81 (3), 229–256.
features such as sequence-based features from biological sources.
Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press,
Moreover, feature selection methods can be used to identify more London, England pp. 45–46; 563–564; 556–558; 354–356; 489–492.
relevant features. The MicroArray Quality Control (MAQC) Consortium, 2010. The MAQC-II study of
common practices for the development and validation of microarray-based
predictive models. Nat. Biotechnol. 28, 827–838.
Acknowledgment Hastie, T., et al., 2005. The Elements of Statistical Learning: Data Mining, Inference
and Prediction. Springer, New York, USA pp. 130–132; 288–290.
We would like to thank Prof. Peng Yang from Institute of Asuncion, A., Newman, D.J., 2007. UCI Machine Learning Repository, Department of
Information and Computer Science. University of California, Irvine, CA Available
Infocomm Research (I2R) for his helpful suggestions. at http://archive.ics.uci.edu/ml/datasets.html.

You might also like