Mining Protein Networks With Maximal Quasi-Bicliques

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

1

Mining Protein Networks with Maximal Quasi-bicliques


Kelvin Sim1 Haiquan Li2
shsim@i2r.a-star.edu.sg hqli@noble.org
Jinyan Li3 Vivekanand Gopalkrishnan3
JYLi@ntu.edu.sg asvivek@ntu.edu.sg
1
Institute for Infocomm Research, Singapore.
2
The Samuel Roberts Noble Foundation, Inc. USA.
3
School of Computer Engineering, Nanyang Technological University, Singapore.

Keywords: protein networks, maximal bicliques, maximal quasi-bicliques

1 Introduction
Discovery of structured protein groups that exhibit some forms of interactions is important in the study
of biological processes. We focus on mining structured pair of protein groups that can be represented
by maximal quasi-biclique. A maximal quasi-biclique is a subgraph which consists of two disjoint
vertex sets that has edges between them. The concept of quasi allows some missing edges between
the two vertex sets, so every vertex in one vertex set need not be adjacent to all vertices in the other
vertex set. A quasi-biclique is a maximal when it is not a proper subset of another quasi-biclique. The
formal definition of maximal quasi-biclique is found in [2].
Our work is motivated by the limitation of using maximal bicliques in Li et al. [1]. Maximal
biclique is a more stringent form of maximal quasi-biclique, where every vertex in a vertex set is
adjacent to all vertices in the other vertex set. Thus, a pair of protein groups represented by maximal
biclique exhibits an all-versus-all interaction whereas a pair of protein groups represented by maximal
quasi-biclique exhibits a most-versus-most interaction. Li et al. observe two characteristics of protein
interaction data sets that impede the usage of maximal bicliques. (1) Not all pairs of protein groups
exhibit all-versus-all interactions. Using maximal bicliques to mine will omit significant pairs of protein
groups that exhibit most-versus-most interactions. (2) Protein interaction data sets are incomplete
and are constantly updating. The data sets also contain errors; an interaction between proteins may
be omitted, yielding a false negative, or an interaction between proteins may be wrongly indicated
to exist, yielding a false positive. Hence maximal biclique suffers when applied to incomplete and
erroneous data sets.
We can use maximal quasi-bicliques to mine pairs of protein groups that exhibit most-versus-
most interactions. And with regards to the second characteristic of protein interaction data sets,
maximal quasi-bicliques attempt to reduce the negative impact of incomplete and erroneous data set
by generalizing maximal biclique through tolerating missing edges.

2 Experimental Method and Results


The yeast (Saccharomyces cerevisiae) protein-protein interaction (ppi) dataset was downloaded from
the Database of Interacting Proteins (DIP) on 23rd October 2005. This dataset is modeled by an
undirected graph where the proteins are represented by the vertices and an interaction between a pair
of proteins is represented by an edge between the corresponding pair of vertices. After removing all
the complex-level experiments, the graph contains 4959 vertices and 10,640 edges.
2 Sim et al.

Table 1: No. of maximal quasi-bicliques/bicliques mined from yeast ppi dataset and their significance.

basic results group validation pair validation


ms  pairs covered domains validated groups (rate) iPfam pairs Interdom pairs
11 0 53 386 92 (86.79%) 0 5
1 7251 1657 12423 (85.66%) 128 420
12 0 4 24 7 (87.50%) 0 0
1 1381 1509 2353 ( 85.19%) 28 115
13 1 79 318 104 (65.82%) 0 0
14 2 1150 1164 1961 (85.26%) 22 67
15 2 13 118 25 (96.15%) 0 0
16 3 12 87 17 (70.83%) 0 0
17 4 15 86 19 (63.33%) 0 0

Using this yeast ppi dataset, we mine both maximal quasi-bicliques and maximal bicliques. The
maximal quasi-bicliques are mined from the yeast ppi dataset using CompleteQB [2] and the maximal
bicliques are mined using the method described in [1]. We then use the validation techniques [1] –
group validation and pair validation – to verify if using maximal quasi-bicliques has more significant
discoveries than using maximal bicliques.
We compared the results obtained from maximal quasi-bicliques and maximal bicliques by varying
the error tolerance rate  while maintaining a constant minimum support ms on the number of vertices
in the maximal quasi-bicliques/bicliques. Table 1 presents the result. The third column pairs shows
the number of maximal quasi-bicliques/bicliques mined at a given ms and . The results with  = 0
are obtained with maximal bicliques, whereas the others are obtained with maximal quasi-bicliques.
For ms ≥ 13, no maximal bicliques are found but we are still able to find maximal quasi-bicliques by
increasing . This demonstrates the flexibility of maximal quasi-bicliques, whereby large interacting
pairs of protein groups can be obtained by relaxing the all-versus-all constraint of maximal biclique.
We know a maximal quasi-biclique/biclique represents a pair of protein groups and we validate
each protein group using group validation. For the results on covered domains, at ms = 11 and
ms = 12 with  = 1, we are able to obtain 4.5 and 62.9 times more covered domains respectively
by using maximal quasi-bicliques. Similarly for the results on validated groups for ms = 11 and
ms = 12, using maximal quasi-bicliques enable us to obtain 135 and 336 times more validated groups
respectively. The validated groups rate in the fifth column indicates that a high percentage (> 80%)
of interacting protein groups generated by maximal quasi-bicliques can be mapped to domains in the
domain database, and there is no obvious decrease of the validated rate.
For the validations on pairs of protein groups, at ms = 11 and ms = 12 and by using maximal
bicliques, we can only find 5 pairs of groups that can be mapped to domain-domain pairs and they
are only found in the Interdom database. By using maximal quasi-bicliques, we can map 691 pairs of
groups to domain-domain pairs in both domain-domain interaction databases, iPfam and Interdom.
Apparently, we are able to discover more relationships between interacting protein group pairs and
domain-domain pairs.

References
[1] H. Li, J. Li, and L. Wong. Discovery motif pairs at interaction sites from protein sequences on a
proteome-wide scale. In Bioinformatics 22(8):989-996, 2006.

[2] K. Sim, J. Li, V. Gopalkrishnan, and G. Liu. Mining maximal quasi-bicliques to co-cluster stocks
and financial ratios for value investor. In ICDM, pages 1059–1063, 2006.

You might also like