Professional Documents
Culture Documents
Mining Maximal Quasi-Bicliques To Co-Cluster Stocks and Financial Ratios For Value Investment
Mining Maximal Quasi-Bicliques To Co-Cluster Stocks and Financial Ratios For Value Investment
Abstract stocks.
We model stocks and their financial ratio values by a bi-
We introduce an unsupervised process to co-cluster partite graph. We use one side of the vertices in the bipartite
groups of stocks and financial ratios, so that investors can graph to represent the stocks, the other set to represent the
gain more insight on how they are correlated. Our idea for financial ratio values as shown in Figure 1(a). As the fi-
the co-clustering is based on a graph concept called max- nancial ratio values are continuous, we use a discretization
imal quasi-bicliques, which can tolerate erroneous or/and technique to divide financial ratio values into intervals, and
missing information that are common in the stock and finan- then each interval is represented by a vertex. An edge exists
cial ratio data. Compared to previous works, our maximal between a stock vertex s and a financial ratio vertex f , if the
quasi-bicliques require the errors to be evenly distributed, financial ratio value for the stock s falls in the interval f of
which enable us to capture more meaningful co-clusters. this financial ratio.
We develop a new algorithm that can efficiently enumer- Maximal quasi-biclique subgraphs, our newly defined
ate maximal quasi-bicliques from an undirected graph. The graph concept, are then used to co-cluster stocks and finan-
concept of maximal quasi-bicliques is domain-independent; cial ratios from the bipartite graph. A quasi-biclique sub-
it can be extended to perform co-clustering on any set of graph consists of two disjoint sets of vertices Si , i = 1, 2,
data that are modeled by graphs. such that every vertex in Si is allowed to disconnect up to
vertices in Sj , j = i, where the error tolerance rate is a
user defined integer. A quasi-biclique subgraph is maximal
if it is not a proper subset of any other quasi-biclique sub-
1 Introduction graphs. For example, in Figure 1(a), {Stock A, B, C} and
{Current Ratio < 1, Debt/Equity Ratio < 1, Return On
The traditional approach of value investment [4] is of Equity Ratio > 10%} are two disjoint vertex sets forming
linear “query”-style. A number of financial ratios are pre- a quasi-biclique subgraph. Every vertex in this subgraph is
selected, and some thresholds are subjectively set for these required to connect to at least 2 of the vertices in the other
financial ratio values. Then stocks satisfying these condi- vertex set. This subgraph is also maximal.
tions are grouped together for further analysis. However, The problem of co-clustering stocks and financial ra-
this process has a weakness; it is hard to know which finan- tios can be better studied by using maximal quasi-bicliques,
cial ratios should be selected and what thresholds should than using the classical maximal bicliques [3], due to the
be set to find correlated stocks. Exhaustive combinations following two reasons: (1) Like most real-world data, finan-
of different financial ratios and their thresholds make this cial ratios data are prone to errors and missing values. This
decision process a daunting task for investors. has an adverse effect on the quality of the co-clusterings.
In this paper, we propose an unsupervised graph-based Maximal quasi-bicliques attempt to reduce this negative im-
data mining method to co-cluster stocks and financial ra- pact by introducing the error tolerance rate . That is, co-
tios. The stocks are clustered based on their similarities in clusters determined by maximal quasi-bicliques can tolerate
financial ratios and concurrently, these financial ratios are up to number of errors. (2) Even if the dataset is free of er-
implicitly clustered according to their occurrences in the roneous or missing values, the relaxation of the connection
stocks. Our co-clustering patterns can be interpreted by in- requirement between the cluster pairs by our method can re-
vestors to understand which clusters of financial ratios and sult in discovering more general cluster groups between the
at what levels of thresholds they are correlated to clusters of stocks and the financial ratios.
v7 {{u, v}|u, v ∈ V (G)}. Vertices u, v ∈ V (G) are adjacent
Return on Equity < 10%
v8 to each other if there is an edge {u, v} connecting them.
v1 v9
Stock A
Current Ratio < 1 v2
Throughout the rest of the paper, we assume that all graphs
v 10
Stock B Debt/Equity Ratio < 1 v3 v 11 are undirected.
Return on Equity > 10 %
v4 v 12 The neighborhood of v in a graph G is denoted as
Stock C v5 Γ(v) = {u|{v, u} ∈ E(G) ∧ u ∈ V (G)}. Let V ⊂ V (G),
v be a vertex in V (G) \ V . We denote the set of ver-
Net Profit Margin < 10% v6
Time(sec)
Time(sec)
10000 10 100
10
5000 1
1
0 0.1 0.1
3 4 5 6 7 8 9 10 11 3 4 5 6 7 8 9 5 6 7 8 9 10 11
Minimum Size Threshold Minimum Size Threshold Minimum Size Threshold
(a) # of maximal quasi-bicliques (b) Running Time (Err=1) (c) Running Time (Err=2)
Figure 2. Running time and the number of maximal quasi-bicliques enumerated from the dataset.
Year References
Figure 3. Price performances of the stocks in [1] J. Abello, M. G. C. Resende, and S. Sudarsky. Massive
a co-cluster. quasi-clique detection. In LATIN ’02, pages 598–612, 2002.
[2] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang,
optimized CompleteQB increases at an even higher rate S. Sun, L. Ling, N. Zhang, G. Li, and R. Chen. Topological
when we increase but decrease ms. Thus, it is impractical structure analysis of the protein-protein interaction network
to find all small maximal quasi-bicliques that allow large in budding yeast. In Nucleic Acids Research 31(9):2443-
2450, 2003.
number of errors.
[3] D. Eppstein. Arboricity and bipartite subgraph listing al-
In general, we can see that the optimized CompleteQB gorithms. Information Processing Letters, 51(4):207–211,
runs approximately linear to the number of subgraphs enu- 1994.
merated. [4] B. Graham and D. Dodd. Security Analysis. McGraw-Hill
Example of an interesting co-cluster We present Professional, 1934.
a co-cluster discovered by our algorithm, which con- [5] J. Han and M. Kamber. Data Mining: Concepts and Tech-
tains stocks {AVP, CLX, MKC, PG} and financial ra- niques. Morgan Kaufmann, 2000.
tios {ROA (7.99, 8.27), Dividend Yield (1.45, 2.48), PE [6] H. Li, J. Li, and L. Wong. Discovery motif pairs at interac-
tion sites from protein sequences on a proteome-wide scale.
(19.69, 21.7), Price/Sales (0.16, 2.4)}. Interestingly, we
In Bioinformatics 22(8):989-996, 2006.
found that all the price performances of the stocks increased [7] G. Liu, K. Sim, and J. Li. Efficient mining of large maximal
for the year 2002, unlike the S&P 500 Index, as shown in bicliques. In Dawak, pages 437–448, 2006.
Figure 3. From this co-cluster, investors gained two types of [8] N. Mishra, D. Ron, and R. Swaminathan. A new concep-
information; firstly, investors can buy this cluster of stocks tual clustering framework. Mach. Learn., 56(1-3):115–151,
to diversify their portfolios, secondly, this cluster of finan- 2005.
cial ratios can be considered as an useful pattern to find [9] T. Murata. Discovery of user communities from web audi-
good performing stocks. ence measurement data. In WI, pages 673–676, 2004.
[10] C. Yan, J. G. Burleigh, and O. Eulenstein. Identify-
ing optimal incomplete phylogenetic data sets from se-
6 Conclusion quence databases. In Molecular Phylogenetics and Evolu-
tion 35(2005):528-535, 2005.
We proposed to use maximal quasi-bicliques to co-
cluster stocks and financial ratios. Compared to the concept
of maximal bicliques, maximal quasi-bicliques are more