Professional Documents
Culture Documents
MMB 2023 SDM
MMB 2023 SDM
MMB 2023 SDM
Abstract
Dictionary
Machine learning and data analytics tasks on graphs ≈ e.g. Fourier
enjoy a lot of attention from both researchers and prac-
titioners due to the utility that a graph structure among 𝑋 𝑈 Φ 𝐿
data entities adds for downstream tasks. In many cases, (a) Dictionary-based graph inference
however, a graph structure is not known a priori, and
instead has to be inferred from data. Specifically, learn- ADBE CNDS
NOW
Software
B2B tech
ing a graph associating time series may elucidate hid- SNPS ACN MSFT
Electric Fiber
MSFT CDNS Semiconductor
CDW
den dependencies and also enable improved performance INTU MPWR
APH
AAPL
APH CDW
in tasks like classification, forecasting and clustering. AMAT KLAC SNPS
TXN
MCHP ADI
While approaches based on pairwise correlation and pre- KLAC LRCX
AAPL
AVGO
cision matrix estimation have been employed widely, re- LRCX ADI TER TXN
AUC
AUC
AUC
AUC
0.65 0.7
0.7
0.6
0.6
0.55 0.5
0.5 0.5
Inf 10 2 1 1/2 1/3 1/5 1/10 1/15 1/20 Inf 10 2 1 1/2 1/3 1/5 1/10 1/15 1/20 0.1 0.3 0.5 0.7 0.9 1 10 20 30 40
SNR SNR Smoothness Density ratio
βin
(a) SNR (b) SNR shared period (c) Varying α (d) Varying βout
, Edges
0.75
0.6 0.7
Purity
AUC
AUC
NMI
Corr
0.65 Corr 0.6 Dong
Dong K-Dong
K-Dong K-Log
K-Log IDGL
0.2 0.6 GIST(S) GIST(S)
0.55 GIST(R) GIST(R)
GIST-SP GIST-SP
Figure 3: Effect of SNR on edge inference for smoothed encodings (α = 0.9) (a), and shared period (b) in synthetic data. Effect of
encoding smoothness (c) and cluster density ratio on edge (d) and group (d) recovery (SNR=1/5). Effect of time series length t on
edge recovery in Synthetic (SNR=1/5) (f) and Road (g); and on group recovery in Stocks (h).
rameter α (see [36] for details). Large values of α in- Varying Time Series Length. Next we evaluate the
crease the “alignment” of neighbor encodings. GIST’s data complexity of competing techniques. Specifically,
quality dominates alternatives and its advantage widens we are interested in how the observed time series length
with α (Fig. 3(c)). A similar trend is evident for alter- affect competitors’ inference quality. For synthetic data
native methods since within-block time series become we vary the number of time steps t in X between 30
more similar in both the encoding and time domains. and 300, set the SNR to 1/5 and report the AUC for
It is important to note that for low values of smooth- competing techniques in Fig. 3(f). GIST(R)’s advantage
ness (α = [0.1, 0.2]), GIST-SP emerges as the dominant is small compared to baselines when a small prefix
method as its ability to associate time series based on of the time series is employed (t = 30). As the
shared periods, but not necessarily similar coefficients length of observations increases, our method can learn
is central to its objective. a better encoding matrix U and consequently a more
Varying Cluster Structure. We also study the accurate Laplacian. While the AUC of baselines also
effect of the underlying ground truth graph structure improves with t, it remains significantly lower than
on the ability of competing methods to recover known that of GIST(R). We also vary the number of observed
clusters. To this end, we vary the density of within and timesteps t for two real-world dataset. In Road, we
across cluster edges in the SB model. In addition to test the AUC for edge inference and report results in
reporting the AUC for edge inference (Fig. 3(d)), we Fig. 3(g). While all methods improve with t, GIST(S)
also quantify the ability to recover ground truth blocks retains dominant performance across all values of t. We
of the SB model in terms of NMI (Fig. 3(e)). In terms of also evaluate the effect of t on the quality of cluster
AUC, GIST(R) and GIST-SP are always the best two recovery (Fig. 3(h)). GIST(R) and GIST-SP dominate
methods. At low ratios the graph is relatively simple alternative on this task across all values of t.
(few edges) to learn. Alternatively, at high density
ratios there are more edges to learn, but the clusters 6.4 Evaluation on real-word datasets We next
are much more “discernable” which helps all baselines evaluate all techniques on the real-world datasets (in-
get a better AUC. The task is most challenging for mid cluding two versions of Stock) and report results for
density ratios of 10 − 20. In the cluster recovery task edge prediction, and cluster detection in Tbl. 2.
(Fig. 3(e)), there is a clear monotonic trend of increasing Edge prediction. In all datasets, a variation of our
performance with stronger clusters. GIST(R) is again method performs best in terms of edge prediction AUC
the dominant method in terms of NMI with a small with the exception of the TEMP dataset where Dong
advantage over GIST-SP. has a small advantage. Despite this disadvantage in
Table 2: Comparison of all competitors on real-world datasets for (i) edge prediction in terms of Area under the ROC curve (AUC),
(ii) cluster recovery via Normalized Mutual Information (NMI) and Purity (Pur), and (iii) running time t(s) in seconds.
Time
Time
K-Dong
and Crime we observe advantage when utilizing GIST- Dong
K-Dong
K-Log
IDGL
GIST(S)
SP. The ground-truth edges in Covid reflect neighboring K-Log
IDGL
GIST(S)
GIST(R)
GIST-SP
however, counties may be temporally lagged in this (a) Varing nodes (T=5000) (b) Varying timesteps (N=500)
patterns. Since, GIST-SP only considers shared periods
when inferring edges, it is able to capture neighboring Figure 4: Scalability comparison of GIST and competing
lag better than alternatives resulting in better AUC. baselines with increasing number of nodes (a) and number of time
Advantage in the Reality Mining dataset can be also steps (b). Running time is in seconds.
due to lag in user message responses, resulting in “off-
phase” time series. Overall, the strong performance of
our methods demonstrate that edges in these datasets 6.5 Running time. Beyond quality, Tbl. 2 also lists
reflects global temporal patterns. For example, the running times for all methods and datasets. While
connections in Bike rental and Road would be highly a simpler baseline like Corr and the optimized signal
reflective of weekly traffic patterns. Note that IDGL processing methods from [28] run between 5 and 10
was not evaluated on datasets without GT communities times faster than our method, they produce graphs of
which we employ as node labels for IDGL. lower quality. Our GIST-SP objective is faster than
Clustering. In the clustering experiments we evaluate alternatives as it follows a Gradient Descent (GD)
how well the graphs inferred by competing techniques approach. GIST(R) and GIST(S) are generally faster
reflect ground truth groups. To this end, we perform than the original Dong method. To further compare
spectral clustering on the learned graphs and report the scalability of all methods we vary the number of
the NMI and Purity (Pur) scores of the learned clus- nodes and number of time steps in a synthetic dataset
ters compared to the ground truth. Results are also and report running times in Fig. 4. The running
reported in Tbl. 2. In all datasets with ground truth time of all competitors grows super-linearly with the
clusters GIST obtains both the highest NMI and Pu- number of nodes (Fig. 4(a)) which is expected as the
rity. In TEMP, although not the best at edge inference number of possible edges grows quadratically. Dong,
(AUC), GIST(R)’ clustering of US cities aligns best with GIST(S), GIST(R) are relatively slower than GIST-SP,
known climate zones. Edges predicted by Dong, K-Log K-Log and K-Dong methods, mainly because they use
and K-Dong are less reflective of long-term tempera- closed-form solutions and/or optimize relatively more
ture patterns due to enforcing smoothness in the tem- complex objectives. All methods can infer a graph
poral rather encoding domain. In Stock-small GIST-SP among 2000 time series within one hour. Running time
produces groupings that best align to sectors, however, of competitors is less affected by the number of time
GIST(R) performs best in the full Stock data. Stock- steps (see Fig. 4(b)).
small contains only three sectors (Basic Materials, Com-
munication Services, and Energy) with low likelihood 6.6 Case Studies: Tech Stocks and Covid in
of cross-sector shared trends, hence the better results Vermont We perform a case study of a subset of the
across all methods as compared to the full Stock data. Stock dataset corresponding to the technology sector.
We compare the top 15 edges and associated companies