Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions

Information Sciences 644 (2023) 119287
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Scalable maximal subgraph mining with backbone-preserving

graph convolutions
Thanh Toan Nguyen a , Thanh Trung Huynh b , Matthias Weidlich c ,
Quan Thanh Tho d,e,∗ , Hongzhi Yin f , Karl Aberer b , Quoc Viet Hung Nguyen g
a
Faculty of Information Technology, HUTECH University, Ho Chi Minh City, Viet Nam
b
École Polytechnique Fédérale de Lausanne, Switzerland
c
Humboldt-Universität zu Berlin, Germany
d
Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10,
Ho Chi Minh City, Viet Nam
e
Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Viet Nam
f
The University of Queensland, Australia
g
Griffith University, Australia
A R T I C L E I N F O A B S T R A C T
Keywords: Maximal subgraph mining is increasingly important in various domains, including bioinformatics,
Maximal subgraph mining genomics, and chemistry, as it helps identify common characteristics among a set of graphs and
Graph embedding enables their classification into different categories. Existing approaches for identifying maximal
Graph convolutional networks
subgraphs typically rely on traversing a graph lattice. However, in practice, these approaches
Scalable approximation
are limited to relatively small subgraphs due to the exponential growth of the search space and
the NP-completeness of the underlying subgraph isomorphism test. In this work, we propose
SCAMA, an approach that addresses these limitations by adopting a divide-and-conquer strategy
for efficient mining of maximal subgraphs. Our approach involves initially partitioning a graph
database into equivalence classes using bootstrapped backbones, which are tree-shaped frequent
subgraphs. We then introduce a learning process based on a novel graph convolutional network
(GCN) to extract maximal backbones for each equivalence class. A critical insight of our approach
is that by estimating each maximal backbone directly in the embedding space, we can avoid the
exponential traversal of the graph lattice. From the extracted maximal backbones, we construct
the maximal frequent subgraphs. Furthermore, we outline how SCAMA can be extended to
perform top-𝑘 largest frequent subgraph mining and how the discovered patterns facilitate graph
classification. Our experimental results demonstrate the effectiveness of SCAMA in identifying
almost perfectly maximal frequent subgraphs, while exhibiting approximately 10 times faster
performance compared to the best baseline technique.
*
Corresponding author at: Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District
10, Ho Chi Minh City, Viet Nam.
E-mail addresses: nt.toan@hutech.edu.vn (T.T. Nguyen), thanhtrunghuynh93@gmail.com (T.T. Huynh), matthias.weidlich@hu-berlin.de (M. Weidlich),
qttho@hcmut.edu.vn (Q.T. Tho), h.yin1@uq.edu.au (H. Yin), karl.aberer@epfl.ch (K. Aberer), henry.nguyen@griffith.edu.au (Q.V.H. Nguyen).
https://doi.org/10.1016/j.ins.2023.119287
Received 6 November 2022; Received in revised form 31 May 2023; Accepted 3 June 2023
Available online 8 June 2023
0020-0255/© 2023 Elsevier Inc. All rights reserved.
T.T. Nguyen, T.T. Huynh, M. Weidlich et al. Information Sciences 644 (2023) 119287
1. Introduction
Maximal subgraph mining (MSM) is the task of identifying only the maximal frequent subgraphs among all frequent subgraphs
in a graph database [1]. These frequent subgraphs capture important patterns in various domains, including bioinformatics [2],
genomics [3], and the social sciences [4,5]. Applications in these domains often require the identification of large structural patterns
in dense datasets that are continuously growing in size [6]. For example, in bioinformatics, analysing the molecular structures of
hundreds to thousands of virus variants can provide insights for drug development and vaccinations. In the social sciences, studying
the propagation of rumours among millions of users in a network helps detect and mitigate social damage. The aforementioned
domains induce several fundamental challenges for maximal subgraph mining, as follows. (C1) An exponential search space: The
space of subgraphs to consider in the mining process is a lattice that grows exponentially in the number of edges of the graphs in a
database [7]. (C2) Intractability of the isomorphism test: To assess the similarity of subgraphs during the search, common techniques
rely on a subgraph isomorphism test, which is NP-complete [8]. (C3) Large graph databases: Various applications include very large
databases. For instance, variants of viruses may lead to thousands of molecular structures, as in the case of AIDS [9].
Existing approaches for maximal subgraph mining are severely limited in addressing these challenges. Traditional MSM methods
adopt a combinatorial search to identify frequent subgraphs in a bottom-up manner, scanning the graph lattice from small to large
subgraphs [10–12]. Due to the exponential search space (C1), the need to perform isomorphism tests (C2) and the sheer size of graph
databases (C3), these approaches are limited to relatively small subgraphs [3]. Other work, such as MARGIN [12], scans the lattice
top-down, i.e., from large to small patterns. However, such an approach is applicable only for sparse graphs, as it does resolve neither
the fundamental challenge of an exponential growth of the search space, nor the challenges induced but its exploration.
We conclude that working directly on graphs makes it difficult to avoid the exponential time complexity of maximal subgraph
mining. However, graph neural networks, which transform graphs into lower-dimensional spaces, have been overlooked in frequent
subgraph mining, with few exceptions like SPMiner [3]. Nevertheless, SPMiner mainly focuses on subgraphs with the highest fre-
quency and does not guarantee the discovery of maximal subgraphs based on a given support threshold.
In this paper, we propose SCAMA as an approach for SCAlable MAximal subgraph mining. SCAMA addresses the three fundamental
challenges for maximal subgraph mining directly: It adopts a divide-and-conquer strategy, so that mining is conducted on the level of
equivalence classes of graphs that are significantly smaller than the original dataset (C3) and can be handled in parallel. Moreover,
by relying on graph embeddings as representations, we avoid any explicit (bottom-up or top-down) scan of the graph lattice (C1),
and directly reach the front of maximal frequent patterns in the search space (C2). However, both, the construction of equivalence
classes and the embeddings-based identification of maximal patterns are first conducted based on backbones, auxiliary tree-structured
graphs from which the maximal subgraphs are derived only in a post-processing step.
We summarize our contributions, as well as the structure of the paper following some background (§2) and a problem statement
(§3), as follows:
• SCAMA: §4 presents SCAMA, an approach to solve maximal subgraph mining at scale. It includes three steps: a construction of
graph equivalence classes based on backbones, a learning process to identify maximal backbones, and a derivation of maximal
subgraphs from the backbones.
• Construction of equivalence classes: §5 presents an algorithm to split the graph database into equivalence classes of graphs based
on bootstrapped backbones.
• Backbone-preserving graph embedding: §6 proposes a graph convolutional network to derive graph embeddings that ensure intra-
and inter-graph consistency to facilitate maximal backbone extraction.
• Pattern construction: §7 introduces an algorithm to efficiently derive maximal frequent subgraphs from their maximal backbones.
• Applications: §8 extends SCAMA for the problem of top-𝑘 largest frequent pattern mining and shows how mined patterns are used
as features for downstream tasks, such as graph classification.
In §9, we evaluate our approach using real-world datasets. The results show that SCAMA correctly identifies most of maximal
frequent subgraphs, while being 10 times faster than baselines and requiring less memory in all configurations. Finally, §10 concludes
the paper.
2. Background
In this section, we first illustrate the motivation for maximal subgraph mining (§2.1), before reviewing related work (§2.2).
2.1. Motivation
Maximal subgraph mining (MSM) finds applications in various domains, including bioinformatics, computer vision, and social
sciences. In this context, we highlight the significance of MSM in understanding molecular structures, with a specific focus on RNA
analysis. While our experimental datasets cover other domains as well, we use the example of RNA to demonstrate the practical
implications of MSM. RNA plays a critical role in numerous biological processes. Identifying common structures among different
parts of RNA provides insights into their encoded functionality. Unlike DNA, RNA lacks detectable sequence patterns, making direct
alignment impossible [13]. Consequently, graph-based representations are used to capture the secondary structure of RNA (Fig. 1-A).
The analysis of RNA then relies on topological fingerprints of its various parts (Fig. 1-B), which are represented as subgraphs [13].
2
Fig. 1. (A) The mapping of the secondary structure of RNA to nodes and edges; (B) graphs representing molecular structures that serve as RNA fingerprints. (Edge
colours indicate different edge labels.)
When working with multiple RNA samples, discovering commonalities can be approached using algorithms for frequent subgraph
mining (FSM) in general, and specifically for maximal subgraph mining (MSM).
2.2. Related work
This section first reviews related work on frequent subgraph mining, before turning to graph embeddings.
Frequent Subgraph Mining. Regular frequent subgraph mining (FSM) targets the extraction of all frequent patterns found in a set
of graphs [1]. FSG [14] and AMG [15] explore the space of possible subgraphs by joining pairs of frequent subgraphs. However,
the join operations and the frequency evaluation of the generated candidates induce high computational costs. gSpan [10] improves
the performance by growing candidate subgraphs directly from a single subgraph rather than joining two candidates. However,
all possible candidates need to be stored during this process, which leads to exponentially increasing compute and storage costs.
GraphSig [16] and LEAP [17] focus on mining of highly frequent subgraphs, favouring those with extreme support over common
and trivial patterns. However, to find these subgraphs in a timely manner, the exploration of the search space is sacrificed, and
only a small part of the graph lattice is accessed. UGM & CGM [18] adopts the idea of frequent edge pruning to accelerate the
mining process. However, prior literature [19] reported that the performance of this work was commonly poor (in terms of time and
memory) due to the traversal still happening in the traditional graph lattice space, which can be exponential, although additional
candidates are pruned.
Since the number of frequent subgraphs may become extremely large in practice, it was suggested to extract only the top-𝑘
subgraphs of the largest size [20]. However, specifying the parameter 𝑘 is difficult in practice and existing schemes follow a
randomisation approach that returns subgraphs with approximate rather than exact isomorphism [20]. A more comprehensive char-
acterization of the commonalities of a set of graphs is achieved by all maximal frequent subgraphs, i.e., the frequent subgraphs that
are not contained in any other frequent subgraph. The number of such maximal subgraphs can be expected to be much smaller than
the total number of frequent subgraphs. However, all frequent subgraphs can be extracted from the maximal ones. State-of-the-art
algorithms for maximal subgraph mining (MSM) are SPIN [11] and MARGIN [12]. However, these show limited scalability as they
need to generate the full graph lattice before conducting the search for the maximal frequent subgraphs. Fig. 2 (c) illustrates this
approach schematically for the case of MARGIN [12]. Here, the red region represents the candidate subgraphs that are generated
by the mining algorithm, while the green region comprises those that are actually visited as part of the search process. SPIN [11]
follows a similar approach, but adopts a bottom-up search. Either way, the graph lattice grows exponentially in the number of edges
of the considered graphs, so that the respective algorithms become intractable for large collections of dense graphs.
With SCAMA, we follow a fundamentally different approach. As sketched in Fig. 2 (d), we generate only a small portion of the
graph lattice to extract knowledge that enables us to directly identify the region of maximal subgraphs in the search space. To this
end, SCAMA adopts a learning process based on graph embeddings, i.e., numerical representations of graphs, as reviewed next.
3
Fig. 2. MSM search space. a) The graph database contains three graphs {𝐺1 , 𝐺2 , 𝐺3}. b) All enumerated subgraphs in which colour subgraphs are frequent and black
subgraphs are infrequent. c) MARGIN generates all subgraphs in a top-down manner (red zone) by gradually deleting one edge from the graph and searching for
MSM in the border (green zone) between frequent subgraphs and infrequent subgraphs. d) SCAMA adopts a bootstrapping bottom-up searching first, and then employs
BP_GCN to find the maximum backbone (green zone) without gradual enumeration.
Graph Embedding. Graph embedding vectorises the nodes of a graph to a numerical representation, such that its structure is
preserved [21,22]. Graph embedding techniques can be classified into three types: (i) matrix-factorisation approaches [23], which
leverage the adjacency matrix to learn the representation directly using matrix decomposition; (ii) random-walk approaches [24],
which perform a random walk and learn the embedding such that the decoder optimises the co-occurrences of nodes that appeared
in the same walk; (iii) deep learning approaches, such as graph convolutional networks (GCNs) [25,26] and graph attention networks
(GATs) [27,28], which capture features assigned to nodes and the structure of the graph in a single model.
Compared to other application areas [29,30], only a few results are available on using embeddings for graph mining. The com-
putation of a graph isomorphism based on vectors of hand-crafted features via Weisfeiler-Lehman labelling was proposed in [31].
However, the feature generation process is costly and requires the subgraph to be known before-hand, which is not the case in
frequent subgraph mining. Recently, SPMiner [3] was proposed as an algorithm for embedding-based mining of frequent subgraphs
of a given size. While it employs a GCN to derive an embedding, it is not applicable for maximal subgraph mining as the subgraph
sizes are not known.
With SCAMA, we contribute a new GCN for graph mining. Unlike existing models, it preserves the relations of specific tree-shaped
subgraphs (coined backbones) between graphs. As a result, the search for maximal subgraphs relies on numerical computations,
thereby avoiding expensive isomorphism tests directly on graphs.
3. Problem statement
Our work is based on undirected labelled graphs. Labels are assigned to nodes and edges to capture their intrinsic properties.
Definition 1 (Labelled graph). A labelled graph, or simply graph, is a 4-tuple 𝐺 = (𝑉 , 𝐸, Σ, 𝑙) where
(1) 𝑉 is a set of nodes,

(2) 𝐸 ⊆ [𝑉 ]2 is a set of undirected edges (two-sets of nodes),
(3) Σ is a set of labels,
(4) 𝑙 ∶ 𝑉 ∪ 𝐸 → Σ assigns labels to nodes and edges.
In the remainder, we assume the alphabet of labels Σ to be defined as R𝑘 . That is, labels are 𝑘-dimensional real vectors. Next, we
define the notions of a subgraph and of a graph isomorphism.
Definition 2 (Subgraph). A graph 𝑆 = (𝑉𝑆 , 𝐸𝑆 , 𝐿𝑆 , 𝜙𝑆 ) is a subgraph of a graph 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 , 𝜙𝐺 ), if 𝑉𝑆 ⊆ 𝑉𝐺 , 𝐸𝑆 ⊆ 𝐸𝐺 , and

𝜙𝑆 (𝑣) = 𝜙𝐺 (𝑣), ∀ 𝑣 ∈ 𝑉𝑆 .
Definition 3 (Isomorphism). Two graphs 𝐺1 = (𝑉1 , 𝐸1 , Σ1 , 𝑙1 ) and 𝐺2 = (𝑉2 , 𝐸2 , Σ2 , 𝑙2 ) are isomorphic, if there exists an edge- and label-
preserving bijection 𝑓 ∶ 𝑉1 → 𝑉2 :
4
Fig. 3. Example of a maximal frequent subgraph.
(1) ∀ 𝑣 ∈ 𝑉1 ∶ 𝑙1 (𝑣) = 𝑙2 (𝑓 (𝑣)),

(2) ∀ {𝑣1 , 𝑣2 } ∈ 𝐸1 ∶ {𝑓 (𝑣1 ), 𝑓 (𝑣2 )} ∈ 𝐸2 , and
(3) ∀ {𝑣1 , 𝑣2 } ∈ 𝐸1 ∶ 𝑙1 ({𝑣1 , 𝑣2 }) = 𝑙2 ({𝑓 (𝑣1 ), 𝑓 (𝑣2 )}).
Based thereon, we consider a graph database, i.e., a set of graphs. The support of a graph is then defined as the number of graphs
that contain an isomorphic subgraph.
Definition 4 (Support). Let  = {𝐺1 , 𝐺2 , … , 𝐺𝑛 } be a graph database and 𝑆 be a graph. For some graph 𝐺 ∈ , let
{
1 if 𝑆 is an isomorphic subgraph of 𝐺,
𝑚𝑎𝑝(𝑆, 𝐺) =
0 otherwise.
The support of 𝑆 in , denoted as sup(𝑆), is defined as
∑
𝐺∈ 𝑚𝑎𝑝(𝑆, 𝐺)
sup(𝑆) = .
||
As mentioned, instead of considering all frequent subgraphs of a graph database, the maximal subgraphs provide a compact
representation of the commonalities of a set of graphs.
Definition 5 (Maximal frequent graph). Let  = {𝐺1 , 𝐺2 , … , 𝐺𝑛 } be a graph database and 0 < 𝜏 < 1 be a support threshold. A graph 𝑆
is maximal frequent, if it is:
1. Frequent: sup(𝑆) ≥ 𝜏, and

2. Maximal: there does not exist a graph 𝑆 ′ , so that 𝑚𝑎𝑝(𝑆, 𝑆 ′ ) = 1 and sup(𝑆 ′ ) ≥ 𝜏.
Example 1. For illustration purposes, consider the graph database in Fig. 3. Both 𝑄 and 𝑆 are frequent subgraphs with a support
threshold 𝜏 = 2∕3 (𝑄 appears three times and 𝑆 appears twice). However, only 𝑆 is maximal frequent. 𝑄 is not maximal, since 𝑄 is
a subgraph of 𝑆.
Based on the above notions, we capture the problem addressed in this paper, as follows.
Problem 1 (Maximal subgraph mining (MSM)). Given a graph database  and a support threshold 𝜏, the problem of maximal subgraph
mining is to find all maximal frequent subgraphs in .
4. The SCAMA approach
To address the problem of maximal subgraph mining at scale, we propose SCAMA. Before giving an overview of the approach, we
summarize its underlying design principles.
Design Principles. Reflecting on emerging applications of graph mining [6] and on the empirical results reported for state-of-the-art
algorithms [11,12,3], we derive the following requirements for a solution to maximal subgraph mining:
(R1) Scalability: the solution shall scale to large graph databases and large maximal frequent subgraphs.
(R2) Progressiveness: the solution shall output found subgraphs immediately, before the entire mining process completes.
(R3) Configurability: the solution shall enable users to determine an order according to which subgraphs are prioritized, e.g., only
the top-𝑘 largest patterns shall be returned.
Framework Overview. We illustrate the main steps of SCAMA in Fig. 4. They are summarized as follows:
Step 1: Construction of equivalence classes using bootstrapped backbones. In the first step, we split the graph database into equivalence
classes. To this end, we efficiently mine frequent subgraphs that are tree-shaped and serve as backbones for the maximal subgraphs
that are actually of interest. We then use the bootstrapped backbones to split the graph database, where each backbone induces an
equivalence class. This way, we achieve scalability (R1), since MSM is performed only for classes that are larger than the support
5
Fig. 4. Schematic illustration of the SCAMA framework for scalable maximal subgraph mining.
Algorithm 1: SCAMA framework.

inputs : , a graph database; 𝜏, a support threshold.
output: , the set of maximal frequent subgraphs.
1 Step 1: 𝑏 ← SplittingDatabase ();
2 𝑏 ← Refine(𝑏 )
3 @parallel computation;
4 foreach 𝑝𝑏 ∈ 𝑏 do
5 if 𝑝𝑏 is not a maximal backbone then
6 Step 2: 𝑝𝑀 ← MaximalBackbone(𝑝𝑏 );
7 else 𝑝𝑀 ← 𝑝𝑏 ;
8 Step 3:  ← SubgraphConstruction (𝑝𝑀 );
9 return  ← {𝑠 ∣ 𝑠 ∈  , 𝑠 is maximal};
threshold 𝜏, and configurability (R3), since user preferences on the exploration of these equivalence classes can be incorporated. The
subsequent steps then consider each equivalence class separately. Again, this improves scalability (R1) since the size of each class can
be expected to be significantly smaller than the complete graph database. At the same time, the separate processing of equivalences
classes also means that the results can be returned progressively (R2), i.e., subgraphs are returned as soon as they are found. The
details of this initial step of SCAMA are given in §5.
Step 2: Maximal backbone mining with graph embedding. In the second step, for each equivalence class, we employ a novel graph
convolutional network to learn graph embeddings. This model uses the bootstrapped backbone as prior knowledge and ensures
that the learned graph representations show intra- and inter-graph consistency. Based thereon, a maximal backbone can be derived
efficiently for each of the equivalence classes. This step is explained in detail in §6.
Step 3: Pattern construction from maximal backbones. The final step assists in constructing the maximal frequent subgraphs from
their maximal backbones. Again, this step is conducted for each equivalence class separately. Rather than checking for edge con-
sistency gradually, edge by edge, we present an algorithm for the efficient expansion of the backbone to subgraphs. Details are
presented in §7.
We separate the three above steps to highlight the conceptual design of our framework. However, the interplay of these steps is
outlined in Algorithm 1. Here, user preferences (R3) may be incorporated in the selection of an equivalence class to explore (line 2).
Also, the algorithm illustrates that patterns can be returned progressively (R2), since equivalent classes can be explored in parallel
and independently.
5. Graph equivalence classes based on bootstrapped backbones
This section shows how to split the graph database into equivalence classes using bootstrapped backbones. We first elaborate on
the notion of a backbone (§5.1), before discussing the induced equivalence classes (§5.2). Finally, we present an algorithm to split a
graph database (§5.3).
6
Fig. 5. Backbone presentation.
Fig. 6. Backbone equivalence classes.
5.1. Backbone representation
A backbone refers to a tree-shaped labelled graph. A rooted backbone is a backbone with one node being designated as the root.
Similar to the idea presented in [32], we use backbones as a descriptor for a set of patterns, i.e., a set of subgraphs. If a subgraph
can be derived from a backbone through expansion, i.e., by including more edges, we refer to the subgraph as being rooted in the
backbone.
We aim to design a unique string representation for every rooted backbone. Inspired by [33], the construction of this string
representation is based on the tree of nodes and comprises two steps: 1) construction of a synthetic label for nodes. For each pair of
nodes (𝑢, 𝑣) in the backbone, where node 𝑣 is the child of node 𝑢 in the tree structure, a label for 𝑣 is synthesized as the concatenation
𝑒𝑙 𝑣𝑙 , with 𝑒𝑙 as the label of edge {𝑢, 𝑣} and 𝑣𝑙 as the label of node 𝑣. 2) construction of a string for the backbone. A breadth-first search
is conducted over the tree structure of the backbone starting from the root and following the order of actual labels of child nodes.
Then, the string representation of the backbone is obtained by concatenating the synthetic labels of nodes according to the search
order, while grouping siblings with a dedicated symbol “$” and appending a dedicated symbol “#” once the search has completed.
Example 2. We show an example of the string representation of backbone 𝑇1 in Fig. 5. We start with the actual label of the root
node 𝐵. By enumerating the nodes according to breadth-first search strategy, we append the string with 𝑦𝐴𝑦𝐴, since 𝑦𝐴 and 𝑦𝐴
are synthetic labels of the nodes in the first level. We continue the process, further adding “$” to each level during the search, and
eventually obtain the string 𝐵$𝑦𝐴𝑦𝐴$𝑦𝐶𝑥𝐷$𝑦𝐶$# as the unique representation of the backbone.
After transforming backbones into string representations, the lexicographical order of these strings induces an order for backbones.
For example, in Fig. 5, in holds that 𝑇1 ≥ 𝑇2 ≥ 𝑇3 . Although a backbone can be represented by multiple strings by designating different
nodes as the root, there is a unique maximal string representation based on the lexicographical order. This maximal representation
is called the canonical representation of the backbone and can be computed efficiently using a greedy algorithm similar to [33]. Note
that, the original mechanism proposed by [33] used a different string representation based on DFS, which does not fully support the
subsumption relationship of subgraph patterns.
5.2. Backbone equivalence classes
The canonical representation of a backbone serves as a summary of a set of subgraphs. As such, it induces an equivalence class
over subgraphs, captured as follows.
Definition 6 (Backbone equivalence class). Two graphs 𝑆1 and 𝑆2 are backbone-based equivalent, denoted as 𝑆1 ≐ 𝑆2 , if their back-
bones have the same canonical representation.
Example 3. Fig. 6 shows four graphs, 𝑆1 , 𝑆2 , 𝑆3 , 𝑆4 . Although these graphs differ, they are expanded from the same backbone and,
hence, belong to the same equivalence class.
The notion of an equivalence class based on backbones enables us to structure the mining process, as each respective subset is
checked for a maximal subgraph. The benefit of splitting the graph database into equivalent classes is that we can safely prune the
equivalence class from the subsequent mining steps if the cardinality of the subset is smaller than the support threshold. Moreover,
7
Algorithm 2: Splitting a graph database.

Input : , a graph database.
Output: 𝑏 , a set of equivalence classes described by a set of backbones.
1 Splitting ()
2  ← {𝑐 ∣ 𝑐 is a frequent edge in };
3 (𝑏 , _) = BootstrappedBackbone(, ∅);
4 return {𝑝𝑏 ∈ 𝑏 ∣ 𝑝𝑏 is maximal};
5
6 BootstrappedBackbone (𝐶, 𝑅)
7 𝑄 ← ∅;
8 foreach 𝑋 ∈ 𝐶 do
9 𝑆 ← {𝑌 ∣ 𝑌 is a backbone of one more nodes of 𝑋};
10 if |𝑆 − 𝑅| < |𝑆| then continue ;
11 if MaxSup({𝑠 ∣ 𝑠 ∈ 𝑆}) = MaxSup({𝑞 ∣ 𝑞 ∈ 𝑄}) then
12 break; // stop if no further split is possible
13 (𝑈 , 𝑉 ) ← BootstrappedBackbone(𝐶, 𝑅);
14 𝑄 ← 𝑄 ∪ 𝑈;
15 𝑅 ← 𝑅 ∪ {𝑋} ∪ 𝑉 ;
16 return 𝑄, 𝑅;
once backbones have been enumerated, those having the same canonical representation as one that has been already considered may
be pruned without loosing information on potential maximal subgraphs. The reason is that search pruning strategy shall follow the
downward closure property [34]—the supergraph of an unfrequent subgraph will also be unfrequent. This property is widely applied
by existing graph mining techniques [35] to speed up the pattern searching.
5.3. Database splitting using backbones
Next, we discuss how to mine backbones and split the graph database into equivalence classes. We start by describing a naive
approach, before presenting an efficient algorithm.
A Naive Approach. A backbone is a frequent tree-shaped subgraph in the graph database. Hence, any FSM technique [10,36] can
be used to mine such backbones by limiting the topology of the mined subgraphs. Such a naive approach involves two steps:
1. Candidate generation: All possible tree-shaped candidate graphs are generated for the graph database. While the set of candidates
grows exponentially [37], a large number of candidates are duplicates and, hence, redundant.
2. Frequency evaluation: For each generated candidate, the support is computed. This step requires a subgraph isomorphism test,
which is an NP-complete problem [10].
Both steps of this naive approach are computationally expensive. Therefore, we present a more efficient algorithm.
Brand-and-bound Algorithm with Early Pruning. We outline a depth-first backbone enumeration procedure in Algorithm 2. The
input of the algorithm is the graph database , while the output is the set of equivalence classes summarised by a set of backbones
𝑏 . To avoid solving an NP-complete problem (of the subgraph isomorphism test) multiple times (step (2) of the naive approach),
we start by greedily isolating all frequent edges 𝐶 in a graph database (line 2). Then, since only a combination of frequent edges
could be part of a frequent tree-shaped subgraph, we recursively expand the candidates 𝑌 by gradually increasing the candidate
size by one frequent edge (the BootstrappedBackbone subroutine). The input of the subroutine is the set of frequent edges 𝐶, and the
already-visited backbones 𝑅. In contrast to the naive approach, the set 𝑅 is used to detect and prune duplicated candidates (line 10).
Furthermore, we stop early when the split result of the current iteration is unchanged compared to the one of the previous iteration
(lines 11–12). Finally, the maximal backbones within all enumerated candidates are reported as the result (line 4).
Complexity Analysis. Similar to SPIN [11], we enumerate the frequent tree-shaped subgraphs in a depth-first manner. However,
unlike the original SPIN algorithm that mines the maximal backbone size 𝑀 , we only bootstrap the backbone size 𝑏 (𝑏 << 𝑀 ). As
the complexity of such a mining routine is exponential in the size of the backbone, the mining process is accelerated significantly. In
addition, we further improve the performance using an early pruning technique, as illustrated in the below example.
Example 4. Consider a graph database that contains two backbones T1, T2 as in (Fig. 7-a) and assume that backbone T1 was
enumerated already. Then, in the next iteration of our algorithm (Fig. 7-b), since the current backbone ADCB has been found
already, we apply early pruning. However, this does not mean that backbone T2 is missed, as it will be found in a future iteration
(as in Fig. 7-c).
In contrast to SPIN [11], the backbones discovered by our algorithm are not necessarily maximal spanning trees. This choice is
motivated as follows: If the database contains a large frequent pattern, the computational effort for mining maximal spanning trees
for such a pattern is high as well. So, instead, we resort to mining backbones with a limited size, which is sufficient to split the graph
8
Fig. 7. Early pruning exemplified.
Fig. 8. Anchor link.
database into equivalence classes. Equivalence classes of a size smaller than the support threshold can be safely pruned, without
compromising correctness of the approach.
6. Maximal backbone mining with graph embedding
Using the approach introduced above, we split the graph database  into equivalence classes, where 𝑃𝑏(𝑖) is the bootstrapped
(𝑖)
backbone for class 𝐶𝑖 . However, this bootstrapped backbone is not necessary a maximal backbone, denoted as 𝑃𝑀 , of class 𝐶𝑖 .
Therefore, this section proposes a learning algorithm that efficiently mines such a maximal backbone for each class. Specifically, we
first mine the local maximal backbone in each pair of graphs of the database, which is guided by their shared backbone (§6.1). Then,
we propose a mechanism to extract a global maximal backbone among all pairs of graphs within an equivalence class (§6.2).
6.1. Local maximal backbone
We distinguish the maximal backbone of all graphs in a class, referred to as the global maximal backbone, from the maximal
backbone between a specific pair of graphs, called the local maximal backbone. This section is devoted to computing the local
maximal backbone within a pair of graphs. Before diving into the details of the algorithm, we introduce some concepts that are used
later.
Anchor Link. Let 𝐺𝑖(1) = (𝑉𝑖(1) , 𝐸𝑖(1) ) and 𝐺𝑖(2) = (𝑉𝑖(2) , 𝐸𝑖(2) ) be two graphs in equivalence class 𝐶𝑖 . We know that 𝐺𝑖(1) and 𝐺𝑖(2) share
a common bootstrapped backbone 𝑃𝑏(𝑖) . This backbone induces a set 𝐴 = {(𝑢, 𝑣) ∣ 𝑢 ∈ 𝑉𝑖(1) , 𝑣 ∈ 𝑉𝑖(2) } of known anchor links between
nodes of either graph that are part of 𝑃𝑏(𝑖) .
Example 5. Fig. 8 illustrates a bootstrapped backbone involving nodes 𝐴 and 𝐵 between graphs 𝐺(1) and 𝐺(2) . Hence, (𝐴, 𝐴) and
(𝐵, 𝐵) are two known anchor links.
The maximal backbone can be extracted by identifying all hidden anchor links between two graphs. This extraction can be
phrased as a link prediction problem [38].
9
Anchor Link Prediction. Given two graphs 𝐺𝑖(1) and 𝐺𝑖(2) and set of known anchor links, the problem of anchor link prediction is the
task of detecting all hidden anchor links between them.
To solve the anchor link prediction problem, we propose to use graph embeddings to find a node-preserving isomorphism between
the two graphs. That is, we aim at constructing a bijection between the nodes of the graphs that preserves the node labels, but not
necessarily edges and edge labels (i.e., it enforces solely condition (1) of Definition 3 for the general graph isomorphism). However,
we consider the following observations before proposing the graph embedding model to solve the problem.
Observation 1. Drawing on empirical results reported in [39], we observe that an anchor link prediction model shall satisfy:
(C1) intra-graph consistency: consistency is required regarding the structure and labels in order to support finding the true corresponding
nodes for unobserved anchor links.
(C2) inter-graph consistency: nodes of different graphs that belong to a known anchor link shall have similar embeddings.
Based thereon, we devise a graph convolutional network (GCN) model for anchor link prediction. Our model, coined Backbone-
Preserving Graph Convolutional Network (Bp-GCN), leverages the known anchor links to satisfy the two consistency criteria. It proceeds
in two steps: The embedding stage and the querying stage. The two-stage separation of our Bp-GCN model is purposely introduced to
make it easier to explain the key ideas of the process. However, its actual learning process unifies both steps.
6.1.1. Bp-GCN embedding stage

We design a specific GCN model to learn the backbone-aware representation of nodes, while ensuring the aforementioned notions
of consistency.
Input Layer. Let (𝐺𝑖(1) , 𝐺𝑖(2) ) be a pair of graphs together with a set of known anchor links 𝐴 as induced by the backbone 𝑃𝑏 . Without
loss of generality, we consider a unified graph 𝐺 = 𝐺(1) ∪ 𝐺(2) . The input node features are given as one-hot vectors of their labels.
A node 𝑣𝑖 has the node feature 𝛼𝑖 ∈ ℝ𝑎×1 , where 𝑎 is the number of node labels. The input feature 𝛼𝑖 is encoded to a d-dimensional
hidden feature ℎ𝑖 𝓁=0 using a simple linear mapping function before passing it into a graph neural network:
ℎ0 𝑖 = 𝑊 0 𝛼𝑖 + 𝑏0 (1)
where 𝑊0 ∈ ℝ𝑑×𝑎 , 𝑏0 ∈ ℝ𝑑 .
GCN Layers. A graph convolutional network is similar to a convolutional neural network (CNN) in terms of weight sharing between
𝓁 layers [21]. Each GCN layer computes the hidden features for the nodes through recursive neighbourhood diffusion. Each node
representation derives its features from the previous layer and from its neighbours to preserve the local structure of the graph.
Stacking of 𝓁 layers allows the model to learn node representation from the 𝓁-hop neighbourhood of the nodes.
We denote by ℎ𝓁𝑖 the feature of node 𝑖 at the layer 𝓁. The new feature ℎ𝓁+1
𝑖 for node 𝑖 at layer 𝓁 + 1 is obtained by computing
a non-linear transformation to the feature ℎ𝑖 𝓁 and the features ℎ𝑗 𝓁 , where nodes 𝑗 are the neighbourhood of the node 𝑖. Such
transformation guarantees that the node representations are invariant to the graph size and node ordering. Formally, the feature of
node 𝑖 at the next layer (𝓁 + 1) is defined as:
( )
ℎ𝓁+1
𝑖 = 𝜎 ℎ𝓁𝑖 , {ℎ𝓁𝑗 ∶ 𝑗 → 𝑖} (2)
where {𝑗 → 𝑖} is the set of neighbouring nodes 𝑗 that point to node 𝑖, and 𝜎 is an activation function.
Although ReLU is widely employed as the activation function in GCN models [35], it is not suitable for anchor prediction, as
it cannot differentiate negative and positive values (i.e., 𝑅𝑒𝐿𝑈 (−𝑥) = 𝑅𝑒𝐿𝑈 (𝑥)). Since we need to preserve all information in the
aggregation process, our activation function must be bijective to preserve the isomorphic relation between nodes without an explicit
Weisfeiler-Lehman test [40]. We therefore adopt the hyperbolic tangent. Also, we add a self-loop to each node and consider a node
as its own neighbour. Eq. (2) is then simplified to:
( )
ℎ𝓁+1
𝑖 = Tanh Mean({ℎ𝓁𝑗 ∶ 𝑗 → 𝑖}) (3)
To accelerate the learning process, we vectorise the features of all input nodes and feed them to the network at once rather than
computing each node separately. As a result, estimating Eq. (3) for all nodes can be performed efficiently using sparse matrix
operation. The equivalent vector form of Eq. (3) for the batch input can be re-written as follows:
( )
̃ 𝓁𝑊 𝓁
𝐻 𝓁+1 = Tanh 𝐴𝐻 (4)
− 12 − 12
where the normalized Laplacian matrix is 𝐴̃ = 𝐷̂ 𝐴̂ 𝐷̂ , 𝐻 (𝓁) = |𝑉 | × 𝑑 (𝓁) denotes the embedding matrix, 𝑑 (𝓁) denotes the dimension
of the embedding at layer 𝓁, and 𝐴̂ = 𝐴 + 𝐼 denotes the adjacency matrix together with the self-node connections. The matrix 𝐼
represents the identity matrix, 𝐷̂ represents a diagonal matrix, and 𝑊 (𝓁+1) ∈ ℝ𝑑 ×𝑑
(𝓁) (𝓁+1)
is the learnable weight matrix at layer 𝓁 + 1.
Backbone-preserving Loss. Our Bp-GCN model maps the graphs 𝐺(1) and 𝐺(2) into a unified embedding space, as guided by the
known anchor links, so that the vectors of nodes that belong to a true anchor link are similar. To preserve the characteristics of
anchor links, we define a backbone-preserving loss:
10
∑ ∑
𝐿𝑏 = ||𝐻 (𝑖) (𝑢) − 𝐻 (𝑖) (𝑣)|| (5)
𝑖∈1..𝓁 (𝑢,𝑣)∈𝐴
where 𝐴 = {(𝑢, 𝑣)|𝑢 ∈ 𝑉 (1) , 𝑣 ∈ 𝑉 (2) } is the set of known anchor links induced by the common backbone 𝑃𝑏 . This loss function helps to
achieve inter-graph consistency (C2).
Consistency Loss. To also achieve intra-graph consistency (C1), we incorporate a consistency loss. It should ensure that nodes with
similar neighbourhood structures have similar embeddings. We define the consistency loss as:
∑ 1 1 𝑇
𝐿𝑐 = ||𝐷̂ − 2 𝐴̂ 𝐷̂ − 2 − 𝐻 (𝑖) .𝐻 (𝑖) ||𝐹 (6)
𝑖∈1..𝓁
where ||.||𝐹 is the Frobenius norm. Here, the normalized Laplacian matrix is employed rather than the adjacency matrix. The reason
is that we want to integrate more structural information into the embeddings to avoid a collapsing embedding space.
The Unified Loss. Finally, the loss functions are unified, as follows:
𝐿(𝐺) = 𝛽𝐿𝑐 + (1 − 𝛽)𝐿𝑏 (7)

Unlike existing GCN-based models that only employ embeddings of the final layer 𝐻 (𝓁)
for computing loss functions [35], we use
all hidden representations for the loss computation (as in Eq. (5) and Eq. (6)). This is to balance a trade-off between the deep and
shallow layers’ information. Deep layers involve more global connectivity but are prone to local connectivity; whereas shallow layers
focus on direct consistency but are prone to global consistency.
6.1.2. Bp-GCN querying stage

To extract the maximal backbone from a pair of graphs, for any given node 𝑣𝑖 of the graph 𝐺(1) and its representation ℎ𝓁𝑖 , we
identify the counterpart node 𝑣𝑗 in graph 𝐺(2) that is closest to 𝑣𝑖 in the embedding space. To this end, we rely on the following
equation:
𝑚𝑖𝑛𝑗=1..𝑛 ||ℎ𝓁𝑖 − ℎ𝓁𝑗 ||𝐹 (8)
where ||.||𝐹 denotes the Frobenius norm, and 𝑛 is the number of nodes in 𝐺(2) . As we aim to find a one-to-one mapping, any mapped
pair of nodes is filtered out. The mapping process terminates upon encountering a pair of nodes with different labels. The tree-shaped
backbone, which includes all node correspondences between two graphs, then forms the maximal backbone.
6.2. Global maximal backbone
Next, we discuss how to find the maximal backbone across all graphs within an equivalence class 𝐶𝑖 . We first sort all graphs by
the number of their frequent edges, from highest to lowest. Based thereon, we find the local backbone between all pairs of graphs
following this order. This heuristic based on sorting is widely used in maximal itemset mining to improve the performance of a
mining process by prioritizing large patterns [40]. However, note that without the step of splitting the graph database based on
bootstrapped backbones (as introduced in §5), this heuristics based on sorting cannot be expected to work well as the frequent edges
(𝑖)
are less likely to belong to the same pattern. Finally, the global maximal backbone 𝑃𝑀 is obtained by intersecting all local backbones
of the equivalence class 𝐶𝑖 .
7. Pattern construction from maximal backbones
(𝑖)
Having identified a maximal backbone 𝑃𝑀 for an equivalence class 𝐶𝑖 , we construct all frequent subgraphs for which the canonical
(𝑖)
backbones are rooted in 𝑃𝑀 . The maximal frequent subgraphs are then extracted from this set of frequent subgraphs.
As detailed in Algorithm 3, we first identify a set of edges shared by all graphs and the backbone (line 1). Then, we expand the
(𝑖)
backbone 𝑃𝑀 with all frequent edges. We prune this equivalence class if the canonical representations of the subgraph before and
after expansion have different canonical representations (lines 2–3). This heuristic is commonly used in frequent subgraph mining to
improve the performance without compromising result correctness [11].
Observation 2. Let ℎ𝓁𝑖 be the embedding of a node 𝑣𝑖 . Since Bp-GCN preserves intra-graph consistency (C1) within a 𝓁-hop neighbourhood,
the frequent edges within the backbone’s receptive field are likely to belong to the same frequent pattern.
Following the above observation, rather than exploring subgraphs by expanding them with single edges, we first explore the
largest candidate and return if it turns out to be frequent (line 8). If this is not the case, we proceed with the edge-wise expansion.
Note that the above observation and, hence, the respective optimisation is enabled only by our Bp-GCN model that preserves intra-
graph consistency.
Example 6. Fig. 9 shows the expansion of a backbone 𝑆1 . Algorithm 3 greedily checks the largest candidate 𝑆4 first. If it turns out
to be frequent, the intermediate candidates 𝑆2 and 𝑆3 are not explored.
11
Algorithm 3: Subgraph Construction.

(𝑖)
Input : 𝑃𝑀 , backbone for expansion;
(𝑉𝑖(1) , 𝐸𝑖(1) , Σ(1) (1) (𝑛) (𝑛) (𝑛) (𝑛)
𝑖 , 𝑙𝑖 ), … , (𝑉𝑖 , 𝐸𝑖 , Σ𝑖 , 𝑙𝑖 ), graphs of 𝐶𝑖 .
(𝑖)
Output: Frequent subgraphs expanded from 𝑃𝑀 .
(1) (𝑛) (𝑖)
1 𝐸 ← {𝑐 ∈ 𝐸𝑖 ∩ … ∩ 𝐸𝑖 ∣ 𝑐 is part of 𝑃𝑀 };
(𝑖)
2 𝐺 ← 𝑃𝑀 ⊕ 𝐸;
(𝑖)
3 if 𝐺 and 𝑃𝑀 have different canonical representations then return ∅;
(𝑖)
4 𝑄 ← SearchGraphs(𝑃𝑀 , 𝐸);
5 return {𝐺 ∈ 𝑄 ∣ 𝐺 is maximal} ;
6
7 SearchGraphs (𝐺, 𝐶)
8 if 𝐺 ⊕ 𝐶 is frequent then return {𝐺 ⊕ 𝐶};
9 𝑄 ← ∅;
10 for 𝑐 ∈ 𝐶 do 𝑄 ← 𝑄 ∪ SearchGraphs(𝐺 ⊕ {𝑐}, 𝐶 ⧵ {𝑐});
11 return Q;
Fig. 9. Example for the subgraph construction algorithm.
8. Applications
This section illustrates two applications of our SCAMA framework. First, we extend it for the problem of top-𝑘 largest frequent
subgraph mining (§8.1). Then, we demonstrate how the mined patterns are used as features for graph classification (§8.2).
8.1. Top-𝑘 largest frequent subgraph mining
In certain application contexts, it may be infeasible to extract all maximal frequent subgraphs or their number may be overwhelm-
ing. In such cases, it is useful to consider solely the top-𝑘 largest frequent subgraphs. The respective mining problem is formulated
as an extension of the MSM problem introduced earlier (Problem 1).
Problem 2 (Top-𝑘 largest frequent subgraph mining (𝑘-LFSM)). Given a graph database  and a support threshold 𝜏, the problem of
top-𝑘 largest frequent subgraph mining is to find 𝑘 frequent subgraphs 𝑆1 , … , 𝑆𝑘 of , so that any other frequent subgraph 𝑆 ′ of  is not
larger than any subgraph 𝑆𝑖 , 1 ≤ 𝑖 ≤ 𝑘.
It is non-trivial to adapt existing solutions to the MSM problem for the 𝑘-LFSM problem. The reason being that existing techniques
that scan the graph lattice first need to extract all frequent patterns, before they can be ranked to select the 𝑘 largest subgraphs. This
is inefficient and mines a large number of superfluous subgraphs.
Our SCAMA framework, however, offers a more efficient solution to the 𝑘-LFSM problem. To this end, we largely follow the
procedure introduced to solve the MSM problem. We first split the graph database into equivalence classes based on bootstraped
backbones and then mine the maximal backbone for each class using the Bp-GCN model. However, only the top-𝑘 largest backbones
are kept for the subsequent construction of the maximal frequent subgraphs. As the construction may result in more than 𝑘 patterns,
a final step identifies the 𝑘 largest of them. Note that our solution to the 𝑘-LFSM problem is non-parametric. The backbones are
ranked naturally in step (2), so that a user solely has to decide how many patterns to retain.
12
Table 1
Statistics of the datasets.
Domain Dataset #Graph #Node #Edge
SYN-1 100 1.5K 3.9K

SYN-2 150 3.0K 6.9K
Synthetic
SYN-3 200 5.0K 9.4K
SYN-4 250 7.5K 15.3K
SYN-5 300 10.5K 22.0K
Molecules AIDS 2000 31.28K 32.4K
Chemical COX2 467 19.3K 20.3K
Computer Vision CUNEIFORM 267 5.68K 11.96K
Bioinformatics ENZYMES 600 19.6K 37.3K
8.2. Graph classification
We further illustrate the application of maximal frequent subgraphs mined with the SCAMA framework for a downstream analysis
task, i.e., graph classification. A good classifier relies on the choice of a meaningful feature set. Although all frequent subgraphs can
be employed as features, such a classifier would be less likely to achieve a good performance. The reason being that small frequent
subgraphs can be expected to be very common in a graph database and, hence, have little discriminative power. In contrast, maximal
frequent subgraphs can be expected to separate graphs more effectively.
To instantiate a classifier from the maximal patterns, we create a vector representation for each graph. With 𝑃 = {𝑝1 , … , 𝑝𝑛 } as
the set of maximal patterns, a graph 𝐺 is represented by a vector 𝑓𝐺 = [𝑓1 , … , 𝑓𝑛 ], such that 𝑓𝑖 = 1 if 𝑝𝑖 is a subgraph of 𝐺, and 𝑓𝑖 = 0
otherwise. Based on this encoding, a common classification algorithm may be applied, e.g., a Support Vector Machine (SVM).
9. Empirical evaluation
This section presents an empirical evaluation of SCAMA. We aim to answer the following research questions:
• RQ1 Does SCAMA outperform diverse baseline techniques?

• RQ2 How does the graph density affect SCAMA?
• RQ3 What is the memory footprint of scalability of SCAMA?
• RQ4 What is the impact of each component of SCAMA?
• RQ5 What influences the performance of SCAMA?
• RQ6 How much can SCAMA benefit from parallelisation?
• RQ7 How does SCAMA perform in top-k pattern mining and graph classification?
Before presenting our empirical results and findings, we first introduce the experimental setup.
9.1. Experiment setup
9.1.1. Datasets
In this work, we evaluate our proposed framework using both synthetic and real-world datasets, and their characteristics are
summarised in Table 1.
Synthetic Datasets. We use the synthetic graph data generator1 to generate four synthetic datasets. The sizes of synthetic datasets
are increased until the competing baseline cannot finish within a reasonable time; the fixed amount of time is 24 hours for all
experiments.
Real-world Datasets. We use four real-world datasets2 from a variety of different domains. The first dataset we choose is the
AIDS [41] molecule dataset, which comprises 2000 graphs. The other three domains we include in the experiment are chemical
(COX2 [42]), computer vision (CUNEIFORM [42]), and bioinformatics (ENZYMES [13]).
9.1.2. Baseline techniques

The following algorithms denote the state of the art for maximal subgraph mining:
SPIN [11]. SPIN mines maximal frequent subgraphs by generating the graph lattice in a bottom-up manner. As such, we use it to
compare SCAMA against approaches that directly work on the graph lattice and search from small to large subgraphs.
MARGIN [12]. MARGIN generates the graph lattice in a top-down manner, which provides a different angle for comparison.
1
https://cse.hkust.edu.hk/graphgen/.
2
https://chrsmrrs.github.io/datasets/docs/datasets/.
13
Fig. 10. Efficiency.
SPMiner [3]. SPMiner maps the graphs into an embedding space, in which the search procedure for frequent subgraphs is efficient.
It uses a contemporary graph embedding technique that is similar to our proposed framework.
9.1.3. Metrics
As it is common in the evaluation of algorithms for the MSM problem [12,11], we measure the total runtime when varying the
support threshold. Additionally, for a given threshold, we also compute the following measures:
Pattern Size. We evaluate the sizes of the found subgraphs. Here, the larger the size, the better the result.
Detection Coefficient. If some ground truth is known (i.e., in the RNA datasets), the detection coefficient [43] is useful to assess
the coverage of pattern detection. If we denote the ground truth and the found patterns as  ∗ and  respectively, the coefficient is
defined as | 
∗ ∩
 ∗ ∪
|, so that larger values are better.
Hit-time. As patterns shall be reported as soon as possible, the hit-time indicates how early the first pattern is obtained.
9.1.4. Hyperparameters
For SCAMA, we set the embedding dimension, learning rate, and training epochs of Bp-GCN to 200, 0.01, and 20, respectively.
The ratio of the backbone-preserving loss and the consistency loss is 𝛽 = 0.8. For the baselines, we tune the parameters based on the
original papers and report the best results.
9.1.5. Reproducibility
All results were obtained on an Intel i7 3.2 GHz system (four cores, 120 GB RAM) running Ubuntu 16.06. SCAMA was implemented
in Python 3. An implementation of SPMiner in Python was obtained from the authors. SPIN and MARGIN were also implemented in
Python.
9.2. End-to-end comparison
To answer the first question (RQ1), we assess the performance of SCAMA in an end-to-end manner, focusing on three aspects:
(i) the general efficiency, (ii) the pattern distribution, and (iii) the effect of our pruning optimisation.
Efficiency. We first comparing the total runtime of SCAMA against the baseline techniques when varying the support threshold (𝜏).
Fig. 10 illustrates the observed total runtime (in seconds) with logarithmic scaling for all four datasets.
The figure shows that SCAMA consistently outperforms the baseline techniques, with the runtime being at least an order of
magnitude smaller for most support thresholds. For SPIN and MARGIN, this is explained by the need to explore an exponentially
growing graph lattice. Also, the efficiency of SPIN is higher compared to MARGIN on all real-world datasets, which is consistent
with the results reported in [12]. Similar to our work, SPMiner exploits an embedding technique, so that the runtime depends on the
graph size, but is independent of the support threshold. However, the runtime of SPMiner is much higher than the one observed for
SCAMA. The reason is that SPMiner employs a subgraph-level embedding mechanism, and the number of subgraphs derived from the
graphs is much higher than the number of graph nodes.
Pruning. Next, to further support for the improvement of efficiency, we quantitatively evaluate the amount of the graph lattice
that is explored by the different techniques. Fig. 11 compares the number of examined nodes in the search space (MARGIN adopts
top-down search, SPIN proceeds bottom-up) over all datasets that all the baselines can finish. The results of SPMiner are omitted as
it relies on subgraph embeddings instead of exploring the graph lattice. We observe that the size of the search space explored by the
14
Fig. 11. Lattice explored.
Fig. 12. Pattern distribution.
baseline techniques is huge in comparison to SCAMA. Our approach explores only a small area to find the bootstrapped backbones
for the maximal frequent subgraphs.
Pattern Distribution. To examine the correctness of the found patterns, we plot the pattern distribution identified by all techniques.
Fig. 12 shows the pattern distribution for the COX2 (𝜏 = 10%) and SYN-1 (𝜏 = 20%) and datasets (the other datasets expose the same
trend). Here, the exact methods (i.e., SPIN and MARGIN) serve as ground truth. We see that SCAMA correctly identifies most of the
maximal patterns. Comparing the results of the approximate methods, SPMiner tends to return small patterns with high frequency
rather than the maximal ones. This is due to its biased exploration of the embedding space and the greedy prioritization of high-
frequency pattern candidates. Interestingly, SPMiner returns patterns with a frequency that is smaller than the support threshold.
For example, in the COX2 dataset, SPMiner yields patterns of size from 15 to 20. However, when comparing these results with
the actual patterns, the support of these patterns is less than the given support threshold, and therefore not satisfying the mining
requirements. The reason for this observation is that SPMiner adopts an approximate procedure when searching for frequent patterns
in the embedding space and, hence, may arrive at erroneous frequency values.
False Negative Analysis. To assess the approximation quality compared to the results of exact methods (i.e., SPIN and MARGIN),
we analyse the false-negative patterns, and the results of the SYN-1 dataset are presented in Fig. 13. The false-negative patterns
are grouped into different categories based on the number of mismatched nodes, which is highlighted in pink colour in the sample
patterns. We see that most negative patterns contain up to three-node mismatched with the actual patterns, and these nodes are in
the boundary of the patterns. The reason for the mismatched nodes is due to the inherent nature of graph embedding techniques that
might loosen the structure proximity of the high-order nodes [21]. However, the estimation error rate (in percent) monotonically
decreases along with the increase in the number of mismatched nodes. This observation hints at a reasonable trade-off for the
efficiency improvement compared to exact methods in cases where exact methods are unable to complete the mining task.
15
Fig. 13. False negative patterns.
Fig. 14. Scalability vs. density.
9.3. Scalability
To understand the scalability of SCAMA, and answer research question RQ2, we study the performance of all techniques by varying
different levels of graph density, pattern size, and dataset size. As exact methods may not finish with some levels of scalability, we
focus on the setting of fixing the execution time up to 24 hours for all the approaches and report the results.
Effects on Density. In this setting, we use the SYN-1 dataset as the sparse level of density. We increase the level of density by
multiplying the underlying density of the SYN-1 dataset by 2, 4 and 8 to firm the medium, moderate, and dense datasets, respectively.
We present the efficiency of all baseline in Fig. 14. The figure shows that both MARGIN and SPIN cannot complete computation
for the dense dataset within 24 hours, so their results are omitted. SCAMA turns out to scale well with the dataset density. This is
because the runtime of SCAMA is bound by the time needed to compute the embeddings, which is linear to the degree of the graph
(see also §9.6). The runtime of SPMiner also shows a linear trend in the dataset’s density, as it employs graph embeddings similar
to our approach. However, the total runtime of SPMiner is high as it spends most of the time in the heuristics search based on a
random walk in the embedding space. SPIN and MARGIN perform the worst in most cases. Their runtime grows exponentially with
the average degree of the graphs. This phenomenon is attributed to the positive correlation between the density of the dataset and
the size of the search space explored by these techniques.
Effects on Pattern Size. In this experiment, we use the SYN-1 dataset. However, patterns of different sizes are implanted to examine
the scalability of all approaches to the pattern size. The results are recorded in Fig. 15-a. Interestingly, when the graph size is fixed,
the running time of SCAMA and SPMiner does not increase significantly as the pattern size increases. The reason is that both SCAMA
and SPMiner employ the embedding technique where the graph size bounds its complexity. In contrast, the runtime of both SPIN
and MARGIN is exponential and exceeds the runtime budget at a pattern size of 18 nodes. This behaviour can be attributed to the
explosion in the number of intermediate needs to be explored and evaluated.
Effects on Graph Size. We further assess the scalability of all approaches in the increase of dataset size. All synthetic datasets are
employed in this experiment, and the results are shown in Fig. 15-b. In general, both SCAMA and SPMiner yield the best results in
scaling with the dataset size. MARGIN and SPIN cannot finish within 24 hours when applying to the SYN-3 and SYN-5, respectively,
suggesting the inefficiency of both bottom-up and top-down search in an exponential lattice space.
16
Fig. 15. Scalability vs. pattern and dataset size.
Fig. 16. Resource consumption.
9.4. Memory consumption
To evaluate the effectiveness in terms of resource consumption (RQ3), we evaluate the memory footprint of all methods. In
Fig. 16, we show the consumed memory relative to the support threshold.
We note that the memory consumption by SPIN is moderate. This is expected since SPIN only explores the lattice nodes that
support finding the maximal spanning tree. SCAMA only explores a small, bottom portion of the graph lattice to find the bootstrapped
backbones for the patterns; and thus, it consumes the least memory. SPMiner and MARGIN consume the most memory in comparison
with the other techniques. This is because MARGIN needs to store about half of the graph lattice, and SPMiner, at the same time,
decomposes and stores a vast number of possible subgraphs in memory to facilitate its subgraph-level embedding mechanism.
9.5. Ablation testing
To better understand the interplay of the components of our framework, we conduct an ablation study (RQ4). We compare the
performance of SCAMA with several variants:
• SCAMA-1: We apply an algorithm for efficient mining of frequent trees called FFSM [36] rather than the algorithm proposed as
the first step of SCAMA to mine backbones.
• SCAMA-2: We employ a state-of-the-art graph convolutional network [44] to replace our backbone-preserving graph embedding
step (step 2).
• SCAMA-3: To evaluate the effectiveness of the optimized subgraph expansion algorithm, we use a naive expansion process that
gradually expands each edge.
Table 2 shows the results for the RNA-2 dataset with a medium level of density, as it best reflects a real-world scenario (other
datasets reveal the same trend). We used the core patterns as the ground truth to assess the detection coefficient of all techniques.
Although SCAMA-1 and SCAMA-3 successfully identified all core patterns, their runtime is higher. Without the optimisation of early
pruning when searching for backbones, SCAMA-1 suffers from overhead due to duplicates. On the other hand, SCAMA-3 faces overhead
due to the evaluation of the power set of candidate edges. While there is a slight improvement in runtime for SCAMA-2 in comparison
to SCAMA, the result correctness is compromised significantly. The reason is that the consistency properties given by SCAMA are no
longer ensured.
17
Table 2
Importance of each model component.
Runtime (s) Detection coefficient (%)
SCAMA 205.95 100%

SCAMA-1 1,913.44 100%
SCAMA-2 197.07 20%
SCAMA-3 219.15 100%
Fig. 17. Qualitative analysis of Bp-GCN.
Fig. 18. Profiling of computation cost.
Qualitative Study. To further explore why the detection coefficient of SCAMA-2 degrades significantly, we qualitatively evaluate the
embedding generated by the standard GCN model and Bp-GCN. We visualise embedding nodes that belong to a common backbone
between two networks in the RNA-1 dataset. We see that Bp-GCN (Fig. 17-b) produces closer embeddings for nodes between common
backbones compared to the standard GCN (Fig. 17-a).
9.6. Performance profiling
SCAMA turned out to yield better overall performance compared to restricted variants and the baseline techniques. Yet, it is also
valuable to understand the reasons for these improvements in terms of a break-down of runtime and memory consumption (RQ5).
Computation Cost. Fig. 18 shows the profiling results in terms of the percentage of the time spent on each step of our framework
for each of the RNA datasets, relative to the support threshold. Interestingly, SCAMA spends most time in the step to split the graph
database (step 1). However, our pruning mechanism keeps this time at a reasonable level. In fact, this is the reason why the total
runtime of SCAMA does not grow exponentially as observed for traditional approaches to subgraph mining, such as SPIN and MARGIN.
Memory Consumption. Fig. 19 presents a break-down of the memory consumed by SCAMA. Around 30% of the memory consumption
is due to the bootstrapped backbone mining (step 1). This step efficiently filters out all duplicate candidates and does not store them
in main memory during the search. Although the subsequent steps use 70% of the memory, the total amount of memory used by
SCAMA is much less than the memory footprint of the baseline techniques, as reported in §9.4. The reason is that the later steps (step
2 & 3) are based on graph embeddings.
18
Table 3
Hit time.
Baseline SYN-1 SYN-2 SYN-3
SCAMA 183.1 191.3 269.8

SPMiner 637.17 831.68 1008.45
MARGIN 915.7 2,057.73 1071.31
SPIN 3,009.6 20,202.5 52,716.6
Fig. 19. Profiling of resource use.
Fig. 20. Progressiveness (RNA-2).
9.7. Effects of parallelisation
Unlike the baseline techniques that search for maximal patterns linearly, SCAMA enables parallel mining. While this helps to boost
the overall performance (see §9.2), it also improves the earliness with which the discovered patterns are reported (RQ6).
Hit-time. The experiments are conducted on three RNA datasets (𝜏 = 20), and the hit times of all techniques are presented in Table 3.
MARGIN and SPMiner tend to report the final results at the end of the mining process. Although SPIN does not wait until the end of
the mining process to return the first pattern, it takes a relatively long time to detect the first maximal subgraph. This behaviour is
expected as SPIN follows a gradual scanning strategy. A frequent subgraph cannot be reported immediately as its larger subgraphs
(i.e., its extensions) can still be frequent. SCAMA is the quickest technique in returning the first maximal pattern.
Progressiveness. In addition to returning the first result faster, SCAMA also progressively returns more subgraphs without waiting
for a long time period. Fig. 20 compares the progressiveness of SPIN and SCAMA for the RNA-2 dataset. MARGIN and SPMiner only
report the results at the end of the mining process, so that their results are omitted in this experiment. The behaviour seen in Fig. 20
is expected as our results in §9.6 indicate that the time for pattern construction is only a small fraction of the total runtime of
SCAMA, while this construction is also done in parallel. While SPIN also progressively returns the found subgraphs, the delay between
the points in time at which results are reported is large. This is due to the need for sequential processing and the assessment of
intermediate subgraphs.
Sensitivity. Fig. 21 studies the sensitivity of the hyperparameter (𝛽) between the consistency loss and backbone-preserving loss. By
varying 𝛽, the success@1 [39] is employed to assess the performance of the Bp-GCN in terms of recovering the maximal backbone of
the implanted core patterns. While 𝛽 = 0.8 provides the sweet spot of the trade-off, it is interesting to see that the higher value of 𝛽
does not increase the performance.
19
Table 4
Graph classification (AIDS).
Threshold SCAMA SPIN/MARGIN SPMiner
60 77.20 78.78 80.75

40 87.45 86.36 80.75
20 92.76 92.57 80.75
10 94.15 94.54 80.75
Fig. 21. Sensitivity (𝛽).
Fig. 22. Top-𝑘 patterns.
9.8. Results for applications
Finally, we consider the performance of SCAMA for the aforementioned applications, thereby addressing research question RQ7.
Top-𝑘 Pattern Mining. We evaluate the ability of SCAMA to prioritize the 𝑘 largest patterns, where 𝑘 is set to four. Fig. 22 presents
the outcome on the RNA-4 dataset with a high density, which is the most challenging scenario (results on other datasets are omitted
due to the same trend). We use the core patterns as the ground truth to assess the detection coefficient. In all four cases, SCAMA
successfully returns the largest patterns that are correct according to the ground truth. Interestingly, the size of discovered patterns is
slightly larger than the size of the ground truth patterns. The reason is that the connections between the patterns and the background
graph lead to a slight increase of the pattern sizes.
Graph Classification. To investigate the quality of mined patterns, we evaluate the confidence of employing them for graph clas-
sification. Table 4 presents the classification results on the AIDS dataset (we obtained similar results for the other datasets). The
accuracy of graph classification is evaluated with 5-fold cross-validation. Clearly, SCAMA outperforms the results of other approxi-
mate methods (i.e., SPMiner) and is largely in line with exact methods (i.e., SPIN and MARGIN). Interestingly, the accuracy increases
when the frequent threshold decreases. This is because large and distinct patterns tend to appear with a low threshold, and applying
these patterns as features helps to increase the classifier’s discriminative power. While SPMiner yields a better performance for high
support scenarios, the classification becomes less effective compared to SCAMA for lower support values. The reason is that SPMiner
yields a fixed set of patterns independent of the chosen support.
9.9. Discussion
Frequent subgraph mining is a crucial problem in graph and data mining, involving the identification of frequently occurring sub-
graphs in a given database [10]. However, the problem is computationally expensive, and exact algorithms can become infeasible for
large datasets [3]. Therefore, approximate algorithms are often used to solve the problem efficiently, and one of the commonly used
approximate approaches is based on divide-and-conquer strategy [18], and our work follows the same paradigm. In our approach,
a smaller subset of graphs is partitioned from the original graph database, and frequent subgraphs are mined from this subset, and
thus the results obtained from the subset are then extrapolated to the entire graph database [3].
However, as our approach is approximate, it can suffer from pessimistic approximation error [45], which refers to the error
introduced when the approximate solution is worse than the actual solution. In the context of frequent subgraph mining using graph
20
sampling, this can happen when the subset does not generalise certain infrequent subgraphs in the original graph database [18,3].
This can lead to the underestimation of the frequency of these subgraphs and, consequently, the overall accuracy of the results.
While exact approaches may struggle to identify frequent subgraphs in dense graphs due to inherently exponential search space, our
method might be less practical with a sparse graph which can cause false negatives (as discussed in §9.2). However, this pessimistic
approximation error can be mitigated by combining our approach and an exact approach, leveraging both strengths. Given the
diversity of algorithms for maximal subgraph mining, combining multiple algorithms to improve the results’ accuracy is a promising
study, which we leave as a future exploration to facilitate recent advanced applications [46–50].
10. Conclusions
In this paper, we propose SCAMA, a scalable framework for mining maximal frequent subgraphs. The SCAMA framework consists
of three main steps. Firstly, the graph database is partitioned into equivalence classes using bootstraped backbones, which are tree-
shaped frequent subgraphs. Then, a novel Bp-GCN model is employed to learn graph representations, enabling the construction of
maximal backbones. Finally, the maximal frequent subgraphs are derived from the obtained maximal backbones. Our experiments on
real-world graph collections from various domains demonstrate the effectiveness of SCAMA in mining maximal frequent subgraphs.
We observe the following key findings:
• Fast: it runs at around 10 times faster than existing algorithms in most configurations (Fig. 10).
• Scalable: it scales linearly in the density of the graph (Fig. 14).
• Memory efficient: by avoiding the generation of a graph lattice, it consumes less memory than baselines (Fig. 16).
Furthermore, we demonstrate that mining maximal patterns using SCAMA enhances graph classification performance (refer
to Table 4). Additionally, we illustrate the capability of SCAMA to efficiently mine the top-𝑘 largest frequent patterns (refer to
Fig. 22). In terms of potential future work for maximal subgraph mining, we envision several directions. Firstly, multi-level sampling
could be explored to sample at different levels of abstraction, reducing the search space. Secondly, incremental mining techniques
can be developed to handle graph databases that are frequently updated. Lastly, noise handling methods are essential for handling
noisy or error-prone input data. Overall, future research in maximal subgraph mining should focus on enhancing the efficiency and
effectiveness of algorithms to handle large-scale and noisy graph data, catering to diverse application domains.
CRediT authorship contribution statement
Thanh Toan Nguyen: Conceptualization, Methodology, Formal analysis, Validation, Writing – Original Draft. Thanh Trung
Huynh: Term, Software, Resources, Data Curation, Visualization. Matthias Weidlich: Validation, Formal analysis, Writing – Review
& Editing. Quan Thanh Tho: Project administration, Funding acquisition. Hongzhi Yin: Investigation, Supervision. Karl Aberer:
Supervision, Project administration, Funding acquisition. Quoc Viet Hung Nguyen: Supervision, Project administration, Funding
acquisition.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Data availability
The authors do not have permission to share data.
Acknowledgement
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant
number IZVSZ2.203310.
References
[1] V. Ingalalli, D. Ienco, P. Poncelet, Mining frequent subgraphs in multigraphs, Inf. Sci. 451 (2018) 50–66.
[2] T.T. Nguyen, T.C. Phan, M.H. Nguyen, M. Weidlich, H. Yin, J. Jo, Q.V.H. Nguyen, Model-agnostic and diverse explanations for streaming rumour graphs,
Knowl.-Based Syst. 253 (2022) 109438.
[3] R. Ying, A. Wang, J. You, J. Leskovec, Frequent subgraph mining by walking in order embedding space, in: Proc. Int. Conf. Mach. Learn. Workshops, 2020.
[4] T.T. Nguyen, T.C. Phan, H.T. Pham, T.T. Nguyen, J. Jo, Q.V.H. Nguyen, Example-based explanations for streaming fraud detection on graphs, Inf. Sci. 621
(2023) 319–340.
[5] T.T. Nguyen, T.T. Huynh, H. Yin, M. Weidlich, T.T. Nguyen, T.S. Mai, Q.V.H. Nguyen, Detecting rumours with latency guarantees using massive streaming data,
VLDB J. (2022) 1–19.
[6] J. Kim, S. Lim, J. Kim, OCSM: finding overlapping cohesive subgraphs with minimum degree, Inf. Sci. 607 (2022) 585–602.
[7] N.-T. Le, B. Vo, L.B. Nguyen, H. Fujita, B. Le, Mining weighted subgraphs in a single large graph, Inf. Sci. 514 (2020) 149–165.
21
[8] J. Wang, X. Ren, S. Anirban, X.-W. Wu, Correct filtering for subgraph isomorphism search in compressed vertex-labeled graphs, Inf. Sci. 482 (2019) 363–373.
[9] National cancer institute, http://www.nci.nih.gov/, 2021.
[10] X. Yan, J. Han, gSpan: graph-based substructure pattern mining, in: 2002 IEEE International Conference on Data Mining, 2002, Proceedings, IEEE, 2002,
pp. 721–724.
[11] J. Huan, W. Wang, J. Prins, J. Yang, SPIN: mining maximal frequent subgraphs from graph databases, in: Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2004, pp. 581–586.
[12] L.T. Thomas, S.R. Valluri, K. Karlapalem, MARGIN: maximal frequent subgraph mining, ACM Trans. Knowl. Discov. Data 4 (3) (2010) 1–42.
[13] K.M. Borgwardt, C.S. Ong, S. Schönauer, S. Vishwanathan, A.J. Smola, H.-P. Kriegel, Protein function prediction via graph kernels, Bioinformatics 21 (suppl_1)
(2005) i47–i56.
[14] A. Inokuchi, T. Washio, H. Motoda, An apriori-based algorithm for mining frequent substructures from graph data, in: Principles of Data Mining and Knowledge
Discovery: 4th European Conference, PKDD 2000 Lyon, France, September 13–16, 2000 Proceedings 4, Springer, 2000, pp. 13–23.
[15] M. Kuramochi, G. Karypis, Frequent subgraph discovery, in: Proceedings 2001 IEEE International Conference on Data Mining, IEEE, 2001, pp. 313–320.
[16] X. Yan, H. Cheng, J. Han, P.S. Yu, Mining significant graph patterns by leap search, in: Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data, 2008, pp. 433–444.
[17] S. Ranu, A.K. Singh, GraphSig: a scalable approach to mining significant subgraphs in large graph databases, in: 2009 IEEE 25th International Conference on
Data Engineering, IEEE, 2009, pp. 844–855.
[18] Ł. Skonieczny, Mining for unconnected frequent graphs with direct subgraph isomorphism tests, in: Man-Machine Interactions, Springer, 2009, pp. 523–531.
[19] R. Ayed, M.-S. Hacid, R. Haque, A. Jemai, An updated dashboard of complete search FSM implementations in centralized graph transaction databases, J. Intell.
Inf. Syst. 55 (1) (2020) 149–182.
[20] F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, P.S. Yu, Mining top-k large structural patterns in a massive network, Proc. VLDB Endow. 4 (11) (2011) 807–818.
[21] S. Kumar, A. Mallik, A. Khetarpal, B. Panda, Influence maximization in social networks using graph embedding and graph neural network, Inf. Sci. 607 (2022)
1617–1636.
[22] M. Qiao, X. He, X. Cheng, P. Li, Q. Zhao, C. Zhao, Z. Tian, KSTAGE: a knowledge-guided spatial-temporal attention graph learning network for crop yield
prediction, Inf. Sci. 619 (2023) 19–37.
[23] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: a survey, Knowl.-Based Syst. 151 (2018) 78–94.
[24] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2014, pp. 701–710.
[25] Y. Jiang, H. Lin, Y. Li, Y. Rong, H. Cheng, X. Huang, Exploiting node-feature bipartite graph in graph convolutional networks, Inf. Sci. 628 (2023) 409–423.
[26] B. Yu, H. Xie, Z. Xu, PN-GCN: positive-negative graph convolution neural network in information system to classification, Inf. Sci. 632 (2023) 411–423.
[27] Y. Wang, H. Wang, W. Lu, Y. Yan, HyGGE: hyperbolic graph attention network for reasoning over knowledge graphs, Inf. Sci. 630 (2023) 190–205.
[28] L. He, L. Bai, X. Yang, H. Du, J. Liang, High-order graph attention network, Inf. Sci. 630 (2023) 222–234.
[29] N.T. Tam, H.T. Trung, H. Yin, T. Van Vinh, D. Sakong, B. Zheng, N.Q.V. Hung, Entity alignment for knowledge graphs with multi-order convolutional networks,
IEEE Trans. Knowl. Data Eng. 34 (9) (2022) 4201–4214.
[30] C.T. Duong, T.T. Nguyen, T.-D. Hoang, H. Yin, M. Weidlich, Q.V.H. Nguyen, Deep MinCut: learning node embeddings from detecting communities, Pattern
Recognit. (2022) 109126.
[31] C.T. Duong, T.T. Nguyen, H. Yin, M. Weidlich, T.S. Mai, K. Aberer, Q.V.H. Nguyen, Efficient and effective multi-modal queries through heterogeneous network
embedding, IEEE Trans. Knowl. Data Eng. 34 (11) (2022) 5307–5320.
[32] S. Guan, H. Ma, Y. Wu, Attribute-driven backbone discovery, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, 2019, pp. 187–195.
[33] Y. Chi, Y. Yang, R.R. Muntz, Indexing and mining free trees, in: Third IEEE International Conference on Data Mining, IEEE, 2003, pp. 509–512.
[34] N. Tung, L.T. Nguyen, T.D. Nguyen, P. Fourier-Viger, N.-T. Nguyen, B. Vo, Efficient mining of cross-level high-utility itemsets in taxonomy quantitative databases,
Inf. Sci. 587 (2022) 41–62.
[35] P. Lin, Y. Li, W. Luo, X. Zhou, Y. Zeng, K. Li, K. Li, Personalized query techniques in graphs: a survey, Inf. Sci. 607 (2022) 961–1000.
[36] J. Huan, W. Wang, J. Prins, Efficient mining of frequent subgraphs in the presence of isomorphism, in: Third IEEE International Conference on Data Mining,
IEEE, 2003, pp. 549–552.
[37] T. Bhadra, S. Bandyopadhyay, Supervised feature selection using integration of densest subgraph finding with floating forward–backward search, Inf. Sci. 566
(2021) 1–18.
[38] T.T. Nguyen, T.T. Nguyen, T.T. Nguyen, B. Vo, J. Jo, Q.V.H. Nguyen, JUDO: just-in-time rumour detection in streaming social platforms, Inf. Sci. 570 (2021)
70–93.
[39] H.T. Trung, N.T. Toan, T. Van Vinh, H.T. Dat, D.C. Thang, N.Q.V. Hung, A. Sattar, A comparative study on network alignment techniques, Expert Syst. Appl.
140 (2020) 112883.
[40] Z. Halim, O. Ali, M.G. Khan, On the efficient representation of datasets as graphs to mine maximal frequent itemsets, IEEE Trans. Knowl. Data Eng. 33 (4) (2019)
1674–1691.
[41] K. Riesen, H. Bunke, et al., IAM graph database repository for graph based pattern recognition and machine learning, in: SSPR/SPR, vol. 5342, 2008, pp. 287–297.
[42] R.A. Rossi, N.K. Ahmed, The network data repository with interactive graph analytics and visualization, in: AAAI, 2015, https://networkrepository.com.
[43] N.T. Tam, M. Weidlich, B. Zheng, H. Yin, N.Q.V. Hung, B. Stantic, From anomaly detection to rumour detection using data streams of social platforms, Proc.
VLDB Endow. 12 (9) (2019) 1016–1029.
[44] J. Qiu, L. Dhulipala, J. Tang, R. Peng, C. Wang, LightNE: a lightweight graph processing system for network embedding, in: Proceedings of the 2021 International
Conference on Management of Data, 2021, pp. 2281–2289.
[45] M.S. Stein, J.A. Nossek, A pessimistic approximation for the fisher information measure, IEEE Trans. Signal Process. 65 (2) (2016) 386–396.
[46] T.T. Nguyen, M. Weidlich, H. Yin, B. Zheng, Q.V.H. Nguyen, B. Stantic, User guidance for efficient fact checking, Proc. VLDB Endow. 12 (8) (2019) 850–863.
[47] T.T. Nguyen, M. Weidlich, H. Yin, B. Zheng, Q.H. Nguyen, Q.V.H. Nguyen, FactCatch: incremental pay-as-you-go fact checking with minimal user effort, in:
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 2165–2168.
[48] T.T. Huynh, M.H. Nguyen, T.T. Nguyen, P.L. Nguyen, M. Weidlich, Q.V.H. Nguyen, K. Aberer, Efficient integration of multi-order dynamics and internal dynamics
in stock movement prediction, in: WSDM ’23: The Sixteeth ACM International Conference on Web Search and Data Mining, Virtual Event, Singapore, February
27–March 3, 2023, ACM, 2023, pp. 1–9.
[49] J. Chen, H. Xiong, H. Zheng, D. Zhang, J. Zhang, M. Jia, Y. Liu, EGC2: enhanced graph classification with easy graph compression, Inf. Sci. 629 (2023) 376–397.
[50] W. Wu, C. Song, J. Zhao, Z. Xu, Physics-informed gated recurrent graph attention unit network for anomaly detection in industrial cyber-physical systems, Inf.
Sci. 629 (2023) 618–633.
22

Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions

Uploaded by

Copyright:

Available Formats

You might also like

Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions

Uploaded by

Copyright:

Available Formats

Information Sciences 644 (2023) 119287

Contents lists available at ScienceDirect

Scalable maximal subgraph mining with backbone-preserving

2.2. Related work

Deﬁnition 1 (Labelled graph). A labelled graph, or simply graph, is a 4-tuple 𝐺 = (𝑉 , 𝐸, Σ, 𝑙) where

(1) 𝑉 is a set of nodes,

Deﬁnition 2 (Subgraph). A graph 𝑆 = (𝑉𝑆 , 𝐸𝑆 , 𝐿𝑆 , 𝜙𝑆 ) is a subgraph of a graph 𝐺 = (𝑉𝐺 , 𝐸𝐺 , 𝐿𝐺 , 𝜙𝐺 ), if 𝑉𝑆 ⊆ 𝑉𝐺 , 𝐸𝑆 ⊆ 𝐸𝐺 , and

Fig. 3. Example of a maximal frequent subgraph.

(1) ∀ 𝑣 ∈ 𝑉1 ∶ 𝑙1 (𝑣) = 𝑙2 (𝑓 (𝑣)),

1. Frequent: sup(𝑆) ≥ 𝜏, and

4. The SCAMA approach

Algorithm 1: SCAMA framework.

5. Graph equivalence classes based on bootstrapped backbones

Fig. 5. Backbone presentation.

Fig. 6. Backbone equivalence classes.

5.1. Backbone representation

5.2. Backbone equivalence classes

Algorithm 2: Splitting a graph database.

5.3. Database splitting using backbones

Fig. 7. Early pruning exempliﬁed.

Fig. 8. Anchor link.

6. Maximal backbone mining with graph embedding

6.1. Local maximal backbone

6.1.1. Bp-GCN embedding stage

𝐿(𝐺) = 𝛽𝐿𝑐 + (1 − 𝛽)𝐿𝑏 (7)

6.1.2. Bp-GCN querying stage

𝑚𝑖𝑛𝑗=1..𝑛 ||ℎ𝓁𝑖 − ℎ𝓁𝑗 ||𝐹 (8)

6.2. Global maximal backbone

7. Pattern construction from maximal backbones

Algorithm 3: Subgraph Construction.

Fig. 9. Example for the subgraph construction algorithm.

8.1. Top-𝑘 largest frequent subgraph mining

Domain Dataset #Graph #Node #Edge

SYN-1 100 1.5K 3.9K

8.2. Graph classiﬁcation

• RQ1 Does SCAMA outperform diverse baseline techniques?

9.1. Experiment setup

9.1.2. Baseline techniques

Fig. 10. Eﬃciency.

9.2. End-to-end comparison

Fig. 11. Lattice explored.

Fig. 12. Pattern distribution.

Fig. 13. False negative patterns.

Fig. 14. Scalability vs. density.

Fig. 15. Scalability vs. pattern and dataset size.

Fig. 16. Resource consumption.

9.4. Memory consumption

9.5. Ablation testing

Runtime (s) Detection coeﬃcient (%)

SCAMA 205.95 100%

Fig. 17. Qualitative analysis of Bp-GCN.

Fig. 18. Proﬁling of computation cost.

9.6. Performance proﬁling

Baseline SYN-1 SYN-2 SYN-3

SCAMA 183.1 191.3 269.8

Fig. 19. Proﬁling of resource use.

Fig. 20. Progressiveness (RNA-2).

9.7. Eﬀects of parallelisation