Professional Documents
Culture Documents
Graphgen: A Scalable Approach To Domain-Agnostic Labeled Graph Generation
Graphgen: A Scalable Approach To Domain-Agnostic Labeled Graph Generation
Graph Generation
Nikhil Goyal∗ Harsh Vardhan Jain∗ Sayan Ranu
Indian Institute of Technology, Delhi Indian Institute of Technology, Delhi Indian Institute of Technology, Delhi
nikhil10sep@gmail.com harshvardhanjain7@gmail.com sayanranu@cse.iitd.ac.in
Abstract Table 1: Limitations of existing graph generative models. n
and m denote the number of nodes and edges respectively.
Graph generative models have been extensively studied in the data
Any other variable is a hyper-parameter of the respective
mining literature. While traditional techniques are based on gener-
model. We consider a model scalable if its complexity is lin-
ating structures that adhere to a pre-decided distribution, recent
arXiv:2001.08184v2 [cs.LG] 8 Apr 2020
ear in n and m.
techniques have shifted towards learning this distribution directly
Technique Domain-agnostic Node Labels Edge Labels Scalable Complexity
from the data. While learning-based approaches have imparted MolGAN[7] ✗ ✓ ✓ ✗ O (n 2 )
significant improvement in quality, some limitations remain to NeVAE[31] ✗ ✓ ✓ ✓ O (Sm)
GCPN[39] ✗ ✓ ✓ ✗ O(m 2 )
be addressed. First, learning graph distributions introduces addi- LGGAN[8] ✓ ✓ ✗ ✗ O (n 2 )
tional computational overhead, which limits their scalability to large Graphite[12] ✓ ✓ ✗ ✗ O (n 2 )
DeepGMG[19] ✓ ✓ ✓ ✗ O (n 2 m)
graph databases. Second, many techniques only learn the structure GraphRNN[40] ✓ ✗ ✗ ✓ O (nM )
GraphVAE[34] ✓ ✓ ✓ ✗ O (n 4 )
and do not address the need to also learn node and edge labels, GRAN[21] ✓ ✗ ✗ ✗ O (n(m + n))
which encode important semantic information and influence the GraphGen ✓ ✓ ✓ ✓ O (m)
structure itself. Third, existing techniques often incorporate domain-
specific rules and lack generalizability. Fourth, the experimentation assumptions about certain properties of the graph database, and gen-
of existing techniques is not comprehensive enough due to either erate graphs that adhere to those assumed properties[1, 2, 18, 28, 41].
using weak evaluation metrics or focusing primarily on synthetic A key area that lacks development is the ability to directly learn
or small datasets. In this work, we develop a domain-agnostic tech- patterns, both local as well as global, from a given set of observed
nique called GraphGen to overcome all of these limitations. Graph- graphs and use this knowledge to generate graphs instead of making
Gen converts graphs to sequences using minimum DFS codes. Mini- prior assumptions. In this work, we bridge this gap by developing a
mum DFS codes are canonical labels and capture the graph structure domain-agnostic graph generative model for labeled graphs without
precisely along with the label information. The complex joint distri- making any prior assumption about the dataset. Such an approach
butions between structure and semantic labels are learned through reduces the usability barrier for non-technical users and applies to
a novel LSTM architecture. Extensive experiments on million-sized, domains where distribution assumptions made by traditional tech-
real graph datasets show GraphGen to be 4 times faster on average niques do not fit well, e.g. chemical compounds, protein interaction
than state-of-the-art techniques while being significantly better in networks etc.
quality across a comprehensive set of 11 different metrics. Our code Modeling graph structures is a challenging task due to the inher-
is released at: https://github.com/idea-iitd/graphgen. ent complexity of encoding both local and global dependencies in
ACM Reference Format: link formation among nodes. We briefly summarize how existing
Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu. 2020. GraphGen: A Scal- techniques tackle this challenge and their limitations.
able Approach to Domain-agnostic Labeled Graph Generation. In Proceed- Traditional graph generative models: Several models exist[1,
ings of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. 2, 18, 28, 29, 37] that are engineered towards modeling a pre-selected
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3366423.3380201 family of graphs, such as random graphs[28], small-world networks
[37], and scale-free graphs[2]. The apriori assumption introduces
1 Introduction and Related Work several limitations. First, due to pre-selecting the family of graphs,
Modeling and generating real-world graphs have applications in i.e. distributions modeling some structural properties, they cannot
several domains, such as understanding interaction dynamics in adapt to datasets that either do not fit well or evolve to a differ-
social networks[12, 35, 36], graph classification[3, 25, 27], and anom- ent structural distribution. Second, these models only incorporate
aly detection[26]. Owing to their wide applications, development of structural properties and do not look into the patterns encoded by
generative models for graphs has a rich history, and many methods labels. Third, these models assume the graph to be homogeneous.
have been proposed. However, a majority of the techniques, make In reality, it is possible that different local communities (subgraphs)
∗ Both authors contributed equally to this research.
within a graph adhere to different structural distributions.
Learning-based graph generative models: To address the
This paper is published under the Creative Commons Attribution 4.0 International above outlined limitations, recent techniques have shifted towards
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution. a learning-based approach[8, 16, 19, 21, 34, 40]. While impressive
WWW ’20, April 20–24, 2020, Taipei, Taiwan progress has been made through this approach, there is scope to
© 2020 IW3C2 (International World Wide Web Conference Committee), published further improve the performance of graph generative models. We
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-7023-3/20/04.
outline the key areas we target in this work.
https://doi.org/10.1145/3366423.3380201
WWW ’20, April 20–24, 2020, Taipei, Taiwan Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu
mum DFS codes, and given that DFS codes are canonical labels, mod-
eling a database of graphs G = {G 1 , · · · , G n } is equivalent to mod- + +
eling their sequence representations S = {F (G 1 ), · · · , F (G n )} =
{S 1 , · · · , Sn }, i.e., pmodel (G) ≡ pmodel (S). To model the sequential State Update
Output generation
nature of the DFS codes, we use an auto-regressive model to learn
pmodel (S). At inference time, we sample DFS codes from this distri-
bution instead of directly sampling graphs. Since the mapping of a Figure 3: Architecture of GraphGen. Red arrows indicate
graph to its minimum DFS code is a bijection, i.e., G = F −1 (F (G)), data flow in the RNN whose hidden state hi captures the
the graph structure along with all node and edge labels can be fully state of the graph generated so far. Blue arrows show the
constructed from the sampled DFS code. information flow to generate the new edge tuple si .
Fig.3 provides an overview of GraphGen model. For graph
G(V , E) with F (G) = S = [s 1 , · · · , sm ], the proposed auto-regressive times, LSTMs and stacked LSTMs have delivered promising re-
model generates si sequentially from i = 1 to i = m = |E|. This sults in cursive handwriting[9], sentence modeling[5] and, speech
means that our model produces a single edge at a time, and since it recognition[10].
belongs to a DFS sequence, it can only be between two previously In this paper, we design a custom LSTM that consists of a state
generated nodes (backward edge) or one of the already generated transition function ft r ans (Eq. 3), an embedding function femb
nodes and an unseen node (forward edge), in which case a new (Eq. 3), and five separate output functions (Eqs. 4-8) for each of
node is also formed. Consequently, only the first generated edge the five components of the 5-tuple si = (tu , tv , Lu , Le , Lv ). (∼M
produces two new nodes, and all subsequent edges produce at most in the following equations represents sampling from multinomial
one new node. distribution)
3.2.2 Model Details We now describe the proposed algorithm to
characterize p(S) in detail. Since S is sequential in nature, we de- hi = ft r ans (hi−1 , femb (si−1 )) (3)
compose the probability of sampling S from p(S) as a product of
tu ∼M θ tu = ftu (hi ) (4)
conditional distribution over its elements as follows:
m+1
Ö m+1
Ö tv ∼M θ tv = ftv (hi ) (5)
p(S) = p(si |s 1 , · · · , si−1 ) = p(si |s <i ) (1) Lu ∼M θ Lu = f Lu (hi ) (6)
i=1 i=1
Le ∼M θ Le = f Le (hi ) (7)
where m is the number of edges and sm+1 is end-of-sequence to- Lv ∼M θ Lu = f Lv (hi ) (8)
ken EOS to allow variable length sequences. We denote p(si |s 1 , · · · ,
si−1 ) as p(si |s <i ) in further discussions. si = concat(tu , tv , Lu , Le , Lv ) (9)
Recall, each element si is an edge tuple of the form (tu , tv , Lu , Le ,
Lv ), where e = (u, v). We make the simplifying assumption that Here, si ∈ {0, 1}k is the concatenated component wise one-hot
timestamps tu , tv , node labels Lu , Lv and edge label Le at each encoding of the real edge, and hi ∈ Rd is an LSTM hidden state
generation step of si are independent of each other. This makes the vector that encodes the state of the graph generated so far. An em-
model tractable and easier to train. Note that this assumption is not bedding function femb is used to compress the sparse information
dependent on data and hence is not forming prior bias of any sort possessed by the large edge representation (si ) to a small vector of
on data. Mathematically, Eq. 1 reduces to the follows. real numbers (Eq. 3). Given the graph state, hi , the output of the
functions ftu (hi ), ftv (hi ), f Lu (hi ), fe (hi ), f Lv (hi ) i.e. θ tu , θ tv , θ Lu ,
p(si |s <i ) = p((tu , tv , Lu , Le , Lv )) | s <i ) θ e , θ Lv respectively represent the multinomial distribution over
= p(tu |s <i ) × p(tv |s <i ) × p(Lu |s <i ) (2) possibilities of five components of the newly formed edge tuple.
Finally, components of the newly formed edge tuple are sampled
× p(Le |s <i ) × p(Lv |s <i )
from the multinomial distributions parameterized by respective θ s
Eq. 2 is extremely complex as it has to capture complete infor- and concatenated to form the new edge (si ) (Eq. 9).
mation about an upcoming edge, i.e., the nodes to which the edge is With the above design, the key components that dictate the
connected, their labels, and the label of the edge itself. To capture modeling quality are the multinomial probability distribution pa-
this highly complex nature of p(si |s <i ) we propose to use expres- rameters θ tu , θ tv , θ Lu , θ e , θ Lv over each edge tuple tu , tv , Lu , Le ,
sive neural networks. Recurrent neural networks (RNNs) are capable Lv . Our goal is, therefore, to ensure that the learned distributions
of learning features and long term dependencies from sequential are as close as possible to the real distributions in the training data.
and time-series data[30]. Specifically, LSTMs[13] have been one Using the broad mathematical architecture explained above, we
of the most popular and efficient methods for reducing the effects will now explain the Graph Generation process followed by the
of vanishing and exploding gradients in training RNNs. In recent sequence modeling / training process.
GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation WWW ’20, April 20–24, 2020, Taipei, Taiwan
3.2.3 Graph Generation Given the learned state transition func- Table 2: Variables and their sizes
tion ft r ans , embedding function femb and output functions ftu , Variable (One-hot vector) Dimension
ftv , f Lu , f Le , f Lv , Alg. 1 provides the pseudocode for the graph t u , tv max |V | + 1
∀G (V , E)∈G
generation process. In Alg. 1, S stores the generated sequence and is L u , Lv |V| + 1
Le |E | + 1
initialized along with other variables (line 1-3). sbi ∈ {0, 1}k , which
represents i t h edge tuple formed for sequence S, is component-
wise sampled from the multinomial distributions learned by output
si [c]T loge
si [c] + (1 − si [c])T log(1 − e
Õ
functions (loop at line 8) and finally appended to the sequence BCE(e
si , si ) = − si [c])
S (line 13). This process of forming a new edge and appending c
it to sequence S is repeated until any of the five components is (10)
sampled as an EOS. Note, since each of the five components of sbi 3.2.5 Parameters State transition function, ft r ans in our archi-
i.e., c i , c ∈ {tu , tv , Lu , Le , Lv } are one-hot encodings and can be of tecture is a stacked LSTM-Cell, each of the five output functions
different size, their EOS could be of different sizes, in which case ftu , ftv , f Lu , f Le , f Lv are fully connected, independent Multi-layer
they will be compared with their respective EOS to indicate the Perceptrons (MLP) with dropouts and femb is a simple linear em-
stopping condition. bedding function. The dimension of the one-hot encodings of tu ,
3.2.4 Graph Modeling We now discuss how the required inference tv , Lu , Le , Lv are estimated from the training set. Specifically, the
functions (Eqs. 3-9) are learned from a set of training graphs. largest values for each element are computed in the first pass while
Given an input graph dataset G, Alg. 2 first converts the graph computing the minimum DFS codes of the training graphs. Table 2
dataset G to a sequence dataset S of minimum DFS codes (line 1). shows the dimensions of the one-hot encodings for each component
We initialize the unlearned neural functions by suitable weights of si . Note that each variable has one size extra than their intuitive
(line 2). Further details about these functions are given in Sec. 3.2.5. size to incorporate the EOS token. Since we set the dimensions of
For each edge si in each sequence S ∈ S, a concatenated component- tu and tv to the largest training graph size, in terms of the number
wise one-hot encoding of the real edge si is fed to the embedding of nodes, the largest generated graph cannot exceed the largest
function femb , which compresses it and its compressed form along training graph. Similarly, setting the dimensions of Lu , Lv , and Le
with previous graph state is passed to the transition function ft r ans . to the number of unique node and edge labels in the training set
ft r ans generates the new graph state hi (line 8). This new state means that our model is bound to produce only labels that have
hi is passed to all the five output functions that provide us the been observed at least once. Since the sizes of each component of
corresponding probability distributions over the components of si is fixed, the size of si , which is formed by concatenating these
the new edge tuple (line 11). We concatenate these probability components, is also fixed, and therefore, satisfies the pre-requisite
distributions to form sei ∈ Rk (same size as that of si ), which is condition for LSTMs to work, i.e., each input must be of the same
representative of a new edge (line 12). Finally, this component-wise size. We employ weight sharing among all time steps in LSTM to
concatenated probability distribution sei is matched (component- make the model scalable.
wise) against the real tuple from the DFS code being learned. This 3.2.6 Complexity Analysis of Modeling and Inference. Consistent
implies that we use ground truth rather than the model’s own with the complexity analysis of existing models (Recall Table 1), we
prediction during successive time steps in training. In contrast to only analyze the complexity of the forward and backward propaga-
this, GraphGen uses its own predictions during graph generation tions. The optimization step to learn weights is not included.
time. The accuracy of the model is optimized using the cumulative
binary cross-entropy loss function. Here s[c] represents component
Algorithm 2: Graph modeling algorithm
c ∈ {tu , tv , Lu , Le , Lv }, and loд is taken elementwise on a vector.
Input : Dataset of Graphs G = {G 1, · · · , G n }
Output : Learned functions f t r ans , f tu , f tv , f Lu , f L e , f Lv , and embedding function
f emb
Algorithm 1: Graph generation algorithm 1 S = {S = F(G) | ∀G ∈ G}
Input : LSTM based state transition function f t r ans , embedding function f emb , five 2 Initialize f tu , f tv , f Lu , f L e , f Lv , f emb
output functions f tu , f tv , f Lu , f L e , f Lv , empty graph state h 0 , start token 3 repeat // 1 Epoch
SO S , and end token EO S 4 for ∀S = [s 1, · · · , sm ] ∈ S do
Output : Graph G 5 s 0 ← SOS ; Initialize h 0
1 S←∅ 6 loss ← 0
2 s 0 ← SO S
b 7 for i from 1 to m + 1 do // sm+1 for EOS tokens
3 i ←0 8 h i ← f t r ans (h i −1, f emb (s i −1 ))
4 repeat // sei will contain component-wise probability distribution
5 i ←i +1 vectors of sbi
6 h i ← f t r ans (h i −1, f emb (b
s i −1 )) 9 sei ← ϕ
7 sbi ← ∅ 10 for c ∈ {tu , tv , Lu , L e , Lv } do
8 for c ∈ {tu , tv , Lu , L e , Lv } do 11 θ c ← fc (h i )
9 θ c ← fc (h i ) 12 sei ← concat(sei , θ c )
// sample c i from Multinomial distribution parameterized by θ c 13 end
10 c i ∼M θ c // Note: s i consists of 5 components (See Eq. 10)
11 sbi ← concat(sbi , c i ) 14 loss ← loss + BC E(sei , s i )
12 end 15 end
13 S ← S | | ⟨sbi ⟩ // Appending sbi to sequence S 16 Back-propagate loss and update weights
14 until ∃ c i , c i = EO S 17 end
15 return F −1 (S ) // F −1 is mapping from minimum DFS code to graph 18 until stopping criteria // Typically when validation loss is minimized
WWW ’20, April 20–24, 2020, Taipei, Taiwan Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu
Table 3: Summary of the datasets. ✗ means labels are absent. by the authors. Since this code does not support labels, we ex-
# Name Domain No. of graphs |V | |E | |V| |E| tend the model to incorporate node and edge labels based on the
1 NCI-H23 (Lung)[23] Chemical 24k [6, 50] [6, 57] 11 3 discussion provided in the paper[40].
2 Yeast[23] Chemical 47k [5, 50] [5, 57] 11 3
3 MOLT-4 (Leukemia)[23] Chemical 33k [6, 111] [6, 116] 11 3 4.1.3 Parameters and Training sets For both GraphRNN and Deep-
4 MCF-7 (Breast)[23] Chemical 23k [6, 111] [6, 116] 11 3
5 ZINC[14] Chemical 3.2M [3, 42] [2, 49] 9 4 GMG, we use the parameters recommended in the respective papers.
✗
Estimation of M in GraphRNN and evaluation of DFS codes2 from
6 Enzymes[4] Protein 575 [2, 125] [2, 149] 3
7 Citeseer[33] Citation 1 3312 4460 6 ✗
8 Cora[33] Citation 1 2708 5278 6 ✗
graph database in GraphGen is done in parallel using 48 threads.
For GraphGen, we use 4 layers of LSTM cells for ft r ans with
The length of the sequence representation for graph G(V , E) is hidden state dimension of 256 and the dimension of femb is set to
|E|. All operations in the LSTM model consume O(1) time per edge 92. Hidden layers of size 512 are used in MLPs for ftu , ftv , f Lu , fe ,
tuple and hence the complexity is O(|E|). The generation algorithm f Lv . We use Adam optimizer with a batch size of 32 for training.
consumes O(1) time per edge tuple and for a generated graph with We use a dropout of p = 0.2 in MLP and LSTM layers. Furthermore,
edge set E, the complexity is O(|E|). we use gradient clipping to remove exploding gradients and L2
regularizer to avoid over-fitting.
4 Experiments To evaluate the performance in any dataset, we split it into three
In this section, we benchmark GraphGen on real graph datasets parts: the train set, validation set, and test set. Unless specifically
and establish that: mentioned, the default split among training, validation and test is
• Quality: GraphGen, on average, is significantly better than the 80%, 10%, 10% of graphs respectively. We stop the training when
state-of-the-art techniques of GraphRNN[40] and DeepGMG[19]. validation loss is minimized or less than 0.05% change in validation
• Scalability: GraphGen, on average, is 4 times faster than Graph- loss is observed over an extended number of epochs. Note that both
RNN, which is currently the fastest generative model that works Citeseer and Cora represent a single graph. To form the training
for labeled graphs. In addition, GraphGen scales better with set in these datasets, we sample subgraphs by performing random
graph size, both in quality and efficiency. walk with restarts from multiple nodes. More specifically, to sample
The code and datasets used for our empirical evaluation can be a subgraph, we choose a node with probability proportional to its
downloaded from https://github.com/idea-iitd/graphgen. degree. Next, we initiate random walk with restarts with restart
4.1 Experimental setup probability 0.15. The random walks stop after 150 iterations. Any
edge that is sampled at least once during random walk with restarts
All experiments are performed on a machine running dual Intel is part of the sampled subgraph. This process is then repeated
Xeon Gold 6142 processor with 16 physical cores each, having 1 X times to form a dataset of X graphs. The sizes of the sampled
Nvidia 1080 Ti GPU card with 12GB GPU memory, and 384 GB subgraphs range from 1 ≤ |V | ≤ 102 and 1 ≤ |E| ≤ 121 in Citeseer
RAM with Ubuntu 16.04 operating system. and 9 ≤ |V | ≤ 111 and 20 ≤ |E| ≤ 124 in Cora.
4.1.1 Datasets Table 3 lists the various datasets used for our em- 4.1.4 Metrics The metrics used in the experiments can be classified
pirical evaluation. The semantics of the datasets are as follows. into the following categories:
• Chemical compounds: The first five datasets are chemical com- • Structural metrics: We use the metrics used by GraphRNN[40]
pounds. We convert them to labeled graphs where nodes repre- for structural evaluation: (1) node degree distribution, (2) clus-
sent atoms, edges represent bonds, node labels denote the atom- tering coefficient distribution of nodes, and (3) orbit count distri-
type, and edge labels encode the bond order such as single bond, bution, which measures the number of all orbits with 4 nodes.
double bond, etc. Orbits capture higher-level motifs that are shared between gen-
• Citation graphs: Cora and Citeseer are citation networks; nodes erated and test graphs. The closer the distributions between the
correspond to publications and an edge represents one paper generated and test graphs, the better is the quality. In addition,
citing the other. Node labels represent the publication area. we also compare the graph size of the generated graphs with
• Enzymes: This dataset contains protein tertiary structures rep- the test graphs in terms of (4) Average number of nodes and (5)
resenting 600 enzymes from the BRENDA enzyme database[32]. Average number of edges.
Nodes in a graph (protein) represent secondary structure ele- • Label Accounting metrics: We compare the distribution of (1)
ments, and two nodes are connected if the corresponding ele- node labels, (2) edge labels, and (3) joint distribution of node
ments are interacting. The node labels indicate the type of sec- labels and degree in generated and test graphs.
ondary structure, which is either helices, turns, or sheets. This is • Graph Similarity: To capture the similarity of generated graphs
an interesting dataset since it contains only three label-types. We in a more holistic manner that considers both structure and
utilize this dataset to showcase the impact of graph invariants. labels, we use Neighbourhood Sub-graph Pairwise Distance Kernel
Specifically, we supplement the node labels with node degrees. (NSPDK)[6]. NSPDK measures the distance between two graphs
For example, if a node has label “A” and degree 5, we alter the by matching pairs of subgraphs with different radii and distances.
label to “5, A”. The lower the NSPDK distance, the better is the performance. This
4.1.2 Baselines We compare GraphGen with DeepGMG[19]1 and quality metric is arguably the most important since it captures
GraphRNN[40]. For GraphRNN, we use the original code released the global similarity instead of local individual properties.
1 Wemodified the DeepGMG implementation for unlabeled graphs provided by Deep
Graph Library (DGL) 2 We adapted Minimum DFS code implementation from kaviniitm
GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation WWW ’20, April 20–24, 2020, Taipei, Taiwan
Table 4: Summary of the performance by GraphGen, GraphRNN, and DeepGMG across 11 different metrics measuring various
aspects of quality, robustness and efficiency on 6 datasets. The best performance achieved under each metric for a particular
dataset is highlighted in bold font. Any value less than 10−3 is reported as ≈ 0. The quality achieved by DeepGMG in Enzymes,
Cora, and Citeseer are not reported since loss function for DeepGMG goes to N AN and hence fails to scale.
Avg # Nodes Avg # Edges Node Edge Joint Node Training
Dataset Model Degree Clustering Orbit NSPDK Novelty Uniqueness # Epochs
(Gen/Gold) (Gen/Gold) Label Label Label & Degree Time
GraphGen 0.009 ≈0 ≈0 0.035 35.20/35.88 36.66/37.65 0.001 ≈0 0.205 ≈ 100% ≈ 100% 6h 550
Lung GraphRNN 0.103 0.301 0.043 0.325 6.32/35.88 6.35/37.65 0.193 0.005 0.836 86% 45% 1d 1h 1900
DeepGMG 0.123 0.001 0.026 0.260 11.04/35.88 10.28/37.65 0.083 0.002 0.842 98% 98% 23h 20
GraphGen 0.006 ≈0 ≈0 0.026 35.58/32.11 36.78/33.22 0.001 ≈0 0.093 97% 99% 6h 250
Yeast GraphRNN 0.512 0.153 0.026 0.597 26.58/32.11 27.01/33.22 0.310 0.002 0.997 93% 90% 21h 630
DeepGMG 0.056 0.002 0.008 0.239 34.91/32.11 35.08/33.22 0.115 ≈0 0.967 90% 89% 2d 3h 18
Mixed: Lung + GraphGen 0.005 ≈0 ≈0 0.023 37.87/37.6 39.24/39.14 0.001 ≈0 0.140 97% 99% 11h 350
Leukemia + GraphRNN 0.241 ≈0 0.039 0.331 8.19/37.6 7.2/39.14 0.102 0.010 0.879 62% 52% 23h 400
Yeast + Breast Deep GMG 0.074 0.002 0.002 0.221 24.35/37.6 24.11/39.14 0.092 0.002 0.912 83% 83% 1d 23h 11
GraphGen 0.243 0.198 0.016 0.051 32.44 /32.95 52.83/64.15 0.005 - 0.249 98% 99% 3h 4000
Enzymes
GraphRNN 0.090 0.151 0.038 0.067 11.87/32.95 23.52/64.15 0.048 - 0.312 99% 97% 15h 20900
GraphGen 0.089 0.083 0.100 0.020 35.99/48.56 41.81/59.14 0.024 - 0.032 83% 95% 4h 400
Citeseer
GraphRNN 1.321 0.665 1.006 0.052 31.42/48.56 77.13/59.14 0.063 - 0.035 62% 100% 12h 1450
GraphGen 0.061 0.117 0.089 0.012 48.66/58.39 55.82/68.61 0.017 - 0.013 91% 98% 4h 400
Cora
GraphRNN 1.125 1.002 0.427 0.093 54.01/58.39 226.46/68.61 0.085 - 0.015 70% 93% 9h 400
• Redundancy checks: Consider a generative model that gener- also seen in the test set. These two metrics combined clearly estab-
ates the exact same graphs that it saw in the training set. Al- lishes that GraphGen model graphs better than GraphRNN and
though this model will perform very well in the above metrics, DeepGMG.
such a model is practically useless. Ideally, we would want the Node-level metrics: Among the other metrics, GraphGen re-
generated graphs to be diverse and similar, but not identical. To mains the dominant performer. To be more precise, GraphGen is
quantify this aspect, we check (1) Novelty, which measures the consistently the best in both Node and Edge Label preservation, as
percentage of generated graphs that are not subgraphs of the well as graph size in terms of the number of edges. It even performs
training graphs and vice versa. Note that identical graphs are sub- the best on the joint Node Label and Degree metric, indicating its su-
graph isomorphic to each other. In other words, novelty checks if periority in capturing structural and semantic information together.
the model has learned to generalize unseen graphs. We also com- The performance of GraphGen is comparatively less impressive in
pute (2) Uniqueness, which captures the diversity in generated the Enzymes dataset, where GraphRNN marginally outperforms in
graphs. To compute Uniqueness, we first remove the generated the Degree and Clustering Coefficient metrics. Nonetheless, even in
graphs that are subgraph isomorphic to some other generated Enzyme, among the eight metrics, GraphGen is better in six. This
graphs. The percentage of graphs remaining after this operation result also highlights the need to not rely on only node-level metrics.
is defined as Uniqueness. For example, if the model generates 100 Specifically, although GraphRNN models the node degrees and clus-
graphs, all of which are identical, the uniqueness is 1/100 = 1%. tering coefficients well, it generates graphs that are much smaller
To compute the distance between two distributions, like in Graph- in size and hence the other metrics, including NSPDK, suffers.
RNN[40], we use Maximum Mean Discrepancy (MMD)[11]. To quan- Novelty and Uniqueness: Across all datasets, GraphGen has
tify quality using a particular metric, we generate 2560 graphs uniqueness of at least 98%, which means it does not generate the
and compare them with a random sample of 2560 test graphs. On same graph multiple times. In contrast, GraphRNN has a signifi-
all datasets except Enzymes, we report the average of 10 runs of cantly lower uniqueness in chemical compound datasets. In Lung,
computing metric, comparing 256 graphs in a single run. Since the only 45% of the generated graphs are unique. In several of the
Enzymes dataset contains only 575 graphs, we sample 64 graphs datasets, GraphRNN also has low novelty, indicating it regenerates
randomly from the test set 40 times and report the average. graphs (or subgraphs) it saw during training. For example, in Cora,
Citeseer and, the mixed chemical dataset, at least 30% of the gener-
ated graphs are regenerated from the training set. While we cannot
pinpoint the reason behind this performance, training on random
4.2 Quality graph sequence representations could be a factor. More specifically,
even though the generative model may generate new sequences,
Table 4 presents the quality achieved by all three benchmarked they may correspond to the same graph. While this is possible in
techniques across 10 different metrics on 6 datasets. We highlight GraphGen as well, the likelihood is much less as it is trained on
the key observations that emerge from these experiments. Note DFS codes that enable one-to-one mapping with graphs.
that, some of the generated graphs may not adhere to the structural Analysis of GraphRNN and DeepGMG: Among GraphRNN
properties assumed in § 2 i.e., no self loops, multiples edges or and DeepGMG, DeepGMG generally performs better in most met-
disconnected components, so we prune all the self edges and take rics. However, DeepGMG is extremely slow and fails to scale on
the maximum connected component for each generated graph. larger networks due to O(|V ||E| 2 ) complexity.
Graph and Sub-graph level similarity: On the NSPDK met- GraphRNN’s major weakness is in learning graph sizes. As visi-
ric, which is the most important due to capturing the global similar- ble in Table 4, GraphRNN consistently generates graphs that are
ity of generated graphs with test graphs, GraphGen is significantly much smaller than the test graphs. We also observe that this issue
better across all datasets. This same trend also transfers to orbit
count distributions. Orbit count captures the presence of motifs
WWW ’20, April 20–24, 2020, Taipei, Taiwan Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu
Figure 5: Visual comparison of the generated graphs with real graphs from Cora. In this figure, the colors of the nodes denote
their labels.
fails to model large graphs. GraphRNN runs out of GPU memory training time of 10 hours. GraphRNN has a similar pre-processing
(12GB) in the largest bucket size in Cora. component as well (GraphRNN-M), where it generates a large num-
As visible, there is a clear drop in quality (increase in NSPDK) ber of BFS sequences to estimate a hyper-parameter. We find that
as the graph sizes grow. This result is expected since larger graphs this pre-processing time, although of polynomial-time complex-
involve larger output spaces, and the larger the output space, the ity, is slower than minimum DFS code computation in chemical
more complex is the modeling task. Furthermore, both GraphGen compounds. Overall, the results of this experiment support the
and GraphRNN convert graph modeling into sequence modeling. It proposed strategy of feeding canonical labels to the neural model
is well known from the literature that auto-regressive neural mod- instead of random sequence representations.
els struggle to learn long-term dependencies in sequences. When
the sequence lengths are large, this limitation impacts the quality. 4.4 Impact of Invariants
Nonetheless, GraphGen has a more gradual decrease in quality Minimum DFS code generation relies on the node and edge labels to
compared to GraphRNN and DeepGMG. prune the search space and identify the lexicographically smallest
Next, we analyze the impact of graph size on the training time. code quickly. When graphs are unlabeled or have very few labels,
Theoretically, we expect the training time to grow, and this pattern this process may become more expensive. In such situations, labels
is reflected in Figs. 7(g) and 7(i). Note that the y-axis is in log scale based on vertex invariants allow us to augment existing labels. To
in these plots. Consistent with previous results, GraphGen is faster. showcase this aspect, we use the Enzymes dataset, which has only
In addition to the overall running time, we also show the average 3 node labels and no edge labels. We study the training time and
time taken by GraphGen in computing the minimum DFS code. the impact on quality based on the vertex invariants used to label
Recall that the worst-case complexity of this task is factorial to the nodes. We study four combinations: (1) existing node labels only,
number of nodes in the graph. However, we earlier argued that (2) node label + node degree, (3) node label + clustering coefficient
in practice, the running times are much smaller. This experiment (CC), and (4) label + degree + CC. Fig. 6 presents the results. As
adds credence to this claim by showing that DFS code generation visible, there is a steady decrease in the DFS code generation time
is a small portion of the total running time. For example, in the as more invariants are used. No such trend, however, is visible in
largest size bucket of chemical compounds, DFS code generation the total training time as additional features may both increase or
of all 10, 000 graphs take 30 seconds out of the total training time decrease the time taken by the optimizer to find the validation loss
of 4 hours. In Cora, on the largest bucket size of 500 edges, the minima. From the quality perspective, both NSPDK and Orbit show
DFS construction component consumes 50 minutes out of the total improvement (reduction) through the use of invariants. Between
degree and CC, degree provides slightly improved results. Overall,
this experiment shows that invariants are useful in both improving
the quality and reducing the cost of minimum DFS code generation.
Orbit
Orbit
DeepGMG 0.2
0.010
GraphGen
20 GraphRNN 20 GraphGen 0.005 0.1
DeepGMG GraphRNN
0 0 0.000
0 1 2 3 0.0 0.5 1.0 0 1 2 3 0.0 0.5 1.0
Training set size (Million) Training set size (Million) Training set size (Million) Training set size (Million)
(a) ZINC (b) Cora (c) ZINC (d) ZINC (e) Cora
GraphGen GraphRNN-M
103
GraphGen-DFS DeepGMG
GraphRNN
Training time (hours)
101
10−1
10−3
10−5
|E| < 25 25 ≤ |E| < 50 50 ≤ |E| < 117
Graph Size
(f) All chemical (g) All chemical (h) Cora (i) Cora
Figure 7: Impact of training set size on (a-b) efficiency and (c-e) quality. Impact of graph size on (f, h) quality and (g, i) training
time. The y-axis of (g) and (i) are in log scale.
huge increase in NSPDK distance, which means inferior quality. complexity of graph canonization is not a practical concern and,
An even more significant result is observed in Uniqueness, which feeding more precise graph representations to the training model
is close to 0 for SentenceVAE. This means the same structure is is a better approach in terms of quality and scalability.
repeatedly generated by SentenceVAE. In DFS codes, every tuple is The second takeaway is the importance of quality metrics. Eval-
unique due to the timestamps, and hence sentence modeling tech- uating graph quality is a complicated task due to the multi-faceted
niques that rely on repetition of words and phrases, fail to perform nature of labeled structures. It is, therefore, important to deploy
well. While this result does not say that LSTMs are the only choice, enough metrics that cover all aspects of graphs such as local node-
it does indicate the need for architectures that are custom-made for level properties, structural properties in the form of motifs and
DFS codes. global similarity, graph sizes, node and edge labels, and redundancy
in the generated graphs. A learned model is effective only if it
5 Concluding Insights performs well across all of the metrics. As shown in this study,
In this paper, we studied the problem of learning generative models GraphGen satisfies this criteria.
for labeled graphs in a domain-agnostic and scalable manner. There The conducted study does not solve all problems related to graph
are two key takeaways from the conducted study. First, existing modeling. Several graphs today are annotated with feature vectors.
techniques model graphs either through their adjacency matrices Examples include property graphs and knowledge graphs. Can
or sequence representations. In both cases, the mapping from a we extend label modeling to feature vectors? All of the existing
graph to its representation is one-to-many. Existing models feed learning-based techniques, including GraphGen, cannot scale to
multiple representations of the same graph to the model and rely graphs containing millions of nodes. Can we overcome this limita-
on the model overcoming this information redundancy. In contrast, tion? We plan to explore these questions further by extending the
we construct canonical labels of graphs in the form of minimum platform provided by GraphGen.
DFS codes and reduce the learning complexity. Canonical label Acknowledgments
construction has non-polynomial time complexity, which could be
a factor behind this approach not being adequately explored. Our The authors thank Lovish Madaan for helpful discussions and com-
study shows that although the worst-case complexity is factorial to ments on the paper. The authors also thank IIT Delhi HPC facility
the number of nodes, by exploiting nodes/edge labels, and vertex for the computational resources.
invariants, DFS-code construction is a small component of the
overall training time (Figs. 7(g) and 7(i)). Consequently, the time