Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DELATOR: Money Laundering Detection via

Multi-Task Learning on Large Transaction Graphs


Henrique S. Assumpção Fabrı́cio Souza Leandro Lacerda Campos
Universidade Federal de Minas Gerais Universidade Federal de Minas Gerais InterMind (Inter Bank)
henriquesoares@dcc.ufmg.br fabricio.souza@dcc.ufmg.br Universidade Federal de Minas Gerais
leandro.campos@bancointer.com.br

Vinı́cius T. de Castro Pires Paulo M. Laurentys de Almeida Fabricio Murai


InterMind (Inter Bank) InterMind (Inter Bank) Worcester Polytechnic Institute
Universidade Federal de Minas Gerais paulo.laurentys@bancointer.com.br Universidade Federal de Minas Gerais
arXiv:2205.10293v2 [cs.LG] 25 Oct 2022

vinicius.pires@bancointer.com.br fmurai@wpi.edu

Abstract—Money laundering has become one of the most rel- [1], [3]. Such rules vary in nature, but can often describe
evant criminal activities in modern societies, as it causes massive numerical thresholds for flagging suspicious transactions, e.g.,
financial losses for governments, banks and other institutions. receiving or sending an amount that surpasses n times the
Detecting such activities is among the top priorities when it comes
to financial analysis, but current approaches are often costly and client’s income. Traditionally, flagged clients are manually
labor intensive partly due to the sheer amount of data to be an- analyzed by an anti-money laundering (AML) team of ex-
alyzed. Hence, there is a growing need for automatic anti-money perts, who try to determine the existence of extraordinary
laundering systems to assist experts. In this work, we propose circumstances that could justify the threshold violation (e.g., a
DELATOR, a novel framework for detecting money laundering client sold a property they owned). Based on that analysis and
activities based on graph neural networks that learn from large-
scale temporal graphs. DELATOR provides an effective and the current legislation, the team decides whether to report the
efficient method for learning from heavily imbalanced graph client to the authorities or not, however, this manual inspection
data, by adapting concepts from the GraphSMOTE framework of flagged clients is an expensive and time-consuming process.
and incorporating elements of multi-task learning to obtain rich Not surprisingly, in Brazil, banks have 45 days to file reports
node embeddings for node classification. DELATOR outperforms regarding the transactions performed in a given month.
all considered baselines, including an off-the-shelf solution from
Amazon AWS by 23% with respect to AUC-ROC. We also Such rule-based AML systems have three main limitations.
conducted real experiments that led to the discovery of 7 new First, it is hard to prioritize flagged clients, except by ranking
suspicious cases among the 50 analyzed ones, which have been them according to the number of rule violations1 . A good
reported to the authorities. prioritization scheme could significantly expedite the reporting
Index Terms—Money Laundering Detection, Graph Neural
of criminals and could provide side information to help experts
Networks, Multi-Task Learning, Large Transaction Graphs
close cases unlikely to be reported. Second, there is potentially
a large number of clients “just below” thresholds for being
I. I NTRODUCTION
flagged that will never be analyzed. Despite not violating any
Money laundering is a general term referring to a myriad given rule, altogether, the set of transactions made by a client
of different criminal activities that seek to legitimize illicit could provide evidence of a crime. Potential losses for a false
financial gains or to conceal the origin of money transferred to negative are immense since the financial institution could face
illegal entities, e.g., terror financing and drug trafficking [1]. significant legal backlash. Last, the fact that some common
Although varied in nature, money laundering activities can transaction patterns (e.g., Smurfing [4]) occur in money laun-
be categorized into three well-defined stages that represent dering activities underlines the limitations of rules that apply to
the full process of legitimizing a series of illegal economic clients individually, namely, they do not account for structural
activities: (i) placement, (ii) layering, and (iii) integration [2]. information in the network formed by all transactions.
Placement refers to the initial process of introducing the Therefore, there is a notable demand for new data-driven
illicitly gained assets into the legitimate financial system, by tools for money laundering detection that escape the rule-
removing obvious traces of illegality. Layering is one of the based approaches. In this paper, we will focus on building a
most important and complex stages of money laundering. It robust data-driven framework for detecting money laundering
consists of a series of transactions with no purpose other than activity in the layering stage by adapting different machine
to conceal the illicit origin of the money. Finally, integration learning models and novel approaches for graph-structured
aims to integrate these assets into the legal economy.
Most western countries implement a rule-based regulation 1 Although a violation severity score could be computed for each rule, it is
that all banks must follow in order to comply with legislation not obvious how to combine them into a single score.
data. We propose a multi-task learning framework named DE- dynamical system. Both methods allow for a time-aware
LATOR for detecting money laundering in dynamic financial representation on the embedding space.
transaction networks. For each graph snapshot, it generates The aforementioned approaches are designed for homoge-
client (node) embeddings based both on the unsupervised link neous graphs, but there also are many attempts to generalize
prediction task and on a supervised edge regression task where GNNs for heterogeneous graphs. The authors who proposed
the labels are the transaction values. Next, the embeddings of RGCN [9] were among the first to extend the message passing
each client are concatenated over the snapshots. DELATOR mechanism of the original GNN to heterogeneous graphs, by
then generates synthetic suspicious clients in the embedding learning on subgraphs independently created for each relation
space to oversample the minority class. Last, it trains a and then aggregating the information across all relations.
supervised classifier for predicting the probability of a given More recent approaches like HGT [10] combine elements
client being involved in money laundering. of the Transformer architecture [11] with GNNs in order to
Thus, we summarize our contributions as follows: model heterogeneous topological relations that can also evolve
• We propose a scalable and effective AML framework that through time. These approaches generally focus on training
experimentally outperforms state-of-the-art methods and the models directly on a downstream task. One exception
has a relatively simple implementation. is the DGL-KE [12] framework implemented on the Deep
• DELATOR is among the first methods for detecting Graph Library [13], which provides a general framework for
money laundering activity on large transaction graphs learning from knowledge graphs based on a generalization of
that are temporal, heterogeneous, and have extremely the unsupervised link prediction task that yields both node and
imbalanced target classes. By leveraging different aspects relation embeddings.
of the network, we are able to simplify the data modeling
and create a method that performs well on a real-world B. Class Imbalance
large dataset.
Class imbalance refers to a significant difference in the
• We evaluate our framework in a real setting by perform-
number of instances of each class and is a characteristic
ing data-driven inference on millions of bank accounts,
displayed by many real-world datasets. It tends to bias model
which ultimately led to the detection of multiple suspi-
predictions towards the majority classes. For this reason., class
cious clients, that were then reported to the authorities
imbalance has long been a relevant research topic within the
for potential involvement in money laundering activities.
machine learning community [14], [15].
To the best of our knowledge, this is the first system of
its kind to be employed successfully in the context of There are many strategies for dealing with this problem,
Brazilian banks. some of which seek to adjust class size through over- or un-
dersampling, i.e., making the majority classes smaller and the
II. R ELATED W ORK minority classes bigger. SMOTE [16] is one of the most pop-
ular oversampling algorithms, and it addresses the problem by
A. Graph Neural Networks interpolating samples belonging to minority classes with their
Our work is mostly related to graph representation learning, nearest neighbors in order to create new synthetic samples.
and more specifically, learning node representations on a latent However, SMOTE does not consider topological information
space based on unsupervised or semi-supervised tasks. In into consideration, and it is dependent on the distance metric
recent years, models based on Graph Neural Networks (GNNs) chosen to determine the nearest neighbors. GraphSMOTE [17]
[5] have taken a prominent role in the context of learning tries to solve this potential issue by adapting the SMOTE algo-
latent representations for graph-structured data. GraphSAGE rithm to better suit graph representation learning applications.
[6] was among the first unsupervised learning frameworks that GraphSMOTE first extracts node embeddings from the graph
could yield an inductive model for graphs. Its loss function is by using a single-layer GNN, and then applies SMOTE to each
based on random walks, and it encourages nearby nodes to minority class in order to generate synthetic nodes to balance
have similar embeddings, while ensuring that distant nodes the previously underrepresented classes. GraphSMOTE also
have dissimilar embeddings. The Graph Attention architecture creates an edge generator in order to connect these nodes to
[7] has also been successful in many different graph-related the network, by training a GNN-based classifier that aims to
tasks, and it utilizes masked self-attention during the message be good at reconstructing the original adjacency matrix of the
passing mechanism, allowing the network to learn which network. The framework then proceeds to train a final classifier
neighbors to prioritize when aggregating information. for node classification on the downstream task.
There are also many recent techniques specifically designed Other approaches like Pick-and-Choose [18] try to deal with
for temporal graphs. EvolveGCN [8] combines a Recurrent the imbalance problem by creating synthetic edges between
Neural Network (RNN) with a GNN to account for the nodes in the minority class, however, the authors of [19] show
temporal aspect of dynamic graphs. The authors propose two that this technique has a significantly lower performance on
alternative approaches for these networks: (i) consider the highly imbalanced datasets. After an extensive analysis of the
GNN weight matrix as the hidden state of the dynamical available literature, we decided to adapt some of the ideas
system, or (ii) use the GNN weight matrix as input of the from GraphSMOTE – which was originally designed for static
graphs – to our framework in order to improve the process of nodes (v, u) by setting w0 as the total amount transacted from
learning information about our dynamic network. v to u. We can then define each snapshot as Gt = (V, E t , X ).
This modeling aggregates transactions with different types into
C. Money Laundering Detection a single transaction, thus simplifying the graph structure when
Detecting money laundering activity on financial transaction compared to approach A2.
networks can be seen as a particular case of fraud detection A2: We can model the data as a heterogeneous directed
on graph-based data. In this more general context, most of the weighted DTDG, i.e., a sequence Shetero = {G1 , G2 , ..., GT }
problems are related to imbalanced scenarios, where a small of graphs, where each Gt ∈ Shetero represents the network at
minority of individuals are actually involved in the target illicit time t. In this scenario, each snapshot is called a knowledge
activity. graph, and it consists of a set of relations that represent each
The authors of [20] provide a simple and efficient method transaction type. More formally, we can define each snapshot
for detecting suspicious activity specifically related to money as Gt = {Gt1 , Gt2 , ..., Gtm }, where each Gtc = (V, Ect , X ) is the
laundering, by employing a series of standard database join relational graph for snapshot t and transaction type c, and Ect =
operations to detect sub-graphs that follow the Smurf pat- {e ∈ E t |e3 = c}. In contrast to A1, approach A2 provides
terns. The authors of [21] provide a GNN-based approach a more complete description of the network, however, this
for money laundering detection on graphs, by employing an comes at the cost of being more complex and costly to model
LSTM network coupled with a graph convolutional network computationally.
in order to simultaneously model the topological and temporal
IV. P ROPOSED F RAMEWORK
relationships between nodes on the graph. Finally, the authors
of [22] provide a comparative analysis between classical In this section, we provide a brief overview of DELATOR,
machine learning methods – such as Random Forests and and then proceed to detail each step of the framework. DE-
Logistic Regression – and GNN approaches for detecting LATOR follows approach A1 for modeling the data, and it is
money laundering activity in a large bitcoin dataset, and find comprised of three main steps:
that the GNN models with skip connections have the best 1) This step consists of a multi-task learning algorithm to
performance amongst the considered baselines. However, all obtain node embeddings for each graph snapshot. We
of the aforementioned models do not directly deal with the first train a GNN model that aims to optimize the link
imbalance problem that is prominent in our data. prediction loss, i.e., a loss function that tends to enforce
embedding similarity between connected nodes. Next,
III. DATA M ODELING we train a second GNN model that aims to optimize
In this section, we explain the model we consider in this the edge regression loss, i.e., a task for predicting
work and discuss a potential, but more complex alternative. the edge weights. The second network also generates
The available transaction data consists of a multiset E node embeddings, that are then concatenated with the
ST t
of monetary transactions with partition E = t=1 E , embeddings generated by the first GNN and passed
t through a Multi-Layered Perceptron (MLP) to obtain the
where each E represents all transactions made at timestep
t ∈ {1, 2, ..., T }, a set V of users and a set X = final prediction for the edge weight. These networks
{(x1 , y1 ), ..., (xn , yn )}, where each xi ∈ Rd represents a are trained individually on their respective tasks and
given user’s attributes. Each yi represents the target variable each generates node embeddings for all graph snap-
indicating if the user was suspected of being involved in money shots. We concatenate such embeddings into a single
laundering activity. Each element e = ((v, u), w, c) ∈ E t representation per snapshot, and then concatenate the
represents a transaction from user v to user u, of amount w representations across all snapshots in order to obtain a
and type c ∈ {1, 2, ..., m}, executed at timestep t, and there single time-aware description of the nodes in euclidean
is no restriction on the amount of transactions between users. space;
Since the data is dynamic in nature, it is natural to model it as 2) Next, we create synthetic nodes to oversample the mi-
a dynamic graph, more specifically as a Discrete-time dynamic nority class, by applying the SMOTE algorithm directly
graph (DTDG), which is a sequence of static graph snapshots to the embedding space on the training set in order to
taken at certain time intervals. In our context, each snapshot obtain new embeddings for the minority class.
contains the same set of nodes and node attributes. This 3) We finally proceed to train a classifier on the aggregated
data structure allows for two alternative modeling approaches, oversampled data that predicts the probability of a user
which will be referred to as A1 and A2. being suspected of being involved in money laundering.
A1: We can model the data as a homogeneous directed We highlight the existence of major challenges for modeling
weighted DTDG, i.e., a sequence Shomo = {G1 , G2 , ..., GT } the money laundering detection task using current machine
of graphs, where each Gt ∈ Shomo represents the network at learning methods on graphs, both due to the inherently difficult
time t. In order to obtain a simple homogeneous graph at each nature of the problem at hand and also due to specific charac-
snapshot
P t, we create a new edge set E t = {((v, u), w0 )|w0 = teristics of our data, namely: (i) the transaction data is dynamic
e∈E t :e1 =(v,u) w} – where ei represents the i-th element of in nature, which significantly increases the complexity of the
the tuple e –, that aggregates all edges between each pair of proposed task, (ii) there are multiple transaction types in the
STEP 1 (Node Embedding Generation) 1) Link Prediction (LP): First, we train a GNN network
to obtain node embeddings based on an unsupervised loss for
GNN (Link Prediction)
Transaction link prediction, i.e., predicting if two nodes are connected in
Data
Graph Generate
Concatenate
Node the graph. Our framework allows for any GNN architecture to
Snapshots embeddings Embeddings
Raw be used in this step, and we will present an example using the
Features
GNN (Edge Regression) GraphSAGE architecture. Recall the sequence Shomo defined
in approach A1. For each graph snapshot Gt ∈ Shomo , the fol-
lowing equation describes the message passing mechanism for
Final Classification
an arbitrary layer l out of a total of L layers of GraphSAGE:
STEPS 2 (Oversampling) and 3 (Node Classification)
ht,l t,l t,l−1
v = f ((Wl · AGG({hu , ∀u ∈ N (v)})) k (Bl · hv )) (1)
Train set SMOTE Classifier

Node … In (1), ht,l


v denotes the embedding of node v at layer l
Embeddings

Test set
Final and snapshot t, Wl denotes a set of learnable weights, Bl
Classification
denotes a set of learnable biases, k denotes the concatenation
Time
operator along the columns, AGG denotes an aggregation
Suspicious

Non-Suspicious
function, N (v) denotes the outgoing neighbours of node v
and f denotes an activation function, e.g., ReLU, and · denotes
matrix multiplication. We initialize ht,0
v = xv for all nodes. For
Fig. 1: Overview of DELATOR.
simplicity, we will refer to ht,L
v as ht
v . We can then define the
t
link prediction loss Llp for a given snapshot t, that we wish
to minimize, as follows:
data, and thus the resulting graph is highly heterogeneous, X
which significantly increases the complexity of the problem Ltpos (Gt ) = − log(σ((htv )> · htu )) (2)
since there are few methods in the literature suited for dynamic ((v,u),w)∈E t
heterogeneous graphs, and even fewer that specifically target
X
Ltneg (Gt ) = − log(1 − σ((htv )> · htu ))) (3)
money laundering detection, (iii) the target classes are ex-
((v,u),w)∈Ẽ t
tremely imbalanced since only a small percentage of users are
actually labeled as involved in money laundering activity, and Ltneg (Gt ) + Ltpos (Gt )
Ltlp = (4)
(iv) the data is considerably large in size, thus any proposed |E t | + |E
et |
method will need to be scalable, in addition to being effective. In (2)-(4), σ is the sigmoid function, Ltpos (Gt ) denotes the log-
Regarding the temporal aspect of the data (challenge (i)), we likelihood of the link prediction for the current snapshot and
only have access to a relatively small time interval, and thus Ltneg (Gt ) denotes the log-likelihood of the link prediction for
methods based on Recurrent Neural Networks have proven a negative sample of the current snapshot, i.e., we define a set
to be ineffective. For this reason, we chose to employ a Ẽ t of randomly selected edges, and > denotes the transpose
simple concatenation procedure for the embeddings obtained operator. This loss function is a direct adaptation of the
at each snapshot that is able to consider the temporal aspect GraphSAGE unsupervised loss, and it encourages the model to
of our data. In order to address the other challenges, we create embeddings that are similar for connected nodes while
propose a framework that has three main advantages: (i) by enforcing a distinct representation for disconnected nodes. The
disregarding different edge types and modeling the data as GNN model is trained consecutively on Shomo , i.e., at each
a homogeneous simple graph, we significantly increase the epoch, we compute the embeddings and scores according to
framework’s efficiency, since most modern approaches for the aforementioned equations, and then perform one optimiza-
heterogeneous graphs suffer from significant scaling issues tion step at each snapshot. Empirical testing showed that this
when dealing with a high number of relations, and, as we method is preferable instead of optimizing w.r.t. the mean of
will later discuss, this simplification does not interfere with the losses throughout all snapshots.
the framework’s efficacy on our dataset, (ii) we incorporate a 2) Edge Regression (ER): After training a GNN model for
strong oversampling technique to the framework, thus reliev- link prediction, we now train a model to perform edge regres-
ing the impact of class imbalance on the final classification sion, i.e., predict the value of a given transaction between two
task, and (iii) the framework is modular in nature, and thus it users. The model consists of a separate GNN that generates
can be largely modified in order to adapt to different modeling embeddings zvt for all nodes, and we again note that this step
scenarios. can be performed with any GNN architecture. For an arbitrary
edge e = ((v, u), w) ∈ E t , we obtain the predicted edge
A. Node Embedding Generation weight ŵ(e) as follows:
The first part of the framework consists of extracting node
ŵ(e) = MLP(htv k zvt k htu k zut ) (5)
embeddings from the network. We adopt a multi-task learning
approach and thus subdivide the training procedure into two In (5), the predicted edge weight is obtained by concatenating
steps. the link prediction and edge regression embeddings of both v
and u, and then passing them through a MLP network. We can C. Node Classification
then define the edge regression loss Lter for a given snapshot The last step of DELATOR consists of a supervised clas-
t, that we wish to minimize, as follows: sification task for detecting users suspected of being involved
(
0.5·(ŵ−w)2 in money laundering activity. After oversampling the minority
t γ , if |ŵ − w| < γ class, we employ LightGBM [23], a state-of-the-art gradient
ler (w, ŵ) =
|ŵ − w| − 0.5 · γ, otherwise boosting algorithm for supervised learning. The LightGBM
(6)
1 X model takes H0 as input and outputs a probability classification
Lter = t · t
ler (w, ŵ(e)) Ŷ of the users, i.e., a real value between 0 and 1 indicating
|E | t
e=((v,u),w)∈E
the probability of a given user being suspicious. We highlight
t that our framework supports any supervised classifier at this
In (6), γ denotes a threshold hyperparameter, and ler (w, ŵ) de-
step. Algorithm 1 describes the entire process of training
notes the partial loss for a given edge. Equation (6) represents
DELATOR and predicting users involved in money laundering.
the smooth L1 loss function of the edge regression task. We
also train this model on Shomo in a consecutive fashion. After
training the models, we obtain a single time-aware embedding Algorithm 1 DELATOR Framework
for a given node v, denoted by η(v), as follows:
Input: Shomo = {G1 , ..., GT }, Number of epochs N
T Output: Predicted probabilities Ŷ
η(v) = k (htv k zvt ) (7) Randomly initialize the weights for the GNNs of step 1,
t=1
i.e., the GNN for link prediction and the GNN for edge
Equation (7) describes the procedure of concatenating the link regression
prediction and edge regression embeddings for a given node Create train-test-validation splits based on the edges of each
to obtain a single embedding for each snapshot, and then graph snapshot for the aforementioned tasks
concatenating these embeddings along all snapshots. We can for n ∈ {1, 2, ..., N } do
then build a single embedding matrix H ∈ R|V|×(T ·(dLP +dER )) , for Gt ∈ Shomo do
where dLP (dER ) denotes the embedding size for the link Obtain the node embeddings for the current snapshot
prediction (edge regression) task, by concatenating the final via LP according to (1)
embedding η for all nodes along the rows, i.e., each row of H Compute Ltlp according to (2)-(4)
is the corresponding vector η for that given node. Perform one optimization step
end for
B. Oversampling end for
for n ∈ {1, 2, ..., N } do
We now seek to generate synthetic nodes to oversample the for Gt ∈ Shomo do
minority class. In this work we decided to adopt the SMOTE Obtain the node embeddings for the current snapshot
algorithm, however our framework is compatible with any via ER according to (1)
oversampling methods that can use the generated embeddings. Compute the predicted edge weights for all edges
The intuition behind SMOTE is to perform interpolation on according to (5)
samples from the minority class with their nearest neighbors in Compute Lter according to (6)
the embedding space of the same class, in order to generate Perform one optimization step
new embeddings that are similar to the ones found in this end for
minority class. Given a node v ∈ V belonging to a certain end for
minority class, e.g., individuals involved in money laundering, Compute the final embeddings according to (7), and then
consider the final embedding η(v). SMOTE generates a new concatenate them to obtain H
sample v 0 according to the following equation: Create train-test-validation splits for H
Oversample the minority class according to (8) to the train
η(v 0 ) = (1 − λ) · η(v) + λ · η(nn(v)) (8) split to obtain H0
Train a supervised classifier with H0 to obtain the final
In (8), nn(v) denotes the nearest neighbor of v from the same prediction Ŷ
class, according to the euclidean distance in the embedding
space, and λ ∼ U (0, 1) denotes a uniform random variable.
Since the new node is generated via interpolation of nodes
V. E XPERIMENTS
from the same class, we can label the new synthetic node v 0
as belonging to the same class as v, thus we can artificially In this section, we detail the experiments conducted to
oversample the minority class to overcome the imbalanced evaluate the performance of DELATOR. Specifically, we are
scenario. We highlight that this oversampling step is only per- interested in the following questions:
formed on the training data. We will refer to the oversampled 1) Can DELATOR’s simpler approach (A1) perform well
aggregated embeddings matrix as H0 . on complex, heterogeneous relational networks?
2) How well does DELATOR perform when compared to thus would be mislabeled as ‘non-suspicious’. A more general
other state-of-the-art techniques for node embeddings approach could then be to create a new class representing
and node classification? unlabeled clients, and then proceed to employ unsupervised
3) How sensitive is DELATOR w.r.t. its hyperparameters? methods for labeling. However, there are two main issues
4) How important is each component of the framework to with this approach: (i) automatically labeling a great number
its performance? of real-world clients as suspected of money laundering is
5) Can DELATOR help the AML team detect suspicious a sensitive topic, since some countries – including Brazil –
users on a real-world experiment? demand a manual specialized analysis of the client in order
for the authorities to take action, moreover, many of the state-
A. Experimental Setup of-the-art models for unsupervised labeling do not have good
1) Dataset: We use the dataset from Inter Bank, one of explainability, making it even harder to argue for their usage
the biggest and fastest growing digital banks in Brazil. The with banks in a real-world task, and (ii) the overwhelming
dataset is composed of transactions made to/from clients of majority of clients would be unlabeled, far exceeding the
the bank. Due to privacy reasons, we cannot provide a full capability of current unsupervised labeling methods.
detailed description of the dataset’s attributes, however, we We highlight that, even though DELATOR does predict
provide a quantitative description of the data. probabilities for all Inter clients in the network, it does not
The dataset is comprised of a sample of 20 million users, perform the automatic labeling of such clients. This step is
200 transaction types, and 110 million transactions that span only made after a manual analysis of the most likely clients by
over a month. We split this data into 5 distinct snapshots, each the AML team. More formally, DELATOR is designed to be
approximately representing a week. As mentioned before, all a CAAT (Computer-assisted audit technology), i.e., it is meant
snapshots contain the same set of users, however, the number to assist the AML team, and not to automate their analysis.
of transactions varies between them. Many of the users in 2) Baselines: We now detail the state-of-the-art methods
the network are not clients from Inter – referred to as non- used as baselines. First, we consider GNN baseline methods
Inter clients –, e.g., clients from other banks or financial for extracting node embeddings from graphs, often referred to
institutions. Each Inter client has 16 tabular attributes, e.g. as embedding generators:
Personal Income and Address , out of which 13 are categorical, • DGL-KE: A state-of-the-art framework developed by
and all numerical features were standardized. Since Inter does Amazon AWS for learning representations on knowledge
not have tabular information on non-Inter clients, we define graphs, i.e., heterogeneous graphs with multiple edge and
their attributes as a vector of ones and also add a dummy node types.
variable to all user attributes indicating if the user is an Inter • GraphSAGE (SAGE): A state-of-the-art GNN archi-
client. Empirical testing showed us that, in the case of A2, tecture that generalizes the original model for learning
it was preferable to keep the dummy variable instead of also on graphs, by allowing for multiple different aggregator
making the graph heterogeneous w.r.t. its nodes. Classifying functions on the message passing step.
such non-Inter clients is beyond the scope of this work, and • Graph Attention (GAT): A state-of-the-art GNN archi-
thus we only keep them during the node embedding generation tecture that introduces an attention mechanism to the
step (Step 1 in Section IV) due to their topological value for message passing algorithm.
understanding the overall structure of the network, after which • Graph Convolution (GCN): The first proposed GNN
they are no longer considered. architecture based on message passing on graphs.
The minority class (‘suspicious’) of the dataset represents As we will discuss in future sections, the SMOTE step
if the Inter client was suspected of being involved in money proved to be essential for GNN-based models to perform well,
laundering, and thus reported to the authorities by Inter’s AML since the large class imbalance makes learning on the available
team. The majority class (‘non-suspicious’) represents Inter data extremely hard, which also made training models directly
clients that: (i) did not trigger any of the rules mentioned in on the final supervised classification task ineffective. Thus, we
Section I, or (ii) triggered a number of rules and, after manual decided to train all of the GNN baseline models according to
analysis by the AML team, were discarded as a suspect. The a similar training procedure as the one described in Section
task in question is to detect Inter clients that were suspected to IV. Specifically, we train the models on an unsupervised
be involved in money laundering activity, and the imbalance link prediction task to obtain node embeddings, then we
ratio between the classes – the ratio between the number of oversample the minority class on the embedding space for
Inter clients in the minority class and the majority class – is the training set, and last, we train a supervised classifier in the
2 · 10−5 . The ratio between non-Inter and Inter clients is not final classification task. We employ approach A1 for modeling
disclosed due to privacy reasons. the data to train the SAGE, GAT, and GCN methods, and
A natural issue with this type of modeling is that it biases approach A2 for modeling the data to train DGL-KE. We note
the classes toward the aforementioned fail-prone rule-based that DGL-KE utilizes a modified version of the link prediction
system, since there could be – and in fact there are, as we task, described in detail in [12].
will show in future sections – clients that are involved in We also experimented with different architectures of embed-
money laundering and haven’t triggered any of the rules, and ding generators for DELATOR for the first step of the frame-
work, i.e., obtaining embeddings according to both the link – and since we have 5 snapshots, this resulted in a final
prediction and edge regression tasks. We denote these variants embedding size of 300 –, and a negative sample size of 2
by including their names as a sub-index, e.g., DELATORSAGE (per positive sample). The remaining baseline GNN models
denotes the version utilizing GraphSAGE layers on both tasks. utilized an embedding size of 40 per snapshot. For all versions
We highlight that comparing our framework with the SAGE, of DELATOR, the MLP used in the edge regression step
GAT and GCN baseline methods will explicit how the edge had two hidden layers with dimensions (120, 60) and (60, 1)
regression task impacts on the overall performance, since such respectively, with the ReLU function as a nonlinearity for the
methods only execute the link prediction task, and comparing first layer. The γ parameter of (6) was set to 3.
DELATOR with DGL-KE will allow us to understand which DGL-KE provides four different models for representation
of the proposed modeling approaches (A1 and A2) is best learning, and we found that TransE l2 yielded the best results.
suited for our data. The model utilizes the edge weights, but by design does
As a baseline that does not utilize the network’s transaction not utilize the node features. A parameter tuning procedure
information in its modeling, we consider CatBoost [24], showed that higher embedding sizes did not contribute to
a state-of-the-art gradient boosting algorithm for supervised higher performance, hence an embedding size of 4 per snap-
learning, that allows for better performance on highly cate- shot was utilized (tested 4, 8, 16, 32, 64), resulting in a final
gorical feature spaces. This model learns directly on the client embedding size of 20. We employed a negative sample size of
attributes X in order to obtain the final prediction. Since the 128, and the remaining hyperparameters were initialized with
raw feature space is mostly categorical, the CatBoost model their default values.
does not oversample the minority class. There are versions Because an extensive grid search over the hyperparameter
of SMOTE made specifically for dealing with mixed feature space revealed little performance improvement, the LightGBM
spaces, such as SMOTE-NC, however these versions suffer model used in DELATOR’s final stage was initialized with its
from severe scaling issues, and in our scenario employing them default hyperparameters, as per the official Python implemen-
proved to be intractable. Other simpler techniques for dealing tation. The same procedure was applied to CatBoost’s hyper-
with class imbalance, such as utilizing loss functions that parameter space, and the final version utilized 400 estimators,
increase the penalty for misclassifications in the minority class, a maximum tree depth of 10 and a learning rate of 1, using
also proved to be ineffective and/or inefficient for CatBoost. “log-loss” as objective, with the remaining hyperparameters
3) Evaluation Metrics: For evaluating the effectiveness of being set to their default values.
the proposed framework for detecting suspicious clients in For the initial embedding generation task (Step 1 in Section
transaction networks, we propose the following metrics: IV), we utilized a 75-10-15 train-test-validation split on the
• AUC-ROC: area under the ROC curve (True positive rate edges of each graph snapshot. For the final supervised task
vs. False positive rate). (Step 3 in Section IV), we utilized an 80-20 train-test split on
• AUPR: area under the PR curve (Precision vs. Recall). the Inter clients and then performed a 5-fold cross validation
• F1-Fraud (F1-F): geometric mean between the precision on the training set. The random number generator seed for
and recall w.r.t. the label that represents suspicious activ- this task was also randomized 5 times in order to obtain 25
ity. different training configurations to check the statistical signif-
• Maximum F1-Fraud (Max. F1-F): maximum value of icance of our experiments. The numerical values described in
the F1-Fraud w.r.t. all possible thresholds. tables I and III represent the mean results, together with their
4) Computing Specifications: All of the experiments were respective standard deviations, w.r.t the proposed metrics.
conducted on the AWS Sagemaker platform for cloud-based
computing. The experiments for the main framework and B. Results
baselines using approach A1 were conducted on a single 1) Quantitative Results: To answer the first two questions
ml.m5.24xlarge machine, with 96 Intel® Xeon® vCPUs and posed in this section, we compared the results of DELATOR
384 GB of memory. The experiments using DGL-KE were with the proposed baselines for the final task of detecting
conducted on a single ml.g4dn.16xlarge machine, with 64 suspicious clients.
Intel® Xeon® vCPUs, 256 GB of memory and 1 NVIDIA® Table I shows the overall evaluation results for the final
V100 Tensor Core GPU with 32 GB of memory. prediction task. We can observe that, for all metrics, the
5) Implementation Details: All GNN models, except for versions of DELATOR have outperformed the proposed base-
DGL-KE, were trained using the Adam [25] optimization lines, especially when compared to DGL-KE. The results w.r.t.
algorithm, with a learning rate of 0.01. These models were AUC-ROC show that, on average, DELATORSAGE provides a
trained until the validation loss increased for 5 consecutive more consistent ranking of clients when compared to the other
epochs (taking, on average, 40 epochs). A parameter tuning baselines. The AUPR score is also higher for DELATORGCN ,
procedure showed that two GNN layers were preferable for showing that it is easier to select a classification threshold such
all of the aforementioned models, as well as showing that one that the framework’s precision dominates its recall. In addition,
attention head was the best option for the methods that use we note that the F1-Fraud and Maximum F1-Fraud metrics
the GAT model. We also utilized an embedding size of 60 are also higher for both DELATORSAGE and DELATORGCN .
per snapshot for DELATOR, with dLP = 40 and dER = 20 This is a strong indicator that the framework has the best
TABLE I: Evaluation results for the final prediction task of detecting suspicious clients.

Method Evaluation Metrics


AUC-ROC ↑ AUPR ↑ F1-F ↑ Max. F1-F ↑
CatBoost 0.653 ± 0.075 0.007 ± 0.015 0±0 0.009 ± 0.003
DGL-KE 0.735 ± 0.028 0±0 0±0 0.001 ± 0
SAGE 0.846 ± 0.039 0.001 ± 0 0.002 ± 0 0.019 ± 0.017
GAT 0.743 ± 0.043 0.001 ± 0.001 0.001 ± 0.001 0.025 ± 0.019
GCN 0.876 ± 0.018 0.002 ± 0.002 0.002 ± 0 0.032 ± 0.025
DELATORSAGE (Ours) 0.905 ± 0.027 0.001 ± 0 0.005 ± 0.001 0.033 ± 0.010
DELATORGCN (Ours) 0.889 ± 0.014 0.011 ± 0.019 0.003 ± 0.001 0.049 ± 0.043
DELATORGAT (Ours) 0.882 ± 0.023 0.001 ± 0 0.003 ± 0.001 0.018 ± 0.015

performance for correctly classifying clients that are suspected false negatives, however, we can see that the number of false
of being involved in money laundering. positives – ‘non-suspicious’ clients classified as ‘suspicious’ –
It is worth noting that, although DELATORGCN outperforms is one order of magnitude greater for DGL-KE when compared
DELATORSAGE w.r.t. AUPR and Maximum F1-Fraud, the to our framework. Moreover, the number of true positives for
results for the latter have a considerably smaller standard DELATOR is approximately three times greater than DGL-
deviation, i.e., DELATOR’s GraphSAGE variant exhibits more KE’s, and two times greater than GCN’s. These results again
consistent performance than the other variants. As for Cat- show how DELATOR outperforms the baselines in detecting
Boost, the model also presents a higher standard deviation for suspicious clients.
the AUPR metric when compared DELATORSAGE , coupled
with lower mean values for the AUC-ROC, F1-Fraud and TABLE II: Confusion matrix for the best versions of DELA-
Maximum F1-Fraud metrics when compared to all other TOR, GCN, and DGL-KE, based on validation sets. The best
methods, meaning that CatBoost struggles to learn consistently version of DELATOR utilized SAGE layers.
on our dataset.
Fig. 2 provides the ROC and PR curves for the best versions Method Predicted
Class non-suspicious suspicious
of DELATOR, GCN, and DGL-KE (based on validation sets). DELATOR non-suspicious 99.806% 0.191%
We observe that the ROC curve for DELATOR increases in suspicious 18 · 10−4 % 4.7 · 10−4 %
height at a much faster pace than the baselines, signalling that GCN Real non-suspicious 99.699% 0.298%
it is easier to choose a classification threshold for DELATOR suspicious 20 · 10−4 % 2.35 · 10−4 %
DGL-KE non-suspicious 97.977% 2.021%
such that the true positive rate is high while the false positive suspicious 22 · 10−4 % 1.6 · 10−4 %
rate is reasonably low, i.e., DELATOR is better at detecting
suspicious clients. The best version of DELATOR also has a
considerably higher AUC-ROC when compared to DGL-KE 2) Parameter Sensitivity: In order to answer the third
and GCN. The PR curve for DELATOR shows a much more question posed in this section, we now explore the impact
desirable behavior, as it does not decrease as fast as DGL-KE’s of DELATOR’s hyperparameters on the model’s performance.
or GCN’s. More specifically, we analyze the impact of the final embed-
ding size for both GNNs involved in the framework. The GNN
1.0 100 GCN (AUPR = 0.00047) trained using the link prediction task has an embedding size
DGL-KE (AUPR = 8e-05)
DELATORSAGE (AUPR = 0.00169) of dLP , and the GNN trained using the edge regression task
0.8 10 1
has an embedding size of dER . Fig. 3 shows the impact of the
True positive rate

0.6 embedding size on the AUC-ROC and AUPR metrics. For the
10 2
Precision

AUC-ROC metric, we observe that a larger embedding size


0.4 significantly benefits the model’s ability to classify clients.
10 3

0.2
Nevertheless, these variations are not drastic, thus showing
GCN (AUC = 0.856) 10
DGL-KE (AUC = 0.757)
4 that the model is robust to parameter variations w.r.t. the
0.0 DELATORSAGE (AUC = 0.946) metric. For the AUPR metric, the intermediate values for the
0.0 0.2 0.4 0.6 0.8 1.0 10 1 100 embedding size show lower performance, while the extreme
False positive rate Recall
values show similarly good results.
Fig. 2: ROC and PR curves for the best versions of DELATOR, 3) Ablation Study: In order to answer the fourth question
GCN and DGL-KE, based on validation sets, for the final task posed in this section, we perform an ablation study of DE-
of detecting suspicious clients LATOR’s components detailed in Table III. We note that,
since all baseline GNN models were trained following the
Table II shows the confusion matrix for the best versions of steps described in Section IV, the results in Table I w.r.t.
DELATOR, GCN, and DGL-KE (based on validation sets). GAT, GCN, and SAGE models when compared to the versions
All methods show a similar number of true negatives and of DELATOR already provide the necessary information for
TABLE III: Ablation Study of DELATOR w.r.t. the final prediction task of detecting suspicious clients.

Embedding Generator Uses LP? Uses ER? Uses SMOTE? Results


AUC-ROC ↑ AUPR ↑
X X X 0.905 ± 0.027 0.001 ± 0
SAGE X X X 0.846 ± 0.039 0.001 ± 0
X X X 0.454 ± 0.062 0±0
X X X 0.290 ± 0.120 0±0
X X X 0.889 ± 0.014 0.011 ± 0.019
GCN X X X 0.876 ± 0.018 0.002 ± 0.002
X X X 0.471 ± 0.062 0±0
X X X 0.497 ± 0.001 0±0
X X X 0.882 ± 0.023 0.001 ± 0
GAT X X X 0.743 ± 0.043 0.001 ± 0.001
X X X 0.502 ± 0.045 0±0
X X X 0.440 ± 0.036 0.003 ± 0

0.95 0.0018 1.0 100 DELATORGAT (AUPR = 0.00063)


0.94 0.0016 DELATORGCN (AUPR = 0.0005)
0.93 0.0014 DELATORSAGE (AUPR = 0.00169)
0.8 10 1
0.92 0.0012
AUC-ROC

AUPR

0.91

True positive rate


0.0010 0.6
0.90 10 2

Precision
0.0008
0.89
0.88 0.0006 0.4
0.0004 10 3
0.87
(16, 8) (24, 16) (32, 24) (40, 20) (16, 8) (24, 16) (32, 24) (40, 20)
Embedding size (LP,ER) Embedding size (LP,ER) 0.2 DELATORGAT (AUC = 0.896)
DELATORGCN (AUC = 0.884) 10 4

0.0 DELATORSAGE (AUC = 0.946)


Fig. 3: Impact of embedding size (dLP , dER ) on AUC-ROC
0.0 0.2 0.4 0.6 0.8 1.0 10 1 100
and AUPR for the final prediction task of detecting suspicious False positive rate Recall
clients, on a fixed seed, for DELATORSAGE .
Fig. 4: ROC and PR curves for different versions of DELA-
TOR, on a fixed seed, for the final task of detecting suspicious
underlining the importance of the edge regression task for the clients.
final prediction.
Table III shows that the SMOTE step is crucial for the
performance of the framework in the final task, since all good way of ranking which clients labelled as ‘non-suspicious’
versions of the framework trained without this step resulted in should be analyzed in order to detect possible failures in the
models with AUC-ROC score close to or below 0.5, i.e., the rules.
models were at most as good as random guessing. We can also This is where DELATOR can provide significant help for
see that, for all models, the edge regression task does provide the AML team. Due to the framework’s low training and
a considerable improvement to the final classification scores prediction time, it produce a list of most likely suspicious
when compared to the baseline models that were only trained clients effectively. In this context, we perform an experiment
on the link prediction task. Again, this shows evidence of the consisting of ranking all clients that did not trigger any alarms
superiority of our approach when compared to the considered during a given time interval. We consider the top 50 clients
baselines. with highest probability of being classified as ‘suspicious’ ac-
Fig. 4 shows the ROC and PR curves for each variant cording to DELATOR. These clients were manually analyzed
of DELATOR. We conclude that DELATOR’s GraphSAGE by the AML team, which resulted in the discovery of 7 new
variant exhibits the best performance, since both curves display clients involved in suspicious activity, that were then reported
the best behavior for choosing a good classification threshold. to the authorities. This experiment shows that DELATOR can
4) Real-World Experiment: In order to answer the final create a new effective and efficient way of detecting suspicious
question posed in this section, we also conducted a real- clients that would be otherwise mislabelled as ‘non-suspicious’
world experiment with Inter to verify the performance of by the rule-based system, as well as providing a way of
DELATOR. As previously discussed, most Brazilian banks detecting possible rule failures which can then be updated by
implement a flawed rule-based system for detecting clients the analysts, thus vastly enhancing the capabilities of the AML
suspected of being involved in money laundering. One of the team.
most significant issues with this system is the fact that the
rules themselves are not defined according to a data-driven VI. P RIVACY AND E THICAL C ONSIDERATIONS
procedure, meaning that many criminals who do not fit the In this section, we provide a few comments regarding
human-defined rules are not considered for analysis. Since the ethical details of our work. We note that the proposed
most banks in Brazil work in a similar manner, they lack a framework does not violate ethical aspects of the relationship
between Inter and its clients, since Brazilian banks have the [9] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, I. Titov,
legal obligation to investigate financial crimes. This work and M. Welling, “Modeling relational data with graph convolutional
networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.06103
also does not violate the privacy of clients, since all of the [10] Z. Hu, Y. Dong, K. Wang, and Y. Sun, “Heterogeneous graph
data was provided in an anonymized form, according to the transformer,” CoRR, vol. abs/2003.01332, 2020. [Online]. Available:
recommendations of Inter’s data privacy team. Moreover, the https://arxiv.org/abs/2003.01332
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
entire data processing procedure was made on the Amazon A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
AWS Sagemaker platform from computers that were provided you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available:
by Inter and that implement a myriad of security techniques, http://arxiv.org/abs/1706.03762
[12] D. Zheng, X. Song, C. Ma, Z. Tan, Z. Ye, J. Dong, H. Xiong, Z. Zhang,
e.g. using an exclusive VPN service. and G. Karypis, “Dgl-ke: Training knowledge graph embeddings at
scale,” in Proceedings of the 43rd International ACM SIGIR Conference
VII. C ONCLUSION AND F UTURE W ORK on Research and Development in Information Retrieval, ser. SIGIR ’20.
New York, NY, USA: Association for Computing Machinery, 2020, p.
In this work, we introduce a novel framework for detecting 739–748.
money laundering activity in large transaction graphs. By [13] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou,
Q. Huang, C. Ma, Z. Huang, Q. Guo, H. Zhang, H. Lin, J. Zhao, J. Li,
leveraging different information available in the network, we A. J. Smola, and Z. Zhang, “Deep graph library: Towards efficient and
are able to construct a multi-task learning procedure that scalable deep learning on graphs,” CoRR, vol. abs/1909.01315, 2019.
generates time-aware node representations on euclidean space, [Online]. Available: http://arxiv.org/abs/1909.01315
[14] R. Longadge and S. Dongre, “Class imbalance problem in data
which then allows us to perform a supervised learning pro- mining review,” CoRR, vol. abs/1305.1707, 2013. [Online]. Available:
cedure in order to predict the probability of a given client http://arxiv.org/abs/1305.1707
being involved in money laundering. This procedure is robust [15] N. Japkowicz and S. Stephen, “The class imbalance problem: A system-
atic study,” Intelligent Data Analysis, pp. 429–449, 2002.
to highly imbalanced target classes, allowing for an effective [16] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer,
method for learning on dynamic networks that suffer from “SMOTE: synthetic minority over-sampling technique,” CoRR, vol.
large class imbalance. We then perform a series of experiments abs/1106.1813, 2011. [Online]. Available: http://arxiv.org/abs/1106.1813
[17] T. Zhao, X. Zhang, and S. Wang, “Graphsmote: Imbalanced node
to evaluate the performance of our model, and experimen- classification on graphs with graph neural networks,” in Proceedings of
tally demonstrate that DELATOR outperforms all considered the 14th ACM international conference on web search and data mining,
baselines w.r.t. the evaluation metrics. We also perfom a 2021, pp. 833–841.
[18] Y. Liu, X. Ao, Z. Qin, J. Chi, H. Yang, and Q. He, “Pick and choose: A
real-world experiment with the help of Inter’s AML team gnn-based imbalanced learning approach for fraud detection,” in WWW,
to evaluate DELATOR’s performance in a practical scenario, 2021.
and show that the model was able to detect multiple clients [19] R. Pereira and F. Murai, “Quão efetivas são redes neurais baseadas
em grafos na detecção de fraude para dados em rede?” in BraSNAM,
involved in suspicious activity. Overall, we have shown that 2021, pp. 205–210. [Online]. Available: https://sol.sbc.org.br/index.php/
DELATOR provides an efficient and effective way of detecting brasnam/article/view/16141
money laundering activity and that the framework can provide [20] M. Starnini, C. E. Tsourakakis, M. Zamanipour, A. Panisson,
W. Allasia, M. Fornasiero, L. L. Puma, V. Ricci, S. Ronchiadin,
significant help to banks and financial institutions. A. Ugrinoska, M. Varetto, and D. Moncalvo, “Smurf-based anti-
In the future, we plan to test DELATOR in other datasets money laundering in time-evolving transaction networks,” in Machine
related to money laundering detection, such as the datasets Learning and Knowledge Discovery in Databases. Applied Data
Science Track. Springer International Publishing, 2021, pp. 171–186.
from the AMLSim project [26]. We also intend to test the [Online]. Available: https://doi.org/10.1007%2F978-3-030-86514-6 11
generalization capabilities of the framework in other tasks [21] I. Alarab and S. Prakoonwit, “Graph-based lstm for anti-money
related to financial activities, such as financial fraud detection. laundering: Experimenting temporal graph convolutional network with
bitcoin data,” Neural Processing Letters, Jun 2022. [Online]. Available:
R EFERENCES https://doi.org/10.1007/s11063-022-10904-8
[22] M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei,
[1] FATF, “International standards on combating money laundering and the T. Robinson, and C. E. Leiserson, “Anti-money laundering in
financing of terrorism & proliferation.” 2012-2021. bitcoin: Experimenting with graph convolutional networks for financial
[2] E. Ebikake., “Money laundering: An assessment of soft law as a forensics,” CoRR, vol. abs/1908.02591, 2019. [Online]. Available:
technique for repressive and preventive anti-money laundering control,” http://arxiv.org/abs/1908.02591
Journal of Money Laundering Control, vol. 19, no. 4, pp. 346–375, 2016. [23] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye,
[3] BACEN, “Circular nº 3.978, de 23 de janeiro de 2020,” and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting
2020. [Online]. Available: https://www.in.gov.br/web/dou/-/circular-n-3. decision tree,” in Advances in Neural Information Processing Systems,
978-de-23-de-janeiro-de-2020-239631175 I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
[4] S. N. Welling., “Smurfs, money laundering and the federal criminal law: S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates,
The crime of structuring transactions,” vol. 41, pp. 287–343, 1989. Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/
[5] T. N. Kipf and M. Welling, “Semi-supervised classification with graph 2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
convolutional networks,” CoRR, vol. abs/1609.02907, 2016. [Online]. [24] A. V. Dorogush, A. Gulin, G. Gusev, N. Kazeev, L. O. Prokhorenkova,
Available: http://arxiv.org/abs/1609.02907 and A. Vorobev, “Fighting biases with dynamic boosting,” CoRR, vol.
[6] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation abs/1706.09516, 2017. [Online]. Available: http://arxiv.org/abs/1706.
learning on large graphs,” CoRR, vol. abs/1706.02216, 2017. [Online]. 09516
Available: http://arxiv.org/abs/1706.02216 [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[7] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and 2014, cite arxiv:1412.6980Comment: Published as a conference paper
Y. Bengio, “Graph attention networks,” 2018. at the 3rd International Conference for Learning Representations, San
[8] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, Diego, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
T. Kaler, and C. E. Leiserson, “Evolvegcn: Evolving graph convolutional [26] T. Suzumura and H. Kanezashi, “Anti-Money Laundering Datasets:
networks for dynamic graphs,” CoRR, vol. abs/1902.10191, 2019. InPlusLab anti-money laundering datadatasets,” http://github.com/IBM/
[Online]. Available: http://arxiv.org/abs/1902.10191 AMLSim/, 2021.

You might also like