Professional Documents
Culture Documents
Inspection-L: A Self-Supervised GNN-Based Money Laundering Detection System For Bitcoin
Inspection-L: A Self-Supervised GNN-Based Money Laundering Detection System For Bitcoin
Inspection-L: A Self-Supervised GNN-Based Money Laundering Detection System For Bitcoin
Bitcoin
Abstract
Criminals have become increasingly experienced in using cryptocurrencies, such as Bitcoin, for money laundering. The use of
cryptocurrencies can hide criminal identities and transfer hundreds of millions of dollars of dirty funds through their criminal digital
wallets. However, this is considered a paradox because cryptocurrencies are gold mines for open-source intelligence, allowing law
arXiv:2203.10465v3 [cs.CR] 25 Apr 2022
enforcement agencies to have more power in conducting forensic analyses. This paper proposed Inspection-L, a graph neural
network (GNN) framework based on self-supervised Deep Graph Infomax (DGI) and Graph Isomorphism Network (GIN), with
supervised learning algorithms, namely Random Forest (RF) to detect illicit transactions for AML. To the best of our knowledge,
our proposal is the first to apply self-supervised GNNs to the problem of AML in Bitcoin. The proposed method has been evaluated
on the Elliptic dataset and shows that our approach outperforms the state-of-the-art in terms of key classification metrics, which
demonstrates the potential of self-supervised GNN in cryptocurrency illicit transaction detection.
Keywords: Graph Neural Networks, Machine Learning, Anomaly Detection, Cryptocurrencies
1. Introduction are able to leverage the inherent structure of the graph data by
building relational inductive biases into the deep learning ar-
The advent of the first cryptocurrency - Bitcoin [1] has rev- chitecture. This provides the ability to learn, reason, and gen-
olutionized the conventional financial ecosystem as it enables eralize from the graph data, inspired by the concept of message
low-cost, anonymous, peer-to-peer cash transfers within and propagation [4].
across various borders. Due to their pseudonymity, many cy- The Bitcoin transaction flow data can naturally be repre-
bercriminals, terrorists, and hackers have started to use them for sented in a graph format, where the transactions are mapped
illegal transactions. For example, the WannaCry ransomware to graph nodes, and transaction flows are mapped to the graph
attack accepted using Bitcoin [2] as the payment method due to edges. Both the topological information as well as the informa-
non-traceability. The criminals received about 3.4 million (46.4 tion contained in node features are crucial for detecting illicit
BTC) ransom only four days after WannaCry spread [2]. How- transactions.
ever, it is also a paradox, as cryptocurrencies are gold mines for Specifically, we will investigate the Elliptic dataset [5], a
open-source intelligence as transaction network data is publicly realistic, partially labelled Bitcoin temporal graph-based trans-
available, allowing law enforcement agencies to conduct foren- action dataset consisting of real entities belonging to licit (e.g.,
sic analysis of the currency flows. Therefore, effective detection wallet, miners), illicit entities (e.g. scams, terrorist organiza-
of illicit transactions in Bitcoin transaction graphs is essential tions, ransomware), or unknown categories. The proposed
for preventing illegal transactions. This problem is challenging Inspection-L framework aims to detect illegal transactions based
for law enforcement agencies owing to the untraceable nature on graph representation learning in a self-supervised manner.
of Bitcoin transactions. Current graph machine learning approaches, such as [5], gen-
Graph representation learning has shown great potential for erally apply supervised graph neural network approaches for
detecting money laundering activities using cryptocurrencies. the detection of illicit transactions. However, the main limita-
Thus, this paper proposes Inspection-L, a Graph Neural Net- tion is that it is limited to only K-layer neighbourhoods, as the
work (GNN) framework based on Deep Graph Infomax (DGI) extreme sparsity of the training signal causes a threat of overfit-
[3] and supervised learning algorithms such as Random Forest ting. On the other hand, self-supervised graph neural network
(RF), to detect illicit transactions for Anti Money Laundering algorithms [6][7] allow for every node to have access to struc-
(AML). tural patterns of the entire graph, which can improve the quality
GNNs are tailored to applications with graph-structured data, of global representation and can outperform supervised graph
such as social sciences, chemistry, and telecommunications, and machine learning algorithms.
In this paper, we demonstrate how the self-supervised DGI
Email addresses: w.w.lo@uq.net.au (Wai Weng Lo), algorithm can be integrated with standard machine learning clas-
m.sarhan@uq.net.au (Mohanad Sarhan), siamak.layeghy@uq.net.au sification algorithms, i.e. Random Forest, to build an an ef-
(Siamak Layeghy), marius@itee.uq.edu.au (Marius Portmann)
3
Figure 1: Overview of Elliptic Dataset
• A true graph G with the true nodes and edges that connect
them, as well as the real node features associated with
each node.
• A corrupted graph H in which the nodes and edges have
been changed using a corruption function. The paper [25]
suggests that the corruption function can randomly shuf-
fle each node feature and maintain the same edges as the
true graph G.
4
a sigmoid activation function to compute the whole graph em-
N
beddings s .
1
E(X,A) log D ~hi , ~s +
X h i
L= n
N + M i=1 1 X (L)
s̄ = σ h (4)
n i=1 i
(3)
M i! In Algorithm 1, line 5 to 6 , for the discriminator D, we
~
X h
E(X,A) log 1 − D h j , ~s introduced the logistic sigmoid non-linear function which com-
j=1 pares a node embedding vector ~hi against a real whole graph
embedding s to calculate the score of (~hi , s) being positive or
Overall, the pseudocode and overall procedure of our pro-
negative.
posed algorithm are shown in Algorithm 1 and Fig. 3.
D (hi , s̄) = σ hTi w s̄ (5)
5. PROPOSED METHOD
For constructing Bitcoin transaction graphs from the dataset, hi , s̄ = σ e
D e hTi w s̄ (6)
we constructed 49 different Bitcoin transaction graphs (TGs) We then applied a binary cross-entropy loss objective func-
using time steps so that the nodes can be represented as node tion to perform gradient descent as shown in Algorithm 1, line
transactions and the edges can be represented as flows of Bit- 8. To perform gradient descent, we maximize the score if the
node embedding is a true node embedding ~hi and minimize the
coin transactions. This is a very natural way to represent Bit-
coin transactions.
score if it is a corrupted node embedding ~hi compared to the
DGI Training. For training the proposed model, the input in- global graph summary generated by the read-out function R.
cludes the transaction graphs G with node features (AF), and As a result, we can maximize the mutual information between
the specified number of training epochs K to extract true node patch representations and the whole real graph summary. After
embeddings and corrupted node embeddings. Before this, we the training process, the trained encoder can be used to generate
need to define the corruption function C to generate the cor- new graph embeddings for downstream purposes, in this case,
rupted transaction graphs C(G) for our GIN encoder to extract illegal transaction detection.
the corrupted node embeddings. In this paper, we randomly
n
shuffled the node features among the nodes in real transaction 1 X
(L)
EG log D h(L)
LDGI = − i , s + EḠ log 1 − D h̃i , s (7)
graphs T G to generate the corrupted transaction graphs. This is 2n i=1
done by shuffling the feature matrix in rows X. Overall, instead In our experiments, we used all 49 different Bitcoin trans-
of adding or removing edges from the adjacency matrix such action graphs to train the DGI with the GIN encoder in a self-
e , A), we use corruption function C which shuffle the
that (A supervised manner without using any labels. For each training
node features such that (X̃ , X) but retain the adjacency matrix graph, we train 300 epochs using an Adam optimizer with a
e = A).
(A learning rate of 0.0001 to perform gradient descent to update
For each training epoch, in Line 3 to 4, we use our proposed the model weights and hyperparameters as shown in Algorithm
GIN encoder to extract true node and corrupted node embed- 1, line 9.
dings; we proposed a GIN encoder as shown in Fig. 3 with
2 layers MLP which consists of 128 hidden units with ReLU Supervised Machine Learning Classification. After the DGI train-
activation function with Batch normalization (as shown in Al- ing, we can reuse the encoder to generate node embeddings as
gorithm 2) [30] for extracting the true node and corrupted node shown in Algorithm 1, line 10 to train and test the RF classi-
embeddings. fier as shown in Algorithm 1, line 11. In our experiments, we
The design of MLPs is motivated by the fundamental goal performed 70:30 splitting for 34 different Bitcoin transactions
of a GNN-based model. Ideally, various types of different graph for training and the remaining 15 bitcoin transaction graphs for
patterns should be distinguishable via the graph encoder, which testing. For training and testing the RF classifier, we used 100
means that different graph nodes should be able to map differ- estimators for training and testing.
ent graphs to the embedding space. This implies that it has
the capability to solve the graph isomorphism problem, where 1. Node Embeddings: After the DGI training, we reuse the
non-isomorphic graphs should be mapped to different represen- encoder to generate node embeddings for training and
tations. testing the RF classifier as mentioned above.
We applied a full neighbour sampling technique. We spec- 2. Node Embeddings with LF features: Similar to sce-
ified the size of full two-hop neighbour samples for the GIN nario 1, we also combine local features (i.e, first 94 raw
encoder with Batch normalization , as DGI benefits from em- features) with node embeddings generated by the trained
ploying wider rather than deeper models [3]. encode for training and testing the RF classifier as men-
For the read-out function R, we applied the mean opera- tioned above.
tion on all node embeddings in a real graph and then applied
5
Figure 3: Proposed Method
Algorithm 2: Batch Normalizing Transform for training and testing the RF classifier as mentioned
[30] above.
input : Values of x over a mini-batch: B = {x1...m }
Parameters can be learned: γ, β 5.0.1. Implementation Environments
n o
output: yi = BNγ,β (xi ) This paper was carried out using a 2.3GHz 2-core Intel(R)
µB ← m1 m
P Xeon(R) processor with 12 GB memory and Tesla P100 GPU
1 i=1 xi
2 σB ← m i=1 (xi − µB )2
2 1 Pm on Linux operating system. The proposed approach is devel-
x −µ
3 xi ← qi B
b oped using the Python programming language with several sta-
σ2B +
tistical and visualization packages such as Sckit-learn, Numpy,
4 yi ← γb
xi + β ≡ BNγ,β (xi )
Pandas, PyTorch Geometric, and Matplotlib. Table 1 summa-
rizes the system configuration.
6
6. Performance Evaluation Table 3: Results of binary classification by Inspection-L compared to state-of-
the-art. AF refers to all raw features, LF refers to the local raw features, i.e. the
first 94 raw features, GNE refers to the node embeddings generated by GCN in
For evaluating the performance of the proposed methods, [5] by using labels and DNE refers to the node embeddings computed by DGI
the standard metrics listed in Table 2 are used, where T P, T N, without using labels.
FP and FN represent the number of True Positives, True Neg-
atives, False Positives and False Negatives, respectively. Illicit
Methods Precision Recall F1 AUC
The proposed method is evaluated by using Precision, Re-
Logistic RegrAF [5] 0.404 0.593 0.481 −
call, F1-score and Area under the receiver operating character- Logistic Regr AF + GNE [5] 0.537 0.528 0.533 −
istics (ROC) curve. All the above metrics can be obtained using Logistic Regr LF [5] 0.348 0.668 0.457 −
the confusion matrix (CM). Logistic Regr LF + GNE [5] 0.518 0.571 0.543 −
RandomForest AF [5] 0.956 0.670 0.788 −
In Table 2, True positive (TP) means the total number of
RandomForest AF + GNE [5] 0.971 0.675 0.796 −
true positives, True negative (TN) indicates the total number of RandomForest AF [15] 0.897 0.721 0.800 −
false positives, False positive (FP) is the total number of false RandomForest AF + GNE [15] 0.958 0.715 0.819 −
negatives and False negative (TN) is the total number of true XGB AF [15] 0.921 0.732 0.815 −
XGB AF + GNE [15] 0.986 0.692 0.813 −
negatives. Based on the aforementioned terms, the evaluation
RandomForest LF [5] 0.803 0.611 0.694 −
metrics are calculated as follows. For a balanced test dataset, RandomForest LF + GNE [5] 0.878 0.668 0.759 −
higher accuracy indicates the model is well learned, but for im- MLP AF [5] 0.694 0.617 0.653 −
balance test dataset scenarios, in this case, only considering ac- MLP AF + GNE [5] 0.780 0.617 0.689 −
MLP LF [5] 0.637 0.662 0.649 −
curacy measures may lead to misleading conclusions since it is
MLP LF + GNE [5] 0.681 0.578 0.625 −
strongly biased to favour the licit class. Thus for this case, re- GCN [5] 0.812 0.512 0.628 −
call and F1-score metrics give a more reasonable explanation of GCN [18] 0.899 0.678 0.773 −
the model’s performance. Skip-GCN [5] 0.812 0.623 0.705 −
EvolveGCN [5] 0.850 0.624 0.720 −
Recall (also known as Detection Rate) is the total number
Inspection-L AF + DNE (RF) 0.972 0.721 0.828 0.916
of true positives divided by the total number of true positives Inspection-L LF + DNE (RF) 0.906 0.712 0.797 0.895
and false negatives. If the recall rate is very low, that means the Inspection-L DNE (RF) 0.593 0.032 0.061 0.735
classifier cannot detect illicit transactions.
Precision measures the quality of the correct predictions. It
is the number of true positives divided by the number of true 7. Experimental Results
positives and false positives. If the false positive is very high, it
Table 3 shows the corresponding results of our Inspection-L
will cause low precision. Our goal is to maximize the precision
compared to state-of-the-art in terms of key metrics. As can be
as much as possible.
observed from the table, in regards to illicit F1-Score, Inspection-
F1-score is the trade-off between precision and recall. Math-
L outperforms the best-reported classifiers. We concatenated
ematically, it is the harmonic mean of precision and recall.
the node embeddings generated from DGI with the original raw
The area under the curve (AUC) computes the trade-off be-
features (AF) in this setting. The experiments achieved an F1
tween sensitivity and specificity, plotted based on the trade-off
score and Recall of 0.828 and 0.721, respectively. Using all
between the true positive rate on the y-axis and the false posi-
AF with node embeddings as features, the ML model’s perfor-
tive rate on the x-axis. Our goal is to maximize the AUC score
mance has significantly increased with an AUC of 0.916, com-
as much as possible.
pared to 0.735 when only using node embeddings. The exper-
iments demonstrate that graph information (node embeddings)
Table 1: Implementation environment specification was useful to enhance the transaction representation.
In the second experiment, we concatenated the node embed-
Unit Description
dings generated from DGI with the local features (LF), which
Processor 2.3 GHz 2-core Inter Xeon(R) Processor
can achieve an F1-score and Recall of 0.712 and 0.797, respec-
RAM 12GB
GPU Tesla P100 GPU 16GB
tively. Both of the results are superior to the state-of-the-art
Operating System Linux algorithms.
Packages Sckit-learn, Numpy, Pandas, These results demonstrate that the power of our self-supervised
PyTorch Geometric, and Matplotlib GNN-based can generate an enhanced feature set to improve
the anti-money laundering detection performance. The results
Table 2: Evaluation metrics used in this study show that the accuracy of the model improves with the en-
hanced feature set for full features. The RF classifier is trained
Metric Definition using all raw features and embedding features to improve over-
Detection Rate (Recall) TP all performance. It can not only detect illicit transactions, but
T P+FN
TP
can also achieve a low false alarm rate.
Precision T P+FP Figure 5 shows the confusion matrix of the three different
Recall×Precision
F1-Score 2× Recall+Precision scenarios. As we can see, the RF classifier using only embed-
AUC-Score
R1 TP
d FP ding features classifies a large number of illicit transactions into
0 T P+FN T N+FP
7
Figure 4: Illicit Recall over test timestep
True label
True label
licit transactions, which also have low false alarm rates. Al- and overall outperforms the state-of-the-art ML-based/Graph-
though the classifier trained with embedding features cannot ac- based classifier. The evaluation results of our initial classifier
curately detect illicit transactions, it rarely classifies licit trans- demonstrate the potential of a self-supervised GNN-based ap-
actions as illicit. Therefore, the false alarm rate is very low. proach for illegal transaction detection in cryptocurrencies. We
The RF classifier trained using both raw features and embed- hope to inspire others to work on the important challenge of
ding features has the advantage of achieving a high detection using graph machine learning to perform financial forensics
rate and a low false alarm rate. through this research, which is very lacking in current research.
Figure 4 shows the recall measure of three different models In the future, we plan to integrate with unsupervised anomaly
across various testing timesteps. Interestingly, all three mod- detection algorithms to detect illegal transactions in an unsu-
els cannot detect new illicit transactions after dark market shut- pervised manner.
down, and it was occurring at time step 43 [5]. Therefore, de-
veloping robust methods to detect illicit transactions without
References
being affected by emerging events has become a major chal-
lenge to address. [1] S. Nakamoto, Bitcoin: A peer-to-peer electronic cash system, Tech. rep.,
Manubot (2019).
[2] N. Kshetri, J. Voas, Do crypto-currencies fuel ransomware?, IT profes-
8. Conclusions and Future Work sional 19 (5) (2017) 11–15.
[3] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, R. D. Hjelm,
This paper presents a novel approach for Bitcoin illicit trans- Deep graph infomax, in: International Conference on Learning Represen-
action detection based on self-supervised GNNs. We first used tations, 2019.
URL https://openreview.net/forum?id=rklz9iAcKQ
the DGI for generating the node embedding with raw features to [4] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive
train the Random Forest for detection. Our experimental eval- survey on graph neural networks, IEEE Transactions on Neural Networks
uation indicates that our approach performs exceptionally well and Learning Systems 32 (1) (2021) 4–24.
8
[5] M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robin- sentations (ICLR).
son, C. E. Leiserson, Anti-money laundering in bitcoin: Experiment- [29] S. Yun, M. Jeong, R. Kim, J. Kang, H. J. Kim, Graph transformer net-
ing with graph convolutional networks for financial forensics, in: ACM works, Advances in neural information processing systems 32.
SIGKDD International Workshop on Knowledge discovery and data min- [30] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network
ing, 2019. training by reducing internal covariate shift, in: International conference
[6] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, J. Tang, Self- on machine learning, PMLR, 2015, pp. 448–456.
supervised learning: Generative or contrastive, IEEE Transactions on [31] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural
Knowledge and Data Engineering. networks?, arXiv preprint arXiv:1810.00826.
[7] Y. Liu, S. Pan, M. Jin, C. Zhou, F. Xia, P. S. Yu, Graph self-supervised
learning: A survey, arXiv preprint arXiv:2103.00111.
[8] C. M. Bishop, N. M. Nasrabadi, Pattern recognition and machine learn-
ing, Vol. 4, Springer, 2006.
[9] T. N. Kipf, M. Welling, Semi-supervised classification with graph convo-
lutional networks, arXiv preprint arXiv:1609.02907.
[10] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi,
T. Kaler, T. Schardl, C. Leiserson, Evolvegcn: Evolving graph convolu-
tional networks for dynamic graphs, in: Proceedings of the AAAI Con-
ference on Artificial Intelligence, Vol. 34, 2020, pp. 5363–5370.
[11] Y. Hu, S. Seneviratne, K. Thilakarathna, K. Fukuda, A. Seneviratne, Char-
acterizing and detecting money laundering activities on the bitcoin net-
work, arXiv preprint arXiv:1912.12060.
[12] Wallet explorer.
URL https://www.walletexplorer.com
[13] J. A. Bondy, U. S. R. Murty, et al., Graph theory with applications, Vol.
290, Macmillan London, 1976.
[14] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks,
in: Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining, 2016, pp. 855–864.
[15] D. Vassallo, V. Vella, J. Ellul, Application of gradient boosting algo-
rithms for anti-money laundering in cryptocurrencies, SN Computer Sci-
ence 2 (3) (2021) 1–15.
[16] C. Lee, S. Maharjan, K. Ko, J. W.-K. Hong, Toward detecting illegal
transactions on bitcoin using machine-learning methods, in: International
Conference on Blockchain and Trustworthy Systems, Springer, 2019, pp.
520–533.
[17] Blockchain explorer.
URL https://www.blockchain.com/ko/explorer
[18] I. Alarab, S. Prakoonwit, M. I. Nacer, Competence of graph convolu-
tional networks for anti-money laundering in bitcoin blockchain, in: Pro-
ceedings of the 2020 5th International Conference on Machine Learning
Technologies, 2020, pp. 23–27.
[19] L. Nan, D. Tao, Bitcoin mixing detection using deep autoencoder, in:
2018 IEEE Third international conference on data science in cyberspace
(DSC), IEEE, 2018, pp. 280–287.
[20] J. Lorenz, M. I. Silva, D. Aparı́cio, J. T. Ascensão, P. Bizarro, Machine
learning methods to detect money laundering in the bitcoin blockchain in
the presence of label scarcity, in: Proceedings of the First ACM Interna-
tional Conference on AI in Finance, 2020, pp. 1–8.
[21] T. Pham, S. Lee, Anomaly detection in bitcoin network using unsuper-
vised learning methods, arXiv preprint arXiv:1611.03941.
[22] P. Monamo, V. Marivate, B. Twala, Unsupervised learning for robust
bitcoin fraud detection, in: 2016 Information Security for South Africa
(ISSA), IEEE, 2016, pp. 129–134.
[23] S. Li, F. Xu, R. Wang, S. Zhong, Self-supervised incremental deep
graph learning for ethereum phishing scam detection, arXiv preprint
arXiv:2106.10176.
[24] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, K. M.
Borgwardt, Weisfeiler-lehman graph kernels., Journal of Machine Learn-
ing Research 12 (9).
[25] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, R. D. Hjelm,
Deep graph infomax, International Conference on Learning Representa-
tions (ICLR).
[26] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning
on large graphs, in: Advances in Neural Information Processing Systems,
2017. arXiv:1706.02216.
[27] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla, Heterogeneous
graph neural network, in: Proceedings of the 25th ACM SIGKDD Inter-
national Conference on Knowledge Discovery & Data Mining, 2019, pp.
793–803.
[28] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio,
Graph attention networks, International Conference on Learning Repre-