ML to detect money laundering in the Bitcoin blockchain

-By Joana Lorenz , Maria Inês Silva, David Aparício, João Tiago Ascensão, Pedro Bizarro

Feedzai

Paper link Git hub link

Abstract

Every year, criminals launder billions of dollars acquired from serious felonies (e.g., terrorism, drug smuggling, or human trafficking) harming countless people and economies. Cryptocurrencies, in particular, have developed as a haven for money laundering activity. Machine Learning can be used to detect these illicit patterns. However, labels are so scarce that traditional supervised algorithms are inapplicable. Here, we address money laundering detection assuming minimal access to labels. First, we show that existing state-of-the-art solutions using unsupervised anomaly detection methods are inadequate to detect the illicit patterns in a real Bitcoin transaction dataset.

Image courtesy: BitCoin Magazine

In the financial sector, Anti-Money Laundering (AML) efforts often rely on rule-based systems. However, vulnerabilities derive from the relative simplicity of publicly available rule-sets, leading to high false-positive rates (FPR) and low detection rates. Machine learning (ML) techniques overcome the rigidity of rule-based systems by inferring complex patterns from historical data, and can potentially increase detection rates and decrease FPRs.

How to detect money laundering in a dataset with few labels.

Detecting money laundering cases in the Bitcoin network without any labels is impossible since illicit transactions hide within clusters of licit behavior.
With just a few labels (approximately 5% of the total), one can match the results of a supervised baseline by using Active Learning (AL). This setting mimics a real-world scenario with limited availability of human analysts for manual labeling.

the existing research on unsupervised illicit activity detection in cryptocurrency and financial transactions by benchmarking different methods on a real-world dataset with a relatively large number of positive cases. In this way, they overcome the typical limitation of evaluating on synthetic data or real data with few positive samples.

Dataset : https://www.kaggle.com/ellipticco/elliptic-data-set

the Bitcoin dataset1 released by Elliptic, a company dedicated to detecting financial crime in cryptocurrencies. It includes 49 graphs sampled from the Bitcoin blockchain at different sequential moments in time (time-steps), as presented in Figure 1. Each graph is a directed acyclic graph, starting from one transaction, and including subsequent related transactions on the blockchain, containing approximately two weeks of data.

Bitcoins transactions are transfers from one Bitcoin address (e.g., a person or company) to another, represented as nodes in the graph. Each transaction consumes the output of past transactions and generates outputs that can be spent by future transactions. The edges in the graph represent the flow of Bitcoins between transactions. The dataset consists of 203,769 transactions, of which 21% are labeled as licit, and 2% as illicit, based on the category of the bitcoin address that created the transaction. The remaining transactions are unlabeled. Illicit categories include scams, malware, terrorist organizations, and Ponzi schemes. Licit categories include exchanges, wallet providers, miners, and licit services. Each transaction has 166 features, 94 of which represent information about the transaction itself. The remaining features were constructed by Weber et al. using information one-hop backward/forward from the transaction, such as the minimum, maximum, and standard deviation of each transaction feature. All features, except for the time-step, are fully anonymized and standardized with zero mean and unit variance.

Unsupervised Learning.

Anomaly detection methods are unsupervised learning techniques to detect outliers in a dataset. Literature suggests their effectiveness in the AML context.

Tested seven common anomaly detection algorithms with readily available Python implementations:

Local Outlier Factor (LOF)

K-Nearest Neighbours (KNN)

Principal Component Analysis (PCA)

One-Class Support Vector Machine (OCSVM)

Cluster-based Outlier Factor (CBLOF)

Angle-based Outlier Detection (ABOD ref: my earlier post)

Isolation Forest (IF).

Active Learning

AL is an incremental learning approach that interactively queries instances for labeling (e.g., by human analysts) and uses the increasing number of labeled instances to (re-)train a supervised model. It fits the AML context by addressing label scarcity and has previously been successfully applied to detect money laundering accounts based on financial transaction history. For an extensive survey on AL, we refer the reader to Settles. The goal of AL is to minimize the number of labels necessary to achieve adequate classifier performance. The process starts with a pool of unlabeled instances (the unlabeled pool), although sometimes there is a residual number of labels. At each iteration, a query strategy queries a batch of instances for manual labeling. After labeling, the instances go into the labeled pool. Finally, a supervised algorithm (the classifier) is trained on the labeled pool and evaluated on a test set. If the performance is not satisfactory, the querying process continues to enrich the labeled pool incrementally. To mimic the manual labeling process in our experiments, we append the labels to the queried instances.

Conclusions

Results indicate that unsupervised anomaly detection methods have poor performance, and we present evidence that anomalies in the feature-space are not indicative of illicit behaviour. This finding highlights that experiments conducted on (partially) synthetic data can be misleading and emphasizes the importance of conducting experiments on real-life datasets to draw reliable conclusions.

Comments

Deb said…

Great Sir !!

June 2, 2020 at 10:42 PM

SRI Blog

Search This Blog

ML to detect money laundering in the Bitcoin blockchain

Labels

Comments

Popular posts from this blog

ABOD and its PyOD python module

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

Why Should I Trust You?. . LIME