Skip to main content

ML to detect money laundering in the Bitcoin blockchain

-By Joana Lorenz , Maria Inês Silva, David Aparício, João Tiago Ascensão, Pedro Bizarro
Feedzai

Paper link                                   Git hub link

Abstract

Every year, criminals launder billions of dollars acquired from serious felonies (e.g., terrorism, drug smuggling, or human trafficking) harming countless people and economies. Cryptocurrencies, in particular, have developed as a haven for money laundering activity. Machine Learning can be used to detect these illicit patterns. However, labels are so scarce that traditional supervised algorithms are inapplicable. Here, we address money laundering detection assuming minimal access to labels. First, we show that existing state-of-the-art solutions using unsupervised anomaly detection methods are inadequate to detect the illicit patterns in a real Bitcoin transaction dataset.

Image courtesy: BitCoin Magazine 

In the financial sector, Anti-Money Laundering (AML) efforts often rely on rule-based systems. However, vulnerabilities derive from the relative simplicity of publicly available rule-sets, leading to high false-positive rates (FPR) and low detection rates. Machine learning (ML) techniques overcome the rigidity of rule-based systems by inferring complex patterns from historical data, and can potentially increase detection rates and decrease FPRs.


How to detect money laundering in a dataset with few labels.

  1. Detecting money laundering cases in the Bitcoin network without any labels is impossible since illicit transactions hide within clusters of licit behavior. 
  2. With just a few labels (approximately 5% of the total), one can match the results of a supervised baseline by using Active Learning (AL). This setting mimics a real-world scenario with limited availability of human analysts for manual labeling. 
the existing research on unsupervised illicit activity detection in cryptocurrency and financial transactions by benchmarking different methods on a real-world dataset with a relatively large number of positive cases. In this way, they overcome the typical limitation of evaluating on synthetic data or real data with few positive samples.





the Bitcoin dataset1 released by Elliptic, a company dedicated to detecting financial crime in cryptocurrencies. It includes 49 graphs sampled from the Bitcoin blockchain at different sequential moments in time (time-steps), as presented in Figure 1. Each graph is a directed acyclic graph, starting from one transaction, and including subsequent related transactions on the blockchain, containing approximately two weeks of data.






Bitcoins transactions are transfers from one Bitcoin address (e.g., a person or company) to another, represented as nodes in the graph. Each transaction consumes the output of past transactions and generates outputs that can be spent by future transactions. The edges in the graph represent the flow of Bitcoins between transactions. The dataset consists of 203,769 transactions, of which 21% are labeled as licit, and 2% as illicit, based on the category of the bitcoin address that created the transaction. The remaining transactions are unlabeled. Illicit categories include scams, malware, terrorist organizations, and Ponzi schemes. Licit categories include exchanges, wallet providers, miners, and licit services. Each transaction has 166 features, 94 of which represent information about the transaction itself. The remaining features were constructed by Weber et al. using information one-hop backward/forward from the transaction, such as the minimum, maximum, and standard deviation of each transaction feature. All features, except for the time-step, are fully anonymized and standardized with zero mean and unit variance.


Unsupervised Learning. 
Anomaly detection methods are unsupervised learning techniques to detect outliers in a dataset. Literature suggests their effectiveness in the AML context.

Tested seven common anomaly detection algorithms with readily available Python implementations:
  • Local Outlier Factor (LOF)
  • K-Nearest Neighbours (KNN)
  • Principal Component Analysis (PCA)
  • One-Class Support Vector Machine (OCSVM)
  • Cluster-based Outlier Factor (CBLOF)
  • Angle-based Outlier Detection (ABOD ref: my earlier post)
  • Isolation Forest (IF).



Active Learning 

AL is an incremental learning approach that interactively queries instances for labeling (e.g., by human analysts) and uses the increasing number of labeled instances to (re-)train a supervised model. It fits the AML context by addressing label scarcity and has previously been successfully applied to detect money laundering accounts based on financial transaction history. For an extensive survey on AL, we refer the reader to Settles. The goal of AL is to minimize the number of labels necessary to achieve adequate classifier performance. The process starts with a pool of unlabeled instances (the unlabeled pool), although sometimes there is a residual number of labels. At each iteration, a query strategy queries a batch of instances for manual labeling. After labeling, the instances go into the labeled pool. Finally, a supervised algorithm (the classifier) is trained on the labeled pool and evaluated on a test set. If the performance is not satisfactory, the querying process continues to enrich the labeled pool incrementally. To mimic the manual labeling process in our experiments, we append the labels to the queried instances.



Conclusions

Results indicate that unsupervised anomaly detection methods have poor performance, and we present evidence that anomalies in the feature-space are not indicative of illicit behaviour. This finding highlights that experiments conducted on (partially) synthetic data can be misleading and emphasizes the importance of conducting experiments on real-life datasets to draw reliable conclusions.

Comments

Deb said…
Great Sir !!

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...