Skip to main content

ML to detect money laundering in the Bitcoin blockchain

-By Joana Lorenz , Maria Inês Silva, David Aparício, João Tiago Ascensão, Pedro Bizarro
Feedzai

Paper link                                   Git hub link

Abstract

Every year, criminals launder billions of dollars acquired from serious felonies (e.g., terrorism, drug smuggling, or human trafficking) harming countless people and economies. Cryptocurrencies, in particular, have developed as a haven for money laundering activity. Machine Learning can be used to detect these illicit patterns. However, labels are so scarce that traditional supervised algorithms are inapplicable. Here, we address money laundering detection assuming minimal access to labels. First, we show that existing state-of-the-art solutions using unsupervised anomaly detection methods are inadequate to detect the illicit patterns in a real Bitcoin transaction dataset.

Image courtesy: BitCoin Magazine 

In the financial sector, Anti-Money Laundering (AML) efforts often rely on rule-based systems. However, vulnerabilities derive from the relative simplicity of publicly available rule-sets, leading to high false-positive rates (FPR) and low detection rates. Machine learning (ML) techniques overcome the rigidity of rule-based systems by inferring complex patterns from historical data, and can potentially increase detection rates and decrease FPRs.


How to detect money laundering in a dataset with few labels.

  1. Detecting money laundering cases in the Bitcoin network without any labels is impossible since illicit transactions hide within clusters of licit behavior. 
  2. With just a few labels (approximately 5% of the total), one can match the results of a supervised baseline by using Active Learning (AL). This setting mimics a real-world scenario with limited availability of human analysts for manual labeling. 
the existing research on unsupervised illicit activity detection in cryptocurrency and financial transactions by benchmarking different methods on a real-world dataset with a relatively large number of positive cases. In this way, they overcome the typical limitation of evaluating on synthetic data or real data with few positive samples.





the Bitcoin dataset1 released by Elliptic, a company dedicated to detecting financial crime in cryptocurrencies. It includes 49 graphs sampled from the Bitcoin blockchain at different sequential moments in time (time-steps), as presented in Figure 1. Each graph is a directed acyclic graph, starting from one transaction, and including subsequent related transactions on the blockchain, containing approximately two weeks of data.






Bitcoins transactions are transfers from one Bitcoin address (e.g., a person or company) to another, represented as nodes in the graph. Each transaction consumes the output of past transactions and generates outputs that can be spent by future transactions. The edges in the graph represent the flow of Bitcoins between transactions. The dataset consists of 203,769 transactions, of which 21% are labeled as licit, and 2% as illicit, based on the category of the bitcoin address that created the transaction. The remaining transactions are unlabeled. Illicit categories include scams, malware, terrorist organizations, and Ponzi schemes. Licit categories include exchanges, wallet providers, miners, and licit services. Each transaction has 166 features, 94 of which represent information about the transaction itself. The remaining features were constructed by Weber et al. using information one-hop backward/forward from the transaction, such as the minimum, maximum, and standard deviation of each transaction feature. All features, except for the time-step, are fully anonymized and standardized with zero mean and unit variance.


Unsupervised Learning. 
Anomaly detection methods are unsupervised learning techniques to detect outliers in a dataset. Literature suggests their effectiveness in the AML context.

Tested seven common anomaly detection algorithms with readily available Python implementations:
  • Local Outlier Factor (LOF)
  • K-Nearest Neighbours (KNN)
  • Principal Component Analysis (PCA)
  • One-Class Support Vector Machine (OCSVM)
  • Cluster-based Outlier Factor (CBLOF)
  • Angle-based Outlier Detection (ABOD ref: my earlier post)
  • Isolation Forest (IF).



Active Learning 

AL is an incremental learning approach that interactively queries instances for labeling (e.g., by human analysts) and uses the increasing number of labeled instances to (re-)train a supervised model. It fits the AML context by addressing label scarcity and has previously been successfully applied to detect money laundering accounts based on financial transaction history. For an extensive survey on AL, we refer the reader to Settles. The goal of AL is to minimize the number of labels necessary to achieve adequate classifier performance. The process starts with a pool of unlabeled instances (the unlabeled pool), although sometimes there is a residual number of labels. At each iteration, a query strategy queries a batch of instances for manual labeling. After labeling, the instances go into the labeled pool. Finally, a supervised algorithm (the classifier) is trained on the labeled pool and evaluated on a test set. If the performance is not satisfactory, the querying process continues to enrich the labeled pool incrementally. To mimic the manual labeling process in our experiments, we append the labels to the queried instances.



Conclusions

Results indicate that unsupervised anomaly detection methods have poor performance, and we present evidence that anomalies in the feature-space are not indicative of illicit behaviour. This finding highlights that experiments conducted on (partially) synthetic data can be misleading and emphasizes the importance of conducting experiments on real-life datasets to draw reliable conclusions.

Comments

Deb said…
Great Sir !!

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

Rule Extraction Algorithm for Deep Neural Networks: A Review

-By Tameru Hailesilassie Department of Computer Science and Engineering National University of Science and Technology (MISiS) Moscow, Russia Today's blog is the continuation of XAI series. Rule Extraction from Neural Networks Abstract—Despite the highest classification accuracy in wide varieties of application areas, the artificial neural network has one disadvantage. The way this Network comes to a decision is not easily comprehensible. The lack of explanation ability reduces the acceptability of neural network in data mining and decision system. This drawback is the reason why researchers have proposed many rule extraction algorithms to solve the problem. Recently, Deep Neural Network (DNN) is achieving a profound result over the standard neural network for classification and recognition problems. It is a hot machine learning area proven both useful and innovative. This paper has thoroughly reviewed various rule extraction algorithms, considering the classifi