Skip to main content

Segment-Based Credit Scoring Using Latent Clusters in the Variational Autoencoder

- By Rogelio A. Mancisidora, , Michael Kampffmeyer  , Kjersti Aas  , Robert Jenssen 
UiT Machine Learning Group




Abstract

Identifying customer segments in retail banking portfolios with different risk profiles can improve the accuracy of credit scoring. The Variational Autoencoder (VAE) has shown promising results in different research domains, and it has been documented the powerful information embedded in the latent space of the VAE. Specifically, the Weight of Evidence (WoE) transformation encapsulates the propensity to fall into financial distress and the latent space in the VAE preserves this characteristic in a well-defined clustering structure. These clusters have considerably different risk profiles and therefore are suitable not only for credit scoring but also for marketing and customer purposes.

This new clustering methodology offers solutions to some of the challenges in the existing clustering algorithms, e.g., suggests the number of clusters assigns cluster labels to new customers, enables cluster visualization, scales to large datasets capture non-linear relationships among others. Finally, for portfolios with a large number of customers in each cluster, developing one classifier model per cluster can improve the credit scoring assessment.


Lending is the principal driver of bank revenues in retail banking, where banks must assess whether to grant a loan at the moment of application. Therefore, banks focus on different aspects to improve this credit assessment. Understanding customers and target populations can improve risk assessment. For example, bank analysts possess the knowledge to understand the underlying risk drivers. This information can be used to identify different groups within a given portfolio, i.e. heuristic-based segmentation, and carry out the risk assessment segment-wise.


There exist five factors that could bring up the need for different segments

i) Marketing,
ii) Customer,
iii) Data,
iv) Process and
v) Model fit.

Marketing factors arise where banks require greater confidence in a specific segment to ensure the ongoing health of the business. Customer factors apply where banks want to treat customers with particular characteristics separately, e.g., customers with no delinquency records. The data factors relate to different operating channels, e.g., internet or dealer, where application data can be entirely different, and process factors refer to the fact that there are products that are treated differently from a managerial point of view. Finally, model fit factors are interactions within the data where different characteristics predict differently for different groups.


Segmentation can be done using a heuristic or statistical-based approach. In the heuristic-based approach, bank analysts define segments based on both customer characteristics and using expert knowledge of the business. For example, a bank can segment customers based on whether they have previous delinquency records. Further, based on these two segments it is the responsibility of a classification model to separate good from bad customers. When the classification is the final goal of segmentation, segments should not only be identified based on demographic factors, but rather on risk-based performance.


The Variational Autoencoder has shown promising results in different research domains. In some cases it outperforms state-of-the-art methodologies, e.g., in the medical field predicting drug response or classifying disease subtypes of DNA methylation, in speech emotion recognition, in generative modelling for language sentences imputing missing words in a large corpus, in facial attribute prediction among others. Authors choose to use the VAE and the Auto Encoding Variational Bayesian (AEVB) algorithm in this research to identify hidden segments in customer portfolios in the bank industry.

A simple Auto encoder is shown above.

 Graphical representation of the AEVB algorithm. The feedforward neural network to the left corresponds to the probabilistic encoder qφ(z|xi) where xi ∈ Rdx is the network input. The output of the network are the parameters in qφ(z|xi) ∼ N (µz|xi , σ2 z|xi I). Note that ∼ N (0, I) is drawn outside the network in order to use gradient descent and backpropagation optimization techniques. Similarly, the feedforward network to the right corresponds to the probabilistic decoder pθ(xi|z). In this case, the input are the latent variables z ∈ Rdz and the network output are the parameters in pθ(xi|z) ∼ N (µxi|z , σ2 xi|z I). The reconstruction is given by ˜x = µxi|z. For readability purposes we do not specify the parameters φ, θ in the networks. However, these parameters are represented by the lines joining the nodes in the networks plus a bias term attached to each node.




Graphical representation of the development methodology used 30% of the majority class (y = 0) data for training the VAE. Once the VAE is trained, it is used to generate the latent variables z for the remaining data, i.e., 70% of the majority and 100% of the minority class (y = 1). Based on the clusters in the latent space, we train Multi-Layer Perceptron (MLP) classifier models using a classical 70%-30% partition for training and testing the model respectively. The model performance of these segment-based MLP models is compared against a portfolio-based MLP model where no segmentation is considered (the dashed box denotes this model).


Data Sets


Kaggle Car loan dataset and Finnish Car loan datasets are used for testing. The Data is classified in to 4 categories, segment-based and Portfolio-based statistics  and below table shows different algorithms and its p value


Conclusion

The Variational Autoencoder has the advantage of capturing non-linear relationships which are projected in the latent space. In addition, the number of clusters is clearly suggested by the latent space in the VAE and, for a low-dimensional latent space, they can be visualized. Furthermore, the VAE can generate the latent configuration of new customers and assign them to one of the existing clusters. The clustering structure in the latent space of the VAE can be used for marketing, customer, and model fit purposes in the bank industry, given that the clusters have considerably different risk profiles and their salient dimensions are business intuitive.






Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods