client2vec: Generic Clients repository for Banking Applications

-By Leonardo Baldassini, Jose Antonio Rodr´ıguez Serrano

BBVA Data & Analytics

Abstract

Designing the client2vec an internal library to rapidly build baselines for banking applications. Client2vec uses marginalized stacked de-noising autoencoders on current account transactions data to create vector embeddings which represent the behaviors of our clients. These representations can then be used in, and optimized against, a variety of tasks such as client segmentation, profiling and targeting.

Most data analytics and commercial campaigns in retail banking revolve around the concept of behavioral similarity, for instance: studies and campaigns on client retention; product recommendations; web applications where clients can compare their expenses with those of similar people in order to better manage their own finances; data integrity tools. The analytic work behind each of these products normally requires the construction of a set of customer attributes and a model, both typically tailored to the problem of interest. The aim is to systematize this process in order to encourage model and code reuse, reduce project feasibility assessment times and promote homogeneous practices.

Client2vec: a library to speed up the construction of informative baselines for behavior centric banking applications. In particular, client2vec focuses on behaviors which can be extracted from account transactions data by encoding that information into vector form (client embedding). These embeddings make it possible to quantify how similar two customers are and, when input into clustering or regression algorithms, outperform the socio demographic customer attributes traditionally used for customer segmentation or marketing campaigns. The proposed solution is with minimal computational and preprocessing requirements that could run even on simple infrastructures. Client2vec offers our data scientists the possibility to optimize the embeddings against the business problem at hand. For instance, the embedding may be tuned to optimize the average precision for the task of retrieving suitable targets for a campaign.

Approach

client2vec following an analogy with unsupervised word embeddings, whereby account transactions can be seen as words, clients as documents (bags or sequence of words) and the behavior of a client as the summary of a document. Just like word or document embeddings, client embeddings should exhibit the fundamental property that neighboring points in the space of embeddings correspond to clients with similar behaviors.

First Approach : To extract vector representations of transactions and compose them into client embeddings, as done with word embeddings to extract phrase or document embeddings via averaging or more sophisticated techniques.

Second Approach : To embed clients straight away

We explored the former option by applying the famed word2vec algorithm to our data and then pooling the embeddings of individual transactions into client representations with a variety of methods. For the latter approach, which is the one currently employed by client2vec, we built client embeddings via a marginalized stacked denoising autoencoder (mSDA). For comparison and benchmarking purposes, we also tested the embedding comprising the raw transactional data of a client and the one produced by sociodemographic variables. Embeddings are then turned into actionable baselines by casting business problems as nearest neighbor regressions. This builds on successful works in computer vision which adopt the principle of the unreasonable effectiveness of data.

Sociodemographic variables

The obvious fundamental benchmark to which we compared all methods are sociodemographic variables: age, gender, income range, postcode, city and province. Such variables are typically considered by banks, retailers and other organizations for purposes like segmentations or campaigns. All of these variables are categorical, even the income, having been binned in several ranges. As such, we one-hot encode them and then reduce the dimensionality of the vector thus obtained in order to measure the Euclidean distance between two sociodemographic representations.

Raw transactions

Embedding via word2vec

Word2vec is a family of embeddings of words in documents, which express each word token with a dense vector. These vectors result from the intermediate encoding of a 2- layer network trained to reconstruct the linguistic context of each token and exhibit strong semantic properties, e.g. two nearby vectors refer to words that may share the same topic or even be synonyms.

Model selection

We treat the preprocessing options for mSDAs listed above like hyperparameters to optimize at train time. Likewise, the hyperparameters for the word2vec benchmark are the word-embedding dimension and the context window size [28], while for the raw transaction embeddings we only choose whether to L2-normalize, log-normalize or binarize. The optimization is carried out separately for each use case we consider.

Results

Conclusions

An attempt to develop an internal tool that could catalyze the data-driven decision making for BBVA. They described how we worked towards a solution that was simple to use, fast to deploy and integrate in colleagues’ processes and that required minimal preprocessing. Along the way, we learned that composing transactional embeddings extracted with word2vec into customer embeddings doesn’t always offer an acceptable performance, while mSDAs help us capture a good deal of behavioral information. Furthermore, we highlighted how this information can be extracted even from simple, coarse transactional data. We plan to keep expanding the client2vec library by adding new representations as new use cases arise, as well as by proactively exploring algorithms that fit its philosophy of simplicity, such as the nonlinear extension of mSDA or metric learning to further boost the performance mSDA embeddings in client targeting.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog

client2vec: Generic Clients repository for Banking Applications

Labels

Comments

Popular posts from this blog

ABOD and its PyOD python module

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

Why Should I Trust You?. . LIME