Original article was published by Nitin Sharma on Deep Learning on Medium
Multi-Domain Fraud Detection While Reducing Good User Declines — Part I
Learning Fraud Patterns Using Multi-task Deep Learning Methods while approving good customers swiftly
This post is the first in a series of two blog posts that outline the approach towards multi-domain fraud detection while optimizing for fraud catch rate across different domains. Further, the multi-task learning strategy is augmented with robust feature learning to mitigate performance deterioration due to post-deployment feature shifts.
The payments fraud detection domain involves deployment of multiple machine learning models to detect different types of fraud patterns. Such heterogeneous payments volume often poses fraud patterns that might vary depending on several factors such as the nature of fraud, the entity being protected, the geographical location and the purpose of transaction. Although such patterns might appear to be starkly different, the patterns have a lot in common due to the pathways in which fraud occurs. In such instances, learning representations from a joint optimization framework that simultaneously attempts to optimize for multiple fraud patterns as a part of one problem formulation often aids in better fraud detection capabilities.
While the applied research literature represented by problems in the image classification, natural language process, speech processing (translation, summarization and dialog systems) presents several use cases and proof of concepts for using multi-task learning, the problem formulation for fraud detection is peculiar in the following ways:
- Large volume payments ecosystem represented by diverse groups of users across multiple countries, using multiple methods of payments while using PayPal for a myriad of purposes.
- Diverse and ever-evolving fraud patterns manifesting through multiple complex e-commerce pathways.
- Multiple types of variations in fraud patterns with factors such as high temporality and varying loss pressures.
- Multi-Objective optimization context underlying effective and brisk approval of good user traffic, which forms core tenet of customer satisfaction. Such objective is to be met while maintaining high temporal accuracy of fraud detection.
In the fraud detection world, several fraud modus operandi might be closely related to each other, often with manifestations that are overlapping. Multiple fraud patterns can result from different kinds of fraudulent activities. For example: fraudsters try to steal identities of legitimate customers and use the associated credentials for unauthorized access of those accounts, further monetizing by using financial instruments (credit card, bank account or balance) of the legitimate customers to purchase goods and services. It is also possible for fraudsters to steal financial instruments and then create accounts to fund e-commerce transactions.
As mentioned earlier, the manifestations of fraud can often be overlapping. For example, stolen identity fraud could be unearthed through a chargeback on a potentially stolen instrument. A key question then, is if simultaneous domain-specific fraud discovery can be carried out during the cross-optimization of multiple tasks, where each task is a specific fraud-type. It is also critical to learn each task, while also identifying features that are robust to temporal changes in distribution of data, or more specifically multi-variate mixture distribution represented by multiple tasks in the co-learning process.
We present a multi-purpose algorithm for simultaneous detection of multiple account-take over oriented fraud patterns using a single deep multi-task learning model. The proposed method employs a multi-task learning framework that regularizes the shared parameters and a robust representation learning scheme to optimize for performance stability during post-deployment shifts. Further, we also demonstrate changes to the architecture that also optimize for accurately identifying good users and incrementally reducing the declines or the False Positives, with no adverse effects on robustness or accuracy. While the model development process involves several choices and nuances, what follows describes some of salient aspects of the method/architecture for the proposed problem.
Optimizing for Catch Rate Across Tasks
The model architecture involves a multi-task learner with hard parameter ⁴ ¹ sharing to reduce over fitting, with multiple shared layers serving to jointly optimize across different tasks, each representing a specific type of account take over (ATO) fraud pattern. In general, larger the number of tasks to be learned, the broader is the search for optimal shared parameters, thus minimizing the possibility of over fitting. Although, there is arguably some possibility of negative transfer due to the occasional dissimilarity of tasks, for the present context comprising of single type of main fraud modus operandi (related to ATO), clustering constraint is also introduced and tested for penalizing both the norms of the task column vectors a⋅,₁,…, a⋅,t as well as the variance based on the following constraint⁴ ²:
Such a penalty provides framework for constrained clustering of task-specific (or sub MO-specific) parameter vectors towards mean, with tuning parameter λ. As an iterative process, all stolen identity modus operandi are trained/tuned as separate model architectures. This is to further avoid over/under-fitting on a specific sub-MO either on account of being the dominant component in the mixture distribution, or due to relative differences class imbalance (fraud v/s good users) across different tasks.
Further, cross-stitch units² are introduced to allow models to determine ways in which task-specific networks leverage other tasks as well as to improve generalizing ability across different sub-MO. This process is carried out by learning shared representations using a linear combination of activation maps contained in cross-stitch units. The cross-stitch units operate by combining post-activation from different layers of the network to establish an optimal balance between shared and task-oriented representations for specific sub-MO. Such architecture also allows the modeler to fine-tune the extent of shared representation learning by controlling the value of 𝛼 which parameterizes the linear combination of activation maps. For relatively similar sub-MOs, the search process starts with a larger value of 𝛼, and adjusts it towards lower values, where differences diverge. More specifically, as recommended by Misra et al. (2016)² , 𝛼 matrix is initialized to be in the range [0,1], for stable learning, also ascertaining consistency of order magnitude between output activation maps and inputs, after and before the cross-stitch units respectively.
In this setting, two variations were attempted: first, to train each network in a task-specific manner separately further using that initialization to further train the joint architecture represented by the cross-stitch units; and second, to have the same initialization and train the networks jointly along with the cross-stitch unit. Through a series of iterations, optimal performance was achieved by shallow training the networks separately, with the search trajectory aligning better while optimizing for performance on individual tasks, yet leaving sufficient room for joint-optimization, re-alignment of network weights and feature learning in subsequent iteration of multi-task learning.
In order to address the issue of temporal stability and optimize further for robustness, feature representations from separate networks with custom cost functions trained for long-term training data are utilized. The network architecture also involves a joint optimizer with networks trained to capture robust long-term patterns in fraud, and further to optimize for catch rate improvement in more near-term fraud patterns. The training approach also involves training based on rolling-windows in incremental time duration. Specifically on the subject of cost functions, the approach utilizes cross-task regularization penalties (norm based group sparsity) to meaningfully constrain the underlying joint feature selection process. Overall, architecture involves the following high-level design:
It maybe noted that method can be modified for tasks that maybe either hierarchical or sequential in nature or when tasks are loosely related to each other, using latent multi-task architectures or Sluice Networks⁶. The problem context above involves fraud patterns that non-hierarchical and which co-exist independently in different sub-populations.
Robust Feature Learning for Post-Deployment Shifts
As a variation to the above architecture, the robust representation learning method is introduced, so the model is robust to sporadic changes either in data distribution or systemic corruption. The idea here is to discover stable feature representation spaces, so as to boost robustness, especially in contexts where critical features could potentially be missing either due to systemic/infrastructure glitches or due to data encoding and transformation errors. Robust feature representations are learned by training stacked de-noising autoencoders that help reconstruct the input from associated corrupted versions. The approach utilizes past data and domain know-how to simulate perturbations in multivariate vectors recreating new training data, while retaining the original vector at the output. The hidden layers in such cases are constrained to discover robust features or representations that generate the original data from systematically corrupted or perturbed input data.
As a last step, further fine tuning was carried out using an emphasized denoising autoencoder by altering the cross-entropy loss function by allocating higher weights to corrupted features, and lower weight to all other features. As proposed by Vincent et al. (2010)⁵, following cost-function modification was further modified:
In general, the framework is also applicable to simulating feature distribution shifts in a large pool of post-deployment issues. Representations utilized from such models are directly utilized in transfer learning context described in the previous section, for the second stage supervised multi-task learning process.
In the next post, an additional variation to the aforementioned architecture will be introduced to minimize declines of good users based on techniques related to online hard example mining (OHEM) and generative modeling. A summary of conclusions and analytical results stemming from application of such techniques to the fraud detection problem will also be provided.
Please subscribe to our blog on medium or reach out to us at email@example.com with any questions.
 Caruana, R. (1993). Multitask Learning: A Knowledge-Based Source of Inductive Bias. In Proceedings of the Tenth International Conference on Machine Learning.
 Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2016.433
 Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://arxiv.org/abs/1604.03540
 Ruder, S. (2017). An Overview of Multi-Task Learning in Deep Neural Networks. https://ruder.io/multi-task/
 Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P. Stacked Denoising Autoencoders: Learning Useful Repesentations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11 (2010) 3371–3408. https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf7
 Ruder, S., Bingel, J., Augenstein, I., & Sogaard, A. Latent Multi-task Architecture Learning. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Hawaii, Jan 27 — Feb 1, 2019. https://arxiv.org/pdf/1705.08142.pdf
 Zhang, Y., & Yang, Q. A Survey on Multi-Task Learning. July 2018. https://arxiv.org/pdf/1707.08114.pdf