How We Implemented Finastra Mortgagebot’s Machine Learning Feature

Source: Deep Learning on Medium

How We Implemented Finastra Mortgagebot’s Machine Learning Feature

If you’re here, you have probably seen the first Mortgagebot post that was more holistic. Here, we are diving into the details of the tools we used and how we implemented the latest technology to bring this project into production.

As with any machine learning project, the data is the most important part. Looking through the database we identified 4 distinct tables that contained relevant information that we wanted to use for feature generation: Account, Borrower, Property, and Product. The data was stored in a Microsoft SQL server database with tens to hundreds of millions of records. We selected the most recent 3 years of data and moved it to Azure Blob Storage so we could use the DataBricks notebook environment for our model development.

Our team at the Innovation Lab are big Python users, and these are the libraries we used:

Numpy — Provides fast computation/manipulation of arrays and matrices for data handling and manipulation

Pandas — Allows us to use in-memory table objects to organize, process, and manipulate data in an efficient manner

PySpark — A distributed in-memory computing framework in spark with Python style syntax

Koalas — Implements the pandas API on top of Apache Spark

Sklearn — Rapid machine learning model development for baseline and more advanced algorithms

Keras — Neural network development across many deep learning architectures

Since we were dealing with large amounts of data that would not fit in the memory on our local machines, we utilized the Microsoft Azure Cloud platform. Specifically, we used Azure DataBricks, an Apache Spark-based analytics platform that is optimized for Azure Cloud. It provides an integrated and interactive development workspace that enables data scientists and engineers to collaborate. We can then create model experiments and track the results of our models across different feature sets, parameters, and model choices. We attached our DataBricks environment to a cluster, which is a virtual machine that we can configure to our liking: it was called develop with 28.0 GB of memory, 4 cores, 1 DBU, 2 min workers, and 3 max workers. This is a standard DS12 V2 cluster that includes Python 3 and the 5.4 ML package that includes all the machine learning and data science libraries we needed. Additionally it has Spark 2.4.3 and Scala 2.11 installed. TL;DR, we used DataBricks because it was convenient and very well-suited for our needs.

Now that we had a high powered Azure development environment in place, we were ready to dig in and start the process for building our model. We performed some statistical analysis in a pairwise fashion across our data columns, looked at correlations and t-tests, and built a preliminary model to understand how an algorithm considers the features. Some of the features that we ultimately decided to use were Employment Classification, Debt-to-Income Ratio, Income, and Marital Status.

Since our goal was to build a binary classification model to determine if a loan would be approved or not (and determine the probability), we tested the following models with rigorous parameter and hyper-parameter tuning:

Logistic Regression

Random Forest

XGBoost

LightGBM

K-Nearest Neighbors

4 layer feed forward Neural Network

To save our model and load and reuse it in the future, we used the joblib library to save the model as as .pkl file. To create a retraining pipeline, we created a class structure to wrap our model so that we could retrain it with new data that is streamed into the Mortgagebot database.

Now that we have a trained model and a class in a DataBricks notebook that handles retraining and model predictions, we need to integrate this notebook into our Azure pipeline so that we can automatically retrain this model and obtain predictions in real-time.

Here is how it all works:

Data is pulled from the source system POS database via stored procedures which are executed through an Azure Data Factory pipeline

Data is copied to the staging schema in the Azure SQL PaaS database and any record that would overflow the structure of the schema is sent to a storage account for review

All data is copied to the staging schema and a merge process is initiated

Data merges successfully and the control table is updated

An Azure Data Factory pipeline kicks off the DataBricks environment

Account, Borrower, Product, and Property tables are grabbed and submitted to the DataBricks notebook for retraining and/or prediction

If the model is selected for retraining, a new model is saved in a specified path in the database file store. If the model is selected for predictions, the predictions are pushed to an Azure SQL database, which PowerBI reads from.

Our Azure pipeline for loading, handling and choosing data to query our model.

Below is the frontend dashboard displaying the estimated approval rate for a selected bank:

As a final result we have a PowerBI Mortgagebot dashboard that is fully integrated for retraining and predictions from a machine learning model.

That’s it, guys! In a nutshell, here’s what we did:

Pulled REAL mortgage applications data, then cleaned it up with Python/Spark tools

Decided which features we were going to use

Developed a few models, tuned them, and then chose the best one

Saved the model into a .pkl file for reload and reuse

Created a class structure for updating the model based on more real data coming in

Account, Borrower, Product, and Property tables are grabbed and submitted to the DataBricks notebook for retraining and/or prediction

Streamed the results on a dashboard so we could visually display the results

Thanks for reading! If you haven’t already checked out the other blog post, I highly recommend checking it out here! We greatly appreciate feedback, concerns, and even questions about why we used what, so please don’t hesitate to reach out to us for whatever you need.