Modeling Bank’s Churn Rate with AdaNet: A Scalable, Flexible Auto-Ensemble Learning Framework

Source: Deep Learning on Medium

Modeling Bank’s Churn Rate with AdaNet: A Scalable, Flexible Auto-Ensemble Learning Framework

Logistic Regression with Google’s AdaNet- Auto Learning Ensemble Framework

Motivation

AdaNet provides a framework that could automatically produce a high-quality model given an arbitrary set of features and a model search space. In addition, it builds ensembles from productionized TensorFlow models to reduce churn, reuse domain knowledge, and conform with business and explainability requirements. The framework is capable of handling datasets containing thousands to billions of examples, in a distributed environment.

Several open-source AutoML frameworks, such as auto-sklearn and auto-pytorch encode the expertise necessary to automate ensemble construction, and achieve impressive results. In comparison, the framework is built with the following capabilities:

  • TensorFlow to facilitate integration with TensorFlow-based production infrastructure and tooling.
  • Designed to efficiently execute on hundreds of heterogeneous workers in a distributed cluster and run on TPU.
  • Open sourced it to share with the entire AutoML community across companies and universities to accelerate open research.

Architecture

AdaNet combines two orthogonal ensembling components:

  • Parallel (similar to bagging) and
  • Sequential (similar to boosting).

These form the axes of the adaptive search space that the framework iteratively explores for an optimal ensemble as illustrated in the figure below.

The search space is defined by the combination of Subnetwork Generators which generate candidate subnetworks ht for iteration t,

  • Ensemble Strategies which form discrete groups of subnetworks, helps in saving time by combining different networks of different depths and breadths.
  • Ensemblers which combine the predictions of grouped subnetwork into ensembles. The framework is responsible for managing and training these ensembles and subnetworks.
  • In AdaNet, the units of a subnetwork learned at a previous iteration, can serve as input to deeper subnetwork added later, thereby the deeper subnetworks added later taking advantage of the embeddings that were learned at the previous iterations.
  • Evaluator evaluates the candidate ensembles once they are finished training to select and fix f∗t, the ensemble with the best performance based on the objective, and component subnetworks.
  • The search then proceeds to iteration t+1, where the Subnetwork Generator adapts its search space according to the previous best ensemble. For example, if the subnetwork search space explores increasingly deeper neural networks, and the deepest subnetwork in the ensemble is l layers deep, the Subnetwork Generator could generate one candidate subnetwork with l hidden layers, and another with l+1.

As illustrated in the below figure, the input (blue) and the output (green), are modelled with several hidden layers.

Fig: Architecture – Units in the yellow block are added at the first iteration while units in purple are added at the second iteration. Two candidate extensions of the architecture are considered at the the third iteration (shown in red): (a) a two-layer extension; (b) a three-layer extension.
  • AdaNet’s objective is to balance and optimize the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data. The ensemble includes a candidate subnetwork to improve the ensemble’s training loss, guaranteeing the fact that generalization error of the ensemble is bounded by its training error and complexity.
  • AdaNet extends Tensorflow’s tf.estimator.Estimator estimator to encapsulate training, evaluation, prediction and export for serving by specifying the target task, such as regression, classification, or multi-task. This abstraction enables the same application code to run on different hardware including CPUs, GPUs and TPUs.

Auto-Estimation and Adaptation

The below figure illustrated auto-estimation and adaptation strategies employed by AdaNet during training phase to yield the best ensembled model.

  • AdaNet allows users to specify their search spaces using the adanet.subnetwork package to define how subnetworks adapt at each iteration, and the adanet.ensemble package to define how ensembles are composed, pruned, and combined.
  • AdaNet provides AutoEnsembleEstimator to users who want a higher-level API for defining a search space in only a few lines of code using Estimators like DNNEstimator and BoostedTreesClassifier.
  • For visualizing model performance during training, the framework integrates with TensorBoard. When training is finished, the framework exports a TensorFlow SavedModel that can be deployed with TensorFlow Serving or similar services.
The adaptive computation graph. Within an iteration, a static computation graph contains multiple subnetworks and ensembles, including the best ensemble and subnetworks from the previous iteration. Each new ensemble candidate is composed of a subset of the present subnetworks. The iteration tracks and exposes the predictions of the best ensemble at each training step. Across iterations, AdaNet implements an adapts its computation graph for the next iteration.
  • In the first phase, each base learners is trained in a separate process, and in the second phase they are then ensembled.
  • However, accelerators such as TPUs do not allow multiple processes to share the same hardware, which limits the number of candidates that can be trained in parallel.
  • AdaNet creates all candidate within the same computation graph and session, including new subnetworks and ensembles for iteration t, and the best ensemble and corresponding subnetworks from t−1. This design allows candidates to share tensors. For example, subnetworks can share the same input pipeline and ensembles can share subnetwork outputs.
  • With all the candidates within the same graph allows compilers such as the TensorFlow compiler and XLA to optimally place ops on logical devices, in order to maximize multi-core accelerator utilization such as when training many small subnetworks on a TPU.
  • TensorFlow and XLA were designed for a static computation graph, which limits creating new operations during training. One workarounds is to dynamically store a model in a resource variable, but it require users to rewrite their models in low-level operations in C++ instead of Python, thereby increasing the cost of adopting AdaNet and limiting flexibility.
  • The framework also has support for Eager execution, which supports neither distributed nor TPU training.

Adaptive Computation by AdaNet

  • AdaNet modifies the training loop to create an adaptive computation graph after completing training of all candidates in iteration t.
  • AdaNet employs a bookkeeping phase, where it reconstructs the graph with the evaluation dataset and evaluates all the candidates to determine the best ensemble for iteration t.
  • In the next step, it serializes metadata about the architecture of the best ensemble, and uses this metadata to construct a new graph for t+1. The new graph includes the best ensemble, and all the new subnetworks and ensembles, and warm-starts the variables of the best ensemble from the most recent checkpoint.
  • Finally, it creates a new checkpoint with the modified graph for t+1 and increments the iteration number variable in the new checkpoint.
  • The Estimator on resuming training, constructs new static graph based on the architecture metadata from iteration t, and restores variables from the new checkpoint.
  • Evaluation and prediction have no effect on the iteration number so their methods require no modification.
Source : https://ai.googleblog.com/2018/10/introducing-adanet-fast-and-flexible.html

Distributed Training

  • AdaNet allows training or debugging on small datasets in single process. To speed up training and evaluate in parallel, the Estimator allows, to distribute work across worker machines and parameter servers.
  • Adanet provides two distribute strategies: replication and the AdaNet-specific round-robin.
  • Estimator provides the default replication strategy, where workers replicate a copy of the full computation graph containing all candidates, and share variable updates through the parameter servers.
  • In the second round-robin strategy, candidate subnetworks are placed on dedicated workers in a round-robin fashion as subnetworks can be trained independently from one another. Certain designated workers load every read-only subnetworks, and train only the ensemble parameters.
  • The round-robin strategy reduces the load on the workers and parameter servers, speeds up training when subnetworks are large, and allows the system to scale linearly with the number of subnetworks and workers.
  • Adanet’s distributed training and search execution is fault tolerant, to restore a worker from a cluster termination ( due to preemption or an exception), from a checkpoint to continue training with minimal loss in training time.
  • Adanet’s adaptive computation graph during distributed training, designates one worker as chief, who is responsible for bookkeeping. Other workers idly loop until the chief writes the expanded checkpoint with an incremented iteration number.

Implementation — Modeling Bank’s Churn Rate

For modelling bank’s churn rate, the datasets have been downloaded from https://data.world/kashundavis/data-science-data-modeling, which contains both the training and testing datasets. The below code snippet demonstrates loading the data and converting the categorical variables to numeric features (using labelEncoder) and all set of columns of the dataframe to float (if any column is in int), as FEATURES_KEY = “x” supports only float values.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import functools

import adanet
import tensorflow as tf


RANDOM_SEED = 42

train_data = pd.read_csv('./datasets/Churn-Modelling-Test-Data.csv')
test_data = pd.read_csv('./datasets/Churn-Modelling-Train-Data.csv')

#eliminsating customer id snd surname
x_train =train_data.iloc[:, 3:-1]
y_train = train_data.iloc[:, -1:]

x_test =test_data.iloc[:, 3:-1]
y_test = test_data.iloc[:, -1:]


x_train['Geography_encoded'] = LabelEncoder().fit_transform(x_train['Geography'])
x_train['Gender_encoded'] = LabelEncoder().fit_transform(x_train['Gender'])

x_train = x_train.drop(['Geography', 'Gender'], axis = 1)
x_train = x_train.astype(np.float32)
y_train = y_train.astype(np.float32)


x_test['Geography_encoded'] = LabelEncoder().fit_transform(x_test['Geography'])
x_test['Gender_encoded'] = LabelEncoder().fit_transform(x_test['Gender'])

x_test = x_test.drop(['Geography', 'Gender'], axis = 1)
x_test = x_test.astype(np.float32)
y_test = y_test.astype(np.float32)


FEATURES_KEY = "x"
_NUM_LAYERS_KEY = "num_layers"
def input_fn(partition, training, batch_size):
"""Generate an input function for the Estimator."""

def _input_fn():

if partition == "train":
dataset = tf.data.Dataset.from_tensor_slices(({
FEATURES_KEY: tf.log1p(x_train)
}, tf.log1p(y_train)))
else:
dataset = tf.data.Dataset.from_tensor_slices(({
FEATURES_KEY: tf.log1p(x_test)
}, tf.log1p(y_test)))

# repeat is called after shuffling,to prevent separate epochs from blending together.
if training:
dataset = dataset.shuffle(10 * batch_size, seed=RANDOM_SEED).repeat()

dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels

return _input_fn

The below code snippet shows how to build a Deep Neural Network with AdaNet with inputs :- optimizer, layer_size, num_layers, mixture_weight_type and seed.

DNNBuilder creates two candidate fully-connected neural networks at each iteration with the same width, but one an additional hidden layer. To make the generator adaptive, each subnetwork is constructed with at least the same number of hidden layers as the most recently added subnetwork to the previous_ensemble.

Subnetworks with more hidden layers, have more capacity, and more heavily regularized mixture weights.

class _SimpleDNNBuilder(adanet.subnetwork.Builder):
"""Builds a DNN subnetwork for AdaNet."""

def __init__(self, optimizer, layer_size, num_layers, learn_mixture_weights,
seed):
"""Initializes a `_DNNBuilder`.

Args:
optimizer: An `Optimizer` instance for training both the subnetwork and the mixture weights.
layer_size: The number of nodes to output at each hidden layer.
num_layers: The number of hidden layers.
learn_mixture_weights: Whether to solve a learning problem to find the
best mixture weights, or use their default value according to the
mixture weight type. When `False`, the subnetworks will return a no_op for the mixture weight train op.
seed: A random seed.

Returns:
An instance of `_SimpleDNNBuilder`.
"""

self._optimizer = optimizer
self._layer_size = layer_size
self._num_layers = num_layers
self._learn_mixture_weights = learn_mixture_weights
self._seed = seed

def build_subnetwork(self,
features,
labels,
logits_dimension,
training,
iteration_step,
summary,
previous_ensemble=None):

input_layer = tf.to_float(features[FEATURES_KEY])
kernel_initializer = tf.glorot_uniform_initializer(seed=self._seed)
last_layer = input_layer
for _ in range(self._num_layers):
last_layer = tf.layers.dense(
last_layer,
units=self._layer_size,
activation=tf.nn.relu,
kernel_initializer=kernel_initializer)
logits = tf.layers.dense(
last_layer,
units=logits_dimension,
kernel_initializer=kernel_initializer)

persisted_tensors = {_NUM_LAYERS_KEY: tf.constant(self._num_layers)}
return adanet.Subnetwork(
last_layer=last_layer,
logits=logits,
complexity=self._measure_complexity(),
persisted_tensors=persisted_tensors)

def _measure_complexity(self):
"""Approximates Rademacher complexity as the square-root of the depth."""
return tf.sqrt(tf.to_float(self._num_layers))

def build_subnetwork_train_op(self, subnetwork, loss, var_list, labels,
iteration_step, summary, previous_ensemble):
return self._optimizer.minimize(loss=loss, var_list=var_list)

def build_mixture_weights_train_op(self, loss, var_list, logits, labels,
iteration_step, summary):
if not self._learn_mixture_weights:
return tf.no_op()
return self._optimizer.minimize(loss=loss, var_list=var_list)

@property
def name(self):
if self._num_layers == 0:
# A DNN with no hidden layers is a linear model.
return "linear"
return "{}_layer_dnn"
.format(self._num_layers)

SimpleDNNGenerator (Fig: Architecture) generates a two DNN subnetworks at each iteration, where the first DNN has an identical shape to the most recently added subnetwork in previous_ensemble. The second has the same shape plus one more dense layer on top.

class SimpleDNNGenerator(adanet.subnetwork.Generator):
"""Generates a two DNN subnetworks at each iteration.
"""

def __init__(self,
optimizer,
layer_size=32,
learn_mixture_weights=False,
seed=None):
"""Initializes a DNN `Generator`.

Args:
optimizer: An `Optimizer` instance for training both the subnetwork and the mixture weights.
layer_size: Number of nodes in each hidden layer of the subnetwork
candidates: This parameter is ignored in a DNN with no hidden layers.
learn_mixture_weights: Whether to solve a learning problem to find the
best mixture weights, or use their default value according to the
mixture weight type. When `False`, the subnetworks will return a no_op for the mixture weight train op.
seed: A random seed.

Returns:
An instance of `Generator`.
"""

self._seed = seed
self._dnn_builder_fn = functools.partial(
_SimpleDNNBuilder,
optimizer=optimizer,
layer_size=layer_size,
learn_mixture_weights=learn_mixture_weights)

def generate_candidates(self, previous_ensemble, iteration_number,
previous_ensemble_reports, all_reports, config):
"""See `adanet.subnetwork.Generator`."""

num_layers = 0
seed = self._seed
if previous_ensemble:
num_layers = tf.contrib.util.constant_value(
previous_ensemble.weighted_subnetworks[
-1].subnetwork.persisted_tensors[_NUM_LAYERS_KEY])
if seed is not None:
seed += iteration_number
return [
self._dnn_builder_fn(num_layers=num_layers, seed=seed),
self._dnn_builder_fn(num_layers=num_layers + 1, seed=seed),
]

The below code snippet shows different learning parameters for training and evaluating the AdaNet learning framework.

# AdaNet parameters
LEARNING_RATE = 0.001 #@param {type:"number"}
TRAIN_STEPS = 100000 #@param {type:"integer"}
BATCH_SIZE = 32 #@param {type:"integer"}

LEARN_MIXTURE_WEIGHTS = False #@param {type:"boolean"}
ADANET_LAMBDA = 0 #@param {type:"number"}
BOOSTING_ITERATIONS = 5 #@param {type:"integer"}


def train_and_evaluate(learn_mixture_weights=LEARN_MIXTURE_WEIGHTS,
adanet_lambda=ADANET_LAMBDA):
"""Trains an `adanet.Estimator` to predict churn yes/no."""

estimator = adanet.Estimator(
# Since we are predicting churn, we'll use a regression
# head that optimizes for MSE.
head=tf.contrib.estimator.regression_head(
loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE),

# Define the generator, which defines our search space of subnetworks
# to train as candidates to add to the final AdaNet model.
subnetwork_generator=SimpleDNNGenerator(
optimizer=tf.train.RMSPropOptimizer(learning_rate=LEARNING_RATE),
learn_mixture_weights=learn_mixture_weights,
seed=RANDOM_SEED),


adanet_lambda=adanet_lambda,

# The number of train steps per iteration.
max_iteration_steps=TRAIN_STEPS // BOOSTING_ITERATIONS,

# The evaluator will evaluate the model on the full training set to
# compute the overall AdaNet loss (train loss + complexity
# regularization) to select the best candidate to include in the
# final AdaNet model.
evaluator=adanet.Evaluator(
input_fn=input_fn("train", training=False, batch_size=BATCH_SIZE)),

# The report materializer will evaluate the subnetworks' metrics
# using the full training set to generate the reports that the generator
# can use in the next iteration to modify its search space.
report_materializer=adanet.ReportMaterializer(
input_fn=input_fn("train", training=False, batch_size=BATCH_SIZE)),

# Configuration for Estimators.
config=tf.estimator.RunConfig(
save_checkpoints_steps=50000,
save_summary_steps=50000,
tf_random_seed=RANDOM_SEED))

# Train and evaluate using using the tf.estimator tooling.
train_spec = tf.estimator.TrainSpec(
input_fn=input_fn("train", training=True, batch_size=BATCH_SIZE),
max_steps=TRAIN_STEPS)
eval_spec = tf.estimator.EvalSpec(
input_fn=input_fn("test", training=False, batch_size=BATCH_SIZE),
steps=None)
return tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)


def ensemble_architecture(result):
"""Extracts the ensemble architecture from evaluation results."""

architecture = result["architecture/adanet/ensembles"]
# The architecture is a serialized Summary proto for TensorBoard.
summary_proto = tf.summary.Summary.FromString(architecture)
return summary_proto.value[0].tensor.string_val[0]


results, _ = train_and_evaluate()
print("Loss:", results["average_loss"])
print("Architecture:", ensemble_architecture(results))

results, _ = train_and_evaluate(learn_mixture_weights=True)
print("Loss:", results["average_loss"])
print("Results:", results)
print("Architecture:", ensemble_architecture(results))

results, _ = train_and_evaluate(learn_mixture_weights=True, adanet_lambda=.015)
print("Loss:", results["average_loss"])
print("Results:", results)
print("Architecture:", ensemble_architecture(results))

Results

The results obtained without setting any custom parameters.

Loss: 0.07183863
Architecture: b’| 1_layer_dnn | 2_layer_dnn | 3_layer_dnn | 4_layer_dnn | 4_layer_dnn |’

The results obtained after setting parameter, learn_mixture_weights to True.

Loss: 0.094054244Results: {'architecture/adanet/ensembles': b'\nn\n\x13architecture/adanetBM\x08\x07\x12\x00BG| 1_layer_dnn | 2_layer_dnn | 3_layer_dnn | 4_layer_dnn | 4_layer_dnn |J\x08\n\x06\n\x04text', 'label/mean': 0.14119416, 'prediction/mean': 0.20065315, 'loss': 0.094078295, 'global_step': 100000, 'average_loss': 0.094054244}Architecture: b'| 1_layer_dnn | 2_layer_dnn | 3_layer_dnn | 4_layer_dnn | 4_layer_dnn |'

The results obtained after setting parameter, learn_mixture_weights to True and adanet_lambda to 0.15. Lambda is the strength of complexity regularization. A larger value penalizes more complex subnetworks.

Loss: 0.08251198Results: {'architecture/adanet/ensembles': b'\ni\n\x13architecture/adanetBH\x08\x07\x12\x00BB| linear | 1_layer_dnn | 2_layer_dnn | 3_layer_dnn | 4_layer_dnn |J\x08\n\x06\n\x04text', 'label/mean': 0.14119416, 'prediction/mean': 0.20957457, 'loss': 0.0825069, 'global_step': 100000, 'average_loss': 0.08251198}Architecture: b'| linear | 1_layer_dnn | 2_layer_dnn | 3_layer_dnn | 4_layer_dnn |'

Conclusion

  • AdaNet is a flexible and scalable framework for training, evaluating, and deploying ensembles of TensorFlow models (e.g. deep neural networks, trees, and linear models).
  • It provides AutoML capabilities including automatic search over a space of candidate ensembles, supports CPU, GPU, and TPU hardware, and can scale from a single process to a cluster seamlessly with tf.estimator.Estimator infrastructure.
  • AdaNet models can serve as replacements for existing Estimator models, and integrate with tools in the TensorFlow open-source ecosystem (https://github.com/tensorflow) like TensorFlow Hub, Model Analysis, and Serving.
  • The framework is flexible and can be extended to include a prior (i.e. fine-tuned production models) in its search space. It offers several out-of-the-box means of training and ensembling them (e.g. uniform average weighting, learning mixture weights).
  • Automatic ensemble of neural networks has two main challenges:choosing the best subnetwork architectures; and using the right number of subnetworks. The AdaNet framework supports different ways to handle the learning of the weights and tackle these two stated challenges.

References

  1. https://github.com/tensorflow/adanet/
  2. https://deepai.org/publication/adanet-a-scalable-and-flexible-framework-for-automatically-learning-ensembles