Wide & Deep Learning for Recommender Systems

Source: Deep Learning on Medium

Wide & Deep Learning for Recommender Systems

The paper presented here Wide & Deep learning — provides an approach on jointly trained wide linear models and deep neural networks — to combine the benefits of memorisation and generalisation for recommender systems. It claims to have productioni-zed and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps.

claim — Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models.

Recommender systems as a ranking system

A recommender system can be viewed as a search ranking system, where the input query is a set of user and contextual information, and the output is a ranked list of items. Given a query, the recommendation task is to find the relevant items in a database and then items are ranked based on certain objectives, such as clicks or purchases.

Memorisation vs Generalisation

Memorisation can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.

1. Memorisation of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalisation requires more feature engineering effort.

2. Recommendations based on memorisation are usually more topical and directly relevant to the items on which users have already performed actions.

3. Memorisation can be achieved effectively using cross-product transformations over sparse features. This explains how the co-occurrence of a feature pair correlates with the target label.

4. One limitation of cross-product transformations is that they do not generalise to query-item feature pairs that have not appeared in the training data.

5. Wide linear models can effectively memorise sparse feature interactions using cross-product feature transformations.

Generalisation, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past.

1. With less feature engineering, deep neural networks can generalise better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features.

2. However, deep neural networks with embeddings can over-generalise and recommend less relevant items when the user-item interactions are sparse and high-rank.

3. Generalisation tends to improve the diversity of the recommended items. Generalisation can be added by using features that are less granular , but manual feature engineering is often required.

4. For massive-scale online recommendation and ranking systems in an industrial setting, generalised linear models such as logistic regression are widely used because they are simple, scalable and interpretable. The models are often trained on binarised sparse features with one-hot encoding.

Dense vs Sparse interactions

In generalisation scenario Embedding-based models, such as factorization machines or deep neural networks, can generalize to previously unseen query-item feature pairs by learning a low-dimensional dense embedding vector for each query and item feature, with less burden of feature engineering. However, it is difficult to learn effective low-dimensional representations for queries and items when the underlying query-item matrix is sparse and high-rank, such as users with specific preferences or niche items with a narrow appeal.

In such cases, there should be no interactions between most query-item pairs, but dense embeddings will lead to nonzero predictions for all query-item pairs, and thus can over-generalize and make less relevant recommendations.

In memorisation scenario

On the other hand, linear models with cross-product feature transformations can memorize these “exception rules” with much fewer parameters.

deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings.

The model

  1. wide component

The wide component is a generalized linear model of the form y = wT x + b, as illustrated in Figure 1 (left). y is the prediction, x = [x1, x2, …, xd] is a vector of d features, w = [w1, w2, …, wd] are the model parameters and b is the bias. The feature set includes raw input features and transformed. One of the most important transformations is the cross-product transformation, which is defined as:

This captures the interactions between the binary features, and adds nonlinearity to the generalized linear model.

2. deep component

The deep component is a feed-forward neural network, as shown in Figure 1 (right). For categorical features, the original inputs are feature strings (e.g., “language=en”). Each of these sparse, high-dimensional categorical features are first converted into a low-dimensional and dense real-valued vector, often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training. These low-dimensional dense embedding vectors are then fed into the hidden layers of a neural network in the forward pass. Specifically, each hidden layer performs the following computation:

where l is the layer number and f is the activation function, often ReLUs. a(l), b(l), and W(l) are the activations, bias, and model weights at l-th layer.

deep neural networks can generalize to previously unseen feature interactions through low dimensional embeddings.

Joint training of the wide and deep component

The wide component and deep component are combined using a weighted sum of their output log odds as the pre-diction, which is then fed to one common logistic loss function for joint training.

Joint training of a Wide & Deep Model is done by backpropagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization

In the experiments, we used Follow-the-regularized-leader (FTRL) algorithm with L1 regularization as the optimizer for the wide part of the model, and AdaGrad for the deep part. For a logistic regression problem, the model’s prediction is:

where Y is the binary class label, σ(·) is the sigmoid function, φ(x) are the cross product transformations of the original features x, and b is the bias term. W(wide) is the vector of all wide model weights, and w(deep) are the weights applied on the final activations a(lf ).

Joint training vs Ensemble

In an ensemble, individual models are trained separately without knowing each other, and their predictions are combined only at inference time but not at training time. In contrast, joint training optimizes all parameters simultaneously by taking both the wide and deep part as well as the weights of their sum into account at training time.

According to the paper there are implications on model size too: For an ensemble, since the training is disjoint, each individual model size usually needs to be larger (e.g., with more features and transformations) to achieve reasonable accuracy for an ensemble to work. In comparison, for joint training the wide part only needs to complement the weaknesses of the deep part with a small number of cross-product feature transformations, rather than a full-size wide model.

Training details

During training, the input layer takes in training data and vocabularies and generate sparse and dense features together with a label. The wide component consists of the cross-product transformation of user installed apps and impression apps. For the deep part of the model, A 32- dimensional embedding vector is learned for each categorical feature. The model concatenates all the embeddings together with the dense features, resulting in a dense vector of approximately 1200 dimensions. The concatenated vector is then fed into 3 ReLU layers, and finally the logistic output unit.


Wide & Deep model improved the app acquisition rate on the main landing page of the app store by +3.9% relative to the control group (statistically significant). The results were also compared with another 1% group using only the deep part of the model with the same features and neural network structure, and the Wide & Deep mode had +1% gain on top of the deep-only model (statistically significant).