Deep Cross Network (DCN) for Deep Learning Recommendation Systems

Original article was published on Deep Learning on Medium


Deep Cross Network (DCN) for Recommendation System

For last 2 weeks, I have researched and worked on developing a Video Recommendation System. For many years, Content-based and Collaborative Filtering approaches have been heavily used in Recommendation Systems. Content-based system bases on similarity among items’ characteristics (e.g. cosine similarity) ; and Collaborative Filtering system bases on user-item interactions (e.g. Alternating Least Squares). These 2 systems have gained success in industry; however, based on my understanding, these 2 approaches are limited by large spare matrix and non-genearlization. Hence, in this article, I introduce and explain Deep Cross Network for Recommendation Systems presented in the following paper: Deep & Cross Network for Ad Click Predictions.

Motivation

Click-Through Rate (CTR) is a large-scale problem in the advertising industry that advertisers pay publishers to examine and displays that are clicked frequently to improve product/brand recognition. Examination and prediction of CTR of ads recommend advertisers and publishers how to improve the ads. Hence, CTR is also considered as a recommendation problem.

Now, let’s dig deep into Deep Cross Network and see how simple and interesting it is.

The rest of this article is structured into parts:

  • Embedding & Stacking Layer
  • Cross Network
  • Deep Network
  • Combination Output Layer
  • LogLoss
  • Implementation & Training
  • Model Analysis: Advantages

Embedding & Stacking Layer

In the exploding era of Internet and Social Media, the size and dimensionality of data generated by humans increase dramatically. To avoid extensive task-specific feature engineering (accounts 75% of time of AI projects), Embedding layer is used after Input layer in Fig. 1 to convert sparse features (e.g. categorical features )into low-dimensional features. Embedding layer was first invented in Natural Language Processing to convert tokenized words into dimension-fixed and dense metrics, aka Word Embeddings [5]. This method is to avoid the high-dimensional and sparse features generated by CountVectorizer or TF-IDF [2] that leads to less computation and feature engineering on text. Then, processed continuous features (e.g. normalized) and embeddings of categorical features are stacked together.

Figure 1

Cross Network

In [3], He et al. proposed Residual Network to solve the Gradient Vanishing that is a fundamental problem of many very Deep Neural Networks. In DCN, Cross Network is built based on Residual Network that each layer is computed as following:

Figure 2

In Fig. 2, the function f is the feature crossing function to generate synthetic features [4]. Synthetic features is the combination of non-linear features (e.g. categories of movies) and linear features (e.g. movie ratings). Hence, this process is to merge categorical and continuous features which simplifies and avoids the feature engineering process. If you look at Fig. 2 carefully, each layer of Cross Network is a simply a Linear model added with additional. No activation is applied here. The Cross Network is similar to Wide model in Wide & Deep Model [6] that Wide and Cross Network models are responsible for memorization or learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.

Deep Network

In Fig. 1, Deep Network is a stack of Fully-Connected (Dense) layers with ReLU activation function. Deep Network is similar to Deep part of Wide & Deep Learning model [6] that Deep Network is responsible for generalization or exploring new feature combinations that have never or rarely occurred in the past.

Combination Output Layer

Outputs of Deep and Cross Networks are concatenated and fed into a standard logit layer (e.g. sigmoid). The output head could be modified. In [1], sigmoid is chosen to predict a probability number (0 < x < 1) or chance that users on an ad. The sigmoid formula is computed as following:

Figure 3

Log Loss

Log Loss is identical to Cross Entropy that you can read more about Cross Entropy at my Cross Entropy post or at other articles. In [3], authors added L2 regularization to Log Loss in order to prevent overfitting (Fig. 4).

Figure 4

Implementation & Training

In [3], authors used Criteo Display Ads dataset and non-CTR datasets (e.g. CoverType Dataset) to evaluate the model’s performance. The model’s performance could be examined at [3]. For implementation, the real-valued features (continuous features) are normalized; and the categorical features are embedded into dense vectors (through Embedding layer) of dimension (6 x (category cardinality)^0.25. The 1/4 power is chosen as reduce dimensionality of sparse features in Embedding vectors [7] that could be updated to optimize the model’s performance. Optimization & Hyperparameters are available in [3].

Model Analysis

At this point, you have probably understood how DCN worked. The noticeable advantages of DCN is to simplify the feature engineering process that accounts up to 75% of time of many AI/ML projects. Feature engineering is a nontrivial process that may require manual feature engineering work, exhaustive searching/testing, and certain knowledge of data domain. This is done by implementing Embedding layer to reduce dimensionality of categorical features (sparse and high-dimensional) into low-dimensional features.

Last thoughts

I have read and worked with DL models for Computer Vision and NLP/NLP projects. However, Deep Cross Network in [3] simply attracts my attention in Recommendation System. Also, as I do not see many posts about Deep Learning in Recommendation Systems, hence, I write this article to share my thoughts and keypoints of DCN. Hopefully, this article is helpful for all of you.

References

[1] Deep & Cross Network for Ad Click Predictions, https://arxiv.org/pdf/1708.05123.pdf

[2] An Introduction to Bag-of-Words in NLP, https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0#:~:text=In%20memory%2Dbased%20algorithms%2C%20we,Pearson%20correlation%20or%20cosine%20similarity.

[3] Deep Residual Learning for Image Recognition, https://arxiv.org/pdf/1512.03385.pdf

[4] Feature Crosses, https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture

[5] What are Word Embeddings?, https://machinelearningmastery.com/what-are-word-embeddings/#:~:text=A%20word%20embedding%20is%20a,challenging%20natural%20language%20processing%20problems.

[6] Wide & Deep Learning for Recommender Systems, https://arxiv.org/pdf/1606.07792.pdf

[7] https://forums.fast.ai/t/embedding-layer-size-rule/50691