Deep Interest Network for CTR Prediction | Paper Review, Explanation and Implementation

Source: Deep Learning on Medium


Before beginning with the paper:- Let’s discuss why CTR prediction is required?

How does a best-recommendation system look like?
Recommendation system- Show the most relevant items based on historic behavior of a user and rank items based on CTR.

For example in the paper, Alibaba is using the DIN model to predict CTR for ranking/ showing relevant candidate ads to users based on their interest.

In the advertisement world, the main metrics that are used are:-
+ CPC — Cost per Click
+ CPM — Cost per 1000 impressions
+ RPM — Revenue per 1000 impressions
+ CPE — Cost per engagement ( whether a user ATC or Transacts a product)

To maximize the Revenue and User experience ads are ranked on eCPM — bid price X CTR is used (where is CTR metric is predicted CTR from the model).

CTR prediction has a direct impact on final revenue and plays a key role in the advertising system and recommendations.

This paper mainly highlights that a fixed-length vector for a user is a bottleneck in capturing users’ diverse interests. It also explains that for a given candidate product/ad only part of the user’s interest will influence his/her action (to click/not click). For example: a female swimmer will click a recommended goggle mostly due to the bought of the bathing suit rather than a shoe in her last week’s shopping list.

Based on this idea: DIN model which pays attention to the related user interests by soft-searching for a relevant part of historical behavior and takes a weighted sum pooling to obtain a representation of user interests wrt. candidate ad.


Features that depict users and ads are the basic element in CTR modeling of the advertisement system.
Feature Representation:-
Data in industrial CTR prediction tasks is mostly in a multi-group categorial form, for example, [weekday=Friday, gender=Female, visited_cate_ids={Bag,Book}, ad_cate_id=Book], which is normally transformed into high-dimensional sparse binary features via encoding.

Fig. 2 Feature Representation
Fig. 3 Statistics of feature sets used in the display advertising system in Alibaba. Features are composed of sparse binary vectors in a group-wise manner

The base Model combines these representations by sum/average pooling and given a fixed-length vector.

DIN: Instead of expressing all user’s diverse interests with the same vector. It adaptively calculates the representation vector of a user interest wrt. a candidate ad. For weighted sum pooling equation used —

Fig4. Local Activation Unit Equation

where {e1, e2, …, eH } (shown in image above) is the list of embedding vectors of behaviors of user U, Va is the embedding vector of ad A. In this way, Vu(A) varies over different ads. a(·) is a feed-forward network with output as the activation weight (Fig.1). Apart from the two input embedding vectors, a(·) adds the out product of them to feed into the subsequent network, which is an explicit knowledge to help relevance modeling.

Weighted Pooling Sum Strategy will give maximum influence to a feature for which the activation score is larger.

Illustration of adaptive activation in DIN. Behaviors with high relevance to candidate ad get high activation weight.