HierTCN: Deep learning models for dynamic recommendations and inferring user interests

Source: Deep Learning on Medium

by Aditya Pal and Pong Eksombatchai | Applied Science

As we build a visual discovery engine, it’s crucial to understand user intent and serve relevant content through recommender systems that constantly learn from a cross-section of data to dynamically predict the next idea a Pinner will love. However, we found existing approaches to be limited by lack of speed and memory consumption and to not incorporate cross-session information. Additionally, a key challenge is that user interests dynamically shift and evolve over time (as seen in Figure 1). Our aim is to build technology that adapts to such evolving user interest patterns at a large scale.

In response, we developed Hierarchical Temporal Convolutional Networks (HierTCN), a deep learning architecture that makes dynamic recommendations based on users’ sequential multi-session interactions with items. This work introduces a hierarchical model that employs Recurrent Neural Network (RNN) and TCN to effectively capture users short-term and long-term interests. We found HierTCN outperformed several baseline models in an offline setting.

Here we’ll share more about our approach. Additionally, you can find details about our work in a published paper this week in conjunction with The Web Conference in San Francisco: Hierarchical Temporal Convolutional Networks for Dynamic Recommender Systems

Figure 1. A sample sequence of actions of a randomly selected user.

Our approach is motivated by the observation that a user’s interest across sessions depend on their longer-term interests, whereas their short-term in-session interests tend to evolve rapidly. Therefore an ideal user model should capture different levels of user dynamics. We operationalize this observation by considering a two-level Convolutional Network (HierTCN) model, that uses a Recurrent Neural Network (RNN) to aggregate information across sessions and a Temporal Convolutional Network (TCN) that aggregates information within sessions.

The input to our model is a sequence of Pins (with session information) that a user interacted with (saved or clicked). The input Pins are modeled via their PinSage embeddings and the main goal is to learn user embeddings in the same latent space as PinSage. Figure 2 depicts the architecture of HierTCN. Here the high-level model is GRU which is updated by an aggregation of each session of interactions using function AGG(.). The low-level model uses TCN to predict user embeddings at each time step based on GRU hidden state and interactions in current session.

Figure 2: HierTCN

The training objective is to take as input a user’s interaction sequence up until time t to generate an embedding u_t that could predict the next interacted pin. To simplify this training objective, we consider a negative sampling approach where each interaction pin in the sequence has an associated set of negative pins that should be ranked lower (typically the impressed pins during the corresponding session).

The Figure 3 shows the working of our model in the wild. The first row is the sequence of interaction pins and subsequent rows show the ranking of different models. We note that moving-average based models rank food-based Pins higher (long-term interest), whereas TCN pivots on the home-decor pins (short-term interest). HierTCN on the other hand is able to blend these short-term and long-term interests better.

Figure 3

A large scale offline validation of different models (Figure 3) shows that HierTCN provides a consistent lift of 10–15% on all evaluation metrics over competing baselines.


Ultimately, HierTCN is designed for web-scale systems with billions of items and hundreds of millions of users. It consists of two levels of models: The high-level model uses Recurrent Neural Networks (RNN) to aggregate users’ evolving long-term interests across different sessions, while the low-level model is implemented with Temporal Convolutional Networks (TCN), utilizing both the long-term interests and short-term interactions within sessions to predict the next interaction. We conducted extensive experiments on a public XING dataset and a large-scale Pinterest dataset that contains 6 million users with 1.6 billion interactions.

We found that HierTCN is 2.5x faster than RNN-based models and used 90% less data memory compared to TCN-based models. We further developed an effective data caching scheme and a queue-based mini-batch generator, which enabled our model to be trained within 24 hours on a single GPU. Our model consistently outperforms state-of-the-art dynamic recommendation methods, with up to 18% improvement in recall and 10% in mean reciprocal rank.

We’re working on productionizing the model to serve recommendations at scale within Pinterest.

Acknowledgements: The core contributors of this work were Jiaxuan You, Yichen Wang, Aditya Pal, Pong Eksombatchai, Chuck Rosenberg, Jure Leskovec.