This post adresses the general problem of constructing a deep learning based recommender system. The particular architecture discribed in the paper is the one powering the new smart feed of the iki service, pushing your skills on daily basis — to check its performance, please try product beta.
If you feel familiar with the general idea of recommender systems, mainstream approaches and would like to go straight to the details of our solution, please skip first two sections of the paper.
Recommender systems have changed the way we interact with lots of services. Instead of providing static data they bring interactive experience, an option to leave your feedback and to personalise the information you are given. A recommender system creates personalised informational flows independently for each user but also takes into account behaviour of all users of a service.
Often it is impossible for a user to choose among the variety of the given options by checking each of them be that movies in an online cinema, consumer goods in an online store or any other content. In case there is lot of content in a service a user faces the problem of informational overload. One of the ways to solve this problem is taking advantage of the targeted recommendations.
One of the examples of a service where targeted content plays crucial role is iki. iki is your personal career and professional consultant powered by machine learning and artificial intelligence related technologies. After a user has defined the areas of his professional interests, iki creates daily content feed developing knowledge and chosen skills with the help of the recommender system with the deep learning architecture described below.
A general description of recommender systems problem setting
The main objects present in any recommender systems are users U, items R and some interactions between them. These interactions are usually presented as a matrix F(|U|x |R|), each cell containing some information about an interaction — the fact that an item has been viewed / bought, the given rating, like / dislike or something else. Depending on the number of users and items in a system the F matrix can become enormously large. But each user normally gives some ratings or interacts only with a small percent of items in the system which results in a very sparse ratings matrix. It is known that one can reduce the number of independent parameters of a sparse model without a significant loss of information. We come to the problem of the right presentation of the rating data in a recommender system: we have to map each user to some vector space by means of a vector with a fixed length n, n<<|U| and each item — to another vector space by means a vector with a fixed length m, m<<|R|. One of the most popular solutions of this problem is SVD (Singular Value Decomposition) of the F matrix which gives us two matrices with shapes |U| x n and n x |R| for a given n with their product being the best approximation of the matrix F. The lines of the first matrix are vectors corresponding to users, the lines of the second one are vectors, corresponding to system’s items; let us call these vectors the “SVD embeddings” of users and items correspondingly.
The traditional method of creating prognosis in recommender systems is collaborative filtering. The main assumption of this method is that users having given similar ratings of some items in the past tend to give similar ratings to other items in the future. One of the most popular methods of solving the collaborative filtering problem is creation of user’s and item’s dense embeddings with the SVD of the initial sparse ratings matrix F. To predict the value of a new rating of user u on item i we have to calculate a scalar product of the corresponding user’s and item’s SVD embeddings.
There are several open libraries providing instruments for construction of various recommender systems based on users ratings matrix. Among others we could name various methods of matrix factorisation (SVD, Non-negative matrix factorization NMF), different variations of nearest neighbours algorithm, co-clustering methods, user based and item based models. The examples of such solutions are Surprise library and LightFM library.
Besides the information about users interactions with items there usually is some other data in a recommender system describing users and items separately. This data could be variative and heterogeneous — items and users could contain some textual descriptions, numerical characteristics, categorical features, images and other types of data. The way of processing and usage of this data depends on the particular problem and is not a universal part of recommender systems.
The main properties or metrics of a recommender system are recommendations relevance and scalability in case of really large numbers of users or items. That setting brings the common recommender system’s architecture consisting of two sequential blocks: candidates generator, choosing a relatively small subset from the large items set, and ranking module, giving a rating of relevance to user’s interests to each item in the chosen subset. This rating is often used to rank items in the order the user will see them.
Deep learning powered recommender system architecture
Content based recommender system with a deep learning architecture is closely related to the actual content present in the system. Futher on we shall dive into details of iki recommender system to describe the DL approach.
In iki user’s interactions with content are views and ratings — a user can like or dislike any content element. An item in our case is a webpage with a blog, an article, a tutorial, a course, or some other content providing professional knowledge. Each content element has textual description or is a text itself.
Content embeddings are the result of text vectorisation procedure. We used GloVe language model with 300 dimensions to vectorise separate words. GloVe is a language model developed at Stanford University, resembling word2vec (preserving semantic similarity of words). The difference is that besides keeping semantics GloVe takes advantage of the statistical approach, taking in account the number of words occurrence in text corpus.
We count tf-idf weights of all the occurring words in the text corpus, then the vectors corresponding to content elements are calculated as a weighted sum of GloVe vectors of words in this content element with tf-idf weights.
A user can add to his profile some information regarding his current and past jobs (such as position name and description of each job). Our system processes this data and extracts a list of professional skills of each user with a special language model Skills Extractor developed for this purpose. In a nutshell the idea of this model is to learn the semantics of “skills” on the corpus of English CVs and to be able to extract previously unseen skills from an unstructured English text. A user can also add some skills manually. To tune the content feed a user has to select some areas of professional interests on our hierarchical cloud of tags similar to Apple Music mechanics. iki offers several hundreds of professional interests to be selected as a part of user’s content feed. User’s skills and selected tags are used in our recommender system as user’s features.
Another set of features used to train our recommender system are content and users embeddings obtained after applying SVD to the user’s ratings matrix. This is not the only usage of the information about ratings and views in the system — we create user’s vector representation describing his viewed and rated content. This is done the following way: having content’s vector representation let us take the minimum, the maximum and the average value of each coordinate on the whole corpus of user’s viewed content. This operation results in three vectors with the same length as the content embeddings. We perform the same operation on the set of liked and disliked content obtaining another 6 vectors with the same length. These dense embeddings are also used as a user’s feature during recommender system training step.
For each tag (content topic) our recommender system creates a separate set of recommendations and, particularly, selects a subset of candidates for further ranking. The goal of iki recommender system is not only to provide the content on corresponding topic but also to rank content elements by quality and expertise level to fit the resulting feed to each user’s level of expertise.
Let us go deeper into details of the candidates generation step. For each tag our system selects candidates for ranking with the help of the following mechanics: the content elements that recently have received particular user’s positive ratings are selected along with the sets of content elements surrounding these selected ones (we take the union of several neighbourhoods in the content embeddings space) are added to the candidates space. We also take in consideration the set of users with similar recent activity on the selected tag (a cosine distance between ratings and views vectors is calculated for this set of users). The set of content elements recently viewed or rated is also added to the set of candidates. Variation of the described neighbourhoods radius provides different number of generated candidates.
For each tag a separate neural network instance is trained and used for content candidates ranking. Each neural net instance is trained and then used for ranking on the content subset corresponding to this tag. The ranking net takes some data about the content element and the user as an input and outputs some number between 1 and -1 as the predicted rating used for the final ranking of the content elements. The model is trained on the set of given ratings and views; each sample of the set is comprised of a user and a content element, the target value is 1 for like, -1 for dislike and 0 if the user just viewed the content without giving it any rating.
The ranking net has 6 inputs with different dimensions. The first input takes information about user’s selected tags and his skills which is transformed in a dense user profile embedding in the next layer. This information allows system to take into account other interests (tags) of the particular user while ranking content corresponding to particular interest. There is a finite number of tags in iki so the information about them is encoded with a fixed length binary encoded vector (1 for selected tags, 0 for the rest).
The skills extracted from user’s CV with SkillsExtractor may be quite variative so we statistically normalise them to eliminate really rare phrases and to end up with a fixed length set. Then we vectorise skills in the same binary way as we have already done with tags. During the periodic recommender system training we update the actual set of skills.
The next two inputs take user’s vector representations we have described before comprised of the rated and viewed averaged content embeddings. Their outputs are united in the next layer which creates dense user’s embedding in the content embeddings space — we call it dense user preference embedding. These two embeddings are united with the other inputs taking SVD embeddings of content and user as well as the content embedding. The obtained vector is passed forward and goes through several dense layers of the neural net. The activation function of the last net’s layer is tanh which gives us an output value in [-1, 1] corridor.
The problem of the cold start in iki is solved the following way: for a particular tag without any ratings the system recommends a new user a set of random content elements with significant distance among each other. This provides a variety of quite different content elements for a new user to choose from at the beginning of his experience.
Results and model’s benchmark
We have tested various recommender system models and combinations (like SVD, NNMF et al.) at the research phase before constructing the final model described above. The model we ended up with obviously shows the best results accuracy on the specific iki dataset. Providing a meaningful benchmark for a deep learning based recommender system is not a trivial task because the ranking net’s architecture is tailored to a specific type of data and it is quite obvious that this specific architecture outperforms any interaction matrix based recommender system just as it takes advantage of more data used to form predicted score.
The other problem is that when you do not have a large enough real user base with all the profile and ratings data, it is hard to simulate the real recommender system performance. You can generate a dataset, but it is no sense in generating a noisy one, so you will tailor some sets of skills and experience and preferences to some users and your recommender system will decode this information and tailor the recommendations to the artificial user profiles you have generated. This approach leaves no room for the real world spontaneous user’s actions processing which really matters.
Source: Deep Learning on Medium