Source: Deep Learning on Medium
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
So finally Pinterest has deployed PinSage as their recommender system and everyone is really excited about the change and the shift in paradigm. But what is it really and how is it related to GraphSAGE?
If we look at PinSage title it reads as follows
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
What are they mentioning in their title? Let me break down the title down into three main keywords
- Graph Convolutional Neural Networks
- Recommender System
Graph Convolutional Neural Networks
Short for GCN, is a model in which both Graphs and Neural Networks comes into picture, already this is getting exciting right? GCN are basically useful for feature representation of nodes in a graph using Neural Networks by taking into consideration of neighbours on k-hops. The convolution mentioned here is not a normal convolution that we see in Deep Learning with images, but the concept is same.
Web-Scale Recommender System
This is saying that they are developing a GCN for web-scale which means for a large graph and also that it is a recommender system, which means that they are trying to generate some kind of representation of nodes in which they train on one and then they use the trained representation for recommending tasks for unseen nodes.
Let me give a small intro on what PinSage is. PinSage is a Recommender System which is built on top of GraphSAGE, and this develops high quality node embeddings using the graph structure and the node feature information. And this model is currently deployed in Pinterest.
Let me pick a small paragraph from PinSage paper, it reads as follows
GraphSAGE is an inductive variant of GCNs that we modify to avoid operating on the entire graph Laplacian. We fundamentally improve upon GraphSAGE by removing the limitation that the whole graph be stored in GPU memory, using low-latency random walks to sample graph neighbourhoods in a producer-consumer architecture. — PinSage
From the above statement what they are saying is that they are just modifying the GraphSAGE which is a variant of GCN and they are modifying in such a way that they need not work on the whole graph Laplacian. Laplacian matrix is a matrix in which we take the whole graph into consideration and this is computed by subtracting the degree matrix with the adjacency matrix. And they are also removing the limitation of the whole graph being stored in GPU which happens in the case of GraphSAGE, and to do this they are using a producer consumer architecture which we will discuss below.
Laplacian Matrix = Degree Matrix — Adjacency Matrix
As they are talking on and on about GraphSAGE let me just give a small into on it.
- Unsupervised Algorithm: for starters, graphSage is an unsupervised algorithm, which means we don’t have any sort of labels.
- Sampling Neighbours: when computing for a particular node u, we need to also consider the set of its neighbours. When all the neighbours are considered, there’s a chance that we are retrieving a lot of nodes which just is computationally very expensive. So to avoid this, only the top t neighbours are sampled from the neighbourhood.
- Aggregator: We got all of the neighbours in the above step, what an aggregator does is that it will aggregate all of the sampled neighbourhood information to a particular one dimensional embedding.
- Unseen Data: GraphSAGE is mainly motived to work on unseen data, and for any unseen data they take in attributes such as text, node profile information, node degrees to generate the initial embeddings.
- Activation: Sigmoid. No comments!
Now that we saw few of the important points, let us go through their algorithm a bit.
Fret not! i will explain the algorithm in as simple terms as possible. But just follow the notations i highlight.
V is a collection of all nodes that are there. x(v) is the initial input embedding that we generate for every node (unseen). h(v) is always considered as the current embedding, but as we have different layers in this module, so h(v) is going to be h(v, k) where k is the layers.
In line 1, we take all the nodes representation x(v) for all nodes v in collection V, then we iterate over all the layers (depth) in the next step.
We Iterate for all nodes v (line 3), and we aggregate (we are going to discuss later on what kind of aggregation function they are using) all of the nodes information (line 4) and pass them along to a Layer in the neural network which will generate the embeddings. We repeat this for all the depths K. Here depth came into picture because, when we take only the one-hop of a nodes, we just consider the immediate neighbours. If we consider two-hops of a nodes, we gain the information of node neighbours and neighbours neighbours. In this way when we do k-hops, we get all the k depth nodes information embedded by the aggregator gaining more neighbourhood information. At the end of the for loop, there is normalisation so that it will be better for computation in the neural network layer. Finally, the embeddings will be in the final layer h(v, K), and this is copied to the vector representation.
If you clearly see the steps 1 and 3, they are computing on the whole set of nodes that they have, and this is the exact thing that PinSage is trying to avoid.
Again follow my notations clearly. z(u) is the current embedding of node for which we are computing the loss function. z(v) are the relevant examples and z(vn) are the non relevant examples. and Q is the number of the non relevant examples.
Whoooaaah… hang on! we just discussed that this is unsupervised data right? then from where on earth did we get the relevant and non relevant information? Let me explain, they used a concept called Random Walk. Its a technique in which a random walk takes place from the node u for a particular depth k, and all of the nodes that appear in the random walk are considered to be closer to u, and they are taken as relevant examples and the rest as the non relevant examples.
Now that we know GraphSAGE a bit, we can understand PinSage much easier as this is very similar to GraphSAGE.
- PinSage is a Supervised algorithm.
- Convolution: This module is same as aggregator in GraphSage, this gets the neighbourhood information into a single column vector representation.
- Stacking Convolutions: Just like GraphSAGE, this also goes on for multiple hops (depths), by going deeper the node learning becomes better by gaining more information from neighbours and their neighbours
- Pinterest is a platform in which they Share and Organise images. Images are considered as Pins. And users can stack a collection of images in albums, which is considered as a board. Users are assumed to stack similar images in albums (boards). Finally, PinSage does a very smart thing to simplify the graph here, they form a Bipartite Graph of pins — boards. this works for this situation because there is no pin — pin and board — board relation.
- There are almost 2 billion pins and 1 billion boards, and both of those are considered as node, so totally 3 billion nodes and there are around 18 billion edges.
Although GraphSAGE algorithm would have worked in this scenario too, but they will have to compute on the whole graph Laplacian which is computationally very expensive, and Pinterst has billions of nodes and edges, and also they need to do some kinds of optimisations during the run time, for faster computations.
- On-the-fly convolutions: They are sampling the neighbourhood of nodes on demand.
- Producer consumer mini-batch construction: Making use of CPU and GPU efficiently by assigning them different tasks.
- Efficient MapReduce Inference: During computation of convolutions (aggregations) we may have to compute the convolutions repeatedly, and this repetition of multiple tasks are minimised by the MapReduce.
- Random Walks: They use a random walk process to sample the neighbourhood
- Importance Pooling: They are computing an importance score of each and every neighbour, with the help of random walk. (not done by GraphSAGE)
- Curriculum Training: They feed harder and harder examples each time to the model to make it more robust and generalise for hard-negative examples. (not done by GraphSAGE)
Nothing to worry, as always, just follow my notations. h(v) is always the current embeddings, N(u) gives the sampled neighbourhood of u, alpha is importance factor of all of the neighbours which we compute during the random walk. gamma is an aggregator function. n(u) in first line holds the neighbourhood information of u. W, Q are Weights, and w, q are bias.
If you are clear with the notations, you can go head. In line 1, we are getting all of the neighbourhood information into n(u). We take all of the neighbours embeddings h(v) and pass it to a layer in the neural network and then apply the importance factor alpha, and aggregate the whole matrix by taking column wise average or mean. Now n(u) in line one has the whole neighbourhood information. In line 2, we concatenate the neighbourhood information with the current representation z(u). The concatenation is done instead of average or mean just due to the computational efficiency. z(u) is same as h(u) as both are current node embeddings of u. After concatenation, this is again sent to another layer in the neural network. and the output will be the new embeddings of u. In the third line, later we normalise the data using l2 normalisation just so that it will help in the neural network computation. I will explain the above algorithm along with the dimensions in the below section.
Each node is considered to be a dimension of the m. It’s mentioned in the paper that they want same number of neurons (d) for all the hidden layers for simplicity. But we can consider this as m, as we will be passing the output as the input in the next iteration.
Now coming to dimensions, every node in the graph has an embedding size of m. So from n neighbours of our node u, we are going to sample t nodes (select t nodes from n such that t<n) now we know the dimension of each node which is m. So the dimension of one neighbour is 1*m. And if there are t such neighbours, the dimension of the whole neighbours is considered as t*m. Now we are sending this to a layer in the neural network with weights Q and bias q. It is mentioned in the paper that all the layers in the neural network are fixed to have the same number of neurons d. And we also consider that we will be sending the output dimensions of the layers as input again. So We can conclude that each layer can have m dimension instead. The number of parameters that will be there for the Q matrix will be m*m as we will have to multiply t*m and m*1 matrices. And the final output dimensional vector is t*m which is the whole neighbourhood processed matrix. Now we will be sending this to the aggregator function which will take the mean or average of all the columns of the neighbourhood processed matrix. We also have an importance factor for each matrix using which we will multiply the corresponding row value (embedding) of the neighbourhood matrix. Now after applying the aggregating function along with the importance score alpha, we will get nu which is the whole neighbourhood representation, and this will be of dimension 1*m.
Now, the neighbourhood representation is concatenated with the current embedding of u. The current embedding is always hu but as we are considering zu here, that also is the same. The concatenation is done instead of aggregation function only to save little computation. After the concatenation, the output dimension will be 1*2m, as we will be concatenation the neighbourhood which is of 1*m and the current representation which is of 1*m dimensions respectively.
This concatenated value is now sent to another layer in the neural network W with the number of neurons m. So the number of parameters that this will have is 2m*m. And after the layer computation we will get an output embedding 1*m, which is the new embedding of u. This is just normalised using l2 normalisation so that it could help in the computation and so that it won’t let the neural network values explode.
We were discussing in the above section that we want to sample the neighbourhood of u. But how do we do it?
We simply consider the neighbourhood nodes which has high influence on u. Again how do we decide this?
We perform random walks on u. And during this random walk process, we will take the visit count of each node and normalise using the l1 normalisations and from this we will take top t valued nodes. Fixing that number of sampling t, helps during the computation and also helps in estimating the computational resource needed on an estimate.
Calculating Importance Factor
We know that alpha is the importance factor. How do we calculate it?
Remember the random walk we did above to find the importance of a neighbour to u? We take the similarity in the random walk. For the above given example, importance of the neighbour is taken as the intersection of the parent random walk and the neighbour random walk. Assuming that the red colour is the random walk of u and black is the random walk of neighbour n1, we take their intersection part of the ellipses.
What does this even mean? Just now we have learnt what a convolution is. A single convolution learns the information from its neighbours. But when we stack such convolutions together, the model will tend to learn the information of its neighbours and the information of neighbours neighbours. Doing this will make the model gain more information by looking at its neighbours. And each layer of convolution always depends on its previous layer. So k always depends on k-1 where as k is the depth. Each layer has two hidden layers of Neural Network, all the layers in the same same depth share weights and bias. W, w, Q, q is shared among nodes but differ among layers. So they can be rewritten as W(k), w(k), Q(k), q(k) finally after the final output from the convolution, there is another common layer in the architecture and this layer of neurons share parameters throughout the model.
This algorithm gives the over all picture of how the training happens. As always parameters first. V is the set of all nodes. M is the subset nodes of V. S is the set of samples of nodes (the selected nodes from the neighbourhood). K is the depth. x(u) is the initial embedding of u.
There are two major parts for this algorithm, first is where it samples the neighbourhood, and second is where it computes the new embedding by convolutions. S holds the nodes, not their embeddings. In the first line, the the subset nodes M from V is copied to S(K), this is the last layer of sampling. Now we iterate in the decreasing order of depth, during this loop for every depth S(k) we copy the nodes S(k-1)and their neighbourhoods N(u) nodes from their next layer. In line 3, we copy the nodes to layer k-1 and in line 5 we copy the neighbours information also to layer k-1. Finally S(0) will hold the nodes of M and its neighbours till k-hop.
For all the nodes which are in S(0), we take the initial embeddings and store them in h(u,0). Now iterating in an increasing order, for every node in layer S(k) (line 10), we sample the nodes from the previous layer which are neighbours of v (line 11). Now we perform the convolve operation (the algorithm we saw first) on the neighbourhood nodes and the current representation of u in its previous layer, which gives the current embeddings of h(u, k) node u and depth k. Finally, we will have all of the embeddings in the Final Layer of h(u, K). We pass this through another layer of neural network which shares parameters G2, G1, g. where G1, g are weights and bias respectively.
As we discussed earlier, PinSage is a supervised algorithm. So it assumes labelled data (q, i) -> L query item pair, relating to label L.
This is a maximum margin loss. z(q) is the embedding of q, z(i) are the relevant examples and z(nk) are the non relevant examples. delta is the hyper parameter for the margin. This loss function maximises the inner product of positive examples. If z(nk) is more, then there will be a positive value, and taking the max will return a positive vale denoting that the loss is high. And if z(i) is high, then the value will be negative and taking the max will return 0 denoting that there is no loss.
Sampling Negative Items
In the above section we discussed that z(nk) are negative items. Now how do we take the non negative items? If we take all of the non relevant as negative samples, then the model will not generalise and wont predict the near correct results (results which are close enough).
500 negative examples are sampled from each mini-batch. Have a look at the above image, Query is the query that they give and positive example is the positive results that they get from the model, and Random Negative is the negative example that is selected from non positive items, Hard Negative is the negative example which is not completely positive but not completely negative. So instead, they ranked the similarity of results and pick few negatives from the rank 2000–5000 and again trained on those examples so that the model will generalise better for near correct results. This is called curriculum training, in which they train on harder and harder examples in each instance.
How do we decide the First Embeddings?
Image and Annotations information is considered for every pin. Every image is passed to VGG16 and the embeddings from the 10th layer is extracted which gives 4096 dimensional embeddings. And word2vec is used to generate word embeddings. They merge both the embeddings to generate the initial embeddings.
- To make the computations better, they split data to multiple mini-bathes and send them to different GPU and take back the output from them and fuse together. Doing this, multiple GPU’s can used simultaneously reducing the runtime.
- Producer Consumer: All of the graph is stored in CPU, and whenever GPU needs a nodes information and node’s neighbours, GPU instructs the CPU, and CPU generates the information of neighbourhood embeddings and then GPU will do the Convolution operation. This kind of architecture helps the maximum utilisation of both CPU and GPU
- CPU — (extracting features, re-indexing, negative sampling)
- GPU — (Model Computations)
- For finding similarity of two embeddings, they perform KNN which is K-Nearest Neighbours.
- Offline Metrics — MRR (Mean Reciprocal Rank): Rank which will be reciprocated and taken mean of.
- Controlled User Studies: Show the outputs to experts to judge if the generated recommendation are relevant or not.
- A/B Test: Show 50% of Pinterest users the recommendations from model A, and another 50% from model B for the same query. They then check the user interactions or activities on it. A and B are the models that generate the embeddings which produce slightly different results. This is checked by actually deploying in Pinterest.
From the results table in the paper, it is considered a win, if user selects the recommendation fo PinSage and a loose if the user selects the opponent and draw if none is selected. And the fraction of wins is among the wins-lose.