Source: Deep Learning on Medium
Machine Learning — Recommender System
Every e-commerce company and content provider needs a recommender system to suggest related products or content to users. Even there is still plenty of room for improvement, recommender systems are one of the most successful stories in ML.
However, as Mike Tyson said, “Everyone has a plan until they get punched in the mouth.” The recommender systems deployed in real life are built from lessons learned in years. They deal with far more diversified objectives with tougher scalability constraints. They are not simply minimizing the MSE of the predictions on historical datasets. I just want to point out the elephant in the room first. The real production systems are tedious with many ensembles methods done in multiple steps. This is a hard topic with conflicting objectives. Because of the scope, this article will cover the basic concepts only.
Content-based filtering recommends items based on the similarity in features between an item and a user preference. For example, we can extract features from a movie description and match what a user prefers. For example, the movie “Spy” can be featured as a McCarthy movie and will match users that watch McCarthy movies regularly. Information about a movie can be extracted from the movie title, descriptions, and information provided by the studio. A user preference can be determined by explicit information given by a user, like the demographic information, or information that can be indirectly collected from what the viewers watch, interact and purchase. Many recommender systems need to deal with the cold start problem when there is not enough information collected for a new user. Deriving user meta-information from user interactions is important. For example, we can collect descriptions for movies that a user watched and use that as input to extract a user preference.
There are many feature extraction techniques. We can create a bag of words from the movie description or from the words collected for the user preference.
Then we use the TF-IDF method to compute each value in this vector. It reduces the importance of a word that frequently occurs in movie descriptions.
We compare the similarity of the corresponding vector of a movie and a user preference for recommendations. For example, we can use the cosine similarity to match a user to different movies.
Other measuring methods including Pearson Coefficient, L1/L2 distance or Jaccard Similarity can be tried subject to how the vector is calculated. If you have problems to catch up with the ML terms so far, you can spend some time on this article first.
Like many ML methods, data preprocessing is important. We review the vocabulary used in preparing the bag of words and limit its size. Stop words, like “the”, should be removed. Review top words manually and remove them if it is irrelevant in your problem domain. We may also match phrases and treat them as a single word in the vocabulary. In addition, there are variants of words that should match to the same word. For example, all plurals can be treated as singulars.
Nevertheless, even TF-IDF is frequently mentioned in the recommender system, this method reduces the importance of common category words like “comedy” which is important in making recommendations. TF-IDF has an important role in a query-type searching. But in some contexts, it may produce too narrow focus recommendations that hurt the diversity of the recommendations. So some systems may eliminate some common and rare words that have little impact on recommendations. Then, they simply use the frequency or the occurrence of the words instead of TF-IDF. The exact choice is influenced by the application context. The words selected in the bags of words, the number of item suggestions and what functions the application is providing are all important factors. This is a good example that we cannot take textbook recommendations for granted, in particular in the field of recommender systems. We are dealing with far more complex objectives and they often conflict with each other. In practice, many real systems use A/B testing to determine their design choices.
Collaborative Filtering (User to user or Item to item)
Content-based filtering has a limitation on the quality of the description provided by the content provider. Technically, there is a limitation of what features can be extracted from the limited amount of content information. Extracting “content” information from the user is hard too. In reality, it is easier to judge people by what they do than what they say. Behavior information, like what they view, how they rate items, is much easier to collect, less intrusive and more accurate.
In collaborative filtering, we focus on making predictions through user behaviors instead of extracting content features. The abundant of this information can make collaborative filtering appealing.
In collaborative filtering, we are provided with labels but not the content features. It collects behavior information on how you interact with items — what you rank, how you rank, what you view and/or what you purchase. One typical approach is to create a user rating matrix containing the users’ ratings on movies. We find similarities and make recommendations.
For example, we locate people that give similar movie ratings as you and make suggestions for movies that you have not watched but ranked highly from these people (user-to-user). Alternatively, instead of finding similarities in people, we can find similarities in movies (item-to-item) and make suggestions from what you already watch. Regarding the cold start problem, item-to-item suggestions may be easier when information for a new user is limited.
One possible implementation of collaborative filtering is to locate the k-nearest neighbors and use its average as our prediction. Or we can compute the similarities (u) between you and all other users and uses them to compute a weighted average for your possible ratings. Of course, some approximation or simplification is needed to scale the second approach.
Some people may be more generous in ratings. Therefore, we may need to normalize each person’s vote before finding their similarity. In the example below, we use the Pearson correlation to compute similarity.
Collaborative Filtering (Matrix factorization)
In a previous ML article, we use matrix decomposition to find the principal components of the data. This is an abstract linear algebra concept that is extremely important in ML.
SVD above is one of the matrix factorization methods. It decomposes data into independent principal components like the one below.
In a nutshell, from a rating matrix, we learn the latent factors (the principal components) that users use in rating movies. For example, these latent factors may be the genre of the movie, the year of the release, the actor or the actress in the movie. But we don’t define it explicitly. We do not know what the latent factors learned unless we examine the results manually. We let the computer to learn them from the data, just like other ML methods. Matrix decomposition is a big topic and we will not elaborate on it further. A high-level discussion can be found here or a more detailed one in here.
But the number of latent factors can be huge but many of them may not be important. In PCA, we simply truncate those that are not important (those with very small singular value σᵢ comparing to others).
So even the rating matrix is large, we can reduce the information to latent factors in representing a user or a movie in a much lower dimension. This is why the concept is also termed as low ranking matrix factorization. We are reducing the rank of the matrix (dimension) through factorization.
With the concept of SVD, any user rating matrix can be decomposed into matrix P containing user latent factor and matrix Q containing item latent factor. The movie rating from a user can be reconstructed by multiplying the corresponding latent factor of the user and the movie.
But how can we find the matrices if there are missing entries in the user rating matrix? Users rate or interact with a handful of movies only. To solve that, we apply SGD to learn the decomposed matrix Z and W.
But we include errors for valid entries in the rating matrix R only. We will not introduce an error to the cost function for entries with missing ratings.
In addition, there will be rating bias in individual user and some cult movie can be rated much higher than average. We will add those biases when making a prediction and add regularization terms for them in the cost function.
Many Markov processes can be solved with random walks. Let’s say we drop off 100 shoppers randomly around the downtown area in San Franciso. We provide a transition matrix to show the probability of where the shoppers should head next from the current position. Eventually, we can spot where most interesting shops are located. Recommender system may use similar concepts except we drop all shoppers in one place and check where the shoppers are after a certain time.
We can represent each movie as a node in a graph. Based on many factors including what products users may select next, items that may be viewed or purchased together, what products a user rated highly together, or other user behaviors, we can formulate a transition probability from one product to another. These probabilities can be represented as a transition matrix below. We want to know the probability distribution of where we may land on as time approaches infinity. These probabilities serve as the ranking in making product recommendations.
Fortunately, we can find this solution in finite steps. If we keep multiplying the transition matrix above, it will converge and in many cases, it converges very fast (we don’t need to perform infinite steps). The vector on the R.H.S. demonstrates the probability of each state we may be in.
Or we can find the eigenvectors of the Transition matrix but this is usually hard to scale for real e-commerce systems. In practice, we sample transitions and after many trials, we can estimate the probability distribution by looking at where are we for each trial.
One of the most difficult parts is to compute the transition probabilities using user behaviors. Like many engineering solutions, this will involve many trial-and-error, heavy experiments, and tuning. That is why many recommender systems are tedious with many known or unknown legacy behind them.
The random walk is one of the most popular recommender algorithm in industrial deployments. Nevertheless, it is a huge topic alone and we will not detail it for now.
Google PageRank utilizes the inbound and outbound links to a page in ranking their search. The basic idea is similar to a random walk on the Markov chain that we just discussed. The detail of the PageRank is discussed in the context of eigenvector here.
By finding the stable state of such a random walk, the corresponding probability of each state can be used for the final ranking. The problem is defined as finding the eigenvectors in the matrix factorization.
In practice, solving it directly with all the web pages in the world is hard. Instead, we continue multiple the matrix R and wish that it converges fast. The PageRank paper has demonstrated that with 322 million page links, the solution converges to a tolerable limit in 52 iterations. So the solution scales unexpectedly well (details).
Here is an example in comparing Coaches Poll on NCAA basketball ranking versus the one computing with the random walk concept using the Markov process. It is not bad that the computer findings are close to the poll by the coaches.
The transition probability is updated by the following equations. If team A wins over team B the transition probability increase from B to A. This will further increase if the game is a blowout.
Transaction data shows a lot of user behaviors and patterns. By making associations between items in millions of transactions, we can design deals, product placements, and recommendations better.
Let’s define a few terms first. Support S is the probability of events (say what is the probability of purchasing S₁, S₂, …Sk ∈ S).
Confidence is how often T happens when S happens.
In association, we are looking for high support and high confidence.
For example, an e-commerce company may display associated items, as related products, for the current items in your shopping cart through association analysis.
First, we start with the transaction records of customers. Starting from one item, we compute the corresponding support. (For simplicity, we use the count instead of the probability in our examples.) But to avoid exponential growth of combinations, entries with count smaller or equal than a threshold will be eliminated.
Next, we create 2-item combinations with the remained items on the left above. This stops our tree to grow exponentially.
Then, we continue to build 3, 4 and 5 item-combination unless there are no more items left to build. After this exercise, we create the frequent itemsets (A, P, K, M, AP, AM, PM, KM & APM).
Our next task is to create association rules. For the combination APM, the possibilities are
But as the number of items grows, this combination grows exponentially also. So we need to prioritize which itemset to start our investigation first. Usually, we start with the itemset with the highest confidence first.
Nevertheless, confidence alone can be misleading. Confidence P(Y|X) can be high simply because Y is popular. We want to know the additional confidence in seeing Y given X. Lift recalibrates the confidence by dividing it with P(Y).
If the lift is greater than one, users will be more likely to purchase itemset 𝑌 given they buy itemset 𝑋, or the opposite if it is smaller than one. Conviction is the expected frequency that X occurred while Y did not happen. A value of 1.3 means 30% of the prediction is wrong if X and Y happen together by chance only.
Let’s have an example of applying the Apriori Algorithm in forming association rules. The example below tries to find any associations for the responses in a 14 question survey. For questions with categorical answers, each possible answer is treated as a separate item. However, for an ordinal answer, this example treats it as either smaller than the median, or greater or equal with the median. So the possible events are like (male, single, single, married, divorced. income < $40,000, income ≥ $40,000, …).