Let’s continue our discussion on recommender systems. The following figure briefly summarizes branches in recommender systems.
In the previous blog, we explored the content-based recommender system. We will look more into collaborative filtering recommender systems in this blog and implement an example with MovieLens dataset using Pyspark ML ALS Recommendation Module in Azure Databricks in our next blog. For the big size of data, Azure Databricks notebook based on the Apache Spark frame provides us with a reliable environment for fast job execution.
Unlike the content-based method, collaborative filtering does not require features of individual items. Instead, the general tendency of users throughout items are collected to group similar users and the algorithm learns what features to use to determine similarities between items.
Collaborative filtering is usually divided into three branches: memory-based, model-based and hybrid method which combines the two methods (memory-based+model-based). The memory-based method is sometimes called “instant-based” method since it temporarily stores information in the computer memory to generate recommendations. In short, the memory-based approach is differentiated from the model-based approach by the amount of user-item data required to develop recommender systems. The former requires the entire user-item dataset to compute recommendations. Meanwhile, the latter makes use of partial data through dimension reduction to create a model. The overall user-item interactions are then approximated from the model.
Memory-based collaborative filtering computes similarities between users or items and predicts a new rating for an item by taking the weighted average of ratings from the similar group. There are two types in this approach: user-based filtering and item-based filtering. Basic ideas behind these methods are;
For user-based filtering: “users who are similar to a certain user also liked…”, and
For item-based filtering: “users who preferred a certain item also liked…”
User-based filtering first selects a user and finds users who have similar rating patterns. The recommender system then can suggest items that those similar users liked. Item-based filtering, on the other hand, takes an item first and finds users who liked the particular item, then searches other items that those users also liked. Item-based filtering has an advantage over user-based filtering. In that, the algorithm can reduce the data size to explore. It only looks at the items that have been rated by users and compiles similar items, rather than comparing all users.
The following figure illustrates the two filtering methods in the user-item rating matrix. For user-based filtering, the similarity between user ukand user ulis calculated by averaging similarities between ratings in columns i2,i3and in. This process is iterated to compare all users. For item-based filtering, the similarity between item iiand item ijis measured by calculating similarities between ratings in rows u1,u2and uk.
Similarities are commonly measured by “Pearson correlation” or “vector cosine”. The first method computes the linear correlation between two user ratings (or item ratings) and validates how much those two users (or items) vary from each other. The second method, as explained in our previous blog, treats two users (or items) as multi-dimensional vectors and compares the angle between them by taking the cosine value.
Memory-based collaborative filtering applies simple algorithms to implement the recommender system and generates good prediction quality. However, this approach is limited in scaling to larger dataset since it basically travels over the entire user-item matrix, which leads to slow computing process. Model-based collaborative filtering is scalable and faster as it takes partial data to create a model and avoids overfitting when the model represents enough of real-world data.
In model-based collaborative filtering, the most popular algorithm is matrix factorization (or latent factorization). Matrix factorization adapts “Singular Value Decomposition (SVD)” methodology which breaks down the original user-item matrix into submatrices, containing latent features (factors).
Okay, since things are getting complicated, let’s not deal with mathematics, and look into the concept in a much simpler way. As we can see in the following diagram, SVD predicts the unknown ratings in user-item matrix M through the inner product of U (user and user feature matrix), Σ (user feature and item feature matrix) and VT(item feature and item matrix). Here, VTis just the transpose of matrix V for item and item features. Don’t get too bothered by this! Full SVD considers all m x n features in computation, however, we can take only k-number of features from Σ and approximate the original matrix M. This is the idea of dimension reduction in model-based collaborative filtering.
When applied to recommender systems, we presume that the feature matrix Σ is folded into U and VT, therefore we can predict the ‘approximated’ rating matrix M’ by the inner product of U and VT.
For the MovieLens practice which will be introduced in our next blog, matrix factorization considers users holding k-latent preferences (e.g. how much the user prefers action movies) and movies having k-latent attributes (e.g. how much the movie has action contents). Multiplications of each preference of a user and each attribute of a movie are added to generate the specific user’s rating on a certain movie. For instance, the value for M’11in the following figure is a summation of f1∙F1+f2∙F2+ … + fk∙Fkfrom U and VT.
Okay, then how do we define user/movie features when we only have rating information? Well, the algorithm learns by itself. We don’t have to know what these features are. (We will see how this works in the following steps!) What we need to do is just randomly pick up the number for k, and repeat the prediction for feature values in U and VTuntil we minimize the errors (mostly, RMSE) between the approximated matrix M’ and the actual matrix M. Let’s see these in steps.
In basic SVD matrix factorization, computation steps are as below:
- Prepare matrix M. You may want to normalize it first by adjusting the values to make a sum of 1 for each row and column.
- Define the number of factors, k.
- Initialize matrix U and VTwith random values. For instance, you may want to set all elements in Uand VTas 1.
- Change and optimize the values in U and VTto minimize the RMSE value.
- Stop iterations when the RMSE value converges.
SVD is a good approach to build a model, however, it still takes time for computation. To enhance the computing efficiency, “Alternative Least Square (ALS)” method or “Stochastic Gradient Descent (SGD)” method can be adapted in minimizing the RMSE. Steps for ALS matrix factorization are as stated below:
- Prepare matrix M with normalization.
- Determine the dimension.
- Fix VTwith random values, and optimize U by minimizing the RMSE value.
- Fix U with random values, and optimize VTby minimizing the RMSE value.
- Repeat step 3 and 4 until the RMSE value converges.
Steps for SGD matrix factorization are as follows:
- Prepare matrix M with normalization.
- Determine the dimension.
- Initialize U and VTwith random values.
- Randomly select a known element Mij from the actual user-item matrix M.
- Calculate residuals between the actual value Mij and the predicted value M’ij which is calculated by the random values in U and VT. For instance, if M11 was selected, the value for M’11 is f1∙F1+ f2∙F2+ … + fk∙Fk and the residual is (M11– M’11).
- Update the elements in U and VTrelated to Mij value to minimize the error in residuals.
- Repeat step 4 to 6 until the stopping condition is met.
In general SGD is faster than ALS, however, ALS sometimes outperforms SGD in a distributed computing environment.
Now it’s time to implement a collaborative filtering algorithm for practice. Azure Databricks is equipped with a good Pyspark package for ALS model-based collaborative filtering in ML ALS Recommendation Module. Let’s dive into this in the next blog!