Do you ever wonder how websites, such as Amazon, bring up similar items related to your product searches? Well, this is a common advertisement approach that uses a data mining concept known as recommender systems to show products that are similar in some nature. We use recommender system to predict unexplored items (products, topics, etc.) that users (customers, subscribers, etc.) may want (purchase, read, etc.) based on the information from other items or users.
One familiar approach in recommender systems is the Association Rule Mining algorithm. As introduced in our previous blog- Data Science Series: Association Rule Mining, association rule focuses on finding the correlations between the items so that the next combination of items can be suggested. This method has been a powerful tool for companies, especially when individual customer profiles were not available. For example, retail companies such as Walmart and Target explore historical transaction data throughout the entire customer base. Then they discover the frequent itemsets mostly purchased together. These item sets are evaluated by the occurrence to determine the next combination of items to be recommended. If milk is found in transactions to be purchased often together with eggs and sugar, customers who buy milk would be thought to have interests in eggs or sugar (or both). Companies can display these items in the same section at the store or send a flyer to promote them together.
As the information of each customer became accessible (e.g. user profile through a personal account at the online shop), recommender systems have been developed to embed the customer’s preference or behavior pattern into the algorithms. The two most common approaches are “content-based” method and “collaborative filtering”. The difference between these two algorithms lies in the usage of customer “group”. As shown in the following figures, the content-based method focuses on an individual user’s preference. Meanwhile, the collaborative filtering recommender system refers to the pattern of “user community”.
In this blog, we will explore the content-based recommender system in detail and continue our discussion on the collaborative filtering in the next blog.
The basic premise in the content-based method is that items with similar features will be preferred similarly by the same user. Discrete features (e.g. model name, color, producer, etc.) of each item are scanned to group the similar items. The user preferences (e.g. ratings, reviews, purchase history, etc.) on each item that they consumed are also recorded to create individual user profiles. By exploring the user profile with the item features, the algorithm scores the items and recommends the similar products that individual users may like.
Let’s implement a content-based recommender system using the MovieLens dataset. MovieLens dataset is a well-known template for recommender system practice composed of 20,000,263 ratings (range from 1 to 5) and 465,564 tag applications across 27,278 movies reviewed by 138,493 users. The goal is to recommend certain movies to a particular user by predicting his/her ratings on unexplored movies. The implementation includes the computation of similarities between movies (items) based on the tags (item features) and generation of a user profile based on his/her ratings on movies (user preferences). For this big size of data, Azure platform outperforms the on-premise machine in computing power.
Data preparation and analysis for this practice were conducted in Azure Databricks notebook. To configure the overall process in the Azure platform, the MovieLens data were stored in Azure Data Lake Store (ADLS) as a data source. To load the data to Azure Databricks notebook, find the link to learn how to mount the ADLS to Azure Databricks file system: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html
Step 1- Load the datasets
Three main datasets – “movies”, “ratings” and “tags” – were loaded as the Pandas dataframe for the python application. Attributes in each dataframe are shown below.
Step 2- Compute the item feature vector
To evaluate the similarities between movies, we first need to find the features of movies. A feature is an attribute to describe the item – movies. In this practice, the “tag” attribute contains the terms to represent the characteristics of each movie.
Since “tag” does not have quantitative measures, we need to develop a qualitative metric to evaluate the feature of movies. TF-IDF values are used for this purpose. TF-IDF is a product of “Term Frequency (TF)” and “Inverse Document Frequency (IDF)”. TF shows how many times a subject term appears in the document. IDF is the inverse of the document frequency for the documents which contain the subject term.
In this practice, TF means how frequently a tag is applied for a movie and IDF indicates how rarely a tag is applied. TF-IDF is calculated using the following equation:
Where, is total number of occurrences of a term i in document j
is total number of documents containing i
is total number of documents
TF-IDF lets us score the terms more reliably. For example, let’s assume that we have a set of terms “the ultimate action” to describe a movie. The term “the” would occur more frequently than “action” in the dataset, however, it is less important in evaluating the similarities than “action”. TF-IDF helps to panelize these terms so that the similarity computation can be less affected. TF-IDF values for MovieLens data were calculated as below.
Now, how do we rate the similarities between movies per each user preference? To compute this, we use the Vector Space Model. This model recognizes each movie as a vector of its features in multi-dimensional space. The individual user profile is also stored as a vector, and the similarities between movies as well as the proximity to the user profile are determined by the angles between the vectors.
TF-IDF values were converted into a feature vector for each movie by normalizing it with unit length. In the following result, the attribute “tag_vec” indicates the feature unit length vector of each movie.
Step 3- Compute the user profile vector
The movie feature vector is ready with tag applications. What we need now is to prepare the user profile vector for a particular user based on his/her previous ratings on the movies. The user profile indicates the degree of user preference, hence, should be the sum of feature vectors of all movies that the user rated positively. For this demonstration, the User No. 65 was selected, and the user profile was computed by the ratings higher than or equal to 3. The following result shows the user profile vector with an attribute “tag_pref”.
Step 4- Compute the cosine similarity to predict item ratings
The proximity of each movie feature vector to the profile vector of User No. 65 was determined by taking the cosine of the angle between these vectors. The first 10 predictions on the movie ratings for User No. 65 are computed as below:
The top 10 movies to recommend to User No. 65 are:
|Movie ID||Movie Title|
|1748||As Good as It Gets (1997)|
|4878||Donnie Darko (2001)|
|4975||Vanilla Sky (2001)|
|7147||Big Fish (2003)|
|44191||V for Vendetta (2006)|
|1206||Clockwork Orange, A (1971)|
|32||Twelve Monkeys (a.k.a. 12 Monkeys) (1995)|
|48774||Children of Men (2006)|
|29||City of Lost Children, The (Cité des enfants perdus, La) (1995)|
Now, you’re all set to build your own content-based recommender system. Find the full code for this implementation here!
One issue in using content-based method is that the recommendation would be limited to the products only from similar categories rather than other items that customers may have interests in. As briefly mentioned, collaborative filtering helps to resolve this bias problem by considering the preferences of user “group”. In the next blog, we will dive into collaborative filtering more detail with the MovieLens dataset.