We learned about what collaborative filtering is in our previous blog. Now, let’s build an ALS collaborative filtering recommendation model using the MovieLens dataset. The raw data is stored in Azure Data Lake Store (ADLS), and we need to load them to the Azure Databricks notebook by mounting or directly connecting ADLS to the Azure Databricks file system.
You can also find the full code for this practice here!
Step 1. Prepare the datasets
“ratings”, “movies” and “tags” datasets were loaded as Pyspark dataframe to implement the ML ALS Recommendation module. If you are familiar with RDD format, Pyspark MLlib ALS Recommendation module could be your choice.
Attributes in each dataframe are shown above. We can also check the data types of these attributes.
To build an user-item matrix in the ALS model for this practice, we refer to the explicit preferences of users – “rating” on movies. Since the “ratings” dataset provides information of users, movies and the corresponding rates from users, we will focus on this dataset to develop our model.
Let’s clean the data by removing the “timestamp” column which is not necessary for analysis. We also need to convert the data types of “userId” and “movieId” to an integer type, and “rating” to double type to apply the ALS recommendation module. “ratings” dataset is now prepared as “ratings_df”.
Step 2. Build the ALS Collaborative Filtering model
“ratings_df” dataframe is divided into two sets by random split function to train and test the model. From the 20 million rows of data, 70% of it is used for training, and the rest 30% is used for testing.
Implementation of ALS algorithm includes parameters that we can adjust for modeling condition.
One thing to note is that we need to set implicitPrefs to False (default) since we use the explicit feedback ALS variant (i.e. rating) in this practice. If we adopt the implicit feedback such as views, clicks and purchases, the model will better predict with the setting to implicitPrefs = True. Another parameter of concern is coldStartStrategy. This parameter is to manage the situation where the training dataset does not have the movie items in the test dataset (i.e. “cold start problem”). ALS function automatically assigns NaN to the predicted value if a new item appears, which leads to NaN results for the evaluation metric during cross-validation. To avoid NaN evaluation result, we need to set coldStartStrategy to “drop” which allows dropping any rows of predictions that contain NaN values.
After ALS algorithm is ready, the model using the training data can be developed as below.
Step 3. Predict the ratings and evaluate the model
The model built in Step 2 is used to predict the ratings on movies in the test dataset. Brief results are shown in the table below:
We can see that the predicted values (“prediction”) are somewhat close to the actual rating values (“rating”) from users in test dataset. The model accuracy can be evaluated by multiple metrics, however, one of the most popular metrics is “root mean squared error (RMSE)”.
Where is the actual value for the ithitem
is the predicted value for the ithitem
Nis total number of item
The RMSE value for our model is calculated as below.
Step 4. Recommend top 10 movies
Using the prediction results in Step 3, we can recommend top 10 movies to User No. 96393 based on the predicted ratings. The model computes the following top 10 movies to recommend to User No. 96393:
Hence, top 10 movies for User No. 96393 are:
|Movie ID||Movie Title|
|101940||Seeking Asian Female (2012)|
|91617||“New Life, A (La vie nouvelle) (2002)”|
|120815||Patton Oswalt: Werewolves and Lollipops (2007)|
|81117||“Moth, The (Cma) (1980)”|
|108149||At Berkeley (2013)|
|126959||The Epic of Everest (1924)|
|89083||“Great White Silence, The (1924)”|
|87719||Living with Wolves (2005)|
|93918||“Big Night, The (1951)”|
|74014||Silent Wedding (Nunta Muta) (2008)|
As seen in Step 2, ALS algorithm takes multiple parameters to create the model. We can iterate different values for these parameters to generate the best model which provides higher prediction accuracy. One of the most important parameters is “rank” which indicates the number of latent factors in the model (i.e. “features” in our discussion above).
Find these optional steps to determine the best model by changing rank values in our full code!
So far, we have explored different recommendation strategies and implemented sample algorithms using Azure Databricks. As we have proven that Azure Databricks provided a reliable environment for data analysis, we may want to integrate this service to other Azure applications through pipelines so that we can gain benefits for automating the job schedule. In the next blog, we will introduce the steps to incorporate Azure Databricks into Azure Data Factory pipeline.