Blogs / Business Intelligence / Data Science / Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Nov 9, 2018

SHARE

We learned about what collaborative filtering is in our previous blog. Now, let’s build an ALS collaborative filtering recommendation model using the MovieLens dataset. The raw data is stored in Azure Data Lake Store (ADLS), and we need to load them to the Azure Databricks notebook by mounting or directly connecting ADLS to the Azure Databricks file system.

You can also find the full code for this practice here!

Step 1. Prepare the datasets

“ratings”, “movies” and “tags” datasets were loaded as Pyspark dataframe to implement the ML ALS Recommendation module. If you are familiar with RDD format, Pyspark MLlib ALS Recommendation module could be your choice.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm
Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm
Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Attributes in each dataframe are shown above. We can also check the data types of these attributes.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

To build an user-item matrix in the ALS model for this practice, we refer to the explicit preferences of users – “rating” on movies. Since the “ratings” dataset provides information of users, movies and the corresponding rates from users, we will focus on this dataset to develop our model.

Let’s clean the data by removing the “timestamp” column which is not necessary for analysis. We also need to convert the data types of “userId” and “movieId” to an integer type, and “rating” to double type to apply the ALS recommendation module. “ratings” dataset is now prepared as “ratings_df”.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Step 2. Build the ALS Collaborative Filtering model

“ratings_df” dataframe is divided into two sets by random split function to train and test the model. From the 20 million rows of data, 70% of it is used for training, and the rest 30% is used for testing.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Implementation of ALS algorithm includes parameters that we can adjust for modeling condition.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

One thing to note is that we need to set implicitPrefs to False (default) since we use the explicit feedback ALS variant (i.e. rating) in this practice. If we adopt the implicit feedback such as views, clicks and purchases, the model will better predict with the setting to implicitPrefs = True. Another parameter of concern is coldStartStrategy. This parameter is to manage the situation where the training dataset does not have the movie items in the test dataset (i.e. “cold start problem”). ALS function automatically assigns NaN to the predicted value if a new item appears, which leads to NaN results for the evaluation metric during cross-validation. To avoid NaN evaluation result, we need to set coldStartStrategy to “drop” which allows dropping any rows of predictions that contain NaN values.

After ALS algorithm is ready, the model using the training data can be developed as below.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Step 3. Predict the ratings and evaluate the model

The model built in Step 2 is used to predict the ratings on movies in the test dataset. Brief results are shown in the table below:

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

We can see that the predicted values (“prediction”) are somewhat close to the actual rating values (“rating”) from users in test dataset. The model accuracy can be evaluated by multiple metrics, however, one of the most popular metrics is “root mean squared error (RMSE)”.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Where  is the actual value for the ithitem

 is the predicted value for the ithitem

            Nis total number of item

The RMSE value for our model is calculated as below.

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Step 4. Recommend top 10 movies

Using the prediction results in Step 3, we can recommend top 10 movies to User No. 96393 based on the predicted ratings. The model computes the following top 10 movies to recommend to User No. 96393:

Data Science Series: Implementing Collaborative Filtering in Azure using an ALS Algorithm

Hence, top 10 movies for User No. 96393 are:

Movie IDMovie Title
101940Seeking Asian Female (2012)
91617“New Life, A (La vie nouvelle) (2002)”
120815Patton Oswalt: Werewolves and Lollipops (2007)
81117“Moth, The (Cma) (1980)”
108149At Berkeley (2013)
126959The Epic of Everest (1924)
89083“Great White Silence, The (1924)”
87719Living with Wolves (2005)
93918“Big Night, The (1951)”
74014Silent Wedding (Nunta Muta) (2008)

As seen in Step 2, ALS algorithm takes multiple parameters to create the model. We can iterate different values for these parameters to generate the best model which provides higher prediction accuracy. One of the most important parameters is “rank” which indicates the number of latent factors in the model (i.e. “features” in our discussion above).

Find these optional steps to determine the best model by changing rank values in our full code!

What’s next?

So far, we have explored different recommendation strategies and implemented sample algorithms using Azure Databricks. As we have proven that Azure Databricks provided a reliable environment for data analysis, we may want to integrate this service to other Azure applications through pipelines so that we can gain benefits for automating the job schedule. In the next blog, we will introduce the steps to incorporate Azure Databricks into Azure Data Factory pipeline.

Read more blogs from Data Science category here.


Corporate HQ:
5920 Windhaven Pkwy, Plano, TX 75093

+1 888-227-2794

+1 972-232-2233

+1 888-227-7192

solutions@visualbi.com


Copyright © Visual BI Solutions Inc.

Subscribe To Our Newsletter

Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates from our team.

You have Successfully Subscribed!

Share This Blog!

Share this blog with your friends and colleagues!