Recently, together with Alejandro Bellogín, we released a toolkit for recommender system evaluation and benchmarking. We wanted to create an objective, transparent, and portable setting for evaluating recommender systems. Why you might say? Well, we had a feeling this would be a good thing.
In 2013, both Alejandro and me found ourselves at CWI as Marie Curie fellows. Given the topics of our (then, quite recent) respective dissertations, we naturally started looking into different aspects of evaluation of recommender systems. One of these was how the evaluation performed in different frameworks (e.g. MyMediaLite or Apache Mahout) compared to one-another. This started as a thought experiment while I was finalizing a book chapter on Benchmarking together with Domonkos Tikk and Paolo Cremonesi. For this chapter, I put together a list of recommendation frameworks (below). Together with Alejandro, we thought it would be interesting to see how some of these frameworks compared to each other when using the same datasets, evaluation settings, and overall setup.
Common frameworks used for recommendation
This led to us sitting down and writing a lot of code over a few months in order to be able to run different frameworks and benchmark them against each other using a separate evaluation tool. We started by adding support for Lenskit and Mahout as they are both written in Java, the language we selected for our toolkit. We also added some support for MyMediaLite through bash scripts. After much coding, we had a standalone toolkit which does not need to be aware of how and where the recommendations are generated, as long as it gets access to a predictions file and an evaluation file. We christened the toolkit RiVal and started running benchmarks.
An evaluation setup in RiVal consists of three steps; split, candidate items generation, evaluation. Each of the steps is configurable so as to make sure that the toolkit can be used in many contexts, and to provide objective evaluations. The split step handles the data splitting, i.e. what portion of the dataset used is going to end up in a training set and what portion will end up in a test set. The step is configurable through a strategy, a strategy can be e.g. random, time-aware, ratio (80-20, 90-10, etc.). The candidate items generation step specifies which items/users will actually be evaluated, i.e whether all users should exist in both training and test set (user-based), or correspondingly all items should exist in both training and test sets, or whether a certain relevance threshold should be used when selecting items into the test set (item-based).
The final step, the evaluation, is the only step which is mandatory, is where the evaluation metrics which will be used are selected and calculated. As input, the process requires a validation set and the recommendations generated by the recommender. The split and candidate items steps can be skipped if the dataset splitting process is internal to recommendation framework. Doing this will however not allow for the same kind of fine-grained control of the evaluation and benchmarking process as the whole pipeline allows.
Evaluating Lenskit, Mahout and MyMediaLite
Coming back to our initial idea of comparing the same algorithms in different frameworks, we decided to compare three very common algorithms implemented in Lenskit, MyMediaLite and Mahout, namely Item-based Collaborative Filtering, User-based Collaborative Filtering (k-nearest neighbors) and Matrix Factorization (FunkSVD and SVD++). We focused on comparing not only the recommendation accuracy, but also the (item and user) coverage of the recommendations, and the time needed for the framework to recommend.
Since no SVD factorization method is implemented in all three frameworks, we used FunkSVD in Mahout and Lenskit, and SVD++ in MyMediaLite - this obviously reflects on the results.
For each algorithm, we ran evaluation on the Movielens 100k dataset, using two different training/test splits: cross-validation (cv), and per-user ratio (rt), and three different values for k (for k-nearest neighbor and latent factors in SVD): 10, 50, and the square root of the number of items in the training set.
Below are the results for different metrics used in our evaluation.
Keys to column labels: AM = Apache Mahout; LK = Lenskit; MML = MyMediaLite; gl = global; pu = per user; cv = cross validation; rt = ratio.
Keys to row labels: UB = User-based; Pea = Pearson Correlation; Cos = Cosine similarity; IB = Item-based.
Time spent on recommending
Above, we see, not unexpectedly, that the running time of the algorithms differs. Specifically, MyMediaLite appears to be generally a little slower than the other frameworks. It should be noted that all experiments were run on two identical workstations running Fedora Linux. Since MyMediaLite is written in C#, it was run using the Mono framework.
When it comes to catalog coverage, i.e. the fraction of items in the training set that are actually being recommended, we again see that the three frameworks perform differently. Mahout and MyMediaLite appear to have problems achieving high coverage (note blue color indicates low coverage).
Similarly, when looking at user coverage, Mahout again appears to have problems when it comes to recommending to all users. This is likely based on the fact that Mahout will only recommend to users that have at minimum a predefined number of items (two times the number of items to be recommended).
Perhaps most interesting is the figure depicting more traditional recommendation accuracy, i.e. nDCG in the figure above. Should the three frameworks perform similarly, the colors on each row should not change. What we see is that the framework perform widely differently from each other. MyMediaLite seems to generally outperform the others. However, taking into consideration the coverage issues in the figures above this does not necessarily mean that MyMediaLite is the "best" of the frameworks.
So, what does this mean for me?
Well, provided that you are a recommender systems practitioner or researcher, you might be interested in using RiVal. Or at the very least making sure that you look at your results in the right context. A direct comparison of your algorithm's performance to something you read in a paper or online might not be feasible. Only compare your performance within the same framework, on the same data and the same data splits. You can also, obviously, use RiVal in your work ;)
Where do I get RiVal?
Where do I read more about this?
- A. Said, A. Bellogín. Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. ACM RecSys 2014.
- A. Said, A. Bellogín. RiVal - A Toolkit to Foster Reproducibility in Recommender System Evaluation. ACM RecSys 2014.