In this post I'll describe our two most recent papers related to the magic barrier of recommender systems. The magic barrier is a concept introduced by Herlocker et al., the term refers to the point at which the accuracy of an algorithm cannot be improved, e.g. where higher precision accuracy, or lower prediction error, does not actually mean that the algorithm is improving.

Before you start reading, you should know that this post has it's fair share of equations and math, but the core concept of it should be clear even when not looking into the mathematical details (at least the first half of it).

I'll start by describing some of the background to why we got into the magic barrier business to start with, if you're only interested in the actual formalization of the magic barrier then go ahead an skip to here.

## Background

The first time I attended the ACM Recommender Systems Conference (RecSys2009), Xavier Amatriain (@xamat) presented a paper titled "Rate in again: Increasing recommendation accuracy by user re-rating" (Xavier actually wrote a blog post about the paper, which I recommend you to read). For me, this paper was one of the most interesting I saw at the conference. It wasn't entirely related to what I thought would be my research topic (still being fairly new in the Ph.D game then). Roughly two years later I did however find myself doing something related to the topics covered in the paper.

At the time, I was working in a project with moviepilot, and one of the things we wanted to figure out was whether we could measure the quality of their recommender engine other than by means of precision, recall, rmse, and the likes. Together with a few colleagues (Sascha & Till) and moviepilot we thought up the idea to ask moviepilot's users about their opinions on movies they had rated in the past. We chose the term opinions (German: Meinung) instead of re-rating to make sure we weren't subconsciously asking the participants to change their ratings. To collect these opinions we created a Web interface, and then the good people at moviepilot posted the link to the thing on their website.

## The Study

The concept of the study was pretty straight-forward: get a bunch of users - ask them to give opinions on movies they've seen in the past - analyze the data - profit.

We needed to make sure that the data we collected would not be biased in any way. This meant that the experience of giving opinions needed to be similar to when a user rates movie. To make sure of this, we practically made a 1:1 copy of moviepilot's own rating interface, but instead of populating it with movie recommendations, we filled it with a random selection of the movies the users had rated.

The two images below show moviepilot's rating interface (first image) and our rating interface (second image).

[caption id="attachment_1005" align="aligncenter" width="300"] moviepilot's rating interface[/caption]

[caption id="attachment_1006" align="aligncenter" width="310"] Our opinion collector[/caption]

We ran the study for a couple of weeks in April and May of 2011, and in the end we had collected 6, 299 new opinions on 2, 329 movies by 306 users. This data is what we then used to estimate the magic barrier of moviepilot.

## The Magic Barrier

The study lead to two publications, a poster titled Estimating the Magic Barrier of Recommender Systems: A User Study at SIGIR 2012 and a full paper at UMAP 2012 titled Users and Noise: The Magic Barrier of Recommender Systems. The poster is based on a very early analysis of our data and presents a fairly simplistic attempt at estimating the magic barrier. Both of these papers would have not been possible without the help of Brijnesh who came up with the idea and approach of using ERM (see The Longer One) for magic barrier estimation.

### The Short and Simple (RMSE-based)

This simple estimate of the magic barrier is almost embarrassingly trivial. In a traditional rating prediction-based evaluation, the quality of the recommender engine is expressed in RMSE where the error calculated by comparing the predicted rating  to an actual rating which has been withheld from the recommendation algorithm (e.g. it's not in the training set).

Given the fact that we now have an additional rating set, i.e. the opinions we collected, we can calculate the RMSE of the users themselves, i.e. giving us an estimate of how much or little the users of moviepilot actually now about their own taste.

[caption id="attachment_1009" align="aligncenter" width="300"] Simple RMSE estimation[/caption]

In the equation above, we simply replace the traditional predicted value with the collected opinion (o above) and compare it to the original/true rating (r above).

Our assumption is that the error measure we obtain reflects the level of optimization we can perform on an algorithm before the improvement in prediction accuracy will not be perceived by the users.

If attempting to use a similar magic barrier estimate, one should take the user's rating context into consideration, i.e. perhaps not calculating one magic barrier for the whole set of ratings, rather focusing on creating a set of magic barriers - one for each context.

The complete approach is described in this paper:

@inproceedings{Said:2012:EMB,
author = {Said, Alan and Jain, Brijnesh J. and Narr, Sascha and Plumbaum, Till and Albayrak, Sahin and Scheel, Christian},
title = {Estimating the Magic Barrier of Recommender Systems: A User Study},
booktitle = {Proceedings of the 35th international ACM SIGIR conference on Research and development in Information Retrieval},
series = {SIGIR '12},
year = {2012},
isbn = {978-1-4503-1472-5},
location = {Portland, OR, USA},
numpages = {2},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {recommender systems, evaluation, noise},
}

### The Longer One (ERM-based)

Ok, so this one is not as short, and not as simple as the short and simple one. But it does make more sense..

The idea behind this magic barrier estimation is based on empirical risk minimization (ERM). For those who fell asleep in Statistics 101, Wikipedia says "Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on the performance of learning algorithms." (click the link above to see the full article).

$$p(u,i)$$ - describes the likelihood of user u rating item i
$$p(r|u,i)$$ - describes the likelihood of user u rating item i with rating score r
$$\mathcal{F}$$ - is a class of rating functions from which we learn the rating function f

We now want to learn a rating function $$f \in \mathcal{F}$$ which minimizes the expected risk function $$R(f) = \sum_{(u,i, r)} p(u, i, r)\big(f(u,i) - r\big)^2$$, meaning that whenever the probability $$p(u,i,r)$$ is high, we want our rating function $$f(u,i)$$ to give good predictions, otherwise the risk becomes high. Consequently, when the probability is low, the prediction error doesn't really matter (well, it does, but not as much) as it will be reduced through the low probability value.

However, since the distribution $$p(u,i,r)$$ is unknown, we cannot compute the optimal rating function $$f_* = \arg\min_{f \in \mathcal{F}} R(f)$$. What we can do instead is approximate $$f_*$$ by minimizing the empirical risk, i.e.
$$\widehat{R}(f|\mathcal{X}) = \frac{1}{\mathop{|\mathcal{X}|}} \sum_{r_{ui}\in \mathcal{X}} \big(f(u,i) - r_{ui}\big)^{2}$$, where $$\mathcal{X} \subseteq \mathcal{R}$$ is the training set (i.e. all ratings given by user u on item i). As a result, minimizing the empirical risk is the same as minimizing RMSE.

#### Modeling User Inconsistencies

Now, that we've formulated an ERM model for minimizing RMSE, we continue with a model for user inconsistencies (i.e. the reason to why the collected opinions were not identical to the original ratings given by users). The goal of this being to show that the magic barrier is the standard deviation of the inconsistencies.

Numerous publications have discussed the inherent rating inconsistencies that users/people seem to be afflicted by (the paper mentioned by Amatriain et al. in the beginning of this post being just one of them). In order to formulate our model for the magic barrier, just as in the RMSE-based model, we will regard all inconsistencies as noise. We're not trying to argue that the taste of users doesn't change (it does), what we're trying to show is a general model of the magic barrier. The final model can be applied to context-aware approaches in order to get context-specific magic barriers (that's something we're currently leaving as future work).

Keeping our user study in mind, we say that the expected rating of user u on item i is defined by the expectation $$\mathbb{E}[R_{ui}] = \mu_{ui}$$ where $$R_{ui}$$ is a random variable taking a value from the rating scale, e.g. in Movielens or Netflix this would be 1, ..., 5, in moviepilot this is a value between 0 and 10 in steps of 0.5).
A rating then becomes $$r_{ui} = \mu_{ui} + \varepsilon_{ui}$$ where $$\mu_{ui}$$ is the expected value of the rating, and the $$\varepsilon_{ui}$$ the noise (which arises from the user's inconsistency).

#### Modeling the Magic Barrier

Having created a model for user inconsistencies, we can now model the the magic barrier. Let's assume that $$f_*$$ is the true and unknown rating function - it knows all expected ratings of all users on all items, i.e. $$f_*(u,i)=\mu_{ui}$$.
This optimal rating function minimizes the expected risk function $$R(f)$$ above, which we (given the above equations) now can express as $$R(f_*) = \sum_{(u,i)} p(u,i)\mathbb{E}\left[\varepsilon_{ui}^2 \right] = \sum_{(u,i)} p(u,i)\mathbb{V}\left[\varepsilon_{ui} \right]$$, where $$\mathbb{V}\left[\varepsilon_{ui} \right]$$ is the noise variance.

So, having all this, we can now express the magic barrier of a recommender system in terms of RMSE as $$B_{\mathcal{U} \times \mathcal{I}} = \sqrt{\sum_{(u,i)} p(u,i)\mathbb{V}\left[\varepsilon_{ui} \right]}$$ (if you can't see why, then replace all symbols in the equation with their corresponding equations).

Finally, we can express the RMSE of the function $$f$$ as $$E_{RMSE}(f) = B_{\mathcal{U} \times \mathcal{I}} +E_f > B_{\mathcal{U} \times \mathcal{I}}$$.

Having this expression for the magic barrier, we used the dataset collected in the survey described at the top of this article to see whether moviepilot's algorithm still had room for improvement or whether it had hit the magic barrier. The whole process is described in detail in this paper:

@incollection{Said:2012:UNM,
year={2012},
isbn={978-3-642-31453-7},
volume={7379},
series={Lecture Notes in Computer Science},
editor={Masthoff, Judith and Mobasher, Bamshad and Desmarais, MichelC. and Nkambou, Roger},
doi={10.1007/978-3-642-31454-4_20},
title={Users and Noise: The Magic Barrier of Recommender Systems},
url={http://dx.doi.org/10.1007/978-3-642-31454-4_20},
publisher={Springer Berlin Heidelberg},
keywords={Recommender Systems; Noise; Evaluation Measures; User Inconsistencies},
author={Said, Alan and Jain, BrijneshJ. and Narr, Sascha and Plumbaum, Till},
pages={237-248}
}

This post has also been posted here.

Updated: