Recently I presented a paper on the effects of similarity weighting schemes on collaborative filtering recommender systems at the 27th ACM Symposium on Applied Computing, more specificall in the Track on Trust, Reputation, Evidence and Collaborative Know-How.
I received a few questions about the presentation, this post aims at answering some of them, and provide a short summary of the paper.
In any interaction dataset, a quick analysis will show that there is a large popularity bias. The reasons for this can be anything from media exposure of blockbuster movies, to the way people reason when they select to watch, buy or listen to a certain item (Marlin and Zemel 2009).
Rating-wise, this has the effect that popular items are also predominantly highly rated, in the Movielens dataset, the three most popular movies are rated with 5 stars by between 15 an 18 thousand users (i.e. 20-25% of all users).
The effect of this is that many users will be similar to eachother only based on the popular, and in general, highly rated movies. Meaning movies they will get recommended might not reflect their "true" taste.
Using weighting schemes to mitigate the effects of popularity bias
Weighting schemes have been used in collaborative filtering before and it has been shown that they, in general, do not add much to the performance of the system. However, no work has researched whether this is a general effect, or whether weighting schemes could improve the performance of a recommender system if applied to a group of users.
For this purpose, we compared how a k Nearest Neighbor recommender performs for users who have rated few, more than few, and many items. We used three different similarity metrics for the process of building neighborhoods, cosine similarity, Pearson Correlation and Euclidean distance and measured the performance of all three types on both datasets.
The results of the experiments could be summarized as: when users have rated more than few movies, but less than many (30-100 ratings), and the rating scale is compact (1-5 stars like Movielens), then applying a weighting scheme adds to the quality of the recommendations (in terms of precision and recall).
The picture above shows the percental changes in terms of pecision for the recommenders using the Pearson correlation compared to the unweighted approach. The differences between the two weighting schemes used (inverse user frequency and inverse popularity) is insignificant. The axis are the number of ratings by the user on the x-axis, and the percental improvement on the y-axis.
Taking a closer look at the specific recommender we see the following
which shows an improvement for users who have rated between 30 and 100 ratings. However, this improvement was not significant in the moviepilot dataset. The differences between the two datasets are primarily the rating scale, which in Movilens in 1-5 stars, and in moviepilot 0-10 stars. The more "compact" scale of Movielens creates, what we believe are, artificial similarities between users based on the popularity. This does happen in the wider rating scale of moviepilot, but the effects of the bias are lower, and thus, so will the effects of a weighting scheme be.