I’ve been reading Programming Collective Intelligence, and this is a summary of what I learnt from chapter 2 'making recommendations'. note my github repo for code example.

Making recommendation in context of collective intelligence is usually based on accepting average weighted preference of other users based on how similar they are from a user whom you are making recommendations for.

We do this all the time when we take recommendations from people. We naturally build a sense of similarities in taste with people we know. Recommendation from people with similar taste will carry more weight, and we generally take bunch of recommendations before selecting a few that are most favoured by majority. It’s fair to say that we often have different circles of people who we seek recommendations from depending on the nature of subject and thus it’s natural to expect existence of multi dimensional similarity matrix.

Now for the time being, if we turn that fuzzy measure of similarity into a number between -1 to 1, it all seems fairly mathematical. (-1 being completely opposite in taste and 1 being completely in sync in taste)

Weighted preference of another user is a product of similarity score and rating for the preference. So rating for the preference will carry more weight if she or he is similar to you. We would then take an average weighted preferences from all the users in the system so that you get representative recommendations of population.

It’s also possible to turn the similarity score around to measure item similarity. Item similarity score is useful to recommend similar items.

There are quite a few well known ways to calculate similarity score. One of them is Pearson Correlation Score. Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations

Covariance of two variables is a bit like variance of one variable except instead of squaring the deviation from the mean, you multiply two deviations from the mean for each variable. If two variables moves away from its means with the same speed in the same direction then the covariance of two variables is going to be the same as variance of each variable. With similar reasoning, if two variables moves away from its means with the same speed in the opposite direction, it will be same as negative value of variance of each variable. This value is then divided by product of standard deviations of two variables to normalise the covariance into number between -1 to 1.

It’s all good, however you will need user preference data with enough intersection to build a good similarity scores and thus make a good recommendation. That’s why Amazon’s recommendation works quite well.

What if we use Amazon’s recommendation engine instead? Every products in Amazon has its own SKU (stock keeping unit) identifier (ASIN - Amazon Standard Item Number) as well as other identifiers such as UPC, EAN etc. ItemLookup operation can be used to find corresponding product from Amazon with more industry standard identifiers such as UPC and EAN. Includes Similarities in ResponseGroup to get similar products. Now match those similar products in Amazon back to products in your store.