07 August 2011

Making recommendations from amazon matrics

I’ve been reading Programming Collective Intelligence, and this is a summary of what I learnt from chapter 2 'making recommendations'. note my github repo for code example.

Making recommendation in context of collective intelligence is usually based on accepting average weighted preference of other users based on how similar they are from a user whom you are making recommendations for.

We do this all the time when we take recommendations from people. We naturally build a sense of similarities in taste with people we know. Recommendation from people with similar taste will carry more weight, and we generally take bunch of recommendations before selecting a few that are most favoured by majority. It’s fair to say that we often have different circles of people who we seek recommendations from depending on the nature of subject and thus it’s natural to expect existence of multi dimensional similarity matrix.

Now for the time being, if we turn that fuzzy measure of similarity into a number between -1 to 1, it all seems fairly mathematical. (-1 being completely opposite in taste and 1 being completely in sync in taste)

Weighted preference of another user is a product of similarity score and rating for the preference. So rating for the preference will carry more weight if she or he is similar to you. We would then take an average weighted preferences from all the users in the system so that you get representative recommendations of population.

It’s also possible to turn the similarity score around to measure item similarity. Item similarity score is useful to recommend similar items.

There are quite a few well known ways to calculate similarity score. One of them is Pearson Correlation Score. Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations

Covariance of two variables is a bit like variance of one variable except instead of squaring the deviation from the mean, you multiply two deviations from the mean for each variable. If two variables moves away from its means with the same speed in the same direction then the covariance of two variables is going to be the same as variance of each variable. With similar reasoning, if two variables moves away from its means with the same speed in the opposite direction, it will be same as negative value of variance of each variable. This value is then divided by product of standard deviations of two variables to normalise the covariance into number between -1 to 1.

It’s all good, however you will need user preference data with enough intersection to build a good similarity scores and thus make a good recommendation. That’s why Amazon’s recommendation works quite well.

What if we use Amazon’s recommendation engine instead? Every products in Amazon has its own SKU (stock keeping unit) identifier (ASIN - Amazon Standard Item Number) as well as other identifiers such as UPC, EAN etc. ItemLookup operation can be used to find corresponding product from Amazon with more industry standard identifiers such as UPC and EAN. Includes Similarities in ResponseGroup to get similar products. Now match those similar products in Amazon back to products in your store.

09 April 2011

Dealing with lots of small files in Hadoop MapReduce with CombineFileInputFormat

Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is splited into one or more InputSplits typically upper bounded by block size. This means the number of input splits are lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there are relatively small number of large files. Note here when the input split spills over block boundaries this could work against the general rule of 'having process close to data', because blocks could be at different network locations.

Enter CombineFileInputFormat, it packs many files into each split so that each mapper has more to process. CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so it doesn't suffer from the same problem of simply having a big split size.

public class MyCombineFileInputFormat extends CombineFileInputFormat {

  public static class MyKeyValueLineRecordReader implements RecordReader {
    private final KeyValueLineRecordReader delegate;

    public MyKeyValueLineRecordReader(
      CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException {
      FileSplit fileSplit = new FileSplit(
        split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations());
      delegate = new KeyValueLineRecordReader(conf, fileSplit);

    public boolean next(Text key, Text value) throws IOException {
      return delegate.next(key, value);

    public Text createKey() {
      return delegate.createKey();

    public Text createValue() {
      return delegate.createValue();

    public long getPos() throws IOException {
      return delegate.getPos();

    public void close() throws IOException {

    public float getProgress() throws IOException {
      return delegate.getProgress();

  public RecordReader getRecordReader(
    InputSplit split, JobConf job, Reporter reporter) throws IOException {
    return new CombineFileRecordReader(
      job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class);

CombineFileInputFormat is an abstract class that you need to extend and override getRecordReader method. CombineFileRecordReader manages multiple input splits in CombineFileSplit simply by constructing new RecordReader for each input split within. MyKeyValueLineRecordReader creates a KeyValueLineRecordReader to delegate operations to.

Remember to set mapred.max.split.size to a small multiple of block size in bytes as otherwise there will be no split at all.

03 April 2011

Search over key value storage

This is a post about my talk ‘search over key value storage’ for our recent away day at Forward.

It's exciting to see what people implement over simple key value storage. From its inherited limitation of simple key lookups, these key value storage encourages developers to think in terms of data access pattern. This is a good thing. After all data storage strategy is largely limited by your choice of key value storage and governed by your data access pattern. Thinking in terms of data access pattern that your storage needs to support will help you to choose the right implementation.

The term ‘search’ is rather broad term, and I’m rather just interested at realtime lookup based on redundant query indexes.

Let’s look at an example of building query indexes. I’m going to use Redis as my choice of key value storage to build a weblog publishing tool. I could model my blog entries like below.
  entry:9 => {
    title => 'processing event log part1', 
    tags => [nosql, couchdb],
    date_published => 01/11/2010
  entry:10 => {
    title => 'applying agile practices', 
    tags => [randomthought, agilepractice],
    date_published => 20/11/2010
  entry:11 => {
    title => 'search over key value storage',
    tags => [nosql, randomthought],
    date_published => 13/12/2010
To support lookup by tags I would need set of entry ids keyed by its tag. This is an example of typical property match lookup. It's important to realise that all the permutations of tags as keys aren't required to support the lookup, consider set operations. To lookup all the entries tagged by nosql but not randomthought, you just need to do simple set operation of tags:nosql - tags:randomthought
  tags:nosql => [9,11]
  tags:randomthought => [10,11]
  tags:couchdb => [9]
  tags:agilepractice => [10]

To support lookup between date_published range, I would need ordered set of entry ids by custom weight honouring natural order of its date_published. This is an example of typical property range lookup. To look up all the entries between 15/11/2010 and 15/12/2010, work out weight of these two dates first then get a subset of the entries:by_date_published between those two weights. If your key value storage doesn't support ordered set by custom weight, you can implement Trie on top of your key value storage. Have a read of this white paper.
  entries:by_date_published => [9,10,11]

Should we build query indexes ourselves? Although it was fun to consider different strategies for building query indexes, it always seemed rather tedious and expensive to me. How about having search engine on top of key value storage? Good thing is when it comes to indexing big data that made key value storage interesting in the first place, search engines like ElasticSearch and Solr support distributed index space and help manage locality of related keys. Here is an example of ElasticSearch with Rubberband gem.
  @rubberband = ElasticSearch.new('')
    :dynamic => "false",
    :properties => {
      :tags* => {:type => "string", :index => "not_analyzed"},
      :date_published* => {:type => "date", :format => "dd/MM/yyyy"}
  }, :index => "yetitrails", :type => "entry")

    {:tags => %w[nosql, couchdb], :date_published => "01/11/2010"},
    :index => "yetitrails", :type => "entry")

    {:tags => %w[randomthought, agilepractice], :date_published => "20/11/2010"},
    :index => "yetitrails", :type => "entry")

    {:tags => %w[nosql, randomthought], :date_published => "13/12/2010"}, 
    :index => "yetitrails", :type => "entry")
    :query => {:constant_score => {
      :filter => {
        :and => [
          {:term* => {:tags => "nosql"}},
          {:range* => {:date_published => 
            {:from => "15/11/2010", :to => "15/12/2010"}}}
  }, :index => "yetitrails", :type => "entry")

There's ongoing effort of better integration between nosql db and search engine, so that all sounds good, but what if we need to search different things tomorrow? Because that is directly related to volatility of data access pattern and query indexes. Given the nature of big data, Key value storage solution like HBase and Cassandra provide Hadoop integration so that we can reindex in batch. Hadoop contrib/index provides a utility to build or update an Lucene index using Hadoop cluster.

Given all the complication, should we search over key value storage? It depends on nature of your problem domain. However if you do need to search over key value storage, it's not such a bad idea either :)

13 February 2011

Applying agile practices in an environment of mistrust

Recently I had an opportunity to reflect on my past experience of applying agile practices in various projects. I want to share how I think its effective application heavily relies on trust in relationship within a team.

There’s an element of trust in any relationship. We tend to grow mutual trust relationship over time. When trust is broken, people sometimes seek counselling to start a long process of building back the relationship. A Counsellor is bit like external consultants who you might bring in when things are not working for your organisation. As external consultants, it’s tricky to get the nature of engagement right. I’ve been on a few projects where external consultants were brought in to deliver a solution but no one in the company was interested at mending that broken relationship. Often many agile practices were used with an aim to provide greater transparency and repeatable success. They are a great stepping stone of building trust and eventually mending that broken relationship. Although it’s important to keep in mind that agile practices are just tools not your main objective. In fact like any tools, it can be misused especially in an environment of mistrust.

Let’s talk about misuse of velocity. Velocity measures how fast a team goes through an iteration. It’s based on trust that everyone in the team performed their best. Actual velocity, when sustained over a period of time, can represent capacity of the team in delivery. Knowing a capacity helps to prioritise stories knowing how much can be done. I’ve seen a few projects where the management focused solely on managing target velocity and variation from estimated points for each story. So much so that there was no time left for them to engage with rest of the team as new features were built. This encourages negative behaviour from developers. I’ve seen a few instances of skewed representation of velocity (such as taking partial points from partially done stories or stories that hasn’t been integrated to the rest of systems) to live up to an expectation of higher velocity than what was delivered. I’ve also seen inflationary trends in story point over time a bit like bubble in property market. When velocity turns into a target, it’s no longer a useful tool.

What about iteration? No I’m not talking about iterative development. Iteration is a tool for an iterative development, and it’s not the only way to do iterative development. I’ve seen a few projects where planning an iteration was a struggle. Filling enough work for an iteration of one week or two were relatively expensive exercise. Running an iteration in such environment can sometimes encourage people to switch their mindset back into project delivery based on rigid contract. Friction between managements and developers arisen from requirements churn was not uncommon. “that’s a new story, it wasn’t in the acceptance criteria” said a developer. “this new story can’t have any story point associated with it, it’s a bug” said a manager. When iteration turns into rigid contract based interaction between managements and developers, it’s no longer a useful tool.

It’s true that agile practices makes inefficiencies in an organisation so painfully visible. But applying agile practices out of context in fact could harm business moving forward, out of that bitter relationship that you sought counselling for. In my opinion, all these agile practices should be treated as tools not as goal, focus should be on supporting managements to run their business idea iteratively. Look out for an opportunity to engage in with management. Support them to try their incomplete ideas to capture the timely sensitive business opportunities. Does that encourage positive change in behaviour? It has vast impact on people's mindset. For developer, it has to be cheap to try a fuzzy idea. For management, it has to be nimble enough to respond to an unexpected outcome. You will be encouraged to innovate. You will find a close relationship between business and technology and thus greater trust. One tool I would recommend to help that series of short burst of trial and learn process is continuous delivery. In fact releasing a few times a week for every new feature shouldn’t be hard. If that sounds hard and risky, let's make it easy to deploy and revert. Keep your development and deployment cycle short so it's cheap to make mistakes along the way.