09 April 2011

Dealing with lots of small files in Hadoop MapReduce with CombineFileInputFormat

Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is splited into one or more InputSplits typically upper bounded by block size. This means the number of input splits are lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there are relatively small number of large files. Note here when the input split spills over block boundaries this could work against the general rule of 'having process close to data', because blocks could be at different network locations.

Enter CombineFileInputFormat, it packs many files into each split so that each mapper has more to process. CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so it doesn't suffer from the same problem of simply having a big split size.

public class MyCombineFileInputFormat extends CombineFileInputFormat {

  public static class MyKeyValueLineRecordReader implements RecordReader {
    private final KeyValueLineRecordReader delegate;

    public MyKeyValueLineRecordReader(
      CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException {
      FileSplit fileSplit = new FileSplit(
        split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations());
      delegate = new KeyValueLineRecordReader(conf, fileSplit);
    }

    @Override
    public boolean next(Text key, Text value) throws IOException {
      return delegate.next(key, value);
    }

    @Override
    public Text createKey() {
      return delegate.createKey();
    }

    @Override
    public Text createValue() {
      return delegate.createValue();
    }

    @Override
    public long getPos() throws IOException {
      return delegate.getPos();
    }

    @Override
    public void close() throws IOException {
      delegate.close();
    }

    @Override
    public float getProgress() throws IOException {
      return delegate.getProgress();
    }
  }

  @Override
  public RecordReader getRecordReader(
    InputSplit split, JobConf job, Reporter reporter) throws IOException {
    return new CombineFileRecordReader(
      job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class);
  }
}

CombineFileInputFormat is an abstract class that you need to extend and override getRecordReader method. CombineFileRecordReader manages multiple input splits in CombineFileSplit simply by constructing new RecordReader for each input split within. MyKeyValueLineRecordReader creates a KeyValueLineRecordReader to delegate operations to.

Remember to set mapred.max.split.size to a small multiple of block size in bytes as otherwise there will be no split at all.