Tuesday, May 25, 2010

Archiving large number of small files into small number of large files

A small file is one which is significantly smaller than the HDFS block size (default 64MB).

We have a lot of data feeds in the range of 2MB per day, storing each as a separate file is non-optimal.

The problem is that HDFS can't handle lots of files, because, every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes. So for 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Also, HDFS does not supports appends (follow http://www.cloudera.com/blog/2009/07/file-appends-in-hdfs/).

Known options are;
  1. Load data to Hbase table and periodically export them to files for long term storage. Some thing like we have product log for a particular date/timestamp against the content of the files stored as plain text in Hbase table.
  2. Alternatively, we can treat these files as pieces of the larger logical file and incrementally consolidate additions to a newer file. That is, file x was archived on day zero, the next day new records are available to be archived. We will rename the existing file to let's say x.bkp and then execute a mapreduce job to read the content from the exiting file and the new file to the file x.
  3. Apache Chukwa solves the similar problem of distributed data collection and archival for log processing. We can also take inspiration from their and provide our custom solution to suit our requirements, if needed.

No comments: