Sunday, March 28, 2010

Cascading: How does cascading decides which fields should go to a column family?

I was playing with Cascading code sample as given here.

Problem Statement: let's say we have three fields in a tuple for e.g.
line_num, lower, upper, double
1, a, A, AA

and I wish to add double to its own column family or lets say club it with an existing column family 'right' How do I do that.

Solution:
String tableName = "DataLoadTable";
        Fields keyFields = new Fields("line_num");
// add a new family name
        String[] familyNames = new String[] { "left", "right", "double" };
// group your fields together in the order in which you would like them to be 
// added to column families
        Fields[] valueFields = new Fields[] { new Fields("lower"),
                new Fields("upper"), new Fields("double") };
        HBaseScheme hbaseScheme = new HBaseScheme(keyFields, familyNames,
                valueFields);
        Tap sink = new HBaseTap(tableName, hbaseScheme, SinkMode.REPLACE);
// describe your tuple entry: add the new field
        Fields fieldDeclaration = new Fields("line_num", "lower", "upper",
                "double");
        Function function = new RegexSplitter(fieldDeclaration, ", ");

The remaining of the code remains the same as given in the example.

Either, the above was too obvious that the authors didn't talked about it in the user guide or I do not know how to describe the problem and hence was not able to find them.

Let me know if I'm wrong.

Thursday, March 11, 2010

Push Button Automation for the Humans

Have you ever been responsible for the installation of a software which should be distributed and installed on myriad set of execution environments with equally diverse sets of configuration in a department where machines are constantly being cleaned and re-imaged? Picture a situation where you are supporting various configurations and levels of your module. Rather than spending hours installing by hand, wouldn't it be great to have a way to automate the installation process so that you could just kick it off, go and get coffee, come back, and have it all installed and ready? Call it 'Push Button' automation :) (Please bear with me for throwing newer phrases)

Strange as it might seem but this hand-made configuration is proving to be a nightmare of sorts.. one might feel exhausted with out really having any sense of accomplishment by solving petty issues which should not really have come up. High time one must set up;
  • standardized installation set up
  • provide application diagnostics which must enable operations team to solve small issues in time.
Time again Apache ANT comes to my rescue.

Thursday, March 4, 2010

How does the data flows when a job is submitted to Hadoop?

Based on the discussion here, typically the data flow is like this:
  1. Client submits a job description to the JobTracker. 
  2. JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode. 
  3. JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code etc. 
  4. JobTracker starts map tasks on the slaves (TaskTrackers) with the appropriate data blocks. 
  5. After running, maps create intermediate output files on those slaves. These are not in HDFS, they're in some temporary storage used by MapReduce. 
  6. JobTracker starts reduces on a series of slaves, which copy over the appropriate map outputs, apply the reduce function, and write the outputs to HDFS (one output file per reducer). 
  7. Some logs for the job may also be put into HDFS by the JobTracker.
However, there is a big caveat, which is that the map and reduce tasks run arbitrary code. It is not unusual to have a map that opens a second HDFS file to read some information (e.g. for doing a join of a small table against a big file). If you use Hadoop Streaming or Pipes to write a job in Python, Ruby, C, etc, then you are launching arbitrary processes which may also access external resources in this manner. Some people also read/write to DBs (e.g. MySQL) from their tasks.