Thursday, March 4, 2010

How does the data flows when a job is submitted to Hadoop?

Based on the discussion here, typically the data flow is like this:
  1. Client submits a job description to the JobTracker. 
  2. JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode. 
  3. JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code etc. 
  4. JobTracker starts map tasks on the slaves (TaskTrackers) with the appropriate data blocks. 
  5. After running, maps create intermediate output files on those slaves. These are not in HDFS, they're in some temporary storage used by MapReduce. 
  6. JobTracker starts reduces on a series of slaves, which copy over the appropriate map outputs, apply the reduce function, and write the outputs to HDFS (one output file per reducer). 
  7. Some logs for the job may also be put into HDFS by the JobTracker.
However, there is a big caveat, which is that the map and reduce tasks run arbitrary code. It is not unusual to have a map that opens a second HDFS file to read some information (e.g. for doing a join of a small table against a big file). If you use Hadoop Streaming or Pipes to write a job in Python, Ruby, C, etc, then you are launching arbitrary processes which may also access external resources in this manner. Some people also read/write to DBs (e.g. MySQL) from their tasks.

No comments: