- Client submits a job description to the JobTracker.
- JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode.
- JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code etc.
- JobTracker starts map tasks on the slaves (TaskTrackers) with the appropriate data blocks.
- After running, maps create intermediate output files on those slaves. These are not in HDFS, they're in some temporary storage used by MapReduce.
- JobTracker starts reduces on a series of slaves, which copy over the appropriate map outputs, apply the reduce function, and write the outputs to HDFS (one output file per reducer).
- Some logs for the job may also be put into HDFS by the JobTracker.
Thursday, March 4, 2010
How does the data flows when a job is submitted to Hadoop?
Based on the discussion here, typically the data flow is like this:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment