Tuesday, May 25, 2010

Hadoop application packaging

Job jar must be packaged as below;

job.jar
|--META-INF
|----MANIFEST.INF
|------Main-Class: x.y.z.Main
|--lib
|---- commons-lang.jar Note: Place your dependent jars inside lib directory
|--org.zero
|---- application classes here

Archiving large number of small files into small number of large files

A small file is one which is significantly smaller than the HDFS block size (default 64MB).

We have a lot of data feeds in the range of 2MB per day, storing each as a separate file is non-optimal.

The problem is that HDFS can't handle lots of files, because, every file, directory and block in HDFS is represented as an object in the namenode's memory, each of which occupies 150 bytes. So for 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.

Also, HDFS does not supports appends (follow http://www.cloudera.com/blog/2009/07/file-appends-in-hdfs/).

Known options are;
  1. Load data to Hbase table and periodically export them to files for long term storage. Some thing like we have product log for a particular date/timestamp against the content of the files stored as plain text in Hbase table.
  2. Alternatively, we can treat these files as pieces of the larger logical file and incrementally consolidate additions to a newer file. That is, file x was archived on day zero, the next day new records are available to be archived. We will rename the existing file to let's say x.bkp and then execute a mapreduce job to read the content from the exiting file and the new file to the file x.
  3. Apache Chukwa solves the similar problem of distributed data collection and archival for log processing. We can also take inspiration from their and provide our custom solution to suit our requirements, if needed.

Saturday, May 22, 2010

One wish

को‌ई गाता मैं सो जाता

संस्कृति के विस्त्रित सागर मे
सपनो कि नौका के अंदर
दुख सुख कि लहरों मे उठ गिर
बहता जाता, मैं सो जाता ।

आँखों मे भरकर प्यार अमर
आशीष हथेली मे भरकर
को‌ई मेरा सिर गोदी मे रख
सहलाता, मैं सो जाता ।

मेरे जीवन का खाराजल
मेरे जीवन का हालाहल
को‌ई अपने स्वर मे मधुमय कर
बरसाता मैं सो जाता ।

को‌ई गाता मैं सो जाता
मैं सो जाता
मैं सो जाता

 - हरिवंशराय बच्चन

बूँद फिर मोती बने ...


एक बूँद 
------------
ज्यों निकल कर बादलों की गोद से
थी अभी एक बूँद कुछ आगे बढ़ी
सोचने फिर फिर यही जी में लगी
हाय क्यों घर छोड़ कर मैं यों बढ़ी
मैं बचूँगी या मिलूँगी धूल में
चू पड़ूँगी या कमल के फूल में
बह गयी उस काल एक ऐसी हवा
वो समन्दर ओर आयी अनमनी
एक सुन्दर सीप का मुँह था खुला
वो उसी में जा गिरी मोती बनी

लोग यौं ही हैं झिझकते सोचते
जबकि उनको छोड़ना पड़ता है घर
किन्तु घर का छोड़ना अक्सर उन्हें
बूँद लौं कुछ और ही देता है कर !

-अयोध्या सिंह उपाध्याय 'हरिऔध`

A Dedication

हम दीवानों की क्या हस्ती

हम दीवानों की क्या हस्ती,
आज यहाँ कल वहाँ चले
मस्ती का आलम साथ चला,
हम धूल उड़ाते जहाँ चले

आए बनकर उल्लास कभी,
आँसू बनकर बह चले अभी
सब कहते ही रह गए,
अरे तुम कैसे आए, कहाँ चले
किस ओर चले? मत ये पूछो,
बस चलना है इसलिए चले

जग से उसका कुछ लिए चले,
जग को अपना कुछ दिए चले
दो बात कहीं, दो बात सुनी,
कुछ हँसे और फिर कुछ रोए
छक कर सुख दुःख के घूँटों को,
हम एक भाव से पिए चले

हम भिखमंगों की दुनिया में,
स्वछन्द लुटाकर प्यार चले
हम एक निशानी उर पर,
ले असफलता का भार चले

हम मान रहित, अपमान रहित,
जी भर कर खुलकर खेल चुके
हम हँसते हँसते आज यहाँ,
प्राणों की बाजी हार चले

अब अपना और पराया क्या,
आबाद रहें रुकने वाले
हम स्वयं बंधे थे, और स्वयं,
हम अपने बन्धन तोड़ चले

- भगवतीचरण वर्मा  

The above poem pretty much summarizes the way I wish I could live my life... trying each moment... keep walking

Sunday, May 9, 2010

How about teaching to learn?


Since, my childhood days I was always told that sharing knowledge improves your learning. However, our social conditioning is such that we are not really comfortable to share primarily because;
  • Fear of being dispensable, because there are others who may replace you.. is the single most important reason for people to 'hoard' knowledge.
  • Fear of being exposed, as sharing might expose your ignorance. 
  • ..
  • ....
There could be many more, the idea here is not to create an exhaustive list of reasons for people to escape ...  the idea here is to make a public assertion that I no longer want to be a passive consumer of information but create information, distil it for people around me find it easy to consume and add greater value to make the world a better place to live.

Hope, that makes a true dedication to my Mother on this day. I seek your blessings.