Hadoop 1.0 vs Hadoop 2.0

Apache Hadoop,created by Doug Cutting andMike Cafarella was instigated in 2005, has MapReduce processing engine to upkeep the distributed processing of large-scale data loads. Several years later, major changes were made so that Hadoop framework not just supports MapReduce but other distributed computing models too.

Hadoop 1.0

In a quintessential Hadoop cluster a file is broken into 64 MB chunks by default and these chunks are distributed across Data Nodes. Each chunk has a default replication factor of 3, thus making 3 copies of the data in the chunk on different racks. job tracker assigns tasks to nodes depending on the location of nodes. Since, Hadoop is ‘Rack Aware’, so it helps the name-node determine the ‘closest’ chunk to a client during reads.

Limitations of Hadoop 1.0:

11. No horizontal scalability:

Hadoop 1.0 provides vertical scalability in NameNode i.e. there is a limit to extend the software in NameNode.

When we have say for example 1000’s of DataNodes then NameNode stores MetaData about those DataNodes. There will be a limit to extend the MetaData that is stored in RAM or memory since Hadoop 1.0 doesn’t allow this mechanism, so a different framework was required to allow horizontal scalability in NameNode.

2. No High Availability in NameNode:

In Hadoop, the secondary NameNode contains the data in form of snapshots, thus being a housekeeping backup for the NameNode.

If namenode fails, it will be recovered from the secondary namenode. We will have to reboot the system and then NameNode will search a FSImage/ Snapshot in the system. The secondary NameNode holds the snapshots because it works as a housekeeping of NameNode. It stores the snapshots after every checkpoint. If in any case the NameNode fails in between or metadata is not updated in secondary NameNode. In this particular case, Data will be lost because it recovers from the last stored checkpoint from the secondary NameNode.

3. Job Tracker Overburdened:

JobTracker is a single daemon or a thread that communicates with thousands of TaskTracker. A single JobTracker in HDFS has to manage the thousands of TaskTracker.

After every 10 seconds JobTracker receives heartbeat signals or impulse from all the TaskTracker that are running on the slave nodes. When the client submits the map reduce Application to the HDFS, the JobTracker accepts the job, divides them into Map job and reduce job then manages and schedules to the task tracker.

Now, In case of 1000’s of nodes the JobTracker becomes overburdened due to managing the life cycle of MapReduce jobs.

4. MRv1 performs only MapReduce jobs

Hadoop 1.0 allows only MapReduce programming either by using MapReduce java programs, PIG and Hive technologies. In case if we want to analyse a graph or perform any other operation we have to move data out of HDFS and then run the graph application to process that data.

Hadoop 2.0

It is advanced version of Hadoop 1.0. It has a features:

1. Multiple Namenodes

Hadoop 2.0 has multiple NameNodes. One is active and others are treated as standby NameNode. All the data from the active NameNode is written to the shared edit log and from here the standby NameNodes read the data so that Active and passive NameNode are in same state. The secondary NameNode does the same thing as it did earlier in Hadoop 1.0. it is not a standby for the active NameNode.

So Hadoop 2.0 provides High availability because there is no data loss, since if if NameNode fails standby NameNode are available.

2. Multiple applications can analyze their data

Unlike in Hadoop 1.0, Hadoop 2.0 allow multiple application to analyse the data such as what is done in Online Transaction Processing (OLTP) and to perform the Giraph Processing.

3. YARN component

Yarn came to run applications beyond MapReduce. While using Yarn we can run multiple application sharing common resource management that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operation of individual nodes. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for applications which cannot wait for batch jobs to finish

4. Schedulers

The scheduler is responsible for allocating resources to the various applications. It doesn't offer any guarantee about restarting the failed task due to application/hardware failure. The scheduler performs it’s scheduling based on requested resources by application.

Capacity scheduler in Hadoop 2.0 allow hierarchal queue to allow more predictable sharing of cluster resources. It allows multiple tenant to securely share a large cluster, such that application allocates resources in timely manner. It divides the percentage of cluster depending on the number of application that’s why there is better utilization of cluster in Hadoop 2.0.

Whereas, Fair scheduler assigns resources to the applications, giving equal share of resources.

Hadoop 2.0 is the advanced version of Hadoop 1.0. There is a big shift in architecture level from Hadoop 1.0 to Hadoop 2.0. Hadoop 2.0 removs the problems of Casading failure, Multi-tenancy, high-availability, Un-utilized data in HDFS. Yarn is a re- architecture that allows multiple applications to run on same platform. This component allows to run applications beyond the MapReduce.

About Admin MC3

This is dummy text. It is not meant to be read. Accordingly, it is difficult to figure out when to end it. But then, this is dummy text. It is not meant to be read. Period.

Learn Big Data Analyics Techniques