An Introduction to Hadoop



yellow-elephant



    Apache Hadoop is an open source software framework that enables the storing, processing and analyzing of large data sets on clusters of commodity hardware in a distributed computing environment. It's creator, Doug Cutting developed the project to cater to the avalanche of data that could not be easily handled by the existing database systems. This platform supports storage and management of Big Data in an inexpensive and efficient manner. 
   
According to Apache,“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

This means that it is possible to run applications on   systems with thousands of nodes with quite a good amount of terabytes involved. Also, all the hardware failures will be taken care of by the framework itself. So, all the Big Data applications will run uninterrupted even when several nodes become inoperative. 

The Apache Hadoop project consists of the following modules :
1. Hadoop Common: This includes all the common utilities that are required by other Hadoop modules. 
2Hadoop Distributed File System(HDFS): It is a distributed file system written in Java that allows storage of huge amounts of data on commodity machines, providing very high aggregate bandwidth across the cluster.
3. Hadoop YARN: It is a framework for scheduling of applications, i.e. job scheduling, and managing cluster resources.
4. Hadoop MapReduce: It is a system for parallel processing of large data sets.

With the sudden explosion of data, including semi-structured and unstructured data, it has become difficult for organizations and companies to handle such a large scale of the generated information effectively. Hadoop is by far the best answer to the limitations of the traditional database management systems with respect to big data. The fact that Hadoop is the most economic, scalable and reliable tool, is not an exaggeration.

Hadoop is, therefore the most popular framework which is used by top-notch companies like Google, Facebook, IBM and several others.

1.
2.
3.
4.
5.
6.
7.

Previous
Next Post »

Popular Posts