What is Big Data and Hadoop

We live in the age of Big Data, where  Big data volumes we need to work on  day-to-day have outgrown the storage and processing capabilities of a single host.  Big Data brings  two challenges: how to store this data  and how to analyze it.





Big Data is  collection of data sets that can be of various  types. The estimated growth of Big Data is  35.0 ZettaBytes by 2020. Traditional database does not deal with semi-structured data and  Unstructured Data .We need a technology which deals with these types of data.



Where is this Big Data generated ? 


Facebook generates 1 terabytes of data per day, New York Stock Exchange  generates 1 Petabytes of data per day. 5700 tweets are generated  per minute. So you can imagine the number of tweets generated by the users in one day !



IBM scientists define the 3 V's of Big Data, Velocity, Variety and Volume. 


How does Real-Time data generate Big Data ?

Sources of Big Data :
Social Media Sites- How does social media generate data? 
People use Facebook to upload photo's, for messaging, commenting and liking and all of this generates a huge amount of Big Data. 



How do  Mobile Devices generates Big data ?

Earlier  mobile phones were  only used for calling  and messaging purposes, but now a days mobile phones generate a lot of data by using applications. Everyone these days  has  smart phones in which huge amounts of  data is generated by the applications and games we use . Sensors also generate Big data. Satellites revolving around the orbit, generate a lot of data. In just seconds, sensors devices generate  huge of amount of data. 


Now the Big question is ! How We will analyze such huge amounts of data !




Hadoop fills the gap in the market by effecting storage and analyze capabilities. Hadoop is a platform for shared storage and computational capabilities. Hadoop was first conceived to fix a scalability issue that existed in Nutch which  is a open source crawler and search engine. Hadoop supports the Distributed File System with a high degree of tolerance. 





Hadoop has 2 core components:

1. Hadoop Distribute File System
2. MapReduce

HDFS is optimized for higher throughput and works best for reading and writing large files. Scalability and availability are also the traits of HDFS.

HDFS is a distributed file system, which is responsible for holding data on  servers. HDFS has three major components : Namenode, Secondary NameNode and DataNode.


Namenode is the brain of the HDFS, which allocates  blocks and datanodes for client write request. Namenode manages the metadata, that contains the addresses of the  blocks that  are stored in datanodes( replica's exist  for  each block) . Namenode is the primary node or we can say that  it is an active node. Secondary Namenode comes into play when primary node fails or is down due to technical issues.


Datanodes are the physical data storage that  communicate to each other by using pipelines for reading and writing files. Files are made up of blocks and each block is replicated  multiple times. Datanodes have the  functionality of  sending heartbeat signals to the namenode if they are functional. If  the namenode doesn't receive heartbeats, It understands that the  datanode has failed and  then makes a replica of that datanode.


MapReduce is a paradigm model for analyzing the data. MapReduce has two phases : Mapper phase and Reducer Phase.


In Map phase, Raw data is considered as a record and split into lines of data.These lines of data are  mapped in the format of a key and a value . The Key-Value pair is then sent  through the Combiner and Partitioner and performs the mini-reducer function. Now comes  the Reducer phase, In reducer we perform the logic operation . Reducer performs its task and produced output is stored in the HDFS. 



Client can read the output from the HDFS.

HDFS provides a shared storage just like a filesystem and MapReduce provides mechanism to analyse that  data. So, Hadoop provides a efficient way for analyzing Big Data.    
Previous
Next Post »

Popular Posts