We live in the age of Big Data, where Big data volumes we need to work on day-to-day have outgrown the storage and processing capabilities of a single host. Big Data brings two challenges: how to store this data and how to analyze it.
Big Data is collection of data sets that can be of various types. The estimated growth of Big Data is 35.0 ZettaBytes by 2020. Traditional database does not deal with semi-structured data and Unstructured Data .We need a technology which deals with these types of data.
Where is this Big Data generated ?
Facebook generates 1 terabytes of data per day, New York Stock Exchange generates 1 Petabytes of data per day. 5700 tweets are generated per minute. So you can imagine the number of tweets generated by the users in one day !
IBM scientists define the 3 V's of Big Data, Velocity, Variety and Volume.
How does Real-Time data generate Big Data ?
Sources of Big Data :
Social Media Sites- How does social media generate data?
People use Facebook to upload photo's, for messaging, commenting and liking and all of this generates a huge amount of Big Data.
How do Mobile Devices generates Big data ?
Earlier mobile phones were only used for calling and messaging purposes, but now a days mobile phones generate a lot of data by using applications. Everyone these days has smart phones in which huge amounts of data is generated by the applications and games we use . Sensors also generate Big data. Satellites revolving around the orbit, generate a lot of data. In just seconds, sensors devices generate huge of amount of data.
Now the Big question is ! How We will analyze such huge amounts of data !
Hadoop fills the gap in the market by effecting storage and analyze capabilities. Hadoop is a platform for shared storage and computational capabilities. Hadoop was first conceived to fix a scalability issue that existed in Nutch which is a open source crawler and search engine. Hadoop supports the Distributed File System with a high degree of tolerance.
Hadoop has 2 core components:
1. Hadoop Distribute File System
2. MapReduce
HDFS is optimized for higher throughput and works best for reading and writing large files. Scalability and availability are also the traits of HDFS.
HDFS is a distributed file system, which is responsible for holding data on servers. HDFS has three major components : Namenode, Secondary NameNode and DataNode.
Namenode is the brain of the HDFS, which allocates blocks and datanodes for client write request. Namenode manages the metadata, that contains the addresses of the blocks that are stored in datanodes( replica's exist for each block) . Namenode is the primary node or we can say that it is an active node. Secondary Namenode comes into play when primary node fails or is down due to technical issues.
Datanodes are the physical data storage that communicate to each other by using pipelines for reading and writing files. Files are made up of blocks and each block is replicated multiple times. Datanodes have the functionality of sending heartbeat signals to the namenode if they are functional. If the namenode doesn't receive heartbeats, It understands that the datanode has failed and then makes a replica of that datanode.
MapReduce is a paradigm model for analyzing the data. MapReduce has two phases : Mapper phase and Reducer Phase.
In Map phase, Raw data is considered as a record and split into lines of data.These lines of data are mapped in the format of a key and a value . The Key-Value pair is then sent through the Combiner and Partitioner and performs the mini-reducer function. Now comes the Reducer phase, In reducer we perform the logic operation . Reducer performs its task and produced output is stored in the HDFS.
Client can read the output from the HDFS.
HDFS provides a shared storage just like a filesystem and MapReduce provides mechanism to analyse that data. So, Hadoop provides a efficient way for analyzing Big Data.
ConversionConversion EmoticonEmoticon