Crunching Big Data with 18 Essential Hadoop Tools.


The Hadoop community is fast evolving to include businesses that offer support, rent time on managed clusters, build sophisticated enhancements to the open source core, or add their own tools to the mix.

Here is a look at the most prominent pieces of today’s Hadoop ecosystem. What appears here is a foundation of tools and code that runs together under the collective heading Hadoop.

1. Hadoop


While many refer to the entire constellation of map and reduce tools as Hadoop, there's still one small pile of code at the center known as Hadoop. The Java-based code synchronizes worker nodes in executing a function on data stored locally. Results from these worker nodes are aggregated and reported. The first step is known as "map"; the second, "reduce."

The code is distributed under the Apache license and is available at the  http://hadoop.apache.org/

2. Ambari


 

Ambari offers a Web-based GUI with wizard scripts for setting up clusters with most of the standard components. Once you get Ambari up and running, it will help you provision, manage, and monitor a cluster of Hadoop jobs.

Ambari is an incubator project at Apache and is supported by Hortonworks. The code is available at  http://incubator.apache.org/ambari/.

3. HDFS ( Hadoop Distributed File System)

 

The Hadoop Distributed File System offers a basic framework for splitting up data collections between multiple nodes while using replication to recover from node failure. The large files are broken into blocks, and several nodes may hold all of the blocks from a file.

The HDFS is also distributed under the Apache license from http://hadoop.apache.org/.

4. HBase

 

When the data falls into a big table, HBase will store it, search it, and automatically share the table across multiple nodes so MapReduce jobs can run locally. The billions of rows of data go in, and then the local versions of the jobs can query them.

The system, often compared to Google's BigTable, can be found at http://hbase.apache.org. 

5. Hive

 

Hive is designed to regularise the process of extracting bits from all of the files in HBase. It offers an SQL-like language that will dive into the files and pull out the snippets your code needs. The data arrives in standard formats, and Hive turns it into a query-able stash.

Hive is distributed by the Apache project at http://hive.apache.org/.


6. Sqoop



Sqoop moves large tables full of information out of the traditional databases and into the control of tools like Hive or HBase.

Sqoop is a command-line tool that controls the mapping between the tables and the data storage layer, translating the tables into a configurable combination for HDFS, HBase, or Hive.

The latest stable version is 1.4.4, but version 2.0 is progressing well. Both are available from http://sqoop.apache.org/ under the Apache license.


7. Pig

 

Apache's Pig plows through the data, running code written in its own language, called Pig Latin, filled with abstractions for handling the data. This structure steers users toward algorithms that are easy to run in parallel across the cluster.

The latest version can be found at http://pig.apache.org.

8. ZooKeeper


ZooKeeper imposes a file system-like hierarchy on the cluster and stores all of the metadata for the machines so you can synchronize the work of the various machines. The nodes use ZooKeeper to signal each other when they're done so the others can start up with the data.

For more information, documentation, and the latest builds turn to http://zookeeper.apache.org/.

9. NoSQL

Not all Hadoop clusters use HBase or HDFS. Some integrate with NoSQL data stores that come with their own mechanisms for storing data across a cluster of nodes. This enables them to store and retrieve data with all the features of the NoSQL database and then use Hadoop to schedule data analysis jobs on the same cluster.

10. Mahout

 
There are a great number of algorithms for data analysis, classification, and filtering, and Mahout is a project designed to bring implementations of these to Hadoop clusters. Many of the standard algorithms, such as K-Means, Dirichlet, parallel pattern, and Bayesian classification, are ready to run on your data with a Hadoop-style map and reduce.

It's just one of the various data analysis tools built into Hadoop.

Mahout comes from the Apache project and is distributed under the Apache license from http://mahout.apache.org/.


11. Lucene/Solr

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. 

Lucene and many of its descendants are part of the Apache project and available from http://www.apache.org.


12. Avro

 

Avro is a data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. 

Avro is another Apache project with APIs and code in Java, C++, Python, and other languages at http://avro.apache.org.


13. Oozie

 
Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.

The code, protected by the Apache license, is found at http://oozie.apache.org/.


14. GIS tools

 

The GIS (Geographic Information Systems) tools for Hadoop project has adapted some of the best Java-based tools for understanding geographic information to run with Hadoop. Your databases can handle geographic queries using coordinates instead of strings. Your code can deploy the GIS tools to calculate in three dimensions.

 The tools are available from http://esri.github.io/gis-tools-for-hadoop/.


15. Flume

  

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

The code is available under the Apache license from http://flume.apache.org.

16. SQL on Hadoop



If you want to run a quick, ad-hoc query of all of that data sitting on your huge cluster, you could write a new Hadoop job which would take a bit of time. After programmers started doing this too often, they started pining for the old SQL databases, which could answer questions when posed in that relatively simple language of SQL. They scratched that itch, and now there are a number of tools emerging from various companies. All offer a faster path to answers.

Some of the most notable include: HAWQ, Impala, Drill, Stinger and Tajo.


17. Cloud


Many of the cloud platforms are scrambling to attract Hadoop jobs because they can be a natural fit for the flexible business model that rents machines by the minute. Companies can spin up thousands of machines to crunch on a big data & Hadoop set in a short amount of time instead of buying permanent racks of machines that can take days or even weeks to do the same calculation.

 Some companies, such as Amazon, are adding an additional layer of abstraction by accepting just the JAR file filled with software routines. Everything else is set up and scheduled by the cloud.



18. Spark



Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System. 

Spark is being incubated by Apache and is available from http://spark.incubator.apache.org/.
Previous
Next Post »

Popular Posts