Crunching Big Data with 18 Essential Hadoop Tools.

The Hadoop community is fast evolving to include businesses that offer support, rent time on managed clusters, build sophisticated enhancements to the open source core, or add their own tools to the mix.

Here is a look at the most prominent pieces of today’s Hadoop ecosystem. What appears here is a foundation of tools and code that runs together under the collective heading Hadoop.

1. Hadoop

While many refer to the entire constellation of map and reduce tools as Hadoop, there's still one small pile of code at the center known as Hadoop. The Java-based code synchronizes worker nodes in executing a function on data stored locally. Results from these worker nodes are aggregated and reported. The first step is known as "map"; the second, "reduce."

The code is distributed under the Apache license and is available at the http://hadoop.apache.org/

2. Ambari

Ambari offers a Web-based GUI with wizard scripts for setting up clusters with most of the standard components. Once you get Ambari up and running, it will help you provision, manage, and monitor a cluster of Hadoop jobs.

Ambari is an incubator project at Apache and is supported by Hortonworks. The code is available at http://incubator.apache.org/ambari/.

3. HDFS ( Hadoop Distributed File System)

The Hadoop Distributed File System offers a basic framework for splitting up data collections between multiple nodes while using replication to recover from node failure. The large files are broken into blocks, and several nodes may hold all of the blocks from a file.

The HDFS is also distributed under the Apache license from http://hadoop.apache.org/.

4. HBase

When the data falls into a big table, HBase will store it, search it, and automatically share the table across multiple nodes so MapReduce jobs can run locally. The billions of rows of data go in, and then the local versions of the jobs can query them.

The system, often compared to Google's BigTable, can be found at http://hbase.apache.org.

5. Hive

Hive is designed to regularise the process of extracting bits from all of the files in HBase. It offers an SQL-like language that will dive into the files and pull out the snippets your code needs. The data arrives in standard formats, and Hive turns it into a query-able stash.

Hive is distributed by the Apache project at http://hive.apache.org/.

6. Sqoop

Sqoop moves large tables full of information out of the traditional databases and into the control of tools like Hive or HBase.

Sqoop is a command-line tool that controls the mapping between the tables and the data storage layer, translating the tables into a configurable combination for HDFS, HBase, or Hive.

The latest stable version is 1.4.4, but version 2.0 is progressing well. Both are available from http://sqoop.apache.org/ under the Apache license.

7. Pig

Apache's Pig plows through the data, running code written in its own language, called Pig Latin, filled with abstractions for handling the data. This structure steers users toward algorithms that are easy to run in parallel across the cluster.

The latest version can be found at http://pig.apache.org.

8. ZooKeeper

ZooKeeper imposes a file system-like hierarchy on the cluster and stores all of the metadata for the machines so you can synchronize the work of the various machines. The nodes use ZooKeeper to signal each other when they're done so the others can start up with the data.

For more information, documentation, and the latest builds turn to http://zookeeper.apache.org/.

9. NoSQL

Not all Hadoop clusters use HBase or HDFS. Some integrate with NoSQL data stores that come with their own mechanisms for storing data across a cluster of nodes. This enables them to store and retrieve data with all the features of the NoSQL database and then use Hadoop to schedule data analysis jobs on the same cluster.

10. Mahout

There are a great number of algorithms for data analysis, classification, and filtering, and Mahout is a project designed to bring implementations of these to Hadoop clusters. Many of the standard algorithms, such as K-Means, Dirichlet, parallel pattern, and Bayesian classification, are ready to run on your data with a Hadoop-style map and reduce.

It's just one of the various data analysis tools built into Hadoop.

Mahout comes from the Apache project and is distributed under the Apache license from http://mahout.apache.org/.

11. Lucene/Solr

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

Lucene and many of its descendants are part of the Apache project and available from http://www.apache.org.

12. Avro

Avro is a data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.

Avro is another Apache project with APIs and code in Java, C++, Python, and other languages at http://avro.apache.org.

13. Oozie

Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.

The code, protected by the Apache license, is found at http://oozie.apache.org/.

14. GIS tools

The GIS (Geographic Information Systems) tools for Hadoop project has adapted some of the best Java-based tools for understanding geographic information to run with Hadoop. Your databases can handle geographic queries using coordinates instead of strings. Your code can deploy the GIS tools to calculate in three dimensions.

The tools are available from http://esri.github.io/gis-tools-for-hadoop/.

15. Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

The code is available under the Apache license from http://flume.apache.org.

16. SQL on Hadoop

If you want to run a quick, ad-hoc query of all of that data sitting on your huge cluster, you could write a new Hadoop job which would take a bit of time. After programmers started doing this too often, they started pining for the old SQL databases, which could answer questions when posed in that relatively simple language of SQL. They scratched that itch, and now there are a number of tools emerging from various companies. All offer a faster path to answers.

Some of the most notable include: HAWQ, Impala, Drill, Stinger and Tajo.

17. Cloud

Many of the cloud platforms are scrambling to attract Hadoop jobs because they can be a natural fit for the flexible business model that rents machines by the minute. Companies can spin up thousands of machines to crunch on a big data & Hadoop set in a short amount of time instead of buying permanent racks of machines that can take days or even weeks to do the same calculation.

Some companies, such as Amazon, are adding an additional layer of abstraction by accepting just the JAR file filled with software routines. Everything else is set up and scheduled by the cloud.

18. Spark

Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System.

Spark is being incubated by Apache and is available from http://spark.incubator.apache.org/.

About Admin MC3

This is dummy text. It is not meant to be read. Accordingly, it is difficult to figure out when to end it. But then, this is dummy text. It is not meant to be read. Period.

What is Big Data and Hadoop We live in the age of Big Data, where Big data volumes we need to work on day-to-day
Big Data: The Vital Key Terms in IT Industry Normal 0 false false false EN-IN X-NONE X-NONE
Hive Programming by Edward Capriolo, Dean Wampler, and Jason Rutherglen There are a lot of different approaches to working with Hadoop data. Since most business use
Introduction to MongoDB MongoDB is a powerful, flexible, and scalable general-purpose database. It is an agile datab