Hive Programming by Edward Capriolo, Dean Wampler, and Jason Rutherglen


There are a lot of different approaches to working with Hadoop data. Since most business users, data analysts and programmers are already familiar with issuing SQL commands against relational databases (such as Oracle, DB2, Teradata, etc.), the tools that provide an SQL- like interface to Hadoop data are very popular. They provide a bridge between the old, familiar relational database world and the new ‘big data’ world of Hadoop.



Hive-programming
Programming Hive

Hive, with its Hive Query Language (HiveQL)  has SQL like structure for accessing data in Hadoop.


This book is simple & easy to learn. Each topic is explained separately and extensively with examples. Hive programming makes it easier for developers to port SQL based applications to Hadoop. 

This book covers the :

--> Overview of Hadoop & MapReduce.

-->The basic difference between  Hive and other programming languages such as Pig programming language,HBase etc.

--> How to install hive & how to configure  it with Hadoop.

--> How to start hive.

--> What are the commands of Hive and how to run these commands in the system.

 Through this post I’m reviewing the most comprehensive and detailed book available on this topic today, that is, guide to using Hive:“Programming Hive” by Capriolo, Wampler and Rutherglen.

  • Title:  Programming Hive
  • Authors:  Capriolo, Wampler & Rutherglen
  • Publisher:  O’Reilly Media
  • Edition:  1st edition
  • Publication date:  October 2012
  • Hive versions:  Up to version 0.9.0


The authors clearly have a lot of real-world experience working with Hive. Edward Capriolo is a committer on the Hive project. The other authors, Dean Wampler and Jason Rutherglen, both work for Think Big Analytics where they have supported numerous big data projects. 



The authors also provide an introduction to several of the other key software products in the Hadoop ecosystem. This gives the reader the ability to determine if Hive is the best tool for a given work.



The book shines in its treatment of advanced topics such as :

--> View creation table design and its relationship to physical storage options.

--> Working with the Hadoop streaming environment.

--> Setting up the Hive web interface using the Hive Thrift service for remote access to Hive from other processes including JDBC and ODBC.

--> Integration with Amazon Web Services.

--> Use of HCatalog to make Hive metadata available to users.


 HCatlog is a table and storage management layer that enables the users with different processing tools such as MapReduce and Pig, to read and write the data on the grid. HCatlog represents that relational view of the Hadoop Distributed File System. Users dont need to worry where and in what format their data are stored such as RCFile format, text files , sequence files and ORC files.

HCatlog allows the users to reading and writing the files in any format in which a SerDe(Serialixation-Deseialization) can be written. 

By default, HCatlog supports RCFile, CSV file, JSON, sequence files. To use custom format, you need to define InputFormat, OutputFormat and SerDe Format.

HCatlog uses Hive Command Line Interface(CLI) to issuing data definition and metadata exploration commands.
HCatalog


Previous
Next Post »

Popular Posts