Interview Questions on Hive


Programming with Hive
Hive



Q1. What is Hive ? How it is different from other applications ?


Solution -  Apache Hive is a data warehouse infrastructure built on top of Hadoop  for providing data summarization, query, and analysis. Hive provides the SQL-like query language called Hive QL. It allows programmers to plug-in custom mappers and reducers. 

Hive differs from other applications :

1.Easy to data Extract/ Transform/ Load (ETL).
2. Best used for batch jobs over large sets of append-only data (like web logs).


Q2. What are the Hive Components ? Why go for Hive when there are lot of similar applications ?


Solution : Shell, Driver, Compiler, Metastore  and Execution Engine are the Hive components.
 
Hive provides :
  SQL-like environment.
  We can partitions the tables.
  Provides Views, Joins, Streaming. 

Q3. Which interface is used in Hive? 

 

Solution: A simple web interface is used called "Hive Web Interface(HWI)" and programmatic access through JDBC, ODBC and a thrift server.

Q4.  What kind of data warehouse application is suitable for Hive?


Solution:   1. Data Mining
                 2. Document Indexing.
                 3. Log Processing.
                 4. Customer-facing Business Intelligence.
                 5. Predictive Modeling.

Q5. What is difference between 'ORDER BY' and 'SORT BY' in hive?


Solution : 1. ORDER BY performes a total ordering of the query result set, that means all the data is passed through a single reducer and SORT BY, orders the data only within the reducer, thereby performing a local ordering, where each reducer's output will be sorted.

              2. SORT BY, gives better performance. 

Q6 What is difference between 'LIKE' and 'RLIKE' predicate operators ?


Solution : 

LIKE predicate operator is used where we need a STRING that began or end with a particular substring or when substring appears anywhere within the string.
Suppose we have to find gyansha in the string, then we can use '%gyansha' or 'gyansha%'  or '%gyansha%'.

 RLIKE predicate operator provides links to resources with more details on regular expression.

Q7. What is Hive Metastore?


Solution: Metastore contains metadata regarding tables, partitions and databases. This is used by the Query Processor during plan generation.

Q8. Is it possible to use same metastore by multiple users, in case of embedded hive?

 

Solution : No, It is not possible to use same metastore  in sharing mode. It is recommended to use standalone "real" databases like MYSQL or PostGresSQL.

Q9. What are the different collection type in Hive?

 

Solution : There are 3 different collection types used in Hive such as Map, Struct and Array.
   

Q10. What are the limitations of Hive ?

 

Solution : 1. Latency for hive queries is generally high(minutes).
                2. It does not offer real-time queries and row level updates.
 

Q11. What are the different Hive data models ?


Solution :
      1. Databases (It defines Namespace).

      2. Tables  (Tables define scheme of namespace).

      3. Partitions  (Partitions defines how data is stored in HDFS, Grouping data bases on some column).

       4. Buckets (Buckets is used to divide the partitions further into buckets , Use for data sampling)
 

Q12. How to load data into the table ?


Solution :  First we have to create table and then data is load into the table by using the command.

  LOAD Data Local INPATH 'path/filename' OVERWRITE INTO TABLE table_name;

Q13. Why we need Views in Hive ?


Solution : A view allows a query to be saved and treated like a table. It is a logical construct. View is used, where query becomes complicated, to hide the complexity of query. We uses views by diving query into smaller, more manageable pieces. It encapsulates the complexity and makes it easier for end users.

Q14. How to see which database you are currently working on and how to enable it ?


Solution : We can see the DataBase by using the command "USE database_name". If we want to enable the name of database permanently , then we can use this command SET hive.cli.print.current.db=true; 

Q15. Can we add 'COMMENT' for any column or table ?


Solution : Yes, we can add COMMENT for any column or table. 

Q16. How to describe metadata or scheme of the table ?

 

Solution : DESCRIBE table_name;

Q17. What is SerDe in Apache Hive ?


Solution : SerDe stands for Serializer-Deserializer. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.

Q18.  What is the functionality of Query Processor in Apache Hive ?


Solution : The following are the main components of the Hive Query Processor:

1. Parse and semantic Analysis.
2. optimizer
3. Plan Components.
4. MetaData Layer.
5. Hive Function Framework.


Previous
Next Post »

Popular Posts