Hadoop Brainstorming Questions and Answers



Multiple choice questions

1. Which of the following procedures is responsible for HDFS data storage.

 A) NameNode
B) Jobtracker
C) Datanode
D) secondaryNameNode
E) tasktracker

Ans: Datanode

2. Does the block in HDfS save several copies by default? 

A) 3 copies
B) 2 copies
C) 1 copy
D) Uncertainty

Ans: 3 copies

3. Which of the following programs is usually started with a NameNode on a node?

A) SecondaryNameNode
B) DataNode
C) TaskTracker
D) Jobtracker

Ans: Jobtracker

       This question analysis:

Hadoop cluster is based on master / slave mode, namenode and jobtracker belong to master, datanode and tasktracker belong to slave, master only one, and slave has multiple SecondaryNameNode memory requirements and NameNode in an order of magnitude, so usually secondary NameNode (running in a separate Physical machine) and NameNode run on different machines.

  • JobTracker and TaskTracker
  • JobTracker corresponds to NameNode
  • TaskTracker corresponds to DataNode
  • DataNode and NameNode are for data storage
  • JobTracker and TaskTracker are for MapReduce execution


Mapreduce several major concepts, mapreduce as a whole can be divided into a few executive clues: obclient, JobTracker and TaskTracker.

1, JobClient will be in the client through the JobClient class will be applied to the application parameters have been packaged into jar file stored in hdfs, and the path submitted to the Jobtracker, and then by JobTracker to create each Task (MapTask and ReduceTask) and distribute them to each TaskTracker Service to perform.

2, JobTracker is a master service, after the start of the software JobTracker receive Job, responsible for scheduling the job of each sub-task Task Task run on TaskTracker, and monitor them, if found to have a failed task to re-run it. In general, JobTracker should be deployed on a separate machine.

3, TaskTracker is running on multiple nodes on the slaver service. TaskTracker actively communicates with JobTracker, receives jobs, and is responsible for direct execution of each task. TaskTracker needs to run on HDFS's DataNode.

4. Hadoop author

A) Martin Fowler
B) Kent Beck
C) Doug cutting  


Ans: Doug cutting 

5. HDFS default Block Size

A) 32MB
B) 64MB
C) 128MB

Ans: 64MB

6. Which of the following is usually the main bottleneck of the cluster

A) CPU
B) network
C) disk IO
D) memory

Ans: disk

The question analysis:

First of all, the purpose of the cluster is to save costs, with cheap pc machine, replace the minicomputer and mainframe. What are the features of minicomputers and mainframes?

1.cpu processing power

2. memory is large enough

So the bottleneck of the cluster can not be a and d

3. The network is a scarce resource, but it is not a bottleneck.

4. As large data faced by massive data, read and write data need io, and then also redundant data, hadoop generally prepared 3 copies of data, so IO will be discounted.

7. What is correct for Secondary NameNode?

A) it is a hot backup of NameNode
B) it does not require memory
C) Its purpose is to help NameNode merge the edit log, reducing the NameNode startup time
D) SecondaryNameNode should be deployed to a node with NameNode

Ans: Its purpose is to help NameNode merge the edit log, reducing the NameNode startup time

Multiple choice:

8. Which of the following can be used as a cluster management tool ?

A) Puppet
B) Pdsh
C) Cloudera Manager
D) d) Zookeeper

9. Configure the rack below which is correct ?

A) If a rack is out of order, it will not affect the data read and write
B) Write data into the DataNode of the different racks
C) MapReduce will be based on the rack to get closer to their own network data

10. Which of the following is correct when the client uploads the file ?

A) The data is passed to the DataNode via NameNode
B) Client will cut the file into Block, and then upload it
C) Client only uploads data to a DataNode, and then NameNode is responsible for Block replication

Analysis of the question:

The client initiates a file write request to the NameNode.

NameNode returns information about the DataNode it manages by the file size and file block configuration.

The Client divides the files into multiple Blocks, written in sequence to each DataNode block, according to the address information of the DataNode.

11. Which of the following is the Hadoop run mode ?

A) stand-alone version
B) Pseudo distribution
C) distributed

12. What are the ways in which Cloudera offers CDH ?

A) Cloudera manager
B) Tar ball
C) Yum
 d) Rpm

True or False:

13. Ganglia can not only monitor, but also can be alerted. (Correct)

Analysis: The purpose of this question is to examine Ganglia's understanding. Strictly speaking it is correct. Ganglia As a most commonly used Linux environment monitoring software, it is good at the node from the user's needs at a lower price to collect data. But ganglia is not good at informing users after warning and incident. The latest ganglia already has some of this functionality. But there are Nagios that is better at warning. Nagios, is a good warning, notification software. By combining Ganglia and Nagios, the data collected by Ganglia is used as the data source for Nagios, and then using Nagios to send early warning notifications, it is possible to achieve a complete set of monitoring and management systems.

14. Block Size can not be modified. (Wrong)

Analysis: it can be modified Hadoop the basic configuration file is hadoop-default.xml, the default establishment of a job will establish the Job Config, Config first read the configuration of hadoop-default.xml, and then read into hadoop- Site.xml configuration (this file was initially configured as empty), hadoop-site.xml in the main configuration need to override the hadoop-default.xml system-level configuration.

15. Nagios can not monitor the Hadoop cluster because it does not provide Hadoop support. (wrong)

Analysis: Nagios is a cluster monitoring tool, and is one of the three major cloud computing tools

16. If NameNode terminates unexpectedly, SecondaryNameNode takes over it to keep the cluster working. (Wrong)

Analysis: SecondaryNameNode is helpful to recover, not replace, how to recover, can view

17. Cloudera CDH is required for paid use. (Wrong)

Analysis: The first set of paid products is Cloudera Enterprise, Cloudera Enterprise in the United States held in California Hadoop Summit (Hadoop Summit) on the public, with a number of private management, monitoring, operational tools to enhance the function of Hadoop. Fees take the contract order, the price varies with the size of the Hadoop cluster.

18. Hadoop is Java development, so MapReduce only supports the Java language. (Wrong)

Analysis: rhadoop is developed in the R language, MapReduce is a framework that can be understood as an idea that can be developed in other languages.

19. Hadoop supports random reads and writes of data. (Wrong)

Analysis: lucene is to support random read and write, and hdfs only support random reading. But HBase can come to remedy. HBase provides random reads and writes to solve problems that Hadoop can not handle. HBase from the bottom of the design began to focus on a variety of scalability issues: the table can be "high", there are billions of data lines; can also be "wide", there are millions of columns; Automatic replication on an ordinary commercial machine node. The table pattern is a direct reflection of physical storage, making it possible for the system to improve the efficient serialization, storage and retrieval of data structures.

20. NameNode responsible for the management of metadata, client-side read and write requests each time, it will read from the disk or will write metadata information and feedback client side. (error)

This question analysis:

NameNode does not need to read metadata from disk, all data in memory, the hard disk is only the serialization of the results, only every time namenode will start to read.
1) file write
The client initiates a file write request to the NameNode.
NameNode returns information about the DataNode it manages by the file size and file block configuration.
The Client divides the files into multiple Blocks, written in sequence to each DataNode block, according to the address information of the DataNode.

2) file read
The client initiates a file read request to the NameNode.



21. NameNode The local disk holds the location of the Block. (Personally considered correct, welcome to make other comments)

Analysis: DataNode is the basic unit of file storage, it will Block stored in the local file system, save the Block of the Meta-data, while the cycle of all the existing Block information sent to the NameNode. NameNode Returns the information of the DataNode stored in the file.
Client reads the file information.

22. DataNode maintains communication with NameNode over a long connection. ()

This is different: the specific is looking for good information in this regard. The following information can be provided.

First clear the concept:

(1) long connection
Client side and Server side to establish a communication connection, after the establishment of the connection is not open, and then send and receive messages. In this way because the communication connection has been there, this method is often used for point-to-point communication.

(2) Short connection
Client side and Server each time a message to send and receive transactions only when the communication connection, after the transaction immediately disconnect. This method is often used for point to multipoint communication, such as multiple clients connected to a Server.


23. Hadoop itself has strict rights management and security measures to ensure the normal operation of the cluster. (Wrong)

Hadoop can only prevent good people make mistakes, but can not stop the bad guys do bad things

24. Slave node to store data, so it's the bigger the better disk. (Wrong)

Analysis: Once the Slave node is down, data recovery is a challenge


25. The hadoop dfsadmin -report command is used to detect HDFS damaged blocks. (Wrong)

26. Hadoop default scheduler policy for FIFO (correct)

27. Each node in the cluster should be equipped with RAID, so as to avoid single disk damage, affecting the entire node to run. (Wrong)

Analysis: First understand what is RAID, you can refer to Encyclopedia disk array. The wrong place in this sentence is too absolute, the specific circumstances of the specific analysis. Title is not the focus, knowledge is the most important. Because hadoop itself has a redundant capacity, so if not very strict do not need to be equipped with RAID. Specific reference to the second question.

28. Because there are multiple copies of HDFS, so there is no singleton problem NameNode.(Wrong)

29. Each map slot is a thread. (Wrong)

Analysis: First we know what is the map slot, map slot -> map slotmap slot is only a logical value (org.apache.hadoop.mapred.TaskTracker.TaskLauncher.numFreeSlots), rather than correspond to a thread or process

30. Mapreduce the input split is a block. (Wrong)

31. NameNode's Web UI port is 50030, which starts the jet service through jetty. (Wrong)

32. HADOOP_HEAPSIZE in the Hadoop environment variable is used to set the memory for all Hadoop daemons. It defaults to 200 GB. (Wrong)

Previous
Next Post »

2 comments

Click here for comments
Deepika
admin
20 April 2017 at 22:50 ×

Good Knowledge sharing about Hadoop .
Hadoop has a huge demand in IT Industry.
http://eonlinetraining.co/course/big-data-hadoop-online-training/

Reply
avatar
Balaji
admin
26 February 2019 at 06:20 ×

Well written article and a well maintained blog. I am really enjoying your writing skills. Please keep writing.


DevOps Course in Chennai
Angular 6 Course in Chennai
Automation Anywhere Course in Chennai

Reply
avatar

Popular Posts