Hadoop Administration Interview Questions & Answers

  1. Question 1. How Will You Decide Whether You Need To Use The Capacity Scheduler Or The Fair Scheduler?

    Answer :

    Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time.

    Fair Scheduler can be used under the following circumstances:

    i) If you wants the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.

    ii) If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.

    iii) Use fair scheduling if there is lot of variability in the utilization between pools.

    Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput.

    Capacity Scheduler can be used under the following circumstances:

    i) If the jobs require scheduler detrminism then Capacity Scheduler can be useful.

    ii) CS’s memory based scheduling method is useful if the jobs have varying memory requirements.

    iii) If you want to enforce resource allocation  because you know very well about the cluster utilization and workload then use Capacity Scheduler.

  2. Question 2. What Are The Daemons Required To Run A Hadoop Cluster?

    Answer :

    NameNode, DataNode, TaskTracker and JobTracker

  3. Informatica Interview Questions

  4. Question 3. How Will You Restart A Namenode?

    Answer :

    The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.

  5. Question 4. Explain About The Different Schedulers Available In Hadoop.?

    Answer :

    FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.

    COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.

    Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

  6. Informatica Tutorial

  7. Question 5. List Few Hadoop Shell Commands That Are Used To Perform A Copy Operation.?

    Answer :

    1. fs –put
    2. fs –copyToLocal
    3. fs –copyFromLocal
  8. Teradata Interview Questions

  9. Question 6. What Is Jps Command Used For?

    Answer :

    jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.

  10. Question 7. What Are The Important Hardware Considerations When Deploying Hadoop In Production Environment?

    Answer :

    Memory-System’s memory requirements will vary between the worker services and management services based on the application.

    Operating System – a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.

    Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.

    Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.

    Network – Two TOR switches per rack provide better redundancy.

    Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.

  11. Teradata Tutorial
    Hadoop Interview Questions

  12. Question 8. How Many Namenodes Can You Run On A Single Hadoop Cluster?

    Answer :

    Only one.

  13. Question 9. What Happens When The Namenode On The Hadoop Cluster Goes Down?

    Answer :

    The file system goes offline whenever the NameNode is down.

  14. Java Interview Questions

  15. Question 10. What Is The Conf/hadoop-env.sh File And Which Variable In The File Should Be Set For Hadoop To Work?

    Answer :

    This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.

  16. Hadoop Tutorial

  17. Question 11. Apart From Using The Jps Command Is There Any Other Way That You Can Check Whether The Namenode Is Working Or Not.?

    Answer :

    Use the command -/etc/init.d/hadoop-0.20-namenode status.

  18. Hadoop MapReduce Interview Questions

  19. Question 12. In A Mapreduce System, If The Hdfs Block Size Is 64 Mb And There Are 3 Files Of Size 127mb, 64k And 65mb With Fileinputformat. Under This Scenario, How Many Input Splits Are Likely To Be Made By The Hadoop Framework.?

    Answer :

    2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.

  20. Informatica Interview Questions

  21. Question 13. Which Command Is Used To Verify If The Hdfs Is Corrupt Or Not?

    Answer :

    Hadoop FSCK (File System Check) command is used to check missing blocks.

  22. Java Tutorial

  23. Question 14. List Some Use Cases Of The Hadoop Ecosystem?

    Answer :

    Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.

  24. Question 15. How Can You Kill A Hadoop Job?

    Answer :

    Hadoop job –kill jobID

  25. Apache Pig Interview Questions

  26. Question 16. I Want To See All The Jobs Running In A Hadoop Cluster. How Can You Do This?

    Answer :

    Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.

  27. Hadoop MapReduce Tutorial

  28. Question 17. Is It Possible To Copy Files Across Multiple Clusters? If Yes, How Can You Accomplish This?

    Answer :

    Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying.

  29. Machine learning Interview Questions

  30. Question 18. Which Is The Best Operating System To Run Hadoop?

    Answer :

    Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.

  31. Teradata Interview Questions

  32. Question 19. What Are The Network Requirements To Run Hadoop?

    Answer :

    • SSH is required to run – to launch server processes on the slave nodes.
    • A password less SSH connection is required between the master, secondary machines and all the slaves.
  33. Apache Pig Tutorial

  34. Question 20. The Mapred.output.compress Property Is Set To True, To Make Sure That All Output Files Are Compressed For Efficient Space Usage On The Hadoop Cluster. In Case Under A Particular Condition If A Cluster User Does Not Require Compressed Data For A Job. What Would You Suggest That He Do?

    Answer :

    If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.

  35. NoSQL Interview Questions

  36. Question 21. What Is The Best Practice To Deploy A Secondary Namenode?

    Answer :

    It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.

  37. Question 22. How Often Should The Namenode Be Reformatted?

    Answer :

    The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.

  38. HBase Tutorial

  39. Question 23. If Hadoop Spawns 100 Tasks For A Job And One Of The Job Fails. What Does Hadoop Do?

    Answer :

    The task will be started again on a new TaskTracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.

  40. HBase Interview Questions

  41. Question 24. How Can You Add And Remove Nodes From The Hadoop Cluster?

    Answer :

    • To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
    • To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.
  42. Hadoop Interview Questions

  43. Question 25. You Increase The Replication Level But Notice That The Data Is Under Replicated. What Could Have Gone Wrong?

    Answer :

    Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.

  44. MongoDB Tutorial

  45. Question 26. Explain About The Different Configuration Files And Where Are They Located.?

    Answer :

    The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml.

  46. MongoDB Interview Questions