Question 1. What Is Apache Oozie?
Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs.It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie is a scalable, reliable and extensible system. Oozie is used in production at Yahoo!, running more than 200,000 jobs every day.
Question 2. Mention Some Features Of Oozie?
- Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.
- Using its Web Service APIs one can control jobs from anywhere.
- Oozie has provision to execute jobs which are scheduled to run periodically.
- Oozie has provision to send email notifications upon completion of jobs.
Question 3. Explain Need For Oozie?
With Apache Hadoop becoming the open source de-facto standard for processing and storing Big Data, many other languages like Pig and Hive have followed – simplifying the process of writing big data applications based on Hadoop.
Although Pig, Hive and many others have simplified the process of writing Hadoop jobs, many times a single Hadoop Job is not sufficient to get the desired output. Many Hadoop Jobs have to be chained, data has to be shared in between the jobs, which makes the whole process very complicated.
Question 4. What Are The Alternatives To Oozie Workflow Scheduler?
- Azkaban is a batch workflow job scheduler
- Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data.
- apache Falcon – Feed management and data processing platform
Question 5. Explain Types Of Oozie Jobs?
Oozie supports job scheduling for the full Hadoop stack like Apache MapReduce, Apache Hive, Apache Sqoop and Apache Pig.
It consists of two parts:
Workflow engine: Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined schedules and availability of data.
Question 6. Explain Oozie Workflow?
An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG) . Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.
Workflow nodes are classified in control flow nodes and action nodes:
Control flow nodes: nodes that control the start and end of the workflow and workflow job execution path.
Action nodes: nodes that trigger the execution of a computation/processing task.
Workflow definitions can be parameterized.The parameterization of workflow definitions it done using JSP Expression Language syntax , allowing not only to support variables as parameters but also functions and complex expressions.
Question 7. What Is Oozie Workflow Application?
Workflow application is a ZIP file that includes the workflow definition and the necessary files to run all the actions.
It contains the following files:
- Configuration file – config-default.xml
- App files – lib/ directory with JAR and SO files
- Pig scripts
Question 8. What Are The Properties That We Have To Mention In .properties?
- Name Node
- Job Tracker
- Lib Path
- Jar Path
Question 9. What Are The Extra Files We Need When We Run A Hive Action In Oozie?
Question 10. What Is Decision Node In Oozie?
Decision Nodes are switch statements that will run different jobs based on the outcomes of an expression.
Question 11. Explain Oozie Coordinator?
Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.Oozie Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The outputs of subsequent workflows become the input to the next workflow. This chain is called a ‘data application pipeline’.
Oozie processes coordinator jobs in a fixed timezone with no DST (typically UTC ), this timezone is referred as ‘Oozie processing timezone’. The Oozie processing timezone is used to resolve coordinator jobs start/end times, job pause times and the initial-instance of datasets. Also, all coordinator dataset instance URI templates are resolved to a datetime in the Oozie processing time-zone.
The usage of Oozie Coordinator can be categorized in 3 different segments:
Small: consisting of a single coordinator application with embedded dataset definitions
Medium: consisting of a single shared dataset definitions and a few coordinator applications
Large: consisting of a single or multiple shared dataset definitions and several coordinator applications
Question 12. Explain Briefly About Oozie Bundle ?
Oozie Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.
More specififcally, the oozie Bundle system allows the user to define and execute a bunch of coordinator applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a bundle. However, a user could use the data dependency of coordinator applications to create an implicit data application pipeline.
Oozie executes workflow based on:
- Time Dependency(Frequency)
- Data Dependency
Question 13. What Is Application Pipeline In Oozie?
It is necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline.
Question 14. How Does Oozie Work?
- Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing. Oozie workflow consists of action nodes and control-flow nodes.
- An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using Sqoop or running a shell script of a program written in Java.
- A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different branches may be followed depending on the result of earlier action node. Start Node, End Node and Error Node fall under this category of nodes.
- Start Node, designates start of the workflow job.
- End Node, signals end of the job.
- Error Node, designates an occurrence of error and corresponding error message to be printed.
At the end of execution of workflow, HTTP callback is used by Oozie to update client with the workflow status. Entry-to or exit-from an action node may also trigger callback.
Question 15. How To Deploy Application?
$ hadoop fs-put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount
Question 16. Mention Workflow Job Parameters?
$ cat job.properties
Question 17. How To Execute Job?
$ oozie job –run –config job.properties
Question 18. What Are All The Actions Can Be Performed In Oozie?
- Email Action
- Hive Action
- Shell Action
- Ssh Action
- Sqoop Action
- Writing a custom Action Executor
Question 19. Why We Use Fork And Join Nodes Of Oozie?
- A fork node splits one path of execution into multiple concurrent paths of execution.
- A join node waits until every concurrent execution path of a previous fork node arrives to it.
- The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.
Question 20. Why Oozie Security?
- User are not allowed to alter job of another user
- Hadoop does not support the authentication of end user
- Oozie has to verify and confirms its user before transferring the job to Hadoop
Adv Java Interview Questions
Adv Java Tutorial
Sqoop Interview Questions
Apache Spark Interview Questions
Apache Hive Interview Questions
Apache Hive Tutorial
Apache Pig Interview Questions
Adv Java Interview Questions
Apache Pig Tutorial
Hadoop Administration Interview Questions
Apache Flume Tutorial
Apache Flume Interview Questions
Sqoop Interview Questions
Data Structure & Algorithms Tutorial
NoSQL Interview Questions