Jobtypes¶
Azkaban job type plugin design provides great flexibility for developers to create any type of job executors which can work with essentially all types of systems – all managed and triggered by the core Azkaban work flow management.
Here we provide a common set of plugins that should be useful to most hadoop related use cases, as well as sample job packages. Most of these job types are being used in LinkedIn’s production clusters, only with different configurations. We also give a simple guide on how one can create new job types, either from scratch or by extending the old ones.
Command Job Type (built-in)¶
The command job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder. Upon execution, Azkaban spawns off a process to run the command.
How To Use¶
One can run one or multiple commands within one command job. Here is what is needed:
Type | Command |
---|---|
command | The full command to run |
For multiple commands, do it like command.1, command.2
, etc.
Sample Job Package¶
Here is a sample job package, just to show how it works:
Download command.zip (Uploaded May 13, 2013)
HadoopShell Job Type¶
In large part, this is the same Command
type. The difference is its
ability to talk to a Hadoop cluster securely, via Hadoop tokens.
The HadoopShell job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder. Upon execution, Azkaban spawns off a process to run the command.
How To Use¶
The HadoopShell
job type talks to a secure cluster via Hadoop
tokens. The admin should specify obtain.binary.token=true
if the
Hadoop cluster security is turned on. Before executing a job, Azkaban
will obtain name node token and job tracker tokens for this job. These
tokens will be written to a token file, to be picked up by user job
process during its execution. After the job finishes, Azkaban takes care
of canceling these tokens from name node and job tracker.
Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life.
One can run one or multiple commands within one command job. Here is what is needed:
Type | Command |
---|---|
command | The full command to run |
For multiple commands, do it like command.1, command.2
, etc.
Here are some common configurations that make a hadoopShell
job for
a user:
Parameter | Description |
---|---|
type | The type name as set by the
admin, e.g. hadoopShell |
dependencies | The other jobs in the flow this job is dependent upon. |
user.to.proxy | The Hadoop user this job should run under. |
hadoop-inject.FOO | FOO is automatically added to the Configuration of any Hadoop job launched. |
Here are what’s needed and normally configured by the admin:
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to Hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user Hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and Hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
Java Job Type¶
The java
job type was widely used in the original Azkaban as a
built-in type. It is no longer a built-in type in Azkaban2. The
javaprocess
is still built-in in Azkaban2. The main difference
between java
and javaprocess
job types are:
javaprocess
runs user program that has a “main” method,java
runs Azkaban provided main method which invokes user program “run” method.- Azkaban can do the setup, such as getting Kerberos ticket or
requesting Hadoop tokens in the provided main in
java
type, whereas injavaprocess
user is responsible for everything.
As a result, most users use java
type for running anything that
talks to Hadoop clusters. That usage should be replaced by
hadoopJava
type now, which is secure. But we still keep java
type in the plugins for backwards compatibility.
How to Use¶
Azkaban spawns a local process for the java job type that runs user
programs. It is different from the “javaprocess” job type in that
Azkaban already provides a main
method, called
JavaJobRunnerMain
. Inside JavaJobRunnerMain
, it looks for the
run
method which can be specified by method.run
(default is
run
). Users can also specify a cancel
method in the case the user
wants to gracefully terminate the job in the middle of the run.
For the most part, using java
type should be no different from
hadoopJava
.
Sample Job¶
Please refer to the hadoopJava type.
hadoopJava Type¶
In large part, this is the same java
type. The difference is its
ability to talk to a Hadoop cluster securely, via Hadoop tokens. Most
Hadoop job types can be created by running a hadoopJava job, such as
Pig, Hive, etc.
How To Use¶
The hadoopJava
type runs user java program after all. Upon
execution, it tries to construct an object that has the constructor
signature of constructor(String, Props)
and runs its run
method.
If user wants to cancel the job, it tries the user defined cancel
method before doing a hard kill on that process.
The hadoopJava
job type talks to a secure cluster via Hadoop tokens.
The admin should specify obtain.binary.token=true
if the Hadoop
cluster security is turned on. Before executing a job, Azkaban will
obtain name node token and job tracker tokens for this job. These tokens
will be written to a token file, to be picked up by user job process
during its execution. After the job finishes, Azkaban takes care of
canceling these tokens from name node and job tracker.
Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life.
If there are multiple job submissions inside the user program, the user should also take care not to have a single MR step cancel the tokens upon completion, thereby failing all other MR steps when they try to authenticate with Hadoop services.
In many cases, it is also necessary to add the following code to make sure user program picks up the Hadoop tokens in “conf” or “jobconf” like the following:
// Suppose this is how one gets the conf
Configuration conf = new Configuration();
if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
conf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
}
Here are some common configurations that make a hadoopJava
job for a
user:
Parameter | Description |
---|---|
type | The type name as set by the
admin, e.g. hadoopJava |
job.class | The fully qualified name of the user job class. |
classpath | The resources that should be on the execution classpath, accessible to the local filesystem. |
main.args | Main arguments passed to user program. |
dependencies | The other jobs in the flow this job is dependent upon. |
user.to.proxy | The Hadoop user this job should run under. |
method.run | The run method, defaults to run() |
method.cancel | The cancel method, defaults to cancel() |
getJobGeneratedProperties | The method user should implement if the output properties should be picked up and passed to the next job. |
jvm.args | The -D for the new jvm
process |
hadoop-inject.FOO | FOO is automatically added to the Configuration of any Hadoop job launched. |
Here are what’s needed and normally configured by the admin:
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to Hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user Hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and Hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user |
hadoop.home | The Hadoop home where the jars and conf resources are installed. |
jobtype.classpath | The items that every such job should have on its classpath. |
jobtype.class | Should be set to
azkaban.jobtype.HadoopJavaJob |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
Since Azkaban job types are named by their directory names, the admin should also make those naming public and consistent.
Sample Job Package¶
Here is a sample job package that does a word count. It relies on a Pig
job to first upload the text file onto HDFS. One can also manually
upload a file and run the word count program alone. The source code is in
azkaban-plugins/plugins/jobtype/src/azkaban/jobtype/examples/java/WordCount.java
Download java-wc.zip (Uploaded May 13, 2013)
Pig Type¶
Pig type is for running Pig jobs. In the azkaban-plugins
repo, we
have included Pig types from pig-0.9.2 to pig-0.11.0. It is up to the
admin to alias one of them as the pig
type for Azkaban users.
Pig type is built on using hadoop tokens to talk to secure Hadoop clusters. Therefore, individual Azkaban Pig jobs are restricted to run within the token’s lifetime, which is set by Hadoop admins. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with job tracker or name node.
Vanilla Pig types don’t provide all udf jars. It is often up to the admin who sets up Azkaban to provide a pre-configured Pig job type with company specific udfs registered and name space imported, so that the users don’t need to provide all the jars and do the configurations in their specific Pig job conf files.
How to Use¶
The Pig job runs user Pig scripts. It is important to remember, however,
that running any Pig script might require a number of dependency
libraries that need to be placed on local Azkaban job classpath, or be
registered with Pig and carried remotely, or both. By using classpath
settings, as well as pig.additional.jars
and udf.import.list
,
the admin can create a Pig job type that has very different default
behavior than the most basic “pig” type. Pig jobs talk to a secure
cluster via hadoop tokens. The admin should specify
obtain.binary.token=true
if the hadoop cluster security is turned
on. Before executing a job, Azkaban will obtain name node and job
tracker tokens for this job. These tokens will be written to a token
file, which will be picked up by user job process during its execution.
For Hadoop 1 (HadoopSecurityManager_H_1_0
), after the job finishes,
Azkaban takes care of canceling these tokens from name node and job
tracker. In Hadoop 2 (HadoopSecurityManager_H_2_0
), due to issues
with tokens being canceled prematurely, Azkaban does not cancel the
tokens.
Since Azkaban only obtains the tokens at the beginning of the job run,
and does not request new tokens or renew old tokens during the
execution, it is important that the job does not run longer than
configured token life. It is also important that individual MR step
inside a single Pig script doesn’t cancel the tokens upon its
completion. Otherwise, all following steps will fail on authentication
with hadoop services. In Hadoop 2, you may need to set
-Dmapreduce.job.complete.cancel.delegation.tokens=false
to prevent
tokens from being canceled prematurely.
Here are the common configurations that make a Pig job for a user:
Parameter | Description |
---|---|
type | The type name as set by the
admin, e.g. pig |
pig.script | The Pig script location. e.g.
src/wordcountpig.pig |
classpath | The resources that should be on the execution classpath, accessible to the local filesystem. |
dependencies | The other jobs in the flow this job is dependent upon. |
user.to.proxy | The hadoop user this job should run under. |
pig.home | The Pig installation directory. Can be used to override the default set by Azkaban. |
param.SOME_PARAM | Equivalent to Pig’s -param |
use.user.pig.jar | If true, will use the user-provided Pig jar to launch the job. If false, the Pig jar provided by Azkaban will be used. Defaults to false. |
hadoop-inject.FOO | FOO is automatically added to the Configuration of any Hadoop job launched. |
Here are what’s needed and normally configured by the admin:
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user |
hadoop.home | The hadoop home where the jars and conf resources are installed. |
jobtype.classpath | The items that every such job should have on its classpath. |
jobtype.class | Should be set to
azkaban.jobtype.HadoopJavaJob |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
Dumping MapReduce Counters: this is useful in the case where a Pig script uses UDFs, which may add a few custom MapReduce counters
Parameter | Description |
---|---|
pig.dump.hadoopCounter | Setting the value of this parameter to true will trigger the dumping of MapReduce counters for each of the generated MapReduce job generated by the Pig script. |
Since Pig jobs are essentially Java programs, the configurations for Java jobs could also be set.
Since Azkaban job types are named by their directory names, the admin
should also make those naming public and consistent. For example, while
there are multiple versions of Pig job types, the admin can link one of
them as pig
for default Pig type. Experimental Pig versions can be
tested in parallel with a different name and can be promoted to default
Pig type if it is proven stable. In LinkedIn, we also provide Pig job
types that have a number of useful udf libraries, including datafu and
LinkedIn specific ones, pre-registered and imported, so that users in
most cases will only need Pig scripts in their Azkaban job packages.
Sample Job Package¶
Here is a sample job package that does word count. It assumes you have
hadoop installed and gets some dependency jars from $HADOOP_HOME
:
Download pig-wc.zip (Uploaded May 13, 2013)
Hive Type¶
The hive
type is for running Hive jobs. In the
azkaban-plugins repo,
we have included hive type based on hive-0.8.1. It should work for
higher version Hive versions as well. It is up to the admin to alias one
of them as the hive
type for Azkaban users.
The hive
type is built using Hadoop tokens to talk to secure Hadoop
clusters. Therefore, individual Azkaban Hive jobs are restricted to run
within the token’s lifetime, which is set by Hadoop admin. It is also
important that individual MR step inside a single Pig script doesn’t
cancel the tokens upon its completion. Otherwise, all following steps
will fail on authentication with the JobTracker or NameNode.
How to Use¶
The Hive job runs user Hive queries. The Hive job type talks to a secure
cluster via Hadoop tokens. The admin should specify
obtain.binary.token=true
if the Hadoop cluster security is turned
on. Before executing a job, Azkaban will obtain NameNode and JobTracker
tokens for this job. These tokens will be written to a token file, which
will be picked up by user job process during its execution. After the
job finishes, Azkaban takes care of canceling these tokens from NameNode
and JobTracker.
Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with Hadoop services.
Here are the common configurations that make a hive
job for single
line Hive query:
Parameter | Description |
---|---|
type | The type name as set by the admin, e.g. hive |
azk.hive.action | use execute.query |
hive.query | Used for single line hive query. |
user.to.proxy | The hadoop user this job should run under. |
Specify these for a multi-line Hive query:
Parameter | Description |
---|---|
type | The type name as set by the admin, e.g. hive |
azk.hive.action | use execute.query |
hive.query.01 | fill in the individual hive queries, starting from 01 |
user.to.proxy | The Hadoop user this job should run under. |
Specify these for query from a file:
Parameter | Description |
---|---|
type | The type name as set by the admin, e.g. hive |
azk.hive.action | use execute.query |
hive.query.file | location of the query file |
user.to.proxy | The Hadoop user this job should run under. |
Here are what’s needed and normally configured by the admin. The following properties go into private.properties:
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user |
hadoop.home | The hadoop home where the jars and conf resources are installed. |
jobtype.classpath | The items that every such job should have on its classpath. |
jobtype.class | Should be set to
azkaban.jobtype.HadoopJavaJob |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
hive.aux.jars.path | Where to find auxiliary library jars |
env.HADOOP_HOME | $HADOOP_HOME |
env.HIVE_HOME | $HIVE_HOME |
env.HIVE_AUX_JARS_PATH | ${hive.aux.jars.path} |
hive.home | $HIVE_HOME |
hive.classpath.items | Those that need to be on hive classpath, include the conf directory |
These go into plugin.properties
Parameter | Description |
---|---|
job.class | azkaban.jobtype.hiveutils.azkab
an.HiveViaAzkaban |
hive.aux.jars.path | Where to find auxiliary library jars |
env.HIVE_HOME | $HIVE_HOME |
env.HIVE_AUX_JARS_PATH | ${hive.aux.jars.path} |
hive.home | $HIVE_HOME |
hive.jvm.args | -Dhive.querylog.location=.
-Dhive.exec.scratchdir=YOUR_HIV
E_SCRATCH_DIR
-Dhive.aux.jars.path=${hive.aux
.jars.path} |
Since hive jobs are essentially java programs, the configurations for Java jobs could also be set.
Sample Job Package
Here is a sample job package. It assumes you have hadoop installed and
gets some dependency jars from $HADOOP_HOME
. It also assumes you
have Hive installed and configured correctly, including setting up a
MySQL instance for Hive Metastore.
Download hive.zip (Uploaded May 13, 2013)
New Hive Jobtype
We’ve added a new Hive jobtype whose jobtype class is
azkaban.jobtype.HadoopHiveJob
. The configurations have changed from
the old Hive jobtype.
Here are the configurations that a user can set:
Parameter | Description |
---|---|
type | The type name as set by the
admin, e.g. hive |
hive.script | The relative path of your Hive script inside your Azkaban zip |
user.to.proxy | The hadoop user this job should run under. |
hiveconf.FOO | FOO is automatically added as a hiveconf variable. You can reference it in your script using ${hiveconf:FOO}. These variables also get added to the configuration of any launched Hadoop jobs. |
hivevar.FOO | FOO is automatically added as a hivevar variable. You can reference it in your script using ${hivevar:FOO}. These variables are NOT added to the configuration of launched Hadoop jobs. |
hadoop-inject.FOO | FOO is automatically added to the Configuration of any Hadoop job launched. |
Here are what’s needed and normally configured by the admin. The following properties go into private.properties (or into ../commonprivate.properties):
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user |
hadoop.home | The hadoop home where the jars and conf resources are installed. |
jobtype.classpath | The items that every such job should have on its classpath. |
jobtype.class | Should be set to
azkaban.jobtype.HadoopHiveJob |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
obtain.hcat.token | Whether Azkaban should request HCatalog/Hive Metastore tokens. If true, the HadoopSecurityManager will acquire an HCatalog token. |
hive.aux.jars.path | Where to find auxiliary library jars |
hive.home | $HIVE_HOME |
These go into plugin.properties (or into ../common.properties):
Parameter | Description |
---|---|
hive.aux.jars.path | Where to find auxiliary library jars |
hive.home | $HIVE_HOME |
jobtype.jvm.args | -Dhive.querylog.location=.
-Dhive.exec.scratchdir=YOUR_HIV
E_SCRATCH_DIR
-Dhive.aux.jars.path=${hive.aux
.jars.path} |
Since hive jobs are essentially java programs, the configurations for Java jobs can also be set.
Common Configurations¶
This section lists out the configurations that are common to all job types
other_namenodes¶
This job property is useful for jobs that need to read data from or write data to more than one Hadoop NameNode. By default Azkaban requests a HDFS_DELEGATION_TOKEN on behalf of the job for the cluster that Azkaban is configured to run on. When this property is present, Azkaban will try request a HDFS_DELEGATION_TOKEN for each of the specified HDFS NameNodes.
The value of this property is in the form of comma separated list of NameNode URLs.
For example: other_namenodes=webhdfs://host1:50070,hdfs://host2:9000
HTTP Job Callback¶
The purpose of this feature to allow Azkaban to notify external systems via an HTTP upon the completion of a job. The new properties are in the following format:
- job.notification.<status>.<sequence number>.url
- job.notification.<status>.<sequence number>.method
- job.notification.<status>.<sequence number>.body
- job.notification.<status>.<sequence number>.headers
Supported values for status¶
- started: when a job is started
- success: when a job is completed successfully
- failure: when a job failed
- completed: when a job is either successfully completed or failed
Number of callback URLs¶
The maximum # of callback URLs per job is 3. So the <sequence number> can go up from 1 to 3. If a gap is detected, only the ones before the gap is used.
HTTP Method¶
The supported methods are GET and POST. The default method is GET
Headers¶
Each job callback URL can optional specify headers in the following format
job.notification.<status>.<sequence number>.headers=<name>:<value>rn<name>:<value> The delimiter for each header is ‘rn’ and delimiter between header name and value is ‘:’
The headers are applicable for both GET and POST job callback URLs.
Job Context Information¶
It is often desirable to include some dynamic context information about the job in the URL or POST request body, such as status, job name, flow name, execution id and project name. If the URL or POST request body contains any of the following tokens, they will be replaced with the actual values by Azkabn before making the HTTP callback is made. The value of each token will be HTTP encoded.
- ?{server} - Azkaban host name and port
- ?{project}
- ?{flow}
- ?{executionId}
- ?{job}
- ?{status} - possible values are started, failed, succeeded
The value of these tokens will be HTTP encoded if they are on the URL, but will not be encoded when they are in the HTTP body.
Examples¶
GET HTTP Method
- job.notification.started.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
- job.notification.completed.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
- job.notification.completed.2.url=http://abc.com/api/v2/message?text=yeah!!
POST HTTP Method
- job.notification.started.1.url=http://abc.com/api/v1/resource
- job.notification.started.1.method=POST
- job.notification.started.1.body={“type”:”workflow”, “source”:”Azkaban”, “content”:”{server}:?{project}:?{flow}:?{executionId}:?{job}:?{status}”}
- job.notification.started.1.headers=Content-type:application/json
VoldemortBuildandPush Type¶
Pushing data from hadoop to voldemort store used to be entirely in java.
This created lots of problems, mostly due to users having to keep track
of jars and dependencies and keep them up-to-date. We created the
VoldemortBuildandPush
job type to address this problem. Jars and
dependencies are now managed by admins; absolutely no jars or java code
are required from users.
How to Use¶
This is essentially a hadoopJava job, with all jars controlled by the admins. User only need to provide a .job file for the job and specify all the parameters. The following needs to be specified:
Parameter | Description |
---|---|
type | The type name as set by the
admin, e.g.
VoldemortBuildandPush |
push.store.name | The voldemort push store name |
push.store.owners | The push store owners |
push.store.description | Push store description |
build.input.path | Build input path on hdfs |
build.output.dir | Build output path on hdfs |
build.replication.factor | replication factor number |
user.to.proxy | The hadoop user this job should run under. |
build.type.avro | if build and push avro data, true, otherwise, false |
avro.key.field | if using Avro data, key field |
avro.value.field | if using Avro data, value field |
Here are what’s needed and normally configured by the admn (always put
common properties in commonprivate.properties
and
common.properties
for all job types).
These go into private.properties
:
Parameter | Description |
---|---|
hadoop.security.manager.class | The class that handles talking to hadoop clusters. |
azkaban.should.proxy | Whether Azkaban should proxy as individual user hadoop accounts. |
proxy.user | The Azkaban user configured with kerberos and hadoop, for secure clusters. |
proxy.keytab.location | The location of the keytab file
with which Azkaban can
authenticate with Kerberos for
the specified proxy.user |
hadoop.home | The hadoop home where the jars and conf resources are installed. |
jobtype.classpath | The items that every such job should have on its classpath. |
jobtype.class | Should be set to
azkaban.jobtype.HadoopJavaJob |
obtain.binary.token | Whether Azkaban should request tokens. Set this to true for secure clusters. |
azkaban.no.user.classpath | Set to true such that Azkaban doesn’t pick up user supplied jars. |
These go into plugin.properties
:
Parameter | Description |
---|---|
job.class | voldemort.store.readonly.mr.azk
aban.VoldemortBuildAndPushJob |
voldemort.fetcher.protocol | webhdfs |
hdfs.default.classpath.dir | HDFS location for distributed cache |
hdfs.default.classpath.dir.enable | set to true if using distributed cache to ship dependency jars |
For more information¶
Please refer to Voldemort project site for more info.
Create Your Own Jobtypes¶
With plugin design of Azkaban job types, it is possible to extend Azkaban for various system environments. You should be able to execute any job under the same Azkaban work flow management and scheduling.
Creating new job types is often times very easy. Here are several ways one can do it:
New Types with only Configuration Changes¶
One doesn’t always need to write java code to create job types for end users. Often times, configuration changes of existing job types would create significantly different behavior to the end users. For example, in LinkedIn, apart from the pig types, we also have pigLi types that come with all the useful library jars pre-registered and imported. This way, normal users only need to provide their pig scripts, and their own udf jars to Azkaban. The pig job should run as if it is run on the gateway machine from pig grunt. In comparison, if users are required to use the basic pig job types, they will need to package all the necessary jars in the Azkaban job package, and do all the register and import by themselves, which often poses some learning curve for new pig/Azkaban users.
The same practice applies to most other job types. Admins should create or tailor job types to their specific company needs or clusters.
New Types Using Existing Job Types¶
If one needs to create a different job type, a good starting point is to see if this can be done by using an existing job type. In hadoop land, this most often means the hadoopJava type. Essentially all hadoop jobs, from the most basic mapreduce job, to pig, hive, crunch, etc, are java programs that submit jobs to hadoop clusters. It is usually straight forward to create a job type that takes user input and runs a hadoopJava job.
For example, one can take a look at the VoldemortBuildandPush job type.
It will take in user input such as which cluster to push to, voldemort
store name, etc, and runs hadoopJava job that does the work. For end
users though, this is a VoldemortBuildandPush job type with which they
only need to fill out the .job
file to push data from hadoop to
voldemort stores.
The same applies to the hive type.
New Types by Extending Existing Ones¶
For the most flexibility, one can always build new types by extending
the existing ones. Azkaban uses reflection to load job types that
implement the job
interface, and tries to construct a sample object
upon loading for basic testing. When executing a real job, Azkaban calls
the run
method to run the job, and cancel
method to cancel it.
For new hadoop job types, it is important to use the correct
hadoopsecuritymanager
class, which is also included in
azkaban-plugins
repo. This class handles talking to the hadoop
cluster, and if needed, requests tokens for job execution or for name
node communication.
For better security, tokens should be requested in Azkaban main process
and be written to a file. Before executing user code, the job type
should implement a wrapper that picks up the token file, set it in the
Configuration
or JobConf
object. Please refer to
HadoopJavaJob
and HadoopPigJob
to see example usage.
System Statistics¶
Azkaban server maintains certain system statistics and they can be seen http:<host>:<port>/stats
To enable this feature, add the following property “executor.metric.reports=true” to azkaban.properties
Property “executor.metric.milisecinterval.default” controls the interval at which the metrics are collected at
Statistic Types¶
Metric Name | Description |
---|---|
NumFailedFlowMetric | Number of failed flows |
NumRunningFlowMetric | Number of flows in the queue |
NumQueuedFlowMetric | Number of flows in the queue |
NumRunningJobMetric | Number of running jobs |
NumFailedJobMetric | Number of failed jobs |
To change the statistic collection at run time, the following options are available
- To change the time interval at which the specific type of statistics are collected - /stats?action=changeMetricInterval&metricName=NumRunningJobMetric&interval=60000
- To change the duration at which the statistics are maintained -/stats?action=changeCleaningInterval&interval=604800000
- To change the number of data points to display - /stats?action=changeEmitterPoints&numInstances=50
- To enable the statistic collection - /stats?action=enableMetrics
- To disable the statistic collection - /stats?action=disableMetrics
Reload Jobtypes¶
When you want to make changes to your jobtype configurations or add/remove jobtypes, you can do so without restarting the executor server. You can reload all jobtype plugins as follows:
curl http://localhost:EXEC_SERVER_PORT/executor?action=reloadJobTypePlugins
Examples¶
The project az-hadoop-jobtype-plugin provides examples that show how to use some of the included jobtypes. Below you can find how to setup a solo-server to run some of them.
java-wc¶
This example uses the pig-0.12.0 job type to upload an input file to HDFS and the hadoopJava job type to count the number of instances that each word is found.
We need to install Hadoop and Pig on the solo-server by expanding the tar into /export/apps/hadoop/latest and /export/apps/pig/latest respectively. Then set the HADOOP_HOME and PIG_HOME variables with their paths:
export HADOOP_HOME=/export/apps/hadoop/latest
export PIG_HOME=/export/apps/pig/latest
If you prefer to install Hadoop and Pig under a different path, please update common.properties and commonprivate.properties under ./az-hadoop-jobtype-plugin/src/jobtypes/ to match it.
Follow the Hadoop Single Cluster instructions to run HDFS on a single cluster. You will need to modify etc/hadoop/core-site.xml and run sbin/start-dfs.sh.
Build the source code and copy the pig-0.12.0 and hadoopJava job type directories, along with common.properties and commonprivate.properties, from ./az-hadoop-jobtype-plugin/src/jobtypes to ./azkaban-solo-server/build/install/azkaban-solo-server/plugins/jobtypes. Then start the solo server by running:
./azkaban-solo-server/build/install/azkaban-solo-server/bin/start-solo.sh
Create a zip file with the contents under ./az-hadoop-jobtype-plugin/src/examples/java-wc
Launch Azkaban by going to http://localhost:8081 and enter the credentials found under /azkaban-solo-server/conf/azkaban-users.xml.
Select Create Project, enter your project details and click Upload. Then select the zip created in the step above. Start the job by clicking on Execute Flow.