Jobtypes¶

Azkaban job type plugin design provides great flexibility for developers to create any type of job executors which can work with essentially all types of systems – all managed and triggered by the core Azkaban work flow management.

Here we provide a common set of plugins that should be useful to most hadoop related use cases, as well as sample job packages. Most of these job types are being used in LinkedIn’s production clusters, only with different configurations. We also give a simple guide on how one can create new job types, either from scratch or by extending the old ones.

Command Job Type (built-in)¶

The command job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder. Upon execution, Azkaban spawns off a process to run the command.

How To Use¶

One can run one or multiple commands within one command job. Here is what is needed:

Type	Command
command	The full command to run

For multiple commands, do it like command.1, command.2, etc.

Sample Job Package¶

Here is a sample job package, just to show how it works:

Download command.zip (Uploaded May 13, 2013)

HadoopShell Job Type¶

In large part, this is the same Command type. The difference is its ability to talk to a Hadoop cluster securely, via Hadoop tokens.

The HadoopShell job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder. Upon execution, Azkaban spawns off a process to run the command.

How To Use¶

The HadoopShell job type talks to a secure cluster via Hadoop tokens. The admin should specify obtain.binary.token=true if the Hadoop cluster security is turned on. Before executing a job, Azkaban will obtain name node token and job tracker tokens for this job. These tokens will be written to a token file, to be picked up by user job process during its execution. After the job finishes, Azkaban takes care of canceling these tokens from name node and job tracker.

Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life.

One can run one or multiple commands within one command job. Here is what is needed:

Type	Command
command	The full command to run

For multiple commands, do it like command.1, command.2, etc.

Here are some common configurations that make a hadoopShell job for a user:

Parameter	Description
type	The type name as set by the admin, e.g. `hadoopShell`
dependencies	The other jobs in the flow this job is dependent upon.
user.to.proxy	The Hadoop user this job should run under.
hadoop-inject.FOO	FOO is automatically added to the Configuration of any Hadoop job launched.

Here are what’s needed and normally configured by the admin:

Parameter	Description
hadoop.security.manager.class	The class that handles talking to Hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user Hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and Hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.

Java Job Type¶

The java job type was widely used in the original Azkaban as a built-in type. It is no longer a built-in type in Azkaban2. The javaprocess is still built-in in Azkaban2. The main difference between java and javaprocess job types are:

javaprocess runs user program that has a “main” method, java runs Azkaban provided main method which invokes user program “run” method.
Azkaban can do the setup, such as getting Kerberos ticket or requesting Hadoop tokens in the provided main in java type, whereas in javaprocess user is responsible for everything.

As a result, most users use java type for running anything that talks to Hadoop clusters. That usage should be replaced by hadoopJava type now, which is secure. But we still keep java type in the plugins for backwards compatibility.

How to Use¶

Azkaban spawns a local process for the java job type that runs user programs. It is different from the “javaprocess” job type in that Azkaban already provides a main method, called JavaJobRunnerMain. Inside JavaJobRunnerMain, it looks for the run method which can be specified by method.run (default is run). Users can also specify a cancel method in the case the user wants to gracefully terminate the job in the middle of the run.

For the most part, using java type should be no different from hadoopJava.

Sample Job¶

Please refer to the hadoopJava type.

hadoopJava Type¶

In large part, this is the same java type. The difference is its ability to talk to a Hadoop cluster securely, via Hadoop tokens. Most Hadoop job types can be created by running a hadoopJava job, such as Pig, Hive, etc.

How To Use¶

The hadoopJava type runs user java program after all. Upon execution, it tries to construct an object that has the constructor signature of constructor(String, Props) and runs its run method. If user wants to cancel the job, it tries the user defined cancel method before doing a hard kill on that process.

The hadoopJava job type talks to a secure cluster via Hadoop tokens. The admin should specify obtain.binary.token=true if the Hadoop cluster security is turned on. Before executing a job, Azkaban will obtain name node token and job tracker tokens for this job. These tokens will be written to a token file, to be picked up by user job process during its execution. After the job finishes, Azkaban takes care of canceling these tokens from name node and job tracker.

Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life.

If there are multiple job submissions inside the user program, the user should also take care not to have a single MR step cancel the tokens upon completion, thereby failing all other MR steps when they try to authenticate with Hadoop services.

In many cases, it is also necessary to add the following code to make sure user program picks up the Hadoop tokens in “conf” or “jobconf” like the following:

// Suppose this is how one gets the conf
Configuration conf = new Configuration();

if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
    conf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
}

Here are some common configurations that make a hadoopJava job for a user:

Parameter	Description
type	The type name as set by the admin, e.g. `hadoopJava`
job.class	The fully qualified name of the user job class.
classpath	The resources that should be on the execution classpath, accessible to the local filesystem.
main.args	Main arguments passed to user program.
dependencies	The other jobs in the flow this job is dependent upon.
user.to.proxy	The Hadoop user this job should run under.
method.run	The run method, defaults to run()
method.cancel	The cancel method, defaults to cancel()
getJobGeneratedProperties	The method user should implement if the output properties should be picked up and passed to the next job.
jvm.args	The `-D` for the new jvm process
hadoop-inject.FOO	FOO is automatically added to the Configuration of any Hadoop job launched.

Here are what’s needed and normally configured by the admin:

Parameter	Description
hadoop.security.manager.class	The class that handles talking to Hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user Hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and Hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
hadoop.home	The Hadoop home where the jars and conf resources are installed.
jobtype.classpath	The items that every such job should have on its classpath.
jobtype.class	Should be set to `azkaban.jobtype.HadoopJavaJob`
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.

Since Azkaban job types are named by their directory names, the admin should also make those naming public and consistent.

Sample Job Package¶

Here is a sample job package that does a word count. It relies on a Pig job to first upload the text file onto HDFS. One can also manually upload a file and run the word count program alone. The source code is in azkaban-plugins/plugins/jobtype/src/azkaban/jobtype/examples/java/WordCount.java

Download java-wc.zip (Uploaded May 13, 2013)

Pig Type¶

Pig type is for running Pig jobs. In the azkaban-plugins repo, we have included Pig types from pig-0.9.2 to pig-0.11.0. It is up to the admin to alias one of them as the pig type for Azkaban users.

Pig type is built on using hadoop tokens to talk to secure Hadoop clusters. Therefore, individual Azkaban Pig jobs are restricted to run within the token’s lifetime, which is set by Hadoop admins. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with job tracker or name node.

Vanilla Pig types don’t provide all udf jars. It is often up to the admin who sets up Azkaban to provide a pre-configured Pig job type with company specific udfs registered and name space imported, so that the users don’t need to provide all the jars and do the configurations in their specific Pig job conf files.

How to Use¶

The Pig job runs user Pig scripts. It is important to remember, however, that running any Pig script might require a number of dependency libraries that need to be placed on local Azkaban job classpath, or be registered with Pig and carried remotely, or both. By using classpath settings, as well as pig.additional.jars and udf.import.list, the admin can create a Pig job type that has very different default behavior than the most basic “pig” type. Pig jobs talk to a secure cluster via hadoop tokens. The admin should specify obtain.binary.token=true if the hadoop cluster security is turned on. Before executing a job, Azkaban will obtain name node and job tracker tokens for this job. These tokens will be written to a token file, which will be picked up by user job process during its execution. For Hadoop 1 (HadoopSecurityManager_H_1_0), after the job finishes, Azkaban takes care of canceling these tokens from name node and job tracker. In Hadoop 2 (HadoopSecurityManager_H_2_0), due to issues with tokens being canceled prematurely, Azkaban does not cancel the tokens.

Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with hadoop services. In Hadoop 2, you may need to set -Dmapreduce.job.complete.cancel.delegation.tokens=false to prevent tokens from being canceled prematurely.

Here are the common configurations that make a Pig job for a user:

Parameter	Description
type	The type name as set by the admin, e.g. `pig`
pig.script	The Pig script location. e.g. `src/wordcountpig.pig`
classpath	The resources that should be on the execution classpath, accessible to the local filesystem.
dependencies	The other jobs in the flow this job is dependent upon.
user.to.proxy	The hadoop user this job should run under.
pig.home	The Pig installation directory. Can be used to override the default set by Azkaban.
param.SOME_PARAM	Equivalent to Pig’s `-param`
use.user.pig.jar	If true, will use the user-provided Pig jar to launch the job. If false, the Pig jar provided by Azkaban will be used. Defaults to false.
hadoop-inject.FOO	FOO is automatically added to the Configuration of any Hadoop job launched.

Here are what’s needed and normally configured by the admin:

Parameter	Description
hadoop.security.manager.class	The class that handles talking to hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
hadoop.home	The hadoop home where the jars and conf resources are installed.
jobtype.classpath	The items that every such job should have on its classpath.
jobtype.class	Should be set to `azkaban.jobtype.HadoopJavaJob`
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.

Dumping MapReduce Counters: this is useful in the case where a Pig script uses UDFs, which may add a few custom MapReduce counters

Parameter	Description
pig.dump.hadoopCounter	Setting the value of this parameter to true will trigger the dumping of MapReduce counters for each of the generated MapReduce job generated by the Pig script.

Since Pig jobs are essentially Java programs, the configurations for Java jobs could also be set.

Since Azkaban job types are named by their directory names, the admin should also make those naming public and consistent. For example, while there are multiple versions of Pig job types, the admin can link one of them as pig for default Pig type. Experimental Pig versions can be tested in parallel with a different name and can be promoted to default Pig type if it is proven stable. In LinkedIn, we also provide Pig job types that have a number of useful udf libraries, including datafu and LinkedIn specific ones, pre-registered and imported, so that users in most cases will only need Pig scripts in their Azkaban job packages.

Sample Job Package¶

Here is a sample job package that does word count. It assumes you have hadoop installed and gets some dependency jars from $HADOOP_HOME:

Download pig-wc.zip (Uploaded May 13, 2013)

Hive Type¶

The hive type is for running Hive jobs. In the azkaban-plugins repo, we have included hive type based on hive-0.8.1. It should work for higher version Hive versions as well. It is up to the admin to alias one of them as the hive type for Azkaban users.

The hive type is built using Hadoop tokens to talk to secure Hadoop clusters. Therefore, individual Azkaban Hive jobs are restricted to run within the token’s lifetime, which is set by Hadoop admin. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with the JobTracker or NameNode.

How to Use¶

The Hive job runs user Hive queries. The Hive job type talks to a secure cluster via Hadoop tokens. The admin should specify obtain.binary.token=true if the Hadoop cluster security is turned on. Before executing a job, Azkaban will obtain NameNode and JobTracker tokens for this job. These tokens will be written to a token file, which will be picked up by user job process during its execution. After the job finishes, Azkaban takes care of canceling these tokens from NameNode and JobTracker.

Since Azkaban only obtains the tokens at the beginning of the job run, and does not request new tokens or renew old tokens during the execution, it is important that the job does not run longer than configured token life. It is also important that individual MR step inside a single Pig script doesn’t cancel the tokens upon its completion. Otherwise, all following steps will fail on authentication with Hadoop services.

Here are the common configurations that make a hive job for single line Hive query:

Parameter	Description
type	The type name as set by the admin, e.g. `hive`
azk.hive.action	use `execute.query`
hive.query	Used for single line hive query.
user.to.proxy	The hadoop user this job should run under.

Specify these for a multi-line Hive query:

Parameter	Description
type	The type name as set by the admin, e.g. `hive`
azk.hive.action	use `execute.query`
hive.query.01	fill in the individual hive queries, starting from 01
user.to.proxy	The Hadoop user this job should run under.

Specify these for query from a file:

Parameter	Description
type	The type name as set by the admin, e.g. `hive`
azk.hive.action	use `execute.query`
hive.query.file	location of the query file
user.to.proxy	The Hadoop user this job should run under.

Here are what’s needed and normally configured by the admin. The following properties go into private.properties:

Parameter	Description
hadoop.security.manager.class	The class that handles talking to hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
hadoop.home	The hadoop home where the jars and conf resources are installed.
jobtype.classpath	The items that every such job should have on its classpath.
jobtype.class	Should be set to `azkaban.jobtype.HadoopJavaJob`
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.
hive.aux.jars.path	Where to find auxiliary library jars
env.HADOOP_HOME	`$HADOOP_HOME`
env.HIVE_HOME	`$HIVE_HOME`
env.HIVE_AUX_JARS_PATH	`${hive.aux.jars.path}`
hive.home	`$HIVE_HOME`
hive.classpath.items	Those that need to be on hive classpath, include the conf directory

These go into plugin.properties

Parameter	Description
job.class	`azkaban.jobtype.hiveutils.azkab an.HiveViaAzkaban`
hive.aux.jars.path	Where to find auxiliary library jars
env.HIVE_HOME	`$HIVE_HOME`
env.HIVE_AUX_JARS_PATH	`${hive.aux.jars.path}`
hive.home	`$HIVE_HOME`
hive.jvm.args	`-Dhive.querylog.location=.` `-Dhive.exec.scratchdir=YOUR_HIV E_SCRATCH_DIR` `-Dhive.aux.jars.path=${hive.aux .jars.path}`

Since hive jobs are essentially java programs, the configurations for Java jobs could also be set.

Sample Job Package

Here is a sample job package. It assumes you have hadoop installed and gets some dependency jars from $HADOOP_HOME. It also assumes you have Hive installed and configured correctly, including setting up a MySQL instance for Hive Metastore.

Download hive.zip (Uploaded May 13, 2013)

New Hive Jobtype

We’ve added a new Hive jobtype whose jobtype class is azkaban.jobtype.HadoopHiveJob. The configurations have changed from the old Hive jobtype.

Here are the configurations that a user can set:

Parameter	Description
type	The type name as set by the admin, e.g. `hive`
hive.script	The relative path of your Hive script inside your Azkaban zip
user.to.proxy	The hadoop user this job should run under.
hiveconf.FOO	FOO is automatically added as a hiveconf variable. You can reference it in your script using ${hiveconf:FOO}. These variables also get added to the configuration of any launched Hadoop jobs.
hivevar.FOO	FOO is automatically added as a hivevar variable. You can reference it in your script using ${hivevar:FOO}. These variables are NOT added to the configuration of launched Hadoop jobs.
hadoop-inject.FOO	FOO is automatically added to the Configuration of any Hadoop job launched.

Here are what’s needed and normally configured by the admin. The following properties go into private.properties (or into ../commonprivate.properties):

Parameter	Description
hadoop.security.manager.class	The class that handles talking to hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
hadoop.home	The hadoop home where the jars and conf resources are installed.
jobtype.classpath	The items that every such job should have on its classpath.
jobtype.class	Should be set to `azkaban.jobtype.HadoopHiveJob`
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.
obtain.hcat.token	Whether Azkaban should request HCatalog/Hive Metastore tokens. If true, the HadoopSecurityManager will acquire an HCatalog token.
hive.aux.jars.path	Where to find auxiliary library jars
hive.home	`$HIVE_HOME`

These go into plugin.properties (or into ../common.properties):

Parameter	Description
hive.aux.jars.path	Where to find auxiliary library jars
hive.home	`$HIVE_HOME`
jobtype.jvm.args	`-Dhive.querylog.location=.` `-Dhive.exec.scratchdir=YOUR_HIV E_SCRATCH_DIR` `-Dhive.aux.jars.path=${hive.aux .jars.path}`

Since hive jobs are essentially java programs, the configurations for Java jobs can also be set.

Common Configurations¶

This section lists out the configurations that are common to all job types

other_namenodes¶

This job property is useful for jobs that need to read data from or write data to more than one Hadoop NameNode. By default Azkaban requests a HDFS_DELEGATION_TOKEN on behalf of the job for the cluster that Azkaban is configured to run on. When this property is present, Azkaban will try request a HDFS_DELEGATION_TOKEN for each of the specified HDFS NameNodes.

The value of this property is in the form of comma separated list of NameNode URLs.

For example: other_namenodes=webhdfs://host1:50070,hdfs://host2:9000

HTTP Job Callback¶

The purpose of this feature to allow Azkaban to notify external systems via an HTTP upon the completion of a job. The new properties are in the following format:

job.notification.<status>.<sequence number>.url
job.notification.<status>.<sequence number>.method
job.notification.<status>.<sequence number>.body
job.notification.<status>.<sequence number>.headers

Supported values for status¶

started: when a job is started
success: when a job is completed successfully
failure: when a job failed
completed: when a job is either successfully completed or failed

Number of callback URLs¶

The maximum # of callback URLs per job is 3. So the <sequence number> can go up from 1 to 3. If a gap is detected, only the ones before the gap is used.

HTTP Method¶

The supported methods are GET and POST. The default method is GET

Headers¶

Each job callback URL can optional specify headers in the following format

job.notification.<status>.<sequence number>.headers=<name>:<value>rn<name>:<value> The delimiter for each header is ‘rn’ and delimiter between header name and value is ‘:’

The headers are applicable for both GET and POST job callback URLs.

Job Context Information¶

It is often desirable to include some dynamic context information about the job in the URL or POST request body, such as status, job name, flow name, execution id and project name. If the URL or POST request body contains any of the following tokens, they will be replaced with the actual values by Azkabn before making the HTTP callback is made. The value of each token will be HTTP encoded.

?{server} - Azkaban host name and port
?{project}
?{flow}
?{executionId}
?{job}
?{status} - possible values are started, failed, succeeded

The value of these tokens will be HTTP encoded if they are on the URL, but will not be encoded when they are in the HTTP body.

Examples¶

GET HTTP Method

job.notification.started.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
job.notification.completed.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
job.notification.completed.2.url=http://abc.com/api/v2/message?text=yeah!!

POST HTTP Method

job.notification.started.1.url=http://abc.com/api/v1/resource
job.notification.started.1.method=POST
job.notification.started.1.body={“type”:”workflow”, “source”:”Azkaban”, “content”:”{server}:?{project}:?{flow}:?{executionId}:?{job}:?{status}”}
job.notification.started.1.headers=Content-type:application/json

VoldemortBuildandPush Type¶

Pushing data from hadoop to voldemort store used to be entirely in java. This created lots of problems, mostly due to users having to keep track of jars and dependencies and keep them up-to-date. We created the VoldemortBuildandPush job type to address this problem. Jars and dependencies are now managed by admins; absolutely no jars or java code are required from users.

How to Use¶

This is essentially a hadoopJava job, with all jars controlled by the admins. User only need to provide a .job file for the job and specify all the parameters. The following needs to be specified:

Parameter	Description
type	The type name as set by the admin, e.g. `VoldemortBuildandPush`
push.store.name	The voldemort push store name
push.store.owners	The push store owners
push.store.description	Push store description
build.input.path	Build input path on hdfs
build.output.dir	Build output path on hdfs
build.replication.factor	replication factor number
user.to.proxy	The hadoop user this job should run under.
build.type.avro	if build and push avro data, true, otherwise, false
avro.key.field	if using Avro data, key field
avro.value.field	if using Avro data, value field

Here are what’s needed and normally configured by the admn (always put common properties in commonprivate.properties and common.properties for all job types).

These go into private.properties:

Parameter	Description
hadoop.security.manager.class	The class that handles talking to hadoop clusters.
azkaban.should.proxy	Whether Azkaban should proxy as individual user hadoop accounts.
proxy.user	The Azkaban user configured with kerberos and hadoop, for secure clusters.
proxy.keytab.location	The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified `proxy.user`
hadoop.home	The hadoop home where the jars and conf resources are installed.
jobtype.classpath	The items that every such job should have on its classpath.
jobtype.class	Should be set to `azkaban.jobtype.HadoopJavaJob`
obtain.binary.token	Whether Azkaban should request tokens. Set this to true for secure clusters.
azkaban.no.user.classpath	Set to true such that Azkaban doesn’t pick up user supplied jars.

These go into plugin.properties:

Parameter	Description
job.class	`voldemort.store.readonly.mr.azk aban.VoldemortBuildAndPushJob`
voldemort.fetcher.protocol	`webhdfs`
hdfs.default.classpath.dir	HDFS location for distributed cache
hdfs.default.classpath.dir.enable	set to true if using distributed cache to ship dependency jars

For more information¶

Please refer to Voldemort project site for more info.

Create Your Own Jobtypes¶

With plugin design of Azkaban job types, it is possible to extend Azkaban for various system environments. You should be able to execute any job under the same Azkaban work flow management and scheduling.

Creating new job types is often times very easy. Here are several ways one can do it:

New Types with only Configuration Changes¶

One doesn’t always need to write java code to create job types for end users. Often times, configuration changes of existing job types would create significantly different behavior to the end users. For example, in LinkedIn, apart from the pig types, we also have pigLi types that come with all the useful library jars pre-registered and imported. This way, normal users only need to provide their pig scripts, and their own udf jars to Azkaban. The pig job should run as if it is run on the gateway machine from pig grunt. In comparison, if users are required to use the basic pig job types, they will need to package all the necessary jars in the Azkaban job package, and do all the register and import by themselves, which often poses some learning curve for new pig/Azkaban users.

The same practice applies to most other job types. Admins should create or tailor job types to their specific company needs or clusters.

New Types Using Existing Job Types¶

If one needs to create a different job type, a good starting point is to see if this can be done by using an existing job type. In hadoop land, this most often means the hadoopJava type. Essentially all hadoop jobs, from the most basic mapreduce job, to pig, hive, crunch, etc, are java programs that submit jobs to hadoop clusters. It is usually straight forward to create a job type that takes user input and runs a hadoopJava job.

For example, one can take a look at the VoldemortBuildandPush job type. It will take in user input such as which cluster to push to, voldemort store name, etc, and runs hadoopJava job that does the work. For end users though, this is a VoldemortBuildandPush job type with which they only need to fill out the .job file to push data from hadoop to voldemort stores.

The same applies to the hive type.

New Types by Extending Existing Ones¶

For the most flexibility, one can always build new types by extending the existing ones. Azkaban uses reflection to load job types that implement the job interface, and tries to construct a sample object upon loading for basic testing. When executing a real job, Azkaban calls the run method to run the job, and cancel method to cancel it.

For new hadoop job types, it is important to use the correct hadoopsecuritymanager class, which is also included in azkaban-plugins repo. This class handles talking to the hadoop cluster, and if needed, requests tokens for job execution or for name node communication.

For better security, tokens should be requested in Azkaban main process and be written to a file. Before executing user code, the job type should implement a wrapper that picks up the token file, set it in the Configuration or JobConf object. Please refer to HadoopJavaJob and HadoopPigJob to see example usage.

System Statistics¶

Azkaban server maintains certain system statistics and they can be seen http:<host>:<port>/stats

To enable this feature, add the following property “executor.metric.reports=true” to azkaban.properties

Property “executor.metric.milisecinterval.default” controls the interval at which the metrics are collected at

Statistic Types¶

Metric Name	Description
NumFailedFlowMetric	Number of failed flows
NumRunningFlowMetric	Number of flows in the queue
NumQueuedFlowMetric	Number of flows in the queue
NumRunningJobMetric	Number of running jobs
NumFailedJobMetric	Number of failed jobs

To change the statistic collection at run time, the following options are available

To change the time interval at which the specific type of statistics are collected - /stats?action=changeMetricInterval&metricName=NumRunningJobMetric&interval=60000
To change the duration at which the statistics are maintained -/stats?action=changeCleaningInterval&interval=604800000
To change the number of data points to display - /stats?action=changeEmitterPoints&numInstances=50
To enable the statistic collection - /stats?action=enableMetrics
To disable the statistic collection - /stats?action=disableMetrics

Reload Jobtypes¶

When you want to make changes to your jobtype configurations or add/remove jobtypes, you can do so without restarting the executor server. You can reload all jobtype plugins as follows:

curl http://localhost:EXEC_SERVER_PORT/executor?action=reloadJobTypePlugins

Examples¶

The project az-hadoop-jobtype-plugin provides examples that show how to use some of the included jobtypes. Below you can find how to setup a solo-server to run some of them.

java-wc¶

This example uses the pig-0.12.0 job type to upload an input file to HDFS and the hadoopJava job type to count the number of instances that each word is found.

We need to install Hadoop and Pig on the solo-server by expanding the tar into /export/apps/hadoop/latest and /export/apps/pig/latest respectively. Then set the HADOOP_HOME and PIG_HOME variables with their paths:

export HADOOP_HOME=/export/apps/hadoop/latest
export PIG_HOME=/export/apps/pig/latest

If you prefer to install Hadoop and Pig under a different path, please update common.properties and commonprivate.properties under ./az-hadoop-jobtype-plugin/src/jobtypes/ to match it.

Follow the Hadoop Single Cluster instructions to run HDFS on a single cluster. You will need to modify etc/hadoop/core-site.xml and run sbin/start-dfs.sh.

Build the source code and copy the pig-0.12.0 and hadoopJava job type directories, along with common.properties and commonprivate.properties, from ./az-hadoop-jobtype-plugin/src/jobtypes to ./azkaban-solo-server/build/install/azkaban-solo-server/plugins/jobtypes. Then start the solo server by running:

./azkaban-solo-server/build/install/azkaban-solo-server/bin/start-solo.sh

Create a zip file with the contents under ./az-hadoop-jobtype-plugin/src/examples/java-wc

Launch Azkaban by going to http://localhost:8081 and enter the credentials found under /azkaban-solo-server/conf/azkaban-users.xml.

Select Create Project, enter your project details and click Upload. Then select the zip created in the step above. Start the job by clicking on Execute Flow.