Plugins

Azkaban is designed to be modular. We are able to plug in code to add viewer pages or execute jobs in a customizable manner. These pages will describe the azkaban-plugins that can be downloaded from the download page and how to extend Azkaban by creating your own plugins or extending an existing one.

HadoopSecurityManager

The most common adoption of Azkaban has been in the big data platforms such as Hadoop, etc. Azkaban’s jobtype plugin system allows most flexible support to such systems.

Azkaban is able to support all Hadoop versions, with support for Hadoop security features; Azkaban is able to support various ecosystem components with all different versions, such as different versions of pig, hive, on the same instance.

A common pattern to achieve this is by using the HadoopSecurityManager class, which handles talking to a Hadoop cluster and take care of Hadoop security, in a secure way.

Hadoop Security with Kerberos, Hadoop Tokens

When Hadoop is used in enterprise production environment, it is advisable to have its security feature turned on, to protect your data and guard against mistakes.

Kerberos Authentication

The most common authentication provided by Apache Hadoop is via Kerberos, which requires a KDC to authenticate users and services.

A user can authenticate with KDC via username/password or use a keytab. KDC distributes a tgt to authenticated users. Hadoop services, such as name node and job tracker, can use this tgt to verify this is authenticated user.

Hadoop Tokens

Once a user is authenticated with Hadoop services, Hadoop will issue tokens to the user so that its internal services won’t flood KDC. For a description of tokens, see here.

Hadoop SecurityManager

For human users, one authenticate with KDC with a kinit command. But for scheduler such as Azkaban that runs jobs on behalf as other users, it needs to acquire tokens that will be used by the users. Specific Azkaban job types should handle this, with the use of HadoopSecurityManager class.

For instance, when Azkaban loads the pig job type, it will initiate a HadoopSecurityManager that is authenticated with the desired KDC and Hadoop Cluster. The pig job type conf should specify which tokens are needed to talk to different services. At minimum it needs tokens from name node and job tracker. When a pig job starts, it will go to the HadoopSecurityManager to acquire all those tokens. When the user process finishes, the pig job type calls HadoopSecurityManager again to cancel all those tokens.

Settings Common to All Hadoop Clusters

When a user program wants to talk to a Hadoop cluster, it needs to know where are the name node and job tracker. It also needs to know how to authenticate with them. These information are all in the Hadoop config files that are normally in $HADOOP_HOME/conf. For this reason, this conf directory as well as the hadoop-core jar need to be on azkaban executor server classpath.

If you are using Hive that uses HCat as its metastore, you also need relevant hive jars and hive conf on the classpath as well.

Native Library

Most likely your Hadoop platform depends on some native library, this should be specified in java.library.path in azkaban executor server.

temp dir

Besides those, many tools on Hadoop, such as Pig/Hive/Crunch write files into temporary directory. By default, they all go to /tmp. This could cause operations issue when a lot of jobs run concurrently. Because of this, you may want to change this by setting java.io.tmp.dir to a different directory.

Settings To Talk to UNSECURE Hadoop Cluster

If you are just starting out with Hadoop, chances are you don’t have kerberos authentication for your Hadoop. Depending on whether you want to run everything as azkaban user (or whatever user started the azkaban executor server), you can do the following settings:

  • If you started the executor server with user named azkaban, and you want to run all the jobs as azkaban on Hadoop, just set azkaban.should.proxy=false and obtain.binary.token=false
  • If you started the executor server with user named azkaban, but you want to run Hadoop jobs as their individual users, you need to set azkaban.should.proxy=true and obtain.binary.token=false

Settings To Talk to SECURE Hadoop Cluster

For secure Hadoop clusters, Azkaban needs its own kerberos keytab to authenticate with KDC. Azkaban job types should acquire necessary Hadoop tokens before user job process starts, and should cancel the tokens after user job finishes.

All job type specific settings should go to their respective plugin conf files. Some of the common settings can go to commonprivate.properties and common.properties.

For instance, Hadoop job types usually require name node tokens and job tracker tokens. These can go to commonpriate.properties.

Azkaban as proxy user

The following settings are needed for HadoopSecurityManager to authenticate with KDC:

proxy.user=YOUR_AZKABAN_KERBEROS_PRINCIPAL

This principal should also be set in core-site.xml in Hadoop conf with corresponding permissions.

proxy.keytab.location=KEYTAB_LOCATION

One should verify if the proxy user and keytab works with the specified KDC.

Obtaining tokens for user jobs

Here are what’s common for most Hadoop jobs

hadoop.security.manager.class=azkaban.security.HadoopSecurityManager_H_1_0

This implementation should work with Hadoop 1.x

azkaban.should.proxy=true
obtain.binary.token=true
obtain.namenode.token=true
obtain.jobtracker.token=true

Additionally, if your job needs to talk to HCat, for example if you have Hive installed with uses kerbrosed HCat, or your pig job needs to talk to HCat, you will need to set for those Hive job types

obtain.hcat.token=true

This makes HadoopSecurityManager acquire a HCat token as well.

Making a New Job Type on Secure Hadoop Cluster

If you are making a new job type that will talk to Hadoop Cluster, you can use the HadoopSecurityManager to take care of security.

For unsecure Hadoop cluster, there is nothing special that is needed.

For secure Hadoop clusters, there are two ways inlcuded in the hadoopsecuritymanager package:

  • give the key tab information to user job process. The hadoopsecuritymanager static method takes care of login from that common keytab and proxy to the user. This is convenient for prototyping as there will be a real tgt granted to the user job. The con side is that the user could potentially use the keytab to login and proxy as someone else, which presents a security hole.
  • obtain Hadoop tokens prior to user job process start. The job wrapper will pick up these binary tokens inside user job process. The tokens should be explicitly cancelled after user job finishes.

By paring properly configured hadoopsecuritymanager with basic job types such as hadoopJava, pig, hive, one can make these job types work with different versions of Hadoop with various security settings.

Included in the azkaban-plugins is the hadoopsecuritymanager for Hadoop-1.x versions. It is not compatible with Hadoop-0.20 and prior versions as Hadoop UGI is not backwards compatible. However, it should not be difficult to implement one that works with them. Going forward, Hadoop UGI is mostly backwards compatible and one only needs to recompile hadoopsecuritymanager package with newer versions of Hadoop.

Azkaban HDFS Browser

The Azkaban HDFS Browser is a plugin that allows you to view the HDFS FileSystem and decode several file types. It was originally created at LinkedIn to view Avro files, Linkedin’s BinaryJson format and text files. As this plugin matures further, we may add decoding of different file types in the future.

_images/hdfsbrowser.png

Setup

Download the HDFS plugin from the download page and extract it into the web server’s plugin’s directory. This is often azkaban_web_server_dir/plugins/viewer/.

Users

By default, Azkaban HDFS browser does a do-as to impersonate the logged-in user. Often times, data is created and handled by a headless account. To view these files, if user proxy is turned on, then the user can switch to the headless account as long as its validated by the UserManager.

Settings

These are properties to configure the HDFS Browser on the AzkabanWebServer. They can be set in azkaban_web_server_dir/plugins/viewer/hdfs/conf/plugin.properties.

Parameter Description Default
viewer.name The name of this viewer plugin HDFS
viewer.path The path to this viewer plugin inside viewer directory. hdfs
viewer.order The order of this viewer plugin amongst all viewer plugins. 1
viewer.hidden Whether this plugin should show up on the web UI. false
viewer.external.class path Extra jars this viewer plugin should load upon init. extlib/*
viewer.servlet.class The main servelet class for this viewer plugin. Use azkaban.viewer.hdfs .HdfsBrowserServlet for hdfs browser  
hadoop.security.manag er.class The class that handles talking to hadoop clusters. Use azkaban.security.Ha doopSecurityManager_H _1_0 for hadoop 1.x  
azkaban.should.proxy Whether Azkaban should proxy as individual user hadoop accounts on a secure cluster, defaults to false false
proxy.user The Azkaban user configured with kerberos and hadoop. Similar to how oozie should be configured, for secure hadoop installations  
proxy.keytab.location The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user  
allow.group.proxy Whether to allow users in the same headless user group to view hdfs filesystem as that headless user false

JobType Plugins

Azkaban Jobtype Plugins Configurations

These are properties to configure the jobtype plugins that are installed with the AzkabanExecutorServer. Note that Azkaban uses the directory structure to infer global settings versus individual jobtype specific settings. Sub-directory names also determine the job type name for running Azkaban instances.

Introduction

Jobtype plugins determine how individual jobs are actually run locally or on a remote cluster. It gives great benefits: one can add or change any job type without touching Azkaban core code; one can easily extend Azkaban to run on different hadoop versions or distributions; one can keep old versions around while adding new versions of the same types. However, it is really up to the admin who manages these plugins to make sure they are installed and configured correctly.

Upon AzkabanExecutorServer start up, Azkaban will try to load all the job type plugins it can find. Azkaban will do very simply tests and drop the bad ones. One should always try to run some test jobs to make sure the job types really work as expected.

Global Properties

One can pass global settings to all job types, including cluster dependent settings that will be used by all job types. These settings can also be specified in each job type’s own settings as well.

Private settings

One can pass global settings that are needed by job types but should not be accessible by user code in commonprivate.properties. For example, the following settings are often needed for a hadoop cluster:

Parameter Description
hadoop.security.manager.class The hadoopsecuritymanager that handles talking to a hadoop cluseter. Use azkaban.security.HadoopSecurity Manager_H_1_0 for 1.x versions
azkaban.should.proxy Whether Azkaban should proxy as individual user hadoop accounts, or run as the Azkaban user itself, defaults to true
proxy.user The Azkaban user configured with kerberos and hadoop. Similar to how oozie should be configured, for secure hadoop installations
proxy.keytab.location The location of the keytab file with which Azkaban can authenticate with Kerberos for the specified proxy.user
jobtype.global.classpath The jars or xml resources every job type should have on their classpath. (e.g. ${hadoop.home}/hadoop-core-1.0. 4.jar,${hadoop.home}/conf)
jobtype.global.jvm.args The jvm args that every job type should have to jvm.
hadoop.home The $HADOOP_HOME setting.

Public settings

One can pass global settings that are needed by job types and can be visible by user code, in common.properties. For example, hadoop.home should normally be passed along to user programs.

Settings for individual job types

In most cases, there is no extra settings needed for job types to work, other than variables like hadoop.home, pig.home, hive.home, etc. However, it is also where most of the customizations come from. For example, one can configure a two Java job types with the same jar resources but with different hadoop configurations, thereby submitting pig jobs to different clusters. One can also configure pig job with pre-registered jars and namespace imports for specific organizations. Also to be noted: in the list of common job type plugins, we have included different pig versions. The admin needs to make a soft link to one of them, such as

$ ln -s pig-0.10.1 pig

so that the users can use a default “pig” type.