Flow Level Retry

Authors: Zhi Zeng

Ref: Azkaban Containerized Executions - Design Doc

Background

As Azkaban moves into containerization, executions are observing some k8s pod related / infra related issues that result in the whole execution falls. Plus, many of the flows take hours to days to run, and if not attended, a failure may takes hours before the user restart it.

So we build the feature that allows users to enable auto-retry on flow-level, as well as opt-in-able strategies on how to restart the execution.

Currently the retried execution will be with a new / different execution ID than the original execution.

Use cases

  • Let the execution to have auto-restart on flow level upon failures, lifting human operation and delay.
  • Let the execution to auto-restarted when encountering containerization failures or other infra issues.
  • Let the retried execution to skip some nodes that succeeded (correspondingly in the original execution) to save time & resources.

Flow-level retry Parameters

This set of parameters defines the retry behavior of flow / execution.

Flow Param Value Type / Note Example Usage
flow.retry.statuses String; Comma delimited list of flow statuses. FAILED,EXECUTION_STOPPED Enable flow-level retry: restart the execution when it falls into the defined statues
flow.max.retries Integer 2 Define the maximum number of flow-level retries that will be attempted, for the execution(s) falls into the defined statuses
flow.retry.strategy String retryAsNew (default value) Define how will an execution be retried when falls into the defined statuses. Currently support strategies: “retryAsNew” & “disableSucceededNodes”

Examples

UI approach:

Set the flow-parameters config in web UI directly when submit / schedule executions, like

_images/FlowRetryUISettingExample.png

In this example:

  • the execution will be retried if it falls to FAIL or EXECUTION_STOPPED status;
  • it will be retried at most 1 time;
  • when retrying, those jobs succeeded in original execution will be disabled in the retried execution.

DSL approach:

Define the flow-parameters using DSL in the gradle files, or in the corresponding .flow files.