Skip to main content

Azure Data Factory Orchestrator

Enterprise

Image$DATAOPS_ADF_RUNNER_IMAGE

The Azure Data Factory (ADF) Orchestrator runs pre-existing pipelines registered in Azure Data Factory in the customer's Azure subscription account.

note

This orchestrator inherits from the Azure Orchestrator. Therefore, call the Azure orchestrator if you require further pre- or post-processing.

Mixing the Azure and ADF orchestrator functionalities in the same job is possible.

Usage

Before using this orchestrator, ensure to define the below ADF pipeline and dependencies:

  • An Azure Resource Group
  • A working Data Factory instance
  • All of the required datasets
  • Any linked services used by the pipeline, like Azure Blob Storage and Office 365
  • The ADF pipeline and any child pipelines

The following YAML script describes how to define an ADF job:

pipelines/includes/local_includes/adf_jobs/my_adf_job.yml
"My ADF Job":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: XXXX
DATAOPS_ADF_PIPELINE_NAME: XXXX
DATAOPS_ADF_RESOURCE_GROUP: XXXX
DATAOPS_ADF_TIMEOUT: 300
DATAOPS_ADF_PIPELINE_RERUN_FROM: Failure
DATAOPS_ADF_SLEEP: 100

# use one of the following connection methods
# identity inheritance
DATAOPS_USE_IDENTITY: 1

# or default vault expansion
SET_AZ_KEYS_TO_ENV: 1

# or custom vault expansion
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png

Connecting to Azure

To connect to Azure, the Azure Data Factory orchestrator supports two methods. Use one or the other but not both.

1. Username and password

To use the Azure orchestrator, you must provide your Azure username and password to the DataOps pipeline to connect to the Azure services, which will, in turn, connect to Azure Data Factory. Setting the environment variables AZURE_USER and AZURE_PASSWORD achieves this.

We recommend that you keep your third-party credentials in the DataOps Vault. Storing them at the default paths AZURE.DEFAULT.USER and AZURE.DEFAULT.PASSWORD allows you to retrieve them by setting the environment variable SET_AZ_KEYS_TO_ENV.

Use the DATAOPS_VAULT() functionality to retrieve your credentials if you have stored your credentials at different vault paths.

variables:
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)

2. Inheriting permissions from the virtual machine

The Azure Data Factory orchestrator also supports using the Virtual Machine's identity to connect to Azure. To utilize this, set the variable DATAOPS_USE_IDENTITY.

Supported parameters

ParameterRequired/DefaultDescription
DATAOPS_ADF_ACTIONREQUIREDMust be START to use the orchestrator
DATAOPS_ADF_FACTORY_NAMEREQUIREDThe factory name
DATAOPS_ADF_PIPELINE_NAMEREQUIREDThe pipeline name
DATAOPS_ADF_RESOURCE_GROUPREQUIREDResource Group name
DATAOPS_ADF_PIPELINE_PARAMETERSOptionalFile path for the ADF pipeline parameters (JSON)
DATAOPS_ADF_PIPELINE_RERUN_FROMOptional. Defaults to STARTDefines the ADF Pipeline recovery behavior in case the DataOps ADF job retries.
Possible values are FAILURE, START and <Name of activity>
DATAOPS_ADF_SLEEPOptional. Defaults to 10Time in seconds between retry attempts of the ADF pipeline
DATAOPS_ADF_TIMEOUTOptional. Defaults to 300Time in seconds to wait for the completion of the ADF pipeline
DATAOPS_USE_IDENTITYOptionalIf set, uses inherited Azure permission from VM
SET_AZ_KEYS_TO_ENVOptionalIf set, exports Azure username(AZURE.DEFAULT.USER) and password(AZURE.DEFAULT.PASSWORD) from vault to the variable

The following points expand on these supported parameters:

DATAOPS_ADF_PIPELINE_RERUN_FROM

The Azure Data Factory orchestrator supports the ability to rerun the ADF pipeline in the event of a failure or timeout by rerunning the existing DataOps pipeline instead of running a new DataOps pipeline.

warning

You cannot rerun a DataOps pipeline if another instance of the same DataOps pipeline has been run in the same environment.

The Azure Data Factory orchestrator uses the ADF pipeline runId to retry the run. Users can define the point where to resume the run in the variable DATAOPS_ADF_PIPELINE_RERUN_FROM.

The following DATAOPS_ADF_PIPELINE_RERUN_FROM values are available:

START

If this parameter is set to START, the ADF pipeline rerun will restart from the beginning.

FAILURE

If this parameter is set to FAILURE, the pipeline run will restart from where the ADF pipeline failed.

From Activity

The ADF pipeline run will restart from the specified activity name if defined.

Activity Name

START and FAILURE are case-sensitive reserved words for the Azure Data Factory orchestrator and cannot be used as an activity name.

Example jobs

Here are several examples of ADF orchestrator jobs:

Standard job setup for the ADF orchestrator

In the case of a failed ADF pipeline, use the default setup if you do not have to rerun the pipeline.

pipelines/includes/local_includes/adf_jobs/azure_data_factory_job.yml
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png

Retrying the job from the failed ADF activity

Use the DATAOPS_ADF_PIPELINE_RERUN_FROM parameter if you do want to leverage the ADF pipeline recovery feature, as follows:

pipelines/includes/local_includes/adf/azure_data_factory_job.yml
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: FAILURE
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png

Retrying the job from a specified ADF activity

Finally, here is an example of using a named ADF activity. Additionally, this example shows how to use the username/password authorization instead of the recommended inherited Azure identity.

pipelines/includes/local_includes/adf/azure_data_factory_job.yml
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: DummyActivity
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png