Azure Data Factory Orchestrator
Type | Pre-Set |
---|---|
Image | $DATAOPS_ADF_RUNNER_IMAGE |
The Azure Data Factory (ADF) Orchestrator is a pre-set orchestrator that executes pre-existing pipelines registered in Azure Data Factory in the customer's Azure subscription account.
This orchestrator inherits from the Azure Orchestrator. Therefore, call the Azure orchestrator if you require further pre- or post-processing.
Mixing the Azure and ADF orchestrator functionalities in the same job is possible.
Usage
Before using this orchestrator, ensure to define the below ADF pipeline and dependencies:
- An Azure Resource Group
- A working Data Factory instance
- All of the required datasets
- Any linked services used by the pipeline, like Azure Blob Storage and Office 365
- The ADF pipeline and any child pipelines
The following YAML script describes how to define an ADF job:
"My ADF Job":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: XXXX
DATAOPS_ADF_PIPELINE_NAME: XXXX
DATAOPS_ADF_RESOURCE_GROUP: XXXX
DATAOPS_ADF_TIMEOUT: 300
DATAOPS_ADF_PIPELINE_RERUN_FROM: Failure
DATAOPS_ADF_SLEEP: 100
# use one of the following connection methods
# identity inheritance
DATAOPS_USE_IDENTITY: 1
# or default vault expansion
SET_AZ_KEYS_TO_ENV: 1
# or custom vault expansion
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Connecting to Azure
To connect to Azure, the Azure Data Factory orchestrator supports two methods. Use one or the other but not both.
1. Username and password
To use the Azure orchestrator, you must provide your Azure username and password to the DataOps pipeline to connect to the Azure services, which will, in turn, connect to Azure Data Factory. Setting the environment variables AZURE_USER
and AZURE_PASSWORD
achieves this.
We recommend that you keep your third-party credentials in the DataOps Vault. Storing them at the default paths AZURE.DEFAULT.USER
and AZURE.DEFAULT.PASSWORD
allows you to retrieve them by setting the environment variable SET_AZ_KEYS_TO_ENV
.
Use the DATAOPS_VAULT()
functionality to retrieve your credentials if you have stored your credentials at different vault paths.
variables:
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
2. Inheriting permissions from the virtual machine
The Azure Data Factory orchestrator also supports using the Virtual Machine's identity to connect to Azure. To utilize this, set the variable DATAOPS_USE_IDENTITY
.
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DATAOPS_ADF_ACTION | REQUIRED | Must be START to use the orchestrator |
DATAOPS_ADF_FACTORY_NAME | REQUIRED | The factory name |
DATAOPS_ADF_PIPELINE_NAME | REQUIRED | The pipeline name |
DATAOPS_ADF_RESOURCE_GROUP | REQUIRED | Resource Group name |
DATAOPS_ADF_PIPELINE_PARAMETERS | Optional | File path for the ADF pipeline parameters (JSON) |
DATAOPS_ADF_PIPELINE_RERUN_FROM | Optional. Defaults to START | Defines the ADF Pipeline recovery behavior in case the DataOps ADF job retries. Possible values are FAILURE , START and <Name of activity> |
DATAOPS_ADF_SLEEP | Optional. Defaults to 10 | Time in seconds between retry attempts of the ADF pipeline |
DATAOPS_ADF_TIMEOUT | Optional. Defaults to 300 | Time in seconds to wait for the completion of the ADF pipeline |
DATAOPS_USE_IDENTITY | Optional | If set, uses inherited Azure permission from VM |
SET_AZ_KEYS_TO_ENV | Optional | If set, exports Azure username(AZURE.DEFAULT.USER ) and password(AZURE.DEFAULT.PASSWORD ) from vault to the variable |
The following points expand on these supported parameters:
DATAOPS_ADF_PIPELINE_RERUN_FROM
The Azure Data Factory orchestrator supports the ability to rerun the ADF pipeline in the event of a failure or timeout by rerunning the existing DataOps pipeline instead of running a new DataOps pipeline.
You cannot rerun a DataOps pipeline if another instance of the same DataOps pipeline has been run in the same environment.
The Azure Data Factory orchestrator uses the ADF pipeline runId
to retry the run. Users can define the point where to resume the run in the variable DATAOPS_ADF_PIPELINE_RERUN_FROM
.
The following DATAOPS_ADF_PIPELINE_RERUN_FROM
values are available:
START
If this parameter is set to START
, the ADF pipeline rerun will restart from the beginning.
FAILURE
If this parameter is set to FAILURE
, the pipeline run will restart from where the ADF pipeline failed.
From Activity
The ADF pipeline run will restart from the specified activity name if defined.
START
and FAILURE
are case-sensitive reserved words for the Azure Data Factory orchestrator and cannot be used as an activity name.
Example jobs
Here are several examples of ADF orchestrator jobs:
Standard job setup for the ADF orchestrator
In the case of a failed ADF pipeline, use the default setup if you do not have to rerun the pipeline.
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Retrying the job from the failed ADF activity
Use the DATAOPS_ADF_PIPELINE_RERUN_FROM
parameter if you do want to leverage the ADF pipeline recovery feature, as follows:
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: FAILURE
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Retrying the job from a specified ADF activity
Finally, here is an example of using a named ADF activity. Additionally, this example shows how to use the username/password authorization instead of the recommended inherited Azure identity.
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: DummyActivity
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Project Resources
None
Host dependencies (and resources)
None