Azure Data Factory Orchestrator
Enterprise
Image | $DATAOPS_ADF_RUNNER_IMAGE |
---|
The Azure Data Factory (ADF) Orchestrator runs pre-existing pipelines registered in Azure Data Factory in the customer's Azure subscription account.
This orchestrator inherits from the Azure Orchestrator. Therefore, call the Azure orchestrator if you require further pre- or post-processing.
Mixing the Azure and ADF orchestrator functionalities in the same job is possible.
Usage
Before using this orchestrator, ensure to define the below ADF pipeline and dependencies:
- An Azure Resource Group
- A working Data Factory instance
- All of the required datasets
- Any linked services used by the pipeline, like Azure Blob Storage and Office 365
- The ADF pipeline and any child pipelines
The following YAML script describes how to define an ADF job:
"My ADF Job":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: XXXX
DATAOPS_ADF_PIPELINE_NAME: XXXX
DATAOPS_ADF_RESOURCE_GROUP: XXXX
DATAOPS_ADF_TIMEOUT: 300
DATAOPS_ADF_PIPELINE_RERUN_FROM: Failure
DATAOPS_ADF_SLEEP: 100
# use one of the following connection methods
# identity inheritance
DATAOPS_USE_IDENTITY: 1
# or default vault expansion
SET_AZ_KEYS_TO_ENV: 1
# or custom vault expansion
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
# or service principal
DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL: 1
TENANT_ID: <tenant_id>
AZURE_APP_ID: DATAOPS_VAULT(PATH.TO.APPLICATION_ID.IN.VAULT)
AZURE_CLIENT_SECRET: DATAOPS_VAULT(PATH.TO.CLIENT_SECRET.IN.VAULT)
DISABLE_ALLOW_NO_SUBSCRIPTIONS: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Connecting to Azure
The Azure orchestrator supports different methods of connecting to Azure. Only ever use one!
1. Username and password
To use the Azure orchestrator, you must provide your Azure username and password to the DataOps pipeline to connect to the Azure services, which will, in turn, connect to Azure Data Factory. Setting the environment variables AZURE_USER
and AZURE_PASSWORD
achieves this.
We recommend that you keep your third-party credentials in the DataOps Vault. Storing them at the default paths AZURE.DEFAULT.USER
and AZURE.DEFAULT.PASSWORD
allows you to retrieve them by setting the environment variable SET_AZ_KEYS_TO_ENV
.
Use the DATAOPS_VAULT()
functionality to retrieve your credentials if you have stored your credentials at different vault paths.
"My Azure Job":
variables:
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
2. Inheriting permissions from the virtual machine
The Azure Data Factory orchestrator supports using the Virtual Machine's identity to connect to Azure. To utilize this, set the variable DATAOPS_USE_IDENTITY
.
3. Using a service principal
The Azure Data Factory orchestrator also supports using the service principal to connect to Azure. To utilize this feature, set the variable DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL
and provide the additional parameters.
"My Azure Job":
variables:
DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL: 1
TENANT_ID: <tenant_id>
AZURE_APP_ID: DATAOPS_VAULT(PATH.TO.APPLICATION_ID.IN.VAULT)
AZURE_CLIENT_SECRET: DATAOPS_VAULT(PATH.TO.CLIENT_SECRET.IN.VAULT)
DISABLE_ALLOW_NO_SUBSCRIPTIONS: 1
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DATAOPS_ADF_ACTION | REQUIRED | Must be START to use the orchestrator |
DATAOPS_ADF_FACTORY_NAME | REQUIRED | The factory name |
DATAOPS_ADF_PIPELINE_NAME | REQUIRED | The pipeline name |
DATAOPS_ADF_RESOURCE_GROUP | REQUIRED | Resource Group name |
DATAOPS_ADF_PIPELINE_PARAMETERS | Optional | File path for the ADF pipeline parameters (JSON) |
DATAOPS_ADF_PIPELINE_RERUN_FROM | Optional. Defaults to START | Defines the ADF Pipeline recovery behavior in case the DataOps ADF job retries. Possible values are FAILURE , START and <Name of activity> |
DATAOPS_ADF_SLEEP | Optional. Defaults to 10 | Time in seconds between retry attempts of the ADF pipeline |
DATAOPS_ADF_TIMEOUT | Optional. Defaults to 300 | Time in seconds to wait for the completion of the ADF pipeline |
DATAOPS_USE_IDENTITY | Optional | If set, uses inherited Azure permission from VM |
AZURE_USER | Optional | Username of the azure |
AZURE_PASSWORD | Optional | Password of the azure |
DATAOPS_USE_IDENTITY | Optional | If set, uses inherited Azure permission from VM |
SET_AZ_KEYS_TO_ENV | Optional | If set, it exports the Azure username and password from the DataOps Vault to the environment |
DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL | Optional | If set, the Azure sign-in will be done using a service principal |
AZURE_APP_ID | Optional | The Application (client) ID associated with the service principal. It is required if DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL is set to 1. |
AZURE_CLIENT_SECRET | Optional | The Client credentials associated with the service principal. It is required if DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL is set to 1. |
DISABLE_ALLOW_NO_SUBSCRIPTIONS | Optional | If set, It disable the ALLOW_NO_SUBSCRIPTIONS. It is required if DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL is set to 1 |
TENANT_ID | Optional | If set, it associates with a unique tenant ID. It is required if DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL is set to 1. |
DATAOPS_AZURE_RETRY_ATTEMPTS | Optional, defaults to 1 | If set, configures the number or retries in case of connection failures. |
DATAOPS_AZURE_RETRY_INTERVAL | Optional, defaults to 10 | If set, configures the number of seconds to wait between retries. |
The following points expand on these supported parameters:
DATAOPS_ADF_PIPELINE_RERUN_FROM
The Azure Data Factory orchestrator supports the ability to rerun the ADF pipeline in the event of a failure or timeout by rerunning the existing DataOps pipeline instead of running a new DataOps pipeline.
You cannot rerun a DataOps pipeline if another instance of the same DataOps pipeline has been run in the same environment.
The Azure Data Factory orchestrator uses the ADF pipeline runId
to retry the run. Users can define the point where to resume the run in the variable DATAOPS_ADF_PIPELINE_RERUN_FROM
.
The following DATAOPS_ADF_PIPELINE_RERUN_FROM
values are available:
START
If this parameter is set to START
, the ADF pipeline rerun will restart from the beginning.
FAILURE
If this parameter is set to FAILURE
, the pipeline run will restart from where the ADF pipeline failed.
From Activity
The ADF pipeline run will restart from the specified activity name if defined.
START
and FAILURE
are case-sensitive reserved words for the Azure Data Factory orchestrator and cannot be used as an activity name.
Example jobs
Here are several examples of ADF orchestrator jobs:
Standard job setup for the ADF orchestrator
In the case of a failed ADF pipeline, use the default setup if you do not have to rerun the pipeline.
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Retrying the job from the failed ADF activity
Use the DATAOPS_ADF_PIPELINE_RERUN_FROM
parameter if you do want to leverage the ADF pipeline recovery feature, as follows:
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: FAILURE
DATAOPS_USE_IDENTITY: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Example using the service principal authorization
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: FAILURE
DATAOPS_AZURE_LOGIN_AS_SERVICE_PRINCIPAL: 1
TENANT_ID: <tenant_id>
AZURE_APP_ID: DATAOPS_VAULT(PATH.TO.APPLICATION_ID.IN.VAULT)
AZURE_CLIENT_SECRET: DATAOPS_VAULT(PATH.TO.CLIENT_SECRET.IN.VAULT)
DISABLE_ALLOW_NO_SUBSCRIPTIONS: 1
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png
Retrying the job from a specified ADF activity
Finally, here is an example of using a named ADF activity. Additionally, this example shows how to use the username/password authorization instead of the recommended inherited Azure identity.
"Azure Data Factory":
extends:
- .agent_tag
stage: "Batch Ingestion"
image: $DATAOPS_ADF_RUNNER_IMAGE
variables:
DATAOPS_ADF_ACTION: START
DATAOPS_ADF_FACTORY_NAME: dataopslivedatafactory
DATAOPS_ADF_PIPELINE_NAME: DummyPipeline
DATAOPS_ADF_RESOURCE_GROUP: dataops
DATAOPS_ADF_PIPELINE_PARAMETERS: $CI_PROJECT_DIR/dataops/azure/properties.json
DATAOPS_ADF_PIPELINE_RERUN_FROM: DummyActivity
AZURE_USER: DATAOPS_VAULT(PATH.TO.USERNAME.IN.VAULT)
AZURE_PASSWORD: DATAOPS_VAULT(PATH.TO.PASSWORD.IN.VAULT)
script:
- /dataops
icon: https://dataops-public-assets.s3.eu-west-2.amazonaws.com/adf.png