Modelling and Transformation Orchestrator
Type | Pre-Set |
---|---|
Image | $DATAOPS_TRANSFORM_RUNNER_IMAGE |
The Modeling and Transformation (MATE) orchestrator is a pre-set orchestrator. It is responsible for taking the models in the /dataops/modelling
directory and executing them in a Snowflake Data Warehouse by compiling them to SQL and then running the resultant SQL statements.
Multiple operations are possible within the Modelling and Transformation Engine (MATE). To trigger the selected operation within MATE, set the parameter TRANSFORM_ACTION
to one of the supported values.
Usage
You must always use the Modelling and Transformation orchestrator with the DataOps Reference Project, providing, among others, the .modelling_and_transformation_base
job.
"Build all Models":
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Data Transformation"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
script:
- /dataops
icon: ${TRANSFORM_ICON}
Supported parameters
When you want to apply a transformation action to all jobs in a pipeline, you must define the relevant parameter at the project level, i.e., in the pipelines/includes/config/variables.yml
file. If you set the parameter in the job definition, it will only apply to the specific job.
Parameter | Required/Default | Description |
---|---|---|
TRANSFORM_ACTION | REQUIRED | Must be one of BUILD , RUN , COMPILE , SNAPSHOT , DOCS , TEST , OPERATION , RENDER , or SEED |
DATABASE_PROFILE | REQUIRED. Defaults to snowflake_operations | Which dbt profile to use from profiles.template.yml |
TRANSFORM_PROJECT_PATH | REQUIRED. Defaults to $CI_PROJECT_DIR/templates/modelling | The directory in the project structure where the base of the Modelling and Transformation project is located |
TRANSFORM_OPERATION_NAME | REQUIRED—if TRANSFORM_ACTION is set to OPERATION | The macro/operation to be executed |
DATABASE_TARGET | REQUIRED. Defaults to other | Which dbt profile target to use from profiles.template.yml |
DATAOPS_FEATURE_MATE_DBT_VERSION | Optional. Defaults to 1.4 | Specifies the dbt version with which to run MATE. See Switching dbt version for more information. |
TRANSFORM_DEBUG_TIMING | Optional. Defaults to blank | If set, it saves performance profiling information to a file timing.log in the TRANSFORM_PROJECT_PATH directory. To view the report use the tool snakeviz by using the command snakeviz timing.log |
TRANSFORM_EXTRA_PARAMS_AFTER | Optional. Defaults to blank | Additional command-line arguments to be added to the end of the transform orchestrator process |
TRANSFORM_EXTRA_PARAMS_BEFORE | Optional. Defaults to blank | Additional command-line arguments to be added to the beginning of the transform orchestrator process |
TRANSFORM_VARS | Optional. Defaults to blank | Key-value pairs to be added when running models. See the section Adding/Overriding variables to dbt at run-time for usage. |
TRANSFORM_FORCE_DEPS | Optional. Defaults to blank | Decides whether to force a refresh of external transformation libraries before execution. Note: When using additional packages in your project, and not only in a few selected jobs in your pipeline, you must define this parameter at the project level, i.e., in the variables: section at pipelines/includes/config/variables.yml . Everything you define in this file will be available to all jobs. |
TRANSFORM_MODEL_INDIRECT_SELECTION_MODE | Optional. Defaults to blank | dbt indirectly selects all tests if they touch any resource you select directly, and this is called "eager" indirect selection. If you don't want to include tests selected indirectly, set the value of this variable to cautious . See Exclude tests selected indirectly for more information. |
TRANSFORM_MODEL_SELECTOR | Optional. Defaults to blank | If set, it specifies what model to run in the project following the defined model name or the name of the model-level tag, e.g., 'person' or 'tag:curation' |
TRANSFORM_YAML_SELECTOR | Optional. Defaults to blank | If set, it executes models following what is defined in resource selectors in YAML. See the Resource selectors in YAML for more information. |
TRANSFORM_OPERATION_ARGS | Optional. Defaults to {} | The YAML string representing the macro arguments, e.g. {arg1: value1, arg2: 345} |
TRANSFORM_PARTIAL_PARSE | Optional. Defaults to blank | If set, it disables partial parsing in the project. See the section on partial parsing for more information |
FULL_REFRESH | Optional. Defaults to blank | If set, it will force incremental models to be fully refreshed |
DATAOPS_REMOVE_RENDERED_TEMPLATES | Optional. Defaults to blank | If set, the system will remove any templates found after processing. This allows files of the format <modelname>.template.yml to be used without creating extra models in the project.Note: Make sure to set this variable in the variables.yml file in the project settings and not within your job. |
The following topics bear reference to and expand on several of these supported parameters:
Using TRANSFORM_ACTION: BUILD
Prerequisite (dbt 1.4 or later)
MATE with dbt adds a new TRANSFORM_ACTION: BUILD
, which runs the following four commands in a single job:
RUN
TEST
SNAPSHOT
SEED
It follows the Directed Acyclic Graph (DAG) order for selected resources or the entire project. The BUILD
action allows you to combine jobs spread across multiple stages into one.
Make sure you have set the dbt version with which to use MATE. See Switching dbt versions.
The following example shows how to run BUILD
for all models, snapshots, and seeds tagged with finance
:
"Build all Models":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Data Transformation"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: BUILD
TRANSFORM_MODEL_SELECTOR: tag:finance
script:
- /dataops
icon: ${TRANSFORM_ICON}
Testing reporting
The test reporting feature is broken further down into the following categories:
1. Enable test reporting
The Transform orchestrator generates a test report when running a TEST
job. To surface this report into the DataOps platform, the job must include an artifact around it, as in lines 11 to 14 in the following example:
"My Test Job":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Transformation Testing"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: TEST
script:
- /dataops
artifacts:
when: always
reports:
junit: $CI_PROJECT_DIR/report.xml
icon: ${TRANSFORM_ICON}
Once the job has run successfully, the job will add the test results to the pipeline's Tests tab.
2. Test states
It is possible to categorize the test result or state test in one of the following ways:
State | Description | Example |
---|---|---|
PASS | The test executed and passed | The test is successful. The target fulfilled its specified condition, such as the correct number of rows present in a database table |
FAIL | The test executed and failed | The test failed. The target did not meet its specified condition, such as the incorrect number of rows present in a table |
ERROR | Error while executing a test | An error occurred while executing a test, such as an invalid column/table name specified in a table/dataset |
SKIPPED | The test skipped due to an unfulfilled condition | The test did not execute because of an unfilled condition in a dataset/table |
3. Test report control
The following parameters control the generation of test reports:
Parameter | Required/Optional | Description |
---|---|---|
JSON_PATH | Optional. Defaults to $TRANSFORM_PROJECT_PATH/target/run_results.json | The path to the JSON result generated by the Transform orchestrator |
REPORT_NAME | Optional. Defaults to report.xml | The generated report name |
REPORT_DIR | Optional. Defaults to $CI_PROJECT_DIR | The path where the generated report is saved. Note: This directory must already exist before these tests run |
TREAT_TEST_ERRORS_AS_FAILED | Optional. Defaults to FALSE | If enabled, it reports a test error as FAIL . See Test state for more information |
TREAT_TEST_WARNS_AS_FAILED | Optional. Defaults to FALSE | If enabled, it reports a test warning as FAIL . See Test state for more information |
Partial parse
Partial parsing can improve the performance characteristics of DataOps pipeline runs by limiting the number of files a pipeline must parse every time it runs. Here, "parsing" means reading files in a project from disk and capturing ref()
and config()
method calls.
These method calls are used to determine the following:
- The shape of the dbt DAG (Direct Acyclic Graph)
- The supplied resource configurations
There is no need to re-parse these files if partial parsing is enabled and the files are unchanged between job requests. The Transform orchestrator can use the parsed representation from the last job request. However, if a file has changed between invocations, then the orchestrator will re-parse the file and update the parsed node cache accordingly.
TRANSFORM_PARTIAL_PARSE
is enabled by default. To disable this feature, set its value to 1
.
To utilize partial parsing in a DataOps project, enable caching in the .modeling_and_transformation_base
job by overriding the settings from the reference project base job by creating a definition in your project as follows:
.modeling_and_transformation_base:
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
DATAOPS_TEMPLATES_DIR: /tmp/local_config
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modeling
cache:
key: $CI_PIPELINE_ID
paths:
- dataops/modeling/target/
icon: ${TRANSFORM_ICON}
Switching dbt version
Prerequisite (dbt 1.4 or later)
You can select the dbt version you want to use to run MATE actions using the variable DATAOPS_FEATURE_MATE_DBT_VERSION
. This variable automatically supports the latest dbt minor release. The supported versions include bundled packages as well.
Currently, by default, MATE uses the dbt version 1.0. We highly recommend selecting a later version, as 1.0 is no longer supported by dbt Labs. To select a later, supported version of dbt, specify the version in the parameter DATAOPS_FEATURE_MATE_DBT_VERSION
, e.g. 1.4.
Installing specific or unsupported dbt versions at runtime is also possible. Then you can set their values in the variable to use them with MATE. See the below topic for more details.
Using a specific or unsupported version
You can specify a dbt version currently not supported by DataOps, such as 1.3
, or a specific version of dbt-core, such as 1.4.3
, in the variable DATAOPS_FEATURE_MATE_DBT_VERSION
.
If you set an unsupported version, MATE will attempt to install the required dbt-core packages for the specified value. Use at your own risk.
Bundled packages such as dataops
, dbtvault
, etc., are unavailable if an unsupported version of dbt is specified.
In such case, you must set the required package in the file dataops/modelling/packages.yml
and enable the variable TRANSFORM_FORCE_DEPS
. Read more about this package installation at Installing additional packages.
Use the following formats to install new dbt versions:
Value | Installed dbt-core Version | Installed dbt-snowflake Version |
---|---|---|
1.3 | Highest available version for 1.3 (>=1.3 and <1.4) | Highest available version for 1.3 (>=1.3 and <1.4) |
1.4.3 | 1.4.3 | Highest available version for 1.4 (>=1.4 and <1.5) |
Switching dbt version in MATE does not support installing pre-release versions of dbt.
Defining a custom profiles.template.yml
You can customize profiles.template.yml
in the /dataops/profiles
project directory to be used by the MATE orchestrator. If no profiles.template.yml
is defined in /dataops/profiles
, then a default one will be used.
The rendered file /dataops/profiles/profiles.yml
will be removed after the /dataops
script call to prevent exposing credentials. You can set the parameter PREVENT_KEY_PAIR_CLEANUP
to 1
to prevent the removal. Doing this helps avoid cleaning up key-pair files (if in use).
The /dataops/profiles/profiles.yml
contains all the details required to connect to your Snowflake account, including sensitive information like credentials.
One can set up DATABASE_PROFILE
and DATABASE_TARGET
to execute macro with higher privilege. See Running a macro using the SOLE Admin role
The default profiles.template.yml
The default profiles.template.yml
is provided below. It defines the default DATABASE_PROFILE
named snowflake_operations
and the default DATABASE_TARGET
named other
.
snowflake_operations:
target: other
outputs:
other:
type: snowflake
account: {{ SNOWFLAKE.ACCOUNT }}
user: {{ SNOWFLAKE.TRANSFORM.USERNAME }}
role: {{ SNOWFLAKE.TRANSFORM.ROLE }}
{% if env.DATAOPS_SNOWFLAKE_AUTH %}
{% if env.DATAOPS_SNOWFLAKE_AUTH == "KEY_PAIR" %}
private_key_path: {{ env.DATAOPS_SNOWFLAKE_KEY_PAIR_PATH }}
{% if env.DATAOPS_SNOWFLAKE_PASSPHRASE %}
private_key_passphrase: {{ env.DATAOPS_SNOWFLAKE_PASSPHRASE }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
database: {{ env.DATABASE }}
schema: BASESCHEMA
warehouse: {{ SNOWFLAKE.TRANSFORM.WAREHOUSE }}
threads: {{ SNOWFLAKE.TRANSFORM.THREADS }}
client_session_keep_alive: False
qa:
type: snowflake
account: {{ SNOWFLAKE.ACCOUNT }}
user: {{ SNOWFLAKE.TRANSFORM.USERNAME }}
role: {{ SNOWFLAKE.TRANSFORM.ROLE }}
{% if env.DATAOPS_SNOWFLAKE_AUTH %}
{% if env.DATAOPS_SNOWFLAKE_AUTH == "KEY_PAIR" %}
private_key_path: {{ env.DATAOPS_SNOWFLAKE_KEY_PAIR_PATH }}
{% if env.DATAOPS_SNOWFLAKE_PASSPHRASE %}
private_key_passphrase: {{ env.DATAOPS_SNOWFLAKE_PASSPHRASE }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
database: {{ env.DATABASE }}
schema: BASESCHEMA
warehouse: {{ SNOWFLAKE.TRANSFORM.WAREHOUSE }}
threads: {{ SNOWFLAKE.TRANSFORM.THREADS }}
client_session_keep_alive: False
prod:
type: snowflake
account: {{ SNOWFLAKE.ACCOUNT }}
user: {{ SNOWFLAKE.TRANSFORM.USERNAME }}
role: {{ SNOWFLAKE.TRANSFORM.ROLE }}
{% if env.DATAOPS_SNOWFLAKE_AUTH %}
{% if env.DATAOPS_SNOWFLAKE_AUTH == "KEY_PAIR" %}
private_key_path: {{ env.DATAOPS_SNOWFLAKE_KEY_PAIR_PATH }}
{% if env.DATAOPS_SNOWFLAKE_PASSPHRASE %}
private_key_passphrase: {{ env.DATAOPS_SNOWFLAKE_PASSPHRASE }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.TRANSFORM.PASSWORD }}
{% endif %}
database: {{ env.DATABASE }}
schema: BASESCHEMA
warehouse: {{ SNOWFLAKE.TRANSFORM.WAREHOUSE }}
threads: {{ SNOWFLAKE.TRANSFORM.THREADS }}
client_session_keep_alive: False
snowflake_ingestion:
outputs:
default:
type: snowflake
account: {{ SNOWFLAKE.ACCOUNT }}
user: {{ SNOWFLAKE.INGESTION.USERNAME }}
role: {{ SNOWFLAKE.INGESTION.ROLE }}
{% if env.DATAOPS_SNOWFLAKE_AUTH %}
{% if env.DATAOPS_SNOWFLAKE_AUTH == "KEY_PAIR" %}
private_key_path: {{ env.DATAOPS_SNOWFLAKE_KEY_PAIR_PATH }}
{% if env.DATAOPS_SNOWFLAKE_PASSPHRASE %}
private_key_passphrase: {{ env.DATAOPS_SNOWFLAKE_PASSPHRASE }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.INGESTION.PASSWORD }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.INGESTION.PASSWORD }}
{% endif %}
database: {{ DATABASE }}
schema: BASESCHEMA
warehouse: {{ SNOWFLAKE.INGESTION.WAREHOUSE }}
threads: {{ SNOWFLAKE.INGESTION.THREADS }}
client_session_keep_alive: False
snowflake_master:
outputs:
default:
type: snowflake
account: {{ SNOWFLAKE.ACCOUNT }}
user: {{ SNOWFLAKE.MASTER.USERNAME }}
role: {{ SNOWFLAKE.MASTER.ROLE }}
{% if env.DATAOPS_SNOWFLAKE_AUTH %}
{% if env.DATAOPS_SNOWFLAKE_AUTH == "KEY_PAIR" %}
private_key_path: {{ env.DATAOPS_SNOWFLAKE_KEY_PAIR_PATH }}
{% if env.DATAOPS_SNOWFLAKE_PASSPHRASE %}
private_key_passphrase: {{ env.DATAOPS_SNOWFLAKE_PASSPHRASE }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.MASTER.PASSWORD }}
{% endif %}
{% else %}
password: {{ SNOWFLAKE.MASTER.PASSWORD }}
{% endif %}
database: {{ env.DATABASE }}
schema: BASESCHEMA
warehouse: {{ SNOWFLAKE.TRANSFORM.WAREHOUSE }}
threads: {{ SNOWFLAKE.TRANSFORM.THREADS }}
client_session_keep_alive: False
config:
send_anonymous_usage_stats: False
Migrating to the /dataops/profiles
directory
If you had customized the profiles.template.yml
in the DataOps Runner host directory /app
, you need to copy the content to the new project git directory /dataops/profiles
. To leverage key pair authentication, start from the default profiles.template.yml
and merge your modifications. In most cases, using the default file is sufficient.
If no profile file named profiles.template.yml
is present in the project directory /dataops/profiles
then the default will be used.
Authentication
Key pair authentication
MATE orchestrator supports using key pair authentication. To know more and how to configure it, see Key Pair Authentication.
Example jobs
The examples below extend the base job .modeling_and_transformation_base
to simplify the MATE job definition. See the reference project base job for all details about these examples.
For ease of reading, the examples below are summarized as follows:
[Test Indirect selection Models][test-indirect-selection-models]
1. Build all models
Build all the models in your project:
"Build all Models":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Data Transformation"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
script:
- /dataops
icon: ${TRANSFORM_ICON}
2. Build a directory of models
Build all the models in the divisions/finance
directory:
"Build all Models":
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Data Transformation"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: models/divisions/finance
script:
- /dataops
icon: ${TRANSFORM_ICON}
3. Build tagged models
Build all the models tagged finance
:
"Build all Models":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Data Transformation"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: tag:finance
script:
- /dataops
icon: ${TRANSFORM_ICON}
4. Test all models
The TRANSFORM_MODEL_SELECTOR
variable works the same way with TEST as it does with RUN.
"Test all Models":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Transformation Testing"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: TEST
script:
- /dataops
artifacts:
when: always
reports:
junit: $CI_PROJECT_DIR/report.xml
icon: ${TESTING_ICON}
5. Build models by running a macro
Rather than building/testing all or part of the MATE models, a MATE job can also execute a standalone macro as its primary operation.
Run My Macro:
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Additional Configuration"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: OPERATION
TRANSFORM_OPERATION_NAME: my_macro
script:
- /dataops
icon: ${TRANSFORM_ICON}
6. Run a macro using the SOLE Admin role
Setting DATABASE_PROFILE
and DATABASE_TARGET
to the values snowflake_master
and default
, respectively (as per the example below) will execute the macro using the higher privileges that SOLE uses.
"Run My Macro as Admin":
extends:
- .modeling_and_transformation_base
- .agent_tag
stage: "Additional Configuration"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: OPERATION
TRANSFORM_OPERATION_NAME: my_macro
DATABASE_PROFILE: snowflake_master
DATABASE_TARGET: default
script:
- /dataops
icon: ${TRANSFORM_ICON}
7. Resource selectors in YAML
Prerequisite (dbt 1.4 or later)
Build models according to what you set in the resource selector in YAML, which allows you to write resource selector configurations in a human-readable and reusable format:
- Full YAML
- CLI-style
selectors:
- name: business_vault_tests
description: "Non-incremental Snowplow models that power nightly exports"
definition:
union:
- method: tag
value: datavault_business_vault
- exclude:
- union:
- method: tag
value: datavault_raw
- method: tag
value: datavault_stage
selectors:
- name: business_vault_tests
description: "Non-incremental Snowplow models that power nightly exports"
definition:
union:
- "tag:datavault_business_vault"
- exclude:
- union:
- "tag:datavault_raw"
- "tag:datavault_stage"
Set the name of the selector in TRANSFORM_YAML_SELECTOR
:
Run My Macro:
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Additional Configuration"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_YAML_SELECTOR: business_vault_tests
script:
- /dataops
icon: ${TRANSFORM_ICON}
8. Exclude tests selected indirectly
You can select tests directly by methods and operators that capture one of the tests' attributes, names, properties, tags, etc. But dbt also indirectly selects all tests that touch any resource you select. See the below example for more information.
"Test Direct Selection Model":
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Transformation Testing"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: TEST
TRANSFORM_MODEL_SELECTOR: tag:direct
TRANSFORM_MODEL_INDIRECT_SELECTION_MODE: cautious
script:
- /dataops
artifacts:
when: always
reports:
junit: $CI_PROJECT_DIR/report.xml
icon: ${TESTING_ICON}
Test against a direct tagged parent model:
version: 2
models:
- name: direct_selection_model
description: "dbt model"
columns:
- name: id
description: "The primary key for this table"
tests:
- unique
Test against an indirect tagged child model (referenced from the direct tagged model):
While testing a direct tagged model, if you set the value of the TRANSFORM_MODEL_INDIRECT_SELECTION_MODE
variable to "cautious" in the pipeline, it will not include the below test for the child model.
version: 2
models:
- name: indirect_selection_model
tests:
- dataops.same_rows_as:
compare_model: ref('direct_selection_model')
description: >
Some very interesting stuff
For more details about indirect selection, see test selection examples.
9. Adding/Overriding variables to dbt at runtime
Prerequisite (dbt 1.4 or later)
You can specify variables with a dynamic or run-time value using the parameters TRANSFORM_VARS
. If set, the value is added to dbt command-line arguments as --vars
.
- TRANSFORM_VARS as string
- TRANSFORM_VARS as dictionary
Run My Macro:
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Additional Configuration"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_VARS: "{ var_one: Value one, var_two: 2 }"
script:
- /dataops
icon: ${TRANSFORM_ICON}
Run My Macro:
extends:
- .modelling_and_transformation_base
- .agent_tag
stage: "Additional Configuration"
image: $DATAOPS_TRANSFORM_RUNNER_IMAGE
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_VARS: >-
{ var_one: Value one, var_two: 2 }
script:
- /dataops
icon: ${TRANSFORM_ICON}