Data Product Orchestrator
Professional Enterprise
Image | $DATAOPS_DATAPRODUCT_RUNNER_IMAGE |
---|---|
Feature Status | PubPrev |
The Data Product orchestrator simplifies the generation of data products within the CI/CD process. It facilitates pipeline runs to build, test, deploy, and update data products within your DataOps project. For more information, see Data Product Pipelines.
Usage
The Data Product orchestrator enriches the data product specification with the metadata from the pipeline run. At the end of the pipeline, the orchestrator generates the data product manifest as a merged document from the specification and the data product snippets and publishes the manifest into the data product registry.
In your project directory under pipelines/includes/local_includes/
create a data_product_orchestrator.yml
job file with the following structure:
"Data Product Orchestrator":
extends:
- .agent_tag
stage: "Data Product"
image: $DATAOPS_DATAPRODUCT_ORCHESTRATOR_IMAGE
variables:
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: dataops/data-product-definitions/my_data_product.yml
# path to the Data Product Manifest that we would like to use for backward compatibility
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE: dataops/data-product-definitions/reference_manifest.yml
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH: 1
script:
- /dataops
artifacts:
paths:
- dataops/report/data_product_manifest_viewer/
name: "Data Product Manifest Viewer"
expose_as: "dataopsreport"
icon: ${DATAOPS_ICON}
You can use different methods to transform data in your data products. For more information, see Creating Custom Data Products Snippets.
Then add the stage Data Product
in the stages (pipelines/includes/config/stages.yml
) towards the end, right before Clean Up
.
stages:
- Pipeline Initialisation
- Vault Initialisation
- Snowflake Setup
- Additional Configuration
- Data Ingestion
- Source Testing
- Data Transformation
- Transformation Testing
- Generate Docs
- Data Product
- Clean Up
Finally, add the job /pipelines/includes/local_includes/data_product_orchestrator.yml
to the full-ci.yml
pipeline file.
include:
- /pipelines/includes/bootstrap.yml
...
- /pipelines/includes/local_includes/data_product_orchestrator.yml
Enriching the data product specification
You must add the following variables and settings to the orchestrator to enrich the data product specification with metadata from the orchestrator - the tables used in the pipeline and the test results.
-
Use the following image for the orchestrator:
"Data Product Orchestrator":
image: $DATAOPS_DATAPRODUCT_RUNNER_IMAGE -
Add a variable that points to the source data product specification you are using to build the data product:
"Data Product Orchestrator":
variables:
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: dataops/data-product-definitions/data_product_1.ymlUsing the MATE selectors in the data product specification file, the orchestrator adds all objects and tests that are part of the data product. The orchestrator is adding the following pipeline run metadata to the source data product specification:
commit: <the commit id>
branch: <the branch name>
pipeline_id: <pipeline id>
run_start: <pipeline start datetime>
publication_datetime: <datetime of generating the enriched data product definition>The MATE orchestrator extracts the relevant information from the MATE logs based on the selectors provided in the data product definition (in the dataset and SLI sections). These extracts will be added to the data product specification file in the Data Product orchestrator.
Performing backward compatibility testing
To use the orchestrator to do a backward compatibility test against an already existing data product, include the following variable in the orchestrator definition:
"Data Product Orchestrator":
variables:
DATAOPS_DATA_PRODUCT_REFERENCE_FILE: path to the reference Data Product Manifest
The job checks if the dataset and SLO sections of the produced data product match the reference data product. The job will fail if the ID and version are identical but the data product manifest has new or dropped attributes.
Supported optional parameters
The following parameters are optional for the Data Product orchestrator:
Parameter | Value | Description |
---|---|---|
DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH | 0 or 1 (default 0) | If set to 1, the value for the database is skipped from the validation |
DATAOPS_DATA_PRODUCT_ALLOW_BREAKING_CHANGES | 0 or 1 (default 0) | If set to 1, instead of failing the pipeline, the orchestrator raises a warning if there is a breaking change or new attributes |
DATAOPS_DATA_PRODUCT_EXCLUDE_OBJECT_ATTRIBUTES_LIST | default columns,mate_unique_id,type | comma-separated list of the object attributes that you will exclude from the backward compatibility check |
DATAOPS_DATA_PRODUCT_EXCLUDE_COLUMN_ATTRIBUTES_LIST | default index,comment | comma-separated list of the columns attributes that you will exclude from the backward compatibility check |
DATAOPS_DATA_PRODUCT_EXCLUDE_SLO_ATTRIBUTES_LIST | default description,test_select | comma-separated list of the SLO attributes that you will exclude from the backward compatibility check |
Example jobs
Creating data products involves several basic steps, including adding the data product orchestrator to the project pipeline. Below is an example showing a typical job for building a data product.
For more information on the workflow for building data products, see the Standalone data products section.
"Data Product Orchestrator":
extends:
- .agent_tag
stage: "Data Product"
image: $DATAOPS_DATAPRODUCT_RUNNER_IMAGE
variables:
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: [path to the data product specification file]
DATAOPS_DATA_PRODUCT_REFERENCE_FILE: [path to the Data Product Manifest that we would like to use for backward compatability][optional]
DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH: 1 [optional]
script:
- /dataops
artifacts:
paths:
- dataops/report/data_product_manifest_viewer/
name: "Data Product Manifest Viewer"
expose_as: "dataopsreport"
icon: ${DATAOPS_ICON}