With the DataOps.live platform, you can use pipelines to build, test, deploy, and update standalone or composite data products in your DataOps project. DataOps pipelines use the data products specification to create and publish data product manifests that hold all the properties and information of the data product and refresh it every time a pipeline runs.
Follow the below steps to build data products or to migrate your existing DataOps pipelines to data products. See Building Data Products for detailed information.
Creating your first data product
DataOps.live projects are the central hub where teams collaborate, build, and manage data products. Each project has visual flows representing the pipelines of data transformations and movement from start to finish.
You can structure and organize each data product through data characteristics and relationships within a data ecosystem to allow other teams or users to discover and reuse what has already been done.
Let's see how you can build your first pipeline in your DataOps project to build a data product. All execution of DataOps code happens within DataOps pipelines that comprise a series of individual jobs.
First, modify your dbt connection profile to enable dbt to connect to your warehouse also during development. To do so, browse to
/dataops/modelling in your project and open
dbt_project.yml. Add the profile
Second, set up DDE (DataOps Development Environment) - the ready-to-code environment for your DataOps use cases.
.gitpod.yml file to the project's directory containing the data product. This will instruct GitPod to use the correct container. For example:
- name: Setup
# Needed when comparing from a feature branch
gp env GOLDEN_MANIFEST_FROM_DIFFERENT_BRANCH=1
# Installs optional pre-commit hook(s)
# pip install pre-commit
# pre-commit install
# The current image used for data products in DDE during the private preview
See Working with data products in DDE for more information.
Finally, configure your project with the following variables in
# variables required for the private preview
1. Creating a Data Product Specification file
First, create the folder
data-product-definitions in your project directory under
Then create a new data product specification file
my_data_product.yml with the following structure:
id: [data product id]
name: [data product name]
description: [data product description]
name: [Name of the dataset]
description: [description of the dataset]
selector of the models that we would like to include in the data product,
- id: [id of the output_port]
name: [output port name]
type: [type usually Tables]
description: [description of the output port]
selector of the test that we would like to include in the data product,
- name: [name of the SLO]
description: [description of the SLO]
test_select: [selector of the test that is related to this SLO]
2. Creating a Data Product orchestrator job in the project
The Data Product orchestrator enriches the data product specification with the metadata from the pipeline run. At the end of the pipeline, the orchestrator generates the data product manifest as a merged document from the specification and the data product snippets and publishes the manifest into the data product registry.
In your project directory under
pipelines/includes/local_includes/ create a
data_product_orchestrator.yml job file with the following structure:
"Data Product Orchestrator":
stage: "Data Product"
# path to the Data Product Manifest that we would like to use for backward compatibility
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE: dataops/data-product-definitions/reference_manifest.yml
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH: 1
name: "Data Product Manifest Viewer"
You can use different methods to transform data in your data products. For more information, see Creating Custom Data Products Snippets.
Then add the stage
Data Product in the stages (
pipelines/includes/config/stages.yml) towards the end, right before
- Pipeline Initialisation
- Vault Initialisation
- Snowflake Setup
- Additional Configuration
- Data Ingestion
- Source Testing
- Data Transformation
- Transformation Testing
- Generate Docs
- Data Product
- Clean Up
Finally, add the job
/pipelines/includes/local_includes/data_product_orchestrator.yml to the
full-ci.yml pipeline file.
3. Creating a deploy token to publish to the registry
You must create a deploy token named
gitlab-deploy-token with read and write registry access to make it available for the pipeline job as
CI_DEPLOY_PASSWORD. You cannot use a different name for the deploy token. Thus make sure to specify it as given. Finally, to be able to do this, you should have a Maintainer access level to the project:
First, open your project and browse to Settings > Repository > Deploy Tokens.
Then expand Deploy tokens and enter
gitlab-deploy-token in the Name field.
Complete the deploy token setup by selecting the read_package_registry and write_package_registry checkboxes and click Create deploy token.
4. Building your first data product
If everything is set up correctly, you can run the
full-ci.yml file containing the pipeline definition in your project directory. See Running Pipelines for more information about the methods to run pipelines.
5. Configuring backward compatibility
Subsequent pipelines can check for backward compatibility.
You must add a reference manifest that holds the last approved data product metadata used by the pipeline to check for data-breaking metadata changes for the dataset and SLO (Service Level Objectives).
First, in your project, navigate to packages and registries > Package Registry.
Then click the folder of the relevant data product (DataProductA) and, in the open window, click the latest data product YAML file to download it.
Next, add the downloaded file to the project repository at
Finally, in your project directory under
pipelines/includes, uncomment the following variables in the
"Data Product Orchestrator":