Skip to main content

Data Product Pipelines

Feature release status badge: PubPrev
PubPrev

With the data product platform, you can use pipelines to build, test, deploy, and update standalone or composite data products in your DataOps project. DataOps pipelines use the data product specification to create and publish data product manifests that hold all the properties and information of the data product and refresh it every time a pipeline runs.

Follow the steps below to build data products or migrate your existing DataOps pipelines to data products. See Building Data Products for detailed information.

Creating your first data product

DataOps.live projects are the central hub where teams collaborate, build, and manage data products. Each project has visual flows representing the pipelines of data transformations and movement from start to finish.

You can structure and organize each data product through data characteristics and relationships within a data ecosystem to allow other teams or users to discover and reuse what has already been done.

Let's see how you can build your first pipeline in your DataOps project to build a data product. All executions of DataOps code happen within DataOps pipelines that comprise a series of individual jobs.

Prerequisites

  • Modify your dbt connection profile to enable dbt to connect to your warehouse also during development. To do so, browse to /dataops/modelling in your project and open dbt_project.yml. Add the profile snowflake_operations.

    /dataops/modelling/dbt_project.yml
    name: MyProject
    version: "1.0"
    config-version: 2
    profile: snowflake_operations
  • Set up the DataOps development environment - the ready-to-code environment for your DataOps use cases.

    Add a .gitpod.yml file to the project's directory containing the data product. This will instruct GitPod to use the correct container. For example:

    .gitpod.yml
    tasks:
    - name: Setup
    before: |
    # Needed when comparing from a feature branch
    gp env GOLDEN_MANIFEST_FROM_DIFFERENT_BRANCH=1
    /dataops-cde/scripts/dataops_cde_setup.sh
    # Installs optional pre-commit hook(s)
    # pip install pre-commit
    # pre-commit install
    # The current image used for data products in the development environment
    image: $DATAOPS_DATAPRODUCT_RUNNER_IMAGE

    See Working with data products in the DataOps development environment for more information.

Step 1 - Create a Data Product Specification file

First, create the folder data-product-definitions in your project directory under /dataops.

Creating data product specification file !!shadow!!

Then create a new data product specification file my_data_product.yml with the following structure:

my_data_product.yml
id: [data product id]
name: [data product name]
description: [data product description]
schema_version: 1.0.0
dataset:
name: [Name of the dataset]
description: [description of the dataset]
mate_models:
- select:
[
selector of the models that we would like to include in the data product,
]
output_ports:
- id: [id of the output_port]
name: [output port name]
type: [type usually Tables]
description: [description of the output port]
service_level_indicators:
- mate_tests:
- select:
[
selector of the test that we would like to include in the data product,
]
service_level_objectives:
- name: [name of the SLO]
description: [description of the SLO]
test_select: [selector of the test that is related to this SLO]

Step 2 - Create a Data Product orchestrator job in the project

The Data Product orchestrator enriches the data product specification with the metadata from the pipeline run. At the end of the pipeline, the orchestrator generates the data product manifest as a merged document from the specification and the data product snippets and publishes the manifest into the data product registry.

In your project directory under pipelines/includes/local_includes/ create a data_product_orchestrator.yml job file with the following structure:

data_product_orchestrator.yml
"Data Product Orchestrator":
extends:
- .agent_tag
stage: "Data Product"
image: $DATAOPS_DATAPRODUCT_ORCHESTRATOR_IMAGE
variables:
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: dataops/data-product-definitions/my_data_product.yml
# path to the Data Product Manifest that we would like to use for backward compatibility
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE: dataops/data-product-definitions/reference_manifest.yml
# DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH: 1
script:
- /dataops
artifacts:
paths:
- dataops/report/data_product_manifest_viewer/
name: "Data Product Manifest Viewer"
expose_as: "dataopsreport"
icon: ${DATAOPS_ICON}
note

You can use different methods to transform data in your data products. For more information, see Creating Custom Data Products Snippets.

Then add the stage Data Product in the stages (pipelines/includes/config/stages.yml) towards the end, right before Clean Up.

pipelines/includes/config/stages.yml
stages:
- Pipeline Initialisation
- Vault Initialisation
- Snowflake Setup
- Additional Configuration
- Data Ingestion
- Source Testing
- Data Transformation
- Transformation Testing
- Generate Docs
- Data Product
- Clean Up

Finally, add the job /pipelines/includes/local_includes/data_product_orchestrator.yml to the full-ci.yml pipeline file.

full-ci.yml
include:
- /pipelines/includes/bootstrap.yml

...

- /pipelines/includes/local_includes/data_product_orchestrator.yml

Step 3 - Build your first data product

If everything is set up correctly, you can run the full-ci.yml file containing the pipeline definition in your project directory. See Running Pipelines for more information about the methods to run pipelines.

Data product pipeline !!shadow!!

Step 4 - Configure backward compatibility

Subsequent pipelines can check for backward compatibility.

You must add a reference manifest that holds the last approved data product metadata used by the pipeline to check for data-breaking metadata changes for the dataset and SLO (Service Level Objectives).

  1. In your project, navigate to CI/CDPipelines and from the latest run of the pipeline that produces the data product go to Reports.

    Data product manifest download !!shadow!!

  2. Click the download icon next to the Data Product Manifest Viewer.

  3. From the downloaded zip rename the data_product_manifest.yml file to reference_manifest.yml.

  4. Add the reference_manifest.yml file to the project repository at dataops/data-product-definitions/reference_manifest.yml.

  5. In your project directory under pipelines/includes, uncomment the following variables in the data_product_orchestrator.yml.

data_product_orchestrator.yml
"Data Product Orchestrator":
...
variables:
DATAOPS_DATA_PRODUCT_REFERENCE_FILE: dataops/data-product-definitions/reference_manifest.yml
DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH: 1