Skip to main content

DataOps Pipeline Overview

Pipelines are the mechanisms that move data through the DataOps platform. As a result, DataOps can support multiple pipelines in a project. These different pipelines are used to implement different behaviors, such as hourly ingestion, daily ingestion, and weekly ingestion tasks or jobs.

Pipeline Types

DataOps supports two main types of pipelines:

  • Batch pipeline or data pipeline
  • Real-time/event pipeline (similar to a software pipeline)

1. Batch Pipeline

When executing a batch pipeline, data moves from left to right. As the pipeline progresses, the data is extracted, loaded, and transformed (ELT), and then made available to end-users.

note

Most of the pipelines in our demo projects are batch pipelines.

2. Real-Time/Event Pipeline

Executing a real-time (or event) pipeline does not directly cause data to be moved, tested, or transformed. Instead, it works like a traditional DevOps pipeline where a software component is built, tested, and deployed to a particular environment. When deployed, this pipeline is expected to stay running, typically waiting for external real-time events that trigger its actions to occur. Examples include an API Gateway or a messaging hub.

note

While real-time pipelines may move real-time data, they are often coupled with a batch pipeline to perform data testing and update models that depend on the real-time pipeline's data.

Various testing types can be included as part of these pipelines, including unit testing, regression testing, and integration testing.

For example, consider a Talend component that reads messages from a Kafka topic, processes them, and writes them to a different Kafka topic.

In DataOps, this component is built, unit tests run, integration tests' prerequisites set up, integration tests done, and, assuming everything passes, the resulting artifact is stored in the artifact repository and then deployed as part of a DataOps pipeline.

Pipeline Files

It's entirely possible to build a DataOps pipeline in a single large <pipeline-name>-ci.yml file. However, there are several issues with this:

  • The pipeline file can grow very large and become hard to maintain.
  • When several pipelines are needed, large amounts of configuration end up being duplicated, which is highly undesirable.

Therefore, standard pipelines split their configurations into reusable, smaller files. Typical reusable configurations are Jobs, Stages, and Parameters. Examples of these are included in all demo projects as well as the DataOps template project.

If you want to learn how to build a DataOps pipeline from scratch in a single file and then restructure it, one step at a time, to match our best practices, work through the DataOps Fundamentals Guide.

Pipeline Artifacts

Pipeline artifacts are files created by DataOps pipelines after a particular pipeline completes its run. These artifacts are saved to disk or object storage and are accessible as follows:

  1. Navigate to CI/CD > Pipelines on the left-hand sidebar

    list-pipelines-run __shadow__

  2. On the far right-hand side of the pipeline details, there are two icons: Documentation and Artifacts

    pipeline-artifacts-icon __shadow__

  3. Click on Artifacts to display a dropdown list of downloadable pipeline artifacts

    list-downloadable-artifacts __shadow__

  4. Select an Artifact to download and view

These artifacts are traditionally deleted seven days after being generated if they are not the latest artifact. DataOps automatically locks the latest artifacts generated by successful pipeline runs on an active branch, merge request, or tag. The net effect of this functionality is that it is easier to set a more robust expiration policy to clean up older artifacts, helping reduce disk space consumption, and ensuring that the latest artifact is always on hand.

Leverage the Template Project

There are many ways to structure a valid <pipeline-name>-ci.yml or full-ci.yml pipeline file. The recommended approach makes heavy use of include files to improve maintainability by adopting the default DataOps Project Structure generated when building project by cloning the DataOps template project.

note

There is no difference between the <pipeline-name>-ci.yml and the full-ci.yml pipeline file names. Therefore, these terms are used interchangeably throughout this section.

tip

You can have as many pipeline files as you need, each individually named. As long as the name contains -ci at the end of its name, DataOps will recognize it as a pipeline file. However, if you have just one pipeline in your DataOps project, the default name will be full-ci.yml as in the DataOps Template Project. See the standard pipeline structure below for more details.

A subset of the the configurations that a pipeline frequently uses is as follows:

/
├── pipelines/
| └── includes/
| ├── config/
| | ├── agent_tag.yml
| | ├── stages.yml
| | └── variables.yml
| |
| ├── local_includes/
| | ├── your jobs.yml
| |
| |
| └── bootstrap.yml

The Standard Pipeline Structure

As noted above (in leveraging the Template Project), the standard pipeline file is named full-ci.yml. Initialized from the template project, the standard content of the full-ci.yml file is similar to the following code snippet:

/full-ci.yml
include: 
- /pipelines/includes/bootstrap.yml

## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/load_secrets.yml

## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/snowflake_lifecycle.yml

## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml

## Generate modelling and transformation documentation
- project: 'reference-template-projects/dataops-template/dataops-reference'
ref: 5-stable
file: '/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml'

# variables:
# DATAOPS_DEBUG: 1

Pipeline bootstrap.yml

The bootstrap.yml file integrates the DataOps Reference Project with the new DataOps project, providing access to all the standard DataOps functionality. It defines all config and default files and gets a project ready to run. The following code snippet shows what the standard bootstrap.yml file looks like:

/pipelines/includes/bootstrap.yml
include: 
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/base_bootstrap.yml

- /pipelines/includes/config/agent_tag.yml
- /pipelines/includes/config/variables.yml
- /pipelines/includes/config/stages.yml

The reference projects definitions are included first so that your project settings from the config directory can override the default project configurations.

Include your Project Settings from the Config Directory

The config directory contains the project settings' configurations files shared by all your pipelines.

├── pipelines/
| └── includes/
| ├── config/
| | ├── agent_tag.yml
| | ├── stages.yml
| | └── variables.yml

These project settings are included by default in via the bootstrap.yml in every pipeline. Navigate to DataOps Project Settings to learn how to change these settings.

Reusing the DataOps Reference Project jobs

Every DataOps pipeline should reuse the default jobs inherited from the DataOps Reference Project.

/full-ci.yml
include: 
## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/load_secrets.yml

## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/snowflake_lifecycle.yml

## Generate modelling and transformation documentation
- project: 'reference-template-projects/dataops-template/dataops-reference'
ref: 5-stable
file: '/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml'

The standard pipeline jobs are run in the following order:

  1. The job to run the Secrets Manager Orchestrator to access the DataOps Vault
  2. The job to run the Snowflake Object Lifecycle Orchestrator to perform DataOps Snowflake Object Lifecycle Management
  3. The job to generate data model documentation via the DataOps Modelling and Transformation Engine

Shared Jobs in the local_includes Directory

The local_includes directory is where the pipeline job definitions are stored.

Even the most commonly used ELT pipeline jobs must also be included in the default full-ci.yml file.

For example:

/full-ci.yml
include: 

## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml

Leveraging the DataOps Modelling and Transformation Engine in a DataOps pipeline provides the following functionality:

  • Test all data sources e.g. for data integrity or freshness
  • Execute data transformation and modeling of your data products
  • Test the results of the transformation before you serve it to your users
note

You must incorporate any files stored in this directory (local_includes) into the top-level <pipeline-name>-ci.yml file.

For instance, let's assume that you've created a Fivetran job that uses the Fivetran Orchestrator to ingest data into Snowflake from a Fivetran pipeline. This YAML file is stored in the DataOps locaL_includes directory and is similar to the following code snippet:

pipelines/includes/local_includes/fivetran-jobs/data_expense_reports.yml
"Data for Expense Reports": 
extends:
- .agent_tag
- .should_run_ingestion
- .orchestration_styling
stage: "Batch Ingestion"
variables:
FIVETRAN_ACTION: START
FIVETRAN_API_KEY: DATAOPS_VAULT(FIVETRAN.DEFAULT.API_KEY)
FIVETRAN_API_SECRET: DATAOPS_VAULT(FIVETRAN.DEFAULT.API_SECRET)
CONNECTOR_ID: my_connectorID
image: $DATAOPS_FIVETRAN_RUNNER_IMAGE
script:
- /dataops
icon: ${FIVETRAN_ICON}

To ensure that this job is run during the DataOps pipeline run, it must be added to the full-ci.yml file as follows:

/full-ci.yml
include: 
- /pipelines/includes/bootstrap.yml

## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5.0.0-beta.1
file: /pipelines/includes/default/load_secrets.yml

## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5.0.0-beta.1
file: /pipelines/includes/default/snowflake_lifecycle.yml

## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml

## Generate modelling and transformation documentation
- project: 'reference-template-projects/dataops-template/dataops-reference'
ref: 5.0.0-beta.1
file: '/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml'

## data ingestion jobs
- /pipelines/includes/local_includes/fivetran-jobs/data_expense_reports.yml

A second example of a local_includes job file demonstrates how to add a YAML file that ingests MySQL source data is as follows:

<pipeline-name>-ci.yml
include:

## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml

# Any number of additional files can be included after this point
- /pipelines/includes/local_includes/ingestion_jobs/mssql-sources.yml

Overriding versus Extending Variable Defaults

In a <pipeline-name>-ci.yml file, keys that take arrays, like variables:, can be extended.

For instance, adding the FOO: BAR key-value pair to the <pipeline-name>-ci.yml file will merge this new key-value pair with the existing variables in the DataOps Project Settings files, as well as all the variables in other config files.

<pipeline-name>-ci.yml
include: 
- /pipelines/includes/bootstrap.yml
## your jobs

variables:
FOO: BAR
# DATAOPS_DEBUG: 1

However, the values in all other non-array keys override their associated default values.

A good example of this principle is in stages: in the project settings pipelines/includes/config/stages.yml file. For a discussion on how you can override the default DataOps Reference Project's stages, see the DataOps Project Settings doc.

Whether you use override or extend, remember to include your changes in the reference project's full-ci.yml file after the default bootstrap.yml line of code. This ensures that your pipeline will use the DataOps project's settings from the local-includes directory rather than the reference project's default configurations.

Driving Pipeline Behaviors with Pipeline Parameters

Parameters are a common concept in any software solution. DataOps.live is no different. As a result, parameters are available at the project, pipeline, and job levels. Refer to DataOps Pipeline Parameters for commonly used pipeline-level parameters.

Pipeline Graph Styling

To further express the role and purpose of each pipeline job, Pipeline Graph Styling is available for use. It allows you to color your graph and use job-specific icons.

Pipeline Cache

The DataOps Pipeline Cache plays a significant role in optimizing pipeline build and runtime by storing reusable components and dependencies on the DataOps Runner's host machine.