Pipeline Overview
Pipelines are the mechanisms that move data through the data product platform. As a result, DataOps.live can support multiple pipelines in a project. You use pipelines to implement different behaviors, such as hourly, daily, and weekly ingestion tasks or jobs.
Pipeline types
DataOps.live supports two main types of pipelines:
- Batch pipeline or data pipeline
- Real-time/event pipeline (similar to a software pipeline)
1. Batch pipeline
When executing a batch pipeline, data moves from left to right. As the pipeline progresses, the data is extracted, loaded, and transformed (ELT) and made available to end-users.
Most of the pipelines in our demo projects are batch pipelines.
2. Real-time/Event pipeline
Executing a real-time (or event) pipeline does not directly cause data to be moved, tested, or transformed. Instead, it works like a traditional DevOps pipeline where a software component is built, tested, and deployed to a particular environment. When deployed, this pipeline is expected to stay running, typically waiting for external real-time events that trigger its actions to occur. Examples include an API Gateway or a messaging hub.
While real-time pipelines may move real-time data, they are often coupled with a batch pipeline to perform data testing and update models that depend on the real-time pipeline's data.
Various testing types can be included as part of these pipelines, including unit testing, regression testing, and integration testing.
For example, consider a Talend component that reads messages from a Kafka topic, processes them, and writes them to a different Kafka topic.
In DataOps, this component is built, unit tests run, integration tests' prerequisites set up, integration tests done, and, assuming everything passes, the resulting artifact is stored in the artifact repository and then deployed as part of a DataOps pipeline.
Pipeline files
It's possible to build a DataOps pipeline in a single large <pipeline-name>-ci.yml
file. However, there are several issues with this:
- The pipeline file can grow very large and become hard to maintain.
- When several pipelines are needed, large amounts of configuration end up being duplicated, which is highly undesirable.
Therefore, standard pipelines split their configurations into reusable, smaller files. Typical reusable configurations are Jobs, Stages, and Parameters. Examples are included in all demo projects and the DataOps template project.
If you want to learn how to build a DataOps pipeline from scratch in a single file and then restructure it, one step at a time, to match our best practices, work through the DataOps Fundamentals Guide.
Pipeline artifacts
Pipeline artifacts are files created by DataOps pipelines after a particular pipeline completes its run. These artifacts are saved to disk or object storage and are accessible as follows:
-
Navigate to CI/CD → Pipelines on the left sidebar.
-
On the far right side of the pipeline details, click Download artifacts.
-
From the dropdown list, select the artifact to download and view.
These artifacts are traditionally deleted seven days after being generated if they are not the latest artifact. DataOps automatically locks the latest artifacts generated by successful pipeline runs on an active branch, merge request, or tag. The net effect of this functionality is that it is easier to set a more robust expiration policy to clean up older artifacts, helping reduce disk space consumption, and ensuring that the latest artifact is always on hand.
Leverage the template project
There are many ways to structure a valid <pipeline-name>-ci.yml
or full-ci.yml
pipeline file. The recommended approach makes heavy use of include files to improve maintainability. Adopt the default DataOps Project Structure generated when building your project by cloning the DataOps Template Project.
There is no difference between the <pipeline-name>-ci.yml
and the full-ci.yml
pipeline file names. Therefore, these terms are used interchangeably throughout this section.
You can have as many pipeline files as you need, each individually named. As long as the name contains -ci
at the end of its name, DataOps will recognize it as a pipeline file. However, if you have just one pipeline in your DataOps project, the default name will be full-ci.yml
as in the DataOps Template Project. See the standard pipeline structure below for more details.
A subset of the configurations that a pipeline frequently uses is as follows:
/
├── pipelines/
| └── includes/
| ├── config/
| | ├── agent_tag.yml
| | ├── stages.yml
| | └── variables.yml
| |
| ├── local_includes/
| | ├── your jobs.yml
| |
| |
| └── bootstrap.yml
The standard pipeline structure
As noted above (in leveraging the Template Project), the standard pipeline file is named full-ci.yml
. Initialized from the template project, the content of the full-ci.yml file is similar to the following code snippet:
include:
- /pipelines/includes/bootstrap.yml
## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/load_secrets.yml
## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/snowflake_lifecycle.yml
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
## Generate modelling and transformation documentation
- project: "reference-template-projects/dataops-template/dataops-reference"
ref: 5-stable
file: "/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml"
# variables:
# DATAOPS_DEBUG: 1
Pipeline bootstrap.yml
The bootstrap.yml
file integrates the DataOps Reference Project with the new DataOps project, providing access to all the standard DataOps functionality. It defines all config and default files and gets a project ready to run. The following code snippet shows what the standard bootstrap.yml file looks like:
include:
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/base_bootstrap.yml
- /pipelines/includes/config/agent_tag.yml
- /pipelines/includes/config/variables.yml
- /pipelines/includes/config/stages.yml
The reference project definitions are included first so your project settings from the config directory can override the default project configurations.
Include your project settings from the config directory
The config directory contains the project settings' configurations files shared by all your pipelines.
├── pipelines/
| └── includes/
| ├── config/
| | ├── agent_tag.yml
| | ├── stages.yml
| | └── variables.yml
These project settings are included by default via the bootstrap.yml
in every pipeline. Navigate to DataOps Project Settings to learn how to change these settings.
Reusing the DataOps Reference Project jobs
Every DataOps pipeline should reuse the default jobs inherited from the DataOps Reference Project.
include:
## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/load_secrets.yml
## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5-stable
file: /pipelines/includes/default/snowflake_lifecycle.yml
## Generate modelling and transformation documentation
- project: "reference-template-projects/dataops-template/dataops-reference"
ref: 5-stable
file: "/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml"
The standard pipeline jobs are run in the following order:
- The job to run the Secrets Manager Orchestrator to access the DataOps Vault
- The job to run the Snowflake Object Lifecycle Orchestrator to perform DataOps Snowflake Object Lifecycle Management
- The job to generate data model documentation via the DataOps Modelling and Transformation Engine
Shared jobs in the local_includes
directory
The local_includes
directory is where the pipeline job definitions are stored.
Even the most commonly used ELT pipeline jobs must also be included in the default full-ci.yml
file.
For example:
include:
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
Leveraging the DataOps Modelling and Transformation Engine in a DataOps pipeline provides the following functionality:
- Test all data sources e.g., for data integrity or freshness
- Execute data transformation and modeling of your data products
- Test the results of the transformation before you serve it to your users
You must incorporate any files stored in this directory (local_includes
) into the top-level <pipeline-name>-ci.yml
file.
For instance, let's assume that you've created a Fivetran job that uses the Fivetran Orchestrator to ingest data into Snowflake from a Fivetran pipeline. This YAML file is stored in the DataOps locaL_includes
directory and is similar to the following code snippet:
"Data for Expense Reports":
extends:
- .agent_tag
- .should_run_ingestion
- .orchestration_styling
stage: "Batch Ingestion"
variables:
FIVETRAN_ACTION: START
FIVETRAN_API_KEY: DATAOPS_VAULT(FIVETRAN.DEFAULT.API_KEY)
FIVETRAN_API_SECRET: DATAOPS_VAULT(FIVETRAN.DEFAULT.API_SECRET)
CONNECTOR_ID: my_connectorID
image: $DATAOPS_FIVETRAN_RUNNER_IMAGE
script:
- /dataops
icon: ${FIVETRAN_ICON}
To ensure that this job is run during the DataOps pipeline run, it must be added to the full-ci.yml
file as follows:
include:
- /pipelines/includes/bootstrap.yml
## Load Secrets job
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5.0.0-beta.1
file: /pipelines/includes/default/load_secrets.yml
## Snowflake Object Lifecycle jobs
- project: reference-template-projects/dataops-template/dataops-reference
ref: 5.0.0-beta.1
file: /pipelines/includes/default/snowflake_lifecycle.yml
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
## Generate modelling and transformation documentation
- project: "reference-template-projects/dataops-template/dataops-reference"
ref: 5.0.0-beta.1
file: "/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml"
## data ingestion jobs
- /pipelines/includes/local_includes/fivetran-jobs/data_expense_reports.yml
A second example of a local_includes
job file demonstrates how to add a YAML file that ingests MySQL source data is as follows:
include:
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
# Any number of additional files can be included after this point
- /pipelines/includes/local_includes/ingestion_jobs/mssql-sources.yml
Overriding versus extending variable defaults
In a <pipeline-name>-ci.yml
file, keys that take arrays, like variables:
, can be extended.
For instance, adding the FOO: BAR
key-value pair to the <pipeline-name>-ci.yml
file will merge this new key-value pair with the existing variables in the DataOps Project Settings files, as well as all the variables in other config files.
include:
- /pipelines/includes/bootstrap.yml
## your jobs
variables:
FOO: BAR
# DATAOPS_DEBUG: 1
However, the values in all other non-array keys override their associated default values.
An excellent example of this principle is in stages:
in the project settings pipelines/includes/config/stages.yml
file. For a discussion on overriding the default DataOps Reference Project's stages, see the DataOps Project Settings doc.
Whether you are using override or extend, remember to include your changes in the reference project's full-ci.yml
file after the default bootstrap.yml
line of code. Doing so ensures that your pipeline will use the DataOps project's settings from the local-includes
directory rather than the reference project's default configurations.
Driving pipeline behaviors with pipeline parameters
Parameters are a common concept in any software solution. DataOps.live is no different. As a result, parameters are available at the project, pipeline, and job levels. Refer to DataOps Pipeline Parameters for commonly used pipeline-level parameters.
Pipeline graph styling
To further express the role and purpose of each pipeline job, Pipeline Graph Styling is available for use. It allows you to color your graph and use job-specific icons.
Pipeline cache
The DataOps Pipeline Cache plays a significant role in optimizing pipeline build and runtime by storing reusable components and dependencies on the DataOps Runner's host machine.