How to Use Parent-Child Pipelines
You can configure DataOps projects to trigger one pipeline from another using a technique called parent and child pipelines. The child (downstream) pipeline can reside in the same or a different project, and variables can be passed from one pipeline to another.
Why use parent-child pipelines?
The #TrueDataOps philosophy favors small, atomic pieces of code that can be reused quickly and efficiently to reduce overall prototyping and development time (from the #TrueDataOps Philosohy). To that end, each DataOps pipeline should ideally be a highly focussed unit of work, performing the minimum number of actions (jobs) to achieve its primary purpose.
However, many of the activities within a pipeline (e.g., Snowflake setup, source testing, model building, etc.) will be common across multiple pipelines, often across multiple projects, potentially leading to code duplication (DRY). The use of dedicated, shared job definitions within each project and shared from reference projects is an existing solution for avoiding the duplication of individual job configurations.
Parent and child pipelines extend this concept of encapsulation, allowing larger pipelines to be broken down into smaller, more focused units of work which can be reused and triggered from one another.
Conceptual example
Here's a screenshot of a standard DataOps pipeline straight out of the standard project template:
This can be abstracted conceptually to the following equivalent diagram:
Apart from the initial section Pipeline start-up, which will necessarily feature in every DataOps pipeline, there are two activities being performed here: an infrastructure task (maintaining Snowflake's databases, schemas, grants, etc.) and a data modeling task. However, many use cases will not require both activities to be performed together in this manner, and developers will often want to execute these activities separately when only one part of the codebase is being worked on.
We can therefore break this down into two separate pipelines, capable of being executed independently but chained together for the standard approach we see above.
This means we can now execute both activities independently or together without duplication. But not just that. Using parent-child pipelines, we can further:
- allow either pipeline to be triggered from another project's pipeline
- start other pipelines (in this project or elsewhere) at the end of these pipelines
- parameterize the model build/test activity and run it in parallel for different sets of models
- etc.
Real-world examples
Here are some actual project configurations that can be used to set up working parent-child implementations.
Example 1: simple in-project parent-child
This sample configuration follows the conceptual example above, separating the SOLE and MATE sections into different pipelines.
Here, we are using three separate pipeline files:
- sole-ci.yml - The SOLE jobs from the standard pipeline configuration
- mate-ci.yml - The MATE jobs from the standard pipeline configuration
- full-ci.yml - The full parent-child pipeline implementation
It's worth noting that every pipeline file includes a reference to bootstrap.yml, ensuring that each pipeline can be executed independently, but when pipelines are combined, redundant references are automatically reduced.
include:
- /pipelines/includes/bootstrap.yml
## Snowflake Object Lifecycle jobs
- project: "reference-template-projects/dataops-template/dataops-reference"
ref: 5-stable
file: "/pipelines/includes/default/snowflake_lifecycle.yml"
include:
- /pipelines/includes/bootstrap.yml
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
## Generate modelling and transformation documentation
- project: "reference-template-projects/dataops-template/dataops-reference"
ref: 5-stable
file: "/pipelines/includes/default/generate_modelling_and_transformation_documentation.yml"
include:
- /pipelines/includes/bootstrap.yml
- /sole-ci.yml
Trigger MATE:
stage: Downstream
inherit:
variables: false
trigger:
include: mate-ci.yml
strategy: depend
The following parent-child features have been employed in full-ci.yml's Trigger MATE job:
- A new stage named Downstream is used - this must be added to the project's stages.yml file, usually towards the end of the list (will depend on your use case).
- The configuration
inherit:variables
set tofalse
will prevent all the parent pipeline's configurations from polluting that of the child pipeline, which will pick up its configuration as if it was executed separately. - Setting the
trigger:strategy
todepend
will cause the parent pipeline to wait until the child pipeline completes and will report back success only if the child pipeline is successful.
Example 2: multiple child pipelines
Extending the above simple example further, we can execute more than one child pipeline in parallel. In this sample configuration, our MATE models are split in two, with one child pipeline building/testing each set.
Here we are reusing the same sole-ci.yml and mate-ci.yml as in the previous example.
include:
- /pipelines/includes/bootstrap.yml
- /sole-ci.yml
Trigger MATE (Set 1):
stage: Downstream
inherit:
variables: false
variables:
TRANSFORM_MODEL_SELECTOR: models/set1
trigger:
include: mate-ci.yml
strategy: depend
Trigger MATE (Set 2):
stage: Downstream
inherit:
variables: false
variables:
TRANSFORM_MODEL_SELECTOR: models/set2
trigger:
include: mate-ci.yml
strategy: depend
Note the use of the variable TRANSFORM_MODEL_SELECTOR
in each trigger job, which will be passed into each child pipeline to control the operation of all MATE jobs.
Example 3: multi-project pipelines
Instead of breaking down a project's pipelines internally, as we have done in the above examples, this sample configuration triggers the full-ci.yml pipeline in another project once the local pipeline jobs are finished.
The parent project's full-ci.yml is shown below, but the child project's full-ci.yml needs no particular configuration for this functionality.
include:
- /pipelines/includes/bootstrap.yml
- /sole-ci.yml
- /mate-ci.yml
Trigger Project 2:
stage: Downstream
inherit:
variables: false
variables:
_PIPELINE_FILE_NAME: full-ci.yml
trigger:
project: "dataops-internal/sam/parent-and-child-pipelines/project-two"
branch: $CI_COMMIT_REF_NAME
# strategy: depend
Please note the parent-child features that have been used here:
- The parameter
inherit:variables
is still set tofalse,
as this configuration has the same issue with the pollution of child pipeline variables. - To select the child project's pipeline to trigger, set the variable
_PIPELINE_FILE_NAME
to the pipeline filename. - Passing the built-in variable
$CI_COMMIT_REF_NAME
totrigger:branch
will ensure that the child pipeline runs in the branch with the same name as the parent pipeline's execution. - Typically, with multi-project pipelines, we want to trigger the child project's pipeline and not wait for the result, which is why this configuration has removed the
strategy
parameter.