Skip to main content

DataOps Pipeline Job Dependencies

Pipelines are at the heart of DataOps. Pipeline executions are visualized either by stage or job dependencies in the pipeline graph. As seen in the image below, part of the full pipeline graph visualizes all of the jobs in the DataOps pipeline, grouped by default by stage, helping you track the progress of your jobs in the order in which they will execute.

partial-pipeline-graph __shadow__

However, creating a pipeline Direct Acyclic Graph (DAG) is possible using the needs: keyword. This feature builds a DAG that starts jobs running sooner than they would if solely configured in stages. The keyword configures the order in which jobs run. One of the challenges of using needs: is that it creates ambiguity when looking at the pipeline graph.

A good example of this ambiguity is found in the pipeline graph representation of this YAML pipeline file:

show-pipeline-dependencies-ci.yml
include:
- "/pipelines/includes/config/agent_tag.yml"

data_prep_1:
extends:
- .agent_tag
stage: build
script:
- echo preparing first data

data_prep_2:
extends:
- .agent_tag
stage: build
needs: []
script:
- echo preparing second data

test:
extends:
- .agent_tag
stage: test
needs: [data_prep_1, data_prep_2]
script:
- echo test

test_transformers:
extends:
- .agent_tag
stage: test
script:
- echo test

transform_1:
extends:
- .agent_tag
stage: deploy
needs: [test_transformers]
script:
- echo transforming

transform_2:
extends:
- .agent_tag
stage: deploy
needs: [test_transformers]
script:
- echo transforming

From analyzing this script, the jobs run in the following order:

  • data_prep_2: This job runs first because it has no dependencies or needs (needs: [])
  • data_prep_1: This job runs after data_prep_2
  • test: This job won't run until both data_prep_1 and data_prep_2 have completed
  • test_transformers: This job will run after test has completed
  • transform_1: This job runs after test_transformers but won't run until test_transformers has completed
  • transform_2: This job runs after transform_1 but won't run until test_transformers has completed
note

The needs:[] or the needs: keyword with an empty array ([]) indicates that this job starts as soon as the pipeline is created.

The pipeline graph for this pipeline is as follows:

pipeline-graph-by-stages __shadow__

If we compare the pipeline graph and the text analysis of how the jobs should run, we can see that the two don't correlate. The pipeline graph based on stages does not show the running order of the jobs based on the needs: keyword, resulting in ambiguity and confusion.

How do you solve this? You can view pipelines by job dependencies when the job order is by needs and not stages. The first step is to click on the Job dependencies button as indicated in the following image:

show-job-dependencies __shadow__

You will notice that the order of jobs has changed when the jobs in this image and the previous image (by stages); thereby, reducing the ambiguity and confusion.

When switching Show dependencies on, additional lines on the graph link the jobs to each other, providing a visual representation of the dependencies between jobs in a pipeline.

graph-job-dependencies __shadow__