How to Prevent Concurrent Running of Pipelines
By default, DataOps runs pipelines in parallel. Not only do pipelines run in parallel, but jobs of the same pipeline stage also run in parallel. Thus, you can effortlessly sequence jobs within a pipeline by adding them to different stages.
As a result, the question that we must ask and answer is how to sequence pipelines within different instances of the same pipeline or different pipelines.
This article answers this question by focusing on how to ensure that a new pipeline does not start before the current pipeline run has been completed.
Adding a resource group
Resource groups are a means of limiting the concurrency of DataOps jobs. We will leverage them in this example by applying the same resource group to all jobs in a given pipeline.
The resulting workflow is as follows:
- A pipeline will start
- A given job in the pipeline will wait until the resource of the given name is free
- If the resource is not free, the pipeline will wait before executing the job until the resource is free
In order to achieve the desired result, we will introduce a new job,
No Parallel Run, at the
Pipeline Initialisation stage to ensure that the
pipeline is blocked at its earliest possible stage.
In addition, we apply the resource group name
sequential-pipeline to every
No Parallel Run:
## Recommended values for resource group names
# Using the name of the job - limiting concurrency at the job level
# Using the name of the pipeline - limiting concurrency at the pipeline level
# the example uses the fixed name sequential-pipeline
stage: Pipeline Initialisation
- echo 'sequential-pipeline starting'
Long Ingestion Job:
# continue to use the fixed name sequential-pipeline to prevent the long running
# ingest running concurrently
stage: Data Ingestion
- echo 'Starting execution ...'
- sleep 60
- echo 'Completed execution ...'
# optional - continue to use the fixed name sequential-pipeline through all jobs
stage: Clean Up
- echo 'sequential-pipeline done'
In order to utilize these job definitions in a DataOps pipeline, like the following example, you can run two instances of the pipeline in parallel, and they will wait for each other.
# sync execution
Observing the two pipeline executions you will see results similar to the following images:
Considerations for jobs in a resource group.
When setting a resource group on a job that runs on a schedule, it's recommended to also place a timeout on that job that is a similar time to the interval of the schedule to prevent a build-up of jobs in the event of an unusually slow running job.
As a simple example, let’s say you have a job that normally takes 20 minutes in a resource group and it runs in a schedule every 1 hour. If that job suddenly takes 90 minutes then a backlog of jobs will start building up, and the situation will get progressively worse. If unnoticed serval days of jobs could build up even if the underlying issue causing the slow job has been fixed.
It very much depends on the job in question, but take time to consider what happens when a job in a resource group suddenly takes longer than expected.