Skip to main content

How to Automatically Retry a Failed Job

Sometimes, you may want to automatically retry a failed job until it succeeds or reaches a specific number of retries. This could be the case when you know you frequently have temporary issues with resources your job depends on. If you know that these issues automatically heal themselves after a certain amount of time, consider using automatic retries for your jobs as well. By doing so, you no longer have to manually retry a failed job as part of a DataOps pipeline and the pipeline can heal itself.

Use the retry keyword to configure how many times you want to reprocess a failed job. Values you can set are 0, 1 or 2. If the value isn't defined, it defaults to 0. Here is an example job:

example_job.yml
Test all Sources:
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: TEST
TRANSFORM_MODEL_SELECTOR: source:*
stage: Source Testing
script:
- /dataops
retry: 2
icon: ${TESTING_ICON}
artifacts:
when: always
reports:
junit: $CI_PROJECT_DIR/report.xml
things to consider

Be careful not to overuse the retry keyword. If your pipeline fails often, it is best to debug and rethink its logic rather than prolonged runs with 1 or 2 retries.