How to Automatically Retry a Failed Job

Sometimes, you may want to automatically retry a failed job until it succeeds or reaches a specific number of retries. This could be the case when you know you frequently have temporary issues with resources your job depends on. If you know that these issues automatically heal themselves after a certain amount of time, consider using automatic retries for your jobs as well. By doing so, you no longer have to manually retry a failed job as part of a DataOps pipeline and the pipeline can heal itself.

Use the retry keyword to configure how many times you want to reprocess a failed job. Values you can set are 0, 1 or 2. If the value isn't defined, it defaults to 0. Here is an example job:

example_job.yml
Test all Sources:
  extends:
    - .modelling_and_transformation_base
    - .agent_tag
  variables:
    TRANSFORM_ACTION: TEST
    TRANSFORM_MODEL_SELECTOR: source:*
  stage: Source Testing
  script:
    - /dataops
  retry: 2
  icon: ${TESTING_ICON}
  artifacts:
    when: always
    reports:
      junit: $CI_PROJECT_DIR/report.xml

things to consider

Be careful not to overuse the retry keyword. If your pipeline fails often, it is best to debug and rethink its logic rather than prolonged runs with 1 or 2 retries.