Skip to main content

Pipeline Cache

note

For an in-depth explanation of the DataOps pipeline types and structure, see the Pipeline Overview documentation.

The pipeline cache is saved on the DataOps.live runner host machine in the /agent_cache folder or the cache folder specified during the Runner Installation. A folder with a name set to pipeline id/number is created inside /agent_cache for each pipeline. This pipeline cache is designed to reduce pipeline build- and runtime by allowing components or dependencies from a particular pipeline run to be used in subsequent runs without being created or downloaded at the start of every pipeline run.

This cache also saves and retrieves the encrypted secrets stored in the DataOps.live vault. This vault contains the credentials specified in YAML files in the host machine's /secrets folder.

See the Vault document for more information.

By default, the pipeline cache is not removed from the host machine. However, if required, it can be destroyed by setting the environment variable CACHE_CLEANUP. The following code snippet shows how to set CACHE_CLEANUP to ensure the cache is removed from the host machine at the end of the pipeline run.

clean-cache:
extends:
- .agent_tag
image: $DATAOPS_BASE_RUNNER_IMAGE
stage: Clean Up
variables:
CACHE_CLEANUP: 1

This cleans up the whole cache for the current pipeline if /dataops is called with CACHE_CLEANUP.

danger

If any other job runs after the cache is deleted, it may lead to failure or an incorrect result. Therefore, it is best to avoid setting the CACHE_CLEANUP environment variable.

Pipeline cache details

The DataOps Runner configuration will mount /cache/$AGENT_TAG:/agent_cache/:rw. In other words, /cache/$AGENT_TAG on the host gets mounted to /agent_cache in each pipeline's job container.

And, to avoid conflicts between different pipelines that use the same agent, each job will create (if it doesn't exist) a /agent_cache/$CI_PIPELINE_ID directory and then logically link this to /cache. As a result, all other functions in a job can just read and write from /cache without interfering with other pipelines.

Lastly, suppose there is a security need to isolate one pipeline from all other pipelines. In that case, it should have its own DataOps Runner, preventing other pipelines from physically reading any of its caches.