Skip to main content

DataOps Pipeline Cache

The DataOps pipeline cache is saved on the DataOps Runner host machine in the /agent_cache folder or the cache folder specified during the DataOps Runner Installation. It is designed to reduce pipeline build- and run-time by allowing components or dependencies from a particular pipeline run to be used in subsequent runs without being created or downloaded at the start of every pipeline run.

note

A folder with a name set to pipeline id/number is created inside /agent_cache for each pipeline.

This cache is also used to save and retrieve the encrypted secrets stored in the DataOps Vault. This vault contains the credentials specified in YAML files in the host machine's /secrets folder.

See the DataOps Vault document for more information.

By default, the pipeline cache is not removed from the host machine. However, if required, it can be destroyed by setting the environment variable CACHE_CLEANUP. The following code snippet shows how to set CACHE_CLEANUP to ensure that the cache is removed from the host machine at the end of the pipeline run.

clean-cache:
extends:
- .agent_tag
image: $DATAOPS_BASE_RUNNER_IMAGE
stage: Clean Up
variables:
CACHE_CLEANUP: 1

In practice, the Base Orchestrator will clean up the whole cache for the current pipeline if /dataops is called with CACHE_CLEANUP.

warning

If any other job is run after the cache is deleted, it may lead to failure or an incorrect result. Therefore, it is best not to set the CACHE_CLEANUP environment variable.

Pipeline Cache Details

The DataOps Runner configuration will mount /cache/$AGENT_NAME:/agent_cache/:rw. In other words, /cache/$AGENT_NAME on the host gets mounted to /agent_cache in each pipeline's job container.

And, to avoid conflicts between different pipelines that use the same agent, each job will create (if it doesn't exist) a /agent_cache/$CI_PIPELINE_ID directory and then logically link this to /cache. As a result, all other functions in a job can just read and write from /cache without interfering with other pipelines.

Lastly, suppose there is a security need to isolate one pipeline from all other pipelines. In that case, it should have its own DataOps Runner, preventing other pipelines from physically reading any of its caches.