Skip to main content

DataOps.live Vault

DataOps.live provides vault functionality to keep all confidential data private by storing it on the host machine in an encrypted format. Additionally, all the secrets saved in the vault are available to pipelines to be used by any job.

One of the first pipeline jobs initialized the vault, which populates the secrets and other content before other jobs run. Depending on the project and pipeline configuration, this initialization is layered from different files and sources.

Vault structure

The DataOps.live vault has a YAML-like structure composed of mandatory and optional objects.

Before defining the secrets and credentials in the vault, you will need a Snowflake account. The SQL script in the topic Create Snowflake instance helps you automatically create a Snowflake account with all the necessary objects. You can use what has been created by the script to complete the below vault example snippet.

A typical vault structure looks like the following code snippet:

SNOWFLAKE:
ACCOUNT: <account>
TRANSFORM:
USERNAME: <transform_username>
ROLE: <transform_role>
PASSWORD: <transform_password>
WAREHOUSE: <transform_warehouse>
THREADS: 8
INGESTION:
USERNAME: <ingestion_username>
ROLE: <ingestion_role>
PASSWORD: <ingestion_password>
WAREHOUSE: <ingestion_warehouse>
THREADS: 8
MAIN:
USERNAME: <main_username>
ROLE: <main_role>
PASSWORD: <main_password>
AWS:
DEFAULT:
S3_KEY: XXXXXXXXXX
S3_SECRET: XXXXXXXXXX

The SNOWFLAKE section is currently standardized and mandatory, along with the AWS.DEFAULT credentials section. However, adding any other content to the vault outside of these objects without causing any pipeline issues is possible.

Once you have set the DataOps vault — secrets and credentials — you must store them either in a vault.yml file or use a secrets manager for this purpose. Read through the sections below for more information.

Vault initialization

As this image shows, initializing the vault on each pipeline run comprises several layers that can add sensitive and non-sensitive content to the vault.

vault initialization steps from bootstrap to project settings !!shadow!!

1. Local DataOps runner content

The original vault.yml file from the DataOps Runner's /secrets mount point is used as the base content for initializing the vault.

2. Sensitive content

You can configure any pipeline to use a secrets manager from which sensitive values such as passwords and security keys can be loaded. This content will add to and override values loaded in the previous step.

3. Additional content

It is also possible to add a final layer of content to the vault by using a vault.template.yml file in the project itself. This can keep vault configurations local to the project (rather than the runner) and allow mapping of non-standard key/value naming schemes.

Configure the DataOps vault

Firstly, the vault encryption is configured using a two-part method that uses a key and a salt.

  • The Vault Key is a random string of characters configured in each project, usually in file pipelines/includes/config/variables.yml using the variable DATAOPS_VAULT_KEY.
  • The Vault Salt is another random string contained in a file on the runner system. This is set up as part of the DataOps Runner Installation instructions.

1. Configure local DataOps runner content

It is possible to omit all information from this vault configuration file and only use layer-2 and layer-3 network architectures. See below for an example of an empty vault configuration file.

note

Current system limitations require that the vault.yml file must exist on the runner. However, as described above, you can set its content to an empty object {}.

An example of an empty vault.yml file is as follows:

{}

For more information, see the DataOps Runner Installation instructions.

2. Configure secrets manager

Most information security best practices mandate strong architectural security for protecting sensitive information. DataOps recommends using a secrets manager (such as AWS Secrets Manager or Azure Key Vault) for this purpose, and support for these systems is built into the data product platform.

Secrets loaded from a secrets manager will be applied to the vault after any local runner content.

note

This process may overwrite values. Therefore, make sure to demarcate which values are the responsibility of which platform.

See the Secrets Manager Orchestrator for complete configuration and usage details.

Configure additional vault content

The above two methods are more than sufficient to provide configurability and security for many use cases. However, an additional layer of vault information can be supplied to pipelines, which is particularly useful in the following circumstances:

  • Moving non-sensitive configurations away from the runner's local vault.yml file
  • Re-mapping content loaded from a secrets manager into a different vault structure.

To address the first point, it can be more convenient to move away from holding configuration on the DataOps Runner, particularly values such as the Snowflake account name or configured numbers of threads. Instead, add these values to a vault.template.yml file in the project, which will be applied to the vault when pipelines run.

Secondly, it is not always possible or convenient to populate a secrets manager with value keys that precisely follow the DataOps vault structure. These values will still be loaded into the vault but at a different location. A vault template can be used to re-map them into the desired places.

To create a vault template, create a file in your project at vault-content/vault.template.yml. This file can look as follows:

SNOWFLAKE:
ACCOUNT: "{{ env.SNOWFLAKE_ACCOUNT }}"
MAIN:
USERNAME: "{{ env.DATAOPS_PREFIX }}_MAIN"
## PASSWORD is set in Secrets Manager
ROLE: DATAOPS_ADMIN
TRANSFORM:
USERNAME: "{{ env.DATAOPS_PREFIX }}_TRANSFORMATION"
## PASSWORD is set in Secrets Manager
ROLE: "{{ env.DATAOPS_PREFIX }}_WRITER"
WAREHOUSE: "{{ env.DATAOPS_PREFIX }}_TRANSFORMATION"
THREADS: 8
INGESTION:
USERNAME: "{{ env.DATAOPS_PREFIX }}_INGESTION"
## PASSWORD is set in Secrets Manager
ROLE: "{{ env.DATAOPS_PREFIX }}_WRITER"
WAREHOUSE: "{{ env.DATAOPS_PREFIX }}_INGESTION"
THREADS: 8

As this is a .template file (see below), we can include Jinja variables in the same manner as elsewhere in DataOps, with two primary benefits:

  • The ability to specify static values for vault keys (e.g. THREADS: 8)
  • The ability to initialize values from environment variables (e.g. ACCOUNT: {{ env.SNOWFLAKE_ACCOUNT }})

Furthermore, we can refer to the values in the vault.yml template file because it is added to the vault after it has been initialized from the DataOps Runner's vault.yml (if it exists) and after any secrets manager information has been loaded. This allows information from the secrets manager to be re-mapped into other locations in the vault.

For example, if our secrets manager contains a key called my.snowflake.password, then you can map this into the vault in vault.template.yml as follows:

SNOWFLAKE:
...
MAIN:
...
PASSWORD: {{ my.snowflake.password }}
note

Using this method, it is important not to introduce circular dependencies into the vault. Only vault content loaded from a previous layer can be referenced in vault template variables.

Use the vault

The most common methods for using values from the vault are in .template files and directly setting variables using the DATAOPS_VAULT(...) syntax.

To set the INGESTION credentials in variables prefixed with SNOW_, you can use the following config:

variables:
SNOW_ACCOUNT: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT)
SNOW_USER: DATAOPS_VAULT(SNOWFLAKE.INGESTION.USERNAME)
SNOW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.INGESTION.PASSWORD)
SNOW_ROLE: DATAOPS_VAULT(SNOWFLAKE.INGESTION.ROLE)
SNOW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.INGESTION.WAREHOUSE)

The variables section can be defined in any job or pipeline configuration file.

DataOps templating

DataOps Template Rendering is used to extract secrets from the DataOps Vault and inject them into configuration files such as databases.template.yml as seen below.

Jinja variables can be included in templates using the {{ ... }} syntax, and the whole vault is scoped into the variable renderer so that you can use any vault path. For example, a template can include {{ SNOWFLAKE.ACCOUNT }}, which will be rendered as the configured Snowflake account string from the vault.

Additionally, the full environment is available under the prefix env., so it is possible to render an environment variable into a template, for example, {{ env.DATAOPS_DATABASE }}. Template files can include Jinja variables and other control structures, allowing a flexible and configurable method for building dynamic content.

Vault examples

  1. The first example is a SOLE database configuration, as found in databases.template.yml:
databases:
"{{ env.DATAOPS_DATABASE }}":

{# For non-production branches, this will be a clone of production #}
{% if (env.DATAOPS_ENV_NAME != 'PROD' and env.DATAOPS_ENV_NAME != 'QA') %}
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}"
{% endif %}

comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}

The first example is a SOLE database configuration as found in databases.template.yml:

This example creates a database whose name is defined by the environment variable DATAOPS_DATABASE. However, if the current environment is not PROD or QA, this database will be cloned from the production (main) database.

  1. The second example contains the SOLE configuration for multiple warehouses in warehouses.yml:
warehouses:

{% for team_name in ['FINANCE', 'OPERATIONS', 'HR', 'SALES', 'MARKETING'] %}
"{{ team_name }}":
comment: Warehouse for {{ team_name }} team usage only
warehouse_size: MEDIUM
auto_suspend: 40
auto_resume: true
grants:
USAGE:
- FUNC_{{ team_name }}_ROLE
{% endfor %}

The YAML code in this configuration file creates five identical warehouses without a lengthy, repetitive configuration, using an inline list of team names into a for loop.

Ingesting variables from the vault

Many simple uses of vault secrets do not require template files (see above), particularly when just passing secure values into an orchestrator, such as an access key for an API or log in details for a remote system. In this case, you can create variables in the relevant job initialized from specific vault values using the DATAOPS_VAULT(...) syntax.

To use this syntax, create a variable in a job's variables' block and set the value to DATAOPS_VAULT(path.to.vault.value). When the job runs, as long as the enclosed vault path is valid, the job will replace this value with the corresponding values from the vault.

For example, this sample Talend Cloud job configuration loads the authentication token for the remote platform from the DataOps Vault:

Sample Talend Job:
...
variables:
TMC_TASK_ID: ...
TMC_ACCESS_TOKEN: DATAOPS_VAULT(TALEND.EMEA.ACCESS_TOKEN)
TMC_TASK_PARAMETERS: ...
script: /dataops
note

The variable rendering mechanism is run within each job's /dataops entry point script. As a result, values can only be used by scripts and applications that run within orchestration scripts launched by /dataops.

Secret masking

Available since the February 2023 release.

For additional security, DataOps provides functionality to mask the values of all secrets stored in the vault.

Any pipeline logs that use the script entry point /dataops will have the values of vault secrets masked to prevent sensitive values from being visible to anyone with access to the pipelines.

info

Secret masking is enabled by default and cannot be disabled by the environment variables.

Values added to the vault are masked whether they come from the vault.template.yml or from supported Secrets Managers.

Secret masking tool

When you want to access the vault outside the /dataops script, you can use the dataops-log-secret-masker tool to mask secrets for you.

The tool requires a stdin stream.

Example uses

Here is an example of a read-vault script that accesses the DataOps vault and attempts to log values. We must capture the stdout and the use of a bash pipe (|) to mask all the values that are also present in the DataOps vault:

warning

Always pipe the dataops-vault read commands through the dataops-log-secret-masker to avoid showing sensitive values in the logs.

my-read-vault-script.sh
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt

dataops-vault read-all | dataops-log-secret-masker
example-ci.yml
your-job:
script:
- "my-read-vault-script.sh"

The bash pipe will only capture stdout and not stderr, so we can merge the streams together by redirecting stderr (file descriptor 2) with stdout (file descriptor 1):

my-read-vault-script.sh
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt

dataops-vault read-all 2>&1 | dataops-log-secret-masker

Add your own items to the DataOps vault by using the dataops-vault command and then check they have been added, masking the values:

my-read-vault-script.sh
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt

dataops-vault write-value --key CUSTOM_KEY --value CUSTOM_VALUE
dataops-vault read-all 2>&1 | dataops-log-secret-masker

Masking exceptions

The following DataOps standard vault keys will not have their values masked as they contain non-sensitive information.

Any values stored with these keys, or key patterns, will still appear in the logs when DATAOPS_DEBUG is set to 1.

Key patternExample key
SNOWFLAKE.ACCOUNTSNOWFLAKE.ACCOUNT
SNOWFLAKE.SOLE.ACCOUNTSNOWFLAKE.SOLE.ACCOUNT
SNOWFLAKE.*.ROLESNOWFLAKE.SECOND.THIRD.ROLE
SNOWFLAKE.*.WAREHOUSESNOWFLAKE.SECOND.THIRD.WAREHOUSE
SNOWFLAKE.*.THREADSSNOWFLAKE.SECOND.THIRD.THREADS
danger

Caution Any secret value under 8 characters will not be masked, as a secret of this length is not considered secure and could be guessed from context.

For example, if secret_value=dataops, it remains unmasked as it is not secure. Further, if we were to mask shorter values like dataops, [MASKED] would replace every single reference in the logs.

Masked vault

Any secret value not in the exceptions list will be replaced with [MASKED] in the pipeline logs.

After secret masking, if DATAOPS_DEBUG=1, a typical vault structure will appear in the logs as below:

SNOWFLAKE:
ACCOUNT: <account>
TRANSFORM:
USERNAME: [MASKED]
ROLE: <transform_role>
PASSWORD: [MASKED]
WAREHOUSE: <transform_warehouse>
THREADS: 8
INGESTION:
USERNAME: [MASKED]
ROLE: <ingestion_role>
PASSWORD: [MASKED]
WAREHOUSE: <ingestion_warehouse>
THREADS: 8
MAIN:
USERNAME: [MASKED]
ROLE: <main_role>
PASSWORD: [MASKED]
AWS:
DEFAULT:
S3_KEY: [MASKED]
S3_SECRET: [MASKED]
}

Password usage

If you need to set passwords as plain text on the data product platform, refer to the table below:

note

The usage in each template for all the passwords below is:

"{{ SNOWFLAKE.INGESTION.PASSWORD }}"

DescriptionExamplePassword in Secrets
Two Double-Quotes""dAtaop3s!\"\"dAtaop3s!
Double-Quotes and a Single-Quote"'dAtop3s!\"'dAtop3s!
Two Single-Quote''DataOps2"''DataOps2"
At Sign@Dataops2"@Dataops2"
Hash/Pound#!a""Data2'#!a\"\"Data2'
Dollar SigndAt$op3s!dAt$$op3s!
Exclamation Mark!Dataops"!Dataops"
Ampersand&Dataops1"&Dataops1"
Open-Parenthesis(Dataops1(Dataops1
Close-Parenthesis)Dataops1)Dataops1
Asterisk*Dataops1"*Dataops1"
Plus-Sign+Dataops1+Dataops1
Comma,Dataops1",Dataops1"
Period.Dataops1.Dataops1
Slash/Dataops1/Dataops1
Percent Sign%Dataops1"%Dataops1"
Colon:Dataops1":Dataops1"
Semicolon;Dataops1;Dataops1
Less-than Sign<Dataops1<Dataops1
Equals Sign=Dataops1=Dataops1
Question mark?Dataops1?Dataops1
Backslash\Dataops1\\Dataops1
Square Bracket[]Dataops1"[]Dataops1"
Caret^Dataops1^Dataops1
Underscore_Dataops1_Dataops1
Tilde~Dataops1~Dataops1
Curly Brackets{}Dataops1{}Dataops1