DataOps.live Vault
DataOps.live provides vault functionality to keep all confidential data private by storing it on the host machine in an encrypted format. Additionally, all the secrets saved in the vault are available to pipelines to be used by any job.
One of the first pipeline jobs initialized the vault, which populates the secrets and other content before other jobs run. Depending on the project and pipeline configuration, this initialization is layered from different files and sources.
Vault structure
The DataOps.live vault has a YAML-like structure composed of mandatory and optional objects.
Before defining the secrets and credentials in the vault, you will need a Snowflake account. The SQL script in the topic Create Snowflake instance helps you automatically create a Snowflake account with all the necessary objects. You can use what has been created by the script to complete the below vault example snippet.
A typical vault structure looks like the following code snippet:
SNOWFLAKE:
ACCOUNT: <account>
TRANSFORM:
USERNAME: <transform_username>
ROLE: <transform_role>
PASSWORD: <transform_password>
WAREHOUSE: <transform_warehouse>
THREADS: 8
INGESTION:
USERNAME: <ingestion_username>
ROLE: <ingestion_role>
PASSWORD: <ingestion_password>
WAREHOUSE: <ingestion_warehouse>
THREADS: 8
MAIN:
USERNAME: <main_username>
ROLE: <main_role>
PASSWORD: <main_password>
AWS:
DEFAULT:
S3_KEY: XXXXXXXXXX
S3_SECRET: XXXXXXXXXX
The SNOWFLAKE section is currently standardized and mandatory, along with the AWS.DEFAULT credentials section. However, adding any other content to the vault outside of these objects without causing any pipeline issues is possible.
Once you have set the DataOps vault — secrets and credentials — you must store them either in a vault.yml file or use a secrets manager for this purpose. Read through the sections below for more information.
Vault initialization
As this image shows, initializing the vault on each pipeline run comprises several layers that can add sensitive and non-sensitive content to the vault.
1. Local DataOps runner content
The original vault.yml
file from the DataOps Runner's /secrets
mount point
is used as the base content for initializing the vault.
2. Sensitive content
You can configure any pipeline to use a secrets manager from which sensitive values such as passwords and security keys can be loaded. This content will add to and override values loaded in the previous step.
3. Additional content
It is also possible to add a final layer of content to the vault by using a vault.template.yml file in the project itself. This can keep vault configurations local to the project (rather than the runner) and allow mapping of non-standard key/value naming schemes.
Configure the DataOps vault
Firstly, the vault encryption is configured using a two-part method that uses a key and a salt.
- The Vault Key is a random string of characters configured in each project,
usually in file
pipelines/includes/config/variables.yml
using the variableDATAOPS_VAULT_KEY
. - The Vault Salt is another random string contained in a file on the runner system. This is set up as part of the DataOps Runner Installation instructions.
1. Configure local DataOps runner content
It is possible to omit all information from this vault configuration file and only use layer-2 and layer-3 network architectures. See below for an example of an empty vault configuration file.
Current system limitations require that the vault.yml file must exist on the
runner. However, as described above, you can set its content to an empty
object {}
.
An example of an empty vault.yml
file is as follows:
{}
For more information, see the DataOps Runner Installation instructions.
2. Configure secrets manager
Most information security best practices mandate strong architectural security for protecting sensitive information. DataOps recommends using a secrets manager (such as AWS Secrets Manager or Azure Key Vault) for this purpose, and support for these systems is built into the data product platform.
Secrets loaded from a secrets manager will be applied to the vault after any local runner content.
This process may overwrite values. Therefore, make sure to demarcate which values are the responsibility of which platform.
See the Secrets Manager Orchestrator for complete configuration and usage details.
Configure additional vault content
The above two methods are more than sufficient to provide configurability and security for many use cases. However, an additional layer of vault information can be supplied to pipelines, which is particularly useful in the following circumstances:
- Moving non-sensitive configurations away from the runner's local vault.yml file
- Re-mapping content loaded from a secrets manager into a different vault structure.
To address the first point, it can be more convenient to move away from holding
configuration on the DataOps Runner, particularly values such as the Snowflake
account name or configured numbers of threads. Instead, add these values to a
vault.template.yml
file in the project, which will be applied to the vault
when pipelines run.
Secondly, it is not always possible or convenient to populate a secrets manager with value keys that precisely follow the DataOps vault structure. These values will still be loaded into the vault but at a different location. A vault template can be used to re-map them into the desired places.
To create a vault template, create a file in your project at
vault-content/vault.template.yml
. This file can look as follows:
SNOWFLAKE:
ACCOUNT: "{{ env.SNOWFLAKE_ACCOUNT }}"
MAIN:
USERNAME: "{{ env.DATAOPS_PREFIX }}_MAIN"
## PASSWORD is set in Secrets Manager
ROLE: DATAOPS_ADMIN
TRANSFORM:
USERNAME: "{{ env.DATAOPS_PREFIX }}_TRANSFORMATION"
## PASSWORD is set in Secrets Manager
ROLE: "{{ env.DATAOPS_PREFIX }}_WRITER"
WAREHOUSE: "{{ env.DATAOPS_PREFIX }}_TRANSFORMATION"
THREADS: 8
INGESTION:
USERNAME: "{{ env.DATAOPS_PREFIX }}_INGESTION"
## PASSWORD is set in Secrets Manager
ROLE: "{{ env.DATAOPS_PREFIX }}_WRITER"
WAREHOUSE: "{{ env.DATAOPS_PREFIX }}_INGESTION"
THREADS: 8
As this is a .template
file (see below), we can include Jinja variables in the
same manner as elsewhere in DataOps, with two primary benefits:
- The ability to specify static values for vault keys (e.g.
THREADS: 8
) - The ability to initialize values from environment variables (e.g.
ACCOUNT: {{ env.SNOWFLAKE_ACCOUNT }}
)
Furthermore, we can refer to the values in the vault.yml
template file because
it is added to the vault after it has been initialized from the DataOps Runner's
vault.yml
(if it exists) and after any secrets manager information has been
loaded. This allows information from the secrets manager to be re-mapped into
other locations in the vault.
For example, if our secrets manager contains a key called
my.snowflake.password
, then you can map this into the vault in
vault.template.yml
as follows:
SNOWFLAKE:
...
MAIN:
...
PASSWORD: {{ my.snowflake.password }}
Using this method, it is important not to introduce circular dependencies into the vault. Only vault content loaded from a previous layer can be referenced in vault template variables.
Use the vault
The most common methods for using values from the vault are in .template files
and directly setting variables using the DATAOPS_VAULT(...)
syntax.
To set the INGESTION credentials in variables prefixed with SNOW_
, you can use
the following config:
variables:
SNOW_ACCOUNT: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT)
SNOW_USER: DATAOPS_VAULT(SNOWFLAKE.INGESTION.USERNAME)
SNOW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.INGESTION.PASSWORD)
SNOW_ROLE: DATAOPS_VAULT(SNOWFLAKE.INGESTION.ROLE)
SNOW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.INGESTION.WAREHOUSE)
The variables section can be defined in any job or pipeline configuration file.
DataOps templating
DataOps Template Rendering is used to
extract secrets from the DataOps Vault and inject them into configuration files
such as databases.template.yml
as seen below.
Jinja variables can be included in templates using the {{ ... }}
syntax, and
the whole vault is scoped into the variable renderer so that you can use any
vault path. For example, a template can include {{ SNOWFLAKE.ACCOUNT }}
, which
will be rendered as the configured Snowflake account string from the vault.
Additionally, the full environment is available under the prefix env., so it is
possible to render an environment variable into a template, for example,
{{ env.DATAOPS_DATABASE }}
. Template files can include Jinja variables and
other control structures, allowing a flexible and configurable method for
building dynamic content.
Vault examples
- The first example is a SOLE database configuration, as found in
databases.template.yml
:
databases:
"{{ env.DATAOPS_DATABASE }}":
{# For non-production branches, this will be a clone of production #}
{% if (env.DATAOPS_ENV_NAME != 'PROD' and env.DATAOPS_ENV_NAME != 'QA') %}
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}"
{% endif %}
comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
The first example is a SOLE database configuration as found in
databases.template.yml
:
This example creates a database whose name is defined by the environment
variable DATAOPS_DATABASE
. However, if the current environment is not PROD
or QA
, this database will be cloned from the production (main) database.
- The second example contains the SOLE configuration for multiple warehouses in
warehouses.yml
:
warehouses:
{% for team_name in ['FINANCE', 'OPERATIONS', 'HR', 'SALES', 'MARKETING'] %}
"{{ team_name }}":
comment: Warehouse for {{ team_name }} team usage only
warehouse_size: MEDIUM
auto_suspend: 40
auto_resume: true
grants:
USAGE:
- FUNC_{{ team_name }}_ROLE
{% endfor %}
The YAML code in this configuration file creates five identical warehouses
without a lengthy, repetitive configuration, using an inline list of team names
into a for
loop.
Ingesting variables from the vault
Many simple uses of vault secrets do not require template files (see above),
particularly when just passing secure values into an orchestrator, such as an
access key for an API or log in details for a remote system. In this case, you
can create variables in the relevant job initialized from specific vault values
using the DATAOPS_VAULT(...)
syntax.
To use this syntax, create a variable in a job's variables' block and set the
value to DATAOPS_VAULT(path.to.vault.value)
. When the job runs, as long as the
enclosed vault path is valid, the job will replace this value with the
corresponding values from the vault.
For example, this sample Talend Cloud job configuration loads the authentication token for the remote platform from the DataOps Vault:
Sample Talend Job:
...
variables:
TMC_TASK_ID: ...
TMC_ACCESS_TOKEN: DATAOPS_VAULT(TALEND.EMEA.ACCESS_TOKEN)
TMC_TASK_PARAMETERS: ...
script: /dataops
The variable rendering mechanism is run within each job's /dataops
entry point
script. As a result, values can only be used by scripts and applications that
run within orchestration scripts launched by /dataops
.
Secret masking
Available since the February 2023 release.
For additional security, DataOps provides functionality to mask the values of all secrets stored in the vault.
Any pipeline logs that use the
script entry point /dataops
will have the values of vault secrets masked to prevent sensitive values from being
visible to anyone with access to the pipelines.
Secret masking is enabled by default and cannot be disabled by the environment variables.
Values added to the vault are masked whether they come from the
vault.template.yml
or from supported Secrets Managers.
Secret masking tool
When you want to access the vault outside the /dataops
script, you can use the
dataops-log-secret-masker
tool to mask secrets for you.
The tool requires a stdin
stream.
Example uses
Here is an example of a read-vault script that accesses the DataOps vault and
attempts to log values. We must capture the stdout
and the use of a bash pipe
(|
) to mask all the values that are also present in the DataOps vault:
Always pipe the dataops-vault
read commands through the
dataops-log-secret-masker
to avoid showing sensitive values in the logs.
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt
dataops-vault read-all | dataops-log-secret-masker
your-job:
script:
- "my-read-vault-script.sh"
The bash pipe will only capture stdout
and not stderr
, so we can merge the
streams together by redirecting stderr
(file descriptor 2) with stdout
(file
descriptor 1):
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt
dataops-vault read-all 2>&1 | dataops-log-secret-masker
Add your own items to the DataOps vault by using the dataops-vault
command and
then check they have been added, masking the values:
DATAOPS_VAULT_FILE=/agent_cache/$CI_PIPELINE_ID/dataops.vault
DATAOPS_VAULT_SALT_FILE=/secrets/vault.salt
dataops-vault write-value --key CUSTOM_KEY --value CUSTOM_VALUE
dataops-vault read-all 2>&1 | dataops-log-secret-masker
Masking exceptions
The following DataOps standard vault keys will not have their values masked as they contain non-sensitive information.
Any values stored with these keys, or key patterns, will still appear in the
logs when DATAOPS_DEBUG
is set to 1
.
Key pattern | Example key |
---|---|
SNOWFLAKE.ACCOUNT | SNOWFLAKE.ACCOUNT |
SNOWFLAKE.SOLE.ACCOUNT | SNOWFLAKE.SOLE.ACCOUNT |
SNOWFLAKE.*.ROLE | SNOWFLAKE.SECOND.THIRD.ROLE |
SNOWFLAKE.*.WAREHOUSE | SNOWFLAKE.SECOND.THIRD.WAREHOUSE |
SNOWFLAKE.*.THREADS | SNOWFLAKE.SECOND.THIRD.THREADS |
Caution Any secret value under 8 characters will not be masked, as a secret of this length is not considered secure and could be guessed from context.
For example, if secret_value=dataops
, it remains unmasked as it is not
secure. Further, if we were to mask shorter values like dataops
, [MASKED]
would replace every single reference in the logs.
Masked vault
Any secret value not in the exceptions list will be replaced with [MASKED]
in
the pipeline logs.
After secret masking, if DATAOPS_DEBUG=1
, a typical vault structure will
appear in the logs as below:
SNOWFLAKE:
ACCOUNT: <account>
TRANSFORM:
USERNAME: [MASKED]
ROLE: <transform_role>
PASSWORD: [MASKED]
WAREHOUSE: <transform_warehouse>
THREADS: 8
INGESTION:
USERNAME: [MASKED]
ROLE: <ingestion_role>
PASSWORD: [MASKED]
WAREHOUSE: <ingestion_warehouse>
THREADS: 8
MAIN:
USERNAME: [MASKED]
ROLE: <main_role>
PASSWORD: [MASKED]
AWS:
DEFAULT:
S3_KEY: [MASKED]
S3_SECRET: [MASKED]
}
Password usage
If you need to set passwords as plain text on the data product platform, refer to the table below:
The usage in each template for all the passwords below is:
"{{ SNOWFLAKE.INGESTION.PASSWORD }}"
Description | Example | Password in Secrets |
---|---|---|
Two Double-Quotes | ""dAtaop3s! | \"\"dAtaop3s! |
Double-Quotes and a Single-Quote | "'dAtop3s! | \"'dAtop3s! |
Two Single-Quote | ''DataOps2 | "''DataOps2" |
At Sign | @Dataops2 | "@Dataops2" |
Hash/Pound | #!a""Data2 | '#!a\"\"Data2' |
Dollar Sign | dAt$op3s! | dAt$$op3s! |
Exclamation Mark | !Dataops | "!Dataops" |
Ampersand | &Dataops1 | "&Dataops1" |
Open-Parenthesis | (Dataops1 | (Dataops1 |
Close-Parenthesis | )Dataops1 | )Dataops1 |
Asterisk | *Dataops1 | "*Dataops1" |
Plus-Sign | +Dataops1 | +Dataops1 |
Comma | ,Dataops1 | ",Dataops1" |
Period | .Dataops1 | .Dataops1 |
Slash | /Dataops1 | /Dataops1 |
Percent Sign | %Dataops1 | "%Dataops1" |
Colon | :Dataops1 | ":Dataops1" |
Semicolon | ;Dataops1 | ;Dataops1 |
Less-than Sign | <Dataops1 | <Dataops1 |
Equals Sign | =Dataops1 | =Dataops1 |
Question mark | ?Dataops1 | ?Dataops1 |
Backslash | \Dataops1 | \\Dataops1 |
Square Bracket | []Dataops1 | "[]Dataops1" |
Caret | ^Dataops1 | ^Dataops1 |
Underscore | _Dataops1 | _Dataops1 |
Tilde | ~Dataops1 | ~Dataops1 |
Curly Brackets | {}Dataops1 | {}Dataops1 |