How to Manage Multiple Databases
Out of the box, using the standard project template, DataOps provides configuration for a single Snowflake database (per environment). This database is referred to in the project by the environment variable DATAOPS_DATABASE
and comprises the DATAOPS_PREFIX
value joined with an underscore to the DATAOPS_ENV_NAME
value.
However, many use cases benefit from, if not strongly require, more than one database in Snowflake.
Managing databases with SOLE and MATE
It's possible, and fairly straightforward, to manage multiple databases with DataOps SOLE and MATE:
- alongside the default database - just needs a bit of extra config
- dynamically replacing the default database - requires challenging the delicate internal structure of DataOps (proceed with caution!)
Before you start
Keep in mind that additional databases will require an extra layer of complexity in your project. If your use case works just as well using schemas, this may make things easier.
Also, DataOps can provision your databases into every environment, including feature branches. Do you want several databases cloned every time you create a feature branch? If your databases are similar, you may choose to develop against just one.
Finally, think about how your databases will be provisioned and managed - together or individually. Will you have a configuration that creates/updates all your databases at once, or do you need to run a pipeline for a single database, maybe using a variable to select which one?
It will be beneficial to keep all this in mind as you work through this document.
Notes on examples
Standard variables
In the examples below, the following variables are set accordingly:
Variable | Value |
---|---|
DATAOPS_PREFIX | DATAOPS |
Architecture 1: Extra, individual databases
In this first architecture, let's consider the requirement for one or more additional databases alongside the default DataOps database. This has the benefit of not altering or removing the default database, so it can be the most straightforward architecture to implement.
For this example we will create two additional databases with separate content that can be optionally cloned into feature branches.
SOLE configuration
To include additional databases in the standard SOLE configuration:
-
Add them as first-level items in the same way as any other account-level object:
databases:
## Default database - keep this in here!
"{{ env.DATAOPS_DATABASE }}":
...
ALPHA:
comment: My first additional database
...
BRAVO:
comment: My other additional database
... -
Configure each additional database as required, using the SOLE database documentation as needed. Feel free to use YAML anchors or Jinja blocks to repeat or generate content.
In this example, the namespacing attribute is omitted, so it will take the default value (both
), meaning that the resultant database name will also include the project prefix (DATAOPS_PREFIX
) and the environment suffix (as we want the databases to exist in multiple environments). Therefore, in production, their full names will be DATAOPS_ALPHA_PROD
and DATAOPS_BRAVO_PROD
, respectively.
When the standard pipeline first runs in main, three databases will be created as per the above configuration.
-
Create a working variable at the top of databases.template.yml that contains either the current environment name or is blank, depending on an environment variable we will use called
SKIP_ADDITIONAL_DB
:{
% set db_switch = '' if env.SKIP_ADDITIONAL_DBS else env.DATAOPS_ENV_NAME %,
} -
Set each the
environment
attribute of each additional database to use this working variable:databases:
...
ALPHA:
comment: My first additional database
environment: '{{ db_switch }}'
... -
Run a feature branch pipeline with
SKIP_ADDITIONAL_DBS
set to1
.
How this works: The environment
attribute in SOLE says, "only create this object in the specified environment," so we're using it here as a kind of on-off switch for each database. Setting it to the value of DATAOPS_ENV_NAME
means it gets created, whereas setting it to a blank value will cause SOLE to skip this object.
MATE configuration
The MATE configuration within the standard DataOps template does not, by default, feature a setting for the database. This is because the template assumes a single database will be used, as so MATE will helpfully take care of setting this out of sight in the inner workings of the MATE orchestrator.
In this example, each additional database will have its own MATE models, and we will assume they are kept in their directory within dataops/modelling/models
. So we can configure dbt_project.yml to build those models in the correct database:
...
models:
...
MyProject:
...
alpha: ## this corresponds to directory path dataops/modelling/models/alpha
+database: "{{ env_var('DATAOPS_PREFIX') }}_ALPHA_{{ env_var('DATAOPS_ENV_NAME') }}"
...
bravo: ## this corresponds to directory path dataops/modelling/models/bravo
+database: "{{ env_var('DATAOPS_PREFIX') }}_ALPHA_{{ env_var('DATAOPS_ENV_NAME') }}"
...
Architecture 2: Multiple identical databases
In the previous architecture, we considered the additional databases to be individual and self-contained, but in this case, we will manage multiple databases that share a SOLE and MATE configuration. This scenario could most commonly arise when a project needs to manage a database for each client, product, or other logical units, where the structures are similar for each item.
The advantage here is that you can maintain a single set of configurations, both for SOLE and MATE, and apply it iteratively to a whole list of clients, products, etc.
SOLE configuration
Since all additional databases will have the same SOLE configuration, you can use a simple loop to avoid duplication in the YAML file:
databases:
...
{% for client_name in ['Smiths', 'Jones'] %}
"{{ client_name|upper }}":
comment: My client database for {{ client_name }}
...
{% endfor %}
On the main branch, the above configuration will create two databases called DATAOPS_SMITHS_PROD
and DATAOPS_JONES_PROD
.
When creating a feature branch environment, developers probably don't want N+1 feature databases (where N is the number of additional databases). Therefore, a good approach is to create one more database that contains sample data for development/testing, and only create that one in the dev/feature environments.
Add this logic (or something like it) into your project's SOLE configuration:
databases:
...
{% set client_names = ['Smiths', 'Jones'] if env.DATAOPS_ENV_NAME in [env.DATAOPS_ENV_NAME_PROD, env.DATAOPS_ENV_NAME_QA] else ['Sample'] %}
{% for client_name in client_names %}
"{{ client_name|upper }}":
...
This creates the same client databases for the PROD and QA environments, but the DEV/feature environments will just get a SAMPLE
database as well as the DataOps default.
MATE configuration
With the previous architecture, we had a directory of MATE models for each of the additional databases, but in this architecture, all additional databases have the same structure. On the one hand, this makes the configuration simpler, as there will now only be one set of models, but we are left with the challenge of applying these models to multiple databases.
One solution is to parameterize the database information in dbt_project.yml, configuring your own parameter called something like CLIENT_NAME
as an environment variable.
...
models:
...
MyProject:
...
client: ## this corresponds to directory path dataops/modelling/models/client
+database: "{{ env_var('DATAOPS_PREFIX') }}_{{ env_var('CLIENT_NAME')|upper }}_{{ env_var('DATAOPS_ENV_NAME') }}"
...
This will allow the creation of MATE jobs that apply these models (by setting TRANSFORM_MODEL_SELECTOR: models/client
) to a specific client database (by setting CLIENT_NAME: XXXX
). Here's an example job that runs the models for one customer:
Build Models (Smiths):
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: models/client
CLIENT_NAME: Smiths
stage: Data Transformation
script:
- /dataops
icon: ${TRANSFORM_ICON}
This approach is fine for a few customers, but with larger numbers, this approach may not scale well. However, the data product platform gives you a method by which we can dynamically generate several parallel jobs based on a list of parameter values:
Build Client Models:
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: models/client
parallel:
matrix:
- CLIENT_NAME: ["Smiths", "Jones"]
stage: Data Transformation
script:
- /dataops
icon: ${TRANSFORM_ICON}
Architecture 3: Single dynamic database
Whilst both the previous architectures create databases alongside the DataOps default database, this approach redefines the DATAOPS_DATABA
variable dynamically at run time to point the pipeline to a different database on each run.
Overriding critical DataOps variables such as DATAOPS_DATABASE
can have serious consequences. Thoroughly test any implementation of this architecture in a safe place before even thinking about deploying to production.
The general approach here is to tweak the value of DATAOPS_DATABASE
in each pipeline run, so a different database is built/updated each time.
Option A: Override DATAOPS_DATABASE
Unfortunately, it's not just as simple as setting DATAOPS_DATABASE: XXXX
in a pipeline or job configuration, as the before_script in every job will recompute this variable and reset it back to the standard pattern for the project. Therefore, to override this variable it must be done after the before_script.
Here's an example of overriding DATAOPS_DATABASE
in a SOLE setup job, which will set the name of the default database to DATAOPS_SMITHS_PROD
:
Set Up Snowflake:
...
variables:
CLIENT_NAME: Smiths
script:
- export DATAOPS_DATABASE=${DATAOPS_PREFIX}_${CLIENT_NAME^^}_${DATAOPS_ENV_NAME}
- /dataops
...
Every time SOLE runs, it drops any Snowflake object that has been removed from the project configuration, as long as the internal SOLE state file confirms that it is an object that is managed by SOLE. This enables the deletion of objects by just removing them from your configuration.
However, if dynamic elements are introduced into the SOLE config, causing object names to change between pipeline runs (as we're doing here), SOLE may drop objects created in the previous pipeline as they do not exist any more (because the name changed).
The safest way to avoid this issue is to permanently set the LIFECYCLE_STATE_RESET
variable to 1
in projects that use this advanced behavior. This will cause SOLE to "forget" about any other objects in Snowflake.
All jobs that use DATAOPS_DATABASE
will need to have the above script line (and any other logic you implement) added to keep behavior consistent throughout the pipeline.
Option B: Override DATAOPS_PREFIX
Although not quite as flexible as the previous option, this approach is much cleaner and works centrally in the project. However, it will affect every object that uses the DATAOPS_PREFIX
as part of its namespacing, so is more suited to projects where the database is the only account-level object managed by SOLE.
It's possible to define the DATAOPS_PREFIX
variable to be set dynamically based on other variables. For example, it can include a CLIENT_NAME
variable that is set further up in the variables.yml file:
variables:
...
CLIENT_NAME: SMITHS
DATAOPS_PREFIX: DATAOPS_$CLIENT_NAME
When applied to the standard DataOps template project, this will create a default database named DATAOPS_SMITHS_PROD
. However, the roles and warehouses will also carry this client name as part of their prefix. Make sure this is what you need when choosing this approach.
As with the previous approach, ensure LIFECYCLE_STATE_RESET
is set in the project to make sure objects do not get removed by future pipelines.
Option C: Use a custom before_script
The two previous options have shown how tweaking values of built-in DataOps variables, whilst a potentially dangerous activity, can provide a very flexible approach to working with dynamic content in Snowflake.
To avoid having to include additional script lines in each job, consider overriding variables centrally by implementing a custom before_script.