Skip to main content

How to Clone Schemas in Feature Branches

When the DataOps database becomes sufficiently large that cloning it into a dev or feature branch environment takes a long time, it's time to consider switching to schema cloning.

With this feature implemented in a DataOps project, new pipelines in the dev or feature branches will create a new, empty database, populated with just the cloned schemas specified in a configurable list.

Configuration

1. Remove database cloning

The first thing to do is to remove the default database cloning logic from the project's Snowflake (SOLE) configuration:

dataops/snowflake/databases.template.yml
databases:
"{{ env.DATAOPS_DATABASE }}":
comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
{% if (env.DATAOPS_ENV_NAME != env.DATAOPS_ENV_NAME_PROD and env.DATAOPS_ENV_NAME != env.DATAOPS_ENV_NAME_QA) %}
## Pipelines running in branches that are not master or qa will clone this database from the main production database.
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}"
data_retention_time_in_days: 1
{% endif %}
...

Remove the highlighted row to leave the following (plus any database grants and additional configuration you may have):

dataops/snowflake/databases.template.yml
databases:
"{{ env.DATAOPS_DATABASE }}":
comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
...

2. Configure all schemas

Next, still in the Snowflake configuration, make sure all schemas are defined in dataops/snowflake/databases.template.yml.

Each schema's manage_mode attribute must be set dynamically depending on whether a pipeline is running in prod/qa or a development branch, and this can be achieved using the following variable:

dataops/snowflake/databases.template.yml
{% set schema_manage_mode = 'all' if env.DATAOPS_ENV_NAME in [env.DATAOPS_ENV_NAME_PROD, env.DATAOPS_ENV_NAME_QA] else 'none' %}
databases:
"{{ env.DATAOPS_DATABASE }}":
...
schemas:
SCHEMA1:
manage_mode: '{{ schema_manage_mode }}'
...
Please note:

All schemas should be defined here, and have the manage_mode set dynamically this way, regardless of whether they will be cloned in dev/feature branches.

3. Create pre-hooks for schema cloning

This is where the real fun starts: setting up SOLE hooks to run between the database and database-level phases that will create the selected schemas.

dataops/snowflake/hooks.template.yml
{% if env.DATAOPS_ENV_NAME not in [env.DATAOPS_ENV_NAME_PROD, env.DATAOPS_ENV_NAME_QA] %}

database_level_hooks:
pre_hooks:
{% for schema in ['SCHEMA1', 'SCHEMA2', 'SCHEMA3'] %}
- command: "CREATE OR REPLACE SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} CLONE {{ env.DATAOPS_DATABASE_MASTER }}.{{ schema }};"
environment: snowflake
- command: "GRANT USAGE ON SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} TO ROLE {{ env.DATAOPS_PREFIX }}_READER;"
environment: snowflake
- command: "GRANT CREATE TABLE, CREATE VIEW ON SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} TO ROLE {{ env.DATAOPS_PREFIX }}_WRITER;"
environment: snowflake
{% endfor %}

{% else %}
database_level_hooks: {}
{% endif %}

These hooks will run before the database-level phase of SOLE (i.e. where schemas, tables, etc. get created/updated), stepping through each schema in a list to:

  1. Create the schema in the dev/feature database by cloning the same schema from the production one
  2. Grant USAGE on the schema to the built-in READER role (this may change in your configuration)
  3. Grant CREATE TABLE and CREATE VIEW on the schema to the WRITER role (you may have additional grants/roles here)

4. Manage the list of schemas

Although having the list of schemas that will be cloned in dev/feature branches managed inline in the configuration of the pre-hook, a cleaner approach is to maintain this list in a separate configuration file and reference it in the hooks config.

dataops/snowflake/configuration/feature_branch_schemas.yml
feature_branch_schemas:
- SCHEMA1
- SCHEMA2
- SCHEMA3

This list can then be referenced in the hooks:

dataops/snowflake/hooks.template.yml
...
database_level_hooks:
pre_hooks:
{% for schema in feature_branch_schemas %}
- command: "CREATE OR REPLACE SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} CLONE {{ env.DATAOPS_DATABASE_MASTER }}.{{ schema }};"
...

As per pre-hook documentation, for SOLE to recognize the new configuration file, an additional variable is required:

pipelines/includes/config/variables.yml
variables:
...
DATAOPS_CONFIGURATION_DIR: $CI_PROJECT_DIR/dataops/snowflake/configuration
...

Things to watch out for

In a dev/feature pipeline, only the schemas you specify in the above list will be created, and will therefore be the only schemas in that branch's database. Therefore, every MATE model that builds in that dev/feature pipeline must have a schema that exists in that dev/feature database. Using branch-specific logic to override the TRANSFORM_MODEL_SELECTOR can help here, limiting the models that are built/tested in a feature branch, and has the additional benefit of reducing the build time.