How to Clone Schemas in Feature Branches
When the DataOps database becomes sufficiently large that cloning it into a dev or feature branch environment takes a long time, it's time to consider switching to schema cloning.
With this feature implemented in a DataOps project, new pipelines in the dev or feature branches will create a new, empty database populated with just the cloned schemas specified in a configurable list.
With the release of the Snowflake 2023_07 bundle planned for January 2024, when a table is cloned, its load history will also be cloned.
As a result, files are not reloaded, and data is not duplicated in the cloned table. You can override this behavior using the FORCE = TRUE COPY
option.
See Snowflake Documentation for more information.
Configuration
1. Remove database cloning
The first thing to do is to remove the default database cloning logic from the project's Snowflake (SOLE) configuration:
databases:
"{{ env.DATAOPS_DATABASE }}":
comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
{% if (env.DATAOPS_ENV_NAME != env.DATAOPS_ENV_NAME_PROD and env.DATAOPS_ENV_NAME != env.DATAOPS_ENV_NAME_QA) %}
## Pipelines running in branches that are not main or qa will clone this database from the main production database.
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}"
data_retention_time_in_days: 1
{% endif %}
...
Remove the highlighted row to leave the following — plus any database grants and additional configuration you may have:
databases:
"{{ env.DATAOPS_DATABASE }}":
comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
...
2. Configure all schemas
Next, still in the Snowflake configuration, make sure all schemas are defined in dataops/snowflake/databases.template.yml
.
Each schema's manage_mode
attribute must be set dynamically depending on whether a pipeline is running in prod/qa or a development branch, and this can be achieved using the following variable:
{% set schema_manage_mode = 'all' if env.DATAOPS_ENV_NAME in [env.DATAOPS_ENV_NAME_PROD, env.DATAOPS_ENV_NAME_QA] else 'none' %}
databases:
"{{ env.DATAOPS_DATABASE }}":
...
schemas:
SCHEMA1:
manage_mode: '{{ schema_manage_mode }}'
...
All schemas should be defined here and have the manage_mode
dynamically set, regardless of whether they will be cloned in dev/feature branches.
3. Create pre-hooks for schema cloning
This is where the real fun starts: setting up SOLE hooks to run between the database and database-level phases that will create the selected schemas.
{% if env.DATAOPS_ENV_NAME not in [env.DATAOPS_ENV_NAME_PROD, env.DATAOPS_ENV_NAME_QA] %}
database_level_hooks:
pre_hooks:
{% for schema in ['SCHEMA1', 'SCHEMA2', 'SCHEMA3'] %}
- command: "CREATE OR REPLACE SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} CLONE {{ env.DATAOPS_DATABASE_MASTER }}.{{ schema }};"
environment: snowflake
- command: "GRANT USAGE ON SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} TO ROLE {{ env.DATAOPS_PREFIX }}_READER;"
environment: snowflake
- command: "GRANT CREATE TABLE, CREATE VIEW ON SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} TO ROLE {{ env.DATAOPS_PREFIX }}_WRITER;"
environment: snowflake
{% endfor %}
{% else %}
database_level_hooks: {}
{% endif %}
These hooks will run before the database-level phase of SOLE (e.g., where schemas, tables, etc. get created/updated), stepping through each schema in a list to:
- Create the schema in the dev/feature database by cloning the same schema from the production one.
- Grant
USAGE
on the schema to the built-inREADER
role (this may change in your configuration). - Grant
CREATE TABLE
andCREATE VIEW
on the schema to theWRITER
role (you may have additional grants/roles here).
4. Manage the list of schemas
Although having the list of schemas that will be cloned in dev/feature branches managed inline in the configuration of the pre-hook, a cleaner approach is to maintain this list in a separate configuration file and reference it in the hooks config.
feature_branch_schemas:
- SCHEMA1
- SCHEMA2
- SCHEMA3
You can reference this list then in the hooks:
...
database_level_hooks:
pre_hooks:
{% for schema in feature_branch_schemas %}
- command: "CREATE OR REPLACE SCHEMA {{ env.DATAOPS_DATABASE }}.{{ schema }} CLONE {{ env.DATAOPS_DATABASE_MASTER }}.{{ schema }};"
...
As per pre-hook documentation, for SOLE to recognize the new configuration file, an additional variable is required:
variables:
...
DATAOPS_CONFIGURATION_DIR: $CI_PROJECT_DIR/dataops/snowflake/configuration
...
Things to watch out for
In a dev/feature pipeline, only the schemas you specify in the above list will be created and will therefore be the only schemas in that branch's database. Therefore, every MATE model that builds in that dev/feature pipeline must have a schema that exists in that dev/feature database. Using branch-specific logic to override the TRANSFORM_MODEL_SELECTOR
can help limit the models built/tested in a feature branch and has the additional benefit of reducing the build time.