Skip to main content

SOLE for Data Products

Feature release status badge: PriPrev
PriPrev

Overview

We have introduced SOLE for Data Products as a new framework for the Snowflake Object Lifecycle EngineSOLE (SOLE) to help you easily build an ecosystem of data products and benefit from greater accuracy and accessibility of data. The major difference is how you define Snowflake objects in the SOLE configuration.

SOLE for Data Products is currently available as a private preview.

  • To enable the private preview in any branch, add the following variable to the project variables.yml file:

    pipelines/includes/config/variables.yml
    variables:
    DATAOPS_SNOWFLAKEOBJECTLIFECYCLE_RUNNER_IMAGE: dataopslive/dataops-snowflakeobjectlifecycle-runner:pripre-sole-for-data-products

    This lets you use the SOLE for Data Products functionality while ensuring that the rest of your orchestrators keep using the default stable tag. To address potential MATE job issues arising from the new default dbt version 1.5, see the dbt 1.5 Migration documentation.

  • To enable and use the functionality, follow the steps detailed in Enabling SOLE for Data Products.

note

We have also introduced data products as an extra layer on top of the data product platform capabilities making managing data products easier than ever. Learn more about Data Products, currently available as a private preview.

Why SOLE for Data Products?

Sole for Data Products answers two challenges:

  • It simplifies the management of Snowflake object configuration in large infrastructures by treating data as a product.
  • It allows the grouping of data products by domains to facilitate consuming data for anybody in an organization who needs data for a particular function. For example, a finance team needs its own set of objects like databases, roles, schemas, etc., while a marketing person may need product-related objects.

In this new version of SOLE, it is possible to relate object definitions to domains, where a domain can be a team. This lets you organize files in any way that better fits your organization.

The below table summarizes the differences between the current SOLE and SOLE for Data Products.

Features and CharacteristicsCurrent SOLESOLE for Data ProductsBenefits/Comment
Object definitionsObjects are grouped based on the object types and all configurations should be in the same file. If the object is schema-level, it must be in the same file as the database.Object configurations are self-describing and they don't have to be in the same file under the parent's configurationBreak away from the nested structure and organize files into type-specific or domain-specific categories. Learn more about flat vs tree structure definitions.
File structureOrganize all object-specific configurations in a single file following the hierarchical data tree structureOrganize object configurations in separate files and in any structure that meets your organization's needsSupport file structures that match how data products are set up in a domain. Learn more about file structure.
Components (reuse object configuration)Not applicableDefine code blocks of YAML configuration as part of any object definition and reuse them in any fileReduce the duplicates required when configuring objects such as grants on a table or common Jinja templates. Learn more about components.
Object relationsNot applicableDefine the relationship between managed objects by using the rel() functionMake the relation between managed entities more explicit. This reduces the chances of SOLE making assumptions about the implied configuration and focuses more on the specified configuration. Learn more about the rel function.
Directory defaultsNot applicableUse the project configuration file to define some default values that are applied to all objects within a directory and recursively within all sub-directoriesApply specific parameters for all objects present in a directory to avoid defining database name, schema name, and similar properties in all schema-level objects for example and having lengthy configurations. Learn more about directory defaults.
Name collision and dataops.idNot applicableAssign a unique ID for an object and use it to resolve the relationship in case of conflictsResolve the relationship defined using the rel() function in case multiple matches are found for an object. Learn more about unique SOLE IDs.

Enabling SOLE for Data Products

To enable SOLE for Data Products, add the file dataops_config.yml to the snowflake configuration directory.
This switches SOLE from using the current configuration to SOLE for Data Products.

Private Preview

SOLE for Data Products is currently not supported in 5-stable releases of the Orchestrators because it is only available as a private preview.
To enable the private preview, see overview.

Mutually Exclusive Formats

The current configuration structure and SOLE for Data Products are mutually exclusive. If SOLE for Data Products is enabled and configuration for the current structure is found, SOLE will throw an error.

The new configuration file dataops_config.yml helps you to:

  • Define defaults for directories.
  • Specify folders that contain non-object configuration.
  • Specify the config version.

Directory defaults

You can use the project configuration file to define defaults for a directory. This lets you apply some parameters for all objects present in a directory.

This is supported by the following elements:

  • Object parent
    • database
    • schema
  • SOLE keywords
    • namespacing
    • manage_mode
    • environment

The defaults can be overridden following the below increasing order of precedence, for example, the value in the object definition overrides that in the component:

  • Default specified for subdirectory
  • Value specified in the included component
  • Value specified in the object definition

Non-object configuration

SOLE for Data Products supports specifying configurations other than object definition such as hooks, runtime-configuration, and the new functionality Components.

You must store these special configurations in special files or directories so that they are ignored while performing validation. Use the following keys to define the directory and files:

  • Hooks: hook-paths
  • Configuration: configuration-paths
  • Component: component-paths

Config version

Project configuration in SOLE for Data Products is planned to be further improved over time with each major improvement added incrementally. To reinforce this evolution, the project configuration file adds support for a new parameter config-version.

Currently, this parameter only supports value 2, but over time, new features would be added in form of 2.1, 2.2, etc., which would allow a seamless transition to a new configuration with new functionalities.

Following is an example of dataops_config.yml with all the above-mentioned parameters:

config-version: 2

configuration-paths: [configuration]
component-paths: [components]
hook-paths: [hooks.yml]

ingestion_objects:
+database: rel(database.{{ env.DATAOPS_DATABASE }})
+schema: rel(schema.INGESTION_SCHEMA)
fivetran:
+schema: rel(schema.FIVETRAN_SCHEMA)

transformation_objects:
+database: rel(database.{{ env.DATAOPS_DATABASE }})
+schema: rel(schema.TRANSFORMATION_SCHEMA)

namespacing_prefix:
+namespacing: prefix

Flat object definition

The current version of SOLE requires Snowflake object configuration to be grouped based on the object type. This means if you have multiple roles, you must define all of them in a single YAML file under the roles key. This type of object definition becomes more prevalent when you define schemas or schema-level objects as these have to be in a tree structure and this adds a challenge for organizations that manage a large number of objects in SOLE.

To address such a challenge, we have made a few major changes to how you define objects. Rather than having a grouped collection of objects, SOLE for Data Products goes for modular, self-describing, and explicit object definition. All definitions specify the object type and relation to other objects, such as their parents.

Self-describing objects

Object configuration in SOLE for Data Products is self-describing that anyone looking at just the object definition, can immediately tell the following:

  • Object type
  • Object name
  • Object dependencies
  • Object relations

Following are a few examples showing this behavior:

- warehouse: # Type of the object
dataops.extends:
- component.default_warehouse # Adding a 'component' with the name 'default_warehouse'
comment: Warehouse for Ingestion operations
name: DATAOPS_INGESTION # Name of the object
resource_monitor: rel(resource_monitor.INGESTION) # Adding a relation to managed Resource Monitor INGESTION
grants:
USAGE:
- rel(role.DATAOPS_WRITER) # Adding a relation to managed Role DATAOPS_WRITER
MONITOR:
- SYSADMIN # Non managed Role SYSADMIN
- rel(role.DATAOPS_WRITER)

Break away from the nested structure

With SOLE new structure for defining objects, child objects don't have to be in the same file under the parent's configuration. Each child object defines the details of the parent object as part of the configuration. This allows you to define the configuration for hierarchical objects in completely separate files.

The following configurations show how you can define a database, schema, table, and stage in separate files to match organization needs. In this example, objects are organized in object-specific directories, but you are free to use whatever structure suits you most.

dataops/snowflake/databases/dataops_database.template.yml
- database:
name: "{{ env.DATAOPS_DATABASE }}"
{% if (env.DATAOPS_ENV_NAME != 'PROD' and env.DATAOPS_ENV_NAME != 'QA') %}
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}" # If a database name and from_database values are different, then the new database would be cloned, otherwise it would be created with a specified definition
data_retention_time_in_days: 1
{% endif %}
comment: This Database is used for demonstration of the usage of the Snowflake Object LifeCycle Runner
grants:
USAGE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
modify:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_ADMIN)
monitor:
- rel(role.DATAOPS_ADMIN)
- SYSADMIN
CREATE SCHEMA:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
dataops/snowflake/schemas/ingestion_schemas.template.yml
- schema:
dataops.extends:
- component.default_schema_grants
name: ADVENTUREWORKS_HUMANRESOURCES
database: rel(database.{{ env.DATAOPS_DATABASE }})
dataops/snowflake/tables/ingestion_tables.template.yml
- table:
name: JOBCANDIDATE
schema: rel(schema.ADVENTUREWORKS_HUMANRESOURCES)
database: rel(database.{{ env.DATAOPS_DATABASE }})
comment: "JOBCANDIDATE Table"
dataops.extends: component.ingestion_table_grants
columns:
JOBCANDIDATEID:
type: NUMBER(38,0)
comment: "Primary Key of the job candidate table"
BUSINESSENTITYID:
type: NUMBER(38,0)
comment: "The Id of the business Entity"
RESUME:
type: VARCHAR(16777216)
comment: "Resume of the candidate"
MODIFIEDDATE:
type: TIMESTAMP_NTZ(9)
comment: "Timestamp when the record was modified"
dataops/snowflake/stages/ingestion.template.yml
- stage:
name: S3_STAGE
schema: rel(schema.ADVENTUREWORKS_HUMANRESOURCES)
database: rel(database.{{ env.DATAOPS_DATABASE }})
comment: Created to load the Adventureworks data
url: s3://dataops-demo/adventureworks/
CREDENTIALS: AWS_KEY_ID = '{{ AWS.INGESTION.ACCESS_KEY_ID }}' AWS_SECRET_KEY = '{{ AWS.INGESTION.SECRET_ACCESS_KEY }}'
file_format: rel(file_format.ADVW_UTF8_TAB_PUB)
grants:
ALL PRIVILEGES:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
USAGE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)

File structure

SOLE for Data Products gives you the freedom to organize object definitions however you like, without any restriction, and in whatever structure that meets your organization's needs.

And as the name suggests, SOLE for Data Products supports file structures that match how data products are set up in a domain. You can set up top-level directories for each data product in a domain — it is recommended to maintain a project per Domain. Objects belonging to the data product can then have their definition present in that directory. You can also organize data product directories with more subdirectories to represent an action, business logic, or object type.

Components

As many objects have a similar configuration, whether it is the grants on a table or file format properties, it makes sense to move such a configuration to a code block and reuse it. To answer this need, SOLE for Data Products introduces components that are pieces of YAML configuration you can use as part of any object definition.

Components are based on YAML anchors but don't have their limitations where the anchor and its usage must be in the same file. Components in SOLE for Data Products let you define object configuration in any file.

Modular configuration

Components are designed to serve as modular configurations you can use in any object definition — assuming that the object supports the specified parameters. A Component can be as small as a single parameter or all parameters of an object. You can then use components directly in an object definition or combine them with other components.

Following are a few examples of components:

dataops/snowflake/components/namespacing.yml
namespacing_prefix:
namespacing: prefix

namespacing_none:
namespacing: none
dataops/snowflake/components/table_grants.yml
ingestion_table_grants:
grants:
SELECT:
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_ADMIN)
INSERT:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_ADMIN)
UPDATE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_ADMIN)
DELETE:
- rel(role.DATAOPS_ADMIN)
dataops/snowflake/components/warehouse.yml
default_warehouse:
grants:
USAGE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
MONITOR:
- SYSADMIN
- rel(role.DATAOPS_ADMIN)
warehouse_size: X-SMALL
auto_suspend: 40
auto_resume: true
namespacing: none
Object Definition vs Component Definition

Object definitions are defined as list items whereas component definitions are configured with a key. This provides a distinct difference between the two to make the configuration self-describing.

Using components

As components are not Snowflake object configurations, they have to be organized in a directory. You must define this directory in the project configuration file dataops_config.yml with the key component-paths.

To use components:

  1. Add a new parameter dataops.extends as part of the object or component definition. This parameter accepts either a string for a single component or a list for multiple components.

  2. Use the format component.<component_name> to include the components.

    - warehouse:
    dataops.extends:
    - component.warehouse_properties
    - component.warehouse_grants
    name: TRANSFORMATION
    comment: Warehouse for Transformation operations

    - file_format:
    name: CSV_FORMAT
    schema: rel(schema.INGESTION_SCHEMA)
    database: rel(database.{{ env.DATAOPS_DATABASE }})
    dataops.extends: component.csv_file_format

Reusing components

In the previous section, you saw how components are included in an object definition.

As components are meant to be modular configurations, you can include groups of parameters in different components and use them as building blocks that you can include in other components:

warehouse_grants:
dataops.extends:
- component.warehouse_usage_grants
- dataops.merge: component.warehouse_monitor_grants
- dataops.merge: component.warehouse_operate_grants

warehouse_usage_grants:
grants:
USAGE:
- rel(role.READER)

warehouse_monitor_grants:
grants:
MONITOR:
- rel(role.ADMIN)

warehouse_operate_grants:
grants:
OPERATE:
- rel(role.ADMIN)

Overriding values

The dataops.extends keyword supports both a string to include a single component and a list to include multiple components. If multiple components are included and two or more components define the same parameter, then the order of including components determines the final value.

As you go down the list of included components, the value is overridden. This means that the value specified in the last included component will be used:

warehouse_x_small:
warehouse_size: XSMALL
warehouse_small:
warehouse_size: SMALL
warehouse_medium:
warehouse_size: MEDIUM
warehouse_large:
warehouse_size: LARGE
warehouse_2x_large:
warehouse_size: X2LARGE

warehouse_properties:
dataops.extends:
- component.warehouse_small
- component.warehouse_2x_large
- component.warehouse_x_small
- component.warehouse_medium
- component.warehouse_large
# warehouse_size: LARGE The value of warehouse_size would be set to LARGE
# as last component included as that set

warehouse_properties_override:
dataops.extends:
- component.warehouse_small
- component.warehouse_2x_large
- component.warehouse_x_small
- component.warehouse_medium
- component.warehouse_large
warehouse_size:
SMALL # The value of warehouse_size would be set to SMALL
# as the current component overrides it
info

The parameters defined in the object will always override the value.

Merging values

Scalar values such as string, integer, and boolean are always overridden when including a component. For list or map values, SOLE for Data Products supports merging in addition to override. Merging allows combining values from different sources into one final value.

By default, all values and components are overridden. To enable merge, you must use dataops.merge.

Merging components

To merge components, you must define the component to be included by using dataops.merge: component.<component_name>. Following is an example of merging component and the final values:

dataops/snowflake/components/warehouse.yml
warehouse_default_usage_grants:
grants:
USAGE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
warehouse_default_monitor_grants:
grants:
MONITOR:
- SYSADMIN
- rel(role.DATAOPS_ADMIN)
warehouse_default_grants:
dataops.extends:
- dataops.merge: component.warehouse_default_usage_grants
- dataops.merge: component.warehouse_default_monitor_grants

Merging components into values

As previously mentioned, the parameters defined in an object definition always override the value defined in components.

If instead of the override default behavior, you want to merge the defined value with the component's value, add the parameter using the following format dataops.merge.<parameter>.
This will merge the values defined in the parameters with the values included in components recursively.

dataops/snowflake/components/warehouse.yml
warehouse_default_grants:
grants:
USAGE:
- rel(role.DATAOPS_WRITER)
- rel(role.DATAOPS_READER)
- rel(role.DATAOPS_ADMIN)
MONITOR:
- SYSADMIN
- rel(role.DATAOPS_ADMIN)
dataops/snowflake/warehouses/data_science.yml
- warehouse:
dataops.extends:
- component.namespacing_none
- component.warehouse_default_grants
comment: Warehouse for the Data Science Team
warehouse_size: SMALL
dataops.merge.grants:
USAGE:
- rel(role.DATA_SCIENCE)
MONITOR:
- rel(role.DATA_SCIENCE)

Jinja template in components

You can also use components to store Jinja logic as a named key. This allows easy usage of Jinja in objects and components without repeating it.

dataops/snowflake/components/share.template.yml
share_namespacing:
{% if env.DATAOPS_ENV_NAME == 'PROD' %}
namespacing: prefix
{% else %}
namespacing: both
{% endif %}
dataops/snowflake/shares/ingestion.yml
- share:
name: INGESTION
dataops.extends:
- component.share_namespacing
comment: "Ingestion Share"

rel function

As part of SOLE for Data Products, we have also updated how the relationship between managed objects is defined. You can see this in the examples used till now.

SOLE for Data Products puts more focus on validation and preventing accidental configuration. To achieve this, you must define the relationship to managed objects by using the rel() function. The function takes a single parameter (string) as input and in the format <object_type>.<object_id>.

- warehouse:
name: INGESTION
comment: Warehouse for Ingestion operations
resource_monitor: rel(resource_monitor.INGESTION) # Relation defined to managed Resource Monitor 'INGESTION'
warehouse_size: LARGE
grants:
MONITOR:
- SYSADMIN
- rel(role.DATAOPS_ADMIN)

- stage:
comment: Load CSV Data
name: CSV_STAGE
schema: rel(schema.FIVETRAN_SCHEMA) # Relation defined to schema 'FIVETRAN_SCHEMA'

database:
rel(database.{{ env.DATAOPS_DATABASE }}) # Relation defined to database
# '{{ env.DATAOPS_DATABASE }}', whose value is resolved at runtime

file_format:
rel(file_format.INGESTION_CSV_FORMAT) # Relation defined to managed File Format
# '{{ env.DATAOPS_DATABASE }}'.'FIVETRAN_SCHEMA'.'INGESTION_CSV_FORMAT'

url: s3://source-bucket/csv/
CREDENTIALS: AWS_KEY_ID = '{{ AWS.INGESTION.ACCESS_KEY_ID }}' AWS_SECRET_KEY = '{{ AWS.INGESTION.SECRET_ACCESS_KEY }}'

Explicit relations

The rel() function makes the relation between managed entities more explicit. It reduces the chances of SOLE making assumptions about the implied configuration and instead focuses on the specified configuration.

The new structure also makes the configuration more readable when trying to determine what is and isn't managed by SOLE.
If an object is not managed by SOLE, the object's name is a string not wrapped in rel().

- warehouse:
name: INGESTION
comment: Warehouse for Ingestion operations
resource_monitor: rel(resource_monitor.INGESTION) # Relation defined to managed Resource Monitor 'INGESTION'
warehouse_size: LARGE
grants:
MONITOR:
- SYSADMIN # Role not managed by SOLE
- rel(role.DATAOPS_ADMIN) # Role managed by SOLE

Configuration validation

Using rel() to define relations allows SOLE to verify configuration more thoroughly before it is applied in Snowflake.

If an object is defined as a relation, but its config is missing — either due to partial commit or due to accidental delete — SOLE will throw an error early on the stage. This prevents accidentally dropping objects due to incorrect or partial configuration

Unique SOLE ID

For account-level objects, you can use the config name when adding a relation in rel(). For schemas or schema-level objects, you might have two or more objects with the same name but a different parent. If the config name is specified in rel(), it could lead to an error as SOLE might not be able to resolve the relationship with certainty.

For such cases, you can use the new parameter dataops.id to assign a unique ID for the object. This ID is used only to resolve the relationship and does not affect the object name in Snowflake.

- file_format:
name: CSV_FORMAT
dataops.id: INGESTION_CSV_FORMAT
schema: rel(schema.INGESTION_SCHEMA)
database: rel(database.{{ env.DATAOPS_DATABASE }})
dataops.extends:
- component.csv_file_format

- file_format:
name: CSV_FORMAT
schema: rel(schema.TRANSFORMATION_SCHEMA)
database: rel(database.{{ env.DATAOPS_DATABASE }})
dataops.extends:
- component.csv_file_format

- stage:
comment: Load CSV Data
name: CSV_STAGE
schema: rel(schema.FIVETRAN_SCHEMA)
database: rel(database.{{ env.DATAOPS_DATABASE }})
file_format: rel(file_format.INGESTION_CSV_FORMAT)

In the above example, there are two file formats with the same name CSV_FORMAT present in different schemas. The stage FIVETRAN_SCHEMA.CSV_STAGE wants to use the file format present in INGESTION_SCHEMA. To do so, you assign a unique ID INGESTION_CSV_FORMAT to the file format using dataops.id. You can now use this file format by specifying the unique ID in rel().

Schema-level objects

If the specified object is a schema-level object, SOLE can resolve the parent's details if the object has a unique name or if you have defined a unique ID. This reduces the number of code lines that have to be added when referencing schema-level objects.

Following is an example of adding relation to file format from stages:

databases:
DATABASE_1:
schemas:
SCHEMA_1:
file_formats:
FORMAT_1:
format_type: CSV
stages:
STAGE_1:
file_format:
format_name: FORMAT_1
SCHEMA_2:
stages:
STAGE_2:
file_format:
format_name: FORMAT_1
format_schema: SCHEMA_1
DATABASE_2:
schemas:
SCHEMA_3:
stages:
STAGE_3:
file_format:
format_name: FORMAT_1
format_schema: SCHEMA_1
format_database: DATABASE_1

You can also use the rel() function as a parameter, such as when using tags.

databases:
DATABASE_1:
schemas:
SCHEMA_1:
tags:
TAG_1:
allowed_values: ["Red", "Yellow", "Blue"]
tables:
TABLE_1:
with_tags:
TAG_1:
value: Red
columns:
COLUMN_1:
type: NUMBER(38,0)
SCHEMA_2:
tables:
TABLE_2:
with_tags:
TAG_1:
value: Yellow
schema: SCHEMA_1
columns:
COLUMN_1:
type: NUMBER(38,0)
DATABASE_2:
schemas:
SCHEMA_3:
tables:
TABLE_3:
with_tags:
TAG_1:
value: Blue
schema: SCHEMA_1
database: DATABASE_1
columns:
COLUMN_1:
type: NUMBER(38,0)

Resolving relation

To resolve the relationship defined in rel(), SOLE performs a lookup in the following order:

  1. Search for defined relation in the list of unique IDs.
  2. Search for defined relation in the same schema as the current object.
  3. Search for defined relations in all schemas in the same database as the current object.
  4. Search for defined relations in all schemas in all the databases.

If a unique match is found, SOLE will use that to resolve rel(). If multiple matches are found, then SOLE throws an error. In such cases, it is recommended you assign and define a unique ID by using dataops.id.

Converting the DataOps template to use SOLE for Data Products

To use SOLE for Data Products with your existing DataOps projects, you must migrate/update your current Snowflake object configuration to match the new structure. Steps to migrate the DataOps Template Project have been added in SOLE Data Products Migration to provide an example of how you can convert current object configurations to use SOLE for Data Products.