How to use Snowflake Iceberg tables with SOLE for Data Products

This article offers a comprehensive guide on using Snowflake Iceberg tables with SOLE for Data Products. It covers the critical concepts of Iceberg tables and provides a detailed, step-by-step guide on defining and using them.

Understanding Iceberg tables in Snowflake

Iceberg tables in Snowflake allow you to efficiently access and manage data stored in Parquet files on S3. Depending on your use case, you can manage these tables either externally or internally.

Internal Catalog (managed by Snowflake):

This method supports the creation of read-write Iceberg tables. Only an external volume is necessary, as Snowflake manages the catalog internally.

Externally managed catalog (using AWS Glue or Object Store):

This method allows you to create read-only Iceberg tables. To set up such a table, you need an external volume and external catalog.

Prerequisite for Creating Iceberg table

Since Iceberg tables utilize cloud storage, you must first set up your cloud provider's storage. Here's how to get started:

Once your cloud storage is ready, you can create the external volume required for your Iceberg table.

info

You can assign an external volume and catalog directly during Iceberg table creation or at the database level. Both the external volume and catalog are account-level objects in Snowflake.

Example commands:

To create a database with an external volume and catalog:

CREATE DATABASE <database_name>
  EXTERNAL_VOLUME = <external_volume_name>
  CATALOG = <external_catalog_name>

To alter an existing database:

ALTER DATABASE IF EXISTS <database_name> SET
  EXTERNAL_VOLUME = <external_volume_name>
  CATALOG = <external_catalog_name>

info

For read and write Iceberg table catalog value will remain SNOWFLAKE and does not have to be specified.

METADATA_FILE_PATH

When setting up cloud storage, determine the path value that designates the metadata file path for a table created from Iceberg files stored in an object storage system. This path should be relative to the active storage location of the external volume linked to the table.

For example, if the full path to your metadata file is s3://mybucket_us_east_1/metadata/v1.metadata.json and the external storage location is s3://mybucket_us_east_1/, you should set METADATA_FILE_PATH to metadata/v1.metadata.json.

Step-by-step guide: creating an externally managed Iceberg table

Step 1 - define a read-only Iceberg table in SOLE for Data Products

Ensure a hooks file path is specified in the dataops_config.yml file in the dataops/snowflake directory while working with Iceberg tables for SOLE for Data Products.

dataops_config.yml
config-version: 2
hook-paths: [hooks.yml]

If the hooks.template.yml file exists:

Verify the presence of the database_level_hooks object group section in hooks.template.yml. If it does not exist, create this section and include database_level_hooks to add commands.
If the hooks.template.yml file does not exist:

Create a new file named hooks.template.yml within the dataops/snowflake directory.

Ensure that the newly created file includes a database_level_hooks object group section.
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the pre_hook section of account_level_hooks.

Add pre_hooks under database_level_hooks section by creating a new file hooks.template.yml in the dataops/snowflake directory, which has the following content:

info
If the external volume and catalog are set at the database level, you don't need to specify them again when creating the Iceberg table.
hooks.template.yml
```
database_level_hooks:
  post_hooks:
      - environment: snowflake
        commands:
          - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name> EXTERNAL_VOLUME = <External_volume_name>
            catalog = <External_catalog_name> METADATA_FILE_PATH= <Metadata file path of cloud storage>
```
Here, {{ env.DATABASE_NAME_PREFIX }} serves as the prefix, and {{ env.DATAOPS_ENV_NAME }} acts as the suffix.

Create the Iceberg table.

hooks.template.yml
database_level_hooks:
  post_hooks:
    - environment: snowflake
      commands:
        - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST EXTERNAL_VOLUME = iceberg_external_volume_soledp
          catalog = IcebergCatalogInt_soledp METADATA_FILE_PATH='ICEBERG_TABLE_STORAGE/metadata/00000-e80344f7-7c89-47df-b1ab-8f9e2b44b3db.metadata.json' COMMENT = ' Table 1'"

Step 2 - Using Iceberg tables as a dbt source in MATE

Define the dbt source in sources.yml

In dbt, sources are defined in a sources.yml file, which connects your dbt models to the actual tables in your data warehouse. When mapping to an Iceberg table, include all relevant details, such as the database, schema, and table name.

Using Iceberg table as a source to perform MATE job transformations

Example:

source.yml
  version: 2
  sources:
    - name: TEST_ICEBERG
      database: {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}
      schema: ICEBERG_SOLEDP_SCHEMA

      tables:
        - name: AWS_ICEBERG_TABLE
          description: "Read-only Iceberg table"
          columns:
            - name: ID
              description: "Column id"
            - name: NAME
              description: "Column name"

Reference the Source in your dbt model

After defining the source, you can reference it in your dbt model using the source function. This ensures that dbt pulls data from the correct Iceberg table.

Example dbt model:
models/my_iceberg_model.sql
```
{{ config(alias='INGESTION_TABLE') }}

SELECT
  *
FROM {{ source('TEST_ICEBERG', 'AWS_ICEBERG_TABLE')  }}
```
Use the {{ source('TEST_ICEBERG', 'ICEBERG_TABLE_TEST') }} macro to refer directly to the defined source.

Reference the source in your dbt_project file

models:
  source_iceberg:
    schema: ICEBERG_SOURCE
    materialized: table
    iceberg:
      tags:
        - "source_iceberg"

To run a MATE job with specific variables, set TRANSFORM_ACTION to RUN and specify your model name in TRANSFORM_MODEL_SELECTOR.

For example:

mate job:
  variables:
    TRANSFORM_ACTION: RUN
    TRANSFORM_MODEL_SELECTOR: "tag:source_iceberg"

Step 3 - Refresh Iceberg table metadata in SOLE for Data Products

If the data stored in cloud storage has been updated, you must refresh the Iceberg table to synchronize its metadata with the latest changes.

hooks.template.yml
database_level_hooks:
  post_hooks:
    - environment: snowflake
      commands:
        - "ALTER ICEBERG TABLE <iceberg_table_name> REFRESH 'metadata_file_relative_path';

The metadata_file_relative_path specifies a metadata file path for a table created from Iceberg files in object storage. The path must be relative to the active storage location of the external volume associated with the table.

Refreshing Iceberg table metadata example:

database_level_hooks:
  post_hooks:
    - environment: snowflake
      commands:
        - "ALTER ICEBERG TABLE {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST REFRESH 'cloud storage path';

After updating the Iceberg table, execute the MATE job again to apply transformations to the refreshed metadata.

Step-by-Step Guide: Creating an Internal Managed Iceberg table

This section walks you through two ways to define and use such tables in your SOLE for Data Products project: first, using traditional hooks; and second, leveraging the newer, dbt-native Iceberg support introduced in dbt v1.9+.

1. Define a read-write Iceberg table in SOLE for Data Products

Ensure a hooks file path is specified in the dataops_config.yml file in the dataops/snowflake directory while working with Iceberg tables for SOLE for DP.

dataops_config.yml
config-version: 2
hook-paths: [hooks.yml]

If the hooks.template.yml file exists:

Verify the presence of the database_level_hooks object group section in hooks.template.yml. If it does not exist, create this section and include database_level_hooks to add commands.
If the hooks.template.yml file does not exist:

Create a new file named hooks.template.yml within the dataops/snowflake directory.

Ensure that the newly created file includes a database_level_hooks object group section.
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the pre_hook section of account_level_hooks.

Add pre_hooks under database_level_hooks section by creating a new file hooks.template.yml in the dataops/snowflake directory, which has the following content:

info
If the external volume and catalog are set at the database level, you don’t need to specify them again when creating the Iceberg table.
hooks.template.yml
```
database_level_hooks:
  post_hooks:
    - environment: snowflake
      commands:
        - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name>( <COLUMN_NAME> <COLUMN_TYPE>) EXTERNAL_VOLUME = <External_volume_name>
            catalog = SNOWFLAKE BASE_LOCATION = 'FOLDER NAME WILL CREATE IN CLOUD LOCATION'
```
Here, {{ env.DATABASE_NAME_PREFIX }} serves as the prefix, and {{ env.DATAOPS_ENV_NAME }} acts as the suffix.

Create the Iceberg table.

hooks.template.yml
database_level_hooks:
  post_hooks:
    - environment: snowflake
      commands:
        - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST(customer_id int, name string) EXTERNAL_VOLUME = iceberg_external_volume_soledp
           catalog = SNOWFLAKE BASE_LOCATION = 'my_iceberg_table'

2. Creating and using Iceberg tables as a Model with MATE

Using with dbt v1.9 or later, you can create Iceberg tables directly within dbt models without requiring pre- or post-hooks. This simplifies workflows and lets MATE (DataOps.live’s dbt transformation engine) handle Iceberg natively.

To enable this functionality, you must add the Iceberg materializations flag in your dbt_project.yml:

dbt_project.yml
flags:
  enable_iceberg_materializations: true

Step 1: Create the dbt model

dbt models usually lie inside the models folder. Make sure you specify the format as iceberg in the config block.

models/sales_customer_orders.sql
{{ config(
    materialized = 'table',
    format       = 'iceberg',
    location     = 'table_base_location'
) }}

select
  customer_id,
  count(order_id)  as orders,
  sum(total_amt)   as revenue
from {{ ref('orders') }}
group by customer_id

Step 2: Tag & expose the model in dbt_project.yml

Add a tag so MATE can identify and run the model:

dbt_project.yml
models:
  project_name:
    +tags: ["iceberg"]

Step 3: Create / update the MATE job definition

mate/jobs/iceberg_transform.yml
mate job:
  variables:
    TRANSFORM_ACTION: RUN
    TRANSFORM_MODEL_SELECTOR: "tag:iceberg"

What happens when this job runs:

MATE runs dbt run --select tag:iceberg.
dbt compiles your model into a CREATE ICEBERG TABLE AS SELECT statement.
Snowflake writes the table’s Parquet and metadata files to your configured external volume.
The Iceberg table becomes immediately queryable inside Snowflake.

Step 4: Verify in Snowflake

SHOW ICEBERG TABLES IN DATABASE my_db;

Conclusion

This guide provides a clear path for using Iceberg tables in SOLE for Data Products and MATE, from setup to integration with MATE models, ensuring a smooth and efficient data management process.

Understanding Iceberg tables in Snowflake​

Internal Catalog (managed by Snowflake):​

Externally managed catalog (using AWS Glue or Object Store):​

Prerequisite for Creating Iceberg table​

METADATA_FILE_PATH​

Step-by-step guide: creating an externally managed Iceberg table​

Step 1 - define a read-only Iceberg table in SOLE for Data Products​

Step 2 - Using Iceberg tables as a dbt source in MATE​

Step 3 - Refresh Iceberg table metadata in SOLE for Data Products​

Step-by-Step Guide: Creating an Internal Managed Iceberg table​

1. Define a read-write Iceberg table in SOLE for Data Products​

2. Creating and using Iceberg tables as a Model with MATE​

Conclusion​