How to use Snowflake Iceberg tables with SOLE for Data Products
This article offers a comprehensive guide on using Snowflake Iceberg tables with SOLE for Data Products. It covers the critical concepts of Iceberg tables and provides a detailed, step-by-step guide on defining and using them.
Understanding Iceberg tables in Snowflake
Iceberg tables in Snowflake allow you to efficiently access and manage data stored in Parquet files on S3. Depending on your use case, you can manage these tables either externally or internally.
Internal Catalog (managed by Snowflake):
This method supports the creation of read-write
Iceberg tables. Only an external volume is necessary, as Snowflake manages the catalog internally.
Externally managed catalog (using AWS Glue or Object Store):
This method allows you to create read-only
Iceberg tables. To set up such a table, you need an external volume and external catalog.
Prerequisite for Creating Iceberg table
Since Iceberg tables utilize cloud storage, you must first set up your cloud provider's storage. Here's how to get started:
Once your cloud storage is ready, you can create the external volume required for your Iceberg table.
You can assign an external volume and catalog directly during Iceberg table creation or at the database level. Both the external volume and catalog are account-level objects in Snowflake.
Example commands:
-
To create a database with an external volume and catalog:
CREATE DATABASE <database_name>
EXTERNAL_VOLUME = <external_volume_name>
CATALOG = <external_catalog_name> -
To alter an existing database:
ALTER DATABASE IF EXISTS <database_name> SET
EXTERNAL_VOLUME = <external_volume_name>
CATALOG = <external_catalog_name>
For read and write Iceberg table catalog value will remain SNOWFLAKE
and does not have to be specified.
METADATA_FILE_PATH
When setting up cloud storage, determine the path value that designates the metadata file path for a table created from Iceberg files stored in an object storage system. This path should be relative to the active storage location of the external volume linked to the table.
For example, if the full path to your metadata file is s3://mybucket_us_east_1/metadata/v1.metadata.json
and the external storage location is s3://mybucket_us_east_1/
, you should set METADATA_FILE_PATH
to metadata/v1.metadata.json
.
Step-by-step guide: creating an externally managed Iceberg table
Step 1 - define a read-only Iceberg table in SOLE for Data Products
Ensure a hooks file path is specified in the dataops_config.yml
file in the dataops/snowflake directory while working with Iceberg tables for SOLE for Data Products.
config-version: 2
hook-paths: [hooks.yml]
-
If the
hooks.template.yml
file exists:Verify the presence of the
database_level_hooks
object group section inhooks.template.yml
. If it does not exist, create this section and includedatabase_level_hooks
to add commands. -
If the
hooks.template.yml
file does not exist:Create a new file named
hooks.template.yml
within thedataops/snowflake
directory.Ensure that the newly created file includes a
database_level_hooks
object group section. -
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the
pre_hook
section ofaccount_level_hooks
.Add
pre_hooks
underdatabase_level_hooks
section by creating a new filehooks.template.yml
in thedataops/snowflake
directory, which has the following content:infoIf the external volume and catalog are set at the database level, you don't need to specify them again when creating the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name> EXTERNAL_VOLUME = <External_volume_name>
catalog = <External_catalog_name> METADATA_FILE_PATH= <Metadata file path of cloud storage>Here,
{{ env.DATABASE_NAME_PREFIX }}
serves as the prefix, and{{ env.DATAOPS_ENV_NAME }}
acts as the suffix. -
Create the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST EXTERNAL_VOLUME = iceberg_external_volume_soledp
catalog = IcebergCatalogInt_soledp METADATA_FILE_PATH='ICEBERG_TABLE_STORAGE/metadata/00000-e80344f7-7c89-47df-b1ab-8f9e2b44b3db.metadata.json' COMMENT = ' Table 1'"
Step 2 - Using Iceberg tables as a dbt source in MATE
-
Define the dbt source in
sources.yml
In dbt, sources are defined in a
sources.yml
file, which connects your dbt models to the actual tables in your data warehouse. When mapping to an Iceberg table, include all relevant details, such as the database, schema, and table name.Using Iceberg table as a source to perform MATE job transformations
Example:
source.ymlversion: 2
sources:
- name: TEST_ICEBERG
database: {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}
schema: ICEBERG_SOLEDP_SCHEMA
tables:
- name: AWS_ICEBERG_TABLE
description: "Read-only Iceberg table"
columns:
- name: ID
description: "Column id"
- name: NAME
description: "Column name" -
Reference the Source in your dbt model
After defining the source, you can reference it in your dbt model using the source function. This ensures that dbt pulls data from the correct Iceberg table.
Example dbt model:
models/my_iceberg_model.sql{{ config(alias='INGESTION_TABLE') }}
SELECT
*
FROM {{ source('TEST_ICEBERG', 'AWS_ICEBERG_TABLE') }}Use the
{{ source('TEST_ICEBERG', 'ICEBERG_TABLE_TEST') }}
macro to refer directly to the defined source. -
Reference the source in your dbt_project file
models:
source_iceberg:
schema: ICEBERG_SOURCE
materialized: table
iceberg:
tags:
- "source_iceberg"To run a MATE job with specific variables, set
TRANSFORM_ACTION
to RUN and specify your model name inTRANSFORM_MODEL_SELECTOR
.For example:
mate job:
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: "tag:source_iceberg"
Step 3 - Refresh Iceberg table metadata in SOLE for Data Products
If the data stored in cloud storage has been updated, you must refresh the Iceberg table to synchronize its metadata with the latest changes.
database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE <iceberg_table_name> REFRESH 'metadata_file_relative_path';
The metadata_file_relative_path
specifies a metadata file path for a table created from Iceberg files in object storage. The path must be relative to the active storage location of the external volume associated with the table.
Refreshing Iceberg table metadata example:
database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST REFRESH 'cloud storage path';
After updating the Iceberg table, execute the MATE job again to apply transformations to the refreshed metadata.
Step-by-Step Guide: Creating an Internal Managed Iceberg table
This section walks you through two ways to define and use such tables in your SOLE for Data Products project: first, using traditional hooks; and second, leveraging the newer, dbt-native Iceberg support introduced in dbt v1.9+.
1. Define a read-write Iceberg table in SOLE for Data Products
Ensure a hooks file path is specified in the dataops_config.yml
file in the dataops/snowflake directory while working with Iceberg tables for SOLE for DP.
config-version: 2
hook-paths: [hooks.yml]
-
If the
hooks.template.yml
file exists:Verify the presence of the
database_level_hooks
object group section in hooks.template.yml. If it does not exist, create this section and includedatabase_level_hooks
to add commands. -
If the
hooks.template.yml
file does not exist:Create a new file named
hooks.template.yml
within thedataops/snowflake
directory.Ensure that the newly created file includes a
database_level_hooks
object group section. -
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the
pre_hook
section ofaccount_level_hooks
.Add
pre_hooks
underdatabase_level_hooks
section by creating a new filehooks.template.yml
in thedataops/snowflake
directory, which has the following content:infoIf the external volume and catalog are set at the database level, you don’t need to specify them again when creating the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name>( <COLUMN_NAME> <COLUMN_TYPE>) EXTERNAL_VOLUME = <External_volume_name>
catalog = SNOWFLAKE BASE_LOCATION = 'FOLDER NAME WILL CREATE IN CLOUD LOCATION'Here,
{{ env.DATABASE_NAME_PREFIX }}
serves as the prefix, and{{ env.DATAOPS_ENV_NAME }}
acts as the suffix. -
Create the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST(customer_id int, name string) EXTERNAL_VOLUME = iceberg_external_volume_soledp
catalog = SNOWFLAKE BASE_LOCATION = 'my_iceberg_table'
2. Creating and using Iceberg tables as a Model with MATE
Using with dbt v1.9 or later, you can create Iceberg tables directly within dbt models without requiring pre- or post-hooks. This simplifies workflows and lets MATE (DataOps.live’s dbt transformation engine) handle Iceberg natively.
To enable this functionality, you must add the Iceberg materializations flag in your dbt_project.yml
:
flags:
enable_iceberg_materializations: true
Step 1: Create the dbt model
dbt models usually lie inside the models
folder. Make sure you specify the format
as iceberg
in the config block.
{{ config(
materialized = 'table',
format = 'iceberg',
location = 'table_base_location'
) }}
select
customer_id,
count(order_id) as orders,
sum(total_amt) as revenue
from {{ ref('orders') }}
group by customer_id
Step 2: Tag & expose the model in dbt_project.yml
Add a tag so MATE can identify and run the model:
models:
project_name:
+tags: ["iceberg"]
Step 3: Create / update the MATE job definition
mate job:
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: "tag:iceberg"
What happens when this job runs:
- MATE runs
dbt run --select tag:iceberg
. - dbt compiles your model into a
CREATE ICEBERG TABLE AS SELECT
statement. - Snowflake writes the table’s Parquet and metadata files to your configured external volume.
- The Iceberg table becomes immediately queryable inside Snowflake.
Step 4: Verify in Snowflake
SHOW ICEBERG TABLES IN DATABASE my_db;
Conclusion
This guide provides a clear path for using Iceberg tables in SOLE for Data Products and MATE, from setup to integration with MATE models, ensuring a smooth and efficient data management process.