How to use Snowflake Iceberg tables with SOLE for Data Products
This article offers a comprehensive guide on using Snowflake Iceberg tables with SOLE for Data Products. It covers the critical concepts of Iceberg tables and provides a detailed, step-by-step guide on defining and using them.
Understanding Iceberg tables in Snowflake
Iceberg tables in Snowflake allow you to efficiently access and manage data stored in Parquet files on S3. Depending on your use case, you can manage these tables either externally or internally.
Internal Catalog (managed by Snowflake):
This method supports the creation of read-write
Iceberg tables. Only an external volume is necessary, as Snowflake manages the catalog internally.
Externally managed catalog (using AWS Glue or Object Store):
This method allows you to create read-only
Iceberg tables. To set up such a table, you need an external volume and external catalog.
Prerequisite for Creating Iceberg table
Since Iceberg tables utilize cloud storage, you must first set up your cloud provider's storage. Here's how to get started:
Once your cloud storage is ready, you can create the external volume required for your Iceberg table.
You can assign an external volume and catalog directly during Iceberg table creation or at the database level. Both the external volume and catalog are account-level objects in Snowflake.
Example commands:
-
To create a database with an external volume and catalog:
CREATE DATABASE <database_name>
EXTERNAL_VOLUME = <external_volume_name>
CATALOG = <external_catalog_name> -
To alter an existing database:
ALTER DATABASE IF EXISTS <database_name> SET
EXTERNAL_VOLUME = <external_volume_name>
CATALOG = <external_catalog_name>
For read and write Iceberg table catalog value will remain SNOWFLAKE
and does not have to be specified.
METADATA_FILE_PATH
When setting up cloud storage, determine the path value that designates the metadata file path for a table created from Iceberg files stored in an object storage system. This path should be relative to the active storage location of the external volume linked to the table.
For example, if the full path to your metadata file is s3://mybucket_us_east_1/metadata/v1.metadata.json
and the external storage location is s3://mybucket_us_east_1/
, you should set METADATA_FILE_PATH
to metadata/v1.metadata.json
.
Step-by-step guide: creating an externally managed Iceberg table
Step 1 - define a read-only Iceberg table in SOLE for Data Products
Ensure a hooks file path is specified in the dataops_config.yml
file in the dataops/snowflake directory while working with Iceberg tables for SOLE for Data Products.
config-version: 2
hook-paths: [hooks.yml]
-
If the
hooks.template.yml
file exists:Verify the presence of the
database_level_hooks
object group section inhooks.template.yml
. If it does not exist, create this section and includedatabase_level_hooks
to add commands. -
If the
hooks.template.yml
file does not exist:Create a new file named
hooks.template.yml
within thedataops/snowflake
directory.Ensure that the newly created file includes a
database_level_hooks
object group section. -
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the
pre_hook
section ofaccount_level_hooks
.Add
pre_hooks
underdatabase_level_hooks
section by creating a new filehooks.template.yml
in thedataops/snowflake
directory, which has the following content:infoIf the external volume and catalog are set at the database level, you don't need to specify them again when creating the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name> EXTERNAL_VOLUME = <External_volume_name>
catalog = <External_catalog_name> METADATA_FILE_PATH= <Metadata file path of cloud storage>Here,
{{ env.DATABASE_NAME_PREFIX }}
serves as the prefix, and{{ env.DATAOPS_ENV_NAME }}
acts as the suffix. -
Create the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST EXTERNAL_VOLUME = iceberg_external_volume_soledp
catalog = IcebergCatalogInt_soledp METADATA_FILE_PATH='ICEBERG_TABLE_STORAGE/metadata/00000-e80344f7-7c89-47df-b1ab-8f9e2b44b3db.metadata.json' COMMENT = ' Table 1'"
Step 2 - Using Iceberg tables as a dbt source in MATE
-
Define the dbt source in
sources.yml
In dbt, sources are defined in a
sources.yml
file, which connects your dbt models to the actual tables in your data warehouse. When mapping to an Iceberg table, include all relevant details, such as the database, schema, and table name.Using Iceberg table as a source to perform MATE job transformations
Example:
source.ymlversion: 2
sources:
- name: TEST_ICEBERG
database: {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}
schema: ICEBERG_SOLEDP_SCHEMA
tables:
- name: AWS_ICEBERG_TABLE
description: "Read-only Iceberg table"
columns:
- name: ID
description: "Column id"
- name: NAME
description: "Column name" -
Reference the Source in your dbt model
After defining the source, you can reference it in your dbt model using the source function. This ensures that dbt pulls data from the correct Iceberg table.
Example dbt model:
models/my_iceberg_model.sql{{ config(alias='INGESTION_TABLE') }}
SELECT
*
FROM {{ source('TEST_ICEBERG', 'AWS_ICEBERG_TABLE') }}Use the
{{ source('TEST_ICEBERG', 'ICEBERG_TABLE_TEST') }}
macro to refer directly to the defined source. -
Reference the source in your dbt_project file
models:
source_iceberg:
schema: ICEBERG_SOURCE
materialized: table
iceberg:
tags:
- "source_iceberg"To run a MATE job with specific variables, set
TRANSFORM_ACTION
to RUN and specify your model name inTRANSFORM_MODEL_SELECTOR
.For example:
mate job:
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: "tag:source_iceberg"
Step 3 - Refresh Iceberg table metadata in SOLE for Data Products
If the data stored in cloud storage has been updated, you must refresh the Iceberg table to synchronize its metadata with the latest changes.
database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE <iceberg_table_name> REFRESH 'metadata_file_relative_path';
The metadata_file_relative_path
specifies a metadata file path for a table created from Iceberg files in object storage. The path must be relative to the active storage location of the external volume associated with the table.
Refreshing Iceberg table metadata example:
database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST REFRESH 'cloud storage path';
After updating the Iceberg table, execute the MATE job again to apply transformations to the refreshed metadata.
Step-by-Step Guide: Creating an Internal Managed Iceberg table
Step 1 - Define a read-write Iceberg table in SOLE for Data Products
Ensure a hooks file path is specified in the dataops_config.yml
file in the dataops/snowflake directory while working with Iceberg tables for SOLE for DP.
config-version: 2
hook-paths: [hooks.yml]
-
If the
hooks.template.yml
file exists:Verify the presence of the
database_level_hooks
object group section in hooks.template.yml. If it does not exist, create this section and includedatabase_level_hooks
to add commands. -
If the
hooks.template.yml
file does not exist:Create a new file named
hooks.template.yml
within thedataops/snowflake
directory.Ensure that the newly created file includes a
database_level_hooks
object group section. -
As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the
pre_hook
section ofaccount_level_hooks
.Add
pre_hooks
underdatabase_level_hooks
section by creating a new filehooks.template.yml
in thedataops/snowflake
directory, which has the following content:infoIf the external volume and catalog are set at the database level, you don’t need to specify them again when creating the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name>( <COLUMN_NAME> <COLUMN_TYPE>) EXTERNAL_VOLUME = <External_volume_name>
catalog = SNOWFLAKE BASE_LOCATION = 'FOLDER NAME WILL CREATE IN CLOUD LOCATION'Here,
{{ env.DATABASE_NAME_PREFIX }}
serves as the prefix, and{{ env.DATAOPS_ENV_NAME }}
acts as the suffix. -
Create the Iceberg table.
hooks.template.ymldatabase_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST(customer_id int, name string) EXTERNAL_VOLUME = iceberg_external_volume_soledp
catalog = SNOWFLAKE BASE_LOCATION = 'my_iceberg_table'
Step 2 -Using Iceberg tables as a Model with MATE
Currently, there is no built-in materialization support for creating or managing Iceberg tables in dbt. The existing materialization options such as table, view, incremental table, and ephemeral do not extend to Iceberg tables. As a result, creating or managing Iceberg tables directly through Mate is not supported. This limitation also applies to the DataOps platform, which does not currently support write operations on Iceberg tables. Users seeking to work with read-write Iceberg tables will need to consider alternative approaches or await future updates that may introduce this functionality.
Conclusion
This guide provides a clear path for using Iceberg tables in SOLE for Data Products and MATE, from setup to integration with MATE models, ensuring a smooth and efficient data management process.