Skip to main content

How to use Snowflake Iceberg tables with SOLE for Data Products

This article offers a comprehensive guide on using Snowflake Iceberg tables with SOLE for Data Products. It covers the critical concepts of Iceberg tables and provides a detailed, step-by-step guide on defining and using them.

Understanding Iceberg tables in Snowflake

Iceberg tables in Snowflake allow you to efficiently access and manage data stored in Parquet files on S3. Depending on your use case, you can manage these tables either externally or internally.

Internal Catalog (managed by Snowflake):

This method supports the creation of read-write Iceberg tables. Only an external volume is necessary, as Snowflake manages the catalog internally.

Externally managed catalog (using AWS Glue or Object Store):

This method allows you to create read-only Iceberg tables. To set up such a table, you need an external volume and external catalog.

Prerequisite for Creating Iceberg table

Since Iceberg tables utilize cloud storage, you must first set up your cloud provider's storage. Here's how to get started:

Once your cloud storage is ready, you can create the external volume required for your Iceberg table.

info

You can assign an external volume and catalog directly during Iceberg table creation or at the database level. Both the external volume and catalog are account-level objects in Snowflake.

Example commands:

  • To create a database with an external volume and catalog:

    CREATE DATABASE <database_name>
    EXTERNAL_VOLUME = <external_volume_name>
    CATALOG = <external_catalog_name>
  • To alter an existing database:

    ALTER DATABASE IF EXISTS <database_name> SET
    EXTERNAL_VOLUME = <external_volume_name>
    CATALOG = <external_catalog_name>
info

For read and write Iceberg table catalog value will remain SNOWFLAKE and does not have to be specified.

METADATA_FILE_PATH

When setting up cloud storage, determine the path value that designates the metadata file path for a table created from Iceberg files stored in an object storage system. This path should be relative to the active storage location of the external volume linked to the table.

For example, if the full path to your metadata file is s3://mybucket_us_east_1/metadata/v1.metadata.json and the external storage location is s3://mybucket_us_east_1/, you should set METADATA_FILE_PATH to metadata/v1.metadata.json.

Step-by-step guide: creating an externally managed Iceberg table

Step 1 - define a read-only Iceberg table in SOLE for Data Products

Ensure a hooks file path is specified in the dataops_config.yml file in the dataops/snowflake directory while working with Iceberg tables for SOLE for Data Products.

dataops_config.yml
config-version: 2
hook-paths: [hooks.yml]
  • If the hooks.template.yml file exists:

    Verify the presence of the database_level_hooks object group section in hooks.template.yml. If it does not exist, create this section and include database_level_hooks to add commands.

  • If the hooks.template.yml file does not exist:

    Create a new file named hooks.template.yml within the dataops/snowflake directory.

    Ensure that the newly created file includes a database_level_hooks object group section.

  • As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the pre_hook section of account_level_hooks.

    Add pre_hooks under database_level_hooks section by creating a new file hooks.template.yml in the dataops/snowflake directory, which has the following content:

    info

    If the external volume and catalog are set at the database level, you don't need to specify them again when creating the Iceberg table.

    hooks.template.yml
    database_level_hooks:
    post_hooks:
    - environment: snowflake
    commands:
    - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name> EXTERNAL_VOLUME = <External_volume_name>
    catalog = <External_catalog_name> METADATA_FILE_PATH= <Metadata file path of cloud storage>

    Here, {{ env.DATABASE_NAME_PREFIX }} serves as the prefix, and {{ env.DATAOPS_ENV_NAME }} acts as the suffix.

  • Create the Iceberg table.

    hooks.template.yml
    database_level_hooks:
    post_hooks:
    - environment: snowflake
    commands:
    - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST EXTERNAL_VOLUME = iceberg_external_volume_soledp
    catalog = IcebergCatalogInt_soledp METADATA_FILE_PATH='ICEBERG_TABLE_STORAGE/metadata/00000-e80344f7-7c89-47df-b1ab-8f9e2b44b3db.metadata.json' COMMENT = ' Table 1'"

Step 2 - Using Iceberg tables as a dbt source in MATE

  • Define the dbt source in sources.yml

    In dbt, sources are defined in a sources.yml file, which connects your dbt models to the actual tables in your data warehouse. When mapping to an Iceberg table, include all relevant details, such as the database, schema, and table name.

    Using Iceberg table as a source to perform MATE job transformations

    Example:

    source.yml
      version: 2
    sources:
    - name: TEST_ICEBERG
    database: {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}
    schema: ICEBERG_SOLEDP_SCHEMA

    tables:
    - name: AWS_ICEBERG_TABLE
    description: "Read-only Iceberg table"
    columns:
    - name: ID
    description: "Column id"
    - name: NAME
    description: "Column name"
  • Reference the Source in your dbt model

    After defining the source, you can reference it in your dbt model using the source function. This ensures that dbt pulls data from the correct Iceberg table.

    Example dbt model:

    models/my_iceberg_model.sql
    {{ config(alias='INGESTION_TABLE') }}

    SELECT
    *
    FROM {{ source('TEST_ICEBERG', 'AWS_ICEBERG_TABLE') }}

    Use the {{ source('TEST_ICEBERG', 'ICEBERG_TABLE_TEST') }} macro to refer directly to the defined source.

  • Reference the source in your dbt_project file

    models:
    source_iceberg:
    schema: ICEBERG_SOURCE
    materialized: table
    iceberg:
    tags:
    - "source_iceberg"

    To run a MATE job with specific variables, set TRANSFORM_ACTION to RUN and specify your model name in TRANSFORM_MODEL_SELECTOR.

    For example:

    mate job:
    variables:
    TRANSFORM_ACTION: RUN
    TRANSFORM_MODEL_SELECTOR: "tag:source_iceberg"

Step 3 - Refresh Iceberg table metadata in SOLE for Data Products

If the data stored in cloud storage has been updated, you must refresh the Iceberg table to synchronize its metadata with the latest changes.

hooks.template.yml
database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE <iceberg_table_name> REFRESH 'metadata_file_relative_path';

The metadata_file_relative_path specifies a metadata file path for a table created from Iceberg files in object storage. The path must be relative to the active storage location of the external volume associated with the table.

Refreshing Iceberg table metadata example:

database_level_hooks:
post_hooks:
- environment: snowflake
commands:
- "ALTER ICEBERG TABLE {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST REFRESH 'cloud storage path';

After updating the Iceberg table, execute the MATE job again to apply transformations to the refreshed metadata.

Step-by-Step Guide: Creating an Internal Managed Iceberg table

Step 1 - Define a read-write Iceberg table in SOLE for Data Products

Ensure a hooks file path is specified in the dataops_config.yml file in the dataops/snowflake directory while working with Iceberg tables for SOLE for DP.

dataops_config.yml
config-version: 2
hook-paths: [hooks.yml]
  • If the hooks.template.yml file exists:

    Verify the presence of the database_level_hooks object group section in hooks.template.yml. If it does not exist, create this section and include database_level_hooks to add commands.

  • If the hooks.template.yml file does not exist:

    Create a new file named hooks.template.yml within the dataops/snowflake directory.

    Ensure that the newly created file includes a database_level_hooks object group section.

  • As an Iceberg table is a schema-level object, it requires specifying the database and schema. If you want to use an existing database and schema, include their names when creating the Iceberg table Otherwise, define them in the pre_hook section of account_level_hooks.

    Add pre_hooks under database_level_hooks section by creating a new file hooks.template.yml in the dataops/snowflake directory, which has the following content:

    info

    If the external volume and catalog are set at the database level, you don’t need to specify them again when creating the Iceberg table.

    hooks.template.yml
    database_level_hooks:
    post_hooks:
    - environment: snowflake
    commands:
    - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_<database-name>_{{ env.DATAOPS_ENV_NAME }}.<schema_name>.<iceberg_table_name>( <COLUMN_NAME> <COLUMN_TYPE>) EXTERNAL_VOLUME = <External_volume_name>
    catalog = SNOWFLAKE BASE_LOCATION = 'FOLDER NAME WILL CREATE IN CLOUD LOCATION'

    Here, {{ env.DATABASE_NAME_PREFIX }} serves as the prefix, and {{ env.DATAOPS_ENV_NAME }} acts as the suffix.

  • Create the Iceberg table.

    hooks.template.yml
    database_level_hooks:
    post_hooks:
    - environment: snowflake
    commands:
    - "CREATE ICEBERG TABLE IF NOT EXISTS {{ env.DATABASE_NAME_PREFIX }}_{{ env.DATAOPS_ENV_NAME }}.ICEBERG_SOLEDP_SCHEMA.ICEBERG_TABLE_TEST(customer_id int, name string) EXTERNAL_VOLUME = iceberg_external_volume_soledp
    catalog = SNOWFLAKE BASE_LOCATION = 'my_iceberg_table'

Step 2 -Using Iceberg tables as a Model with MATE

Currently, there is no built-in materialization support for creating or managing Iceberg tables in dbt. The existing materialization options such as table, view, incremental table, and ephemeral do not extend to Iceberg tables. As a result, creating or managing Iceberg tables directly through Mate is not supported. This limitation also applies to the DataOps platform, which does not currently support write operations on Iceberg tables. Users seeking to work with read-write Iceberg tables will need to consider alternative approaches or await future updates that may introduce this functionality.

Conclusion

This guide provides a clear path for using Iceberg tables in SOLE for Data Products and MATE, from setup to integration with MATE models, ensuring a smooth and efficient data management process.