Skip to main content

How to Create Incremental Models in MATE

Creating an incremental model is as simple as this example demonstrates. You have a source table that gets loaded once a day with data. From that, you create an incremental model that only processes "new" records loaded. This ensures a lighter model that only inserts new rows rather than recreating the entire table from scratch.

Setup

Let's say you have the following setup:

  1. An ingestion table staging the daily load. Let's call this SOURCE_TABLE

    SOURCE_TABLE !!shadow!!

    Set the table as a source:

    dataops/modelling/sources/source_table.yml
    version: 2

    sources:
    - name: staging
    tables:
    - name: source_table
    columns:
    - name: name
    - name: schema
    - name: load_date
    - name: load_count
  2. Set the incremental model history_inc:

    dataops/modelling/models/history_inc.sql
     {{ config(
    materialized='incremental'
    ) }}

    select
    name,
    schema,
    load_date,
    load_count
    from {{source('staging', 'source_table') }}

    {% if is_incremental() %}

    -- this filter will only be applied on an incremental run
    where load_date > (select max(load_date) from {{ this }})

    {% endif %}

A couple of things happen here:

  • You configure the model with materialized = 'incremental'. The first run rebuilds the model entirely, and every subsequent run that does not have the --full-refresh option set triggers the is_incremental() clause.

  • You select to read the SOURCE_TABLE from {{source('staging', 'source_table') }}.

  • You have the is_incremental() clause in Jinja {% if ... %} defining how to grab new rows. The condition what are new rows is:

    where load_date > (select max(load_date) from {{ this }}

    Where you say, "give me only rows with load_date more recent than what is currently the oldest row in this table".

If you want to read more about Jinja, head to our template rendering section.

Initial pipeline run

The initial pipeline run will produce the history_inc model:

initial pipeline run !!shadow!!

You can see that history_inc gets created. Checking the content, it is the same as the original table:

source and incremental model tables !!shadow!!

Subsequent pipeline runs

Now let's assume that ingestion happened and the original table had changed. It has an additional row that looks like row 5:

updated SOURCE_TABLE after ingestion !!shadow!!

The subsequent MATE job run triggers the where clause and only selects the corresponding rows.

subsequent MATE job run !!shadow!!

What happens is that the new row 5 gets added to the history_inc model:

row 5 gets added to the history model from SOURCE_TABLE !!shadow!!

Things to consider

It would be best if you were careful when using incremental runs. Defining where to put is_incremental() macro may significantly lighten your model and execution time, but one should take caution, e.g., when using window functions.