Skip to main content

MATE Core Concepts

The concepts that make up the Modelling and Transformation Engine or MATE are as follows:

Project

A project is a complete set of YAML and SQL files that define all sources and transformations for a MATE project. The primary project definition is held in dbt_project.yml, which contains all top-level configurations for the project. This project includes all sources, models, and other items, as detailed below.

In DataOps, there is typically a single MATE project per DataOps project (repository) and is located in /dataops/modelling.

Sources

One of the first steps in a MATE project is to define your sources. These describe the initial database tables all models and transformations are based. Defining sources this way makes it possible to maintain lineage back to the data source and apply freshness and other tests to unaltered source data.

tip

Sources are defined in YAML files in the project's models directory.

Models

Models use SQL SELECT statements to define data transformations, inheriting from sources or other models. These statements are then stored in SQL template files. Jinja templating is used to insert dynamic content into models, such as logical references to source data.

For instance:

/dataops/modelling/models/customers/source_stack_customers.sql
with source_stack_customers as (

select
id as customer_id,
first_name,
last_name

from source.customers;
)

Materializations

Models are either materialized as tables or views or are not materialized, only existing as logical "stepping stones" for descendant models.

The following materializations can be configured for models:

Table

The table materialization model is built and rebuilt as a full table on each run, resulting in a faster response time for BI tools accessing the data in this materialization.

For instance:

The following code snippet creates a model.

/dataops/modelling/models/customers/source_stack_customers.sql
with source_stack_customers as (

select
id as customer_id,
first_name,
last_name

from source.customers
where is_active = true
)

Once this table is built, we can select from the table as follows:

select *
from source_stack_customers
note

See the section on Materializations: Table or View for more information on how to tell MATE to build a particular model as a table or view.

View

A logical view is created for the model rather than a table, meaning that no additional data is stored in the database.

In practice, if we store the following SQL statement as a view, the SQL query (statement) itself gets stored in the Snowflake data warehouse.

/dataops/modelling/models/customers/source_stack_customers.sql
with source_stack_customers as (

select
id as customer_id,
first_name,
last_name

from source.customers
where is_active = true
)

We can select from this view in the same way we selected from the table above, but the difference is that source_stack_customers.sql statement is run every time we run a SQL query on this view.

note

See the section on Materializations: Table or View for more information on how to tell MATE to build a particular model as a table or view.

Incremental

Although usually defining a complete SELECT for truncate/load, MATE models can be configured as incremental to allow subsequent runs to load only the rows that have changed. In other words, incremental models allow DataOps to insert or update records in a table on a schedule or since the last time the MATE project inside the DataOps pipeline was run.

note

Incremental models are best suited for event data with large datasets and events logged in real-time or near-real-time.

For instance, let's assume we have a table that collects click-stream data. We have a very large eCommerce site and we collect data about how often product pages are clicked on and from what geographical locations, as well as other data, including shopping cart data.

Our source YAML file will look as follows:

version: 2

sources:
- name: click_stream_data
database: raw
tables:
- name: website_click_events
loaded_at_field: click_event_timestamp
freshness:
error_after: { count: 30, period: minutes }

The incremental model code snippet is as follows:

{{ config(
materialized = 'incremental',
unique_key = 'website_click_event_id')
}}

with events as (
select * from {{ source('click_stream_data', 'website_click_events') }}
{% if is_incremental() %}
where click_event_timestamp >= (select max(click_event_timestamp) from {{ this }})
{% endif %}
),

click_events as (
select * from website_click_events
where event = 'click_event'
),

aggregated_click_events as (
select
website_click_event_id,
count(*) as num_click_events
min(click_event_timestamp) as click_event_start,
max(click_event_timestamp) as max_click_event_timestamp
from website_click_events
group by 1
),

joined_click_events as (
select
*
from click_events
left join aggregated_click_events using (website_click_event_id)
)
select * from joined_click_events

Ephemeral

These models are not built into the database at all. Instead, they remain a logical concept in DataOps to allow granularity in model hierarchies. In summary, ephemeral models function as follows:

  • They do not exist in the database.
  • They are reusable code snippets.
  • They are inserted as a CTE or Common Table Expression in a model that references this ephemeral model.

It is worth noting the following about ephemeral models:

  • They are best used for lightweight transformations early in your modeling and transformation process or DAG (Direct Acyclic Graph).
  • You cannot select directly from these models.
  • They are best only used in a few downstream models.

Snapshots

Implementing built-in history in data warehouse tables using Slowly Changing Dimensions (SCD) can be complex to manage, but MATE snapshots allow type-2 SCD to be set up with minimal configuration. Configuration is added to model-like SQL files managed within the project's snapshots directory.

Tests

Often overlooked and undervalued in data projects, the ability to write tests that validate code and data integrity is a key pillar of DataOps. In DataOps MATE, tests are defined in the same YAML files that hold other source and model configurations. A set of built-in tests exist, allowing simple operations such as checks for null values. The DataOps platform also provides a rich library of additional test functions.

note

Tests are executed as part of a standard pipeline, with full configurability that allows for independent testing of tagged sets of models.

More Information

Materializations: Table or View

Whether a model is built as a table or a view depends on one of two configurations:

In the dbt_project.yml file:

name: my_project

models:
my_project:
website_click_events:
# materialize all models in models/events as tables
+materialized: <table/view>

In a configuration block at the top of the model SQL file:

/dataops/modelling/models/customers/source_stack_customers.sql

{{ config(materialized='<table/view>', sort='customer_id', dist='customer_id') }}

with source_stack_customers as (

select
id as customer_id,
first_name,
last_name

from source.customers
where is_active = true
);

MATE Transform Orchestrator

All the modelling and transformations are executed by the MATE Transform Orchestrator.