MATE Core Concepts
The concepts that make up the Modelling and Transformation Engine or MATE are as follows:
Project
A project is a complete set of YAML, SQL, and Python files that define all sources and transformations for a MATE project. The primary project definition is held in dbt_project.yml
, which contains all top-level configurations for the project. This project includes all sources, models, and other items, as detailed below.
DataOps typically has a single MATE project per DataOps project (repository) located in /dataops/modelling
.
The Support of Python files is currently in Public Preview. For more information, see Using Python to build MATE models.
Sources
One of the first steps in a MATE project is to define your sources. These describe the initial database tables models and transformations are based on. Defining sources this way makes it possible to maintain lineage back to the data source and apply freshness and other tests to unaltered source data.
You define sources in YAML files in the project's models
directory.
Models
SQL Models use SQL SELECT
statements to define data transformations, inheriting from sources or other models. These statements are then stored in SQL template files. Jinja templating is used to insert dynamic content into models, such as logical references to source data.
For instance:
with source_stack_customers as (
select
id as customer_id,
first_name,
last_name
from source.customers;
)
The Support of Python models is currently in Public Preview. For more information, see Using Python to build MATE models.
Python models use a function named model()
, which takes two parameters:
dbt
: A class compiled by dbt Core, unique to each model, that enables you to run your Python code in the context of your dbt project and DAG (Directed Acyclic Graph).session
: A class that represents your data platform's connection to the Python backend. The session is needed to read tables as DataFrames, and to write DataFrames back to tables. In PySpark, the SparkSession is namedspark
and is available globally. For consistency across platforms, we always pass it into the model function as an explicit argument calledsession
.
For instance:
def model(dbt, session):
target_name = dbt.config.get("target_name")
specific_var = dbt.config.get("specific_var")
specific_env_var = dbt.config.get("specific_env_var")
orders_df = dbt.ref("fct_orders")
# limit data in dev
if target_name == "dev":
orders_df = orders_df.limit(500)
Materializations
Models are either materialized as tables or views or are not materialized, only existing as logical "stepping stones" for descendant models.
You can configure the following materializations for models:
Python models only support Table and Incremental materializations. For more information, see the dbt documentation.
Table
The table materialization model is built and rebuilt as a full table on each run, resulting in a faster response time for BI tools accessing the data in this materialization.
For instance:
The following code snippet creates a SQL model.
with source_stack_customers as (
select
id as customer_id,
first_name,
last_name
from source.customers
where is_active = true
)
Once this table is built, you can select from the table as follows:
select *
from source_stack_customers
The following code snippet creates a Python model.
from snowflake.snowpark.functions import col
def model(dbt, session):
dbt.config(materialized = "table")
df = dbt.ref("raw_orders")
df_new = df.select(df["id"].alias("order_id"),df["user_id"].alias("customer_id"), col("order_date"), col("status"))
return df_new
See the section on Materializations: Table or View for more information on how to tell MATE to build a particular model as a table or view.
View
A logical view is created for the model rather than a table, meaning that no additional data is stored in the database.
In practice, if you store the following SQL statement as a view, the SQL query (statement) itself gets stored in the Snowflake data warehouse.
with source_stack_customers as (
select
id as customer_id,
first_name,
last_name
from source.customers
where is_active = true
)
You can select from this view the same way you selected from the table above, but the difference is that the source_stack_customers.sql
statement is run every time you run a SQL query on this view.
See the section on Materializations: Table or View for more information on how to tell MATE to build a particular model as a table or view.
Incremental
Although usually defining a complete SELECT
for truncate/load, MATE models can be configured as incremental to allow subsequent runs to load only the changed rows. In other words, incremental models let DataOps insert or update records in a table on a schedule or since the last time the MATE project inside the DataOps pipeline was run.
Incremental models are best suited for event data with large datasets and events logged in real-time or near-real-time.
Let's assume you have a table that collects click-stream data for the SQL model instance. You have an extensive eCommerce site, and you collect data about how often product pages are clicked and from what geographical locations and other data, including shopping cart data.
The source YAML file will look as follows:
version: 2
sources:
- name: click_stream_data
database: raw
tables:
- name: website_click_events
loaded_at_field: click_event_timestamp
freshness:
error_after: { count: 30, period: minutes }
The SQL incremental model code snippet is as follows:
{{ config(
materialized = 'incremental',
unique_key = 'website_click_event_id')
}}
with events as (
select * from {{ source('click_stream_data', 'website_click_events') }}
{% if is_incremental() %}
where click_event_timestamp >= (select max(click_event_timestamp) from {{ this }})
{% endif %}
),
click_events as (
select * from website_click_events
where event = 'click_event'
),
aggregated_click_events as (
select
website_click_event_id,
count(*) as num_click_events
min(click_event_timestamp) as click_event_start,
max(click_event_timestamp) as max_click_event_timestamp
from website_click_events
group by 1
),
joined_click_events as (
select
*
from click_events
left join aggregated_click_events using (website_click_event_id)
)
select * from joined_click_events
The following code snippet creates a Python model:
- Snowpark
- PySpark
import snowflake.snowpark.functions as F
def model(dbt, session):
dbt.config(materialized = "incremental")
df = dbt.ref("upstream_table")
if dbt.is_incremental:
# only new rows compared to max in current table
max_from_this = f"select max(updated_at) from {dbt.this}"
df = df.filter(df.updated_at >= session.sql(max_from_this).collect()[0][0])
# or only rows from the past 3 days
df = df.filter(df.updated_at >= F.dateadd("day", F.lit(-3), F.current_timestamp()))
...
return df
import pyspark.sql.functions as F
def model(dbt, session):
dbt.config(materialized = "incremental")
df = dbt.ref("upstream_table")
if dbt.is_incremental:
# only new rows compared to max in current table
max_from_this = f"select max(updated_at) from {dbt.this}"
df = df.filter(df.updated_at >= session.sql(max_from_this).collect()[0][0])
# or only rows from the past 3 days
df = df.filter(df.updated_at >= F.date_add(F.current_timestamp(), F.lit(-3)))
...
return df
Ephemeral
These models are not built into the database at all. Instead, they remain a logical concept in DataOps, allowing granularity in model hierarchies. In summary, ephemeral models function as follows:
- They do not exist in the database.
- They are reusable code snippets.
- They are inserted as a CTE or Common Table Expression in a model that references this ephemeral model.
It is worth noting the following about ephemeral models:
- They are best used for lightweight transformations early in your modeling and transformation process or DAG (Direct Acyclic Graph).
- You cannot select directly from these models.
- They are best only used in a few downstream models.
Snapshots
Implementing built-in history in data warehouse tables using Slowly Changing Dimensions (SCD) can be complex to manage, but MATE snapshots let you set up type-2 SCD with minimal configuration. Configuration is added to model-like SQL files managed within the project's snapshots directory.