MATE Packages
Bundled packages
The DataOps Template Rendering topic notes that packages are integral to the MATE ecosystem. They are vital for importing macros, models, and other logic into a DataOps project. An excellent example of a package is dbt_utils
, a library of open-source macros you can use and reuse across DataOps MATE projects. Another vital package used in MATE transformations is the dataops
package. MATE ships with the packages dbt_utils
and dataops
by default.
The full set of bundled MATE packages includes:
- dbt-1.8
- dbt-1.7
- dbt-1.5
- dbt-1.4
- Datavault-UK/automate_dv (version: 0.11.0)
- tnightengale/dbt_meta_testing (version: 0.3.6)
- dbt-labs/dbt_utils (version: 1.3.0)
- data-mie/dbt_profiler (version: 0.8.2)
The package Datavault-uk/dbtvault
has been renamed to Datavault-UK/automate_dv
in dbt 1.7.
To prevent potential failures, ensure that you update the usage of the package by replacing dbtvault
with automate_dv
. For example:
- Previous Configuration
- New Configuration
{%- set source_model = "v_stg_orders" -%}
{%- set src_pk = "ORDER_PK" -%}
{%- set src_nk = "ORDERKEY" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ dbtvault.hub(src_pk=src_pk, src_nk=src_nk, src_ldts=src_ldts,
src_source=src_source, source_model=source_model) }}
{%- set source_model = "v_stg_orders" -%}
{%- set src_pk = "ORDER_PK" -%}
{%- set src_nk = "ORDERKEY" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}
{{ automate_dv.hub(src_pk=src_pk, src_nk=src_nk, src_ldts=src_ldts,
src_source=src_source, source_model=source_model) }}
- Datavault-uk/dbtvault (version: 0.9.5)
- tnightengale/dbt_meta_testing (version: 0.3.6)
- dbt-labs/dbt_utils (version: 1.2.0)
- data-mie/dbt_profiler (version: 0.8.2)
When upgrading a dbt version, potential breaking changes may affect specific existing setups:
- If moving from 1.4 to 1.5, MATE jobs could fail if their configuration contains an empty
tests:
keyword in your YAML file. You can resolve this by removing the emptytests:
keyword.
Check dbt migration for detailed guidance to help you resolve any potential issues.
- Datavault-uk/dbtvault (version: 0.9.4)
- tnightengale/dbt_meta_testing (version: 0.3.6)
- dbt-labs/dbt_utils (version: 1.2.0)
- data-mie/dbt_profiler (version: 0.8.2)
We strongly advise upgrading to the latest dbt version. However, when upgrading from dbt 1.4 to 1.5, MATE jobs could fail if their configuration contains an empty tests:
keyword in your YAML file. You can resolve this by removing the empty tests:
keyword.
Check dbt migration for detailed guidance to help you resolve any potential issues.
Many packages are available on hub.getdbt.com.
Installing additional packages
If the bundled MATE packages are insufficient, it is possible to install additional ones. To do this, you must modify the packages.yml
file in the project's dataops/modelling
directory.
Specify a package using one of the following methods depending on where your package is stored.
Prerequisites
To install additional MATE packages successfully:
- Add the variable
TRANSFORM_FORCE_DEPS:1
in the file pipelines/includes/config/variables.yml.
Adding the variable here makes it available to all MATE jobs. - If you have created your project from the DataOps Template Project and have the line
local: ../../../modelling-and-transformation-libsfrom packages.yml
in the file dataops/modelling/packages.yml, remove this line.
Otherwise, you may have a compilation error installing the packages.
dbt-Hub packages
dbt Hub is a registry for dbt packages. Install the packages listed on dbt Hub using the syntax:
packages:
- package: calogica/dbt_expectations
version: 0.5.6
Git packages
Public repository
Install the packages stored on a Git server using the syntax:
packages:
- git: "https://github.com/dbt-labs/dbt-codegen.git"
revision: 0.8.1
Private repository
This method lets you clone private packages via HTTPS by passing in a Git token via an environment variable.
packages:
# use this format when accessing your repository via a personal access token
- git: "https://{{env_var('GITHUB_USERNAME')}}:{{env_var('GIT_CREDENTIAL')}}@github.com/<your_repository_path>.git" # git HTTPS URL"
Read more about creating a GitHub Personal Access token.
Project subdirectories
The dbt documentation gives a great example for installing packages from sub-directories.
Local packages
Install the packages that you have stored locally by specifying the path to the project using the syntax:
packages:
- local: /opt/dbt/foobar # use a local path
When the MATE job runs in the DataOps pipelines, the MATE Orchestrator goes through the packages.yml
file and installs all the packages (or dependencies) needed to run all the MATE jobs.
Installing the dataops
package
DataOps.live provides the package dataops
, adding model- and column-level tests. This bundled package is included by default if you use MATE with the supported dbt versions. However, suppose you want to use a different version of dbt but still want to benefit from the tests or grant management capabilities of the dataops
package. In that case, you need to install this package using the syntax:
packages:
- local: /macro_libs/modelling_and_transformation_libs
Configuring packages
Add any variables needed by these packages to the dbt_project.yml
file. The dbt documentation gives a great example of adding (or configuring) variables for the Snowflake snowplow/snowplow_web
package, found on hub.getdbt.com/snowplow/snowplow_web.
Here is a duplication of part of this code snippet:
vars:
snowplow:
"snowplow:timezone": "America/New York"
"snowplow:page_ping_frequency": 10
"snowplow:events": "{{ ref('sp_base_events') }}"
"snowplow:context:web_page": "{{ ref('sp_base_web_page_context') }}"
"snowplow:context:performance_timing": false
"snowplow:context:useragent": false
"snowplow:pass_through_columns": []
Using packages
Now that you know better MATE packages, the next step is to use one or more packages in your MATE jobs.
Packages don't only have macros; they can also include models, seeds, and all sorts of other helpful MATE logic.
Seeds are CSV files stored in the MATE project's seeds
directory under dataops/modelling
. They are loaded into the Snowflake data warehouse by a MATE job and can be referenced in downstream models using the ref
function.
However, macros and models are the MATE package's two most valuable and well-used components. As a result, let's take a deep dive into using macros and models contained within these packages.
Using package macros
One of the most widely used packages in MATE is dbt_utils
.
dbt_utils.star
dbt_utils.star
is a very helpful macro. Let's look at its function and purpose as well as an example of how to use it in your MATE models.
According to the dbt_utils documentation, dbt_utils.star
does the following:
It generates a comma-separated list of all fields in the
from
relation, excluding any listed in theexcept
argument.
This macro takes the following arguments:
Argument | Required/Optional | Description |
---|---|---|
from | REQUIRED | Has a ref or source value that holds the list of columns you want to select from |
except | Optional, default = [] | Lists the columns you wish to exclude |
relation_alias | Optional, default = '' | Adds a prefix to all the generated fields with an alias (relation_alias .field_name ) |
prefix | Optional, default = '' | Adds a prefix to the output field_name (field_name as prefix_field_name ) |
suffix | Optional, default = '' | Adds a suffix to the output field_name (field_name as field_name_suffix ) |
And you can use it following the syntax:
select
{{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"], prefix="max_") }}
from {{ ref('my_model') }}
To expand on the dbt_utils.star
macro detail, let's look at a practical use case example. Let's assume you have the following models (stg_orders
and stg_product_types
) as seen in the Building MATE Models documentation.
stg_orders as (
select
id as order_id,
prod_type_code as product_type
items as items_ordered,
date as order_date
from
raw.stock_orders
)
stg_product_types as (
select
id as product_type_id,
code as product_type_code,
prod_type_desc as product_type_description
from
raw.product_types
)
Use these staging models to build a downstream model that shows the number of orders per product type between a start and an end date:
-
Declare the
start_date
andend_date
variables in thedbt_project.yml
file.See What is modeling and transformation? for more information on the
dbt_project.yml
file.The following code snippet shows you how to add the
start_date
andend_date
variables to thedbt_project.yml
file:dbt_project.ymlname: my_dbt_project
version: 1.0.0
# Define variables here
vars:
start_date: 01 January 2021
end_date: 31 December 2021 -
Use the below SQL statement to build the
dim_total_orders_product_type
model:with product_types as (
select * from {{ ref('stg_product_types') }}
),
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select product_types.product_type_code as product_type_code,
product_types.product_type_description as product_type_description,
coalesce(sum(orders.items_ordered), 0), as total_orders
from product_types
left join orders using (orders.product_type)
where orders.order_date between '{{ var("start_date") }}' and '{{ var("end_date") }}'
)
select * from final -
To return a comma-generated list of all the fields in the
dim_total_orders_product_type
model, call thedbt_utils.star
macro. The code is as follows:select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"]) }}
from {{ ref('dim_total_orders_product_type') }}
Once the MATE job runs, dbt_utils.star
runs, and the output returned is:
product_type_code,total_orders
As noted in this macro's description, the different arguments return different outputs. Let's look at a few of the available combinations and their results:
select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type')) }}
from {{ ref('dim_total_orders_product_type') }}
This SQL returns all the available columns in the dim_total_orders_product_type
model, namely, product_type_code
, product_type_description
, and total_orders
.
The SQL statement to add a prefix to each column name is as follows:
select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), prefix="final_") }}
from {{ ref('dim_total_orders_product_type') }}
And the results returned are:
final_product_type_code,final_product_type_description,final_total_orders
The same principle applies to adding a suffix to the returned column names.
The SQL statement to add a prefix to each column name is as follows:
select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"], suffix="_final") }}
from {{ ref('dim_total_orders_product_type') }}
product_type_code_final,total_orders_final
dbt_utils.date_spine
This second example describes how to consume the dbt_utils.date_spine
macro's source code as taken from hub.getdbt.com and generate the SQL code needed to build a date spine.
{{ dbt_utils.date_spine(
datepart="day"
start_date="to_date('01/01/2016', 'mm/dd/yyyy')",
end_date="dateadd(week, 1, current_date)"
)
}}
Once the job runs, this SQL statement uses the dbt_utils.date_spine
macro to create a date spine view. It creates a view of days, one row per day, from 01 January 2021 until 1 January 2022.
{{
config(
materialized = 'view',
)
}}
{{ dbt_utils.date_spine(
datepart="day",
start_date="to_date('01/01/2021', 'mm/dd/yyyy')",
end_date="to_date('01/01/2022', 'mm/dd/yyyy')"
)
}}
dbt_utils.surrogate_key
A third and last example is a macro to create a surrogate primary key from a list of columns in a model. For instance, let's assume that you want to create a model that returns the number of orders per product type per day, between a start date and an end date from the staging model (stg_product_orders
). This model's SQL is as follows:
select
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2
This model does not have a primary key it needs. You can create a primary key using the dbt_utils.surrogate_key
macro.
How?
Use the dbt_utils.surrogate_key
macro. This macro ingests a list of columns and generates a unique hash for each row that is always the same when the column values are the same.
By adding this macro, the model's SQL is now as follows:
select
{{ dbt_utils.surrogate_key(['product_type_id', 'order_date']) }}
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2
Using package models
As highlighted above, packages can also include models, such as dbt-labs/dbt_utils
. Let's expand on this topic by diving into how to access and use the models in this package.
The first step is to add this package to the packages.yml
file, as seen in the code snippet below:
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
Before referencing the models in this package in downstream models, extra steps are required.
Why?
These models could select from or depend on specific sources. Therefore, you need to ensure that these sources are set up and ready to be accessed by the code referencing the models in the specified package.
For instance, one of the steps you need to take is to create a CSV file seed containing Snowflake contract rates:
effective_date, rate
2020-06-01, 2.55
2020-08-01, 2.48
The below example shows the schema test asserts if the column does not have the same value in all rows.
version: 2
models:
- name: model_name
columns:
- name: column_name
tests:
- dbt_utils.not_constant
{% test not_constant(model, column_name) %}
{{ return(adapter.dispatch('test_not_constant', 'dbt_utils')(model, column_name)) }}
{% endtest %}
{% macro default__test_not_constant(model, column_name) %}
select
{# In TSQL, subquery aggregate columns need aliases #}
{# thus: a filler col name, 'filler_column' #}
count(distinct {{ column_name }}) as filler_column
from {{ model }}
having count(distinct {{ column_name }}) = 1
{% endmacro %}
PubPrev Using PyPI packages
For python dbt models, you can define functions that depend on third-party packages if those packages are installed and available to the Python runtime on the data platform.
In this example, let's use the holidays package to determine if a given date is a holiday in France. The code below uses the pandas API for simplicity and consistency across platforms. The exact syntax, and the need to refactor for multi-node processing, still vary.
- Snowpark
- PySpark
import holidays
def is_holiday(date_col):
# Chez Jaffle
french_holidays = holidays.France()
is_holiday = (date_col in french_holidays)
return is_holiday
def model(dbt, session):
dbt.config(
materialized = "table",
packages = ["holidays"]
)
orders_df = dbt.ref("stg_orders")
df = orders_df.to_pandas()
# apply our function
# (columns need to be in uppercase on Snowpark)
df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)
# return final dataset (Pandas DataFrame)
return df
import holidays
def is_holiday(date_col):
# Chez Jaffle
french_holidays = holidays.France()
is_holiday = (date_col in french_holidays)
return is_holiday
def model(dbt, session):
dbt.config(
materialized = "table",
packages = ["holidays"]
)
orders_df = dbt.ref("stg_orders")
df = orders_df.to_pandas_on_spark() # Spark 3.2+
# df = orders_df.toPandas() in earlier versions
# apply our function
df["is_holiday"] = df["order_date"].apply(is_holiday)
# convert back to PySpark
df = df.to_spark() # Spark 3.2+
# df = session.createDataFrame(df) in earlier versions
# return final dataset (PySpark DataFrame)
return df
PubPrev Configuring packages
We encourage you to configure the required packages and versions so dbt can track them in project metadata. This configuration is required for the implementation on some platforms. If you need specific versions of packages, specify them.
def model(dbt, session):
dbt.config(
packages = ["numpy==1.23.1", "scikit-learn"]
)
version: 2
models:
- name: my_python_model
config:
packages:
- "numpy==1.23.1"
- scikit-learn