Skip to main content

MATE Packages

Bundled packages

The DataOps Template Rendering topic notes that packages are integral to the MATE ecosystem. They are vital for importing macros, models, and other logic into a DataOps project. An excellent example of a package is dbt_utils, a library of open-source macros you can use and reuse across DataOps MATE projects. Another vital package used in MATE transformations is the dataops package. MATE ships with the packages dbt_utils and dataops by default.

The full set of bundled MATE packages includes:

tip

Many packages are available on hub.getdbt.com.

Installing additional packages

If the bundled MATE packages are insufficient, it is possible to install additional ones. To do this, you must modify the packages.yml file in the project's dataops/modelling directory.

Specify a package using one of the following methods depending on where your package is stored.

Prerequisites

To install additional MATE packages successfully:

  • Add the variable TRANSFORM_FORCE_DEPS:1 in the file pipelines/includes/config/variables.yml.
    Adding the variable here makes it available to all MATE jobs.
  • If you have created your project from the DataOps Template Project and have the line local: ../../../modelling-and-transformation-libsfrom packages.yml in the file dataops/modelling/packages.yml, remove this line.
    Otherwise, you may have a compilation error installing the packages.

dbt-Hub packages

dbt Hub is a registry for dbt packages. Install the packages listed on dbt Hub using the syntax:

/dataops/modelling/packages.yml
packages:
- package: calogica/dbt_expectations
version: 0.5.6

Git packages

Public repository

Install the packages stored on a Git server using the syntax:

/dataops/modelling/packages.yml
packages:
- git: "https://github.com/dbt-labs/dbt-codegen.git"
revision: 0.8.1

Private repository

This method lets you clone private packages via HTTPS by passing in a Git token via an environment variable.

/dataops/modelling/packages.yml
packages:
# use this format when accessing your repository via a personal access token
- git: "https://{{env_var('GITHUB_USERNAME')}}:{{env_var('GIT_CREDENTIAL')}}@github.com/<your_repository_path>.git" # git HTTPS URL"

Read more about creating a GitHub Personal Access token.

Project subdirectories

The dbt documentation gives a great example for installing packages from sub-directories.

Local packages

Install the packages that you have stored locally by specifying the path to the project using the syntax:

/dataops/modelling/packages.yml
packages:
- local: /opt/dbt/foobar # use a local path

When the MATE job runs in the DataOps pipelines, the MATE Orchestrator goes through the packages.yml file and installs all the packages (or dependencies) needed to run all the MATE jobs.

Configuring packages

Add any variables needed by these packages to the dbt_project.yml file. The dbt documentation gives a great example of adding (or configuring) variables for the Snowflake snowplow/snowplow_web package, found on hub.getdbt.com/snowplow/snowplow_web.

Here is a duplication of part of this code snippet:

/dataops/modelling/dbt_project.yml
vars:
snowplow:
"snowplow:timezone": "America/New York"
"snowplow:page_ping_frequency": 10
"snowplow:events": "{{ ref('sp_base_events') }}"
"snowplow:context:web_page": "{{ ref('sp_base_web_page_context') }}"
"snowplow:context:performance_timing": false
"snowplow:context:useragent": false
"snowplow:pass_through_columns": []

Using packages

Now that you know better MATE packages, the next step is to use one or more packages in your MATE jobs.

Packages don't only have macros; they can also include models, seeds, and all sorts of other helpful MATE logic.

Seeds are CSV files stored in the MATE project's seeds directory under dataops/modelling. They are loaded into the Snowflake data warehouse by a MATE job and can be referenced in downstream models using the ref function.

However, macros and models are the MATE package's two most valuable and well-used components. As a result, let's take a deep dive into using macros and models contained within these packages.

Using package macros

One of the most widely used packages in MATE is dbt_utils.

dbt_utils.star

dbt_utils.star is a very helpful macro. Let's look at its function and purpose as well as an example of how to use it in your MATE models.

According to the dbt_utils documentation, dbt_utils.star does the following:

It generates a comma-separated list of all fields in the from relation, excluding any listed in the except argument.

This macro takes the following arguments:

ArgumentRequired/OptionalDescription
fromREQUIREDHas a ref or source value that holds the list of columns you want to select from
exceptOptional, default = []Lists the columns you wish to exclude
relation_aliasOptional, default = ''Adds a prefix to all the generated fields with an alias (relation_alias.field_name)
prefixOptional, default = ''Adds a prefix to the output field_name (field_name as prefix_field_name)
suffixOptional, default = ''Adds a suffix to the output field_name (field_name as field_name_suffix)

And you can use it following the syntax:

select
{{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"], prefix="max_") }}
from {{ ref('my_model') }}

To expand on the dbt_utils.star macro detail, let's look at a practical use case example. Let's assume you have the following models (stg_orders and stg_product_types) as seen in the Building MATE Models documentation.

stg_orders.sql
stg_orders as (
select
id as order_id,
prod_type_code as product_type
items as items_ordered,
date as order_date
from
raw.stock_orders
)
stg_product_types.sql
stg_product_types as (
select
id as product_type_id,
code as product_type_code,
prod_type_desc as product_type_description
from
raw.product_types
)

Use these staging models to build a downstream model that shows the number of orders per product type between a start and an end date:

  1. Declare the start_date and end_date variables in the dbt_project.yml file.

    See MATE Fundamentals: What is Modelling and Transformation? for more information on the dbt_project.yml file.

    The following code snippet shows you how to add the start_date and end_date variables to the dbt_project.yml file:

    dbt_project.yml
    name: my_dbt_project
    version: 1.0.0
    # Define variables here
    vars:
    start_date: 01 January 2021
    end_date: 31 December 2021
  2. Use the below SQL statement to build the dim_total_orders_product_type model:

    with product_types as (
    select * from {{ ref('stg_product_types') }}
    ),

    with orders as (
    select * from {{ ref('stg_orders') }}
    ),

    final as (
    select product_types.product_type_code as product_type_code,
    product_types.product_type_description as product_type_description,
    coalesce(sum(orders.items_ordered), 0), as total_orders

    from product_types
    left join orders using (orders.product_type)
    where orders.order_date between '{{ var("start_date") }}' and '{{ var("end_date") }}'
    )

    select * from final
  3. To return a comma-generated list of all the fields in the dim_total_orders_product_type model, call the dbt_utils.star macro. The code is as follows:

    select
    {{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"]) }}
    from {{ ref('dim_total_orders_product_type') }}

Once the MATE job runs, dbt_utils.star runs, and the output returned is:

product_type_code,total_orders

As noted in this macro's description, the different arguments return different outputs. Let's look at a few of the available combinations and their results:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type')) }}
from {{ ref('dim_total_orders_product_type') }}

This SQL returns all the available columns in the dim_total_orders_product_type model, namely, product_type_code, product_type_description, and total_orders.

The SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), prefix="final_") }}
from {{ ref('dim_total_orders_product_type') }}

And the results returned are:

final_product_type_code,final_product_type_description,final_total_orders

The same principle applies to adding a suffix to the returned column names.

The SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"], suffix="_final") }}
from {{ ref('dim_total_orders_product_type') }}
product_type_code_final,total_orders_final

dbt_utils.date_spine

This second example describes how to consume the dbt_utils.date_spine macro's source code as taken from hub.getdbt.com and generate the SQL code needed to build a date spine.

{{ dbt_utils.date_spine(
datepart="day"
start_date="to_date('01/01/2016', 'mm/dd/yyyy')",
end_date="dateadd(week, 1, current_date)"
)
}}

Once the job runs, this SQL statement uses the dbt_utils.date_spine macro to create a date spine view. It creates a view of days, one row per day, from 01 January 2021 until 1 January 2022.

fct_date_spine_view.sql
{{
config(
materialized = 'view',
)
}}

{{ dbt_utils.date_spine(
datepart="day",
start_date="to_date('01/01/2021', 'mm/dd/yyyy')",
end_date="to_date('01/01/2022', 'mm/dd/yyyy')"
)
}}

dbt_utils.surrogate_key

A third and last example is a macro to create a surrogate primary key from a list of columns in a model. For instance, let's assume that you want to create a model that returns the number of orders per product type per day, between a start date and an end date from the staging model (stg_product_orders). This model's SQL is as follows:

dim_customer_order_summary.sql
select
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

This model does not have a primary key it needs. You can create a primary key using the dbt_utils.surrogate_key macro.

How?

Use the dbt_utils.surrogate_key macro. This macro ingests a list of columns and generates a unique hash for each row that is always the same when the column values are the same.

By adding this macro, the model's SQL is now as follows:

dim_customer_order_summary.sql
select
{{ dbt_utils.surrogate_key(['product_type_id', 'order_date']) }}
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

Using package models

As highlighted above, packages can also include models, such as dbt-labs/dbt_utils. Let's expand on this topic by diving into how to access and use the models in this package.

The first step is to add this package to the packages.yml file, as seen in the code snippet below:

/dataops/modelling/packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0

Before referencing the models in this package in downstream models, extra steps are required.

Why?

These models could select from or depend on specific sources. Therefore, you need to ensure that these sources are set up and ready to be accessed by the code referencing the models in the specified package.

For instance, one of the steps you need to take is to create a CSV file seed containing Snowflake contract rates:

/dataops/modelling/seeds/snowflake_contract_rates.csv
effective_date, rate
2020-06-01, 2.55
2020-08-01, 2.48

The below example shows the schema test asserts if the column does not have the same value in all rows.

/dataops/modelling/models/not_constant.yml
version: 2

models:
- name: model_name
columns:
- name: column_name
tests:
- dbt_utils.not_constant
/dataops/modelling/macros/not_constant.sql
{% test not_constant(model, column_name) %}
{{ return(adapter.dispatch('test_not_constant', 'dbt_utils')(model, column_name)) }}
{% endtest %}

{% macro default__test_not_constant(model, column_name) %}


select
{# In TSQL, subquery aggregate columns need aliases #}
{# thus: a filler col name, 'filler_column' #}
count(distinct {{ column_name }}) as filler_column

from {{ model }}

having count(distinct {{ column_name }}) = 1


{% endmacro %}

Feature release status badge: PriPrev
PriPrev
Using PyPI packages

For python dbt models, you can define functions that depend on third-party packages if those packages are installed and available to the Python runtime on the data platform.

In this example, let's use the holidays package to determine if a given date is a holiday in France. The code below uses the pandas API for simplicity and consistency across platforms. The exact syntax, and the need to refactor for multi-node processing, still vary.

/dataops/modelling/models/customers/my_python_model.py
import holidays

def is_holiday(date_col):
# Chez Jaffle
french_holidays = holidays.France()
is_holiday = (date_col in french_holidays)
return is_holiday

def model(dbt, session):
dbt.config(
materialized = "table",
packages = ["holidays"]
)

orders_df = dbt.ref("stg_orders")

df = orders_df.to_pandas()

# apply our function
# (columns need to be in uppercase on Snowpark)
df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)

# return final dataset (Pandas DataFrame)
return df

Feature release status badge: PriPrev
PriPrev
Configuring packages

We encourage you to configure the required packages and versions so dbt can track them in project metadata. This configuration is required for the implementation on some platforms. If you need specific versions of packages, specify them.

/dataops/modelling/models/customers/my_python_model.py
def model(dbt, session):
dbt.config(
packages = ["numpy==1.23.1", "scikit-learn"]
)
/dataops/modelling/models/customers/config.yml
version: 2

models:
- name: my_python_model
config:
packages:
- "numpy==1.23.1"
- scikit-learn