MATE Packages

Bundled packages

The DataOps Template Rendering topic notes that packages are integral to the MATE ecosystem. They are vital for importing macros, models, and other logic into a DataOps project. An excellent example of a package is dbt_utils, a library of open-source macros you can use and reuse across DataOps MATE projects. Another vital package used in MATE transformations is the dataops package. MATE ships with the packages dbt_utils and dataops by default.

The full set of bundled MATE packages includes:

dbt-1.9
dbt-1.8
dbt-1.7
dbt-1.5
dbt-1.4

note

The package Datavault-uk/dbtvault has been renamed to Datavault-UK/automate_dv in dbt 1.7. To prevent potential failures, ensure that you update the usage of the package by replacing dbtvault with automate_dv. For example:

Previous Configuration
New Configuration

dataops/modelling/models/data_vault/raw_vault/hubs/hub_order.sql
{%- set source_model = "v_stg_orders" -%}
{%- set src_pk = "ORDER_PK" -%}
{%- set src_nk = "ORDERKEY" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}

{{ dbtvault.hub(src_pk=src_pk, src_nk=src_nk, src_ldts=src_ldts,
                src_source=src_source, source_model=source_model) }}

dataops/modelling/models/data_vault/raw_vault/hubs/hub_order.sql
{%- set source_model = "v_stg_orders" -%}
{%- set src_pk = "ORDER_PK" -%}
{%- set src_nk = "ORDERKEY" -%}
{%- set src_ldts = "LOAD_DATE" -%}
{%- set src_source = "RECORD_SOURCE" -%}

{{ automate_dv.hub(src_pk=src_pk, src_nk=src_nk, src_ldts=src_ldts,
                src_source=src_source, source_model=source_model) }}

warning

When upgrading a dbt version, potential breaking changes may affect specific existing setups:

If moving from 1.4 to 1.5, MATE jobs could fail if their configuration contains an empty tests: keyword in your YAML file. You can resolve this by removing the empty tests: keyword.

Check dbt migration for detailed guidance to help you resolve any potential issues.

warning

We strongly advise upgrading to the latest dbt version. However, when upgrading from dbt 1.4 to 1.5, MATE jobs could fail if their configuration contains an empty tests: keyword in your YAML file. You can resolve this by removing the empty tests: keyword.

Check dbt migration for detailed guidance to help you resolve any potential issues.

tip

Many packages are available on hub.getdbt.com.

Installing additional packages

If the bundled MATE packages are insufficient, it is possible to install additional ones. To do this, you must modify the packages.yml file in the project's dataops/modelling directory.

Specify a package using one of the following methods depending on where your package is stored.

Prerequisites

To install additional MATE packages successfully:

Add the variable TRANSFORM_FORCE_DEPS:1 in the file pipelines/includes/config/variables.yml.
Adding the variable here makes it available to all MATE jobs.
If you have created your project from the DataOps Template Project and have the line local: ../../../modelling-and-transformation-libsfrom packages.yml in the file dataops/modelling/packages.yml, remove this line.
Otherwise, you may have a compilation error installing the packages.

dbt-Hub packages

dbt Hub is a registry for dbt packages. Install the packages listed on dbt Hub using the syntax:

/dataops/modelling/packages.yml
packages:
  - package: calogica/dbt_expectations
    version: 0.5.6

Git packages

Public repository

Install the packages stored on a Git server using the syntax:

/dataops/modelling/packages.yml
packages:
  - git: "https://github.com/dbt-labs/dbt-codegen.git"
    revision: 0.8.1

Private repository

This method lets you clone private packages via HTTPS by passing in a Git token via an environment variable.

/dataops/modelling/packages.yml
packages:
  # use this format when accessing your repository via a personal access token
  - git: "https://{{env_var('GITHUB_USERNAME')}}:{{env_var('GIT_CREDENTIAL')}}@github.com/<your_repository_path>.git" # git HTTPS URL"

Read more about creating a GitHub Personal Access token.

Project subdirectories

The dbt documentation gives a great example for installing packages from sub-directories.

Local packages

Install the packages that you have stored locally by specifying the path to the project using the syntax:

/dataops/modelling/packages.yml
packages:
  - local: /opt/dbt/foobar # use a local path

When the MATE job runs in the DataOps pipelines, the MATE Orchestrator goes through the packages.yml file and installs all the packages (or dependencies) needed to run all the MATE jobs.

Installing the `dataops` package

DataOps.live provides the package dataops, adding model- and column-level tests. This bundled package is included by default if you use MATE with the supported dbt versions. However, suppose you want to use a different version of dbt but still want to benefit from the tests or grant management capabilities of the dataops package. In that case, you need to install this package using the syntax:

/dataops/modelling/packages.yml
packages:
  - local: /macro_libs/modelling_and_transformation_libs

Configuring packages

Add any variables needed by these packages to the dbt_project.yml file. The dbt documentation gives a great example of adding (or configuring) variables for the Snowflake snowplow/snowplow_web package, found on hub.getdbt.com/snowplow/snowplow_web.

Here is a duplication of part of this code snippet:

/dataops/modelling/dbt_project.yml
vars:
  snowplow:
    "snowplow:timezone": "America/New York"
    "snowplow:page_ping_frequency": 10
    "snowplow:events": "{{ ref('sp_base_events') }}"
    "snowplow:context:web_page": "{{ ref('sp_base_web_page_context') }}"
    "snowplow:context:performance_timing": false
    "snowplow:context:useragent": false
    "snowplow:pass_through_columns": []

Using packages

Now that you know better MATE packages, the next step is to use one or more packages in your MATE jobs.

Packages don't only have macros; they can also include models, seeds, and all sorts of other helpful MATE logic.

Seeds are CSV files stored in the MATE project's seeds directory under dataops/modelling. They are loaded into the Snowflake data warehouse by a MATE job and can be referenced in downstream models using the ref function.

However, macros and models are the MATE package's two most valuable and well-used components. As a result, let's take a deep dive into using macros and models contained within these packages.

Using package macros

One of the most widely used packages in MATE is dbt_utils.

`dbt_utils.star`

dbt_utils.star is a very helpful macro. Let's look at its function and purpose as well as an example of how to use it in your MATE models.

According to the dbt_utils documentation, dbt_utils.star does the following:

It generates a comma-separated list of all fields in the from relation, excluding any listed in the except argument.

This macro takes the following arguments:

Argument	Required/Optional	Description
`from`	REQUIRED	Has a `ref` or `source` value that holds the list of columns you want to select from
`except`	Optional, default = []	Lists the columns you wish to exclude
`relation_alias`	Optional, default = ''	Adds a prefix to all the generated fields with an alias (`relation_alias`.`field_name`)
`prefix`	Optional, default = ''	Adds a prefix to the output `field_name` (`field_name` as `prefix_field_name`)
`suffix`	Optional, default = ''	Adds a suffix to the output `field_name` (`field_name` as `field_name_suffix`)

And you can use it following the syntax:

select
{{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"], prefix="max_") }}
from {{ ref('my_model') }}

To expand on the dbt_utils.star macro detail, let's look at a practical use case example. Let's assume you have the following models (stg_orders and stg_product_types) as seen in the Building MATE Models documentation.

stg_orders.sql
stg_orders as (
   select
     id as order_id,
     prod_type_code as product_type
     items as items_ordered,
     date  as order_date
   from
     raw.stock_orders
)

stg_product_types.sql
stg_product_types as (
   select
     id as product_type_id,
     code as product_type_code,
     prod_type_desc as product_type_description
   from
     raw.product_types
)

Use these staging models to build a downstream model that shows the number of orders per product type between a start and an end date:

Declare the start_date and end_date variables in the dbt_project.yml file.

See What is modeling and transformation? for more information on the dbt_project.yml file.

The following code snippet shows you how to add the start_date and end_date variables to the dbt_project.yml file:
dbt_project.yml
```
name: my_dbt_project
version: 1.0.0
# Define variables here
vars:
  start_date: 01 January 2021
  end_date: 31 December 2021
```

Use the below SQL statement to build the dim_total_orders_product_type model:

with product_types as (
  select * from {{ ref('stg_product_types') }}
),

with orders as (
  select * from {{ ref('stg_orders') }}
),

final as (
    select product_types.product_type_code as product_type_code,
           product_types.product_type_description as product_type_description,
           coalesce(sum(orders.items_ordered), 0), as total_orders

    from product_types
    left join orders using (orders.product_type)
    where orders.order_date between '{{ var("start_date") }}' and '{{ var("end_date") }}'
)

select * from final

To return a comma-generated list of all the fields in the dim_total_orders_product_type model, call the dbt_utils.star macro. The code is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"]) }}
from {{ ref('dim_total_orders_product_type') }}

Once the MATE job runs, dbt_utils.star runs, and the output returned is:

product_type_code,total_orders

As noted in this macro's description, the different arguments return different outputs. Let's look at a few of the available combinations and their results:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type')) }}
from {{ ref('dim_total_orders_product_type') }}

This SQL returns all the available columns in the dim_total_orders_product_type model, namely, product_type_code, product_type_description, and total_orders.

The SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), prefix="final_") }}
from {{ ref('dim_total_orders_product_type') }}

And the results returned are:

final_product_type_code,final_product_type_description,final_total_orders

The same principle applies to adding a suffix to the returned column names.

The SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"], suffix="_final") }}
from {{ ref('dim_total_orders_product_type') }}

product_type_code_final,total_orders_final

`dbt_utils.date_spine`

This second example describes how to consume the dbt_utils.date_spine macro's source code as taken from hub.getdbt.com and generate the SQL code needed to build a date spine.

{{ dbt_utils.date_spine(
    datepart="day"
    start_date="to_date('01/01/2016', 'mm/dd/yyyy')",
    end_date="dateadd(week, 1, current_date)"
    )
}}

Once the job runs, this SQL statement uses the dbt_utils.date_spine macro to create a date spine view. It creates a view of days, one row per day, from 01 January 2021 until 1 January 2022.

fct_date_spine_view.sql
{{
  config(
    materialized = 'view',
    )
}}

{{ dbt_utils.date_spine(
    datepart="day",
    start_date="to_date('01/01/2021', 'mm/dd/yyyy')",
    end_date="to_date('01/01/2022', 'mm/dd/yyyy')"
   )
}}

`dbt_utils.surrogate_key`

A third and last example is a macro to create a surrogate primary key from a list of columns in a model. For instance, let's assume that you want to create a model that returns the number of orders per product type per day, between a start date and an end date from the staging model (stg_product_orders). This model's SQL is as follows:

dim_customer_order_summary.sql
select
  product_type_id,
  order_date,
  count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

This model does not have a primary key it needs. You can create a primary key using the dbt_utils.surrogate_key macro.

How?

Use the dbt_utils.surrogate_key macro. This macro ingests a list of columns and generates a unique hash for each row that is always the same when the column values are the same.

By adding this macro, the model's SQL is now as follows:

dim_customer_order_summary.sql
select
  {{ dbt_utils.surrogate_key(['product_type_id', 'order_date']) }}
  product_type_id,
  order_date,
  count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

Using package models

As highlighted above, packages can also include models, such as dbt-labs/dbt_utils. Let's expand on this topic by diving into how to access and use the models in this package.

The first step is to add this package to the packages.yml file, as seen in the code snippet below:

/dataops/modelling/packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0

Before referencing the models in this package in downstream models, extra steps are required.

Why?

These models could select from or depend on specific sources. Therefore, you need to ensure that these sources are set up and ready to be accessed by the code referencing the models in the specified package.

For instance, one of the steps you need to take is to create a CSV file seed containing Snowflake contract rates:

/dataops/modelling/seeds/snowflake_contract_rates.csv
effective_date, rate
2020-06-01, 2.55
2020-08-01, 2.48

The below example shows the schema test asserts if the column does not have the same value in all rows.

/dataops/modelling/models/not_constant.yml
version: 2

models:
  - name: model_name
    columns:
      - name: column_name
        tests:
          - dbt_utils.not_constant

/dataops/modelling/macros/not_constant.sql
{% test not_constant(model, column_name) %}
  {{ return(adapter.dispatch('test_not_constant', 'dbt_utils')(model, column_name)) }}
{% endtest %}

{% macro default__test_not_constant(model, column_name) %}


select
    {# In TSQL, subquery aggregate columns need aliases #}
    {# thus: a filler col name, 'filler_column' #}
    count(distinct {{ column_name }}) as filler_column

from {{ model }}

having count(distinct {{ column_name }}) = 1


{% endmacro %}

PubPrev
Using PyPI packages

For python dbt models, you can define functions that depend on third-party packages if those packages are installed and available to the Python runtime on the data platform.

In this example, let's use the holidays package to determine if a given date is a holiday in France. The code below uses the pandas API for simplicity and consistency across platforms. The exact syntax, and the need to refactor for multi-node processing, still vary.

Snowpark
PySpark

/dataops/modelling/models/customers/my_python_model.py
import holidays

def is_holiday(date_col):
    # Chez Jaffle
    french_holidays = holidays.France()
    is_holiday = (date_col in french_holidays)
    return is_holiday

def model(dbt, session):
    dbt.config(
        materialized = "table",
        packages = ["holidays"]
    )

    orders_df = dbt.ref("stg_orders")

    df = orders_df.to_pandas()

    # apply our function
    # (columns need to be in uppercase on Snowpark)
    df["IS_HOLIDAY"] = df["ORDER_DATE"].apply(is_holiday)

    # return final dataset (Pandas DataFrame)
    return df

/dataops/modelling/models/customers/my_python_model.py
import holidays

def is_holiday(date_col):
    # Chez Jaffle
    french_holidays = holidays.France()
    is_holiday = (date_col in french_holidays)
    return is_holiday

def model(dbt, session):
    dbt.config(
        materialized = "table",
        packages = ["holidays"]
    )

    orders_df = dbt.ref("stg_orders")

    df = orders_df.to_pandas_on_spark()  # Spark 3.2+
    # df = orders_df.toPandas() in earlier versions

    # apply our function
    df["is_holiday"] = df["order_date"].apply(is_holiday)

    # convert back to PySpark
    df = df.to_spark()               # Spark 3.2+
    # df = session.createDataFrame(df) in earlier versions

    # return final dataset (PySpark DataFrame)
    return df

PubPrev
Configuring packages

We encourage you to configure the required packages and versions so dbt can track them in project metadata. This configuration is required for the implementation on some platforms. If you need specific versions of packages, specify them.

/dataops/modelling/models/customers/my_python_model.py
def model(dbt, session):
    dbt.config(
        packages = ["numpy==1.23.1", "scikit-learn"]
    )

/dataops/modelling/models/customers/config.yml
version: 2

models:
  - name: my_python_model
    config:
      packages:
        - "numpy==1.23.1"
        - scikit-learn

Bundled packages​

Installing additional packages​

Prerequisites​

dbt-Hub packages​

Git packages​

Public repository​

Private repository​

Project subdirectories​

Local packages​

Installing the dataops package​

Configuring packages​

Using packages​

Using package macros​

dbt_utils.star​

dbt_utils.date_spine​

dbt_utils.surrogate_key​

Using package models​

PubPrev Using PyPI packages​

PubPrev Configuring packages​