Skip to main content

MATE Packages

Bundled Packages

The DataOps Template Rendering guide notes that packages are integral to the MATE ecosystem. They are vital for importing macros, models, and other logic into a DataOps project. A good example of a package is dbt_utils, a library of open-source macros that can be used and reused across DataOps MATE projects. Another vital package used in MATE transformations is the dataops package. MATE ships with the packages dbt_utils and dataops by default.

The full set of bundled MATE packages includes:

tip

There are many different packages available on hub.getdbt.com.

Installing additional Packages

If the bundled MATE packages are not sufficient, you can install additional ones. To do this, we need to modify the packages.yml file in the project's dataops/modelling directory. The default content is:

/dataops/modelling/packages.yml
packages:
- local: ../dataops-individual-runners/modelling_and_transformation_libs

There are three ways to reference a package in packages.yml:

The first way is to use the package syntax as follows:

/dataops/modelling/packages.yml
packages:
- local: ../dataops-individual-runners/modelling_and_transformation_libs

- package: calogica/dbt_expectations
version: 0.5.6

The second way is to use the package's source in the packages.yml and, instead of using the package syntax, use the git syntax as follows:

/dataops/modelling/packages.yml
packages:
- local: ../dataops-individual-runners/modelling_and_transformation_libs

- git: "https://github.com/guy-adams/dbtvault.git"
revision: dataops-5

The third way is to install local sub-directories or parent directories as projects. For instance, the MATE package modelling-and-transformation-libs is included in all projects. The following code snippet shows how to add a package in a project_libs folder:

/dataops/modelling/packages.yml
packages:
- local: ../dataops-individual-runners/modelling_and_transformation_libs

- local: project_libs

When the MATE job runs in the DataOps pipelines, the MATE Orchestrator will go through the packages.yml file and install all the packages (or dependencies) needed to run all the MATE jobs.

Configure Packages

We must add any variables needed by these packages to the dbt_project.yml file. The dbt docs site provides a great example of adding (or configuring) variables for the Snowflake snowplow/snowplow_webpackage, found on hub.getdbt.com/snowplow/snowplow_web.

Here is a duplication of part of this code snippet:

/dataops/modelling/dbt_project.yml
vars:
snowplow:
"snowplow:timezone": "America/New York"
"snowplow:page_ping_frequency": 10
"snowplow:events": "{{ ref('sp_base_events') }}"
"snowplow:context:web_page": "{{ ref('sp_base_web_page_context') }}"
"snowplow:context:performance_timing": false
"snowplow:context:useragent": false
"snowplow:pass_through_columns": []

Using Packages

Now that we understand what MATE packages are, the next step is to use one or more packages in your MATE jobs.

Packages do not only contain macros; they can also include models, seeds, and all sorts of other helpful MATE logic.

Seeds are CSV files stored in the MATE project's seeds directory, under dataops/modelling. They are loaded into the Snowflake data warehouse during a MATE job and can be referenced in downstream models using the ref function.

Nonetheless, macros and models are the two most valuable and well-utilized components of a MATE package. As a result, let's take a deep dive into using macros and models contained within these packages.

Using Package Macros

One of the most widely used packages in MATE is dbt_utils.

dbt_utils.star

The most well-used macro is the dbt_utils.star macro. Let's look at this macro's function and purpose as well as an example of how to use it in your MATE models.

According to the dbt_utils docs, the dbt_utils.star macro performs the following function:

It generates a comma-separated list of all fields that exist in the from relation, excluding any fields listed in the except argument.

Moreover, this macro takes the following arguments:

ArgumentRequired/OptionalDescription
fromREQUIREDThis argument's value is a ref or source that contains the list of columns you want to select from
exceptOptional, default = []This argument lists the columns you wish to exclude
relation_aliasOptional, default = ''This argument will prefix all the generated fields with an alias (relation_alias.field_name)
prefixOptional, default = ''This value will prefix the output field_name (field_name as prefix_field_name)
suffixOptional, default = ''This value will suffix the output field_name (field_name as field_name_suffix)

And it is used in the following way:

select
{{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"], prefix="max_") }}
from {{ ref('my_model') }}

To expand on the dbt_utils.star macro detail, let's look at a practical use case example. Let's assume we have the following models (stg_orders and stg_product_types) as seen in the Building MATE Models doc:

stg_orders.sql
stg_orders as (
select
id as order_id,
prod_type_code as product_type
items as items_ordered,
date as order_date
from
raw.stock_orders
)
stg_product_types.sql
stg_product_types as (
select
id as product_type_id,
code as product_type_code,
prod_type_desc as product_type_description
from
raw.product_types
)

We then use these staging models to build a downstream model modeling numbers of orders per product type between a start date and end date.

tip

You must declare the start_date and end_date variables in the dbt_project.yml file. Navigate to MATE Fundamentals: What is Modelling and Transformation? for more information on the dbt_project.yml file.

The following code snippet shows how the start_date and end_date variables are added to the dbt_project.yml file:

dbt_project.yml
name: my_dbt_project
version: 1.0.0

# Define variables here
vars:
start_date: 01 January 2021
end_date: 31 December 2021

This SQL statement builds the dim_total_orders_product_type model:

with product_types as (
select * from {{ ref('stg_product_types') }}
),

with orders as (
select * from {{ ref('stg_orders') }}
),


final as (
select product_types.product_type_code as product_type_code,
product_types.product_type_description as product_type_description,
coalesce(sum(orders.items_ordered), 0), as total_orders

from product_types
left join orders using (orders.product_type)
where orders.order_date between '{{ var("start_date") }}' and '{{ var("end_date") }}'
)

select * from final

Lastly, to return a comma-generated list of all the fields in the dim_total_orders_product_type model from clause, we must call the dbt_utils.star macro. The code is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"]) }}
from {{ ref('dim_total_orders_product_type') }}

And dbt_utils.star ran, the output returned is:

product_type_code,total_orders

As noted in this macro's description, the different arguments return different outputs. Therefore, for the sake of completeness, let's look at a few of the available combinations and their results:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type')) }}
from {{ ref('dim_total_orders_product_type') }}

This SQL returns all the available columns in the dim_total_orders_product_type model; namely, product_type_code, product_type_description, and total_orders.

The next SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), prefix="final_") }}
from {{ ref('dim_total_orders_product_type') }}

And the results returned are:

final_product_type_code,final_product_type_description,final_total_orders

Lastly, the same principle applies to adding a suffix to the returned column names:

This SQL statement to add a prefix to each column name is as follows:

select
{{ dbt_utils.star(from=ref('dim_total_orders_product_type'), except=["product_type_description"], suffix="_final") }}
from {{ ref('dim_total_orders_product_type') }}
product_type_code_final,total_orders_final

dbt_utils.date_spine

This second example shows how to consume the dbt_utils.date_spine macro's source code as taken from hub.getdbt.com, generating the SQL code needed to build a date spine.

{{ dbt_utils.date_spine(
datepart="day"
start_date="to_date('01/01/2016', 'mm/dd/yyyy')",
end_date="dateadd(week, 1, current_date)"
)
}}

When run, this SQL statement uses the dbt_utils.date_spine macro to create a date spine view. In summary, it creates a view of days, one row per day, from 01 January 2021 until 1 January 2022.

fct_date_spine_view.sql
{{
config(
materialized = 'view',
)
}}

{{ dbt_utils.date_spine(
datepart="day",
start_date="to_date('01/01/2021', 'mm/dd/yyyy')",
end_date="to_date('01/01/2022', 'mm/dd/yyyy')"
)
}}

dbt_utils.surrogate_key

A third and last example is a macro to create a surrogate primary key from a list of columns in a model. For instance, let's assume that we want to create a model that returns the number of orders per product type per day between a start date and end date from the staging model (stg_product_orders). This model's SQL is as follows:

dim_customer_order_summary.sql
select
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

This model does not have a primary key, but it needs a primary key. Therefore, we can create a primary key using the dbt_utils.surrogate_key macro.

How?

The succinct answer is to use the dbt_utils.surrogate_key macro. This macro ingests a list of columns and generates a unique hash for each row that is always the same when the column values are the same.

By adding this macro, our model's SQL is now as follows:

dim_customer_order_summary.sql
select
{{ dbt_utils.surrogate_key(['product_type_id', 'order_date']) }}
product_type_id,
order_date,
count(*) as
from {{ ref('stg_product_orders') }}
where order_date between to_date('01/01/2022', 'mm/dd/yyyy') and to_date('01/07/2022', 'mm/dd/yyyy')
group by 1, 2

Using Package Models

As highlighted above, packages can also include models, such as gitlabhq/snowflake_spend. Let's expand on this topic by diving into how to access and use the models in this package.

The first step is to add this package to the packages.yml file, as seen in the code snippet below:

/dataops/modelling/packages.yml
packages:
- local: ../dataops-individual-runners/modelling_and_transformation_libs

- package: gitlabhq/snowflake_spend
version: 1.2.0

Extra steps are required before referencing the models in this package in downstream models.

Why?

These models could select from or depend on specific sources. Therefore, we need to ensure that these sources are set up and ready to be accessed as necessitated by the code referencing the models in the specified package.

For instance, one of the steps we need to take is to create a CSV file seed containing Snowflake contract rates. For example:

/dataops/modelling/seeds/snowflake_contract_rates.csv
effective_date, rate
2020-06-01, 2.55
2020-08-01, 2.48

When the MATE Orchestrator installs this package, extra models such as snowflake_spend, snowflake_warehouse_metering and dbt_models are created. They will be visible in the Snowflake data warehouse and in the auto-generated docs. But the model source code will not be in our MATE project.

The models contained within the gitlabhq/snowflake_spend package must be configured in the dbt_project.yml file. Therefore, the next step is to configure this package as follows:

/dataops/modelling/dbt_project.yml
models:
snowflake_spend:
+enabled: true
+schema: my_snowflake_spend_schema
xf:
snowflake_warehouse_monitoring_xf:
+enabled: true

Once these models have been configured, it's time to use them in our downstream models. Here is a code sample of how to reference these models.

calc_cost_per_credit.sql
select
warehouse_id,
warehouse_name,
start_time,
end_time,
usage_month,
usage_length,
credits_used,
dollars_spent
from {{ ref('usage') }}
order by 1