MATE Best Practice Recommendations
Check out the below best practice recommendations when using the SQL-based database modeling and testing engine developed on top of the dbt framework:
- Where should I create my tests?
- How do I test composite keys for uniqueness?
- How do I update and delete data in my models?
- How do I use @ in dbt model selection?
Where should I create my tests?
The standard DataOps 4-layer model is as follows:
- Ingestion
- Curation
- Transformation (calculation)
- Consumption
Based on this information, a commonly asked question is:
"In which layer should I run my tests"?
The answer to this is usually relatively simple and is a DataOps best practice recommendation:
Create your test as close to the earliest point data can break.
In other words, if a test is going to fail, we want it to fail as quickly/early as possible. There are a few reasons for this:
- The quicker something fails, the sooner you can resolve it.
- Root cause analysis: If something that can be tested at the ingestion layer doesn't get tested, you risk ending up with any number of models failing downstream. As a result, you will have to perform a root cause analysis to determine the underlying failure.
Test in the curation layer
Here is a typical use case scenario that demonstrates how to build and run tests in the curation layer of the DataOps model:
Let's assume you are ingesting data from a stock control system and modeling it in MATE. A good example is the staging model called stg_product_types
described in the MATE documentation.
For example, this SQL code snippet builds the stg_product_types
model:
stg_product_types as (
select
id as product_type_id,
code as product_type_code,
prod_type_desc as product_type_description
from
raw.product_types
)
It stands to reason that you must test the data in this model for integrity and accuracy. Let's assume that your company manufactures three different product types. Therefore, before using this model in downstream models and transformations, you must ensure that the product type code field only contains valid product type codes. And the description must contain accurate product type descriptions.
As described in the Mate Automated Data Testing doc, you can use the accepted_values
test.
For instance:
version: 2
models:
- name: stg_product_type
columns:
- name: product_type_code
tests:
- accepted_values:
values: ["toy_trains", "toy_cars", "toy_airplanes"]
- name: product_type_description
tests:
- accepted_values:
values: ["Toy Trains", "Toy Cars", "Toy Airplanes"]
You should create the test against the first model that uses the raw.product_types
source (typically the curation model).
Test in the transformation layer
A second use case is where testing occurs in the transformation (or calculation) layer of the DataOps model.
Let's assume that you are joining together stock and sales order data. Executive management wants to know the total number of orders per product type.
The following code joins the stg_orders
and stg_product_types
models to build a model (called dim_total_orders_product_type
) that returns a list of orders per product type.
with product_types as (
select * from {{ ref('stg_product_types') }}
),
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select product_types.product_type_code,
product_types.product_type_description,
coalesce(sum(orders.items_ordered), 0), as total_orders
from product_types
left join orders using (orders.product_type)
)
select * from final
We need to test that none of the rows are null. We can use the not_null
generic test to check that none of the columns in this model are null.
For example:
version: 2
models:
- name: dim_total_orders_product_type
columns:
- name: product_type_code
tests:
- not_null
- name: product_type_description
tests:
- not_null
- name: total_orders
tests:
- not_null
It's important to note that this test must be run in the transformation (calculation) layer. It cannot be run earlier in the DataOps model because the sources have not yet been ingested, and the staging models have not been built. If the stg_product_types
model has not been tested in the curation layer, any bugs or inconsistencies would be passed downstream to this transformation model.
How do I test composite keys for uniqueness?
When testing composite keys for uniqueness, you can't necessarily check for the uniqueness of concatenation because any of them could be null, resulting in an invalid output. Also, as seen in the table below, the column values can interact and make it seem like things aren't unique when they are:
Type | FK_ID | concat(Type_FK_ID) | |
---|---|---|---|
Foo-1 | 21 | Foo-121 | Great |
Foo-1 | 22 | Foo-122 | Great |
Foo-12 | 1 | Foo-121 | Not great |
Foo-12 | 2 | Foo-122 | Not great |
The combination of Type and FK_ID is unique, but just concatenating them together does not work. You can solve the null issue by wrapping ifnull
functions around the column names and solve the interaction issue by putting commas or dashes between columns.
However, there is a function that does this much more easily: unique_combination_of_columns
.
For example:
version: 2
models:
- name: model_name
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- Type
- FK_ID
For more information on this function, navigate to https://hub.getdbt.com/fishtown-analytics/dbt_utils/latest/.
How do I update and delete data in my models?
The first part of this recommendation describes what not to do:
Let's assume you have a stored procedure with 4 steps:
- A->B
- Update several rows in B
- Delete several rows from B
- B->C
These steps run in the predefined order when implemented in the stored procedure. However, this does not work in the same way in DataOps. How will MATE work out whether to run step 2 or step 3 first? Additionally, because steps 2 and 3 operate on B, they are cyclic. This cyclic behavior is not that successful in a DAG (Directed Acyclic Graph).
Therefore, the question is, what is the solution to this challenge?
The straightforward answer is to build a set of MATE SQL models. The workflow is now as follows:
- Model B1 (ephemeral): Select from A
- Model B2 (ephemeral): Join with reference set X to update the requisite rows
- Model C: Join with reference set Y to delete the requisite rows
Moving from a stored procedure to a set of SQL models, we are moving from imperative to declarative code.
The best practice recommendation is not to use UPDATE or DELETE. Instead, implement additional intermediate steps like B1 and B2 above.
How do I use @ in dbt model selection?
This is confusing in the documentation. It implies that model@
is similar to model+
but just adds the parents of every model selected, i.e., a recursive +
.
The same for @model
is similar to +model
. In fact, model@
is invalid and will never return any models. @model
selects all the children of this model (even though it's at the start - hence confusing) and then all the parents of these.