Skip to main content

DataOps Best Practice Recommendations

The DataOps best practice recommendations are divided up into the following categories:

Testing

This category provides answers to the following best-practice questions:

Where should I create my tests?

The standard DataOps 4-layer model is as follows:

  • Ingestion
  • Curation
  • Transformation (calculation)
  • Consumption

Based on this information, a commonly asked question is:

"In which layer should I run my tests"?

The answer to this is usually relatively simple and is a DataOps best practice recommendation:

Create your test as close as possible to the earliest point data can break.

In other words, if a test is going to fail, we want it to fail as quickly/early as possible. There are a few reasons for this:

  • The quicker something fails, the sooner you can resolve it.
  • Root cause analysis: If something that can be tested at the ingestion layer doesn't get tested, you risk ending up with any number of models failing downstream. As a result, you will have to perform a root cause analysis to determine where the underlying failure is.

Test in the Curation Layer

Here is a typical use case scenario that demonstrates how to build and run tests in the curation layer of the DataOps model:

Let's assume you are ingesting data from a stock control system and modeling it in MATE. A good example of this is the staging model called stg_product_types described in the MATE documentation.

For example, this SQL code snippet builds the stg_product_types model:

stg_product_types.sql
stg_product_types as (
select
id as product_type_id,
code as product_type_code,
prod_type_desc as product_type_description
from
raw.product_types
)

It stands to reason that you must test the data in this model for veracity and accuracy. Let's assume that your company manufactures three different product types. Therefore, before using this model in downstream models and transformations, you must ensure that the product type code field only contains valid product type codes. And the description must contain accurate product type descriptions.

As described in the Mate Automated Data Testing doc, you can use the accepted_values test.

For instance:

version: 2

models:
- name: stg_product_type
columns:
- name: product_type_code
tests:
- accepted_values:
values: ["toy_trains", "toy_cars", "toy_airplanes"]
- name: product_type_description
tests:
- accepted_values:
values: ["Toy Trains", "Toy Cars", "Toy Airplanes"]
note

The test should be created against the first model that uses the raw.product_types source (typically the curation model).

Test in the Transformation layer

A second use case is where testing occurs in the transformation (or calculation) layer of the DataOps model.

Let's assume that you are joining together stock and sales order data. Executive management wants to know the total number of orders per product type.

The following code joins the stg_orders and stg_product_types models to build a model (called dim_total_orders_product_type) that returns a list of orders per product type.

dim_total_orders_product_type.sql
with product_types as (
select * from {{ ref('stg_product_types') }}
),

with orders as (
select * from {{ ref('stg_orders') }}
),

final as (
select product_types.product_type_code,
product_types.product_type_description,
coalesce(sum(orders.items_ordered), 0), as total_orders

from product_types
left join orders using (orders.product_type)
)

select * from final

We need to test that none of the rows are null. We can use the not_null generic test to check that none of the columns in this model are null.

For example:

version: 2

models:
- name: dim_total_orders_product_type
columns:
- name: product_type_code
tests:
- not_null
- name: product_type_description
tests:
- not_null
- name: total_orders
tests:
- not_null

It's important to note that this test must be run in the transformation (calculation) layer. It cannot be run earlier in the DataOps model because the sources have not yet been ingested, and the staging models have not been built. If the stg_product_types model has not been tested in the curation layer, any bugs or inconsistencies would be passed downstream to this transformation model.

How do I test composite keys for uniqueness?

When testing composite keys for uniqueness, you can't necessarily check for the uniqueness of concatenation because any of them could be null, resulting in an invalid output. Also, as seen in the table below, the column values can interact and make it seem like things aren't unique when they are:

TypeFK_IDconcat(Type_FK_ID)
Foo-121Foo-121Great
Foo-122Foo-122Great
Foo-121Foo-121Not great
Foo-122Foo-122Not great

The combination of Type and FK_ID is unique, but just concatenating them together does not work. You can solve the null issue by wrapping ifnull functions around the column names and solve the interaction issue by putting commas or dashes between columns.

However, there is a function that does this much more easily: unique_combination_of_columns.

For example:

version: 2
models:
- name: model_name
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- Type
- FK_ID

For more information on this function, navigate to https://hub.getdbt.com/fishtown-analytics/dbt_utils/latest/.

Modeling

This category includes the following best practice recommendations:

How do I update and delete data in my models?

The first part of this recommendation describes what not to do:

Let's assume you have a stored procedure with 4 steps:

  • A->B
  • Update several rows in B
  • Delete several rows from B
  • B->C

These steps run in the predefined order when implemented in the stored procedure. However, this does not work in the same way in DataOps. How will MATE work out whether to run step 2 or step 3 first? Additionally, because steps 2 and 3 operate on B, they are cyclic. This cyclic behavior is not that successful in a DAG (Directed Acyclic Graph).

Therefore, the question that begs is what is the solution to this challenge?

The straightforward answer is to build a set of MATE SQL models. The workflow is now as follows:

  • Model B1 (ephemeral): Select from A
  • Model B2 (ephemeral): Join with reference set X to update the requisite rows
  • Model C: Join with reference set Y to delete the requisite rows
note

Moving from a stored procedure to a set of SQL models, we are moving from imperative to declarative code.

The best practice recommendation is not to use UPDATE or DELETE. Instead, implement additional intermediate steps like B1 and B2 above.

How do I use @ in dbt model selection?

This is confusing in the documentation. It implies that model@ is similar to model+ but just adds the parents of every model selected, i.e., a recursive +.

The same for @model is similar to +model. In fact, model@ is invalid and will never return any models. @model selects all the children of this model (even though it's at the start - hence confusing) and then all the parents of these.