MATE Automated Data Testing
The value of data testing in MATE lies in the ability to make sure that the models that we build meet our assertions. Not only does MATE contain a set of standard tests as well as the ability to write as many custom test sets as you require, but the data testing process is automated.
MATE also contains a rich library of macros and additional test sets to help you write your tests to ensure that your data models are thoroughly tested to ensure the production data models' accuracy, validity, and trustworthiness.
In practice, MATE automated testing takes a set of tests defined in .yml files under /dataops/modelling/
and executes them as part of a standard/default MATE job. The standard automated testing YAML config file is as follows:
data_testing:
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: TEST
stage: "Business Domain Validation Checks"
script:
- /dataops
icon: ${TRANSFORM_ICON}
Generic tests
Generic tests are specified in a YAML file and can be set at a column level.
Additionally, the standard generic tests shipped with dbt comprise the following four types:
- Unique
- Not null
- Accepted values
- Relationships
Unique test
Unique tests determine whether the values in a specific column are unique or not. It is typically described as follows:
version: 2
models:
- name: stg_product_type
columns:
- name: product_type_id
tests:
- unique
When run, this test will check every value in the product_type_id
column and return an error if there are any duplicate values in the stg_product_type
model.
Not null test
Not null (not_null
) tests are similar to the unique test described above, except that they test whether any values in the specified column are null or not. The test passes if there are no null values and fails if at least one null value is in this column.
For example:
version: 2
models:
- name: stg_product_type
columns:
- name: product_type_id
tests:
- not_null
Accepted values test
The accepted values (accepted_values
) test determines whether the specified column values are equal to one of the values in the list that accompanies the test declaration. For instance, this code snippet shows that the accepted_values
test will check all the values in the product_type
column to ensure all the values are in the values list below.
version: 2
models:
- name: stg_product_type
columns:
- name: product_type
tests:
- accepted_values:
values:
- toy_trains
- toy_cars
- toy_dolls
- toy_airplanes
Relationships test
The relationship test relationships
tests the primary - foreign key referential integrity of values in a column.
Wikipedia defines referential integrity as follows:
Referential integrity is a property of data stating that all its references are valid. In the context of relational databases, it requires that if a value of one attribute (column) of a relation (table) references a value of another attribute (either in the same or a different relation), then the referenced value must exist.
To test for referential integrity the test of the relationship is structured as follows:
version: 2
models:
- name: final_stock_levels
columns:
- name: product_type
tests:
- relationships:
to: ref('stg_product_type')
field: product_type
Singular tests
Not only does MATE ship with [generic test types][(#generic-tests)], but it also includes the functionality to write singular (or stand-alone) tests. A singular test is a SQL query stored in a .sql
file.
Let's assume you want to determine whether the stock levels in your stock level table are not negative. To test this determination, your SQL statement will need to look something like this:
select
stock_id
sum(stock_level) as total_stock_per_id
from {{ ref ('stock_levels')}}
group by stock_id
having not(total_stock_per_id < 0)
DataOps package tests
DataOps.live provides the package dataops
, adding model- and column-level tests. The list includes:
- Model-level row tests:
- Model-level compare tests:
- Column-level tests:
Use them together with generic tests, and tests from bundled packages. Example:
version: 2
models:
- name: calc_staff_member
tests:
# Check we have at least 250 rows
- dataops.minimum_rows:
rows: 250
# Check that each row in the Employees table/model made it into this one
- dataops.same_rows_as:
compare_model: ref('cur_employee')
columns:
- name: "JOB_TITLE"
tests:
# Job title should not be none
- not_null
# Expect at least 10 different job titles
- dataops.distinct_count_minimum: 10
minimum_count:
- name: "NAME"
tests:
# Name should not be none
- not_null
# Name should not be the same for everyone
- dbt_utils.not_constant
Note that:
- The
name:
needs to match the model name. - The
tests:
next to the model name apply to the overall model (table or view). - The
tests:
undercolumns
apply to just those columns.
MATE ships with the dataops
package by default, but you need to install this package manually if you want to use a specific version of dbt. See Install the DataOps
package for more information.
DataOps model-level row tests
The following row-level tests are used to test row counts
dataops.exact_rows
exact_rows
tests that the exact number of rows in the given model (table, view) is as expected
- parameter
rows
- number of rows
Usage:
version: 2
models:
- name: my_relation
tests:
# Check for exactly 250 rows
- dataops.exact_rows:
rows: 250
dataops.maximum_rows_v2
maximum_rows_v2
tests that the maximum number of rows in a model does not exceed a given value
- parameter
rows
- number of rows
Usage:
version: 2
models:
- name: my_relation
tests:
# Check for 250 or less rows
- dataops.maximum_rows_v2:
rows: 250
dataops.minimum_rows
minimum_rows
tests that the minimum number of rows in a model is never less than a given value
- parameter
rows
- number of rows
Usage:
version: 2
models:
- name: my_relation
tests:
# Check for 250 or more rows
- dataops.minimum_rows:
rows: 250
DataOps model-level compare tests
These tests compare the row count for a specific model against its reference model:
dataops.fewer_rows_than_or_equal
fewer_rows_than_or_equal
tests that the number of rows in a model (table, view) must be less than or equal to a reference model
- parameter
compare_model
- relation to another model or source
Usage:
version: 2
models:
- name: my_relation
tests:
- dataops.fewer_rows_than_or_equal:
compare_model: ref('other_relation')
dataops.fewer_rows_than
fewer_rows_than
tests that the number of rows in a model (table, view) must be less than a reference model
- parameter
compare_model
- relation to another model or source
Usage:
version: 2
models:
- name: my_relation
tests:
- dataops.fewer_rows_than:
compare_model: ref('other_relation')
dataops.same_rows_as
same_rows_as
tests that the number of rows in a model (table, view) must be the same as a reference model
- parameter
compare_model
- relation to another model or source
Usage:
version: 2
models:
- name: my_relation
tests:
- dataops.same_rows_as:
compare_model: ref('other_relation')
DataOps column-level tests
dataops.distinct_count_maximum
distinct_count_maximum
tests the maximum number of distinct values in the specified column
- parameter
maximum_count
- the count
Usage:
version: 2
models:
- name: my_relation
columns:
- name: COL_A
tests:
- dataops.distinct_count_maximum:
maximum_count: 250
dataops.distinct_count_minimum
distinct_count_minimum
tests the minimum number of distinct values in the specified column
- parameter
minimum_count
- the count
Usage:
version: 2
models:
- name: my_relation
columns:
- name: COL_A
tests:
- dataops.distinct_count_minimum:
minimum_count: 250
dataops.max_value_equals
max_value_equals
tests the maximum value of a column to be equal to the parameter
- parameter
check_value
- the exact value to match
Usage:
version: 2
models:
- name: my_relation
columns:
- name: COL_A
tests:
- dataops.max_value_equals:
check_value: 250
dataops.test_sql_condition
test_sql_condition
tests a column against a given error condition (WHERE statement syntax)
- parameter
error_condition
- SQL WHERE expression without the WHERE matching errors
Usage:
version: 2
models:
- name: my_relation
columns:
- name: COL_A
tests:
# match all distinct 3 character values as errors
- dataops.test_sql_condition:
error_condition: value_field like '___'
Tips
Combining multiple tests
It is important to note that you can specify more than one test for a particular column. The following code snippet demonstrates this:
version: 2
models:
- name: stg_product_type
columns:
- name: product_type_id
tests:
- unique
- not_null
When run, this script will first test whether all the values in the product_type_id
column are unique, and then it will test whether all the values in the product_type_id
column are not null.
Single test YAML files
MATE includes a single test definition .yml
file that can cover many models; however, we recommend using a single .yml file for each model or SQL file. For example: