MATE Exercises
To carry out the exercises in this user guide, you will need the following:
- A new DataOps project, created from the standard template and configured with a runner.
- Your Snowflake account will need to have the Snowflake sample data available.
Exercise 1: Create a new feature branch
The first thing we need to do is to create a new feature branch exercises.
Navigate to Repository > Branches
Click top-right the New branch button
Name the branch exercises
Exercise 2: Run the pipeline and look at the docs
The next thing we need to do is run the new exercises (feature branch) pipeline so the template content (based on the Snowflake sample data) will be built in Snowflake.
Open your project's exercises branch in WebIDE and check that the
.agent_tag
in the file /pipelines/includes/config/agent_tag.yml is configured and the variables are set up.Open /full-ci.yml and comment out the three sample MATE jobs for now:
include:
...
## Modelling and transformation jobs
# - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
# - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
# - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
- Commit your changes to exercises and run the full-ci.yml pipeline.
DataOps automatically generates documentation from your MATE content, even if you do not compile/run any models.
Using the project sidebar, navigate to CI / CD > Pipelines.
Your most recent pipeline (probably the only pipeline, unless you needed to re-run it) will be top of the list. Click the documentation 📄 button on the far right side of the pipeline row to open the docs in a new tab.
A few things to note about the docs:
- There are several projects listed on the left-hand navigation. The Snowflake sample content from your DataOps project will be under the MyProject heading (as this is the default project name at the top of your dbt_project.yml file). The other projects are DataOps and dbt libraries that are included in the build process.
- Clicking the Database tab on the left-hand navigation shows that we have sources from the
SNOWFLAKE_SAMPLE_DATA
database and the sample models are in theSAMPLES
schema within our production database. - These docs are built using the dbt docs package, so navigation functionality and much of the content are very similar to the standard dbt docs.
- Docs are generated for every pipeline unless you take out the reference to generate_modelling_and_transformation_documentation.yml from your pipeline file (full-ci.yml), which is a job originating from the DataOps reference project.
Exercise 3: Test the source tables
After ingestion, the first step in most DataOps pipelines should be to run a comprehensive set of data tests on the ingested data. It is common practice to create dbt sources on all tables that will be used as source data for MATE models, so we can add the tests there.
Open WebIDE and check you are editing the exercises branch.
Open file
dataops/modelling/sources/snowflake_sample_data/tpch_sf1/customer.yml
and examine the tests against column C_CUSTKEY. These are standard column tests from MATE: unique and not null.Open file full_ci.yml and uncomment the job file test_all_sources.yml:
/full-ci.ymlinclude:
...
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
# - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
# - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.ymlCommit your changes to exercises and run the full-ci.yml pipeline.
Open the pipeline view, either by:
- Navigating to CI / CD > Pipelines and clicking the status (should be
RUNNING
) against the current pipeline; or - A few seconds after committing the changes, a link to the pipeline will appear in the lower left corner of the WebIDE panel:
- Navigating to CI / CD > Pipelines and clicking the status (should be
Once the pipeline completes, open the Tests tab (you may need to refresh the page) and click through to view the results.
Exercise 4: Build all models
Now we have well-tested source data (of course, many more tests than this will be in a real project.), we can start building models. Fortunately, the sample data included with our project template has a curation layer of simple models that build from the sources we just tested.
Many modern data warehouse implementations use a curation, or similarly-named layer, for data cleansing and standardization after the ingestion stage. These tables usually follow the ingestion tables column-for-column, row-for-row, including type conversion, cleansing rules, etc.
Open WebIDE and check you are editing the exercises branch.
Open file
dataops/modelling/models/snowflake_sample_data/curation_customer.sql
and examine the model SQL.The project template includes a job to run these tests: open file full_ci.yml and uncomment the job file build_all_models.yml (leave test_all_sources.yml also uncommented):
include:
...
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
# - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
Commit your changes to exercises and run the full-ci.yml pipeline.
Once the pipeline is completed, open Snowflake and take a look at the new tables in the
SAMPLES
schema of your production database.
If you need to check which database DataOps is using, i.e. the name of your production database, check the generated documentation for the pipeline that just ran.
Exercise 5: Test all models
OK, so we have built our first proper data warehouse layer. The next step will be to test it.
Open WebIDE and check you are editing the exercises branch.
Open file
dataops/modelling/models/samples/samples_customer.sql
and examine the tests against columnC_CUSTKEY
. As with the source, we are using the same two standard tests to ensure data integrity after the model build.The project template also includes a job to run these tests: open file full_ci.yml and uncomment the job file test_all_models.yml (leave the previous two job files also uncommented):
/full-ci.ymlinclude:
...
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.ymlCommit your changes to exercises and run the full-ci.yml pipeline.
Open the pipeline view and examine the additional test results once the run completes.
Exercise 6: Create a mart
Once a data warehouse has ingested, cleansed and tested source data, it is time to build some useful data products. For this exercise, we will create a very simple data mart that presents some of the source data in a new layer.
Open WebIDE and check you are editing the exercises branch.
Create a file named
dataops/modelling/models/mart/mart_orders.sql
:Did you know?If you create the file at the top level and specify a full path, WebIDE will create intermediate directories for you.
dataops/modelling/models/mart/mart_orders.sql{{ config(alias='ORDERS') -}}
SELECT
O_ORDERKEY AS "Order Key",
O_CUSTKEY AS "Customer Key",
O_ORDERSTATUS AS "Order Status",
O_TOTALPRICE AS "Total Price",
O_ORDERDATE AS "Order date",
O_ORDERPRIORITY AS "Order Priority",
O_CLERK AS "Clerk",
O_SHIPPRIORITY AS "Ship Priority",
O_COMMENT AS "Comment",
(SELECT COUNT(*) FROM {{ ref('samples_lineitem') }} WHERE L_ORDERKEY = O_ORDERKEY) AS "Number of Line Items"
FROM {{ ref('samples_orders') }}Open file
dataops/modelling/dbt_project.yml
and add the mart definition to the models section (final two lines below):dataops/modelling/dbt_project.ymlmodels:
...
MyTemplate:
snowflake_sample_data:
+schema: SAMPLES
mart:
+schema: MARTCommit your changes to exercises and run the full-ci.yml pipeline.
Take a look at the generated documentation and the new table (and schema) in Snowflake.
Exercise 7: Reorganize the pipeline jobs
You will have noticed that we have not added or changed any pipeline jobs in the last exercise, yet the new table was built. This is because the existing Build all Models is configured to build all models in the project, so any new ones get picked up automatically.
However, that is not what we want. To build each layer at a time (so we can test each one before building the next), we need to create separate jobs to build each layer.
Open WebIDE and check you are editing the exercises branch.
Open file
pipelines/includes/config/stages.yml
and add four new stages after Source Testing:pipelines/includes/config/stages.ymlstages:
- Pipeline Initialisation
- Vault Initialisation
- Snowflake Setup
- Additional Configuration
- Data Ingestion
- Source Testing
- Curation Build
- Curation Testing
- Mart Build
- Mart Testing
- Data Transformation
- Transformation Testing
- Generate Docs
- Clean UpBy the wayNo need to delete any stages right now - if there are no jobs in a stage then the stage will not show on pipelines.
Create a new job file called
pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.yml
with the following job definition:pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.ymlBuild Curation Models:
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: snowflake_sample_data
stage: Curation Build
script:
- /dataops
icon: ${TRANSFORM_ICON}Create another new job file called
pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.yml
with the following job definition:pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.ymlTest Curation Models:
extends:
- .modelling_and_transformation_base
- .agent_tag
variables:
TRANSFORM_ACTION: TEST
TRANSFORM_MODEL_SELECTOR: snowflake_sample_data
stage: Curation Testing
script:
- /dataops
icon: ${TESTING_ICON}
artifacts:
when: always
reports:
junit: $CI_PROJECT_DIR/report.xmlOpen file
full-ci.yml
and edit the links to build_all_models.yml and build_all_models.yml to point to the new curation jobs:/full-ci.ymlinclude:
...
## Modelling and transformation jobs
- /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
- /pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.yml
- /pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.ymlCommit your changes to exercises and run the full-ci.yml pipeline. You will see the new stages and jobs, and looking in the logs for the Build Curation Models job, you will notice that it is only building the
SAMPLES
schema tables now.
Exercise 8: Build and test the new mart
Finally, we need to create pipeline jobs to build and test our new mart. As you now know how and where to set up the necessary configuration files for this, the following steps will point you in the general direction of what is needed to accomplish this.
Make sure you are still in exercises.
Create two new job files to build and test the mart.
HintYou could copy the existing curation jobs - only a small difference is needed.
Oh, and make sure you link them intofull-ci.yml
.Run the pipeline and check it all works. You may notice that there are no tests to run in the Test Mart Models job.
Create a YAML file alongside the new mart model and add some tests.
...adding another mart table and more tests?
Exercise 9: Merge your exercises into production
Finally, once you completed all the exercises it's time to bring them back to production. To do so we will create a merge request:
Navigate in your sidebar to Merge requests
Click the Create merge request button
Verify that your new merge request states From
exercises
intomaster
Review all the other settings and at the bottom of the screen click Create merge request
After peer review you can finally merge your change back to your production branch