Skip to main content

MATE Exercises

To carry out the exercises in this user guide, you will need the following:

Exercise 1: Create a new feature branch

The first thing we need to do is to create a new feature branch exercises.

  1. Navigate to Repository > Branches

  2. Click top-right the New branch button

  3. Name the branch exercises

Exercise 2: Run the pipeline and look at the docs

The next thing we need to do is run the new exercises (feature branch) pipeline so the template content (based on the Snowflake sample data) will be built in Snowflake.

  1. Open your project's exercises branch in WebIDE and check that the .agent_tag in the file /pipelines/includes/config/agent_tag.yml is configured and the variables are set up.

  2. Open /full-ci.yml and comment out the three sample MATE jobs for now:

    /full-ci.yml
    include:
    ...

    ## Modelling and transformation jobs
    # - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
    # - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
    # - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
  3. Commit your changes to exercises and run the full-ci.yml pipeline.

    Did you know?

    DataOps automatically generates documentation from your MATE content, even if you do not compile/run any models.

  4. Using the project sidebar, navigate to CI / CD > Pipelines.

  5. Your most recent pipeline (probably the only pipeline, unless you needed to re-run it) will be top of the list. Click the documentation 📄 button on the far right side of the pipeline row to open the docs in a new tab.

A few things to note about the docs:

  • There are several projects listed on the left-hand navigation. The Snowflake sample content from your DataOps project will be under the MyProject heading (as this is the default project name at the top of your dbt_project.yml file). The other projects are DataOps and dbt libraries that are included in the build process.
  • Clicking the Database tab on the left-hand navigation shows that we have sources from the SNOWFLAKE_SAMPLE_DATA database and the sample models are in the SAMPLES schema within our production database.
  • These docs are built using the dbt docs package, so navigation functionality and much of the content are very similar to the standard dbt docs.
  • Docs are generated for every pipeline unless you take out the reference to generate_modelling_and_transformation_documentation.yml from your pipeline file (full-ci.yml), which is a job originating from the DataOps reference project.

Exercise 3: Test the source tables

After ingestion, the first step in most DataOps pipelines should be to run a comprehensive set of data tests on the ingested data. It is common practice to create dbt sources on all tables that will be used as source data for MATE models, so we can add the tests there.

  1. Open WebIDE and check you are editing the exercises branch.

  2. Open file dataops/modelling/sources/snowflake_sample_data/tpch_sf1/customer.yml and examine the tests against column C_CUSTKEY. These are standard column tests from MATE: unique and not null.

  3. Open file full_ci.yml and uncomment the job file test_all_sources.yml:

    /full-ci.yml
    include:
    ...

    ## Modelling and transformation jobs
    - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
    # - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
    # - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
  4. Commit your changes to exercises and run the full-ci.yml pipeline.

  5. Open the pipeline view, either by:

    • Navigating to CI / CD > Pipelines and clicking the status (should be RUNNING) against the current pipeline; or
    • A few seconds after committing the changes, a link to the pipeline will appear in the lower left corner of the WebIDE panel:

      WebIDE pipeline link __shadow__
  6. Once the pipeline completes, open the Tests tab (you may need to refresh the page) and click through to view the results.

    CI / CD > Pipelines > Test Report __shadow__

Exercise 4: Build all models

Now we have well-tested source data (of course, many more tests than this will be in a real project.), we can start building models. Fortunately, the sample data included with our project template has a curation layer of simple models that build from the sources we just tested.

Did you know?

Many modern data warehouse implementations use a curation, or similarly-named layer, for data cleansing and standardization after the ingestion stage. These tables usually follow the ingestion tables column-for-column, row-for-row, including type conversion, cleansing rules, etc.

  1. Open WebIDE and check you are editing the exercises branch.

  2. Open file dataops/modelling/models/snowflake_sample_data/curation_customer.sql and examine the model SQL.

  3. The project template includes a job to run these tests: open file full_ci.yml and uncomment the job file build_all_models.yml (leave test_all_sources.yml also uncommented):

    /full-ci.yml
    include:
    ...

    ## Modelling and transformation jobs
    - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
    - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
    # - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
  4. Commit your changes to exercises and run the full-ci.yml pipeline.

  5. Once the pipeline is completed, open Snowflake and take a look at the new tables in the SAMPLES schema of your production database.

Remember

If you need to check which database DataOps is using, i.e. the name of your production database, check the generated documentation for the pipeline that just ran.

Exercise 5: Test all models

OK, so we have built our first proper data warehouse layer. The next step will be to test it.

  1. Open WebIDE and check you are editing the exercises branch.

  2. Open file dataops/modelling/models/samples/samples_customer.sql and examine the tests against column C_CUSTKEY. As with the source, we are using the same two standard tests to ensure data integrity after the model build.

  3. The project template also includes a job to run these tests: open file full_ci.yml and uncomment the job file test_all_models.yml (leave the previous two job files also uncommented):

    /full-ci.yml
    include:
    ...

    ## Modelling and transformation jobs
    - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
    - /pipelines/includes/local_includes/modelling_and_transformation/build_all_models.yml
    - /pipelines/includes/local_includes/modelling_and_transformation/test_all_models.yml
  4. Commit your changes to exercises and run the full-ci.yml pipeline.

  5. Open the pipeline view and examine the additional test results once the run completes.

    Test results __shadow__

Exercise 6: Create a mart

Once a data warehouse has ingested, cleansed and tested source data, it is time to build some useful data products. For this exercise, we will create a very simple data mart that presents some of the source data in a new layer.

  1. Open WebIDE and check you are editing the exercises branch.

  2. Create a file named dataops/modelling/models/mart/mart_orders.sql:

    Did you know?

    If you create the file at the top level and specify a full path, WebIDE will create intermediate directories for you.

    dataops/modelling/models/mart/mart_orders.sql
    {{ config(alias='ORDERS') -}}

    SELECT
    O_ORDERKEY AS "Order Key",
    O_CUSTKEY AS "Customer Key",
    O_ORDERSTATUS AS "Order Status",
    O_TOTALPRICE AS "Total Price",
    O_ORDERDATE AS "Order date",
    O_ORDERPRIORITY AS "Order Priority",
    O_CLERK AS "Clerk",
    O_SHIPPRIORITY AS "Ship Priority",
    O_COMMENT AS "Comment",
    (SELECT COUNT(*) FROM {{ ref('samples_lineitem') }} WHERE L_ORDERKEY = O_ORDERKEY) AS "Number of Line Items"
    FROM {{ ref('samples_orders') }}
  3. Open file dataops/modelling/dbt_project.yml and add the mart definition to the models section (final two lines below):

    dataops/modelling/dbt_project.yml
    models:
    ...
    MyTemplate:
    snowflake_sample_data:
    +schema: SAMPLES
    mart:
    +schema: MART
  4. Commit your changes to exercises and run the full-ci.yml pipeline.

  5. Take a look at the generated documentation and the new table (and schema) in Snowflake.

    Mart table in docs __shadow__

Exercise 7: Reorganize the pipeline jobs

You will have noticed that we have not added or changed any pipeline jobs in the last exercise, yet the new table was built. This is because the existing Build all Models is configured to build all models in the project, so any new ones get picked up automatically.

However, that is not what we want. To build each layer at a time (so we can test each one before building the next), we need to create separate jobs to build each layer.

  1. Open WebIDE and check you are editing the exercises branch.

  2. Open file pipelines/includes/config/stages.yml and add four new stages after Source Testing:

    pipelines/includes/config/stages.yml
    stages:
    - Pipeline Initialisation
    - Vault Initialisation
    - Snowflake Setup
    - Additional Configuration
    - Data Ingestion
    - Source Testing
    - Curation Build
    - Curation Testing
    - Mart Build
    - Mart Testing
    - Data Transformation
    - Transformation Testing
    - Generate Docs
    - Clean Up
    By the way

    No need to delete any stages right now - if there are no jobs in a stage then the stage will not show on pipelines.

  3. Create a new job file called pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.yml with the following job definition:

    pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.yml
    Build Curation Models:
    extends:
    - .modelling_and_transformation_base
    - .agent_tag
    variables:
    TRANSFORM_ACTION: RUN
    TRANSFORM_MODEL_SELECTOR: snowflake_sample_data
    stage: Curation Build
    script:
    - /dataops
    icon: ${TRANSFORM_ICON}
  4. Create another new job file called pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.yml with the following job definition:

    pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.yml
    Test Curation Models:
    extends:
    - .modelling_and_transformation_base
    - .agent_tag
    variables:
    TRANSFORM_ACTION: TEST
    TRANSFORM_MODEL_SELECTOR: snowflake_sample_data
    stage: Curation Testing
    script:
    - /dataops
    icon: ${TESTING_ICON}
    artifacts:
    when: always
    reports:
    junit: $CI_PROJECT_DIR/report.xml
  5. Open file full-ci.yml and edit the links to build_all_models.yml and build_all_models.yml to point to the new curation jobs:

    /full-ci.yml
    include:
    ...

    ## Modelling and transformation jobs
    - /pipelines/includes/local_includes/modelling_and_transformation/test_all_sources.yml
    - /pipelines/includes/local_includes/modelling_and_transformation/build_curation_models.yml
    - /pipelines/includes/local_includes/modelling_and_transformation/test_curation_models.yml
  6. Commit your changes to exercises and run the full-ci.yml pipeline. You will see the new stages and jobs, and looking in the logs for the Build Curation Models job, you will notice that it is only building the SAMPLES schema tables now.

    Build curation log __shadow__

Exercise 8: Build and test the new mart

Finally, we need to create pipeline jobs to build and test our new mart. As you now know how and where to set up the necessary configuration files for this, the following steps will point you in the general direction of what is needed to accomplish this.

  1. Make sure you are still in exercises.

  2. Create two new job files to build and test the mart.

    Hint

    You could copy the existing curation jobs - only a small difference is needed.
    Oh, and make sure you link them into full-ci.yml.

  3. Run the pipeline and check it all works. You may notice that there are no tests to run in the Test Mart Models job.

  4. Create a YAML file alongside the new mart model and add some tests.

Why not try...

...adding another mart table and more tests?

Exercise 9: Merge your exercises into production

Finally, once you completed all the exercises it's time to bring them back to production. To do so we will create a merge request:

  1. Navigate in your sidebar to Merge requests

  2. Click the Create merge request button

  3. Verify that your new merge request states From exercises into master

  4. Review all the other settings and at the bottom of the screen click Create merge request

  5. After peer review you can finally merge your change back to your production branch