The DataOps Reference Project

DataOps Reference Project

In the previous section, we learned how to structure our projects and eliminate duplication. That's great, yet in practice, real-world DataOps projects have a lot more going on, and in a lot of this, you do not need to worry about yourself.

What you will learn

In this section, you will learn how we make your day-to-day easier with the DataOps Reference Project. We
introduce the script entry point /dataops wiring up all infrastructure for our DataOps app. Finally, we look at Template rendering as an easy mechanism to substitute values in configuration files from the DataOps Vault.

Set aside 15 minutes to complete the section.

Adding the reference project

We provide a DataOps Reference Project that you include in your project, and it provides all the base things
you need down the line. The reference project provides, amongst others:

standard Stages,
standard DataOps Orchestrator image names,
standard Variables,
standard Icons for your jobs, and
everything you need to work with data from ingress over Snowflake to egress

Let us use the reference project.

Edit pipelines/includes/bootstrap.yml

pipelines/includes/bootstrap.yml
include:
  - project: reference-template-projects/dataops-template/dataops-reference
    ref: 5-stable
    file: /pipelines/includes/base_bootstrap.yml

  - /pipelines/includes/config/agent_tag.yml
  - /pipelines/includes/config/variables.yml
  - /pipelines/includes/config/stages.yml

We include the reference at the start because the includes are order-sensitive. We want to include the generic things first and then override with more specific things in our project.

If we run this, now the pipeline fails. It fails because DataOps Reference Project defines its own set of stages. Yet we have overridden all of them with our stages. As the DataOps Reference Project introduces a reusable job that runs in a stage called Pipeline Initialisation, we are now missing the stage Pipeline Initialisation in our project.

Using standard stages

We could add the missing stage to our list. The stages available from the DataOps Reference Project are pretty good. Let us switch our jobs to the standard stage Demo Jobs.

pipelines/includes/local_includes/base_hello.yml
.base_hello:
  extends:
    - .agent_tag
  stage: Demo Jobs
  image: dataopslive/dataops-python3-runner:5-stable
  script:
    - echo "Hello, $NAME!"

pipelines/includes/local_includes/say_hello_again.yml
Say Hello Again:
  extends:
    - .agent_tag
  stage: Demo Jobs
  image: dataopslive/dataops-python3-runner:5-stable
  script:
    - echo "Hello, $MY_NAME!"

And remove the include of our own stages:

pipelines/includes/bootstrap.yml
include:
  - project: reference-template-projects/dataops-template/dataops-reference
    ref: 5-stable
    file: /pipelines/includes/base_bootstrap.yml

  - /pipelines/includes/config/agent_tag.yml
  - /pipelines/includes/config/variables.yml
  #- /pipelines/includes/config/stages.yml

The file /pipelines/includes/config/stages.yml can now be deleted since it is unused.

note

All our jobs are now running concurrently because we put them in the same stage.

Using standard orchestrators

Rather than having to keep track of orchestrators and their versions, it's much easier to choose from available orchestrators that always point to the latest stable and tested version. For example we can replace dataopslive/dataops-python3-runner:5-stable with $DATAOPS_PYTHON3_RUNNER_IMAGE.

While we are at it, we can give these jobs an icon from a range of provided icons.

pipelines/includes/local_includes/base_hello.yml
.base_hello:
  extends:
    - .agent_tag
  stage: Demo Jobs
  image: $DATAOPS_PYTHON3_RUNNER_IMAGE
  script:
    - echo "Hello, $NAME!"
  icon: ${DATAOPS_ICON}

pipelines/includes/local_includes/say_hello_again.yml
Say Hello Again:
  extends:
    - .agent_tag
  stage: Demo Jobs
  image: $DATAOPS_PYTHON3_RUNNER_IMAGE
  script:
    - echo "Hello, $MY_NAME!"
  icon: ${PYTHON_ICON}

The script entry point /dataops

We learned that jobs execute on a specific orchestrator in the previous sections. Each run of any orchestrator executes a single bash shell script. So far, we have been using an inline script for each of the jobs:

pipelines/includes/local_includes/hello.yml
A Job:
  ...
  script:
    - echo "Hello, $MY_NAME!"
  ...

All standard orchestrators do way more than execute a simple script. No matter their purpose, they require an execution environment specific to the DataOps app. This includes pulling in credentials from the DataOps Vault, populating all standard variables, or setting up the cache area for inter-job communication. Thus, we provide a standard shell script that does all the heavy lifting and boilerplate for you named /dataops. It is our de-facto entrypoint for any orchestrator.

We offer orchestrators that perform specific tasks like kicking off an ELT job or building a Snowflake environment. Using them, often all you need is a job with the /dataops script and to set a few variables defined in that orchestrator's documentation. In more flexible orchestrators (e.g., Python 3 or R), you are likely to want to add your content into the script tag, but using the /dataops entry point script is still recommended in all cases.

Let us use the entrypoint /dataops

pipelines/includes/local_includes/base_hello.yml
.base_hello:
  extends:
    - .agent_tag
  stage: Demo Jobs
  image: $DATAOPS_PYTHON3_RUNNER_IMAGE
  script:
    - /dataops
    - echo "Hello, $NAME!"
  icon: ${DATAOPS_ICON}

At this point, we also need to include a long, random string for the DATAOPS_VAULT_KEY variable. It is used to provide a part-encryption key for the DataOps Vault while in use.

pipelines/includes/config/variables.yml
variables:
  MY_NAME: Sam
  NAME1: Justin
  NAME2: Guy
  NAME3: Colin
  DATAOPS_VAULT_KEY: 3kl4fj34g45g34f908uwejclvqh40fhgui3q40879

Commit and Run

Compare the output from the Say Hello to Person 1 job, which is now using the /dataops entry point, with the output from the Say Hello Again job, which is not. You can start to see how much is going on.

Template rendering

A critical capability for DataOps is the ability to store templates in the repository, which are rendered at run time using variables from the DataOps Vault or calculated in some previous job. To illustrate this, let us imagine that Person 1 does not want their name stored in the DataOps repository.

We solve this problem with text substitution using a template rendering engine. DataOps considers anything with .template in the file name as a template and renders it at run time. The resultant file name removes the .template portion of the original file name. We recommend placing .template right before the file extension.

Let us create a template file dataops/say_hello_person1.template.txt.

dataops/say_hello_person1.template.txt
Hello from the vault, {{ PEOPLE.NAME1 }}!

The {{ FOO.BAR }} syntax tells the Template Renderer to look for a key called BAR under a group called FOO in the DataOps Vault. For a simple demo, this information can be read from the /secrets/vault.yml file on your DataOps Runner host:

/secrets/vault.yml
PEOPLE:
  NAME1: Justin

Using the shared dataops-101-shared-runner, this is already done for you.

As NAME1 is defined now in vault.yml, we can remove NAME1 from pipelines/includes/config/variables.yml:

pipelines/includes/config/variables.yml
variables:
  MY_NAME: Sam
  NAME2: Guy
  NAME3: Colin
  DATAOPS_VAULT_KEY: 3kl4fj34g45g34f908uwejclvqh40fhgui3q40879

And in pipelines/includes/local_includes/say_hello.yml rather than using the script: defined in the `.base_hello`` job, we override this with our own:

pipelines/includes/local_includes/say_hello.yml
Say Hello:
  extends:
    - .base_hello
  variables:
    NAME: $MY_NAME

Say Hello to Person 1:
  extends:
    - .base_hello
  script:
    - /dataops
    - cat dataops/say_hello_person1.txt

Say Hello to Person 2:
  extends:
    - .base_hello
  variables:
    NAME: $NAME2

Say Hello to Person 3:
  extends:
    - .base_hello
  variables:
    NAME: $NAME3

As you can see in the video, the Say Hello to Person 2 is unaffected, but in Say Hello to Person 1, the Template Renderer has found our file, rendered it using the contents of the DataOps Vault, and used it.

danger

The DataOps Vault is used for more sensitive information like passwords, auth tokens, and keys. You would never print such credentials out to the logs!

Checkpoint 3

We have now added to our DataOps project:

Use of the DataOps Reference Project for commonly used resources
Use of the /dataops entry point to wire up the infrastructure for the DataOps app
Introduced you to Template Rendering to easily substitute values in configuration files from the DataOps Vault

This is a very simple pipeline and not moving any actual data yet. Still, we now got virtually all the building blocks we need. Let us look at a couple of final pieces to complete our understanding.

Before we do that, let us take a break.

DataOps Reference Project​

What you will learn​

Adding the reference project​

Using standard stages​

Using standard orchestrators​

The script entry point /dataops​

Template rendering​

Checkpoint 3​