Skip to main content

DataOps Core Concepts

It is super important to understand how we use the following terms and concepts within DataOps as both a philosophy and platform. Therefore, here is a list of the essential concepts and terms found within the DataOps platform:

Account and Project Structure

DataOps projects are functionally similar to repositories in other VCS platforms and are typically contained in a group/subgroup structure within a client account.

Account/Top-Level Group

An account is effectively a customer tenant within the DataOps platform and is materialized as a top-level group in the organizational structure. It contains the following:

Therefore, a typical account URL will look something like https://app.dataops.live/acme-corp.

A DataOps user is assigned to an account (by DataOps.live staff) as a member, granting account-level access at different privilege levels.

Group/Subgroup

A group is a collection of the following items:

These items are all contained within an account.

Group owners can assign other DataOps users access to each group (as members), assigning each with different privilege levels.

to bear in mind

It is normally best practice to register a DataOps Runner against a group rather than projects.

Project

Project header __shadow__

A project is primarily a Git-compliant code repository that contains configurations allowing code to be merged and pipelines to be run.

Projects can be created at the top-level of an account. However, it is good practice to create projects within groups/sub-groups>.

User

A user is a login that permits a physical person (or system user) entry to the DataOps platform. Users are created by DataOps.live staff and assigned to an account.

Template Project

Template-project __shadow__

When setting up any new DataOps projects, the DataOps administrator will typically create a new project from the main project template, rather than starting from scratch. This brings in the standard structure, mandatory directories and files, and many best-practice configurations.

Reference Project

In order to avoid having to copy and maintain many standard configuration files repeatedly, each DataOps project maintains a link to the standard DataOps Reference Project. This includes standard pipeline configuration settings, such as stages, default jobs, etc., and clones the entire reference project into the runtime workspace within each pipeline, providing much more project content.

DataOps Pipelines

All execution of DataOps code happens within DataOps pipelines that comprise a series of individual jobs.

Pipeline

Pipeline summary __shadow__

A DataOps Pipeline is an execution of a pipeline file in a project that runs the pipeline's configured jobs in the specified stage.

Pipeline File

One of the main YAML configuration files within DataOps projects, each pipeline file (a project can have one or many) is identified by a suffix of -ci.yml, e.g., full-ci.yml.

note

Pipeline files must sit at the top level in the project, not in a subdirectory.

Job

job-summary __shadow__

Each pipeline file includes one or more jobs, whether defined in the project itself (usually as a YAML file within pipelines/includes), or within a DataOps reference project.

Job definitions must include the following:

Base Job

Many jobs often share similar configurations. For example, you may have several MATE jobs that use the same image, stage, and variables. In order to prevent duplicating the same code and saving the same configuration into each job, the principle of reusable code is applied where a base job provides an abstract job definition that will not directly run but is included in other jobs to prevent repeated content.

Base jobs are identified by a leading dot (.) in front of the filename - e.g. .modelling_and_transformation_base.

Pipeline Stage

Job stages __shadow__

The fundamental method for sequencing jobs in a pipeline is through stages. A project defines a series of stage names in a configuration file, and then each job is configured to run in one of these stages.

All the jobs in any given stage will execute in parallel, up to the concurrency limits of the DataOps Runner.

Variables

Within pipeline configuration files, variables that control the behavior of the DataOps pipelines and its individual jobs are defined. These variables are passed into a job orchestrator image at execution time and are added to the runtime environment for access by apps and scripts.

Runtime Infrastructure

DataOps Runner

The DataOps Runner is a lightweight, long-running Docker container that a DataOps administrator will install on client infrastructure (usually a cloud or on-premises virtual machine) that picks up and runs the jobs within DataOps pipelines.

Jobs that run on the DataOps Runner are instantiated as additional, short-lived orchestrators.

to bear in mind

The term runner can be used synonymously to refer to both a logical runner process (Docker container) and the physical/virtual machine it runs on.

Orchestrators

Orchestrators execute the workload of specific DataOps pipeline jobs and are launched by the DataOps Runner as a subprocess of the DataOps Runner.

When configuring a DataOps job, the developer must select the most appropriate orchestrator for the required task. Each orchestrator is defined by an Orchestrator Container Image, available as a predefined variable in the DataOps reference project.

note

The Orchestrator Container Image contains all the necessary tools and scripts to execute the job.

The DataOps Runner and DataOps Orchestrators interact as follows:

dataops-runner-orchestrator-interactions _shadow_

The default polling interval of the runner for new work is 10 seconds and can be altered during runner installation by modifying the check_interval in the generated config.toml.

Vault

Each time a pipeline runs, DataOps automatically creates a secure, encrypted file known as the DataOps Vault, which resides only on the DataOps runner host. It is populated by vault files on the runner host and by configuration from a secrets manager if one is configured in the DataOps project.

Jobs can access information from the vault using variable substitution in template files, e.g., {{ SNOWFLAKE.ACCOUNT }}, or setting variables using the DATAOPS_VAULT(...) syntax.