Skip to main content

Data Products Administration

Feature release status badge: PriPrev
PriPrev

Data Product orchestrator parameters

The DataOps.live Data Product orchestrator enriches the data product specifications with the metadata from the pipeline run. Using the MATE selectors in the data product specification file, the orchestrator adds all objects and tests that are part of the data product.

The orchestrator is adding the following pipeline run metadata to the source data product specification:

commit: <the commit id>
branch: <the branch name>
pipeline_id: <pipeline id>
run_start: <pipeline start datetime>
publication_datetime: <datetime of generating the enriched data product definition>

Enriching the data products Specification

You must add the following variables and settings to the orchestrator to enrich the data product specification with metadata from the orchestrator - the tables used in the pipeline and the test results.

  1. Use the following image for the orchestrator:

    "Data Product Orchestrator":
    image: dataopslive/dataops-transform-orchestrator:DATAOPS-8619
  2. Add a variable that points to the source data product specification you are using to build the data product:

    "Data Product Orchestrator":
    variables:
    DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: dataops/data-product-definitions/data_product_1.yml

    The MATE orchestrator extracts the relevant information from the MATE logs based on the selectors provided in the data product definition (in the dataset and SLI sections). These extracts will be added to the data product specification file in the Data Product orchestrator.

    All data product definitions are uploaded to the data product registry of the project. Apart from the data product definitions, an additional registry metadata package is created and updated - _dataops_data_product_registry_metadata. This package contains the metadata the data product engine uses to auto-trigger the pipelines.

    Data product package registry __shadow__

Performing backward compatibility testing

To use the orchestrator to do a backward compatibility test against an already existing data product, include the following variable in the orchestrator definition:

"Data Product Orchestrator":
variables:
DATAOPS_DATA_PRODUCT_REFERENCE_FILE: path to the reference Data Product Manifest

The job checks if the dataset and SLO sections of the produced data product match the reference data product. The job will fail if the ID and version are identical but the data product manifest has new or dropped attributes.

Orchestrator optional parameters

The following parameters are optional for the Data Product orchestrator:

ParameterValueDescription
DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH0 or 1 (default 0)If set to 1, the value for the database is skipped from the validation
DATAOPS_DATA_PRODUCT_ALLOW_BREAKING_CHANGES0 or 1 (default 0)If set to 1, instead of failing the pipeline, the orchestrator raises a warning if there is a breaking change or new attributes
DATAOPS_DATA_PRODUCT_EXCLUDE_OBJECT_ATTRIBUTES_LISTdefault columns,mate_unique_id,typecomma-separated list of the object attributes that you will exclude from the backward compatibility check
DATAOPS_DATA_PRODUCT_EXCLUDE_COLUMN_ATTRIBUTES_LISTdefault index,commentcomma-separated list of the columns attributes that you will exclude from the backward compatibility check
DATAOPS_DATA_PRODUCT_EXCLUDE_SLO_ATTRIBUTES_LISTdefault description,test_selectcomma-separated list of the SLO attributes that you will exclude from the backward compatibility check

Data products access tokens

The registration token is generated automatically from the DataOps.live platform and is used to link the data product engine with the DataOps project containing the data product.

You must generate two tokens:

  • A deploy token - necessary for the engine to read the data product definitions stored in its package registry.
  • A trigger token - necessary for the engine to trigger pipelines.

Generating a deploy token

  1. Open the project where you have defined the data product and browse to Settings > Repository > Deploy Tokens.

    Creating data product token __shadow__

  2. Expand Deploy tokens.

  3. Enter gitlab-deploy-token in the Name field.

    Creating data product token __shadow__

  4. Select the read_package_registry and write_package_registry checkboxes and click Create deploy token.

Generating a trigger token

  1. Open the project where you have defined the data product and browse to Settings > CI/CD.

    Creating data product token __shadow__

  2. Expand Pipeline triggers.

  3. Enter data-product-engine-trigger in the Description field and click Add trigger.

    Creating data product token __shadow__

Data products engine deployment

The data product dependencies engine is deployed to your infrastructure - like a DataOps runner - and sits within your system boundary.

See DataOps Docker Runner Prerequisites for detailed information about global requirements.

Installing the engine

Install the data product engine on a Linux server, host, or virtual machine. The exact nature of the server/virtual machine is up to you and can vary between bare metal, AWS, or Azure.

AWS EC2

Minimum specifications:

  • 2 CPU cores
  • 4GB RAM
  • 50GB Disk/Storage
  • As a guide, for most use cases, an AWS t3.medium (or equivalent) that uses EBS-only storage by default
  • A sudo user

Instance profile (IAM Role) (optional)

Attach an IAM role that has permission to read sensitive configuration.

Docker

Install Docker following the instructions at the Docker site for your operating system of choice.

Then do the following if you didn't do it as part of the Docker installation instructions:

  1. Run sudo usermod -aG docker $USER.

    This allows you to run docker without being root.

  2. Log out and log in again.

  3. To test your docker install, run the hello-world container:

    docker run hello-world

Setting up DataOps vault: AWS secret

The DataOps.live platform's fundamental security model is that the platform and repository contain all the information about what to do. But only the data product engine stores these secrets, so no one else can access them.

You must register all data products with the data product engine. The sensitive configuration structure is a JSON list of objects. Each object has one optional key out of the four. Use this structure in your configuration provider item (e.g., an entity within the Secrets Manager orchestrator):

Data product secret
[
{
"base_url": "(required, string): Base URL of the DataOps platform",
"project_path": "(required, string): Unique project path e.g 'group/subgroup/project'.",
"deploy_token": "(required, string): Allows engine to obtain this project's data product definitions.",
"trigger_token": "(optional, string): Allows engine to trigger this project's data product pipelines."
}
]
pipeline auto-triggering

Include trigger_token only when you want the engine to trigger the data product pipelines in the current project.

For example, create a sensitive configuration JSON file that looks like the below example and name it sensitive_configuration.json:

sensitive_configuration.json
[
{
"base_url": "https://app.dataops.live/",
"project_path": "<path-to-project-1>",
"deploy_token": "xxxxxxxxxxxxxxxxxxxx",
"trigger_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
{
"base_url": "https://app.dataops.live/",
"project_path": "<path-to-project-2>",
"deploy_token": "xxxxxxxxxxxxxxxxxxxx",
"trigger_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
]

Create Secrets Manager secret using AWS CLI:

aws secretsmanager --region=<region> create-secret --name DataProductEngineSecret \
--description "Secret containing sensitive configuration for my Data Product Engine" \
--secret-string file://sensitive_configuration.json

The result should look like below:

{
"ARN": "arn:aws:secretsmanager:<region>:<account>:secret:DataProductEngineSecret-5fMhVy",
"Name": "DataProductEngineSecret",
"VersionId": "<id>"
}

And the updated secret should look like below:

aws secretsmanager --region=<region> update-secret --secret-id
DataProductEngineSecret \
--secret-string file://sensitive_configuration.json

Running the data products engine

Run the data product engine after replacing the secret ARN and AWS region values:

docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=<secret-arn> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm

Data products engine logs

You can run the below command within the same instance of the data product engine to get details about the operations done by the engine.

docker logs -f dataops-data-product-engine

Use command-line arguments (CLI) options (or environment variables) to configure the data product engine at runtime and obtains sensitive configuration from a configuration provider.

To get more information on the CLI arguments, we recommend that you run the command with the --help option:

docker logs -f dataops-data-product-engine
docker run --rm -t dataopslive/dataops-data-product-engine:latest-next --help

You can use a set of global commands to include specific information in the engine logs:

CLI ArgumentsEquivalent Environment VariableDescription
--version or -vn/aShows version of the data product engine
--log-level or -lDATA_PRODUCT_ENGINE_LOG_LEVELLogging level as a string (case insensitive) or numeric value. Default: INFO
--wait-timeDATA_PRODUCT_ENGINE_WAIT_TIME(Optional) Time interval between rules evaluation, in seconds. Default: 60

AWS configuration

Configuration parameters depend on the sensitive configuration provider you use. See Supported Secret Managers for information about the third-party remote secret managers the DataOps pipelines can work with.

If you want to use AWS as your provider, you must authenticate using either "Instance Profile (IAM Role)" (recommended) or using user "access keys".

CLI ArgumentsEquivalent Environment VariableRequired/OptionalDescription
--aws-parameter-store-parameter-nameDATA_PRODUCT_ENGINE_AWS_PARAMETER_ST ORE_PARAMETER_NAMERequiredName of the item you want the engine to retrieve from AWS Parameter Store
--aws-regionDATA_PRODUCT_ENGINE_AWS_REGIONRequiredShows version of the data product engine
--aws-secret-manager-arnDATA_PRODUCT_ENGINE_AWS_SECRET_MANAG ER_ARNRequiredARN of the secret you want the engine to retrieve from AWS Secrets Manager
--aws-access-key-idATA_PRODUCT_ENGINE_AWS_ACCESS_KEY_IDOptionalAWS access key ID if not using instance profile
--aws-secret-access-keyDATA_PRODUCT_ENGINE_AWS_SECRET_ACCES S_KEYOptionalAWS secret access key if not using instance profile

Examples

Using command line arguments

Example using AWS secrets manager - authorized using an Instance Profile (IAM Role):

docker run -d \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next \
aws --aws-region <aws-region> \
sm --aws-secret-manager-arn arn:aws:secretsmanager:<region>:
<account-id>:secret:<secret-name>-<random-6-digit-string>

Using environment variables

Example using AWS secrets manager - authorized using an Instance Profile (IAM Role):

docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=arn:aws:
secretsmanager:<region>:<account-id>:secret:<secret-name>-<random-6-
digit-string> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm

Using short-term AWS credentials (for testing only)

export AWS_ACCESS_KEY_ID="<access-key-id>"
export AWS_SECRET_ACCESS_KEY="<secret-key>"
export AWS_SESSION_TOKEN="<session-token>"

Example using AWS secrets manager authorized by passing in exported AWS credentials:

docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=arn:aws:
secretsmanager:<region>:<account-id>:secret:<secret-name>-<random-6-
digit-string> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm