Data Products Administration
Data Product orchestrator parameters
The DataOps.live Data Product orchestrator enriches the data product specifications with the metadata from the pipeline run. Using the MATE selectors in the data product specification file, the orchestrator adds all objects and tests that are part of the data product.
The orchestrator is adding the following pipeline run metadata to the source data product specification:
commit: <the commit id>
branch: <the branch name>
pipeline_id: <pipeline id>
run_start: <pipeline start datetime>
publication_datetime: <datetime of generating the enriched data product definition>
Enriching the data products Specification
You must add the following variables and settings to the orchestrator to enrich the data product specification with metadata from the orchestrator - the tables used in the pipeline and the test results.
Use the following image for the orchestrator:
"Data Product Orchestrator":
image: dataopslive/dataops-transform-orchestrator:DATAOPS-8619Add a variable that points to the source data product specification you are using to build the data product:
"Data Product Orchestrator":
variables:
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE: dataops/data-product-definitions/data_product_1.ymlThe MATE orchestrator extracts the relevant information from the MATE logs based on the selectors provided in the data product definition (in the dataset and SLI sections). These extracts will be added to the data product specification file in the Data Product orchestrator.
All data product definitions are uploaded to the data product registry of the project. Apart from the data product definitions, an additional registry metadata package is created and updated -
_dataops_data_product_registry_metadata
. This package contains the metadata the data product engine uses to auto-trigger the pipelines.
Performing backward compatibility testing
To use the orchestrator to do a backward compatibility test against an already existing data product, include the following variable in the orchestrator definition:
"Data Product Orchestrator":
variables:
DATAOPS_DATA_PRODUCT_REFERENCE_FILE: path to the reference Data Product Manifest
The job checks if the dataset and SLO sections of the produced data product match the reference data product. The job will fail if the ID and version are identical but the data product manifest has new or dropped attributes.
Orchestrator optional parameters
The following parameters are optional for the Data Product orchestrator:
Parameter | Value | Description |
---|---|---|
DATAOPS_DATA_PRODUCT_REFERENCE_FILE_FROM_DIFFERENT_BRANCH | 0 or 1 (default 0) | If set to 1, the value for the database is skipped from the validation |
DATAOPS_DATA_PRODUCT_ALLOW_BREAKING_CHANGES | 0 or 1 (default 0) | If set to 1, instead of failing the pipeline, the orchestrator raises a warning if there is a breaking change or new attributes |
DATAOPS_DATA_PRODUCT_EXCLUDE_OBJECT_ATTRIBUTES_LIST | default columns,mate_unique_id,type | comma-separated list of the object attributes that you will exclude from the backward compatibility check |
DATAOPS_DATA_PRODUCT_EXCLUDE_COLUMN_ATTRIBUTES_LIST | default index,comment | comma-separated list of the columns attributes that you will exclude from the backward compatibility check |
DATAOPS_DATA_PRODUCT_EXCLUDE_SLO_ATTRIBUTES_LIST | default description,test_select | comma-separated list of the SLO attributes that you will exclude from the backward compatibility check |
Data products access tokens
The registration token is generated automatically from the DataOps.live platform and is used to link the data product engine with the DataOps project containing the data product.
You must generate two tokens:
- A deploy token - necessary for the engine to read the data product definitions stored in its package registry.
- A trigger token - necessary for the engine to trigger pipelines.
Generating a deploy token
Open the project where you have defined the data product and browse to Settings > Repository > Deploy Tokens.
Expand Deploy tokens.
Enter
gitlab-deploy-token
in the Name field.Select the read_package_registry and write_package_registry checkboxes and click Create deploy token.
Generating a trigger token
Open the project where you have defined the data product and browse to Settings > CI/CD.
Expand Pipeline triggers.
Enter
data-product-engine-trigger
in the Description field and click Add trigger.
Data products engine deployment
The data product dependencies engine is deployed to your infrastructure - like a DataOps runner - and sits within your system boundary.
See DataOps Docker Runner Prerequisites for detailed information about global requirements.
Installing the engine
Install the data product engine on a Linux server, host, or virtual machine. The exact nature of the server/virtual machine is up to you and can vary between bare metal, AWS, or Azure.
AWS EC2
Minimum specifications:
- 2 CPU cores
- 4GB RAM
- 50GB Disk/Storage
- As a guide, for most use cases, an AWS t3.medium (or equivalent) that uses EBS-only storage by default
- A sudo user
Instance profile (IAM Role) (optional)
Attach an IAM role that has permission to read sensitive configuration.
Docker
Install Docker following the instructions at the Docker site for your operating system of choice.
- Docker for Linux distributions and architectures
Then do the following if you didn't do it as part of the Docker installation instructions:
Run
sudo usermod -aG docker $USER
.This allows you to run docker without being root.
Log out and log in again.
To test your docker install, run the
hello-world
container:docker run hello-world
Setting up DataOps vault: AWS secret
The DataOps.live platform's fundamental security model is that the platform and repository contain all the information about what to do. But only the data product engine stores these secrets, so no one else can access them.
You must register all data products with the data product engine. The sensitive configuration structure is a JSON list of objects. Each object has one optional key out of the four. Use this structure in your configuration provider item (e.g., an entity within the Secrets Manager orchestrator):
[
{
"base_url": "(required, string): Base URL of the DataOps platform",
"project_path": "(required, string): Unique project path e.g 'group/subgroup/project'.",
"deploy_token": "(required, string): Allows engine to obtain this project's data product definitions.",
"trigger_token": "(optional, string): Allows engine to trigger this project's data product pipelines."
}
]
Include trigger_token
only when you want the engine to trigger the data product pipelines in the current project.
For example, create a sensitive configuration JSON file that looks like the below example and name it sensitive_configuration.json
:
[
{
"base_url": "https://app.dataops.live/",
"project_path": "<path-to-project-1>",
"deploy_token": "xxxxxxxxxxxxxxxxxxxx",
"trigger_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
{
"base_url": "https://app.dataops.live/",
"project_path": "<path-to-project-2>",
"deploy_token": "xxxxxxxxxxxxxxxxxxxx",
"trigger_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
]
Create Secrets Manager secret using AWS CLI:
aws secretsmanager --region=<region> create-secret --name DataProductEngineSecret \
--description "Secret containing sensitive configuration for my Data Product Engine" \
--secret-string file://sensitive_configuration.json
The result should look like below:
{
"ARN": "arn:aws:secretsmanager:<region>:<account>:secret:DataProductEngineSecret-5fMhVy",
"Name": "DataProductEngineSecret",
"VersionId": "<id>"
}
And the updated secret should look like below:
aws secretsmanager --region=<region> update-secret --secret-id
DataProductEngineSecret \
--secret-string file://sensitive_configuration.json
Running the data products engine
Run the data product engine after replacing the secret ARN and AWS region values:
docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=<secret-arn> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm
Data products engine logs
You can run the below command within the same instance of the data product engine to get details about the operations done by the engine.
docker logs -f dataops-data-product-engine
Use command-line arguments (CLI) options (or environment variables) to configure the data product engine at runtime and obtains sensitive configuration from a configuration provider.
To get more information on the CLI arguments, we recommend that you run the command with the --help
option:
docker logs -f dataops-data-product-engine
docker run --rm -t dataopslive/dataops-data-product-engine:latest-next --help
You can use a set of global commands to include specific information in the engine logs:
CLI Arguments | Equivalent Environment Variable | Description |
---|---|---|
--version or -v | n/a | Shows version of the data product engine |
--log-level or -l | DATA_PRODUCT_ENGINE_LOG_LEVEL | Logging level as a string (case insensitive) or numeric value. Default: INFO |
--wait-time | DATA_PRODUCT_ENGINE_WAIT_TIME | (Optional) Time interval between rules evaluation, in seconds. Default: 60 |
AWS configuration
Configuration parameters depend on the sensitive configuration provider you use. See Supported Secret Managers for information about the third-party remote secret managers the DataOps pipelines can work with.
If you want to use AWS as your provider, you must authenticate using either "Instance Profile (IAM Role)" (recommended) or using user "access keys".
CLI Arguments | Equivalent Environment Variable | Required/Optional | Description |
---|---|---|---|
--aws-parameter-store-parameter-name | DATA_PRODUCT_ENGINE_AWS_PARAMETER_ST ORE_PARAMETER_NAME | Required | Name of the item you want the engine to retrieve from AWS Parameter Store |
--aws-region | DATA_PRODUCT_ENGINE_AWS_REGION | Required | Shows version of the data product engine |
--aws-secret-manager-arn | DATA_PRODUCT_ENGINE_AWS_SECRET_MANAG ER_ARN | Required | ARN of the secret you want the engine to retrieve from AWS Secrets Manager |
--aws-access-key-id | ATA_PRODUCT_ENGINE_AWS_ACCESS_KEY_ID | Optional | AWS access key ID if not using instance profile |
--aws-secret-access-key | DATA_PRODUCT_ENGINE_AWS_SECRET_ACCES S_KEY | Optional | AWS secret access key if not using instance profile |
Examples
Using command line arguments
Example using AWS secrets manager - authorized using an Instance Profile (IAM Role):
docker run -d \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next \
aws --aws-region <aws-region> \
sm --aws-secret-manager-arn arn:aws:secretsmanager:<region>:
<account-id>:secret:<secret-name>-<random-6-digit-string>
Using environment variables
Example using AWS secrets manager - authorized using an Instance Profile (IAM Role):
docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=arn:aws:
secretsmanager:<region>:<account-id>:secret:<secret-name>-<random-6-
digit-string> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm
Using short-term AWS credentials (for testing only)
export AWS_ACCESS_KEY_ID="<access-key-id>"
export AWS_SECRET_ACCESS_KEY="<secret-key>"
export AWS_SESSION_TOKEN="<session-token>"
Example using AWS secrets manager authorized by passing in exported AWS credentials:
docker run -d \
--env DATA_PRODUCT_ENGINE_AWS_SECRET_MANAGER_ARN=arn:aws:
secretsmanager:<region>:<account-id>:secret:<secret-name>-<random-6-
digit-string> \
--env DATA_PRODUCT_ENGINE_AWS_REGION=<aws-region> \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
--name dataops-data-product-engine \
dataopslive/dataops-data-product-engine:latest-next aws sm