data.world catalog Orchestrator
Enterprise
Image | $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE |
---|
The data.world catalog orchestrator interacts with the data.world data catalog to publish metadata about the data transformed in a DataOps pipeline. In summary, the data.world catalog orchestrator provides a single-click interface to the data.world service.
Usage
data.world_v2:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DATADOTWORLD_ACTION: START
DW_ORG: <org name>
DW_INSTANCE: <Private instance name>
DW_DBT_PROJECT_NAME: <dbt_collection_name>
DW_SNOWFLAKE_PROJECT_NAME: <snowflake_collection_name>
DW_DBT_DATASET: <dbt-dataset>
DW_SNOWFLAKE_DATASET: <snowflake-dataset>
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
# DW_BLOCK_PROFILE_UPLOAD: 1
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
script:
- /dataops
icon: ${DATADOTWORLD_ICON}
The data.world catalog orchestrator assumes that a DataOps modeling and transformation job completed its run earlier in the DataOps pipeline. It leverages the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DW_ORG | REQUIRED | The data.world organization where the dataset fits |
DW_INSTANCE | REQUIRED | The data.world private instance |
DW_SNOWFLAKE_DATASET | REQUIRED | The data.world Snowflake dataset to update. The standard value is ddw-catalogs .Note: data.world may ask you to change this |
DW_DBT_DATASET | REQUIRED | The data.world dbt dataset to update. The standard value is ddw-catalogs .Note: data.world may ask you to change this |
DW_AUTH_TOKEN | REQUIRED | An authentication token in data.world is a secure and unique identifier that allows you to access and authenticate with data.world's services and APIs. With this token, you can perform actions, including uploading data, querying datasets, and interacting with the platform programmatically. |
DW_DBT_PROJECT_NAME | REQUIRED | Specifies the name of the dbt collection where the collector output will be stored |
DW_SNOWFLAKE_PROJECT_NAME | REQUIRED | Specifies the name of the snowflake collection where the collector output will be stored |
DATAOPS_DDW_SNOWFLAKE_URL | REQUIRED | Snowflake URL used to connect to data.world |
DATAOPS_DDW_DATABASE | REQUIRED | Snowflake database used to connect to |
DATAOPS_DDW_USER | REQUIRED | Snowflake username used to connect to database |
DATAOPS_DDW_PASSWORD | REQUIRED | Snowflake password used to connect to database |
DATAOPS_DDW_ROLE | REQUIRED | Snowflake role used to run the query. Note: DATAOPS_DDW_ROLE should have access to SNOWFLAKE.ACCOUNT_USAGE database/schema if DW_TAG_COLLECTION and DW_POLICY_COLLECTION are set to true. |
DW_INSTANCE | Optional (required for private instance) | Should be set only for data.world private instance |
DATAOPS_DDW_WAREHOUSE | Optional | Snowflake warehouse used to connect to data.world |
DW_UPLOAD | Optional | If set, it uploads the generated catalog to the organization account's catalogs dataset |
DW_BLOCK_PROFILE_UPLOAD | Optional | If set, it prevents updating the metadata profile during a job run |
DATAOPS_TEMPLATES_DIR | REQUIRED | The directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/profiles |
DATAOPS_SECONDARY_TEMPLATES_DIR | REQUIRED | The secondary directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/modelling |
DW_TAG_COLLECTION | Optional - defaults to true | If set, it will harvest tags in snowflake collector |
DW_POLICY_COLLECTION | Optional - defaults to true | If set, it will harvest masking policies and row access policies in snowflake collector |
DW_LOG_LEVEL | Optional - defaults to INFO | Specify the logging level as a string, choosing from INFO , WARN , ERROR , or DEBUG |
Make sure the service user associated with the organization has Manage
access.
Most of the configuration happens on the data.world application. When run, the orchestrator uploads a default profile file. The default profile is sufficient to get started.
The DATA_WORLD.AUTH
key in the DataOps vault is a valid user authentication token obtained from the data.world settings at https://data.world/settings/advanced.
Example jobs
This example dynamically adjusts the organization being used based on the DataOps context (dev, test, prod). In other words, depending on the context, the default organization changes from dataopslivedev
to dataopsliveqa
and dataopslive
, respectively.
data.world_v2:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DW_ORG: dataopslive
DW_INSTANCE: data.world
DW_DBT_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_DATASET: ddw-staging
DW_DBT_DATASET: ddw-staging
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
# DW_BLOCK_PROFILE_UPLOAD: 1
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
DW_TAG_COLLECTION: "false"
DW_POLICY_COLLECTION: "false"
DW_LOG_LEVEL: DEBUG
script:
- /dataops
icon: ${DATADOTWORLD_ICON}
Project resources
The data.world catalog orchestrator assumes that MATE has already run in the pipeline. It then leverages the MATE results, specifically table-level lineage, including tags, descriptions, and other metadata.
The orchestrator uses four intermediate files: the catalog, manifest, dbt_project, and run_results. The files must be located at the following path: /dataops/modelling/target
; a working directory of the standard MATE project found at /dataops/modelling/
.
The details of these intermediate files are as follows:
-
catalog.json
- this file contains information from your data warehouse about the tables and views produced and defined by the resources in your project. -
manifest.json
- this file contains a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties. -
dbt_project.yml
- Every dbt project needs a dbt_project.yml file — this is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project. -
run_results.json
- this file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc.) that was executed.
Host dependencies (and Resources)
The example configurations use a data.world access token stored in the DataOps vault at DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN
.