data.world catalog Orchestrator
Enterprise
Image | $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE |
---|
The data.world catalog orchestrator interacts with the data.world data catalog to publish metadata about the data transformed in a DataOps pipeline. In summary, the data.world catalog orchestrator provides a single-click interface to the data.world service.
Usage
data.world_v2:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DATADOTWORLD_ACTION: START
DW_ORG: <org name>
DW_INSTANCE: <Private instance name>
DW_DBT_PROJECT_NAME: <dbt_collection_name>
DW_SNOWFLAKE_PROJECT_NAME: <snowflake_collection_name>
DW_DBT_DATASET: <dbt-dataset>
DW_SNOWFLAKE_DATASET: <snowflake-dataset>
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
# DW_BLOCK_PROFILE_UPLOAD: 1
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
script:
- /dataops
icon: ${DATADOTWORLD_ICON}
The data.world catalog orchestrator assumes that a DataOps modeling and transformation job completed its run earlier in the DataOps pipeline. It leverages the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DATADOTWORLD_ACTION | REQUIRED | Action to be performed by the orchestrator. Must be START |
DW_ORG | REQUIRED | The data.world organization where the dataset fits |
DW_INSTANCE | Optional (required for private instance) | Should be set only for data.world private instance |
DW_SNOWFLAKE_DATASET | REQUIRED | The data.world Snowflake dataset to update. The standard value is ddw-catalogs .Note: data.world may ask you to change this |
DW_DBT_DATASET | REQUIRED | The data.world dbt dataset to update. The standard value is ddw-catalogs .Note: data.world may ask you to change this |
DW_AUTH_TOKEN | REQUIRED | An authentication token in data.world is a secure and unique identifier that allows you to access and authenticate with data.world's services and APIs. With this token, you can perform actions, including uploading data, querying datasets, and interacting with the platform programmatically. |
DW_DBT_PROJECT_NAME | REQUIRED | Specifies the name of the dbt collection where the collector output will be stored |
DW_SNOWFLAKE_PROJECT_NAME | REQUIRED | Specifies the name of the snowflake collection where the collector output will be stored |
DATAOPS_DDW_SNOWFLAKE_URL | REQUIRED | Snowflake URL used to connect to data.world |
DATAOPS_DDW_DATABASE | REQUIRED | Snowflake database used to connect to |
DATAOPS_DDW_USER | REQUIRED | Snowflake username used to connect to database |
DATAOPS_DDW_ROLE | REQUIRED | Snowflake role used to run the query. Note: DATAOPS_DDW_ROLE should have access to SNOWFLAKE.ACCOUNT_USAGE database/schema if DW_TAG_COLLECTION and DW_POLICY_COLLECTION are set to true. |
DATAOPS_DDW_PASSWORD | Optional | Snowflake password used to connect to database |
DATAOPS_SNOWFLAKE_AUTH | Optional (required for key-pair authentication) | Authentication method used to connect to Snowflake. See key-pair authentication on how to use it. |
DATAOPS_DDW_WAREHOUSE | Optional | Snowflake warehouse used to connect to data.world |
DW_UPLOAD | Optional | If set, it uploads the generated catalog to the organization account's catalogs dataset |
DW_BLOCK_PROFILE_UPLOAD | Optional | If set, it prevents updating the metadata profile during a job run |
DATAOPS_TEMPLATES_DIR | REQUIRED | The directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/profiles |
DATAOPS_SECONDARY_TEMPLATES_DIR | REQUIRED | The secondary directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/modelling |
DW_TAG_COLLECTION | Optional - defaults to true | If set, it will harvest tags in snowflake collector |
DW_POLICY_COLLECTION | Optional - defaults to true | If set, it will harvest masking policies and row access policies in snowflake collector |
DW_LOG_LEVEL | Optional - defaults to INFO | Specify the logging level as a string, choosing from INFO , WARN , ERROR , or DEBUG |
Make sure the service user associated with the organization has Manage
access.
Most of the configuration happens on the data.world application. When run, the orchestrator uploads a default profile file. The default profile is sufficient to get started.
The DATA_WORLD.AUTH
key in the DataOps vault is a valid user authentication token obtained from the data.world
settings at https://data.world/settings/advanced.
Authentication
Key-pair authentication
The data.world catalog orchestrator supports using Snowflake key-pair authentication. To learn how to configure it, see the key-pair authentication documentation.
Example jobs
This example dynamically adjusts the organization being used based on the DataOps context (dev, test, prod). In other
words, depending on the context, the default organization changes from dataopslivedev
to dataopsliveqa
and
dataopslive
, respectively.
- Password based authentication
- Key-pair based authentication
Publish metadata and lineage:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DATADOTWORLD_ACTION: START
DW_ORG: dataopslive
DW_INSTANCE: data.world
DW_DBT_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_DATASET: ddw-staging
DW_DBT_DATASET: ddw-staging
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
DW_TAG_COLLECTION: "false"
DW_POLICY_COLLECTION: "false"
DW_LOG_LEVEL: DEBUG
script:
- /dataops
icon: ${DATADOTWORLD_ICON}
Publish metadata and lineage:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DATADOTWORLD_ACTION: START
DW_ORG: dataopslive
DW_INSTANCE: data.world
DW_DBT_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_DATASET: ddw-staging
DW_DBT_DATASET: ddw-staging
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_SNOWFLAKE_AUTH: KEY_PAIR
DATAOPS_SNOWFLAKE_KEY_PAIR: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.KEY_PAIR)
DATAOPS_SNOWFLAKE_PASSPHRASE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSPHRASE)
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
DW_TAG_COLLECTION: "false"
DW_POLICY_COLLECTION: "false"
DW_LOG_LEVEL: DEBUG
script:
- /dataops
icon: ${DATADOTWORLD_ICON}
Project resources
The data.world catalog orchestrator assumes that MATE has already run in the pipeline. It then leverages the MATE results, specifically table-level lineage, including tags, descriptions, and other metadata.
The orchestrator uses four intermediate files: the catalog, manifest, dbt_project, and run_results. The files must be
located at the following path: /dataops/modelling/target
; a working directory of the standard MATE project found at
/dataops/modelling/
.
The details of these intermediate files are as follows:
-
catalog.json
- this file contains information from your data warehouse about the tables and views produced and defined by the resources in your project. -
manifest.json
- this file contains a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties. -
dbt_project.yml
- Every dbt project needs a dbt_project.yml file — this is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project. -
run_results.json
- this file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc.) that was executed.
Host dependencies (and Resources)
The example configurations use a data.world access token stored in the DataOps vault at
DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN
.