Skip to main content

data.world catalog Orchestrator

Enterprise

Image$DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE

The data.world catalog orchestrator interacts with the data.world data catalog to publish metadata about the data transformed in a DataOps pipeline. In summary, the data.world catalog orchestrator provides a single-click interface to the data.world service.

Usage

pipelines/includes/local_includes/datadotworld_jobs/datadotworld_catalog.yml
data.world_v2:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DATADOTWORLD_ACTION: START
DW_ORG: <org name>
DW_INSTANCE: <Private instance name>
DW_DBT_PROJECT_NAME: <dbt_collection_name>
DW_SNOWFLAKE_PROJECT_NAME: <snowflake_collection_name>
DW_DBT_DATASET: <dbt-dataset>
DW_SNOWFLAKE_DATASET: <snowflake-dataset>
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
# DW_BLOCK_PROFILE_UPLOAD: 1
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
artifacts:
when: always
name: data.world logs
paths:
- applog
- dwcc-output
- dwcc
script:
- /dataops
icon: ${DATADOTWORLD_ICON}

The data.world catalog orchestrator assumes that a DataOps modeling and transformation job completed its run earlier in the DataOps pipeline. It leverages the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.

Supported parameters

ParameterRequired/DefaultDescription
DW_ORGREQUIREDThe data.world organization where the dataset fits
DW_INSTANCEREQUIREDThe data.world private instance
DW_SNOWFLAKE_DATASETREQUIREDThe data.world Snowflake dataset to update. The standard value is ddw-catalogs.
Note: data.world may ask you to change this
DW_DBT_DATASETREQUIREDThe data.world dbt dataset to update. The standard value is ddw-catalogs.
Note: data.world may ask you to change this
DW_AUTH_TOKENREQUIREDAn authentication token in data.world is a secure and unique identifier that allows you to access and authenticate with data.world's services and APIs. With this token, you can perform actions, including uploading data, querying datasets, and interacting with the platform programmatically.
DW_DBT_PROJECT_NAMEREQUIREDSpecifies the name of the dbt collection where the collector output will be stored
DW_SNOWFLAKE_PROJECT_NAMEREQUIREDSpecifies the name of the snowflake collection where the collector output will be stored
DATAOPS_DDW_SNOWFLAKE_URLREQUIREDSnowflake URL used to connect to data.world
DATAOPS_DDW_DATABASEREQUIREDSnowflake database used to connect to
DATAOPS_DDW_USERREQUIREDSnowflake username used to connect to database
DATAOPS_DDW_PASSWORDREQUIREDSnowflake password used to connect to database
DATAOPS_DDW_ROLEREQUIREDSnowflake role used to run the query.
Note: DATAOPS_DDW_ROLE should have access to SNOWFLAKE.ACCOUNT_USAGE database/schema if DW_TAG_COLLECTION and DW_POLICY_COLLECTION are set to true.
DW_INSTANCEOptional (required for private instance)Should be set only for data.world private instance
DATAOPS_DDW_WAREHOUSEOptionalSnowflake warehouse used to connect to data.world
DW_UPLOADOptionalIf set, it uploads the generated catalog to the organization account's catalogs dataset
DW_BLOCK_PROFILE_UPLOADOptionalIf set, it prevents updating the metadata profile during a job run
DATAOPS_TEMPLATES_DIRREQUIREDThe directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIRREQUIREDThe secondary directory where you place your query templates. The recommended setting is $CI_PROJECT_DIR/dataops/modelling
DW_TAG_COLLECTIONOptional - defaults to trueIf set, it will harvest tags in snowflake collector
DW_POLICY_COLLECTIONOptional - defaults to trueIf set, it will harvest masking policies and row access policies in snowflake collector
DW_LOG_LEVELOptional - defaults to INFOSpecify the logging level as a string, choosing from INFO, WARN, ERROR, or DEBUG
note

Make sure the service user associated with the organization has Manage access.

Most of the configuration happens on the data.world application. When run, the orchestrator uploads a default profile file. The default profile is sufficient to get started.

The DATA_WORLD.AUTH key in the DataOps vault is a valid user authentication token obtained from the data.world settings at https://data.world/settings/advanced.

Example jobs

This example dynamically adjusts the organization being used based on the DataOps context (dev, test, prod). In other words, depending on the context, the default organization changes from dataopslivedev to dataopsliveqa and dataopslive, respectively.

pipelines/includes/local_includes/datadotworld_jobs/datadotworld_catalog.yml
data.world_v2:
extends:
- .agent_tag
stage: Data Catalog
image: $DATAOPS_DATADOTWORLD_CATALOG_RUNNER_IMAGE
variables:
DW_ORG: dataopslive
DW_INSTANCE: data.world
DW_DBT_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_PROJECT_NAME: data-world-catalog
DW_SNOWFLAKE_DATASET: ddw-staging
DW_DBT_DATASET: ddw-staging
DW_AUTH_TOKEN: DATAOPS_VAULT(DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN)
DW_UPLOAD: "true"
DATAOPS_DDW_SNOWFLAKE_URL: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT).snowflakecomputing.com
DATAOPS_DDW_DATABASE: ${DATAOPS_DATABASE}
DATAOPS_DDW_WAREHOUSE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.WAREHOUSE)
DATAOPS_DDW_USER: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.USERNAME)
DATAOPS_DDW_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.PASSWORD)
DATAOPS_DDW_ROLE: DATAOPS_VAULT(SNOWFLAKE.TRANSFORM.ROLE)
# DW_BLOCK_PROFILE_UPLOAD: 1
DATAOPS_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/profiles
DATAOPS_SECONDARY_TEMPLATES_DIR: $CI_PROJECT_DIR/dataops/modelling
DW_TAG_COLLECTION: "false"
DW_POLICY_COLLECTION: "false"
DW_LOG_LEVEL: DEBUG
artifacts:
when: always
name: data.world logs
paths:
- applog
- dwcc-output
- dwcc
script:
- /dataops
icon: ${DATADOTWORLD_ICON}

Project resources

The data.world catalog orchestrator assumes that MATE has already run in the pipeline. It then leverages the MATE results, specifically table-level lineage, including tags, descriptions, and other metadata.

The orchestrator uses four intermediate files: the catalog, manifest, dbt_project, and run_results. The files must be located at the following path: /dataops/modelling/target; a working directory of the standard MATE project found at /dataops/modelling/.

The details of these intermediate files are as follows:

  • catalog.json - this file contains information from your data warehouse about the tables and views produced and defined by the resources in your project.

  • manifest.json - this file contains a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties.

  • dbt_project.yml - Every dbt project needs a dbt_project.yml file — this is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project.

  • run_results.json - this file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc.) that was executed.

Host dependencies (and Resources)

The example configurations use a data.world access token stored in the DataOps vault at DATADOTWORLD.DEFAULT.DW_AUTH_TOKEN.