Skip to main content

Collibra Orchestrator

TypePre-Set
Image$DATAOPS_COLLIBRA_RUNNER_IMAGE
Feature Status
Feature release status badge: PriPrev
PriPrev

The Collibra orchestrator is a pre-set orchestrator that interacts with Collibra to publish metadata about the data transformed in a DataOps pipeline. In summary, this orchestrator provides a single-click interface to the Collibra service.

Usage

pipelines/includes/local_includes/collibra_jobs/collibra.yml
"Collibra Catalog Sync":
extends:
- .agent_tag
stage: "Catalog Sync"
image: $DATAOPS_COLLIBRA_RUNNER_IMAGE
script:
- /dataops
icon: ${COLLIBRA_ICON}
variables:
DATAOPS_COLLIBRA_HOST: "https://dataops.collibra.com/rest/2.0"
DATAOPS_COLLIBRA_USERNAME: DATAOPS_VAULT(DATAOPS_COLLIBRA_USERNAME)
DATAOPS_COLLIBRA_PASSWORD: DATAOPS_VAULT(DATAOPS_COLLIBRA_PASSWORD)
DATAOPS_COLLIBRA_IMPORT_FILE: sample.json
artifacts:
name: "Export input and output files"
when: always
paths:
- $CI_PROJECT_DIR/$DATAOPS_COLLIBRA_IMPORT_FILE
- $CI_PROJECT_DIR/manifest.json
- $CI_PROJECT_DIR/run_results.json
- $CI_PROJECT_DIR/catalog.json

The Collibra orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.

Supported parameters

ParameterRequired/DefaultDescription
DATAOPS_COLLIBRA_HOSTREQUIREDThe Collibra organization where the dataset fits. The example value is "https://client_name.collibra.com/rest/2.0". You can find your client_name in the URL you use to access Collibra.
DATAOPS_COLLIBRA_USERNAMEREQUIRED if username/password authentication is chosenUsername to the Collibra service. If omitted JWT authentication will be performed.
DATAOPS_COLLIBRA_PASSWORDREQUIRED if DATAOPS_COLLIBRA_USERNAME is setPassword to the Collibra service
DATAOPS_COLLIBRA_CONTINUE_ON_ERROROptional - defaults to FalseDefines whether the import should continue if some of the import commands are invalid or failed to execute. If set to True, the valid commands are still committed to the database, which can lead to partial results being stored.
DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITYOptional - defaults to Data Governance CouncilName of the Data Quality Community in Collibra
DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAINOptional - defaults to Data Quality DimensionsName of the Data Quality Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra
DATAOPS_COLLIBRA_DATA_QUALITY_DOMAINOptional - defaults to DataOps.live Pipeline Data Quality Rules CatalogName of the Data Quality Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra
DATAOPS_COLLIBRA_FETCH_RESULTS_SLEEPOptional - defaults to 30 (seconds)Waiting time from the time of sending the import request until the time a request is sent to fetch the status of the upload. It's recommended to increase the default value depending on the number of items imported.
DATAOPS_COLLIBRA_FINALIZATION_STRATEGYOptional - defaults to CHANGE_STATUSThe synchronization finalization strategy used in the clean-up action. This determines whether to remove, ignore, or change the status of assets that no longer exist in the external system. Possible values are REMOVE_RESOURCES, CHANGE_STATUS, and IGNORE. When you select CHANGE_STATUS, you must also provide a value for DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID.
DATAOPS_COLLIBRA_IMPORT_FILEOptional - defaults to sample.jsonName of the file uploaded to Collibra
DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_IDOptional - defaults to 00000000-0000-0000-0000-000000005011(Obsolete)If DATAOPS_COLLIBRA_FINALIZATION_STRATEGY is set to CHANGE_STATUS, this parameter determines the new status ID for assets that no longer exist in the external system
DATAOPS_COLLIBRA_PARENT_COMMUNITYOptional - defaults to NoneSets a Collibra Parent Community if desired
DATAOPS_COLLIBRA_RELATIONS_ACTIONOptional - defaults to ADD_OR_IGNOREReplaces existing relations or add/update, if any, during a refresh. Allowed values are ADD_OR_IGNORE / REPLACE.
DATAOPS_COLLIBRA_SEND_NOTIFICATIONOptional - defaults to FalseDefines whether the job status notification should be sent by email
DATAOPS_COLLIBRA_SIMULATIONOptional - defaults to FalseDefines whether the import should be triggered as a simulation. If set to True, the result of the import simulation will be available at the end of the job, but no change will be applied to Data Governance Center (DGC).
DATAOPS_DATA_PRODUCTS_PATHOptional - defaults to NonePath to the data_product.yml file
DATAOPS_PATH_TO_MATE_OUTPUT_FILESOptional - defaults to ./dataops/modelling/targetPath to the output files of the mate generate docs job

Project resources

The Collibra orchestrator assumes that all the steps of the MATE job (run model, test, and generate docs) have run and finalized in the pipeline. The orchestrator then uses the MATE results, specifically table-level lineage, including columns, descriptions, tests, and other metadata.

The orchestrator uses three intermediate files, catalog, manifest, and run_results. These files must be located at DATAOPS_PATH_TO_MATE_OUTPUT_FILES — the default path is /dataops/modelling/target, where /dataops/modelling/ is a working directory of the standard MATE project. All files are required for the orchestrator to work.

The details of the intermediate files are as follows:

  • catalog.json: has information from your data warehouse about the tables and views produced and defined by the resources in your project.

  • manifest.json: has a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties.

  • run_results.json: has complete data about the results of tests that were run as well as the compilation status of your dbt project's resources.

Collibra import API

Collibra import API has two operation modes: import and synchronization. The Collibra orchestrator uses the synchronization operation. This means the following:

The synchronization operation is intended to be used when you want to replicate the state of the external system, for example, a physical database schema. Typically you perform this operation regularly. You collect the metadata from the external system, such as database tables and columns, and upload the entire schema to Collibra. The Synchronization component makes sure that the Collibra data replicates exactly the external system. The difference between import and synchronization is that with the latter new assets aren't only added and existing ones updated, but also assets removed from the external system are removed from Collibra. Another difference is performance; synchronizing the identical data for the second time is faster than the simple import because Collibra stores the hash of every synchronized asset and uses it to decide if an asset needs to be updated. This is faster than comparing each attribute and relation individually.

You can learn more about the process in the Collibra Import API Documentation.

Community and domain names

The Collibra orchestrator can create a parent community if DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided. Under this community, it imports all other communities and domains. The name of the child community is derived from the name defined in dbt_project.yml, MyProject in the example below. If no DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided, then what is defined in dbt_project.yml would become a parent community, nesting all created domains.

dataops/modelling/dbt_project.yml
## Project
name: MyProject
version: 0.1
config-version: 2

## Sources
model-paths: [models, sources]
analysis-paths: [analysis]
test-paths: [tests]
seed-paths: [seeds]
macro-paths: [macros]
snapshot-paths: [snapshots]

Under this community, or under the parent community if one is provided, we create the following domains:

  • Physical Data Dictionary (one or more)
  • Logical Data Dictionary
  • Data Usage Registry
  • Technology Asset Domain
  • Rulebook
  • Glossary

By default, this orchestrator creates data quality metric assets that are related to the default Data Quality Dimension asset "Integrity", which is located in the Data Governance Council community (full path is Data Governance Council/Data Quality Dimensions/Integrity). As part of our prerequisite package, we also create the DataOps.live Pipeline Data Quality Rules Catalog domain. If either of those is missing, the pipeline will fail. That being said, these values are configurable, and it is possible to change the default values of DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY, DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN and DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSION and thus point to a new object, which has to be created before running the orchestrator.

To summarize, during the integration, we rely on the following out-of-the-box custom objects to exist unless otherwise configured:

  • Data Governance Council as a default value that can be configured for the object type Community
  • Data Quality Dimensions as a default value that can be configured for the object type Domain
  • Data Governance Council/Data Quality Dimensions/Integrity as a default value that can be configured for the object type Asset
  • DataOps.live Pipeline Data Quality Rules Catalog as a default value that can be configured for the object type Domain

Types of authentication

The Collibra orchestrator supports JWT or username/password authentication. If DATAOPS_COLLIBRA_USERNAME is not set, a JWT authentication will be attempted. To have a successful JWT authentication:

  • The DataOps username and Collibra username must be identical

  • The Collibra username should be a dedicated Collibra service account created for the integration, which has admin privileges

  • The following settings should be provided via the Collibra Console under Data Governance Center > Configuration > Security Configuration > JWT:

    ParameterValue
    JSON Web Key Set URLhttps://app.dataops.live/-/jwks
    JWT Token TypesJWT
    JWT AlgorithmsRS256
    JWT Issuerapp.dataops.live
    JWT Principal ID Claim Nameuser_login

Prerequisite package

The Collibra orchestrator relies on a pre-configured Collibra operating model that can be modified using the Collibra orchestrator variables.

To help you get started with the expected Collibra configurations of the operating model, we have created the Collibra Archive file dataopslive_orchestrator.car that you can import using the Collibra Import functionality. All changes will be imported in separate DataOps.live scope.

Troubleshooting

If the orchestrator returns an error, it is advised to visit your Collibra instance and click profile/activities/ to open up the list of imports with more specific information on the failure reason. This should help you report a possible error and eliminate any misconfigurations.