Skip to main content

Collibra Orchestrator

Enterprise

Image$DATAOPS_COLLIBRA_RUNNER_IMAGE

The Collibra orchestrator interacts with Collibra to publish metadata about the data transformed in a DataOps pipeline. In summary, this orchestrator provides a single-click interface to the Collibra service. For more information about the prerequisites and Collibra Operating Model, see Documentation at Collibra marketplace.

Usage

pipelines/includes/local_includes/collibra_jobs/collibra.yml
"Collibra Catalog Sync":
extends:
- .agent_tag
stage: "Catalog Sync"
image: $DATAOPS_COLLIBRA_RUNNER_IMAGE
script:
- /dataops
icon: ${COLLIBRA_ICON}
variables:
DATAOPS_COLLIBRA_HOST: "https://dataops.collibra.com/rest/2.0"
DATAOPS_COLLIBRA_USERNAME: DATAOPS_VAULT(DATAOPS_COLLIBRA_USERNAME)
DATAOPS_COLLIBRA_PASSWORD: DATAOPS_VAULT(DATAOPS_COLLIBRA_PASSWORD)
DATAOPS_COLLIBRA_IMPORT_FILE: sample.json
artifacts:
name: "Export input and output files"
when: always
paths:
- $CI_PROJECT_DIR/$DATAOPS_COLLIBRA_IMPORT_FILE
- $CI_PROJECT_DIR/manifest.json
- $CI_PROJECT_DIR/run_results.json
- $CI_PROJECT_DIR/catalog.json

The Collibra orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.

Supported parameters

ParameterRequired/DefaultDescription
DATAOPS_COLLIBRA_HOSTREQUIREDThe Collibra organization where the dataset fits. The example value is "https://client_name.collibra.com/rest/2.0". You can find your client_name in the URL you use to access Collibra.
DATAOPS_COLLIBRA_USERNAMEREQUIRED if username/password authentication is chosenUsername to the Collibra service. If omitted, JWT authentication will be performed.
DATAOPS_COLLIBRA_PASSWORDREQUIRED if DATAOPS_COLLIBRA_USERNAME is setPassword to the Collibra service
DATAOPS_COLLIBRA_CONTINUE_ON_ERROROptional. Defaults to FalseDefines whether the import should continue if some of the import commands are invalid or failed to execute. If set to True, the valid commands are still committed to the database, which can lead to partial results being stored.
DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITYOptional. Defaults to Data Governance CouncilName of the Data Quality Community in Collibra
DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAINOptional. Defaults to Data Quality DimensionsName of the Data Quality Dimensions Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra
DATAOPS_COLLIBRA_DATA_QUALITY_DOMAINOptional. Defaults to DataOps.live Pipeline Data Quality Rules CatalogName of the Data Quality Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra
DATAOPS_COLLIBRA_DO_NOT_GENERATE_GLOSSARYOptional. Defaults to NoneDefines whether to generate glossary. If unset, the glossary will be available at the end of the job.
DATAOPS_COLLIBRA_FETCH_RESULTS_SLEEPOptional. Defaults to 30 (seconds)The time elapsed between initiating the import request and triggering the request to fetch the upload status. It's recommended to increase the default value depending on the number of items imported.
DATAOPS_COLLIBRA_FINALIZATION_STRATEGYOptional. Defaults to CHANGE_STATUSThe synchronization finalization strategy used in the clean-up action. This determines whether to remove, ignore, or change the status of assets that no longer exist in the external system. Possible values are REMOVE_RESOURCES, CHANGE_STATUS, and IGNORE. When you select CHANGE_STATUS, you must also provide a value for DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID.
DATAOPS_COLLIBRA_IMPORT_FILEOptional. Defaults to sample.jsonName of the file uploaded to Collibra
DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_IDOptional. Defaults to 00000000-0000-0000-0000-000000005011(Obsolete)If DATAOPS_COLLIBRA_FINALIZATION_STRATEGY is set to CHANGE_STATUS, this parameter determines the new status ID for assets that no longer exist in the external system
DATAOPS_COLLIBRA_PARENT_COMMUNITYOptional. Defaults to NoneSets a Collibra Parent Community if desired
DATAOPS_COLLIBRA_RELATIONS_ACTIONOptional. Defaults to ADD_OR_IGNOREReplaces existing relations or add/update, if any, during a refresh. Allowed values are ADD_OR_IGNORE / REPLACE.
DATAOPS_COLLIBRA_READ_API_TOKENOptional. Defaults to NoneThe token required if the generated data product of the project has an input port. It is recommended to store the token in the vault and retrieve it like this: DATAOPS_VAULT(DATAOPS_COLLIBRA_TOKEN).
DATAOPS_COLLIBRA_SEND_NOTIFICATIONOptional. Defaults to FalseDefines whether the job status notification should be sent by email
DATAOPS_COLLIBRA_SIMULATIONOptional. Defaults to FalseDefines whether the import should be triggered as a simulation. If set to True, the result of the import simulation will be available at the end of the job, but no change will be applied to the Data Governance Center (DGC).
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILEOptional. Defaults to None. If a data product is generated the path should be set to: the path of the data_product_definition.ymlPath to the data_product_definition.yml file
DATAOPS_PATH_TO_MATE_OUTPUT_FILESOptional. Defaults to ./dataops/modelling/targetPath to the output files of the mate generate docs job

Project resources

The Collibra orchestrator assumes that all the steps of the MATE job (run model, test, and generate docs) have run and finalized in the pipeline. The orchestrator then uses the MATE results, specifically table-level lineage, including columns, descriptions, tests, and other metadata.

The orchestrator uses three intermediate files, catalog, manifest, and run_results. These files must be located at DATAOPS_PATH_TO_MATE_OUTPUT_FILES — the default path is /dataops/modelling/target, where /dataops/modelling/ is a working directory of the standard MATE project. All files are required for the orchestrator to work.

The details of the intermediate files are as follows:

  • catalog.json: has information from your data warehouse about the tables and views produced and defined by the resources in your project.

  • manifest.json: has a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties.

  • run_results.json: has complete data about the results of tests that were run as well as the compilation status of your dbt project's resources.

Collibra import API

Collibra import API has two operation modes: import and synchronization. The Collibra orchestrator uses the synchronization operation. This means the following:

The synchronization operation is intended to be used when you want to replicate the state of the external system, for example, a physical database schema. Typically you perform this operation regularly. You collect the metadata from the external system, such as database tables and columns, and upload the entire schema to Collibra. The Synchronization component makes sure that the Collibra data replicates exactly the external system. The difference between import and synchronization is that with the latter new assets aren't only added and existing ones updated, but also assets removed from the external system are removed from Collibra. Another difference is performance; synchronizing the identical data for the second time is faster than the simple import because Collibra stores the hash of every synchronized asset and uses it to decide if an asset needs to be updated. This is faster than comparing each attribute and relation individually.

You can learn more about the process in the Collibra Import API Documentation.

Community and domain names

The Collibra orchestrator can create a parent community if DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided. Under this community, it imports all other communities and domains. The name of the child community is derived from the name defined in dbt_project.yml, MyProject in the example below. If no DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided, then what is defined in dbt_project.yml would become a parent community, nesting all created domains.

dataops/modelling/dbt_project.yml
## Project
name: MyProject
version: 0.1
config-version: 2

## Sources
model-paths: [models, sources]
analysis-paths: [analysis]
test-paths: [tests]
seed-paths: [seeds]
macro-paths: [macros]
snapshot-paths: [snapshots]

Under this community, or under the parent community if one is provided, we create the following domains:

  • Physical Data Dictionary (one or more)
  • Logical Data Dictionary
  • Data Usage Registry
  • Technology Asset Domain
  • Rulebook
  • Glossary

By default, this orchestrator creates data quality metric assets that are related to the default Data Quality Dimension asset "Integrity", which is located in the Data Governance Council community (full path is Data Governance Council/Data Quality Dimensions/Integrity). As part of our prerequisite package, we also create the DataOps.live Pipeline Data Quality Rules Catalog domain. If either of those is missing, the pipeline will fail. That being said, these values are configurable, and it is possible to change the default values of DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY, DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN and DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSION and thus point to a new object, which has to be created before running the orchestrator.

To summarize, during the integration, we rely on the following out-of-the-box custom objects to exist unless otherwise configured:

  • Data Governance Council as a default value that can be configured for the object type Community
  • Data Quality Dimensions as a default value that can be configured for the object type Domain
  • Data Governance Council/Data Quality Dimensions/Integrity as a default value that can be configured for the object type Asset
  • DataOps.live Pipeline Data Quality Rules Catalog as a default value that can be configured for the object type Domain

Using Collibra orchestrator together with the Data Product orchestrator

The Collibra orchestrator acts as a centralized hub for data management, providing a comprehensive view of data assets and their associated metadata. It allows organizations to define, automate, and enforce data governance policies, ensuring data quality, compliance, and security throughout the data lifecycle. With the emphasis of DataOps.live on agile and automated data operations, it becomes crucial to integrate data products into the Collibra ecosystem.

In DataOps.live, data products are generated by using the Data Product orchestrator in our CI/CD process. These data products contain valuable insights and transformations applied to raw data and are typically represented as yaml configurations, accompanied by various parameters that govern their behavior. To fully leverage the potential of these data products and transfer them to Collibra, it is vital to integrate them with the Collibra orchestrator, allowing for centralized governance and visibility.

The integration process involves building a robust mechanism that reads the DataOps data product and parses it into a Collibra object. This parsed object serves as a representation of the data product within the Collibra ecosystem, capturing its essential attributes, such as lineage, metadata, and associated business rules. Additionally, the integration ensures that the parameters governing the data product's behavior are correctly mapped and synchronized with the Collibra orchestrator, enabling consistent governance across the organization. By integrating the Collibra orchestrator with DataOps.live data products, you can achieve the following benefits:

  • You gain a holistic view of your data landscape, promoting transparency and understanding of data flows and dependencies. This visibility enhances data governance, enabling your organization to track and manage the lineage of data products, understand their impact on downstream processes, and ensure compliance with regulatory requirements.
  • You can effectively manage and monitor the behavior and performance of these products. This allows for better coordination between data engineering teams, data scientists, data stewards, and business stakeholders, resulting in improved collaboration, data-driven decision-making, and optimized data product delivery.

To start using the integration between the Data Product orchestrator and Collibra orchestrator, you have to:

  • Make sure the Collibra job runs after the Data Product orchestrator job.
  • Set DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE to the path of the data_product_definition.yml file.
  • Make sure that your data_product_definition.yml file has a governance: collibra block. An example configuration looks like this:
dataops/dataproduct/data_product_definition.yml
id: twp
name: Operational data
description: All orders and customers data that are coming from the source system
schema_version: 1.0.0
governance:
- collibra:
collibra_community: sample_community_value #parent community for the domain if any. if no parent is selected, Data Domain becomes a top-level community
collibra_domain: sample_domain_value #data domain owning the data product

If the data product has an input_port specified (meaning the data product is a product of another data product), you would have to set up a DataOps ReadApi token and assign it to this variable DATAOPS_COLLIBRA_READ_API_TOKEN.

To properly link the two data products, having a governance: collibra block in the input port's project data_product_definition.yml is required.

To specify the path to the data_product_definition.yml file, you should define the variable DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE and set it to the desired path. The Data Product orchestrator uses this variable. Setting its value globally is adequate to ensure its availability. By setting the DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE variable, you can provide the path to the data_product_definition.yml file, allowing the orchestrator to access and use it accordingly.

For more information about data products in DataOps.live, see Data Product Pipelines. For more information about the Data Product orchestrator, see Data Products Administration.

Types of authentication

The Collibra orchestrator supports JWT or username/password authentication. If DATAOPS_COLLIBRA_USERNAME is not set, a JWT authentication will be attempted. To have a successful JWT authentication:

  • The DataOps username and Collibra username must be identical

  • The Collibra username should be a dedicated Collibra service account created for the integration, which has admin privileges

  • The following settings should be provided via the Collibra Console under Data Governance Center > Configuration > Security Configuration > JWT:

    ParameterValue
    JSON Web Key Set URLhttps://app.dataops.live/-/jwks
    JWT Token TypesJWT
    JWT AlgorithmsRS256
    JWT Issuerapp.dataops.live
    JWT Principal ID Claim Nameuser_login

Prerequisite package

The Collibra orchestrator relies on a pre-configured Collibra operating model that can be modified using the Collibra orchestrator variables.

To help you get started with the expected Collibra configurations of the operating model, we have created the Collibra Archive file dataopslive_orchestrator.car that you can import using the Collibra Import functionality. All changes will be imported in separate DataOps.live scope.

Troubleshooting

If the orchestrator returns an error, it is advised to visit your Collibra instance and click profile/activities/ to open up the list of imports with more specific information on the failure reason. This should help you report a possible error and eliminate any misconfigurations.