Collibra Orchestrator

Enterprise

Image	$DATAOPS_COLLIBRA_RUNNER_IMAGE

The Collibra orchestrator interacts with Collibra to publish metadata about the data transformed in a DataOps pipeline. In summary, this orchestrator provides a single-click interface to the Collibra service. For more information about the prerequisites and Collibra Operating Model, see Documentation at Collibra marketplace.

Usage

pipelines/includes/local_includes/collibra_jobs/collibra.yml
"Collibra Catalog Sync":
  extends:
    - .agent_tag
  stage: "Catalog Sync"
  image: $DATAOPS_COLLIBRA_RUNNER_IMAGE
  script:
    - /dataops
  icon: ${COLLIBRA_ICON}
  variables:
    DATAOPS_COLLIBRA_HOST: "https://dataops.collibra.com/rest/2.0"
    DATAOPS_COLLIBRA_USERNAME: DATAOPS_VAULT(DATAOPS_COLLIBRA_USERNAME)
    DATAOPS_COLLIBRA_PASSWORD: DATAOPS_VAULT(DATAOPS_COLLIBRA_PASSWORD)
    DATAOPS_COLLIBRA_IMPORT_FILE: sample.json
  artifacts:
    name: "Export input and output files"
    when: always
    paths:
      - $CI_PROJECT_DIR/$DATAOPS_COLLIBRA_IMPORT_FILE
      - $CI_PROJECT_DIR/manifest.json
      - $CI_PROJECT_DIR/run_results.json
      - $CI_PROJECT_DIR/catalog.json

The Collibra orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.

Supported parameters

Parameter	Required/Default	Description
`DATAOPS_COLLIBRA_HOST`	REQUIRED	The Collibra organization where the dataset fits. The example value is "https://`client_name`.collibra.com/rest/2.0". You can find your `client_name` in the URL you use to access Collibra.
`DATAOPS_COLLIBRA_USERNAME`	REQUIRED if username/password authentication is chosen	Username to the Collibra service. If omitted, JWT authentication will be performed.
`DATAOPS_COLLIBRA_PASSWORD`	REQUIRED if `DATAOPS_COLLIBRA_USERNAME` is set	Password to the Collibra service
`DATAOPS_COLLIBRA_CONTINUE_ON_ERROR`	Optional. Defaults to `False`	Defines whether the import should continue if some of the import commands are invalid or failed to execute. If set to `True`, the valid commands are still committed to the database, which can lead to partial results being stored.
`DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY`	Optional. Defaults to `Data Governance Council`	Name of the Data Quality Community in Collibra
`DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN`	Optional. Defaults to `Data Quality Dimensions`	Name of the Data Quality Dimensions Domain under `DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY` in Collibra
`DATAOPS_COLLIBRA_DATA_QUALITY_DOMAIN`	Optional. Defaults to `DataOps.live Pipeline Data Quality Rules Catalog`	Name of the Data Quality Domain under `DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY` in Collibra
`DATAOPS_COLLIBRA_DO_NOT_GENERATE_GLOSSARY`	Optional. Defaults to `None`	Defines whether to generate glossary. If unset, the glossary will be available at the end of the job.
`DATAOPS_COLLIBRA_FETCH_RESULTS_SLEEP`	Optional. Defaults to `30` (seconds)	The time elapsed between initiating the import request and triggering the request to fetch the upload status. It's recommended to increase the default value depending on the number of items imported.
`DATAOPS_COLLIBRA_FINALIZATION_STRATEGY`	Optional. Defaults to `CHANGE_STATUS`	The synchronization finalization strategy used in the clean-up action. This determines whether to remove, ignore, or change the status of assets that no longer exist in the external system. Possible values are `REMOVE_RESOURCES`, `CHANGE_STATUS`, and `IGNORE`. When you select `CHANGE_STATUS`, you must also provide a value for `DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID`.
`DATAOPS_COLLIBRA_IMPORT_FILE`	Optional. Defaults to `sample.json`	Name of the file uploaded to Collibra
`DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID`	Optional. Defaults to `00000000-0000-0000-0000-000000005011`(Obsolete)	If `DATAOPS_COLLIBRA_FINALIZATION_STRATEGY` is set to `CHANGE_STATUS`, this parameter determines the new status ID for assets that no longer exist in the external system
`DATAOPS_COLLIBRA_PARENT_COMMUNITY`	Optional. Defaults to `None`	Sets a Collibra Parent Community if desired
`DATAOPS_COLLIBRA_RELATIONS_ACTION`	Optional. Defaults to `ADD_OR_IGNORE`	Replaces existing relations or add/update, if any, during a refresh. Allowed values are `ADD_OR_IGNORE` / `REPLACE`.
`DATAOPS_COLLIBRA_READ_API_TOKEN`	Optional. Defaults to `None`	The token required if the generated data product of the project has an input port. It is recommended to store the token in the vault and retrieve it like this: `DATAOPS_VAULT(DATAOPS_COLLIBRA_TOKEN)`.
`DATAOPS_COLLIBRA_SEND_NOTIFICATION`	Optional. Defaults to `False`	Defines whether the job status notification should be sent by email
`DATAOPS_COLLIBRA_SIMULATION`	Optional. Defaults to `False`	Defines whether the import should be triggered as a simulation. If set to `True`, the result of the import simulation will be available at the end of the job, but no change will be applied to the Data Governance Center (DGC).
`DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE`	Optional. Defaults to `None`. If a data product is generated the path should be set to: the path of the `data_product_definition.yml`	Path to the `data_product_definition.yml` file
`DATAOPS_PATH_TO_MATE_OUTPUT_FILES`	Optional. Defaults to `./dataops/modelling/target`	Path to the output files of the mate `generate docs` job

Project resources

The Collibra orchestrator assumes that all the steps of the MATE job (run model, test, and generate docs) have run and finalized in the pipeline. The orchestrator then uses the MATE results, specifically table-level lineage, including columns, descriptions, tests, and other metadata.

The orchestrator uses three intermediate files, catalog, manifest, and run_results. These files must be located at DATAOPS_PATH_TO_MATE_OUTPUT_FILES — the default path is /dataops/modelling/target, where /dataops/modelling/ is a working directory of the standard MATE project. All files are required for the orchestrator to work.

The details of the intermediate files are as follows:

catalog.json: has information from your data warehouse about the tables and views produced and defined by the resources in your project.
manifest.json: has a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties.
run_results.json: has complete data about the results of tests that were run as well as the compilation status of your dbt project's resources.

Collibra import API

Collibra import API has two operation modes: import and synchronization. The Collibra orchestrator uses the synchronization operation. This means the following:

The synchronization operation is intended to be used when you want to replicate the state of the external system, for example, a physical database schema. Typically you perform this operation regularly. You collect the metadata from the external system, such as database tables and columns, and upload the entire schema to Collibra. The Synchronization component makes sure that the Collibra data replicates exactly the external system. The difference between import and synchronization is that with the latter new assets aren't only added and existing ones updated, but also assets removed from the external system are removed from Collibra. Another difference is performance; synchronizing the identical data for the second time is faster than the simple import because Collibra stores the hash of every synchronized asset and uses it to decide if an asset needs to be updated. This is faster than comparing each attribute and relation individually.

You can learn more about the process in the Collibra Import API Documentation.

Community and domain names

The Collibra orchestrator can create a parent community if DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided. Under this community, it imports all other communities and domains. The name of the child community is derived from the name defined in dbt_project.yml, MyProject in the example below. If no DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided, then what is defined in dbt_project.yml would become a parent community, nesting all created domains.

dataops/modelling/dbt_project.yml
## Project
name: MyProject
version: 0.1
config-version: 2

## Sources
model-paths: [models, sources]
analysis-paths: [analysis]
test-paths: [tests]
seed-paths: [seeds]
macro-paths: [macros]
snapshot-paths: [snapshots]

Under this community, or under the parent community if one is provided, we create the following domains:

Physical Data Dictionary (one or more)
Logical Data Dictionary
Data Usage Registry
Technology Asset Domain
Rulebook
Glossary

By default, this orchestrator creates data quality metric assets that are related to the default Data Quality Dimension asset "Integrity", which is located in the Data Governance Council community (full path is Data Governance Council/Data Quality Dimensions/Integrity). As part of our prerequisite package, we also create the DataOps.live Pipeline Data Quality Rules Catalog domain. If either of those is missing, the pipeline will fail. That being said, these values are configurable, and it is possible to change the default values of DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY, DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN and DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSION and thus point to a new object, which has to be created before running the orchestrator.

To summarize, during the integration, we rely on the following out-of-the-box custom objects to exist unless otherwise configured:

Data Governance Council as a default value that can be configured for the object type Community
Data Quality Dimensions as a default value that can be configured for the object type Domain
Data Governance Council/Data Quality Dimensions/Integrity as a default value that can be configured for the object type Asset
DataOps.live Pipeline Data Quality Rules Catalog as a default value that can be configured for the object type Domain

Using Collibra orchestrator together with the Data Product orchestrator

The Collibra orchestrator acts as a centralized hub for data management, providing a comprehensive view of data assets and their associated metadata. It allows organizations to define, automate, and enforce data governance policies, ensuring data quality, compliance, and security throughout the data lifecycle. With the emphasis of DataOps.live on agile and automated data operations, it becomes crucial to integrate data products into the Collibra ecosystem.

In DataOps.live, data products are generated by using the Data Product orchestrator in our CI/CD process. These data products contain valuable insights and transformations applied to raw data and are typically represented as yaml configurations, accompanied by various parameters that govern their behavior. To fully leverage the potential of these data products and transfer them to Collibra, it is vital to integrate them with the Collibra orchestrator, allowing for centralized governance and visibility.

The integration process involves building a robust mechanism that reads the DataOps data product and parses it into a Collibra object. This parsed object serves as a representation of the data product within the Collibra ecosystem, capturing its essential attributes, such as lineage, metadata, and associated business rules. Additionally, the integration ensures that the parameters governing the data product's behavior are correctly mapped and synchronized with the Collibra orchestrator, enabling consistent governance across the organization. By integrating the Collibra orchestrator with DataOps.live data products, you can achieve the following benefits:

You gain a holistic view of your data landscape, promoting transparency and understanding of data flows and dependencies. This visibility enhances data governance, enabling your organization to track and manage the lineage of data products, understand their impact on downstream processes, and ensure compliance with regulatory requirements.
You can effectively manage and monitor the behavior and performance of these products. This allows for better coordination between data engineering teams, data scientists, data stewards, and business stakeholders, resulting in improved collaboration, data-driven decision-making, and optimized data product delivery.

To start using the integration between the Data Product orchestrator and Collibra orchestrator, you have to:

Make sure the Collibra job runs after the Data Product orchestrator job.
Set DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE to the path of the data_product_definition.yml file.
Make sure that your data_product_definition.yml file has a governance: collibra block. An example configuration looks like this:

dataops/dataproduct/data_product_definition.yml
id: twp
name: Operational data
description: All orders and customers data that are coming from the source system
schema_version: 1.0.0
governance:
  - collibra:
    collibra_community: sample_community_value #parent community for the domain if any. if no parent is selected, Data Domain becomes a top-level community
    collibra_domain: sample_domain_value #data domain owning the data product

If the data product has an input_port specified (meaning the data product is a product of another data product), you would have to set up a DataOps ReadApi token and assign it to this variable DATAOPS_COLLIBRA_READ_API_TOKEN.

To properly link the two data products, having a governance: collibra block in the input port's project data_product_definition.yml is required.

To specify the path to the data_product_definition.yml file, you should define the variable DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE and set it to the desired path. The Data Product orchestrator uses this variable. Setting its value globally is adequate to ensure its availability. By setting the DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE variable, you can provide the path to the data_product_definition.yml file, allowing the orchestrator to access and use it accordingly.

Types of authentication

The Collibra orchestrator supports JWT or username/password authentication. If DATAOPS_COLLIBRA_USERNAME is not set, a JWT authentication will be attempted. To have a successful JWT authentication:

The DataOps username and Collibra username must be identical
The Collibra username should be a dedicated Collibra service account created for the integration, which has admin privileges
The following settings should be provided via the Collibra Console under Data Governance Center > Configuration > Security Configuration > JWT:

Parameter Value
JSON Web Key Set URL https://app.dataops.live/-/jwks
JWT Token Types JWT
JWT Algorithms RS256
JWT Issuer app.dataops.live
JWT Principal ID Claim Name user_login

Parameter	Value
`JSON Web Key Set URL`	https://app.dataops.live/-/jwks
`JWT Token Types`	JWT
`JWT Algorithms`	RS256
`JWT Issuer`	app.dataops.live
`JWT Principal ID Claim Name`	user_login

Prerequisite package

The Collibra orchestrator relies on a pre-configured Collibra operating model that can be modified using the Collibra orchestrator variables.

To help you get started with the expected Collibra configurations of the operating model, we have created the Collibra Archive file dataopslive_orchestrator.car that you can import using the Collibra Import functionality. All changes will be imported in separate DataOps.live scope.

Troubleshooting

If the orchestrator returns an error, it is advised to visit your Collibra instance and click profile/activities/ to open up the list of imports with more specific information on the failure reason. This should help you report a possible error and eliminate any misconfigurations.

Enterprise

Usage​

Supported parameters​

Project resources​

Collibra import API​

Community and domain names​

Using Collibra orchestrator together with the Data Product orchestrator​

Types of authentication​

Prerequisite package​

Troubleshooting​