The Collibra orchestrator is a pre-set orchestrator that interacts with Collibra to publish metadata about the data transformed in a DataOps pipeline. In summary, this orchestrator provides a single-click interface to the Collibra service.
"Collibra Catalog Sync":
stage: "Catalog Sync"
name: "Export input and output files"
The Collibra orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.
|REQUIRED||The Collibra organization where the dataset fits. The example value is "https://|
|REQUIRED if username/password authentication is chosen||Username to the Collibra service. If omitted JWT authentication will be performed.|
|REQUIRED if ||Password to the Collibra service|
|Optional - defaults to ||Defines whether the import should continue if some of the import commands are invalid or failed to execute. If set to |
|Optional - defaults to ||Name of the Data Quality Community in Collibra|
|Optional - defaults to ||Name of the Data Quality Domain under |
|Optional - defaults to ||Name of the Data Quality Domain under |
|Optional - defaults to ||Waiting time from the time of sending the import request until the time a request is sent to fetch the status of the upload. It's recommended to increase the default value depending on the number of items imported.|
|Optional - defaults to ||The synchronization finalization strategy used in the clean-up action. This determines whether to remove, ignore, or change the status of assets that no longer exist in the external system. Possible values are |
|Optional - defaults to ||Name of the file uploaded to Collibra|
|Optional - defaults to ||If |
|Optional - defaults to ||Sets a Collibra Parent Community if desired|
|Optional - defaults to ||Replaces existing relations or add/update, if any, during a refresh. Allowed values are |
|Optional - defaults to ||Defines whether the job status notification should be sent by email|
|Optional - defaults to ||Defines whether the import should be triggered as a simulation. If set to |
|Optional - defaults to ||Path to the |
|Optional - defaults to ||Path to the output files of the mate |
The Collibra orchestrator assumes that all the steps of the MATE job (run model, test, and generate docs) have run and finalized in the pipeline. The orchestrator then uses the MATE results, specifically table-level lineage, including columns, descriptions, tests, and other metadata.
The orchestrator uses three intermediate files,
run_results. These files must be located at
DATAOPS_PATH_TO_MATE_OUTPUT_FILES — the default path is
/dataops/modelling/ is a working directory of the standard MATE project. All files are required for the orchestrator to work.
The details of the intermediate files are as follows:
catalog.json: has information from your data warehouse about the tables and views produced and defined by the resources in your project.
manifest.json: has a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties.
run_results.json: has complete data about the results of tests that were run as well as the compilation status of your dbt project's resources.
Collibra import API
Collibra import API has two operation modes: import and synchronization. The Collibra orchestrator uses the synchronization operation. This means the following:
The synchronization operation is intended to be used when you want to replicate the state of the external system, for example, a physical database schema. Typically you perform this operation regularly. You collect the metadata from the external system, such as database tables and columns, and upload the entire schema to Collibra. The Synchronization component makes sure that the Collibra data replicates exactly the external system. The difference between import and synchronization is that with the latter new assets aren't only added and existing ones updated, but also assets removed from the external system are removed from Collibra. Another difference is performance; synchronizing the identical data for the second time is faster than the simple import because Collibra stores the hash of every synchronized asset and uses it to decide if an asset needs to be updated. This is faster than comparing each attribute and relation individually.
You can learn more about the process in the Collibra Import API Documentation.
Community and domain names
The Collibra orchestrator can create a parent community if
DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided. Under this community, it imports all other communities and domains. The name of the child community is derived from the name defined in
MyProject in the example below. If no
DATAOPS_COLLIBRA_PARENT_COMMUNITY is provided, then what is defined in
dbt_project.yml would become a parent community, nesting all created domains.
model-paths: [models, sources]
Under this community, or under the parent community if one is provided, we create the following domains:
- Physical Data Dictionary (one or more)
- Logical Data Dictionary
- Data Usage Registry
- Technology Asset Domain
By default, this orchestrator creates data quality metric assets that are related to the default Data Quality Dimension asset "Integrity", which is located in the
Data Governance Council community (full path is
Data Governance Council/Data Quality Dimensions/Integrity). As part of our prerequisite package, we also create the
DataOps.live Pipeline Data Quality Rules Catalog domain. If either of those is missing, the pipeline will fail. That being said, these values are configurable, and it is possible to change the default values of
DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSION and thus point to a new object, which has to be created before running the orchestrator.
To summarize, during the integration, we rely on the following out-of-the-box custom objects to exist unless otherwise configured:
Data Governance Councilas a default value that can be configured for the object type
Data Quality Dimensionsas a default value that can be configured for the object type
Data Governance Council/Data Quality Dimensions/Integrityas a default value that can be configured for the object type
DataOps.live Pipeline Data Quality Rules Catalogas a default value that can be configured for the object type
Types of authentication
The Collibra orchestrator supports JWT or username/password authentication. If
DATAOPS_COLLIBRA_USERNAME is not set, a JWT authentication will be attempted. To have a successful JWT authentication:
The DataOps username and Collibra username must be identical
The Collibra username should be a dedicated Collibra service account created for the integration, which has admin privileges
The following settings should be provided via the Collibra Console under
Data Governance Center>
JSON Web Key Set URL
JWT Token Types
JWT Principal ID Claim Name
The Collibra orchestrator relies on a pre-configured Collibra operating model that can be modified using the Collibra orchestrator variables.
To help you get started with the expected Collibra configurations of the operating model, we have created the Collibra Archive file
dataopslive_orchestrator.car that you can import using the Collibra Import functionality. All changes will be imported in separate DataOps.live scope.
If the orchestrator returns an error, it is advised to visit your Collibra instance and click
profile/activities/ to open up the list of imports with more specific information on the failure reason. This should help you report a possible error and eliminate any misconfigurations.