Collibra Orchestrator
Enterprise
Image | $DATAOPS_COLLIBRA_RUNNER_IMAGE |
---|
The Collibra orchestrator interacts with Collibra to publish metadata about the data transformed in a DataOps pipeline. In summary, this orchestrator provides a single-click interface to the Collibra service. For more information about the prerequisites and Collibra Operating Model, see Documentation at Collibra marketplace.
Usage
"Collibra Catalog Sync":
extends:
- .agent_tag
stage: "Catalog Sync"
image: $DATAOPS_COLLIBRA_RUNNER_IMAGE
script:
- /dataops
icon: ${COLLIBRA_ICON}
variables:
DATAOPS_COLLIBRA_HOST: "https://dataops.collibra.com/rest/2.0"
DATAOPS_COLLIBRA_USERNAME: DATAOPS_VAULT(DATAOPS_COLLIBRA_USERNAME)
DATAOPS_COLLIBRA_PASSWORD: DATAOPS_VAULT(DATAOPS_COLLIBRA_PASSWORD)
DATAOPS_COLLIBRA_IMPORT_FILE: sample.json
artifacts:
name: "Export input and output files"
when: always
paths:
- $CI_PROJECT_DIR/$DATAOPS_COLLIBRA_IMPORT_FILE
- $CI_PROJECT_DIR/manifest.json
- $CI_PROJECT_DIR/run_results.json
- $CI_PROJECT_DIR/catalog.json
The Collibra orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DATAOPS_COLLIBRA_HOST | REQUIRED | The Collibra organization where the dataset fits. The example value is "https://client_name .collibra.com/rest/2.0". You can find your client_name in the URL you use to access Collibra. |
DATAOPS_COLLIBRA_USERNAME | REQUIRED if username/password authentication is chosen | Username to the Collibra service. If omitted, JWT authentication will be performed. |
DATAOPS_COLLIBRA_PASSWORD | REQUIRED if DATAOPS_COLLIBRA_USERNAME is set | Password to the Collibra service |
DATAOPS_COLLIBRA_CONTINUE_ON_ERROR | Optional. Defaults to False | Defines whether the import should continue if some of the import commands are invalid or failed to execute. If set to True , the valid commands are still committed to the database, which can lead to partial results being stored. |
DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY | Optional. Defaults to Data Governance Council | Name of the Data Quality Community in Collibra |
DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN | Optional. Defaults to Data Quality Dimensions | Name of the Data Quality Dimensions Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra |
DATAOPS_COLLIBRA_DATA_QUALITY_DOMAIN | Optional. Defaults to DataOps.live Pipeline Data Quality Rules Catalog | Name of the Data Quality Domain under DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY in Collibra |
DATAOPS_COLLIBRA_DO_NOT_GENERATE_GLOSSARY | Optional. Defaults to None | Defines whether to generate glossary. If unset, the glossary will be available at the end of the job. |
DATAOPS_COLLIBRA_FETCH_RESULTS_SLEEP | Optional. Defaults to 30 (seconds) | The time elapsed between initiating the import request and triggering the request to fetch the upload status. It's recommended to increase the default value depending on the number of items imported. |
DATAOPS_COLLIBRA_FINALIZATION_STRATEGY | Optional. Defaults to CHANGE_STATUS | The synchronization finalization strategy used in the clean-up action. This determines whether to remove, ignore, or change the status of assets that no longer exist in the external system. Possible values are REMOVE_RESOURCES , CHANGE_STATUS , and IGNORE . When you select CHANGE_STATUS , you must also provide a value for DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID . |
DATAOPS_COLLIBRA_IMPORT_FILE | Optional. Defaults to sample.json | Name of the file uploaded to Collibra |
DATAOPS_COLLIBRA_MISSING_ASSET_STATUS_ID | Optional. Defaults to 00000000-0000-0000-0000-000000005011 (Obsolete) | If DATAOPS_COLLIBRA_FINALIZATION_STRATEGY is set to CHANGE_STATUS , this parameter determines the new status ID for assets that no longer exist in the external system |
DATAOPS_COLLIBRA_PARENT_COMMUNITY | Optional. Defaults to None | Sets a Collibra Parent Community if desired |
DATAOPS_COLLIBRA_RELATIONS_ACTION | Optional. Defaults to ADD_OR_IGNORE | Replaces existing relations or add/update, if any, during a refresh. Allowed values are ADD_OR_IGNORE / REPLACE . |
DATAOPS_COLLIBRA_READ_API_TOKEN | Optional. Defaults to None | The token required if the generated data product of the project has an input port. It is recommended to store the token in the vault and retrieve it like this: DATAOPS_VAULT(DATAOPS_COLLIBRA_TOKEN) . |
DATAOPS_COLLIBRA_SEND_NOTIFICATION | Optional. Defaults to False | Defines whether the job status notification should be sent by email |
DATAOPS_COLLIBRA_SIMULATION | Optional. Defaults to False | Defines whether the import should be triggered as a simulation. If set to True , the result of the import simulation will be available at the end of the job, but no change will be applied to the Data Governance Center (DGC). |
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE | Optional. Defaults to None . If a data product is generated the path should be set to: the path of the data_product_definition.yml | Path to the data_product_definition.yml file |
DATAOPS_PATH_TO_MATE_OUTPUT_FILES | Optional. Defaults to ./dataops/modelling/target | Path to the output files of the mate generate docs job |
Project resources
The Collibra orchestrator assumes that all the steps of the MATE job (run model, test, and generate docs) have run and finalized in the pipeline. The orchestrator then uses the MATE results, specifically table-level lineage, including columns, descriptions, tests, and other metadata.
The orchestrator uses three intermediate files, catalog
, manifest
, and run_results
. These files must be located at DATAOPS_PATH_TO_MATE_OUTPUT_FILES
— the default path is /dataops/modelling/target
, where /dataops/modelling/
is a working directory of the standard MATE project. All files are required for the orchestrator to work.
The details of the intermediate files are as follows:
-
catalog.json
: has information from your data warehouse about the tables and views produced and defined by the resources in your project. -
manifest.json
: has a complete representation of your dbt project's resources (models, tests, macros, etc.), including all node configurations and resource properties. -
run_results.json
: has complete data about the results of tests that were run as well as the compilation status of your dbt project's resources.
Collibra import API
Collibra import API has two operation modes: import and synchronization. The Collibra orchestrator uses the synchronization operation. This means the following:
The synchronization operation is intended to be used when you want to replicate the state of the external system, for example, a physical database schema. Typically you perform this operation regularly. You collect the metadata from the external system, such as database tables and columns, and upload the entire schema to Collibra. The Synchronization component makes sure that the Collibra data replicates exactly the external system. The difference between import and synchronization is that with the latter new assets aren't only added and existing ones updated, but also assets removed from the external system are removed from Collibra. Another difference is performance; synchronizing the identical data for the second time is faster than the simple import because Collibra stores the hash of every synchronized asset and uses it to decide if an asset needs to be updated. This is faster than comparing each attribute and relation individually.
You can learn more about the process in the Collibra Import API Documentation.
Community and domain names
The Collibra orchestrator can create a parent community if DATAOPS_COLLIBRA_PARENT_COMMUNITY
is provided. Under this community, it imports all other communities and domains. The name of the child community is derived from the name defined in dbt_project.yml
, MyProject
in the example below. If no DATAOPS_COLLIBRA_PARENT_COMMUNITY
is provided, then what is defined in dbt_project.yml
would become a parent community, nesting all created domains.
## Project
name: MyProject
version: 0.1
config-version: 2
## Sources
model-paths: [models, sources]
analysis-paths: [analysis]
test-paths: [tests]
seed-paths: [seeds]
macro-paths: [macros]
snapshot-paths: [snapshots]
Under this community, or under the parent community if one is provided, we create the following domains:
- Physical Data Dictionary (one or more)
- Logical Data Dictionary
- Data Usage Registry
- Technology Asset Domain
- Rulebook
- Glossary
By default, this orchestrator creates data quality metric assets that are related to the default Data Quality Dimension asset "Integrity", which is located in the Data Governance Council
community (full path is Data Governance Council/Data Quality Dimensions/Integrity
). As part of our prerequisite package, we also create the DataOps.live Pipeline Data Quality Rules Catalog
domain. If either of those is missing, the pipeline will fail. That being said, these values are configurable, and it is possible to change the default values of DATAOPS_COLLIBRA_DATA_QUALITY_COMMUNITY
, DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSIONS_DOMAIN
and DATAOPS_COLLIBRA_DATA_QUALITY_DIMENSION
and thus point to a new object, which has to be created before running the orchestrator.
To summarize, during the integration, we rely on the following out-of-the-box custom objects to exist unless otherwise configured:
Data Governance Council
as a default value that can be configured for the object typeCommunity
Data Quality Dimensions
as a default value that can be configured for the object typeDomain
Data Governance Council/Data Quality Dimensions/Integrity
as a default value that can be configured for the object typeAsset
DataOps.live Pipeline Data Quality Rules Catalog
as a default value that can be configured for the object typeDomain
Using Collibra orchestrator together with the Data Product orchestrator
The Collibra orchestrator acts as a centralized hub for data management, providing a comprehensive view of data assets and their associated metadata. It allows organizations to define, automate, and enforce data governance policies, ensuring data quality, compliance, and security throughout the data lifecycle. With the emphasis of DataOps.live on agile and automated data operations, it becomes crucial to integrate data products into the Collibra ecosystem.
In DataOps.live, data products are generated by using the Data Product orchestrator in our CI/CD process. These data products contain valuable insights and transformations applied to raw data and are typically represented as yaml
configurations, accompanied by various parameters that govern their behavior. To fully leverage the potential of these data products and transfer them to Collibra, it is vital to integrate them with the Collibra orchestrator, allowing for centralized governance and visibility.
The integration process involves building a robust mechanism that reads the DataOps data product
and parses it into a Collibra object. This parsed object serves as a representation of the data product
within the Collibra ecosystem, capturing its essential attributes, such as lineage, metadata, and associated business rules. Additionally, the integration ensures that the parameters governing the data product's behavior are correctly mapped and synchronized with the Collibra orchestrator, enabling consistent governance across the organization.
By integrating the Collibra orchestrator with DataOps.live data products, you can achieve the following benefits:
- You gain a holistic view of your data landscape, promoting transparency and understanding of data flows and dependencies. This visibility enhances data governance, enabling your organization to track and manage the lineage of data products, understand their impact on downstream processes, and ensure compliance with regulatory requirements.
- You can effectively manage and monitor the behavior and performance of these products. This allows for better coordination between data engineering teams, data scientists, data stewards, and business stakeholders, resulting in improved collaboration, data-driven decision-making, and optimized data product delivery.
To start using the integration between the Data Product orchestrator and Collibra orchestrator, you have to:
- Make sure the Collibra job runs after the Data Product orchestrator job.
- Set
DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE
to the path of thedata_product_definition.yml
file. - Make sure that your
data_product_definition.yml
file has agovernance: collibra
block. An example configuration looks like this:
id: twp
name: Operational data
description: All orders and customers data that are coming from the source system
schema_version: 1.0.0
governance:
- collibra:
collibra_community: sample_community_value #parent community for the domain if any. if no parent is selected, Data Domain becomes a top-level community
collibra_domain: sample_domain_value #data domain owning the data product
If the data product has an input_port
specified (meaning the data product is a product of another data product), you would have to set up a DataOps ReadApi token and assign it to this variable DATAOPS_COLLIBRA_READ_API_TOKEN
.
To properly link the two data products, having a governance: collibra
block in the input port's project data_product_definition.yml
is required.
To specify the path to the data_product_definition.yml
file, you should define the variable DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE
and set it to the desired path. The Data Product orchestrator uses this variable. Setting its value globally is adequate to ensure its availability. By setting the DATAOPS_DATA_PRODUCT_SPECIFICATION_FILE
variable, you can provide the path to the data_product_definition.yml
file, allowing the orchestrator to access and use it accordingly.
Types of authentication
The Collibra orchestrator supports JWT or username/password authentication. If DATAOPS_COLLIBRA_USERNAME
is not set, a JWT authentication will be attempted. To have a successful JWT authentication:
-
The DataOps username and Collibra username must be identical
-
The Collibra username should be a dedicated Collibra service account created for the integration, which has admin privileges
-
The following settings should be provided via the Collibra Console under
Data Governance Center
>Configuration
>Security Configuration
>JWT
:Parameter Value JSON Web Key Set URL
https://app.dataops.live/-/jwks JWT Token Types
JWT JWT Algorithms
RS256 JWT Issuer
app.dataops.live JWT Principal ID Claim Name
user_login
Prerequisite package
The Collibra orchestrator relies on a pre-configured Collibra operating model that can be modified using the Collibra orchestrator variables.
To help you get started with the expected Collibra configurations of the operating model, we have created the Collibra Archive file dataopslive_orchestrator.car
that you can import using the Collibra Import functionality. All changes will be imported in separate DataOps.live scope.
Troubleshooting
If the orchestrator returns an error, it is advised to visit your Collibra instance and click profile/activities/
to open up the list of imports with more specific information on the failure reason. This should help you report a possible error and eliminate any misconfigurations.