Informatica Cloud Data Governance and Catalog (CDGC) Orchestrator
Enterprise
Image | $DATAOPS_INFORMATICA_CDGC_RUNNER_IMAGE |
---|---|
Feature Status | PubPrev |
The Informatica Cloud Data Governance and Catalog (CDGC) orchestrator interacts with Informatica Cloud to publish metadata about the data transformed in a DataOps pipeline.
Usage
"Informatica CDGC Orchestrator":
extends:
- .agent_tag
stage: "Informatica CDGC Orchestrator"
image: $DATAOPS_INFORMATICA_CDGC_RUNNER_IMAGE
variables:
DATAOPS_INFORMATICA_CDGC_URL: "https://<your_org>.informaticacloud.com"
DATAOPS_INFORMATICA_CDGC_USERNAME: DATAOPS_VAULT(INFORMATICA.CDGC.USERNAME)
DATAOPS_INFORMATICA_CDGC_PASSWORD: DATAOPS_VAULT(INFORMATICA.CDGC.PASSWORD)
# your your dbt catalog source ID
DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_ID: "<update this>"
# match this with your dbt catalog source configuration
DATAOPS_INFORMATICA_CDGC_DBT_MANIFEST_FILEPATH: "/home/ubuntu/dbtmanifestpath/manifest.json"
DATAOPS_INFORMATICA_CDGC_FILETRANSFER: "scp"
DATS_INFORMATICA_CDGCAOP_SCP_SSH_HOST: DATAOPS_VAULT(INFORMATICA.CDGC.SSH_HOST)
DATAOPS_INFORMATICA_CDGC_SCP_SSH_USER: ubuntu
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY: DATAOPS_VAULT(INFORMATICA.CDGC.SSH_PRIVATE_KEY)
script:
- /dataops
icon: ${INFORMATICA_ICON}
The Informatica CDGC orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.
Prerequisits
For the orchestrator to work successfully, you must use MATE with a version of dbt 1.7 or later. See switching dbt versions for details.
Supported parameters
Parameter | Required/Default | Description |
---|---|---|
DATAOPS_INFORMATICA_CDGC_URL | REQUIRED | Organization's Informatica Cloud URL, e.g. https://dm-us.informaticacloud.com/ for Americas or https://dm-em.informaticacloud.com/ for Europe. |
DATAOPS_INFORMATICA_CDGC_USERNAME | REQUIRED | Username required to log in to the Informatica organization account. |
DATAOPS_INFORMATICA_CDGC_PASSWORD | REQUIRED | Description for an optional param with no default. |
DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_ID | REQUIRED | dbt Catalog Source ID from organization's Informatica Metadata Command Center. |
DATAOPS_INFORMATICA_CDGC_DBT_MANIFEST_FILEPATH | REQUIRED | File path accessible to the Informatica Secure Agent and the DataOps runner, where DataOps runner would store dbt manifest file. |
DATAOPS_INFORMATICA_CDGC_FILETRANSFER | Optional, defaults to fs | Option to use for DBT Manifest file transfer form DataOps running to Informatica Cloud Secure Agent. Accepted values are: fs file path accessible to Informatica Cloud Secure Agent is directly accessible to DataOps runner, or scp for secure file transfer using ssh. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_HOST | Optional, required for scp | Host Name or IP Address of SSH server where DBT manifest file has to be transferred. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PORT | Optional, required for scp , defaults to 22 | Port of the SSH server where DBT manifest file has to be transferred. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_USER | Optional, required for scp | Username for SSH Server for scp. Normally the operating system user the Informatica Secure Agent runs as. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY | Optional, required for scp | The content of the private key for SCP file transfer retrieved from the DataOps vault. If you use this parameter, don't use DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY_FILENAME . |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY_FILENAME | Optional for scp | Private key file name for SCP file transfer. The file is stored as a secure file outside the git repository. If you use this parameter, don't use DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY . |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_SERVER_FINGERPRINT | Optional for scp | SSH server fingerprint to trust. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PASSPHRASE | Optional for scp | Passphrase to decrypt the SSH private key. |
DATAOPS_INFORMATICA_CDGC_SCP_SSH_KNOWN_HOSTS_FILEPATH | Optional for scp | Known SSH server host to trust. |
Before you start
To use the orchestrator, you will need to set up and configure the following:
- A DataOps runner either on Kubernetes or as a Docker container
- A self-hosted Informatica Secure Agent - the Informatica Cloud Hosted Agent will not work for this integration, as it does not accept the metadata integration
- A file transfer method between the DataOps runner and the Secure Agent
- The dbt Catalog Source in the Informatica Metadata Command Center
DataOps runner and Informatica Secure Agent configuration
First, ensure you have DataOps runner on Docker or a DataOps runner on Kubernetes configured and deployed. Ensure that you can either:
- establish a network connection from the runner to the Informatica Secure Agent, or
- co-host the DataOps runner with the Informatica Secure Agent on the same machine, so that you can perform a file copy
Setting up the Informatica Secure Agent
Since you cannot use the Informatica Cloud Hosted Agent for this orchestrator, you need to use a self-hosted Secure Agent. If you do not have a self-hosted agent or agent group yet, review the Informatica instructions for Secure Agents and install the Secure Agent from Informatica > Administrator > Runtime Environments:
Configuring the file transfer between the DataOps Runner and the Informatica Secure Agent
The Informatica CDGC Orchestrator must be able to copy files from the orchestrator files system inside a container to the secure agent's file system. The orchestrator provides two options by setting the parameter DATAOPS_INFORMATICA_CDGC_FILETRANSFER
to achieve that.
- select
fs
for a plain file transfer on the local file system - select
scp
for a secure file copy to a different machine
Using plain file transfer
To achieve a plain file transfer, you need to be able to access the secure agent's file system. For that to work, you will need to mount that file system to the container of the orchestrator by modifying the DataOps runner's config.toml
.
Modify your existing runner configuration file. If you have a Docker runner, your config file should have a section similar to:
[runners.docker]
...
volumes = ["/app:/local_config:rw", "/agent_cache:/agent_cache:rw", "/secrets:/secrets:ro"]
...
Create a new volume mapping by updating the line volumes
and map from the host's secure agent installation directory /path/to/your/secure/agent
to the directory within the orchestrator container /secure_agent
.
[runners.docker]
...
volumes = ["/path/to/your/secure/agent:/secure_agent:rw", "/app:/local_config:rw", "/agent_cache:/agent_cache:rw", "/secrets:/secrets:ro"]
...
Using secure file copy (SCP)
To be able to transfer files using secure file copy (SCP), you need to generate and configure the supporting Secure Shell (SSH) key pair. Then the public key needs to be copied
Step 1 — Generate the SSH key pair
First, generate a new SSH key pair that will be exclusively used to establish trust between the orchestrator and the host of the secure agent:
$ ssh-keygen -t rsa -b 4096 -C "Informatica CDGC orchestrator" -f infa_cdgc -N ''
Generating public/private rsa key pair.
Your identification has been saved in infa_cdgc
Your public key has been saved in infa_cdgc.pub
The key fingerprint is:
SHA256:9bsC8pSapLEx8uvG0K0GyWHH2g+NJo1bIX2K466lrVc Informatica CDGC orchestrator
The command will write two files: infa_cdgc
(private key) and infa_cdgc.pub
(public key).
While SSH can work with different algorithms, e.g. ED25519, the orchestrator currently only supports RSA. The example command uses 4096 bits to increase security.
Step 2 — Copying the SSH public key to the Secure Agent host
Second, you must ensure the secure agent host trusts the newly created key. Add the public key to the server file ~/.ssh/authorized_keys
to do so.
Conveniently, SSH comes with ssh-copy-id
, which will copy the necessary information and also ensures it is using the correct secure agent operating system user:
ssh-copy-id -i infa_cdgc.pub <username>@<secure_agent_host>
Therefore, if your user is ubuntu
and your secure_agent_host
is 178.2.78.156
then the command is:
ssh-copy-id -i infa_cdgc.pub ubuntu@178.2.78.156
You should see output similar to:
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "infa_cdgc.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ubuntu@178.2.78.156's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'infa_cdgc'"
and check to make sure that only the key(s) you wanted were added.
Step 3 — Copy the content of the private key to the DataOps vault
Finally, the content of your private key needs to be stored in the DataOps Vault. Decide about a vault key, e.g. INFORMATICA.CDGC.SCP.PRIVATE_KEY
, and store the multi-line content as the value. Ensure that the line breaks are preserved.
Your private key should look similar to the following:
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
... more lines ....
EEuWXa1d7bQ9sSoFvghxAAAAHUluZm9ybWF0aWNhIENER0Mgb3JjaGVzdHJhdG9y
-----END OPENSSH PRIVATE KEY-----
Informatica setup
Before the Informatica CDGC orchestrator can run in a DataOps pipeline, a catalog must be set up in the Informatica Metadata Command Center for dbt. Additionally, a catalog should be set up in the Informatica Metadata Command Center to sync the metadata of the Snowflake database. Further, a connection assignment should be done from the dbt catalog to the Snowflake catalog to get the entire lineage after running the sync jobs.
Let's go through the steps.
Informatica Administrator setup
First, ensure your secure agent is up and running. Check under Informatica > Administrator > Runtime Environments. Let's assume you named the secure agent dataops-shared-agent
. If you have not done that yet, revisit the Informatica Secure Agent setup section.
Now navigate to Informatica > Administrator > Connections and create a new Snowflake Data Cloud connection using the secure agent dataops-shared-agent
:
Informatica Metadata Command Center setup
Let's set up the required Informatica Metadata Command Center catalog sources. To start, go to Informatica > Metadata Command Center > Explore > Catalog Sources and review your existing sources:
Step 1 — Create the dbt Catalog Source
Create a new dbt catalog source:
Under Registration:
- Name: dbt catalog source
- Description: Catalog source for dbt imports for the Informatica CDGC orchestrator
- dbt Manifest JSON File Path: /home/ubuntu/dbtmanifestpath/manifest.json
Under Configuration:
- Runtime Environment:
dataops-shared-agent
- Metadata Change Option: Delete
Complete the remaining steps and Save!
Once saved, capture the catalog source id from your browser's address bar:
If your address is https://mcc.dm-em.informaticacloud.com/catalogsource/decb1fb6-2969-36c6-9586-55f50f49bab7
then the catalog id is decb1fb6-2969-36c6-9586-55f50f49bab7
.
Step 2 — Create the Snowflake Catalog Source
Create a new Snowflake catalog source:
Under Registration:
- Name: Snowflake catalog source
- Description: Catalog source for <ACCOUNT_LOCATOR> Snowflake account
- Connection: select the connection you created against the
dataops-shared-agent
runtime environment.
Complete the remaining steps and Save!
Once saved, capture the catalog source id from your browser's address bar:
If your address is https://mcc.dm-em.informaticacloud.com/catalogsource/ea3efe71-3a98-30e9-89ae-ca48fe8ae225
then the catalog id is ea3efe71-3a98-30e9-89ae-ca48fe8ae225
.
Step 3 — Run both catalog sources
To complete the final steps, you must ensure that you can complete a Run for each catalog source.
For the dbt catalog source, you must grab a metadata.json
from a MATE step's artifacts and manually place that into the specified Catalog Source folder.
For the Snowflake source, ensure you give it sufficient time to complete a first run.
Step 4 — Connection assignment
Once each catalog source job run is completed, you can perform a connection assignment such that the lineage between the incoming dbt model and the changes on Snowflake is traceable.
Under Informatica > Administrator > Monitor find the dbt job and assign it to Snowflake's job.
Step 4 — Summary
Once you have completed the configuration steps, you have a functioning setup. The primary result is the dbt Catalog Source ID captured when creating it. If you missed to capture the ID, review the dbt catalog source section again.
The dbt Catalog Source ID is the value for the parameter DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_ID
the orchestrator uses.