Skip to main content

Informatica CDGC Orchestrator

Image$DATAOPS_INFORMATICA_CDGC_RUNNER_IMAGE
Feature Status
Feature release status badge: PubPrev
PubPrev

The Informatica Cloud Data Governance and Catalog (CDGC) orchestrator interacts with Informatica Cloud to publish metadata about the data transformed in a DataOps pipeline.

Usage

pipelines/includes/local_includes/informatica_jobs/informatica_cdgc.yml
"Informatica CDGC Orchestrator":
extends:
- .agent_tag
stage: "Informatica CDGC Orchestrator"
image: $DATAOPS_INFORMATICA_CDGC_RUNNER_IMAGE
variables:
DATAOPS_INFORMATICA_CDGC_URL: "https://<your_org>.informaticacloud.com"
DATAOPS_INFORMATICA_CDGC_USERNAME: DATAOPS_VAULT(INFORMATICA.CDGC.USERNAME)
DATAOPS_INFORMATICA_CDGC_PASSWORD: DATAOPS_VAULT(INFORMATICA.CDGC.PASSWORD)
# your your dbt catalog source ID
DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_ID: "<update this>"
# match this with your dbt catalog source configuration
DATAOPS_INFORMATICA_CDGC_DBT_MANIFEST_FILEPATH: "/home/ubuntu/dbtmanifestpath/manifest.json"
DATAOPS_INFORMATICA_CDGC_FILETRANSFER: "scp"
DATS_INFORMATICA_CDGCAOP_SCP_SSH_HOST: DATAOPS_VAULT(INFORMATICA.CDGC.SSH_HOST)
DATAOPS_INFORMATICA_CDGC_SCP_SSH_USER: ubuntu
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY: DATAOPS_VAULT(INFORMATICA.CDGC.SSH_PRIVATE_KEY)

script:
- /dataops
icon: ${INFORMATICA_ICON}

The Informatica CDGC orchestrator assumes that a DataOps modeling and transformation job completed running — including the Generate Docs stage — in an earlier stage of the DataOps pipeline. It uses the metadata (data about data) model to provide up-to-date information to the catalog at the end of every pipeline run.

public preview

While the orchestrator is in preview, contact DataOps support for concrete usage examples, as the job configuration shown does not perfectly match what you need to configure.

Prerequisits

dbt 1.7 or later required

For the orchestrator to work successfully, you must use MATE with a version of dbt 1.7 or later. See switching dbt versions for details.

Supported parameters

ParameterRequired/DefaultDescription
DATAOPS_INFORMATICA_CDGC_URLREQUIREDOrganization's Informatica Cloud URL, e.g. https://dm-us.informaticacloud.com/ for Americas or https://dm-em.informaticacloud.com/ for Europe.
DATAOPS_INFORMATICA_CDGC_USERNAMEREQUIREDUsername required to log in to the Informatica organization account.
DATAOPS_INFORMATICA_CDGC_PASSWORDREQUIREDDescription for an optional param with no default.
DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_IDREQUIREDdbt Catalog Source ID from organization's Informatica Metadata Command Center.
DATAOPS_INFORMATICA_CDGC_DBT_MANIFEST_FILEPATHREQUIREDFile path accessible to the Informatica Secure Agent and the DataOps runner, where DataOps runner would store dbt manifest file.
DATAOPS_INFORMATICA_CDGC_FILETRANSFEROptional, defaults to fsOption to use for DBT Manifest file transfer form DataOps running to Informatica Cloud Secure Agent. Accepted values are: fs file path accessible to Informatica Cloud Secure Agent is directly accessible to DataOps runner, or scp for secure file transfer using ssh.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_HOSTOptional, required for scpHost Name or IP Address of SSH server where DBT manifest file has to be transferred.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PORTOptional, required for scp, defaults to 22Port of the SSH server where DBT manifest file has to be transferred.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_USEROptional, required for scpUsername for SSH Server for scp. Normally the operating system user the Informatica Secure Agent runs as.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEYOptional, required for scpThe content of the private key for SCP file transfer retrieved from the DataOps vault. If you use this parameter, don't use DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY_FILENAME.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY_FILENAMEOptional for scpPrivate key file name for SCP file transfer. The file is stored as a secure file outside the git repository. If you use this parameter, don't use DATAOPS_INFORMATICA_CDGC_SCP_SSH_PRIVATE_KEY.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_SERVER_FINGERPRINTOptional for scpSSH server fingerprint to trust.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_PASSPHRASEOptional for scpPassphrase to decrypt the SSH private key.
DATAOPS_INFORMATICA_CDGC_SCP_SSH_KNOWN_HOSTS_FILEPATHOptional for scpKnown SSH server host to trust.

Before you start

To use the orchestrator, you will need to set up and configure the following:

  • A DataOps runner either on Kubernetes or as a Docker container
  • A self-hosted Informatica Secure Agent - the Informatica Cloud Hosted Agent will not work for this integration, as it does not accept the metadata integration
  • A file transfer method between the DataOps runner and the Secure Agent
  • The dbt Catalog Source in the Informatica Metadata Command Center

DataOps runner and Informatica Secure Agent configuration

First, ensure you have DataOps runner on Docker or a DataOps runner on Kubernetes configured and deployed. Ensure that you can either:

  • establish a network connection from the runner to the Informatica Secure Agent, or
  • co-host the DataOps runner with the Informatica Secure Agent on the same machine, so that you can perform a file copy

Setting up the Informatica Secure Agent

Since you cannot use the Informatica Cloud Hosted Agent for this orchestrator, you need to use a self-hosted Secure Agent. If you do not have a self-hosted agent or agent group yet, review the Informatica instructions for Secure Agents and install the Secure Agent from Informatica > Administrator > Runtime Environments:

dasd !!shadow!!

Configuring the file transfer between the DataOps Runner and the Informatica Secure Agent

The Informatica CDGC Orchestrator must be able to copy files from the orchestrator files system inside a container to the secure agent's file system. The orchestrator provides two options by setting the parameter DATAOPS_INFORMATICA_CDGC_FILETRANSFER to achieve that.

  • select fs for a plain file transfer on the local file system
  • select scp for a secure file copy to a different machine

Using plain file transfer

To achieve a plain file transfer, you need to be able to access the secure agent's file system. For that to work, you will need to mount that file system to the container of the orchestrator by modifying the DataOps runner's config.toml.

Modify your existing runner configuration file. If you have a Docker runner, your config file should have a section similar to:

/srv/dataops-runner-$AGENT_NAME/config/config.toml
[runners.docker]
...
volumes = ["/app:/local_config:rw", "/agent_cache:/agent_cache:rw", "/secrets:/secrets:ro"]
...

Create a new volume mapping by updating the line volumes and map from the host's secure agent installation directory /path/to/your/secure/agent to the directory within the orchestrator container /secure_agent.

/srv/dataops-runner-$AGENT_NAME/config/config.toml
[runners.docker]
...
volumes = ["/path/to/your/secure/agent:/secure_agent:rw", "/app:/local_config:rw", "/agent_cache:/agent_cache:rw", "/secrets:/secrets:ro"]
...

Using secure file copy (SCP)

To be able to transfer files using secure file copy (SCP), you need to generate and configure the supporting Secure Shell (SSH) key pair. Then the public key needs to be copied

Step 1 — Generate the SSH key pair

First, generate a new SSH key pair that will be exclusively used to establish trust between the orchestrator and the host of the secure agent:

Generate SSH key pair
$ ssh-keygen -t rsa -b 4096 -C "Informatica CDGC orchestrator" -f infa_cdgc -N ''

Generating public/private rsa key pair.
Your identification has been saved in infa_cdgc
Your public key has been saved in infa_cdgc.pub
The key fingerprint is:
SHA256:9bsC8pSapLEx8uvG0K0GyWHH2g+NJo1bIX2K466lrVc Informatica CDGC orchestrator

The command will write two files: infa_cdgc (private key) and infa_cdgc.pub (public key).

Only RSA is currently supported

While SSH can work with different algorithms, e.g. ED25519, the orchestrator currently only supports RSA. The example command uses 4096 bits to increase security.

Step 2 — Copying the SSH public key to the Secure Agent host

Second, you must ensure the secure agent host trusts the newly created key. Add the public key to the server file ~/.ssh/authorized_keys to do so.

Conveniently, SSH comes with ssh-copy-id, which will copy the necessary information and also ensures it is using the correct secure agent operating system user:

ssh-copy-id syntax
ssh-copy-id -i infa_cdgc.pub <username>@<secure_agent_host>

Therefore, if your user is ubuntu and your secure_agent_host is 178.2.78.156 then the command is:

ssh-copy-id example
ssh-copy-id -i infa_cdgc.pub ubuntu@178.2.78.156

You should see output similar to:

/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "infa_cdgc.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ubuntu@178.2.78.156's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'infa_cdgc'"
and check to make sure that only the key(s) you wanted were added.

Step 3 — Copy the content of the private key to the DataOps vault

Finally, the content of your private key needs to be stored in the DataOps Vault. Decide about a vault key, e.g. INFORMATICA.CDGC.SCP.PRIVATE_KEY, and store the multi-line content as the value. Ensure that the line breaks are preserved.

Your private key should look similar to the following:

-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
... more lines ....
EEuWXa1d7bQ9sSoFvghxAAAAHUluZm9ybWF0aWNhIENER0Mgb3JjaGVzdHJhdG9y
-----END OPENSSH PRIVATE KEY-----

Informatica setup

Before the Informatica CDGC orchestrator can run in a DataOps pipeline, a catalog must be set up in the Informatica Metadata Command Center for dbt. Additionally, a catalog should be set up in the Informatica Metadata Command Center to sync the metadata of the Snowflake database. Further, a connection assignment should be done from the dbt catalog to the Snowflake catalog to get the entire lineage after running the sync jobs.

Let's go through the steps.

Informatica Administrator setup

First, ensure your secure agent is up and running. Check under Informatica > Administrator > Runtime Environments. Let's assume you named the secure agent dataops-shared-agent. If you have not done that yet, revisit the Informatica Secure Agent setup section.

Now navigate to Informatica > Administrator > Connections and create a new Snowflake Data Cloud connection using the secure agent dataops-shared-agent:

Snowflake connection !!shadow!!

Informatica Metadata Command Center setup

Let's set up the required Informatica Metadata Command Center catalog sources. To start, go to Informatica > Metadata Command Center > Explore > Catalog Sources and review your existing sources:

Informatica Metadata Command Center Catalog Sources !!shadow!!

Step 1 — Create the dbt Catalog Source

Create a new dbt catalog source:

Informatica Metadata Command Center - dbt catalog source step 1 !!shadow!!

Under Registration:

  • Name: dbt catalog source
  • Description: Catalog source for dbt imports for the Informatica CDGC orchestrator
  • dbt Manifest JSON File Path: /home/ubuntu/dbtmanifestpath/manifest.json

Informatica Metadata Command Center - dbt catalog source step 2 !!shadow!!

Under Configuration:

  • Runtime Environment: dataops-shared-agent
  • Metadata Change Option: Delete

Informatica Metadata Command Center - dbt catalog sources step 3 !!shadow!!

Complete the remaining steps and Save!

Once saved, capture the catalog source id from your browser's address bar:

Informatica Metadata Command Center - dbt catalog source - capture id !!shadow!!

If your address is https://mcc.dm-em.informaticacloud.com/catalogsource/decb1fb6-2969-36c6-9586-55f50f49bab7 then the catalog id is decb1fb6-2969-36c6-9586-55f50f49bab7.

Step 2 — Create the Snowflake Catalog Source

Create a new Snowflake catalog source:

Informatica Metadata Command Center - Snowflake catalog source step 1 !!shadow!!

Under Registration:

  • Name: Snowflake catalog source
  • Description: Catalog source for <ACCOUNT_LOCATOR> Snowflake account
  • Connection: select the connection you created against the dataops-shared-agent runtime environment.

Informatica Metadata Command Center - Snowflake catalog source step 2 !!shadow!!

Complete the remaining steps and Save!

Once saved, capture the catalog source id from your browser's address bar:

Informatica Metadata Command Center - Snowflake catalog source - capture id !!shadow!!

If your address is https://mcc.dm-em.informaticacloud.com/catalogsource/ea3efe71-3a98-30e9-89ae-ca48fe8ae225 then the catalog id is ea3efe71-3a98-30e9-89ae-ca48fe8ae225.

Step 3 — Run both catalog sources

To complete the final steps, you must ensure that you can complete a Run for each catalog source.

For the dbt catalog source, you must grab a metadata.json from a MATE step's artifacts and manually place that into the specified Catalog Source folder.

For the Snowflake source, ensure you give it sufficient time to complete a first run.

Step 4 — Connection assignment

Once each catalog source job run is completed, you can perform a connection assignment such that the lineage between the incoming dbt model and the changes on Snowflake is traceable.

Under Informatica > Administrator > Monitor find the dbt job and assign it to Snowflake's job.

Informatica Metadata Command Center - connection assignment from dbt to Snowflake !!shadow!!

Step 4 — Summary

Once you have completed the configuration steps, you have a functioning setup. The primary result is the dbt Catalog Source ID captured when creating it. If you missed to capture the ID, review the dbt catalog source section again.

The dbt Catalog Source ID is the value for the parameter DATAOPS_INFORMATICA_CDGC_CATALOG_SOURCE_ID the orchestrator uses.