DataOps FAQs
Find here Frequently Asked Questions (FAQs) about the DataOps SaaS platform DataOps.live.
You can find questions on DataOps in general on our site at DataOps FAQs.
- Why does my Git Compare not show any changes?
- How do I disable the Snowflake Query Cache?
- Can I load data from S3 to a Snowflake tenant on Azure?
- How do I automate the documentation process?
- How do I simplify automated testing without resulting in a time-consuming process?
- Is the order of input important?
- How can I change branch/environment names?
- What kind of data can I pass between pipeline stages?
- Is there a pattern for generating tasks at pipeline runtime based on external data?
- Is there a pattern for triggering pipelines externally?
- Where does the schedule/pipeline manager run, and where are the individual task container executed?
- Why does a runner fail to clone the repository?
- Do jobs time out during maintenance?
- What is the behavior of scheduled jobs during maintenance?
- Why do Terraform logs show warnings after generating Destroy-Plan and Destroy Jobs?
- Why do SOLE logs show warnings while creating a new environment?
- How does Datadog integration work?
Why does my Git Compare not show any changes?
This is how DataOps works. It only compares unidirectional or forward changes, not backward changes.
How do I disable the Snowflake Query Cache?
It is not a good idea to disable the Snowflake Query Cache. However, in some scenarios, it is helpful to be able to do it, for example, during performance testing.
You can achieve this by adding a pre-hook to the model in question, as the following code snippet shows:
{{
config(
pre_hook="alter session set USE_CACHED_RESULT = FALSE",
)
}}
Can I load data from S3 to a Snowflake tenant on Azure?
Yes. The performance is a little slower, but functionally, this works fine.
How do I automate the documentation process?
We have built automated documentation processes into our DataOps.live platform. Since we have all the logic about how we build, transform, and test in our Git repository and have access to the target database, we have all the information needed to build a good set of automated documentation.
For more information about documentation, check out MATE Project Documentation.
How do I simplify automated testing without resulting in a time-consuming process?
One of the most significant and complex aspects of testing data pipelines and DataOps is that data models consistently shift due to schema changes and different user requirements. Therefore, keeping test sets lined up can become time-consuming when everything changes continually and rapidly.
We solved this challenge by implementing the following:
Ensure your tests are stored in the same Git repo as your configuration and code files so that as you make changes and deploy them, the functional changes and tests are deployed together.
Ensure your tests are defined in the same place as (or alongside) your functional logic. If you have data modeling defined in one place and tests in a different location, it is virtually impossible to keep them in sync.
Note that the same applies to grants and permissions. If you define them together with the functional code, it is much easier to manage and more challenging to make mistakes.
Deploy your functional changes using an automated declarative approach like the Snowflake Object Lifecycle Engine (SOLE) found in our DataOps platform. This removes the need to write endless ALTER TABLE statements.
For more information about data testing, check out MATE Automated Data Testing.
Is the order of input important?
Yes, it is. Think of Git Compare as a merge request. If you want to merge the changes in feature branch A to the main branch, the valid input order is as follows:
- Source: A
- Target: main
In this scenario, Git shows the changes as expected. But when you try the input as follows:
- Source: main
- Target: A
Git shows that there is nothing to compare because Git Compare checks the head of the main branch and shows it is behind the target branch.
If you find yourself in this situation, you can always swap the inputs around or force the contents of one branch to another, as described in Git in 30 Seconds.
How can I change branch/environment names?
To change the default DataOps environment names (PROD, QA, and DEV), add the following variables to your project's variables.yml
file and override the default names:
DATAOPS_ENV_NAME_PROD: PROD
DATAOPS_ENV_NAME_QA: QA
DATAOPS_ENV_NAME_DEV: DEV
Also, ensure that any logic that uses these environment names is updated with the new names - particularly
check dataops/snowflake/databases.template.yml
If you also wish to rename the branches in a DataOps project, set these additional variables with their new names (don't forget to rename the actual branches):
DATAOPS_BRANCH_NAME_PROD: main
DATAOPS_BRANCH_NAME_QA: qa
DATAOPS_BRANCH_NAME_DEV: dev
We suggest that you only perform this change on a new project, renaming main first, then creating the other branches from it.
For more information about environments, check out DataOps Environment Management.
What kind of data can I pass between pipeline stages?
Pipeline jobs can pass files between themselves in 4 ways:
- As part of the pipeline cache, every job in a pipeline has access to a directory/cache. This is a shared filesystem between all the jobs in a pipeline, but only for that pipeline. It is not visible to any other pipeline. This is the usual way of passing data between jobs.
- Jobs in a pipeline also have access to a /persistent_cache — an area shared across multiple pipelines for the same branch. For instance, this is useful when you are doing incremental ingestion and want to store your high-water mark data somewhere that will be available to the following pipeline run for that branch or environment.
- Using something completely external such as an AWS S3 bucket.
- The job artifact is sent to the SaaS platform at the end of a run and stored as a downloadable object for the job over months. It is also available to other jobs immediately after in the pipeline. Since this is stored in our platform, we don't recommend using this method for production data for data governance and privacy reasons.
For more information about sequencing jobs in a pipeline, check out Pipeline Stages.
Is there a pattern for generating tasks at pipeline runtime based on external data?
Have you encountered this use case before? How are others solving it?
There are two ways to do this:
Within a single job: We do this regularly — one job can still saturate all the resources for a host. We use this quite frequently, and of the form
create_list_of_jobs | parallel -jobs 800% do_some_work.sh
.This creates a thread pool of 8 per CPU core (32 threads on a 4-core machine). It will then take the first 32 jobs and call
do_some_work.sh
, passing in the individual parameters each time the shell script is called. As soon as the first call finishes, the system starts on the 33rd call until all 5000 are completed.This is a very effective way to perform a lot of work in parallel but still respect upstream API concurrent limitations. If your upstream only allows you 16 simultaneous connections, fix the jobs limit as
-jobs 16
. Note that there are many variations on this we can help you with.With many jobs: In DataOps.live, you can create jobs dynamically or programmatically, and this involves having some trivial script that essentially produces a large YAML block.
For more information about what you can do with REST APIs, check out Using the REST API.
Is there a pattern for triggering pipelines externally?
The DataOps.live platform includes a full-scope REST API that can do everything from kicking off pipelines and monitoring results to creating and approving merge requests, user management, and even creating new projects.
To kick off a pipeline run, the call would look something like below:
curl -s --request GET --header "PRIVATE-TOKEN: $AUTH_TOKEN" --header "Content-Type: application/json" "$URL/api/v4/projects/$PROJECT/repository/files/$PIPELINE_FILE?ref=$BRANCH" | jq
For AWS and the GCS equivalent, you could write a very simple trigger:
S3 -> Trivial Lambda -> DataOps REST API
Where does the schedule/pipeline manager run, and where are the individual task container executed?
The scheduler and pipeline management is done in the DataOps SaaS platform in the cloud. After building the complete pipeline graph, it maintains a queue of the pending jobs (or ready to run because all requirements and other dependencies have been met) for each DataOps runner.
Each DataOps runner, a long-running and stateless container, dials home regularly (typically every second) and asks if there are any pending jobs for it to execute. If the answer is yes, the DataOps runner runs another container of a specific type, passes in the relevant job run information, and monitors it for completion, streaming the logs back in real time.
Today our standard deployment model is that the long-running DataOps runner and the child container it spans are run on a Linux machine, typically EC2 in AWS. Therefore, resource allocation isn't very complex.
Why does a runner fail to clone the repository during platform maintenance?
At the start of a job, the runner clones from the repository at the step “getting source from Git repository”. If the platform maintenance is in progress, the job will fail at this step because the runner cannot connect to the platform to clone the repo.
Jobs that are past this stage should complete without issue.
Do jobs time out during platform maintenance?
Yes. Jobs will continue to time out (if timeouts are set) during maintenance.
What is the behavior of scheduled jobs during platform maintenance?
A pipeline scheduled to start during maintenance will only run once the maintenance is complete. The Schedules view will show “X minutes ago” under Last Pipeline.
The following pipeline will start at the scheduled time.
Why do Terraform logs show warnings after generating Destroy-Plan and Destroy Jobs?
After the generation of destroy plan or deletion of resources, Terraform might log warning messages such as below:
│ Warning: Resource targeting is in effect
│
│ You are creating a plan with the -target option, which means that the
│ result of this plan may not represent all of the changes requested by the
│ current configuration.
│
│ The -target option is not for routine use, and is provided only for
│ exceptional situations such as recovering from errors or mistakes, or when
│ Terraform specifically suggests to use it as part of an error message.
│ Warning: Applied changes may be incomplete
│
│ The plan was created with the -target option in effect, so some changes
│ requested in the configuration may have been ignored and the output values
│ may not be fully updated. Run the following command to verify that no other
│ changes are pending:
│ terraform plan
│
│ Note that the -target option is not suitable for routine use, and is
│ provided only for exceptional situations such as recovering from errors or
│ mistakes, or when Terraform specifically suggests to use it as part of an
│ error message.
These warning messages are generated if not all the objects in the Object Group are deleted.
Why do SOLE logs show warnings while creating a new environment?
When you start SOLE, it will first try to import objects in case they do exist in Snowflake. Suppose your objects are new and should be created by SOLE itself. In that case, the operation fails, and you get a few warning messages indicating that this is an expected behavior because there is simply nothing to import from Snowflake.
How does Datadog integration work?
There are two ways to do this:
- You can have a job at the end of a pipeline that gathers all the log information and fires it off to Datadog.
- Our preferred method is to containerize and run the Datadog agent on the same host as the DataOps runner. We do this for our runner hosts since it gives us log information from the containers and valuable host information.