Skip to main content

Branching Strategies

As endorsed in the fourth #TrueDataOps pillar, environment management forms an integral or foundational part of the data product platform, and branching strategies are a big part of this function.

The #TrueDataOps website describes environment management within the DataOps context as follows:

Environment management is one of the most complex elements of #TrueDataOps. Great environment management for #TrueDataOps requires all environments to be built, changed, and (where relevant) destroyed automatically. It also requires. A fast way of making a new environment look just like production. This could involve the duplication of TB of data.

Therefore, let's dive into environment management and git branching strategies, starting with branching strategy concepts.

Git branching strategy concepts

One of the standouts of Git as a versioning control system is its environment management capabilities. It contains many different ways of handling branches and merging and releasing these branches, among other functions. You can read up on these different approaches in DataOps Environments.

Git usage best practices are imperative. Therefore, if your company has already instituted Git usage best practices, use them. However, most data teams do not yet have a set of Git best practices. Moreover, the differences between DevOps and DataOps mean that some of the typical DevOps approaches don't work quite as well for DataOps. As a result, we have adopted a standard and well-accepted Git model that has proven to work well for DataOps.

This approach is detailed below.

note

The standard pipeline -ci.yml files included in all template DataOps projects are based on this Git model and will need tweaking if a different model is used.

Overview of Git branches

In its simplest form, the Git development workflow looks like this:

  • Cut a new feature branch (feature-branch-name) from the production branch (e.g., your Jira ticket).
  • Make and test your changes on this feature branch (feature-branch-name).
  • Once happy, create a merge request (MR) from feature-branch-name back to production.

Because this process is all about data, the ability to create a feature branch data warehouse is one of the most powerful features of the data product platform. When branching your DataOps repository, you are, in effect, creating a sandbox where you can edit to your heart's content without the possibility of disrupting anyone else.

This is precisely what you get with branching.

In a mature environment, you will find other environments beyond your feature branch and production:

NameLifecycleTypeProtectedEnvironment NameRequired/optionalData
production or prodLong-livedSharedYesDATAOPS_ENV_NAME_PRODRequiredIngested
qaLong-livedSharedYesDATAOPS_ENV_NAME_QAOptional, basic level of maturity neededIngested
devLong-livedSharedNoDATAOPS_ENV_NAME_DEVOptional, basic level of maturity requiredCloned
feature-branch-nameShort-livedIndividualNoDATAOPS_PREFIX_FB_FEATURE_BRANCH_NAMEShould be used for all individual changesCloned

You can find details about the variables for environments in Project settings, Project variables. Note that only Maintainers can commit to protected branches.

Git branching in detail

This section dives into the details of DataOps environments, including using branches.

The quality assurance branch and environment

A quality assurance branch (qa) and environment (QA) is a virtually perfect production replica. The primary purpose of this environment is to ensure that if something fails in production, it should rather fail in a qa branch before it hits production. In other words, by merging to qa, developers or engineers are confident that, from their perspective, the code is ready to be merged into production.

note

A QA environment is not the place to find elementary bugs in code. It would be best if you treated this environment as the production environment, and any bugs found in QA should be equivalent to allowing bugs through to production.

The QA development workflow is as follows:

  • Cut a feature branch (feature-branch-name) from the qa branch.
  • Make and test changes on the feature-branch-name.
  • Once happy, create a merge request (MR) from feature-branch-name back to qa.
  • Run tests.
  • Once tests run successfully in qa, create a merge request from qa to production.

The development branch and environment

The most frequent cause of failures (but still relatively rare) in a highly collaborative versioning model such as Git is a scenario where individually feature A runs successfully and feature B runs successfully, but feature A and feature B together fail.

A typical example of this cause of failure is as follows:

Let's assume we have two different feature branches that introduce a calculation model called "calc_sales_order_average", but the logic in each branch is diverse. There is no way to resolve this automatically. Thus, the question becomes: How do you solve this conflict?

A best-practice solution is as follows:

You need an integration environment (DEV) before QA and PROD where you can bring features together, test them with each other, confirm they work together, and then use a merge request to merge them into QA.

The primary function of a dev branch is to provide the DEV environment where features can be tested with one another before being merged into QA and then into PROD.

The development workflow is as follows:

  • Cut a feature branch (feature-branch-name) from the dev branch.

  • Make and test changes on the feature branch (feature-branch-name).

  • Once satisfied, don't merge into dev straightaway. First, merge from dev into a feature branch (feature-branch-name). Your feature branch will now look like dev will once you merge, allowing you to spot any issues before even getting to a shared branch.

    note

    This is an optional but great best-practice step.

  • Once happy, create a merge request (MR) from feature-branch-name back to the dev branch.

  • Once tests run successfully in dev and stakeholders are happy, create a merge request from dev to qa.

  • Once tests run successfully in qa and stakeholders are happy, create a merge request from qa to prod.

Merge a single feature from dev to qa

Promoting a feature into production is sometimes necessary without promoting the rest of the changes in the dev feature branch. Ideally, the dev feature branch should always be ready to promote upward, but this is not always the case.

Therefore, the solution is to cherry-pick this commit out of the dev branch and append it to a new QA feature branch to promote a single feature and create a merge request to merge this feature branch to QA. This merge request should be approved and applied the same way as any other merge into the QA branch.

Short-lived branches

Feature branches should exist for the shortest period possible. In many situations, they live for hours or a few days and are deleted when a merge request is approved.

tip

Remember to check the "Squash commits when the merge request is accepted" checkbox when creating a merge request.

In rare cases, a feature branch may need to live for longer than this. The longer it exists, the greater its likely divergence from the production branch. If a feature branch must live for a long time, it's strongly recommended to do regular merges from your source branch, usually from dev, into your feature branch to minimize this divergence. At least once a week is a good practice, although some of the best DataOps engineers do this daily.

Also, consider how large your work scope must be. Don't get hung up on the word "feature." The name is just a convenience. It means "a branch that contains a deployable unit of code." This could be anything from a single-character typo up to a moderate-sized feature. Still, it would be best to decompose anything beyond a specific size into small, deployable pieces, each with a short-lived feature branch.

Long-lived branches

Long-lived branches should be protected in the data product platform and never deleted. They will, of course, differ over time as features are added and deprecated. However, the goal is to merge small code increments, often from dev to qa and then to production.

The main reasons include the following:

  • If there are issues, it's much easier to find and debug them when the last merge had two rather than 200 changes in a single merge request.
  • Issues are much easier to find and fix when they are fresh in the minds of the developers.
  • #TrueDataOps principles include delivering value regularly to the business. The business doesn't derive value from a feature sitting in a dev or qa branch for a long time before being merged into production.

These factors have a material impact on how to break down your work. If you are working on a prominent feature, break the work down so that the pieces you have completed and tested can be pushed up the branches, even if they are not visible to end users. The idea is to integrate little and often. In this scenario, you wouldn't create a single feature branch for this work but a feature branch for each small unit of work. Then, the last merge request should only be a few lines to make it visible to general users, but it has been developed, tested, and deployed in pieces for months.

Scheduling pipelines for branches

DataOps pipelines are classified as real-time or event pipelines. And as with DevOps, CI pipelines are configured to build, test, and deploy code to specific environments or long-lived branches such as production, qa, and dev.

Moreover, scheduling or creating regularly scheduled pipelines for these long-lived branches is integral to ensuring that the new feature and upgraded feature development lifecycle is as short as possible. This principle is based on DevOps and CI/CD. For more information on CI/CD within the DataOps context, navigate to Background to DataOps, Snowflake Technical Webinars.

Based on this information, it is imperative to ensure regularly scheduled pipelines for the PROD, QA, and DEV environments.

For example:

EnvironmentScheduleDescription
PRODWeeklyThis pipeline is scheduled to run once a week on the production branch
PRODDailyThis pipeline is scheduled to run once a day on the production branch
PRODHourlyThis pipeline is scheduled to run once every hour on the production branch
QADailyThis pipeline is scheduled to run once a day on the qa branch
QAHourlyThis pipeline is scheduled to run once every hour on the qa branch
DEVDailyThis pipeline is scheduled to run once a day on the dev branch
DEVHourlyThis pipeline is scheduled to run once every hour on the dev branch

Merging from one shared branch to another

When merging new features from one shared branch to another, such as from qa to production, features not ready to be deployed to production could also be in qa and will, consequently, be merged into production. In most cases, this scenario does not occur because all features in qa are expected to be deployable to production. However, there is sometimes a use case for merging a single feature (Feature A) from qa into production, leaving the rest (Feature B) behind.

There are three options to merge the single feature into production:

  • Manual control: Don't accept the merge request for feature B until feature A is already in production.
  • Cherry-picking: Manually select the files from qa that relate to Feature A but not to Feature B and merge Feature A's files to production.
  • Create the merge request from a particular commit SHA or the point on the qa branch where feature A is present but feature B is not.
tip

Cherry-picking is your best option in this scenario.

Development and feature branch databases

As described at the outset of this text, environment management is crucial to DataOps. This is true for long-lived branches like prod, qa, and dev and short-lived feature branches.

Consider how DevOps is used for modern software development. One of the best use cases for DevOps is the ability to cut a branch of your source code, develop a new feature, run a CI/CD pipeline that builds your application, and then deploy it to a Kubernetes cluster to run the data product.

A feature branch is not trivial, nor is it a toy environment. It looks, feels, and works exactly like the production environment. This is critical for the developer and stakeholders to test additions to the codebase work before tearing it down for the next iteration.

DataOps.live does precisely this with data. Cutting a branch of your data ingestion, modeling, and transformation code and then building it requires creating a new fully-featured data warehouse that works just like the production data warehouse.

Feature branches must be isolated from production to allow for development and testing before being torn down for the subsequent development and testing iteration cycle.

Snowflake performs the same role as Kubernetes in the data product platform architecture. It is the elastic environment where we can deploy any number of fully isolated and secure instances, such as a Snowflake deployment from the database to any other Snowflake object.

DataOps.live automatically leverages Snowflake zero-copy clone to create an isolated working environment for your feature branch data warehouse. The feature branch is based on the most recent production database. This can be configured in your project's database configuration. The described default behavior for projects created from the DataOps Template Project is:

dataops/snowflake/databases.template.yml
databases:
"{{ env.DATAOPS_DATABASE }}":
{# For non-production branches, this will be a clone of production #}
{% if (env.DATAOPS_ENV_NAME != 'PROD' and env.DATAOPS_ENV_NAME != 'QA') %}
from_database: "{{ env.DATAOPS_DATABASE_MASTER }}"
{% endif %}

comment: This is the main DataOps database for environment {{ env.DATAOPS_ENV_NAME }}
grants:
USAGE:
- WRITER
- READER
...

The {% if ... %} condition ensures cloning is only performed for dev and feature branches. The from_database key defines the source database to clone. The value of env.DATAOPS_DATABASE_MASTER is the name of the production database.