Skip to main content

Git in 30 Seconds

Git plays a pivotal role in the overall data product platform. Therefore, it is imperative to have a working knowledge of git. This document aims to provide a few critical commands needed to build and run DataOps pipelines and their jobs. Should you need to learn Git basics, see the site for an introduction to git. And for a more exhaustive guide, navigate to the online Pro Git Book.

In the meantime, here is a list of useful Git CLI (Command Line Interface) commands:

git pull

This command updates your local copy of the repository from the data product Platform.

git push

This command pushes the changes in the feature branch to the data product platform. In other words, once you have committed your changes using the git commitcommand, you push the changes to the data product platform. For instance:

$ git push

If you haven't already pushed the feature branch to the data product platform, the git push command on its own won't work. You need to type in the following command:

$ git push --set-upstream origin <feature branch name>

Let's assume your feature branch is named my_new_feature_branch, then the command you must type is as follows:

$ git push --set-upstream origin my_new_feature_branch

git status

This command returns the state of the working directory on your local drive and the git staging area. In other words, it shows you which files aren't being tracked by git as well as which files have and haven't been staged.

For instance:

Let's assume you are creating a new pipeline job file called my_new_pipeline_job.yml in a feature branch called my_new_feature_branch When you save the file and type git status on the bash CLI, the following detail is returned:

$ git status

on branch my_new_feature_branch
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in the working directory)
modified: .../my_new_pipeline_job.yml

no changes added to commit (use "git add" and/or "git commit -a")

git diff

git diff is a function that returns the changes between two different data sources, such as files in a feature branch. Two variants of this command are relevant to this guide:

git diff

For example:

Let's assume you've created, tested, and merged the my_new_pipeline_job.yml file to the main repo on the data product platform. Now there is a request to add additional functionality to this file. Your colleague completes the changes and, as per DataOps best practices, asks you to review the changes before merging the updated file with the main branch.

Using git diff is a quick and easy way to see the updates to this file.

git diff HEAD

The git diff HEAD command returns a list of all changes that have been added (in files) to the feature branch but have not been committed.

For instance, after adding your changed files to the git staging area (using git add), typing git diff HEAD will return a list of files changed and the changes made to each file.

git add

This command takes two forms and both forms are used to add changed files to the staging area before your next commit. The git status command output describes how to use the git add command.

git add <filename>

The git add <filename> command adds a single file to the git staging area so that you can commit the changes to the feature branch and push it to the data product platform using git push.

For example:

Let's continue with the git status example:

$ git add ../my_new_pipeline_job.yml

When you type in git status again, the following will be returned:

$ git status

on branch my_new_feature_branch
Changes to be committed:
(use "git restore --staged <file>..." unstage)
modified: .../my_new_pipeline_job.yml

git add -A

This command adds all the files you have changed to the staging area simultaneously. You don't have to add each file individually like the git add <filename> command.

For example:

You've changed your pipeline job file, my_new_pipeline_job.yml, and created a second pipeline job file, called my_second_pipeline_job.yml file. You want to add both files to the git staging area simultaneously. Typing git add -A will do this for you.

$ git add -A

When you type in git status, the following result will be returned:

$ git status

on branch my_new_feature_branch
Changes to be committed:
(use "git restore --staged <file>..." unstage)
modified: .../my_new_pipeline_job.yml
modified: ../my_second_pipeline_job.yml

Every time you make changes to this file, or any file in the feature branch, you will have to add the files to the staging area using the git add command.

git commit

The git commit command commits your staged changes to the data product platform. Again there are two variants of this command:

git commit

Just typing in git commit returns an editor where you must type in a commit message. For instance:

$ git commit

This is my commit message
# Please enter the commit messages for your changes. Lines starting with # will be ignored ...

If you exit out of the text editor without adding a commit message and saving it, the following error will be returned:

Aborting commit due to empty commit message

git commit -m

This command allows you to commit the staged changes and add a message in one command.

For example:

$ git commit -m "This is my commit message"

When you type in git status, the following message is returned:

$ git status

On branch my_new_feature_branch
nothing to commit, working tree clean

git rm <filename>

This command removes a file from your local repo.

For example:

Let's assume you've created a file (my_incorrect_pipeline_config.yml) that you no longer need and saved it to your local git repo. To remove it, type:

$ git rm my_incorrect_pipeline_config.yml

If you want only to remove the file from the git repo but keep it on disk, you can add the --cached flag.

$ git rm my_incorrect_pipeline_config.yml --cached

git checkout

The command git checkout is used to navigate between existing git feature branches. For instance, let's assume you have two feature branches, my_new_feature_branch and my_second_feature_branch, and you which to navigate between the two branches. Here is the code sample to navigate to the branch, my_new_feature_branch:

$ git checkout my_new_feature_branch

git checkout -b <branch-name>

The git checkout -b <branch-name> creates a new branch and navigates to it.

For instance:

Let's assume you want to create a new feature branch and check it out so you can make changes to your DataOps project. The command git checkout -b <branch-name> will allow you to achieve this.

git checkout -b my_dataops_project_feature_branch

Switched to a new branch my_dataops_project_feature_branch

git pull --force

This command overwrites local changes with those in your DataOps project repository.


Use with caution as this command will overwrite all changes to local files, including those you might need to keep.

For instance, let's assume you have changed the file in your local repository but have not pushed the changes to the remote DataOps repository. One of your team members has also made changes to this file that you need to reference in your local feature branch. The git force pull command will pull the remote changes and overwrite your local changes.

git cherry-pick

Append commits to the current HEAD

This command allows you to pick individual Git commits by reference and append them to the current working HEAD or branch. It is similar to forcing the contents of one branch into another; however, instead of forcing all of the commits into another branch simultaneously, you pick one commit at a time and add it to the checked-out branch (the HEAD).

git cherry-pick has two typical use cases in DataOps:

Undo changes

Let's assume you added a commit to the wrong branch. You must reverse the changes linked to this commit; however, you don't want to lose these changes.

The solution is to select the incorrect commit and add it to the correct feature branch by switching to the destination (or correct) feature branch and cherry-picking the commit from the source (or incorrect) feature branch. The git cherry-pick command is the ideal solution to this problem.

Promote a single feature into production

Additionally, as described in Branching Strategies, Merge a Single Feature From Dev to QA, it is sometimes necessary to promote a single commit (without the rest of the commits) up the branch hierarchy into production: from dev into QA, and then into production.

Using git cherry-pick: a practical example

On its own, the git cherry-pick command is as follows:

$ git cherry-pick commitSha

Let's assume the dev branch has a commit (commitSha ca3ab763) that must be moved to qa. The following commands are needed to cherry-pick this commit from the dev and append it to the qa branch:

$ git checkout qa
$ git cherry-pick ca3ab763

Many more commands are available for advanced git use cases supported by the data product platform. As highlighted above, for a more exhaustive guide, consult the online Pro Git Book.

Forcing contents of one branch into another branch

Should you wish/need to force the contents of one branch, such as dev into another branch like qa, you need to run the following commands:

$ git checkout dev
$ git pull
$ git merge -s ours qa
$ git checkout qa
$ git merge dev
$ git push