Git in 30 Seconds
Git plays a pivotal role in the overall data product platform. Therefore, it is imperative to have a working knowledge of git. This document aims to provide a few critical commands needed to build and run DataOps pipelines and their jobs. Should you need to learn Git basics, see the freecodecamp.org site for an introduction to git. And for a more exhaustive guide, navigate to the online Pro Git Book.
In the meantime, here is a list of useful Git CLI (Command Line Interface) commands:
git pull
This command updates your local copy of the repository from the data product Platform.
git push
This command pushes the changes in the feature branch to the data product platform. In other words, once you have committed your changes using the git commit
command, you push the changes to the DataOps.live data product platform. For instance:
$ git push
If you haven't already pushed the feature branch to the data product platform, the git push
command on its own won't work. You need to type in the following command:
$ git push --set-upstream origin <feature branch name>
Let's assume your feature branch is named my_new_feature_branch
, then the command you must type is as follows:
$ git push --set-upstream origin my_new_feature_branch
git status
This command returns the state of the working directory on your local drive and the git staging area. In other words, it shows you which files aren't being tracked by git as well as which files have and haven't been staged.
For instance:
Let's assume you are creating a new pipeline job file called my_new_pipeline_job.yml
in a feature branch called my_new_feature_branch
When you save the file and type git status
on the bash
CLI, the following detail is returned:
$ git status
on branch my_new_feature_branch
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in the working directory)
modified: .../my_new_pipeline_job.yml
no changes added to commit (use "git add" and/or "git commit -a")
git diff
git diff
is a function that returns the changes between two different data sources, such as files in a feature branch. Two variants of this command are relevant to this guide:
git diff
For example:
Let's assume you've created, tested, and merged the my_new_pipeline_job.yml
file to the main repo on the data product platform. Now there is a request to add additional functionality to this file. Your colleague completes the changes and, as per DataOps best practices, asks you to review the changes before merging the updated file with the main branch.
Using git diff
is a quick and easy way to see the updates to this file.
git diff HEAD
The git diff HEAD
command returns a list of all changes that have been added (in files) to the feature branch but have not been committed.
For instance, after adding your changed files to the git staging area (using git add
), typing git diff HEAD
will return a list of files changed and the changes made to each file.
git add
This command takes two forms and both forms are used to add changed files to the staging area before your next commit. The git status
command output describes how to use the git add
command.
git add <filename>
The git add <filename>
command adds a single file to the git staging area so that you can commit the changes to the feature branch and push it to the data product platform using git push
.
For example:
Let's continue with the git status
example:
$ git add ../my_new_pipeline_job.yml
When you type in git status
again, the following will be returned:
$ git status
on branch my_new_feature_branch
Changes to be committed:
(use "git restore --staged <file>..." unstage)
modified: .../my_new_pipeline_job.yml
git add -A
This command adds all the files you have changed to the staging area simultaneously. You don't have to add each file individually like the git add <filename>
command.
For example:
You've changed your pipeline job file, my_new_pipeline_job.yml
, and created a second pipeline job file, called my_second_pipeline_job.yml
file. You want to add both files to the git staging area simultaneously. Typing git add -A
will do this for you.
$ git add -A
When you type in git status
, the following result will be returned:
$ git status
on branch my_new_feature_branch
Changes to be committed:
(use "git restore --staged <file>..." unstage)
modified: .../my_new_pipeline_job.yml
modified: ../my_second_pipeline_job.yml
Every time you make changes to this file, or any file in the feature branch, you will have to add the files to the staging area using the git add
command.
git commit
The git commit
command commits your staged changes to the data product platform. Again there are two variants of this command:
git commit
Just typing in git commit
returns an editor where you must type in a commit message. For instance:
$ git commit
This is my commit message
# Please enter the commit messages for your changes. Lines starting with # will be ignored ...
If you exit out of the text editor without adding a commit message and saving it, the following error will be returned:
Aborting commit due to empty commit message
git commit -m
This command allows you to commit the staged changes and add a message in one command.
For example:
$ git commit -m "This is my commit message"
When you type in git status
, the following message is returned:
$ git status
On branch my_new_feature_branch
nothing to commit, working tree clean
git rm <filename>
This command removes a file from your local repo.
For example:
Let's assume you've created a file (my_incorrect_pipeline_config.yml
) that you no longer need and saved it to your local git repo. To remove it, type:
$ git rm my_incorrect_pipeline_config.yml
If you want only to remove the file from the git repo but keep it on disk, you can add the --cached
flag.
$ git rm my_incorrect_pipeline_config.yml --cached
git checkout
The command git checkout
is used to navigate between existing git feature branches. For instance, let's assume you have two feature branches, my_new_feature_branch
and my_second_feature_branch
, and you which to navigate between the two branches. Here is the code sample to navigate to the branch, my_new_feature_branch
:
$ git checkout my_new_feature_branch
git checkout -b <branch-name>
The git checkout -b <branch-name>
creates a new branch and navigates to it.
For instance:
Let's assume you want to create a new feature branch and check it out so you can make changes to your DataOps project. The command git checkout -b <branch-name>
will allow you to achieve this.
git checkout -b my_dataops_project_feature_branch
Switched to a new branch my_dataops_project_feature_branch
git pull --force
This command overwrites local changes with those in your DataOps project repository.
Use with caution as this command will overwrite all changes to local files, including those you might need to keep.
For instance, let's assume you have changed the README.md file in your local repository but have not pushed the changes to the remote DataOps repository. One of your team members has also made changes to this file that you need to reference in your local feature branch. The git force pull
command will pull the remote changes and overwrite your local changes.
git cherry-pick
Append commits to the current HEAD
This command allows you to pick individual Git commits by reference and append them to the current working HEAD
or branch. It is similar to forcing the contents of one branch into another; however, instead of forcing all of the commits into another branch simultaneously, you pick one commit at a time and add it to the checked-out branch (the HEAD).
git cherry-pick
has two typical use cases in DataOps:
Undo changes
Let's assume you added a commit to the wrong branch. You must reverse the changes linked to this commit; however, you don't want to lose these changes.
The solution is to select the incorrect commit and add it to the correct feature branch by switching to the destination (or correct) feature branch and cherry-picking the commit from the source (or incorrect) feature branch. The git cherry-pick
command is the ideal solution to this problem.
Promote a single feature into production
Additionally, as described in Branching Strategies, Merge a Single Feature From Dev to QA, it is sometimes necessary to promote a single commit (without the rest of the commits) up the branch hierarchy into production: from dev into QA, and then into production.
Using git cherry-pick
: a practical example
On its own, the git cherry-pick
command is as follows:
$ git cherry-pick commitSha
Let's assume the dev branch has a commit (commitSha ca3ab763
) that must be moved to qa. The following commands are needed to cherry-pick this commit from the dev and append it to the qa branch:
$ git checkout qa
$ git cherry-pick ca3ab763
Many more commands are available for advanced git use cases supported by the data product platform. As highlighted above, for a more exhaustive guide, consult the online Pro Git Book.
Forcing contents of one branch into another branch
Should you wish/need to force the contents of one branch, such as dev into another branch like qa, you need to run the following commands:
$ git checkout dev
$ git pull
$ git merge -s ours qa
$ git checkout qa
$ git merge dev
$ git push