Leveling Up Git Skills
Table of Contents
Git Step by Step from File to Commit #
Working with Git is a common task for developers. Different projects have different expectations on git should be used. In the last months I had to upgrade my git skills and wanted to share some of the things I learned.
‼️ This post about what I learned additionally, therefore some basic git knowledge is assumed.
Within this blog post I mainly used the Git Book 1 and the O’Reilly Course 2.
Hashes as Object IDs #
Git uses a lot of hashes of files referred to as Object IDs (OID). Hash the content of the file. This hashing creates a unique (40 char) identifier for the content (using SHA-1). For two files with the same content, the hash will be the same, the file name or location does not matter. Adding two files with the same content will result in the same hash being created.
> cat README.md | git hash-object --stdin
0588748a531675ba06e473ae4d74532cdb2f94d3
Blob is one Git Object #
Git will create a blob in the .git
directory with the hash as name and the zipped content of the file as content.
Git is a Content addressable file system, which means that the content of the file is used to determine the file
location.
.git/objects
├── 05
│ └── 88748a531675ba06e473ae4d74532cdb2f94d3
| "Blob of the file content (zipped)"
├── info
└── pack
If multiple filed are added, one blob is created for each file. Two files with the same content will be added as one blob. Editing a file will create a new blob with a new hash, the old blob will still be there. Without a commit, this blob is not reachable, git only keeps history of committed files. Git is Immutable - a blob cannot be changed, only new blobs can be created. The git garbage collector will remove the blobs that are not referenced by any link eventually.
Git is not copying the files. Unchanged files are not copied, only the tree object is created with the hash of the blob of the unchanged file pointing to one content.
A Tree is a Git Object #
To store the relation of the file name and the blob, git creates a tree object. A Tree object is a file that contains a list in which the file type, file name, and the Hash Object ID. The Tree file hashed will again be the Object ID for the tree, therefore it will also be the same if all file contents and the file names in the tree are the same.
> git write-tree
2d66c1b551aae5efa0d242cc105b8531d3ae5e4b
> git cat-file -p 2d66c1b551aae5efa0d242cc105b8531d3ae5e4b
040000 tree 9b6bc3d88d357e1f02cda9e279f0bc9a9c5c11a3 content
040000 tree b8e6c8eb08befc5a0fde3b0c2821c806c7be2b36 public
100644 blob 0588748a531675ba06e473ae4d74532cdb2f94d3 readme.md
100644 blob 0588748a531675ba06e473ae4d74532cdb2f94d3 readme_copy.md
040000 tree bac8dc35897c3fe58c6f3d321deea5ee7dfff50f resources
git write-tree
creates a tree object from the staged files, this is usually done automatically when committing.
In the tree file, different files with the same content will point to the same hash (blob).
The output shows the tree object with the following structure:
100644 blob 0588748a531675ba06e473ae4d74532cdb2f94d3 readme.md
100644 -> Git file Type (binary 1000 (regular file), 1010 (symbolic link) and 1110 (gitlink)
and Linux file mode, binary authorization (0644/-rw-r--r--)
blob -> The type of the object, in this case a blob. Subdirectories (like public)
are modeled as trees.
0588748a...-> The OID (Object ID), Hash of the blob
readme.md -> The name of the file
While staged the tree objects are also added to the .git/objects.
directory.
.git/objects
├── 2d # The root tree object
│ └── 66c1b551aae5efa0d242cc105b8531d3ae5e4b
├── 9b # The tree object for the content directory
│ └── 6bc3d88d357e1f02cda9e279f0bc9a9c5c11a3
├── b8 # The tree object for the public directory
│ └── e6c8eb08befc5a0fde3b0c2821c806c7be2b36
├── 05 # The blob object for the readme.md file
│ └── 88748a531675ba06e473ae4d74532cdb2f94d3
...
├── info
└── pack
A Commit is a Git Object #
Git creates a commit object, which again is a file containing the hash of a tree object, the hash of the parent commit,
an author and committer, and the commit message. The commit object also has a hash stored in the .git/objects
directory.
❯ git cat-file -p 46cbcbdd595811bb59cbbc307d061dfaa7478a85
tree 6a58a71284938e520f1c21f24971dbe14821fbd7
parent 45dda006756054c4fd77fd9a0423a7033f21dbfd
author Sofia Fischer <sofia@philodev.com> 1752449645 +0200
committer Sofia Fischer <sofia@philodev.com> 1752449645 +0200
✨ Post about bitemporal data
This behavior is the what did set git apart from other versin control systems in the past (although now alternatives with this behavior exist)
Branches, Heads, Tags #
Branches #
A Branch is a named reference to a commit. The branches are stored in the .git/refs/heads
directory.
❯ ls .git/refs/heads
master bitemporal
❯ cat .git/refs/heads/master
96b73cbfd9cb66356737ce3282986c2f8aa225b1
Deleting a branch will only delete the file with the reference to one commit. If the commit is not fully merged, this might delete the only reference to the commit making the commit “orphaned” and unreachable.
Head #
The head is a reference to the current branch. The head is stored in .git/HEAD
and contains the name of the current
The head references the commit that will be the parent of the next commit.
❯ cat .git/HEAD
ref: refs/heads/main
Head may also point to a commit directly, this is called a “detached head”. What sounds spooky, is actually a nice way to travel back in time and look at the state of the repository at a specific commit. The reason why “detached head” causes panic is that the next commit will not be on a branch, and will be orphaned if head is changed. It is a feature of branches that they change on commit to point to the latest commit on the branch. So the solution is to just switch back to a branch.
One way to “save” a commit in a detached head state is to add a tag to the commit. Tags are references to a specific commit. Now when head is changed, the commit will not be orphaned, as the tag will still point to the commit.
Making pretty Git Commits in Practice #
While the git commits of some developers look like the developer had a plan for their feature, which unfolded flawlessly - one commit contained refactoring to make the change easy, one commit added unreachable new code with tests, and one connected it in multiple places. The dream of every reviewer is most commonly not the result of a developer who planned all actions upfront in perfection, but rather the result of using git to create readable, easy to review commits.
Commiting some lines of a file #
Many of the features I can implement start out as a single commit, but then I realise some refactoring that should go in
a separate commit.
Git allows to stage only parts of a file with git add -p
or git add --patch
. This will show the changes in the file
and allow you to select a subset of the changes to stage. For each patch of code git will ask you if you want to stage
it:
y
- yes, stage this hunkn
- no, do not stage this hunka
- stage this and all the remaining hunks in the filed
- do not stage this hunk nor any of the remaining hunks in the fileg
- go to the selected hunk/
- search for a hunk matching the given regexj
- leave this hunk undecided, see next undecided hunkJ
- leave this hunk undecided, see next hunkk
- leave this hunk undecided, see previous undecided hunkK
- leave this hunk undecided, see previous hunks
- split the current hunk into smaller hunkse
- edit the current hunk manually?
- print help”
In PyCharm (or Intellij) you can also stage parts of a file by selecting the lines you want to stage in the preview of the commit dialog.
Change something in a commited commit #
It is also possible to change something in a past commit to ensure a clean commit history. The easiest way to do this is
using git commit --amend
. This will take the current staged changes and add them to the last commit. This is only
possible if the commit to change is the last commit.
If the commit is not further down the commit history, git allows to change a past commit by adding a new commit with the
message fixup! <commit message>
. As an easier way to do this, you can use git commit --fixup <commit hash>
, in the
GUI of PyCharm by right clicking on the commit to change and selecting “Fixup commit”.
As an example, to fix some spelling mistakes in a blog post ✨add post about bitemporal data
after I changed the some
internal things for my blog, I could end up with the following commits:
d01fd6 fixup! ✨add post about bitemporal data
c028f4 🚀 use go modules instead of git submodules
b032c2 ✨add post about bitemporal data
The fixup commit will be squashed into the b032c2
commit on interactive rebase or on merge if used with
--autosquash
.
Interactive Rebase #
Interactive rebase is a powerful tool to change the commit history. It allows you to reorder, squash, or edit commits.
git rebase -i <commit hash>
will open an editor with a list of commits starting from the given commit hash.
> git rebase --interactive 96b73cbfd9cb66356737ce3282986c2f8aa225b1
pick 1e57725 # ✨ Post about git
pick be9293c # 🖋continue on git post
pick 3003a57 # 🖋add disclaimer for git post
# Rebase 96b73cb..3003a57 onto 96b73cb (3 commands)
#
# Commands:
# p, pick <commit> = use commit
# r, reword <commit> = use commit, but edit the commit message
# e, edit <commit> = use commit, but stop for amending
# s, squash <commit> = use commit, but meld into previous commit
# f, fixup [-C | -c] <commit> = like "squash" but keep only the previous
# commit's log message, unless -C is used, in which case
# keep only this commit's message; -c is same as -C but
# opens the editor
# x, exec <command> = run command (the rest of the line) using shell
# b, break = stop here (continue rebase later with 'git rebase --continue')
# d, drop <commit> = remove commit
# l, label <label> = label current HEAD with a name
# t, reset <label> = reset HEAD to a label
Don’t Panik - this is vim 😬 In this text file the pick can be replaced with for example break
which will stop the
reabase at this commit and allow making changes to the commit. Also, the order of the commits can be changed.
With the decision what should be done with each commit, :wq
will save the changes and exit the editor, and start the
rebase. If all the commits are picked, the rebase will continue without stopping as a regular rebase.
If there is a commit set to break
, the rebase will stop at that commit, any changes can be made to the files; after
which I can add / stage the changes and continue the rebase with git rebase --continue
.
Mind that creating a new commit in this stage will insert a new commit into the history before the commit that is currently being edited.
In the GUI of PyCharm, the interactive rebase can be started by right-clicking on a commit and selecting “Interactive Rebase from here”. The GUI will show a list of commits than can be reordered, or selected and breaked by clicking on the “pause” button. The IDE will then stop at the selected commit and allow to make changes to the files, then again adding and continuing is also possible graphically (in the current version of PyCharm, the continue button is in the top left).
Why the fuzz? Conclusion #
Many of these features are not easy to learn, and feel spooky even the third time. Many people who use git find it hard to understand and not very user-friendly - which is just true. Git is so unintuitively that it even has the concept of " porcelane commands" which are commands that are easier to use and follow a mental model of git storing changes instead of trees; and “plumbing commands” which are the commands that require understanding of how git works internally.
There are alternatives: While git is the gold standard for version control, there are alternatives that build on top of git, like Jujutsu which keep a much simpler mental model and detailed operation history. It is build to edit past commits, without the need for interactive rebase; it provides a much nicer UI by supporting things like “undo” (what a concept!), and it does not neet to separate between “plumbing” and " porcelain" commands.
Looking into this is on my to-do list, but I am already missing the GUI of PyCharm to visualize the changes.
But there are reasons why an advanced usage of git is worth the effort: A clean commit history provides information about how a file or repository evolved, which can be useful for new developers or those who revisit their project after some time. Clean commits in a Pull Request makes reviews much more understandable, and therefor easier and faster. It can separate the steps taken to implement a feature, and allow to understand the reasoning behind the changes. Also, I found it a good practice to split up the work in smaller steps that still all pass tests and work, makes my implementation more reliable, easier to map in my head, and easier to debug.
Happy Coding :)