Git: Snapshots, not diffs

Introduction

‍
Ever wonder how Git manages to work so fast? Git was invented by Linus Torvalds in order to continue development of the Linux Kernel because no other source control management (SCMs) met their specific requirements for a distributed system. So from the get go it was designed to handle one of the most demanding open-source repositories with hundreds of developers working on it.

This article will assume you have at least a basic understanding on how Git works and what the most used commands do. If you are new to Git or want to improve your knowledge of it, head over to https://learngitbranching.js.org and follow their interactive tutorials.

Because traditional SCM solutions store differences of files with each commit, a lot of people fall into the conclusion that Git does too. But Git actually stores snapshots of the exact state of every tracked file in your working tree at a given moment. This means that each commit on Git has a reference (more on those later) to each complete file, not just the differences. And this also applies to files that had no changes.

Git documentation doesn't actually use the term "snapshot", the term it uses is "commit", but it's a good way to better describe how it works. Git has a reputation for being confusing, and in our experience, one of the top reasons is an interpretation of the commits as diffs. However, commits are snapshots, not diffs!

Behind the scenes

Internally, Git stores, mainly, 3 types of objects:

Blob Objects (files): A snapshot of a file in a given moment. It's the complete file, not a diff.
Tree Objects (folders): Each tree has one or more entries, each of which is the SHA-1 hash of a blob (a reference to a file) or subtree.
Commit Objects: The main contents of a commit object are a reference to a tree, a reference of the parent commit and some metadata.

*This is a simple version of the Git data model*

‍

What exactly is a commit

A commit object contains a reference to the top-level tree for the snapshot of the project at that point; the parent commit if any (the first commit does not have parent); the author/committer information (user.email and user.email from configuration settings); a timestamp and then the commit message.

Here is a illustration that better shows this:

In order to avoid having a full copy of the working directory with each commit, what Git does with each new commit is create a new blob object just for the files that got modified (or are new). For each file that is the same as the previous commit it just adds a reference to the already existing blob object into the new commit. E.g., if there are 3 commits on the repository and the file example.txt was added on the first commit and never modified, then all three commits have a reference to the same blob object.

So, that's why Git is so fast, because each commit contains references to the full working directory, a complete snapshot. If it had the differences with the previous commit, it would have to process a lot of commits one by one in order to switch branches, or to make a rebase, or to make almost all commands that manipulate the Git history which would be painfully slow in big projects, and that's just not what Linus wanted.

*Several commits have reference to the same blob file*

‍

Another important thing about commits is that they never change, they are immutable objects. What commands like rebase and amend do is actually create a new commit. This is because a commit is identified by all of his properties. So every command that "changes" something (commit message, parent commit, etc.) from a commit is actually creating a new one.

‍

Branches are just references

Another misconception about Git is the concept of "branch". Some people see the branch as a group of commits that goes until it forks from another branch, or that it goes until the very first commit but when you do something like Git rebase, what is happening is that branches are being compared, or something like that.
‍

Branches are just references! That's it, a pointer to a commit. This could be the single best concept to better understand how Git works. Branches are super volatile and lightweight, just pointers.

Tags are also just pointers to a commit. They just usually stay on the same commit forever. There are other subtle differences between tags and branches that we are not gonna cover here.

‍

Classic commands

Now that you understand a little bit more about how Git works under the hood, let's talk about two of the most used Git commands and how they do what they do but knowing they work with snapshots and not diffs.

For the examples below, let's use this Git repository:

‍

‍

Merge

Let's assume we want to merge feature_2 into feature_1 (Git merge feature_1). Git will calculate all the changes made in the branch feature_1 and then apply them on top of feature_2 .

In order to do that, what Git does is create a diff between commit E and commit A (which is the first common commit between the two branches) and then replay the changes on feature_2.

The result commit of merging feature_1 into feature_2 is the same as merging feature_2 into feature_1, but there is a difference in the process - in the example from above, if we happen to find conflicts to solve, we could see (by using Git status command) that new files from commits D and E are not displayed as new files. This is because we are applying feature_1 changes on top of feature_2.

Another thing to notice is when trying to see the changes of a merge commit: against which parent should we compare the merge commit in order to see the changes? It depends on who you ask to show you the changes. What Git log -p does is not to compare at all (you can make it show something by adding various flags). Git show chooses to compare against the first parent, then against the second parent and then it combines the two diffs, producing a so-called "combined diff". If you use a Git UI (Sourcetree, Git Kraken, etc.) consider knowing how it shows the merge commits.

‍

Rebase

Let's assume we want to rebase feature_1 into feature_2.

What Git will do is copy the diff between commit B and commit A and the difference between commit C and commit B. This difference is obtainable by comparing the commits' snapshots. Once it has the diffs what it does is start applying them in order on top of feature_2 branch.

So contrary to popular belief, Git rebase does not copy the whole commit and apply it on top of the other branch. It calculates the difference (changes) between commits and applies them in the destination branch one by one. We are used to seeing diffs all the time when using Git, this is because internally Git calculates diff between commits all the time.

‍

Conclusion

Git is all about commits and their relationship (Git history), branches are just pointers to commits. So, to better understand Git, we need to better understand commits. Next time you have to do something on Git, remember: each commit has a reference to a complete working directory, with all it's files, at a given moment, hopefully it will improve your understanding of how Git works.

This was just a hint on Git behind the scenes - if you want to go more technical or precise, head over to https://git-scm.com/ which is Git's official documentation and the main source of this article. And if you want to go even deeper, there is always source code! If you really are that brave, here is the link: https://github.com/git/git.

Don't forget to follow us on social media to be notified when new articles like this one are published!