Optimizing Git’s Merge Machinery, #3

Published in

Palantir Blog

10 min readApr 9, 2021

Editor’s note: This is the third post in a series by a Palantir Software Engineer on optimizing git’s merge and rename detection machinery. Click to read the first and second posts.

This is the third in a series of blog posts on scaling git’s merge and rename detection machinery. In particular, the first also included some background information on how the merge machinery works, how we use git at Palantir, and why I have worked on optimizing and rewriting it. The background information from the first post in this series will be critical to understanding this post. So, if you’re unaware that git stores hashes for files in addition to having hashes for commits, or you’re unaware that merging involves three-way content merges, or you don’t know why merges would need two sets of rename detection done, then you may want to go back and read that background information first. If you know all that stuff already, continue reading. In this post I’ll consider another optimization, one that I summarized in the first post simply as “Find paths where detecting renames wouldn’t change the result, and don’t detect renames for those paths.”

Hard work pays off eventually, laziness pays off now

When faced with code running an O(N²) algorithm operating on a large number of inputs and a desire to make it faster, most folks probably ask themselves one of the following questions:

Is there an O(N log N) or even O(N) algorithm that could be employed here?
Even if we can’t improve the algorithmic complexity, can we optimize the code in a way that makes it multiple times faster?
Are there approximate solutions that could be employed that solve slightly different problems, but provide us good enough solutions, and do so faster?
Can we cache relevant information somehow over time to reduce the computation costs at the point we need?

You might note that the previous blog post in this series essentially fell into the approximate solution category. However, I’m not covering these optimizations in the order I discovered them; the first blog post was actually my second optimization and the second blog post was my third optimization. My original optimization didn’t come from answering any of the questions above, but a totally different set of questions:

How bad would it be if we just didn’t bother with this computation? Can we get away with it in some cases?

In effect, you might say that I’m taking the computer science concept of lazy evaluation — not evaluating results until they are requested — and trying to kick it up a notch to hyper lazy evaluation by just saying, “Nah, you don’t need those; I just won’t compute them” — and trying to figure out when I can get away with it.

(The eagle-eyed might view both of my previous optimizations and the current one as different forms of “Make N smaller.” The previous two optimizations did that through finding alternative ways to compute or approximate some of the renames, while the current one is literally about deciding to just not compute some of the renames.)

Why do we need renames anyway?

The merge machinery uses renames for two purposes:

Three-way content merging: in order to do three-way content merging of files, we need three different file versions. If one side of history renamed a file, then some of the content for the file is found under a different path than in the merge base.
Directory rename detection: when one side of history renames a directory, and the other side of history adds new files to that directory, we want to be able to warn the user about the need to choose whether those new files stay in the old directory or move to the new one.

If we don’t do rename detection, renamed files are likely to appear as modify/delete conflicts for the old filename and as a new file for the new filename. Both will be present in the working copy, both will have edits, the original version of the file from the merge base won’t be obvious, and users will be left attempting to reconcile the two. Also, if we don’t do rename detection, new files placed into old directories could be “left behind” when all the other files in those directories have moved elsewhere.

Ok, we need renames; but do we need all of them?

If a rename is needed for the purposes of three-way content merging or for the purposes of directory rename detection, then we need it. Let’s dig into those two cases to see if either of them need renames in all cases, or if they have cases where rename detection can be skipped:

Three-way content merges

Three-way content merges are about reconciling differences made on both sides of history. If only one side of history modified the content of a file, then we don’t need to do a three-way content merge. Let’s look at each of the two possibilities to see why this is the case. The brief version is:

If the file content wasn’t modified on the renamed side ⇒ we get to do exact rename detection, which is cheap
If the file content wasn’t modified on the unrenamed side ⇒ detection of a rename for that source file is irrelevant

Let’s walk through both in more detail. Let’s use two filenames, old.c & new.c, with the following abbreviated object ids (and where the value ‘000000’ is used to denote that the file is missing in that commit). The first case corresponds to something like this:

                 old.c     new.c
   MERGE_BASE:   01d01d    000000
   MERGE_SIDE1:  000000    01d01d
   MERGE_SIDE2:  5e1ec7    000000

Here, exact rename detection will determine that old.c in MERGE_BASE and new.c in MERGE_SIDE1 are related (identical, in fact). That would suggest we should do a threeway content merge between MERGE_BASE:old.c, MERGE_SIDE1:new.c, and MERGE_SIDE2:old.c. However, two of those three files have the same content (noted by having the same hash), so the three-way content merge isn’t even needed; we already know the result will be a file whose hash is 5e1ec7 — which we already have stored in our repository. And the knowledge about the rename lets us know that 5e1ec7 is the hash for file named new.c in the result of the merge, while old.c should be removed from the result.

The second case is a bit more involved, and perhaps the claim that the rename is irrelevant is surprising. Let’s walk through what that case looks like:

                 old.c     new.c
   MERGE_BASE:   01d01d    000000
   MERGE_SIDE1:  01d01d    000000
   MERGE_SIDE2:  000000    5e1ec7

Since I claimed that detection of the rename was irrelevant for this case, let’s compare what happens both with and without the rename being detected:

If the rename IS NOT detected ⇒

then old.c looks like it was unmodified on one side and deleted on the other and should thus be removed.
new.c looks like a new file we should keep as-is.

If the rename IS detected ⇒

the code sets up a three-way content merge between MERGE_BASE:old.c, MERGE_SIDE1:old.c, and MERGE_SIDE2:new.c. This runs as follows…
Since the version of the file in MERGE_BASE and MERGE_SIDE1 are identical (both are files with hash 01d01d), the three-way content merge would produce exactly the version of the file whose abbreviated object id is 5e1ec7. In fact, the code looks at the hashes and returns the answer without even looking at the file contents since it knows the answer and knows the repository already has a file with the necessary hash stored in it.
Since the file was renamed from old.c to new.c, it will record the file contents whose hash is 5e1ec7 at new.c (which means keeping it as-is), and remove old.c.

If you compare the two results above, you’ll note that they have an identical conclusion, even if they used different logic to arrive there. Thus we have the same result regardless of whether we detect a rename for old.c. There is no way for us to know a priori what file new.c would be paired with and whether that file has been modified, so new.c will have to be left in our big matrix of comparisons when searching for renames. The fact that old.c is unmodified on one side of history, though, is enough for us to know that it doesn’t need to be included — at least not for the purpose of three-way content merging.

Directory rename detection

We only need renames for source files as part of directory rename detection if:

The directory containing the file no longer appears on the side of history for which we are detecting renames (if the directory still appears, it hasn’t been renamed, at least not on that side of history).
The containing directory had new files added to it on the other side of history from which we are detecting renames (if there were no new files added to that directory, then there will be nothing for directory rename detection to warn about or move into a new location).

Further, even in these cases, we don’t need renames for all the source files in a renamed directory, we only need enough renames to satisfy the “majority wins” rule, so if we get enough renames from exact rename detection or basename-guided rename detection (see the last two blog posts in this series) then we don’t need to detect any more — at least not for the purposes of directory rename detection.

Much like three-way content merges, the needs for directory rename detection has focused on rename sources and gives us a way to mark certain source files as irrelevant for rename detection. We do not have a way to reduce the number of potential rename destinations.

Abstraction layers

It has been said that every problem in computer science can be solved with another layer of abstraction, except for the problem of having too many layers of abstraction. As software engineers, we love abstracting things; it allows us to create reusable components. However, sometimes the abstraction creates an artificial wall between different components of our programs, and we later find we want to create a building that spans both sides of the wall. Other times our reusable components were designed to solve one problem, but when we go to re-use them for a different problem, we find they provide enough information for us to solve what we need, but they provide more than what we actually need.

In git, there is a specific component responsible for rename (and copy) detection. It’s a sub-component of the diff machinery, which takes two trees (a start and end state) and computes the differences. This is very handy for things like git diff OLD_COMMIT NEW_COMMIT or git log -p as they’ll want to know all the differences between two commits; the diff machinery solves such a problem well.

Merges need two sets of renames — from the MERGE_BASE to MERGE_SIDE1 and from MERGE_BASE to MERGE_SIDE2. So, it seems quite obvious to just reuse our nice component twice to do rename detection for each side. That gives us the answers we need...but I think this abstraction obscured the optimization opportunities for a long time. Note that all of our work above about when renames can be skipped involved information about the other side of history. The diff component doesn’t have knowledge about three commits; only two, so it has no notion of another side of history.

Implementing these optimizations basically means we’ve got to pass a bunch of extra information from the merge component to the rename detection component so that it can make decisions based on this extra information. It also means moving all the directory rename detection logic from the merge component into the rename detection component (despite the name similarity, directory rename detection had always been merge specific and since it was a post-processing step it lived outside of the rename detection umbrella).

This all sounds really complicated and special case; is it worth it?

At this point you may be noting some things that make you question the advisability of this optimization:

This optimization will only help the merge machinery (so git commands like merge, revert, cherry-pick, or rebase); it won’t help commands in the plain diff family (so it won’t help git commands like diff, log, or status).
This optimization is going to involve an awful lot of code and checking of special cases, as well as passing a lot of information across our old abstraction boundary.
This is all for special cases where the unrenamed side of a file isn’t modified AND the file isn’t potentially part of a directory rename. The second of those conditions may be common, but the first sounds like a special case.

So, is it worth it?

In my opinion, this is the best of all the optimizations I did. To see why it can be so powerful, let’s say we need to cherry-pick an old commit of ours, but upstream in the meantime has renamed 5K files. Let’s ignore the previous optimizations (exact rename detection and basename-guided rename detection) for a moment, or assume that we have 5K files left after those optimizations. Inexact rename detection in git will need to do 5K * 5K file comparisons to match up the rename pairs. However, if the commit we need to cherry-pick is typical, then it modifies ~10 files, and likely doesn’t even add any new files. For a case like that, our 5K * 5K comparison drops to ~10 * 5K — a reduction factor of ~200.

For our testcase of interest, the basename-guided rename detection happened to be very effective, meaning that there wasn’t that much remaining optimization potential. But despite that, using this additional optimization still yields some pretty impressive results:

                        Before                  After
mega-renames:    130.465 s ±  0.259 s    11.435 s ±  0.158 s

Interestingly, if we turned off the basename-guided optimization from the last blog post, then this optimization by itself would have given us the following speedup for our testcase of interest:

                        Before                  After
mega-renames:   1799.937 s ±  0.493 s    16.148 s ±  0.169 s

So it’s much more powerful than the basename-guided optimization, at least for typical rebases and cherry-picks. But the great thing is that some paths can only benefit from one of the optimizations, and we get to use whichever one is better for any given path.

For more details about this optimization, see the two upstream threads where my patches were reviewed, namely the thread on avoiding detecting irrelevant renames and on avoiding detecting even more irrelevant renames.

This optimization will appear in git-2.32, which is scheduled to be released in June 2021.

Author

Elijah Newren