Faster Git checkouts on NFS and SSD with parallelism

Update (2022/07/13): I defended my Master’s dissertation about parallel checkout and was approved in the program :) Here are the dissertation and defense slides:

MSc Dissertation | MSc Defense Slides

June 2021 marked the release of Git 2.32.0, which includes the new “parallel checkout” mode. This feature can speed up some checkout operations by up to 3.6x on SSDs and 4.5x on NFS mounts!

Furthermore, it benefits not only git checkout, but all Git commands that make use of the checkout machinery, including: clone, reset, switch, restore, merge, and others.

I’ve worked on this project with other Git contributors for about an year, and I’m very happy that it is finally available for everyone to try it out! In this post, I’d like to discuss a little bit about the cases in which parallel checkout works the best, and why is that.

TL;DR

Parallel checkout produces the best results for (large) repos on SSDs and NFS mounts.
It is not recommended for small repos and/or repos on HDDs (without prior benchmarking), as it can worsen performance.
Enable parallel checkout by setting the desired number of parallel workers with git config checkout.workers <N>. One means sequential mode (default).

Motivation ↩

Checkout is usually a fast operation when updating a small number of files, but performance can become a problem as the workload grows. This is specially critical for Git users over networked file systems; which, due to the higher I/O latencies, may experience some checkout commands taking up to 50x or even 130x more time than local file systems.

To put it into perspective, a full checkout of the Linux repository (which contains over 70K files), takes around 8 seconds on a local Linux machine with SSD, but it can take 5 to 15 minutes on network file systems. Furthermore, Linux is not even the largest repository versioned through Git: the Chromium repository has about 400K working tree files with a repo size of 36 GiB; and Windows has over 3.5M working tree files, in a repo of 300 GiB.

With that in mind, the goal of this project was to parallelize the working tree update phase of checkout and improve its performance for large workloads, specially over NFS. Note that other important subtasks of checkout, like the tree traversal or index update code, are beyond the scope of this work.

Benchmark ↩

Since the primary goal was to speedup operations with many file creations, I chose to benchmark a command where this is the main bottleneck: a git checkout . execution on an empty working tree of the Linux kernel repository (version 5.12). This requires the creation of over 70 thousand files.

The command was benchmarked on two instances of the Linux kernel repository:

Packed Objects: a Linux clone containing all objects reachable from v5.12 on a single packfile (8.2M objects, totaling 3.5 GiB of real and disk size);
Loose Objects: a shallow clone (i.e. --depth=1) containing only the objects from v5.12 itself, expanded to loose format (~76K objects, totaling 254MiB of real size, or 454MiB of disk size);

Note: the “Loose Objects” benchmark was created mostly for research purposes. This artificial case should not happen much in practice, since Git packs loose objects when they surpass the 6700 threshold (and our repo have 12x that value).

I’ll interleave the benchmark results with some profiling plots (for the sequential checkout). These were generated to better understand which tasks/functions consume the most time during the execution of the benchmarked checkout command in each storage type. The profiling data were collected using the bcc-tools, which provides both on- and off-CPU profilers, allowing us to properly see the time spent on I/O. For simplicity, I’ll only show a summary of the most time-consuming functions, but you can check this page for the full Flamegraphs.

The benchmark was executed on different Linux machines (with SSDs, HDDs, and NFS mounts), and sampled each measurement 15 times to plot the mean runtime with a confidence interval of 95%. Additionally, the machines caches have been cleaned before each sampling.

Machines info
Please check this page for the hardware and software description of the machines used in this benchmark.

SSDs ↩

Starting with the SSDs, we have three machines: Mango and Grenoble both have 4 cores with 2 threads per core (8 logical cores), and Songbird has 6 cores with 2 threads per core (12 logical cores):

Checkout Times on Machine "Mango" - SSD Checkout Times on Machine "Grenoble" - SSD Checkout Times on Machine "Songbird" - SSD

The SSD plots show speedups ranging from 2.5x to 3.6x on the packed objects case, and 5.6x to 7.2x on the loose objects case!

On all three machines, we got the best overall results at around 16 to 32 workers for the packed objects, and 64 workers for the loose objects. But notice that we start seeing diminishing returns around 8 workers. Thus, values higher than that may not be worth the additional usage of system resources (like RAM and I/O bandwidth).

Let’s take a look at the most time consuming functions on a sequential checkout in one of these machines: Checkout Profile on Mango - SSD

For the packed case, most of the time is spent on inflate(), which is the zlib function responsible for decompression. This is CPU-bound, and parallelizes quite well with multiple objects. On the loose case, most of the time is spent on filemap_fault(), which is the kernel function responsible for reading file’s contents on a page fault of a memory mapped region (see mmap(2)). Because of its architecture, SSD allows for internal parallelism, which can be better exploited when there are more outstanding requests in the I/O queue. The increased I/O queue depth also allows for better optimizations to be employed by the I/O scheduler (such as request reordering and merging).

HDDs ↩

We will be looking at three machines on the HDD tests: Wall-e and Cicada have 4 logical cores each, and Grenoble has 8 (all with Intel’s Hyper-Threading).

Checkout Times on Machine "Wall-e" - HDD Checkout Times on Machine "Grenoble" - HDD Checkout Times on Machine "Cicada" - HDD

As we can see, parallel checkout was not very effective on the HDDs. All three machines achieved some improvement in the packed objects case, but the overall speedup was too small. The loose case, on the other hand, saw massive performance degradations. Again, let’s take a look at the performance profile for one of these machines:

Checkout Profile on Cicada - HDD

As we can see, checkout spends almost all its run time reading the objects from the HDD. However, unlike the SSD, HDDs can typically only execute a single operation at each time. Depending on the I/O patterns, the increase in the I/O queue depth *might* allow for some scheduler optimizations that reduce the overall disk-seeking time. If that is not the case, however, the concurrent requests from parallel checkout may end up only further stressing the disk (think about fights for critical resources), and degrading the performance. I suspect that this is what we are seeing in the loose case checkout. The files may be so scattered over the disk that there is not enough opportunity for request merging by the I/O scheduler. As for the packed case, the scheduler should be able to take more advantage of the increased I/O queue depth, as the objects are stored in a single file.

Yes, there is fragmentation – and the more fragmented case, Wall-e, actually saw higher speedups than the less fragmented ones – however, the number of fragments here don’t come even close to the number of discontinuous disk chunks used to store the loose objects.

NFS ↩

I used two setups for the NFS tests: one with machine Cicada as the NFS server and machine Mango as the NFS client (both on LAN connected through 5GHz Wi-Fi), and another one using two AWS EC2 instances and an EBS gp3 volume for storage (which is SSD-based). Let’s start with the EBS one:

Checkout Times on NFS from SSD - AWS EBS gp3

On the Cicada NFS setup, the checkout benchmark with the Linux repository was taking too much time (almost an hour for a single execution of the sequential checkout on the SSD and over two hours on the HDD), so I used the Git repository instead. This repo contains about 4k files at v2.32.0. The packed and loose repository versions were set up exactly like the Linux ones. The first ended up with ~310K objects, and the second ended up with ~4K objects. Let’s see the results:

Checkout Times on NFS from HDD - Cicada

We obviously cannot compare the times from the two plots above as they are using different repositories, hardware, configurations, and etc. But I wanted to see what difference an SSD makes on a parallel checkout over NFS. Fortunately, Cicada also has a small 20 GiB SSD :) Well, it is meant for caching and accelerating the Windows boot, not for general storage. But let’s give it a shot anyway:

Checkout Times on NFS from HDD - Cicada

On the NFS tests we got the following speedups (respectively for packed and loose case):

4.5x and 5.2x on the EBS gp3 SSD setup with the Linux repository;
2.6x and 3.4x on the Cicada HDD setup with the Git repository;
2.9x and 5.5x on the Cicada SSD setup with the Git repository.

The best results were at around 32 to 64 workers on all three setups. But we start to get diminishing returns from 8 workers, so this seems to be a good value for NFS mounts, as we can achieve good performance without overusing the system’s resources.

Let’s see the profile plot for the NFS from EBS:

Checkout Profile on NFS - EBS gp3 SSD

Two functions dominate the execution time: open() with 44% of the total runtime, and fstat(), with 33~40%. For open(), the most time consuming calls are the ones creating new files in the working tree, and their runtime is equally divided between two NFS operations: OPEN and SETATTR. Both of them require one round-trip to the server for each call. As for fstat(), the cost majorly comes from having to flush previous write operations (which were locally cached) to the server. This is required in order to update some file attributes, like the last modification time, which must be returned by fstat(). For both open() and fstat(), practically the entire time is spent off-CPU, either sending requests to the NFS server or waiting on its responses. Since NFS servers are typically able to process multiple connections simultaneously, parallel checkout can be an effective way to amortize the network latency and also promote parallelism in server work associated with these operations.

Finally, I also wanted to see what results we could get on single-core machines. To do that, I disabled all cores but one on both the NFS client and server of the Cicada setup (which can be done by writing ‘0’ to the special files /sys/devices/system/cpu/cpu[1-9]*/online):

Checkout Times on NFS from HDD - Cicada - One Core

The results were very similar to the ones above, with the optimal configuration around 8 to 16 workers. This suggests that the performance gain we get from the parallelism on NFS is also applicable for single-core machines.

Windows benchmark ↩

It would be impracticable to repeat this performance tests in every system and machine architecture where Git is used. However, I still wanted to see how parallel checkout behaves on a different operating system, so I ran the local benchmarks on Microsoft Windows as well. It has a large user base and an active development community on Git. Besides, it is not a UNIX-like system (differently then macOS, BSD, and Linux), so it is a good choice to complement the benchmarks we already have on Linux.

Checkout Times on Machine "Mango" - SSD

Checkout Times on Machine "Cicada" - HDD

Conclusions ↩

From the benchmarks, I would suggest using something around 8 workers on NFS mounts, and perhaps as many workers as the number of logical cores on local SSDs (at least on Linux).

For HDDs, I would not recommend enabling parallel checkout, unless you have already ran some tests to access the performance on your specific machine and Git repo. You can do that, for example, with something like the following:

#!/bin/bash
# Note: expects $PWD to be the git repo you want to run the benchmark at.
# Also, hyperfine (https://github.com/sharkdp/hyperfine) must be installed.


tmpdir="$(mktemp -d ./tmp-worktree-XXXX)" &&
git worktree add "$tmpdir" HEAD &&
(
    cd "$tmpdir" &&
    hyperfine \
        -S /bin/bash \
        -p 'rm -rf * && sync && sudo /sbin/sysctl vm.drop_caches=3' \
        -L WORKERS 1,2,4,8,16,32,64 \
        'git -c checkout.workers={WORKERS} checkout .'
)
git worktree remove -f "$tmpdir"

How to enable ↩

Set the desired number of parallel workers with:

$ git config [--global|--local] checkout.workers <N>

The default value is one, i.e. sequential mode.

You can also change the checkout.thresholdForParallelism configuration, which defines the minimum number of files for which Git should enable parallelism. This avoids the cost of spawning multiple workers and performing inter-process communication when there is not enough workload for parallelism. (The default value is 100, which should be reasonable in most cases.)

Notes on parallel-ineligible entries ↩

Some files are currently checked out sequentially, regardless of the checkout mode configured. These are:

Symbolic links;
Regular files that require external smudge filters (like Git-LFS).

This limitation exists to prevent race conditions and to avoid breaking non-concurrency assumption that external filters might have. (You can read more about this here.) Additionally, all file removals and directory creations are currently performed sequentially.

Source code ↩

The three series of patches that compose this project can be seen at:

Acknowledgments ↩

Thanks to Jeff Hostetler and Nguyễn Thái Ngọc Duy, who I co-developed parallel checkout with, and to all reviewers for dedicating their time and effort to improve the quality of this feature. In particular, special thanks to Christian Couder, Junio Hamano, and Derrick Stolee. Finally, thanks to Amazon for sponsoring me in this project.

Matheus Tavares