Home

Matheus Tavares

05 Jun 2022

The "Schrödinger's tree object": is it there or not?

Tags: git, fun

Reading a reddit comment a while ago, I learned about a somewhat “mysterious” git object which can be both present and absent at the same time! Well… not by the same definition of “presence”, but that spoils the fun, right? :P Let’s see this object in more details.


If you want to check whether a given object exists in a git repository, you can use git rev-parse. As the man page says:

--verify
	   Verify that exactly one parameter is provided, and that it can be
	   turned into a raw 20-byte SHA-1 that can be used to access the
	   object database. If so, emit it to the standard output; otherwise,
	   error out.
	   [...]
	   To make sure that $VAR names an existing object of any
           type, git rev-parse "$VAR^{object}" can be used.

Let’s run an example inside the git.git repository:

$ git rev-parse --verify e83c5163316f89bfbde7d9ab23ca2e25604af290^{object}
e83c5163316f89bfbde7d9ab23ca2e25604af290
$ echo $?
0

Ok. What about an nonexistent object?

$ git rev-parse --verify 0000000000000000000000000000000000000000^{object}
fatal: Needed a single revision
$ echo $?
128

Hm, nothing out of the ordinary here. But the fun starts when we look at the “mysterious” SHA1 hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904. If we list all objects in the git.git repository using git cat-file and grep for this particular hash, we get nothing:

$ git cat-file --batch-check="%(objectname)" --batch-all-objects | \
	grep 4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
1

However … rev-parse seems to disagree about the object’s presence:

$ git rev-parse --verify 4b825dc642cb6eb9a060e54bf8d69288fbee4904^{object}
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
0

Hmm, what is happening here? Is the object there or not? Let’s go a bit further with a reduced test case:

$ git init /tmp/repo
Initialized empty Git repository in /tmp/repo/.git/

$ git -C /tmp/repo rev-parse --verify 4b825dc642cb6eb9a060e54bf8d69288fbee4904^{object}
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ echo $?
0

Wait, what? The just-created /tmp/repo repository clearly has no objects:

$ ls /tmp/repo/.git/objects

/tmp/repo/.git/objects
├── info
└── pack

2 directories, 0 files

Is rev-parse broken? Let’s try something else… Running git cat-file to print all object hashes and grep for our target did not produce any result. But cat-file can also be used to print metadata about a given list of hashes. Let’s try that with our hash:

$ echo 4b825dc642cb6eb9a060e54bf8d69288fbee4904 | \
	git -C /tmp/repo cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(objectsize:disk)'
4b825dc642cb6eb9a060e54bf8d69288fbee4904 tree 0 0

Hmmmm, so 4b825dc642 is a tree object of size 0, both on disk and decompressed. Nevertheless, we saw that there are no objects on disk… I was intrigued. Is this hardcoded somewhere in git? And if so, why?

My first attempt to “uncover the mystery” was:

$ git -C git.git grep 4b825dc642
git-rebase--preserve-merges.sh:288:             ptree=4b825dc642cb6eb9a060e54bf8d69288fbee4904
t/oid-info/hash-info:16:empty_tree sha1:4b825dc642cb6eb9a060e54bf8d69288fbee4904
t/t0015-hash.sh:26:     grep 4b825dc642cb6eb9a060e54bf8d69288fbee4904 actual

The first match comes from a shell script which uses our mysterious hash (a.k.a. the empty tree hash) as a fallback if a commit does not have a parent. This is done to compare the hashes of a commit’s tree and its parent commit’s tree to decide whether the commit is considered “empty” (i.e. its tree is the same as the parent). The other two matches come from the test suite. Hmm, so no hardcoded value on the actual object reading code? That’s curious… Let’s see what gdb has to show us! Running git rev-parse -e 4b825dc642cb6eb9a060e54bf8d69288fbee4904 through the debugger, we can see the following call chain:

cmd_cat_file()
  cat_one_file()
    repo_has_object_file()
      ...
        do_oid_object_info_extended()
	  find_cached_object()

And at the footer of find_cached_object() we have this code:

	if (oideq(oid, the_hash_algo->empty_tree))
		return &empty_tree;
	return NULL;

Aha! So we have this “empty tree” hash saved somewhere… Well, turns out that my git grep search did not found it because it is not defined in hex format! See:

#define EMPTY_TREE_SHA1_BIN_LITERAL \
	 "\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60" \
	 "\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04"
#define EMPTY_TREE_SHA256_BIN_LITERAL \
	"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
	"\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \
	"\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \
	"\x53\x21"

The last thing to “uncover” is: why is this hash value hardcoded? Well, for that we can find the explanation using git blame (or tig blame). The code at find_cached_object() comes from the commit 346245a1bb ("hard-code the empty tree object", 2008-02-13), which says:

commit 346245a1bb6272dd370ba2f7b9bf86d3df5fed9a
Author: Jeff King <peff@peff.net>
Date:   Wed Feb 13 06:25:04 2008 -0500

    hard-code the empty tree object
    
    Now any commands may reference the empty tree object by its
    sha1 (4b825dc642cb6eb9a060e54bf8d69288fbee4904). This is
    useful for showing some diffs, especially for initial
    commits.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

There we have it, mystery uncovered!

Ok, Ok… I definitely over-dramatized this process… But I find it quite interesting to run this kind of analysis! It helps better understand parts of a code base, reproduce bugs, or even find the reason why a certain function (or line of code) was written in a given way. So I decided to document this particular small adventure. I hope you also enjoyed the “Schrödinger’s object” :)

Til next time,
Matheus