GSoC Week 13: Going for a [too] big step

We are approaching the end of GSoC. In fact, this is the final week! A working version of git-grep with parallel inflation was already sent and merged into pu. But threading is still disabled when --recurse-submodules or --textconv are given. So I’ve been working on a v2 to support grepping submodules and try some more optimizations along the way. It didn’t seem so complex at the beginning, but I quickly realized I was just at the tip of the iceberg. This was probably the most intense week in my GSoC. It went like this:

1) Submodule’s initializations

At first, to efficiently grep submodules in parallel I thought I just had to internally protect these four functions at the beginning of grep_submodule:

is_submodule_active()
repo_submodule_init()
repo_read_gitmodules()
add_to_alternates_memory()

Doing so, I would be able to remove the grep_read_mutex around them in favor of the internal object reading locks I added. This would let git grep --recurse-submodules --cached take full advantage of the parallel inflation. (the internal locks are released when inflating.)

1.1) Finding the thread-unsafe spots

The first thing to do was to discover why those functions needed to be in a critical section. A comment right above them give the answer:

/*
 * NEEDSWORK: submodules functions need to be protected because they
 * access the object store via config_from_gitmodules(): the latter
 * uses get_oid() which, for now, relies on the global the_repository
 * object.
 */

But inspecting config_from_gitmodules() I discovered the comment was outdated and the function no longer used get_oid(). However, it now called add_to_alternates_memory(), which adds the subrepo object directory as an in-memory alternate to the global the_repository. As this operation would be performed concurrently with the object reading operations by the worker threads, it could cause data races. (Because it writes to the r->objects->odb list, while oid_object_info_extended()’s call chain may need to iterate through it.)

1.2) Removing 1st `add_to_alternates_memory()`

config_from_gitmodules() was adding the subrepo to the alternates memory in order to call config_with_options(), which needs access to the subrepo’s objects. The call chains I found that reads the subrepo objects are:

config_with_options() > git_config_from_blob_ref() > get_oid()
config_with_options() > git_config_from_blob_ref() > get_oid() > git_config_from_blob_oid() > read_object_file()

So, in order to remove the addition to the alternates list, I added a repo_config_with_options(), which would take an extra repository parameter and pass it down in those chains. Then I made config_from_gitmodules() use this new function.

1.3) Thread-safe initializations?

At this point, I was pretty sure the first 3 submodule operations were already thread-safe without the need of grep_read_mutex or even my internal object reading locks. But there was no way to be sure other than going through each and every function in the call chains. I tried that, but as one would imagine, the call chains were too big for me to manually examine line-by-line…

1.4) Removing 2nd `add_to_alternates_memory()`.

I still had to take care of one more thread-unsafe function: the add_to_alternates_memory() at grep_submodule(). Removing it would not only help in the parallelization but would also solve another NEEDSWORK comment:

/*
 * NEEDSWORK: This adds the submodule's object directory to the list of
 * alternates for the single in-memory object store.  This has some bad
 * consequences for memory (processed objects will never be freed) and
 * performance (this increases the number of pack files git has to pay
 * attention to, to the sum of the number of pack files in all the
 * repositories processed so far).  This can be removed once the object
 * store is no longer global and instead is a member of the repository
 * object.
 */

The call was there so that worker threads could access the subrepo’s objects through the alternates list. Thus, to remove it I had to pass the subrepository reference down to the threads and make them use it explicitly. Unfotunatelly, it seems that this patch introduced some data races I wasn’t able to find until this moment :(

2) Other submodule’s data races

I also noticed some other operations on grep_submodule() which weren’t thread-safe:

The call to parse_object_or_die(), when grepping trees, is outside a lock. This function may load objects internally, write to r->parsed_objects_obj_hash and also call lookup_replace_object(), which is thread-unsafe. So calls to it must be done in a critical section.
To clean the memory allocated for the subrepository, repo_clear(&subrepo) was being called right before returning from grep_submodule(). However, by the time this function is called, there’s no guarantee that the worker threads have already finished working on the subrepo. To properly avoid that, we would need to implement a kind of mapping which tells, for each repo, how many objects are yet to be processed. As this would be somehow laborious, I just skipped the memory cleaning for now, to implement a testing version.

3) Thread-safe code inside critical section

I noticed what it seemed to be an already thread-safe code inside add_work()’s critical section. So I moved it out, which lead to a minor but noticeable speedup:

-static void add_work(struct grep_opt *opt, const struct grep_source *gs)
+static void add_work(struct grep_opt *opt, struct grep_source *gs)
 {
+       if (opt->binary != GREP_BINARY_TEXT)
+               grep_source_load_driver(gs, opt->repo->index);
+
        grep_lock();

        while ((todo_end+1) % ARRAY_SIZE(todo) == todo_done) {
                pthread_cond_wait(&cond_write, &grep_mutex);
        }

        todo[todo_end].source = *gs;
-       if (opt->binary != GREP_BINARY_TEXT)
-               grep_source_load_driver(&todo[todo_end].source,
-                                       opt->repo->index);

4) Move the `obj_read_mutex` down

I didn’t mention it before, but some of those four submodule operations use object reading function such as oid_object_info_extended(), which is not thread-safe. To overcome this, I decided to move v1’s obj_read_mutex from read_object_file_extended() down to oid_object_info_extended(). This allowed even more functions to be called in parallel, but some extra locks were needed.

5) Replace `grep_read_mutex` by internal locks

With all the above done, it was time to remove the grep_read_mutex and enable the internal object reading locks. This moment, I also re-enabled threads in the cached grep with --recurse-submodules. Unfortunatelly, the result wasn’t the best. I tried a lot of changes to this main idea, but all of them ended up in a data race, Segmentation Fault, failure to read subrepos’ objects, or performance drop. Each of these alternative versions are in my local branches, which I thought wouldn’t be worthy uploading to share, but here is the “main” one.

Next steps

I believe my idea for v2 turned out to be a very big step and I just got lost in it. So my current idea is to take a step back and go slower. With that in mind, I might end up not finishing this version during GSoC. But I thinkg that’s OK as we already have v1 sent and with good speedup. And I can keep working on this after GSoC, anyway :)

So, instead of making more tweaks to my v2 series, I sat down and analysed the code trying to list what are the real open issues (and how we might solve them):

add_to_alternates_memory(): althought this function cannot be called in parallel, the case in which git-grep uses it, may not be problematic, if:
- prepare_alt_odb() is called at repositories’ initalization (at cmd_grep() for the_repository and grep_submodule() for subrepos). This way, we force eager initialization and avoid data races with the worker threads in the initialization.
- The ent->next = NULL line is moved up in link_alt_odb_entry(). Then, the worker threads may iterate through r->objects->odb without possibly falling on an unmaped memory region. (That’s hacky, but should work).
parse_object_or_die(): again, this should not be a problem in our case, if:
- The object reading operations are properly protected internally.
- prepare_replace_object() is called at repositories’ initialization (as before).
Submodule initialization functions (at grep_submodule): This is the critical part! I re-investigated them and found out is_submodule_active(), for example, has repack_packed_git() in its call chain. This could clearly generate data races with the workers. And as we are working with “indermediate level” locks, I really don’t know yet how to protect these functions without going lower or higher in our “protection layer”. On the other hand, submodule_from_path() has the same problem and it is outside the critical section… So maybe we can ignore those data races which are too improbable?

Regarding the commits I already have, I plan to:

Remove the submodule-config: don't add subrepo to alternates from the series. It is no longer necessary here, so it might be sent separately.
Remove grep: don't add submodules to the alternates list and hold it in standby. It currently brings data races that must be solved if I’m going to send it in the future.
Refactor the patch adding the object reading locks. I plan to protect only oid_object_info_extended() and not read_object_file_extended(). This won’t work for the general case (which I’ll make explicit), but should work in our case as we’ll be forcing eager initialization for alternates and replace maps in all repos.
The grep: move driver pre-load out of critical section should be OK if we have the object reading locks.
The WIP: grep: remove racy call to repo_clear() must be properly finished.

And that’s the plan! Well, I really hope I’m not being too flustered again, and proposing bigger steps than what I can take…

Main difficulties

Main difficulties this week were:

Debugging data races: I used a combination of GDB, ThreadSanitizer and valgring (helgrind and memcheck). But sometimes the racy spot is to occasional to be caught.
Investigating deep call chains: To make a low-level “thread-protection layer” is hard. I needed to go deep in git-grep call stacks to make sure some functions were safe to be called unlocked. And a couple times I was still unsure.

I also spent some time thinking how I should properly deal with:

Lazy initializations: They are great in big codebases as they mitigate the need to keep track of all initializations in the beginning of an execution. But unfortunatelly, they are difficult to handle when threaded, as two threads can try to initialize a resource at the same time. (For example, take a look at prepare_replace_object() and prepare_alt_odb(), which are called by many functions.)
Static function-scope variables: Some functions such as oid_to_hex() return a static buffer. This is a good convenience for the callees and it avoids memory leaks, but it makes the function thread-unsafe.