GSoC Week 8: A working parallel inflation

Some fixes at the dir-iterator patches

Before I talk about my main project, I think I should briefly mention how my dir-iterator patchset has evolved since the last report on it.

Last week I wrote that the problem regarding the st_ino usage was solved. But, in fact, there was still a way to go. Dscho and I exchanged some emails, trying to figure out the best way to overcome the problems that were still open. After considering some alternatives, we ended up deciding to remove the snippet that detects circular simlinks from the series. This code was still far from portable and it was blocking the pactchset from being merged. Also, we may work on a more portable implementation in the future, after the series gets merged. So, here you can find the version 8 of this patchset, without the circular symlink finder and the respective adjustments made. Junio has already added it to the respective topic branch.

Recap: where were we?

OK, back to the main project: Last week we saw that parallel inflation could lead to some good time reductions on git-grep. In fact, we even observed a 46% reduction on gentoo’s repo, for one of the test cases!

But as the targets are bigger repos, I was looking forward to see the impact of those changes on a linux or chromium repository. Unfortunatelly, grep’ing this repos with my modified git-grep only lead to Segmentation Fault :( The error was due to some race condition on the delta_base_cache, but I couldn’t find the problematic spot. And why did it only happened on larger repositories, in the first place? That’s when last week ended and debugging started!

The race condition on `delta_base_cache`

My first hypothesis was that one thread could have free’d a cache entry which a second thread was expecting to be present at the cache. Through GDB I found out that this was not the case. The call stack showed that the error occurred when trying to remove an entry from the cache. OK, so why wasn’t it able to remove the entry? GDB was really helpful but Valgrind’s memcheck gave me the answer:

Invalid read of size 4
   at: entry_equals (hashmap.c:99)
   by: find_entry_ptr (hashmap.c:137)
   by: hashmap_remove (hashmap.c:222)
   by: detach_delta_base_cache_entry (packfile.c:1395)
   by: release_delta_base_cache (packfile.c:1421)
   by: add_delta_base_cache (packfile.c:1447)
   by: unpack_entry (packfile.c:1730)
   by: ...
 Address 0x556c9c18 is 8 bytes inside a block of size 104 free'd
   at: free (vg_replace_malloc.c:530)
   by: detach_delta_base_cache_entry (packfile.c:1398)
   by: release_delta_base_cache (packfile.c:1421)
   by: add_delta_base_cache (packfile.c:1447)
   by: unpack_entry (packfile.c:1730)
   by: ...
 Block was alloc'd at
   at: malloc (vg_replace_malloc.c:299)
   by: do_xmalloc (wrapper.c:60)
   by: xmalloc (wrapper.c:87)
   by: add_delta_base_cache (packfile.c:1437)
   by: unpack_entry (packfile.c:1730)
   by: ...

Through this output and code examination we can conclude the program was trying to access a delta_base_cache entry that was already released and free’d. And, as we had already seen through GDB, the invalid access was also trying to release the entry. The funniest thing is that both of these conflicting releases happen at the same spot of unpack_entry()’s code. As I knew this spot was protected by the added obj_read_mutex, there was no possibility of two threads executing it “simultaneously”. So, how could this error be happening?

I studied the unpack_entry() function in its current form and past revisions to learn the most I could of how object reading from packfile works. (It was very interesting!) This helped me understand how the cache works in more details.

Before really adding a new entry, add_delta_base_cache() may need to remove some of the least recently used ones if the cache is full. It keeps control of that through a list from the list.h API (so the cache is composed of a hashmap and a list). The error I got was exactly at the code that iterate trought this list removing old entries.

I thought maybe the list was not being properly updated in my threaded code and it was keeping wrong references. So I ran some tests trying to understand some points. For example, what happens with just one thread? What happens if I remove the unlock-and-re-lock pattern from git_inflate()? And what if I add it somewhere else? Could it be the same thread trying to double release an entry? Could a thread be holding a reference to an entry already released and re-insert that into the cache, later?

Long story short, I finally found the problem: what if there were two entries for the same data in the lru list? Then, maybe, a thread was trying to release an entry whose data was also in another entry released by another thread… The idea matched with the information Valgrind was displaying.

My first solution to this problem was to check if the entry was already present at the lru list when trying to add a new one. To do that, I added the following snippet to the beginning of add_delta_base_cache():

list_for_each(lru, &delta_base_cache_lru) {
	struct delta_base_cache_entry *f =
		list_entry(lru, struct delta_base_cache_entry, lru);
	if(f->key.base_offset == base_offset && f->key.p == p)
		return;
}

And, with that, we got free from errors!! But… the execution time got really slow :( It’s easy to see why: everytime we add a new entry we need to iterate throught the whole list!

I scratched my head for a while until I got to a better solution: Oh well, if we want to check for an entry existence, why don’t we use the hashmap instead of the list? Replacing the above code with a simple if (get_delta_base_cache()) return solved the issue keeping up a good performance! I’m very happy that we got to this point :) You can see this working version here.

Results

I tested the modified git-grep at chromium’s repo¹ with these² test cases. The original sequential performance was also measured for a reference. Here are the means of 30 executions after 2 warmups for each test case:

Times of git-grep at chromium repo
==============================
  Threads 1
                        Regex 1            Regex 2
    Original:           17.5815s           21.7217s

  Threads 2
                        Regex 1            Regex 2
    Original:           14.6363s           15.6117s
    Parallel inflate:   9.7983s            11.3965s
    Reduction:          33.05%             27.00%

  Threads 8
                        Regex 1            Regex 2
    Original:           13.5643s           16.4347s
    Parallel inflate:   6.3097s            6.9667s
    Reduction:          53.48%             57.61%

Some very good execution time reductions!

Difficulties

Debugging threaded code can be quite a challenge! Even with the help of GDB and Valgrind (memcheck and helgrind) I spent a lot of time trying to understand what was happening and why it was happening. It’s been a long week, but I’m so happy that this barrier has finally been overcome!

Next steps

From last week, the following topics were completed:

Continue investigating the delta_base_cache SegFaults; and how to avoid the problems with the execution order of critical sections (the intercalations).
Measure speedup on Chromium.

Now that I have a working version with verified improvements, I have to:

Make the patch less hacky:
- Split the changes into logical patches.
- Improve how the object reading mutex is being handled at sha1-file.c and zlib.c
- Possibly add a obj_read_use_locks variable for the API users to set, when needed.
- Ensure there’s no other possible race condition introduced with these changes.
- Protect the submodule and textconv operations on git-grep (they were left behind when I removed the grep_read_mutex).
- Maybe check the improvements on a HDD machine.
Take a look at Duy’s patch again to see if I can improve my version based on it.
Try to refine the object reading locks to have even more parallel work.

I’m also considering if I should investigate even more why the threaded inflation version tries to add duplicates to the delta_base_cache. Perhaps if I find the cause I could refactor the code so that checking the entry at add_delta_base_cache() becomes unnecessary. But I think this may be really time demanding, so I need to check with my mentors about it.

Footnotes:

chromium’s repo at commit 03ae96f (“Add filters testing at DSF=2”, 04-06-2019), after a git gc execution. ↩
Four test cases: 2 and 8 threads, with regexes abcd[02] (“regex 1”) and -E '(static|extern) (int|double) \*' (“regex 2”). ↩