Week 4: A different approachTweet
As I documented in the last post, we’ve been rethinking my initially proposed schedule, as things weren’t turning out exactly how I previously thought. I was probably working on functions that didn’t really need to be thread-safe and kind of walking in the dark.
Talking with my mentors we reached the conclusion that it would be more benefical to the project if I switched the approach. Starting at git-grep, which is already threaded and go refining its paralellelism while making the necessary pack access functions thread-safe, seemed the way to go. I felt more confortable with this idea, but this also meant I had to study more the codebase, so that’s what I’ve been doing.
I spent most of the week studying git-grep’s code. Thankfully, it wasn’t as complex as the pack access’ or git-blame’s. To implement threading, git-grep uses a producer-consumer mechanism, which I’m already familiar with. The idea is to have a master-thread listing the files on a repository and adding tasks to a FIFO, while N worker-threads get and perform the tasks.
I also discovered that git-grep have two “operation modes”:
- grep the tracked files in the working tree (default)
- grep blobs registered at the index (–cached or when a ref is given)
The first option runs with the producer-consumer mechanism and has a very good (although not linear) speedup. But the second one, which uses the pack access code to read the blobs, despite sharing the same machinary with the first, is perfomed sequentially. This very nice thread. on the mailing list explains why (it also gave me some insights on possible improvements to git-grep’s performance).
Basically, the speedup observed when greping the working tree with threads was not present when greping blobs with threads. In fact, the developers started observing a slowdown as the number of threads increased! To avoid that, threading was disabled for this case at the commit 53b8d93 (“grep: disable threading in non-worktree case”, 2011-12-12). I reverted this commit to check by myself and got the same described results, for simple regexes.
The commit’s message says:
Measurements by various people have shown that grepping in parallel is not beneficial when the object store is involved.
So until the pack machinery allows unthreaded access, we disable grep’s threading in all but the worktree case.
This last sentence is quite motivating for my project :) But to gain more
certainty of the pack access weight on git-grep I decided to run some tests.
In particular, I compiled Git with
-g -pg -O0 and ran
-E '(static|extern) (int|double) \*' HEAD1 on
After that, using gprof and gprof2dot
I got the following graph (click it for a larger vizualization):
As you can see,
read_object_file_extended() and its childs take a significant
slice of the execution time. So it seems promissing trying to perform it in
parallel. Another eye catcher is
grep_source_is_binary(). I’m yet not sure why
it seems so time expensive, but it’s definitelly something I want to further
I also reverted the commit disabling threads on cached grep and made a plot using 8 threads. You can see the results here.
Christian suggested me this nice series of scripts
to render perf’s output as a
flamegraph. For the same previous test (just removing
-pg), we get this image
(click it to enlarge):
(The 8-threads version can be found here)
Here we can see, again, the object store calls distributed at the recursive
grep_tree(). But what most draws attention in it is the massive time
libz.so (which should be the decompression routines?). If so, it’s
another good indicative for the parallelization.
Another possible improvement suggested at that previous ML thread is to try replicating at git-grep a nice git-diff feature: diff avoids retriving blobs from the object store when it knows the files on the working tree are stat-clean (i.e. updated).
Out of curiosity I re-ran the test some more times with different sets of
parameters. One of the interesting results came when I re-executed the above
experiment with a simpler “abc” regex: 76% of the time was spent at
alone! So it may be a nice hot spot for improvements.
It’s also curious how git-grep’s CPU ocupancy stays quite low with or without threads… Perhaps much of the time is spent in lock contentions or waiting for work to be assigned by the master-thread. I surelly want to inspect this further.
Other Git related activities
This week Renato and I hosted a seminar on “Object Oriented Programing in C: A case study on Git and Linux”. In it, we showed some of the OOP techniques that Git and Linux use to help organizing code and solving problems. Here you can find the slides of our presentation: oop_git_and_kernel.pdf.
I’ve been also dedicated in trying to form a group of local Git contributors in my college. More specifically, I’m helping interested students in the “Git front” of FLUSP, a group that I participate, composed of students passionated about FLOSS projects. This week, we had some new faces in FLUSP’s Git meeting :) Here is a picture of it, when we jointly downloaded Git’s source code, inspected the codebase and other things:
I surelly underestimated the complexity of this project. And not knowing enough of the codebase makes it harder to propose ways of parallelizing/optimizing code. Expecially for some heavy sections as the pack access or blame. I’m a bit sad to not have written much code yet :( But I’ve been feeling kind of stuck, not knowing where to start. That’s why I’ve dedicated myself to studying the code and that’s why I proposed the approach change. I’m happy, though, to start digging into git-grep and gradually understand its internals. I’m not sure yet whether this new approach will work, but I’m axious to start getting my hands on code.
The idea, now, is to focus on git-grep’s calls to the pack access code
in the effort to make them thread-safe. Then, we’ll be able to eliminate
git-grep’s lock around these functions (or at least have a finer granularity in
the locks). But this is more of a long term goal. I’m not exactly sure yet on
how to break it into small tasks or what, concretely, I’ll work at during the
following week. What I have planned is to inspect these functions further and
take a look at the other optimization options I mentioned in this post. I’m
also curious to further inspect
grep_source_is_binary() and the
low CPU-usage during git-grep.
I tried to choose a regex, neither so simple nor so complex for the tests. This was done in order to get a more realistic result of the average case. Too simple or too complicated regexes could bring the execution time to extremes, not allowing us to see how the object store code performs in git-grep. ↩
Til next time,