Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Kostya Serebryany <kcc <at> google.com>
Subject: multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Thursday 17th April 2014 12:21:46 UTC (over 3 years ago)
Hi,

The current design of -fprofile-instr-generate has the same fundamental
flaw
as the old gcc's gcov instrumentation: it has contention on counters.
A trivial synthetic test case was described here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html

For the problem to appear we need to have a hot function that is
simultaneously executed
by multiple threads -- then we will have high contention on the racy
profile counters.

Such situation is not necessary very frequent, but when it happens
-fprofile-instr-generate becomes barely usable due to huge slowdown
(5x-10x)

An example is the multi-threaded vp9 video encoder.

git clone https://chromium.googlesource.com/webm/libvpx
cd libvpx/
F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++
$F" LD="clang++ $F" ./configure
make -j32
# get sample video from from
https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m

When running single-threaded, -fprofile-instr-generate adds reasonable ~15%
overhead
(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
seconds).

I am not saying that this flaw is a showstopper, but with the continued
move
towards multithreading it will be hurting more and more users of coverage
and PGO.
AFAICT, most of our PGO users simply can not run their software in
single-threaded mode,
and some of them surely have hot functions running in all threads at once.

At the very least we should document this problem, but better try fixing
it.

Some ideas:

- per-thread counters. Solves the problem at huge cost in RAM per-thread
- 8-bit per-thread counters, dumping into central counters on overflow.
- per-cpu counters (not portable, requires very modern kernel with lots of
patches)
- sharded counters: each counter represented as N counters sitting in
different cache lines. Every thread accesses the counter with index TID%N.
Solves the problem partially, better with larger values of N, but then
again it costs RAM.
- reduce contention on hot counters by not incrementing them if they are
big enough:
   {if (counter < 65536) counter++}; This reduces the accuracy though. Is
that bad for PGO?
- self-cooling logarithmic counters: if ((fast_random() % (1 << counter))
== 0) counter++;

Other thoughts?

--kcc
 
CD: 44ms