Features Download
From: David Chisnall <david.chisnall <at> cl.cam.ac.uk>
Subject: Re: Proposal: "load linked" and "store conditional" atomic instructions
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Friday 30th May 2014 11:24:50 UTC (over 3 years ago)
On 29 May 2014, at 18:21, Tim Northover  wrote:

>    void atomic_foo(int *addr) {
>      int oldval = *addr;
>      do {
>        newval = foo(oldval);
>      } while (__c11_compare_exchange_weak(addr, &oldval, newval));

This particular example is a bit difficult, because the best representation
depends on the complexity of foo().  If foo() is simple (and guaranteed not
to contain any atomics) then you'd want to do the load-linked at the start
and store-conditional at the end, something like this:

     int oldval = load_linked(addr);
     do {
       newval = foo(oldval);
     } while (!store_conditional(addr, newval));

If, however, foo() is not visible, is complex, or has some other memory
accesses[1] then the correct lowering would be a load at the start and then
a load-linked/store conditional (but without a loop) at the end.

     int oldval = *addr;
     int oldval2;
     do {
       newval = foo(oldval);
       oldval2 = load_linked(addr);
     } while ((oldval != oldval2) || !store_conditional(addr, newval));

The lowering that we currently generate, however, is more like this:

     int oldval = *addr;
     int success = 1;
     do {
       newval = foo(oldval);
       int oldval2;
       do {
         oldval2 = load_linked(addr);
         if (oldval2 != oldval) {
           success = 0;
       } while ((oldval != oldval2) || !store_conditional(addr, newval));
     } while (!success);

This is clearly suboptimal.  My problem with your current proposal is that
we really need to preserve the weak cmpxchg semantics through optimisation
to ensure correctness.  The front end does not know (except in trivial
cases) which of the two earlier forms is correct.  It could always emit the
second form for any weak cmpxchg, but that then makes optimisation harder
because optimisers have to infer that it means a weak cmpxchg[2].

When we get to CodeGenPrepare, it's fairly easy to transform the original
structure into the first correct output, if there are no memory accesses in
the middle.  Transforming the second version into the first is harder
unless the structure is preserved.  

In short, I believe that adding ll/sc to the IR will:

- Not make generating IR easier than adding weak cmpxchg
- Not make optimising IR easier than adding weak cmpxchg
- Not make code generation easier than adding weak cmpxchg
- Be more invasive than adding weak cmpxchg


[1] RISC-V, for example, requires that there be no memory accesses *at all*
between ll and sc, some ARM implementations require that there be no memory
accesses within the same cache line (see a bug report from a year or so ago
on Apple hardware).

[2] There's a lot of work being done by Peter Sewell's group and others
currently on what are semantically valid transforms for the C11 memory
model, so as long as we preserve those semantics in the IR we have lots of
opportunities to improve optimisations.
CD: 3ms