Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Tim Northover <t.p.northover <at> gmail.com>
Subject: Proposal: "load linked" and "store conditional" atomic instructions
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Thursday 29th May 2014 11:55:52 UTC (over 3 years ago)
Hi,

I've been looking at improving atomicrmw & cmpxchg code more,
particularly on architectures using the load-linked/store-conditional
model.

The summary is that current expansion for cmpxchg seems to happen too
late for LLVM to make meaningful use of the opportunities it provides.
I'd like to move it earlier and express it in terms of a first-class
pair of "load linked" and "store conditional" instructions.

More details are attached and inlined below. Currently the only code I
have is for one or two AtomicExpand peepholes, so any suggestions or
alternatives would be very welcome. It's possible this is entirely the
wrong approach.

What do people think?

Cheers.

Tim.

The Problem
-----------

We now expand atomic instructions to load-linked & store-conditional
intrinsics
just before ISel, but this expansion opens up more optimisation
opportunities
that LLVM *would* be able to exploit if it saw the control flow earlier.

For example the return value of the C++11 and C11 compare_exchange
operations is
actually whether the exchange succeeded, which leads to some common idioms
in
Clang-produced IR.

From "if(__c11_compare_exchange_strong(...))":
    %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
    %success = icmp eq i32 %loaded, %oldval
    br i1 %success, label %true, label %false
the control-flow here should be something like:
    loop:
      %loaded = load linked i32* %addr seq_cst
      %trystore = icmp eq %loaded, %oldval
      br i1 %trystore, label %store.cond, label %false
    store.cond:
      %success = store conditional i32 %newval, i32* %addr seq_cst
      br i1 %success, label %true, label %loop

From "return __c11_compare_exchange_strong(...);":
    %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
    %success = icmp eq i32 %loaded, %oldval
    %res = zext i1 %success to i32
    ret i32 %res
here, we'd want something like:
    loop:
      %loaded = load linked i32* %addr seq_cst
      %trystore = icmp eq %loaded, %oldval
      br i1 %trystore, label %store.cond, label %false
    store.cond:
      %success = store conditional i32 %newval, i32* %addr seq_cst
      br i1 %success, label %true, label %loop
    true:
      ret i32 1
    false:
      ret i32 0

In both cases the compare is largely redundant in LL/SC architectures. You
know
by the control flow graph what its result will be, and if LLVM sees this
early
enough it can produce something close to optimal code.

Unfortunately, most of LLVM has finished by the time this expansion occurs,
so
we end up with redundant compares.

Now, it's certainly possible to put various peephole optimisations into the
expansion pass to cover these cases (at least), but to me that feels like
fragile duplication of effort we've already made in the mid-end.

My Proposal
-----------

I think the best way to solve this longer term is to add explicit "load
linked"
and "store conditional" instructions to the LLVM IR, with equal status to
"cmpxchg" and "atomicrmw".

We could then simplify the AtomicExpandLoadLinked pass to use these, change
it
to use something like a TargetTransformInfo hook instead of TargetLowering
and
schedule it much earlier in the mid-end. X86 & SystemZ would simply request
no
expansion happen.

This means Clang could still emit cmpxchg and atomicrmw, which closely
match C &
C++ semantics, but most of LLVM would see the control-flow and operations
in a
form much closer to how the targers will actually emit them.

In addition, other languages would be able to provide more general atomic
operations, if they want (slightly more complex bit twiddling seems like it
could be particularly useful in some algorithms). Current atomic
instructions
cover (most of) C and C++, but nothing else.

Potential issues
----------------

1. Can invalid transformations be done on them? Separating an LL from its
   corresponding SC by too much almost guarantees failure (an OS task
switch
   happens, clearing the monitor).
2. We'd need to document their interaction with LLVM's memory model. I'm
not
   sure I've seen a good happens-before model that explicitly uses LL/SC.
The
   common ones people came up with in the Java & C++11 standardisation
process
   seem to assume atomic RMW operations instead.
3. People might be assuming StoreInst produces no value and not calling
   ReplaceAllUses. A separate StoreConditionalInst is one option, but that
risks
   the reverse, people not considering it a store when it is.
4. Not all targets can support them. There are very few LLVM instructions
like
   that, which is a little concerning. But not all targets can support
current
   atomics either.

   If we make vulnerability to ABA problems target-specific in our
documentation
   then X86 & SystemZ could use the same trick as for min/max to implement
them
   generically: a load/cmpxchg loop.

Alternatives
------------

1. Enhancing MachineCSE and friends so they can optimise this.
    - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc" already.
    - Cmpxchg is more difficult. It's the control-flow optimisations that
are
      particularly powerful in these cases.
    - It's more duplicated effort between different parts of LLVM.

2. Peephole optimisations in the AtomicExpandLoadLinked pass.
    - Feels like a hack, spotting particular patterns that the mid-end can
      already cope with.
    - Deals with toy functions well, but it's not quite clear that those
are the
      only options in more general code (after inlining, say).
    - Potentially safest option: atomicrmw & cmpxchg preserved until
      very late. Nothing's going to come in and convert valid code to
invalid.

3. Moving expansion further forward but still using intrinsics.
    - The intrinsics need to have fully general side effects to be correct:
      effectively an "asm volatile" for optimisation purposes, which is
very
      heavy-handed for LLVM's other optimisations.
    - Still need target hooks to create the calls, because intrinsics don't
get
      type lowered and so you can't write a fully generic one (e.g. an i64
ldrex
      on ARM needs to return { i32, i32 }).

4. Change the cmpxchg operation to return (say) { iN, i1 } where the second
   value indicates success.
    - Probably good idea anyway as part of support for weak
compare-exchange
      operations.
    - Doesn't actually help this issue much: it's mostly control-flow
      optimisations that seem to be helpful, and this would leave us with
phi
      nodes in need of materialisation when lowered.
    - Might make the peephole optimisations in 2 slightly simpler.

5. Some kind of generalisation that can cover both LL/SC and transactional
   memory to avoid doing this again if and when that works out?
    - I can't think of one that's both neat and practically implementable
for
      both.
    - Suggestions welcome.
 
CD: 4ms