Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Michael Zolotukhin <mzolotukhin <at> apple.com>
Subject: Re: Proposal: add intrinsics for safe division
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Saturday 26th April 2014 21:11:07 UTC (over 3 years ago)
Hi,

Thanks to everybody for the very informative feedback! Due to difference in
timezones, I usually miss the most active time of discussions, however
I’ll try to cover the raised questions here and share my vision on them.

Generally speaking, we are trying to find answers to the following two
questions:
    o	Are these intrinsics useful in general? 
    o	If yes, when to lower them to actual IR/DAG?

In a nutshell, my answers are:
    o	Yes, they are useful, and useful for all targets. Not only for ARM64,
where they are especially useful.
    o	After loop optimizations but before lowering IR to DAG. Currently
it’s in CodeGenPrepare, but we might lower them before that if there is
any need in this.

I'll try to explain why I think so. For the beginning, let's consider a
simple expression div=x/y in different languages and write the
corresponding IR with use of the intrinsics (here I omit types for
conciseness, assuming result type of the safe.sdiv intrinsic being {i32,
i1, i1}) :

*** C/C-like ***
  %div = sdiv %x, %y

*** Java, Go ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0
  %divbyzero_bit = extractvalue %divr, 1
  br i1 %divbyzero_bit, label %throw, label %normal

*** JavaScript, with “|0” ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0

*** JavaScript without “|0”, Ruby and maybe others ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0
  %divbyzero_bit = extractvalue %divr, 1
  %ovf_bit = extractvalue %divr, 2
  %exceptional_case_bit = or %ovf_bit, %divbyzero_bit
  br i1 %exceptional_case_bit, label %handle_exceptional_case, label
%normal


Now let’s write the IR for the same expression without the intrinsics:

*** C/C-like ***
  %div = sdiv %x, %y

*** Java, Go ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %throw, label %normal
normal:
  %div = sdiv %x, %y

*** JavaScript, with “|0” ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %division_by_zero, label %chkovf
division_by_zero:
  br label %end
chkovf:
  %tmin_bit = cmp %x, min
  %minus_one_bit = cmp %y, -1
  %ovf_bit = and %tmin_bit, %minus_one_bit
  br i1 %ovf_bit, label %ovf, label %normal
ovf:
  br label %end
normal:
  %div.0 = sdiv %x, %y
  br label %end
end:
  %div = phi [%div.0, %normal], [min, %ovf], [0, %division_by_zero]

*** JavaScript without “|0”, Ruby and maybe others ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %throw, label %chkovf
chkovf:
  %tmin_bit = cmp %x, min
  %minus_one_bit = cmp %y, -1
  %ovf_bit = and %tmin_bit, %minus_one_bit
  br i1 %ovf_bit, label %ovf, label %normal
ovf:
  ; Special handling of min/-1: converting to BigInt, double or
something other
  unreachable
normal:
  %div.0 = sdiv %x, %y

As one can see, if the intrinsic is lowered to some CFG (i.e. on x86),
there is nor loss neither win: due to the implemented optimizations the
final code would be roughly the same as in the case with explicitly written
CFG.

If the intrinsic could be directly lowered to a single instruction (i.e. on
ARM64 and if we don’t bother about divz/ovf-flag), use of the intrinsic
would definitely be beneficial - that takes place for the  case.

That might make one think that the intrinsics are useful only for ARM64,
and in general aren’t that useful at all. But in fact even when it
can’t be lowered to a single instruction, keeping the control flow
‘inside’ the intrinsic till its lowering might be beneficial. For
instance, loop passes work much better on loops without control flow:
IndVars might be a good example of such pass. And, for instance, it seems
much easier to vectorize a loop with call to one of these intrinsics than
vectorize a loop with explicitly created control flow (and it could become
even worse if there are several calls to the intrinsics in the same loop).

In some cases we might want to lower these intrinsics earlier than in CGP -
as Reid mentioned, it could help us to hoist checks out of loops. That’s
true but we need to keep in mind possible disadvantages of such early
lowering (see previous paragraph). But in general I don’t see any
problems with allowing earlier lowering.

As for extending the result structure to keep one more flag (being that i2
instead of i1, or one more i1 flag) - that seems fine, and both options
{iN, i2} and {iN, i1, i1} look good to me.

Now, why not to lower these intrinsics even later? Almost for all targets
we want to get very similar CFG, and I don’t see any benefits of
duplicating very similar code all across different backends. Even on ARM64
we need to have control flow in some cases (e.g. for Java) - so we actually
win nothing from making lowering completely target-specific.

I hope that covers the raised concerns, at least to some extent. Does that
sound reasonable enough? If so, I’ll prepare and post here an updated
version of the patch.

Thanks,
Michael


On Apr 26, 2014, at 4:04 AM, Andrew Trick  wrote:

> 
> On Apr 25, 2014, at 2:21 PM, Eric Christopher  wrote:
> 
>>> In short, I agree with your observations that these intrinsics are not
an
>>> obvious slam-dunk compared to making the explicit control flow, but I
think
>>> that the intrinsics do give enough flexibility on the LLVM side that it
>>> would be great if front-ends used them rather than rolling the control
flow
>>> themselves.
>>> 
>> 
>> The problem is that then we have 2 problems: All targets (except for
>> arm64) then have to lower the intrinsic as the first thing they do
>> (giving us a TTI pass as the first thing in the pipeline) to take
>> advantage of the information later during optimization, _and_ we have
>> to plumb all of the work optimizing the intrinsic as well giving us a
>> situation where we've now split our optimization efforts as well as
>> the pain of maintaining an intrinsic that's useful for a single
>> target.
>> 
>> I really think that this is just solidifying my position that the
>> intrinsic is a bad idea and that this should be done as later
>> optimizations.
> 
> The goal of the intrinsic wasn't stated clearly.
> 
> This intrinsic isn't particularly necessary for any specific frontend
> or architecture. I think the LLVM community would benefit from a
> canonical way to represent well-defined integer division. We don't
> need to add an IR instruction, because it's fine for most optimization
> passes to ignore these operations. A target independent intrinsic
> works well as long as:
> 
> - It cleanly captures the language level semantics.
> 
> - It facilitates mid-level optimization.
> 
> - It naturally lowers into ideal code both on architectures that trap
>  (x86) and those that don't (arm).
> 
> To summarize my understanding of the concerns:
> 
> (1) The semantics of the safe.div intrinsic need to be useful for the
> Language/ISA Matrix that LLVMers care about.
> 
> At canonical IR level, the intrinsic is useful by eliminating control
> flow merges and representing divide-by-zero and/or signed overflow in
> a canonical form:
> 
>    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
>    %bit = extractvalue {i32, i1} %res, 1
>    br i1 %bit, label %trap, label %continue
> trap:
>    call ...
>    unreachable
> 
> continue:
>    %div = extractvalue {i32, i1} %res, 0
> 
> The original proposal fails to achieve this because the common case of
> Java/Go would require a check in the slow-path to differentiate
> divide-by-zero from signed overflow. That should be fixed by
> generalizing the intrinsic so that the two conditions are distinct:
> 
>    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
>    %div0 = extractvalue {i32, i1} %res, 1
>    br i1 %div0, label %trap, label %checkovf
> 
> checkovf:
>    %ovf = extractvalue {i32, i1} %res, 2
>    br i1 %div0, label %trap, label %continue
> 
> trap:
>    call ...
>    unreachable
> 
> continue:
>    %div = extractvalue {i32, i1} %res, 0
> 
> ...or some variation of the above. I don't have a personal stake in this.
> 
> (2) The safe.div intrinsic inhibits generic code motion and Correlated
> Value Prop based optimization.
> 
> This goes both ways.
> 
> CVP could miss out on cases unless we teach it about the semantics of
> the intrinsics. Correct me if I'm wrong, but I don't think this would
> actually be too difficult.
> 
> OTOH, with the intrinsics, it would be easy to write a simple
> optimization pass that hoists and combines checks along the domtree.
> 
> After divide-by-zero optimization, you would have something like:
> 
>    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
>    %div0 = extractvalue {i32, i1} %res, 1
>    br i1 %div0, label %trap, label %continue
> 
> trap:
>    # No call here!
>    unreachable
> 
> continue:
>    %div = extractvalue {i32, i1} %res, 0
> 
> And the branch to trap just goes away at some point.
> 
> Now considering Reid's LICM example:
> 
> for (int i = 0; i < n; ++i) {
>  if (b == 0) c[i] = 0;
>  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
>  else c[i] = a[i] / b;
> }
> 
> Simply put, if we want to unswitch this code for x86, then we need a
> TTI pass to lower the intrinsic. On the other hand, I believe it is
> more important to eliminate the control flow within the loop to aid
> loop analysis and other optimizations. So we still want the front end
> to emit the intrinsic, we just may eventually want it lowered earlier
> than CGP. I don't think this issue has any bearing on the intrinsic's
> LangRef spec.
> 
> There was some comparison made to iload/istore, which I don't
> follow:
> - subsuming loads and stores into another instruction is really scary
>  considering how much logic we have for analyzing the side effects of
>  memory access.
> - there is no benefit to IR optimization to the iload/istore intruction.
> - it is easy to detect null checked load/stores in IR at any point.
> - it is very rare that a platform would want to resume from trapping
load/store.
> 
> (3) The safe.div intrinsic is a target-specific codegen optimization.
> 
> Regardless of the target, we want to eliminate control flow merges in
> all cases, and completely eliminate control flow when we don't have
> trapping semantics. To do this, we need an intrinsic at canonical IR
> level. Clearly it should be a target independent intrinsic
> since *some* optimizations/analyses want to be aware of it.
> 
> Even ignoring the rationale above, supporting codegen with a
> target-specific intrinsic would minimally require the DAG builder to
> match this
> "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0)"
> 
> after all optimizations have had their way with it. A big problem here
> is that there are actually many different forms of this expression and
> surrounding control flow, depending on the frontend's integer division
> semantics. We would have to recognize all the variations of checking for
> divide-by-zero and/or overflow, trapping and/or producing a certain
> constant, branching or selecting values. This is obviously terrible
> from an engineering standpoint.
> 
> -Andy
> _______________________________________________
> LLVM Developers mailing list
> [email protected]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
 
CD: 2ms