Features Download
From: Chandler Carruth <chandlerc <at> google.com>
Subject: Re: [RFC] Stripping unusable intrinsics
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Tuesday 23rd December 2014 21:40:54 UTC (over 3 years ago)
On Tue, Dec 23, 2014 at 12:41 PM, Chris Lattner  wrote:

> On Dec 23, 2014, at 10:54 AM, Bob Wilson  wrote:
> On Dec 23, 2014, at 10:41 AM, Chris Lattner  wrote:
> On Dec 23, 2014, at 10:28 AM, Chris Bieneman  wrote:
> It should be straight-forward to have something like
> LLVMInitializeX86Target/RegisterTargetMachine install the intrinsics into
> registry.
> I tried doing that a few years ago. It’s not nearly as easy as it
> because we’ve got hardcoded references to various target intrinsics
> scattered throughout the code.
> I was just writing to say exactly this. There are a number of ways we
> could work toward something like this. I’m completely in favor of a
> where Intrinsics are properties of the targets and don’t leach out,
> today they do in a lot of places.
> What are the specific problems here?  Anything that does an equality
> comparison with the IntrinsicID can be changed to do strcmp with the
> That would handle the one-off cases like
> InstCombiner::SimplifyDemandedUseBits in InstCombine.
> The other cases in InstCombine could be handled similarly, but may be
> better handled by adding a intrinsic behavior query APIs to the intrinsic
> registry, or would be better handled (eventually) by adding new
> to the intrinsics themselves.
> I don’t think there’s anything fundamentally difficult, but it’s a
> change. For example:
> $ git grep Intrinsic:: | wc
>     3364   12286  281078
> The vast majority of those 3,364 lines have hardcoded references to
> specific intrinsics. Many of them are used in contexts where you can’t
> easily insert a strcmp (e.g., case values in large switch statements, or
> worse, the m_Intrinsic PatternMatch templates).
> I don’t find this convincing.  It should be simple to introduce a new
> m_Intrinsic PatternMatch template that takes a string.  The switches are
> also straight-forward.  In BasicAA, for example, instead of:
>    switch (II->getIntrinsicID()) {
>     default: break;
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     case Intrinsic::arm_neon_vld1: {
>       assert(ArgIdx == 0 && "Invalid argument index");
>       assert(Loc.Ptr == II->getArgOperand(ArgIdx) &&
>              "Intrinsic location pointer not argument?");
>       // LLVM's vld1 and vst1 intrinsics currently only support a single
>       // vector register.
>       if (DL)
>         Loc.Size = DL->getTypeStoreSize(II->getType());
>       break;
>     }
> }
> Just change this to:
>    switch (II->getIntrinsicID()) {
>     case Intrinsic::memset:
>     case Intrinsic::memcpy:
>     case Intrinsic::memmove: {
>     ...
>     default:
>       if (II->getName() == “llvm.arm.neon.vld1”) {
>         same stuff as above
>       }
>     break;

If we're going to talk about what the right long-term design is, let me put
out a different opinion. I used to be somewhat torn on this issue, but this
discussion and looking at the particular intrinsics in question, I'm
rapidly being persuaded.

We shouldn't have any target specific intrinsics. At the very least, we
shouldn't use them anywhere in the front- or middle-end, even if we have

Today, frontends need to emit specific target instrinsics *and* have the
optimizer be aware of them. I can see a few reasons why:

1) Missing semantics -- the IR may not have *quite* the semantics desired
and provided by the target's ISA.
2) Historical expectations -- the GCC-compatible builtins are named after
the instructions, and the target independent builtins lower to intrinsics
so the target-specific ones should too.
3) Poor instruction selection -- we could emit the logic as boring IR, but
we fail to instruction select that well, so as a hack we emit the
instruction directly and teach the optimizer to still optimize through it.

If we want to pursue the *right* design, I think we should be fixing these
three issues and then we won't need the optimizer to be aware of any of

Fixing #1) We just need to flush these out and define target independent
intrinsics. The number of these is relatively small, and not likely to be a
burden. I'm looking at you cvtss2si.

Fixing #2) I think this just requires a giant pile of work to re-target
builtins and other things that people expect the optimizer to operate on
into proper IR. Yes, it's hard, but it is the right design. With x86 we're
already about 90% of the way there. My understanding is that ARM has gone
the other direction. While unpleasant, I think we have to make a choice:
either the builtin doesn't get optimized or it gets lowered to "normal" IR.

Fixing #3) Yes, we need better instruction selection. We have always needed
better instruction selection. If there are truly critical needs that aren't
met, we can even achieve this through much more localized hacks *in the
backend* that form specific intrinsics to "pre select" instructions that we
know are important. While just as hacky and gross, this seems like a much
better tradeoff to make than the current approach.

In the end, we don't have target-specific intrinsics in the middle end at
all. Then we get to decide on whether it is worth our time to provide a
really efficient target-specific *encoding* of these operations, or if we
should just lower to the target specific encoding we already have: inline
assembly. I don't have strong feelings about this either way.

OK, so I think that is the right design, but I also completely agree with
Chris B here -- it is *way more work* than is really called for to address
a very narrow, targeted problem that WebKit has: binary size.

But I also agree with Chris L's point here:

> Spreading ifdefs through the codebase would be really unfortunate.

I just don't think this is necessary. And I don't think relying on a
fragile frontend-based "optimization" to achieve the same effects is really
great either.

The huge code in IR/Function.cpp.o is from *terrible* code being emitted by
tablegen. We should just change tablegen to emit something smaller. This
will get most of the size win, and will get it even when the target *is*
compiled in!

In particular, if we switch to a sorted table of intrinsic names, change
the *lookup* of names to use the same sorted table so we share the memory,
and do a binary search, I think the code size will be essentially free.

Once we address that, Chris B can use the somewhat ugly #ifdef's or his
clang attribute patch to look for other hotspots where we burn size on
intrinsics. I bet a few more could be fixed by similar effort.

If the code size is *still* a problem, we could could start factoring
things incrementally into target independent intrinsics and removing the
target specific ones which might help incrementally move us toward the
"right design" (IMO).

Then if someone is really motivated (which would be cool) to do all the
hard work to really move us over to the right design, we would have lost
nothing, there would be essentially no more technical debt than there is
today, and WebKit and others won't have to wait for their binary size

CD: 55ms