Features Download
From: Ansari, Zia <zia.ansari <at> intel.com>
Subject: =?windows-1252?q?Bugzilla_=96_Bug_5615?=
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Wednesday 10th June 2015 16:52:57 UTC (over 2 years ago)
Old bug, but I decided to use some modern hardware to do some analysis on
it for fun.. I updated the Bugzilla report, but it was suggested that I
should also share with llvmdev for broader exposure for anyone interested..
Text from the bug report copied below, and PPT attached to mail.

Useful for anyone interested in or troubled by code alignment issues on IA.



Comment 4  Zia Ansari 2015-05-28 17:49:13 CDT
I know this is super old, but I took a quick look at this issue and the
test-case attached to pr3120 to see if anything jumped out at me. Mostly
for educational purposes, and also to see if there are any opportunities.

Since this report is very old, it’s unclear on which architecture the
performance swings were reported and, perhaps more importantly, whether we
care about those architectures today, or not.

I chose to play around with it a little on today’s hardware to see if
there are still any alignment issues. It actually turned out that with “0
mod 32” vs “16 mod 32” byte alignment, the benchmark did show
significant swings (~50%-70%) on an IVB and HSW.

The reason for the swings wasn’t immediately obvious, but some deeper
analysis pointed me to the issue being within the DSB (the post decode uop
cache). I wrote up a detailed presentation of what’s going on so that I
could share it with the rest of my team for educational purposes

The quick summary is : The DSB caches post-decoded uops that are frequently
executed so that front-end pipeline stages and overhead can be bypassed,
allowing to feed 32B worth of instructions per clock, instead of 16B. The
DSB allows 3 ways (each of which can hold 6 uops) to be allocated to each
32B chunk of instructions (by IP address). Unconditional branches always
end a way. If the code is aligned and laid out in such a way as to require
more than 3 ways per 32B chunk of instructions in tightly packed code with
lots of JMP instructions, then we can get into situations where we keep
flip-flopping execution in and out of the DSB vs. Front End. This can be
inefficient and incurs additional penalties. You can find additional
details in the presentation, and also in the public Intel Optimization

It’s tricky to decide whether something can/should be done about this, or
not. One option is to pad code whenever we detect multiple jmp instructions
in a potential 32B chunk of instructions (specifically, more than 3). This
may cause unnecessary code bloat with no payoff, but it could also be rare
enough to be insignificant padding that may help boost performance in those
rare cases. I plan on playing around with this a little to see how many
cases we can catch in SPEC, for example, and measure bloat vs. perf to see
if it’s a viable solution.

The other option would be to do nothing, and make do with simply
understanding what the problem is so that it can be identified in the
future. Architectures change rapidly, and this could be something that goes
away soon.

In either case, I’ll probably pursue the first option above and report
back on what I find.

Regarding the other details reported in this issue, I realize that the slow
vs. fast cases both had 0 mod 32 byte alignment. It’s hard to do the
analysis on what the issue there was, without having the exact code and the
exact (old) architecture on which it was run. If I had to guess, I would
say that it was a case of unfortunate aliasing in the branch prediction
buffer, causing differences in the prediction of one of the many branches,
particularly the indirect branch, which is known to have prediction issues
on some older architectures.

Feel free to contact me if you’d like additional info.

Zia Ansari.
CD: 7ms