"Mark Knecht" posted
[email protected], excerpted
below, on Thu, 14 Sep 2006 07:15:42 -0700:
> I'm just curious whether anyone besides me is noticing their machine
> feeling somewhat sluggish since doing the gcc-4.1 upgrade? Mine seems ot
> be using a lot of memory. Alt-tabbing between windows seems slow.
> Ethernet traffic in my browser is causing pretty noticeable
> interruptions in things like MythTV.
> The machine is still quite usable, but it doesn't feel as snappy as it
> did last week.
> I made no changes in /etc/make.conf for the upgrade. Everything is
> pretty basic as far as I can tell:
> CFLAGS="-march=k8 -O2 -pipe"
I've noticed rather the opposite, here. gcc-4.1.1 compiled binaries are
/dramatically/ faster and more efficient than 3.x. However, I'm using a
rather more elaborate CFLAGS/CXXFLAGS, and it's my conviction that gcc-4.1
does better at optimizing exactly the way you've told it to. That is, if
you've given it inefficient optimizations, I'm convinced it makes a bad
thing worse, while if you've chosen your optimizations well, it makes a
good thing dramatically better.
Here's my CFLAGS/CXXFLAGS:
CFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
-freorder-blocks-and-partition -combine -funit-at-a-time -ftree-pre
-fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants"
CXXFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks
-funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload
The general strategy here is to take advantage of size optimization -- on
modern compilers, L1 and L2 cache are FAR FAR faster than main memory, and
raw CPU cycles runs circles around even cache speeds. Thus, optimizing
for CPU speed at the expense of size makes little sense, because all those
saved cycles and more are likely to be spent waiting for memory to return
code that /would/ have fit in the cache were it size optimized.
Thus, for example, where traditional optimizations unroll loops into
flat code where possible, to avoid the expense of the jump back to the top
of the loop, that spreads out the loop to several times its original code
size, thus taking far more room in fast cache and forcing the CPU to wait
far more often for code to be fetched from main memory. I prefer to keep
the loops, making the code smaller and thus allowing more of it to fit in
faster cache. I believe that for most code, this technique will result in
faster execution in the real world, despite the theoretical loss of a CPU
cycle here or there due to jumping back to the top of the loop.
The -freorder-blocks-and-partition, OTOH, can make code slightly larger,
but the effect is the same as the above, increasing execution speed. What
this optimization does is separate code that is used often from that which
is seldom used, so the "hot" code is smaller and fits better in high speed
cache, while the "cold" code ends up in slower main memory most of the
time. While a lower percentage of the code may be in cache due to the
larger size, cache will be used far more effectively, as more "hot" code
will be retained therein, with the cold code that's not used so often
allowed to drop out of cache into main memory. This particular
optimization doesn't work well with C++, however, so it's in my CFLAGS but
not my CXXFLAGS.
Likewise with -combine, which allows the compiler to optimize across
multiple source files at a time. It's only implemented for C at this time
(according to the gcc manpage), so it's in my CFLAGS but omitted from my
The other strategy here is to make as full a use of the extra registers
available to amd64 in 64-bit mode (as opposed to 32-bit x86 mode) as
possible. Registers operate at the speed of the CPU, no wait at all, as
there is for even L1 cache, so it pays to use them as efficiently as
possible. Several of the flags (-frename-registers of course, -fweb, etc)
in my CFLAGS are therefore designed to encourage gcc to do this.
All the flags I've not mentioned specifically are designed to further the
three common goals mentioned above, making as efficient a use as possible
of the speed of (1) registers and (2) cache memory, by allowing gcc to
optimize over as wide a scope (3, whole units with unit-at-a-time, or
even multiple units with -combine) as possible. Of course, see the gcc
manpage for additional details.
As I said, with the above, there's a /dramatic/ improvement in
performance between gcc-3.x and gcc-4.1.x.
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
[email protected] mailing list