Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Mike Pall <mikelu-1102 <at> mike.de>
Subject: Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2
Newsgroups: gmane.comp.lang.lua.general
Date: Tuesday 15th February 2011 16:02:08 UTC (over 6 years ago)
Josh Haberman wrote:
> Josh Haberman  gmail.com> writes:
> > Mike Pall  mike.de> writes:
> > > The main loop of an interpreter is a tough job for compilers (and
> > > CPUs). Most interpreters have been written in C, so C compilers
> > > have been tuned for this case over the years. They still generate
> > > mediocre code, compared to what you can achieve in assembler.
> > 
> > Could you possibly summarize the few biggest strategies that you
> > use to beat C compilers when it comes to interpreter loops?  I'm
> > really interested in this problem.  I can (and do) read the LuaJIT
> > code, but that doesn't tell me what was deficient about the original
> > compiler output.
> 
> Hmm, this probably seems overly broad.  Maybe a better question
> to ask would be: what do compilers still suck at, in the context
> of interpreter main loops?  Common wisdom is that compilers these
> days are so good that a human can rarely ever beat them.  Clearly
> that's wrong in the case of LuaJIT, but how so?  What bad decisions
> do compilers make in interpreter loops?

The control-flow graph of an interpreter with C switch-based
dispatch looks like this:

      .------.
      V      |
      |      |
      L      |  L = instruction load
      D      |  D = instruction dispatch
   / /|\ \   |
  / / | \ \  |
  C C C C C  |  C = operand decode
  X X X X X  |  X = instruction execution
  \ \ | / /  |
   \ \|/ /   |
      |      |
      V      |
      `------'

Each individual instruction execution looks like this:

  ......V......
  :X    |     :
  :     |\ \  :
  :     F S S :  F = fast path
  :     |/ /  :  S = slow paths
  :     |     :
  :.....V.....:

We're talking here about dozens of instructions and hundreds of
slow paths. The compiler has to deal with the whole mess and gets
into trouble:

* Diamond-shaped control-flow is known to be the worst-case
  scenario for most optimizations and for register alloction.
  Nested diamond-shaped control-flow is even worse.

* The compiler doesn't have enough hints to see what the fast
  paths and what the slow paths are. Even if you'd be able to tell
  it, it still sees this as a single giant control-flow graph.

  Anything in this loop could potentially influence anything else,
  so almost nothing can be hoisted or eliminated. The slow paths
  kill the opportunities for the fast paths and the complex
  instructions kill the opportunities for the simpler instructions.
  
* The standard register allocation heuristics fail at this scale,
  so the compiler has trouble keeping important variables in
  registers.

We can use a direct or indirect-threaded interpreter even in C,
e.g. with the computed 'goto &' feature of GCC:

  * * * * *
  | | | | |
  C C C C C    C = operand decode
  X X X X X    X = instruction execution
  L L L L L    L = next instruction load
  D D D D D    D = next instruction dispatch
  | | | | |
  V V V V V

This effectively replicates the load and the dispatch, which helps
the CPU branch predictors. But it has its own share of problems:

* There's no loop the compiler could recognize. And all of those
  gotos can jump to pretty much anywhere in the code. Therefore
  the compiler is unable to hoist anything, because there _will_
  be a slow path where an aliasing store kills all opportunities.

* The register allocator can only treat each of these segments
  separately and will do a real bad job. There's just no way to
  give it a goal function like "I want the same register
  assignment before each goto".

* Tail-merging and CSE will happily join all these common tails of
  each instruction and generate a single dispatch point. Ick. You
  can try to disable some optimizations for this piece of code,
  but this will negatively impact all paths.
  
* There's a point where you start to fight the compiler and this
  is a game you cannot win.

If you write an interpreter loop in assembler, you can do much
better:

* Keep a fixed register assignment for all instructions.

* Keep everything in registers for the fast paths. Spill/reload
  only in the slow paths.

* Move the slow paths elsewhere, to help with I-Cache density.

* Pre-load instructions and pre-decode operands.

Here's how this would look like:

  *  *  *  *  *
  |  |  |  |  |
  C  C  C  C  C    C = partial operand decode for this instruction
  F> F> F> F> F>   F = fast path, > = exit to slow path
  L  L  L  L  L    L = next instruction load
  C  C  C  C  C    C = partial operand decode for the next instruction
  D  D  D  D  D    D = next instruction dispatch
  |  |  |  |  |
  V  V  V  V  V

You can get this down to just a few machine code instructions.
I've posted an example for the x86 LuaJIT interpreter here:

  http://www.reddit.com/r/programming/comments/badl2/luajit_2_beta_3_is_out_support_both_x32_x64/c0lrus0

On PPC/e500 I had to use a couple more tricks: e.g. merging the
operand decode and the index scaling. That crazy 'rlwinm'
instruction comes real handy here. Or hand-scheduling the
instruction load above the stores in the fast path. Or using
vector (!) instructions for type checks.

There's just no way you can reasonably expect even the most
advanced C compilers to do this on your behalf.

There are some more design goals for an interpreter, like having
only a single fast path in every bytecode instruction etc. ...
I won't go into that here, but remember: every percent counts!

Final words: A modern compiler contains a huge set of heuristics
that interact with each other in strange ways. They have been
tuned for 'average' code. But an interpreter is a very different
beast, so you'll inevitably get disappointing results.

--Mike
 
CD: 3ms