Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane

From: Andrew Trick <atrick <at> apple.com>
Subject: Re: "Anti" scheduling with OoO cores?
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Wednesday 5th November 2014 06:08:13 UTC (over 4 years ago)
> On Nov 4, 2014, at 5:54 AM, James Molloy  wrote:
> 
> Hi Andy,
> 
> Thanks for the reply!
> 
> > This doesn't seem to work (a poor schedule is produced) so I changed it
to also require another resource that I modelled as unbuffered
(BufferSize=0), in the hope that this would "block" other FDIVs... no joy.
> That should create a hazard that blocks scheduling of the FDIVs. So that
was the right thing to do, assuming that’s what you want - register
pressure could suffer in some cases.
> 
> This didn't work. From looking at the misched output, it seemed to see
the unbuffered resource use, then assume nothing could be done for 18
cycles and then carry on 18 cycles in, resulting in the FDIVs being
clustered in the final schedule.

I’m curious why no MUL was scheduled in the window. If you attach your
patch and debug-only=misched output to a PR, I could probably tell you.
Keep in mind it’s scheduling bottom-up initially, but may alternate.

>  The machine model is much more precise than the scheduler’s internal
model. It would be possible to approximately simulate the behavior of the
reorder buffer, but since most OoO machines have such large buffers now,
it’s not worth adding the cost and complexity to the generic scheduler.
At least I wasn’t able to find real examples where it mattered.
> 
> I think this is the real crux of the matter. Cortex-A57 doesn't have a
unified reorder buffer at all. It has a separate 8-entry reorder buffer per
pipeline. So scheduling really matters, and putting 8 dependent operations
in a row can completely kill the out-of-order execution. Every dependent
operation we put in eats up a queue slot, so scheduling really can make a
difference. If we changed the machine model to have a MicroOpBufferSize of
"small", and modelled a buffersize of 8 on each of the pipeline resources -
how much of that information would the generic scheduler use?

On your test case (not a loop) all values for MicroOpBufferSize > 1 will be
treated the same. The same is true for resource BufferSize.

They can be “reserved” (0), “unbuffered" (1), or “buffered" (>1).

In case you’re wondering why the machine model supports so much unused
precision, remember that it’s pretty easy to plugin your own scheduling
strategy. When I developed the machine model, I had some out-of-tree logic
to model the resource buffer limits. It just wasn’t worth maintaining in
the generic scheduler. The generic scheduler exercises the basic features
of the machine model and scheduling framework but should stay reasonably
lightweight.

The kind of issues you’re seeing mainly surface in very large single
block FP loops with parallel long latency chains. I found that the limiting
factor on those cases was that we were not considering the interloop
dependencies. I added a small heuristic for that but stopped short of
cyclic scheduling (which wouldn’t actually be that difficult either).

> (Also, what is "small". It's out of order so "2"? But it's not massively
out of order so maybe model it as in-order ("0")? We do still want to
consider register pressure though... ("1")?)

Yeah. MicroOpBufferSize=1 is meant to be for scheduling to hide latency but
doesn’t model stalls as unavoidable.

You could also try BufferSize=1 on individual resources. In fact, given the
way it’s currently implemented, you may want FDIV to use one reserved
resource (BufferSize=0) for 18 resource cycles and another resource, UnitX 
at BufferSize=1.

I won't make any guarantee any of this will do what you want. You’re at
the level of tuning where you just need to look at the heuristics that are
coming into play, and I haven’t done any tuning for this kind of target.

If someone is into this sort of thing, they could write an A57-specific
scheduling strategy and change heuristics like crazy without worrying about
other targets.

> Hmm. The implementation of inorder scheduling with the new machine model
is pretty lame.
> 
> OK, this needs to be added. That's fair enough.

Yeah, reserving individual processor resources within a group is blatantly
unimplemented. It’s not really hard though.

-Andy

> 
> Cheers,
> 
> James
> 
> On 4 November 2014 08:34, Andrew Trick > wrote:
> 
> > On Nov 2, 2014, at 4:46 AM, James Molloy > wrote:
> >
> > Hi Andy, Dave,
> >
> > I've been doing a bit of experimentation trying to understand the
schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).
> >
> > FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and
FDIVSrr). This seems to be already semi-modelled, with a
"ResourceCycles=[18]" line in the SchedWriteRes for this instruction.
> 
> Pretty typical - we should be able to handle this.
> 
> > This doesn't seem to work (a poor schedule is produced) so I changed it
to also require another resource that I modelled as unbuffered
(BufferSize=0), in the hope that this would "block" other FDIVs... no joy.
> 
> That should create a hazard that blocks scheduling of the FDIVs. So that
was the right thing to do, assuming that’s what you want - register
pressure could suffer in some cases.
> 
> ResourceCycles is an ordered list. It’s only going to stall if the
unbuffered resource is the one taking 18 cycles. You didn’t attach your
patch though, so I can’t be sure what your actually did...
> 
> > Then I noticed that the MicroOpBufferSize is set to 128, which is
wildly high as Cortex-A57 has separated smaller reorder buffers, not one
larger reorder buffer.
> > Even reducing it down to "2" made no effect, the divs were scheduled in
a clump together. But "1" and "0" (denoting in-order) produced a nice
schedule.
> 
> There’s a huge difference between 0, 1, and > 1. Beyond that, the
generic scheduler only cares in some cases of very tight loops. Your
example is straight line code so it won’t matter. You could model buffers
on the individual resources instead to be more precise, but I don’t think
it will matter much unless you start customizing heuristics by plugging in
a new scheduling strategy.
> 
> > I'd expect an OoO machine with a buffer of 2 ops would produce a very
similar schedule as an in-order machine. So where am I going wrong?
> 
> See above. The machine model is much more precise than the scheduler’s
internal model. It would be possible to approximately simulate the behavior
of the reorder buffer, but since most OoO machines have such large buffers
now, it’s not worth adding the cost and complexity to the generic
scheduler. At least I wasn’t able to find real examples where it
mattered.
> 
> > Sample attached - I'd expect the FDIVs to be equally spread across the
MULs.
> 
> The stalls should be modeled as long as the FDIV uses an unbuffered
resource for 18 cycles and the MUL does not use the same resource at all.
But the way in-order hazards work in the scheduler, you may end up with
three MULs strangely smashed between two FDIVs.
> 
> To get a more even dispersement, you can try BufferSize=1. That basically
prioritizes for latency, but is very sensitive to a bunch of heuristics.
> 
> > (The extension to this I want to model is that we can have 2 S-register
FDIVs in parallel but only one D-reg FDIV, and never both, but that can
wait until I've understood what's going on here!).
> 
> Hmm. The implementation of inorder scheduling with the new machine model
is pretty lame. It was a quick fix to get something working. It needs to be
extended so that it separately counts cycles for multiple units of the same
resource. It would be straightforward enough to do that. I can’t really
volunteer at the moment though.
> 
> -Andy
> 
> >
> > Cheers,
> >
> > James
> 
>
 
CD: 23ms