Subject: Where in the pass pipeline run the loop optimizer?
Date: Wednesday 3rd December 2014 08:32:49 UTC (over 3 years ago)
On 30.11.2014 15:05, Hongbin Zheng wrote: > What should us consider when we try to determinate the point to run Polly? I think we could benefit from some community feedback here. Some context: Since a couple of weeks we focus on tuning Polly to minimize compile time overhead. One important case for us is to minimize the compile time impact of Polly in case no loop optimizations are performed. One item that we would like to reconsider here is the position in the LLVM pass order where we execute Polly. Currently, when '-O3 -polly' is run, we first schedule nine canonicalization passes (taken from the -O3 pass pipeline) to prepare the code for Polly, then run Polly itself and after this we run the full unmodified -O3 sequence. The reason for this was that we wanted full control on how we canonicalize the code for Polly and we also wanted to run the _entire_ LLVM optimization chain to be sure the code is well optimized after. The reason for this choice was mostly to be able to experiment with loop optimizations without being constrained by compile time or pass order concerns. This pass order is obviously not optimal due to the inherent compile time increase it implies and also because we do not currently run the inliner in our canonicalizer. Hence, we would like to place the Polly optimizer inside the standard -O3 optimization sequence. The question is now what would be the optimal location. Here some considerations: 1) After the IR has been canonicalized This should remove the need for most of our -polly-canonicalize passes and eliminate the corresponding compile time cost. 2) After inlining has been performed. Inlining is essential to optimize C++ heavy code such as boost::ublas. 3) Before the Loop/SLP vectorizer We want to ensure that Polly is run at a position, where the vectorizer can continue to optimize the code generated by Polly (and to benefit from the parallelism, alias information and loop optimizations we performed). 4) Minimize interactions with the inliner At best Polly is run after all inlining has happened, such that Polly always sees the largest possible loop nest. If we optimize too early we may optimize code that later is inlined in a larger loop nest where the earlier optimization hinders the optimization of the larger loop nest. Some open questions: 5) Run it before / after LICM? LICM may make more loop bounds analyzable by making the parameters in the loop bounds loop invariant. On the other side, running it may introduce unnecessary scalar data dependences, which need to be eliminated again (Polly does not do this yet). 6) Run it before loop rotate / loop unswitch? Both of these passes introduce copies of code which Polly does not really benefit from. (Polly may e.g prove that a loop is executed at least once, such that the copy introduced by loop-rotate is not necessary any more). Hence running them before Polly may unnecessarily complicating the loop structure. On the other side, those passes may be helpful for LICM. I currently have been considering two locations: Option 1: Before the loop optimization passes ============================================= MPM.add(createReassociatePass()); // Reassociate expressions << Add Polly before the loop optimization passes >> // Or better call it EP_LoopCanonicalizerStart? + addExtensionsToPM(EP_LoopOptimizerStart, MPM); MPM.add(createLoopRotatePass()); // Rotate Loop MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3)); MPM.add(createInstructionCombiningPass()); MPM.add(createIndVarSimplifyPass()); // Canonicalize indvars MPM.add(createLoopIdiomPass()); // Recognize idioms like memset. MPM.add(createLoopDeletionPass()); // Delete dead loops Option 2: Right before the loop vectorizer ===================================== MPM.add(createBarrierNoopPass()); << Add Polly at the beginning of the target specific optimizations >> + addExtensionsToPM(EP_TargetOptsStart, MPM); // Re-rotate loops in all our loop nests. These may have fallout out // of rotated form due to GVN or other transformations, and the // vectorizer relies on the rotated form. if (ExtraVectorizerPasses) MPM.add(createLoopRotatePass()); MPM.add(createLoopVectorizePass(DisableUnrollLoops, LoopVectorize)); Originally I was thinking of 'Option 1'. It is after the scalar optimization passes, but before LICM and LoopRotate. However, this location seems still to be still somehow a canonicalization phase. 'Option 2' places Polly after all canonicalization (is the inliner somehow finished at this point?) and right at the beginning of the target specific optimizations. It seems somehow preferable to me at the moment. Does anybody have opinions on this? Cheers, Tobias