Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane

From: Adam Nemet <anemet <at> apple.com>
Subject: RFC: Loop distribution/Partial vectorization
Newsgroups: gmane.comp.compilers.llvm.devel
Date: Monday 12th January 2015 18:42:36 UTC (over 3 years ago)
Hi,

We'd like to propose new Loop Distribution pass.  The main motivation is to
allow partial vectorization of loops.  One such example is the main loop of
456.hmmer in SpecINT_2006.  The current version of the patch improves hmmer
by
24% on ARM64 and 18% on X86.

The goal of the pass is to distribute a loop that can't be vectorized
because of
memory dependence cycles.  The pass splits the part with cycles into a new
loop
making the remainder of the loop a candidate for vectorization.  E.g.:

        for (k = 0; k < M; k++) {
         S1:   MC[k+1] = …
         // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
         S2:   DC[k+1] = DC[k], MC[k]                                      
        }
  
      => (Loop Distribute)
  
        for (k = 0; k < M; k++) {
         S1:   MC[k+1] = ...
        }
        for (k = 0; k < M; k++) {
         S2:   DC[k+1] = DC[k], MC[k]
        }
  
      => (Loop Vectorize S1)
  
        for (k = 0; k < M; k += 4) {
         S1:   MC[k+1:k+5] = ...
        }
        for (k = 0; k < M; k++) {
         S2:   DC[k+1] = DC[k], MC[k]
        }

I'd like to collect feedback on the design decisions made so far.  These
are
implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930 <http://reviews.llvm.org/D6930>).
Here is the list of design choices:

- Loop Distribution is implemented as a separate pass to be run before the
Loop
  Vectorizer.

- The pass reuses the Memory Dependence Checker framework from the Loop
  Vectorizer.  This along with the AccessAnalysis class is split out into a
new
  LoopAccessAnalysis class.  We may want to turn this into an analysis pass
on its own.

- It also reuses the Run-time Memory Check code from the Loop Vectorizer. 
The
  hmmer loop requires memchecks.  This is again captured by the same
  LoopAccessAnalysis class.

- The actual loop distribution is implemented as follows:

  - The list of unsafe memory dependencies is computed for the loop. 
Unsafe
    means that the dependence may be part of a cycle (this is what the
current
    framework provides).
  - Partitions are created for each set of unsafe dependences.
  - Partitions are created for each of the remaining stores not yet
encountered.
    The order of the partitions preserve the original order of the
dependent
    memory accesses.
  - Simple partition merging is performed to minimize the number of new
loops.
  - Partitions are populated with the other dependent instructions by
following
    the SSA use-def chains and control dependence edges.
  - Finally, the actual distribution is performed by creating a loop for
each
    partition.  For each partition we clone the loop and remove all the
    instructions that don't belong to the partition.
  - Also, if run-time memory checks are necessary, these are emitted.  We
keep
    an original version of the loop around to branch too if the checks
fail.

My plan is to proceed with the following steps:

- Bring the current functionality to trunk by splitting off smaller patches
from
  the current patch and completing them.  The final commit will enable loop
  distribution with a command-line flag or a loop hint.

- Explore and fine-tune the proper cost model for loop distribution to
allow
  partial vectorization.  This is essentially whether to partition and what
  these partitions should be.  Currently instructions are mapped to
partitions
  using a simple heuristics to create a vectorizable partitions.  We may
need to
  interact with the vectorizer to make sure the vectorization will actually
  happen and it will be overall profitable.

- Explore other potentials for loop distribution, e.g.:
  - Partial vectorization of loops that can't be if-converted
  - Classic loop distribution to improve spatial locality
  - Compute the Program Dependence Graph rather than the list of unsafe
memory
    accesses and allow reordering of memory operations
  - Distribute a loop in order to recognize parts as loop idioms

    Long term, loop distribution could also become a transformation utility
    (Transform/Util).  That way, the loop transformation passes could use
it to
    strip the loop from parts that inhibits the given optimization.

Please let me know if you have feedback either on the design or on the next
steps.

Thanks,
Adam
 
CD: 12ms