Subject: Loop unrolling opportunity in SPEC's libquantum with profile info
Date: Thursday 16th January 2014 00:13:27 UTC (over 3 years ago)
I am starting to use the sample profiler to analyze new performance opportunities. The loop unroller has popped up in several of the benchmarks I'm running. In particular, libquantum. There is a ~12% opportunity when the runtime unroller is triggered. This helps functions like quantum_sigma_x (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149). The function accounts for ~20% of total runtime. By allowing the runtime unroller, we can speedup the program by about 12%. I have been poking at the unroller a little bit. Currently, the runtime unroller is only triggered by a special flag or if the target states it in the unrolling preferences. We could also consult the block frequency information here. If the loop header has a higher relative frequency than the rest of the function, then we'd enable runtime unrolling. Chandler also pointed me at the vectorizer, which has its own unroller. However, the vectorizer only unrolls enough to serve the target, it's not as general as the runtime-triggered unroller. From what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on avx targets). Additionally, the vectorizer only unrolls to aid reduction variables. When I forced the vectorizer to unroll these loops, the performance effects were nil. I'm currently looking at changing LoopUnroll::runOnLoop() to consult block frequency information for the loop header to decide whether to try runtime triggers for loops that don't have a constant trip count but could be partially peeled. Does that sound reasonable? Thanks. Diego.