Subject: Fast parallel binary-trees for the shootout: Control.Parallel.Strategies FTW!
Date: Monday 1st September 2008 22:18:13 UTC (over 9 years ago)
The Computer Language Benchmarks Game recently got a quad core 64 bit machine. Initial results just porting the single-threaded Haskell benchmarks to 64 bit look rather pleasing, http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all Note that C++, D, Clean et al aren't ported to the 64 bit machine yet (and perhaps some implementations won't survive the 64 bit switch). Now, none of the Haskell programs are annotated for parallelism yet, despite that quad core sitting there ready to go. I've a conjecture that fast, smp capable, concise code, of the kind GHC produces, could dominate the shootout for a while to come, if we can effectively utilise those cores. This is a chance to make an impression! So, the first parallel benchmark has been submitted by Tom Davies and me, a parallel implementation of the binary-trees benchmark, using the Control.Parallel.Strategies lib to set up some top level parallel tasks. Initial results look promising, with the old entry: * Single core, no heap hints: 26.9s $ ghc -O2 -fasm Original.hs --make $ time ./Original 20 ./Original 20 26.79s user 0.16s system 100% cpu 26.933 total Looking at the GC stats (-sstderr) we see its spending too much time doing GC, 57.8%, so we can improve that by setting a default heap size: * Single core, with heap hints: 13.7s $ time ./Original 20 -A350M ... ./Original 20 +RTS -A350M 13.54s user 0.27s system 100% cpu 13.791 total Halves runtime in comparison to the current entry, and reduces GC to 3.8%. * Now, the parallel version, changing one line to use a parMap strategy, and using the N+1 capabilities rule, $ ghc -O2 -fasm Parallel2.hs --make -threaded $ time ./Parallel2 20 +RTS -N5 -A350M ... ./Parallel2 20 +RTS -N5 -A350M 15.38s user 1.57s system 305% cpu 5.684 total So 305% productive (not too bad for this memory-heavy program), and a good speedup. Observations: * check GC stats with -sstderr, then use -AxxxM to set a default heap size * parallelise the top level control using speculative strategies * Use -N+1 capabilities. -- Don