Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Dan Doel <dan.doel <at> gmail.com>
Subject: Low-level array performance
Newsgroups: gmane.comp.lang.haskell.glasgow.user
Date: Monday 16th June 2008 19:51:31 UTC (over 9 years ago)
Greetings,

Recently, due to scattered complaints I'd seen on the internet, I set about
to 
rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout.

The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly 
representative of code one would actually write in Haskell these days. Over

the past few days, I've written several versions of the benchmark, and 
collaborated a bit with local speed guru Don Stewart, but an entry that
bests 
the current entry does not seem to be in the cards currently. So, I thought

I'd write about some of the issues, and hopefully get some encouraging news

about the situation.

Issue 1: STUArrays aren't optimized as fully as one might ideally expect.

This isn't so much of an issue, I suppose, except possibly for an
environment 
like the shootout. I wrote versions of the benchmark (and particular pieces

of the overall benchmark, as well) for both STUArrays, and MUArrays from 
dons' new uvector [2] library. The STUArray versions are consistently
slower, 
and a look at the core reveals that STUArray code contains significantly
more 
boxing than uvector. This isn't a big deal, as there's no reason not to use

uvector, aside from it not being blessed by being in the GHC distribution. 
However, I thought I'd bring it up in case anyone hasn't heard of the
library 
yet.

Issue 2: Reading from/writing to a MutableByteArray# is slower than an
Addr#

This is, I think, the crux of the issue. The main content of the benchmark
is 
reversing/shifting items in an array. To get a somewhat easier look at the 
core, I boiled things down to a benchmark that just reverses a small array 
many times. In the interest of further reducing things, I wrote a version
of 
the benchmark that uses raw Addr#s, and a version that uses raw 
MutableByteArray#s. I've attached both versions.

The fannkuch benchmark (input 11) with byte arrays runs in around 12
seconds 
on my machine. To get the reversal benchmark to around that time, I can
tell 
it to, for instance, reverse a 10-element array 250 million times. Those
same 
inputs to the Addr# version of the reverse benchmark result in a runtime of
7 
to 8 seconds, so there's a significant discrepancy at the level the
benchmark 
actually runs at (the current fannkuch entry that uses Addr#/Ptr takes
around 
9 seconds, so there's other stuff going on in the benchmark evening them
out, 
but this is likely the source of the difference between the two).

uvector performance on the reversal benchmark is comparable with 
MutableByteArray#, so the practical slowdown using a library for actual 
public consumption seems to be in the fact that it's based on byte arrays 
(STUArray again falls further behind by retaining too many boxes).

So, to wrap things up: dons and I were rather curious about the performance

differences between MutableByteArray# and Addr#, as one might expect them
to
be comparable, both being low-level, raw pointer type structures; the code
for 
using the two is nearly identical.

(If one were to raise an Issue 3, it'd likely be that, even with Addr# and 
malloc, which is currently the fastest Haskell solution, we're not that
fast. 
A glance at that shootout list reveals that we're beat by a myriad of 
Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL),
Clean, 
and of course C++, C, Fortran, etc. Is there any hope of getting
performance 
improvements that will push us up in that list? I've heard that people are 
working on improved code generation, but I don't know if that can help with

this sort of thing (or if it's more of a runtime system area).

While writing this mail, it came to mind that this is also a potential
source 
of slowness in the sorting 'library' I've been writing for uvector. I
already 
have some routines that are competitive with/beat qsort from glibc, but
GNU's 
C++ STL introsort implementation has remained untouched. However, if 
reads/writes simply take twice as much time in Haskell as they do in C++,
my 
results aren't as surprising.)

Anyhow, thanks for any light that can be shed on the subject.

1: http://shootout.alioth.debian.org/gp4/benchmark.php?test=fannkuch
2: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uvector
 
CD: 3ms