Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Donald Bruce Stewart <dons <at> cse.unsw.edu.au>
Subject: ByteStrings and the ram barrier
Newsgroups: gmane.comp.lang.haskell.libraries
Date: Friday 12th May 2006 06:07:47 UTC (over 11 years ago)
Hey [email protected],

Data.ByteStrings are strict, unboxed arrays of Word8s. They're great for
fast processing of strings that fit in memory but no good for lazy
processing, or dealing with data larger than your ram (i.e. they're
infeasible above 1G on a 2G ram box).

Last week Duncan Coutts had the bright idea to write
Data.ByteString.Lazy, a newtyped [ByteString], which would allow
constant space processing of chunks of strict ByteString arrays. 

The theory is that we'd be able to efficiently process large data, where
both ByteString and [Char] fails, and open a new range of applications
that could be handled successfully with Haskell.

By using list operations over the spine of the structure, and fast
ByteString operations over the leaves, we should be able to get some of
the benefits of both lazy lists and strict bytestrings.

For example, to map over a lazy bytestring you do a list map along the
spine of the list, and a bytestring map over the nodes:

    lazy bytestring map  = L.map (B.map f) xs

So we hacked away at it for a few days, and here are some preliminary
results:
Simple task, filter the letter 'e' from a file.

------------------------------------------------------------------------

4G file, 1G ram, Linux/x86, 3.20GHz P4, GHC 6.4.1

    Data.ByteString.Lazy (128k chunks)
        constant heap, total time 2m 20s, 25% cpu

    Data.List
        constant heap, total time 8 minutes (quite good!), 98% cpu

    sed 's/e//g'
        constant heap,  I gave up after 10 minutes

------------------------------------------------------------------------

10G file, 1G ram, OpenBSD/x86, 2.4GHz P4, GHC 6.4.1

    Data.ByteString.Lazy (5M chunks)
        constant heap, total time 11.44 minutes, 34% cpu

    Data.List   
        constant heap, total time 18.21 minutes, 90% cpu

------------------------------------------------------------------------

A nice improvement for Haskell on mutli-gigabyte files, I think!

Data.ByteString.Lazy is in fps head, here:

    http://www.cse.unsw.edu.au/~dons/code/fps/Data/ByteString/Lazy.hs
   

Cheers,
  Don
 
CD: 3ms