|
Subject: Re: Fast saving/loading of huge matrices Newsgroups: gmane.comp.python.scientific.user Date: 2007-04-19 19:30:32 GMT (2 years, 10 weeks, 6 days, 15 hours and 15 minutes ago) El dj 19 de 04 del 2007 a les 09:23 -0500, en/na Robert Kern va escriure: > Gael Varoquaux wrote: > > I have a huge matrix (I don't know how big it is, it hasn't finished > > loading yet, but the ascii file weights 381M). I was wondering what > > format had best speed efficiency for saving/loading huge file. I don't > > mind using a hdf5 even if it is not included in scipy itself. > > I think we've found that a simple pickle using protocol 2 works the fastest. At > the time (a year or so ago) this was faster than PyTables for loading the entire > array of about 1GB size. PyTables might be better now, possibly because of the > new numpy support. I was curious as well if PyTables 2.0 is getting somewhat faster than 1.4 series (although I already knew that for this sort of things, the space for improvement should be rather small). For that, I've made a small benchmark (see attachments) and compared the performance for PyTables 1.4 and 2.0 against pickle (protocol 2). In the benchmark, a NumPy array of around 1 GB is created and the time for writing and reading it from disk is written to stdout. You can see the outputs for the runs in the attachments as well. >From there, some conclusions can be draw: 1. The difference of performance between PyTables 1.4 and 2.0 for this especific task is almost negligible. This was somthing expected because, although 1.4 was using numarray at the core, the use of the array protocol made unnecessary the copies of the arrays (and hence, the overhead over 2.0, with NumPy at the core, is negligible). 2. For writing, the EArray (Extensible Array) object of PyTables has roughly the same speed than NumPy (a 15% faster in fact, but this is not that much). However, for reading, the speed-up of PyTables over pickle is more than 2x (up to 2.35x for 2.0), which is something to consider. 3. For compressed EArrays, writing times are relatively bad: between 0.06x (zlib and PyTables 1.4) and 0.15x (lzo and PyTables 2.0). However, for reading the ratios are quite good: between 0.57x (zlib and PyTables 1.4) and 1.45x (lzo and PyTables 2.0). In general, one should expect better performance from compressed data, but I've chosen completely random data here, so the compressors weren't able to achieve even decent compression ratios and that hurts I/O performance quite a few. 4. The best performance is achieved by the simple (it doesn't allow to be enlarged nor compressed), but rather effective in terms of I/O, Array object. For writing, it can be up to 1.74x faster (using PyTables 2.0) than pickle and up to 3.56x (using PyTables 1.4) for reading, which is quite a lot (more than 500 MB/s) in terms of I/O speed. I will warn the reader that these times are taken *without* having in account the flush time to disk for writing. When this time is taken, the gap between PyTables and pickle will reduce significantly (but not when using compression, were PyTables will continue to be rather slower in comparison). So, you should take the the above figures as *peak* throughputs (that can be achieved when the dataset fits comfortably in the main memory because of the filesystem cache). For reading, and when the files doesn't fit in the filesystem cache or are read from the first time one should expect an important degrading over all the figures that I presented here. However, when using compression over real data (where a 2x or more compression ratios are realistic), the compressed EArray should be up to 2x faster (I've noticed this many times in other contexts) for reading than other solutions (this is so because one have to read less data from disk and moreover, CPUs today are exceedingly fast at decompressing). The above benchmarks have been run on a Linux machine running SuSe Linux with an AMD Opteron @ 2 GHz, 8 GB of main memory and a 7200 rpm IDE disk. Cheers, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth Python version: 2.4.4 (#1, Nov 6 2006, 12:24:47) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] NumPy version: 1.0.1 PyTables version: 1.4 Checking with a 1000x125000 matrix of float64 elements (953.674 MB) ***** cPickle (protocol 2) ***** Time for writing: 3.992s File size: 955M Time for reading: 6.222s ***** PyTables EArray (dump row to row) ***** Time for writing: 3.745s. Speed-up over cPickle: 1.07x File size: 955M Time for reading: 2.73s. Speed-up over cPickle: 2.28x File size: 955M ***** PyTables EArray (dump row to row, compressed with zlib) ****** Time for writing: 68.575s. Speed-up over cPickle: 0.06x File size: 810M Time for reading: 10.956s. Speed-up over cPickle: 0.57x File size: 810M ***** PyTables EArray (dump row to row, compressed with lzo) ***** Time for writing: 33.865s. Speed-up over cPickle: 0.12x File size: 840M Time for reading: 7.694s. Speed-up over cPickle: 0.81x File size: 840M ***** PyTables EArray (complete dump) ***** Time for writing: 3.389s. Speed-up over cPickle: 1.18x File size: 955M Time for reading: 2.758s. Speed-up over cPickle: 2.26x File size: 955M ***** PyTables Array ***** Time for writing: 2.659s. Speed-up over cPickle: 1.5x File size: 955M Time for reading: 1.746s. Speed-up over cPickle: 3.56x File size: 955M Python version: 2.5 (r25:51908, Nov 3 2006, 12:01:01) [GCC 4.0.2 20050901 (prerelease) (SUSE Linux)] NumPy version: 1.0.2.dev3640 PyTables version: 2.0b2pro Checking with a 1000x125000 matrix of float64 elements (953.674 MB) ***** cPickle (protocol 2) ***** Time for writing: 4.674s File size: 955M Time for reading: 6.254s ***** PyTables EArray (dump row to row) ***** Time for writing: 3.844s. Speed-up over cPickle: 1.22x File size: 972M Time for reading: 2.663s. Speed-up over cPickle: 2.35x File size: 972M ***** PyTables EArray (dump row to row, compressed with zlib) ****** Time for writing: 48.956s. Speed-up over cPickle: 0.1x File size: 831M Time for reading: 8.597s. Speed-up over cPickle: 0.73x File size: 831M ***** PyTables EArray (dump row to row, compressed with lzo) ***** Time for writing: 30.643s. Speed-up over cPickle: 0.15x File size: 842M Time for reading: 4.302s. Speed-up over cPickle: 1.45x File size: 842M ***** PyTables EArray (complete dump) ***** Time for writing: 4.071s. Speed-up over cPickle: 1.15x File size: 972M Time for reading: 2.701s. Speed-up over cPickle: 2.32x File size: 972M ***** PyTables Array ***** Time for writing: 2.693s. Speed-up over cPickle: 1.74x File size: 955M Time for reading: 1.81s. Speed-up over cPickle: 3.46x File size: 955M _______________________________________________ SciPy-user mailing list SciPy-user <at> scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user |
|
|