Features Download
From: Daniel Phillips <phillips <at> phunq.net>
Subject: Time to truncate
Newsgroups: gmane.comp.file-systems.tux3
Date: Tuesday 2nd September 2008 01:15:49 UTC (over 9 years ago)
The last burst of checkins has brought Tux3 to the pointer where it
undeniably acts like a filesystem: one can write files, go away,
come back later and read those files by name.  We can see some of the
hoped for attractiveness starting to emerge: Tux3 clearly does scale
from the very small to the very big at the same time.  We have our
Exabyte file with 4K blocksize and we can also create 64 Petabyte
files using 256 byte blocks.  How cool is that?  Not much chance for
internal fragmentation with 256 byte blocks.


I wonder how well Tux3 will perform with 256 byte blocks.  Actually,
I don't really see big problems.  We should probably be working mostly
with tiny blocks in initial development, because little blocks generate
bushy trees, and bushy trees expose boundary conditions much faster
than big blocks.  Which is exactly what we need now if we want to get
stable early.  Plus it helps focus on allocation strategy: more little
blocks means more chances for things to go wrong by fragmentation.
Let's keep that issue front and center throughout the entire course of
Tux3 development.

(When we get closer to the kernel port I will switch to working mainly
with 512 byte blocks, which is the finest granularity supported by
Linux block devices at present.)

Anyway, the question naturally arises: what next?  There are so many
issues remaining, big and small.  Some of the big ones:

  * Atomic Commit - we want to know if Tux3's new forward logging
    strategy is as good as I have boasted, and indeed, does it work
    at all?  And what is the commit algorithm exactly?

  * Versioning - very nearly the entire reason for Tux3 to exist,
    although we are now beginning to see evidence that even as a
    conventional non-versioning filesystem, Tux3 is not without its

  * Coalesce on delete - without this we can still delete files but we
    cannot recover file index blocks, only empty them, not so good.

  * Kernel port - no kernel port, no proof of concept, no hordes of
    enthusiastic kernel developers flocking to help.  Imagining how
    well Tux3 will work in kernel is no substitute for actually being
    able to mount a Tux3 filesystem and take it for a spin.

  * Extents - without extents we are going to get hammered (pun
    intentional) by the competition in various benchmarks.  Not all
    benchmarks, but some important ones.  We cannot enter the
    benchmark sweepstakes until extents are working.  There is a big
    messy interaction between extents and versioning: versioned
    extents are much harder to do than versioned pointers because the
    number of boundary conditions in the algorithms explodes and
    new, very subtle block (de)allocation issues arise.  Not a
    weekend project, more like a couple of weeks.
  * Locking - often the biggest source of bugs and bottlenecks in a
    Linux kernel subsystem, not to mention the way it tends to force
    unnatural algorithmic modifications on the unfortunate coder, to
    get around roadblocks like not being able to sleep in spinlocks or
    interrupt context, situations that are encountered frequently in
    any kernel system having to do with storage.

  * Extended attributes.  Ok, so nobody exactly uses them.  Well,
    except Samba, which is very sensitive to xattr performance, and...
    security people, how love to play with weird and wonderful schemes
    for doing security better with the help of extended attributes.

So with all those big projects to do, and a host of little ones
besides, really, what next?

OK, I decided.  It's going to be coalesce on delete, just enough of
that to implement file truncation.  It is now time to truncate.  As
soon as file truncation is added to the test mix we will see much more
interesting behavior from the bitmap allocator, and we will discover
some great ways to generate horrible fragmentation issues.  Yummy.

One approachable project that pretty well anybody on the list here
could jump into while I am going at truncation: leaf methods to check
integrity of the two kinds of btree leaves we now have in use, file
data index leaves (dleaf.c) and inode table leaf blocks (ileaf.c).
Whoever wants to carve their initials on what is starting to look like
a for-real Linux filesystem, now is a great time to take a flyer.  The
code base is still tiny, builds fast, has lots of interactive feedback
and is easy to work on.  And you get to put your email address near
the beginning of the list, which will naturally write its way into the
history of open source.  Probably.


CD: 11ms