Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Daniel Phillips <daniel.raymond.phillips <at> gmail.com>
Subject: Tux3 Report: Initial fsck has landed
Newsgroups: gmane.comp.file-systems.tux3
Date: Monday 28th January 2013 05:55:40 UTC (over 4 years ago)
Initial Tux3 fsck has landed

Things are moving right along in Tux3 land. Encouraged by our great initial
benchmarks for in-cache workloads, we are now busy working through our
to-do
list to develop Tux3 the rest of the way into a functional filesystem that
a
sufficiently brave person could actually mount.

At the top of the to-do list is "fsck". Because really, fsck has to rank as
one of the top features of any filesystem you would actually want to use.
Ext4 rules the world largely on the strength of e2fsck. Not just fsck, but
certainly that is a large part of it. Accordingly, we have set our sights
on
creating an e2fsck-quality fsck in due course.

Today, I am happy to be able to say that a first draft of a functional Tux3
fsck has already landed:

    https://github.com/OGAWAHirofumi/tux3/blob/master/user/tux3_fsck.c

Note how short it is. That is because Tux3 fsck uses a "walker" framework
shared by a number of other features. It will soon also use our suite of
metadata format checking methods that were developed years ago (and still
continue to be improved).

The Tux3 walker framework (another great hack by Hirofumi, likewise the
initial fsck) is interesting in that it evolved from tux3graph, Hirofumi's
graphical filesystem structure dumper. And before that, it came from our
btree
traversing framework, which came from ddsnap, which came from HTree, which
came from Tux2. Whew. Nearly a 15 year history for that code when you trace
it all out.

Anyway, the walker is really sweet. You give it a few specialized methods
and
poof, you have an fsck. So far, we just check physical referential
integrity:
each block is either free or is referenced by exactly one pointer in the
filesystem tree, possibly as part of a data extent. This check is done with
the help of a "shadow bitmap". As we walk the tree, we mark off all
referenced
blocks in the shadow bitmap, complaining if already marked. At the end of
that, the shadow file should be identical to the allocation bitmap inode.
And
more often than not, it is.

Cases where we actually get differences are now mostly during hacking,
though
of course we do need to be checking a lot more volumes under different
loads
to have a lot of confidence about that. As a development tool, even this
very
simple fsck is a wonderful thing.

Tux3 fsck is certainly not going to stay simple. Here is roughly where we
are
going with it next:

    http://phunq.net/pipermail/tux3/2013-January/001976.html
    "Fsck Revisited"

To recap, next on the list is checking referential integrity of the
directory
namespace, a somewhat more involved problem than physical structure, but
not
really hard. After that, the main difference between this and a real fsck
will be repair. Which is a big topic, but it is already underway. First
simple
repairs, then tricky repairs.

Compared to Ext2/3/4, Tux3 has a big disadvantage in terms of fsck: it does
not confine inode table blocks to fixed regions of the volume. Tux3 may
store
any metadata block anywhere, and tends to stir things around to new
locations
during normal operation. To overcome this disadvantage, we have the concept
of
uptags:

    http://phunq.net/pipermail/tux3/2013-January/001973.html
    "What are uptags?"

With uptags we should be able to fall back to a full scan of a damaged
volume
and get a pretty good idea of which blocks are actually lost metadata
blocks,
and to which filesystem objects they might belong.

Free form metadata has another disadvantage: we can't just slurp it up from
disk in huge, efficient reads. Instead we tend to mix inode table blocks,
directory entry blocks, data blocks and index blocks all together in one
big
soup so that related blocks live close together. This is supposed to be
great
for read performance on spinning media, and should also help control write
multiplication on solid state devices, but it is most probably going to
suck
for fsck performance on spinning disk, due to seeking.

So what are we going to do about that? Well, first we want to verify that
there is actually an issue, as proved by slow fsck. We already suspect that
there is, but some of the layout optimization work we have underway might
go
some distance to fixing it. After optimizing layout, we will probably still
have some work to do to get at least close to e2fsck performance. Maybe we
can
come up with some smart cache preload strategy or something like that.

The real problem is, Moore's Law just does not work for spinning disks.
Nobody
really wants their disk spinning faster than 72000 rpm, or they don't want
to
pay for it. But density goes up as the square of feature size. So media
transfer rate goes up linearly while disk size goes up quadratically.
Today,
it takes a couple of hours to read each terabyte of disk. Fsck is normally
faster than that, because it only reads a portion of the disk, but over
time,
it breaks in the same way. The bottom line is, full fsck just isn't a
viable
thing to do on your system as a standard, periodic procedure. There is
really
not a lot of choice but to move on to incremental and online fsck.

It is quite possible that Tux3 will get to incremental and online fsck
before
Ext4 does. (There you go, Ted, that is a challenge.) There is no question
that
this is something that every viable, modern filesystem must do, and no,
scrubbing does not cut the mustard. We need to be able to detect errors on
the
filesystem, perhaps due to blocks going bad, or heaven forbid, bugs, then
report them to the user and *fix* them on command without taking the volume
offline. If that seems hard, it is. But it simply has to be done.

So that is the Tux3 Report for today. As usual, the welcome mat is out for
developers at oftc.net #tux3. Or hop on over and join our mailing list:

    http://phunq.net/cgi-bin/mailman/listinfo/tux3

We are open to donations of various kinds, particularly of your own awesome
developer power. We have an increasing need for testers. Expect to see a
nice simple recipe for KVM testing soon. Developing kernel code in
userspace
is a normal thing in the Tux3 world. It's great. If you haven't tried it
yet,
you should.

Thank you for reading, and see you on #tux3.

Regards,

Daniel
 
CD: 19ms