Features Download
From: Tom Christiansen <tchrist <at> perl.com>
Subject: Filesystems? *I'll* give ya fileysytems!
Newsgroups: gmane.comp.lang.perl.perl5.porters
Date: Friday 20th March 2009 12:21:45 UTC (over 9 years ago)
Mark Mielke wrote, and correctly, in replying to Michael Schwern's 
lamentably incorrect and misled posting, the following:

>> If you really want a close() that doesn't flush, for those
>> ten people out there writing Perl code where it's a good
>> idea, give a Perl filehandle option (we do have IO::Handle
>> after all) or pragma to turn it off.  But for god's sake,
>> default it to on.

>> It's even right there in the docs, so really this whole thing
>> is just a bug.


>>   close FILEHANDLE
>>   close   Closes the file or pipe associated with the file
>>           handle, flushes the IO buffers, and closes the 
>>                   ^^^^^^^^^^^^^^^^^^^^^^
>>           system file descriptor.

>> Language lawyers may say "oh, but that's Perl's IO buffer not
>> the filesystem's buffer" or whatever.  Bullshit, the user
>> doesn't draw so fine a line.

Schwern, *please* don't be Stupid--especially belligerently so.

I can see I'm going to have to thwap you with a weekend's worth
of homework.  Happy reading.

>> They don't know or care if Perl does it or the OS does it or
>> magic ponies do it. It's no comfort that we're technically
>> correct when their data is lost.  The purpose is to make
>> closing a filehandle safe.

You're old enough to know better than this.  

LISTEN TO ME: just because you edit a file with some file, "write
out" your changes, close the file, and exit the program, IN NO
FASHION WHATSOEVER GUARANTEES that a physical write has occurred.
It never has.  Moreover, this is not a Perl problem; it is not a
C probem.  In fact, it is not a problem at all.  It is a design
requirement, goal, and feature.  *ALL* programs work this way, I
don't care whether it's vi or sh or cat or perl or emacs or open
office. They *all* make no such silly "guarantee" as you seem to
be ignorantly demanding.

There is one, nearly unique exception to this: fsck itself.
You remember fscks, don't you, Schwern?  fscks?  What are 
*those* for, anyway?

Do you know *why* fsck stands apart as the lone exception to the
rule, and do you know *how* it does what it must?  Do you know why 
the filesystem must be unmounted to be fsck'd?

It's because fsck does *not* use the block special device that
the filesystem is mounted on.  It uses the character special
device.  It does this because you *cannot* go through the block
buffer cache when what you're trying to diagnose errors made by
that very system!  You *must* go through the raw interface, which
is the other name for the corresponding character special
device, and NOT involve the buffer-cache at all.

I don't care whether you call it /dev/disk0s2 and /dev/rdisk0s2, or
/dev/sd0d and /dev/rsd0d, or whatever.  These pairs exist as an
integral part not just of "filesystem", but of the entire VM system
that's supporting you.  There's a lot more going on here than you seem
to be respectful of here.

The interactions with the kernel's buffer-cache, the low-level device
drivers, the upper-level VM system with paging and perhaps DMA locked-
down pages, optimal scheduling of numerous simultaneous requests in the
read-ahead/write-behind system, and several other aspects of the system
are *all* tied together, inextricably.  

Trying to play God by fsync'ing your file descriptor is first of all,
a very selfish thing, and second of all, rather less effective than 
you seem to think it is.

Do you know what happens when you type sync(8)?  It calls sync(2).
Ever notice how fast it comes back?  Do you know why?  BECAUSE IT

All it does is preƫmpt the normally delayed flushing of dirty blocks by
expiring them all.  It does *not* write them.  They just go on the
various disk queues to be written to each device, but that doesn't get
them there -- yet.  Once upon a time, and on some systems still, there
was a user process called update(8) that did this sort of thing every 30
seconds or so.

On more modern filesystems, the old update(8) daemon has been integrated
into the kernel's block buffer system, and so this is done in a more
staggered fashion.  Nevertheless, a sync(2) will expire them all.
Eventually.  For even *it* does not guarantee that they made it out to
disk.  Again, I remind you that it merely puts them on the write queues for
the next go around.

Now, if you weren't in user mode, and superuser is but a user, then you
*could* call vflushbuf(9) with an argument to block until that inode's
blocks are all safely put to bed.  But even that's not the default.  You
could also call vwaitforio(9), which sleeps [that word means something VERY
DIFFERENT to the kernel, you know; check out WCHAN from ps(1)] until all
asynchronous writes (if such there be) to that vnode are complete.

But again, we're not talking user-space activity here.  This is no realm
that *PROGRAMS* have any business playing.  And in fact, they cannot.

You really should read the buffercache(9) manpage, you know. 

Once done there, go to /sys/kern/vfs_syscalls.c, and regard the sys_sync()
and sys_fsync() functions.  After you're done trying to figure out where
the fsync vnode opts really are for FFS (try /sys/sys/vnode.h), you still
don't know for sure.

You have to go look in /sys/kern/vfs_sync.c for the code you're really
looking for.  There you'll also will find what became of update(8): it's
now the sched_sync() function found therein.  And don't get too thrilled 
by speedup_syncer(), as it's only called from

     * If memory utilization has gotten too high, deliberately slow things
     * down and speed up the I/O processing.
    STATIC int
    request_cleanup(resource, islocked)

in /sys/ufs/ffs/ffs_softdep.c anyway.

Speaking of which, we're at the vnode level here, so you don't
really know what goes on.  It depends on your particular filesystem
type.  You'll find ffs_fsync() in /sys/ufs/ffs/ffs_vnops.c.  Pay
attention to what ap->a_waitfor == MNT_WAIT is really doing there:
it's not so simple as you might think.

> I tracked and commented on a similar thread for glib/gio. There
> are a lot of invalid expectations that people hold, and people
> are reacting to this problem without understanding it.

I'm surprised--and disappointed, too--but I guess I shouldn't be, 
that so many people fail to understand all this.  It's nothing 
new under the sun.

> First - POSIX and UNIX have never promised to fsync() on
> close() that I am aware of.

No, they have not.  This has always been there.  Check the V7 code.
And all the rest of my references throughout this epistle.

> The performance cost is unreasonable. Consider that ever disk write
> may involve a disk seek, and disks are still in the 5m - 10ms range
> for seek latency. Being unable to close more than 100 file descriptors
> a second is not reasonable. Do we want fsync() on every close()? I
> don't think so.


Not only is it piggishly selfish in a way no other program 
is, it may not even do you as much good as you think it does:
RTFM graciously included below.

Much depends on how that f/s type's vnode ops for fsync implement
the same.  And if you're running on top of something *ELSE*, oh,
like say Mach, what are you really doing, since you never know
whether it's all virtualized anyway.

You absolutely do *NOT* want do not to micromanage your operating
system, Schwern! Better to just throw it away altogether and give
up.  Very smart people have spent a lot of hard work over decades
to make a multitasking operating system yield increasingly faster
and more reliable.

And you choose to overrule them.  On what basis, sir?

> close() doesn't fsync() - it does not have to, and it's
> not a very good idea for applications to explicitly do in
> general either. It has a very real performance cost. If
> Perl arbitrarily switches to forcing fsync() on close(),
> Perl is going to seem much slower for certain types of
> applications.

And so will other applications.  Do not play God with the
buffercache  meta-system UNTIL AND UNLESS you can show that
know more than its makers. This remains unproven, and I
trust that state to persist indefinitely.

> The problem here is "what happens when my system crashes?"
> A lot of people seem surprised that system crashes easily
> caused corruptions in the past and still do today. How
> many people ranting about this subject do not realize that
> their hard disk comes from the factory in "write caching"
> mode, such that your file system can become corrupted even
> with a fully functional journalling file system?

Next to nobody.

> For the common case with write() and close(), we generally
> do not care. If the file system comes back up and it's at
> an earlier state that the instant the system died - nobody
> should be surprised.

> The cause for surprise is when the file system has an
> *inconsistent* state. The rename() case described here is
> a write() followed by a close() followed by a rename().
> The assumption of many has been that close() and rename()
> are guaranteed to be run in sequence such that if the
> close() fails, the rename() fails. The conclusion is
> invalid. close() touches the file. rename() touches the
> directory. Since they're touching different parts of the
> file system - why would they be guaranteed to happen in
> sequence?

I'm not really sure why you think it would matter even if
there *WERE* out of sequence.  After all, close(2) acts on a
file descriptor, something that's already been through
namei(9) and so doesn't *care* about names.  Whereas rename(2)
acts on filenames.  You can rename files at your whim while
still writing to them.  Or you better be able to!!

Well, on most file systems, that is.

But it's invalid for several other reasons.  Not all filesystems
implement an atomic rename(2).  There are also propagation issues
involved in UDP/IP- based implementations of NFS in which most
operations are not idempotent-- although DIRECTORY creation and
deletion does propagate syncronously to guarantee consistency,
that of mere files within them does *NOT*.

Thus multiple link(2)s, unlink(2)s, and creat()s [O_EXCL|O_CREAT]
can all succeed on non-directory inodes, even though this
violates many sacred cows. It's due to problems inherent to the
UDP protocol's penchant for dropping, duplicating, mutating, or
misordering packets. And I believe I recall that TCP/IP-based
implementations may have less trouble with this.

Kirk went to a great deal of trouble getting rename() to be
atomic for directory inodes in FFS -- it's not in UFS.  Dennis
warned him that "that would be rather tricky/difficult to get
right."  Not having fully explored the problemspace as Dennis
had, Kirk didn't recognize the trouble he was setting himself up
for.  But he *did* manage it in the end.  And when he came back
to Dennis, he said something to the effect of how he hadn't
realized the severity of Dennis's typical understatement, and
that it *WAS* a bugger  of a problem.

> In ext3 this difference is highlighted in writeback vs ordered
> journal mode. The writeback mode exhibits the behaviour that
> rename() can happen without write()/close() leaving the file
> empty on system startup. The ordered mode is a bit hacky - it
> ensures that data is flushed before metadata. This guarantee
> covers for problems such as write()/close()/rename().

That's what I would have imagined it to do.

> In ext4 it seems that they tried to lose this hacky bit -
> but it exposed the applications that make assumptions
> about close() always being scheduled before rename().

Sounds like a dumb NFS bug, or the stupidity of the mis-
implementation of fcntl's file locking on names not
descriptors on SysV.  I complained to Dennis about this,
and he looked at me quizzically, saying, "But that's just
wrong." Of course it was wrong.  Somebody messed up the
spec, and propagated it, without thinking.  These things
are touchy.

> This has raised a lot of mixed opinions. Mostly, because
> people are surprised and have a knee jerk reaction that
> things return to the bubble of safety that they once were.

Or that they thought they were.

> I do not support adding fsync() everywhere, and not on
> close() either. It's silly in my opinion.

Mark, you say it so much more gently than I. :-)

> The only time it is needed is when the application wishes
> to make a stronger guarantee about the state of the file
> system before continuing. The write()/close()/rename() is
> exactly this sort of case. The goal is to accomplish an
> atomic-change-in-place effect. The goal cannot be safely
> achieved without ensuring that the file is correctly
> written to disk before doing the rename. This requires
> write()/fsync()/close()/rename(). This is similar to how a
> database engine needs to fsync() to ensure consistent
> state of a data file before continuing.

Now this I tend to disbelieve.  There is some sort of filesystem
bug here if that be true.   I can argue this from 1st principles,
too, because something that is omissible but need always be done
to ensure a correctly behaving program -- AND ESPECIALLY A
something you should *ever* leave up to the user.

sync, sync, sync....

Ever wonder why one types:

    % sync
    % sync
    % sync

The first one schedules the writes, and comes right back to you.
The next one you type a little bit later to make you feel
better.  On some systems, the 2nd sync(2) might ahve blocked
till the kernel finished with the 1st one, but mostly, it's
about timing. And the third?  Because of course you aren't sure
whether you remembered whether you typed that second one, so
just in case.  :-)

I am not kidding in the very least, and this is not apocryphal.
Larry and Dennis and I were dining together, just the three of
us, and this came up.  Dennis supplied the first two answers, and
Larry the last one--which caused Dennis to laugh really hard,
because Larry got it exactly right: it's just the insecurity of
not being sure you'd done the second one. It was utterly
hilarious, because we'd all done it a zillion times.

Have you ever worked on buffer-cache code and the VM system? 

I have.  

And in more than one job and on more than one operating system.
I *do* know just a little bit, maybe not much, but apparently
more than you need to know, about what we're discussing.

I've seen a lot of Unix filesystems, include ones with much odder
characteristics than these.  Ever experienced the joys of async
*READS* in wired-down DMA memory that bus-tranferred?

Schwern, if you care to debate the Unix filesystem, you'd
best have a lot of theory and practice under your belt.  As
Andrew has said, Unix has its weaknesses, but its filesytem
isn't one of them.

Have *you* read the many academic papers, starting from the
earliest UFS work, then moving on to Kirk's FFS, the whole vnode
layer thing, Kirk's later work on soft updates--and what *those*
are all about, anyway?  Have you read about the early forays into
journaling or log-based filesystems, with many papers going back
and forth between Margo and Ouster?  Or the later work done on
some of the more exotic systems, now more commonplace?

Once you'e done all that, then fine: we can start from a common
knowledge base.  But until then, I'm skeptical, because I don't 
see much evidence that you understand what you're saying, Schwern.

I include references to online documentation through this
message, but at the end, you will find more proper references to
more formal work, and it is these especially to which I should
like you to divert your time and attention.  You need to
understand a great deal more about the problemspace to appreciate
it, and it truly appears that you currently fail to do so.


> Does Perl have any code that does atomic-change-in-place effect
> using rename()? If it does, it should run fsync().

Huh?  That smells wrong.  Atomic rename isn't new, and providing
we're not talking directory inodes, hardly rocket science.

Look at vfs_bio.c and then the filesytem's vnode ops that
implement these, and explain how what you're saying can be true.

I believe it shouldn't be in Perl.
I believe it shouldn't be in libc.
I believe it should be in the kernel, if needed. 

And that demesne is not ours to plunder or poke, 
no matter how strong be our will.

> Note that fsync() should not be the same as autoflush.
> Autoflush writes the data to the operating system. This
> does not and has never guaranteed that the data is written
> to disk.

Yup.   And I still don't think fsync(2) guarantees what 
you think it may.  Besides the code citations and references
in /sys/ that make be dubious, this pretty much seals the deal
that it won't work:

    % man 2 fsnc

         fsync -- synchronize a file's in-core state with that on disk


         fsync(int fildes);

         Fsync() causes all modified data and attributes of fildes to be
         moved to a permanent storage device.  This normally results in all
         in-core modified copies of buffers for the associated file to be
         written to a disk.

         Note that while fsync() will flush all data from the host to the
         drive (i.e. the "permanent storage device"), the drive itself may
         not physically write the data to the platters for quite some time
         and it may be written in an out-of-order sequence.

         Specifically, if the drive loses power or the OS crashes, the
         application may find that only some or none of their data was
         written.  The disk drive may also re-order the data so that later
         writes may be present, while earlier writes are not.

         This is not a theoretical edge case.  This scenario is easily
         reproduced with real world workloads and drive power failures.

So there!  

And now what?

> Forcing the data to disk on autoflush is a bad idea from a 
> performance perspective.

Very, very much so.

*And* "forcing to disk" is not so easy as you think it is anyway.

Here's my *minimal* reading list for Schwern, after he's tackled
the kernel code whose references I sent his way above:

  *  Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, Robert S.
     "A Fast File System for UNIX" [1983, 1984]

  *  Mendel Rosenblum and John K. Ousterhout
     "The Design and Implementation of a Log-Structured File System" [1991]

  *  Tweedie, Stephen C (1998)
     "Journaling the Linux ext2fs Filesystem" (PDF), The Fourth Annual
Linux Expo.

  *  McKusick, M. and Ganger, G. (1999). 
     "Soft Updates: A Technique for Eliminating Most Synchronous Writes in
      the Fast Filesystem."
     USENIX Annual Technical Conference. 1-18.  

  *  Seltzer, Margo I; Ganger, Gregory R; McKusick, M Kirk, 
     "Journaling Versus Soft Updates: Asynchronous Meta-data Protection in
File Systems", 
     2000 USENIX Annual Technical Conference (USENIX Association).

  *  Seltzer, M. et al. (2000).  
     "Journaling Versus Soft Updates: Asynchronous Meta-data Protection in
File Systems." 
     USENIX Annual Technical Conference. 71-84.

  *  McKusick, M. (2002). "Running "fsck" in the Background." 
     Proceedings of the BSDCon 2002. 55-64.

  *  Hans Reiser
     Reiser4 whitepaper of 2003

  *  Daniel Ellard, Jonathan Ledlie, and Margo Seltzer
     "The Utility of File Names" [2004]

  *  You sohuld also look at the ext3, ext4, and HFS papers, 
     as well as descriptions of NFS under *both* UDP and TCP
     implementations, which are not the same.

  *  Also, if you google for Kirk, he's got notes out there from
     this year, like the talk he just presented in Asia.

Do that, *then* get back to us, ok? :-)

And yes, I *am* serious abou that.

But if that's not enough reading for you, you might try these:

  *  Maurice J. Bach, The Design of the UNIX Operating System, 
     Prentice Hall, 1986.

  *  Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and
     John S. Quarterman, The Design and Implementation of the
     4.4BSD Operating System, Addison Wesley, 1996.

  *  Leffler, et. al., The Design and Implementation of the 4.3
     BSD Unix Operating System, Addison Wesley, 1989.

CD: 4ms