Subject: Filesystems? *I'll* give ya fileysytems!
Date: Friday 20th March 2009 12:21:45 UTC (over 8 years ago)
Mark Mielke wrote, and correctly, in replying to Michael Schwern's lamentably incorrect and misled posting, the following: >> If you really want a close() that doesn't flush, for those >> ten people out there writing Perl code where it's a good >> idea, give a Perl filehandle option (we do have IO::Handle >> after all) or pragma to turn it off. But for god's sake, >> default it to on. >> It's even right there in the docs, so really this whole thing >> is just a bug. Wrong. >> close FILEHANDLE >> close Closes the file or pipe associated with the file >> handle, flushes the IO buffers, and closes the >> ^^^^^^^^^^^^^^^^^^^^^^ >> system file descriptor. >> Language lawyers may say "oh, but that's Perl's IO buffer not >> the filesystem's buffer" or whatever. Bullshit, the user >> doesn't draw so fine a line. Schwern, *please* don't be Stupid--especially belligerently so. I can see I'm going to have to thwap you with a weekend's worth of homework. Happy reading. >> They don't know or care if Perl does it or the OS does it or >> magic ponies do it. It's no comfort that we're technically >> correct when their data is lost. The purpose is to make >> closing a filehandle safe. You're old enough to know better than this. LISTEN TO ME: just because you edit a file with some file, "write out" your changes, close the file, and exit the program, IN NO FASHION WHATSOEVER GUARANTEES that a physical write has occurred. It never has. Moreover, this is not a Perl problem; it is not a C probem. In fact, it is not a problem at all. It is a design requirement, goal, and feature. *ALL* programs work this way, I don't care whether it's vi or sh or cat or perl or emacs or open office. They *all* make no such silly "guarantee" as you seem to be ignorantly demanding. There is one, nearly unique exception to this: fsck itself. You remember fscks, don't you, Schwern? fscks? What are *those* for, anyway? Do you know *why* fsck stands apart as the lone exception to the rule, and do you know *how* it does what it must? Do you know why the filesystem must be unmounted to be fsck'd? It's because fsck does *not* use the block special device that the filesystem is mounted on. It uses the character special device. It does this because you *cannot* go through the block buffer cache when what you're trying to diagnose errors made by that very system! You *must* go through the raw interface, which is the other name for the corresponding character special device, and NOT involve the buffer-cache at all. I don't care whether you call it /dev/disk0s2 and /dev/rdisk0s2, or /dev/sd0d and /dev/rsd0d, or whatever. These pairs exist as an integral part not just of "filesystem", but of the entire VM system that's supporting you. There's a lot more going on here than you seem to be respectful of here. The interactions with the kernel's buffer-cache, the low-level device drivers, the upper-level VM system with paging and perhaps DMA locked- down pages, optimal scheduling of numerous simultaneous requests in the read-ahead/write-behind system, and several other aspects of the system are *all* tied together, inextricably. Trying to play God by fsync'ing your file descriptor is first of all, a very selfish thing, and second of all, rather less effective than you seem to think it is. Do you know what happens when you type sync(8)? It calls sync(2). Ever notice how fast it comes back? Do you know why? BECAUSE IT IS NOT A BLOCKING CALL. All it does is preëmpt the normally delayed flushing of dirty blocks by expiring them all. It does *not* write them. They just go on the various disk queues to be written to each device, but that doesn't get them there -- yet. Once upon a time, and on some systems still, there was a user process called update(8) that did this sort of thing every 30 seconds or so. On more modern filesystems, the old update(8) daemon has been integrated into the kernel's block buffer system, and so this is done in a more staggered fashion. Nevertheless, a sync(2) will expire them all. Eventually. For even *it* does not guarantee that they made it out to disk. Again, I remind you that it merely puts them on the write queues for the next go around. Now, if you weren't in user mode, and superuser is but a user, then you *could* call vflushbuf(9) with an argument to block until that inode's blocks are all safely put to bed. But even that's not the default. You could also call vwaitforio(9), which sleeps [that word means something VERY DIFFERENT to the kernel, you know; check out WCHAN from ps(1)] until all asynchronous writes (if such there be) to that vnode are complete. But again, we're not talking user-space activity here. This is no realm that *PROGRAMS* have any business playing. And in fact, they cannot. You really should read the buffercache(9) manpage, you know. Once done there, go to /sys/kern/vfs_syscalls.c, and regard the sys_sync() and sys_fsync() functions. After you're done trying to figure out where the fsync vnode opts really are for FFS (try /sys/sys/vnode.h), you still don't know for sure. You have to go look in /sys/kern/vfs_sync.c for the code you're really looking for. There you'll also will find what became of update(8): it's now the sched_sync() function found therein. And don't get too thrilled by speedup_syncer(), as it's only called from /* * If memory utilization has gotten too high, deliberately slow things * down and speed up the I/O processing. */ STATIC int request_cleanup(resource, islocked) in /sys/ufs/ffs/ffs_softdep.c anyway. Speaking of which, we're at the vnode level here, so you don't really know what goes on. It depends on your particular filesystem type. You'll find ffs_fsync() in /sys/ufs/ffs/ffs_vnops.c. Pay attention to what ap->a_waitfor == MNT_WAIT is really doing there: it's not so simple as you might think. > I tracked and commented on a similar thread for glib/gio. There > are a lot of invalid expectations that people hold, and people > are reacting to this problem without understanding it. I'm surprised--and disappointed, too--but I guess I shouldn't be, that so many people fail to understand all this. It's nothing new under the sun. > First - POSIX and UNIX have never promised to fsync() on > close() that I am aware of. No, they have not. This has always been there. Check the V7 code. And all the rest of my references throughout this epistle. > The performance cost is unreasonable. Consider that ever disk write > may involve a disk seek, and disks are still in the 5m - 10ms range > for seek latency. Being unable to close more than 100 file descriptors > a second is not reasonable. Do we want fsync() on every close()? I > don't think so. BINGO! Not only is it piggishly selfish in a way no other program is, it may not even do you as much good as you think it does: RTFM graciously included below. Much depends on how that f/s type's vnode ops for fsync implement the same. And if you're running on top of something *ELSE*, oh, like say Mach, what are you really doing, since you never know whether it's all virtualized anyway. You absolutely do *NOT* want do not to micromanage your operating system, Schwern! Better to just throw it away altogether and give up. Very smart people have spent a lot of hard work over decades to make a multitasking operating system yield increasingly faster and more reliable. And you choose to overrule them. On what basis, sir? > close() doesn't fsync() - it does not have to, and it's > not a very good idea for applications to explicitly do in > general either. It has a very real performance cost. If > Perl arbitrarily switches to forcing fsync() on close(), > Perl is going to seem much slower for certain types of > applications. And so will other applications. Do not play God with the buffercache meta-system UNTIL AND UNLESS you can show that know more than its makers. This remains unproven, and I trust that state to persist indefinitely. > The problem here is "what happens when my system crashes?" > A lot of people seem surprised that system crashes easily > caused corruptions in the past and still do today. How > many people ranting about this subject do not realize that > their hard disk comes from the factory in "write caching" > mode, such that your file system can become corrupted even > with a fully functional journalling file system? Next to nobody. > For the common case with write() and close(), we generally > do not care. If the file system comes back up and it's at > an earlier state that the instant the system died - nobody > should be surprised. > The cause for surprise is when the file system has an > *inconsistent* state. The rename() case described here is > a write() followed by a close() followed by a rename(). > The assumption of many has been that close() and rename() > are guaranteed to be run in sequence such that if the > close() fails, the rename() fails. The conclusion is > invalid. close() touches the file. rename() touches the > directory. Since they're touching different parts of the > file system - why would they be guaranteed to happen in > sequence? I'm not really sure why you think it would matter even if there *WERE* out of sequence. After all, close(2) acts on a file descriptor, something that's already been through namei(9) and so doesn't *care* about names. Whereas rename(2) acts on filenames. You can rename files at your whim while still writing to them. Or you better be able to!! Well, on most file systems, that is. But it's invalid for several other reasons. Not all filesystems implement an atomic rename(2). There are also propagation issues involved in UDP/IP- based implementations of NFS in which most operations are not idempotent-- although DIRECTORY creation and deletion does propagate syncronously to guarantee consistency, that of mere files within them does *NOT*. Thus multiple link(2)s, unlink(2)s, and creat()s [O_EXCL|O_CREAT] can all succeed on non-directory inodes, even though this violates many sacred cows. It's due to problems inherent to the UDP protocol's penchant for dropping, duplicating, mutating, or misordering packets. And I believe I recall that TCP/IP-based implementations may have less trouble with this. Kirk went to a great deal of trouble getting rename() to be atomic for directory inodes in FFS -- it's not in UFS. Dennis warned him that "that would be rather tricky/difficult to get right." Not having fully explored the problemspace as Dennis had, Kirk didn't recognize the trouble he was setting himself up for. But he *did* manage it in the end. And when he came back to Dennis, he said something to the effect of how he hadn't realized the severity of Dennis's typical understatement, and that it *WAS* a bugger of a problem. > In ext3 this difference is highlighted in writeback vs ordered > journal mode. The writeback mode exhibits the behaviour that > rename() can happen without write()/close() leaving the file > empty on system startup. The ordered mode is a bit hacky - it > ensures that data is flushed before metadata. This guarantee > covers for problems such as write()/close()/rename(). That's what I would have imagined it to do. > In ext4 it seems that they tried to lose this hacky bit - > but it exposed the applications that make assumptions > about close() always being scheduled before rename(). Sounds like a dumb NFS bug, or the stupidity of the mis- implementation of fcntl's file locking on names not descriptors on SysV. I complained to Dennis about this, and he looked at me quizzically, saying, "But that's just wrong." Of course it was wrong. Somebody messed up the spec, and propagated it, without thinking. These things are touchy. > This has raised a lot of mixed opinions. Mostly, because > people are surprised and have a knee jerk reaction that > things return to the bubble of safety that they once were. Or that they thought they were. > I do not support adding fsync() everywhere, and not on > close() either. It's silly in my opinion. Mark, you say it so much more gently than I. :-) > The only time it is needed is when the application wishes > to make a stronger guarantee about the state of the file > system before continuing. The write()/close()/rename() is > exactly this sort of case. The goal is to accomplish an > atomic-change-in-place effect. The goal cannot be safely > achieved without ensuring that the file is correctly > written to disk before doing the rename. This requires > write()/fsync()/close()/rename(). This is similar to how a > database engine needs to fsync() to ensure consistent > state of a data file before continuing. Now this I tend to disbelieve. There is some sort of filesystem bug here if that be true. I can argue this from 1st principles, too, because something that is omissible but need always be done to ensure a correctly behaving program -- AND ESPECIALLY A CORRECTLY BEHAVING KERNEL-IMPLEMENTED FILESYSTEM -- is not something you should *ever* leave up to the user. sync, sync, sync.... Ever wonder why one types: % sync % sync % sync The first one schedules the writes, and comes right back to you. The next one you type a little bit later to make you feel better. On some systems, the 2nd sync(2) might ahve blocked till the kernel finished with the 1st one, but mostly, it's about timing. And the third? Because of course you aren't sure whether you remembered whether you typed that second one, so just in case. :-) I am not kidding in the very least, and this is not apocryphal. Larry and Dennis and I were dining together, just the three of us, and this came up. Dennis supplied the first two answers, and Larry the last one--which caused Dennis to laugh really hard, because Larry got it exactly right: it's just the insecurity of not being sure you'd done the second one. It was utterly hilarious, because we'd all done it a zillion times. Have you ever worked on buffer-cache code and the VM system? I have. And in more than one job and on more than one operating system. I *do* know just a little bit, maybe not much, but apparently more than you need to know, about what we're discussing. I've seen a lot of Unix filesystems, include ones with much odder characteristics than these. Ever experienced the joys of async *READS* in wired-down DMA memory that bus-tranferred? Schwern, if you care to debate the Unix filesystem, you'd best have a lot of theory and practice under your belt. As Andrew has said, Unix has its weaknesses, but its filesytem isn't one of them. Have *you* read the many academic papers, starting from the earliest UFS work, then moving on to Kirk's FFS, the whole vnode layer thing, Kirk's later work on soft updates--and what *those* are all about, anyway? Have you read about the early forays into journaling or log-based filesystems, with many papers going back and forth between Margo and Ouster? Or the later work done on some of the more exotic systems, now more commonplace? Once you'e done all that, then fine: we can start from a common knowledge base. But until then, I'm skeptical, because I don't see much evidence that you understand what you're saying, Schwern. I include references to online documentation through this message, but at the end, you will find more proper references to more formal work, and it is these especially to which I should like you to divert your time and attention. You need to understand a great deal more about the problemspace to appreciate it, and it truly appears that you currently fail to do so. Drastically. > Does Perl have any code that does atomic-change-in-place effect > using rename()? If it does, it should run fsync(). Huh? That smells wrong. Atomic rename isn't new, and providing we're not talking directory inodes, hardly rocket science. Look at vfs_bio.c and then the filesytem's vnode ops that implement these, and explain how what you're saying can be true. I believe it shouldn't be in Perl. I believe it shouldn't be in libc. I believe it should be in the kernel, if needed. And that demesne is not ours to plunder or poke, no matter how strong be our will. > Note that fsync() should not be the same as autoflush. > Autoflush writes the data to the operating system. This > does not and has never guaranteed that the data is written > to disk. Yup. And I still don't think fsync(2) guarantees what you think it may. Besides the code citations and references in /sys/ that make be dubious, this pretty much seals the deal that it won't work: % man 2 fsnc NAME fsync -- synchronize a file's in-core state with that on disk SYNOPSIS #include