Features Download
From: Daniel Phillips <phillips <at> phunq.net>
Subject: Feature interaction between multiple volumes and atomic update
Newsgroups: gmane.comp.file-systems.tux3
Date: Saturday 30th August 2008 00:41:43 UTC (over 9 years ago)
It turns out that multiple independent volumes sharing the same
allocation space is a feature that does not quite come for free as I
had earlier claimed.  The issue is this:

 * Tux3 guarantees that when fsync (or other filesystem sync) returns
   then the entire volume including all subvolumes is in a consistent
   state.  In particular, any block in use by the subvolume being
   synced is persistently recorded as in use, and no block that is not
   in use by (the persistent image of) any subvolume is recorded as in

 * It is desirable that a fsync apply only to the subvolume being
   synced, even if other subvolumes are mounted and in use at the same
   time.  Otherwise, syncing a given subvolume would require time
   proportional to the number of subvolumes simultaneously mounted,
   which would be a regression compared to having the volumes actually
   separate.  Since the multiple subvolume feature has a marginal use
   case anyway, such a drawback would verge on being fatal for this

 * Therefore it seems logical that Tux3 should have a separate forward
   log for each subvolume to allow independent syncing of subvolumes.
   But global allocation state must always be consistent regardless of
   the order in which subvolumes are synced.

 * We do not want to have a separate log dedicated to block allocation
   because that would require updating two logs in many cases where
   only one log update would otherwise be required.

 * An unexpected interruption may occur when any combination of
   subvolumes is mounted and active.  But on restart, nothing requires
   that the same set of subvolumes be remounted.

 * If a subvolume is not mounted, then it is not desirable for Tux3 to
   recreate the cache state of that subvolume.  Recreating cache state
   is fundamental to the Tux3 integrity recovery design.  In other words,
   we do not want to replay the log into cache for every subvolume that
   was mounted at the time of a crash.

So what do we do?  Some ideas:

 1) Drop the multiple subvolume feature.

 2) When the first subvolume is remounted after a crash, scan all other
    subvolumes for allocation changes, roll those up into a dedicated
    allocation log, and mark in the dedicated allocation log the highest
    log sequence numbers of the subvolume logs that were rolled up into
    the allocation log.

 3) When the first subvolume is remounted after a crash, implicitly
    remount and replay all subvolumes that were also mounted at the time
    of the crash, roll up the logs, and unmount them.

 4) Partition the allocation space so that each subvolume allocates
    from a completely independent allocation space, which is separately
    logged and synced.  Either implement this by providing an
    additional level of indrection so that each subvolume has its own
    map of the complete volume which may be expanded from time to time
    by large increments, or record in each subvolume allocation map
    only those regions that are free and available to the subvolume.

I am tending towards solutions 2 or 4 at the moment, though there are
no doubt other approaches I have not considered.  The main goal is to
avoid such complexity as to devalue the attractiveness of the subvolume
feature, which as I said earlier is not a feature anybody has actually
asked for.

Solution 4 seems to encroach on the territory of the volume manager,
something Tux3 wishes to avoid.  We would be better advised to improve
the volume manager so that it is capable enough to provide such
incremental allocation itself in a way that maps well to the needs of
filesystems such as Tux3.

I CC'd this one to Matt Dillon, perhaps mainly for sympathy.  Hammer
does not have this issue as it does not support subvolumes, perhaps


CD: 3ms