Features Download
From: Anand Avati <anand.avati-Re5JQEeQqe8AvxtiuMwx3w <at> public.gmane.org>
Subject: Re: regressions due to 64-bit ext4 directory cookies
Newsgroups: gmane.comp.file-systems.gluster.devel
Date: Wednesday 13th February 2013 21:21:06 UTC (over 4 years ago)
> My understanding is that only one frontend server is running the server.
> So in your picture below, "NFS v3" should be some internal gluster
> protocol:
>                                                    /------ GFS Storage
>                                                   /        Server #1
>    GFS Cluster     NFS V3      GFS Cluster      -- gluster protocol
>    Client        <--------->   Frontend Server  ---------- GFS Storage
>                                                 --         Server #2
>                                                   \
>                                                    \------ GFS Storage
>                                                            Server #3
> That frontend server gets a readdir request for a directory which is
> stored across several of the storage servers.  It has to return a
> cookie.  It will get that cookie back from the client at some unknown
> later time (possibly after the server has rebooted).  So their solution
> is to return a cookie from one of the storage servers, plus some kind of
> node id in the top bits so they can remember which server it came from.
> (I don't know much about gluster, but I think that's the basic idea.)
> I've assumed that users of directory cookies should treat them as
> opaque, so I don't think what gluster is doing is correct.

NFS uses the term cookies, while man pages of readdir/seekdir/telldir calls
them "offsets". RFC 1813 only talks about communication between and NFS
server and NFS client. While knfsd performs a trivial 1:1 mapping between
d_off "offsets" into these "opaque cookies", the "gluster" issue at hand is
that, it made assumptions about the nature of these "offsets" (that they
are representing some kind of true distance/offset and therefore fall
within some kind of bounded magnitude -- somewhat like the inode
numbering), and performs a transformation (instead of a 1:1 trivial
mapping) like this:

  final_d_off = (ext4_d_off * MAX_SERVERS) + server_idx

thereby utilizing a few more top bits, also ability to perform a reverse
transformation to "continue" from a previous location.  As you can see,
final_d_off now overflows for very large values of ext4_d_off. This
final_d_off is used both as cookies in gluster-NFS (userspace) server, and
also as d_off entry parameter in FUSE readdir reply. The gluster / ext4
d_off issue is not limited to gluster-NFS, but also exists in the FUSE
client where NFS is completely out of picture.

You are probably right in that gluster has made different assumptions about
the "nature" of values filled in d_off fields. But the language used in all
man pages makes you believe they were supposed to be numbers representing
some kind of distance/offset (with bounded magnitude), and not a "random"

This had worked (accidentally, you may call it) on all filesystems
including ext4, as expected. But on kernel upgrade, only ext4 backed
deployments started giving problems and we have been advising our users to
either downgrade their kernel or use a different filesystem (we really do
not want to force them into making a choice of one backend filesystem vs

You can always say "this is your fault" for interpreting the man pages
differently and punish us by leaving things as they are (and unfortunately
a big chunk of users who want both ext4 and gluster jeapordized). Or you
can be kind, generous and be considerate to the legacy apps and users (of
which gluster is only a subset) and only provide a mount option to control
the large d_off behavior.

CD: 3ms