Features Download
From: Daniel Phillips <phillips <at> phunq.net>
Subject: [RFC] An alternative interface to device mapper
Newsgroups: gmane.linux.kernel.device-mapper.devel
Date: Wednesday 5th March 2008 09:29:42 UTC (over 10 years ago)
Q: If you have an axe with a rusty head and a rotten handle, how do you 
repair it?

A: It is a two step process.  First you replace the handle, then you 
replace the head.

At one time, volume management on Linux was new and shiny.  Today it has 
fallen well behind Sun, FreeBSD, NetApp and even Microsoft.  In order 
to avoid losing even more of the storage "market" than we have already 
lost, we must make a concerted effort to catch up with the state of the 
art, or ideally take the lead as has proved possible with so many other 
aspects of Linux.

This RFC is about replacing the handle of our not-so-shiny LVM axe.  
Design goals for this alternative "ddsetup" interface are:

  * Convenient to embed in a C program
  * Make it simple enough that a library is unnecessary
  * Support for creating detailed, accurate error messages
  * Error messages delivered to caller rather than logged
  * Naturally extensible as new requirements emerge
  * 32 bit ABI works on 64 bit kernel without translation
  * Avoid bad API practices identified by [ARND 07]
  * Do not break the existing ioctl interface

The patch below includes a new kernel interface generator called ddlink.  
The ddsetup device mapper interface is an instance of a ddlink 
interface, instantiated by supplying domain-specific methods for read, 
write, ioctl and poll.

In more detail: ddlink is a generic pipe-like interface for controlling 
device drivers.  It was inspired by Trond's venerable and successful 
rpc-pipefs, which he invented to control various aspects of NFS server 
and client operation.  ddlink takes the form of a virtual filesystem 
with no namespace.  It provides application programs with fd objects 
that can be read, written, ioctled and polled, suitable for efficient 
binary communication with kernel components.  Read, write and poll 
operations act similarly to a pipe.  Unlike a pipe, there is no write 
buffering.  Each write to a ddlink directly triggers some kernel 
handler.  Reads are buffered via an output queue of ddlink "items", each 
of which is an unrestricted blob.  Ioctls on ddlinks are unrestricted 
and the ioctl command space is unpolluted.

There are no partial reads on ddlinks.  A read call either provides 
enough space to hold the next outbound kernel item or EIO is triggered, 
meaning "make your buffer bigger and try again".  In practice this 
arrangement takes the onus off the userspace program to buffer partial 
reads in order to reassemble input that would otherwise be brutally 
dismembered.  As a bonus, the kernel code for ddlink is considerably 
simplified versus Trond's rpc-pipefs precursor.

Unlike a pipe, there is no waiting for input on a ddlink: if there is 
nothing to read then the read returns immediately with zero length.  If 
some other behavior is desired it can be obtained using poll.

There is a simple framework to provide for generalized allocation and 
destruction of dditems, the internal transaction unit for a ddlink.  
Finally, there is a small library of helper funcations that are useful 
for creating domain-specific ddlink interfaces.  The core code for 
ddlink is about 150 lines of lindented C, plus 100 lines of library 
functions and a header file that has already ballooned to the 
intimidating size of 50 lines.  In other words, ddlink is about as 
light and tight as an interface gets.  It is also highly efficient, 
flexible and extensible, and requires very little boilerplate code.

I do not know whether the lack of a ddlink namespace is a bug or a 
feature.  In current usage, a ddlink fd is delivered to an application 
program by some out of band means.  For example, to get a ddsetup ddlink 
to control device mapper, you ioctl /dev/mapper/control.  Makes sense? 
One can imagine many other methods of obtaining a ddlink, for example, 
by reading a fd number in ascii from a sysfs file (gah).  In practice, 
lacking a namespace feels more like a feature than a bug.

A quick tour of the ddlink library:

int ddlink(struct file_operations *fops,
	void *(*create)(struct ddinode *dd, void *info), void *info)

   Create a new ddlink fd that will use the supplied file operations.
   An optional create method may be supplied and arbitrary state 
   information supplied via "info".  (I am not sure why I set up the
   create as a callback, but I do recall that when I tried to do it
   otherwise, conciseness of usage was degraded.  I might revisit this

void ddlink_queue(struct ddinode *dd, struct dditem *item)

   Queue an output data item onto the tail of the output queue.

void ddlink_push(struct ddinode *dd, struct dditem *item)

   Push an output data item onto the head of the output queue.  Useful
   for error messages or transaction-type calls that can return data
   without disturbing any preexisting queue contents

struct dditem *ddlink_pop(struct ddinode *dd)

   Remove a data item from the front of the queue

struct dditem *dditem_new(struct ddinode *dd, size_t size)

   Create a new dditem with the indicated data size, whose char data
   area is available as item->data

void ddlink_clear(struct ddinode *dd)

   Destroy all queued data items

int ddlink_ready(struct ddinode *dd)

   Returns true if the ddlink queue is nonempty

unsigned ddlink_poll(struct file *file, poll_table *table)

   A simple poll method that only supports polling for input,
   nearly always just what is needed

int ddlink_error(struct ddinode *dd, int err, const char *fmt, ...)

   Format and push an error item onto the queue so that the error
   text will be retrieved by the next read from the ddlink

So that is ddlink, short and sweet.  Only a couple of trivial glue 
functions were omitted.  The remainder of this note is about ddsetup, 
which is the single extant example of a ddlink interface.

Device Mapper is actually a lot more capable than most people know.  
Each device mapper device consists of two layers: the virtual device 
exposed to applications, and an underlying table of "target devices", 
each of which is effectively a virtual block device in itself.

The "map" part of device mapper is about translating each bio directed 
at the virtual device into one or more bio transfers to some contiguous 
subset of the underlying device table.  In addition, device mapper 
implements stacking, whereby additional layers of virtual devices can 
be inserted into an existing top level virtual device while that device 
remains open and in use.  Details of how this is accomplished are 
surprisingly simple, but outside the scope of today's note.  The 
important thing here is to get a sense of just how rich the device 
mapper interface needs to be in order to expose the full range of 
device mapper capabilities to application programs.

Device mapper is currently exposed to userland via a stupifyingly 
complex interface, in which 16 different device mapper subfunctions are 
multiplexed via ioctls through one grand unified parameter structure.  
This interface has proved so unwieldy that exactly one userspace program 
uses it, namely libdevmapper.  Unfortunately, libdevmapper brings its 
own oddly structured interface to the party, in which a series of ioctl 
calls is recast as a "task", a thoroughly unsuccessful abstraction.  As 
far as I know, the libdevmapper interface is only used by three 
userspace programs: dmsetup, lvm2 and cryptsetup.  There may be others, 
but the point is, if this interface were well suited to its task then 
there would be lots of programs using it by now.  Instead nearly all 
device mapper usage continues to be scripted via dmsetup or (less 
commonly) lvm2 commands, or carried out manually using the lvm2 
interactive interface.

This many years into the effort we ought to be slicing and dicing 
volumes as second nature, changing configuration on the fly, 
transparently expanding, shrinking and migrating filesystems, and many 
other things that ZFS and GEOM are already doing and we are not.  It is 
not so much that device mapper is incapable of such fancy tricks, but 
that we have taken a very powerful kernel subsystem and hobbled it with 
a nearly unusable application interface.  Think about a jet turbine 
racecar with a two inch air intake.

So here I have attempted to create a granular interface to expose the 
same functionality as the existing device mapper ioctl interface does, 
but in a transparent and easy enough way that no library is required, 
and concise enough that when you need to, you can realistically embed
your lvm operations inline in a C program.

The ddsetup design does not abandon ioctls entirely.  Though ddlink is 
perfectly capable of implementing the entire interface via rpc-like 
write and read calls with function codes included in-line, this style 
does not map well onto the expressive capabilities of C.  A mix of 
writes and ioctls ended up looking better on the page and is easier to 

In general, ddsetup uses write calls for variable length data and ioctls 
for fixed length structures.  One could also say: writes for data and 
ioctl commands for, ahem, commands.

There is a little state machine inside a ddsetup fd that keeps track of 
where you are in midst of a complex call sequence, particularly the 
device create sequence.  Using ddsetup, you push strings onto a stack 
with write calls and turn the strings into more complex objects using 
ioctls.  If you make a mistake and get a -1 error return from any
of the calls, you can then read from the ddlink to get a text 
description of what went wrong, complete with message formatting 
courtesy of the ddlink_error convenience function.

For example, given a ddsetup fd named dd:

   ioctl(dd, DMTABLE, &(struct ddtable){ .targets = 1 });
   write(dd, "linear", 6);
   write(dd, "/dev/hda5", 9);
   write(dd, "1234", 4);
   ioctl(dd, DMTARGET, &(struct ddtarget){ .sectors = 10000 });
   write(dd, "foo", 3);
   ioctl(dd, DMCREATE);
   read(dd, &result, sizeof(result));

Leaving out error handling for clarity, this creates a 10,000 sector 
virtual device named "foo" which is a linear mapping of hda5 starting 
at sector offset 1234, equivalent to the shell command:

   echo 0 10000 linear /dev/ubdb 1234 | dmsetup create foo

Except that we did not use the shell, or a library, just a header file 
to name the ioctl commands and provide some simple interface structs.
(Actually, the example above is more complex than necessary. I do not 
think the .targets field serves any useful purpose, and I will make it 
go away soon.)

Reflecting device mapper's table structured arrangement, the sequence 
from the "linear" write to the DMTARGET ioctl may be repeated 
arbitrarily many times to build up a complex mapping.  In fact, this is 
how device mapper maps extents from an lvm partition to your 
lvm "partitions".  Easy, no?  Well it is when written as above.

Such mappings are not restricted to linear targets.  Some fancy mappings 
have linear targets at each end and a temporary mirror of two devices 
in the middle.  This is how lvm2 implements pvmove, its clever ability 
to relocate physical targets of a virtual device while the device is 
running.  Powerful, and practically unknown to most Linux users.  
According to me, that is because it is hard to write programs to drive 
such functionality.  As a result, when Linux users do it, they do it by 
hand.  Solaris users are having a party with this kind of thing, and 
laughing at us.  Really.

There is not a lot more to say about ddsetup, which is actually the 
point.  It is pretty obvious how to use it, and how it is implemented.  
I did have to do some pretty serious spelunking in dm-ioctl.c to ferret 
out the bits of device mapper that do the actual work, in some cases 
having to go pretty deep to work past dependencies on the monolithic 
ioctl interface struct.  Some header files needed to be rearranged, 
arguably into the form they should have taken in the first place.  
There are some needlessly strange object lifetime rules to deal with 
internally, but otherwise this was a pretty straightforward romp.

An early version of this code was shown to Eric Biederman and Alasdair 
Kergon at OLS last year.  After taking all the C99 bits out, I managed 
to convince Eric and others to actually read the examples.  Let us now 
see if the (positive) reaction I observed at that time survives wider 

A diffstat for ddlink and ddsetup together:

 Documentation/ioctl-number.txt |    1
 block/ll_rw_blk.c              |    2
 drivers/Makefile               |    1
 drivers/ddlink.c               |  294 ++++++++++++++++++++
 drivers/md/Makefile            |    1
 drivers/md/dm-ioctl.c          |  593
 drivers/md/dm-table.c          |   52 ---
 drivers/md/dm.c                |   53 ---
 drivers/md/dm.h                |   78 +++++
 include/linux/ddlink.h         |   41 ++
 include/linux/ddsetup.h        |   36 ++
 include/linux/device-mapper.h  |   37 ++
 12 files changed, 1056 insertions(+), 133 deletions(-)

Currently, this implements a majority of the device mapper interface 
calls, but not all of them, so expect another hundred or two lines 
before completion.  This is still significantly less code than the 
original ioctl interface (which is still in there) and much clearer.

I have written two example programs, ddsetup.c and ddcreate.c.  The 
former aims to be a drop-in replacement for dmsetup.c and the latter 
implements a (useful) demonstration command that creates a virtual 
device consisting of a single device mapper target, with all target 
parameters supplied on the command line.

For example:

   ddcreate foo 10000 linear /dev/ubdb 1234

ddcreate is 71 lines long including plenty of whitespace, while being 
quite general.  My message is about the 71 lines.

So who uses this ddsetup today?  Answer: nobody.  Better answer: the 
ddsnap cluster snapshot driver has a usability problem because of its 
reliance on PF_LOCAL sockets to glue components together.  Filesystem 
based sockets were adopted for the component glue because it is hard to 
do anything more elegant working with the command line device mapper 
setup utility.  We looked at hacking the device mapper ioctl interface 
to do what we needed, but then if we were willing to go that far then 
why not just drop the other shoe and improve the userspace interface to 
the point where it is actually pleasant to use, and maintainable too?  
This is how ddsetup was born.

The next thing we need to do with this interface is demonstrate a solid 
use case by adopting it on an experimental branch of zumastor.  I 
expect both ddsnap and zumastor systems to shrink as a result, 
including significantly shrinking the documentation.  This has not yet 
been done yet, and until it is, this effort deserves to be firmly 
relegated to the "nice but so what" category.  So, profound thanks to 
you, dear reader, for having had the stamina to read all the way to 
here, and we will see you here again after having eaten this delicious 
new dogfood ourselves.


[ARND 07] How to not invent kernel interfaces, Arnd Bergmann, 
   [email protected], July 31, 2007
CD: 4ms