Features Download
From: Mark McLoughlin <markmc <at> redhat.com>
Subject: KVM PCI device assignment issues
Newsgroups: gmane.linux.kernel.pci
Date: Friday 13th February 2009 16:32:47 UTC (over 9 years ago)

KVM has support for PCI device assignment using VT-d and AMD IOMMU, but
there are a number of inter-related issues that need some further

  - Unbinding devices from any existing device driver before assignment

  - Resetting devices before and after assignment

  - Helping users figure out which devices can actually be assigned

This gets confusing, so some background constraints first:

  - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
    bridge must be assigned to the same VT-d domain - i.e given device 
    A (0000:0f:1.0) and device B (and 0000:0f:2.0), if you assign 
    device A to guest, you cannot then use device B in the host or 
    another guest.

  - Some newer PCIe devices (and newer conventional PCI devices too via 
    PCI Advanced Features) support Function Level Reset (FLR). This 
    allows a PCI function to be reset without affecting any other 
    functions on that device, or any other devices. This feature is not 
    widespread yet AFAIK - e.g. I've seen it on an audio controller, 
    and it must also be supported by SR-IOV devices.

  - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
    devices (and functions) behind a PCI bridge.

  - A PCI Power Management D-state transition (D3hot to D0) can be used 
    to reset a device (all functions).

  - Some PCI devices don't have page aligned MMIO BARs. These devices 
    (all functions) cannot be safely assigned to guests.

Driver Unbinding

Before a device is assigned to a guest, we should make sure that no host
device driver is currently bound to the device.

We can do that with e.g.

 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
 $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind

One minor problem with this scheme is that at this point you can't
unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
In order to support that, we need a "remove_id" interface to remove the
dynamic ID.

What we don't support is a way to unbind permanently. Xen has a
pciback.hide module param which tries to achieve this, but you end up
with the inevitable issues around making sure pciback is loaded before
the device driver etc.

Permanent unbinding isn't necessarily needed, but it might help provide
a solution to some of the nastier issues below.

Device Reset

Before assigning a device to a guest, it should be reset. The host or a
previous guest may have left the device in an unknown state. Not
resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
with e1000e devices.

FLR is without doubt the preferable solution here. KVM already
implements this. However, the range of devices which support FLR is
currently quite limited.

If we're assigning devices from behind a PCI/PCI-x bridge (remember all
devices must be assigned together), then we can use SBR to reset them
all together. Clearly, though, one should make sure that all devices
behind that bridge are not in use before doing the reset. We could
implement this with a "reset" sysfs interface for pci-stub - it would
only reset a device using SBR if all devices behind that bridge were
bound to pci-stub.

Where a conventional PCI device is on the root bus, or where a PCIe
device is on the root bus or another bus with multiple devices, we could
use the D-state transition reset. Since this resets all functions on a
device, we would need a similar approach where all functions must be
bound to pci-stub before being reset.

Furthermore, we would need to prevent pci-stub from resetting a device
it is bound to where the device is already assigned to a guest. To
achieve this, we would want KVM to explicitly call in to pci-stub to
mark a device as in use.

The alternatives to such an approach are:

  a) Only support FLR capable devices

  b) Cross our fingers and hope that work without a device reset

  c) Allow a driver to be permanently unbound from a device and require 
     the user to reboot after unbinding before assigning


In order to support a sane user interface in management tools, it should
be possible to list all PCI devices on available on a host and filter
out those which cannot be assigned to a guest.

Furthermore, it should be possible to do this without actually affecting
any of the devices - i.e. a "try to unbind and see if we oops" approach
clearly isn't great.

Finally, some management tools would like to be able to do this
filtering given the constraint of a device being reserved for a
currently inactive guest.

This last constraint is the most difficult and points to the logic
needing to be in userland management libraries. Possibly the only sane
kernel space support would be "try to unbind and reset; if it works then
the device is assignable".


Only supporting devices with FLR restricts our user pool far too

Permanent unbinding is not supportable.

SBR and D-state reset support is doable with the addition of a "reset"
interface to pci-stub and some logic to check that a reset does not
affect devices not already bound to pci-stub.

KVM would need to be able to mark pci-stub bound devices as in use when
assigned to a guest.

We need the opposite to "new_id" to allow dynids to be removed.

The filtering abilities available to userland via kernel interfaces will
be limited. Further logic will need to be implemented in userland.

CD: 14ms