Subject: KVM PCI device assignment issues
Date: Friday 13th February 2009 16:32:47 UTC (over 9 years ago)
Hi, KVM has support for PCI device assignment using VT-d and AMD IOMMU, but there are a number of inter-related issues that need some further discussion: - Unbinding devices from any existing device driver before assignment - Resetting devices before and after assignment - Helping users figure out which devices can actually be assigned This gets confusing, so some background constraints first: - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same bridge must be assigned to the same VT-d domain - i.e given device A (0000:0f:1.0) and device B (and 0000:0f:2.0), if you assign device A to guest, you cannot then use device B in the host or another guest. - Some newer PCIe devices (and newer conventional PCI devices too via PCI Advanced Features) support Function Level Reset (FLR). This allows a PCI function to be reset without affecting any other functions on that device, or any other devices. This feature is not widespread yet AFAIK - e.g. I've seen it on an audio controller, and it must also be supported by SR-IOV devices. - Secondary Bus Reset (SBR) allows software to trigger a reset on all devices (and functions) behind a PCI bridge. - A PCI Power Management D-state transition (D3hot to D0) can be used to reset a device (all functions). - Some PCI devices don't have page aligned MMIO BARs. These devices (all functions) cannot be safely assigned to guests. Driver Unbinding ================ Before a device is assigned to a guest, we should make sure that no host device driver is currently bound to the device. We can do that with e.g. $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind One minor problem with this scheme is that at this point you can't unbind from pci-stub and trigger a re-probe and have e1000e bind to it. In order to support that, we need a "remove_id" interface to remove the dynamic ID. What we don't support is a way to unbind permanently. Xen has a pciback.hide module param which tries to achieve this, but you end up with the inevitable issues around making sure pciback is loaded before the device driver etc. Permanent unbinding isn't necessarily needed, but it might help provide a solution to some of the nastier issues below. Device Reset ============ Before assigning a device to a guest, it should be reset. The host or a previous guest may have left the device in an unknown state. Not resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors with e1000e devices. FLR is without doubt the preferable solution here. KVM already implements this. However, the range of devices which support FLR is currently quite limited. If we're assigning devices from behind a PCI/PCI-x bridge (remember all devices must be assigned together), then we can use SBR to reset them all together. Clearly, though, one should make sure that all devices behind that bridge are not in use before doing the reset. We could implement this with a "reset" sysfs interface for pci-stub - it would only reset a device using SBR if all devices behind that bridge were bound to pci-stub. Where a conventional PCI device is on the root bus, or where a PCIe device is on the root bus or another bus with multiple devices, we could use the D-state transition reset. Since this resets all functions on a device, we would need a similar approach where all functions must be bound to pci-stub before being reset. Furthermore, we would need to prevent pci-stub from resetting a device it is bound to where the device is already assigned to a guest. To achieve this, we would want KVM to explicitly call in to pci-stub to mark a device as in use. The alternatives to such an approach are: a) Only support FLR capable devices b) Cross our fingers and hope that work without a device reset c) Allow a driver to be permanently unbound from a device and require the user to reboot after unbinding before assigning Filtering ========= In order to support a sane user interface in management tools, it should be possible to list all PCI devices on available on a host and filter out those which cannot be assigned to a guest. Furthermore, it should be possible to do this without actually affecting any of the devices - i.e. a "try to unbind and see if we oops" approach clearly isn't great. Finally, some management tools would like to be able to do this filtering given the constraint of a device being reserved for a currently inactive guest. This last constraint is the most difficult and points to the logic needing to be in userland management libraries. Possibly the only sane kernel space support would be "try to unbind and reset; if it works then the device is assignable". Conclusions =========== Only supporting devices with FLR restricts our user pool far too severely. Permanent unbinding is not supportable. SBR and D-state reset support is doable with the addition of a "reset" interface to pci-stub and some logic to check that a reset does not affect devices not already bound to pci-stub. KVM would need to be able to mark pci-stub bound devices as in use when assigned to a guest. We need the opposite to "new_id" to allow dynids to be removed. The filtering abilities available to userland via kernel interfaces will be limited. Further logic will need to be implemented in userland. Cheers, Mark.