Sharing PCIe cards across architectures

Some days ago during one of conference calls one of my co-workers asked:

Has anyone ever tried PCI forwarding to an ARM VM on an x86 box?

As my machine was opened I just turned it off and inserted SATA controller into one of unused PCI Express slots. After boot I started one of my AArch64 CirrOS VM instances and gave it this card. Worked perfectly:

[   21.603194] pcieport 0000:00:01.0: pciehp: Slot(0): Attention button pressed
[   21.603849] pcieport 0000:00:01.0: pciehp: Slot(0) Powering on due to button press
[   21.604124] pcieport 0000:00:01.0: pciehp: Slot(0): Card present
[   21.604156] pcieport 0000:00:01.0: pciehp: Slot(0): Link Up
[   21.739977] pci 0000:01:00.0: [1b21:0612] type 00 class 0x010601
[   21.740159] pci 0000:01:00.0: reg 0x10: [io  0x0000-0x0007]
[   21.740199] pci 0000:01:00.0: reg 0x14: [io  0x0000-0x0003]
[   21.740235] pci 0000:01:00.0: reg 0x18: [io  0x0000-0x0007]
[   21.740271] pci 0000:01:00.0: reg 0x1c: [io  0x0000-0x0003]
[   21.740306] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x001f]
[   21.740416] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x000001ff]
[   21.742660] pci 0000:01:00.0: BAR 5: assigned [mem 0x10000000-0x100001ff]
[   21.742709] pci 0000:01:00.0: BAR 4: assigned [io  0x1000-0x101f]
[   21.742770] pci 0000:01:00.0: BAR 0: assigned [io  0x1020-0x1027]
[   21.742803] pci 0000:01:00.0: BAR 2: assigned [io  0x1028-0x102f]
[   21.742834] pci 0000:01:00.0: BAR 1: assigned [io  0x1030-0x1033]
[   21.742866] pci 0000:01:00.0: BAR 3: assigned [io  0x1034-0x1037]
[   21.742935] pcieport 0000:00:01.0: PCI bridge to [bus 01]
[   21.742961] pcieport 0000:00:01.0:   bridge window [io  0x1000-0x1fff]
[   21.744805] pcieport 0000:00:01.0:   bridge window [mem 0x10000000-0x101fffff]
[   21.745749] pcieport 0000:00:01.0:   bridge window [mem 0x8000000000-0x80001fffff 64bit pref]

Let’s go deeper

Next day I turned off desktop for CPU cooler upgrade. During process I went through my box of expansion cards and plugged additional USB 3.0 controller (Renesas based). Also added SATA hard drive and connected it to previously added controller.

Once computer was back online I created new VM instance. This time I used Fedora 32 beta. But when I tried to add PCI Express card I got an error:

Error while starting domain: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1279, in startup
    self._backend.create()
  File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1234, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

Hmm. It worked before. Tried other card — with the same effect.

Debugging

Went to #qemu IRC channel and started discussing issue with QEMU developers. Turned out that probably no one tried sharing expansion cards to foreign architecture guest (in TCG mode instead of same architecture KVM mode).

As I had VM instance where sharing card worked I started checking what was wrong. After some restarts it was clear that crossing 3054 MB of guest memory was enough to get VFIO errors like above.

Reporting

Issue not reported does not exist. So I opened a bug against QEMU. Filled it with error messages, “lspci” output data for used cards, QEMU command line (generated by libvirt) etc.

Looks like the problem lies in architecture differences between x86-64 (host) and aarch64 (guest). Let me quote Alex Williamson:

The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn’t know it’s being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue.

Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue.

I have to admit that this is too low-level for me. I hope that the problem I hit will help someone to improve QEMU.

aarch64 virtualization