Using XSETBV to write to XCR0 creates a general protection fault in a VM on hardware that supports MPX

Question

Using XSETBV to write to XCR0 creates a general protection fault in a VM on hardware that supports MPX

I am trying to write to Extended Control Register 0 (xcr0) on an x86_64 Debian v7 virtual machine. My approach to doing so is through a kernel module (so CPL=0) with some inline assembly. However, I keep getting a general protection fault (#GP) when I try to execute the xsetbv instruction.

The init function of my module first checks that the osxsave bit is set in control register 4 (cr4). If it isn't, it sets it. Then, I read the xcr0 register using xgetbv. This works fine and (in the limited testing I have done) has the value 0b111. I would like to set the bndreg and bndcsr bits which are the 3rd and 4th bits (0-indexed), so I do some ORing and write 0b11111 back to xcr0 using xsetbv. The code to achieve this last part is as follows.

unsigned long xcr0;             /* extended register    */
unsigned long bndreg = 0x8;     /* 3rd bit in xcr0      */
unsigned long bndcsr = 0x10;    /* 4th bit in xcr0      */

/* ... checking cr4 for osxsave and reading xcr0 ... */

if (!(xcr0 & bndreg))
    xcr0 |= bndreg;

if (!(xcr0 & bndcsr))
    xcr0 |= bndcsr;

/* ... xcr0 is now 0b11111 ... */

/*
 * write changes to xcr0; ignore high bits (set them =0) b/c they are reserved
 */
unsigned long new_xcr0 = ((xcr0) & 0xffffffff);
__asm__ volatile (
    "mov $0, %%ecx      \t\n" // %ecx selects the xcr to write
    "xor %%rdx, %%rdx   \t\n" // set %rdx to zero
    "xsetbv             \t\n" // write from edx:eax into xcr0
    :
    : "a" (new_xcr0)        /* input    */
    : "ecx", "rdx"          /* clobbered    */
);

By looking at the trace from the general protection fault, I determined that the xsetbv instruction is the problem. However, if I don't manipulate xcr0 and just read its value and write it back, things seem to work fine. Looking at the Intel manual and this site, I found various reasons for a #GP, but none of them seem to match my situation. The reasons are as follows along with my explanation for why they most likely don't apply.

If the current privilege level is not 0 --> I use a kernel module to achieve CPL=0
If an invalid xcr is specified in %ecx --> 0 is in %ecx which is valid and worked for xgetbv
If the value in edx:eax sets bits that are reserved in the xcr specified by ecx --> according to the Intel manual and Wikipedia the bits I am setting are not reserved
If an attempt is made to clear bit 0 of xcr0 --> I printed out xcr0 before setting it, and it was 0b11111
If an attempt is made to set xcr0[2:1] to 0b10 --> I printed out xcr0 before setting it, and it was 0b11111

Thank you in advance for any help discovering why this #GP is happening.

c

linux-kernel

x86

virtual-machine

vmware

asked on Stack Overflow May 14, 2020 by

peachykeen • edited May 28, 2020 by

Peter Cordes

2 Answers

Peter Cordes was right, it was a problem with my hypervisor. I am using VMWare Fusion for virtualization, and after a lot of digging on the internet I found the following quote from VMWare:

Memory protection extensions (MPX) were introduced in Intel Skylake generation CPUs and provided hardware support for bound checking. This feature will not be supported in Intel CPUs beginning with the Ice Lake generation.

Starting with ESXi 6.7 P02 and ESXi 7.0 GA, in order to minimize disruptions during future upgrades, VMware will no longer expose MPX by default to VMs at power-on. A VM configuration option can be used to continue exposing MPX.

The solution VMWare proposed was to edit the virtual machine's .vmx file with the following directive.

cpuid.enableMPX = "TRUE"

After I did this, things worked and I was able to use xsetbv to enable the bndreg and bndcsr bits of xcr0.

When using VMWare to expose CPU features from the host to the guest under more normal conditions (i.e. the feature isn't plagued with deprecation) you can mask the bits of cpuid leaves by adding the following to the VM's .vmx file.

cpuid.<leaf>.<register> = "<value>"

So, for example, if we assume that SMAP can be exposed this way, we would want to set bit 20 of cpuid leaf 7.

cpuid.7.ebx = "----:----:---1:----:----:----:----:----"

Colons are optional to ease reading of the string, ones and zeros override any default settings, and dashes are used to leave default setting alone.

answered on Stack Overflow May 28, 2020 by

peachykeen • edited Jun 20, 2020 by

Community

/proc/cpuinfo on the VM doesn't list mpx in the flags (it does list xsave though). My host does have MPX support though. I am running Linux kernel version 3.19 which does support MPX and I already have a binary compiled with MPX (the bnd instructions etc. are all there when I objdump). The problem is that the instructions get treated as NOPs. I thought the process I described above would fix this and enable the CPU to recognize MPX.

It would enable MPX if you ran it on a machine that supported MPX. (Assuming your code is correct.)

The virtual x86 CPU your VM is running on does not, according to its own virtualized CPUID, so it's not surprising at all that this faults. The hypervisor might be doing this manually in a VMEXIT, emulating xsetbv and checking the changes to the virtualized xcr0.

If you want to use features your HW has but your VM doesn't support, in general you have to run on bare metal instead. Or find a different VM that does expose the feature to the guest.

Note that MPX introduces new architectural state (the bnd registers) that have to get saved/restored on context switches. If your hypervisor doesn't want to do that, that would be one reason to disable MPX. (I think it can get saved/restored as part of xsave, but it does make the save slightly larger.) I haven't looked at MPX much; it might be something the hypervisor would have to deal with in vmexits to not have bounds checking apply to the hypervisor... If so that would be a major inconvenience.

answered on Stack Overflow May 14, 2020 by

Peter Cordes

User contributions licensed under CC BY-SA 3.0