VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

1

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.

static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
    __asm__ volatile(
            "cpuid"
            :"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
            :"a"(code)
            :"cc");
}

uint64_t rdtsc(void)
{
        uint32_t  lo, hi;
        // RDTSC copies contents of 64-bit TSC into EDX:EAX
        asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
        return (uint64_t)hi << 32 | lo;
}

void i386mode_tests(void)
{
    u32 eax, ebx, ecx, edx;
    u32 i = 0;

    asm ("mov %%cr0, %%eax\n"
         "mov %%eax, %0  \n" : "=m" (eax) : :);

    my_printf("Guest CR0 = 0x%x\n", eax);
    cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
    vm_tsc[0]= rdtsc();
    for (i = 0; i < 100; i ++) {
        rdtsc();
    }
    vm_tsc[1]= rdtsc();
    my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}

The output is something like this,

Guest CR0 = 0x80050033
Rdtsc takes 2742

On the other hand, I make a host application to do the same thing, like above

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
        __asm__ volatile(
                        "cpuid"
                        :"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
                        :"a"(code)
                        :"cc");
}

uint64_t rdtsc(void)
{
        uint32_t  lo, hi;
        // RDTSC copies contents of 64-bit TSC into EDX:EAX
        asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
        return (uint64_t)hi << 32 | lo;
}

int main(int argc, char **argv)
{
        uint64_t     vm_tsc[2];
        uint32_t eax, ebx, ecx, edx, i;

        cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
        vm_tsc[0]= rdtsc();
        for (i = 0; i < 100; i ++) {
                rdtsc();
        }
        vm_tsc[1]= rdtsc();
        printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);

        return 0;
}

It outputs followings,

Rdtsc takes 2325

Running above two codes in 40 iterations to get the average value as followings,

avag(VM)   = 3188.000000   
avag(host) = 2331.000000

The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,

 0xA501E97E = control_VMX_cpu_based  
 0xFFFFFFFFFFFFFFF0 = control_CR0_mask  
 0x0000000080050033 = control_CR0_shadow  

In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

linux
performance
x86
virtualization
rdtsc
asked on Stack Overflow Jun 7, 2018 by wangt13 • edited Jun 7, 2018 by phuclv

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0