I have a hypothesis that speculative execution on Intel Nehalem (1 gen) causing a crash. Is it possible or I completely wrong? If this is possible what can I do to prevent this? Maybe disable speculative execution for just one function or whole translation unit?
For the compilation of the cpp file that has problematic code, clang is used with flags -mavx2 -mxsave all other files compiled without these flags. This code works fine on any available contemporary mac book and windows laptop/desktop.
Testers mac book has Intel(R) Core(TM) i5 CPU 760. This CPU doesn't support AVX2 instruction set. There is a code that checks if the AVX2 supported and if not it is not executed. I can't have direct access to this device for debugging to know which exactly code causing the crash. But now I have two hypotheses:
I have already replaced/"fixed" the checking code as the primary hypothesis but tester still reports the crash. So I don't know for sure that isAvx2Supported
is false.
Code that checks if AVX2 supported
void cpuid(int info[4], int InfoType) noexcept
{
#ifdef _WIN32
__cpuidex(info, InfoType, 0);
#else
__cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
#endif
}
bool check_xcr0_ymm() noexcept
{
uint32_t xcr0;
#if defined(_MSC_VER)
xcr0 = (uint32_t)_xgetbv(0);
#else
__asm__ __volatile__("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx");
#endif
// checking if xmm and ymm state are enabled in XCR0
return (xcr0 & 6) == 6;
}
bool check_4th_gen_intel_core_features() noexcept
{
// see original article
// https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
int cpuInfo[4] = {};
cpuid(cpuInfo, 1);
// CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1
// && CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1
// && CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1
constexpr uint32_t fma_movbe_osxsave_mask = ((1 << 12) | (1 << 22) | (1 << 27));
if((cpuInfo[2] & fma_movbe_osxsave_mask) != fma_movbe_osxsave_mask)
return false;
if(!check_xcr0_ymm())
return false;
cpuid(cpuInfo, 7);
// CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1
// && CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1
// && CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1
constexpr uint32_t avx2_bmi12_mask = (1 << 5) | (1 << 3) | (1 << 8);
if((cpuInfo[1] & avx2_bmi12_mask) != avx2_bmi12_mask)
return false;
cpuid(cpuInfo, 0x80000001);
// CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1
if((cpuInfo[2] & (1 << 5)) == 0)
return false;
return true;
}
const auto isAvx2Supported = check_4th_gen_intel_core_features();
actual code that uses AVX2
int findCharFast(const char* data, size_t dataSize, char c, unsigned int& offset)
{
if(isAvx2Supported)
{
const auto mask = _mm256_set1_epi8(c);
auto it = data + offset;
for(const auto end = data + dataSize - 31; it < end; it += 32)
{
if(const auto result = _mm256_movemask_epi8(_mm256_cmpeq_epi8(mask, _mm256_loadu_si256(reinterpret_cast<const __m256i*>(it)))))
{
return it - data + get_first_bit_set(result);
}
}
offset = it - data;
}
return -1;
}
crash report says
Crashed Thread: 43 Queue(0x60400005b270)[16]
Exception Type: EXC_BAD_INSTRUCTION (SIGILL) Exception Codes: 0x0000000000000001, 0x0000000000000000
Thread 43 Crashed:: Queue(0x60400005b270)[16] 0 0x000000010d54b06b findCharFast(char const*, unsigned long, char, unsigned int&) + 91
Thread 43 crashed with X86 Thread State (64-bit): rax: 0x00000000ffffffff rbx: 0x0000000000000000 rcx: 0x000070000982bbd0 rdx: 0x0000000000000000 rdi: 0x00007f9ac044fa0b rsi: 0x000000000000000d rbp: 0x000070000982bc00 rsp: 0x000070000982bbc8 r8: 0x0000000000000001 r9: 0x0000000000000001 r10: 0x000060c000334030 r11: 0xfffffffffffc0fde r12: 0x0000000000000001 r13: 0x0000600000016fb0 r14: 0x00007f9ac044fa0b r15: 0x00007f9ac044fa18 rip: 0x000000010d54b06b rfl: 0x0000000000010246 cr2: 0x000000010d529cf0
There was a bug in AVX2 support detection code. Intel article which describes how to do it is basically wrong. Before calling xgetbv
implementation MUST check that ECX.XSAVE[bit 26]==1
. Checking only OSXSAVE flag is not sufficient.
User contributions licensed under CC BY-SA 3.0