Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

Question

Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase):

How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)?

This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information showing my research/thinking and what I have tried. Below is a somewhat stream of consciousness of this.

Note that none of the questions posed below here are ones I expect to be answered; they are more rhetorical. Their purpose is to demonstrate why I can't answer the above question (despite my research, I lack the experience in this area to make definitive statements such as @PeterCordes's "branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well."). Also note that the Rust component is largely irrelevant here as this is an assembly issue. My reasoning for including it was the ahead-of-time compiled functions are written in Rust, so I was unsure if there was something that Rust did (or instructed LLVM to do) that could be advantageous in this situation. It is totally acceptable for an answer to not consider Rust at all; in fact, I expect this will be the case.

Think of the following as scratch work on the back of a math exam:

Note: I muddled the term intrinsics here. As pointed out in the comments, "ahead-of-time compiled functions" is a better description. Below I'll abbreviate that AOTC functions.

I'm writing a JIT in Rust (although Rust is only relevant to a bit of my question, the bulk of it relates to JIT conventions). I have AOTC functions that I've implemented in Rust that I need to be able to call from code emitted by my JIT. My JIT mmap(_, _, PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED)s some pages for the jitted code. I have the addresses of my AOTC functions, but unfortunately they are much further away than a 32-bit offset. I'm trying to decide now how to emit calls to these AOTC functions. I've considered the following options (these are not questions to be answered, just demonstrating why I can't answer the core question of this SO thread myself):

(Rust specific) Somehow make Rust place the AOTC functions close to (maybe on?) the heap so that the calls will be within a 32-bit offset. It's unclear that that is possible with Rust (There is a way to specify custom linker args, but I can't tell to what those are applied and if I could target a single function for relocation. And even if I could where do I put it?). It also seems like this could fail if the heap is large enough.
(Rust specific) Allocate my JIT pages closer to the AOTC functions. This could be achieved with mmap(_, _, PROT_EXEC, MAP_FIXED), but I'm unsure how to pick an address that wouldn't clobbering existing Rust code (and keeping within arch restrictions--is there a sane way to get those restrictions?).
Create stubs within the JIT pages that handle the absolute jump (code below), then call the stubs. This has the benefit of the (initial) call site in the JITted code being a nice small relative call. But it feels wrong to have to jump through something. This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction). Additionally, it seems like this jump would be slower since its address is indirect and it depends on the mov for that address.

mov rax, {ABSOLUTE_AOTC_FUNCTION_ADDRESS}
jmp rax

The reverse of (3), just inlining the above at each intrinsic call site in the JITed code. This resolves the indirection issue, but makes the JITted code larger (perhaps this has instruction cache and decoding consequences). It still has the issue that the jump is indirect and depends on the mov.
Place the addresses of the AOTC functions on a PROT_READ (only) page near the JIT pages. Make all the call sites near, absolute indirect calls (code below). This removes the second level of indirection from (2). But the encoding of this instruction is unfortunately large (6 bytes), so it has the same issues as (4). Additionally, now instead of depending on a register, jumps unnecessarily (insofar as the address is known at JIT time) depend on memory, which certainly has performance implications (despite perhaps this page being cached?).

aotc_function_address:
    .quad 0xDEADBEEF

# Then at the call site
call qword ptr [rip+aotc_function_address]

Futz with a segment register to place it closer to the AOTC functions so that calls can be made relative to that segment register. The encoding of such a call is long (so maybe this has decoding pipeline issues), but other than that this largely avoids lots of the tricky bits of everything before it. But, maybe calling relative to a non-cs segment performs poorly. Or maybe such futzing is not wise (messes with the Rust runtime, for example). (as pointed out by @prl, this doesn't work without a far call, which is terrible for performance)
Not really a solution, but I could make the compiler 32-bit and not have this problem at all. That's not really a great solution and it also would prevent me from using the extended general purpose registers (of which I utilize all).

All of the options presented have drawbacks. Briefly, 1 and 2 are the only ones that don't seem to have performance impacts, but it's unclear if there is a non-hacky way to achieve them (or any way at all for that matter). 3-5 are independent of Rust, but have obvious performance drawbacks.

Given this stream of consciousness, I arrived at the following rhetorical question (which don't need explicit answers) to demonstrate that I lack the knowledge to answer the core question of this SO thread by myself. I have struck them to make it abundantly clear that I am not posing all of these are part of my question.

For approach (1), is it possible to force Rust to link certain extern "C" functions at a specific address (near the heap)? How should I choose such an address (at compile time)? Is it safe to assume that any address returned by mmap (or allocated by Rust) will be within a 32 bit offset of this location?

For approach (2), how can I find a suitable place to place the JIT pages (such that it doesn't clobber existing Rust code)?

And some JIT (non-Rust) specific questions:

For approach (3), will the stubs hamper performance enough that I should care? What about the indirect jmp? I know this somewhat resembles linker stubs, except as I understand linker stubs are at least only resolved once (so they don't need to be indirect?). Do any JITs employ this technique?

For approach (4), if the indirect call in 3 is okay, is inlining the calls worth it? If JITs typically employ approach (3/4) is this option better?

For approach (5), is the dependence of the jump on memory (given that the address is known at compile time) bad? Would that make it less performant that (3) or (4)? Do any JITs employ this technique?

For approach (6), is such futzing unwise? (Rust specific) Is there a segment register available (not used by the runtime or ABI) for this purpose? Will calls relative to a non-cs segment be as performant as those relative to cs?

And finally (and most importantly), is there a better approach (perhaps employed more commonly by JITs) that I'm missing here?

I can't implement (1) or (2) without my Rust questions having answers. I could, of course, implement and benchmark 3-5 (perhaps 6, although it would be nice to know about the segment register futzing beforehand), but given that these are vastly different approaches, I was hoping there was existing literature about this that I couldn't find, because I didn't know the right terms to google for (I'm also currently working on those benchmarks). Alternatively maybe someone who's delved into JIT internals can share their experience or what they've commonly seen?

I am aware of this question: Jumps for a JIT (x86_64). It differs from mine because it is talking about stringing together basic blocks (and the accepted solution is way too many instructions for a frequently called intrinsic). I am also aware of Call an absolute pointer in x86 machine code, which while it discusses similar topics to mine, is different, because I am not assuming that absolute jumps are necessary (approaches 1-2 would avoid them, for example).

assembly

rust

compiler-construction

x86-64

jit

asked on Stack Overflow Mar 1, 2019 by

Bailey Parker • edited Apr 12, 2019 by

Bailey Parker

1 Answer

Summary: try to allocate memory near your static code. But for calls that can't reach with rel32, fall back to call qword [rel pointer] or inline mov r64,imm64 / call r64.

Your mechanism 5. is probably best for performance if you can't make 2. work, but 4. is easy and should be fine. Direct call rel32 needs some branch prediction, too, but it's definitely still better.

Terminology: "intrinsic functions" should probably be "helper" functions. "Intrinsic" usually means either language built-in (e.g. Fortran meaning) or "not a real function, just something that inlines to a machine instruction" (C/C++ / Rust meaning, like for SIMD, or stuff like _mm_popcnt_u32(), _pdep_u32(), or _mm_mfence()). Your Rust functions are going to compile to real functions that exist in machine code that you're going to call with call instructions.

Yes, allocating your JIT buffers within +-2GiB of your target functions is obviously ideal, allowing rel32 direct calls.

The most straightforward would be to use a large static array in the BSS (which the linker will place within 2GiB of your code) and carve your allocations out of that. (Use mprotect (POSIX) or VirtualProtect (Windows) to make it executable).

Most OSes (Linux included) do lazy allocation for the BSS (COW mapping to the zero page, only allocating physical page frames to back that allocation when it's written, just like mmap without MAP_POPULATE), so it only wastes virtual address space to have a 512MiB array in the BSS that you only use the bottom 10kB of.

Don't make it larger than or close to 2GiB, though, because that will push other things in the BSS too far away. The default "small" code model (as described in the x86-64 System V ABI) puts all static addresses within 2GiB of each other for RIP-relative data addressing and rel32 call/jmp.

Downside: you'd have to write at least a simple memory allocator yourself, instead of working with whole pages with mmap/munmap. But that's easy if you don't need to free anything. Maybe just generate code starting at an address, and update a pointer once you get to the end and discover how long your code block is. (But that's not multi-threaded...) For safety, remember to check when you get to the end of this buffer and abort, or fall back to mmap.

If your absolute target addresses are in the low 2GiB of virtual address space, use mmap(MAP_32BIT) on Linux. (e.g. if your Rust code is compiled into a non-PIE executable for x86-64 Linux. But that won't be the case for PIE executables (common these days), or for targets in shared libraries. You can detect this at run-time by checking the address of one of your helper functions.)

In general (if MAP_32BIT isn't helpful/available), your best bet is probably mmap without MAP_FIXED, but with a non-NULL hint address that you think is free.

Linux 4.17 introduced MAP_FIXED_NOREPLACE which would let you easily search for a nearby unused region (e.g. step by 64MB and retry if you get EEXIST, then remember that address to avoid searching next time). Otherwise you could parse /proc/self/maps once at startup to find some unmapped space near the mapping that contains the address of one of your helper functions. The will be close together.

Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a "non-MAP_FIXED" type of behavior: they will return an address that is different from the requested address.

In the next higher or lower free page(s) would be ideal for having a non-sparse memory map so the page table doesn't need too many different top-level page directories. (HW page tables are a radix tree.) And once you find a spot that works, make future allocations contiguous with that. If you end up using a lot of space there, the kernel can opportunistically use a 2MB hugepage, and having your pages contiguous again means they share the same parent page directory in the HW page tables so iTLB misses triggering page walks may be slightly cheaper (if those higher levels stay hot in data caches, or even cached inside the pagewalk hardware itself). And for efficient for the kernel to track as one larger mapping. Of course, using more of an already-allocated page is even better, if there's room. Better code density on a page level helps the instruction TLB, and possibly also within a DRAM page (but that's not necessarily the same size as a virtual memory page).

Then as you do code-gen for each call, just check whether the target is in range for a call rel32 with off == (off as i32) as i64
else fall back to 10-byte mov r64,imm64 / call r64. (rustcc will compile that to movsxd/cmp, so checking every time only has trivial cost for JIT compile times.)

(Or 5-byte mov r32,imm32 if possible. OSes that don't support MAP_32BIT might still have the target addresses down there. Check for that with target == (target as u32) as u64. The 3rd mov-immediate encoding, 7-byte mov r/m64, sign_extended_imm32 is probably not interesting unless you're JITing kernel code for a kernel mapped in the high 2GiB of virtual address space.)

The beauty of checking and using a direct call whenever possible is that it decouples code-gen from any knowledge about allocating nearby pages or where the addresses come from, and just opportunistically makes good code. (You might record a counter or log once so you / your users at least notice if your nearby allocation mechanism is failing, because the perf diff won't typically be easily measurable.)

Alternatives to mov-imm / call reg

mov r64,imm64 is a 10-byte instruction that's a bit large to fetch/decode, and for the uop-cache to store. And may take an extra cycle to read from the uop cache on SnB-family according to Agner Fog's microarch pdf (https://agner.org/optimize). But modern CPUs have pretty good bandwidth for code-fetch, and robust front-ends.

If profiling finds that front-end bottlenecks are a big problem in your code, or large code size is causing eviction of other valuable code from L1 I-cache, I'd go with option 5.

BTW, if any of your functions are variadic, x86-64 System V requires that you pass AL=number of XMM args, you could use r11 for the function pointer. It's call-clobbered and not used for arg-passing. But RAX (or other "legacy" register) will save a REX prefix on the call.

Allocate Rust functions near where mmap will allocate

No, I don't think there's any mechanism to get your statically compiled functions near where mmap might happen to put new pages.

mmap has more than 4GB of free virtual address space to pick from. You don't know ahead of time where it's going to allocate. (Although I think Linux at least does keep some amount of locality, to optimize the HW page tables.)

You in theory could copy the machine code of your Rust functions, but they probably reference other static code/data with RIP-relative addressing modes.

call rel32 to stubs that use mov/jmp reg

This seems like it would be detrimental to performance (perhaps interfering with RAS/jump address prediction).

The perf downside is only from having 2 total call/jump instructions for the front-end to get past before it can feed the back-end with useful instructions. It's not great; 5. is much better.

This is basically how the PLT works for calls to shared-library functions on Unix/Linux, and will perform the same. Calling through a PLT (Procedure Linking Table) stub function is almost exactly like this. So the performance impacts have been well-studied and compared with other ways of doing things. We know that dynamic library calls aren't a performance disaster.

Asterisk before an address and push instructions, where is it being pushed to? shows AT&T disassembly of one, or single-step a C program like main(){puts("hello"); puts("world");} if you're curious. (On the first call, it pushes an arg and jumps to a lazy dynamic linker function; on subsequent calls the indirect jump target is the address of the function in the shared library.)

Why does the PLT exist in addition to the GOT, instead of just using the GOT? explains more. The jmp whose address is updated by lazy linking is jmp qword [xxx@GOTPLT]. (And yes, the PLT really does use a memory-indirect jmp here, even on i386 where a jmp rel32 that gets rewritten would work. IDK if GNU/Linux ever historically used to rewrite the offset in a jmp rel32.)

The jmp is just a standard tailcall, and does not unbalance the Return-Address predictor Stack. The eventual ret in the target function will return to the instruction after the original call, i.e. to the address that call pushed onto the call stack and onto the microarchitectural RAS. Only if you used a push / ret (like a "retpoline" for Spectre mitigation) would you unbalance the RAS.

But the code in Jumps for a JIT (x86_64) that you linked is unfortunately terrible (see my comment under it). It will break the RAS for future returns. You'd think it would only break it for this one with the call (to get a return address to be adjusted) should balance out the push/ret, but actually call +0 is a special case that doesn't go on the RAS in most CPUs: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks. (calling over a nop could change that I guess, but the whole thing is totally insane vs. call rax unless it's trying to defend against Spectre exploits.) Normally on x86-64, you use a RIP-relative LEA to get a nearby address into a register, not call/pop.

inline mov r64, imm64 / call reg

This is probably better than 3; The front-end cost of larger code-size is probably lower than the cost of calling through a stub that uses jmp.

But this is also probably good enough, especially if your alloc-within-2GiB methods work well enough most of the time on most of the targets you care about.

There may be cases where it's slower than 5. though. Branch prediction hides the latency of fetching and checking the function pointer from memory, assuming that it predicts well. (And usually it will, or else it runs so infrequently that it's not performance relevant.)

call qword [rel nearby_func_ptr]

This is how gcc -fno-plt compiles calls to shared-library functions on Linux (call [rip + symbol@GOTPCREL]), and how Windows DLL function calls are normally done. (This is like one of the suggestions in http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/)

call [RIP-relative] is 6 bytes, only 1 byte larger than call rel32, so it has a negligible impact on code-size vs. calling a stub. Fun fact: you will sometimes see addr32 call rel32 in machine code (the address size prefix has no effect except for padding). This comes from a linker relaxing a call [RIP + symbol@GOTPCREL] to a call rel32 if the symbol with non-hidden ELF visibility was found in another .o during linking, not a different shared object after all.

For shared library calls, this is usually better than PLT stubs, with the only downside being slower program startup because it requires early binding (non-lazy dynamic linking). This isn't an issue for you; the target address is known ahead of code-gen time.

The patch author tested its performance vs. a traditional PLT on some unknown x86-64 hardware. Clang is maybe a worst-case scenario for shared library calls, because it makes many calls to small LLVM functions that don't take much time, and it's long running so early-binding startup overhead is negligible. After using gcc and gcc -fno-plt to compile clang, the time for clang -O2 -g to compile tramp3d goes from 41.6s (PLT) to 36.8s (-fno-plt). clang --help becomes slightly slower.

(x86-64 PLT stubs use jmp qword [symbol@GOTPLT], not mov r64,imm64/jmp though. A memory-indirect jmp is only a single uop on modern Intel CPUs, so it's cheaper on a correct prediction, but maybe slower on an incorrect prediction, especially if the GOTPLT entry misses in cache. If it's used frequently, it will typically predict correctly, though. But anyway a 10-byte movabs and a 2-byte jmp can fetch as a block (if it fits in a 16-byte aligned fetch block), and decode in a single cycle, so 3. is not totally unreasonable. But this is better.)

When allocating space for your pointers, remember that they're fetched as data, into L1d cache, and with a dTLB entry not iTLB. Don't interleave them with code, that would waste space in the I-cache on this data, and waste space in D-cache on lines that contain one pointer and mostly code. Group your pointers together in a separate 64-byte chunk from code so the line doesn't need to be in both L1I and L1D. It's fine if they're in the same page as some code; they're read-only so won't cause self-modifying-code pipeline nukes.

answered on Stack Overflow Apr 12, 2019 by

Peter Cordes • edited Jun 20, 2020 by

Community

User contributions licensed under CC BY-SA 3.0