Startup code of a statically-linked executable issues so many system calls?

8

I am experimenting by statically compiling a minimal program and examining the system calls that are issued:

$ cat hello.c
#include <stdio.h>

int main (void) {
  write(1, "Hello world!", 12);
  return 0;
}

$ gcc hello.c -static

$ objdump -f a.out
a.out:     file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x00000000004003c0

$ strace ./a.out
execve("./a.out", ["./a.out"], [/* 39 vars */]) = 0
uname({sys="Linux", node="ubuntu", ...}) = 0
brk(0)                                  = 0xa20000
brk(0xa211a0)                           = 0xa211a0
arch_prctl(ARCH_SET_FS, 0xa20880)       = 0
brk(0xa421a0)                           = 0xa421a0
brk(0xa43000)                           = 0xa43000
write(1, "Hello world!", 12Hello world!)            = 12
exit_group(0)                           = ?

I know that when linked non-statically, ld emits startup code to map libc.so and ld.so into the process's address space, and ld.so would continue loading any other shared libraries.

But in this case, why are so many system calls issued, apart from execve, write and exit_group?

Why the heck uname(2)? Why so many calls to brk(2) to get and set the program break, and a call to arch_prctl(2) to set the process state, when that seems like something that should have been done in kernel-space, at execve time?

c
linux
static-linking
libc
system-calls
asked on Stack Overflow Oct 3, 2011 by Blagovest Buyukliev • edited Oct 4, 2011 by Blagovest Buyukliev

2 Answers

10

uname is needed to check that the kernel version is not too ancient.

Two brks are needed to set up thread local storage. Two others are needed to set up dynamic loader path (the executable still might call dlopen, even if it's statically linked). I'm not sure why these come in pairs.

On system arch_prctl isn't called, set_thread_area is called in its place. This sets up TLS for the current thread.

These things probably could be done lazily (i.e. called when corresponding facilities are used for the first time). But perhaps it would make no sense performance-wise (just a guess).

By the way gdb-7.x can stop on system calls with the catch syscall command.

answered on Stack Overflow Oct 3, 2011 by n. 'pronouns' m.
7

Shameless plug: When built against musl libc, the strace for that program static linked or dynamic linked is:

execve("./a.out", ["./a.out"], [/* 42 vars */]) = 0
write(1, "Hello world!", 12)            = 12
exit_group(0)                           = ?

It should be similarly minimal with dietlibc if you static link, or with uClibc and static linking as long as you built uClibc with locale and advanced stdio stuff disabled. (For some reason uClibc with those features enabled runs lots of startup code to initialize them even in programs that don't use them...). As far as I know, however, musl is the only one that has a dynamic linker capable of avoiding heavy startup syscall overhead in dynamic-linked programs.

As for why static linking with glibc makes all those brk calls, I really have no idea; you'd have to read the source. I suspect it's allocating space for internal data structures for malloc, stdio, locale, and possibly the thread structure for the main thread. As n.m. said, the arch_prctl is for setting the thread register to point to the main thread's thread structure. This could be deferred to the first access (which musl does), but it's a bit of a pain to do so and mildly hurts performance. If you care about the runtime of large programs more than the startup time of many many small programs, it may make sense to always initialize the thread register at program load time. Note that the kernel cannot set it for you because it does not know the address it should be set to.

It's possible that an extension to the ELF format could be made to allow the main thread structure to be in the .data section with an ELF header telling the kernel where it is, but the acrobatics needed between the libc, the linker, and the kernel would probably be so ugly as to make this optimization undesirable... They would also impose further constraints on the userspace implementation of threads.


User contributions licensed under CC BY-SA 3.0