The strangest docker fault I've ever seen

3

I am using Docker with docker-mailserver on one of my servers. Very strange trouble appeared after migrating some services from legacy Debian Jessie server to Ubuntu 16.04 LTS server. Parameters of servers:

Legacy:

someuser@legacyserver:~$ uname -r
3.16.0-4-amd64
someuser@legacyserver:~$ dpkg -l | grep systemd
...215-17+deb8u7...
someuser@legacyserver:~$ cat /proc/cmdline
root=ZFS=rpool/ROOT/debian-1 ro boot=zfs quiet

New server:

someuser@newserver:~$ uname -r
4.4.0-21-generic
someuser@newserver:~$ dpkg -l | grep systemd
...229-4ubuntu4...
someuser@newserver:~$ cat /proc/cmdline
root=ZFS=rpool/ROOT/debian-1 apparmor=0 ro

I am running docker-mailserver on docker in systemd-nspawn Debian Jessie container. The first problem that I've encouraged was read-only cgroups at new systemd, this fixed that problem:

mount | grep cgroup | tail -n +2 | while read line
do
    umount -l $(echo $line | cut -f3 -d" ")
    mount -t $(echo $line | cut -f5 -d" ") -o $(echo $line | cut -f6 -    d" " | rev | cut -c2- | rev | cut -c2- | sed -e 's/ro,/rw,/g') $(echo     $line | cut -f1 -d" ") $(echo $line | cut -f3 -d" ")
done

It just remounts all cgroups read-write (cannot use -o remount).

But, then first I am rsh'ing to the systemd-nspawn container, and then from it to the docker container. When I am for example reloading Postfix (or doing anything else)... BOTH CONTAINERS (nested docker and systemd-nspawn) exits as quiet as a mouse... Like this:

someuser@newserver:~# rsh somesystemdcontainer
Last login: Sun Jun 25 15:27:24 CEST 2017 from host0 on pts/0
Linux somesystemdcontainer 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC     2016 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@somesystemdcontainer:~# rsh mail #this is the docker container
Last login: Sun Jun 25 13:28:18 UTC 2017 from 172.18.0.1 on pts/0
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-21-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
root@mail:~# service postfix reload
 * Reloading Postfix configuration...
   ...done.
root@mail:~# rlogin: connection closed.
root@newserver:~#

NOTHING IN DMESG, NOTHING IN KERNEL LOG, NOTHING ANYWHERE. As you've seen in cmdline, disabling apparmor both on kernel and userspace side doesn't help... After tring to stop the systemd-nspawn container:

jun 25 15:32:26 newserver kernel: INFO: task sh:10962 blocked for more than 120 seconds.
jun 25 15:32:26 newserver kernel:       Tainted: P           O    4.4.0-21-generic #37-Ubuntu
jun 25 15:32:26 newserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jun 25 15:32:26 newserver kernel: sh              D ffff88009ebb3c88     0 10962   9487 0x00000102
jun 25 15:32:26 newserver kernel:  ffff88009ebb3c88 0000000000000000 ffff88040dab3700 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel:  ffff88009ebb4000 ffff8800c08008b0 0000000000000001 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel:  ffff8800c2fe87e8 ffff88009ebb3ca0 ffffffff818203f5 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel: Call Trace:
jun 25 15:32:26 newserver kernel:  [<ffffffff818203f5>] schedule+0x35/0x80
jun 25 15:32:26 newserver kernel:  [<ffffffff8111fd4f>] zap_pid_ns_processes+0x13f/0x1a0
jun 25 15:32:26 newserver kernel:  [<ffffffff8108432b>] do_exit+0xa6b/0xae0
jun 25 15:32:26 newserver kernel:  [<ffffffff8122383f>] ? dput+0x2f/0x220
jun 25 15:32:26 newserver kernel:  [<ffffffff81084423>] do_group_exit+0x43/0xb0
jun 25 15:32:26 newserver kernel:  [<ffffffff810904d2>] get_signal+0x292/0x600
jun 25 15:32:26 newserver kernel:  [<ffffffff8102e517>] do_signal+0x37/0x6f0
jun 25 15:32:26 newserver kernel:  [<ffffffff8181fd36>] ? __schedule+0x386/0xa10
jun 25 15:32:26 newserver kernel:  [<ffffffff81083526>] ? do_wait+0x116/0x240
jun 25 15:32:26 newserver kernel:  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
jun 25 15:32:26 newserver kernel:  [<ffffffff81003c5e>] syscall_return_slowpath+0x4e/0x60
jun 25 15:32:26 newserver kernel:  [<ffffffff81824650>] int_ret_from_sys_call+0x25/0x8f
jun 25 15:32:53 newserver systemd[1]: systemd-nspawn-macvlan@somesystemdcontainer.service: State 'stop-sigterm' timed out. Killing.
jun 25 15:32:53 newserver systemd-nspawn[9483]: somesystemdcontainer login:
jun 25 15:32:53 newserver systemd[1]: systemd-nspawn-macvlan@somesystemdcontainer.service: Main process exited, code=killed, status=9/KILL
jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer.
jun 25 15:32:53 newserver systemd[1]: systemd-nspawn-macvlan@somesystemdcontainer.service: Unit entered failed state.
jun 25 15:32:53 newserver systemd[1]: systemd-nspawn-macvlan@somesystemdcontainer.service: Failed with result 'signal'.
jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer.
jun 25 15:32:53 newserver systemd-machined[2890]: Machine somesystemdcontainer terminated.

The 10962 is... bash inside DOCKER container, which "jumps out of namespace" on pstree...

What should I do now?

linux
docker
systemd
docker-compose
asked on Server Fault Jun 25, 2017 by MobileDevelopment

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0