We currently experience intermittent and random restarts of all our Docker swarm services (in production). This behaviour started a couple days ago and so far our services have restarted about 6 times.
When I check /var/log/messages
we see the following at the moment the services are being restarted
Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.777591370+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:xqo1jd54dy9h4si7p5kd147lf leaving:false netPeers:2 entries:16 Queue qLen:0 netMsg/s:0"
Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.778866829+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:hlxpttt1y61naebb1ehva5oa0 leaving:false netPeers:2 entries:42 Queue qLen:0 netMsg/s:0"
Jul 18 14:27:07 MASTER_SERVER dockerd: time="2019-07-18T14:27:07.778894355+02:00" level=info msg="NetworkDB stats OUR_MASTER_SERVER(1f6c53091b31) - netID:hccgvewpmr4qg93smkf1zb3gk leaving:false netPeers:2 entries:4 Queue qLen:0 netMsg/s:0"
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 25: Deprecated option KeyRegenerationInterval
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 26: Deprecated option ServerKeyBits
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 40: Deprecated option RSAAuthentication
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 42: Deprecated option RhostsRSAAuthentication
Jul 18 14:28:05 MASTER_SERVER sshd[64861]: rexec line 61: Deprecated option UseLogin
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.219729414+02:00" level=warning msg="memberlist: Refuting a suspect message (from: b19e7378f796)"
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.242915522+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8 session.id=mhf0hqu2wo4t6x1n8ja4tzwr3 sessionID=mhf0hqu2wo4t6x1n8ja4tzwr3
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.377485716+02:00" level=error msg="agent: session failed" backoff=100ms error="node not registered" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.378175039+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.378223959+02:00" level=info msg="waiting 78.699854ms before registering session" module=node/agent node.id=0bf0j7rcw6xg7pda5lka7jyy8
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.392797756+02:00" level=info msg="worker 2ni1yglnxjxlpopn7kwxh6oh5 was successfully registered" method="(*Dispatcher).register"
Jul 18 14:28:37 MASTER_SERVER dockerd: time="2019-07-18T14:28:37.497010348+02:00" level=info msg="worker 0bf0j7rcw6xg7pda5lka7jyy8 was successfully registered" method="(*Dispatcher).register"
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 452 0x000001C4 - no destination available
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 452 0x000001C4 - no destination available
Jul 18 14:28:39 MASTER_SERVER kernel: IPVS: rr: FWM 451 0x000001C3 - no destination available
Jul 18 14:28:40 MASTER_SERVER NetworkManager[8489]: <info> [1563452920.8734] manager: (veth33d0b84): new Veth device (/org/freedesktop/NetworkManager/Devices/1919)
Jul 18 14:28:40 MASTER_SERVER NetworkManager[8489]: <info> [1563452920.8750] manager: (veth1ba455b): new Veth device (/org/freedesktop/NetworkManager/Devices/1920)
Jul 18 14:28:40 MASTER_SERVER kernel: br0: port 2(veth260) entered blocking state
Jul 18 14:28:40 MASTER_SERVER kernel: br0: port 2(veth260) entered disabled state
Jul 18 14:28:40 MASTER_SERVER kernel: device veth260 entered promiscuous mode
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info> [1563452921.0567] manager: (veth9b724a9): new Veth device (/org/freedesktop/NetworkManager/Devices/1921)
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info> [1563452921.0604] manager: (veth328f5ab): new Veth device (/org/freedesktop/NetworkManager/Devices/1922)
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered blocking state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered disabled state
Jul 18 14:28:41 MASTER_SERVER kernel: device veth83 entered promiscuous mode
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered blocking state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered forwarding state
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=af9c57b5a3e15cf84eb6c11e81b9b1283c86d289819600ddd83fd5cf791141cf module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=08e21510d82e56462f4bcfe5e4d734c8fcf21c92470afcf711bca5c7b1a29e88 module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.126973723+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.127491897+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.128341866+02:00" level=warning msg="rmServiceBinding 208e26beb7581a7e09041ebf23d6170f0f7ad7785a28b0d5b37b4194cca8d64e possible transient state ok:false entries:0 set:false "
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41+02:00" level=info msg="shim reaped" id=99889097e3444a485567d85d5dcdbd1f56c49369207f3137d7c80d7c8efd1168 module="containerd/tasks"
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.132298072+02:00" level=warning msg="rmServiceBinding 055c400762be47b9e7c09c3a948c17b049a91b0cb53fbcefaee5f48f448fb91a possible transient state ok:false entries:0 set:false "
Jul 18 14:28:41 MASTER_SERVER dockerd: time="2019-07-18T14:28:41.140787013+02:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 14:28:41 MASTER_SERVER NetworkManager[8489]: <info> [1563452921.1487] manager: (veth7b97007): new Veth device (/org/freedesktop/NetworkManager/Devices/1923)
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 4(veth83) entered disabled state
Jul 18 14:28:41 MASTER_SERVER kernel: br0: port 16(veth257) entered disabled state
...
MORE OF THE SAME
...
Jul 18 14:28:54 MASTER_SERVER kernel: br0: port 7(veth82) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: device veth82 left promiscuous mode
Jul 18 14:28:54 MASTER_SERVER kernel: br0: port 7(veth82) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: IPVS: __ip_vs_del_service: enter
Jul 18 14:28:54 MASTER_SERVER kernel: docker_gwbridge: port 17(veth7648a95) entered disabled state
Jul 18 14:28:54 MASTER_SERVER kernel: device veth7648a95 left promiscuous mode
Jul 18 14:28:54 MASTER_SERVER kernel: docker_gwbridge: port 17(veth7648a95) entered disabled state
Jul 18 14:28:54 MASTER_SERVER NetworkManager[8489]: <info> [1563452934.9971] device (veth7648a95): released from master device docker_gwbridge
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 15(veth256) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: device veth256 left promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 15(veth256) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered blocking state
Jul 18 14:28:55 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered disabled state
Jul 18 14:28:55 MASTER_SERVER kernel: device veth865a1b5 entered promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 11(veth253) entered disabled state
Jul 18 14:28:55 MASTER_SERVER dockerd: time="2019-07-18T14:28:55+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.155/32 -j MARK --set-mark 458]: (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.155/32 -j MARK --set-mark 458: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:55 MASTER_SERVER kernel: device veth253 left promiscuous mode
Jul 18 14:28:55 MASTER_SERVER kernel: br0: port 11(veth253) entered disabled state
Jul 18 14:28:55 MASTER_SERVER dockerd: time="2019-07-18T14:28:55.049845009+02:00" level=error msg="Failed to delete firewall mark rule in sbox e6c95a7 (16cba4b): reexec failed: exit status 5"
...
MORE OF THIS
...
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.877371310+02:00" level=warning msg="rmServiceBinding a22765c0f00db0a4949b009129fa3efc78e2450567e759ff306dcfd32ebe0043 possible transient state ok:false entries:0 set:false "
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.893365848+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox f9cd880 (f0562de): no such process"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.893415513+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox f9cd880 (f0562de): no such process"
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9029] device (veth8339f31): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 3(veth8339f31) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 3(veth8339f31) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9250] device (veth242a19a): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 5(veth242a19a) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 5(veth242a19a) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 6(veth08b162c) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 6(veth08b162c) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9454] device (veth08b162c): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9647] device (veth1c69a54): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455]: (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975416212+02:00" level=error msg="Failed to delete firewall mark rule in sbox f9cd880 (f0562de): reexec failed: exit status 5"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975578597+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox fd62ef8 (7801a00): no such process"
Jul 18 14:28:57 MASTER_SERVER dockerd: time="2019-07-18T14:28:57.975609232+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox fd62ef8 (7801a00): no such process"
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 14(veth1c69a54) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 14(veth1c69a54) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered blocking state
Jul 18 14:28:57 MASTER_SERVER kernel: docker_gwbridge: port 2(veth865a1b5) entered forwarding state
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9814] device (veth865a1b5): carrier: link connected
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9902] manager: (veth90d68ef): new Veth device (/org/freedesktop/NetworkManager/Devices/1988)
Jul 18 14:28:57 MASTER_SERVER NetworkManager[8489]: <info> [1563452937.9937] manager: (veth1c52f1f): new Veth device (/org/freedesktop/NetworkManager/Devices/1989)
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered disabled state
Jul 18 14:28:58 MASTER_SERVER kernel: device veth713fa85 entered promiscuous mode
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: docker_gwbridge: port 9(veth713fa85) entered forwarding state
Jul 18 14:28:58 MASTER_SERVER NetworkManager[8489]: <info> [1563452938.0130] manager: (vethdc96c98): new Veth device (/org/freedesktop/NetworkManager/Devices/1990)
Jul 18 14:28:58 MASTER_SERVER NetworkManager[8489]: <info> [1563452938.0142] manager: (veth713fa85): new Veth device (/org/freedesktop/NetworkManager/Devices/1991)
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered disabled state
Jul 18 14:28:58 MASTER_SERVER kernel: device veth269 entered promiscuous mode
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered blocking state
Jul 18 14:28:58 MASTER_SERVER kernel: br0: port 4(veth269) entered forwarding state
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/5cadb3994f4d81174c1b686028a6fd3e917c9b8b2ac62b17c4754e59ed4a2158/shim.sock" debug=false module="containerd/tasks" pid=2290
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/536c3eae39192936e649668192e0daf6efba70cc60d1822cd0e8e2e593ebac40/shim.sock" debug=false module="containerd/tasks" pid=2300
Jul 18 14:28:58 MASTER_SERVER kernel: IPVS: Creating netns size=2048 id=305
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58+02:00" level=error msg="setting up rule failed, [-t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455]: (iptables failed: iptables --wait -t mangle -D OUTPUT -d 10.0.0.161/32 -j MARK --set-mark 455: iptables: No chain/target/match by that name.\n (exit status 1))"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283086156+02:00" level=error msg="Failed to delete firewall mark rule in sbox fd62ef8 (7801a00): reexec failed: exit status 5"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283246626+02:00" level=error msg="Failed to delete real server 10.0.0.124 for vip 10.0.0.161 fwmark 455 in sbox b828388 (4152113): no such process"
Jul 18 14:28:58 MASTER_SERVER dockerd: time="2019-07-18T14:28:58.283277874+02:00" level=error msg="Failed to delete service for vip 10.0.0.161 fwmark 455 in sbox b828388 (4152113): no such process"
Jul 18 14:28:58 MASTER_SERVER kernel: IPVS: Creating netns size=2048 id=306
When I check the services with docker service ps qbrek329yydx
I see that previous containers have the desired state "Shutdown". When I inspect the stopped containers, they have either exited with status 0 or status 137
Status 0:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 0,
"Error": "",
"StartedAt": "2019-07-18T12:29:03.540071095Z",
"FinishedAt": "2019-07-18T14:32:57.229648337Z"
}
Status 137:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2019-07-18T12:29:00.047740568Z",
"FinishedAt": "2019-07-18T14:33:07.497580769Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 0,
Those with status 137 are normal, because if our RabbitMQ stops, Spring will report that it's unhealthy and Docker will try to restart. The containers are not killed by OOM, the applications inside did not crash, Docker just decided to restart everything.
We also have containers deployed without swarm, those are never restarted and are still up and running since long time. It's only those containers which have been created with a docker service which are being restarted. This also indicates that the docker daemon hasn't restarted either.
We first assumed it was a network issue (causing RabbitMQ to fail and our healthchecks to cause restartes, but even our filebeat container which has no healthcheck gets restarted), in the mean time we have applied the SACK Kernel panic attack mitigation and these fixes but without any luck.
We still have enough disk space, so that's not the problem either. We're a bit low on memory, but nothing which would cause a crash
total used free shared buff/cache available
Mem: 15G 11G 1.3G 183M 2.5G 3.2G
Swap: 4.0G 258M 3.7G
The output of docker info
is
Containers: 54
Running: 14
Paused: 0
Stopped: 40
Images: 48
Server Version: 18.03.1-ce
Storage Driver: overlay
Backing Filesystem: xfs
Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: 0bf0j7rcw6xg7pda5lka7jyy8
Is Manager: true
ClusterID: 0e4yg9p17skhonx5jkaulhrsi
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.242.171.24
Manager Addresses:
10.242.171.24:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.12.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.5GiB
Name: OUR_MASTER_SERVER
ID: OZOG:AM6N:XGRS:KZMV:BOHR:QGAX:625X:AFDW:F6AV:P7ZF:FS4R:TCZL
Docker Root Dir: /opt/hpc/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: bridge-nf-call-ip6tables is disabled
Does anyone have a clue why we get all these restarts ?
I guess the problem is related to your network connectivity, because of heartbeat to manager { } failed
.
Managers: 1
Nodes: 2
When it is true, you got only one manager node and two worker nodes?
The manager node will restart (I guess again) all containers after reconnecting to its nodes to fullfill the service definitions.
For productive workloads, you should have at least three docker swarm managers. This will make the cluster more fault tolerant. When your cluster consists only of three nodes, promote the two worker nodes also to managers.
So I did a lot more digging, put our production nodes in debug mode and checked the logs. In debug mode, Docker outputs a log message about every 1-2 seconds and I noticed multiple gaps in the logs, some of the gaps were 5 seconds long but sometimes I would not get any log message for over 20 seconds after which Docker would consider the master node (itself) as unstable and it forced itself to restart, pulling all other instances with it (the downside of having one master).
After migrating our VM to another server, the problem never re-occurred and it's running fine again. We've ordered a 3rd VM so we can run with 3 masters so that we can survive a single master failure.
User contributions licensed under CC BY-SA 3.0