A few days ago all 8 of my app servers started having issues. I have not deployed any code changes in months. This is not the first time, every 4 months or so it happens. They will be fine for a while, then all of the sudden they quickly climb to 100% CPU on one CPU. Since i'm using passenger, eventually another thread will do the same thing, and max out another CPU, and so on and so on till the web server can no longer handle traffic and stops responding.
Ive recently created some brand new centos 7 boxes running nginx and puma. I have a mix of both. I wanted to see if it was web server issue / bug, but that doesn't seem to be the case.
I've done a lot of digging (which I am not great at), but i've at least found that when running an strace on the processes, they look pretty normal to start, and then when they go crazy as described above, its just a non-stop flood of clock_gettime(CLOCK_REALTIME, {1518938625, 9566131}) = 0
calls. The normal process like I said is not constantly spitting out stuff, only when a web request comes in for example, but then something sets it off and it just goes nuts till I kill the process, or restart passenger, or reboot the server. Then within an hour or two its back to having issues again.
I've been at it for a couple of days babysitting, non-stop restarting stuff to keep it limping along, but I am desperate for some ideas. I've noticed a couple of really old posts from like 2013 that talk about a 100% cpu issue with this clock_gettime thing, and i've tried both suggestions that are associated with the few posts i've seen. One is setting a TZ variable, and the other is supposed to fix a leapsecond bug of some sort. I don't understand the reasoning behind either of the proposed fixes, but sadly they did not work.
I am running the following stack for half my servers: ruby 2.2.0 Passenger standalone: Gem Version: 4.0.58 (and tried upgrading to 5.2.0 on one server with no change in behavior) MySQL CentOS 6.9
And the other half: ruby 2.2.10, nginx, puma, CentOS 7.2.
I've already tried serverfault and was directed here instead citing a probably ruby issue.
EDIT
I am getting different output on the new puma servers.
(the passenger processes that hang at 100% are still blowing up with: [pid 1631] clock_gettime(CLOCK_REALTIME, {1523052672, 424967903}) = 0
when i run strace)
strace -p pid -f is now showing stuff like this over and over:
[pid 1918] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 1918] futex(0x55cf6990e988, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1918] futex(0x55cf6990e914, FUTEX_WAIT_BITSET_PRIVATE, 16749, {8516, 752821584}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 1918] futex(0x55cf6990e988, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1918] futex(0x55cf6990e914, FUTEX_WAIT_BITSET_PRIVATE, 16751, {8517, 753257393}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 1918] futex(0x55cf6990e988, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1918] futex(0x55cf6990e914, FUTEX_WAIT_BITSET_PRIVATE, 16753, {8518, 753722228}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 1918] futex(0x55cf6990e988, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1918] futex(0x55cf6990e914, FUTEX_WAIT_BITSET_PRIVATE, 16755, {8519, 754308489}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 1918] futex(0x55cf6990e988, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1918] futex(0x55cf6990e914, FUTEX_WAIT_BITSET_PRIVATE, 16757, {8520, 754762287}, ffffffff <unfinished ...>
ltrace is showing stuff like this:
[pid 1918] <... pthread_cond_timedwait resumed> ) = 110
[pid 1918] pthread_mutex_unlock(0x6990e988, 0, 0, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x6990e988, 0xffffffff) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0xffffffff) = 0
[pid 1918] clock_gettime(1, 0xfefd5b0, 8295, 0x6990e860) = 0
[pid 1918] time(0, 0xe93a4, 8775, 8775) = 0x5ac7ef79
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0xfeff050) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0x61d59520, 0, 0xfeff050) = 0
[pid 1918] strcmp(0x61a9f560, 0x61a9719f, 0, 0x61a9719f) = 0xfffffff9
[pid 1918] strcmp(0x61a9f570, 0x61a9719f, 3, 112) = 0xfffffff1
[pid 1918] strcmp(0x61a9f580, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f590, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f5a0, 0x61a9719f, 3, 112) = 0xfffffff8
[pid 1918] strcmp(0x61a9f5b0, 0x61a9719f, 3, 112) = 0
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0, 0x61aaf2b8, 0x6990e860) = 0
[pid 1918] time(0, 0xfeff038, 0x68b74c00, 0xfeff038) = 0x5ac7ef79
[pid 1918] clock_gettime(1, 0xfefd5b0, 0, 1) = 0
[pid 1918] clock_gettime(1, 0xfefd530, 0x6990e988, 8776) = 0
[pid 1918] _setjmp(0x6990ea70, 0xfefd530, 0x3a493cbc, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x4a053590, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0x6990e860) = 0
[pid 1918] pthread_mutex_lock(0x6990e988, 0, 0x62a38f30, 0x6990e860) = 0
[pid 1918] pthread_cond_timedwait(0x6990e910, 0x6990e988, 0xfefd530, 0x6990e860) = 110
[pid 1918] pthread_mutex_unlock(0x6990e988, 0, 0, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x6990e988, 0xffffffff) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0xffffffff) = 0
[pid 1918] clock_gettime(1, 0xfefd5b0, 8295, 0x6990e860) = 0
[pid 1918] time(0, 0xf0d3c, 8776, 8776) = 0x5ac7ef7a
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0xfeff050) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0x61d59520, 0, 0xfeff050) = 0
[pid 1918] strcmp(0x61a9f560, 0x61a9719f, 0, 0x61a9719f) = 0xfffffff9
[pid 1918] strcmp(0x61a9f570, 0x61a9719f, 3, 112) = 0xfffffff1
[pid 1918] strcmp(0x61a9f580, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f590, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f5a0, 0x61a9719f, 3, 112) = 0xfffffff8
[pid 1918] strcmp(0x61a9f5b0, 0x61a9719f, 3, 112) = 0
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0, 0x61aaf2b8, 0x6990e860) = 0
[pid 1918] time(0, 0xfeff038, 0x68b74c00, 0xfeff038) = 0x5ac7ef7a
[pid 1918] clock_gettime(1, 0xfefd5b0, 0, 1) = 0
[pid 1918] clock_gettime(1, 0xfefd530, 0x6990e988, 8778) = 0
[pid 1918] _setjmp(0x6990ea70, 0xfefd530, 0x7b0c57, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x4a053590, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0x6990e860) = 0
[pid 1918] pthread_mutex_lock(0x6990e988, 0, 0x62a38f30, 0x6990e860) = 0
[pid 1918] pthread_cond_timedwait(0x6990e910, 0x6990e988, 0xfefd530, 0x6990e860) = 110
[pid 1918] pthread_mutex_unlock(0x6990e988, 0, 0, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x6990e988, 0xffffffff) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0xffffffff) = 0
[pid 1918] clock_gettime(1, 0xfefd5b0, 8295, 0x6990e860) = 0
[pid 1918] time(0, 0x3f6f, 8778, 8778) = 0x5ac7ef7b
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0xfeff050) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0x61d59520, 0, 0xfeff050) = 0
[pid 1918] strcmp(0x61a9f560, 0x61a9719f, 0, 0x61a9719f) = 0xfffffff9
[pid 1918] strcmp(0x61a9f570, 0x61a9719f, 3, 112) = 0xfffffff1
[pid 1918] strcmp(0x61a9f580, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f590, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f5a0, 0x61a9719f, 3, 112) = 0xfffffff8
[pid 1918] strcmp(0x61a9f5b0, 0x61a9719f, 3, 112) = 0
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0, 0x61aaf2b8, 0x6990e860) = 0
[pid 1918] time(0, 0xfeff038, 0x68b74c00, 0xfeff038) = 0x5ac7ef7b
[pid 1918] clock_gettime(1, 0xfefd5b0, 0, 1) = 0
[pid 1918] clock_gettime(1, 0xfefd530, 0x6990e988, 8779) = 0
[pid 1918] _setjmp(0x6990ea70, 0xfefd530, 0x233674d, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x4a053590, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0x6990e860) = 0
[pid 1918] pthread_mutex_lock(0x6990e988, 0, 0x62a38f30, 0x6990e860) = 0
[pid 1918] pthread_cond_timedwait(0x6990e910, 0x6990e988, 0xfefd530, 0x6990e860) = 110
[pid 1918] pthread_mutex_unlock(0x6990e988, 0, 0, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x6990e988, 0xffffffff) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0xffffffff) = 0
[pid 1918] clock_gettime(1, 0xfefd5b0, 8295, 0x6990e860) = 0
[pid 1918] time(0, 0xb00f, 8779, 8779) = 0x5ac7ef7c
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0xfeff050) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0x61d59520, 0, 0xfeff050) = 0
[pid 1918] strcmp(0x61a9f560, 0x61a9719f, 0, 0x61a9719f) = 0xfffffff9
[pid 1918] strcmp(0x61a9f570, 0x61a9719f, 3, 112) = 0xfffffff1
[pid 1918] strcmp(0x61a9f580, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f590, 0x61a9719f, 3, 112) = 3
[pid 1918] strcmp(0x61a9f5a0, 0x61a9719f, 3, 112) = 0xfffffff8
[pid 1918] strcmp(0x61a9f5b0, 0x61a9719f, 3, 112) = 0
[pid 1918] pthread_mutex_lock(0x69515980, 0x61d59520, 0x61d59520, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x69515980, 0, 0x61aaf2b8, 0x6990e860) = 0
[pid 1918] time(0, 0xfeff038, 0x68b74c00, 0xfeff038) = 0x5ac7ef7c
[pid 1918] clock_gettime(1, 0xfefd5b0, 0, 1) = 0
[pid 1918] clock_gettime(1, 0xfefd530, 0x6990e988, 8780) = 0
[pid 1918] _setjmp(0x6990ea70, 0xfefd530, 0x3eedb9f, 0xffffffff) = 0
[pid 1918] pthread_mutex_lock(0x62a38f30, 0, 0x4a053590, 0x6990e860) = 0
[pid 1918] pthread_mutex_unlock(0x62a38f30, 0, 0, 0x6990e860) = 0
[pid 1918] pthread_mutex_lock(0x6990e988, 0, 0x62a38f30, 0x6990e860) = 0
EDIT I just upgraded half of the app servers (the ones on puma and centos 7.2) to ruby 2.3.7, and within an hour already had a process at 100% CPU. I have no idea what could be causing this because the servers go along fine at 0.05 or less until this hang happens, and everything goes down hill from there.
User contributions licensed under CC BY-SA 3.0