How to debug this mass site outage situation?

0

I manage a Centos 6.9 server that runs around 30-40 Wordpress sites, this server uses WHM/Cpanel. Its a 24 core server that really doesnt see that much traffic and it usually has plenty of resources to spare.

I was alerted to some site outages on Sunday, it seemed that when someone visited some of the sites they would be met with a blank page and then an empty text file would be downloaded. If a path such as /wp-admin was visited then an empty php file would be downloaded.

I then did the following:

httpd status - was running fine, did a restart anyway

mysql(MariaDB) status - running fine, did a restart anyway

Htop - nothing seemed too out of the oridinary, there isnt much traffic on Sundays and it was an off-peak time (6am EST) so not much CPU load, RAM usage was slightly higher but still had plenty to spare, swap untouched.

Disk space - 41% used so at normal

Wordpress - As I could recall they were all up to the latest major version so 4.9 and higher, not all were updated to the latest minor version.

Php error_logs - checked several of the down sites and no errors reported.

Plugins - I tried renaming the folder, this didnt do anything though. Like above, all mostly up to date as I recalled and no new ones installed recently.

Access logs - I checked some of these for individual sites, there still seemed to be plenty of GET requests coming in to site pages, these pages had the same problem but what was interesting was that full paths to image files at /wp-uploads/ worked fine and rendered.

Site files - nothing seemed out of place with the Wordpress or site files for a handful of the sites I checked.

Finally, there were a few sites that were working fine, I couldnt see anything that was different with these sites though.

Eventually I logged into the WHM admin panel for the server and I restarted php-fpm from there and all the sites came back online. I cant think what the problem was though.

The /var/log/messages file has a lot of activity in it mostly due to brute force attempts (all denied) on the sites but if I grep for ̀php-fpm I get this:

May 22 02:20:49 server kernel: [ 6482.242289] INFO: task php-fpm:8538 blocked for more than 120 seconds.
May 22 02:20:49 server kernel: [ 6482.242924] php-fpm       D 000000000000000d     0  8538  22433 0x00000000
May 22 02:20:49 server kernel: [ 6482.246611] INFO: task php-fpm:8543 blocked for more than 120 seconds.
May 22 02:20:49 server kernel: [ 6482.247243] php-fpm       D 000000000000000d     0  8543  22433 0x00000000
May 22 02:20:49 server kernel: [ 6482.249480] INFO: task php-fpm:8873 blocked for more than 120 seconds.
May 22 02:20:49 server kernel: [ 6482.250109] php-fpm       D 0000000000000006     0  8873  22433 0x00000000
May 22 13:06:11 server kernel: [37914.804467] php-fpm[25437] general protection ip:5ef507 sp:7ffda7a83260 error:0 in php-fpm[400000+346000]
May 23 03:25:43 server yum[26918]: Updated: ea-php56-php-fpm-5.6.36-1.1.4.cpanel.x86_64
May 23 03:25:49 server yum[26918]: Updated: ea-php55-php-fpm-5.5.38-37.37.3.cpanel.x86_64
May 23 03:25:54 server yum[26918]: Updated: ea-php70-php-fpm-7.0.30-1.1.4.cpanel.x86_64

the sites were not down since May 22nd though. I do not seem to have any logs enabled for php-fpm itself unfortunately though.

I would like to get to the bottom of this and find out what caused this.

thank you

centos6
wordpress
httpd
asked on Server Fault May 28, 2018 by goblin_rocket

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0