HP Proliant getting down with Critical error LED Blinking - what then?

0

After a few hours of correctling working, our proliant Server stop calculus with the System Healt LED 12 blinking, which according to the documentation ( http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01706108-8.pdf ) is the sign of a "Critical system failure detected (processor, memory, regulator, thermal event, fan, NMI)" (page 96).

SSH is then lost. We can reboot and re-get ssh ( I am not onsite) , but I don't know what to check then ? is there any logfile where to find some info ?

I found this guide : http://denis.herve.free.fr/trsfrt/HProliant.pdf but seems overwelming to me.

My colleague suggest it could be a RAM + Swap overload which make the whole server crash. I don't really agree with him as as far I am concerned, a memory issue wouldn't lead to a critical system failure. Any idea on this point ?

I am wondering if there could be any relationship with my previous post : Linux server swapping before memory is completely full.

we are on ubuntu 14.04.

PS : the server is on a basement, there may be a bit of water condensation on the morning...

EDIT Folowing @Hennes remark, we moved the server back to the living room. But after a night of calculus, it was again bliking with the red light :-(

Now I am trying to get my head around the log files. We rebooted the server this morning around 09:44 Here are the files recently changed : enter image description here

What to search, where, to get some Error infos ?

I tried :

romain@pl:/var/log$ cat syslog | grep error
Dec 27 12:00:23 pl kernel: [    1.053210] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 12:00:23 pl kernel: [    6.740763] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    6.741967] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    7.082169] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    7.112776] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 12:00:23 pl kernel: [    9.905224] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 27 11:52:18 pl kernel: [    1.053048] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 27 11:52:18 pl kernel: [    6.364768] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.365903] ata3.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.684685] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [    6.686080] ata4.00: failed to enable AA (error_mask=0x1)
Dec 27 11:52:18 pl kernel: [   11.211120] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Dec 28 09:46:55 pl kernel: [    1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
Dec 28 09:46:55 pl kernel: [    6.348693] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.349786] ata3.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.699099] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    6.731027] ata4.00: failed to enable AA (error_mask=0x1)
Dec 28 09:46:55 pl kernel: [    8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro

and :

romain@pl:/var/log$ cat dmesg | grep error
[    1.051638] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[    6.348693] ata3.00: failed to enable AA (error_mask=0x1)
[    6.349786] ata3.00: failed to enable AA (error_mask=0x1)
[    6.699099] ata4.00: failed to enable AA (error_mask=0x1)
[    6.731027] ata4.00: failed to enable AA (error_mask=0x1)
[    8.959211] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro

-> Here I don't really get what are the values in the first column like [ 6.731027] : is it the number of seconds since boot ?

I checked

romain@pl:/var/log$ cat syslog | grep memory
Dec 27 12:00:23 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 27 12:00:23 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
[...]
Dec 27 12:00:23 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 12:00:23 pl kernel: [    0.000000] Early memory node ranges
Dec 27 12:00:23 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 12:00:23 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 12:00:23 pl kernel: [    0.019764] Initializing cgroup subsys memory
Dec 27 12:00:23 pl kernel: [    0.019992] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 12:00:23 pl kernel: [    0.971501] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 12:00:23 pl kernel: [    0.972518] Scanning for low memory corruption every 60 seconds
Dec 27 12:00:23 pl kernel: [    6.154807] memory memory67: hash matches
Dec 27 12:00:23 pl kernel: [    6.205519] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 12:00:23 pl kernel: [    6.234958] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 12:00:23 pl kernel: [    6.254602] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 12:00:23 pl kernel: [    9.739558] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 12:00:32 pl kernel: [   20.152332] cgroup: docker-runc (2183) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 12:00:32 pl kernel: [   20.152335] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 27 11:52:18 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 27 11:52:18 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 27 11:52:18 pl kernel: [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 27 11:52:18 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x61fffffff]
Dec 27 11:52:18 pl kernel: [    0.000000] Early memory node ranges
Dec 27 11:52:18 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 27 11:52:18 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 27 11:52:18 pl kernel: [    0.019779] Initializing cgroup subsys memory
Dec 27 11:52:18 pl kernel: [    0.020005] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 27 11:52:18 pl kernel: [    0.970708] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 27 11:52:18 pl kernel: [    0.971734] Scanning for low memory corruption every 60 seconds
Dec 27 11:52:18 pl kernel: [    5.854654] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 27 11:52:18 pl kernel: [    5.883624] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 27 11:52:18 pl kernel: [    5.902731] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 27 11:52:18 pl kernel: [   10.983190] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 27 11:52:25 pl kernel: [   19.933483] cgroup: docker-runc (2140) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 27 11:52:25 pl kernel: [   19.933486] cgroup: "memory" requires setting use_hierarchy to 1 on the root
Dec 28 09:46:55 pl kernel: [    0.000000] Scanning 1 areas for low memory corruption
Dec 28 09:46:55 pl kernel: [    0.000000] Base memory trampoline at [ffff880000094000] 94000 size 24576
Dec 28 09:46:55 pl kernel: [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[...]
Dec 28 09:46:55 pl kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x51fffffff]
Dec 28 09:46:55 pl kernel: [    0.000000] Early memory node ranges
Dec 28 09:46:55 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[...]
Dec 28 09:46:55 pl kernel: [    0.000000] PM: Registered nosave memory: [mem 0xffc00000-0xffffffff]
Dec 28 09:46:55 pl kernel: [    0.020007] Initializing cgroup subsys memory
Dec 28 09:46:55 pl kernel: [    0.020233] Freeing SMP alternatives memory: 32K (ffffffff81e88000 - ffffffff81e90000)
Dec 28 09:46:55 pl kernel: [    0.970821] Freeing initrd memory: 20288K (ffff880035850000 - ffff880036c20000)
Dec 28 09:46:55 pl kernel: [    0.971834] Scanning for low memory corruption every 60 seconds
Dec 28 09:46:55 pl kernel: [    5.824432] Freeing unused kernel memory: 1412K (ffffffff81d27000 - ffffffff81e88000)
Dec 28 09:46:55 pl kernel: [    5.853109] Freeing unused kernel memory: 232K (ffff8800017c6000 - ffff880001800000)
Dec 28 09:46:55 pl kernel: [    5.871990] Freeing unused kernel memory: 336K (ffff880001bac000 - ffff880001c00000)
Dec 28 09:46:55 pl kernel: [    8.826997] EDAC i7core: Driver loaded, 2 memory controller(s) found.
Dec 28 09:47:04 pl kernel: [   19.154325] cgroup: docker-runc (2171) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
Dec 28 09:47:04 pl kernel: [   19.154328] cgroup: "memory" requires setting use_hierarchy to 1 on the root

I also checked for 'fan', 'nmi', 'critical' in the syslog file, without any output.

I remembered some stackoverflow questions where people where copy/pasting wohle files in an external logfile website - I can't remember the name - I am ready to put files online if someone is interested.

Any hint on where to search for what keyword is welcome.

We use the server with docker and r-studio server on top for ML calculus. I really doubt that the kind of usage may be the source for this issue, but in IT we never know, so I precise it ;)

Thanks for any idea.

linux
ubuntu
memory
asked on Super User Dec 27, 2016 by Romain Jouin • edited Mar 20, 2017 by Community

1 Answer

1

Assuming your system is the ML150 G6 that the documentation you linked mentions, let me strongly urge you to set up and use the Lights Out-100 management functionality on the system.

A basic how-to can be found here. Once you've gotten access into the Lights Out-100 management (I'd recommend using the web interface until you're more familiar with what LO100 offers and how you use it), then see especially pages 28-32 of that same doc; it shows how you can see in real-time sensors and event information for your system. Often, if a hardware issue is causing resets, it will be listed in the System Event Log, and finding it there will give you some insight into what's going on with your machine. The System Event Log should be capturing its data whether you've ever touched LO100 or not, so once you can get in there it should have something interesting to tell you.

Much of the same information can be fetched through your running OS, either via /var/log/messages (which you've already tried without much success) or through HP's Insight tools, which are available to install for some Linux flavors (see http://downloads.linux.hp.com/SDR/project/mcp/ for one good starting point to getting some of these tools). Unfortunately not all events are visible in the system logs, as they're hardware specific and the HP agents, not the kernel proper, are what instrument them.

Having said that, you may also see if you have mcelog installed and running; it can catch some hardware events, and usually logs something in the message log when it catches an event. It also usually either logs the event information to a separate log, or keeps it in memory so you can query it with the mcelog command. It's worth looking for mcelog in your messages log, or to look if you have a recently updated /var/log/mcelog file.

answered on Super User Dec 31, 2016 by Rubio

User contributions licensed under CC BY-SA 3.0