Debian Server Keeps Restarting Unexpectedly

3

My laboratory' server with Debian-Wheezy-7.8-Stable keeps restarting few times after some hours of uptime without any notification. This server is set up for considerably high load numerical computation as well as parallel computation. I have printed log from var/log/messages and last reboot but I found it hard to understand this log messages. I have tried to look into entry right before the reboot time took place and look into the same time in var/log/messages but it seems that entries from var/log/messages only show log/messages after the reboot happened.

I have surfed around and found that that some people get the same problem but it seems the cause is different from one another and /var/log/messages appears to be the key to look into the problem. What does my var/log/messages actually describe in regard of this unwanted rebooting event? and how to start learning on how to read this log for beginner? I mean is there any important keyword to look for or something?

Thank you for any help you can provide.

last reboot

reboot   system boot  3.2.0-4-amd64    Wed May 20 03:29 - 12:43  (09:14)
reboot   system boot  3.2.0-4-amd64    Tue May 19 16:01 - 12:43  (20:42)

var/log/messages

May 18 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 16:01:19 labserver kernel: imklog 5.8.11, log source = /proc/kmsg started.
May 19 16:01:19 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2401" x-info="http://www.rsyslog.com"] start
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpuset
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpu
May 19 16:01:19 labserver kernel: [    0.000000] Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.65-1+deb7u2
May 19 16:01:19 labserver kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] BIOS-provided physical RAM map:
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009ac00 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000000009ac00 - 00000000000a0000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000100000 - 000000007df71000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007df71000 - 000000007e0f1000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e0f1000 - 000000007e2ec000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e2ec000 - 000000007f367000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007f367000 - 000000007f800000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000080000000 - 0000000090000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed40000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000100000000 - 0000000880000000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000] NX (Execute Disable) protection: active
May 19 16:01:19 labserver kernel: [    0.000000] SMBIOS 2.7 present.
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x880000 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x7df71 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] found SMP MP-table at [ffff8800000fd900] fd900
May 19 16:01:19 labserver kernel: [    0.000000] Using GB pages for direct mapping
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000000000000-000000007df71000
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000100000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] RAMDISK: 36bea000 - 375ed000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: RSDP 00000000000f04a0 00024 (v02 ALASKA)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: XSDT 000000007e204088 0008C (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACP 000000007e211040 0010C (v05 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI Warning: FADT (revision 5) is longer than ACPI 2.0 version, truncating length 268 to 244 (20110623/tbfadt-288)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DSDT 000000007e2041a8 0CE96 (v02 ALASKA    A M I 00000015 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACS 000000007e2e3080 00040
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: APIC 000000007e211150 00100 (v03 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FPDT 000000007e211250 00044 (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: MCFG 000000007e211298 0003C (v01 ALASKA OEMMCFG. 01072009 MSFT 00000097)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET 000000007e2112d8 00038 (v01 ALASKA    A M I 01072009 AMI. 00000005)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PRAD 000000007e211310 000BE (v02 PRADID  PRADTID 00000001 MSFT 03000001)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SPMI 000000007e2113d0 00040 (v05 A M I   OEMSPMI 00000000 AMI. 00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SSDT 000000007e211410 D0CB0 (v02  INTEL    CpuPm 00004000 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: EINJ 000000007e2e20c0 00130 (v01    AMI AMI EINJ 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: ERST 000000007e2e21f0 00230 (v01  AMIER AMI ERST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HEST 000000007e2e2420 000A8 (v01    AMI AMI HEST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: BERT 000000007e2e24c8 00030 (v01    AMI AMI BERT 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DMAR 000000007e2e24f8 000C4 (v01 A M I   OEMDMAR 00000001 INTL 00000001)
May 19 16:01:19 labserver kernel: [    0.000000] No NUMA configuration found
May 19 16:01:19 labserver kernel: [    0.000000] Faking a node at 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] Initmem setup node 0 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000]   NODE_DATA [000000087fffb000 - 000000087fffffff]
May 19 16:01:19 labserver kernel: [    0.000000] Zone PFN ranges:
May 19 16:01:19 labserver kernel: [    0.000000]   DMA      0x00000010 -> 0x00001000
May 19 16:01:19 labserver kernel: [    0.000000]   DMA32    0x00001000 -> 0x00100000
May 19 16:01:19 labserver kernel: [    0.000000]   Normal   0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] Movable zone start PFN for each node
May 19 16:01:19 labserver kernel: [    0.000000] early_node_map[3] active PFN ranges
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000010 -> 0x0000009a
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000100 -> 0x0007df71
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PM-Timer IO Port: 0x408
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec01000] gsi_base[24])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[1]: apic_id 2, version 32, address 0xfec01000, GSI 24-47
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
May 19 16:01:19 labserver kernel: [    0.000000] Using ACPI (MADT) for SMP configuration information
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET id: 0x8086a701 base: 0xfed00000
May 19 16:01:19 labserver kernel: [    0.000000] SMP: Allowing 12 CPUs, 0 hotplug CPUs
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009a000 - 000000000009b000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009b000 - 00000000000a0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007df71000 - 000000007e0f1000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e0f1000 - 000000007e2ec000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e2ec000 - 000000007f367000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f367000 - 000000007f800000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f800000 - 0000000080000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000080000000 - 0000000090000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000090000000 - 00000000fed1c000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed1c000 - 00000000fed40000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed40000 - 00000000ff000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000ff000000 - 0000000100000000
May 19 16:01:19 labserver kernel: [    0.000000] Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
May 19 16:01:19 labserver kernel: [    0.000000] Booting paravirtualized kernel on bare hardware
May 19 16:01:19 labserver kernel: [    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:12 nr_node_ids:1
May 19 16:01:19 labserver kernel: [    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88087fc00000 s78848 r8192 d23552 u131072
May 19 16:01:19 labserver kernel: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8258294
May 19 16:01:19 labserver kernel: [    0.000000] Policy zone: Normal
May 19 16:01:19 labserver kernel: [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
May 19 16:01:19 labserver kernel: [    0.000000] xsave/xrstor: enabled xstate_bv 0x7, cntxt size 0x340
May 19 16:01:19 labserver kernel: [    0.000000] Checking aperture...
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] Memory: 32975732k/35651584k available (3434k kernel code, 2130964k absent, 544888k reserved, 3305k data, 576k init)
May 19 16:01:19 labserver kernel: [    0.000000] Hierarchical RCU implementation.
May 19 16:01:19 labserver kernel: [    0.000000]    RCU dyntick-idle grace-period acceleration is enabled.
May 19 16:01:19 labserver kernel: [    0.000000] NR_IRQS:33024 nr_irqs:1184 16
May 19 16:01:19 labserver kernel: [    0.000000] Extended CMOS year: 2000
May 19 16:01:19 labserver kernel: [    0.000000] Console: colour VGA+ 80x25
May 19 16:01:19 labserver kernel: [    0.000000] console [tty0] enabled
May 19 16:01:19 labserver kernel: [    0.000000] Fast TSC calibration using PIT
May 19 16:01:19 labserver kernel: [    0.004000] Detected 2100.074 MHz processor.
May 19 16:01:19 labserver kernel: [    0.000003] Calibrating delay loop (skipped), value calculated using timer frequency.. 4200.14 BogoMIPS (lpj=8400296)
May 19 16:01:19 labserver kernel: [    0.000144] pid_max: default: 32768 minimum: 301
May 19 16:01:19 labserver kernel: [    0.000253] Security Framework initialized
May 19 16:01:19 labserver kernel: [    0.000324] AppArmor: AppArmor disabled by boot time parameter
May 19 16:01:19 labserver kernel: [    0.002355] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
May 19 16:01:19 labserver kernel: [    0.011585] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
May 19 16:01:19 labserver kernel: [    0.015724] Mount-cache hash table entries: 256
May 19 16:01:19 labserver kernel: [    0.015915] Initializing cgroup subsys cpuacct
May 19 16:01:19 labserver kernel: [    0.015986] Initializing cgroup subsys memory
May 19 16:01:19 labserver kernel: [    0.016063] Initializing cgroup subsys devices
May 19 16:01:19 labserver kernel: [    0.016133] Initializing cgroup subsys freezer
May 19 16:01:19 labserver kernel: [    0.016201] Initializing cgroup subsys net_cls
May 19 16:01:19 labserver kernel: [    0.016270] Initializing cgroup subsys blkio
May 19 16:01:19 labserver kernel: [    0.016344] Initializing cgroup subsys perf_event
May 19 16:01:19 labserver kernel: [    0.016441] CPU: Physical Processor ID: 0
May 19 16:01:19 labserver kernel: [    0.016509] CPU: Processor Core ID: 0
May 19 16:01:19 labserver kernel: [    0.017564] mce: CPU supports 23 MCE banks
May 19 16:01:19 labserver kernel: [    0.017670] CPU0: Thermal monitoring enabled (TM1)
May 19 16:01:19 labserver kernel: [    0.017768] using mwait in idle threads.
May 19 16:01:19 labserver kernel: [    0.018315] ACPI: Core revision 20110623
May 19 16:01:19 labserver kernel: [    0.049889] DMAR: Host address width 46
May 19 16:01:19 labserver kernel: [    0.049958] DMAR: DRHD base: 0x000000fbffc000 flags: 0x1
May 19 16:01:19 labserver kernel: [    0.050034] IOMMU 0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
May 19 16:01:19 labserver kernel: [    0.050122] DMAR: RMRR base: 0x0000007f239000 end: 0x0000007f247fff
May 19 16:01:19 labserver kernel: [    0.050195] DMAR: ATSR flags: 0x0
May 19 16:01:19 labserver kernel: [    0.050261] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
May 19 16:01:19 labserver kernel: [    0.050427] IOAPIC id 0 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050497] IOAPIC id 2 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050568] HPET id 0 under DRHD base 0xfbffc000
May 19 16:01:19 labserver kernel: [    0.050741] Enabled IRQ remapping in x2apic mode
May 19 16:01:19 labserver kernel: [    0.050810] Enabling x2apic
May 19 16:01:19 labserver kernel: [    0.050875] Enabled x2apic
May 19 16:01:19 labserver kernel: [    0.050943] Switched APIC routing to cluster x2apic.
May 19 16:01:19 labserver kernel: [    0.051552] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
May 19 16:01:19 labserver kernel: [    0.091256] CPU0: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz stepping 04
May 19 16:01:19 labserver kernel: [    0.195570] Performance Events: PEBS fmt1+, generic architected perfmon, Intel PMU driver.
May 19 16:01:19 labserver kernel: [    0.195802] ... version:                3
May 19 16:01:19 labserver kernel: [    0.195869] ... bit width:              48
May 19 16:01:19 labserver kernel: [    0.195936] ... generic registers:      4
May 19 16:01:19 labserver kernel: [    0.196003] ... value mask:             0000ffffffffffff
May 19 16:01:19 labserver kernel: [    0.196073] ... max period:             000000007fffffff
May 19 16:01:19 labserver kernel: [    0.196143] ... fixed-purpose events:   3
May 19 16:01:19 labserver kernel: [    0.196210] ... event mask:             000000070000000f
May 19 16:01:19 labserver kernel: [    0.196468] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.196637] Booting Node   0, Processors  #1
May 19 16:01:19 labserver kernel: [    0.312587] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.312765]  #2
May 19 16:01:19 labserver kernel: [    0.424400] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.424578]  #3
May 19 16:01:19 labserver kernel: [    0.536316] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.536489]  #4
May 19 16:01:19 labserver kernel: [    0.648124] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.648303]  #5
May 19 16:01:19 labserver kernel: [    0.759941] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.760115]  #6
May 19 16:01:19 labserver kernel: [    0.871864] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.872050]  #7
May 19 16:01:19 labserver kernel: [    0.983690] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.983866]  #8
May 19 16:01:19 labserver kernel: [    1.095600] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.095774]  #9
May 19 16:01:19 labserver kernel: [    1.207414] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.207589]  #10
May 19 16:01:19 labserver kernel: [    1.319223] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.319400]  #11 Ok.
May 19 16:01:19 labserver kernel: [    1.431095] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.431192] Brought up 12 CPUs
May 19 16:01:19 labserver kernel: [    1.431260] Total of 12 processors activated (50398.84 BogoMIPS).
May 19 16:01:19 labserver kernel: [    1.450786] devtmpfs: initialized
May 19 16:01:19 labserver kernel: [    1.455360] PM: Registering ACPI NVS region at 7e0f1000 (2076672 bytes)
May 19 16:01:19 labserver kernel: [    1.455494] PM: Registering ACPI NVS region at 7f367000 (4820992 bytes)
May 19 16:01:19 labserver kernel: [    1.455843] print_constraints: dummy: 
May 19 16:01:19 labserver kernel: [    1.455977] NET: Registered protocol family 16
May 19 16:01:19 labserver kernel: [    1.456140] ACPI: bus type pci registered
May 19 16:01:19 labserver kernel: [    1.456268] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
May 19 16:01:19 labserver kernel: [    1.456361] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
May 19 16:01:19 labserver kernel: [    1.466673] PCI: Using configuration type 1 for base access
May 19 16:01:19 labserver kernel: [    1.468173] bio: create slab <bio-0> at 0
May 19 16:01:19 labserver kernel: [    1.468353] ACPI: Added _OSI(Module Device)
May 19 16:01:19 labserver kernel: [    1.468422] ACPI: Added _OSI(Processor Device)
May 19 16:01:19 labserver kernel: [    1.468491] ACPI: Added _OSI(3.0 _SCP Extensions)
May 19 16:01:19 labserver kernel: [    1.468560] ACPI: Added _OSI(Processor Aggregator Device)
May 19 16:01:19 labserver kernel: [    1.484562] ACPI: Executed 1 blocks of module-level executable AML code
May 19 16:01:19 labserver kernel: [    1.727818] ACPI: Interpreter enabled
May 19 16:01:19 labserver kernel: [    1.727891] ACPI: (supports S0 S1 S4 S5)
May 19 16:01:19 labserver kernel: [    1.728159] ACPI: Using IOAPIC for interrupt routing
May 19 16:01:19 labserver kernel: [    1.736531] ACPI: No dock devices found.
May 19 16:01:19 labserver kernel: [    1.736630] HEST: Table parsing has been initialized.
May 19 16:01:19 labserver kernel: [    1.736704] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
May 19 16:01:19 labserver kernel: [    1.737041] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
May 19 16:01:19 labserver kernel: [    1.737361] pci_root PNP0A08:00: host bridge window [io  0x0000-0x03af]
May 19 16:01:19 labserver kernel: [    1.737435] pci_root PNP0A08:00: host bridge window [io  0x03e0-0x0cf7]
May 19 16:01:19 labserver kernel: [    1.737508] pci_root PNP0A08:00: host bridge window [io  0x03b0-0x03df]
May 19 16:01:19 labserver kernel: [    1.737586] pci_root PNP0A08:00: host bridge window [io  0x0d00-0xffff]
May 19 16:01:19 labserver kernel: [    1.737659] pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
May 19 16:01:19 labserver kernel: [    1.737747] pci_root PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff]
May 19 16:01:19 labserver kernel: [    1.737834] pci_root PNP0A08:00: host bridge window [mem 0xfed0e000-0xfed0ffff]
May 19 16:01:19 labserver kernel: [    1.737922] pci_root PNP0A08:00: host bridge window [mem 0x80000000-0xfbffffff]
May 19 16:01:19 labserver kernel: [    1.740791] pci 0000:00:01.0: PCI bridge to [bus 01-01]
May 19 16:01:19 labserver kernel: [    1.745575] pci 0000:00:01.1: PCI bridge to [bus 02-03]
May 19 16:01:19 labserver kernel: [    1.745700] pci 0000:00:02.0: PCI bridge to [bus 04-04]
May 19 16:01:19 labserver kernel: [    1.745816] pci 0000:00:03.0: PCI bridge to [bus 05-05]
May 19 16:01:19 labserver kernel: [    1.745933] pci 0000:00:03.2: PCI bridge to [bus 06-06]
May 19 16:01:19 labserver kernel: [    1.746285] pci 0000:00:11.0: PCI bridge to [bus 07-07]
May 19 16:01:19 labserver kernel: [    1.746541] pci 0000:00:1e.0: PCI bridge to [bus 08-08] (subtractive decode)
May 19 16:01:19 labserver kernel: [    1.747170]  pci0000:00: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.747465]  pci0000:00: ACPI _OSC control (0x15) granted
May 19 16:01:19 labserver kernel: [    1.756901] ACPI: PCI Root Bridge [UNC0] (domain 0000 [bus ff])
May 19 16:01:19 labserver kernel: [    1.758443]  pci0000:ff: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.758528]  pci0000:ff: ACPI _OSC control (0x1d) granted
May 19 16:01:19 labserver kernel: [    1.759439] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760105] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760768] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.761383] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.762006] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.762729] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.763450] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.764170] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)
linux
debian
hardware
uptime
asked on Server Fault May 20, 2015 by Franky

1 Answer

1

You need to provide more information, especially log entries right before the system rebooted. However as far as I can see it may not provide more information. Check other logs such as syslog.

The most common causes in my experience for sudden restarts without any indication really of what went wrong is often hardware related. Otherwise the kernel mostly will have a chance of writing something in the logs to give a clue.

Some common causes of sudden restarts:

  • Overheating, probably the main cause, get an idea of the temperature, try to log it, does the server have a display that could show the temperature, is the room cooled properly. Perhaps replace the thermal compound on the heatsinks covering the CPU(s).

  • Bad hardware or drivers, get a list of it using "lspci" for example, a bad dimm can cause a system to suddenly hang and/or reboot (re-seat dimms, CPUs and cards). I remember a server which rebooted occasionally due to an issue with the intel ethernet card. Sometimes a bad disk can cause such problems as well, although normally it would just cause it to hang rather than restart.

  • A bad UPS, I remember a battery backed UPS going bad slowly and one of the indicators it did so was a regular weekly power cycle of servers connected to it. You could just have a mis-configured power cycle schedule.

answered on Server Fault May 20, 2015 by aseq • edited May 20, 2015 by aseq

User contributions licensed under CC BY-SA 3.0