My dGPU (Nvidia GTX 880M) looks dead, any hope?

1

Yesterday I started playing Kenshi (interesting game btw), and after some time it crashed. I had no overheating problem (far from that), nothing is overclocked on my laptop, and my not-so-young GTX 880M was still more than capable of running it, but at this point it seems it stopped working. I powered off my computer and decided to give it a look the next day.

The next day (today), I noticed the following :

  • The computer takes an unusually long time at BIOS initialisation phase (i.e. before loading the OS)
  • Once the OS is booted, the dGPU light remains on despite it not being used
  • The Windows device manager complains about not being able to load the driver
  • On Linux, it's possible to power down the dGPU via an ACPI call (hence the light goes off), but trying to use it doesn't work at all

So I reset the BIOS to the so-called "Optimized defaults" in the hope it would help, but it looks like it didn't.

On Linux, I get the following kernel messages when trying to use my GTX 880M (everything before and after has been removed for the sake of clarity) :

[Aug22 16:11] pci 0000:01:00.0: [10de:1198] type 00 class 0x030000
[  +0.000036] pci 0000:01:00.0: reg 0x10: [mem 0xf6000000-0xf6ffffff]
[  +0.000017] pci 0000:01:00.0: reg 0x14: [mem 0xe0000000-0xefffffff 64bit pref]
[  +0.000016] pci 0000:01:00.0: reg 0x1c: [mem 0xf0000000-0xf1ffffff 64bit pref]
[  +0.000012] pci 0000:01:00.0: reg 0x24: [io  0xe000-0xe07f]
[  +0.000012] pci 0000:01:00.0: reg 0x30: [mem 0xf7000000-0xf707ffff pref]
[  +0.000052] pci 0000:01:00.0: Enabling HDA controller
[  +0.000087] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8 GT/s x16 link)
[  +0.000486] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[  +0.000011] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[  +0.000160] pci 0000:01:00.1: [10de:0e0a] type 00 class 0x040300
[  +0.000029] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[  +0.000057] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
[  +0.000382] pcieport 0000:00:01.0: ASPM: current common clock configuration is broken, reconfiguring
[  +0.009663] pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xefffffff 64bit pref]
[  +0.000037] pci 0000:01:00.0: BAR 3: assigned [mem 0xf0000000-0xf1ffffff 64bit pref]
[  +0.000018] pci 0000:01:00.0: BAR 0: assigned [mem 0xf6000000-0xf6ffffff]
[  +0.000003] pci 0000:01:00.0: BAR 6: assigned [mem 0xf7000000-0xf707ffff pref]
[  +0.000002] pci 0000:01:00.1: BAR 0: assigned [mem 0xf7080000-0xf7083fff]
[  +0.000002] pci 0000:01:00.0: BAR 5: assigned [io  0xe000-0xe07f]
[  +0.000004] pcieport 0000:00:01.0: PCI bridge to [bus 01]
[  +0.000001] pcieport 0000:00:01.0:   bridge window [io  0xe000-0xefff]
[  +0.000003] pcieport 0000:00:01.0:   bridge window [mem 0xf6000000-0xf70fffff]
[  +0.000002] pcieport 0000:00:01.0:   bridge window [mem 0xe0000000-0xf1ffffff 64bit pref]
[  +0.000176] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[  +0.000036] snd_hda_intel 0000:01:00.1: enabling device (0000 -> 0002)
[  +0.000057] snd_hda_intel 0000:01:00.1: Disabling MSI
[  +0.000006] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client
[  +0.041484] IPMI message handler: version 39.2
[  +0.016187] ipmi device interface
[  +0.704043] nvidia: module license 'NVIDIA' taints kernel.
[  +0.000001] Disabling lock debugging due to kernel taint
[  +0.012267] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[  +0.000320] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[  +0.000078] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[  +0.099429] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.100  Fri May 29 08:45:51 UTC 2020
[  +0.055239] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.100  Fri May 29 08:14:04 UTC 2020
[  +0.002640] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  +0.020768] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
[ +30.855545] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  +0.000034] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.000492] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  +0.000122] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  +0.071852] nvidia-uvm: Loaded the UVM driver, major device number 235.
[Aug22 16:14] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +3.384946] rfkill: input handler enabled
[  +9.980611] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000047] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.167869] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000075] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.099416] gnome-shell[6474]: segfault at 20 ip 00007f4d29a83356 sp 00007ffd43c62db0 error 4 in libnvidia-glsi.so.440.100[7f4d29a21000+95000]
[  +0.000004] Code: 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 0f 1f 44 00 00 48 8b 3d f1 31 24 00 e8 74 31 00 00 89 de 48 89 c7 e8 5a fe ff ff <48> 8b 78 20 e8 61 60 01 00 48 83 f8 01 48 89 45 00 19 c0 83 e0 0f
[  +4.409548] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +0.281982] rfkill: input handler disabled
[Aug22 16:15] rfkill: input handler enabled
[  +4.303838] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +0.355605] rfkill: input handler disabled
[  +8.056674] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000069] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.172351] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000030] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171805] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.167969] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171784] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000070] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.168094] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[Aug22 16:16] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000031] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.168161] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000025] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171680] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171964] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000039] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Next thing I intend to do is unplug everything, remove the battery and wait one hour, but I'm getting out of ideas. Apart from praying, any advices ?

EDIT : My laptop is a Clevo P170SM, though I don't consider Clevo is the one to blame. Maybe I am the one, because Windows froze after my game crash, and instead of doing a forced shutdown, I waited for the OS until it bugchecked and rebooted by itself, which it did, but only after a possibly deadly one hour long wait, and I'm suspecting something fried in the process.

The lesson I learned here is : don't trust the OS for protecting your hardware. Most hardware protections are either implemented in the BIOS, drivers and directly in hardware by the device manufacturers, and those can fail sometimes.

The OS focuses more on protecting your data, and system availability. For example filesystem journalling protects the data, and on Windows, GPU Timeout Detection and Recovery (TDR) brings availability. That's a personal opinion, but despite being done at a driver level, I consider filesystem journalling part of the OS since most OSes if not all, were designed and should be installed using a very specific filesystem format. I suppose Windows can be installed on an ext4 filesystem, but some functionality would be missing. But I digress...

linux
windows-10
graphics-card
crash
gpu
asked on Super User Aug 22, 2020 by NovHak • edited Aug 23, 2020 by NovHak

1 Answer

0
[ +30.855545] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  +0.000034] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.000492] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  +0.000122] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  +0.099416] gnome-shell[6474]: segfault at 20 ip 00007f4d29a83356 sp 00007ffd43c62db0 error 4 in libnvidia-glsi.so.440.100[7f4d29a21000+95000]

According to some of the relevant log entries, the OS failed to initialize the display adapter. That is typically bad news along with the other information that you provided. There may be similar hardware errors in the Event Viewer of Windows.

The left column of the log shows how much time elapsed between messages. Your PC is taking a long time to load due to the repeated failures to initialize the Nvidia hardware. The first failure takes 30 seconds, and subsequent failures are about 10 seconds each.

You don't mention the brand of laptop, and specifically call out the "dGPU", which I am presuming means a discrete or dedicated Nvidia graphics chip, and you are still able to use the video output generated from the CPU's integrated graphics.

The GPU may be able to receive power and basic system management requests (the power up and down you mention), but the rest of the GPU is inaccessible, most likely due to hardware failure.

This is the best assessment I think possible with the information provided.

Possible other troubleshooting could be to try to boot up from an unmodified Live CD, just to be absolutely certain it isn't an issue with your installed OSes.

If you Google the GPU related error messages in your log, you come up with links to the Nvidia support forums. Here is their recommendation:

Can you run

nvidia-bug-report.sh

on your machine and email the output file to linux-bugs [at] nvidia.com ? In the rare chance that this issue causes your machine to lock up while collecting this bug report please run

nvidia-bug-report.sh --safe-mode

Nvidia Support Forum

answered on Super User Aug 23, 2020 by Jim Diroff II

User contributions licensed under CC BY-SA 3.0