AMD VEGA64 crash on kernel > 4.15

1

So while trying to run Kernel 4.19.39, 5.0.13 and 5.1 they freeze seconds after starting Steam or Overwatch (BattleNet client). Currently running 4.15 which runs just fine and stable.

I have done the following:

  • GRUB_CMDLINE_LINUX_DEFAULT="splash idle=nomwait"
  • the typical power supply option
  • Updated BIOS (from AGESA 1.0.0.4 to 1.0.0.6)
  • Updated OS (Ubuntu 18.04)

Hardware

AMD Ryzen 7 2700X Wraith Boxed
Asus Vega 64 Strix    
Gigabyte X470 AORUS ULTRA GAMING (AGESA 1.0.0.6)
G.Skill Ripjaws V 16GB DDR4 3200MHz (4 x 16GB)
Corsair CX850M 850W ATX power supply unit

screenfetch -n

OS: Ubuntu 18.04 bionic
 Kernel: x86_64 Linux 4.15.0-48-generic
 Uptime: 1h 29m
 Packages: 3497
 Shell: bash 4.4.19
 Resolution: 3840x2160
 DE: GNOME 
 WM: GNOME Shell
 WM Theme: Adwaita
 GTK Theme: Ambiance [GTK2/3]
 Icon Theme: ubuntu-mono-dark
 Font: Ubuntu 11
 CPU: AMD Ryzen 7 2700X Eight-Core @ 16x 3.7GHz [36.3°C]
 GPU: Radeon RX Vega (VEGA10, DRM 3.23.0, 4.15.0-48-generic, LLVM 9.0.0)
 RAM: 6208MiB / 64432MiB

Drivers + additional info

~$ glxinfo | grep "OpenGL version"
OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.2.0-devel - padoka PPA

~$ cat /etc/apt/sources.list.d/paulo-miguel-dias-ubuntu-mesa-bionic.list
deb http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic main
# deb-src http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic main

~$ sudo lspci -v | grep -i vga -A 10
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1) (prog-if 00 [VGA controller])
    Subsystem: ASUSTeK Computer Inc. Vega 10 XT [Radeon RX Vega 64]
    Flags: bus master, fast devsel, latency 0, IRQ 114
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f0000000 (64-bit, prefetchable) [size=2M]
    I/O ports at e000 [size=256]
    Memory at fcc00000 (32-bit, non-prefetchable) [size=512K]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: 

    ...

~$ apt show libdrm-amdgpu1 -a
Package: libdrm-amdgpu1
Version: 2.4.98+git1905192304.922d929~b~padoka0
Priority: optional
Section: libs
Source: libdrm
Maintainer: Debian X Strike Force <debian-x@lists.debian.org>
Installed-Size: 76,8 kB
Depends: libc6 (>= 2.17), libdrm2 (>= 2.4.82)
Download-Size: 26,9 kB
APT-Manual-Installed: yes
APT-Sources: http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic/main amd64 Packages
Description: Userspace interface to amdgpu-specific kernel DRM services -- runtime
 This library implements the userspace interface to the kernel DRM
 services.  DRM stands for "Direct Rendering Manager", which is the
 kernelspace portion of the "Direct Rendering Infrastructure" (DRI).
 The DRI is currently used on Linux to provide hardware-accelerated

I've found the following in the kernel logs while testing with Kernel 5.1

May 22 18:46:31 [HOST] kernel: [  256.354386] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354390] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354391] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0050153D
May 22 18:46:31 [HOST] kernel: [  256.354395] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354397] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354398] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354404] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354405] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354407] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354411] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354412] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354413] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354418] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354419] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354420] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354424] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354426] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354427] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354430] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354432] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354433] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354437] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354438] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354439] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354443] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354444] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354445] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:31 [HOST] kernel: [  256.354449] amdgpu 0000:0c:00.0: [gfxhub] no-retry page fault (src_id:0 ring:158 vmid:5 pasid:32780, for process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575)
May 22 18:46:31 [HOST] kernel: [  256.354450] amdgpu 0000:0c:00.0:   in page starting at address 0x0000000000400000 from 27
May 22 18:46:31 [HOST] kernel: [  256.354451] amdgpu 0000:0c:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
May 22 18:46:41 [HOST] kernel: [  261.469953] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:41 [HOST] kernel: [  266.593840] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:41 [HOST] kernel: [  266.599848] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=18098, emitted seq=18100
May 22 18:46:41 [HOST] kernel: [  266.599914] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Battle.net.exe pid 10384 thread Battle.net:cs0 pid 10575
May 22 18:46:41 [HOST] kernel: [  266.599918] amdgpu 0000:0c:00.0: GPU reset begin!
May 22 18:46:47 [HOST] kernel: [  271.709694] [drm:amdgpu_dm_commit_planes.isra.43 [amdgpu]] *ERROR* Waiting for fences timed out.
May 22 18:46:47 [HOST] kernel: [  272.165625] amdgpu 0000:0c:00.0: GPU BACO reset
May 22 18:46:47 [HOST] kernel: [  272.643907] amdgpu 0000:0c:00.0: GPU reset succeeded, trying to resume
May 22 18:46:47 [HOST] kernel: [  272.644035] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
May 22 18:46:47 [HOST] kernel: [  272.644126] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
May 22 18:46:47 [HOST] kernel: [  272.644277] [drm] PSP is resuming...
May 22 18:46:47 [HOST] kernel: [  272.790964] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
May 22 18:46:47 [HOST] kernel: [  272.801714] amdgpu: [powerplay] Failed to send message: 0x46, ret value: 0xffffffff
May 22 18:46:47 [HOST] kernel: [  272.801830] amdgpu: [powerplay] Failed to send message: 0x61, ret value: 0xffffffff
May 22 18:46:48 [HOST] kernel: [  273.172332] [drm] UVD and UVD ENC initialized successfully.
May 22 18:46:48 [HOST] kernel: [  273.271995] [drm] VCE initialized successfully.
May 22 18:46:48 [HOST] kernel: [  273.273190] [drm] recover vram bo from shadow start
May 22 18:46:48 [HOST] kernel: [  273.279784] [drm] recover vram bo from shadow done
May 22 18:46:48 [HOST] kernel: [  273.279787] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279789] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279823] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279831] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279833] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279838] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279844] amdgpu 0000:0c:00.0: GPU reset(2) succeeded!
May 22 18:46:48 [HOST] kernel: [  273.279844] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279848] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279853] [drm] Skip scheduling IBs!
May 22 18:46:48 [HOST] kernel: [  273.279855] [drm] Skip scheduling IBs!
linux-kernel
vega
asked on Stack Overflow May 22, 2019 by PvdL • edited May 22, 2019 by PvdL

1 Answer

0

Kernel 5.5 is running and stable!

uname -a

Linux patrick-X470-AORUS-ULTRA-GAMING 5.5.10-050510-generic #202003180732 SMP Wed Mar 18 07:35:23 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

screenfetch -n

 patrick@patrick-X470-AORUS-ULTRA-GAMING
 OS: Ubuntu 18.04 bionic
 Kernel: x86_64 Linux 5.5.10-050510-generic
 Uptime: 17h 38m
 Packages: 3877
 Shell: bash 4.4.20
 Resolution: 3840x2160
 DE: GNOME 
 WM: GNOME Shell
 WM Theme: Adwaita
 GTK Theme: Ambiance [GTK2/3]
 Icon Theme: ubuntu-mono-dark
 Font: Ubuntu 11
 CPU: AMD Ryzen 7 2700X Eight-Core @ 16x 3.7GHz [38.8°C]
 GPU: Radeon RX Vega (VEGA10, DRM 3.36.0, 5.5.10-050510-generic, LLVM 10.0.0)
 RAM: 10126MiB / 64332MiB

Drivers + additional info

$ glxinfo | grep "OpenGL version"
OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.0-devel - padoka PPA

$ sudo lspci -v | grep -i vga -A 10
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1) (prog-if 00 [VGA controller])
    Subsystem: ASUSTeK Computer Inc. Vega 10 XT [Radeon RX Vega 64]
    Flags: bus master, fast devsel, latency 0, IRQ 119
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f0000000 (64-bit, prefetchable) [size=2M]
    I/O ports at e000 [size=256]
    Memory at fcc00000 (32-bit, non-prefetchable) [size=512K]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [64] Express Legacy Endpoint, MSI 00

$ apt show libdrm-amdgpu1 -a
Package: libdrm-amdgpu1
Version: 2.4.100+git2001081023.9ebfac1~b~padoka0
Priority: optional
Section: libs
Source: libdrm
Maintainer: Debian X Strike Force <debian-x@lists.debian.org>
Installed-Size: 80,9 kB
Depends: libc6 (>= 2.17), libdrm2 (>= 2.4.100)
Download-Size: 28,2 kB
APT-Manual-Installed: yes
APT-Sources: http://ppa.launchpad.net/paulo-miguel-dias/mesa/ubuntu bionic/main amd64 Packages
answered on Stack Overflow Mar 23, 2020 by PvdL

User contributions licensed under CC BY-SA 3.0