PCI driver failed: Detected PCI bus error on device

0

I'm trying to do reset on a specific pci device using my own customize driver on ppc64(power pc) machine.

This driver works on another ppc64 machine.

This is the function that responsible to to do this action. I removed several code lines to emphasize the important flow.

int reset_device(void)
{
    pdev =  g_reset_info.devs[ix];        
    err = pci_enable_device(pdev);

    if (err) {
        return err;
    }
    pci_set_master(pdev);
    err = pci_save_state(pdev);
    if (err) {
            return err;
    }

    pdev =  g_reset_info.devs[ix];

    err = pci_set_pcie_reset_state(pdev, pcie_hot_reset);
    if (err) {
        return err;
    }

    msleep(jiffies_to_msecs(HZ/2));
    msleep(jiffies_to_msecs(HZ/2));

    pdev =  g_reset_info.devs[ix];

    err = pci_set_pcie_reset_state(pdev, pcie_deassert_reset);
        if (err) {
            return err;
        }

    pdev =  g_reset_info.devs[ix];
    pci_restore_state(pdev);

    msleep(jiffies_to_msecs(HZ/2));
    msleep(jiffies_to_msecs(HZ/2));

    return 0;
}

This is the output which came from the dmesg:

mst_ppc_pci_reset_driver reset_device 63 Send hot reset to device: 0000:50:00.0 
mst_ppc_pci_reset_driver reset_device 81 Deassert device: 0000:50:00.0 
Call Trace: 
[c000000186f92fe0] [c0000000000155ac] .show_stack+0x6c/0x198 (unreliable) 
[c000000186f93090] [c000000000076a8c] .eeh_dn_check_failure+0x354/0x3f0 
[c000000186f93150] [c000000000029b7c] .rtas_read_config+0x13c/0x198 
[c000000186f931f0] [c00000000039c8d0] .pci_bus_read_config_word+0xa0/0xf8 
[c000000186f932b0] [c0000000003a2730] .pci_find_capability+0x40/0xd0 
[c000000186f93360] [c0000000003a2b6c] .pci_restore_pcie_state+0x54/0x2e8 
[c000000186f93410] [c0000000003a501c] .pci_restore_state+0x84/0x1b8 
[c000000186f934d0] [d000000003810384] .reset_device+0x184/0x430 [mst_ppc_pci_reset] 
[c000000186f93590] [c0000000003a6254] .local_pci_probe+0x7c/0xf8 
[c000000186f93620] [c0000000003a63a8] .__pci_device_probe+0xd8/0x128 
[c000000186f936d0] [c0000000003a72a8] .pci_device_probe+0x38/0x68 
[c000000186f93760] [c0000000004d0bd8] .really_probe+0xb0/0x288 
[c000000186f93810] [c0000000004d0e4c] .driver_probe_device+0x9c/0x110 
[c000000186f938a0] [c0000000004d0fbc] .__driver_attach+0xfc/0x100 
[c000000186f93930] [c0000000004cfee4] .bus_for_each_dev+0xc4/0x118 
[c000000186f939e0] [c0000000004d08a8] .driver_attach+0x28/0x40 
[c000000186f93a60] [c0000000004cf3b0] .bus_add_driver+0x190/0x340 
[c000000186f93b10] [c0000000004d1950] .driver_register+0x98/0x1b8 
[c000000186f93bb0] [c0000000003a760c] .__pci_register_driver+0x64/0x140 
[c000000186f93c50] [d0000000038107c0] .init+0x28/0x400 [mst_ppc_pci_reset] 
[c000000186f93cd0] [c00000000000ab68] .do_one_initcall+0x68/0x1e0 
[c000000186f93d90] [c00000000010893c] .SyS_init_module+0xcc/0x218 
[c000000186f93e30] [c0000000000098ec] syscall_exit+0x0/0x40 
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xf (was 0xffffffff, writing 0xff) 
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xe (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xd (was 0xffffffff, writing 0x40)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xc (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xb (was 0xffffffff, writing 0x6115b3)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xa (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x9 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x7 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x6 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x5 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x9e00000c)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x3 (was 0xffffffff, writing 0x20)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x2 (was 0xffffffff, writing 0x2070000)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x1 (was 0xffffffff, writing 0x100146)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0x101115b3)
EEH: Detected PCI bus error on device <null>
EEH: This PCI device has failed 1 times in the last hour:
EEH: Bus location=U78CB.001.WZS02VY-P1-C11-T1 driver=mst_ppc_pci_reset_driver pci addr=0000:50:00.0
EEH: Device location=U78CB.001.WZS02VY-P1-C11-T1 driver= pci addr=<null>
EEH: of node=/pci@800000020000013/pci15b3,61@0
EEH: PCI device/vendor: 101115b3
EEH: PCI cmd/status register: 00100140
EEH: PCI-E capabilities and status follow:
EEH: PCI-E 00: 0002c010
EEH: PCI-E 01: 19008fe2
EEH: PCI-E 02: 0000595e
EEH: PCI-E 03: 0043f103
EEH: PCI-E 04: 10830000
EEH: PCI-E 05: 00000000
EEH: PCI-E 06: 00000000
EEH: PCI-E 07: 00000000
EEH: PCI-E 08: 00000000
EEH: PCI-E AER capability register set follows:
EEH: PCI-E AER 00: 00010001
EEH: PCI-E AER 01: 00000000
EEH: PCI-E AER 02: 00000000
EEH: PCI-E AER 03: 00062010
EEH: PCI-E AER 04: 00000000
EEH: PCI-E AER 05: 00002000
EEH: PCI-E AER 06: 000001e4
EEH: PCI-E AER 07: 00000000
EEH: PCI-E AER 08: 00000000
EEH: PCI-E AER 09: 00000000
EEH: PCI-E AER 0a: 00000000
EEH: PCI-E AER 0b: 00000000
EEH: PCI-E AER 0c: 00000000
EEH: PCI-E AER 0d: 00000000
RTAS: event: 2736, Type: Platform Error, Severity: 2
mst_ppc_pci_reset_driver 0000:50:00.0: PME# disabled
linux-kernel
linux-device-driver
powerpc
pci
asked on Stack Overflow Aug 22, 2019 by Itay Avraham

1 Answer

5

When debugging this sort of issue it's a very good idea to track what kernel versions you are using and to provide specific details about the HW you are testing with. From the fact your kernel has eeh_dn_check_failure() rather than eeh_check_dev_failure() I can gather this is a very old kernel. Does the other system you tested with have the same kernel? Same firmware? All this is relevant to your problem.

Anyway, I'd say you need a one second wait between the de-asserting the reset and restoring config space. The PCI spec requires that system software give the device one second to initialise after a reset before attempting IO, config cycles included. In 2015 commit 26833a5029b7 ("powerpc/eeh: Make the delay for PE reset unified") added a delay after de-assert (on powerpc at least) so that would be handled for you. Considering your kernel is old enough to still have eeh_dn_check_failure() (renamed in 2012, see f8f7d63fd96e) you probably don't have that patch and need to do the wait yourself.

What's probably happening is that the device isn't ready to respond to config accesses and drops them. The hypervisor will detect a timeout and assumes the device is malfunctioning so it isolates the device (freezes it) using the EEH mechanism that IBM's Power hardware has. Normally the OS will try to un-freeze and reset the device after that happens, but that can fail for a lot of reasons, especially on older kernels.

answered on Stack Overflow Aug 23, 2019 by Oliver O'Halloran

User contributions licensed under CC BY-SA 3.0