Devlink reload hangs: fix race and lock issue

Bug #2039869 reported by William Tu
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-bluefield (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
In Progress
Undecided
Unassigned

Bug Description

Summary:
Machine hangs when doing devlink reload

How to reproduce:
Host:
[root@bu-lab24v ~]# echo '2' > /sys/class/net/ens2f0np0/device/sriov_numvfs

Arm:
root@bu-lab24v-oob:~# uname -r
5.15.0-1027-bluefield
root@bu-lab24v-oob:~# devlink dev eswitch set pci/0000:03:00.0 mode switchdev
root@bu-lab24v-oob:~# devlink dev reload pci/0000:03:00.0
*Hangs*

Arm dmesg:
[ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
[ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield #29-Ubuntu
[ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1089.790829] task:devlink state:D stack: 0 pid: 8753 ppid: 5090 flags:0x00000004
[ 1089.790838] Call trace:
[ 1089.790840] __switch_to+0xf8/0x150
[ 1089.790857] __schedule+0x2b8/0x790
[ 1089.790865] schedule+0x64/0x140
[ 1089.790870] schedule_preempt_disabled+0x18/0x24
[ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680
[ 1089.790878] __mutex_lock_slowpath+0x40/0x90
[ 1089.790883] mutex_lock+0x64/0x70
[ 1089.790887] devl_lock+0x1c/0x30
[ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core]
[ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core]
[ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
[ 1089.791318] devlink_reload+0x214/0x290

Fixes:
Checking the OFED source code, we found this missing devl trap group
also need to be backported to avoid deadlock.

void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
{
...
#ifdef HAVE_DEVL_PORT_REGISTER
#ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
        devl_assert_locked(priv_to_devlink(dev));
#else
        devl_lock(devlink);
#endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
#endif /* HAVE_DEVL_PORT_REGISTER */
        mutex_lock(&mlx5_intf_mutex);
#ifdef HAVE_DEVL_PORT_REGISTER

Related issue:
#2032378 Devlink backport: fix race and lock issue

So cherry-pick the patch below
commit 852e85a704c2e11c050bdea286bc438aba4f4a22
Author: Jiri Pirko <email address hidden>
Date: Sat Jul 16 13:02:34 2022 +0200

    net: devlink: add unlocked variants of devling_trap*() functions

    Add unlocked variants of devl_trap*() functions to be used in drivers
    called-in with devlink->lock held.

Changed in linux-bluefield (Ubuntu):
status: New → Invalid
Changed in linux-bluefield (Ubuntu Jammy):
status: New → Fix Committed
status: Fix Committed → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.