Mlx5 kworker blocked Kernel 5.19 (Jammy HWE)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-ovn-chassis |
Triaged
|
High
|
Unassigned | ||
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
This is seen on particular with :
* Charmed Openstack with Jammy Yoga
* 5.19.0-35-generic (linux-
* Mellanox Connectx-6 card with mlx5_core module being used
* SR-IOV is being used with VF-LAG for the use of OVN hardware offloading
The servers enter into very high load (around 75~100) quickly during the boot with all process relying on network communication with the Mellanox network card being stuck or extremely slow.
Kernel logs are being displayed about kworkers being blocked for more than 120 seconds
The number of SR-IOV devices configured both from the firmware and the kernel seems to have a serious correlation with the likeliness of this bug to occur.
Having enabled more VF seems to hugely increase the risk for this bug to arise.
This does not happen systematically at every boot, but with 32 VFs on each PF, it occurs about 40% of the time.
To recover the server, a cold reboot is required.
Look at a quick sample of the trace, this seems to involve directly the mlx5 driver within the kernel :
Mar 07 05:24:56 nova-1 kernel: INFO: task kworker/0:1:19 blocked for more than 120 seconds.
Mar 07 05:24:56 nova-1 kernel: Tainted: P OE 5.19.0-35-generic #36~22.04.1-Ubuntu
Mar 07 05:24:56 nova-1 kernel: "echo 0 > /proc/sys/
Mar 07 05:24:56 nova-1 kernel: task:kworker/0:1 state:D stack: 0 pid: 19 ppid: 2 flags:0x00004000
Mar 07 05:24:56 nova-1 kernel: Workqueue: events work_for_cpu_fn
Mar 07 05:24:56 nova-1 kernel: Call Trace:
Mar 07 05:24:56 nova-1 kernel: <TASK>
Mar 07 05:24:56 nova-1 kernel: __schedule+
Mar 07 05:24:56 nova-1 kernel: schedule+0x68/0x110
Mar 07 05:24:56 nova-1 kernel: schedule_
Mar 07 05:24:56 nova-1 kernel: __mutex_
Mar 07 05:24:56 nova-1 kernel: __mutex_
Mar 07 05:24:56 nova-1 kernel: mutex_lock+
Mar 07 05:24:56 nova-1 kernel: mlx5_register_
Mar 07 05:24:56 nova-1 kernel: mlx5_init_
Mar 07 05:24:56 nova-1 kernel: probe_one+
Mar 07 05:24:56 nova-1 kernel: local_pci_
Mar 07 05:24:56 nova-1 kernel: work_for_
Mar 07 05:24:56 nova-1 kernel: process_
Mar 07 05:24:56 nova-1 kernel: worker_
Mar 07 05:24:56 nova-1 kernel: ? rescuer_
Mar 07 05:24:56 nova-1 kernel: kthread+0xee/0x120
Mar 07 05:24:56 nova-1 kernel: ? kthread_
Mar 07 05:24:56 nova-1 kernel: ret_from_
Mar 07 05:24:56 nova-1 kernel: </TASK>
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 2009594
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.