[mlx5] Intermittent VF-LAG activation failure

Bug #1988018 reported by Frode Nordahl
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Committed
Undecided
Unassigned
Jammy
New
Undecided
Unassigned
Kinetic
Fix Committed
Undecided
Unassigned
netplan.io (Ubuntu)
Triaged
Medium
Unassigned
Jammy
In Progress
Undecided
Martin Kalcok
Kinetic
Won't Fix
Medium
Unassigned

Bug Description

During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG.

Intermittently one may see that VF-LAG initialization fails:
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG
                           Make sure all VFs are unbound prior to VF LAG activation or deactivation

This is caused by rebinding the driver prior to the VF lag being ready.

A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver:

    $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state

The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel.

0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

Frode Nordahl (fnordahl)
Changed in linux (Ubuntu Kinetic):
status: New → Fix Committed
Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu Kinetic):
status: New → Triaged
importance: Undecided → Medium
tags: added: foundations-triage-discuss
Lukas Märdian (slyon)
tags: removed: foundations-triage-discuss
Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Ubuntu 22.10 (Kinetic Kudu) has reached end of life, so this bug will not be fixed for that specific release.

Changed in netplan.io (Ubuntu Kinetic):
status: Triaged → Won't Fix
Changed in netplan.io (Ubuntu Jammy):
assignee: nobody → Martin Kalcok (martin-kalcok)
status: New → In Progress
Revision history for this message
Lukas Märdian (slyon) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

I think they are two distinct problems, and hopefully we would get a comment from NVIDIA/Mellanox as the statements in bug 2020409 contradicts the documentation [0] the current Netplan implementation is based on.

Martin may have more details, but wanted to mention that one of our suspected culprits is with how Netplan lays out the udev rules for VF activation [1]:
1) It takes a long time when many are configured, as opposed to the expectation in the comment.
2) The process appears to be executed multiple times, which combined with the fact it takes a long time in turn may end up clashing with both the networking backends creation of the bond and the systemd unit rebinding the VFs.

Bug 2020409 also raises the question if there are any bond/LAG related system bringup quirks for systems using only Scalable Functions (SF) or a combination of SFs and VFs. I have yet to see any documentation about that.

0: https://enterprise-support.nvidia.com/s/article/Configuring-VF-LAG-using-TC
1: https://github.com/canonical/netplan/blob/a7e4be03918c986020650743cb6cf0934696ef0c/src/sriov.c#L107-L112

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.