[bionic] machine stuck and bonding not working well when nvmet_rdma module is loaded
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Joseph Salisbury | ||
Bionic |
Fix Released
|
High
|
Joseph Salisbury |
Bug Description
== SRU Justification ==
This bug causes the machine to get stuck and bonding to not work when
the nvmet_rdma module is loaded.
Both of these commits are in mainline as of v4.17-rc1.
== Fixes ==
a3dd7d0022c3 ("nvmet-rdma: Don't flush system_wq by default during remove_one")
9bad0404ecd7 ("nvme-rdma: Don't flush delete_wq by default during remove_one")
== Regression Potential ==
Low. Limited to nvme driver and tested by Mellanox.
== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
== Original Bug Description ==
Hi
Machine stuck after unregistering bonding interface when the nvmet_rdma module is loading.
scenario:
# modprobe nvmet_rdma
# modprobe -r bonding
# modprobe bonding -v mode=1 miimon=100 fail_over_mac=0
# ifdown eth4
# ifdown eth5
# ip addr add 15.209.12.173/8 dev bond0
# ip link set bond0 up
# echo +eth5 > /sys/class/
# echo +eth4 > /sys/class/
# echo -eth4 > /sys/class/
# echo -eth5 > /sys/class/
# echo -bond0 > /sys/class/
dmesg:
kernel: [78348.225556] bond0 (unregistering): Released all slaves
kernel: [78358.339631] unregister_
kernel: [78368.419621] unregister_
kernel: [78378.499615] unregister_
kernel: [78388.579625] unregister_
kernel: [78398.659613] unregister_
kernel: [78408.739655] unregister_
kernel: [78418.819634] unregister_
kernel: [78428.899642] unregister_
kernel: [78438.979614] unregister_
kernel: [78449.059619] unregister_
kernel: [78459.139626] unregister_
kernel: [78469.219623] unregister_
kernel: [78479.299619] unregister_
kernel: [78489.379620] unregister_
kernel: [78499.459623] unregister_
kernel: [78509.539631] unregister_
kernel: [78519.619629] unregister_
The following upstream commits that fix this issue
commit a3dd7d0022c3472
Author: Max Gurtovoy <email address hidden>
Date: Wed Feb 28 13:12:38 2018 +0200
nvmet-rdma: Don't flush system_wq by default during remove_one
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the system work queue.
Reviewed-by: Israel Rukshin <email address hidden>
Signed-off-by: Max Gurtovoy <email address hidden>
Reviewed-by: Sagi Grimberg <email address hidden>
Signed-off-by: Keith Busch <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
diff --git a/drivers/
index aa8068f..a59263d 100644
--- a/drivers/
+++ b/drivers/
@@ -1469,8 +1469,25 @@ static struct nvmet_fabrics_ops nvmet_rdma_ops = {
static void nvmet_rdma_
{
struct nvmet_rdma_queue *queue, *tmp;
+ struct nvmet_rdma_device *ndev;
+ bool found = false;
+
+ mutex_lock(
+ list_for_
+ if (ndev->device == ib_device) {
+ found = true;
+ break;
+ }
+ }
+ mutex_unlock(
+
+ if (!found)
+ return;
- /* Device is being removed, delete all queues using this device */
+ /*
+ * IB Device that is used by nvmet controllers is being removed,
+ * delete all queues using this device.
+ */
commit 9bad0404ecd7594
Author: Max Gurtovoy <email address hidden>
Date: Wed Feb 28 13:12:39 2018 +0200
nvme-rdma: Don't flush delete_wq by default during remove_one
The .remove_one function is called for any ib_device removal.
In case the removed device has no reference in our driver, there
is no need to flush the work queue.
Reviewed-by: Israel Rukshin <email address hidden>
Signed-off-by: Max Gurtovoy <email address hidden>
Reviewed-by: Sagi Grimberg <email address hidden>
Signed-off-by: Keith Busch <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
diff --git a/drivers/
index f5f460b..250b277 100644
--- a/drivers/
+++ b/drivers/
@@ -2024,6 +2024,20 @@ static struct nvmf_transport_ops nvme_rdma_transport = {
static void nvme_rdma_
{
struct nvme_rdma_ctrl *ctrl;
+ struct nvme_rdma_device *ndev;
+ bool found = false;
+
+ mutex_lock(
+ list_for_
+ if (ndev->dev == ib_device) {
+ found = true;
+ break;
+ }
+ }
+ mutex_unlock(
+
+ if (!found)
+ return;
/* Delete all controllers using this device */
summary: |
- machine stuck and bonding not working well when nvmet_rdma module is - loaded + [bionic] machine stuck and bonding not working well when nvmet_rdma + module is loaded |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Bionic): | |
status: | Incomplete → Triaged |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
status: | Triaged → In Progress |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
Changed in linux (Ubuntu): | |
status: | In Progress → Fix Released |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1764982
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.