[Azure] Fix VM crash/hang issues due to fast VF add/remove events
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-azure (Ubuntu) |
New
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
Medium
|
Tim Gardner | ||
Lunar |
Fix Released
|
Medium
|
Tim Gardner |
Bug Description
SRU Justification
[Impact]
A Linux guest on Hyper-V/Azure can occasionally crash during early Linux kernel boot due to a strange host behavior:
1. The host assigns a VF to the guest;
2. The host immediately unassigns the VF from the guest; //Dexuan: due to some race conditions bug in Linux vPCI driver, Linux can crash.
3. The host assigns the VF to the guest again.
I'm asking the Hyper-V team to investigate the host behavior, but I'm not sure when they'll get that fixed.
Starting late 2022 (around Nov 2022), Linux guests on Azure started to crash more frequently due to a host side update at that time: a new host/hypervisor feature of handling "correctable memory errors" can cause a lot of successive VF remove/add events, so the race conditions bug in Linux vPCI driver can surface much more easily. The Hyper-V team is implementing a batching mechanism so that the guest will get much less VF remove/add events (ETA: June 2023), but meanwhile we should also get the Linux race condition bugs fixed so that Linux guests won't crash even if it receives the successive VF remove/add events.
[Test Plan]
MSFT tested
[Regression potential]
Guests may continue to crash.
[Other Info]
SF: #00349076
CVE References
- 2022-20369
- 2022-2196
- 2022-2663
- 2022-3061
- 2022-3524
- 2022-3545
- 2022-3564
- 2022-3565
- 2022-3566
- 2022-3567
- 2022-3594
- 2022-3621
- 2022-3643
- 2022-41218
- 2022-4139
- 2022-42703
- 2022-42896
- 2022-4378
- 2022-4382
- 2022-43945
- 2022-45934
- 2022-47520
- 2022-47940
- 2022-48502
- 2023-0045
- 2023-0179
- 2023-0266
- 2023-0461
- 2023-0597
- 2023-1075
- 2023-1118
- 2023-1281
- 2023-1380
- 2023-1670
- 2023-1829
- 2023-1859
- 2023-1872
- 2023-2124
- 2023-2176
- 2023-23559
- 2023-2612
- 2023-2640
- 2023-26545
- 2023-30456
- 2023-3090
- 2023-31248
- 2023-3141
- 2023-31436
- 2023-32233
- 2023-32629
- 2023-3269
- 2023-3389
- 2023-3390
- 2023-3439
- 2023-35001
affects: | linux (Ubuntu) → linux-azure (Ubuntu) |
Changed in linux-azure (Ubuntu Jammy): | |
assignee: | nobody → Tim Gardner (timg-tpi) |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux-azure (Ubuntu Lunar): | |
assignee: | nobody → Tim Gardner (timg-tpi) |
importance: | Undecided → Medium |
status: | New → In Progress |
tags: |
added: verification-done-jammy verification-done-lunar removed: verification-needed-jammy verification-needed-lunar |
This bug is awaiting verification that the linux-azure/ 5.15.0- 1043.50 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification- needed- jammy' to 'verification- done-jammy' . If the problem still exists, change the tag 'verification- needed- jammy' to 'verification- failed- jammy'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/ /wiki.ubuntu. com/Testing/ EnableProposed for documentation how to enable and use -proposed. Thank you!