[Azure] Fix VM crash/hang issues due to fast VF add/remove events

Bug #2023071 reported by Tim Gardner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Undecided
Unassigned
Jammy
Fix Released
Medium
Tim Gardner
Lunar
Fix Released
Medium
Tim Gardner

Bug Description

SRU Justification

[Impact]

A Linux guest on Hyper-V/Azure can occasionally crash during early Linux kernel boot due to a strange host behavior:
1. The host assigns a VF to the guest;
2. The host immediately unassigns the VF from the guest; //Dexuan: due to some race conditions bug in Linux vPCI driver, Linux can crash.
3. The host assigns the VF to the guest again.
I'm asking the Hyper-V team to investigate the host behavior, but I'm not sure when they'll get that fixed.

Starting late 2022 (around Nov 2022), Linux guests on Azure started to crash more frequently due to a host side update at that time: a new host/hypervisor feature of handling "correctable memory errors" can cause a lot of successive VF remove/add events, so the race conditions bug in Linux vPCI driver can surface much more easily. The Hyper-V team is implementing a batching mechanism so that the guest will get much less VF remove/add events (ETA: June 2023), but meanwhile we should also get the Linux race condition bugs fixed so that Linux guests won't crash even if it receives the successive VF remove/add events.

[Test Plan]

MSFT tested

[Regression potential]

Guests may continue to crash.

[Other Info]

SF: #00349076

Tim Gardner (timg-tpi)
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu Jammy):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
Changed in linux-azure (Ubuntu Lunar):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1043.50 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/6.2.0-1009.9 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar' to 'verification-done-lunar'. If the problem still exists, change the tag 'verification-needed-lunar' to 'verification-failed-lunar'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-azure verification-needed-lunar
Tim Gardner (timg-tpi)
tags: added: verification-done-jammy verification-done-lunar
removed: verification-needed-jammy verification-needed-lunar
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (223.9 KiB)

This bug was fixed in the package linux-azure - 6.2.0-1009.9

---------------
linux-azure (6.2.0-1009.9) lunar; urgency=medium

  * lunar/linux-azure: 6.2.0-1009.9 -proposed tracker (LP: #2026476)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

  * Azure: Fix lockup in swiotlb when used as a CVM (LP: #2026736)
    - swiotlb: remove swiotlb_max_segment
    - swiotlb: fix the deadlock in swiotlb_do_find_slots
    - swiotlb: use wrap_area_index() instead of open-coding it
    - swiotlb: fix slot alignment checks
    - swiotlb: fix a braino in the alignment check fix

  * [Azure] Fix VM crash/hang issues due to fast VF add/remove events
    (LP: #2023071) // Case [Azure] Fix VM crash/hang issues due to fast VF
    add/remove events (LP: #2023594)
    - PCI: hv: Fix a race condition bug in hv_pci_query_relations()
    - PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
    - PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
    - Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
    - PCI: hv: Add a per-bus mutex state_lock
    - PCI: hv: Use async probing to reduce boot time

  * Azure: Fix perf regression: remove rx_cqes, tx_cqes counters for MANA
    (LP: #2022940)
    - net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters

  * [Azure][MANA][VLANTagging] Support for VLAN Tagging for MANA (LP: #2023695)
    - net: mana: Add support for vlan tagging

  [ Ubuntu: 6.2.0-27.28 ]

  * lunar/linux: 6.2.0-27.28 -proposed tracker (LP: #2026488)
  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] update annotations scripts
  * CVE-2023-2640 // CVE-2023-32629
    - Revert "UBUNTU: SAUCE: overlayfs: handle idmapped mounts in
      ovl_do_(set|remove)xattr"
    - Revert "UBUNTU: SAUCE: overlayfs: Skip permission checking for
      trusted.overlayfs.* xattrs"
    - SAUCE: overlayfs: default to userxattr when mounted from non initial user
      namespace
  * UNII-4 5.9G Band support request on 8852BE (LP: #2023952)
    - wifi: rtw89: 8851b: add 8851B basic chip_info
    - wifi: rtw89: introduce realtek ACPI DSM method
    - wifi: rtw89: regd: judge UNII-4 according to BIOS and chip
    - wifi: rtw89: support U-NII-4 channels on 5GHz band
  * Disable hv-kvp-daemon if /dev/vmbus/hv_kvp is not present (LP: #2024900)
    - [Packaging] disable hv-kvp-daemon if needed
  * A deadlock issue in scsi rescan task while resuming from S3 (LP: #2018566)
    - ata: libata-scsi: Avoid deadlock on rescan after device resume
  * [SRU] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU (LP: #2008745)
    - [Config] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU
  * Lunar update: v6.2.15 upstream stable release (LP: #2025067)
    - ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15
    - ASoC: Intel: soc-acpi: add table for Intel 'Rooks County' NUC M15
    - ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm
    - x86/hyperv: Block root partition functionality in a Confidential VM
    - ASoC: amd: yc: Add DMI entries to support Victus by HP Laptop 16-e1xxx
      (8A22)
    - iio:...

Changed in linux-azure (Ubuntu Lunar):
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (406.3 KiB)

This bug was fixed in the package linux-azure - 5.15.0-1044.51

---------------
linux-azure (5.15.0-1044.51) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1044.51 -proposed tracker (LP: #2029291)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] update variants

linux-azure (5.15.0-1043.50) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1043.50 -proposed tracker (LP: #2026495)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] resync getabis

  * kdump fails on big arm64 systems when offset is not specified (LP: #2024479)
    - arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
    - arm64: kdump: Reimplement crashkernel=X
    - docs: kdump: Update the crashkernel description for arm64
    - arm64: kdump: Do not allocate crash low memory if not needed
    - arm64/mm: Define defer_reserve_crashkernel()
    - arm64: kdump: Provide default size when crashkernel=Y, low is not specified
    - arm64: kdump: Support crashkernel=X fall back to reserve region above DMA
      zones

  * Azure: MANA: Fix doorbell access for receives (LP: #2027615)
    - SAUCE: net: mana: Batch ringing RX queue doorbell on receiving packets
    - SAUCE: net: mana: Use the correct WQE count for ringing RQ doorbell

  * [Azure][MANA][InfinitiBand] Features Support and InfiniBand for MANA
    (LP: #2024917)
    - bpf: Let bpf_warn_invalid_xdp_action() report more info
    - PCI: Move PCI_VENDOR_ID_MICROSOFT/PCI_DEVICE_ID_HYPERV_VIDEO definitions to
      pci_ids.h
    - net: mana: Assign interrupts to CPUs based on NUMA nodes
    - net: mana: Add support for auxiliary device
    - net: mana: Record the physical address for doorbell page region
    - net: mana: Handle vport sharing between devices
    - net: mana: Set the DMA device max segment size
    - net: mana: Export Work Queue functions for use by RDMA driver
    - net: mana: Record port number in netdev
    - net: mana: Move header files to a common location
    - net: mana: Define max values for SGL entries
    - net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
    - net: mana: Define data structures for allocating doorbell page from GDMA
    - net: mana: Define data structures for protection domain and memory
      registration
    - net: mana: Fix return type of mana_start_xmit()
    - RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
    - RDMA/mana: Remove redefinition of basic u64 type
    - RDMA/mana_ib: Prevent array underflow in mana_ib_create_qp_raw()
    - net: mana: Fix accessing freed irq affinity_hint
    - [Config] azure: Enable MANA_INFINIBAND

  * [Azure] Fix VM crash/hang issues due to fast VF add/remove events
    (LP: #2023071) // Case [Azure] Fix VM crash/hang issues due to fast VF
    add/remove events (LP: #2023594)
    - PCI: hv: Fix a race condition bug in hv_pci_query_relations()
    - PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
    - PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
    - Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
    - ...

Changed in linux-azure (Ubuntu Jammy):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.