Infinite systemd loop when power off the machine with multiple MD RAIDs

Bug #2036184 reported by AceLan Kao
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
HWE Next
New
Undecided
Unassigned
linux (Ubuntu)
Status tracked in Mantic
Jammy
Invalid
Undecided
Unassigned
Mantic
In Progress
Undecided
AceLan Kao
linux-oem-6.1 (Ubuntu)
Status tracked in Mantic
Jammy
Fix Released
Undecided
AceLan Kao
Mantic
Invalid
Undecided
Unassigned
linux-oem-6.5 (Ubuntu)
Status tracked in Mantic
Jammy
Fix Released
Undecided
AceLan Kao
Mantic
Invalid
Undecided
Unassigned

Bug Description

[Impact]
The system with multiple MD RAIDs sometimes hangs while rebooting, that's because of the systemd can't stop and close the MD disk.

[Fix]
This commit fixes the issue, and this issue has been introduced by 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") after v6.0

https://patchwork.kernel.org<email address hidden>/

[Test case]
1. Reboot the system with multiple MD RAIDs at least 10 times.
2. Make sure the system can reboot successfully every time.
3. You should not see error messages like below.

[ 205.360738] systemd-shutdown[1]: Stopping MD devices.
[ 205.366384] systemd-shutdown[1]: sd-device-enumerator: Scan all dirs
[ 205.373327] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/bus
[ 205.380427] systemd-shutdown[1]: sd-device-enumerator: Scanning /sys/class
[ 205.388257] systemd-shutdown[1]: Stopping MD /dev/md127 (9:127).
[ 205.394880] systemd-shutdown[1]: Failed to sync MD block device /dev/md127, ignoring: Input/output error
[ 205.404975] md: md127 stopped.
[ 205.470491] systemd-shutdown[1]: Stopping MD /dev/md126 (9:126).
[ 205.770179] md: md126: resync interrupted.
[ 205.776258] md126: detected capacity change from 1900396544 to 0
[ 205.783349] md: md126 stopped.
[ 205.862258] systemd-shutdown[1]: Stopping MD /dev/md125 (9:125).
[ 205.862435] md: md126 stopped.
[ 205.868376] systemd-shutdown[1]: Failed to sync MD block device /dev/md125, ignoring: Input/output error
[ 205.872845] block device autoloading is deprecated and will be removed.
[ 205.880955] md: md125 stopped.
[ 205.934349] systemd-shutdown[1]: Stopping MD /dev/md124p2 (259:7).
[ 205.947707] systemd-shutdown[1]: Could not stop MD /dev/md124p2: Device or resource busy
[ 205.957004] systemd-shutdown[1]: Stopping MD /dev/md124p1 (259:6).
[ 205.964177] systemd-shutdown[1]: Could not stop MD /dev/md124p1: Device or resource busy
[ 205.973155] systemd-shutdown[1]: Stopping MD /dev/md124 (9:124).
[ 205.979789] systemd-shutdown[1]: Could not stop MD /dev/md124: Device or resource busy
[ 205.988475] systemd-shutdown[1]: Not all MD devices stopped, 4 left.

[Where problems could occur]
It fixes the data race issue, should not introduce any regression.

AceLan Kao (acelankao)
Changed in linux (Ubuntu Jammy):
status: New → In Progress
Changed in linux (Ubuntu Mantic):
status: New → In Progress
Changed in linux (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
Changed in linux (Ubuntu Mantic):
assignee: nobody → AceLan Kao (acelankao)
Changed in linux-oem-6.5 (Ubuntu Jammy):
status: New → In Progress
Changed in linux-oem-6.5 (Ubuntu Mantic):
status: New → Invalid
Changed in linux-oem-6.5 (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
tags: added: oem-priority originate-from-2025253 somerville
AceLan Kao (acelankao)
Changed in linux (Ubuntu Jammy):
status: In Progress → Invalid
assignee: AceLan Kao (acelankao) → nobody
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-6.5/6.5.0-1004.4 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-oem-6.5' to 'verification-done-jammy-linux-oem-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-oem-6.5' to 'verification-failed-jammy-linux-oem-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-oem-6.5-v2 verification-needed-jammy-linux-oem-6.5
Timo Aaltonen (tjaalton)
Changed in linux-oem-6.5 (Ubuntu Jammy):
status: In Progress → Fix Committed
AceLan Kao (acelankao)
Changed in linux-oem-6.1 (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
status: New → In Progress
Changed in linux-oem-6.1 (Ubuntu Mantic):
status: New → Invalid
Timo Aaltonen (tjaalton)
summary: - Infiniate systemd loop when power off the machine with multiple MD RAIDs
+ Infinite systemd loop when power off the machine with multiple MD RAIDs
AceLan Kao (acelankao)
tags: added: verification-done-jammy-linux-oem-6.5
removed: verification-needed-jammy-linux-oem-6.5
Timo Aaltonen (tjaalton)
Changed in linux-oem-6.1 (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (62.8 KiB)

This bug was fixed in the package linux-oem-6.5 - 6.5.0-1004.4

---------------
linux-oem-6.5 (6.5.0-1004.4) jammy; urgency=medium

  * jammy/linux-oem-6.5: 6.5.0-1004.4 -proposed tracker (LP: #2036238)

  * Infiniate systemd loop when power off the machine with multiple MD RAIDs
    (LP: #2036184)
    - SAUCE: md: do not _put wrong device in md_seq_next

  * dell-uart-backlight fails to communicate with the scalar IC somtimes.
    (LP: #2035299)
    - SAUCE: platform/x86: dell-uart-backlight: add small delay after write
      command

  [ Ubuntu: 6.5.0-6.6 ]

  * mantic/linux: 6.5.0-6.6 -proposed tracker (LP: #2035595)
  * Mantic update: v6.5.3 upstream stable release (LP: #2035588)
    - drm/amd/display: ensure async flips are only accepted for fast updates
    - cpufreq: intel_pstate: set stale CPU frequency to minimum
    - tpm: Enable hwrng only for Pluton on AMD CPUs
    - Input: i8042 - add quirk for TUXEDO Gemini 17 Gen1/Clevo PD70PN
    - Revert "fuse: in fuse_flush only wait if someone wants the return code"
    - Revert "f2fs: clean up w/ sbi->log_sectors_per_block"
    - Revert "PCI: tegra194: Enable support for 256 Byte payload"
    - Revert "net: macsec: preserve ingress frame ordering"
    - reiserfs: Check the return value from __getblk()
    - splice: always fsnotify_access(in), fsnotify_modify(out) on success
    - splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice
    - splice: fsnotify_access(in), fsnotify_modify(out) on success in tee
    - eventfd: prevent underflow for eventfd semaphores
    - fs: Fix error checking for d_hash_and_lookup()
    - iomap: Remove large folio handling in iomap_invalidate_folio()
    - tmpfs: verify {g,u}id mount options correctly
    - selftests/harness: Actually report SKIP for signal tests
    - vfs, security: Fix automount superblock LSM init problem, preventing NFS sb
      sharing
    - ARM: ptrace: Restore syscall restart tracing
    - ARM: ptrace: Restore syscall skipping for tracers
    - btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
    - erofs: release ztailpacking pclusters properly
    - locking/arch: Avoid variable shadowing in local_try_cmpxchg()
    - refscale: Fix uninitalized use of wait_queue_head_t
    - clocksource: Handle negative skews in "skew is too large" messages
    - powercap: arm_scmi: Remove recursion while parsing zones
    - OPP: Fix potential null ptr dereference in dev_pm_opp_get_required_pstate()
    - OPP: Fix passing 0 to PTR_ERR in _opp_attach_genpd()
    - selftests/resctrl: Add resctrl.h into build deps
    - selftests/resctrl: Don't leak buffer in fill_cache()
    - selftests/resctrl: Unmount resctrl FS if child fails to run benchmark
    - selftests/resctrl: Close perf value read fd on errors
    - sched/fair: remove util_est boosting
    - arm64/ptrace: Clean up error handling path in sve_set_common()
    - sched/psi: Select KERNFS as needed
    - cpuidle: teo: Update idle duration estimate when choosing shallower state
    - x86/decompressor: Don't rely on upper 32 bits of GPRs being preserved
    - arm64/fpsimd: Only provide the length to cpufeature for xCR registers
    - sched/rt: Fix sysctl_sched_rr_timeslice in...

Changed in linux-oem-6.5 (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ofir Gal (ofirgalvolumez) wrote :

I've identified a bug impacting Linux kernel images 6.0, 6.1, and 6.2 on Jammy. As I'm new to the Launchpad bugs platform, I'm seeking clarification on the process. Will the fix be incorporated into the upcoming Linux kernel releases for Jammy?

Additionally, I've noticed a submitted patch in the mailing list that would fix issue for Jammy kernels.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-6.1/6.1.0-1024.24 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-oem-6.1' to 'verification-done-jammy-linux-oem-6.1'. If the problem still exists, change the tag 'verification-needed-jammy-linux-oem-6.1' to 'verification-failed-jammy-linux-oem-6.1'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-oem-6.1-v2 verification-needed-jammy-linux-oem-6.1
AceLan Kao (acelankao)
tags: added: verification-done-jammy-linux-oem-6.1
removed: verification-needed-jammy-linux-oem-6.1
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem-6.1 - 6.1.0-1024.24

---------------
linux-oem-6.1 (6.1.0-1024.24) jammy; urgency=medium

  * jammy/linux-oem-6.1: 6.1.0-1024.24 -proposed tracker (LP: #2038210)

  * Packaging resync (LP: #1786013)
    - [Packaging] update annotations scripts
    - [Packaging] resync getabis
    - [Packaging] update helper scripts

  * CVE-2023-42756
    - netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP

  * CVE-2023-4244
    - netfilter: nf_tables: don't skip expired elements during walk
    - netfilter: nf_tables: GC transaction API to avoid race with control plane
    - netfilter: nf_tables: adapt set backend to use GC transaction API
    - netfilter: nft_set_hash: mark set element as dead when deleting from packet
      path
    - netfilter: nf_tables: remove busy mark and gc batch API
    - netfilter: nf_tables: don't fail inserts if duplicate has expired
    - netfilter: nf_tables: fix kdoc warnings after gc rework
    - netfilter: nf_tables: fix GC transaction races with netns and netlink event
      exit path
    - netfilter: nf_tables: GC transaction race with netns dismantle
    - netfilter: nf_tables: GC transaction race with abort path
    - netfilter: nf_tables: use correct lock to protect gc_list
    - netfilter: nf_tables: defer gc run if previous batch is still pending

  * CVE-2023-42752
    - net: remove osize variable in __alloc_skb()
    - net: factorize code in kmalloc_reserve()
    - net: deal with integer overflows in kmalloc_reserve()

  * CVE-2023-42572
    - net: add SKB_HEAD_ALIGN() helper

  * CVE-2023-5197
    - netfilter: nf_tables: disallow rule removal from chain binding

  * CVE-2023-42755
    - net/sched: Retire rsvp classifier
    - [Config] remove NET_CLS_RSVP and NET_CLS_RSVP6

  * CVE-2023-4881
    - netfilter: nftables: exthdr: fix 4-byte stack OOB write

  * Fix ADL: System enabled AHCI can't get into s0ix when attached ODD
    (LP: #2037493)
    - SAUCE: ata: ahci: Add Intel Alder Lake-P AHCI controller to low power
      chipsets list

  * Fix unstable audio at low levels on Thinkpad P1G4 (LP: #2037077)
    - ALSA: hda/realtek - ALC287 I2S speaker platform support

  * Infinite systemd loop when power off the machine with multiple MD RAIDs
    (LP: #2036184)
    - SAUCE: md: do not _put wrong device in md_seq_next

  * Fix RCU warning on AMD laptops (LP: #2036377)
    - power: supply: core: Use blocking_notifier_call_chain to avoid RCU complaint

 -- Timo Aaltonen <email address hidden> Tue, 03 Oct 2023 18:13:17 +0300

Changed in linux-oem-6.1 (Ubuntu Jammy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.