systemd mount units fail during boot, while file system is correctly mounted

Bug #1837227 reported by Guillaume Penin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Pro
Status tracked in 18.04
18.04
In Progress
High
Heitor Alves de Siqueira
systemd
New
Unknown
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Won't Fix
Undecided
Unassigned
Focal
Fix Released
High
Heitor Alves de Siqueira
Jammy
Fix Released
Undecided
Unassigned
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Won't Fix
Undecided
Heitor Alves de Siqueira
Focal
In Progress
High
Heitor Alves de Siqueira
Jammy
Fix Released
Undecided
Heitor Alves de Siqueira

Bug Description

[Impact]
systemd mount units fail during boot, and the system boots into emergency mode

[Test Plan]
This issue seems to happen randomly, and doesn't seem related to a specific mount unit.

We've used a test script with good results during investigation to reproduce similar mount failures in a running system, and have seen a strong correlation between the script failures and the boot time mount failures.

The attached 'rep-tmpfs.sh' script should be used to validate that mount points are working correctly under stress. One can run through the different variants as below:

# ./rep-tmpfs.sh --variant-0
# ./rep-tmpfs.sh --variant-1
# ./rep-tmpfs.sh --variant-2
# ./rep-tmpfs.sh --variant-3
# ./rep-tmpfs.sh --variant-4

All of these should run successfully without any reported errors.

[Where problems could occur]
The patches change the way systemd tracks and handles mount points in general, so potential regressions could affect other mount units. We should keep an eye out for any issues with mounting file systems, as well as rapid mount/unmount operations. Successful test runs with the reproducer script should increase reliability in having no new regressions.

[Other Info]
This has been tackled upstream with several attempts, which have resulted in the final patch from 2022:
  01400460ae16 core/mount: adjust deserialized state based on /proc/self/mountinfo

For Bionic, systemd requires several dependency patches as below:
  6a1d4d9fa6b9 core: properly reset all ExecStatus structures when entering a new unit cycle
  7eba1463dedc mount: flush out cycle state on DEAD→MOUNTED only, not the other way round
  350804867dbc mount: rescan /proc/self/mountinfo before processing waitid() results
  1d086a6e5972 mount: mark an existing "mounting" unit from /proc/self/mountinfo as "just_mounted"

Additionally, the kernel also requires the following patches:
  28ca0d6d39ab list: introduce list_for_each_continue()
  9f6c61f96f2d proc/mounts: add cursor

[Original Description]
In Ubuntu 18.04 at least, we sometimes get a random server in emergency mode with a failed mount unit (ext4 file system), while the corresponding file system is in fact correctly mounted. It happens roughly once every 1000 reboots.

It seems to be related with this bug : https://github.com/systemd/systemd/issues/10872

Is it possible to apply the fix (https://github.com/systemd/systemd/commit/350804867dbcc9b7ccabae1187d730d37e2d8a21) in Ubuntu 18.04 ?

Thanks in advance.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Hello,

from reading the upstream bug, it's unclear if the systemd commit actually did fix this; it seems like a kernel and/or util-linux patch is needed.

From your description, it sounds like you're not able to reliably reproduce this (only randomly), right? Have you tried the upstream reproducer? It would help a lot if we had a reliable way to reproduce and verify this.
https://github.com/systemd/systemd/issues/10872#issuecomment-523399087

Changed in systemd:
status: Unknown → New
Revision history for this message
Dan Streetman (ddstreet) wrote :

Ping for reproducer, if you have one.

Changed in systemd (Ubuntu):
status: New → Incomplete
tags: added: ddstreet
Revision history for this message
Dan Streetman (ddstreet) wrote :

this appears to still be waiting for a fix upstream, so this bug will need to wait for upstream.

Changed in systemd (Ubuntu):
status: Incomplete → New
Revision history for this message
Dan Streetman (ddstreet) wrote :

Just to follow up, a quick read of the upstream bug seems to indicate there are kernel patch(es) involved in fixing this as well as systemd patch(es). The upstream bug isn't yet marked fixed, so hopefully we can check back in the new year to start backporting.

Revision history for this message
Nick Rosbrook (enr0n) wrote :

The commits referenced in the upstream bug report are present in Lunar and newer.

Changed in systemd (Ubuntu):
status: New → Fix Released
Changed in systemd (Ubuntu Focal):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in systemd (Ubuntu Jammy):
assignee: nobody → Heitor Alves de Siqueira (halves)
status: New → In Progress
tags: added: se-sponsor-halves
removed: ddstreet
summary: - Random mount units sometimes fail, while file system is correctly
- mounted
+ Mount units sometimes fail, while file system is correctly mounted
summary: - Mount units sometimes fail, while file system is correctly mounted
+ systemd mount units fail during boot, while file system is correctly
+ mounted
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

I'm marking Bionic as Won't Fix, as this is no longer under standard support. The fixes for this bug will be made available under Ubuntu Pro for 18.04.

Changed in systemd (Ubuntu Bionic):
assignee: nobody → Heitor Alves de Siqueira (halves)
status: New → Won't Fix
description: updated
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

I wasn't able to reproduce this with the test script, so Jammy is likely unaffected by this. If this changes, please leave a comment so we can investigate.

Changed in systemd (Ubuntu Jammy):
status: In Progress → Fix Released
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

I've now successfully tested this on Bionic, Focal and Jammy. Using a test script based on upstream's bug report, several systemd mount units are created that are tmpfs-backed for better performance. The basic routine of the script is as follows:

- mount the unit (or target mount point)
- sleep 1s
- verify whether the mount exists and is mounted correctly

This is done concurrently with 100 mountpoints, through 40 test loops by default. Additionally, the script covers all variants of this issue that were reported upstream:
1) install all systemd units at once, mount with 'systemctl start' all units
2) install each systemd unit when mounting, mount with 'systemctl start' each unit
3) mount directly with 'systemd-mount'
4) mount directly with 'mount'
5) install each systemd unit when mounting, mount with 'systemctl start',
   trigger multiple 'systemctl daemon-reload' throughout test loop
Each of these scenarios has an equivalent test "variant" in the script that can be invoked directly. Unless specified otherwise, only scenario 4 is tested.

Without the systemd and kernel fixes, at least one of the above scenarios seems to fail consistently in Bionic and Focal. After installing test packages from the PPAs [0] and [1], I've been able to run the test script without any errors for multiple iterations.

[0] https://launchpad.net/~halves/+archive/ubuntu/339757-test-kernel
[1] https://launchpad.net/~halves/+archive/ubuntu/339757-test-systemd

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :
description: updated
Nick Rosbrook (enr0n)
tags: added: systemd-sru-next
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :
Stefan Bader (smb)
Changed in systemd (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-164.181 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux' to 'verification-done-focal-linux'. If the problem still exists, change the tag 'verification-needed-focal-linux' to 'verification-failed-focal-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-v2 verification-needed-focal-linux
Changed in linux (Ubuntu Bionic):
status: New → Won't Fix
Changed in linux (Ubuntu Focal):
assignee: nobody → Heitor Alves de Siqueira (halves)
importance: Undecided → Medium
status: New → Fix Committed
importance: Medium → High
Changed in systemd (Ubuntu Focal):
status: Fix Committed → In Progress
Changed in linux (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu Jammy):
status: New → Fix Released
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

I've tested the kernel from focal-proposed, with the systemd packages from my personal PPA (as the systemd patches aren't yet available in focal-proposed).

All test variants from the rep-tmpfs.sh script ran succesfully, and general smoke testing revealed no further issues.

ubuntu@z-rotomvm35:~$ uname -rv
5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023
ubuntu@z-rotomvm35:~$ dpkg -l systemd
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-==============================-============-=================================
ii systemd 245.4-4ubuntu3.22+20230626dbg2 amd64 system and service manager

tags: added: verification-done-focal-linux
removed: verification-needed-focal-linux
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (8.1 KiB)

This bug was fixed in the package linux - 5.4.0-164.181

---------------
linux (5.4.0-164.181) focal; urgency=medium

  * focal/linux: 5.4.0-164.181 -proposed tracker (LP: #2033867)

  * Please enable Renesas RZ platform serial installer (LP: #2022361)
    - [Config] enable hihope RZ/G2M serial console

  * Azure: hv_netvsc: add support for vlans in AF_PACKET mode (LP: #2030872)
    - hv_netvsc: add support for vlans in AF_PACKET mode

  * systemd mount units fail during boot, while file system is correctly mounted
    (LP: #1837227)
    - list: introduce list_for_each_continue()
    - proc/mounts: add cursor

  * CVE-2023-40283
    - Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb

  * CVE-2023-20588
    - x86/bugs: Increase the x86 bugs vector size to two u32s
    - x86/CPU/AMD: Do not leak quotient data after a division by 0
    - x86/CPU/AMD: Fix the DIV(0) initial fix attempt

  * CVE-2023-4194
    - net: tun_chr_open(): set sk_uid from current_fsuid()
    - net: tap_open(): set sk_uid from current_fsuid()

  * CVE-2023-1206
    - tcp: Reduce chance of collisions in inet6_hashfn().

  * CVE-2021-4001
    - bpf: Fix toctou on read-only map's constant scalar tracking

  * Focal update: v5.4.248 upstream stable release (LP: #2031121)
    - test_firmware: fix a memory leak with reqs buffer
    - KEYS: asymmetric: Copy sig and digest in public_key_verify_signature()
    - dasd: refactor dasd_ioctl_information
    - s390/dasd: Use correct lock while counting channel queue length
    - power: supply: ab8500: Fix external_power_changed race
    - power: supply: sc27xx: Fix external_power_changed race
    - power: supply: bq27xxx: Use mod_delayed_work() instead of cancel() +
      schedule()
    - ARM: dts: vexpress: add missing cache properties
    - power: supply: Ratelimit no data debug output
    - platform/x86: asus-wmi: Ignore WMI events with codes 0x7B, 0xC0
    - regulator: Fix error checking for debugfs_create_dir
    - irqchip/meson-gpio: Mark OF related data as maybe unused
    - power: supply: Fix logic checking if system is running from battery
    - btrfs: handle memory allocation failure in btrfs_csum_one_bio
    - parisc: Improve cache flushing for PCXL in arch_sync_dma_for_cpu()
    - parisc: Flush gatt writes and adjust gatt mask in parisc_agp_mask_memory()
    - MIPS: Alchemy: fix dbdma2
    - mips: Move initrd_start check after initrd address sanitisation.
    - xen/blkfront: Only check REQ_FUA for writes
    - drm:amd:amdgpu: Fix missing buffer object unlock in failure path
    - ocfs2: fix use-after-free when unmounting read-only filesystem
    - ocfs2: check new file size on fallocate call
    - nios2: dts: Fix tse_mac "max-frame-size" property
    - nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
    - nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
    - kexec: support purgatories with .text.hot sections
    - powerpc/purgatory: remove PGO flags
    - nouveau: fix client work fence deletion race
    - RDMA/uverbs: Restrict usage of privileged QKEYs
    - net: usb: qmi_wwan: add support for Compal RXM-G1
    - ALSA: hda/realtek: Add a quirk for Compaq N14JP6
    - ...

Read more...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1118.125 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-azure' to 'verification-done-focal-linux-azure'. If the problem still exists, change the tag 'verification-needed-focal-linux-azure' to 'verification-failed-focal-linux-azure'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-azure-v2 verification-needed-focal-linux-azure
Nick Rosbrook (enr0n)
tags: added: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.