Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs

Bug #2036467 reported by Krister Johansen
270
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-images
New
Critical
Unassigned
e2fsprogs (Ubuntu)
Status tracked in Mantic
Trusty
Won't Fix
Critical
Matthew Ruffell
Xenial
Won't Fix
Critical
Matthew Ruffell
Bionic
Won't Fix
Critical
Matthew Ruffell
Focal
In Progress
Critical
Matthew Ruffell
Jammy
In Progress
Critical
Matthew Ruffell
Lunar
In Progress
Critical
Matthew Ruffell
Mantic
In Progress
Critical
Matthew Ruffell

Bug Description

[Impact]

This is a long running bug plaguing cloud-images, where on a rare occasion resize2fs would fail and the image would not resize to fit the entire disk.

Online resizes would fail due to a superblock checksum mismatch, where the superblock in memory differs from what is currently on disk due to changes made to the image.

$ resize2fs /dev/nvme1n1p1
resize2fs 1.47.0 (5-Feb-2023)
resize2fs: Superblock checksum does not match superblock while trying to open /dev/nvme1n1p1
Couldn't find valid filesystem superblock.

Changing the read of the superblock to Direct I/O solves the issue.

[Testcase]

Start an c5.large instance on AWS, and attach a 60gb gp3 volume for use as a scratch disk.

Run the following script, courtesy of Krister Johansen and his team:

   #!/usr/bin/bash
   set -euxo pipefail

   while true
   do
           parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
           sleep .5
           mkfs.ext4 /dev/nvme1n1p1
           mount -t ext4 /dev/nvme1n1p1 /mnt
           stress-ng --temp-path /mnt -D 4 &
           STRESS_PID=$!
           sleep 1
           growpart /dev/nvme1n1 1
           resize2fs /dev/nvme1n1p1
           kill $STRESS_PID
           wait $STRESS_PID
           umount /mnt
           wipefs -a /dev/nvme1n1p1
           wipefs -a /dev/nvme1n1
   done

Test packages are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test

If you install the test packages, the race no longer occurs.

[Where problems could occur]

We are changing how resize2fs reads the superblock from underlying disks.

If a regression were to occur, resize2fs could fail to resize offline or online volumes. As all cloud-images are online resized during their initial boot, this could have a large impact to public and private clouds should a regression occur.

[Other info]

Upstream mailing list discussion:
https://<email address hidden>/
https://<email address hidden>/

This was fixed in the below commit upstream:

commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <email address hidden>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
 online resizes
Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84

The commit has not been tagged to any release. All supported Ubuntu releases require this fix, and need to be published in standard non-ESM archives to be picked up in cloud images.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in e2fsprogs (Ubuntu):
status: New → Confirmed
Revision history for this message
Ye Lu (luye98) wrote :

Hi, we were seeing similar issues when bootstrapping AWS EC2 hosts in our service. We applied the patch provided in upstream internally and it indeed resolved the filesystem resize errors we previously encountered. It would be helpful to also backport the patch in Ubuntu and make it generally available for focal and jammy distributions.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@foundations & EDM can we have this backported all the way to trusty? Xenial for sure.

tags: added: rls-mm-incoming
information type: Public → Public Security
Changed in cloud-images:
importance: Undecided → Critical
Revision history for this message
Julian Andres Klode (juliank) wrote :

@Krister If you are interested in driving the process to get the patch landed, you can follow the procedure at

https://packaging.ubuntu.com/html/fixing-a-bug.html

And

https://wiki.ubuntu.com/StableReleaseUpdates

To prepare updates for all releases. Feel free to ask for help on IRC.

If not, no worries, we'll get to it, I tagged it foundations-todo for the team to do! But if you want to gain experience in packaging this is a good place to start!

tags: added: foundations-todo
removed: rls-mm-incoming
Revision history for this message
Robby Pocase (rpocase) wrote :

@Krister additionally, can you clarify "occasionally"? It's clearly frequently enough to prioritize upstreaming a fix. Knowing frequency could aide in priority internally.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@rpocase please check internally I believe we have multiple azure & Aws affected customers as per previous Salesforce escalations.

Revision history for this message
Krister Johansen (kmjohansen) wrote (last edit ):

Thanks for all the responses. I'm not sure how quickly I'll be able to get to this either, so I'm hesitant to commit to fixing myself. That said, if I can get time to send patches before your team gets to fixing it, I'll do my best.

To answer the question about how frequently we see this: it was about 4-5 times a day until I applied the patches to our forked version of e2fsprogs.

A few other things to note about what's going on here. In 1.45.7, e2fsprogs added some additional retries to the checksum validation path on open:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=6338a8467564c3a0a12e9fcb08bdd748d736ac2f

I picked up this patch as well, and found that it helped a bit, but I was still able to reproduce the problem with the reproducer that I shared.

My team is running on the linux-aws-5.15 HWE kernel that's from jammy but shipped to focal. There's a kernel fix that may help with this problem too, and it has been present since 5.10. That said, I haven't tested this on systems that are running <= 5.4. (We don't have very many of these anymore.)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=05c2c00f3769abb9e323fcaca70d2de0b48af7ba

The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer lock") may help to ensure that the superblock contents are always consistent on disk, prior to the DIO read, since the directio path writes out any dirty cached sb pages prior to issuing the read.

Additionally, there's another known issue with consecutive online resize attempts:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=a408f33e895e455f16cf964cb5cd4979b658db7b

We've gotten the fix for this in linux-aws-5.15 from Ubuntu, but it may be germane for testing on older releases.

Changed in e2fsprogs (Ubuntu Mantic):
status: Confirmed → In Progress
Changed in e2fsprogs (Ubuntu Lunar):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Jammy):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Focal):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Bionic):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Xenial):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Trusty):
status: New → In Progress
Changed in e2fsprogs (Ubuntu Mantic):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Lunar):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Jammy):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Focal):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Bionic):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Xenial):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Trusty):
importance: Undecided → Critical
Changed in e2fsprogs (Ubuntu Mantic):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Lunar):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Jammy):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Focal):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Bionic):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Xenial):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in e2fsprogs (Ubuntu Trusty):
assignee: nobody → Matthew Ruffell (mruffell)
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on mantic which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on lunar which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on jammy which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on focal which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on bionic which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on xenial which fixes this issue.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for e2fsprogs on trusty which fixes this issue.

summary: - superblock checksum mismatch in resize2fs
+ Resizing cloud-images occasionally fails due to superblock checksum
+ mismatch in resize2fs
description: updated
tags: added: sts
Revision history for this message
Julian Andres Klode (juliank) wrote :

trusty and xenial receive bug updates via Pro and not via the main archive anymore, you'll have to get SEG to add bug tasks for Pro and prepare +esm updates with them.

Changed in e2fsprogs (Ubuntu Trusty):
status: In Progress → Won't Fix
Changed in e2fsprogs (Ubuntu Xenial):
status: In Progress → Won't Fix
Revision history for this message
Julian Andres Klode (juliank) wrote :

@mruffel did you mean to get sponsoring for the patches? you might then want to subscribe ~ubuntu-sponsors so this can be merged by the patch pilots.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

@juliank I'm just doing a little bit more testing for the moment, as I really want to make sure this isn't going to cause any issues in the cloud images. It would be nice to have this bug fixed though, I have seen a few cases related to it over the years.

I'll ask my SEG colleagues for help with sponsoring in a day or two.

description: updated
Changed in e2fsprogs (Ubuntu Bionic):
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.