Thanks for all the responses. I'm not sure how quickly I'll be able to get to this either, so I'm hesitant to commit to fixing myself. That said, if I can get time to send patches before your team gets to fixing it, I'll do my best.
To answer the question about how frequently we see this: it was about 4-5 times a day until I applied the patches to our forked version of e2fsprogs.
A few other things to note about what's going on here. In 1.45.7, e2fsprogs added some additional retries to the checksum validation path on open:
I picked up this patch as well, and found that it helped a bit, but I was still able to reproduce the problem with the reproducer that I shared.
My team is running on the linux-aws-5.15 HWE kernel that's from jammy but shipped to focal. There's a kernel fix that may help with this problem too, and it has been present since 5.10. That said, I haven't tested this on systems that are running <= 5.4. (We don't have very many of these anymore.)
The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer lock") may help to ensure that the superblock contents are always consistent on disk, prior to the DIO read, since the directio path writes out any dirty cached sb pages prior to issuing the read.
Additionally, there's another known issue with consecutive online resize attempts:
Thanks for all the responses. I'm not sure how quickly I'll be able to get to this either, so I'm hesitant to commit to fixing myself. That said, if I can get time to send patches before your team gets to fixing it, I'll do my best.
To answer the question about how frequently we see this: it was about 4-5 times a day until I applied the patches to our forked version of e2fsprogs.
A few other things to note about what's going on here. In 1.45.7, e2fsprogs added some additional retries to the checksum validation path on open:
https:/ /git.kernel. org/pub/ scm/fs/ ext2/e2fsprogs. git/commit/ ?id=6338a846756 4c3a0a12e9fcb08 bdd748d736ac2f
I picked up this patch as well, and found that it helped a bit, but I was still able to reproduce the problem with the reproducer that I shared.
My team is running on the linux-aws-5.15 HWE kernel that's from jammy but shipped to focal. There's a kernel fix that may help with this problem too, and it has been present since 5.10. That said, I haven't tested this on systems that are running <= 5.4. (We don't have very many of these anymore.)
https:/ /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit? id=05c2c00f3769 abb9e323fcaca70 d2de0b48af7ba
The 05c2c00f3769 ("ext4: protect superblock modifications with a buffer lock") may help to ensure that the superblock contents are always consistent on disk, prior to the DIO read, since the directio path writes out any dirty cached sb pages prior to issuing the read.
Additionally, there's another known issue with consecutive online resize attempts:
https:/ /git.kernel. org/pub/ scm/linux/ kernel/ git/torvalds/ linux.git/ commit? id=a408f33e895e 455f16cf964cb5c d4979b658db7b
We've gotten the fix for this in linux-aws-5.15 from Ubuntu, but it may be germane for testing on older releases.