xfs_growfs: /dev/mapper/vg-lv_srv is not a mounted XFS filesystem
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Unassigned |
Bug Description
Description
===========
Integration jobs are failing on xfs_growfs, device not mounted XFS filesystem.
```
"cmd": ""/usr/
"stderr": "[ERROR] Running command failed: cmd \"xfs_growfs /dev/mapper/
```
While provisioning overcloud some logs also shows UNREACHABLE[4][5]
Logs
====
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
Changed in tripleo: | |
status: | Confirmed → Triaged |
Amol Kahat (amolkahat) wrote : | #1 |
Marios Andreou (marios-b) wrote : | #2 |
not sure if this is a transient issue yet. we have seen this in master/wallaby runs from today at [1][2] however the -internal jobs we have on our hardware are passing i.e. periodic-
[1] https:/
[2] https:/
Rabi Mishra (rabi) wrote : | #3 |
Probably related to recent dib change(?) https:/
Harald Jensås (harald-jensas) wrote : | #4 |
Observation, there is amessage on the node console log[1] "XFS (dm-10): Unmounting Filesystem", I belive that is the /srv volume based on this earlier in the log
[ 5.510097] XFS (dm-10): Mounting V5 Filesystem
Mounting [0;1;39m/srv[0m...
In this run overcloud-
overcloud-
[1] https:/
[2] https:/
FAILING NODE
------------
overcloud-
[ 75.058401] dm-2: detected capacity change from 10330112 to 165380096
[ 75.060856] device-mapper: thin: 253:2: growing the data device from 80704 to 1292032 blocks
[ 75.065373] dm-3: detected capacity change from 10330112 to 165380096
[ 75.172751] dm-4: detected capacity change from 7585792 to 23207936
[ 75.268501] dm-5: detected capacity change from 491520 to 2441216
[ 75.407233] dm-7: detected capacity change from 491520 to 20021248
[ 75.428521] systemd-
[ 75.515659] dm-8: detected capacity change from 393216 to 4292608
[ 75.607987] dm-9: detected capacity change from 491520 to 2441216
[ 75.738410] dm-6: detected capacity change from 1949696 to 79470592
[ 75.843440] dm-10: detected capacity change from 98304 to 34676736
[ 75.858760] XFS (dm-10): Unmounting Filesystem
------------
WORKING NODE
------------
overcloud-
[ 107.814654] dm-2: detected capacity change from 10330112 to 165380096
[ 107.815693] device-mapper: thin: 253:2: growing the data device from 80704 to 1292032 blocks
[ 107.835429] dm-3: detected capacity change from 10330112 to 165380096
[ 108.036161] dm-4: detected capacity change from 7585792 to 23207936
[ 108.169220] dm-5: detected capacity change from 491520 to 2441216
[ 108.375316] dm-7: detected capacity change from 491520 to 20021248
[ 108.406866] systemd-
[ 108.710300] dm-8: detected capacity change from 393216 to 4292608
[ 109.147461] dm-9: detected capacity change from 491520 to 2441216
[ 109.448583] dm-6: detected capacity change from 1949696 to 79470592
[ 109.559833] dm-10: detected capacity change from 98304 to 34676736
[ 257.648175] SELinux: Converting 479 SID table entries...
[ 257.655331] SELinux: policy capability network_
[ 257.655707] SELinux: policy capability open_perms=1
[ 257.655978] SELinux: policy capability extended_so...
Ronelle Landy (rlandy) wrote : | #5 |
Testing revert:
amolkahat proposed openstack/
Steve Baker (steve-stevebaker) wrote : | #6 |
I can't reproduce this on a 2TB disk baremetal with a rhel9.0 17.1 image, now trying on PSI with a centos9 image.
Changed in tripleo: | |
assignee: | nobody → Steve Baker (steve-stevebaker) |
Steve Baker (steve-stevebaker) wrote : | #7 |
In the gate there seems to be a high percentage of nodes which are not contactable, so growvols doesn't get run. Whatever is causing this would be unrelated to the growvols change
Steve Baker (steve-stevebaker) wrote : | #8 |
This log[1] on change[2] shows that the the revert was successfully applied but the /srv mount issue remains:
xfs_growfs: /dev/mapper/
So I don't think this change is the root cause of this bug, I'll continue investigating.
[1] https:/
[2] https:/
Steve Baker (steve-stevebaker) wrote : | #9 |
Was there something else which changed in the image building yesterday, like diskimage-builder being unpinned?
chandan kumar (chkumar246) wrote : | #10 |
We are seeing similar issue in CS9 zed ovb jobs
[1.] https:/
Amol Kahat (amolkahat) wrote : | #11 |
List of packages which are changed
Passed job
==========
device-
device-
device-
device-
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
fuse-
kernel-
kernel-
kernel-
kernel-
kernel-
kernel-
kernel-
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
lvm2-
lvm2-
selinux-
selinux-
Failed job
==========
device-
device-
device-
device-
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
86_64
libblockdev
libblockdev
fuse-
kernel-
kernel-
kernel-
kernel-
kernel-
kernel-
kernel-
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
libblockdev
lvm2-
lvm2-
selinux-
selinux-
Amol Kahat (amolkahat) wrote : | #12 |
Checking job run with package exclude: https:/
chandan kumar (chkumar246) wrote : | #13 |
@steve, we checked all the repos related to image build, nothing got changed. I checked rdoinfo repo. Dib is not pinned https:/
Steve Baker (steve-stevebaker) wrote : | #14 |
One thing that has occurred to me is that when /srv doesn't mount then growvols will fail, and when any other volume doesn't mount then the boot won't complete at all. That would explain the high proportion of nodes that are not contactable, its actually the same issue.
Changed in tripleo: | |
assignee: | Steve Baker (steve-stevebaker) → nobody |
chandan kumar (chkumar246) wrote : | #15 |
on this patch https:/
It might fix the issue.
Sandeep Yadav (sandeepyadav93) wrote : | #16 |
Hello All,
Adding some observations:-
* Issue is transient, some jobs are passing.
* In the affected job as well, Some of the overcloud nodes are provisioning successfully and only ~1/2 out of 4 nodes don't provision successfully.
* Affected jobs have two different symptoms but all the jobs fail on the same task while running growvols.
A) Job fails with Unreachable issue
https:/
~~~
UNREACHABLE | Running /usr/local/
~~~
B) Job fails with Failure -
"FATAL | Running /usr/local/
* Rerunning node provisioning on the same nodes passes in 2nd attempt(manually tried)
Sandeep Yadav (sandeepyadav93) wrote : | #17 |
Centos packages rpm diff: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master) | #18 |
Related fix proposed to branch: master
Review: https:/
Sandeep Yadav (sandeepyadav93) wrote : | #19 |
Hello All,
We tried pinning kernel to 5.14.0-
Affected kernel version: 5.14.0-214*, node provisioning passed for multiple jobs with older kernel, results at [2].
Observation:
Vexx mirror and mirror.
Because of our job wrong configuration, we are leaking content from mirror.
1) Component line job, centos.repo is not disabled and result in latest kernel leak.
https:/
~~~
2022-12-27 00:54:07.471 | kernel x86_64 5.14.0-214.el9 baseos 2.8 M
~~~
Fix: https:/
2) Integration line job:-
Overcloud nodes are wrongly using mirror.
baseurl=http://
older kernel works, but i wonder if we are hitting this issue because we are mixing content(and this is not a kernel bug), instead of excluding kernel maybe we should fix the repo to use correct local mirror.
I will continue checking the baseurl for integration job tomorrow.
[1] https:/
[2] https:/
chandan kumar (chkumar246) wrote : | #20 |
Thank you Sandeep for putting the detailed investigation report. I picked the log https:/
[1]. https:/
```
+ '[' -e /etc/ci/
2022-12-20 20:17:41 | + source /etc/ci/
2022-12-20 20:17:41 | ++ export NODEPOOL_
2022-12-20 20:17:41 | ++ NODEPOOL_
```
[2]. https:/
```
+ '[' -e /etc/ci/
2022-12-20 20:29:55 | + export NODEPOOL_
2022-12-20 20:29:55 | + NODEPOOL_
```
[3]. https:/
```
+ '[' -e /etc/ci/
2022-12-20 20:24:11 | + export NODEPOOL_
2022-12-20 20:24:11 | + NODEPOOL_
```
Based on above results, on overcloud node there is no /etc/ci/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master) | #21 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 6d250182278bf68
Author: Sandeep Yadav <email address hidden>
Date: Tue Dec 27 17:16:54 2022 +0530
Pass centos.repo from host during image build.
Image build in component jobs is pulling content from
centos.repo instead of quickstart repos, see [0]
This can cause mismatch of rpm when mirrors.centos.org and
local mirrors are not in sync.
The base image which we use to build overcloud images already have
centos.repo and when proxy mirros are not updated this can cause
an issue.
This is a workaround patch to pass centos.repo(which are disabled
on host) so that same repos in the image will be overridden.
One change in behavior is at the end of image build, dib will delete
the centos.repos in the overcloud image as dib cleans up what it adds
ignoring what was already present.
oooci-
[0] https:/
~~~
2022-12-27 00:54:07.471 | kernel x86_64 5.14.0-214.el9 baseos 2.8 M
~~~
[1] https:/
Related-Bug: #2000226
Change-Id: Iecf36eff8ef27f
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master) | #22 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master) | #23 |
Related fix proposed to branch: master
Review: https:/
chandan kumar (chkumar246) wrote : | #24 |
https:/
https:/
chandan kumar (chkumar246) wrote : | #25 |
Currently on pinning kernel https:/
As the node is already using the latest kernel.
https:/
```
CentOS Stream 9
Kernel 5.14.0-
Activate the web console with: systemctl enable --now cockpit.socket
```
and in fs01 component job where we build the images.
https:/
```
install-packages -u
2022-12-30 12:32:05.597 | Last metadata expiration check: 0:00:10 ago on Fri Dec 30 07:31:55 2022.
2022-12-30 12:32:05.706 | Error:
2022-12-30 12:32:05.706 | Problem: package kernel-
2022-12-30 12:32:05.706 | - cannot install the best update candidate for package kernel-
2022-12-30 12:32:05.706 | - package kernel-
2022-12-30 12:32:05.706 | (try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
```
so this is also not working.
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 85fcfd4eb226337
Author: Chandan Kumar <email address hidden>
Date: Thu Dec 29 12:29:34 2022 +0530
Add modify_
In order to run specific command on the image, we need
to add support for --run-command to the modify image role.
It will be useful to run specific command instead of using
script.
This functionality will be consumed here:
https:/
Related-Bug: #2000226
Signed-off-by: Chandan Kumar <email address hidden>
Change-Id: I8575c712467db8
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master) | #27 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 5b5f1cf307751cc
Author: Chandan Kumar <email address hidden>
Date: Thu Dec 29 15:49:30 2022 +0530
Upload /etc/ci/
In order to use local mirrors on overcloud nodes,
we need to copy /etc/ci/
that we can pull the correct packages from afs mirror
otherwise we will pull it from centos mirror which results into
random issues, which are hard to debug.
This patch copys the same using modify-image role.
https:/
makes sure before copying existing directory exists on the image.
The added task will work for qcow2 image and mirror_info.sh needs to
be present on undercloud.
Related-Bug: #2000226
Depends-On: https:/
Signed-off-by: Chandan Kumar <email address hidden>
Change-Id: Ia3b4634b551e57
Amol Kahat (amolkahat) wrote : | #28 |
Tried with the downgrading kernel[1], but this is also not help to get this issue resolved[2][3].
[1] https:/
[2] https:/
[3] https:/
Marios Andreou (marios-b) wrote : | #29 |
per comment #28 above... the version of kernel installed in the controller node in that test is as follows
kernel.x86_64 5.14.0-210.el9 @quickstart-
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master) | #30 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #31 |
Related fix proposed to branch: master
Review: https:/
Steve Baker (steve-stevebaker) wrote : | #32 |
I have a reproducer with a custom image and an ansible playbook deployed to PSI.
Downgrading lvm2, device-mapper was enough to get 100% success after 20 attempts.
I'm still collecting data, but it looks like about 80% success rate with latest lvm2, device-mapper. Package versions are coupled between these two packages, but I think raising a bug against lvm2 would be best.
This change will log growvols[1] activity to the systemd journal, which shows exactly what command triggers the unmount when the journal is viewed (lvextend of /srv)
[1] https:/
Here is the growvols portion of the journal for a failed run:
Jan 03 21:28:03 lp2000226-1 python3[2427]: ansible-
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [INFO] Finding all block devices
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [INFO] Running: lsblk -Po kname,pkname,
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [DEBUG] Result: KNAME="vda" PKNAME="" NAME="vda" LABEL="" TYPE="disk" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda1" PKNAME="vda" NAME="vda1" LABEL="MKFS_ESP" TYPE="part" FSTYPE="vfat" MOUNTPOINT=
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda2" PKNAME="vda" NAME="vda2" LABEL="" TYPE="part" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda3" PKNAME="vda" NAME="vda3" LABEL="mkfs_boot" TYPE="part" FSTYPE="ext4" MOUNTPOINT="/boot"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda4" PKNAME="vda" NAME="vda4" LABEL="" TYPE="part" FSTYPE=
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-0" PKNAME="vda4" NAME="vg-
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-2" PKNAME="dm-0" NAME="vg-
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-3" PKNAME="dm-2" NAME="vg-
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-4" PKNAME="dm-2" NAME="vg-lv_root" LABEL="img-rootfs" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-5" PKNAME="dm-2" NAME="vg-lv_tmp" LABEL="fs_tmp" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/tmp"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-6" PKNAME="dm-2" NAME="vg-lv_var" LABEL="fs_var" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/var"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-7" PKNAME="dm-2" NAME="vg-lv_log" LABEL="fs_log" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT=
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-8" PKNAME="dm-2" NAME="vg-lv_audit" LABEL="fs_audit" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT=
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-9" PKNAME="dm-2" NAME="v...
Sandeep Yadav (sandeepyadav93) wrote : | #33 |
Hello Steve,
We tried pinning to older lvm2/device-mapper, but it didn't help.
Please see the logs below:-
Steve Baker (steve-stevebaker) wrote : | #34 |
Since systemd-252, calling lvextend on /home or /srv will sometimes
(~20%) cause the volume to be unmounted, here is the logging sequence:
growvols[2428]: [INFO] Running: lvextend --size +17200840704B /dev/mapper/
dmeventd[775]: No longer monitoring thin pool vg-lv_thinpool-
kernel: dm-10: detected capacity change from 98304 to 33693696
systemd[1]: Stopped target Local File Systems.
systemd[1]: Unmounting /srv...
kernel: XFS (dm-10): Unmounting Filesystem
dmeventd[775]: Monitoring thin pool vg-lv_thinpool-
systemd[1]: srv.mount: Deactivated successfully.
systemd[1]: Unmounted /srv.
systemd[1]: systemd-
systemd[1]: Stopped File System Check on /dev/disk/
growvols[2428]: [DEBUG] Result: Size of logical volume vg/lv_srv changed from 48.00 MiB (12 extents) to <16.07 GiB (4113 extents).
growvols[2428]: Logical volume vg/lv_srv successfully resized.
The event "Stopped target Local File Systems." should only happen
when dmeventd notifies that the thin pool is near capacity as a safety
measure, which is clearly not the case:
$ lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv_audit vg Vwi-aotz-- <2.05g lv_thinpool 0.62
lv_home vg Vwi-aotz-- 1.16g lv_thinpool 0.88
lv_log vg Vwi-aotz-- <9.55g lv_thinpool 0.31
lv_root vg Vwi-aotz-- <11.07g lv_thinpool 18.97
lv_srv vg Vwi-aotz-- <16.07g lv_thinpool 0.31
lv_thinpool vg twi-aotz-- 77.92g 3.36 1.68
lv_tmp vg Vwi-aotz-- 1.16g lv_thinpool 0.88
lv_var vg Vwi-aotz-- <37.43g lv_thinpool 1.10
I speculate that the root cause is that systemd-252 is interpreting
messages from dmeventd differently since systemd-250 and either the
message or the interpretation is incorrect.
I've proposed [1] in an attempt to prevent dmeventd from being called at all during lvextend, I'll leave it testing overnight.
[1] https:/
Amol Kahat (amolkahat) wrote : | #35 |
Excluding systemd-252 resolved issue[1]. We could see jobs are passing in testproject[2][3]. Testing more jobs to make sure systemd package has the issue.
[1] https:/
[2] https:/
[3] https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master) | #36 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 5fee91651e08099
Author: Sandeep Yadav <email address hidden>
Date: Tue Dec 27 07:29:39 2022 +0530
Exclude latest systemd*-252-* packages
After latest systemd we are hitting bug[1] and is breaking
the ovb node provisioning. It is blocking the promotion.
Let's exclude the latest systemd*-252-* till we have a proper
fix.
[1] https:/
Related-Bug: #2000226
Signed-off-by: Sandeep Yadav <email address hidden>
Co-Authored-by: Amol Kahat <email address hidden>
Change-Id: Ia6837781272ae3
Ronelle Landy (rlandy) wrote : | #37 |
Bug created to systemd: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (master) | #38 |
Fix proposed to branch: master
Review: https:/
Changed in tripleo: | |
status: | Triaged → In Progress |
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master) | #39 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 2ce67c3dbb6a7f8
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300
Install modified udev rule to fix lvextend unmount
This fix has been proposed to lvm2 upstream[1]. This change can be
reverted once the fix is packaged. It is proposed here to unblock
upstream and downstream delivery pipelines.
[1] https:/
Change-Id: If187a2b1ec61ec
Closes-Bug: #2000226
Related: rhbz#2158628
Changed in tripleo: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (stable/zed) | #40 |
Fix proposed to branch: stable/zed
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (stable/wallaby) | #41 |
Fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (stable/wallaby) | #42 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit 40dd9ee8e58b1ef
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300
Install modified udev rule to fix lvextend unmount
This fix has been proposed to lvm2 upstream[1]. This change can be
reverted once the fix is packaged. It is proposed here to unblock
upstream and downstream delivery pipelines.
[1] https:/
Change-Id: If187a2b1ec61ec
Closes-Bug: #2000226
Related: rhbz#2158628
(cherry picked from commit 2ce67c3dbb6a7f8
tags: | added: in-stable-wallaby |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-image-elements (stable/zed) | #43 |
Change abandoned by "dasm <email address hidden>" on branch: stable/zed
Review: https:/
Reason: Abandoning since deprecation of stable/zed is in progress: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (stable/zed) | #44 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/zed
commit c8173d557ee0772
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300
Install modified udev rule to fix lvextend unmount
This fix has been proposed to lvm2 upstream[1]. This change can be
reverted once the fix is packaged. It is proposed here to unblock
upstream and downstream delivery pipelines.
[1] https:/
Change-Id: If187a2b1ec61ec
Closes-Bug: #2000226
Related: rhbz#2158628
(cherry picked from commit 2ce67c3dbb6a7f8
tags: | added: in-stable-zed |
This is hitting on master[1], wallaby[2] and zed[3]. It is started 2022-12-20 midnight.
[1] https:/ /review. rdoproject. org/zuul/ builds? pipeline= openstack- periodic- integration- main&skip= 0 /review. rdoproject. org/zuul/ builds? pipeline= openstack- periodic- integration- stable1& skip=0 /review. rdoproject. org/zuul/ builds? pipeline= openstack- periodic- integration- zed-centos9
[2] https:/
[3] https:/