xfs_growfs: /dev/mapper/vg-lv_srv is not a mounted XFS filesystem

Bug #2000226 reported by Amol Kahat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Description
===========

Integration jobs are failing on xfs_growfs, device not mounted XFS filesystem.

```
"cmd": ""/usr/local/sbin/growvols --yes /=8GB /tmp=1GB /var/log=10GB /var/log/audit=2GB /home=1GB /var=50% /srv=50%"

"stderr": "[ERROR] Running command failed: cmd \"xfs_growfs /dev/mapper/vg-lv_srv\", stdout \"\", stderr \"xfs_growfs: /dev/mapper/vg-lv_srv is not a mounted XFS filesystem

```

While provisioning overcloud some logs also shows UNREACHABLE[4][5]

Logs
====
[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset064-master/6e501f8/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-master/8e50a0d/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

[4] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master/0473b56/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[5] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-master/e48496c/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

[6] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset035-wallaby/c2da018/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[7] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp_1supp-featureset039-wallaby/c6b3f6b/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

Ronelle Landy (rlandy)
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Amol Kahat (amolkahat) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

not sure if this is a transient issue yet. we have seen this in master/wallaby runs from today at [1][2] however the -internal jobs we have on our hardware are passing i.e. periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-internal-master is green

[1] https://review.rdoproject.org/zuul/buildset/f92877447177403e8c687222264a6c8a
[2] https://review.rdoproject.org/zuul/buildset/d1afcc1ea56742feb41bef26915d0f0e

Revision history for this message
Rabi Mishra (rabi) wrote :
Revision history for this message
Harald Jensås (harald-jensas) wrote :
Download full text (3.3 KiB)

Observation, there is amessage on the node console log[1] "XFS (dm-10): Unmounting Filesystem", I belive that is the /srv volume based on this earlier in the log
[ 5.510097] XFS (dm-10): Mounting V5 Filesystem
         Mounting [0;1;39m/srv[0m...

In this run overcloud-controller-1 and overcloud-controller-2 both failed growing /srv.
overcloud-controller-0 in the same job succeded, and in the console log[2] for this node there is no "Unmounting Filesystem" log entry.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/baremetal_2-console.log
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/baremetal_3-console.log

FAILING NODE
------------
overcloud-controller-2 login: [ 75.015471] dm-1: detected capacity change from 10330112 to 165380096
[ 75.058401] dm-2: detected capacity change from 10330112 to 165380096
[ 75.060856] device-mapper: thin: 253:2: growing the data device from 80704 to 1292032 blocks
[ 75.065373] dm-3: detected capacity change from 10330112 to 165380096
[ 75.172751] dm-4: detected capacity change from 7585792 to 23207936
[ 75.268501] dm-5: detected capacity change from 491520 to 2441216
[ 75.407233] dm-7: detected capacity change from 491520 to 20021248
[ 75.428521] systemd-journald[774]: Received client request to relinquish /var/log/journal/93dec626c4b543199a464e80b4dcc4ab access.
[ 75.515659] dm-8: detected capacity change from 393216 to 4292608
[ 75.607987] dm-9: detected capacity change from 491520 to 2441216
[ 75.738410] dm-6: detected capacity change from 1949696 to 79470592
[ 75.843440] dm-10: detected capacity change from 98304 to 34676736
[ 75.858760] XFS (dm-10): Unmounting Filesystem
------------

WORKING NODE
------------
overcloud-controller-0 login: [ 107.742932] dm-1: detected capacity change from 10330112 to 165380096
[ 107.814654] dm-2: detected capacity change from 10330112 to 165380096
[ 107.815693] device-mapper: thin: 253:2: growing the data device from 80704 to 1292032 blocks
[ 107.835429] dm-3: detected capacity change from 10330112 to 165380096
[ 108.036161] dm-4: detected capacity change from 7585792 to 23207936
[ 108.169220] dm-5: detected capacity change from 491520 to 2441216
[ 108.375316] dm-7: detected capacity change from 491520 to 20021248
[ 108.406866] systemd-journald[773]: Received client request to relinquish /var/log/journal/93dec626c4b543199a464e80b4dcc4ab access.
[ 108.710300] dm-8: detected capacity change from 393216 to 4292608
[ 109.147461] dm-9: detected capacity change from 491520 to 2441216
[ 109.448583] dm-6: detected capacity change from 1949696 to 79470592
[ 109.559833] dm-10: detected capacity change from 98304 to 34676736
[ 257.648175] SELinux: Converting 479 SID table entries...
[ 257.655331] SELinux: policy capability network_peer_controls=1
[ 257.655707] SELinux: policy capability open_perms=1
[ 257.655978] SELinux: policy capability extended_so...

Read more...

Revision history for this message
Ronelle Landy (rlandy) wrote :

Testing revert:

 amolkahat proposed openstack/diskimage-builder master: Revert "Grow thin pool metadata by 1GiB" https://review.opendev.org/c/openstack/diskimage-builder/+/868281

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I can't reproduce this on a 2TB disk baremetal with a rhel9.0 17.1 image, now trying on PSI with a centos9 image.

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

In the gate there seems to be a high percentage of nodes which are not contactable, so growvols doesn't get run. Whatever is causing this would be unrelated to the growvols change

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This log[1] on change[2] shows that the the revert was successfully applied but the /srv mount issue remains:
xfs_growfs: /dev/mapper/vg-lv_srv is not a mounted XFS filesystem

So I don't think this change is the root cause of this bug, I'll continue investigating.

[1] https://logserver.rdoproject.org/62/45162/21/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-tripleo-master/634f742/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[2] https://review.rdoproject.org/r/c/testproject/+/45162

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Was there something else which changed in the image building yesterday, like diskimage-builder being unpinned?

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Amol Kahat (amolkahat) wrote :

List of packages which are changed
    Passed job
    ==========
    device-mapper-1.02.185-3.el9.x86_64
    device-mapper-event-1.02.185-3.el9.x86_64
    device-mapper-event-libs-1.02.185-3.el9.x86_64
    device-mapper-libs-1.02.185-3.el9.x86_64

    libblockdev-2.28-2.el9.x86_64
    libblockdev-crypto-2.28-2.el9.x86_64
    libblockdev-fs-2.28-2.el9.x86_64
    libblockdev-loop-2.28-2.el9.x86_64
    libblockdev-mdraid-2.28-2.el9.x86_64
    libblockdev-part-2.28-2.el9.x86_64
    libblockdev-swap-2.28-2.el9.x86_64
    libblockdev-utils-2.28-2.el9.x86_64

    fuse-overlayfs-1.9-1.el9.x86_64

    kernel-5.14.0-210.el9.x86_64
    kernel-core-5.14.0-205.el9.x86_64
    kernel-core-5.14.0-210.el9.x86_64
    kernel-modules-5.14.0-205.el9.x86_64
    kernel-modules-5.14.0-210.el9.x86_64
    kernel-tools-5.14.0-210.el9.x86_64
    kernel-tools-libs-5.14.0-210.el9.x86_64

    libblockdev-2.28-2.el9.x86_64
    libblockdev-crypto-2.28-2.el9.x86_64
    libblockdev-fs-2.28-2.el9.x86_64
    libblockdev-loop-2.28-2.el9.x86_64
    libblockdev-mdraid-2.28-2.el9.x86_64
    libblockdev-part-2.28-2.el9.x86_64
    libblockdev-swap-2.28-2.el9.x86_64
    libblockdev-utils-2.28-2.el9.x86_64

    lvm2-2.03.16-3.el9.x86_64
    lvm2-libs-2.03.16-3.el9.x86_64

    selinux-policy-38.1.2-1.el9.noarch
    selinux-policy-targeted-38.1.2-1.el9.noarch

    Failed job
    ==========
    device-mapper-1.02.187-3.el9.x86_64
    device-mapper-event-1.02.187-3.el9.x86_64
    device-mapper-event-libs-1.02.187-3.el9.x86_64
    device-mapper-libs-1.02.187-3.el9.x86_64

    libblockdev-2.28-3.el9.x86_64
    libblockdev-crypto-2.28-3.el9.x86_64
    libblockdev-fs-2.28-3.el9.x86_64
    libblockdev-loop-2.28-3.el9.x86_64
    libblockdev-mdraid-2.28-3.el9.x86_64
    libblockdev-part-2.28-3.el9.x
    86_64
    libblockdev-swap-2.28-3.el9.x86_64
    libblockdev-utils-2.28-3.el9.x86_64

    fuse-overlayfs-1.10-1.el9.x86_64

    kernel-5.14.0-214.el9.x86_64
    kernel-core-5.14.0-205.el9.x86_64
    kernel-core-5.14.0-214.el9.x86_64
    kernel-modules-5.14.0-205.el9.x86_64
    kernel-modules-5.14.0-214.el9.x86_64
    kernel-tools-5.14.0-214.el9.x86_64
    kernel-tools-libs-5.14.0-214.el9.x86_64

    libblockdev-2.28-3.el9.x86_64
    libblockdev-crypto-2.28-3.el9.x86_64
    libblockdev-fs-2.28-3.el9.x86_64
    libblockdev-loop-2.28-3.el9.x86_64
    libblockdev-mdraid-2.28-3.el9.x86_64
    libblockdev-part-2.28-3.el9.x86_64
    libblockdev-swap-2.28-3.el9.x86_64
    libblockdev-utils-2.28-3.el9.x86_64

    lvm2-2.03.17-3.el9.x86_64
    lvm2-libs-2.03.17-3.el9.x86_64

    selinux-policy-38.1.3-1.el9.noarch
    selinux-policy-targeted-38.1.3-1.el9.noarch

Revision history for this message
Amol Kahat (amolkahat) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

@steve, we checked all the repos related to image build, nothing got changed. I checked rdoinfo repo. Dib is not pinned https://github.com/redhat-openstack/rdoinfo/blame/master/tags/antelope-uc.yml#L143 . There has not changed from last 2 months.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

One thing that has occurred to me is that when /srv doesn't mount then growvols will fail, and when any other volume doesn't mount then the boot won't complete at all. That would explain the high proportion of nodes that are not contactable, its actually the same issue.

Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → nobody
Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello All,

Adding some observations:-

* Issue is transient, some jobs are passing.
* In the affected job as well, Some of the overcloud nodes are provisioning successfully and only ~1/2 out of 4 nodes don't provision successfully.

* Affected jobs have two different symptoms but all the jobs fail on the same task while running growvols.

A) Job fails with Unreachable issue

https://logserver.rdoproject.org/37/28537/62/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-zed/0f4cbc3/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
~~~
UNREACHABLE | Running /usr/local/sbin/growvols /=8GB /tmp=1GB /var/log=10GB /var/log/audit=2GB /home=1GB /var=100% | overcloud-novacompute-0
~~~

B) Job fails with Failure -

ex. https://logserver.rdoproject.org/37/28537/63/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-zed/7dca9fd/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

"FATAL | Running /usr/local/sbin/growvols /=8GB /tmp=1GB /var/log=10GB /var/log/audit=2GB /home=1GB /var=50% /srv=50% "

* Rerunning node provisioning on the same nodes passes in 2nd attempt(manually tried)

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Centos packages rpm diff: https://www.diffchecker.com/9YppsLRG/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Hello All,

We tried pinning kernel to 5.14.0-210.el9.x86_64 (patch [1])and it helped.
Affected kernel version: 5.14.0-214*, node provisioning passed for multiple jobs with older kernel, results at [2].

Observation:

Vexx mirror and mirror.stream.centos.org are currently not in sync and have different kernel version available.
Because of our job wrong configuration, we are leaking content from mirror.stream.centos.org and getting latest kernel.

1) Component line job, centos.repo is not disabled and result in latest kernel leak.

https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/1cd7430/logs/undercloud/home/zuul/build.log.txt.gz
~~~
2022-12-27 00:54:07.471 | kernel x86_64 5.14.0-214.el9 baseos 2.8 M
~~~

Fix: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/868677

2) Integration line job:-

Overcloud nodes are wrongly using mirror.stream.centos.org instead of local vexx mirror and pulling latest kernel during modify-image role.

https://logserver.rdoproject.org/37/28537/63/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-zed/7dca9fd/logs/overcloud-controller-0/etc/yum.repos.d/quickstart-centos-appstreams.repo.txt.gz

baseurl=http://mirror.stream.centos.org/9-stream/AppStream/x86_64/os/

older kernel works, but i wonder if we are hitting this issue because we are mixing content(and this is not a kernel bug), instead of excluding kernel maybe we should fix the repo to use correct local mirror.

I will continue checking the baseurl for integration job tomorrow.

[1] https://review.opendev.org/c/openstack/tripleo-quickstart/+/868605/6/config/release/tripleo-ci/CentOS-9/master.yml
[2] https://review.rdoproject.org/r/c/testproject/+/28537/69#message-ab5d371a81952bbc0f65beffc2fd40afa1bd6cc2

Revision history for this message
chandan kumar (chkumar246) wrote :

Thank you Sandeep for putting the detailed investigation report. I picked the log https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/undercloud/home/zuul/ from bug description and found 3 repo setup files used in overcloud and undercloud nodes to populate repo.
[1]. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/undercloud/home/zuul/repo_setup.log.txt.gz
```
 + '[' -e /etc/ci/mirror_info.sh ']'
2022-12-20 20:17:41 | + source /etc/ci/mirror_info.sh
2022-12-20 20:17:41 | ++ export NODEPOOL_MIRROR_HOST=mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org
2022-12-20 20:17:41 | ++ NODEPOOL_MIRROR_HOST=mirror.regionone.vexxhost-nodepool-tripleo.rdoproject.org
```
[2]. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/undercloud/home/zuul/repo_setup.sh.1671586194.log.txt.gz

```
+ '[' -e /etc/ci/mirror_info.sh ']'
2022-12-20 20:29:55 | + export NODEPOOL_CENTOS_MIRROR=http://mirror.stream.centos.org
2022-12-20 20:29:55 | + NODEPOOL_CENTOS_MIRROR=http://mirror.stream.centos.org
```

[3]. https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/377ff7b/logs/undercloud/home/zuul/repo_setup.sh.1671585849.log.txt.gz
```
+ '[' -e /etc/ci/mirror_info.sh ']'
2022-12-20 20:24:11 | + export NODEPOOL_CENTOS_MIRROR=http://mirror.stream.centos.org
2022-12-20 20:24:11 | + NODEPOOL_CENTOS_MIRROR=http://mirror.stream.centos.org
```

Based on above results, on overcloud node there is no /etc/ci/mirror_info.sh file that's why default mirror pointing to centos.stream is used. that's why we are seeing the above issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/868677
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/6d250182278bf686d5120f5c5863d1ee24bccb41
Submitter: "Zuul (22348)"
Branch: master

commit 6d250182278bf686d5120f5c5863d1ee24bccb41
Author: Sandeep Yadav <email address hidden>
Date: Tue Dec 27 17:16:54 2022 +0530

    Pass centos.repo from host during image build.

    Image build in component jobs is pulling content from
    centos.repo instead of quickstart repos, see [0]

    This can cause mismatch of rpm when mirrors.centos.org and
    local mirrors are not in sync.

    The base image which we use to build overcloud images already have
    centos.repo and when proxy mirros are not updated this can cause
    an issue.

    This is a workaround patch to pass centos.repo(which are disabled
    on host) so that same repos in the image will be overridden.

    One change in behavior is at the end of image build, dib will delete
    the centos.repos in the overcloud image as dib cleans up what it adds
    ignoring what was already present.

    oooci-build-images already have same fix.

    [0] https://logserver.rdoproject.org/openstack-component-baremetal/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-wallaby/1cd7430/logs/undercloud/home/zuul/build.log.txt.gz
    ~~~
    2022-12-27 00:54:07.471 | kernel x86_64 5.14.0-214.el9 baseos 2.8 M
    ~~~

    [1] https://github.com/openstack/tripleo-ci/commit/b2ce4b4c101d68eaa71d1499609423e10f50f2d5

    Related-Bug: #2000226
    Change-Id: Iecf36eff8ef27fd2734f77b09180d0b4d8654c52

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/868864

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

Currently on pinning kernel https://review.opendev.org/c/openstack/tripleo-quickstart/+/868605 to kernel-core-5.14.0-214* is not helping.
As the node is already using the latest kernel.
https://logserver.rdoproject.org/03/46503/6/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/7db5da2/logs/baremetal_6_33617_0-console.log
```
CentOS Stream 9
Kernel 5.14.0-214.el9.x86_64 on an x86_64

Activate the web console with: systemctl enable --now cockpit.socket
```

and in fs01 component job where we build the images.
https://logserver.rdoproject.org/03/46503/6/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-baremetal-master/d255aa2/logs/undercloud/home/zuul/build.log.txt.gz
```
 install-packages -u
2022-12-30 12:32:05.597 | Last metadata expiration check: 0:00:10 ago on Fri Dec 30 07:31:55 2022.
2022-12-30 12:32:05.706 | Error:
2022-12-30 12:32:05.706 | Problem: package kernel-5.14.0-214.el9.x86_64 requires kernel-modules-uname-r = 5.14.0-214.el9.x86_64, but none of the providers can be installed
2022-12-30 12:32:05.706 | - cannot install the best update candidate for package kernel-5.14.0-80.el9.x86_64
2022-12-30 12:32:05.706 | - package kernel-modules-5.14.0-214.el9.x86_64 is filtered out by exclude filtering
2022-12-30 12:32:05.706 | (try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
```
so this is also not working.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/868861
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/85fcfd4eb226337d84f6a38aa5feaab6ca3372f5
Submitter: "Zuul (22348)"
Branch: master

commit 85fcfd4eb226337d84f6a38aa5feaab6ca3372f5
Author: Chandan Kumar <email address hidden>
Date: Thu Dec 29 12:29:34 2022 +0530

    Add modify_image_run_command var to modify-image role

    In order to run specific command on the image, we need
    to add support for --run-command to the modify image role.
    It will be useful to run specific command instead of using
    script.

    This functionality will be consumed here:
    https://review.opendev.org/c/openstack/tripleo-quickstart/+/868864

    Related-Bug: #2000226

    Signed-off-by: Chandan Kumar <email address hidden>
    Change-Id: I8575c712467db8fea403459d26731dd7070df131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/868864
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/5b5f1cf307751cc440f0341914b1a54a2f617dd5
Submitter: "Zuul (22348)"
Branch: master

commit 5b5f1cf307751cc440f0341914b1a54a2f617dd5
Author: Chandan Kumar <email address hidden>
Date: Thu Dec 29 15:49:30 2022 +0530

    Upload /etc/ci/mirror_info.sh to overcloud images

    In order to use local mirrors on overcloud nodes,
    we need to copy /etc/ci/mirror_info.sh script there so
    that we can pull the correct packages from afs mirror
    otherwise we will pull it from centos mirror which results into
    random issues, which are hard to debug.

    This patch copys the same using modify-image role.
    https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/868861
    makes sure before copying existing directory exists on the image.

    The added task will work for qcow2 image and mirror_info.sh needs to
    be present on undercloud.

    Related-Bug: #2000226

    Depends-On: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/868861

    Signed-off-by: Chandan Kumar <email address hidden>
    Change-Id: Ia3b4634b551e5791b382ce62cfefd39167dfd27e

Revision history for this message
Amol Kahat (amolkahat) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

per comment #28 above... the version of kernel installed in the controller node in that test is as follows

kernel.x86_64 5.14.0-210.el9 @quickstart-centos-base

https://logserver.rdoproject.org/37/28537/71/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/93fc6da/logs/overcloud-controller-0/var/log/extra/package-list-installed.txt.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/869130

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/869131

Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Download full text (30.3 KiB)

I have a reproducer with a custom image and an ansible playbook deployed to PSI.

Downgrading lvm2, device-mapper was enough to get 100% success after 20 attempts.

I'm still collecting data, but it looks like about 80% success rate with latest lvm2, device-mapper. Package versions are coupled between these two packages, but I think raising a bug against lvm2 would be best.

This change will log growvols[1] activity to the systemd journal, which shows exactly what command triggers the unmount when the journal is viewed (lvextend of /srv)

[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/869130

Here is the growvols portion of the journal for a failed run:

Jan 03 21:28:03 lp2000226-1 python3[2427]: ansible-ansible.legacy.command Invoked with _raw_params=systemd-cat --identifier=growvols /usr/bin/growvols --verbose --debug --yes /=8GB /tmp=1GB /var/log=10GB /var/log/audit=2GB /home=1GB /var=50% /srv=50% _uses_shell=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [INFO] Finding all block devices
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [INFO] Running: lsblk -Po kname,pkname,name,label,type,fstype,mountpoint
Jan 03 21:28:03 lp2000226-1 growvols[2428]: [DEBUG] Result: KNAME="vda" PKNAME="" NAME="vda" LABEL="" TYPE="disk" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda1" PKNAME="vda" NAME="vda1" LABEL="MKFS_ESP" TYPE="part" FSTYPE="vfat" MOUNTPOINT="/boot/efi"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda2" PKNAME="vda" NAME="vda2" LABEL="" TYPE="part" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda3" PKNAME="vda" NAME="vda3" LABEL="mkfs_boot" TYPE="part" FSTYPE="ext4" MOUNTPOINT="/boot"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="vda4" PKNAME="vda" NAME="vda4" LABEL="" TYPE="part" FSTYPE="LVM2_member" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-0" PKNAME="vda4" NAME="vg-lv_thinpool_tmeta" LABEL="" TYPE="lvm" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-2" PKNAME="dm-0" NAME="vg-lv_thinpool-tpool" LABEL="" TYPE="lvm" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-3" PKNAME="dm-2" NAME="vg-lv_thinpool" LABEL="" TYPE="lvm" FSTYPE="" MOUNTPOINT=""
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-4" PKNAME="dm-2" NAME="vg-lv_root" LABEL="img-rootfs" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-5" PKNAME="dm-2" NAME="vg-lv_tmp" LABEL="fs_tmp" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/tmp"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-6" PKNAME="dm-2" NAME="vg-lv_var" LABEL="fs_var" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/var"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-7" PKNAME="dm-2" NAME="vg-lv_log" LABEL="fs_log" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/var/log"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-8" PKNAME="dm-2" NAME="vg-lv_audit" LABEL="fs_audit" TYPE="lvm" FSTYPE="xfs" MOUNTPOINT="/var/log/audit"
Jan 03 21:28:03 lp2000226-1 growvols[2428]: KNAME="dm-9" PKNAME="dm-2" NAME="v...

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Since systemd-252, calling lvextend on /home or /srv will sometimes
(~20%) cause the volume to be unmounted, here is the logging sequence:

growvols[2428]: [INFO] Running: lvextend --size +17200840704B /dev/mapper/vg-lv_srv
dmeventd[775]: No longer monitoring thin pool vg-lv_thinpool-tpool.
kernel: dm-10: detected capacity change from 98304 to 33693696
systemd[1]: Stopped target Local File Systems.
systemd[1]: Unmounting /srv...
kernel: XFS (dm-10): Unmounting Filesystem
dmeventd[775]: Monitoring thin pool vg-lv_thinpool-tpool.
systemd[1]: srv.mount: Deactivated successfully.
systemd[1]: Unmounted /srv.
systemd[1]: systemd-fsck@dev-disk-by\x2dlabel-fs_srv.service: Deactivated successfully.
systemd[1]: Stopped File System Check on /dev/disk/by-label/fs_srv.
growvols[2428]: [DEBUG] Result: Size of logical volume vg/lv_srv changed from 48.00 MiB (12 extents) to <16.07 GiB (4113 extents).
growvols[2428]: Logical volume vg/lv_srv successfully resized.

The event "Stopped target Local File Systems." should only happen
when dmeventd notifies that the thin pool is near capacity as a safety
measure, which is clearly not the case:

$ lvs
  LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
  lv_audit vg Vwi-aotz-- <2.05g lv_thinpool 0.62
  lv_home vg Vwi-aotz-- 1.16g lv_thinpool 0.88
  lv_log vg Vwi-aotz-- <9.55g lv_thinpool 0.31
  lv_root vg Vwi-aotz-- <11.07g lv_thinpool 18.97
  lv_srv vg Vwi-aotz-- <16.07g lv_thinpool 0.31
  lv_thinpool vg twi-aotz-- 77.92g 3.36 1.68
  lv_tmp vg Vwi-aotz-- 1.16g lv_thinpool 0.88
  lv_var vg Vwi-aotz-- <37.43g lv_thinpool 1.10

I speculate that the root cause is that systemd-252 is interpreting
messages from dmeventd differently since systemd-250 and either the
message or the interpretation is incorrect.

I've proposed [1] in an attempt to prevent dmeventd from being called at all during lvextend, I'll leave it testing overnight.

[1] https://review.opendev.org/c/openstack/diskimage-builder/+/869274

Revision history for this message
Amol Kahat (amolkahat) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/868605
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/5fee91651e0809976e98e25f0d53989c09280021
Submitter: "Zuul (22348)"
Branch: master

commit 5fee91651e0809976e98e25f0d53989c09280021
Author: Sandeep Yadav <email address hidden>
Date: Tue Dec 27 07:29:39 2022 +0530

    Exclude latest systemd*-252-* packages

    After latest systemd we are hitting bug[1] and is breaking
    the ovb node provisioning. It is blocking the promotion.

    Let's exclude the latest systemd*-252-* till we have a proper
    fix.

    [1] https://bugs.launchpad.net/tripleo/+bug/2000226

    Related-Bug: #2000226

    Signed-off-by: Sandeep Yadav <email address hidden>
    Co-Authored-by: Amol Kahat <email address hidden>
    Change-Id: Ia6837781272ae3f3e9299a0ff4b8bf2d0bae2be5

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-image-elements/+/872481
Committed: https://opendev.org/openstack/tripleo-image-elements/commit/2ce67c3dbb6a7f8b193dbbf3fde6090005fa6700
Submitter: "Zuul (22348)"
Branch: master

commit 2ce67c3dbb6a7f8b193dbbf3fde6090005fa6700
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300

    Install modified udev rule to fix lvextend unmount

    This fix has been proposed to lvm2 upstream[1]. This change can be
    reverted once the fix is packaged. It is proposed here to unblock
    upstream and downstream delivery pipelines.

    [1] https://github.com/lvmteam/lvm2/pull/105

    Change-Id: If187a2b1ec61ec47738b99b40919d4cc65fa9505
    Closes-Bug: #2000226
    Related: rhbz#2158628

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-image-elements/+/873134

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-image-elements/+/873134
Committed: https://opendev.org/openstack/tripleo-image-elements/commit/40dd9ee8e58b1efdd9cce4853348042e195e94c9
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 40dd9ee8e58b1efdd9cce4853348042e195e94c9
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300

    Install modified udev rule to fix lvextend unmount

    This fix has been proposed to lvm2 upstream[1]. This change can be
    reverted once the fix is packaged. It is proposed here to unblock
    upstream and downstream delivery pipelines.

    [1] https://github.com/lvmteam/lvm2/pull/105

    Change-Id: If187a2b1ec61ec47738b99b40919d4cc65fa9505
    Closes-Bug: #2000226
    Related: rhbz#2158628
    (cherry picked from commit 2ce67c3dbb6a7f8b193dbbf3fde6090005fa6700)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-image-elements (stable/zed)

Change abandoned by "dasm <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/tripleo-image-elements/+/873133
Reason: Abandoning since deprecation of stable/zed is in progress: https://lists.openstack.org/pipermail/openstack-discuss/2023-February/032083.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/tripleo-image-elements/+/873133
Committed: https://opendev.org/openstack/tripleo-image-elements/commit/c8173d557ee07727e2477c418e6bf2bf2569673c
Submitter: "Zuul (22348)"
Branch: stable/zed

commit c8173d557ee07727e2477c418e6bf2bf2569673c
Author: Steve Baker <email address hidden>
Date: Thu Feb 2 13:10:02 2023 +1300

    Install modified udev rule to fix lvextend unmount

    This fix has been proposed to lvm2 upstream[1]. This change can be
    reverted once the fix is packaged. It is proposed here to unblock
    upstream and downstream delivery pipelines.

    [1] https://github.com/lvmteam/lvm2/pull/105

    Change-Id: If187a2b1ec61ec47738b99b40919d4cc65fa9505
    Closes-Bug: #2000226
    Related: rhbz#2158628
    (cherry picked from commit 2ce67c3dbb6a7f8b193dbbf3fde6090005fa6700)

tags: added: in-stable-zed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.