Stx-openstack apply-fail after swact standby controller, lock, unlock standby controller
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Critical
|
Gustavo Santos |
Bug Description
Brief Description
-----------------
Stx-openstack apply-failed after swacting standby controller and lock, unlocking standby controller. This is visible on Standard, Standard-EXT configurations on baremetal.
Severity
--------
<Critical: System/Feature is not usable due to the defect>
Steps to Reproduce
------------------
Swact standby controller
Lock standby controller
Unlock standby controller
check: system application-list | grep openstack
normally it should be applied, but it fails
Expected Behavior
------------------
stx-openstack should apply fine, without any error
Actual Behavior
----------------
stx-openstack apply fails
Reproducibility
---------------
reproduced 3 times in a row.
System Configuration
-------
Multi-node system, Dedicated storage on barebetal
Branch/Pull Time/Commit
-------
master
Last Pass
---------
20210226T024233Z
Timestamp/Logs
--------------
will be attached
Test Activity
--------------
Sanity
Workaround
----------
-
CVE References
Alexandru Dimofte (adimofte) wrote : | #1 |
Dan Voiculeasa (dvoicule) wrote : | #2 |
Changed in starlingx: | |
importance: | Undecided → Critical |
status: | New → Triaged |
tags: | added: stx.5.0 stx.apps |
Ghada Khalil (gkhalil) wrote : | #3 |
stx.5.0 / critical - sanity issue introduced by recent commit
Bob Church (rchurch) wrote : | #4 |
- LP1917308.txt Edit (40.6 KiB, text/plain)
Attaching key logs from this.
But I don’t see any evidence of a network disconnect that would cause this. Right before this happens, controller-1 has just come online and finished DRDB syncing. It's possible we have some stale TCP connection to the helm postgres DB in the tiller container. Postgres logs report nothing off. Maybe running out of connections? Looking at the tiller process running the container there are quite a few threads running. I'm not sure if this is normal behavior.
I could not reproduce this in my local setup
Bob Church (rchurch) wrote : | #5 |
It is possible that the 5 second timeout on the tiller command is not long enough based on the current responsiveness of the system
2021-03-
2021-03-
and that the command is being prematurely terminated before it could complete.
tags: | added: stx.containers |
chen haochuan (martin1982) wrote : | #6 |
containers/
2021-03-
2021-03-
2021-03-
2021-03-
./containers/
2021-03-
2021-03-
~
chen haochuan (martin1982) wrote : | #7 |
I could reproduce this issue.
Depley duplex system with latest iso and stx-openstack apply successfully
on controller-0
$ system host-swact 1
on controller-1
$ system host-swact 2
then on controller-0(active controller)
$ system application-apply stx-openstack
Application apply fail
chen haochuan (martin1982) wrote : | #8 |
For my reproduce system
./sysinv.
And 192.188.204.2:5432, postgres listen on this port
[sysadmin@
tcp 0 0 0.0.0.0:5432 0.0.0.0:* LISTEN 1357087/postgres
tcp6 0 0 :::5432 :::* LISTEN 1357087/postgres
[sysadmin@
nfsnobo+ 130497 0.1 0.4 213352 126384 ? Ssl 03:35 0:57 /tiller --storage=sql --sql-dialect=
postgres 1357087 0.0 0.1 312924 34880 ? S< 14:17 0:00 /usr/bin/postgres -D /var/lib/
And cluster ip "172.16.192.72", is pod armada-api address.
./pods/
[sysadmin@
armada-
so after swact, tiller in aramda-api pod, could not access posgresl service
chen haochuan (martin1982) wrote : | #9 |
after wait for while, it could recovery
Changed in starlingx: | |
assignee: | nobody → Gustavo Santos (gooshtavow) |
Gustavo Santos (gooshtavow) wrote : | #10 |
The armada-api pod, which runs helm 2, goes up with the following command when starting the tiller container:
tiller --storage=sql --sql-dialect=
Where 192.168.204.1 is the active controller's floating IP address. This creates a socket connecting the pod to the currently active controller. After performing a swact, this socket becomes invalid, because it still points to the now inactive controller, and that is why the broken pipe error happens.
Yvonne Ding (yding) wrote : | #11 |
The issue can be reproduced on AIO-SX after lock/unlock controller with "20210401T032802Z" load.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
2021-04-01 17:11:01.794 95532 ERROR sysinv.
Gustavo Santos (gooshtavow) wrote : | #12 |
A code review with a possible fix has been opened for this bug: https:/
Changed in starlingx: | |
status: | Triaged → Fix Released |
Alexandru Dimofte (adimofte) wrote : | #13 |
I checked again today(20210408T
sysinv 2021-04-08 17:40:06.274 920503 INFO sysinv.helm.utils [-] Caught HelmTillerFailure exception. Retrying... Exception: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.
command terminated with exit code 1
sysinv 2021-04-08 17:40:06.691 920503 INFO sysinv.helm.utils [-] Caught HelmTillerFailure exception. Retrying... Exception: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.
command terminated with exit code 1
sysinv 2021-04-08 17:40:06.692 920503 ERROR sysinv.
command terminated with exit code 1
: HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.
command terminated with exit code 1
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
2021-04-08 17:40:06.692 920503 ERROR sysinv.
Gustavo Santos (gooshtavow) wrote : | #14 |
Alexandru, can you provide a little more information about the system you've tested this on and if you got the error more than once? I wasn't able to reproduce the issue in several attempts on two different systems and I'm wondering why you're still getting the error.
Alexandru Dimofte (adimofte) wrote : | #15 |
Today I checked again if this issue is still there and I tested using a baremetal Standard configuration.
The steps were:
system host-swact controller-0
ssh controller-1
system host-lock controller-0
system host-unlock controller-0
watch system application-list (in 5-6 minutes stx-openstack will try a reapply but will fail)
Alexandru Dimofte (adimofte) wrote : | #16 |
- Added collected logs from today, baremetal standard configuration Edit (119.9 MiB, application/x-tar)
Alexandru Dimofte (adimofte) wrote : | #17 |
I manually checked again today this bug on baremetal: Duplex, Standard and Standard External. I reproduced it on Standard External only.
Alexandru Dimofte (adimofte) wrote : | #18 |
Gustavo Santos (gooshtavow) wrote : | #19 |
Alexandru, I have opened a code review (https:/
Ghada Khalil (gkhalil) wrote : | #20 |
Re-opening as there seems to be more code reviews required to address this issue.
Once a fix is merged in stx master, it will also need to be cherrypicked to the r/stx.5.0 release.
Changed in starlingx: | |
status: | Fix Released → In Progress |
Ghada Khalil (gkhalil) wrote : | #21 |
Note:
This seems to be a generic issue with the containerized application framework after a swact.
https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master) | #22 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit ad8567f06485a10
Author: Gustavo Santos <email address hidden>
Date: Tue Apr 13 16:09:21 2021 -0300
Restart tiller on openstack pending install check
This is another attempt at fixing the same bug as the merged review
https:/
there were reports indicating that the bug would still occur on certain
setups.
This patch explicitly forces a tiller restart when catching the first
HelmTillerF
only trying to rerun the 'helm list' command, which was believed to be
a reliable workaround to the problem, but didn't solve it in every
possible scenario.
Closes-Bug: #1917308
Signed-off-by: Gustavo Santos <email address hidden>
Change-Id: I38667609173ca5
Changed in starlingx: | |
status: | In Progress → Fix Released |
Ghada Khalil (gkhalil) wrote : | #23 |
@Gustavo, please cherrypick your changes to the r/stx.5.0 release asap.
tags: | added: stx.cherrypickneeded |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.5.0) | #24 |
Fix proposed to branch: r/stx.5.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.5.0) | #25 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: r/stx.5.0
commit 70df83f1f949f76
Author: Gustavo Santos <email address hidden>
Date: Tue Apr 13 16:09:21 2021 -0300
Restart tiller on openstack pending install check
This is another attempt at fixing the same bug as the merged review
https:/
there were reports indicating that the bug would still occur on certain
setups.
This patch explicitly forces a tiller restart when catching the first
HelmTillerF
only trying to rerun the 'helm list' command, which was believed to be
a reliable workaround to the problem, but didn't solve it in every
possible scenario.
Closes-Bug: #1917308
Signed-off-by: Gustavo Santos <email address hidden>
Change-Id: I38667609173ca5
(cherry picked from commit ad8567f06485a10
tags: |
added: in-r-stx50 removed: stx.cherrypickneeded |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master) | #26 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (master) | #27 |
Related fix proposed to branch: master
Review: https:/
Angie Wang (angiewang) wrote : | #28 |
Just a note, helm is using package sqlx to establish connection with postgres backend and sqlx is using Golong postgres driver. The "broken pipe" issue is an issue in Golang Postgres driver - https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master) | #29 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit b3540ccfdfa6956
Author: Robert Church <email address hidden>
Date: Wed May 12 22:36:23 2021 -0400
Update the liveness probe to verify postgres connectivity
Change the tillerLivenessP
postgres backend. We will override the periodSeconds and
failureThre
the tiller pod over a swact when the postgres DB/server moves from one
controller to the other.
This will help guarantee that the tiller connection is always
re-established if the connectivity to the postgres backend fails.
Change-Id: I7fbed33a8c821f
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master) | #30 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit d5460198dc0310a
Author: Robert Church <email address hidden>
Date: Wed May 12 22:45:38 2021 -0400
Adjust armada's tiller container liveness probe
With the liveness probe update in the armada helm chart to test the
connectivity to the postgres backend, adjust the periodSeconds and
failureThre
postgres switching from one controller to another.
Reviewing logs from various H/W labs it appears that average postgres
swact time ranges from 9s-20s, with the mean ~15s.
Times can be observed with:
2021-
2021-
Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
postgres server is not accessible, the tiller container will be
restarted within the 9s minimum swact time. This will ensure that the
next time tiller is required by Armada or used by the helmv2-cli that
the connection to postgres backend has been re-established.
Change-Id: I7454a737771d9a
Depends-On: https:/
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master) | #31 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master) | #32 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 4e1aa82e96d9b4c
Author: Robert Church <email address hidden>
Date: Sat May 15 16:24:29 2021 -0400
Update postgres liveness check to support IPv6 addresses
Templating will add square brackets for IPv6 addresses which are
interpreted as an array vs. a string. Quote this so that it interpreted
correctly.
Change-Id: I2b705015a74ea2
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
Ghada Khalil (gkhalil) wrote : | #33 |
The additional commits above will need to be merged in the r/stx.5.0 branch
tags: |
added: stx.cherrypickneeded removed: in-r-stx50 |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (r/stx.5.0) | #34 |
Related fix proposed to branch: r/stx.5.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (r/stx.5.0) | #35 |
Related fix proposed to branch: r/stx.5.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (r/stx.5.0) | #36 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: r/stx.5.0
commit 106331ecec1a77f
Author: Robert Church <email address hidden>
Date: Wed May 12 22:36:23 2021 -0400
Update the liveness probe to verify postgres connectivity
Change the tillerLivenessP
postgres backend. We will override the periodSeconds and
failureThre
the tiller pod over a swact when the postgres DB/server moves from one
controller to the other.
This will help guarantee that the tiller connection is always
re-established if the connectivity to the postgres backend fails.
Change-Id: I7fbed33a8c821f
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
(cherry picked from commit b3540ccfdfa6956
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (r/stx.5.0) | #37 |
Related fix proposed to branch: r/stx.5.0
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (r/stx.5.0) | #38 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: r/stx.5.0
commit 4555715323b2561
Author: Robert Church <email address hidden>
Date: Wed May 12 22:45:38 2021 -0400
Adjust armada's tiller container liveness probe
With the liveness probe update in the armada helm chart to test the
connectivity to the postgres backend, adjust the periodSeconds and
failureThre
postgres switching from one controller to another.
Reviewing logs from various H/W labs it appears that average postgres
swact time ranges from 9s-20s, with the mean ~15s.
Times can be observed with:
2021-
2021-
Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
postgres server is not accessible, the tiller container will be
restarted within the 9s minimum swact time. This will ensure that the
next time tiller is required by Armada or used by the helmv2-cli that
the connection to postgres backend has been re-established.
Change-Id: I7454a737771d9a
Depends-On: https:/
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
(cherry picked from commit d5460198dc0310a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (r/stx.5.0) | #39 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: r/stx.5.0
commit 821de96615cb6f9
Author: Robert Church <email address hidden>
Date: Sat May 15 16:24:29 2021 -0400
Update postgres liveness check to support IPv6 addresses
Templating will add square brackets for IPv6 addresses which are
interpreted as an array vs. a string. Quote this so that it interpreted
correctly.
Change-Id: I2b705015a74ea2
Related-Bug: #1917308
Signed-off-by: Robert Church <email address hidden>
(cherry picked from commit 4e1aa82e96d9b4c
Ghada Khalil (gkhalil) wrote : | #40 |
Adding in-r-stx50 as the latest commits have been merged in the r/stx.5.0 release branch
tags: |
added: in-r-stx50 removed: stx.cherrypickneeded |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (f/centos8) | #41 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8) | #42 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #43 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (f/centos8) | #44 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (f/centos8) | #45 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8) | #46 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (f/centos8) | #47 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 4e96b762f549aad
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000
Revert "Restore host filesystems with collected sizes"
This reverts commit 255488739efa4ac
Reason for revert: Did a rework to fix https:/
Change-Id: Iea79701a874eff
Depends-On: I55ae6954d24ba3
commit c064aacc377c8bd
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400
Ensure apiserver keys are present before extract from tarball
This is to fix the upgrade playbook issue that happens during
AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
https:/
The apiserver keys are not available in stx4.0 side so we need
to ensure the keys under /etc/kubernetes/pki are present in the
backed-up tarball before extracting, otherwise playbook fails
because the keys are not found in the archive.
Change-Id: I8602f07d1b1041
Closes-Bug: 928925
Signed-off-by: Angie Wang <email address hidden>
commit 0261f22ff7c23d2
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400
Update SX to DX migration to wait for coredns config
This commit updates the SX to DX migration playbook to wait after
modifying the system mode to duplex until the runtime manifest that
updates coredns config has completed. The playbook will wait for up to
20 minutes to allow for the possibilty that sysinv has multiple
runtime manifests queued up, each of which could take several minutes.
Depends-On: https:/
Depends-On: https:/
Change-Id: I3bf94d3493ae20
Closes-Bug: 1929148
Signed-off-by: Don Penney <email address hidden>
commit 7c4f17bd0d92fc1
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000
Fixed missing apiserver-
When controller-1 is the active controller
the backup archive does not contain
/etc/
This change adds a new task which brings
the certs from /etc/kubernetes/pki
Closes-bug: 1928925
Signed-off-by: Daniel Safta <email address hidden>
Change-Id: I3c68377603e1af
commit e221ef8fbe51aa6
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500
Support boo...
tags: | added: in-f-centos8 |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8) | #48 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #49 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8) | #50 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (f/centos8) | #51 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit b310077093fd567
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300
Fix resize of filesystems in puppet logical_volume
After system reinstalls there is stale data on the disk
and puppet fails when resizing, reporting some wrong filesystem
types. In our case docker-lv was reported as drbd when
it should have been xfs.
This problem was solved in some cases e.g:
when doing a live fs resize we wipe the last 10MB
at the end of partition:
https:/
Our issue happened here:
https:/
Resize can happen at unlock when a bigger size is detected for the
filesystem and the 'logical_volume' will resize it.
To fix this we have to wipe the last 10MB of the partition after the
'lvextend' cmd in the 'logical_volume' module.
Tested the following scenarios:
B&R on SX with default sizes of filesystems and cgts-vg.
B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
cgts-vg with additional physical volumes:
- name: cgts-vg
- path: /dev/disk/
size: 50
type: partition
- path: /dev/disk/
size: 30
type: partition
- path: /dev/disk/
type: disk
B&R on DX system with backup of size 70G and cgts-vg
with additional physical volumes:
physicalVol
- path: /dev/disk/
size: 50
type: partition
- path: /dev/disk/
size: 30
type: partition
- path: /dev/disk/
type: disk
Closes-Bug: 1926591
Change-Id: I55ae6954d24ba3
Signed-off-by: Mihnea Saracin <email address hidden>
commit 322557053045895
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300
Execute once the ceph services script on AIO
The MTC client manages ceph services via ceph.sh which
is installed on all node types in
/etc/
Since the AIO controllers have both controller and worker
personalities, the MTC client will execute the ceph script
twice (/etc/service.
/etc/
This behavior will generate some issues.
We fix this by exiting the ceph script if it is the one from
/etc/
Closes-Bug: 1928934
Change-Id: I3e4dc313cc3764
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8) | #52 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 9e420d9513e5faf
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400
Add more logging to run docker login
Add error log for running docker login. The new log could
help identify docker login failure.
Closes-Bug: 1930310
Change-Id: I8a709fb6665de8
Signed-off-by: Bin Qian <email address hidden>
commit 31c77439d2cea59
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500
Fix controller-0 downgrade failing to kill ceph
kill_
file that does not exist in an AIO-DX environment.
We no longer invoke kill_ceph_
AIO SX or DX env.
This allows: "system host-downgrade controller-0"
to proceed in an AIO-DX environment where that second
controller (controller-0) was upgraded.
Partial-Bug: 1929884
Signed-off-by: albailey <email address hidden>
Change-Id: I633853f7531773
commit 0dc99eee608336f
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500
Fix file permissions failure during duplex upgrade abort
When issuing a downgrade for controller-0 in a duplex upgrade
abort and rollback scenario, the downgrade command was failing
because the sysinv API does not have root permissions to set
a file flag.
The fix is to use RPC so the conductor can create the flag
and allow the downgrade for controller-0 to get further.
Partial-Bug: 1929884
Signed-off-by: albailey <email address hidden>
Change-Id: I913bcad73309fe
commit 7ef3724dad17375
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800
Fix bug rook-ceph provision with multi osd on one host
Test case:
1, deploy simplex system
2, apply rook-ceph with below override value
value.yaml
cluster:
storage:
nodes:
- name: controller-0
devices:
- name: sdb
- name: sdc
3, reboot
Without this fix, only osd pod could launch successfully after boot
as vg start with ceph could not correctly add in sysinv-database
Closes-bug: 1929511
Change-Id: Ia5be599cd168d1
Signed-off-by: Chen, Haochuan Z <email address hidden>
commit 23505ba77d76114
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400
Fix issue in partition data migration script
The created partition dictonary partition_map is not
an ordered dict so we need to sort it by its key -
device node when iterating it to adjust the device
nodes/paths for user created extra partitions to ensure
the number of device node...
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8) | #53 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #54 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
Did a short investigation since https:/ /review. opendev. org/c/starlingx /config/ +/773451 landed.
There is a small error observerd in the logs introduced by that that commit, but it is not the cause for the issue observed here. This will be the fix for that error: sysinv/ sysinv/ sysinv/ conductor/ manager. py b/sysinv/ sysinv/ sysinv/ sysinv/ conductor/ manager. py sysinv/ sysinv/ sysinv/ conductor/ manager. py sysinv/ sysinv/ sysinv/ conductor/ manager. py r(service. PeriodicService ):
LOG. exception( "Failed to regenerate the overrides for app %s. %s" %
(app. name, e))
diff --git a/sysinv/
index b5189f65..6fb2616e 100644
--- a/sysinv/
+++ b/sysinv/
@@ -11908,8 +11908,8 @@ class ConductorManage
else:
- LOG.info("{} app active:{} status:{} does not warrant re-apply",
- app.name, app.active, app.status)
+ LOG.info("{} app active:{} status:{} does not warrant re-apply"
+ "".format(app.name, app.active, app.status))
def app_lifecycle_ actions( self, context, rpc_app, hook_info):
"""Perform any lifecycle actions for the operation and timing supplied.
--
2.30.0
Back to the issue:
Seems armada/kubernetes related.
sysinv 2021-03-01 11:36:32.372 2356122 INFO sysinv. conductor. kube_app [-] lifecycle hook for application stx-openstack (1.0-78- centos- stable- versioned) started {'lifecycle_type': u'manifest', 'relative_timing': u'pre', 'mode': u'auto', 'operation': u'apply', 'extra': {'was_applied': True}}. openstack. lifecycle. lifecycle_ openstack [-] Wait if there are openstack charts in pending install... conductor. kube_app [-] Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16. 192.176: 45960-> 10.10.59. 10:5432: write: broken pipe 192.176: 45960-> 10.10.59. 10:5432: write: broken pipe conductor. kube_app Traceback (most recent call last):
sysinv 2021-03-01 11:36:32.372 2356122 INFO k8sapp_
sysinv 2021-03-01 11:36:32.781 2356122 ERROR sysinv.
command terminated with exit code 1
: HelmTillerFailure: Helm operation failure: Failed to obtain pending charts list: Helm operation failure: Error: write tcp 172.16.
command terminated with exit code 1
2021-03-01 11:36:32.781 2356122 ERROR sysinv.
var/log/containers$ grep -R "10.10.59.10" | grep armada-api api-b86d46465- xdbjt_armada_ tiller- a00cf66fa21b19f 28771a99a2aa856 43c1fbfd2ed9d19 d0f10c2a8ac7925 cc1b.log: 2021-03- 01T10:44: 38.71962272Z stderr F [storage/driver] 2021/03/01 10:44:38 list: failed to list: write tcp 172.16. 192.176: 60758-> 10.10.59. 10:5432: write: broken pipe api-b86d46465- xdbjt_armada_ tiller- a00cf66fa21b19f 28771a99a2aa856 43c1fbfd2ed9d19 d0f10c2a8ac7925 cc1b.log: 2021-03- 01T11:36: 32.776510152Z stderr F [storage/driver] 2021/03/01 11:36:32 list: failed to list: write tcp 172.16. 192.176: 45960-> 10.10.59. 10:5432: write: broken pipe api-b86d46465- xdbjt_armada_ tiller- a00cf66fa21b19f 28771a99a2aa856 43c1fbfd2ed9d19 d0f10c2a8ac7925 cc1b.log: 2021-03- 01T11:38: 56.600564874Z stderr F [storage/driver] 2021/03/01 11:38:56 list: failed to list: write tcp 172.16. 192.176: 35854-> 10.10.59. 10:5432: write: broken pipe api-b86d46465- xdbjt_armada_ til...
armada-
armada-
armada-
armada-