Mistral fails to maintain a keystone session while deploying an overcloud
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mistral |
In Progress
|
High
|
Unassigned | ||
tripleo |
Incomplete
|
High
|
Unassigned |
Bug Description
Trying to deploy a containerized overcloud from a containerized undercloud in OVB environment, the overcloud gets deployed but Mistral Executor fails when Zaqar is trying to post the message on the queue:
Maybe irrelevant but this messages popup in our logs:
Loaded 2 Fernet keys from /etc/keystone/
Therefore, the overcloud fails to finish the deployment.
I've been working on aligning the mistral/zaqar configurations:
https:/
But it didn't help; so now wondering about key rotations etc.
Note that we haven't hit this bug in multinode jobs; maybe because job is faster than OVB? Do we have some sort of expiration?
summary: |
- Failures to get tokens when undercloud is containerized + Mistral or Zaqar fail to maintain a keystone session while deploying an + overcloud |
tags: | added: tech-debt |
Changed in mistral: | |
importance: | Undecided → High |
status: | New → Incomplete |
status: | Incomplete → New |
tags: | added: workflows |
Changed in mistral: | |
status: | New → Triaged |
milestone: | none → rocky-1 |
Changed in tripleo: | |
milestone: | rocky-1 → rocky-3 |
no longer affects: | zaqar |
summary: |
- Mistral or Zaqar fail to maintain a keystone session while deploying an + Mistral fails to maintain a keystone session while deploying an overcloud |
Changed in mistral: | |
assignee: | nobody → Brad P. Crochet (brad-9) |
Changed in mistral: | |
milestone: | rocky-1 → rocky-2 |
Changed in mistral: | |
milestone: | rocky-2 → rocky-3 |
Changed in tripleo: | |
milestone: | rocky-3 → rocky-rc1 |
Changed in tripleo: | |
assignee: | nobody → Brad P. Crochet (brad-9) |
status: | Triaged → In Progress |
Changed in mistral: | |
milestone: | rocky-3 → rocky-rc1 |
Changed in mistral: | |
milestone: | rocky-rc1 → rocky-rc2 |
Changed in mistral: | |
milestone: | rocky-rc2 → stein-1 |
Changed in tripleo: | |
milestone: | rocky-rc1 → stein-1 |
Changed in tripleo: | |
milestone: | stein-1 → stein-2 |
Changed in mistral: | |
milestone: | stein-1 → stein-2 |
Changed in tripleo: | |
assignee: | Brad P. Crochet (brad-9) → nobody |
Changed in mistral: | |
assignee: | Brad P. Crochet (brad-9) → nobody |
Changed in tripleo: | |
milestone: | stein-2 → stein-3 |
Changed in mistral: | |
milestone: | stein-2 → stein-3 |
Changed in tripleo: | |
milestone: | stein-3 → stein-rc1 |
Changed in mistral: | |
milestone: | stein-3 → train-1 |
Changed in tripleo: | |
milestone: | stein-rc1 → train-1 |
Changed in tripleo: | |
milestone: | train-1 → train-2 |
Changed in tripleo: | |
milestone: | train-2 → train-3 |
Changed in tripleo: | |
milestone: | train-3 → ussuri-1 |
Changed in mistral: | |
milestone: | train-1 → ussuri-1 |
Changed in mistral: | |
milestone: | ussuri-1 → ussuri-2 |
Changed in tripleo: | |
milestone: | ussuri-1 → ussuri-2 |
Changed in tripleo: | |
milestone: | ussuri-2 → ussuri-3 |
Changed in mistral: | |
milestone: | ussuri-2 → ussuri-3 |
Changed in tripleo: | |
milestone: | ussuri-3 → ussuri-rc3 |
Changed in mistral: | |
milestone: | ussuri-3 → ussuri-rc1 |
Changed in mistral: | |
milestone: | ussuri-rc1 → ussuri-rc2 |
Changed in mistral: | |
milestone: | ussuri-rc2 → none |
milestone: | none → victoria-1 |
Changed in tripleo: | |
milestone: | ussuri-rc3 → victoria-1 |
Changed in tripleo: | |
milestone: | victoria-1 → victoria-3 |
Changed in mistral: | |
milestone: | victoria-1 → wallaby-1 |
Changed in tripleo: | |
milestone: | victoria-3 → wallaby-1 |
Changed in tripleo: | |
milestone: | wallaby-1 → wallaby-2 |
Changed in tripleo: | |
milestone: | wallaby-2 → wallaby-3 |
I don't think it's a token provider configuration issue, since we only have 2 keys by default. And the deployment seems to start and go forward for quite a while: https:/ /logs.rdoprojec t.org/56/ 542556/ 100/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t001-master/ Z64d11a27268e46 db803351bb52f7c c25/undercloud/ home/jenkins/ overcloud_ deploy. log.txt. gz
from there I can see it goes up to step 5, and in the end it fails with this exception: No JSON object could be decoded
which I guess comes from mistral client.
At some point in the zaqar logs I can see that it fails with authorization failed: /logs.rdoprojec t.org/56/ 542556/ 100/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t001-master/ Z64d11a27268e46 db803351bb52f7c c25/undercloud/ var/log/ containers/ zaqar/zaqar. log.txt. gz#_2018- 04-04_01_ 56_10_669
https:/
which gets reflected in the mistral executor logs here https:/ /logs.rdoprojec t.org/56/ 542556/ 100/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t001-master/ Z64d11a27268e46 db803351bb52f7c c25/undercloud/ var/log/ containers/ mistral/ executor. log.txt. gz#_2018- 04-04_01_ 56_12_627
which is what Emilien reported.
I think that the issue is that mistral (server) is not refreshing the token that zaqar is using. The token works for a while and expires after an hour (which is what we configure). The theory has backup data because of the log timings:
The deploy starts at 0:55 /logs.rdoprojec t.org/56/ 542556/ 100/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t001-master/ Z64d11a27268e46 db803351bb52f7c c25/undercloud/ home/jenkins/ overcloud_ deploy. log.txt. gz#_2018- 04-04_00_ 55_54
https:/
And we see the error at 1:55 /logs.rdoprojec t.org/56/ 542556/ 100/openstack- check/gate- tripleo- ci-centos- 7-ovb-3ctlr_ 1comp-featurese t001-master/ Z64d11a27268e46 db803351bb52f7c c25/undercloud/ home/jenkins/ overcloud_ deploy. log.txt. gz#_2018- 04-04_01_ 55_56
https:/
So ultimately it seems to me that it's an issue on how mistral creates the client (in a way that doesn't refresh the keystone tokens). This should have been handled already though, and it does seem to me that mistral is using sessions correctly (as far as I can tell). Are we using an old mistral container?
This would have usually gotten handled by the session object from keystoneauth1, which I thought was being used in zaqar