tripleo

centos-8-standalone-on-multinode-ipa job with FIPS enabled failing with: "Can't run container rabbitmq_wait_bundle"

Bug #1950382 reported by Douglas Viroel on 2021-11-09

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Triaged	High	Unassigned	tripleo yoga-1

Bug Description

The following behavior is happening more often on standalone jobs with FIPS enabled:

ERROR: Can't run container rabbitmq_wait_bundle
stderr: + STEP=2
...
Error: 'rabbitmqctl eval "lists:keymember(rabbit, 1, application:which_applications())." | grep -q true' returned 1 instead of one of [0]
Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from 'notrun' to ['0'] failed: 'rabbitmqctl eval "lists:keymember(rabbit, 1, application:which_applications())." | grep -q true' returned 1 instead of one of [0]
Error: Could not prefetch rabbitmq_user provider 'rabbitmqctl': Command is still failing after 180 seconds expired!
Warning: /Stage[main]/Tripleo::Profile::Base::Rabbitmq/Rabbitmq_user[guest]: Skipping because of failed dependencies
Warning: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Rabbitmq_policy[ha-all@/]: Skipping because of failed dependencies
+ rc=6
+ set -e
+ set +ux
Log: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_801/808215/49/check/tripleo-ci-centos-8-standalone-on-multinode-ipa/8016e34/logs/undercloud/home/zuul/standalone_deploy.log

More details in the logs:
* rabbitmq seems to be failing to start:
(log_op_output) notice: rabbitmq_start_0[203] error output [ Call cib_query failed (-6): No such device or address ]
(log_op_output) notice: rabbitmq_start_0[203] error output [ Call cib_query failed (-6): No such device or address ]
(log_op_output) notice: rabbitmq_start_0[203] error output [ Schema validation of configuration is disabled (enabling is encouraged and prevents common misconfigurations) ]
(log_op_output) notice: rabbitmq_start_0[203] error output [ Error: operation wait on node <email address hidden> timed out. Timeout value used: 195000 ]
(log_finished) info: rabbitmq start (call 13, PID 203) exited with status 1 (execution time 200002ms, queue time 0ms)
(log_execute) info: executing - rsc:rabbitmq action:notify call_id:34
Log: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_801/808215/49/check/tripleo-ci-centos-8-standalone-on-multinode-ipa/8016e34/logs/undercloud/var/log/extra/podman/containers/rabbitmq-bundle-podman-0/stdout.log

More hits here:
https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-standalone-on-multinode-ipa&change=808215

Similar bug, but with different errors on rabbitmq:
https://bugs.launchpad.net/tripleo/+bug/1949327

Tags:

Revision history for this message

Ade Lee (alee-3) wrote on 2021-11-09:

This failure seems to be consistently happening on the tripleo-ci-centos-8-standalone-on-multinode-ipa job.

There are other jobs that are also running with fips enabled where rabbitmq seems to be set up just fine (scenario 7, 10)

See https://review.opendev.org/c/openstack/tripleo-ci/+/808215

Revision history for this message

Douglas Viroel (dviroel) wrote on 2021-11-12:

Hi Bogdan Dobrelya and John Eckersberg, when you have time, can you help us debug this issue?

Revision history for this message

John Eckersberg (jeckersb) wrote on 2021-11-16:

This looks like a duplicate of https://bugs.launchpad.net/tripleo/+bug/1949327

Revision history for this message

Douglas Viroel (dviroel) wrote on 2021-11-16:

Looks a little bit different, since this job doesn't have the same errors logged in #1949327, like:
- 'epmd: failed to bind on ipaddr 0.0.0.0'
or
- start up errors in rabbitmq:
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_801/808215/49/check/tripleo-ci-centos-8-standalone-on-multinode-ipa/8016e34/logs/undercloud/var/log/containers/rabbitmq/startup_err

But if you have any patch for #1949327, we can try and see if also fix this one.

Thanks John

Revision history for this message

John Eckersberg (jeckersb) wrote on 2021-11-16:

Oh duh I see you already mentioned that at the very end of the original comment. Sorry, not having a good reading comprehension day :)

This one is almost certainly something to do with TLS, just based on from the rabbit stdout.log linked above:

(log_op_output) notice: rabbitmq_stop_0[1386] error output [ * TCP connection succeeded but Erlang distribution failed ]

So the cli can locate rabbit and connect to it, but can't start distribution (handshake, basically). At that point either (1) there is an erlang cookie mismatch, or (2) the TLS handshake failed for some reason (probably certificate verification).

It's probably not (1) since this is all contained within the same node and both the server and CLI share the same cookie file. The only way it could possibly be a cookie mismatch is if rabbit starts, then something in tripleo changes the cookie out from underneath of it, and then the CLI tries to use the new cookie. This has happened repeatedly in the past during upgrades, but I wouldn't expect to see it show up in a more straightforward CI run.

Michele and I did a rather significant overhaul of the rabbit TLS bits here recently:

https://review.opendev.org/c/openstack/puppet-tripleo/+/812401
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/812390

A lot of that was specifically to improve FIPS support by removing hard-coded ciphers and forcing everything to only use tls 1.2 or tls 1.3. Plus newer erlang started logging errors about certificate verification and these tweaks removed those by actually verifying the certificate or explicitly disabling verification in the cases that don't require it.

With all of that said, it's still not obvious why you would be hitting this only intermittently. I am always suspicious of name resolution, and maybe there is a mismatch with the name(s) in the certificate but only happens if something resolves in some particular manner. If we can hold a node once this reproduces it would be a huge help to poke at it with the erlang cli as well as openssl s_client and see if we can get some idea of why cert verification might or might not be failing.

Oh duh I see you already mentioned that at the very end of the original comment.  Sorry, not having a good reading comprehension day :)

This one is almost certainly something to do with TLS, just based on from the rabbit stdout.log linked above:

(log_op_output) 	notice: rabbitmq_stop_0[1386] error output [   * TCP connection succeeded but Erlang distribution failed  ]

So the cli can locate rabbit and connect to it, but can't start distribution (handshake, basically).  At that point either (1) there is an erlang cookie mismatch, or (2) the TLS handshake failed for some reason (probably certificate verification).

It's probably not (1) since this is all contained within the same node and both the server and CLI share the same cookie file.  The only way it could possibly be a cookie mismatch is if rabbit starts, then something in tripleo changes the cookie out from underneath of it, and then the CLI tries to use the new cookie.  This has happened repeatedly in the past during upgrades, but I wouldn't expect to see it show up in a more straightforward CI run.

Michele and I did a rather significant overhaul of the rabbit TLS bits here recently:

https://review.opendev.org/c/openstack/puppet-tripleo/+/812401
https://review.opendev.org/c/openstack/tripleo-heat-templates/+/812390

A lot of that was specifically to improve FIPS support by removing hard-coded ciphers and forcing everything to only use tls 1.2 or tls 1.3.  Plus newer erlang started logging errors about certificate verification and these tweaks removed those by actually verifying the certificate or explicitly disabling verification in the cases that don't require it.

With all of that said, it's still not obvious why you would be hitting this only intermittently.  I am always suspicious of name resolution, and maybe there is a mismatch with the name(s) in the certificate but only happens if something resolves in some particular manner.  If we can hold a node once this reproduces it would be a huge help to poke at it with the erlang cli as well as openssl s_client and see if we can get some idea of why cert verification might or might not be failing.

Revision history for this message

John Eckersberg (jeckersb) wrote on 2021-11-17:

Ade was able to hold a node from a failed CI run and I think I understand why this isn't working.

- The version of erlang in the container image is older and is not compiled with FIPS support
- We aren't making any attempt to explicitly enable FIPS mode in erlang

As a result, when rabbitmqctl tries to connect to rabbitmq it cannot handshake because it tries to use an unsupported key:

[...]
** Reason for termination = error:badarg
** Callback modules = [tls_connection]
** Callback mode = state_functions
** Stacktrace =
** [{crypto,evp_generate_key_nif,[x25519],[]},
[...]

So we (1) need an update erlang build with FIPS support compiled in, as well as (2) tht/puppet-tripleo changes which add the correct config to enable erlang FIPS mode.

Revision history for this message

Jiri Podivin (jpodivin) wrote on 2022-05-03:

Similar or perhaps identical issue on tripleo-ci-centos-9-scenario004-standalone under FIPS

Trace:
------
Error: 'rabbitmqctl eval "lists:keymember(rabbit, 1, application:which_applications())." | grep -q true' returned 1 instead of one of [0]

Logs:
-----
https://7e0f91f2308d33170f0c-6aefc8cc238ba217eb8a7ced9edefe1e.ssl.cf5.rackcdn.com/824479/26/check/tripleo-ci-centos-9-scenario004-standalone/4875097/logs/undercloud/var/log/containers/stdouts/rabbitmq_wait_bundle.log

Revision history for this message

John Eckersberg (jeckersb) wrote on 2022-05-03:

The short answer is that FIPS support has been temporarily removed from erlang when using openssl 3.0 (as is used in el9):

https://github.com/erlang/otp/commit/6bb9c51e900fe8fb5a88bd2498f6e5a92f94ed8d

We have been patching erlang in cbs to somewhat make it work until now, but the behavior in openssl seems to have changed such that our previous hack is no longer working. Given that upstream has completely removed support for the time being, I'm not sure we can easily get it functioning again.

Douglas Viroel (dviroel) on 2023-06-14

tags:

added: fips

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.