Intermittent CI failures on focal-yoga on main development branch
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Fix Committed
|
Medium
|
Unassigned |
Bug Description
There is an intermittent failure in CI with focal-yoga on the main development branch for:
test_408_
Error message:
2023-01-05 19:13:15.702273 | focal-medium | 2023-01-05 19:13:15 [INFO] =======
2023-01-05 19:13:15.702318 | focal-medium | 2023-01-05 19:13:15 [INFO] ERROR: test_408_
2023-01-05 19:13:15.702335 | focal-medium | 2023-01-05 19:13:15 [INFO] Send (and check) amqp messages to every rmq unit (ssl enabled).
2023-01-05 19:13:15.702348 | focal-medium | 2023-01-05 19:13:15 [INFO] -------
2023-01-05 19:13:15.702391 | focal-medium | 2023-01-05 19:13:15 [INFO] Traceback (most recent call last):
2023-01-05 19:13:15.702931 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.702973 | focal-medium | 2023-01-05 19:13:15 [INFO] connection = pika.BlockingCo
2023-01-05 19:13:15.702990 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.703003 | focal-medium | 2023-01-05 19:13:15 [INFO] self._impl = self._create_
2023-01-05 19:13:15.703017 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.703032 | focal-medium | 2023-01-05 19:13:15 [INFO] raise self._reap_
2023-01-05 19:13:15.703044 | focal-medium | 2023-01-05 19:13:15 [INFO] pika.exceptions
2023-01-05 19:13:15.703640 | focal-medium | 2023-01-05 19:13:15 [INFO] During handling of the above exception, another exception occurred:
2023-01-05 19:13:15.703680 | focal-medium | 2023-01-05 19:13:15 [INFO] Traceback (most recent call last):
2023-01-05 19:13:15.703696 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.703711 | focal-medium | 2023-01-05 19:13:15 [INFO] self._test_
2023-01-05 19:13:15.703725 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.703739 | focal-medium | 2023-01-05 19:13:15 [INFO] rmq_utils.
2023-01-05 19:13:15.704383 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.704423 | focal-medium | 2023-01-05 19:13:15 [INFO] connection = connect_
2023-01-05 19:13:15.704439 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/
2023-01-05 19:13:15.704453 | focal-medium | 2023-01-05 19:13:15 [INFO] raise Exception(msg)
2023-01-05 19:13:15.704466 | focal-medium | 2023-01-05 19:13:15 [INFO] Exception: amqp connection failed to 172.16.0.184:5671 as testuser1 ()
Rabbit is reporting as clustered:
rabbitmq-server/0 active idle 3 172.16.0.184 5672/tcp Unit is ready and clustered
nrpe/0* active idle 172.16.0.184 icmp,5666/tcp Ready
rabbitmq-server/1 active idle 4 172.16.0.102 5672/tcp Unit is ready and clustered
nrpe/2 active idle 172.16.0.102 icmp,5666/tcp Ready
rabbitmq-server/2* active idle 5 172.16.0.181 5672/tcp Unit is ready and clustered
nrpe/1 active idle 172.16.0.181 icmp,5666/tcp Ready
The actual connection error occurred at:
2023-01-05 19:03:09.782296 | focal-medium | 2023-01-05 19:03:09 [ERROR] Socket failed to connect: <socket.socket fd=10, family=
2023-01-05 19:03:09.783047 | focal-medium | 2023-01-05 19:03:09 [ERROR] TCP Connection attempt failed: ConnectionRefus
2023-01-05 19:03:09.783090 | focal-medium | 2023-01-05 19:03:09 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSo
2023-01-05 19:03:14.788888 | focal-medium | 2023-01-05 19:03:14 [INFO] Pika version 1.3.1 connecting to ('172.16.0.184', 5671)
2023-01-05 19:03:14.790591 | focal-medium | 2023-01-05 19:03:14 [ERROR] Socket failed to connect: <socket.socket fd=10, family=
2023-01-05 19:03:14.791228 | focal-medium | 2023-01-05 19:03:14 [ERROR] TCP Connection attempt failed: ConnectionRefus
2023-01-05 19:03:14.791273 | focal-medium | 2023-01-05 19:03:14 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSo
2023-01-05 19:03:19.797779 | focal-medium | 2023-01-05 19:03:19 [INFO] Pika version 1.3.1 connecting to ('172.16.0.184', 5671)
2023-01-05 19:03:19.799783 | focal-medium | 2023-01-05 19:03:19 [ERROR] Socket failed to connect: <socket.socket fd=10, family=
2023-01-05 19:03:19.800584 | focal-medium | 2023-01-05 19:03:19 [ERROR] TCP Connection attempt failed: ConnectionRefus
2023-01-05 19:03:19.800632 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSo
2023-01-05 19:03:19.801293 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQP connection workflow failed: AMQPConnectionW
2023-01-05 19:03:19.801444 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQPConnectionW
2023-01-05 19:03:19.801474 | focal-medium | 2023-01-05 19:03:19 [ERROR] Connection workflow failed: AMQPConnectionW
2023-01-05 19:03:19.803204 | focal-medium | 2023-01-05 19:03:19 [ERROR] Error in _create_
2023-01-05 19:03:19.803251 | focal-medium | Traceback (most recent call last):
2023-01-05 19:03:19.803269 | focal-medium | File "/home/
e.g. around 19:03:19
And the rabbitmq unit was re-starting up at that time and so couldn't accept connections:
2023-01-05 19:03:08.774 [info] <0.4415.0> RabbitMQ is asked to stop...
2023-01-05 19:03:08.922 [info] <0.4415.0> Stopping RabbitMQ applications and their dependencies in the following order:
..
2023-01-05 19:03:09.014 [info] <0.64.0> SIGTERM received - shutting down
2023-01-05 19:03:17.288 [info] <0.8.0> Log file opened with Lager
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: list of feature flags found:
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] implicit_
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] quorum_queue
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] virtual_
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2023-01-05 19:03:19.729 [info] <0.43.0> Application mnesia exited with reason: stopped
2023-01-05 19:03:19.789 [info] <0.8.0> Feature flags: all running remote nodes support `[implicit_
2023-01-05 19:03:19.856 [info] <0.332.0> ra: meta data store initialised. 0 record(s) recovered
2023-01-05 19:03:19.857 [info] <0.340.0> WAL: recovering ["/var/
2023-01-05 19:03:19.859 [info] <0.345.0>
Starting RabbitMQ 3.8.2 on Erlang 22.2.7
Copyright (c) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL 1.1. Website: https:/
2023-01-05 19:03:19.862 [info] <0.345.0>
node : rabbit@
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/
cookie hash : eM/c6tWpFiRvVvj
log(s) : /<email address hidden>
: /var/log/
database dir : /var/lib/
..
2023-01-05 19:03:20.189 [info] <0.448.0> rabbit on node 'rabbit@
2023-01-05 19:03:20.279 [notice] <0.106.0> Changed loghwm of /<email address hidden> to 50
2023-01-05 19:03:20.601 [info] <0.8.0> Server startup complete; 0 plugins started.
Thus, the reason the test failed is that the unit was restarting at the time. This is a race in the test.
Possible solutions are simply to retry the connect several times, or more robustly detect whether the unit is up prior to starting the test; i.e. verify that the unit really is up.
The rabbitmq unit was running leader-
2023-01-05 19:03:26 unit-rabbitmq-
2023-01-05 19:03:55 unit-rabbitmq-
e.g. leader-
and config-changed ran from 19:03:26 -> 19:03:55
So, what's the test code doing during 408?
Test 408 starts at 19:01:01:
2023-01-05 19:01:01.483261 | focal-medium | 2023-01-05 19:01:01 [INFO] test_408_
---
The full sequence of things running on the unit during the test were:
2023-01-05 19:01:32 unit-rabbitmq-
2023-01-05 19:01:38 unit-rabbitmq-
2023-01-05 19:01:48 unit-nrpe-0: nrpe/0 uniter (run relation-changed (11; unit: rabbitmq-server/0) hook), waited 6s, held 10s
2023-01-05 19:01:48 unit-rabbitmq-
2023-01-05 19:01:58 unit-rabbitmq-
2023-01-05 19:02:08 unit-rabbitmq-
2023-01-05 19:02:55 unit-rabbitmq-
2023-01-05 19:02:55 unit-rabbitmq-
2023-01-05 19:03:26 unit-rabbitmq-
---
So what I think is happening, from code inspection, is that each config-changed hook also triggers leader-
---
Potential solution:
The function connect_
Changed in charm-rabbitmq-server: | |
status: | In Progress → Fix Committed |
importance: | Undecided → Medium |
Changed in charm-rabbitmq-server: | |
assignee: | Alex Kavanagh (ajkavanagh) → nobody |
The fix will land in zaza-openstack- tests: https:/ /github. com/openstack- charmers/ zaza-openstack- tests/pull/ 979