Intermittent CI failures on focal-yoga on main development branch

Bug #2002156 reported by Alex Kavanagh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Committed
Medium
Unassigned

Bug Description

There is an intermittent failure in CI with focal-yoga on the main development branch for:

test_408_rmq_amqp_messages_all_units_ssl_on

Error message:

2023-01-05 19:13:15.702273 | focal-medium | 2023-01-05 19:13:15 [INFO] ======================================================================
2023-01-05 19:13:15.702318 | focal-medium | 2023-01-05 19:13:15 [INFO] ERROR: test_408_rmq_amqp_messages_all_units_ssl_on (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)
2023-01-05 19:13:15.702335 | focal-medium | 2023-01-05 19:13:15 [INFO] Send (and check) amqp messages to every rmq unit (ssl enabled).
2023-01-05 19:13:15.702348 | focal-medium | 2023-01-05 19:13:15 [INFO] ----------------------------------------------------------------------
2023-01-05 19:13:15.702391 | focal-medium | 2023-01-05 19:13:15 [INFO] Traceback (most recent call last):
2023-01-05 19:13:15.702931 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/rabbitmq_server/utils.py", line 437, in connect_amqp_by_unit
2023-01-05 19:13:15.702973 | focal-medium | 2023-01-05 19:13:15 [INFO] connection = pika.BlockingConnection(parameters)
2023-01-05 19:13:15.702990 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/pika/adapters/blocking_connection.py", line 360, in __init__
2023-01-05 19:13:15.703003 | focal-medium | 2023-01-05 19:13:15 [INFO] self._impl = self._create_connection(parameters, _impl_class)
2023-01-05 19:13:15.703017 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/pika/adapters/blocking_connection.py", line 451, in _create_connection
2023-01-05 19:13:15.703032 | focal-medium | 2023-01-05 19:13:15 [INFO] raise self._reap_last_connection_workflow_error(error)
2023-01-05 19:13:15.703044 | focal-medium | 2023-01-05 19:13:15 [INFO] pika.exceptions.AMQPConnectionError
2023-01-05 19:13:15.703640 | focal-medium | 2023-01-05 19:13:15 [INFO] During handling of the above exception, another exception occurred:
2023-01-05 19:13:15.703680 | focal-medium | 2023-01-05 19:13:15 [INFO] Traceback (most recent call last):
2023-01-05 19:13:15.703696 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 215, in test_408_rmq_amqp_messages_all_units_ssl_on
2023-01-05 19:13:15.703711 | focal-medium | 2023-01-05 19:13:15 [INFO] self._test_rmq_amqp_messages_all_units(units,
2023-01-05 19:13:15.703725 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/rabbitmq_server/tests.py", line 145, in _test_rmq_amqp_messages_all_units
2023-01-05 19:13:15.703739 | focal-medium | 2023-01-05 19:13:15 [INFO] rmq_utils.publish_amqp_message_by_unit(dest_unit,
2023-01-05 19:13:15.704383 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/rabbitmq_server/utils.py", line 469, in publish_amqp_message_by_unit
2023-01-05 19:13:15.704423 | focal-medium | 2023-01-05 19:13:15 [INFO] connection = connect_amqp_by_unit(unit, ssl=ssl,
2023-01-05 19:13:15.704439 | focal-medium | 2023-01-05 19:13:15 [INFO] File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/zaza/openstack/charm_tests/rabbitmq_server/utils.py", line 445, in connect_amqp_by_unit
2023-01-05 19:13:15.704453 | focal-medium | 2023-01-05 19:13:15 [INFO] raise Exception(msg)
2023-01-05 19:13:15.704466 | focal-medium | 2023-01-05 19:13:15 [INFO] Exception: amqp connection failed to 172.16.0.184:5671 as testuser1 ()

Rabbit is reporting as clustered:

rabbitmq-server/0 active idle 3 172.16.0.184 5672/tcp Unit is ready and clustered
  nrpe/0* active idle 172.16.0.184 icmp,5666/tcp Ready
rabbitmq-server/1 active idle 4 172.16.0.102 5672/tcp Unit is ready and clustered
  nrpe/2 active idle 172.16.0.102 icmp,5666/tcp Ready
rabbitmq-server/2* active idle 5 172.16.0.181 5672/tcp Unit is ready and clustered
  nrpe/1 active idle 172.16.0.181 icmp,5666/tcp Ready

The actual connection error occurred at:

2023-01-05 19:03:09.782296 | focal-medium | 2023-01-05 19:03:09 [ERROR] Socket failed to connect: <socket.socket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.16.0.132', 57850)>; error=111 (Connection refused)
2023-01-05 19:03:09.783047 | focal-medium | 2023-01-05 19:03:09 [ERROR] TCP Connection attempt failed: ConnectionRefusedError(111, 'Connection refused'); dest=(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('172.16.0.184', 5671))
2023-01-05 19:03:09.783090 | focal-medium | 2023-01-05 19:03:09 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused')
2023-01-05 19:03:14.788888 | focal-medium | 2023-01-05 19:03:14 [INFO] Pika version 1.3.1 connecting to ('172.16.0.184', 5671)
2023-01-05 19:03:14.790591 | focal-medium | 2023-01-05 19:03:14 [ERROR] Socket failed to connect: <socket.socket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.16.0.132', 57852)>; error=111 (Connection refused)
2023-01-05 19:03:14.791228 | focal-medium | 2023-01-05 19:03:14 [ERROR] TCP Connection attempt failed: ConnectionRefusedError(111, 'Connection refused'); dest=(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('172.16.0.184', 5671))
2023-01-05 19:03:14.791273 | focal-medium | 2023-01-05 19:03:14 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused')
2023-01-05 19:03:19.797779 | focal-medium | 2023-01-05 19:03:19 [INFO] Pika version 1.3.1 connecting to ('172.16.0.184', 5671)
2023-01-05 19:03:19.799783 | focal-medium | 2023-01-05 19:03:19 [ERROR] Socket failed to connect: <socket.socket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.16.0.132', 57854)>; error=111 (Connection refused)
2023-01-05 19:03:19.800584 | focal-medium | 2023-01-05 19:03:19 [ERROR] TCP Connection attempt failed: ConnectionRefusedError(111, 'Connection refused'); dest=(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('172.16.0.184', 5671))
2023-01-05 19:03:19.800632 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQPConnector - reporting failure: AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused')
2023-01-05 19:03:19.801293 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQP connection workflow failed: AMQPConnectionWorkflowFailed: 3 exceptions in all; last exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused'); first exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused').
2023-01-05 19:03:19.801444 | focal-medium | 2023-01-05 19:03:19 [ERROR] AMQPConnectionWorkflow - reporting failure: AMQPConnectionWorkflowFailed: 3 exceptions in all; last exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused'); first exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused')
2023-01-05 19:03:19.801474 | focal-medium | 2023-01-05 19:03:19 [ERROR] Connection workflow failed: AMQPConnectionWorkflowFailed: 3 exceptions in all; last exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused'); first exception - AMQPConnectorSocketConnectError: ConnectionRefusedError(111, 'Connection refused')
2023-01-05 19:03:19.803204 | focal-medium | 2023-01-05 19:03:19 [ERROR] Error in _create_connection().
2023-01-05 19:03:19.803251 | focal-medium | Traceback (most recent call last):
2023-01-05 19:03:19.803269 | focal-medium | File "/home/ubuntu/src/review.opendev.org/openstack/charm-rabbitmq-server/.tox/func-target/lib/python3.8/site-packages/pika/adapters/blocking_connection.py", line 451, in _create_connection

e.g. around 19:03:19

And the rabbitmq unit was re-starting up at that time and so couldn't accept connections:

2023-01-05 19:03:08.774 [info] <0.4415.0> RabbitMQ is asked to stop...
2023-01-05 19:03:08.922 [info] <0.4415.0> Stopping RabbitMQ applications and their dependencies in the following order:
..
2023-01-05 19:03:09.014 [info] <0.64.0> SIGTERM received - shutting down
2023-01-05 19:03:17.288 [info] <0.8.0> Log file opened with Lager
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: list of feature flags found:
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] implicit_default_bindings
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] quorum_queue
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: [x] virtual_host_metadata
2023-01-05 19:03:19.635 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2023-01-05 19:03:19.729 [info] <0.43.0> Application mnesia exited with reason: stopped
2023-01-05 19:03:19.789 [info] <0.8.0> Feature flags: all running remote nodes support `[implicit_default_bindings,quorum_queue,virtual_host_metadata]`
2023-01-05 19:03:19.856 [info] <0.332.0> ra: meta data store initialised. 0 record(s) recovered
2023-01-05 19:03:19.857 [info] <0.340.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit@juju-2aa3df-zaza-1906cf48db01-3/quorum/rabbit@juju-2aa3df-zaza-1906cf48db01-3/00000003.wal"]

2023-01-05 19:03:19.859 [info] <0.345.0>
 Starting RabbitMQ 3.8.2 on Erlang 22.2.7
 Copyright (c) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL 1.1. Website: https://rabbitmq.com
2023-01-05 19:03:19.862 [info] <0.345.0>
 node : rabbit@juju-2aa3df-zaza-1906cf48db01-3
 home dir : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash : eM/c6tWpFiRvVvjkjVKsOw==
 log(s) : /<email address hidden>
                : /var/log/rabbitmq/rabbit@juju-2aa3df-zaza-1906cf48db01-3_upgrade.log
 database dir : /var/lib/rabbitmq/mnesia/rabbit@juju-2aa3df-zaza-1906cf48db01-3

..

2023-01-05 19:03:20.189 [info] <0.448.0> rabbit on node 'rabbit@juju-2aa3df-zaza-1906cf48db01-5' up
2023-01-05 19:03:20.279 [notice] <0.106.0> Changed loghwm of /<email address hidden> to 50
2023-01-05 19:03:20.601 [info] <0.8.0> Server startup complete; 0 plugins started.

Thus, the reason the test failed is that the unit was restarting at the time. This is a race in the test.

Possible solutions are simply to retry the connect several times, or more robustly detect whether the unit is up prior to starting the test; i.e. verify that the unit really is up.

The rabbitmq unit was running leader-settings-changed during the 19:03:20 timeframe:

2023-01-05 19:03:26 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run leader-settings-changed hook), waited 0s, held 24s
2023-01-05 19:03:55 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run config-changed hook), waited 0s, held 29s

e.g. leader-settings-changed ran from 19:03:02 -> 19:03:26
and config-changed ran from 19:03:26 -> 19:03:55

So, what's the test code doing during 408?

Test 408 starts at 19:01:01:

2023-01-05 19:01:01.483261 | focal-medium | 2023-01-05 19:01:01 [INFO] test_408_rmq_amqp_messages_all_units_ssl_on (zaza.openstack.charm_tests.rabbitmq_server.tests.RmqTests)

---

The full sequence of things running on the unit during the test were:

2023-01-05 19:01:32 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run config-changed hook), waited 1s, held 23s
2023-01-05 19:01:38 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run leader-settings-changed hook), waited 0s, held 6s
2023-01-05 19:01:48 unit-nrpe-0: nrpe/0 uniter (run relation-changed (11; unit: rabbitmq-server/0) hook), waited 6s, held 10s
2023-01-05 19:01:48 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run action 114), waited 10s, held 0s
2023-01-05 19:01:58 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run relation-changed (4; unit: rabbitmq-server/1) hook), waited 0s, held 10s
2023-01-05 19:02:08 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run relation-changed (4; unit: rabbitmq-server/1) hook), waited 0s, held 10s
2023-01-05 19:02:55 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run leader-settings-changed hook), waited 0s, held 6s
2023-01-05 19:02:55 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run action 120), waited 0s, held 0s
2023-01-05 19:03:26 unit-rabbitmq-server-0: rabbitmq-server/0 uniter (run leader-settings-changed hook), waited 0s, held 24s

---

So what I think is happening, from code inspection, is that each config-changed hook also triggers leader-settings-changed hooks in the other two units, so a non-leader gets a config-changed AND a leader-settings-changed hook. And the config changing test code doesn't take into account that a leader-settings-changed hook may occur and thus cause a unit to become unavailable for a few seconds (due to a restart). The code that attempts to connect and verify a unit doesn't retry and so is susceptible to errors.

---

Potential solution:

The function connect_amqp_by_unit() needs to retry it's connect at least for 90 seconds to allow the leader-settings-changed hook to complete.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :
Changed in charm-rabbitmq-server:
assignee: nobody → Alex Kavanagh (ajkavanagh)
status: New → In Progress
Felipe Reyes (freyes)
Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
importance: Undecided → Medium
Changed in charm-rabbitmq-server:
assignee: Alex Kavanagh (ajkavanagh) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.