Designate units blocked with services not running that should be: designate-producer

Bug #2035119 reported by Bas de Bruijne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Designate Charm
Expired
Undecided
Unassigned

Bug Description

In test run https://solutions.qa.canonical.com/testruns/715dc755-e475-42cf-8378-f57f13f10ac2/, which tests openstack yoga on focal with the following version configuration:

==========
maas 3.2.9-
juju 3.1.5
cpe-foundation 2.21.1
infra-ubuntu focal
ceph quincy/stable
charms yoga/stable
fce-container-image ubuntu:jammy
legacy-lma stable
openstack yoga
charmed-kubernetes 1.28
landscape-server 23.03+17-0landscape0
cloud-init 23.2.2-0ubuntu0~20.04.1
==========

the deployment fails because the designate units are stuck blocked:
==========
designate/0* blocked idle 0/lxd/3 10.246.164.146 9001/tcp Services not running that should be: designate-producer
  designate-mysql-router/0* active idle 10.246.164.146 Unit is ready
  hacluster-designate/0* active idle 10.246.164.146 Unit is ready and clustered
  logrotated/41 active idle 10.246.164.146 Unit is ready.
  prometheus-grok-exporter/41 active idle 10.246.164.146 9144/tcp Unit is ready
  public-policy-routing/19 active idle 10.246.164.146 Unit is ready
  ubuntu-advantage/41 active idle 10.246.164.146 Attached (esm-apps,esm-infra)
designate/1 blocked idle 1/lxd/3 10.246.167.39 9001/tcp Services not running that should be: designate-producer
  designate-mysql-router/1 active idle 10.246.167.39 Unit is ready
  hacluster-designate/1 active idle 10.246.167.39 Unit is ready and clustered
  logrotated/45 active idle 10.246.167.39 Unit is ready.
  prometheus-grok-exporter/45 active idle 10.246.167.39 9144/tcp Unit is ready
  public-policy-routing/23 active idle 10.246.167.39 Unit is ready
  ubuntu-advantage/45 active idle 10.246.167.39 Attached (esm-apps,esm-infra)
designate/2 blocked idle 2/lxd/3 10.246.166.217 9001/tcp Services not running that should be: designate-producer
  designate-mysql-router/2 active idle 10.246.166.217 Unit is ready
  hacluster-designate/2 active idle 10.246.166.217 Unit is ready and clustered
  logrotated/52 active idle 10.246.166.217 Unit is ready.
  prometheus-grok-exporter/52 active idle 10.246.166.217 9144/tcp Unit is ready
  public-policy-routing/29 active idle 10.246.166.217 Unit is ready
  ubuntu-advantage/52 active idle 10.246.166.217 Attached (esm-apps,esm-infra)
==========

The expected behaviour is that the designate charms install successfully without having to be manually unblocked.

Looking at the crashdump, we see the following tracebacks in the designate-producer logs:
==========
2023-09-11 05:15:31.309 82268 INFO designate.service [-] Starting producer service (version: 14.0.2)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service [req-e7fe101f-7cdf-4091-8453-ea6003000020 - - - - -] Error starting thread.: tooz.coordination.ToozConnectionError: [Errno 111] ECONNREFUSED (with errno ECONNREFUSED [111])
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service Traceback (most recent call last):
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 43, in _failure_translator
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service yield
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 70, in wrapper
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service return func(*args, **kwargs)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 495, in heartbeat
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self.client.set(self._encode_member_id(self._member_id),
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 1022, in set
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service return client.set(key, value, expire=expire, noreply=noreply,
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 328, in set
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service return self._store_cmd(b'set', {key: value}, expire, noreply,
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 880, in _store_cmd
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self._connect()
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 285, in _connect
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service sock.connect(self.server)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 270, in connect
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service socket_checkerr(fd)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 54, in socket_checkerr
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service raise socket.error(err, errno.errorcode[err])
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service ConnectionRefusedError: [Errno 111] ECONNREFUSED
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service The above exception was the direct cause of the following exception:
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service Traceback (most recent call last):
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 806, in run_service
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service service.start()
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/designate/producer/service.py", line 78, in start
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self.coordination.start()
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/designate/coordination.py", line 81, in start
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self._coordinator.start(start_heart=True)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/coordination.py", line 689, in start
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service super(CoordinationDriverWithExecutor, self).start(start_heart)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/coordination.py", line 426, in start
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self._start()
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 70, in wrapper
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service return func(*args, **kwargs)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 293, in _start
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self.heartbeat()
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 70, in wrapper
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service return func(*args, **kwargs)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service self.gen.throw(type, value, traceback)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 57, in _failure_translator
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service utils.raise_with_cause(coordination.ToozConnectionError,
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/tooz/utils.py", line 224, in raise_with_cause
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 142, in raise_with_cause
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service raise exc_cls(message, *args, **kwargs) from kwargs.get('cause')
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service tooz.coordination.ToozConnectionError: [Errno 111] ECONNREFUSED (with errno ECONNREFUSED [111])
2023-09-11 05:15:31.986 82268 ERROR oslo_service.service
2023-09-11 05:15:32.118 82268 INFO designate.service [req-e7fe101f-7cdf-4091-8453-ea6003000020 - - - - -] Stopping producer service
2023-09-11 05:15:32.119 82268 CRITICAL designate [req-e7fe101f-7cdf-4091-8453-ea6003000020 - - - - -] Unhandled error: tooz.coordination.ToozConnectionError: [Errno 111] ECONNREFUSED (with errno ECONNREFUSED [111])
2023-09-11 05:15:32.119 82268 ERROR designate Traceback (most recent call last):
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 43, in _failure_translator
2023-09-11 05:15:32.119 82268 ERROR designate yield
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 70, in wrapper
2023-09-11 05:15:32.119 82268 ERROR designate return func(*args, **kwargs)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 393, in _leave_group
2023-09-11 05:15:32.119 82268 ERROR designate group_members, cas = self.client.gets(encoded_group)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 1078, in gets
2023-09-11 05:15:32.119 82268 ERROR designate return client.gets(key)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 516, in gets
2023-09-11 05:15:32.119 82268 ERROR designate return self._fetch_cmd(b'gets', [key], True).get(key, defaults)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 809, in _fetch_cmd
2023-09-11 05:15:32.119 82268 ERROR designate self._connect()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/pymemcache/client/base.py", line 285, in _connect
2023-09-11 05:15:32.119 82268 ERROR designate sock.connect(self.server)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 270, in connect
2023-09-11 05:15:32.119 82268 ERROR designate socket_checkerr(fd)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/eventlet/greenio/base.py", line 54, in socket_checkerr
2023-09-11 05:15:32.119 82268 ERROR designate raise socket.error(err, errno.errorcode[err])
2023-09-11 05:15:32.119 82268 ERROR designate ConnectionRefusedError: [Errno 111] ECONNREFUSED
2023-09-11 05:15:32.119 82268 ERROR designate
2023-09-11 05:15:32.119 82268 ERROR designate The above exception was the direct cause of the following exception:
2023-09-11 05:15:32.119 82268 ERROR designate
2023-09-11 05:15:32.119 82268 ERROR designate Traceback (most recent call last):
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/bin/designate-producer", line 10, in <module>
2023-09-11 05:15:32.119 82268 ERROR designate sys.exit(main())
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/designate/cmd/producer.py", line 45, in main
2023-09-11 05:15:32.119 82268 ERROR designate service.wait()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/designate/service.py", line 379, in wait
2023-09-11 05:15:32.119 82268 ERROR designate _launcher.wait()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 388, in wait
2023-09-11 05:15:32.119 82268 ERROR designate status, signo = self._wait_for_exit_or_signal()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 373, in _wait_for_exit_or_signal
2023-09-11 05:15:32.119 82268 ERROR designate self.stop()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 288, in stop
2023-09-11 05:15:32.119 82268 ERROR designate self.services.stop()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 761, in stop
2023-09-11 05:15:32.119 82268 ERROR designate service.stop()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/designate/producer/service.py", line 103, in stop
2023-09-11 05:15:32.119 82268 ERROR designate self.coordination.stop()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/designate/coordination.py", line 94, in stop
2023-09-11 05:15:32.119 82268 ERROR designate self._disable_grouping()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/designate/coordination.py", line 119, in _disable_grouping
2023-09-11 05:15:32.119 82268 ERROR designate leave_group_req.get()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/coordination.py", line 664, in get
2023-09-11 05:15:32.119 82268 ERROR designate return self._fut.result(timeout=timeout)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
2023-09-11 05:15:32.119 82268 ERROR designate return self.__get_result()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2023-09-11 05:15:32.119 82268 ERROR designate raise self._exception
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/futurist/_utils.py", line 52, in run
2023-09-11 05:15:32.119 82268 ERROR designate result = self.fn(*self.args, **self.kwargs)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 333, in wrapped_f
2023-09-11 05:15:32.119 82268 ERROR designate return self(f, *args, **kw)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 423, in __call__
2023-09-11 05:15:32.119 82268 ERROR designate do = self.iter(retry_state=retry_state)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 360, in iter
2023-09-11 05:15:32.119 82268 ERROR designate return fut.result()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
2023-09-11 05:15:32.119 82268 ERROR designate return self.__get_result()
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
2023-09-11 05:15:32.119 82268 ERROR designate raise self._exception
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 426, in __call__
2023-09-11 05:15:32.119 82268 ERROR designate result = fn(*args, **kwargs)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 70, in wrapper
2023-09-11 05:15:32.119 82268 ERROR designate return func(*args, **kwargs)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
2023-09-11 05:15:32.119 82268 ERROR designate self.gen.throw(type, value, traceback)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/drivers/memcached.py", line 57, in _failure_translator
2023-09-11 05:15:32.119 82268 ERROR designate utils.raise_with_cause(coordination.ToozConnectionError,
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/tooz/utils.py", line 224, in raise_with_cause
2023-09-11 05:15:32.119 82268 ERROR designate excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2023-09-11 05:15:32.119 82268 ERROR designate File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 142, in raise_with_cause
2023-09-11 05:15:32.119 82268 ERROR designate raise exc_cls(message, *args, **kwargs) from kwargs.get('cause')
2023-09-11 05:15:32.119 82268 ERROR designate tooz.coordination.ToozConnectionError: [Errno 111] ECONNREFUSED (with errno ECONNREFUSED [111])
2023-09-11 05:15:32.119 82268 ERROR designate
==========

It is not immediately clear to me where the connection failure is coming from.

Crashdumps and configs for this test run can be found here: https://oil-jenkins.canonical.com/artifacts/715dc755-e475-42cf-8378-f57f13f10ac2/index.html

tags: added: cdo-qa foundations-engine
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :

Future Occurrences can be found at: https://solutions.qa.canonical.com/bugs/2035119

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The error is basically not being able to connect to the memcached service, which was connection refused (tooz). It's the same on all three units. As to why that is, there's not really sufficient information to go on.

It would probably be good to enable more verbosity in the logging by adding '-v -v' (...) to the command out via the config option "extra-options", and enable debug and verbose logging in the designate charm via the 'debug' and 'verbose' options.

Changed in charm-designate:
status: New → Incomplete
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Thanks, Alex. I have an MR up to set the "debug" and "verbose" on all supported charms for SQA testing going forward. That should help with LP #2032971 too. I will let you know when we have new occurrences with these options set.

Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

A new occurrence of this one with the debug and verbose options enabled:

https://solutions.qa.canonical.com/testruns/7955dc5f-527b-4be1-84f6-5152319858ec

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Please could we also have the configuration files for designate? It would be really useful if the crashdump collected /etc so that we can see how a service is configured.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

I have a PR up to collect /etc/designate: https://github.com/juju/juju-crashdump/pull/104. I also collected /etc/designate/designate.conf off a different deployment that does not show this bug: https://pastebin.canonical.com/p/zTJP72KhVQ/. I assume all our deployments have the same designate configuration but I can't be sure. I'll let you know when the crashdump change is merged and we have some more data.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Thanks Bas! Much appreciated; this is a really strange one in that it's hard to see what's going on. It may be necessary to try and 'catch it in action' and debug on the deployed model, if possible?

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Designate Charm because there has been no activity for 60 days.]

Changed in charm-designate:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.