Bug #2032971 “Ceph-dashboard charm stuck waiting while status lo...” : Bugs : Ceph Dashboard Charm

Bas de Bruijne (basdbruijne) on 2023-08-24

tags:

added: cdo-qa foundations-engine

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-08-25:

#1

Can you provide a snippet of yaml status to show the application status data and the status data for each unit. It could be related to juju combining the unit statuses to get an overall app status if the app status is not explicitly set.

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-08-25 (last edit on 2023-08-25):

#2

We don't collect the yaml status, but the human-readable status looks like:

=====
ceph-mon/0* active idle 0/lxd/1 10.246.165.72 Unit is ready and clustered
  ceph-dashboard/1 waiting idle 10.246.165.72 Charm configuration in progress
  logrotated/14 active idle 10.246.165.72 Unit is ready.
  prometheus-grok-exporter/14 active idle 10.246.165.72 9144/tcp Unit is ready
  ubuntu-advantage/14 active idle 10.246.165.72 Attached (esm-apps,esm-infra)
ceph-mon/1 active idle 1/lxd/1 10.246.165.152 Unit is ready and clustered
  ceph-dashboard/2 waiting idle 10.246.165.152 Charm configuration in progress
  logrotated/20 active idle 10.246.165.152 Unit is ready.
  prometheus-grok-exporter/20 active idle 10.246.165.152 9144/tcp Unit is ready
  ubuntu-advantage/20 active idle 10.246.165.152 Attached (esm-apps,esm-infra)
ceph-mon/2 active idle 2/lxd/1 10.246.167.10 Unit is ready and clustered
  ceph-dashboard/0* active idle 10.246.167.10 Unit is ready
  logrotated/13 active idle 10.246.167.10 Unit is ready.
  prometheus-grok-exporter/13 active idle 10.246.167.10 9144/tcp Unit is ready
  ubuntu-advantage/13 active idle 10.246.167.10 Attached (esm-apps,esm-infra)
=====

There is a juju-show-unit output of each unit in the crashdump (https://oil-jenkins.canonical.com/artifacts/86603a2f-7356-474a-8f94-57623e95c282/generated/generated/openstack/juju-dump-db-openstack-2023-08-24-10.30.27.tar.gz). That probably contains the necessary information.

Edit: we do collect the yaml status, I put a snippet here: https://pastebin.canonical.com/p/wQThZC3W4y/

We don't collect the yaml status, but the human-readable status looks like:

=====
ceph-mon/0*                                 active    idle   0/lxd/1   10.246.165.72                      Unit is ready and clustered
  ceph-dashboard/1                          waiting   idle             10.246.165.72                      Charm configuration in progress
  logrotated/14                             active    idle             10.246.165.72                      Unit is ready.
  prometheus-grok-exporter/14               active    idle             10.246.165.72   9144/tcp           Unit is ready
  ubuntu-advantage/14                       active    idle             10.246.165.72                      Attached (esm-apps,esm-infra)
ceph-mon/1                                  active    idle   1/lxd/1   10.246.165.152                     Unit is ready and clustered
  ceph-dashboard/2                          waiting   idle             10.246.165.152                     Charm configuration in progress
  logrotated/20                             active    idle             10.246.165.152                     Unit is ready.
  prometheus-grok-exporter/20               active    idle             10.246.165.152  9144/tcp           Unit is ready
  ubuntu-advantage/20                       active    idle             10.246.165.152                     Attached (esm-apps,esm-infra)
ceph-mon/2                                  active    idle   2/lxd/1   10.246.167.10                      Unit is ready and clustered
  ceph-dashboard/0*                         active    idle             10.246.167.10                      Unit is ready
  logrotated/13                             active    idle             10.246.167.10                      Unit is ready.
  prometheus-grok-exporter/13               active    idle             10.246.167.10   9144/tcp           Unit is ready
  ubuntu-advantage/13                       active    idle             10.246.167.10                      Attached (esm-apps,esm-infra)
=====

There is a juju-show-unit output of each unit in the crashdump (https://oil-jenkins.canonical.com/artifacts/86603a2f-7356-474a-8f94-57623e95c282/generated/generated/openstack/juju-dump-db-openstack-2023-08-24-10.30.27.tar.gz). That probably contains the necessary information.

Edit: we do collect the yaml status, I put a snippet here: https://pastebin.canonical.com/p/wQThZC3W4y/

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-08-25:

#3

I'm seeing this bug again today, again with ceph-dashboard. I am wondering if the issue is on the ceph-dashboard side instead of juju. I'm adding the bug to both to get some eyes on it.

summary:

- Charm stuck waiting while status log says its blocked
+ Ceph-dashboard charm stuck waiting while status log says its blocked

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-09-07 (last edit on 2023-09-07):

#4

2 runs with Juju 3.2.3 candidate - https://solutions.qa.canonical.com/testruns/e731952a-d0f5-4708-a785-46fed0a803fa & https://solutions.qa.canonical.com/testruns/bec7a29c-4554-4b12-b6bf-8a50ea02a9d6. And all previous runs are with juju 2.9.x.

Logs
DEBUG:root:ceph-dashboard/1 workload status is waiting since 2023-09-07 08:27:46+00:00
DEBUG:root:ceph-dashboard/2 workload status is waiting since 2023-09-07 08:26:21+00:00
Eventually timeout after 4 hours.

From juju-show-status-log/ceph-dashboard_2
07 Sep 2023 07:22:23Z juju-unit executing running loadbalancer-relation-changed hook
07 Sep 2023 07:22:24Z juju-unit idle
07 Sep 2023 08:16:38Z workload blocked Dashboard is not enabled
07 Sep 2023 08:22:22Z juju-unit executing running certificates-relation-changed hook for vault/0
07 Sep 2023 08:22:26Z juju-unit executing running dashboard-relation-changed hook for ceph-mon/0
07 Sep 2023 08:22:33Z juju-unit executing running certificates-relation-changed hook for vault/0
07 Sep 2023 08:22:35Z juju-unit idle
07 Sep 2023 08:24:26Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/2
07 Sep 2023 08:24:27Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/1
07 Sep 2023 08:24:29Z juju-unit executing running radosgw-dashboard-relation-changed hook
07 Sep 2023 08:24:30Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/0
07 Sep 2023 08:24:31Z juju-unit idle
07 Sep 2023 12:26:36Z workload waiting Charm configuration in progress

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-09-12:

#5

It's the non leader units that are blocking.

Sadly there's insufficient information in the logs to work out what isn't or is going on:

ceph-mon/0 active idle 0/lxd/1 10.246.164.141 Unit is ready and clustered
  ceph-dashboard/2 waiting idle 10.246.164.141 Charm configuration in progress
  logrotated/24 active idle 10.246.164.141 Unit is ready.
  prometheus-grok-exporter/24 active idle 10.246.164.141 9144/tcp Unit is ready
  ubuntu-advantage/24 active idle 10.246.164.141 Attached (esm-apps,esm-infra)
ceph-mon/1 active idle 1/lxd/1 10.246.165.116 Unit is ready and clustered
  ceph-dashboard/1 waiting idle 10.246.165.116 Charm configuration in progress
  logrotated/11 active idle 10.246.165.116 Unit is ready.
  prometheus-grok-exporter/11 active idle 10.246.165.116 9144/tcp Unit is ready
  ubuntu-advantage/11 active idle 10.246.165.116 Attached (esm-apps,esm-infra)
ceph-mon/2* active idle 2/lxd/1 10.246.164.168 Unit is ready and clustered
  ceph-dashboard/0* active idle 10.246.164.168 Unit is ready
  logrotated/7 active idle 10.246.164.168 Unit is ready.
  prometheus-grok-exporter/7 active idle 10.246.164.168 9144/tcp Unit is ready
  ubuntu-advantage/7 active idle 10.246.164.168 Attached (esm-apps,esm-infra)

It would probably be useful to switch debug config to true for the charm to see if that elicits any additional information.

It's the non leader units that are blocking.

Sadly there's insufficient information in the logs to work out what isn't or is going on:

ceph-mon/0                                  active    idle   0/lxd/1   10.246.164.141                 Unit is ready and clustered
  ceph-dashboard/2                          waiting   idle             10.246.164.141                 Charm configuration in progress
  logrotated/24                             active    idle             10.246.164.141                 Unit is ready.
  prometheus-grok-exporter/24               active    idle             10.246.164.141  9144/tcp       Unit is ready
  ubuntu-advantage/24                       active    idle             10.246.164.141                 Attached (esm-apps,esm-infra)
ceph-mon/1                                  active    idle   1/lxd/1   10.246.165.116                 Unit is ready and clustered
  ceph-dashboard/1                          waiting   idle             10.246.165.116                 Charm configuration in progress
  logrotated/11                             active    idle             10.246.165.116                 Unit is ready.
  prometheus-grok-exporter/11               active    idle             10.246.165.116  9144/tcp       Unit is ready
  ubuntu-advantage/11                       active    idle             10.246.165.116                 Attached (esm-apps,esm-infra)
ceph-mon/2*                                 active    idle   2/lxd/1   10.246.164.168                 Unit is ready and clustered
  ceph-dashboard/0*                         active    idle             10.246.164.168                 Unit is ready
  logrotated/7                              active    idle             10.246.164.168                 Unit is ready.
  prometheus-grok-exporter/7                active    idle             10.246.164.168  9144/tcp       Unit is ready
  ubuntu-advantage/7                        active    idle             10.246.164.168                 Attached (esm-apps,esm-infra)

It would probably be useful to switch debug config to true for the charm to see if that elicits any additional information.

Changed in charm-ceph-dashboard:
status:	New → Incomplete

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-09-22:

#6

The message "Charm configuration in progress" is actually generated here: https://github.com/openstack/charm-ops-openstack/blob/48a2ce2fb90d8725b10f5435a948358b99375410/ops_openstack/core.py#L199

My assumption is that self._stored.is_started is therefore still false.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-09-22:

#7

More thoughts:

In the code in the stable/quincy.2 charm:

    def _configure_dashboard(self, event) -> None:
        """Configure dashboard"""
        self.request_certificates()
        if not self.mon.mons_ready:
            logging.info("Not configuring dashboard, mons not ready")
            return
        if not ceph_utils.is_dashboard_enabled():
            if self.unit.is_leader():
                ceph_utils.mgr_enable_dashboard()
            else:
                logging.info("Dashboard not enabled, deferring event.")
                return

...

        self._register_dashboards()
        self._manage_radosgw()
        self._manage_iscsigw()
        self._stored.is_started = True
        self.update_status()

This appears to be the only place that self._stored.is_started is set to true; but this would mean that a non leader could never get to this point.

The master branch is the same. I wonder if the gate tests actually miss this issue. I'll try a local deploy and see what I get from the stable/quincy.2 charm.

Alex Kavanagh (ajkavanagh) on 2023-09-25

Changed in charm-ceph-dashboard:
status:	Incomplete → In Progress
importance:	Undecided → High
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-09-25:

#8

The issue seems to be that the 'return' in the else of the "if self.unit.is_leader()" in:

    def _configure_dashboard(self, event) -> None:
        """Configure dashboard"""
        self.request_certificates()
        if not self.mon.mons_ready:
            logging.info("Not configuring dashboard, mons not ready")
            return
        if not ceph_utils.is_dashboard_enabled():
            if self.unit.is_leader():
                ceph_utils.mgr_enable_dashboard()
            else:
                logging.info("Dashboard not enabled, deferring event.")
                return

may mean that the subordinate units "run out of events" to then go on and run the rest of the _configure_dashboard() function, meaning the self._stored.is_started never gets set to true.

In order to guarantee that it does occur, I will try adding a peer relation and just set a nonce (uuid4) when the leader does enable the dashboard to trigger a peer relation changed hook on the subordinate units so that they do get a hook execution to actually run the function after the leader has enabled the dashboard. This is assuming that the ceph_utils.mgr_enable_dashboard() isn't async!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-25: Related fix proposed to charm-ceph-dashboard (stable/quincy.2)

#9

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896418

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-09-27: Fix proposed to charm-ceph-dashboard (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896726

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-03: Change abandoned on charm-ceph-dashboard (master)

#11

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896726
Reason: Abandoning in favour of With SQAs testing we did not observe this issue anymore once we applied these patches:

https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896617
https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896742

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-03: Change abandoned on charm-ceph-dashboard (stable/quincy.2)

#12

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896418

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-10-12:

#13

Can we confirm that this does not affect Juju?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-10-12:

#14

We haven't seen this issue in 2 weeks. I can confirm that no action is needed on the juju side. Ceph-dashboard shows the status "in progress", but I think it is actually "fix released".

Changed in juju:
status:	Incomplete → Invalid

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-10-13:

#15

This was resolved in a different way to my fix. I'm setting to fix released; please re-open if it re-occurs.

Changed in charm-ceph-dashboard:
assignee:	Alex Kavanagh (ajkavanagh) → nobody
status:	In Progress → Fix Released

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-19:

#16

Re-opened as https://bugs.launchpad.net/charm-ceph-dashboard/+bug/2039763

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Invalid	Undecided	Unassigned
	Ceph Dashboard Charm	Fix Released	High	Unassigned

Ceph Dashboard Charm

Ceph-dashboard charm stuck waiting while status log says its blocked

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches