Ceph-dashboard charm stuck waiting while status log says its blocked

Bug #2032971 reported by Bas de Bruijne
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
Ceph Dashboard Charm
Fix Released
High
Unassigned

Bug Description

I'm not sure how to word a good description for this bug.

In test run https://solutions.qa.canonical.com/testruns/86603a2f-7356-474a-8f94-57623e95c282/, we use juju 2.9.44 and the test run times out because ceph-dashboard is stuck waiting.
Juju wait (I know, we need to switch to juju wait-for but since juju wait calls `juju status` this is still a valid concern), says that 2 ceph-dashboard units have been stuck waiting since `2023-08-24 06:38:02+00:00`.
Indeed, when we dump the `juju status` output (timestamped at `10:29:48Z`) it shows the ceph-dashboard units are waiting with a message "Charm configuration in progress".
In the juju-crashdump we collect the status logs, which shows the following:

====
bas@bas:~/Downloads/f6067689-3d44-43e5-a5f4-c182fd6ae230/ceph-dashboard_2$ cat juju-show-status-log/ceph-dashboard_2
Time Type Status Message
24 Aug 2023 05:16:55Z juju-unit executing running loadbalancer-relation-changed hook for ceph-dashboard-loadbalancer/0
24 Aug 2023 05:16:59Z juju-unit executing running loadbalancer-relation-joined hook for ceph-dashboard-loadbalancer/1
24 Aug 2023 05:17:04Z juju-unit executing running loadbalancer-relation-changed hook for ceph-dashboard-loadbalancer/1
24 Aug 2023 05:17:09Z juju-unit executing running loadbalancer-relation-joined hook for ceph-dashboard-loadbalancer/2
24 Aug 2023 05:17:14Z juju-unit executing running loadbalancer-relation-changed hook for ceph-dashboard-loadbalancer/2
24 Aug 2023 05:17:15Z juju-unit idle
24 Aug 2023 05:19:05Z juju-unit executing running loadbalancer-relation-changed hook
24 Aug 2023 05:19:07Z juju-unit idle
24 Aug 2023 06:28:24Z juju-unit executing running certificates-relation-changed hook for vault/0
24 Aug 2023 06:28:25Z juju-unit idle
24 Aug 2023 06:29:19Z juju-unit executing running certificates-relation-changed hook for vault/0
24 Aug 2023 06:29:27Z juju-unit executing running dashboard-relation-changed hook for ceph-mon/1
24 Aug 2023 06:29:30Z juju-unit idle
24 Aug 2023 06:32:15Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/2
24 Aug 2023 06:32:16Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/1
24 Aug 2023 06:32:17Z juju-unit executing running radosgw-dashboard-relation-changed hook
24 Aug 2023 06:32:18Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/0
24 Aug 2023 06:32:20Z juju-unit idle
24 Aug 2023 06:32:41Z workload blocked Dashboard is not enabled
24 Aug 2023 10:33:02Z workload waiting Charm configuration in progress
====

This says that the charm is not waiting, but has actually been blocked since 6:32 and would still be blocked when we dumped the juju status. According to this, it only switched to waiting one minute after we viewed the juju status.

This bug feels very similar to LP: #1992666. It may be an issue in the ceph-dashboard charm instead, but with the conflicting information from juju it is hard to open a bug there.

Extra logs and config files can be found here: https://oil-jenkins.canonical.com/artifacts/86603a2f-7356-474a-8f94-57623e95c282/index.html

tags: added: cdo-qa foundations-engine
Revision history for this message
Ian Booth (wallyworld) wrote :

Can you provide a snippet of yaml status to show the application status data and the status data for each unit. It could be related to juju combining the unit statuses to get an overall app status if the app status is not explicitly set.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote (last edit ):

We don't collect the yaml status, but the human-readable status looks like:

=====
ceph-mon/0* active idle 0/lxd/1 10.246.165.72 Unit is ready and clustered
  ceph-dashboard/1 waiting idle 10.246.165.72 Charm configuration in progress
  logrotated/14 active idle 10.246.165.72 Unit is ready.
  prometheus-grok-exporter/14 active idle 10.246.165.72 9144/tcp Unit is ready
  ubuntu-advantage/14 active idle 10.246.165.72 Attached (esm-apps,esm-infra)
ceph-mon/1 active idle 1/lxd/1 10.246.165.152 Unit is ready and clustered
  ceph-dashboard/2 waiting idle 10.246.165.152 Charm configuration in progress
  logrotated/20 active idle 10.246.165.152 Unit is ready.
  prometheus-grok-exporter/20 active idle 10.246.165.152 9144/tcp Unit is ready
  ubuntu-advantage/20 active idle 10.246.165.152 Attached (esm-apps,esm-infra)
ceph-mon/2 active idle 2/lxd/1 10.246.167.10 Unit is ready and clustered
  ceph-dashboard/0* active idle 10.246.167.10 Unit is ready
  logrotated/13 active idle 10.246.167.10 Unit is ready.
  prometheus-grok-exporter/13 active idle 10.246.167.10 9144/tcp Unit is ready
  ubuntu-advantage/13 active idle 10.246.167.10 Attached (esm-apps,esm-infra)
=====

There is a juju-show-unit output of each unit in the crashdump (https://oil-jenkins.canonical.com/artifacts/86603a2f-7356-474a-8f94-57623e95c282/generated/generated/openstack/juju-dump-db-openstack-2023-08-24-10.30.27.tar.gz). That probably contains the necessary information.

Edit: we do collect the yaml status, I put a snippet here: https://pastebin.canonical.com/p/wQThZC3W4y/

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

I'm seeing this bug again today, again with ceph-dashboard. I am wondering if the issue is on the ceph-dashboard side instead of juju. I'm adding the bug to both to get some eyes on it.

summary: - Charm stuck waiting while status log says its blocked
+ Ceph-dashboard charm stuck waiting while status log says its blocked
Revision history for this message
Jeffrey Chang (modern911) wrote (last edit ):

2 runs with Juju 3.2.3 candidate - https://solutions.qa.canonical.com/testruns/e731952a-d0f5-4708-a785-46fed0a803fa & https://solutions.qa.canonical.com/testruns/bec7a29c-4554-4b12-b6bf-8a50ea02a9d6. And all previous runs are with juju 2.9.x.

Logs
DEBUG:root:ceph-dashboard/1 workload status is waiting since 2023-09-07 08:27:46+00:00
DEBUG:root:ceph-dashboard/2 workload status is waiting since 2023-09-07 08:26:21+00:00
Eventually timeout after 4 hours.

From juju-show-status-log/ceph-dashboard_2
07 Sep 2023 07:22:23Z juju-unit executing running loadbalancer-relation-changed hook
07 Sep 2023 07:22:24Z juju-unit idle
07 Sep 2023 08:16:38Z workload blocked Dashboard is not enabled
07 Sep 2023 08:22:22Z juju-unit executing running certificates-relation-changed hook for vault/0
07 Sep 2023 08:22:26Z juju-unit executing running dashboard-relation-changed hook for ceph-mon/0
07 Sep 2023 08:22:33Z juju-unit executing running certificates-relation-changed hook for vault/0
07 Sep 2023 08:22:35Z juju-unit idle
07 Sep 2023 08:24:26Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/2
07 Sep 2023 08:24:27Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/1
07 Sep 2023 08:24:29Z juju-unit executing running radosgw-dashboard-relation-changed hook
07 Sep 2023 08:24:30Z juju-unit executing running radosgw-dashboard-relation-changed hook for ceph-radosgw/0
07 Sep 2023 08:24:31Z juju-unit idle
07 Sep 2023 12:26:36Z workload waiting Charm configuration in progress

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

It's the non leader units that are blocking.

Sadly there's insufficient information in the logs to work out what isn't or is going on:

ceph-mon/0 active idle 0/lxd/1 10.246.164.141 Unit is ready and clustered
  ceph-dashboard/2 waiting idle 10.246.164.141 Charm configuration in progress
  logrotated/24 active idle 10.246.164.141 Unit is ready.
  prometheus-grok-exporter/24 active idle 10.246.164.141 9144/tcp Unit is ready
  ubuntu-advantage/24 active idle 10.246.164.141 Attached (esm-apps,esm-infra)
ceph-mon/1 active idle 1/lxd/1 10.246.165.116 Unit is ready and clustered
  ceph-dashboard/1 waiting idle 10.246.165.116 Charm configuration in progress
  logrotated/11 active idle 10.246.165.116 Unit is ready.
  prometheus-grok-exporter/11 active idle 10.246.165.116 9144/tcp Unit is ready
  ubuntu-advantage/11 active idle 10.246.165.116 Attached (esm-apps,esm-infra)
ceph-mon/2* active idle 2/lxd/1 10.246.164.168 Unit is ready and clustered
  ceph-dashboard/0* active idle 10.246.164.168 Unit is ready
  logrotated/7 active idle 10.246.164.168 Unit is ready.
  prometheus-grok-exporter/7 active idle 10.246.164.168 9144/tcp Unit is ready
  ubuntu-advantage/7 active idle 10.246.164.168 Attached (esm-apps,esm-infra)

It would probably be useful to switch debug config to true for the charm to see if that elicits any additional information.

Changed in charm-ceph-dashboard:
status: New → Incomplete
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The message "Charm configuration in progress" is actually generated here: https://github.com/openstack/charm-ops-openstack/blob/48a2ce2fb90d8725b10f5435a948358b99375410/ops_openstack/core.py#L199

My assumption is that self._stored.is_started is therefore still false.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

More thoughts:

In the code in the stable/quincy.2 charm:

    def _configure_dashboard(self, event) -> None:
        """Configure dashboard"""
        self.request_certificates()
        if not self.mon.mons_ready:
            logging.info("Not configuring dashboard, mons not ready")
            return
        if not ceph_utils.is_dashboard_enabled():
            if self.unit.is_leader():
                ceph_utils.mgr_enable_dashboard()
            else:
                logging.info("Dashboard not enabled, deferring event.")
                return

...

        self._register_dashboards()
        self._manage_radosgw()
        self._manage_iscsigw()
        self._stored.is_started = True
        self.update_status()

This appears to be the only place that self._stored.is_started is set to true; but this would mean that a non leader could never get to this point.

The master branch is the same. I wonder if the gate tests actually miss this issue. I'll try a local deploy and see what I get from the stable/quincy.2 charm.

Changed in charm-ceph-dashboard:
status: Incomplete → In Progress
importance: Undecided → High
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The issue seems to be that the 'return' in the else of the "if self.unit.is_leader()" in:

    def _configure_dashboard(self, event) -> None:
        """Configure dashboard"""
        self.request_certificates()
        if not self.mon.mons_ready:
            logging.info("Not configuring dashboard, mons not ready")
            return
        if not ceph_utils.is_dashboard_enabled():
            if self.unit.is_leader():
                ceph_utils.mgr_enable_dashboard()
            else:
                logging.info("Dashboard not enabled, deferring event.")
                return

may mean that the subordinate units "run out of events" to then go on and run the rest of the _configure_dashboard() function, meaning the self._stored.is_started never gets set to true.

In order to guarantee that it does occur, I will try adding a peer relation and just set a nonce (uuid4) when the leader does enable the dashboard to trigger a peer relation changed hook on the subordinate units so that they do get a hook execution to actually run the function after the leader has enabled the dashboard. This is assuming that the ceph_utils.mgr_enable_dashboard() isn't async!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-dashboard (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896418

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-dashboard (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-dashboard (master)

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896726
Reason: Abandoning in favour of With SQAs testing we did not observe this issue anymore once we applied these patches:

https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896617
https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896742

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-dashboard (stable/quincy.2)

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896418

Revision history for this message
Joseph Phillips (manadart) wrote :

Can we confirm that this does not affect Juju?

Changed in juju:
status: New → Incomplete
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We haven't seen this issue in 2 weeks. I can confirm that no action is needed on the juju side. Ceph-dashboard shows the status "in progress", but I think it is actually "fix released".

Changed in juju:
status: Incomplete → Invalid
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This was resolved in a different way to my fix. I'm setting to fix released; please re-open if it re-occurs.

Changed in charm-ceph-dashboard:
assignee: Alex Kavanagh (ajkavanagh) → nobody
status: In Progress → Fix Released
Revision history for this message
Nobuto Murata (nobuto) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.