hacluster for manila-ganesha get stuck with "Resource: res_ganesha_xxx_vip not running"

Bug #1957738 reported by Yoshi Kadokawa
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Manila-Ganesha Charm
Triaged
High
Unassigned

Bug Description

I have deployed manila and manila-ganesha both in 3 LXD units with hacluster, as described in the official document[0].
However, the hacluster charm status get stuck with a message of "Resource: res_ganesha_bdb1339_vip not running"

$ juju status manila-ganesha
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.9.22 unsupported 04:08:40Z

App Version Status Scale Charm Store Channel Rev OS Message
hacluster-manila-ganesha blocked 3 hacluster charmstore stable 81 ubuntu Resource: res_ganesha_bdb1339_vip not running
manila-ganesha 15.2.14 active 3 manila-ganesha charmstore stable 20 ubuntu Unit is ready
manila-ganesha-mysql-router 8.0.27 active 3 mysql-router charmstore stable 15 ubuntu Unit is ready

Unit Workload Agent Machine Public address Ports Message
manila-ganesha/0* active idle 6/lxd/7 10.148.197.12 Unit is ready
  hacluster-manila-ganesha/2 blocked idle 10.148.197.12 Resource: res_ganesha_bdb1339_vip not running
  manila-ganesha-mysql-router/2 active idle 10.148.197.12 Unit is ready
manila-ganesha/1 active idle 7/lxd/7 10.148.197.29 Unit is ready
  hacluster-manila-ganesha/1 blocked idle 10.148.197.29 Resource: res_ganesha_bdb1339_vip not running
  manila-ganesha-mysql-router/1 active idle 10.148.197.29 Unit is ready
manila-ganesha/2 active idle 8/lxd/7 10.148.197.25 Unit is ready
  hacluster-manila-ganesha/0* active executing 10.148.197.25 Unit is ready and clustered
  manila-ganesha-mysql-router/0* active idle 10.148.197.25 Unit is ready

Machine State DNS Inst id Series AZ Message
6 started 10.148.196.229 controller-node-1 focal zone1 Deployed
6/lxd/7 started 10.148.197.12 juju-a80423-6-lxd-7 focal zone1 Container started
7 started 10.148.196.237 controller-node-5 focal zone2 Deployed
7/lxd/7 started 10.148.197.29 juju-a80423-7-lxd-7 focal zone2 Container started
8 started 10.148.196.235 controller-node-3 focal zone3 Deployed
8/lxd/7 started 10.148.197.25 juju-a80423-8-lxd-7 focal zone3 Container started

And here is the output from pacemaker. As you can see not just the res_ganesha_xxx_vip but also res_manila_share_manila_share is also failing.

$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-a80423-8-lxd-7 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Thu Jan 13 04:14:17 2022
  * Last change: Thu Jan 13 04:14:13 2022 by root via crm_node on juju-a80423-6-lxd-7
  * 3 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ juju-a80423-6-lxd-7 juju-a80423-7-lxd-7 juju-a80423-8-lxd-7 ]

Full List of Resources:
  * res_manila_share_manila_share (systemd:manila-share): FAILED juju-a80423-6-lxd-7
  * res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Stopping juju-a80423-6-lxd-7
  * Resource Group: grp_ganesha_vips:
    * res_ganesha_6cb3ded_vip (ocf::heartbeat:IPaddr2): Started juju-a80423-6-lxd-7
    * res_ganesha_959e94a_vip (ocf::heartbeat:IPaddr2): Started juju-a80423-6-lxd-7
    * res_ganesha_bdb1339_vip (ocf::heartbeat:IPaddr2): Stopped

Failed Resource Actions:
  * res_manila_share_manila_share_start_0 on juju-a80423-6-lxd-7 'error' (1): call=9627, status='complete', exitreason='', last-rc-change='2022-01-13 04:14:14Z', queued=0ms, exec=197ms

For now, I could work around it by executing the following.

$ juju run --unit manila-ganesha/leader -- sudo systemctl unmask manila-share

[0] https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/manila-ganesha.html

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I have attached the log of pacemaker as well.
In this log, I could see that the pacemaker resource is failing with the following error.

Could not issue start for res_manila_share_manila_share: Unit manila-share.service is masked.

This symptom is seen every time I deploy this environment.
BTW, since Vault is used as TLS termination for API endpoints, the reproducible steps are

1. Deploy OpenStack with Vault
2. Unlock and activate Vault. Before this, hacluster-manila-ganesha is all fine
3. After the relation with Vault is completed, hacluster-manila-ganesha gets stuck. I have waited for more than 30 minutes, but it was still stuck.

description: updated
Revision history for this message
Felipe Reyes (freyes) wrote :

Hi Yoshi, I will try to reproduce this issue.

Changed in charm-manila-ganesha:
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Felipe Reyes (freyes) wrote :

Yoshi, I was trying to reproduce without success, the specific bundle I'm using can be found at https://paste.ubuntu.com/p/q6G3xhKRDd/

In the latest stable version (21.10) landed this commit https://opendev.org/openstack/charm-manila-ganesha/commit/b1cb18391ce8f913c5c64de1a8df88ae0308bbae that precisely re-enables/unmask manila-share service, and the juju status posted in the bug description points out you are using cs:manila-ganesha-20 (21.10).

I will give it another try using focal-wallaby, because with focal-xena things work ok.

Revision history for this message
Felipe Reyes (freyes) wrote :

do you have /var/log/juju/ directory from one of the units where you found this issue?

Revision history for this message
Felipe Reyes (freyes) wrote :

focal-wallaby gave me the same results - https://pastebin.ubuntu.com/p/myrbqtFX9m/ - what I did find surprising though was that after unlocking Vault the manila-ganesha units had manila-share.service running where before unlocking it only one of them was running.

Revision history for this message
Felipe Reyes (freyes) wrote :

I will set this as incomplete until the /var/log/juju from manila-ganesha units (or a juju-crashdump) is provided.

Changed in charm-manila-ganesha:
status: New → Incomplete
assignee: Felipe Reyes (freyes) → nobody
Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

Hi Felipe,

Thank you for looking into this.
I can reproduce this with the following bundle.

https://pastebin.ubuntu.com/p/X79hSMdfYw/

As you can see from the bundle, I have configured a VIP for manila and manila-ganesha and also added relation with Vault.
Before unsealing Vault, manila-ganesha works ok, however, after unsealing Vault which update the endpoint to SSL, then this symptom is seen.

I will add the juju-crashdump later once I have collected it.

Changed in charm-manila-ganesha:
status: Incomplete → New
Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :
Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1957738] Re: hacluster for manila-ganesha get stuck with "Resource: res_ganesha_xxx_vip not running"

On Fri, 2022-02-04 at 02:29 +0000, Yoshi Kadokawa wrote:
> Hi Felipe,
>
> Thank you for looking into this.
> I can reproduce this with the following bundle.
>
> https://pastebin.ubuntu.com/p/X79hSMdfYw/
>
> As you can see from the bundle, I have configured a VIP for manila and
> manila-ganesha and also added relation with Vault.
> Before unsealing Vault, manila-ganesha works ok, however, after
> unsealing Vault which update the endpoint to SSL, then this symptom is
> seen.

I can reproduce the issue now, the piece of the puzzle that I was
missing was the (direct) relation between manila-ganesha and vault.

I was going through the source of the charm and I find no indication of
what should do when it's related to vault, what it did was to install
apache2 and render an empty configuration file:

$ sudo wc -l /etc/apache2/sites-enabled/openstack_https_frontend.conf
0 /etc/apache2/sites-enabled/openstack_https_frontend.conf

What's the expected setup when these 2 charms are related?, because I
have the feeling this is just an interface inherited from a layer and
manila-ganesha doesn't use this functionality.

https://opendev.org/openstack/charm-manila-ganesha/src/branch/master/src/layer.yaml#L16-L18

Changed in charm-manila-ganesha:
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

Currently the manila-ganesha charm inherits from the openstack-api base layer - this implies access to keystone, endpoint registration and TLS ca certs for trust.

Upstream documentation does detail configuration for access to the identity service so lets assume this is actually required:

https://docs.openstack.org/manila/latest/install/install-share-ubuntu.html

That said there is no need to enable an apache configuration with no config so it looks like there is a bug in here somewhere.

The relation to vault would supply the CA cert for the deployment to enable trust - the charm looks to be using the default handler for this relation so I suspect that is somewhat confusing things.

Changed in charm-manila-ganesha:
status: Incomplete → New
status: New → Triaged
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote :

As Felipe was able to reproduce this issue I'm marking this bug as Triaged and High.

Revision history for this message
Vern Hart (vern) wrote :

I have encountered this bug in my current deployment. This scenario occurs on deployment when vault is related to the manila-ganesha application but it also re-occurs when the signed cert from the CSR is uploaded into vault.

The workaround mentioned by Yoshi does indeed resolve the issue:

  juju run --unit manila-ganesha/leader -- sudo systemctl unmask manila-share

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

Happened to me as well on an offline deployment...

After more testing, it appears that manila-share service is masked on all manila-ganesha units.
It is consequently required to unmask the service on all of them, otherwise if the leader unit goes down, the whole service is down as well.

  juju run -a manila-ganesha systemctl unmask manila-share

And for Corosync/Pacemaker to try to start right away the service. Doing on one unit :
  juju ssh manila-ganesha/leader sudo crm resource cleanup

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

This is a very common issue in the SQA lab as well. We have also seen the hacluster units going into this state after the OpenStack cluster is deployed already. Let me know if we can do anything in terms of extra data collection to help debug the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.