Cinder-volume may fail to start properly during deployment
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Cinder Charm |
New
|
Undecided
|
Unassigned |
Bug Description
This issue seems to happen when deploying specifically Cinder-volume on separate units.
The topology of Cinder is the following :
- Cinder API and Scheduler on LXD units
- Cinder Volume on baremetal units (for multiple iSCSI backends accesses)
When deploying a bundle, one or multiple cinder-volume units may end in 'blocked' status with its message complaining that 'cinder-volume' process isn't running which is exactly the issue.
In term of version :
- MaaS 3.1
- Juju 2.9.28
- Cinder charm's stable from Charmhub : 530
So far I've seen this happening from time to time on :
- Focal Wallaby and Focal Xena with a Powerstore a iSCSI backend.
- Focal Ussuri with Purestorage as iSCSI backend.
The solution is simply to run on the unit 'sudo systemctl restart cinder-volume' and the deployment can finish properly.
Looking at the logs, Cinder-volume fails to find a proper working backend and terminates itself, which is a normal behavior since it happens while the deployment is ongoing and all the local/subordinates charms may not have finished to install themselves.
I can observe that systemd's unit is configured to try to restart cinder-volume service if it fails to start, but for some reason it seems to stop retrying at some point. (see attached journalctl log).
The most interesting part on both log files are happening between 17:01:00 and 17:07:23 (time I restarted manually the service through systemctl command)
I've seen this with my current deployment.
Using the "enabled-services" option, we've got scheduler,api deployed in control containers and volume deployed on the bare metal nodes because we are utilizing purestorage iscsi backends.
The cinder-volume units seem to often fail to start the cinder-volume services. I am not sure why but it looks like it may be related to the backend not being ready yet (at first).
We left the deployment for several hours (5pm until about 9am) and the service never restarted on their own.
Regardless *why* it's failing at first, it eventually succeeds when I manually start the cinder-volume service. Could the charm be more proactive by restarting services that aren't running that should be?
As suggested, this work-around gets us past this:
juju run -a cinder-volume sudo systemctl restart cinder-volume