external ceph cinder volume config breaks volumes on ussuri upgrade

Bug #1904062 reported by Alexander Diana
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
In Progress
High
Michal Nasiadka
Ussuri
Triaged
High
Unassigned
Victoria
Triaged
High
Unassigned
Wallaby
In Progress
High
Michal Nasiadka

Bug Description

**Bug Report**

What happened:
When refactoring to use the new external-ceph templates in ussuri, cinder-volume agents came up under their own hosts, which results in 3 "different" storage hosts.

This results in all pre-ussuri volumes being unmanagable, as they are still tied to rbd:volumes@rbd-1, and new volumes will also become unmanagable if their host agent goes down.

What you expected to happen:

cinder-volume services to come up under a single host, so that a single node failure, does not result in unmanagable volumes.

How to fix:
cinder.conf needs backend_host=rbd:volumes added to the rbd-1 config as a sane default, which matches up with previous recommendations and expected behavior.
This will make existing deployments work without changes, and fix the single-node-failure condition of the current settings.

How to reproduce it (minimal and precise):

**Environment**:
* Kolla-Ansible version: stable/ussuri

Mark Goddard (mgoddard)
Changed in kolla-ansible:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Mark Goddard (mgoddard) wrote :

Train external ceph docs: https://docs.openstack.org/kolla-ansible/train/reference/storage/external-ceph-guide.html#cinder

[rbd-1]
rbd_ceph_conf=/etc/ceph/ceph.conf
rbd_user=cinder
backend_host=rbd:volumes
rbd_pool=volumes
volume_backend_name=rbd-1
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_secret_uuid = {{ cinder_rbd_secret_uuid }}

Ussuri made the integration simpler, adding the following to ceph.conf:

{% if cinder_backend_ceph | bool %}
[rbd-1]
volume_driver = cinder.volume.drivers.rbd.RBDDriver
volume_backend_name = rbd-1
rbd_pool = {{ ceph_cinder_pool_name }}
rbd_ceph_conf = /etc/ceph/ceph.conf
rbd_flatten_volume_from_snapshot = false
rbd_max_clone_depth = 5
rbd_store_chunk_size = 4
rados_connect_timeout = 5
rbd_user = {{ ceph_cinder_user }}
rbd_secret_uuid = {{ cinder_rbd_secret_uuid }}
report_discard_supported = True
image_upload_use_cinder_backend = True
{% endif %}

This is missing backend_host=rbd:volumes. There is a related Tripleo bug [1], which explains that this option is used to set the same host for all backends in an environment with multiple cinder-volume services representing a single storage cluster.

[1] https://bugs.launchpad.net/bugs/1753596

summary: - external_ceph cinder-volume config break volumes on ussuri upgrade
+ external ceph cinder volume config breaks volumes on ussuri upgrade
Revision history for this message
Mark Goddard (mgoddard) wrote :

Actually, this OpenStack Ansible bug suggests that backend_host is not recommended: https://bugs.launchpad.net/cinder/+bug/1837403. We might need to set [DEFAULT] cluster to use active/active cinder-volume though.

Revision history for this message
Mark Goddard (mgoddard) wrote :

OSA highlights the 'cinder-manage volume update_host' command in their release notes. It's not clear to me what the right solution is at this point though.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/763011

Changed in kolla-ansible:
assignee: nobody → Michal Nasiadka (mnasiadka)
status: Triaged → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

The relevant openstack-discuss ML thread: http://lists.openstack.org/pipermail/openstack-discuss/2020-November/018838.html (thank you all for answering our questions!)

tags: added: ceph cinder rbd
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/808003

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/808004

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/808003
Committed: https://opendev.org/openstack/kolla-ansible/commit/f97e752018affcb81604230e7e9b0101960cec83
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit f97e752018affcb81604230e7e9b0101960cec83
Author: Radosław Piliszek <email address hidden>
Date: Fri Dec 18 21:46:35 2020 +0100

    [CI] Cinder upgrade testing

    To gain visibility into how our upgrades affect existing Cinder
    volumes, a new testing path is required.
    This patch adds it.

    Additionally, it refactors the repeated actions and ensures that
    we wait for volume deletions as well.

    Change-Id: Ic08d461e6fdf91c378a87860765a489c2f86d690
    Related-Bug: #1904062
    (cherry picked from commit 62b8c6b68413330da032d14b45c6fbd340ec9e2d)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/ussuri)

Change abandoned by "Radosław Piliszek <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/808004
Reason: ussuri going em, thus /us not extending CI

Revision history for this message
Sven Kieske (s-kieske) wrote :

Hi,

can someone provide an update on this bug?

Because we hit this in real life deployments:

1. volume has a specific os controller node as `os-vol-host-attr:host`
2. that os controller node gets maintenance
3. vm instance with attached volume gets deleted
3. nova throws: openstack.nova nova-compute c84d9828-1277-457a-828d-db7dc3c03216 [instance: d5832d72-0b70-422e-ba94-12b24f1a75e1] Ignoring unknown cinder exception for volume 615d4759-bed3-4a84-91a8-8fce612bfb2a: Gateway Time-out (HTTP 504): cinderclient.exceptions.ClientException: Gateway Time-out (HTTP 504)

the volume than still exists and still claims to be attached to a nonexistant vm.

we can of course clean this up manually.

afaiu there needs to be added some active/active cinder deployment with a coordinator service like pacemaker or etcd to kolla-ansible?

would it be possible to mimic somehow what tripleo does? as far as I understand they have implemented active active cinder deployment with etcd coordinator?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/847151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (stable/yoga)

Change abandoned by "Radosław Piliszek <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/847151
Reason: proper docs are the way to go

Tom Fifield (fifieldt)
tags: added: docs
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Michal Nasiadka <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/847151

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.