Cinder

Race condition when deleting snapshots for backup with deferred deletion enabled

Bug #2012622 reported by Enrico Bocchi on 2023-03-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Cinder	New	Undecided	Unassigned

Bug Description

Description
===========

The Cinder controller provides the ability to make backup of volumes. In case of an "in-use" volume, it is still possible to create the backup by "--force"-ing it, leading to the creation of an temporary snapshot and volume to backup from. When the backup process completes, Cinder deletes the intermediate snapshot and volume.

The deletion of the snapshot may fail in case deferred deletions are enabled. This is due to the fact the temporary volume, which is a child of the snapshot, prevents the deletion of the latter.
The temporary volume remains in "error_deleting" state, but will be eventually deleted once the asynchronous trash purging kicks in.

This issue has been identified with:
- Cinder 18.1.0 (and is still present in master)
- Ceph RBD Pacific, 16.2.9

Steps to reproduce
==================
* Configure one Ceph RBD cluster to be used with Cinder for the provisioning of volumes and enable deferred deletion.
* Make a backup of an in-use volume using the `--force` flag. This will generate a snapshot and a temporary volume (created from the snapshot) that will be used to make the backup.
* Once the backup process completes, cinder tries to delete the temporary volume and the snapshot to clean up. Given deferred deletion is enabled, the temporary volume is move to trash instead of being immediately deleted.
* When cinder tries to unprotect and delete the snapshot from the original volume, Ceph librdb refuses as the temporary volume in the trash is a child of the snapshot and returns "[errno 16] RBD image is busy".

* Use an alias in `ceph.conf` to reach the mons of the Ceph cluster
* Start the Manila controller
* Replace one existing mons with another one (e.g., due to HW failure) that has a different IP address
* Update the alias members to remove the old mon and add the new one

Expected result
===============
The cleanup procedure kicking in when a backup completes successfully deletes the temporary volume and snapshots, leaving the original volume and its backup.

Actual result
=============
* The temporary volume is eventually deleted from the trash thanks to the asynchronous trash purging, but it remains in the cinder volume list with state "error_deleting"
* The snapshot is never visible through volume snapshot list, but instead remains on ceph RBD as cinder failed to unprotect and remove it (due to the temporary volume blocking).

Further comments
================
- I am attaching a patch against master (last commit for rbd.py being c827e5f8867fe71ca121b5671284b852c218aa23)
- Patch on master: https://github.com/ebocchi/cinder/commit/c5879197c74aaacc5675204b307224669e449429
- Patch on 18.1.0: https://github.com/ebocchi/cinder/commit/18414f37448db92db93f49a338345636d0bdac90

See original description

Revision history for this message

Enrico Bocchi (ebocchi) wrote on 2023-03-23:

rbd.py Edit (107.8 KiB, text/plain)

Enrico Bocchi (ebocchi) on 2023-03-23

description:

updated

Revision history for this message

Christian Rohmann (christian-rohmann) wrote on 2023-04-13:

Thanks for reporting this and also writing a patch to fix the issue.
OpenStack (Cinder) uses Gerrit for changes / bugfixes, see https://docs.openstack.org/contributors/code-and-documentation/quick-start.html for a quickstart.

If you push a change there for review, it will appear in this bug automagically as a proposed fix.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

rbd.py Edit

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.