[RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

Bug #2019190 reported by Alexander Käb
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
New
Critical
Eric Harney
Wallaby
New
Critical
Unassigned
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

While trying out the volume retype feature in cinder, we noticed that after an instance is
rebooted it will not come back online and be stuck in an error state or if it comes back
online, its filesystem is corrupted.

## Observations

Say there are the two volume types `fast` (stored in ceph pool `volumes`) and `slow`
(stored in ceph pool `volumes.hdd`). Before the retyping we can see that the volume
for example is present in the `volumes.hdd` pool and has a watcher accessing the
volume.

```sh
[ceph: root@mon0 /]# rbd ls volumes.hdd
volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

[ceph: root@mon0 /]# rbd status volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
Watchers:
        watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 cookie=140370268803456
```

Starting the retyping process using the migration policy `on-demand` for that volume either
via the horizon dashboard or the CLI causes the volume to be correctly transferred to the
`volumes` pool within the ceph cluster. However, the watcher does not get transferred, so
nobody is accessing the volume after it has been transferred.

```sh
[ceph: root@mon0 /]# rbd ls volumes
volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

[ceph: root@mon0 /]# rbd status volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
Watchers: none
```

Taking a look at the libvirt XML of the instance in question, one can see that the `rbd`
volume path does not change after the retyping is completed. Therefore, if the instance is
restarted nova will not be able to find its volume preventing an instance start.

#### Pre retype

```xml
[...]
<source protocol='rbd' name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
    <host name='2001:XX:XX:XXX::a088' port='6789'/>
    <host name='2001:XX:XX:XXX::3af1' port='6789'/>
    <host name='2001:XX:XX:XXX::ce6f' port='6789'/>
</source>
[...]
```

#### Post retype (no change)

```xml
[...]
<source protocol='rbd' name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
    <host name='2001:XX:XX:XXX::a088' port='6789'/>
    <host name='2001:XX:XX:XXX::3af1' port='6789'/>
    <host name='2001:XX:XX:XXX::ce6f' port='6789'/>
</source>
[...]
```

### Possible cause

While looking through the code that is responsible for the volume retype we found a function
`swap_volume` volume which by our understanding should be responsible for fixing the association
above. As we understand cinder should use an internal API path to let nova perform this action.
This doesn't seem to happen.

(`_swap_volume`: https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)

## Further observations

If one tries to regenerate the libvirt XML by e.g. live migrating the instance and rebooting the
instance after, the filesystem gets corrupted.

## Environmental Information and possibly related reports

We are running the latest version of TripleO Wallaby using the hardened (whole disk)
overcloud image for the nodes.

Cinder Volume Version: `openstack-cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`

### Possibly related

- https://bugzilla.redhat.com/show_bug.cgi?id=1293440

(might want to paste the above to a markdown file for better readability)

Revision history for this message
Sofia Enriquez (lsofia-enriquez) wrote :

Hello Alexander Käb,

To clarify:
- (double check) Are instances created from volumes, or are volumes attached to an instance? Can you share the command you are using to do this (steps).
- Is the data on the volumes encrypted?
- Have you encountered any errors in the cinder c-vol logs? Could you share the c-vol log?

Thanks!

tags: added: drivers live-migration nova rbd retype
Changed in cinder:
importance: Undecided → Medium
summary: - Retyping of in-use boot volumes renders instances unusable (possible
- data corruption)
+ [RBD] Retyping of in-use boot volumes renders instances unusable
+ (possible data corruption)
Revision history for this message
Sofia Enriquez (lsofia-enriquez) wrote :

Adding Nova because the report indicates that the volume is migrated to a different ceph pool but the instance points to the old location.

Revision history for this message
Alexander Käb (alexander-kaeb) wrote :

Hi Sofia,

all the tested instances were created from an image, with the option `Create New Volume`
checked, when creating an instance via the dashboard. The steps performed to retype the
volumes are as follows:

- Either via the Dashboard or the CLI (`cinder retype --migration-policy on-demand [...]`) retype the volume from either slow to fast or fast to slow
- rebooting the instance using i.e. a soft reboot

Just these two steps are enough to bring the instance to an error state, as libvirt will
try to load the instance's volume from the pre-retype location which will fail.
Sometimes live-migrating the instance after the retype can lead to the instance working again, but
if the instance performs some IO-Operations, there is a great chance, that the FS is broken
after an reboot:

```
[[0;32m OK [0m] Stopped target [0;1;39mBasic System[0m.
[[0;32m OK [0m] Reached target [0;1;39mInitrd File Systems[0m.
[[0;32m OK [0m] Stopped target [0;1;39mSystem Initialization[0m.
[[0;32m OK [0m] Stopped [0;1;39mdracut pre-mount hook[0m.
[[0;32m OK [0m] Stopped [0;1;39mdracut initqueue hook[0m.
[[0;32m OK [0m] Stopped [0;1;39mdracut pre-trigger hook[0m.
[[0;32m OK [0m] Stopped [0;1;39mdracut pre-udev hook[0m.
[[0;32m OK [0m] Stopped [0;1;39mdracut cmdline hook[0m.
[[0;32m OK [0m] Started [0;1;39mEmergency Shell[0m.
[[0;32m OK [0m] Reached target [0;1;39mEmergency Mode[0m.

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.

[?2004h:/#
```

Attached you will find the cinder-volume log and the nova-compute log during an earlier
test. (debug log enabled)

Revision history for this message
Alexander Käb (alexander-kaeb) wrote :

nova log

Revision history for this message
melanie witt (melwitt) wrote :

Generally, nova gets the volume locations from cinder as a field called 'connection_info' which belongs to a volume attachment.

The way retype usually works is cinder creates a new empty volume with the destination volume type and then calls the nova swap_volume API [1] to swap the volume from the original source volume to the new destination volume. Nova will call the cinder API to create a new attachment for the destination volume. Then, nova gathers the nova-compute host connector and calls the cinder API to update the attachment with the host connector. Cinder API returns the new connection_info from this call. Nova calls down into the libvirt driver to connect the new volume and copy the volume data from the old volume to the new volume, using the new connection_info for the destination libvirt XML. Finally, Nova disconnects the old volume.

However from what I can tell reading the code, in the case of the RBD driver on the cinder side, I don't see that nova is called at all as part of the retyping process, so it doesn't know about the new volume location when it goes to generate the guest XML.

I found mention about this issue on the ceph-users mailing list recently as well:

https://<email address hidden>/thread/TJO6YBJFHCY743UPQDY4D4PENZDQFAHH

which pointed to these posts on the openstack-discuss mailing list:

https://lists.openstack.org/pipermail/openstack-discuss/2023-June/034160.html

https://lists.openstack.org/pipermail/openstack-discuss/2023-June/034165.html

According to the second post, the retype of attached RBD volumes was working in Victoria as long as the [nova] section of the cinder.conf was configured and then it stopped working in Wallaby. The second post noted https://bugs.launchpad.net/cinder/+bug/1886543 as the only change around retype for Wallaby, so is it possible that is related?

I think this bug is Critical given it's a regression and has potential for data loss. Please let me know if I’ve got anything wrong here and/or if anything is needed on the nova side.

[1] https://github.com/openstack/cinder/blob/5728d3899f13140203d44259ca8dfb7ae132e192/cinder/volume/manager.py#L2429

Changed in cinder:
importance: Medium → Critical
Eric Harney (eharney)
Changed in cinder:
assignee: nobody → Eric Harney (eharney)
Revision history for this message
melanie witt (melwitt) wrote :

I spent some time on this and I was able to reproduce the bug.

I am not sure exactly how RBD assisted volume migration is supposed to work but there is no call to Nova happening, so Nova doesn't know anything has changed. That point kind of doesn't matter though because AFAICT there is no existing API call that could be used to tell Nova, "point at the new volume location without copying any volume data to it". The only API we have at present is the swap volume API and there's no way to tell it not to copy volume data.

The other issue I see is that the volume attachment connection_info on the Cinder side does not itself get updated with the new volume location. So even if Nova was able to pull new connection_info from Cinder [1], it would still fail to boot because the new volume location isn't there.

Based on the fact that we don't have an API to tell Nova about the new volume location without copying data, I'm not sure what we can do to immediately fix this other than revert the patch that changed the mechanism for RBD volume retype.

For a future fix, I "think" it would not be difficult to add a "do not copy" type of flag to the PUT /servers/{server_id}/os-volume_attachments/{volume_id} API in Nova [2]. Then after the retype Cinder could call Nova to say "this volume moved but don't copy any data there".

Here are the steps I used to reproduce the issue:

https://paste.openstack.org/show/bNpzkjbeXrmTCwNHfDGs

No volumes are encrypted and the [nova] section is configured in cinder.conf.

[1] https://docs.openstack.org/nova/latest/cli/nova-manage.html#volume-attachment-refresh
[2] https://docs.openstack.org/api-ref/compute/?expanded=update-a-volume-attachment-detail#update-a-volume-attachment

Revision history for this message
melanie witt (melwitt) wrote :
Download full text (9.4 KiB)

I uploaded a DNM tempest patch to run modified TestVolumeMigrateRetypeAttached tests in tempest/scenario/test_volume_migrate_attached.py with the master, stable/wallaby, and stable/victoria branches [1]:

  https://review.opendev.org/c/openstack/tempest/+/890360

The tests in ^ are modified to add a hard reboot of the instance at the end.

The migrate volume test passes in all branches while the retype volume test fails in master and stable/wallaby but passes in stable/victoria [2].

The unmodified tests will pass because they aren't hard rebooting the server to cause regeneration of guest XML.

In the test logs on the DNM patch [2], I think I might have also found why migrate works while retype fails.

The RBD driver [3] makes a decision about which path to take based on the volume status. In the test logs, it's showing that for migrate, the volume is 'in-use' and the RBD driver (correctly) considers this case to be a move across different pools and falls back to a generic migrate which calls the Nova swap volume API. For retype however, the volume status is 'retyping' so it doesn't refuse the assisted migration and it goes ahead.

Excerpts from the c-vol log:

migrate volume:

Aug 03 22:24:16.833416 np0034853654 cinder-volume[116332]: DEBUG cinder.volume.manager [None req-1c151856-e8fb-41e3-ad42-36810f4fcec8 tempest-TestVolumeMigrateRetypeAttached-2102186043 None] Issue driver.migrate_volume. {{(pid=116332) migrate_volume /opt/stack/cinder/cinder/volume/manager.py:2609}}
Aug 03 22:24:16.834270 np0034853654 cinder-volume[116332]: DEBUG cinder.volume.drivers.rbd [None req-1c151856-e8fb-41e3-ad42-36810f4fcec8 tempest-TestVolumeMigrateRetypeAttached-2102186043 None] Attempting RBD assisted volume migration. volume: 9a27b9cd-e6e5-4f29-a127-a030e94c5356, host: {'host': 'np0034853654@ceph2#ceph2', 'cluster_name': None, 'capabilities': {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 24.56, 'free_capacity_gb': 24.56, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:018eb22d-04d2-464f-8294-675d033013df:cinder:othervolumes', 'backend_state': 'up', 'volume_backend_name': 'ceph2', 'replication_enabled': False, 'allocated_capacity_gb': 0, 'filter_function': None, 'goodness_function': None, 'timestamp': '2023-08-03T22:23:59.050934'}}, status=in-use. {{(pid=116332) migrate_volume /opt/stack/cinder/cinder/volume/drivers/rbd.py:1924}}
Aug 03 22:24:16.834270 np0034853654 cinder-volume[116332]: DEBUG os_brick.initiator.linuxrbd [None req-1c151856-e8fb-41e3-ad42-36810f4fcec8 tempest-TestVolumeMigrateRetypeAttached-2102186043 None] opening connection to ceph cluster (timeout=-1). {{(pid=116332) connect /opt/stack/os-brick/os_brick/initiator/linuxrbd.py:70}}
Aug 03 22:24:16.861112 np0034853654 cinder-volume[116332]: DEBUG cinder.volume.drivers.rbd [None req-1c151856-e8fb-41e3-ad42-36810f4fcec8 tempest-TestVolumeMigrateRetypeAttached-2102186043 None] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=-1). {{(pid=116332) _do_conn /opt/stack/cinder/cinder/volume/drivers/rbd.py:480}}
Au...

Read more...

Revision history for this message
Luigi Toscano (ltoscano) wrote :

Can the tempest patch be resurrect and pushed as a proper patch? I didn't notice this comment (sorry) and ended up writing a simpler version, which I'm going to abandon: https://review.opendev.org/c/openstack/tempest/+/893863

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/896172

Revision history for this message
melanie witt (melwitt) wrote :

Thank you Luigi for pointing that out!

I have pushed a proper patch and proposed two more patches as well to enable us to configure Ceph in devstack to use a separate Ceph pool per backend:

* tempest patch to test regression: https://review.opendev.org/c/openstack/tempest/+/890360

* devstack-plugin-ceph patch to enable config of separate Ceph pools: https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/895533

* cinder patch to add a cinder-tempest-ceph-multibackend job: https://review.opendev.org/c/openstack/cinder/+/896172

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi everyone, on our test we have a workaround. After volume retype, cold migrating the instance updates the pool name on guest xml and creates a new volume attachment which contains the new pool name on the attachment properties.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.