race between threaded rbd operations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
New
|
Undecided
|
Unassigned | ||
OpenStack Cinder-Ceph charm |
Fix Released
|
Medium
|
Edward Hope-Morley |
Bug Description
When using rbd_exclusive_
2018-08-29 06:57:41.604 1586622 DEBUG cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.610 1586622 ERROR cinder.
2018-08-29 06:57:41.612 1586622 DEBUG cinder.
Eg: when thread 1 is deleting volume A, then thread 2 that deleted volume B is trying to update the usage and when it queried the database volume A was still "available" but then when it queries Ceph about that volume it's no longer there, so basically the logic to update the usage statistics is racy.
_get_usage_info is run in a native thread via RBDVolumeProxy, native thread uses blocking mode, it means once _get_usage_info throws an exception ImageNotFound, the green thread which spawns this native thread won't be able to yield in time so that all other green threads will be blocked as well, at this time, any image operation like delete/
How to reproduce:
1, set up a test env with the default rbd_exclusive_
2, create 100 test volumes
3, delete 100 test volumes -
openstack volume list| egrep -v "^\+-+|ID"| awk '{print $2}'| xargs openstack volume delete
4, then the exception ImageNotFound should can be seen from cinder-volume.log (if not see, can create/delete more volumes), at this time the time will obviously increase when deleting a volume.
description: | updated |
tags: | added: ceph drivers rbd |
tags: | added: canonical-bootstack |
Changed in charm-cinder: | |
milestone: | none → 18.11 |
importance: | Undecided → Medium |
assignee: | nobody → Edward Hope-Morley (hopem) |
affects: | charm-cinder → charm-cinder-ceph |
Changed in charm-cinder-ceph: | |
assignee: | Edward Hope-Morley (hopem) → nobody |
milestone: | 18.11 → none |
milestone: | none → 18.11 |
assignee: | nobody → Edward Hope-Morley (hopem) |
tags: | added: stable-backport |
Changed in charm-cinder-ceph: | |
status: | Fix Committed → Fix Released |
So this is mostly correct but what I've seen is that the green thread that is waiting on the native thread never gets re-scheduled in the case where the threaded operation raises an exception and doesn't explicitly handle it. In other words the green threads calling the native threads gets stuck but other green threads still operate as normal. This simple test script seems to demonstrate the same behaviour - https:/ /pastebin. ubuntu. com/p/JXgzhppKw r/