Okay Gorka and I just had a nice long chat about things and I think we made some progress on understanding the (several) ways we can get into this situation and came up with some action items. I'll try to summarize here and I'll look for Gorka to correct me if I get anything wrong.
I think that we're now on the same page that delete of a running instance is much more of a forceful act than some might think, and that we expect to try to be graceful with that, but with a limited amount of patience before we kill it with fire. That maps to us actually always calling force=True when we do the detachment. Even with force=True, brick *tries* to flush and disconnect gracefully, but if it can't, will cut things off at the knees. Thus, if we did force=True now, we wouldn't get into the situation the bug describes because we would *definitely* have cleaned up at that point.
It sounds like there are some robustification steps that can be made in brick to do more validation of the full chain from instance->multipathd->iscsi->volume when we're doing attachments to try to avoid getting into the situation described by this bug, so Gorka is going to work on that.
Gorka also described another way to get into this situation, which is much more exploitable by the user, and I'll let him describe it in more detail. But the short story is that cinder should not let users delete attachments for instances that nova says are running (i.e. not deleted).
Multipathd, while well-intentioned, also has some behavior that is counterproductive when recovering from various situations where paths to a device get disconnected. Enabling the recheck_wwid thing in multipathd should be a recommended flag to have enabled to reduce the likelihood of that happening. Especially in the case where nova has allowed a blind delete due to a downed compute node, we need multipathd to not "help" by reattaching things without extra checks.
So, the action items roughly are:
1. Nova should start passing force=True in our call to brick detach for instance delete
2. Recommend the recheck_wwid flag for multipathd, and get deployment tools to enable it
3. Robustification of brick's attach workflow to do some extra sanity checks
4. Cinder should refuse to allow users to delete an attachment for an active volume
Based on the cinder user-exploitable attack vector, it sounds to me like we should keep this bug private on that basis until we have at least the cinder/nova validation step in place. We could create another one for just that scenario, but publicizing the accidental scenario and discussion we have in this bug now might be enough of a suggestion that more people would figure out the user-oriented attack.
Okay Gorka and I just had a nice long chat about things and I think we made some progress on understanding the (several) ways we can get into this situation and came up with some action items. I'll try to summarize here and I'll look for Gorka to correct me if I get anything wrong.
I think that we're now on the same page that delete of a running instance is much more of a forceful act than some might think, and that we expect to try to be graceful with that, but with a limited amount of patience before we kill it with fire. That maps to us actually always calling force=True when we do the detachment. Even with force=True, brick *tries* to flush and disconnect gracefully, but if it can't, will cut things off at the knees. Thus, if we did force=True now, we wouldn't get into the situation the bug describes because we would *definitely* have cleaned up at that point.
It sounds like there are some robustification steps that can be made in brick to do more validation of the full chain from instance- >multipathd- >iscsi- >volume when we're doing attachments to try to avoid getting into the situation described by this bug, so Gorka is going to work on that.
Gorka also described another way to get into this situation, which is much more exploitable by the user, and I'll let him describe it in more detail. But the short story is that cinder should not let users delete attachments for instances that nova says are running (i.e. not deleted).
Multipathd, while well-intentioned, also has some behavior that is counterproductive when recovering from various situations where paths to a device get disconnected. Enabling the recheck_wwid thing in multipathd should be a recommended flag to have enabled to reduce the likelihood of that happening. Especially in the case where nova has allowed a blind delete due to a downed compute node, we need multipathd to not "help" by reattaching things without extra checks.
So, the action items roughly are:
1. Nova should start passing force=True in our call to brick detach for instance delete
2. Recommend the recheck_wwid flag for multipathd, and get deployment tools to enable it
3. Robustification of brick's attach workflow to do some extra sanity checks
4. Cinder should refuse to allow users to delete an attachment for an active volume
Based on the cinder user-exploitable attack vector, it sounds to me like we should keep this bug private on that basis until we have at least the cinder/nova validation step in place. We could create another one for just that scenario, but publicizing the accidental scenario and discussion we have in this bug now might be enough of a suggestion that more people would figure out the user-oriented attack.