I don't know if my assumption is correct or not, because I can't reproduce the multipath device mapper situation from the report (some failed some active) no matter how much I force things to break in different ways.
Since each iSCSI storage backend behaves differently I don't know if I can't reproduce it because the difference in behavior or because the way I'm trying to reproduce it is different. It may even be that multipathd is different in my system.
Unfortunately I don't know if the host where that happened had leftover devices before the leak happened, or what the SCSI IDs of the 2 volumes involved really are.
From os-brick's connect_volume perspective what it did is the right thing, because when it looked at the multipath device containing the newly connected devices it was dm-9, so that's the one that it should return.
How multipath ended up with 2 different volumes in the same device mapper, I don't know.
I don't think "recheck_wwid" would solve the issue because os-brick would be too fast in finding the multipath and it wouldn't give enough time for multipathd to activate the paths and form a new device mapper.
In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.
The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.
That would ensure that compute nodes are not left with leftover devices that could cause problems.
I'll see if I can find a reasonable improvement in os-brick that would detect this issues and fail the connection, although it's probably going to be a bit of a mess.
I don't know if my assumption is correct or not, because I can't reproduce the multipath device mapper situation from the report (some failed some active) no matter how much I force things to break in different ways.
Since each iSCSI storage backend behaves differently I don't know if I can't reproduce it because the difference in behavior or because the way I'm trying to reproduce it is different. It may even be that multipathd is different in my system.
Unfortunately I don't know if the host where that happened had leftover devices before the leak happened, or what the SCSI IDs of the 2 volumes involved really are.
From os-brick's connect_volume perspective what it did is the right thing, because when it looked at the multipath device containing the newly connected devices it was dm-9, so that's the one that it should return.
How multipath ended up with 2 different volumes in the same device mapper, I don't know.
I don't think "recheck_wwid" would solve the issue because os-brick would be too fast in finding the multipath and it wouldn't give enough time for multipathd to activate the paths and form a new device mapper.
In any case I strongly believe that nova should never proceed to delete the cinder attachment if detaching with os-brick fails because that usually implies data loss.
The exception would be when the cinder volume is going to be delete after disconnecting it, and in that case the disconnect call to os-brick should be always forced, since data loss is irrelevant.
That would ensure that compute nodes are not left with leftover devices that could cause problems.
I'll see if I can find a reasonable improvement in os-brick that would detect this issues and fail the connection, although it's probably going to be a bit of a mess.