The following is suitable for deployments where it's possible to just put together a script and execute.
1. Delete all ephemeral backing files not currently being referred to by a disk.local cow layer. It's better to delete all such unused ephemeral files since step 2 below can be unreliable.
2. Identify the ephemeral files that are root disk images. This is quite straightforward if all the root disks have a partition table and all we do is look at the file signature to determine if there's a partitionless image with ext3 fs on it. But on a system with user uploaded glance files or where there are single partition root disks with kernel, there could be several false negatives. Anyway, assuming we managed to determine if a compute node has a bad ephemeral backing file then we should first disable the service to prevent any more instance scheduling on that node.
3. Then we could isolate all the VMs using the bad ephemeral file(s) to a set of fresh nodes using live-migration and delete the now unused bad ephemeral files and enable scheduling again.
If one doesn't want to take the chance with step 2 above, a more deterministic approach would be to migrate all the VMs at least once to a node known to contain none or new correct ephemeral disks.
What this gives us is that all ephemeral backing files will be corrected and new instances will see the expected block device with an ext3 fs. All existing VMs will continue to run including those that had already seen the exposed data. However, they all would be backed on good ephemeral backing files.
The following is suitable for deployments where it's possible to just put together a script and execute.
1. Delete all ephemeral backing files not currently being referred to by a disk.local cow layer. It's better to delete all such unused ephemeral files since step 2 below can be unreliable.
2. Identify the ephemeral files that are root disk images. This is quite straightforward if all the root disks have a partition table and all we do is look at the file signature to determine if there's a partitionless image with ext3 fs on it. But on a system with user uploaded glance files or where there are single partition root disks with kernel, there could be several false negatives. Anyway, assuming we managed to determine if a compute node has a bad ephemeral backing file then we should first disable the service to prevent any more instance scheduling on that node.
3. Then we could isolate all the VMs using the bad ephemeral file(s) to a set of fresh nodes using live-migration and delete the now unused bad ephemeral files and enable scheduling again.
If one doesn't want to take the chance with step 2 above, a more deterministic approach would be to migrate all the VMs at least once to a node known to contain none or new correct ephemeral disks.
What this gives us is that all ephemeral backing files will be corrected and new instances will see the expected block device with an ext3 fs. All existing VMs will continue to run including those that had already seen the exposed data. However, they all would be backed on good ephemeral backing files.