Just to summarize my understanding, and perhaps clarify for others, this bug is focused on stale connection_info for rbd volumes (not rbd images). rbd images have a related issue during live migration that is being handled in a separate bug (see comment 12 above).
Focusing on connection_info for rbd volumes now (and thanks to Matt Riedemann's comments for the tips here). connection_info appears to be properly refreshed for live migration in pre_live_migration() where _get_instance_block_device_info() is called with refresh_conn_info=True (see comment 9 above and https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L5977).
Is the fix as simple as flipping refresh_conn_info=False to True for some of the other calls to _get_instance_block_device_info()? Below is an audit of the _get_instance_block_device_info() calls.
Calls to _get_instance_block_device_info() with refresh_conn_info=True:
finish_revert_resize()
_finish_resize()
pre_live_migration()
Based on xavpaice's comments in (see comment 13 above -- "... existing, running, instances were fine, fresh new instances were fine, but when we stopped instances via nova, then started them again, they failed to start ..."), it would seem that the following should also have refresh_conn_info=True:
_power_on() # solves xavpaice's scenario?
_do_rebuild_instance()
reboot_instance()
Just to summarize my understanding, and perhaps clarify for others, this bug is focused on stale connection_info for rbd volumes (not rbd images). rbd images have a related issue during live migration that is being handled in a separate bug (see comment 12 above).
Focusing on connection_info for rbd volumes now (and thanks to Matt Riedemann's comments for the tips here). connection_info appears to be properly refreshed for live migration in pre_live_ migration( ) where _get_instance_ block_device_ info() is called with refresh_ conn_info= True (see comment 9 above and https:/ /github. com/openstack/ nova/blob/ stable/ queens/ nova/compute/ manager. py#L5977).
Is the fix as simple as flipping refresh_ conn_info= False to True for some of the other calls to _get_instance_ block_device_ info()? Below is an audit of the _get_instance_ block_device_ info() calls.
Calls to _get_instance_ block_device_ info() with refresh_ conn_info= False: evacuated_ instances( ) guests_ state() instance( ) instance( ) instance( ) offload_ instance( ) can_live_ migrate_ source( ) migration( ) live_migration( ) live_migration_ at_destination( ) live_migration_ at_destination( )
_destroy_
_init_instance()
_resume_
_shutdown_
_power_on()
_do_rebuild_
reboot_instance()
revert_resize()
_resize_
resume_instance()
shelve_
check_
_do_live_
_post_
post_
rollback_
Calls to _get_instance_ block_device_ info() with refresh_ conn_info= True: revert_ resize( ) migration( )
finish_
_finish_resize()
pre_live_
Based on xavpaice's comments in (see comment 13 above -- "... existing, running, instances were fine, fresh new instances were fine, but when we stopped instances via nova, then started them again, they failed to start ..."), it would seem that the following should also have refresh_ conn_info= True: instance( )
_power_on() # solves xavpaice's scenario?
_do_rebuild_
reboot_instance()