evacuation failed: Port update failed : Unable to correlate PCI slot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Low
|
Balazs Gibizer | ||
Queens |
Triaged
|
Low
|
Unassigned | ||
Rocky |
In Progress
|
Low
|
Unassigned | ||
Stein |
Triaged
|
Low
|
Unassigned | ||
Train |
Fix Released
|
Low
|
Unassigned | ||
Ussuri |
Fix Released
|
Low
|
Balazs Gibizer | ||
Victoria |
Fix Released
|
Low
|
Balazs Gibizer |
Bug Description
Description
===========
if the _update_
nova/compute/
2931 def rebuild_
2932 +-- 84 lines: injected_files, new_pass, orig_sys_
3016 claim_ctxt = rebuild_claim(
3017 context, instance, scheduled_node,
3018 limits=limits, image_meta=
3019 migration=
3020 self._do_
3021 +-- 47 lines: claim_ctxt, context, instance, orig_image_
3068 instance.
3069 # NOTE (ndipanov): This save will now update the host and node
3070 # attributes making sure that next RT pass is consistent since
3071 # it will be based on the instance and not the migration DB
3072 # entry.
3073 instance.host = self.host
3074 instance.node = scheduled_node
3075 instance.save()
3076 instance.
the instance is not handled as managed instance of the destination host because it is not updated on DB yet.
2020-09-19 07:27:36.321 8 WARNING nova.compute.
And so the SRIOV ports (PCI device) was free by clean_usage() eventhough the VM has the VF port already.
743 def _update_
744 +-- 45 lines: # initialize the compute node object, creating it-----
789 self.pci_
790 dev_pools_obj = self.pci_
After that, evacuated this VM to another compute host again, we got the error like below.
Steps to reproduce
==================
1. create a VM on com1 with SRIOV VF ports.
2. stop and disable nova-compute service on com1
3. wait 60 sec (nova-compute reporting interval)
4. evauate the VM to com2
5. wait the VM is active on com2
6. enable and start nova-compute on com1
7. wait 60 sec (nova-compute reporting interval)
8. stop and disable nova-compute service on com2
9. wait 60 sec (nova-compute reporting interval)
10. evauate the VM to com1
11. wait the VM is active on com1
12. enable and start nova-compute on com2
13. wait 60 sec (nova-compute reporting interval)
14. go to step 2.
Expected result
===============
Evacuation should be done without errors.
Actual result
=============
Evacuation failed with "Port update failed"
Environment
===========
openstack-
Logs & Configs
==============
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
2020-09-19 07:34:22.670 8 ERROR nova.compute.
tags: | added: evacuate pci resource-tracker |
Changed in nova: | |
status: | Incomplete → New |
Before saving instance with new host(destination host) in rebuild_instance(),
2931 def rebuild_ instance( self, context, instance, orig_image_ref, image_ref, metadata, ------- ------- ------- ------- ------- ------- ------- ------- ------- ---- apply_migration _context( ) drop_migration_ context( ) migration_ status( migration, 'done')
2932 +--136 lines: injected_files, new_pass, orig_sys_
3068 instance.
3069 # NOTE (ndipanov): This save will now update the host and node
3070 # attributes making sure that next RT pass is consistent since
3071 # it will be based on the instance and not the migration DB
3072 # entry.
3073 instance.host = self.host
3074 instance.node = scheduled_node
3075 instance.save()
3076 instance.
3077
3078 # NOTE (ndipanov): Mark the migration as done only after we
3079 # mark the instance as belonging to this host.
3080 self._set_
resource tracker(RT) may get instances by get_by_ host_and_ node()
743 def _update_ available_ resource( self, context, resources): ------- ------- ------- ------- ------- ------- ------- ------- - InstanceList. get_by_ host_and_ node(
744 +-- 12 lines: # initialize the compute node object, creating it-----
756 # Grab all instances assigned to this node:
757 instances = objects.
and after saving instance host(destination host) and set the migration status to 'done', RT get the migrations.
767 # Grab all in-progress migrations: MigrationList. get_in_ progress_ by_host_ and_node(
768 migrations = objects.
769 context, self.host, nodename)
770
After these situation, pci devices are free by clean_usage()
789 self.pci_ tracker. clean_usage( instances, migrations, orphans)