Rescheduled instance with pre-existing port fails with PortInUse exception
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
New
|
Undecided
|
Unassigned |
Bug Description
Attempting to create an instance that uses an existing neutron port, when the instance creation fails on the first compute node, and gets rescheduled to another compute node, the rescheduled attempt fails with a PortInUse exception. In case it matters, I'm using neutron ML2 with linuxbridge and the port is on a VLAN provider network.
Steps to reproduce (starting with an AZ/aggregate with two functional compute nodes up and running):
1. Create a neutron port, and make a note of the ID (os port create --network XXX myport)
2. Inject a failure on the first node - e.g. by renaming the qemu binary
3. Create an instance, using the port created earlier (openstack server create --nic port-id=XXX --image cirros --flavor m1.tiny myvm)
The instance will fail on the first node, and get rescheduled on the second, where it will fail with:
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
2018-02-15 22:52:39.347 43784 ERROR nova.compute.
I've reproduced this on both Ocata and Pike. It does not seem to happen if the port is created by nova (i.e. openstack server create --nic net-id=XXX ...)
This looks a bit like https:/
tags: | added: compute neutron |
It's failing here because the port already has a device_id set (an instance id):
https:/ /github. com/openstack/ nova/blob/ stable/ pike/nova/ network/ neutronv2/ api.py# L572
But we should unset that when cleaning up and unbinding the port on the first host before rescheduling:
https:/ /github. com/openstack/ nova/blob/ stable/ pike/nova/ network/ neutronv2/ api.py# L511
Do you see this error in the logs on the first host?
LOG.exception( _LE("Unable to clear device ID "
"for port '%s'"), port_id)