frequent spawn failures on busy ocata cloud
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Undecided
|
Unassigned | ||
Ubuntu Cloud Archive |
Incomplete
|
Undecided
|
Unassigned | ||
Ocata |
New
|
Undecided
|
Unassigned | ||
neutron |
New
|
Undecided
|
Unassigned |
Bug Description
On one of our OpenStack deployments, which runs ocata from the
Ubuntu Cloud Archive on Ubuntu 16.04 LTS, we often find that
instances fail to spawn as follows. We generally observe this
problem more frequently on older compute hosts when the spawn rate
is high enough (this cloud runs mostly short-lived instances).
From nova show's "fault" field:
{"message": "Build of instance 1d005a25-
2.7/dist-
filter_
File \"/usr/
reason=msg)
", "created": "2017-11-
Reviewing nova-compute.log reveals the following:
2017-11-23 00:50:01.027 4173 DEBUG nova.compute.
[...]
2017-11-23 00:55:05.978 4173 WARNING nova.virt.
Reviewing the neutron-server logs, I see e.g.:
2017-11-23 00:55:08.951 1755 DEBUG neutron.
but no such line for vif-network-
confirm that nova did not receive the network-vif-plugged event
because neutron did not send it in the first place; if neutron had
even tried to send the event, it would have logged it.
Now for neutron-
this last line below looks rather strange, or at least it does if
you've had a long enough look at OVSNeutronAgent
2017-11-23 00:50:03.903 2800 DEBUG neutron.
2017-11-23 00:50:04.451 2800 DEBUG neutron.
[...]
2017-11-23 00:56:44.314 2800 DEBUG neutron.
_bind_devices looks like this
(full version at https:/
def _bind_devices(self, need_binding_
# [...]
port_names = [p['vif_
port_info = self.int_
"Port", columns=["name", "tag"], ports=port_names, if_exists=True)
for port_detail in need_binding_ports:
# [...]
# Do not bind a port if it's already bound
cur_tag = tags_by_
if cur_tag is None:
which seems to suggest that was binding only even attempted five
minutes later, *after* nova had given up and deleted the port,
which seems rather strange, if it's true. Although another theory
that just occurred to me is that maybe neutron-
somehow got "stuck" for long enough that nova-compute timed out.
This is also the only line that _bind_devices logs for such ports
that fail to launch in this manner on our cloud. There is no
logging corresponding to "Setting status for %s to UP",
"Configuration for devices %s failed!", or to "Configuration for
devices up %(up)s and devices down %(down)s completed." (see
further down in the code at the github link above).
I cherry-picked commit c6044fb86174caa
from ocata (it looks like it will be in neutron 10.0.5; "git
show": http://
could be relevant, but it did not improve the situation markedly,
although we have logged a few cases where the code was invoked.
We have another similar cloud hosting similar loads, but running
mitaka, that does not evince this problem.
affects: | neutron (Ubuntu) → cloud-archive |
Hi Paul,
Thanks for reporting this
Based on the port binding attempt occurring ~5 minutes later it seems you're on to something in the theory that neutron- openvswitch- agent somehow got stuck for long enough that nova-compute timed out. If you see this again can you share any more details system load for nova/neutron? /var/log/syslog may be helpful as well.
Adding upstream projects to this as well in case they've seen similar issues..
Thanks,
Corey