No valid host nova error results in uninformative stack status

Bug #1585525 reported by Steven Hardy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
New
Undecided
Unassigned

Bug Description

I hit this in TripleO CI - the root cause appears to be Nova returning 500 with a no-valid-host-found error, but Heat then obfuscates this pretty well:

What you see first as a user is:

2016-05-20 08:36:33.000 | | Controller | 26791d29-e7d0-45dd-ae04-02f99722d0b5 | ResourceInError: resources.Controller.resources[0].resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown" | CREATE_FAILED | 2016-05-20T08:32:49 |

So, a ResourceGroup, containing nested stacks (that contain servers) failed to create, and the reason is "Unknown, Code: Unknown", great! :(

There's no indication what the reason is, until you look in the logs:

2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource Traceback (most recent call last):
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 715, in _action_recorder
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource yield
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 795, in _do_action
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 329, in wrapper
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource step = next(subtask)
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 763, in action_handler_task
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource done = check(handler_data)
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 903, in check_create_complete
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 223, in _check_active
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2016-05-20 08:32:45.692 20437 ERROR heat.engine.resource

Nova has given us 500, but it did give us a reasonable message.

This causes the stack resource to fail, at which point all useful information about why the error happened is lost:

2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource Traceback (most recent call last):
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 715, in _action_recorder
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource yield
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 795, in _do_action
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 329, in wrapper
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource step = next(subtask)
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 763, in action_handler_task
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource done = check(handler_data)
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 365, in check_create_complete
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource return self._check_status_complete(self.CREATE)
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 414, in _check_status_complete
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource action=action)
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource ResourceFailure: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2016-05-20 08:32:47.844 20439 ERROR heat.engine.resource

So, I guess the question is how do we surface deeply nested errors in a way that's more useful than "unknown"?

Revision history for this message
Thomas Herve (therve) wrote :

I looked into it, and it appears there is one server which actually ends up with that error message. See http://logs.openstack.org/81/266881/10/check-tripleo/gate-tripleo-ci-f22-nonha/9eb113c/logs/undercloud/var/log/heat/heat-engine.txt.gz#_2016-05-20_08_30_40_620

ResourceGroup only surface the first error, so that's why you get that message at the top.

I don't know if we can aggregate errors, but I believe we're doing our best here.

Rico Lin (rico-lin)
Changed in heat:
milestone: none → no-priority-tag-bugs
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.