I installed the undercloud from RDO TripleO master-ci from May 31st, and used it as an all-in-one OpenStack/Ironic deployment. I used the Ironic instance to deploy a few bare-metal hosts, but did not deploy an overcloud. After the Keystone Admin API became temporarily unavailable, Mistral got caught in a loop, attempting the same tasks over and over again. This caused the Mistral DB to grow to 31GB, and the Mistral logs to grow to over 5GB, and the /var partition filled up.
The count of records in the Mistral DB tables after 3 days was:
mistral.task_executions_v2: 489,413 records
mistral.action_executions_v2: 613,495 records
mistral.workflow_executions_v2: 61,196 records
The actions that were being repeated were related to introspection, and the errors had to do with authentication. At some point the Keystone API was unavailable for a short time, but the errors in Mistral repeated for several days after the Keystone API was restored to operation.
Below is an example from the logs which shows the actions that are repeating over and over again. These were reflected in the database.
| 34a79011-9297-449e-9406-e6d778125936 | complete_swift | NULL | SUCCESS | NULL |
| 34a79bb7-ac2c-4bc7-a9b5-1b13f93c5aa7 | complete_swift | NULL | SUCCESS | NULL |
| 34a7b7ef-69d4-474c-9dc3-157ed30407a2 | check_status | NULL | ERROR | Failed by tasks: [u'send_message'] |
| 34a80ea3-7352-400a-b254-9281086b5157 | merge_payload | NULL | SUCCESS | NULL |
| 34a871e4-ec10-4602-a890-35bca1cdd58c | merge_payload | NULL | SUCCESS | NULL |
| 34a8756c-e41c-4dfe-8229-e8e673e9958c | send_message | NULL | ERROR | Workflow failed due to message status |
| 34a889d2-7eb2-4f65-ab05-cf7f242a7d31 | complete_swift | NULL | SUCCESS | NULL |
| 34a8bd2f-d48f-4944-a1e9-b02aefff0e7e | branch_workflow | NULL | SUCCESS | NULL |
| 34a90005-8525-4f12-8b1c-e2b4d3cd0e81 | merge_payload | NULL | SUCCESS | NULL |
| 34a9032d-a039-403d-b2f3-c5b9073f2316 | branch_workflow | NULL | SUCCESS | NULL |
| 34a91146-c8ec-4e04-b552-924217d9ac3d | send_message | NULL | ERROR | Failed to run action [action_ex_id=f20c7c8e-291b-44a0-b791-f75db7420807, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'queue_post'}', params='{u'queue_name': u'tripleo', u'messages': {u'body': {u'type': u'tripleo.baremetal.v1.introspect', u'payload': {u'status': u'FAILED', u'failed_introspection': [u'a5c148b5-1719-4571-ae32-44ad97e1cf16'], u'introspected_nodes': {u'a5c148b5-1719-4571-ae32-44ad97e1cf16': {u'uuid': u'a5c148b5-1719-4571-ae32-44ad97e1cf16', u'links': [{u'href': u'http://10.9.92.130:13050/v1/introspection/a5c148b5-1719-4571-ae32-44ad97e1cf16', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-06-01T18:09:08'}}, u'message': u'Unhandled workflow error', u'execution': {u'name': u'tripleo.baremetal.v1.introspect', u'created_at': u'2018-06-01 17:48:50', u'updated_at': u'2018-06-03 05:36:34', u'id': u'42c73f81-5270-4e04-9316-80edd1b54dee', u'params': {u'root_execution_id': u'3e257e45-b64c-47de-b662-585c83d5b5fe', u'index': 0, u'namespace': u'', u'task_execution_id': u'1a3110b8-c43e-4c69-a4bc-5d0bf5e2f5f0', u'env': {}}, u'input': {u'node_timeout': 1200, u'run_validations': False, u'node_uuids': [u'a5c148b5-1719-4571-ae32-44ad97e1cf16'], u'queue_name': u'tripleo', u'max_retry_attempts': 2, u'concurrency': 20}, u'spec': {u'tasks': {u'run_validations': {u'name': u'run_validations', u'on-error': u'set_validations_failed', u'workflow': u'tripleo.validations.v1.run_groups', u'on-success': u'introspect_nodes', u'version': u'2.0', u'input': {u'queue_name': u'<% $.queue_name %>', u'group_names': [u'pre-introspection']}, u'type': u'direct'}, u'successful_introspection': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'successful_introspection', u'publish': {u'status': u'SUCCESS', u'message': u'Successfully introspected <% $.introspected_nodes.len() %> node(s).'}}, u'increase_attempt_counter': {u'on-complete': u'retry_failed_nodes', u'version': u'2.0', u'type': u'direct', u'name': u'increase_attempt_counter', u'publish': {u'introspection_attempt': u'<% $.introspection_attempt + 1 %>'}}, u'send_message': {u'name': u'send_message', u'workflow': u'tripleo.messaging.v1.send', u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}], u'version': u'2.0', u'input': {u'status': u"<% $.get('status', 'SUCCESS') %>", u'queue_name': u'<% $.queue_name %>', u'message': u"<% $.get('message', '') %>", u'execution': u'<% execution() %>', u'type': u'<% execution().name %>', u'payload': u"<% { introspected_nodes => $.get('introspected_nodes', []), failed_introspection => $.get('failed_introspection', []) } %>"}, u'type': u'direct'}, u'introspect_nodes': {u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'introspect_nodes', u'on-error': [{u'get_introspection_status': u'<% $.introspection_attempt <= $.max_retry_attempts %>'}, {u'max_retry_attempts_reached': u'<% $.introspection_attempt > $.max_retry_attempts %>'}], u'workflow': u'_introspect', u'on-success': u'get_introspection_status', u'version': u'2.0', u'concurrency': u'<% $.concurrency %>', u'input': {u'node_uuid': u'<% $.uuid %>', u'queue_name': u'<% $.queue_name %>', u'timeout': u'<% $.node_timeout %>'}, u'type': u'direct'}, u'set_validations_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'set_validations_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(run_validations).result %>'}, u'on-success': u'send_message'}, u'retry_failed_nodes': {u'version': u'2.0', u'type': u'direct', u'name': u'retry_failed_nodes', u'publish': {u'status': u'RUNNING', u'message': u"<% 'Retrying {0} nodes that failed introspection. Attempt {1} of {2} '.format($.failed_introspection.len(), $.introspection_attempt, $.max_retry_attempts + 1) %>", u'node_uuids': u'<% $.failed_introspection %>'}, u'on-success': [u'send_message', u'introspect_nodes']}, u'max_retry_attempts_reached': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'max_retry_attempts_reached', u'publish': {u'status': u'FAILED', u'message': u"<% 'Retry limit reached with {0} nodes still failing introspection'.format($.failed_introspection.len()) %>"}}, u'initialize': {u'on-complete': [{u'run_validations': u'<% $.run_validations %>'}, {u'introspect_nodes': u'<% not $.run_validations %>'}], u'version': u'2.0', u'type': u'direct', u'name': u'initialize', u'publish': {u'introspection_attempt': 1}}, u'unhandled_error': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'unhandled_error', u'publish': {u'status': u'FAILED', u'message': u'Unhandled workflow error'}}, u'get_introspection_status': {u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'get_introspection_status', u'on-error': u'increase_attempt_counter', u'on-success': [{u'successful_introspection': u'<% $.failed_introspection.len() = 0 %>'}, {u'increase_attempt_counter': u'<% $.failed_introspection.len() > 0 %>'}], u'publish-on-error': {u'failed_introspection': u'<% ($.node_uuids.toSet() - task().result.where(isDict($)).select($.uuid).toSet()) + task().result.where(isDict($)).where($.finished = true and $.error != null).toSet() + task().result.where(isDict($)).where($.finished = false).toSet() %>', u'introspected_nodes': u'<% task().result.where(isDict($)).toDict($.uuid, $) %>'}, u'publish': {u'failed_introspection': u'<% task().result.where($.finished = true and $.error != null).select($.uuid) + task().result.where($.finished = false).select($.uuid) %>', u'introspected_nodes': u'<% task().result.toDict($.uuid, $) %>'}, u'version': u'2.0', u'action': u'baremetal_introspection.get_status', u'input': {u'uuid': u'<% $.uuid %>'}, u'type': u'direct'}}, u'name': u'introspect', u'tags': [u'tripleo-common-managed'], u'task-defaults': {u'on-error': u'unhandled_error'}, u'version': u'2.0', u'input': [u'node_uuids', {u'run_validations': False}, {u'queue_name': u'tripleo'}, {u'concurrency': 20}, {u'max_retry_attempts': 2}, {u'node_timeout': 1200}], u'description': u'Take a list of nodes and move them through introspection.\nBy default each node will attempt introspection up to 3 times (two retries plus the initial attemp) if it fails. This behaviour can be modified by changing the max_retry_attempts input.\nThe workflow will assume the node has timed out after 20 minutes (1200 seconds). This can be changed by passing the node_timeout input in seconds.\n'}}, u'plan_name': None, u'execution_id': u'42c73f81-5270-4e04-9316-80edd1b54dee', u'deployment_status': None}}}}']
ZaqarAction.queue_post failed: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}. |
| 34a95717-8ea2-4b38-8511-8f21f1fe6ca7 | complete_swift | NULL | SUCCESS | NULL |
| 34a95919-37de-4c23-948e-3c12aeb84283 | branch_workflow | NULL | SUCCESS | NULL |
| 34a9ba97-0df8-4b67-ab97-33b21a8f089f | check_status | NULL | ERROR | Failed by tasks: [u'send_message'] |
| 34a9c58b-721d-463d-bd7f-66eb55e3cfe8 | merge_payload | NULL | SUCCESS | NULL |
| 34a9cc59-ca58-44b2-8bba-7c4540d1d32a | merge_payload | NULL | SUCCESS | NULL |
Versions installed:
openstack- mistral- engine- 7.0.0-0. 20180523140054. 573d2d0. el7.noarch mistral- common- 7.0.0-0. 20180523140054. 573d2d0. el7.noarch mistral- executor- 7.0.0-0. 20180523140054. 573d2d0. el7.noarch mistral- api-7.0. 0-0.20180523140 054.573d2d0. el7.noarch mistral- lib-0.5. 0-0.20180425121 912.ef3482e. el7.centos. noarch mistralclient- 3.5.0-0. 20180518153811. 8cc7648. el7.noarch mistral- 7.0.0-0. 20180523140054. 573d2d0. el7.noarch mistral- 13.0.1- 0.2018052814212 5.a94142c. el7.noarch
openstack-
openstack-
openstack-
python2-
python2-
python-
puppet-