exceptions not correctly handled during DB outage when using the sqlalchemy backend

Bug #2037050 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
taskflow
In Progress
Undecided
Unassigned

Bug Description

When taskflow is running with jobboard and the sqlalchemy backend, if a DB outage occurs, the exception is not correctly handled and the job is consumed instead of being skipped.

Reproduced in Octavia, the DB was shut down while a flow was executed:

Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: WARNING taskflow.conductors.backends.impl_executor [-] Job execution failed (consumption proceeding): RedisJob: get_failover_LB_flow-edc09063-db6d-4bcc-91e4-810b69673789 (priority=JobPriority.NORMAL, uuid=eef69a08-37
1e-4a53-9dfd-c4bb3765f627, details={'store': {'loadbalancer': {'admin_state_up': True, 'description': None, 'loadbalancer_id': '60416b66-6184-425a-99f7-21abf776942c', 'name': 'lb1', 'project_id': '0758ab98a0d9459a958bb7acf51c4805', 'vip_address': '172.24.4.65', 'vip_network
_id': 'd57ae4e2-4380-442a-8bea-c6dd0ee18df4', 'vip_port_id': '38731152-5ce1-41bc-af98-fb85d601239f', 'vip_subnet_id': '1fc9bb57-591e-4a48-843c-e4faeca8b528', 'vip_qos_policy_id': None, 'availability_zone': None, 'flavor': {'loadbalancer_topology': 'ACTIVE_STANDBY'}}, 'build
_type_priority': 20, 'server_group_id': None, 'loadbalancer_id': '60416b66-6184-425a-99f7-21abf776942c', 'flavor': {'loadbalancer_topology': 'ACTIVE_STANDBY'}, 'availability_zone': {}}}): tenacity.RetryError: RetryError[<Future at 0x7f950d8d75e0 state=finished raised StorageFailure>]
[..]
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor Traceback (most recent call last):
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor result = fn(*args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 462, in _save_flow_detail
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor original_flow_detail.update(conn.update_flow_details(flow_detail))
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/persistence/backends/impl_sqlalchemy.py", line 486, in update_flow_details
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor exc.raise_with_cause(exc.StorageFailure,
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/exceptions.py", line 51, in raise_with_cause
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/oslo_utils/excutils.py", line 142, in raise_with_cause
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor raise exc_cls(message, *args, **kwargs) from kwargs.get('cause')
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor taskflow.exceptions.StorageFailure: Failed updating flow details with uuid 'edc09063-db6d-4bcc-91e4-810b69673789'
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor (Background on this error at: https://sqlalche.me/e/14/e3q8)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor The above exception was the direct cause of the following exception:
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor Traceback (most recent call last):
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/conductors/backends/impl_executor.py", line 178, in _dispatch_job
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor stage_func()
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/conductors/backends/impl_executor.py", line 164, in _run_engine
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor for _state in engine.run_iter():
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/engines/action_engine/engine.py", line 328, in run_iter
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self._change_state(states.FAILURE)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/engines/action_engine/engine.py", line 365, in _change_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor moved, old_state = self.storage.change_flow_state(state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/fasteners/lock.py", line 296, in wrapper
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return f(self, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 1157, in change_flow_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self.set_flow_state(state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/fasteners/lock.py", line 296, in wrapper
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return f(self, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 1135, in set_flow_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self._with_connection(self._save_flow_detail, source, clone)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 346, in _with_connection
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return functor(conn, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return self(f, *args, **kw)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor do = self.iter(retry_state=retry_state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 373, in iter
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor six.raise_from(retry_exc, fut.exception())
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "<string>", line 3, in raise_from
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor tenacity.RetryError: RetryError[<Future at 0x7f950d8d75e0 state=finished raised StorageFailure>]
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor

The issue occurs because since [1], _save_flow_detail uses tenacity.retry, so when many StorageFailure exceptions are raised, tenacity doesn't re-raise StorageFailure, but it raises tenacity.RetryError.

In the caller [2], if NO_CONSUME_EXCEPTIONS ([ExecutionFailure, StorageFailure]) is caught, the consumption of the job is skipped, but if any other exception is caught (including RetryError), the job is consumed (the non-completion of the job creates some issues in Octavia)

There are 2 ways to fix this issue:
1. add the reraise=True parameter to tenacity.retry in _save_flow_detail and _save_atom_detail, tenacity would raise the last caught StorageFailure exception, the exception handler would work as expected (it would hide the intermediate StorageFailure exception, but IMO it's fine)
2. catch RetryError in caller and check if the last exception is one of NO_CONSUME_EXCEPTIONS (probably an ugly patch)

I will propose a patch with the first method.

[1] https://opendev.org/openstack/taskflow/commit/c32454213e5aa1b6cda770f56ddef18f06c2c3c2
[2] https://opendev.org/openstack/taskflow/src/commit/cd80fc96de6951fa70159a43221bfe0461a92b57/taskflow/conductors/backends/impl_executor.py#L172-L203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to taskflow (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/taskflow/+/896192

Changed in taskflow:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/taskflow/+/896232

description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.