exceptions not correctly handled during DB outage when using the sqlalchemy backend

Bug #2037050 reported by Gregory Thiemonge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
taskflow
Fix Released
High
Gregory Thiemonge

Bug Description

When taskflow is running with jobboard and the sqlalchemy backend, if a DB outage occurs, the exception is not correctly handled and the job is consumed instead of being skipped.

Reproduced in Octavia, the DB was shut down while a flow was executed:

Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: WARNING taskflow.conductors.backends.impl_executor [-] Job execution failed (consumption proceeding): RedisJob: get_failover_LB_flow-edc09063-db6d-4bcc-91e4-810b69673789 (priority=JobPriority.NORMAL, uuid=eef69a08-37
1e-4a53-9dfd-c4bb3765f627, details={'store': {'loadbalancer': {'admin_state_up': True, 'description': None, 'loadbalancer_id': '60416b66-6184-425a-99f7-21abf776942c', 'name': 'lb1', 'project_id': '0758ab98a0d9459a958bb7acf51c4805', 'vip_address': '172.24.4.65', 'vip_network
_id': 'd57ae4e2-4380-442a-8bea-c6dd0ee18df4', 'vip_port_id': '38731152-5ce1-41bc-af98-fb85d601239f', 'vip_subnet_id': '1fc9bb57-591e-4a48-843c-e4faeca8b528', 'vip_qos_policy_id': None, 'availability_zone': None, 'flavor': {'loadbalancer_topology': 'ACTIVE_STANDBY'}}, 'build
_type_priority': 20, 'server_group_id': None, 'loadbalancer_id': '60416b66-6184-425a-99f7-21abf776942c', 'flavor': {'loadbalancer_topology': 'ACTIVE_STANDBY'}, 'availability_zone': {}}}): tenacity.RetryError: RetryError[<Future at 0x7f950d8d75e0 state=finished raised StorageFailure>]
[..]
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor Traceback (most recent call last):
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor result = fn(*args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 462, in _save_flow_detail
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor original_flow_detail.update(conn.update_flow_details(flow_detail))
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/persistence/backends/impl_sqlalchemy.py", line 486, in update_flow_details
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor exc.raise_with_cause(exc.StorageFailure,
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/exceptions.py", line 51, in raise_with_cause
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/oslo_utils/excutils.py", line 142, in raise_with_cause
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor raise exc_cls(message, *args, **kwargs) from kwargs.get('cause')
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor taskflow.exceptions.StorageFailure: Failed updating flow details with uuid 'edc09063-db6d-4bcc-91e4-810b69673789'
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor (Background on this error at: https://sqlalche.me/e/14/e3q8)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor The above exception was the direct cause of the following exception:
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor Traceback (most recent call last):
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/conductors/backends/impl_executor.py", line 178, in _dispatch_job
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor stage_func()
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/conductors/backends/impl_executor.py", line 164, in _run_engine
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor for _state in engine.run_iter():
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/engines/action_engine/engine.py", line 328, in run_iter
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self._change_state(states.FAILURE)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/engines/action_engine/engine.py", line 365, in _change_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor moved, old_state = self.storage.change_flow_state(state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/fasteners/lock.py", line 296, in wrapper
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return f(self, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 1157, in change_flow_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self.set_flow_state(state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/fasteners/lock.py", line 296, in wrapper
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return f(self, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 1135, in set_flow_state
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor self._with_connection(self._save_flow_detail, source, clone)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/opt/stack/taskflow/taskflow/storage.py", line 346, in _with_connection
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return functor(conn, *args, **kwargs)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor return self(f, *args, **kw)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor do = self.iter(retry_state=retry_state)
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 373, in iter
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor six.raise_from(retry_exc, fut.exception())
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor File "<string>", line 3, in raise_from
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor tenacity.RetryError: RetryError[<Future at 0x7f950d8d75e0 state=finished raised StorageFailure>]
Sep 22 02:20:32 gthiemon-devstack octavia-worker[398651]: ERROR taskflow.conductors.backends.impl_executor

The issue occurs because since [1], _save_flow_detail uses tenacity.retry, so when many StorageFailure exceptions are raised, tenacity doesn't re-raise StorageFailure, but it raises tenacity.RetryError.

In the caller [2], if NO_CONSUME_EXCEPTIONS ([ExecutionFailure, StorageFailure]) is caught, the consumption of the job is skipped, but if any other exception is caught (including RetryError), the job is consumed (the non-completion of the job creates some issues in Octavia)

There are 2 ways to fix this issue:
1. add the reraise=True parameter to tenacity.retry in _save_flow_detail and _save_atom_detail, tenacity would raise the last caught StorageFailure exception, the exception handler would work as expected (it would hide the intermediate StorageFailure exception, but IMO it's fine)
2. catch RetryError in caller and check if the last exception is one of NO_CONSUME_EXCEPTIONS (probably an ugly patch)

I will propose a patch with the first method.

[1] https://opendev.org/openstack/taskflow/commit/c32454213e5aa1b6cda770f56ddef18f06c2c3c2
[2] https://opendev.org/openstack/taskflow/src/commit/cd80fc96de6951fa70159a43221bfe0461a92b57/taskflow/conductors/backends/impl_executor.py#L172-L203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to taskflow (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/taskflow/+/896192

Changed in taskflow:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/taskflow/+/896232

description: updated
Changed in taskflow:
importance: Undecided → High
assignee: nobody → Gregory Thiemonge (gthiemonge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to taskflow (master)

Reviewed: https://review.opendev.org/c/openstack/taskflow/+/896192
Committed: https://opendev.org/openstack/taskflow/commit/07a1a3f417fafdde5fd32692d5e0d9b6c077f62b
Submitter: "Zuul (22348)"
Branch: master

commit 07a1a3f417fafdde5fd32692d5e0d9b6c077f62b
Author: Gregory Thiemonge <email address hidden>
Date: Fri Sep 22 03:22:07 2023 -0400

    Fix incorrect handling of storage exceptions

    Since [0] _save_flow_detail and _save_atom_detail use tenacious.retry,
    it means that they no longer raise StorageException but a
    tenacious.RetryError which is not caught by the callers.

    Some jobs may have been incorrectly consumed (instead of being
    rescheduled) when StorageException was raised.
    Now tenacious reraises only the last exception (StorageException), which
    is correctly handled.

    [0] Ic6b0a78d20124cc027468ecc6aeff189c25d1a8a

    Closes-Bug: #2037050
    Change-Id: I030dd8fc5b65833243cf0948af53dc1aeabf41d9

Changed in taskflow:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/taskflow/+/896232
Committed: https://opendev.org/openstack/taskflow/commit/981052a6cbb0478e3e13407a664c5cd65726d743
Submitter: "Zuul (22348)"
Branch: master

commit 981052a6cbb0478e3e13407a664c5cd65726d743
Author: Gregory Thiemonge <email address hidden>
Date: Fri Sep 22 08:21:08 2023 -0400

    Avoid endless backtraces on StorageFailure

    If the storage is down (for instance sqlalchemy), the logbook cannot be
    loaded, then taskflow goes into a endless loop of exceptions. To
    mitigate this issue, use the retry feature of tenacity, the backtrace
    will be displayed only every 5 seconds

    Partial-Bug: #2037050

    Change-Id: I656abdf7325c46d3afb2cc7ca905f1a335fb0d2f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/taskflow 5.5.0

This issue was fixed in the openstack/taskflow 5.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.