Octavia does not handle DBConnection error on batch_update_members

Bug #2015239 reported by Sergey Kraynev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
octavia
Confirmed
Medium
Unassigned

Bug Description

Octavia handles the code for DBConnectionError in API and during execution taskflow.
However such error is not handled during preparation tasks here:
https://github.com/openstack/octavia/blob/stable/2023.1/octavia/controller/worker/v1/controller_worker.py#L496-L516

On my setup with octavia yoga, I get DBConnectionError during getting list of old members.

```
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server self.connect()
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/pymysql/connections.py", line 664, in connect
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server raise exc
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'x.x.x.x' ([Errno 111] Connection refused)")
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server [SQL: SELECT 1]
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-03-15 13:28:17,652.652 10 ERROR oslo_messaging.rpc.server
```

Another trace:

https://paste.opendev.org/show/btK99zWWfozmXCizn58L/

```
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/octavia/controller/worker/v1/controller_worker.py", line 502, in batch_update_members
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server updated_members = [
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/octavia/controller/worker/v1/controller_worker.py", line 503, in <listcomp>
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server (self._member_repo.get(db_apis.get_session(), id=m.get('id')), m)
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.8/site-packages/octavia/db/repositories.py", line 139, in get
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server return model.to_data_model()

...

2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server raise err.OperationalError(
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server [SQL: SELECT tags.resource_id AS tags_resource_id, tags.tag AS tags_tag, anon_1.pool_id AS anon
_1_pool_id
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server FROM (SELECT pool.id AS pool_id
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server FROM pool
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server WHERE pool.id = %(pk_1)s) AS anon_1 INNER JOIN tags ON tags.resource_id = anon_1.pool_id]
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server [parameters: {'pk_1': 'da2108ed-cb40-4d8d-a0f2-0ab3468c4b8c'}]
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-04-03 10:22:09,964.964 10 ERROR oslo_messaging.rpc.server
```

As result both LBs stuck in PENDING_UPDATE state.

description: updated
description: updated
summary: - Octavia doe snot handle DBConnection error on batch_update_members
+ Octavia does not handle DBConnection error on batch_update_members
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

We probably need to add try/except blocks for all the DB queries (not only in batch_update_members) in {v1,v2}/controller_worker.py

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Gregory Thiemonge (gthiemonge) I agree with this point, but does it possible to do for all such queries on one PR ? Or it has to be split on several PRs to simplify review and PR size

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

@Sergey for me it's the same bug, it can be fixed in one single commit

Changed in octavia:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Thinking about this: we cannot set the provisioning_status to ERROR if the DB is down

Revision history for this message
Sergey Kraynev (skraynev) wrote :

@Gregory: hm, so what about retry for some places? I talk only about cases, whee we read data from DB - not write. I think, that it should be relevant - do retry in case of fail DB connection, if it does not break state of the resources.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.