Comment 9 for bug 1889731

Revision history for this message
Trent Lloyd (lathiat) wrote :

Also saw this issue manifest as an actual problem in octavia-health-manager. It was stuck, not working, outputting the following errors:

2020-10-13 06:25:15.184 1728 ERROR octavia.db.api [-] Connection to database failed. Retrying in 10 seconds.: sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)
[backtrace omitted]
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api The above exception was the direct cause of the following exception:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api (Background on this error at: http://sqlalche.me/e/e3q8)
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api During handling of the above exception, another exception occurred:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)

I found the following:
sqlalchemy/oslo.db keeps a pool of 10 connections active at a time to reduce db connection latency. It also allows an overflow of 10 more (20 total) if all the connections are simultaneously in-use by a thread, but won't allow any more even if requested.

Octavia is multi-threaded with eventlet/greenlet. By default health_update_threads and stats_update_threads are both equal to the number of CPU threads. Most of the time they will only use the DB connection for a short period (so will multiplex their use of the smaller number of connections) but various factors may cause them to hold onto the connection for a while or all have work to do at the same time.

Most charms limit the number of worker threads to to 0.25x threads by the worker-multiplier config [and, I think, max 4 inside lXD], charm-octavia had this bug where it didn't do that

Octavia hit error "Lost connection to MySQL server during query" [this error can also happen if the server connection was lost before the query was sent]. Sometimes this is caused by aggressive wait-timeout=180 on the server but here we have wait-timeout=3600 so likely I suspect that there was a MySQL server failover which will cause the old connection to go stale as the new VIP owner didn't have the old TCP connections.

sqlalchemy normally sees such an error and automatically reconnects. However in doing so it got the additional error: "QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)" as a direct result of trying to handle the original "Lost connection" error.

I suspect this is actually a limitation in sqlalchemy in that if a connection has an error, and it wants to reconnect, rather than "replace" the existing connection it tries to get a new connection in the pool before removing the old one, but can't because we hit the limit. I would argue this is a bug but I can't find any good references, but I did find this note that the overlimit limit should be >= total number of threads which is unlikely the case here due to the above bug: https://github.com/sqlalchemy/sqlalchemy/issues/5308

There is also a case for getting this QueuePool limit under normal operation, but I didn't see that here and I haven't seen any other bugs or reports of that. So my guess is that all 20 connections are stuck in use by a thread trying to get that connection to reconnect and failing. This is a kind of "dead lock". After restarting octavia-health-manager it reconnected and resumed opertaion normally after failing like that for at least a day.

In any case this fix will likely prevent the issue since we'll likely have less than 20 threads with this change in place. But I will also open another bug to if necessary raise the DB threadpool limit to at least match the expected number of threads.

Documenting here mostly so people can find this bug if they search for the relevant error messages.