openstack-ansible

Bug #1981463
Comment #0

Comment 0 for bug 1981463

Revision history for this message

Alexander Binzxxxxxx (devil000000) wrote on 2022-07-12:

setup:
using xena and pretty much default settings.
so openstack_db_connection_recycle_time is 600 and galera_wait_timeout as well while timeout in haproxy for galera frontend/backend is 5000s

symptom:
seeing galera connection aborts reported in haproxy in ERSP column. In the mariadb log I get lines like:
"Aborted connection 594171 to db: 'placement' user: 'placement' host: 'hostA.mydomain.com' (Got timeout reading communication packets)"
Also aborted connections counter is rising in mariadb.
Such errors cause retries on openstack side causing things to go slow from time to time.

expectation:
not getting those kind of errors

some analysis:
maria db is actually dropping the connections at wait_timeout (=galera_wait_timeout=600) due to connection beeing idle for a long time.
oslo.db config used in basically all openstack services is doing some connection pooling and is configured (e.g. in placement) with the following values (all default):
max_overflow = 50
max_pool_size = 5
pool_timeout = 30
connection_recycle_time = 600
So it should actually close connections and re-establish them before the timeout.
also haproxy using timeouts with 5000s in frontend and backend should not matter here.

not a solution:
increasing the wait_timeout in mariadb to 1200 or 3600.

(workaround) solution but may not be a good one:
increasing the wait_timeout in mariadb to 7200.

I am not sure where the issue is actually comming from but here are my best guesses:
* there is a bug in openstack end not setting the config values in lower layer library
* there is some bug in the sql db facing lib code causing pooling and refresh not to work properly.
* the timeout in mariadb must be higher then in oslo.db
* haproxy may still cause some issue here and the 5000s may be part of that.

impact:
mostly annoying errors causing retries and slowing things down without any big impact.
so i consider this a minor bug