setup:
using xena and pretty much default settings.
so openstack_db_connection_recycle_time is 600 and galera_wait_timeout as well while timeout in haproxy for galera frontend/backend is 5000s
symptom:
seeing galera connection aborts reported in haproxy in ERSP column. In the mariadb log I get lines like:
"Aborted connection 594171 to db: 'placement' user: 'placement' host: 'hostA.mydomain.com' (Got timeout reading communication packets)"
Also aborted connections counter is rising in mariadb.
Such errors cause retries on openstack side causing things to go slow from time to time.
expectation:
not getting those kind of errors
some analysis:
maria db is actually dropping the connections at wait_timeout (=galera_wait_timeout=600) due to connection beeing idle for a long time.
oslo.db config used in basically all openstack services is doing some connection pooling and is configured (e.g. in placement) with the following values (all default):
max_overflow = 50
max_pool_size = 5
pool_timeout = 30
connection_recycle_time = 600
So it should actually close connections and re-establish them before the timeout.
also haproxy using timeouts with 5000s in frontend and backend should not matter here.
not a solution:
increasing the wait_timeout in mariadb to 1200 or 3600.
(workaround) solution but may not be a good one:
increasing the wait_timeout in mariadb to 7200.
I am not sure where the issue is actually comming from but here are my best guesses:
* there is a bug in openstack end not setting the config values in lower layer library
* there is some bug in the sql db facing lib code causing pooling and refresh not to work properly.
* the timeout in mariadb must be higher then in oslo.db
* haproxy may still cause some issue here and the 5000s may be part of that.
impact:
mostly annoying errors causing retries and slowing things down without any big impact.
so i consider this a minor bug
setup: db_connection_ recycle_ time is 600 and galera_wait_timeout as well while timeout in haproxy for galera frontend/backend is 5000s
using xena and pretty much default settings.
so openstack_
symptom: mydomain. com' (Got timeout reading communication packets)"
seeing galera connection aborts reported in haproxy in ERSP column. In the mariadb log I get lines like:
"Aborted connection 594171 to db: 'placement' user: 'placement' host: 'hostA.
Also aborted connections counter is rising in mariadb.
Such errors cause retries on openstack side causing things to go slow from time to time.
expectation:
not getting those kind of errors
some analysis: wait_timeout= 600) due to connection beeing idle for a long time. recycle_ time = 600
maria db is actually dropping the connections at wait_timeout (=galera_
oslo.db config used in basically all openstack services is doing some connection pooling and is configured (e.g. in placement) with the following values (all default):
max_overflow = 50
max_pool_size = 5
pool_timeout = 30
connection_
So it should actually close connections and re-establish them before the timeout.
also haproxy using timeouts with 5000s in frontend and backend should not matter here.
not a solution:
increasing the wait_timeout in mariadb to 1200 or 3600.
(workaround) solution but may not be a good one:
increasing the wait_timeout in mariadb to 7200.
I am not sure where the issue is actually comming from but here are my best guesses:
* there is a bug in openstack end not setting the config values in lower layer library
* there is some bug in the sql db facing lib code causing pooling and refresh not to work properly.
* the timeout in mariadb must be higher then in oslo.db
* haproxy may still cause some issue here and the 5000s may be part of that.
impact:
mostly annoying errors causing retries and slowing things down without any big impact.
so i consider this a minor bug