PostgreSQL charm should detect when replication is failing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
PostgreSQL Charm |
New
|
Undecided
|
Unassigned |
Bug Description
I have a 2 node PostgreSQL cluster backing landscape-server, running on KVMs. I was preparing for a clean shutdown of the host node and so tried doing a switchover to make the master run on the machine on the host which was remaining online.
However, this failed.
Upon investigation, I determined that the previous master had been encountering replication errors for quite some time (days). Example:
2021-05-10 15:53:56 UTC [6790]: [3-1] db=[unknown]
After the attempted switchover, the intended new master unit (which was also the juju leader) ended up in an error state due to failing its replication-
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
May 19 20:39:18 landscapesql-2 postgresql@
However, before I performed any maintenance here, "juju status" didn't indicate any problems, nor did nagios alert us for any issues. "juju status" for those units looked like this:
landscape-
landscape-
The charm deployed was cs:postgresql-208, on top of Bionic.
I think the charm should do something to detect replication failures, so as to avoid this type of issue.
While I think this is tangential to the bug's intended issue (lack of alerting or "juju status" indication of when replication is not healthy), if you wish to know the exact context of the Juju error encountered, here is a traceback from the failed replication- relation- changed hook:
2021-06-01 16:11:04 ERROR juju-log replication:3: Hook error: juju/agents/ unit-landscape- postgresql- 1/.venv/ lib/python3. 6/site- packages/ charms/ reactive/ __init_ _.py", line 74, in main dispatch( restricted= restricted_ mode) juju/agents/ unit-landscape- postgresql- 1/.venv/ lib/python3. 6/site- packages/ charms/ reactive/ bus.py" , line 390, in dispatch other_handlers) juju/agents/ unit-landscape- postgresql- 1/.venv/ lib/python3. 6/site- packages/ charms/ reactive/ bus.py" , line 359, in _invoke invoke( ) juju/agents/ unit-landscape- postgresql- 1/.venv/ lib/python3. 6/site- packages/ charms/ reactive/ bus.py" , line 181, in invoke _action( *args) juju/agents/ unit-landscape- postgresql- 1/charm/ reactive/ postgresql/ replication. py", line 635, in drain_master_ and_promote_ anointed wal_received_ offset( postgresql. connect( )) juju/agents/ unit-landscape- postgresql- 1/charm/ reactive/ postgresql/ postgresql. py", line 130, in connect connect( user=user, database=database, host=host, port=port_) python3/ dist-packages/ psycopg2/ __init_ _.py", line 130, in connect factory= connection_ factory, **kwasync) OperationalErro r: FATAL: the database system is starting up
Traceback (most recent call last):
File "/var/lib/
bus.
File "/var/lib/
_invoke(
File "/var/lib/
handler.
File "/var/lib/
self.
File "/var/lib/
local_offset = postgresql.
File "/var/lib/
return psycopg2.
File "/usr/lib/
conn = _connect(dsn, connection_
psycopg2.